Dear Fellow Scholars, this is Two Minute Papers with Dr. Károly Zsolnai-Fehér. We've been trying out doing the videos with a camera. I really enjoyed it, and your feedback was also absolutely incredible. I've never seen anything like this. So many comments, thank you so much for the kind words everyone. So, we will try to do more of this. But note that this one is a classic voice paper that we've always done here. It was done before we did the camera thing, I thought I would record this little intro now so you don't get surprised. And then next video I'll be back. And for now, please enjoy this super fun paper.
How do robots learn how to be a good robot? Well, surely not like this. Haha. Not by just running around in the real world. Of course! I mean, imagine a real robot doing this for years and years. It would be dangerous to others and itself. So here is a better question: how do we teach a robot to be a helpful, good robot safely? Well, we put it inside of a video game. Start learning there first! In the game, we simulate physics, and let it fail. A lot. Then, over time, get better. Now, I've been to a bunch of AI and robotics labs around the world, and let me briefly summarize what I saw:
things work fantastically well in a simulation, and then, when you put them into the real world, huge disappointment. Something that worked really well suddenly does not work well or at all. Why? Well, the main reason is that simulations are often just not good enough. They often mimic reality, but they are not a substitute for reality. So what do we do? Well, of course, try to use reality. In this work, DreamDojo, scientists said okay, let's feed the AI 44 thousand hours of videos of humans doing stuff. That sounds great, except the fact that it is completely useless.
Why? Well, humans and robots have entirely different physical bodies, hands, and joints. Also, the video does not contain action information. It's just a soup of data that doesn't say what joints are exerting forces and how. Nothing. So why do this? Does this even make sense? Well, they propose 4 genius ideas, and I hope that will make this work, because it would be a miracle. One, if the video does not have labels on what actions are taking place, well, then let the AI try to understand it and make up its own story of what is happening. If you see someone waving at a bus that is pulling away. You
don't need a text label to know that someone has just missed their ride. Two, this dataset is stupendously large. It has more than 4 billion frames, and probably more than 1 quadrillion pixels. Okay, that is too much information. It is almost impossible to handle. So the AI has to learn what is important and what isn't. How? Well, it is forced to compress information. A musician does not need to know every song in the universe. They have to know that there are 12 notes in a scale, and every song is just built as a combination of these fundamental notes. This forces the AI to look at only the most critical information.
But guess what, it is still not enough to just dump videos into the robot and make it work. Why? Well, three, if you train a robot to pick up a cup at a global position, it learns to reach for that exact spot in the world. That's no good. Why? Well, if you move the cup a few inches to the left, the global coordinates change entirely, and the robot has no idea what to do. So, what scientists said, instead of using absolute robot joint poses, let's transform the inputs into relative actions. If you are cooking, sometimes you don't need
absolute coordinates. Here, the knife only needs to know where it is relative to the carrot's spot. And believe it or not, this is still not working. We need something more. What do we need? Well, four, the goal is that the AI learns cause and effect. Jelly bunny hits the wall, and something happens. Try to learn that by predicting the next frame. The problem is that the AI is cheating. Like a student, it just looks at the solution at the end, and says, oh yeah, I was gonna say exactly that. So how did they prevent that?
Well, they fed it actions in small blocks of 4 at a time, so it cannot cheat by peeking at the future to guess what happens right now. Okay, this was a lot of genius stuff, so it better give us something amazing. Let's see what we got. Previous method. Can't predict the future…oh my, look, that hand clips through the piece of paper. Now hold on to your papers Fellow Scholars for the new method and….oh my! Look at that! The paper finally crumples beautifully! And with previous methods, the clipping gets even worse. Look. That's not predicting reality, that's just guessing. New technique - now we're talking! Looking good!
Also, previous technique, hand moves the lid, and the lid refuses to move. No good. New technique, the lid moves! Woo-hoo! Yes, this is the corner of the internet where we get unreasonably happy about a moving lid. And these are not some cherry-picked results, the new technique is so much better than previous methods. This is a huge leap forward! Now, this gets even better. So it finally understands the world better than previous techniques. So what do we pay for this? How much slower is this than previous methods? Well, it is pretty slow because it requires 35 heavy denoising steps just to generate one prediction. But wait, don't despair!
We can use distillation here. Distillation is a training phase where a fast student model is used to learn the predictions of the slower, high-quality teacher model. The goal is that the student would be nearly as good as the teacher model, but much faster. Well, let's test that! Oh my, now the student is a heck of a lot faster. It seems that it is 4 times faster than the teacher that was used to train it. It runs at about 10 frames per second. Understanding the world and predicting how it will change at a speed that is interactive. That is absolutely insane. Well done! And the kicker is that they also predict very similar outcomes. This is an absolute slam dunk paper. Wow.
Now for you wise Fellow Scholars out there, I'll note that we talked about a technique called NeRD, Neural Robot Dynamics. That was a robot AI that trained in its own imagination. So how does this relate to that? Now NeRD was building a perfect 3D environment. This one thinks in 2D. It just sees the world as a bunch of 2D video pixels on a flat TV screen. Thus, this one is able to learn about thousands of everyday objects. So cool! This finally gives us smarter AI robots, and robots that we can all own ourselves. In a world full of subscriptions,
it is so refreshing that we get all of this for free. A ton of code and pre-trained models are available for free for all of us. No silly subscriptions and proprietary code. A free brain that you can upload to your own devices and use it however you want. Love it. So this finally puts us one step closer to having a robot fold our laundry, or cook a healthy meal. Or help a specialist doctor perform surgery from the other side of the planet via teleoperation. What a time to be alive!
Read the full English subtitles of this video, line by line.