ENFR

Tech • IA • Crypto

Aujourd'hui Ma veille Vidéos Top articles 24h Archives Favoris Mes topics

IA physique : la nouvelle ère de la robotique

GoogleGoogle for Developers21 mai 2026 à 23:5638:00

Lecteur audio

0:00 / 0:00

INTRO

Les avancées de l’IA multimodale et de la robotique convergent pour permettre des robots humanoïdes plus polyvalents, mais des défis majeurs en dextérité, données et sécurité subsistent avant un déploiement réel à grande échelle.

POINTS CLÉS

Les percées en IA accélèrent la robotique

Les progrès rapides de l’IA généraliste, en particulier les systèmes multimodaux combinant vision, langage et action, transforment la robotique. Les chercheurs ont étendu les modèles vision-langage en systèmes vision-langage-action (VLA), permettant aux robots d’interpréter des commandes et d’agir dans le monde physique. Ces modèles offrent une généralisation inattendue, comme identifier et manipuler des objets inconnus grâce à une compréhension sémantique.

Des partenariats visent des robots polyvalents

La collaboration entre Google DeepMind et Boston Dynamics vise à construire des robots combinant intelligence physique et capacités de raisonnement. L’approche imite le développement humain: maîtriser d’abord l’équilibre et le mouvement, puis apprendre des concepts abstraits comme les affordances des objets. Cette conception à double système cherche à produire des robots capables de s’adapter à de nouveaux environnements et tâches.

Le format humanoïde gagne du terrain

Le dernier robot humanoïde Atlas reflète un virage stratégique vers des conceptions proches de l’humain. Les humanoïdes sont compatibles avec les environnements et les données humaines, tout en offrant des avantages pratiques comme deux bras pour la manipulation et deux jambes pour la mobilité et l’équilibre. Cette forme est de plus en plus vue comme essentielle pour atteindre une AGI physique, définie comme l’égalité des capacités physiques humaines.

Simulation et données réelles pour l’entraînement

L’entraînement en robotique repose sur deux approches principales: l’apprentissage par renforcement en simulation et la collecte de données réelles. La simulation est efficace pour la locomotion et le contrôle du corps entier, tandis que la manipulation complexe nécessite la téléopération, où des humains contrôlent les robots pour générer des données. Des systèmes VR sont souvent utilisés pour aligner l’entrée humaine avec la perception du robot.

La dextérité reste le problème le plus difficile

Malgré les progrès en mouvement et manipulation basique, les compétences motrices fines restent non résolues. Des tâches comme ouvrir des contenants ou manipuler de petits objets sont difficiles en raison d’un toucher limité et de la complexité physique du réel. Même des systèmes d’IA avancés excellant en code ou raisonnement échouent sur des tâches simples comme cuisiner.

La vision domine, mais le toucher est crucial

La plupart des systèmes actuels reposent fortement sur l’apprentissage basé sur la vision, grâce à l’abondance de données visuelles et à des caméras matures. Cependant, les chercheurs soulignent que le toucher est essentiel pour une vraie dextérité. Les humains s’y fient beaucoup, et des avancées sont attendues via de meilleurs capteurs haptiques et leur intégration.

Émergence de robots « pensants »

De nouveaux modèles intègrent des étapes de raisonnement internes, mêlant « tokens de pensée » et actions. Cela permet aux robots d’évaluer leurs décisions avant d’agir, améliorant adaptabilité et interprétabilité. Les premières démonstrations montrent des ajustements dynamiques en situations inconnues, étape clé vers la généralisation.

L’industrie en tête de l’adoption

Le déploiement à court terme se concentre sur les environnements industriels, où les robots gèrent des tâches répétitives, dangereuses ou physiques. Exemples: déchargement de charges lourdes, inspections, manipulation d’outils. Ces contextes offrent des conditions contrôlées et une valeur économique claire, contrairement au domicile.

Les robots domestiques restent lointains

Les experts estiment qu’une adoption domestique significative est encore à 5 à 10 ans. Les obstacles clés incluent la dextérité, la généralisation entre tâches et des systèmes de sécurité robustes. Même des actions simples, comme récupérer des clés dans une poche, restent hors de portée.

Sécurité et passage à l’échelle critiques

Au-delà des capacités techniques, les robots doivent atteindre une fiabilité et une sécurité élevées pour opérer aux côtés des humains. Cela rappelle les défis de la conduite autonome, où la performance seule ne suffit pas sans fortes garanties de sécurité.

CONCLUSION

La robotique progresse rapidement grâce à l’intégration de l’IA moderne, mais atteindre des machines vraiment générales, sûres et habiles exigera des avancées en matériel, en apprentissage et en adaptation au réel au cours de la prochaine décennie.

Transcription complète

[MUSIC PLAYING] [APPLAUSE] JACKLYN DALLAS: So excited to be here. Today's session is on the future of robotics and we're joined by two absolute geniuses. We have Kanishka Rao, who is the Head of Robotics at Google DeepMind, and Alberto Rodriguez, who's the Head of Robotics Behavior at Boston Dynamics. So stoked to be with you guys. KANISHKA RAO: It's great to be here. JACKLYN DALLAS: Yeah. ALBERTO RODRIGUEZ: Thank you. JACKLYN DALLAS: Welcome to the stage. ALBERTO RODRIGUEZ: Yeah. JACKLYN DALLAS: I want to start us off by grounding this moment. As a kid, I always grew up loving robots. But it feels like 2026 is the year that a lot of is actually coming to fruition. Can you bring us up to speed on what's different this year, and what breakthroughs have happened recently that has put us on this exponential? KANISHKA RAO: I think, yeah, we're seeing so many breakthroughs in just general purpose AI. We've seen, just like in the last few days, the rate of progress in just general AI in the digital world. And I think all these breakthroughs are making their way into the physical world and especially impacting how we do robotics. And, if you think about, for robots, if they have to be useful to us humans, they must understand our human world. And I think a lot of this understanding is coming from these large frontier multimodal models. A few years ago, we took these large, vision language models, and then we adapted them for robotics by adding a third mode, which was like action. So these are like physical tokens instead of your digital tokens. And it was mind blowing just to see the result of that, because all of a sudden you have this vision language and action model, and you could talk to the robot. And you can ask it to do things, and you get all this understanding for free in robotics. And I remember we put a bunch of toys in front of the robot, and we asked the robot to pick up the extinct animal, and it went for a dinosaur and picked up the dinosaur toy. And it was an "aha" moment. Because again, the robot had never seen this in the training data. This was just all kind of understanding coming from these digital AI breakthroughs. So I think, yeah, robotics is really riding this wave of general AI intelligence. And I think, yeah, we're in a revolution in the robotics space as well. JACKLYN DALLAS: We're also in an interesting moment where you guys are partnering. Can you tell me a little bit, Alberto, about that partnership and what you're working on? Because I think Boston Dynamics is the robotics company. A lot of people have seen your robot videos, whether it be the robot dog 10 years ago. What are you working on now, how has that changed, and what does the partnership look like? ALBERTO RODRIGUEZ: Yeah, I would say that, OK, I'll be honest. This is one question that I was prompted that we were going to get about the partnership. So I came a little bit prepared. And I was thinking what problem I'm going to have on stage and it's clear it's going to be a chair. So I spent reading a little bit of time how kids learn to sit. And you might relate to this. I think you have a two-year-old. KANISHKA RAO: I have a two-year-old, yeah. ALBERTO RODRIGUEZ: Yeah. So I was reading that there's clearly two stages in how kids learn to sit down. There's a first stage, which happens around six to nine months old, it's when they learn the concept of balance. They sit on places and they know how to stay balanced upright with their head standing up. They learn the concept of how to stand up without falling forward or falling back, or without moving the chair back, for example. And then there's a second stage that happens when they are closer to 1 and 1/2 year old, when they learn the notion of what's a chair, what's a place where you can sit, the affordance of sitting. And actually, it's funny. There's a period of time where kids might just spend a few weeks just sitting on weird places, like on their toys or on a jacket on the floor, just to understand what it means to see it and what actually makes sense to sit. The idea that sitting has directionality, like when you're sitting down, you're going to be facing a certain direction and no backwards. So if the chair were to be backwards, I should probably turn the chair so that I'm facing at you. And those two ideas relate to the two main ways in which we are building generalization in robotics. There's physical intelligence, the idea that the robots we're building, they need to understand what it means to exert forces on the world, how to use their body to exert those forces, and then there's this idea of reasoning or common sense generalization. And they have to be able to interact with anything in the world, even if they haven't seen it before. You can also sit right there on the ledge or on the table, and it's fine as long as you know how to balance. So there's two things that are interesting. One is that you cannot learn the second thing without first learning the first one. Psychologists have captured this widely. But also the second thing is that we are building our robots with exactly the same model. We're building two different brains that are compatible with each other. And I would say that one of the cool things about this partnership is that we have in a combination a team that not just can provide an excellent hardware, but also has the ability to bring those two things together. And it's like the A-team to build, sort of a generalist robot for physical labor. JACKLYN DALLAS: And historically you guys were working on a lot of different form factors for robotics. But we're specifically talking about Atlas today. Can you tell me about why a humanoid robot makes sense? I feel like there are so many different morphologies that you could do for robots. How do you pick which one to use? I feel like we're moving from a world of very specialized robotics, like robots in factories, just an arm to these more like, generalist robots. ALBERTO RODRIGUEZ: Yeah, I mean, we actually just released the new generation of our Atlas robot, our humanoid robot at Boston Dynamics, which I believe in a second, we'll see. We'll be able to see a clip. But before, just getting into your question, why humanoid, right? Why humanoids makes sense? And I think that on one hand, maybe Kanishka can extend on this, the promise of scaling up data collection as a means to create generalization and common sense works best when you can learn from humans. That's the simplest path to scale up data generation that feed these models. But also from a hardware perspective, actually the humanoid form factor, it turns out it also makes sense. Having two arms is way, way better than just having one arm, because you can balance loads more efficiently, because you can reposition objects in a way that would be hard to do with just one arm. Having two legs, it turns out that allows you to get almost any place that humans can go. Not just because you can step up things, but you can also change your form factor. You can become thin in the right direction, but you can become strong to resist forces in the direction that you want. And being able to change that location of your feet also allows you to change the friction with the ground in a way that allows you to accelerate faster or decelerate faster, more efficiently than you had wheels. So it just makes sense. KANISHKA RAO: Yeah, I think robotics, we talk about physical AGI, and I think we define AGI in terms of what us humans can do. So I think for physical AGI, I expect that the robot can do anything I can do or an average human can do. So I think just for that reason alone, like working, or at least doing research on the humanoid is justified, because ultimately, that will be the test of can the AGI, do all the things. And the humanoid form factor is the best place to express that. JACKLYN DALLAS: Makes sense. Let's look at the video. ALBERTO RODRIGUEZ: Yeah, let's look at the video. JACKLYN DALLAS: This is a new generation of Atlas. ALBERTO RODRIGUEZ: Yes, this is the new generation of Atlas. And here, you see doing it a very physical task, lifting a fridge and carrying it to one of our developers who is very cool sitting there. It just asked to bring a soda. I guess why we need the reasoning model to understand the difference between soda and the fridge. But a couple of things that are very important about this robot, and why it's very important for us and for the partnership. This robot, while it maintains the physicality and agility that characterizes Boston Dynamics robots, it's been designed especially for mass manufacturing. So it has the simplicity in its design that is necessary to deliver on the reliability that we need to be able to do data collection at scale and deployments at scale. So we're super happy that we've been able to combine those two things in the same piece of hardware. JACKLYN DALLAS: Can we talk about data collection and how we train robots? Because it seems like when I think about this AI moment that we're in, the breakthrough of the transformer has led to an exponentially huge amount of new information and better AI models. And it feels like with robotics, we've also experienced similar breakthroughs where historically we were training them by just brute force, having them do the activity over and over and over again. And now there's this whole new level of training with training and simulation. But there's also still teleoperation. Can you break down for me the different ways we train robots and models? KANISHKA RAO: Yeah, so in robotics right now, there's maybe roughly two ways to train robots. And things that you can simulate are easy to learn in simulation. So for example, the robot body itself, since you've built the hardware or Boston Dynamics has built the hardware, you can build very good simulators for it. And Boston Dynamics, they're the state of the art at doing this. So if you can simulate the task, I think you can then train it. You have the same verifiable rewards loop, and you can do RL, and you can train some really nice policies and simulation, and then you can transfer them to the real world. And that works well. ALBERTO RODRIGUEZ: That's how the fridge-- [INTERPOSING VOICES] KANISHKA RAO: Exactly. So I think a lot of the walking, the running, the whole body, the dancing, all of this is powered through a really good simulation, and then setting up this RL loop in simulation. Then this other bucket of learning, which is maybe the harder bucket of learning, and this will take longer maybe, this is the dexterity bucket where you cannot simulate the task because you want the robot to interact with all the objects in the world. And there's so many different things, and objects, and the ways you manipulate them in the world, it's hard to simulate all of that. So on the manipulation side, what has really led to a lot of the latest advancements has been leveraging these large foundation models and then weaving in real-world data with them. So I talked about VLAs a while ago. Essentially what you try to do is you try to train the robot on real-world data, and you have these physical tokens collected through teleoperation, and we Interleave them with vision and language tokens. And that is a way to generalize the dexterity a bit more. But yeah, that bucket is mostly relying on real-world data collect. JACKLYN DALLAS: And what is teleoperation? How does it actually work? What is teleoperation? KANISHKA RAO: Teleoperation essentially, so the way the robots learn about the physical world is just like we do. It's through your own embodied experience. It's not showing a video to a robot and then it learns that. Maybe in a year or so we'll be there. But at least for today, it's really important to understand how the robot moves. The robot needs to understand, hey, if I poke my finger right now, what will the world do. So teleoperation is when we collect data, we control the robot to move in the world and to do the task. And that way it builds this physics knowledge by interaction. So it's human controlling a robot, and then the robot doing the task. And through this teleoperation data, it learns the physics and the task. And that's been the state of the art way of doing manipulation today. ALBERTO RODRIGUEZ: And I would say that the tighter the embodiment of the pilot that is still operating the robot, the better the data that it can generate. In practice, what happens is that we usually put pilots in VR headsets so that they can actually see through the eyes of the robot so that they don't get to use information that the robot would not have when it would try to do that, the exact same task. So that this notion of what the robot will be able to observe is like ideally replicated one-to-one with what the pilots or demonstrators are generating. JACKLYN DALLAS: What are the phases of this? Do we need more different types of training data to really get robots in the real world? Or is visual training data enough and we just need a lot more of it? KANISHKA RAO: I think we can still do more with teleoperation, just because I think we've seen a lot of generality, but we still haven't found like that basic recipe of, again, if you think about vision, language, action models, there's way more vision and language tokens than action tokens. So I think the strategy for robotics right now is how do you leverage the intelligence in these bigger things to then accelerate robotics. So just last year, we announced our Gemini robotics model. And there one of the breakthroughs that we showed was we can introduce thinking in the physical world. So the idea there is, typically in the digital world when you think it's mostly about what code you're going to write or what text you're going to output. With this model, we had thinking about your physical actions. So it would be if you're trying to grab something, it would be something like, should I close my hand right now? Will they pick up the object? Or should I move further and then close my hand? So by interleaving this thinking with physical action tokens, we're trying to make the actions a bit more general. So we're getting more bang for our buck in terms of the data. And I think we have a video of this in a laundry sorting video. Maybe you can take a look at that. JACKLYN DALLAS: Yeah, let's do it. [VIDEO PLAYBACK] - Now we're thinking we're able to ask the robot to chain a whole series of tasks together and it's completely end to end. The thoughts and the actions are interleaved. I can see it running in person, which is really cool. Let's see what it does with the white cloth here. It's still trying to go for-- OK, so here's the thing. It said adjust the black bin a little bit, so I can pick up the cloth. So this is really cool. We didn't really train it to do this explicit thing. So these kind of thoughts just emerges during inference, which is really cool to see. And it's super reactive. I can now switch this up on it. Sorry, Apollo. And it knows that I have done that and it will react. [END PLAYBACK] KANISHKA RAO: I think this is what we mean by generalization for robotics, because you can train a robot to be pretty instinctive and do repetitive things. But what makes us humans so good at all of this is when faced with a new situation, we can think our way through and solve that new thing. So this was an example where this setup was not in the training data, but it could think its way out of that problem. JACKLYN DALLAS: Yeah, I see a video like that. I'm like, this is extraordinary. The robot's in real time, this feels a huge breakthrough. Which leads me to this bigger question that I think a lot of people in the room have probably been experiencing like hundreds of robots dancing in sync on Twitter, but then a video like that, or a video of a robot picking up a banana is actually the more impressive one. How can people understand the videos online of robots and what's a genuine breakthrough versus what's been like pre-programmed, and it's impressive but less impressive? KANISHKA RAO: Yeah, so I think as you said earlier, I think the manipulation part of the problem is probably going to be, I think, the final chapter in robotics. That's the most difficult part. And sometimes, I work-- ALBERTO RODRIGUEZ: It's probably the only chapter. Getting robots to dance, it only gets you this far. KANISHKA RAO: Exactly. So yeah, I feel like-- ALBERTO RODRIGUEZ: Not many people are going to pay for that. KANISHKA RAO: Yeah, we work with this stuff every day, but it's hard to express to everybody how difficult dexterity is. We have these amazing hands that we use all the time and we take it for granted. But it is really hard. If you look at the state of AI today, we can code up operating systems in whatever 24 hours. And we can solve complicated math. But we can't scramble eggs. JACKLYN DALLAS: Yeah, why is that hard? KANISHKA RAO: Yeah, it is, because there's physical intelligence is it feels slightly different than the digital intelligence. Again, I have a two-year-old at home and he can walk, and run, and climb. He even can do some speech recognition and talking and language understanding. But he still struggles with dexterity. I don't think he can open up this can of water or zippers. So there's something different about physical intelligence. And I think our research of robotics will tell us about ourselves too. How do we learn this stuff and why is it different? But yeah, the one thing I would say is the dexterity stuff, as Alberto also said, is the more difficult part of robotics versus the dancing, and the walking, and the running, just because, yeah, we still haven't cracked that problem fully yet. ALBERTO RODRIGUEZ: Yeah, I would say that even just from a technology perspective of what's available to us today, as you were explaining, there's two techniques that we mostly use to engineer behaviors or to get generality to emerge, which is one learning from demonstration; and two, learning from trial and error, mostly in simulation. Because that's the only place where we can do the very expensive search of trial and error. And doing something just grabbing this bottle and opening this cap is something that it is, one, very difficult to demonstrate. If you imagine putting yourself in the mindset of a pilot that is piloting the robot, being able to do this without feeling the forces that you are feeling as you are trying to screw or unscrew the cup is just very difficult. But it's also very difficult to simulate realistically the compression that happens in the skin and the sensing that happens in the sensors, in the fingertips in my hand, to understand what's the behavior that is going to drive this. So we have ideas for how to solve those problems reliably, but we still don't have the one key technique that is going to enable the kind of dexterity that we want at the reliability level that is necessary for industrial-scale deployments that actually have value. JACKLYN DALLAS: Will we have to create almost like a haptic model where it's giving-- I think right now, most of our breakthroughs have been the robots, just seeing things through cameras. How do we make them feel things? Is there a new type of model that comes out? KANISHKA RAO: So, yeah, so think the tactile puzzle, it's a real puzzle. And I think today all the state of the art models that we are showing, at least for manipulation, they are vision-based, which is again is a bit weird. Because if you think about all the manipulation that you're doing today in your daily lives, you're probably using your sense of touch and haptics way more than vision. So it's a bit of a conundrum, but maybe there's a few reasons for why vision is state of the art today. I think one is, again, we're building these models on top of frontier models. And there's way more vision data on the internet than any tactile data, so I think just the size of the data set. I think the other thing is we do use these wrist cameras. So robotics, you might see sometimes there's a camera on the wrist or the end effector. And that gives you a close-up view of the actual contact points. So maybe instead of actually feeling these things if in the pixels, you can capture the compression as you're talking about, maybe that's a substitute for tactile in the meantime, and that's how it's working. So there's some unreasonable effectiveness of this vision stuff. And I think we have a clip of maybe we can show this now of a dexterity task that we-- JACKLYN DALLAS: Like origami, right? KANISHKA RAO: Yeah, so we're using the robot to fold origami. And this model is completely vision-based. There's no sense of touch or force or any of that. And if you think about it, this task should have not been possible just by vision, but it is able to learn this quite effectively. And again, it's looking at the creases, the folds. And it's probably inferring the forces and the sense of touch from this. So it's neat that this is working with just vision. But I do feel that ultimately, yeah, I think hardware will get better. I think one thing is cameras are so ubiquitous. Skin is hard to hardware. So I think, yeah, those maybe hardware developments will also lead to some of this research. But yeah, as of today, vision is what's helping us get through this. ALBERTO RODRIGUEZ: Yeah, maybe just an anecdote of how important tactile is, so part of the process of scaling data collection at the very, very large scale is equipping people with wearables, like putting a camera on your forehead and just asking people to just go by their lives, just doing as if they were doing normal life, and then using that as a means for pre-training the very large-scale models, that reason about behavior. But it turns out that if you do that, the way that people move through their lives is like this. They're doing something here. And while they're thinking about what's the next thing they're going to do, they do it here and then they move. So actually the cameras, their eyes are not looking at what they do with their hands. It turns out that most of what people do is driven by tactile feedback and proprioception, and they didn't need to look at it other than just for a fraction of a second to understand what to do, and then how to lock in the hands, and then just let the hands do their magic. So my hypothesis, my bet is that we're not there yet because we're limited by hardware reliability when it comes to tactile sensing and cutaneous sensing. But the moment we get there, and I'm pretty sure we will get there because there's tons of interest today, we are probably going to see a gradual transition to relying less for control loops or for high-frequency control loops on visual sensing and using it more for common sense understanding, and using more tactile sensations for driving manipulation. JACKLYN DALLAS: Yeah, I saw a really interesting study where this doctor numbed patient's hands and then watched how well they were able to do different tasks, and it was pretty atrocious. People were not able to do much. So it's amazing to me that the robot was able to learn any level of origami. I'm curious when we think about the different stages of robots, it seems like we have a lot of robots in factories right now. And then maybe the second level is front of house, and then the third level is like homes. What's the roadmap look like? When do you think people in this room will have a robot in their life, and what breakthroughs are on our journey to get there? KANISHKA RAO: So yeah, it definitely will be a while. It's not next year. So, yeah. Sorry. ALBERTO RODRIGUEZ: What? JACKLYN DALLAS: Things you hate to hear. KANISHKA RAO: So, I think maybe in the next 5 to 10 years, we'll start seeing. JACKLYN DALLAS: That's good. KANISHKA RAO: Yeah, I think again, if there's one takeaway of this I would say is dexterity is hard and it's still an unsolved problem. So I think, yeah, that's one of the main challenges that we still have to solve. I think even generality is not fully solved. Maybe we can learn to open this bottle, but the robot still struggles to open any bottle. When humans learn this skill of unscrewing something, we can start unscrewing almost anything. So we can reapply these verbs to most things. But robots are still a bit narrow minded. They can learn a verb with a few objects, and then maybe it'll generalize to a few more. But they don't universally learn that skill. So I think, yeah, generality is still like a big bottleneck. And then, yeah, dexterity is probably the big one that we still have to solve. And I think, as Alberto said, maybe we still are waiting for some hardware breakthroughs. A simple task, again for dexterity, is like picking out keys out of your pocket or a purse. That's so difficult for a robot to do. We won't even attempt that today, but it's something we just naturally do every day. So I think, yeah, there's still a ways to go where hardware has to meet us there for tactile sensations. There's still work to be done. JACKLYN DALLAS: What are robots really good at today? If we were to create a list of what they're already excelling at and exceptional, what would it be? KANISHKA RAO: So yeah, I think again, we have seen, they're very good at whole body control. So even a few years ago, humanoids would not been a thing because balancing was so hard. But again, like with really good simulation and state of the art breakthroughs in reinforcement learning, balancing is solved, I would say. I don't know if that's a hot take anymore, but yeah, balancing is solved. So humanoids are more-- ALBERTO RODRIGUEZ: I think it's a solved problem. KANISHKA RAO: OK, you agree? So yeah, humanoids are more possible now. We can see them walking around. And I think basic manipulation, picking up something and placing it somewhere else, they're quite good at imitation learning plus this human collected data is working well. So I think basic pick and place, and walking, and obviously like the dancing and all of this, today they're very good at. ALBERTO RODRIGUEZ: Yeah. If you look at industrial tasks, for example, in manufacturing. JACKLYN DALLAS: Yeah. ALBERTO RODRIGUEZ: The leading skills that are necessary to make an impact are things that are actually very complex. So it's like handling cables. So you want to be able to handle something that is very deformable and put it somewhere, route it through hooks. It's using power tools. So you want to be able to grab a bolt driver, feel the strength of a good grasp, which is something that actually takes many years for people to know. Or just imagine a screwdriver. How long does it take for someone to actually feel what it feels to hold a screwdriver in a way that you will feel confident to actually use it? Some people never actually learn how to do that. So using power tools, and then the third one is bin picking, right? So there's a bin full of objects. And you want to fish one, not two, nor three, or not 0, but just one. Even the ones that get stuck in the corner of the bin, those are very hard things to do still today. But I think that we are on a great trajectory from trying to squeeze and continue to invest in these bets, that learning from demonstration, just like we've seen with large language models, has a very long route to saturate performance. I'm convinced that performance will saturate just from demonstration. At some point, robots are going to have to continue to improve performance from trial and error. They're going to have to get there, make mistakes, recognize, feel what it feels to fail at something, and learn from that. Just like RL is extremely impactful in bringing large language models to high performance. The same thing is going to happen in robotics. JACKLYN DALLAS: Is there a robotic equivalent to muscle memory, how a human that's played piano a thousand times is able to play and not really think about it? Are there tasks that the robot just gets so good at that it runs at a lower level? KANISHKA RAO: So I would say most robot models today are in the regime of muscle memory, because again, they see the state and they just react. There's no thinking involved except for the latest models that we're working on. So I think they're kind of moving away from the muscle memory reactive mode to more thinking a bit more about what they're doing, and then more thinking intelligent mode. However, it's reactive, but it's still a large vision model that's controlling this. So in that sense, it's not tactile-based, but it's a vision-based reactive model that it just sees an image and it makes a move, and it doesn't really think about why it's making that move. If you ask even the state of the art robotics models today, hey, why did you take that action? They don't give you an answer. JACKLYN DALLAS: Oh, interesting. KANISHKA RAO: Yeah, they don't. Yeah, so it's more of a reactive thing where the actions are just like outputted. But they cannot explain to you why they did that. JACKLYN DALLAS: How does the new model change that? KANISHKA RAO: So yeah, the new model is basically what we did is we interleaved thinking tokens and then action tokens. So before it outputs an action, we basically explicitly added some tokens about what it was doing. And this thinking could influence why it was taking the actions. So it still doesn't explain why it took the action. But you can see the thought trace that led to that action. And if you tweak the thought process, you actually get different actions. So you can influence the actions this way. So again, it's a little bit trying to connect that intelligence of motion and physicality to the digital intelligence. But it's still not fully there yet where you can talk to the robot. Hey, why did you do that? JACKLYN DALLAS: Can you give me an example of something that you would tweak that would then lead to a different action? KANISHKA RAO: Yeah, it was just simple things like, let's say it's trying to grab something and its thought process is it might say, hey, I'm close enough to close my hand and I will grab the water bottle. And if it's thinking trace, it will just close its hand. But the experiments we try to do is we edit the thinking trace and we say, I'm not close enough, I need to go lower. And if that's the thought it's conditioned on, then the action would be to go lower. So it makes the models more interpretable and also steerable, because you can look at the thoughts and understand why is it taking that action. JACKLYN DALLAS: And is it that would replace, the VLAs, does that replace the other types of models or is it they all come together? KANISHKA RAO: So at least this model I talked about was still like a VLA-style model where it's based on top of a vision language model. I think the one interesting thing is again, we're talking about how robotics is getting impacted by these big breakthroughs in just large frontier models. And I think the one thing I'm really excited about is we saw these omni models, video models. JACKLYN DALLAS: Let's go, yes! KANISHKA RAO: Again, we're talking about physical intelligence, which is intelligence about motion, and physics, and gravity, and friction, and all these things. And these are not really captured in language or even an image. But these models, the video models, these world models, they are able to output videos that look pretty realistic. And I've played around with them and tried to ask them to create videos of dexterous manipulation. And they look pretty good. So they do seem to, again, yeah, Damos was talking about how they understand some basic physics. I do think that they have this some understanding of physics through video production, and it will be really interesting to see how, again, robotics can leverage some of this physics and motion intelligence, and then build robot models on top of that. JACKLYN DALLAS: At Boston Dynamics, what are you personally spending your most time on right now? ALBERTO RODRIGUEZ: Sorry, what? JACKLYN DALLAS: At Boston Dynamics, what are you personally spending the most amount of time on right now? ALBERTO RODRIGUEZ: The most interesting thing we're working on right now? JACKLYN DALLAS: What are you like spending the most amount of time on? ALBERTO RODRIGUEZ: Oh, the most amount of time. JACKLYN DALLAS: Yeah. ALBERTO RODRIGUEZ: So for us, we design robots not because we like designing robots, but because we know that there's a certain sort of physicality that is possible to iterate through motion. And we want to bring that to products for our customers. And one of the things that we're most interested in is understanding what's the right kind of balance of models to be able to-- our robots develop a sense of general physical understanding of interacting with the world. There's some sense of frequency analysis, so to speak, in how a robot has to make decisions. That tells us already that there are certain things that belong that have to be decided, once per second. When you decide, oh, this is the thing that I want to interact with, and it's not that thing. And I should reach out and I should reach out with a certain sort of direction, or with a certain sort of preshape of the hand. But then the certain decisions that have to be done way, way, way faster, because when the hands are, if it's accommodating to the bottle, it's a control loop that the robot has to make decisions, 50 times a second or 100 times a second. The same thing happens with balance. When the robot is trying to carry the fridge and it needs to react to the fridge side balancing. And the fact that there's these differences in frequency in which the robot has to make decisions, already tells us that we need an architecture that is tailored to that sort of decision making. So we're very interested in understanding what is the right structure to drive a system that has general physical understanding, and what's the right way to populate it with intelligence. Is it just demonstration, or is it trial and error, or a combination of those two things? JACKLYN DALLAS: Totally. So on my channel, I make optimistic tech videos, what the future will look if things go well. And I think you guys are two of the driving forces in making that future amazing. I want to hear from both of you. Take me into your head. What does your vision for the future look like? What are you trying to get us to? If everything goes right, what does the world look like in 10 years from now? KANISHKA RAO: Great question. Yeah so I think-- ALBERTO RODRIGUEZ: How many millions of humanoids are we going to have 10 years from now? KANISHKA RAO: Yeah, I think if everything goes well and we can solve like dexterity, I think there's still a huge safety research puzzle as well. I think robots are not going to be useful unless they can be really safe. So I think that's another big roadblock. I think in that sense, robotics is like autonomous driving, too. We have to really solve that part too. It can't be a side thought. So we're developing all our safety AIs along with this. But I think, yeah, in 10 years, if we are successful, I do think a lot of basic, general day-to-day stuff, we can have robots do. And I think we'll see robots more amongst us helping us. Just on a personal note, I hate to do daily chores. I know it's a very first-world problem, but I think a lot of those kinds of things, the boring, the dull, the repetitive, the dangerous tasks robots will be physically doing. And I think, generally the goal is to benefit humanity. And I think physical intelligence can enable all these form factors help us in that. So that's the dream in 10 years that we will be there. JACKLYN DALLAS: I believe. I think you guys can do it. KANISHKA RAO: Yeah. JACKLYN DALLAS: What about for you? ALBERTO RODRIGUEZ: Yeah, I would say it's similar. The thing that drives us is to make arduous physical labor, the kind that is backbreaking, and very tedious an option. And we're doing it with our existing products, like Stretch is a robot that unloads boxes from trucks. And these boxes are 50 to 60 pounds. And these trucks, they're not controlled for temperature. So sometimes they can be at 100 Fahrenheit. And the robot does it and it doesn't complain. Or Spot is another example right where you will it deploy to patrol and do inspections in your factory every day at the exact same time of the day, and do inspections on the same 100 places in your factory, or 1,000 inspection points. And it has to do that every single day. And most of those days, maybe 499 times in a row, nothing will happen, extremely tedious. But then all of a sudden, the 500th time something feels off in one of the pieces of machinery in your factory. And having to keep the attention span and having to keep the interest in understanding that just something went wrong is very, very tedious. And using robots to do those kinds of jobs, it makes sense. And we want Atlas, we're targeting Atlas as a machine that is capable of very arduous labor. That's why it's strong. That's why it has the kind of payload that it has. JACKLYN DALLAS: So you're more focused on the industrial side of robotics, would you say? ALBERTO RODRIGUEZ: At the time, currently, yes. Clearly, we've seen that industry, especially manufacturing, is a good entry point, at least for us. It gives us a window into a market that has the right cost for where humanoids are today. And it gives us a window where there are ways to mitigate safety concerns in ways that would be very difficult to do at home, for example. JACKLYN DALLAS: Yeah. ALBERTO RODRIGUEZ: And it gives us a ramp up to eventually go to other markets, like home, for example that could also be extremely beneficial. JACKLYN DALLAS: Well, that's amazing. I admire both of you so much. Thank you for the panel. Thanks, everyone, for watching. [MUSIC PLAYING]

Sur le même sujet : Google