ENFR

Tech • IA • Crypto

Aujourd'hui Ma veille Vidéos Top articles 24h Archives Favoris Mes topics

Propulsez l’avenir de la robotique avec Gemini

GoogleGoogle for Developers21 mai 2026 à 17:1518:05

Lecteur audio

0:00 / 0:00

INTRO

Google DeepMind fait progresser la robotique avec des modèles alimentés par Gemini qui permettent aux machines de percevoir, raisonner et agir dans des environnements réels complexes.

POINTS CLÉS

Passage de l’IA numérique aux agents physiques

L’intelligence artificielle s’étend au-delà du texte et du traitement de données vers des applications réelles dominées par l’incertitude et l’interaction physique. La robotique pose des défis comme interpréter des environnements désordonnés, reconnaître des dangers (ex. liquides renversés) et naviguer dans des espaces dynamiques. L’approche de DeepMind vise à rendre les machines fiables hors des cadres contrôlés.

Modèle Gemini Robotics ER pour le raisonnement

Le modèle Gemini Robotics ER 1.6 est conçu pour la robotique, combinant vision, compréhension du langage et raisonnement spatial. Contrairement aux modèles classiques, il agit comme une « unité logique », soutenant la perception et la planification dans la chaîne percevoir–planifier–agir.

Détection d’objets à vocabulaire ouvert

Le système remplace les modèles à étiquettes fixes par des modèles vision-langage (VLMs) capables de comprendre des descriptions naturelles. Les robots peuvent identifier des objets via des requêtes abstraites comme « l’outil le plus utilisé » sans jeux de données prédéfinis, réduisant l’entraînement spécifique et facilitant le déploiement.

Bon sens physique et raisonnement sémantique

Le modèle intègre un raisonnement physique, lui permettant d’inférer poids, fragilité et contraintes structurelles. Il peut éviter de soulever un objet lourd ou manipuler avec précaution des éléments fragiles. Il interprète aussi des consignes ambiguës en s’appuyant sur le contexte, par exemple ce que signifie « ranger à l’écart ».

Compréhension temporelle et basée sur la vidéo

Basé sur l’architecture Gemini 3 Flash, le système prend en charge des entrées multi-images et vidéo pour un raisonnement à long horizon. Les robots analysent des séquences d’événements, comme vérifier la réussite d’une prise, améliorant la validation sans règles complexes.

Perception auto-améliorée via génération de code

Une fonction appelée « Gentic vision » permet de générer du code pour prétraiter les images (recadrage, rotation). Cela améliore la lecture d’étiquettes, l’inspection de jauges ou la détection d’anomalies en milieu industriel.

Orchestration des tâches et planification des mouvements

Le système traduit des instructions de haut niveau en séquences d’actions exécutables. Par exemple, « mets le bloc bleu dans le bol orange » est décomposé en saisir, déplacer et relâcher. Il prend aussi en charge la planification de trajectoires avec points intermédiaires et évitement d’obstacles.

Modèles Vision-Langage-Action (VLA)

DeepMind a introduit des modèles VLA qui mappent directement les entrées visuelles et les commandes en actions motrices. Ils permettent un contrôle en temps réel pour des tâches comme le nettoyage ou la manipulation d’objets sans programmation explicite.

Dextérité et incarnation polyvalente

Les modèles visent à fonctionner sur divers matériels, humanoïdes ou non, tout en améliorant la motricité fine (ex. brancher des câbles). Des démonstrations montrent l’exécution de tâches nouvelles, comme placer une balle dans un anneau sans entraînement préalable.

Interaction humain-robot et retour en direct

L’API Gemini Live permet une communication bidirectionnelle à faible latence, combinant voix, vision et appels de fonctions. Des conversations naturelles peuvent déclencher des actions, rendant la collaboration plus intuitive.

Cadres de sécurité et benchmarks réels

DeepMind met en avant des systèmes de sécurité en couches inspirés du modèle du fromage suisse. Les benchmarks Asimov intègrent des données réelles de blessures issues du National Electronic Injury Surveillance System et s’alignent sur les normes ISO pour évaluer les risques.

CONCLUSION

L’intégration du raisonnement, de la perception et de l’action dans des systèmes unifiés basés sur Gemini marque une avancée majeure vers des robots autonomes capables d’opérer efficacement et en sécurité dans le monde réel.

Transcription complète

[music] >> Hey all, I'm Paul Reese and I'm the developer relations lead for robotics at Google DeepMind. For those of you that don't know me, I would generally describe myself as a maker. Whether it's something low-tech like gardening and woodworking or more advanced projects involving the Internet of Things and complex machines, I've always preferred doing things in the real world, which is why I'm very happy to be able to tell you about all the work we're doing in robotics here at DeepMind. All right, so typically AI has existed within the digital realm. It has been amazing for things like editing text, analyzing massive data sets, or writing code. But the physical world is a whole other beast. For a robot, understanding that a discoloration and reflection on the floor is potentially a spilled liquid or figuring out the exact trajectory needed for mapping a busy factory has historically been a massive challenge. DeepMind has been working to bridge that gap by using Gemini models to move from static chatbots to physical agents. We aren't just teaching machines to see the world, but we're enabling them to perceive it with advanced spatial awareness and understanding. We want robots that can navigate unstructured and messy environments rather than only predictable and organized scenes while being able to interact naturally with humans and perform complex tasks. Okay. That's all good, but anyone can say that they're doing great things. So, let's actually look at this in more detail. In robotics, we usually follow a perceive, plan, actuate flow. Google's vision language models are perfect for those first two stages, including the latest version of Gemini robotics embody reasoning or ER 1.6. Unlike a general LLM, the ER model is fine-tuned on robotics data, specifically for spatial reasoning and understanding to essentially be the logic unit of the robot system. To give you an idea of what this actually looks like in practice, let's look at some of the capabilities. First, there's 2D pointing and object localization. The model is exceptional at identifying points on objects, specific parts on objects where you may want to grasp them, and general labeling, or giving the model a list of objects to look for by their state, such as on desk or closed, and having it identify those objects based on their unique states. What's great about this is that it's easy to get started with our Python SDK. You can initialize the client, pass in the model ID for the latest version of Gemini Robotics ER, your prompt requesting the objects that you're looking for, be it specific items or more abstract requests like the screwdriver I need for the specialty screw, and any additional Gemini configuration items, like temperature or thinking level, to get a response that you can use in your robotics project. All right, so let's look under the hood for a second. If you've worked with computer vision before, you've probably used a model like YOLO that was trained on fixed data sets like Coco or ImageNet. Those are fantastic for what they are, but they are closed vocabulary, meaning they can only identify a predefined list of objects. If you need your robot to find a particular tool for cutting a dovetail joint when woodworking, but that wasn't in your training data set, you're pretty much out of luck. With Gemini Robotics ER, we've moved to a vision language model or VLM architecture. When you use the generate content method in the SDK, you aren't just running a classification script. The model uses semantic grounding to locate whatever you describe in natural language. Because language and vision are mapped in the same space, the model doesn't need a specific label for every object in your warehouse. This is called open vocabulary object detection. You can ask for the tool that looks like it's been used the most, or the component that is currently overheating. The model reasons through the visual data to ground those abstract concepts into precise coordinates. For developers, this is a massive win, as it eliminates the need to build, label, and maintain custom vision models for every unique environment. Now, the reason we call it embodied reasoning instead of just a vision model is because it understands the physics of the scene, not just the pixels. In traditional robotics, if you told a robot to clear the table, it might try to pick up a dish that's far too heavy for it, potentially stripping a gear or dropping the plate, because it doesn't understand weight or structural integrity. Gemini Robotics ER has what we call physical common sense. When it looks at a scene, it isn't just trying to identify the plate and the food, but it's reasoning about the relationship between them. It knows that to move a dish, it must first move food into smaller containers that could be lifted, or request human intervention. It also understands that a glass bottle is fragile, while a plastic one isn't. This reasoning layer allows the robot to handle unseen scenarios, like knowing it shouldn't try to lift a table that is obviously bolted to the floor. It's this kind of practical intuition that has historically been a nightmare to script manually. Beyond physical common sense, there's semantic reasoning. One of the hardest things for an AI-backed robot to navigate isn't necessarily a messy environment, but rather a vague human prompt. If we tell a robot, "Hey, put this away." it's suddenly trying to figure out what is this, and where is away. The Gemini Robotics ER 1.6 model excels at navigating this ambiguity. It uses the visual context of the environment to make an educated guess. If it sees someone holding a screwdriver, and there's an open toolbox nearby, it reasons that a way likely means that specific tray through the use of the model's thinking capabilities. For developers, this means we're spending less time writing brittle if else statements for every possible edge case, and more time defining the actual high-level goals for your robots. Now, typically when we use these models, we think in terms of snapshots, like sending one image to generate content and getting one answer. But robotics is inherently a temporal problem. So, we need to think about what happened 10 seconds ago in order to solve what we need to do now. Because Gemini Robotics ER 1.6 is built on the Gemini 3 Flash backbone, it natively supports video and multi-image inputs. This allows for what we call long horizon temporal reasoning. Instead of asking, is the door open? You can pass a sequence of frames from the robot's journey through a facility. And because the model tokenizes these frames chronologically, it can reason about state changes. For example, you can prompt the model with something like, "Look at the last 30 seconds of video. Did the gripper successfully secure the object, or did it slip?" The model isn't just looking at the final frame. It's looking at the motion and delta between frames. This is a game-changer for success detection. In the past, you'd have to write complex heuristic code to check if a task was completed. Now, you can simply use the model as a temporal supervisor that watches the robot's work and confirms that the physical state of the world has actually changed in the way that you intended. Reasoning gets us 90% of the way there, but sometimes the visual input is pretty difficult to process. One issue we see fairly regularly with standard Gemini is that images sent to the model may not be that great for determining what's in it due to size, rotation, clutter, or really a variety of other factors. We've addressed this by adding support for a Gentic vision to the Gemini Robotics ER 1.6 model. By enabling the code execution tool with the SDK, the ER model can now generate code for intermediate steps, allowing the models to manipulate those images itself to get a better understanding of content. Let's take a look at an example. Say you have a production line that makes circuit boards, and one of the steps involves taking the unique ID off of a certain chip to record it, such as the ESMT chip in this image. With traditional LLMs, this text might be difficult to read, or it could get mixed up with other text on the board. By enabling the code execution tool, Gemini writes the code to take the image, locate the chip that you're looking for, and then crops the image down to just the necessary area. In this particular example, the chip and text are still upside down. So, the Gemini Robotics ER model does one more step with its generated code to rotate the intermediate image into a more readable orientation, making it far more likely to correctly get the numbers off of that chip. The usefulness of this feature extends far beyond just reading text, as Gemini Robotics ER has been optimized for the common task of reading analog gauges and understanding common environments, like factories and plants, to detect anomalies. If you have a robot running around a facility, you can have it checking pressure gauges, making sure doors are properly closed, looking for spills or snow on the ground, or performing any number of other tasks with the same simple tooling. Once the robot perceives what's around it, we can move into the planning stage through task orchestration. Think of this like giving a robot a toolbox of functions. In your code, you can provide the model with a list of specific operations that your robot can perform, like open gripper, move to coordinate, detect objects, return to home, or really anything else. When you give the robot a high-level task, like put the blue block in the orange bowl, the Gemini ER model will break that task into a series of smaller task to plan out the entire operation. It looks at the scene, identifies the blue block and the orange bowl, and then determines what order the functions should be called to open the gripper, move towards the block, close the gripper, move it over to the drop-off zone, open the gripper, and then return to a neutral pose. Along with high-level orchestration, the Gemini Robotics ER model can handle micro planning, like coming up with trajectories that a robot should follow. Instead of just a start and end point, the Gemini Robotics ER model can generate a series of intermediate waypoints, while also highlighting ways to avoid obstacles. So, the final part of the perceive, plan, actuate loop, as you can probably guess, is actuate. This is where the gripper meets the object, so to speak. In robotics, actuation is just a fancy way of saying that we're moving the hardware, usually by sending specific values to actuators, or motors and joints, that make the robot go. Now, I know the tendency is to say, "Look at this amazing AI feature that should change everything." But, in traditional robotics, if you have a machine in a factory repeating the exact same motion 10,000 times a day, standard control theory works just fine. But, there are so many scenarios that people are exploring now that aren't in a controlled environment as they branch out into the real world. This leads to environments that are messy, full of objects the robot has never seen before, and situations that weren't in the original training data. For example, a robot rolling down the street trying to deliver food could encounter everything from a junky free couch blocking the entire sidewalk to a flock of unyielding grumpy Canadian geese. To help solve this, we've introduced our vision, language, action, or VLA models, including the flagship Gemini robotics model, which is currently available through our trusted tester program. These models map camera pixels and natural language instructions directly to blocks of motor values. You give it a prompt like, "Clean up the desk." and the VLA streams camera frames to determine exactly how the actuators should move to get the job done. This can even be used in conjunction with the Gemini robotics ER model to take a video of a larger task and break it into individual parts for actuation, letting the robot solve a series of smaller problems to complete the big one. What makes this special is that we've built these VLAs on top of the state-of-the-art Gemini backbone. Because the model has a massive general understanding of the world, the robot is able to understand context and details about the objects. For example, when we gave our robot a small basketball and a net game and asked it to do a slam dunk, the model was able to take the ball and place it through the hoop despite not being previously trained for this. Beyond simple pick-and-place operations, we're also pushing for true dexterity. We want our machines to do more than just move around, but we're training our models for complex tasks like plugging in cords or performing fine detail precision operations on a workbench. Finally, we've taken a generalist approach to embodiments. Whether you are working with a humanoid, a quadruped, or a mounted bipedal setup, our goal is to meet you where you are. These models are designed to be embodiment-agnostic, providing a powerful, general-purpose brain that can be adapted to whatever hardware you're working with. By combining the high-level reasoning of the ER model with the reactive, real-time control of the VLA, we're closing the loop on truly autonomous physical agents. All right, so at this point we've gone over the tools for the full perceive, plan, and actuate loop. But, that's really just scratching the surface. The next big question is how do you actually talk to the robot? And how does the robot keep you in the loop? For example, if you have a humanoid robot, people will likely tell it things like hello and thank you. But, there's a lot of variants in how they might say these. If your robot isn't programmed for each variation, it could get kind of awkward as it stands there and just feels unnatural. But, with additional tooling, the work for smooth human robot interactions could be handled for you. To accomplish this, I'm a big fan of the Gemini Live API. This is what turns a robot from a pre-programmed machine into an interactive partner. It enables low latency bidirectional conversation between you and the Gemini model, letting you give instructions through natural language, like saying, "That screw looks a little loose. Can you tighten it?" On top of that, this isn't just about audio. The Live API allows you to stream camera frames directly to the model, providing constant visual context. This helps you create feedback loops where you can use this stream to even alter plans during runtime. Where the real magic happens though is through function calling. Based on what the model hears and sees, it can decide to trigger a specific developer-defined function. This is the bridge that allows a natural conversation to turn into precise physical action, opening the door to incredibly fluid human robot interactions. And while you're planning out your builds, having a playground to test your logic is definitely a nice to have. And that's where AI Studio comes in. This lets you rapidly prototype prompts and test how the model perceives images from your specific hardware without having to constantly re-flash or reload scripts on the robot. Along with this, one of my favorite areas is the build section. We've actually created a few web app templates to show off what's possible, like integrating the MuJoCo simulation engine directly into the browser. Like you can see here, where we're running a virtual robotic arm that uses the Gemini Robotics ER model to detect block locations, then performs a pick and place task based on that information without the risk of testing on real hardware, really lending itself to a fail-fast, fail-safe strategy. And to wrap up this whirlwind tour of Google DeepMind's robotics offerings, let's talk about safety. In the digital world, if an AI hallucinates, it might give you a weird recipe or a buggy snippet of code. Not ideal, but you know, not catastrophic. When this happens in robotics, we're talking about real-world hardware, sometimes with significant weight, potentially moving around in the same space as people. As we all probably know, tools like a simple table saw deserve respect because of the physical risk involved. When you add an autonomous brain to a machine, that respect for safety has to be built in from the ground up. This is why our team is engaged in ongoing safety research to develop a holistic approach to robot protection. We like to think of it as the Swiss cheese model of defense. No single layer is a perfect barrier, but by stacking multiple layers of safeguards, spanning the semantic, physical, and operational aspects of safety, we can effectively mitigate risk. To evaluate these layers, we've introduced the Asimov safety benchmarks. As discussed in our research, these aren't just theoretical. They are grounded in reality using the NEISS or National Electronic Injury Surveillance System database. This contains real-world injury reports from hospitals to teach the model about physical common sense. Second, we ground our work in established industrial ISO standards to ensure our benchmarks reflect the same operational safety constraints used in factories today. So, that's a quick look into what we're working on within the robotic space at Google DeepMind. At the end of the day, we're building tools to be your partners in creation. So, I'm looking forward to seeing what you all build. You can find more information in our developer docs, cookbooks on GitHub, and sign up for the trusted tester program on the Gemini robotics webpage. Thanks for listening in, and I'm looking forward to seeing what you build. >> [music] [music]

Sur le même sujet : Google