ENFR

Tech • IA • Crypto

Today My briefing Videos Top articles 24h Archives Favorites My topics

Power the future of robotics with Gemini

GoogleGoogle for DevelopersMay 21, 2026 at 05:15 PM18:05

Audio player

0:00 / 0:00

TL;DR

Google DeepMind is advancing robotics with Gemini-powered models that enable machines to perceive, reason, and act in complex real-world environments.

KEY POINTS

Shift from digital AI to physical agents

Artificial intelligence is expanding beyond text and data processing into real-world applications where uncertainty and physical interaction dominate. Robotics presents challenges such as interpreting messy environments, recognizing hazards like spills, and navigating dynamic spaces. DeepMind’s approach focuses on enabling machines to operate reliably outside controlled settings.

Gemini Robotics ER model for reasoning

The Gemini Robotics ER 1.6 model is designed specifically for robotics, combining vision and language understanding with spatial reasoning. Unlike traditional models, it functions as a robot’s “logic unit,” supporting perception and planning in the standard robotics pipeline of perceive, plan, and actuate.

Open vocabulary object detection

The system replaces fixed-label vision models with vision-language models (VLMs) that understand natural language descriptions. Robots can identify objects based on abstract queries such as “the tool most used” without predefined datasets. This reduces the need for custom training and allows flexible deployment across environments.

Physical common sense and semantic reasoning

The model incorporates physical reasoning, enabling robots to infer properties like weight, fragility, and structural constraints. It can decide not to lift a heavy object or to handle fragile items carefully. It also interprets ambiguous human instructions, using environmental context to infer intent, such as determining where “away” means when storing tools.

Temporal and video-based understanding

Built on the Gemini 3 Flash architecture, the system supports multi-frame and video input for long-horizon reasoning. Robots can analyze sequences of events, such as verifying whether a grasp succeeded, improving task validation without complex rule-based systems.

Self-improving perception via code generation

A feature known as “Gentic vision” allows the model to generate code to preprocess images, such as cropping or rotating them for better analysis. This improves tasks like reading component labels, inspecting gauges, or detecting anomalies in industrial environments.

Task orchestration and motion planning

The system translates high-level instructions into sequences of executable steps. For example, “put the blue block in the orange bowl” is broken into discrete actions like grasping, moving, and releasing. It also supports trajectory planning, generating intermediate waypoints and obstacle avoidance strategies.

Vision-Language-Action (VLA) models

DeepMind introduced VLA models that directly map visual input and language commands to motor actions. These models enable real-time control, allowing robots to perform tasks such as cleaning or object manipulation without explicit programming for each scenario.

Dexterity and general-purpose embodiment

The models aim to support diverse hardware, including humanoid and non-humanoid robots, while improving fine motor skills for tasks like plugging cables. Demonstrations include handling novel tasks, such as placing a ball into a hoop without prior training.

Human-robot interaction and live feedback

The Gemini Live API enables low-latency, bidirectional communication, combining voice, vision, and function calling. This allows natural conversation to trigger real-world actions, creating more intuitive human-robot collaboration.

Safety frameworks and real-world benchmarks

DeepMind emphasizes layered safety systems inspired by the Swiss cheese model, combining multiple safeguards. The Asimov benchmarks incorporate real injury data from the National Electronic Injury Surveillance System and align with industrial ISO standards to evaluate risk in physical environments.

CONCLUSION

DeepMind’s integration of reasoning, perception, and action into unified Gemini-based systems marks a significant step toward autonomous robots capable of operating safely and effectively in real-world conditions.

Full transcript

[music] >> Hey all, I'm Paul Reese and I'm the developer relations lead for robotics at Google DeepMind. For those of you that don't know me, I would generally describe myself as a maker. Whether it's something low-tech like gardening and woodworking or more advanced projects involving the Internet of Things and complex machines, I've always preferred doing things in the real world, which is why I'm very happy to be able to tell you about all the work we're doing in robotics here at DeepMind. All right, so typically AI has existed within the digital realm. It has been amazing for things like editing text, analyzing massive data sets, or writing code. But the physical world is a whole other beast. For a robot, understanding that a discoloration and reflection on the floor is potentially a spilled liquid or figuring out the exact trajectory needed for mapping a busy factory has historically been a massive challenge. DeepMind has been working to bridge that gap by using Gemini models to move from static chatbots to physical agents. We aren't just teaching machines to see the world, but we're enabling them to perceive it with advanced spatial awareness and understanding. We want robots that can navigate unstructured and messy environments rather than only predictable and organized scenes while being able to interact naturally with humans and perform complex tasks. Okay. That's all good, but anyone can say that they're doing great things. So, let's actually look at this in more detail. In robotics, we usually follow a perceive, plan, actuate flow. Google's vision language models are perfect for those first two stages, including the latest version of Gemini robotics embody reasoning or ER 1.6. Unlike a general LLM, the ER model is fine-tuned on robotics data, specifically for spatial reasoning and understanding to essentially be the logic unit of the robot system. To give you an idea of what this actually looks like in practice, let's look at some of the capabilities. First, there's 2D pointing and object localization. The model is exceptional at identifying points on objects, specific parts on objects where you may want to grasp them, and general labeling, or giving the model a list of objects to look for by their state, such as on desk or closed, and having it identify those objects based on their unique states. What's great about this is that it's easy to get started with our Python SDK. You can initialize the client, pass in the model ID for the latest version of Gemini Robotics ER, your prompt requesting the objects that you're looking for, be it specific items or more abstract requests like the screwdriver I need for the specialty screw, and any additional Gemini configuration items, like temperature or thinking level, to get a response that you can use in your robotics project. All right, so let's look under the hood for a second. If you've worked with computer vision before, you've probably used a model like YOLO that was trained on fixed data sets like Coco or ImageNet. Those are fantastic for what they are, but they are closed vocabulary, meaning they can only identify a predefined list of objects. If you need your robot to find a particular tool for cutting a dovetail joint when woodworking, but that wasn't in your training data set, you're pretty much out of luck. With Gemini Robotics ER, we've moved to a vision language model or VLM architecture. When you use the generate content method in the SDK, you aren't just running a classification script. The model uses semantic grounding to locate whatever you describe in natural language. Because language and vision are mapped in the same space, the model doesn't need a specific label for every object in your warehouse. This is called open vocabulary object detection. You can ask for the tool that looks like it's been used the most, or the component that is currently overheating. The model reasons through the visual data to ground those abstract concepts into precise coordinates. For developers, this is a massive win, as it eliminates the need to build, label, and maintain custom vision models for every unique environment. Now, the reason we call it embodied reasoning instead of just a vision model is because it understands the physics of the scene, not just the pixels. In traditional robotics, if you told a robot to clear the table, it might try to pick up a dish that's far too heavy for it, potentially stripping a gear or dropping the plate, because it doesn't understand weight or structural integrity. Gemini Robotics ER has what we call physical common sense. When it looks at a scene, it isn't just trying to identify the plate and the food, but it's reasoning about the relationship between them. It knows that to move a dish, it must first move food into smaller containers that could be lifted, or request human intervention. It also understands that a glass bottle is fragile, while a plastic one isn't. This reasoning layer allows the robot to handle unseen scenarios, like knowing it shouldn't try to lift a table that is obviously bolted to the floor. It's this kind of practical intuition that has historically been a nightmare to script manually. Beyond physical common sense, there's semantic reasoning. One of the hardest things for an AI-backed robot to navigate isn't necessarily a messy environment, but rather a vague human prompt. If we tell a robot, "Hey, put this away." it's suddenly trying to figure out what is this, and where is away. The Gemini Robotics ER 1.6 model excels at navigating this ambiguity. It uses the visual context of the environment to make an educated guess. If it sees someone holding a screwdriver, and there's an open toolbox nearby, it reasons that a way likely means that specific tray through the use of the model's thinking capabilities. For developers, this means we're spending less time writing brittle if else statements for every possible edge case, and more time defining the actual high-level goals for your robots. Now, typically when we use these models, we think in terms of snapshots, like sending one image to generate content and getting one answer. But robotics is inherently a temporal problem. So, we need to think about what happened 10 seconds ago in order to solve what we need to do now. Because Gemini Robotics ER 1.6 is built on the Gemini 3 Flash backbone, it natively supports video and multi-image inputs. This allows for what we call long horizon temporal reasoning. Instead of asking, is the door open? You can pass a sequence of frames from the robot's journey through a facility. And because the model tokenizes these frames chronologically, it can reason about state changes. For example, you can prompt the model with something like, "Look at the last 30 seconds of video. Did the gripper successfully secure the object, or did it slip?" The model isn't just looking at the final frame. It's looking at the motion and delta between frames. This is a game-changer for success detection. In the past, you'd have to write complex heuristic code to check if a task was completed. Now, you can simply use the model as a temporal supervisor that watches the robot's work and confirms that the physical state of the world has actually changed in the way that you intended. Reasoning gets us 90% of the way there, but sometimes the visual input is pretty difficult to process. One issue we see fairly regularly with standard Gemini is that images sent to the model may not be that great for determining what's in it due to size, rotation, clutter, or really a variety of other factors. We've addressed this by adding support for a Gentic vision to the Gemini Robotics ER 1.6 model. By enabling the code execution tool with the SDK, the ER model can now generate code for intermediate steps, allowing the models to manipulate those images itself to get a better understanding of content. Let's take a look at an example. Say you have a production line that makes circuit boards, and one of the steps involves taking the unique ID off of a certain chip to record it, such as the ESMT chip in this image. With traditional LLMs, this text might be difficult to read, or it could get mixed up with other text on the board. By enabling the code execution tool, Gemini writes the code to take the image, locate the chip that you're looking for, and then crops the image down to just the necessary area. In this particular example, the chip and text are still upside down. So, the Gemini Robotics ER model does one more step with its generated code to rotate the intermediate image into a more readable orientation, making it far more likely to correctly get the numbers off of that chip. The usefulness of this feature extends far beyond just reading text, as Gemini Robotics ER has been optimized for the common task of reading analog gauges and understanding common environments, like factories and plants, to detect anomalies. If you have a robot running around a facility, you can have it checking pressure gauges, making sure doors are properly closed, looking for spills or snow on the ground, or performing any number of other tasks with the same simple tooling. Once the robot perceives what's around it, we can move into the planning stage through task orchestration. Think of this like giving a robot a toolbox of functions. In your code, you can provide the model with a list of specific operations that your robot can perform, like open gripper, move to coordinate, detect objects, return to home, or really anything else. When you give the robot a high-level task, like put the blue block in the orange bowl, the Gemini ER model will break that task into a series of smaller task to plan out the entire operation. It looks at the scene, identifies the blue block and the orange bowl, and then determines what order the functions should be called to open the gripper, move towards the block, close the gripper, move it over to the drop-off zone, open the gripper, and then return to a neutral pose. Along with high-level orchestration, the Gemini Robotics ER model can handle micro planning, like coming up with trajectories that a robot should follow. Instead of just a start and end point, the Gemini Robotics ER model can generate a series of intermediate waypoints, while also highlighting ways to avoid obstacles. So, the final part of the perceive, plan, actuate loop, as you can probably guess, is actuate. This is where the gripper meets the object, so to speak. In robotics, actuation is just a fancy way of saying that we're moving the hardware, usually by sending specific values to actuators, or motors and joints, that make the robot go. Now, I know the tendency is to say, "Look at this amazing AI feature that should change everything." But, in traditional robotics, if you have a machine in a factory repeating the exact same motion 10,000 times a day, standard control theory works just fine. But, there are so many scenarios that people are exploring now that aren't in a controlled environment as they branch out into the real world. This leads to environments that are messy, full of objects the robot has never seen before, and situations that weren't in the original training data. For example, a robot rolling down the street trying to deliver food could encounter everything from a junky free couch blocking the entire sidewalk to a flock of unyielding grumpy Canadian geese. To help solve this, we've introduced our vision, language, action, or VLA models, including the flagship Gemini robotics model, which is currently available through our trusted tester program. These models map camera pixels and natural language instructions directly to blocks of motor values. You give it a prompt like, "Clean up the desk." and the VLA streams camera frames to determine exactly how the actuators should move to get the job done. This can even be used in conjunction with the Gemini robotics ER model to take a video of a larger task and break it into individual parts for actuation, letting the robot solve a series of smaller problems to complete the big one. What makes this special is that we've built these VLAs on top of the state-of-the-art Gemini backbone. Because the model has a massive general understanding of the world, the robot is able to understand context and details about the objects. For example, when we gave our robot a small basketball and a net game and asked it to do a slam dunk, the model was able to take the ball and place it through the hoop despite not being previously trained for this. Beyond simple pick-and-place operations, we're also pushing for true dexterity. We want our machines to do more than just move around, but we're training our models for complex tasks like plugging in cords or performing fine detail precision operations on a workbench. Finally, we've taken a generalist approach to embodiments. Whether you are working with a humanoid, a quadruped, or a mounted bipedal setup, our goal is to meet you where you are. These models are designed to be embodiment-agnostic, providing a powerful, general-purpose brain that can be adapted to whatever hardware you're working with. By combining the high-level reasoning of the ER model with the reactive, real-time control of the VLA, we're closing the loop on truly autonomous physical agents. All right, so at this point we've gone over the tools for the full perceive, plan, and actuate loop. But, that's really just scratching the surface. The next big question is how do you actually talk to the robot? And how does the robot keep you in the loop? For example, if you have a humanoid robot, people will likely tell it things like hello and thank you. But, there's a lot of variants in how they might say these. If your robot isn't programmed for each variation, it could get kind of awkward as it stands there and just feels unnatural. But, with additional tooling, the work for smooth human robot interactions could be handled for you. To accomplish this, I'm a big fan of the Gemini Live API. This is what turns a robot from a pre-programmed machine into an interactive partner. It enables low latency bidirectional conversation between you and the Gemini model, letting you give instructions through natural language, like saying, "That screw looks a little loose. Can you tighten it?" On top of that, this isn't just about audio. The Live API allows you to stream camera frames directly to the model, providing constant visual context. This helps you create feedback loops where you can use this stream to even alter plans during runtime. Where the real magic happens though is through function calling. Based on what the model hears and sees, it can decide to trigger a specific developer-defined function. This is the bridge that allows a natural conversation to turn into precise physical action, opening the door to incredibly fluid human robot interactions. And while you're planning out your builds, having a playground to test your logic is definitely a nice to have. And that's where AI Studio comes in. This lets you rapidly prototype prompts and test how the model perceives images from your specific hardware without having to constantly re-flash or reload scripts on the robot. Along with this, one of my favorite areas is the build section. We've actually created a few web app templates to show off what's possible, like integrating the MuJoCo simulation engine directly into the browser. Like you can see here, where we're running a virtual robotic arm that uses the Gemini Robotics ER model to detect block locations, then performs a pick and place task based on that information without the risk of testing on real hardware, really lending itself to a fail-fast, fail-safe strategy. And to wrap up this whirlwind tour of Google DeepMind's robotics offerings, let's talk about safety. In the digital world, if an AI hallucinates, it might give you a weird recipe or a buggy snippet of code. Not ideal, but you know, not catastrophic. When this happens in robotics, we're talking about real-world hardware, sometimes with significant weight, potentially moving around in the same space as people. As we all probably know, tools like a simple table saw deserve respect because of the physical risk involved. When you add an autonomous brain to a machine, that respect for safety has to be built in from the ground up. This is why our team is engaged in ongoing safety research to develop a holistic approach to robot protection. We like to think of it as the Swiss cheese model of defense. No single layer is a perfect barrier, but by stacking multiple layers of safeguards, spanning the semantic, physical, and operational aspects of safety, we can effectively mitigate risk. To evaluate these layers, we've introduced the Asimov safety benchmarks. As discussed in our research, these aren't just theoretical. They are grounded in reality using the NEISS or National Electronic Injury Surveillance System database. This contains real-world injury reports from hospitals to teach the model about physical common sense. Second, we ground our work in established industrial ISO standards to ensure our benchmarks reflect the same operational safety constraints used in factories today. So, that's a quick look into what we're working on within the robotic space at Google DeepMind. At the end of the day, we're building tools to be your partners in creation. So, I'm looking forward to seeing what you all build. You can find more information in our developer docs, cookbooks on GitHub, and sign up for the trusted tester program on the Gemini robotics webpage. Thanks for listening in, and I'm looking forward to seeing what you build. >> [music] [music]

More from Google