ENFR

Tech • IA • Crypto

Today My briefing Videos Top articles 24h Archives Favorites My topics

Introducing Gemini Omni

GoogleGoogle for DevelopersMay 19, 2026 at 06:17 PM44:07

Audio player

0:00 / 0:00

TL;DR

Google DeepMind’s Gemini Omni introduces a multimodal AI model capable of generating and editing video with high temporal coherence, controllable pacing, and integrated audio-visual consistency.

KEY POINTS

Launch of Gemini Omni

Gemini Omni is positioned as a major step toward fully multimodal AI, सक्षम of handling image, video, audio, and text inputs while producing coherent video outputs. The model builds on earlier systems and consolidates multiple generative capabilities into a single architecture designed for both consumers and professionals.

Advanced Video Editing Capabilities

A defining feature is native video editing, allowing users to transform existing footage through simple prompts. Edits such as changing subjects, removing objects, or altering environments preserve motion, speech, and timing consistency, marking a shift from earlier tools that relied on stitched or layered outputs.

Temporal Awareness and Story Control

The model demonstrates improved temporal reasoning, enabling it to structure sequences across seconds with accurate pacing. Users can specify timing constraints, such as fitting content into 10-second clips, or define when events occur, while the system plans scenes and transitions internally.

Prompt-Based Creative Transformations

Demonstrations include turning people into animated characters, altering perspectives, and generating rapid sequences like alphabet-based visuals. These transformations require only natural language prompts, highlighting accessibility without technical editing skills.

Reference-Driven Identity and Consistency

Users can supply multiple image and audio references to improve character fidelity. Up to seven reference inputs help the model reconstruct facial structure, voice, and movement, with better results when varied angles are provided. This enables consistent avatars across scenes and projects.

Multi-Character and Voice Limitations

While the model handles two to three characters reliably, performance degrades with larger groups. Voice synchronization remains a challenge beyond a few speakers, where dialogue attribution can become inconsistent.

Generation Speed and Trade-offs

Outputs typically take 60 to 90 seconds to generate, reflecting increased reasoning and planning compared to faster image tools. The system balances responsiveness with higher-quality outputs and more complex scene construction.

Consumer and Professional Access

The model is available through the Gemini app for general users, while advanced workflows are supported in Flow, a creative suite for building longer narratives. Integration with YouTube Shorts enables remixing and editing of existing clips using the model.

Text Rendering and Informational Use Cases

Improved text rendering within video allows for readable overlays and educational content. This capability supports use cases beyond entertainment, including instructional videos, explainers, and visual storytelling tailored to different audiences.

Safety Measures and Watermarking

Deployment includes a cautious rollout with restrictions on likeness and voice replication, requiring an avatar setup process. All generated content is embedded with SynthID watermarking and C2PA metadata, enabling detection of AI-generated media across platforms.

Future Developments

Planned improvements include longer video durations, enhanced character consistency, better voice handling, and deeper grounding in real-world data. The model is also expected to expand interactive and educational applications.

CONCLUSION

Gemini Omni signals a shift toward unified multimodal AI systems that merge generation and editing, with implications for content creation, education, and digital media authenticity.

Full transcript

There's so many different kinds of edits that you can do and I think we weren't sure what exactly the model was going to pick up on, but it actually like is very versatile. Like we we have an amazing group of people who have been making demos for the past couple of weeks and I've been really blown away with like all the kinds of things that they have been able to do of like turning themselves into crochet dolls and like liquefying things, you know, and like like those thing and they just work and it's all through prompting which is pretty incredible. Hey everyone, welcome back to release notes. My name is Logen Kilpatrick. I'm on the Google DeepMind team. Today I'm joined by Nicole, Dumi, Gabe, Shomi. We're talking about the new Omni model, uh, Gemini Omni. So maybe one of you want to sort of kick us off with the the the highle news about the model. >> We're really excited to introduce Gemini Omni. I think there's two things um that people should take away from this model. One is we're basically bringing nanobanana to video. Um, so we have a really great video generation model, but it especially shines at video editing. And the second thing, the reason we're calling it Omni is it's really the next step towards the journey of making Gemini fully multimodal in and multimodal out. So you can take in image, video, audio, input, references, and right now good video as output. And then, you know, in the future, we will be adding more modalities um as a way to make Gemini fully multimodal in and out. >> I love it. I feel like we've been telling the sort of Gemini multimodal input story. It was sort of like a foundational part of the original Gemini models. Um and then obviously the whole suite of from Nano Banana to VO to some of the music and audio models like sort of state-of-the-art across the rest of the gen media portfolio. So, it's exciting to to see it all come together. Maybe we look at demos uh and see and we can do some live reactions. >> Let's go to some examples. Let's play this one. So, this is actually a really nice example of bringing together sort of the Gemini world knowledge with media generation. And so, this is literally asking the model to generate the alphabet and then examples um of an object for each letter of the alphabet. Nice. Nice. And this and the sort of um how aware is the model of its own because I imagine there's a there's like a time constraint of how much generation the model can do. Is it like aware of its own time? like is the like the se the speed sequence do we have to sort of like prompt the model to be like make sure to do it fast because you only have eight seconds to generate or like how how much of that context is uh >> I think that's one of the things that really really improved on is that the model has an ability to create very fast potentially uh sequences so unlike previous generations of her mouse um the the control over the time and being able to tell a story is much better this is a bit of an extreme kind of extration where you want to create very fast um um frame games would go one after the other. Each one has different content and and there is a specific order. So the model is very aware temporarily what where everything should go and and basically when the user asks for a relatively kind of complex sequence then the model is able to reason about it and create this sequence correctly. >> To your question about the sort of do you have to prompt you can but you you you don't have to. It's pretty pretty flexible in that sense. You can just say well I do it in 10 seconds and it will do it but you can also specify like well I want it to be slow whatever it is that you want it to. So it's pretty controllable in that sense. >> Part of that comes from the conditioning. So like there's this text conditioning that goes into the model. Should it be fast-paced? Should it be really slow? But then another part of it of it is just from the model itself. Like it really looks to find these correlations between what you're asking for and how the video should look. >> Yeah, I love it. There's like an AI video speed. That first example is like cool to see that like it is a lot faster pace than sort of what you would normally get out of a sort of an AI video generator. Um which is cool. And this one's very fastpaced and we prompted it to be very fastpaced to like fit the alphabet in. But we have another example which goes into the editing capabilities and also the anything in because we're using an image as input and a video as input. Uh so let's play a video of this woman talking. >> Well, I don't know what it is. But today is going to be a good day. I can feel it. >> This is Arena, one of our amazing creators of the Greenfield team. Um and then the prompt is basically turn me into this animal. Well, I don't know what it is, but today is going to be a good day. I can feel it. >> That's awesome. Just to draw it back to sort of folks who who have some familiarity with the VO models. This is sort of like a combination of like reference to video audio and the part that wasn't possible before is like the audio consistency cuz like you could do sort of sort of editing like that with previous video generations, right? with like reference to video or no or is that not possible? >> We do reference to video where the reference was an image or an audio clip. But video editing is a net new thing that we're bringing with Omni. And that's why we like to use the reference of it's kind of like Nano Banana for video because we're basically combining everything that we had in VL, hence also Omni. Um, and we're adding video editing. And it's really nice because with Gemini, you know, in the loop again because we're building on Gemini, you can use fairly simple language, right? Like literally turn me into this animal, there's nothing that complex in the prompt and then the model just does it and then it keeps the boys the same and it keeps the speech the same, which is really nice. >> Yeah, we pretty much build them all from the ground up to be able to understand different modalities. So in this case, the image and the video. So it can actually know what information it has to take from the source or in this case the the reference video which a person is talking and recontextualize it um you know in this in this case as an animal talking but the small I think the small movements the expressions of the animal kind of transfer from the original person and that's I think really like those nuances make it look better. Yeah, I think that's a really important point is that while Veil could certainly do reference to to video, um that was a capability that we kind of like almost layered into it. Whereas this is with Omni, it's really a step change. It's like this is we started from the foundation. We kind of had to rethink how to build this this model from the ground up. Um and we're really happy with the results because of that. >> Maybe pragmatically also like it's it actually does improve the quality improves with more preferences that you put in of of the same subject. So you can so you can like you can stuff in more images of yourself and more whatever whatever it is and it will build a better model world model if you want uh to use the the term of art >> um and and it will just be a better outcome. >> We can also talk about a bit more about the the way to provide more with your um references. So we have the avatar workflow which we'll be talking about more um that allows you to kind of like just one time take photo of yourself and uh speak and get the sample and then use it across different generations. So, but also you can bring three or four or even more up to I think it's about seven um references of yours and and voice references. And again, the more information that the model has, the more it can kind like recreate and recontextualize the presence of the the specific person or identity in a new scene. Um, and I think what we've seen is that it's also not just the number of references, but also what information the references have. So, for example, if I take a few photos of myself from different angles, the model can build a better understanding of the 3D structure of the face. And then if you walk around, then it will actually look more like you, right? Like if you only have a front uh photo of yourself, then when you turn around, it might look like a different person because the mall has no idea how you look from the side. So, you know, it's also about that. >> And something important to point out is that these references don't have to be of the same thing or person. So, you could have like two people or in this case, we saw like there was a wolf. So, but you could have even multiple animals, whatever you wanted. Um, and then in your prompt, you just kind of riff point back to the things that you want to incorporate into the video, and it will just intuitively just figure out like what to do with those. >> Yeah, that's awesome. Let's look at some more demos. I feel like we've we've got lots of cool ones to look through. >> We have an image on of a violin on this background, which is just grass. And then we have a woman playing a violin. And the first prompt is basically take this violinist and transport her into this environment. So, let's play that. Our next prompt is make the violin invisible and then keep everything else the same. Right? So, let's see what happens. >> That's awesome. Wow. >> It's really awesome because it keeps everything else the same. She looks the same, right? like she's still playing, the music is the same and then the violin just disappears and you see what happens. And then this this next one is basically change the perspective. So you can see how this is useful for many different things and it would have been really hard to do. Um and you can also see how the model just is an understanding of perspective and what happens when you change camera angles, right? So let's play this one. I do not play the violin, but I am told this looks right. >> That's awesome. Actually, really quickly on multi-turn, I think one of the one of the pieces of like early feedback from the original Nano Banana models was sort of like when you do multi-turn, sometimes the model actually like wouldn't make the edits. Uh, as we see that feedback all the time from people. Um, and sort of it's gotten progressively better. And I'm curious just to sort of like level set people's expectation as they go in and try this model like does the is like pretty consistent it'll make the changes that you want or does it still have that like odd artifact that Nano Banana has sometimes where you'll ask for an edit and it'll end up just giving you the same image back without anything really changed. >> It's less that I think the main differences between Nano Banana and Omni here is obviously the generations take a lot longer, right? So with Nano Banana it was super snappy. You have 10 seconds and you kind of get the output. here you're probably waiting like 60 seconds plus sometimes 90. So it just I I think that naturally maybe limits how much you want to do multi-turn. Um but it definitely I think the other thing is that we've seen this with Nanana too like obviously the longer you go maybe the more like the instruction following degrades and that's still true but we've been getting kind of two three four turns pretty reliably with this model. Um, but I think to your point, I think every time we introduce new capabilities like this, sometimes the model is better at following some instructions than others, right? So, please send us feedback. Um, I think we're all on Twitter. If we're not, we should be. Um, and tag us and let us know what you like and don't like about >> Don't tag me. Tag them. >> I don't know. Yeah. Maybe one limitation I have personally observed is that like sometimes editing will edit more than you asked for. >> Like I think it's almost the opposite of what you were asking about. It'll be like and that's I don't know if it's um we need to dig deeper into like how in which which way this is a problem but like it's possible that like if you under under specify what your edits is the model will do more than you thought should be do should be done right you may be technically correct it did what you did but it also did some more so >> when it comes to videos there is so much happening right 10-second video actually a lot is happening and to be able to edit it accurately even explain what you want is not like with language not always very easy. Um, so sometimes you might ask for something to change but again it is kind like illdefined what's the right thing to do right editing is not like always clear okay I want to change this text to something else that's one thing but if I want to change the I make it invisible there are many ways for example like we've seen in this valium to make it maybe invisible or change the the camera position there are m multiple ways so I think the more you're specifying what you want I think the better the mole can understand and try to to to implement the edit But still, of course, it doesn't always get it right. And then we're still working on that to be more and more accurate. Yeah. >> How much reasoning is actually happening? Is that part of that like 90 seconds is like the model actually doing or 60 seconds the part model's actually doing a bunch of reasoning similar to like the latter nto banana models where like that's that's like a core part of how we're getting model performance out of it. >> Yeah, there is a quite a bit of reasoning that goes on behind the scenes. Um, I mean, if you think about like when you're doing text to video and you give it a simple prompt, there's, as said, there's a lot going more going on in the video than just, you know, like a dog on a skateboard or something like that. It's well, where is the dog? What does the what kind of time of the day is it? Etc. And so, the model really needs to create a lot of detail. And so, it needs to reason about it, you know, just like a kind of like a person creating a film would would reason about well, what's the story? What's going on? And that's actually what's happening is it's like it's like reasoning through like scene by scene or something like that or I don't know how much of this is like the secret sauce or like how much actually is like actually even visible to users at all but like what is what happens as it is like thinking through what the next eight or nine seconds of the video should look like. >> If you provide a very short prompt then there is so much to create to just give you kind like the 10-second video right. So a lot of this information is coming from them all thinking through how it should be uh maybe split the 10 seconds or or longer to to kind like shorter kind of clips what should happen in each one how they should be consistent with each other. So there are a lot of planning around that and then um eventually the video that you get kind like try to combine all of those uh all of this information that or this planning into kind of coherent uh video. Um so if you provide more information for example and one thing that I like is that you can actually say I want in specific time specific seconds for something to happen then you constrain much more uh the video right so if you provide more information the model will try and and and adear to what the user wants and then if you provide less then it has to go and be more of a director on your behalf. >> Nice. I'm curious like why like why 10 seconds? Is it just like the model has gotten better and more coherent or like architecturally it's you set up to do 10 seconds or like why 10 seconds? Why not 30? I know 30 seconds would take a long time to generate. >> With Omni, you can create 10 seconds uh videos. That's true. But um because of the consistency that using references, you can easily have the same um basically character or characters in in in a 10-second video. So you can easily create a much longer film. That being said, we are definitely planning to extend that in the in the near future. >> We really try to focus on making this accessible to consumers with this release. I think that's quite new because with VO, we really focus on kind of professional creators and filmmakers and we're still working with those obviously because everything that we've shown you is useful if you're creating content professionally, but we really wanted to make this tech accessible to consumers. So the fact that you know you can create your avatar and then you can put yourself in videos and it looks like you like that's kind of that nano banana moment for video right shorter videos we also think are kind of more conducive to the consumer use cases but we are planning on enabling longer videos and one interesting thing is when we talk to folks who are creating films professionally they actually prefer to have shorter scenes because you kind of create a scene and then you stack those scenes. like the workflow that Schlomi described is actually quite common in the industry. But you're right, like we do want to make longer videos and this definitely on the road map. >> This is also kind of a user experience like you know imagine doing a multi-turn edit of a 90-cond video. Most users will just not find this very pleasant but we could do it as a as a flag planting thing if you want. >> We'll get there. >> Yeah, we'll get there. >> We'll get there. One of the things actually that we haven't talked about and this related to like use cases and sort of the different audiences and everything they're looking for, maybe we can actually just talk about availability for a second, which is like for people trying to do the sort of nano banana VO moment, where should they be going for sort of the more professional user who wants to do the 10-second multi- many many iterations, put together something professional, where should they be going? Um, availability >> for easy consumer use cases, go to the Gemini app. Um, so it's available to ultra and pro and plus users. >> Really? >> Yes. >> That's where all the TPUs are going now. It makes sense. >> Um, no, but actually like this this is related, right? Because if we do longer video generations that we can serve it to fewer people, right? So, so this is definitely a trade-off that we thought about with this model, but yes, go to the Gemini app. For people who have much more kind of intensive workflows and who want to really like use a full-on creative suite, go to Flow. Um, you can use Nano Banana to create references, characters. is you can give your characters voices so they're kind of consistent throughout the project that you're working on and you can use um you know omni inflow for all of those creations. Um flow now has some really cool features where they also have like an agentic workflow that can suggest you ideas. It's very cool actually. >> Um shout out to the flow team and Josh. Um where you know an agent can suggest you ideas on what to do. I usually don't have good ideas on what to do so I just use the agent to tell me what to do there and that's really fun. Um, and then we're also, um, I think for the first time also sim shipping this model with YouTube. Um, and so if you go on YouTube shorts, um, or YouTube create on YouTube shorts, basically some videos will be eligible to be remixed with Omni and then you can kind of go in and edit those. >> Very cool. Nice. >> And then we will have APIs and you know other developer access. >> Fingers crossed. I'm excited. >> User experience of creating videos is not like it's not a sole thing and then it really depends on on what the user is trying to do. So, I find myself using both the Gemini app for some things when I I just want something to create quickly, maybe some memes sent to someone. Maybe I want to take my um my video and just change something small, make it funny, and send to to a friend or my family. But then if I want to try and create, it's not like I'm amazing film uh uh uh amateur. Yes. So, so but my kids like it. And so I mean so it can be professional but it can also be like you know maybe uh more of like a longer workflow and you have some idea and you want to get it out you know and actually create it then I think there's there amazing improvements to flow and to the to the kind like user experience there you can build those characters and give them a voice and then use them in different scenes. So the ability to create a much longer film or or or kind like a video I think it's it's just a better workflow for that. So yeah, we can try. We believe those both like use cases are very different and require different experiences. >> What about um and actually maybe we go to the next demo, but I'm I'm also curious to plant the seed for like multihuman video generation and editing. And I'm curious how the model does like with character consistency as like is is the the suggestion like two people, it's like pretty reasonably good. You get to five, it starts to break down and things get crazy or like and I feel like it was the same for Nano Banana for like editing. If you had like a big group, a group photo was like one example that I felt like didn't work on the initial iteration of the model and you sort of needed to have like one person or two people. Same vibe for for Omni to start or is it can it handle it all? >> I've seen pretty impressive examples of two characters and maybe stretching it to three. Again, you do have to provide quite a few references, right? So like the more image references you provide for each character and the better you get all of their facial angles so they can kind of capture the movement, the better things get. like two to three pretty good. I think once you go to like a big group photo and then you try to animate it or like like it starts to drift much much faster. Um and even if I use like a single head shot of me and then I try to put myself in a scene like it doesn't quite look like me. But if I do the like you know side profile from both sides and kind of head on it it's so much better. >> And can those references be videos? Could you do the like you look at the camera side to side up down like you're >> It will be coming soon. >> It's just Yeah. The model can do it. We run out of >> one of the model can do much more than we're actually >> shipping. I have to say we're going to take us like two months at least to ship all of the features. >> It has a lot in it. >> We packed a lot into a small model and I think this is actually part of why it's called Gemini Omni Flash. Like it's actually really impressive. It is a pretty small model. We're packing a lot of features and we have more coming. So >> we kind of knew this when we were developing the model, but we were developing something that we didn't know how to serve quite yet. Um, and so we're kind of like slowly unrolling this as we figure out, well, how do we actually want to like let users interact with it? Um, or allow users to interact with it. One thing I would say is that the character generation, especially from like TTOV, is is quite good for like, you know, like four or five characters. I think the one thing that is an area to to improve upon is voice generation. So once you go beyond like two characters, what you might start seeing is that the same like same person might be talking like it kind of gets confused between who's should be talking exactly when. And this is obviously an area that we want to improve upon. >> I love it. So our podcasting with five people, we're good. It's uh >> but you're still good. >> We're still good now. But if it's only two people, don't do two. It's a matter of time. Let's watch some more demos. >> Next one. This is another example of just multi-turn of kind of like impossible worlds that you can have in the palm of your hand. Um, let's do it. >> Oh, wow. I love this. It looks like that little world orb thing from uh from Project Genie. >> And it's just so much more accurate. Like we we tested all of these against Vio and you can really see the strip step change from where we were with VO. So it's pretty exciting. If you go to the next one, um this is just Celestial World. It's basically a play on the same idea. >> Five fingers. >> Five fingers. >> That looks like a good hand. That's a good hand. >> Very cool. >> And then we have another one. >> That's very cool. very the text rendering I think is really something that we love that I think it's in in text rendering we really feel like it's a step change comparing to our previous video models and and it's so useful because uh text apparently is pretty useful so like you know for anything informational or or even if you just want to have make some fun message to someone like being able to have text on screen >> or adding an overlay with video >> yeah it's just really and and of course like the text I think we have some examples right that the text is in the right place it looks like you can read it. It looks correct and like I think this is super like a useful kind of capability that we are improving and in a way that's coming from this omni approach because the model learns all of modalities including text from the ground up. So it can actually do much better on text generation thanks to that. >> I feel like this omni model has to be like the hardest example of taking like all of these different things uh and bringing them together. And I'm curious sort of like actually just like how we think about prioritizing like what do we want to be good? Was it actually like initially we're like hey video the nano banana moment was incredible. We think we could do something really really interesting for video and so like let's go and try to make sure that that use case works really well to begin with and then sort of keep hill climbing everything else to like try to get quality up or is it like you just sort of throw it all together and then do a bunch of evals and play around with it and you're like actually video editing is the thing that works best. We definitely started like the tagline of the project which started a while ago was Nana Banana video, right? That was that was that that was there in the slides ever since the beginning of the project. Um >> uh and so you know that's was definitely sort of a northstar there. You know doesn't mean that we didn't do the other thing where you said where like we just put everything together and hope that it works but like it was it was definitely sort of a you know where we wanted to go to. So, and we've arrived. But like I think it's definitely an open problem of how to how to how to put everything together, right? Like in a way that like gets you to that desired outcome because none of video kind of we all feel it. We sort of understand what that means. >> Yeah. >> But like actually practically measuring are we are we improving towards that goal? It's it's not super obvious, right? Like you know like there's just a lot to there's just a lot to video editing that is not just quantified on a single number that you can hill climb on. >> For us like execution is really key and we're researchers. So we incrementally added these things and layered everything on top of each other. Um and of course everything interacts with each other. I think one of the most difficult parts of this particular model was just the number of evaluations that we had to run especially towards the end when we just had everything going and so we had editing references image out text out etc. Um, we just had to evaluate everything and then kind of just understand what's going on and you see some things regress, some things improve. Knowing when the trade-offs are where those trade-offs are, it's it's really challenging. You need a very deep intuition on how these models work and what's going to work at scale and what just isn't. >> First, it builds on on I think like long-term research that we've been doing, I think, for a few years to how to combine modalities. And as you've mentioned, this is being you know a northstar for Gemini and and kind of like our generative media projects to see how we can combine the modalities together. Um an interesting thing is that as as you combine different modalities, you have to decide um to prioritize which one is more important or like how how do we weigh them in a way that that is is gets us the the what we we want the model to be able to excel in. Um and that's something we definitely uh worked on a lot. Uh the interesting thing is that we see that putting some of the modalities together actually helps it like different modalities because there is a lot shared for example for image generation and video generation there are a lot of similarities of generation of the the visual stuff right understanding the visual world same goes for music and audio you know are you can imagine that if you ask for music to be generated in a video then the model has to know how to create like the music to begin with. So we also see that that this information helps the model on the connect to do better on videos when it has like for example audio and music information. So overall we see that all of those modalities kind like make them all better across the board. Of course with the right adjustments. Yeah. >> Historically we actually talked about for Nano Banana that like uh Koshik and others had like you know trying to like the text rendering quality is like an overall metric that you can sort of like track correlated with quality of the model in general. Um, and I'm curious if that's been true for Omni as this like text rendering capability has gotten so good. Um, and then actually also like the the tooling tool calling capability of the model and obviously the latter versions of Nano Banana you can use search out of the box and has I think like image search which is really cool and has like made that experience powerful and I'm curious like how much that's um the like tool tool use for the Omni model is something that either works today or is like something that you all are are pushing towards in the future. works today, but that's for certain. Um, but it's definitely something we're pushing on as well. >> I think what you've seen with Nano Banana kind of generally, right, like we kind of started with a lot of the fun use cases of like edit an image of me and make me a mini figurine, right? And then we kind of went to like actually you can make an infographic and like use it for a work presentation or whatnot, right? Um, or you can use it to kind of explain a really difficult concept to someone. And I think you're kind of seeing us do that with video now, too, where like we're starting with a lot of these fun things, but then you're like, "Oh, it's actually also useful because I take a video on vacation and like somebody walks in and then maybe I just want to remove them." In the past, most people would have never learned the professional tools to be able to do those kinds of things, right? So, I think we're in that stage and we're seeing kind of the early glimpses of like, well, actually, now you have text rendering working and then when we add in the grounding and actual real information, right? We're also launching kind of um street view grounding with Genie. um at IO this year, right? So, you can kind of see how you you can ground in a lot of this real information. It just makes the models much more useful. It's definitely where we're going. >> Yeah, I think it connects ultimately, you know, to Google's mission of organizing the world's information and we really see how there are like really cool applications that are more on theformational side and less on the kind like entertainment side. Uh which of course is interesting as well, but really on the information side, we see like we're just the tip of of what we can do. >> It works really well. uh despite not having poured like you know super many resources into it. So I think that's that's and I'm actually quite excited to see is there anything else that we haven't discovered, right? So you know people will discover like the mini figurines or whatever it is that people have done with the with Nana Banana like I'm sure people will do similar interesting things with with the with Omni when it when it chips. So and what they tried >> the those Yeti selfie vlogs or whatever. I'm like now now you can just be an influencer without actually having to go just send my send my avatar out and go explore explore the world. This example actually is also really cool. I've really loved examples that people have been doing of kind of like you take footage of the real world and then you do something like crazy fantastical in that world. So this is an example of like a person's just drawing a circle on a piece of paper and then you like turn it into all these wild things. So this is a circle turning into a black hole. Black hole or deep mind logo. Wow, that's that's awesome. >> Don't try that. >> Yeah, don't try that at home. >> One thing that we really like about mall is that it can do style um kind like changing the style of a video. And that's something that you know we don't have any such data in in when we train the model, right? Like it's it's very hard to to have like a a video in one style and then video in another style like you know that's not something that you can easily find. Um but the model just kind of generalized and and because it has a deep understanding of different types of medias and styles and language then you can just say okay I want to have this video and different style and now edit it to and then it will just like learn to uh change the style in a way that looks reasonable. So here for example, we asked uh basically the mall to change the style throughout the video and you can say the scene remains consistent but the style just changes throughout. >> And I think this is interesting just for video editing in general, right? Because there's so many different kinds of edits that you can do. And I think we weren't sure what exactly the model was going to pick up on, but it actually like is very versatile. Like we we have an amazing group of people who have been making demos for the past couple of weeks and I've been really blown away with like all the kinds of things that they have been able to do of like turning themselves into crochet dolls and like liquefying things, you know, and like like those and they just work and it's all through prompting which is pretty incredible. >> Something I was really impressed with was actually in an early iteration of the video after we kind of like recently um allowed it to do video edits, I was looking some evaluation prompts and I saw this one prompt. Well, most of them kind of like asked for, you know, like these kind of modifications to the scene, but this one prompt that asked for the next scene and the video just flawlessly executed it. It continued the story. Um, there's a woman walking down a corridor and then it asked for a monster to come out of the out ofhead of a door and then the camera turned around the the corner and it just continued and did it. And that was not something we had ever kind of explicitly trained for. Um, and that's actually something that you can do today which is pretty cool. I feel like this is like an interesting like new form of like media that doesn't really exist, which is this like you had those like choose your own adventure books growing up that you read or whatever. This like you can kind of do that now with Omni but in video format in like a much more engaging way, which I think is really really interesting. >> I think the way that we generally think about media now is it's different people learn through different ways, right? And a lot of people are visual learners, right? And so I've had a lot of people come up to me after Nano Banana saying like, "Oh, my dad, you know, is a chemistry professor and I have no idea what he does." And so I just like took all his lectures, fed it into Nano Banana and got these like sketch notes that just visually explain it to me because that's how I learn, right? And it was the first time that I was able to have a conversation with my dad in like 20 years about what he does, right? And it was like a real bonding moment for that family. Um and so I think with you know again with video some people may understand concept better through video or some concepts are better explained through a video because you need to kind of see the motion and the transition. So, one of the really exciting things, and I think we have a protein folding example, very on brand, um, is you can kind of explain things in different styles, right? You can kind of adjust it to different audiences. And so, I think one of the things we're all really excited about is just like a new tool for learning, right? Because like a kid was going to learn differently that I'm going to learn about protein folding. And so, we have one example here. >> Proteins start as chains of amino acids. They fold into patterns like the alpha helix and flat sections called beta sheets, forming a perfect three-dimensional shape. I need to extend this video a few times. >> We're gonna have to This is when extension comes. You know, you can you can go deeper into >> this is where, you know, like the promise of longer videos is going to is going to sort of, you know, hold and it's going to be more useful than just, you know, just all arbitrarily generate something very long for no good reason. Like I think generating longer videos that explain things, that educational, that create moments. I think that that would be very cool. I also feel like this with um this like educational type of content with style transfer is also really interesting because it's like oh I can sort of make it you know whatever you know comic book themed or whatever it is and you can sort of bring it to life in different ways and not like the if you don't want this like claimation animation style or whatever it is. I had a physics teacher that like he loved kind of in bringing memes and kind of things from like my generation into the classroom and that really it helped like it made it a lot more engaging and so you can imagine some like a teacher bringing you know videos in and and uh like really engaging students a lot more with those. Yeah, it's it's really I think it's really like promising as a tool for educ for educators to or and teachers to be able to bring a concept an idea and kind of like whatever they can explain whatever they want but in a visual way that would really match kind like their audience right so you know it's like explain to me like I'm five or explain me like I like memes the medium is basically matching the audience and I think we can go much much deeper >> it's somewhat obvious but it's also worth um like for folks who have like tried to make even this type of like basic like I've done this a bunch for like programming stuff and it's like extremely time and it's like I I want so much to do it because like the end product I think is like great for people. It's just like super laborious. It's not what brings energy in my life. And like I feel like there's um there's so many cases now where I think like folks are going to get their hands on Omni and like go and make a bunch of this content which is going to be great for other people. Um and it just wouldn't have existed otherwise cuz like they don't have the time, means, money, whatever it is in order to do it. And now you can get a Gemini Pro subscription. Um and you know sort of you're off to the races and can go and use those which is really cool. So um it's awesome. So we have another demo where we basically show any modality in. So we have some images, we have audio references, we have a video. If you remember the sailor from last year's IEL, he got very he got very famous on Twitter. Uh we t we show him all the time in our review decks. Um and so we have an example of taking a style of a story book and then the sailor from last year and his voice and we've got a new track for you. The sea is a wild untamed might. She commands your awe. >> Very cool. I see. I feel like I need more. I'm like, give me another give me another 20 seconds. >> Remember? Yeah. Give me another 20 seconds. >> We hear that. >> And And so, actually, just for this example, this was like the the references for this Sailor at the Sea was like the previous um the previous like a screenshot of what it was before. What's the previous year's video with that audio track? So, it was basically what what he said in this video is exactly what he was saying in the original video. He was kind of like looking at into into the sea into the distance and then we had a style reference image that went into this. >> We've got one more example >> um where you gave us your permission to use your >> Sammy took this picture >> to to use your like >> I did not know what it was for originally, but I did consent. You did consent to your picture being taken and we have a demo of you at 41. >> Nice, >> Gemini. >> That's pretty uncanny. That's pretty >> That looks pretty on brand, honestly. >> I mean, I think I I I can see your next Twitter post. >> Yeah, exactly. >> We just have one more video. Let's take a look. >> I've been working on training this new model on my likeness and voice. It is Google's new Omni model and the results are pretty wild. Um, does it actually sound like me? It is hard to tell. >> Does it pass? >> Yeah, that passes the test. >> Uh, throwback to my old department. I miss it. >> The first one was very very very good. Um, I think first one was A+. I think second one was was like a >> So, so in that case, how many uh images? We use multiple photos to to to get >> did take some side. >> This is the you need the side profile shots that actually like when you move it kind of looks like you. But I I will say even evaluating yourself is sometimes very hard especially on your voice because you like don't usually listen to yourself speaking. So to show's point like it's actually really helpful to just see like other people on the team and then you look at if the model is good at you know reproducing their like >> Logan may be uniquely qualified cuz he does listen to himself a lot. >> Unfortunately I have to hear myself talk. Um maybe this is actually a great transition to talk uh just generally about like safety for this model. Obviously the capability is super interesting. Obviously uh you know generating things with people's likeness is I'm sure there's there's many many challenges product modelwise to sort of handle. Um so would love to talk about sort of at a high level like what are the yeah what are the things we're thinking about anything that folks should know um as they as they sort of get their hands on on me. There's obviously a lot of new things that you can do with this model, right? Including edit videos, which includes speech, um, including, you know, using voice and audio references. And so, we are actually going out pretty conservative when we first released this model. And you might feel like we're blocking a lot of things. And part of the reason is because we really are trying to learn what people will do with this model. Um, and kind of just really look at like real world use cases because we have some idea of what people will do, both kind of positive and negative. But we really need to see what happens when we release the model in the real world. And so we are taking a pretty conservative approach um for your own voice and your own likeness. Right now you can only do it through the avatar flow. Um and we will monitor the feedback. Please do send us feedback on what you think should be working and isn't right now. And you know we will adjust as we go. But we really kind of want to learn and get the feedback from users so that we can both kind of deploy this responsibly but also give people kind of creative control over what they want to do which is the balance that we you know try to strike every time. I don't think we've mentioned yet the sort of avatar flow and just like the really quick TLDDR for folks uh who want to go do actually with their own likeness create some sort of video. >> Yeah. So, um this is also part kind of a convenience thing because I think as we've been talking about this in order to capture your likeness, you really do have to take kind of multiple shots of your face. We need an audio clip. Um and so the avatar flow is kind of a one-time setup that you can go through and in all of the products that we're launching in actually. So, YouTube, Flow, and the Gemini app. And basically, um, you record yourself speaking. So, you speak out a, you know, a set of numbers out loud. Um, we capture your visual likeness on camera. Um, and then you basically store your avatar and you can invoke it the next time you come into the product. So, it both has this sort of safety element, but it also has the convenience element of you just have to do it once and then you can reuse your avatar every single time that you come into the product. I think syn ID and the ability to know if video is basically generated I think it's really key um because we see that the the capabilities of models across the board get better and be better um they look very real we can have someone's identity very carefully kind of like a kind like replicated and that's obviously poses a challenge um and I think ultimately part of of releasing this responsibly is that all of our videos are always embedded with the syn ID um basically mark that a watermark that says that it is possible to know this video is generated and then on on various surfaces like Chrome and Gemini app and others um users can check if a video that they have or received was generated the bar models. And we think that this is basically the probably one of of the the most important um ways to know that content is generated and you should know if you want to trust it or not based on on that information. Yeah. and and ju just on that so we're obviously very proud of synth ID which is a tool that we've developed um but we also partner with CTPa and we embed C2PA metadata in all of our content so then if people want to check it on other platforms they also can do that >> and like syn like syndrome edits as well and if people make modifications and post as well and I just want so everyone knows like the way to use it is you upload a video to the Gemini app you can ask is this an AI generated video and it will check that for you >> I love it well this was Um, this was awesome. A lot to look forward to. Uh, a lot to sort of hopefully get the feedback. Y'all are y'all are all over the internet. Um, please please send us feedback about sort of the models and and what doesn't work. Um, what maybe we can also just talk for a second before we close about what comes next? Um, obviously Omniflash. Uh, hopefully we'll have a pro model, but like anything that y'all are like especially excited for um, in the next iterations of the model. Personally, I'm really excited about, you know, pushing the envelope on what's possible in across, you know, qualitywise, um, the duration of of videos. >> Excited about everything >> everything. But but specifically specifically, I I I really would love to see people be able to create kind like to tell stories. I think that's something that we've seen that we have our previous smalls. Um it was pretty difficult like to actually tell a story that something happens and then something else happens and things remain consistent and we think there is we we made progress and we hope to make even more progress towards that. So so that can apply to um you know film making toformational kind like videos and I'm the most excited about that. Um and we're I'm personally also excited about being able to have more um interactive versions of our models. So we hope to to have some of that in the future. >> I'm really excited about like being able to deliver factual and information content. Um so I think like if you think about kind of like the progress of video models over time as they become better at grounding and these better world models, you go from enabling people to generate things that are really entertaining and engaging videos to things that people can use for learning and kind of information delivery. Uh, and I'm really I think that we're going to be able to really push on that, push the envelope there in the next generation. >> I love it. This kind of technology will empower of a new generation of YouTubers, I think, of like, you know, just content creators that will hopefully build interesting kind of, you know, educational, informational, whatever they are. Uh, things I'm also actually excited about like new emerging capabilities like we've um I don't think we've demoed this here, but like uh we've we've discovered that the model is pretty good at at synchronizing music to video. So you can like you know you know add sort of you know create videos that respond to music. Again not something that we like trained for specifically right like I'm wondering what other what other things will people discover this way. Uh what what what other things is this model capable of and then maybe we can double down on that you know in the pro version uh whenever that comes. >> When we talk about these omni models I think there are interesting thing interesting ways in which the modalities can interact that we haven't really I think fully discovered yet. So like you know you can iterate on a storyboard in images and you can like design your character before you ever go into video generation and I think we're kind of just scratching the surface on that and that's going to be a really interesting way I think to see people interact with these models and so I'm excited about you know what's to come on that over the coming months. >> I'm also excited. This is going to be great. Um well thank all of you for taking the time to sit down for building a cool model all the other teams across Google who sort of made this uh launch possible. I think there's probably like hundreds and hundreds of people in in sort of collaboration from serving to training to research to all the all the product work to to bring all these things to life. Um so yeah uh a great a great Google team effort. Thank you all for sitting down. Um and thanks everyone for tuning in to this episode of release notes. We'll we'll see you in the next one.

More from Google