ENFR

Tech • IA • Crypto

Today Videos Video recaps Articles Top articles Archives

Building with Gemini Embedding 2: Our first natively multimodal embedding model

GoogleGoogle for DevelopersApril 30, 202611:12

0:00 / 0:00

TL;DR

Google DeepMind has launched Gemini Embedding 2, a multimodal model that unifies text, images, audio, video, and documents into a single embedding space for search and AI workflows.

Key Points

Unified multimodal embeddings

Gemini Embedding 2 directly maps multiple data types—including text, images, audio, video, and documents—into a shared vector space. This enables systems to compare and retrieve information across formats without converting everything into text, improving both accuracy and efficiency.

Interleaved input capability

The model supports combining different modalities in a single request, such as pairing an image with descriptive text or audio. These inputs are merged into one composite embedding, simplifying system design and reducing the need for separate processing pipelines.

Multilingual support

The model handles more than 100 languages out of the box. This allows developers to build global applications without additional translation layers, maintaining semantic consistency across languages and media types.

Flexible vector size with Matryoshka learning

Using Matryoshka representation learning, the model prioritizes key semantic information in early vector dimensions. Developers can choose between the default 3,072 dimensions or smaller sizes like 1,536 or 768, balancing performance, storage costs, and search latency while preserving quality.

Benchmark-leading performance

The system sets new performance standards across multimodal tasks, outperforming leading models in text, image, and video benchmarks while adding speech understanding. This positions it as a strong foundation for advanced AI retrieval systems.

Core use cases in retrieval and agents

The model is designed as a backbone for multimodal retrieval-augmented generation (RAG) and agent-based systems. It enables querying across mixed data sources—such as video libraries, meeting recordings, and documents—within a single unified framework.

Broad task optimization

Gemini Embedding 2 is tuned for multiple applications, including semantic search, question answering, classification, clustering, code retrieval, and fact-checking. Task-specific prompting guidance is provided to improve performance in each area.

Cross-modal search capabilities

Developers can perform similarity search across modalities, such as retrieving images from text queries or matching audio clips to visual content. For example, a query like “animal” can return relevant images, while an image input can retrieve visually or semantically similar items.

Developer accessibility

The model is available via the Gemini API and the Gemini Enterprise Agent Platform. Implementation requires minimal code, with support for direct embedding of raw files like images, audio, and PDFs using standard SDKs.

CONCLUSION

Gemini Embedding 2 represents a significant step toward unified multimodal AI systems, enabling more efficient search, retrieval, and agent workflows across diverse data types.

Full transcript

Hi, I'm Patrick and I'm a member of the technical staff at Google DeepMind. Today, we're excited to launch Gemini embedding 2 in general availability. Gemini embedding 2 is our first natively multimodal embedding model. It is based on Gemini and it directly maps text, images, video, audio, and documents into a single unified embedding space. So, let me give you a quick overview of the model and then show you how to use it with some code examples. Because the model processes multiple modalities natively, it understands the semantic relationships across different media types without relying on intermediate text conversions. It also supports interleaved inputs. This means you can pass multiple modalities like an image alongside a related text description in a single API request to generate one composite embedding. For developers, this significantly simplifies your architecture. You can pass a raw image or video clip directly to the embedding endpoint and receive a vector that is immediately comparable to text queries. It also supports over 100 languages out of the box, so you can build multilingual applications with it. Now, let's talk about flexible output size because vector dimension size is a major factor in storage costs and search latency. To give you more control, we're using Matryoshka representation learning. This nests the most critical semantic information in the earliest dimensions of the vector. You can request the default 3,072 dimensions for maximum precision or truncate to 1,536 or even 768 to optimize for scale. The model maintains high quality even at lower dimensions, so you can easily tune that cost-performance tradeoff for your specific app. On the benchmarks, it establishes a new performance standard for multimodal depth. The model outperforms leading models in text, image, and video tasks while introducing also speech capabilities. The primary application for Gemini embedding 2 is serving as the retrieval backbone for multimodal rag and agentic workflows. You can build agents that query video libraries, meeting audio, and text documents simultaneously without maintaining separate ingestion pipelines for each modality. To support different use cases, the model is also optimized for specific tasks including search queries, question answering, fact-checking, code retrieval, classification, clustering, and semantic similarity. And our documentation includes a prompting guide to help you implement these task-specific optimizations. Gemini embedding 2 is generally available now via the Gemini API and Gemini Enterprise Agent Platform. So, let's jump into the code and then learn how to use it. I'll walk through how to generate multimodal embeddings and how to perform search across different file types. So, now I'm in antigravity and we're using Python, so we installed the Google GenAI Python SDK and then we set up a client and then we simply call client models embed contents and the model ID is Gemini embedding 2. And for the first example, we have a simple text sentence, so we embed this sentence, "What is the meaning of life?" and then we can print results embeddings and we can also extract the values, so this is the first item in the list and then we calculate the length. So, let's run this and see what we get. So, here we have our embedding values and then you see we have 3,072 as the number of values, so this is our output dimension that we get by default. So, this is a simple example for text. Now, let's jump to a multimodal example and the code is similar for for example an image here. We read an image from disk, so I prepared a few images here like this one with this puppy and then we simply um add this here, so the contents this time we read it from the bytes with the correct bytes and then the correct mime type and this is all we need to embed an image. And also the code is exactly the same for an audio file, so here we also have an audio file for example. Look at how fast the dog runs to catch the tennis ball in the park. And in this case, of course, we need to change the mime time and uh the code is very similar or exactly the same for uh PDFs and also videos except that here we're using a different mime type. So, let's run this. Let me clear this first and then let's run the second file and then it also should be pretty fast and here we get our embeddings for an image and our embeddings uh for an audio file. In the next example, I want to show you how you can do embedding aggregation and combine multiple inputs from different modalities and then calculate one single embedding. So, here again, we're reading an image and then an audio file and then all you have to do is now append multiple inputs into this contents list. So, this can be for example a text description, an image of a dog, and then the image of the dog, and then also a sound file talking about a dog. And now if we run this Python and then the third file, then you should see we only get again one embedding back, so this combined everything into one embedding. Now, I want to show you how you can configure the output dimensionality and for this all you have to do is append this configuration and this is a dictionary where you specify the output dimensionality key and then as a value the size you want, so here we're using 768 and then again we print the embeddings and the length of the embeddings. So, let's clear this and then run the next example. And it was pretty fast and now here you see this has a length of 768. Now, as last example, I want to show you how we can implement multimodal search with the model and then perform for example similarity search and find different images. So, I prepared different images, 10 different images in this case, for example, it has different categories like food like apple and bananas and then also a few animals like we have a kitten, a bird, and a puppy and then also for example an image of a city and also some nature. And now we are embedding all of these images with the code that I showed you before. So, we iterate over all these images and then read the bytes and then calculate the embeddings and then I'm simply storing this in a dictionary combined with the image name and then we are dumping this to a JSON file and I already ran this, so this is now dumped into this JSON file, so we have the file name and then the embeddings and we have this for all the 10 items in here. So, this is the first step, calculate the embeddings for your data and then as next step, we perform search. So, now we do similarity search and this can be based on a text query, but it can also be for example based on an image or an audio file. So, first let me show you how to do this with a text query. So, we have the query image of let's start an image of an animal and then we embed this query and then this is our query embedding and now we need to calculate the similarity and for this we're using the cosine similarity, which is defined. So, the cos- cosine similarity of two vectors is defined as the dot product divided by the length of the first vector times the length of the second vector. Then here we load the stored embeddings and then we iterate over all the values and then for each value we calculate the similarity between the query embedding and the items in our database and then we sort this and then for example we print the top three items. So, if we do Python search.py, then we see the two closest matches for a query an image of an animal are puppy, bird, and kitten. So, this worked. For example, if we change this an image of food and then run this again, then hopefully you should see now the top three closest matches are pizza, bananas, and the apple image. So, you can see based on on text, we get at the corresponding images. And also we can actually do image-to-image search. So, instead of using this text query, what we can do is we can feed in also an image. So, here I have a second puppy, which looks similar to this one but in a different pose and different setting. So, now we're feeding in this and now this image is our query embedding and then now if we run this again then also hopefully we should get yeah, the puppy is close, the kitten is close, and then maybe yeah, the bananas was the third case. But still the leading uh the closest matches are our puppy and kitten. And yeah, this is an easy way how to use the model to do multimodal search. And you can even put this a step further. And for this we have an applet that you find on ai.studio. It's called multimodal search. And here for example, we embedded different text, different images, and then different audio samples. And then for example, if you search by cat across each each file type, you should get similar images. So the closest text is a sleepy kitten. The closest image is this. And the closest audio sample is cat purring. So let's quickly listen to it. The little kitten is purring so loudly while she takes a nap in the sun. Uh yeah. And of course yeah, you can also upload images to use as your search query or an audio file. So yeah, play around with this and have fun. All right, that's it. Detailed documentation and an interactive app are linked below. So give it a try and let us know what you build with it and if you have any feedback and then see you in the next one. Bye. >> [music]

More from Google