ENFR

Tech • IA • Crypto

Today Videos Video recaps All topics Top articles Archives

We’re introducing three audio models in the API

AIOpenAIMay 7, 20264:05

0:00 / 0:00

TL;DR

OpenAI has introduced real-time audio models capable of live multilingual translation and voice-driven task execution with reasoning and system integration.

KEY POINTS

Launch of real-time audio models

OpenAI has unveiled new real-time audio capabilities in its API, highlighting two systems: GPT Realtime Translate and GPT Realtime 2. These models are designed to process spoken language instantly, enabling both live translation and interactive voice-based assistance. The release signals a push toward making voice a primary interface for digital systems.

Live translation across 70 languages

The GPT Realtime Translate model can translate speech in real time across approximately 70 languages. It begins translating mid-sentence by identifying key linguistic elements such as verbs, producing output that mirrors natural conversational flow. The system maintains fluency even when speakers switch languages abruptly or include technical terminology.

Natural conversational flow

Unlike traditional translation tools that wait for pauses, the model delivers continuous output, creating the effect of a live dialogue. It preserves sentence structure and tone, enabling smoother communication in multilingual settings such as international presentations, customer support, and education. The system can also dynamically switch between languages during a conversation without disruption.

Voice agents with reasoning and actions

The GPT Realtime 2 model introduces reasoning capabilities to voice assistants, allowing them to interpret requests, access data, and execute tasks. In demonstrations, the system retrieved calendar details, identified meeting participants, and responded conversationally, all while maintaining awareness of context and timing.

Integration with external systems

The model can connect to external tools such as calendars, CRM systems, dashboards, and connected devices. In one example, it updated a CRM entry with meeting summaries and next steps after pulling relevant contextual information. This integration enables voice agents to act directly within existing workflows.

Parallel tool use and user feedback

GPT Realtime 2 supports parallel tool calling, allowing multiple background actions to occur simultaneously. During these processes, the system communicates progress to the user through short updates or “preambles,” ensuring transparency when tasks take several seconds to complete.

Continuous listening without interruption

A key feature is persistent listening: the assistant remains aware of conversation context without interrupting until prompted. This allows users to speak naturally, pause interactions, and resume without resetting the system, enhancing the sense of an ongoing dialogue.

Context retention and adaptability

The models maintain conversational context over time, enabling them to handle multi-step interactions and evolving instructions. This supports more complex use cases, such as preparing meeting summaries, tracking tasks, or coordinating across multiple applications.

Implications for voice interfaces

These advancements position voice as a more viable primary interface for digital interaction. By combining translation, reasoning, and action-taking in real time, the models reduce friction between human communication and software execution.

CONCLUSION

The new real-time audio models demonstrate a shift toward seamless, voice-driven computing, combining instant translation with intelligent task execution across connected systems.

Full transcript

Hey everyone, we're introducing new real-time audio models in the OpenAI API. In this demo, I'll show two of them. GPT Realtime Translate for live translations and GPT Realtime 2 for voice agents that can follow instruction and take actions. Let's start with translations cuz that one feels so magical. I speak French, but say I need to present to an audience around the world. The English you'll hear is the model's live audio output captured directly from this laptop with transcriptions. Now, as I start speaking in French, we'll lower the volume of my mic and increase the one from the model so you can have a real feel for it. No edit to the audio. Let's give it a try. What's really impressive is that the model can listen to me and translate while I'm speaking. It waits for the key word like the verb. start translating right away and the result is a much more natural conversation just like a dialogue between two people. I can even interrupt in German and the model switches effortlessly between my German and your French and we can even include technical terms like GPT real time open AI or computer use and the model has no trouble handling that. Isn't that amazing? The model can translate across 70 different languages in real time, really following the shape of every sentence. So whether you're building a media platform or tools for customer support or education, we believe that this can help you break down the language barriers. And this model is just one of the ways we're improving voice intelligence. So for the next demo, let's talk about GPT Realtime 2, our new model that brings intelligent reasoning to voice agents. So, let's bring up my phone and take a look at my personal voice assistant. >> Hi there. >> Hi again. What's up? >> Yeah, I have uh a customer meeting coming up. Can you take a look at my calendar? >> You have a meeting with Sable Crust Robotics in 12 minutes. And you're meeting with Alex Kim, their CTO. >> Great. Thank you. Oh, please stay quiet for a second until I say back to demo. Ramon, don't forget now that these models have things like reasoning and parallel tool calling, it's even more important to use things like preamles. This way, the model can explain itself and update the user. Thank you, Jason, for the great reminder. Very important. Actions can of course take a few seconds and so it's very important for the model to acknowledge those. With GPT realtime 2, you can communicate directly during the reasoning and the tool calling so the user stays informed. And by the way, what makes voice agent so natural now is that they stay in the conversation. Jason and I have been chatting. Uh the model has been listening and it's still listening now but not interrupting us until I say back to demo. >> I'm here when you're ready to continue the demo. >> Pretty cool, right? So now let's highlight what Jason just mentioned with preamble. Let's ask another task and say hey could you uh now uh update the CRM and um put the meeting of today as a brief and the next steps. >> Let me pull the latest context and update your CRM. Sablerest launched warehouse automation this morning. Expansion is active. Security review is the blocker. >> All right. Thank you. I'm all set. Uh please take quiet again as I wrap this up. What's exciting here is that you can now connect the model to any kind of system. It could be your dashboards, the services you're using, even connected devices, and so much more. So, that was a quick preview of our new real-time audio model coming to the OpenAI API. You can now create agents that keep conversation going as they think in the background. They can translate live across 70 languages. They can preserve context and they can even act inside the products you're already using. Voice can truly become the primary interface now and we can't wait to see what you build with these new models. Thanks for watching. Hey, back to demo. How was that? >> Smooth and clear. It felt natural and demo friendly.

More from AI