
Tech • IA • Crypto
OpenAI has introduced real-time audio models capable of live multilingual translation and voice-driven task execution with reasoning and system integration.
OpenAI has unveiled new real-time audio capabilities in its API, highlighting two systems: GPT Realtime Translate and GPT Realtime 2. These models are designed to process spoken language instantly, enabling both live translation and interactive voice-based assistance. The release signals a push toward making voice a primary interface for digital systems.
The GPT Realtime Translate model can translate speech in real time across approximately 70 languages. It begins translating mid-sentence by identifying key linguistic elements such as verbs, producing output that mirrors natural conversational flow. The system maintains fluency even when speakers switch languages abruptly or include technical terminology.
Unlike traditional translation tools that wait for pauses, the model delivers continuous output, creating the effect of a live dialogue. It preserves sentence structure and tone, enabling smoother communication in multilingual settings such as international presentations, customer support, and education. The system can also dynamically switch between languages during a conversation without disruption.
The GPT Realtime 2 model introduces reasoning capabilities to voice assistants, allowing them to interpret requests, access data, and execute tasks. In demonstrations, the system retrieved calendar details, identified meeting participants, and responded conversationally, all while maintaining awareness of context and timing.
The model can connect to external tools such as calendars, CRM systems, dashboards, and connected devices. In one example, it updated a CRM entry with meeting summaries and next steps after pulling relevant contextual information. This integration enables voice agents to act directly within existing workflows.
GPT Realtime 2 supports parallel tool calling, allowing multiple background actions to occur simultaneously. During these processes, the system communicates progress to the user through short updates or “preambles,” ensuring transparency when tasks take several seconds to complete.
A key feature is persistent listening: the assistant remains aware of conversation context without interrupting until prompted. This allows users to speak naturally, pause interactions, and resume without resetting the system, enhancing the sense of an ongoing dialogue.
The models maintain conversational context over time, enabling them to handle multi-step interactions and evolving instructions. This supports more complex use cases, such as preparing meeting summaries, tracking tasks, or coordinating across multiple applications.
These advancements position voice as a more viable primary interface for digital interaction. By combining translation, reasoning, and action-taking in real time, the models reduce friction between human communication and software execution.
The new real-time audio models demonstrate a shift toward seamless, voice-driven computing, combining instant translation with intelligent task execution across connected systems.