
Tech • IA • Crypto
OpenAI has introduced GPT Realtime 2, a voice-first AI model enabling low-latency, multilingual, tool-using agents capable of real-time reasoning and action across applications.
A set of three models was released, including a real-time translation system, a streaming Whisper-based transcription model, and GPT Realtime 2. Together, they enable live multilingual conversations, fast speech recognition with latency as low as 200 milliseconds, and advanced voice-driven reasoning. The translation model supports over 70 input languages and 13 output languages, targeting use cases like customer service and live communication.
The flagship model integrates GPT-5–class reasoning into voice interactions, improving instruction following, multilingual performance, and tool execution. It supports a 128,000-token context window, roughly equivalent to nearly an hour of conversation, allowing sustained dialogue without truncation. New features include parallel tool calls, domain-specific vocabulary handling, and adjustable expressiveness such as tone and emotion.
The model replaces traditional cascaded systems—speech-to-text, reasoning, then text-to-speech—with a voice-to-voice architecture. This reduces latency and improves conversational flow, eliminating delays that can disrupt real-time interactions. The approach is particularly critical in environments where even half-second pauses can feel unnatural.
Demonstrations showed AI agents not just responding conversationally but actively navigating interfaces and executing tasks. In an e-commerce example, a voice agent searched products, filtered results, analyzed reviews, checked weather conditions, and added items to a cart. The system dynamically selected from 15–20 tools, chaining actions without requiring user micromanagement.
Another demonstration highlighted a voice-driven analytics assistant capable of filtering dashboards, identifying anomalies, and performing root cause analysis. The agent diagnosed a mobile Safari-specific bug affecting European users, summarizing findings into actionable insights. This reflects a shift toward AI acting as an in-loop analyst, not just a conversational interface.
Companies deploying voice agents at scale report that reliability, not just fluency, is the main hurdle. Even 0.1% error rates in decision-making can create significant business risk. Systems must handle noisy audio, interruptions, accents, and ambiguous input while adhering to strict policies and workflows.
Early testing shows notable improvements, including calls that are 30% faster at median latency and up to 200% faster at higher latency percentiles compared to older architectures. Voice quality is reported to be competitive with specialized synthesis providers, while maintaining reasoning and task execution capabilities.
Despite improvements, production systems still require additional infrastructure layers. These include custom turn-detection models, workflow orchestration, guardrails, sensitive data handling, and simulation-based evaluation. Voice agents must be tested end-to-end to ensure they complete tasks correctly, not just sound natural.
Common issues include mishearing names or numbers, failing to recover from early mistakes, and misunderstanding intent in high-stakes scenarios such as travel or finance. Another challenge is distinguishing between meaningful input and conversational fillers like “uh-huh,” which humans naturally ignore but models may misinterpret.
Voice interfaces are gaining traction across mobile apps, smart devices, gaming, and enterprise tools. In markets such as Brazil and India, voice-first interaction is already prevalent. The ability to speak four times faster than typing makes voice particularly suited for capturing intent quickly and naturally.
The launch of GPT Realtime 2 signals a shift toward voice-native AI systems that can reason, act, and interact in real time, though enterprise deployment still depends on robust orchestration and reliability beyond the base model.