ENFR

Tech • IA • Crypto

Today Briefing Videos Top 24h Crypto Archives Favorites Topics

Build Hour: GPT-Realtime-2

9/10

AIOpenAIMay 13, 2026 at 11:13 PM43:00

Audio player

0:00 / 0:00

TL;DR

OpenAI has introduced GPT Realtime 2, a voice-first AI model enabling low-latency, multilingual, tool-using agents capable of real-time reasoning and action across applications.

KEY POINTS

New audio model suite unveiled

A set of three models was released, including a real-time translation system, a streaming Whisper-based transcription model, and GPT Realtime 2. Together, they enable live multilingual conversations, fast speech recognition with latency as low as 200 milliseconds, and advanced voice-driven reasoning. The translation model supports over 70 input languages and 13 output languages, targeting use cases like customer service and live communication.

GPT Realtime 2 brings reasoning to voice

The flagship model integrates GPT-5–class reasoning into voice interactions, improving instruction following, multilingual performance, and tool execution. It supports a 128,000-token context window, roughly equivalent to nearly an hour of conversation, allowing sustained dialogue without truncation. New features include parallel tool calls, domain-specific vocabulary handling, and adjustable expressiveness such as tone and emotion.

Shift from speech-to-text pipelines to native voice AI

The model replaces traditional cascaded systems—speech-to-text, reasoning, then text-to-speech—with a voice-to-voice architecture. This reduces latency and improves conversational flow, eliminating delays that can disrupt real-time interactions. The approach is particularly critical in environments where even half-second pauses can feel unnatural.

Voice agents now operate software interfaces

Demonstrations showed AI agents not just responding conversationally but actively navigating interfaces and executing tasks. In an e-commerce example, a voice agent searched products, filtered results, analyzed reviews, checked weather conditions, and added items to a cart. The system dynamically selected from 15–20 tools, chaining actions without requiring user micromanagement.

Enterprise analytics and decision support

Another demonstration highlighted a voice-driven analytics assistant capable of filtering dashboards, identifying anomalies, and performing root cause analysis. The agent diagnosed a mobile Safari-specific bug affecting European users, summarizing findings into actionable insights. This reflects a shift toward AI acting as an in-loop analyst, not just a conversational interface.

Production challenges remain for enterprises

Companies deploying voice agents at scale report that reliability, not just fluency, is the main hurdle. Even 0.1% error rates in decision-making can create significant business risk. Systems must handle noisy audio, interruptions, accents, and ambiguous input while adhering to strict policies and workflows.

Performance gains in real-world deployments

Early testing shows notable improvements, including calls that are 30% faster at median latency and up to 200% faster at higher latency percentiles compared to older architectures. Voice quality is reported to be competitive with specialized synthesis providers, while maintaining reasoning and task execution capabilities.

Complex orchestration beyond the model

Despite improvements, production systems still require additional infrastructure layers. These include custom turn-detection models, workflow orchestration, guardrails, sensitive data handling, and simulation-based evaluation. Voice agents must be tested end-to-end to ensure they complete tasks correctly, not just sound natural.

Key failure modes in voice AI

Common issues include mishearing names or numbers, failing to recover from early mistakes, and misunderstanding intent in high-stakes scenarios such as travel or finance. Another challenge is distinguishing between meaningful input and conversational fillers like “uh-huh,” which humans naturally ignore but models may misinterpret.

Expanding use cases and global adoption

Voice interfaces are gaining traction across mobile apps, smart devices, gaming, and enterprise tools. In markets such as Brazil and India, voice-first interaction is already prevalent. The ability to speak four times faster than typing makes voice particularly suited for capturing intent quickly and naturally.

CONCLUSION

The launch of GPT Realtime 2 signals a shift toward voice-native AI systems that can reason, act, and interact in real time, though enterprise deployment still depends on robust orchestration and reliability beyond the base model.

Full transcript

More from AI