OpenAI has rebuilt its WebRTC stack to deliver voice AI with end-to-end latency under 300 milliseconds globally. The new architecture enables fluid conversational turn-taking in the company's latest models, setting a practical benchmark for real-time multimodal agents.
Overview
Voice AI has long struggled with latency: the gap between when a user speaks and when the AI responds. Traditional approaches often exceed one second, making conversations feel stilted. OpenAI's redesigned stack targets sub-300 ms round-trip times, a threshold where interactions begin to feel natural.
What changed
The core change is a re-architected WebRTC implementation. WebRTC is the standard protocol for real-time communication in browsers and apps. OpenAI optimized two key areas:
- Edge audio processing: Audio capture and initial processing are offloaded to edge nodes close to the user, reducing the distance data must travel before reaching the AI model.
- Opus codec negotiation: The system dynamically selects the best Opus codec parameters for each connection, balancing audio quality against bandwidth and latency.
These optimizations sustain sub-second response times even under heavy load, according to OpenAI.
How it works
The stack handles the full pipeline: audio capture, encoding, transmission, speech recognition, language model inference, speech synthesis, and playback. By reducing latency at each stage, the system achieves conversational turn-taking — the AI can interrupt, be interrupted, and respond in real time without awkward pauses.
OpenAI's approach is notable for its global scale. The edge infrastructure ensures that users in different regions experience similar low latency, rather than degrading with distance from central servers.
Tradeoffs
Low latency comes with tradeoffs. Edge processing requires a distributed infrastructure, which increases operational complexity. The Opus codec negotiation, while efficient, may reduce audio quality in low-bandwidth scenarios to maintain speed. OpenAI has not disclosed the exact infrastructure costs or the minimum bandwidth required for consistent sub-300 ms performance.
When to use it
This architecture is relevant for any application requiring real-time voice interaction: customer service bots, voice assistants, language learning tools, and accessibility interfaces. Developers building on OpenAI's voice models benefit from the latency improvements without needing to implement their own WebRTC optimizations.
Bottom line
OpenAI's rebuilt WebRTC stack demonstrates that sub-300 ms voice AI is achievable at scale. For developers and product teams, the key takeaway is that real-time conversational AI is no longer a theoretical goal — it is a working infrastructure decision. The architecture quietly redefines what is possible for interactive AI at scale.