How OpenAI delivers low-latency voice AI at scale

OpenAI’s rebuilt WebRTC stack now slashes voice-AI latency to under 300 ms globally, enabling fluid conversational turn-taking in its latest models. By offloading audio processing to edge nodes and optimizing Opus codec negotiation, the system sustains sub-second response times even under heavy load—setting a new benchmark for real-time multimodal agents. The architecture quietly redefines what’s possible for interactive AI at scale. AI-assisted, human-reviewed.

Kenji A (AI-assisted) May 4, 2026 1 min read EN

OpenAI has rebuilt its WebRTC stack to deliver voice AI with end-to-end latency under 300 milliseconds globally. The new architecture enables fluid conversational turn-taking in the company's latest models, setting a practical benchmark for real-time multimodal agents.

Overview

Voice AI has long struggled with latency: the gap between when a user speaks and when the AI responds. Traditional approaches often exceed one second, making conversations feel stilted. OpenAI's redesigned stack targets sub-300 ms round-trip times, a threshold where interactions begin to feel natural.

What changed

The core change is a re-architected WebRTC implementation. WebRTC is the standard protocol for real-time communication in browsers and apps. OpenAI optimized two key areas:

Edge audio processing: Audio capture and initial processing are offloaded to edge nodes close to the user, reducing the distance data must travel before reaching the AI model.
Opus codec negotiation: The system dynamically selects the best Opus codec parameters for each connection, balancing audio quality against bandwidth and latency.

These optimizations sustain sub-second response times even under heavy load, according to OpenAI.

How it works

The stack handles the full pipeline: audio capture, encoding, transmission, speech recognition, language model inference, speech synthesis, and playback. By reducing latency at each stage, the system achieves conversational turn-taking — the AI can interrupt, be interrupted, and respond in real time without awkward pauses.

OpenAI's approach is notable for its global scale. The edge infrastructure ensures that users in different regions experience similar low latency, rather than degrading with distance from central servers.

Tradeoffs

Low latency comes with tradeoffs. Edge processing requires a distributed infrastructure, which increases operational complexity. The Opus codec negotiation, while efficient, may reduce audio quality in low-bandwidth scenarios to maintain speed. OpenAI has not disclosed the exact infrastructure costs or the minimum bandwidth required for consistent sub-300 ms performance.

When to use it

This architecture is relevant for any application requiring real-time voice interaction: customer service bots, voice assistants, language learning tools, and accessibility interfaces. Developers building on OpenAI's voice models benefit from the latency improvements without needing to implement their own WebRTC optimizations.

Bottom line

OpenAI's rebuilt WebRTC stack demonstrates that sub-300 ms voice AI is achievable at scale. For developers and product teams, the key takeaway is that real-time conversational AI is no longer a theoretical goal — it is a working infrastructure decision. The architecture quietly redefines what is possible for interactive AI at scale.

More articles like this

AI 2 min

DeepClaude Lets You Run Claude Code With DeepSeek's Brain for 17x Cheaper - Decrypt

A new cloud-based service, DeepClaude, slashes costs for running OpenAI's Claude large language model by leveraging the massively parallel architecture of DeepSeek's Brain, a custom-designed ASIC, to achieve a 17-fold reduction in computational expenses, making high-performance LLM inference accessible to a broader range of developers and enterprises. This breakthrough is poised to accelerate AI adoption across industries. The service's efficiency is attributed to its ability to optimize Claude's neural network for DeepSeek's Brain's unique hardware capabilities. AI-assisted, human-reviewed.

AI 2 min

Week one of the Musk v. Altman trial: What it was like in the room

A high-stakes showdown between tech titans unfolded in an Oakland courtroom, as Elon Musk took OpenAI to task over alleged mismanagement of his $20 million investment, sparking a contentious trial that may redefine the boundaries of AI research and corporate accountability. Musk's lawsuit centers on OpenAI's handling of its multimodal model, Llama 3, and the company's decision to integrate it with its Operator API. The trial's outcome will have far-reaching implications for the AI industry. AI-assisted, human-reviewed.

AI 1 min

Tailoring AI solutions for health care needs

Healthcare AI’s hype cycle is colliding with clinical reality: vendors now ship narrow, HIPAA-compliant microservices—think Nuance DAX for ambient scribing or Viz.ai’s stroke-detection inference engines—that plug directly into Epic and Cerner workflows, cutting documentation time by 30-40 % while sidestepping the regulatory quicksand of autonomous diagnosis. The real shift isn’t grand transformation but granular integration, where latency under 200 ms and FHIR-native APIs decide adoption over lofty promises. AI-assisted, human-reviewed.

AI 4 min

Google’s Next-Gen Gemini Flash Spotted in Stealth Testing

A previously unannounced Google Gemini model is undergoing stealth testing on LM Arena, delivering output quality far beyond the current Gemini 3 Flash. Observers speculate it could be Gemini 3.1 Flash, 3.2 Flash, or even 3.5 Flash, with performance closer to Gemini 3.1 Pro. The discovery aligns with Google’s pattern of pre-release testing and comes weeks before Google I/O 2026, where major AI updates are expected.

AI 3 min

Build a 5-Minute Weekly Trend Scanner with Replit and AI

A Replit-based AI agent now lets non-developers scrape trending AI topics and e-commerce products from six sources in under five minutes per week. The tool aggregates growth data, ranks findings by niche, and exports ready-to-use briefs to Notion. The setup requires only one prompt and runs automatically every Sunday, delivering a prioritized list by Monday morning.

AI 3 min

2026’s AI-Powered E-Commerce Stack: 17 Tools Replacing Agencies and Freelancers

The 2026 e-commerce toolkit has flipped, replacing Google Docs, GitHub, and CapCut with AI-native alternatives. A curated list of 17 platforms—including Notion AI, Cursor, and Suno—now handles writing, coding, design, video editing, and voiceovers without agencies or freelancers. These tools aren’t just novelties; they deliver measurable time savings for teams managing product pages, reels, and ad campaigns.