OpenAI GPT-Realtime-2: GPT-5-Level Reasoning for Voice AI

TL;DR

OpenAI launched GPT-Realtime-2 with GPT-5-class reasoning, 128K token context window (4x increase), and 11% performance improvement over v1.5
Two specialized models debut: GPT-Realtime-Translate for live translation across 70+ languages, and GPT-Realtime-Whisper for streaming transcription
Pricing remains unchanged at $32/$64 per million tokens for Realtime-2; translation costs $0.034/minute and transcription $0.017/minute
Advanced agent capabilities include parallel tool calling, conversational preambles, and adjustable reasoning effort from minimal to xhigh

What Happened

OpenAI released three voice-focused models Thursday, led by GPT-Realtime-2—the company’s first voice model with what it calls “GPT-5-class reasoning.” The update quadruples the context window from 32,000 to 128,000 tokens and delivers an 11% performance boost over the February 2026 GPT-Realtime-1.5 release.

The launch includes GPT-Realtime-Translate, a dedicated model handling 70+ input languages with translation into 13 output languages, and GPT-Realtime-Whisper, a streaming transcription model building on OpenAI’s long-standing Whisper brand.

GPT-Realtime-2 introduces conversational awareness features that previous versions lacked. The model can now initiate responses with context-setting phrases like “let me check that” and execute parallel tool calls while maintaining conversation flow—capabilities that mirror modern agentic systems. Developers control reasoning depth through five settings: minimal, low (default), medium, high, and xhigh.

Why It Matters

The 32K token limit was a bottleneck. Voice agents handling customer support, medical consultations, or technical troubleshooting hit that ceiling fast when context includes conversation history, tool outputs, and system instructions. At 128K tokens, GPT-Realtime-2 can maintain coherence through complex, multi-turn interactions without losing thread.

The reasoning upgrade matters more than the speed. OpenAI is explicit about this: “building useful voice products takes more than fast turn-taking and a natural-sounding voice.” Previous voice models excelled at responsiveness but struggled with contextual understanding when requests shifted mid-conversation or required chaining multiple actions. GPT-5-level reasoning means the model can track intent changes, recover from ambiguity, and determine when to use tools—all while speaking.

For developers, parallel tool calling changes what’s possible. A travel assistant can simultaneously check flight status, hotel availability, and weather without forcing sequential queries. The model tells users what it’s doing (“I’m checking three hotels in your price range”), reducing the “black box” anxiety that kills voice UI adoption.

Key Details

GPT-Realtime-2 Specifications:

Context window: 128,000 tokens (vs. 32,000 in v1.5)
Performance: 11% improvement over GPT-Realtime-1.5
Reasoning levels: minimal, low (default), medium, high, xhigh
Pricing: $32 per 1M input tokens, $64 per 1M output tokens (unchanged)
Features: Parallel tool calls, conversational preambles, extended context tracking

GPT-Realtime-Translate:

Input languages: 70+
Output languages: 13
Pricing: $0.034 per minute
Use case: Live translation during voice interactions

GPT-Realtime-Whisper:

Type: Streaming transcription
Pricing: $0.017 per minute
Relationship: Branded successor to OpenAI’s open-weight Whisper models (last updated 2022)

Implications

OpenAI is segmenting voice AI into three distinct patterns: voice-to-action (“book me a table”), system-to-voice (“your flight is delayed”), and voice-to-voice (full conversational agents). This taxonomy signals where the company sees revenue opportunities—not in demos, but in production workflows.

The specialized model split (Realtime-2, Translate, Whisper) represents architectural maturity. Rather than forcing one model to handle all voice tasks, OpenAI is optimizing for specific use cases. Translation and transcription are commoditizing fast; the real differentiation lives in reasoning-aware voice interaction.

The unchanged pricing for GPT-Realtime-2 despite major capability upgrades suggests OpenAI is land-grabbing market share before competitors like Anthropic or Google DeepMind ship comparable voice reasoning. At $32/$64 per million tokens, this undercuts building custom voice stacks with separate ASR, LLM, and TTS components.

Our Take

The 128K context window is the headline, but the adjustable reasoning effort is the sleeper feature. Developers can now tune the tradeoff between response latency and answer quality per-query. A simple “set a timer” doesn’t need xhigh reasoning; a multi-step insurance claim does. This granularity matters for cost control and user experience.

OpenAI’s timing is deliberate. Voice agents were stuck in the “cool demo, broken production” valley because they couldn’t handle real conversational complexity. By upgrading reasoning before competitors ship their first-gen voice models, OpenAI is setting the bar at GPT-5-level—making anything less feel outdated on arrival.

Watch for two things: First, whether the open-weight Whisper model gets a 2026 update or if OpenAI is fully commercializing speech. Second, how enterprises respond to per-minute vs. per-token pricing for translation and transcription—this could push cost-sensitive use cases toward self-hosted alternatives.

The voice AI race just entered its reasoning phase. Speed and naturalness are table stakes now. The winners will be whoever ships context-aware, tool-using voice agents that don’t break when users change their mind mid-sentence.