What changed, in plain terms
Voice has been the awkward middle child of the generative-AI stack. Text is cheap and well understood; images and video grab the headlines; but spoken, real-time, two-way audio — the thing that powers a phone agent, a live caption feed, or an interpreter sitting between two people who do not share a language — has stayed fiddly and expensive. OpenAI's 7 May announcement is aimed squarely at that gap.
The headline is that the Realtime API is now generally available, out of beta and intended for production. Alongside that, OpenAI shipped three new models, each tuned for a distinct job rather than one general-purpose "voice" endpoint:
- GPT-Realtime-2 — the company's first voice model with what OpenAI describes as GPT-5-class reasoning. The context window jumps from 32K to 128K tokens, so the model can hold a much longer conversation in its head before it starts to lose the thread.
- GPT-Realtime-Translate — live speech translation from more than 70 input languages into 13 output languages, keeping pace with the speaker instead of waiting for a full sentence.
- GPT-Realtime-Whisper — streaming speech-to-text that transcribes live as the speaker talks, rather than after the recording ends.
The use cases OpenAI calls out read like a roadmap for anyone building service software: customer support, education, meeting management, live captioning, multilingual translation, workflow automation and voice agents. The full announcement is on the OpenAI blog.
If your GPT-Realtime-2 agent reuses the same system prompt and instructions on every call — and it almost certainly should — cache it. Cached audio input is billed at $0.40 per million tokens against $32 for fresh input. On a high-volume support line that is the single biggest lever you have on the bill.
The multilingual angle is the real story for both our markets
For builders in India and the UK, the translation and transcription pieces are more interesting than another reasoning model, because they map onto problems both markets already feel acutely.
In India, customer support is a multilingual problem by default. A lending app, a logistics startup or a government-services helpline fields calls in Hindi, Tamil, Bengali, Marathi, Telugu and a dozen more, often code-switched mid-sentence with English. Until now, supporting that meant either staffing language-specific queues or bolting a separate translation service onto the audio pipeline, adding latency at exactly the moment a frustrated caller is least patient. With 70-plus input languages flowing through a single streaming endpoint, a Bengaluru team can stand up a Hindi-or-Tamil voice agent on the same infrastructure they would use for English, and have GPT-Realtime-Whisper produce a clean transcript for the quality and compliance team in parallel.
In the UK, the pressure points are different but adjacent. UK accents — Glaswegian, Geordie, Scouse, multicultural London English — have historically tripped up speech recognition trained mostly on American data, so a streaming transcriber that handles them well is worth testing before you commit. And the translation model speaks directly to UK firms trading into Europe post-Brexit: a London SaaS company running support or sales calls into Germany, France, Spain or the Netherlands can now translate live rather than scheduling an interpreter or falling back to email. The 13 output languages are the constraint to check here — confirm your target European languages are on that list before you design around it.
Audio-token pricing is not text-token pricing, and it can quietly blow a budget. GPT-Realtime-2 bills per audio token at $32 per million in and $64 per million out, and spoken audio generates tokens far faster than typed text — think hundreds per second of speech, on both sides of the conversation. A ten-minute call on the reasoning model is a meaningfully larger line item than a ten-minute text chat. Model your cost on real audio length, not on a text-chat analogy, before you ship to volume.
Which of the three do you actually need?
This is the question that decides your architecture, so be honest about the job before you reach for the most capable model. The three split cleanly:
- You need a conversation — the agent listens, reasons and talks back (support bot, voice assistant, tutoring agent). That is GPT-Realtime-2. It is the only one of the three that generates a spoken reply, and the only one you pay reasoning-grade audio-token prices for.
- You need a transcript — live captions, a searchable record, a feed into your analytics or compliance store, with no spoken reply. That is GPT-Realtime-Whisper, and at
$0.017per minute it is the cheapest of the trio by a wide margin. - You need to bridge languages — two people, two languages, one live conversation, or a speaker addressing a multilingual audience. That is GPT-Realtime-Translate, billed at
$0.034per minute.
Plenty of real builds combine two. A multilingual support line might run GPT-Realtime-2 for the conversation and pipe the audio through GPT-Realtime-Whisper for the compliance transcript. A live event might run Translate for the audience and Whisper for the captions. The point is to reach for each model for the job it is priced and tuned for, rather than routing everything through the expensive reasoning model out of habit.
"The instinct is to use the smartest model for everything. With audio that is a trap — the transcript half of our support flow does not need reasoning at all, so moving it to the streaming transcriber cut that leg of the bill to almost nothing. Match the model to the job, not to the brochure."
— Aanya, Verified Builder · Bengaluru, INThe pricing, side by side
Here is the trio in one table. Prices are as announced on 7 May 2026. Note the two billing units in play — GPT-Realtime-2 is per audio token, the other two are per minute — which is exactly why a like-for-like comparison needs care.
| Model | What it does | Unit | Price | When to use |
|---|---|---|---|---|
| GPT-Realtime-2 | Reasoning-grade voice agent that listens and speaks back (128K context) | Per audio token | $32 / 1M input ($0.40 / 1M cached); $64 / 1M output | Two-way conversational agents |
| GPT-Realtime-Translate | Live speech translation, 70+ in → 13 out, keeps pace with speaker | Per minute | $0.034 / min | Cross-language conversations and events |
| GPT-Realtime-Whisper | Streaming speech-to-text, transcribes live as the speaker talks | Per minute | $0.017 / min | Live captions, transcripts, compliance records |
A worked example: a ten-minute support call
Abstract pricing is hard to reason about, so cost out a concrete scenario. Take a single ten-minute customer support call on a multilingual line, where you want both a live transcript and live translation — but you are handling the conversation logic yourself rather than paying for the reasoning model on this particular flow.
- Transcription with GPT-Realtime-Whisper: 10 minutes ×
$0.017= $0.17. - Translation with GPT-Realtime-Translate: 10 minutes ×
$0.034= $0.34. - Combined transcription plus translation for the full call: $0.51.
That is the easy half — both per-minute models are trivial to forecast, because the unit is wall-clock time. The reasoning model is the one to watch. If you instead route the whole call through GPT-Realtime-2 so it actually converses, you are no longer paying per minute; you are paying per audio token in and out, and a ten-minute two-way call can run into the millions of tokens once both speakers and the model's spoken replies are counted. That is where the per-token economics — and the cached-input lever from the pro tip above — start to dominate the bill. The practical takeaway: keep cheap per-minute work on the per-minute models, and reserve the per-token reasoning model for the part of the flow that genuinely needs to think and speak.
Instrument cost per call from day one, broken down by model. Because two of the three bill per minute and one bills per token, a single blended "cost per conversation" number hides the lever you most need to pull. Tag each leg — transcribe, translate, converse — separately in your observability.
Want to discuss this with other verified Builders?
Every article on AI Tech Connect is written by, or curated for, our Verified Builders. Browse profiles, shortlist who you want to hire or collaborate with.
Browse Builders →Where this sits in the wider stack
Voice is becoming one more agentic surface, and it slots in next to the others rather than replacing them. The same architectural questions builders have been working through for coding agents and browser agents — how to scope a tool, how to bill it, where the model should reason versus where it should just execute — now apply to audio. If you have been following how those debates are playing out in agentic coding, the parallels are direct; see our take on the Cursor Composer 2.5 and the Cursor SDK shift. And on the protocol side, the move toward open, interoperable agent surfaces — the kind of thing the WebMCP browser-agent protocol is pushing for — is the backdrop against which a production voice agent will eventually need to talk to your other systems.
For now, the practical advice is unglamorous. Confirm your target languages are on the lists — 70-plus in, 13 out. Run a small pilot on real accents and real code-switched speech from your actual users, not on clean studio audio. Instrument cost per call by model leg. And resist the pull of routing everything through the reasoning model: the streaming transcriber and the translator exist precisely so you do not have to.
So — should you build on this now?
For most teams shipping voice, yes — with eyes open on the billing. The split into three purpose-built models is genuinely useful, the Realtime API is now GA rather than a beta you have to caveat to your stakeholders, and the multilingual reach finally makes a single audio pipeline viable for the polyglot reality of Indian support lines and UK-into-Europe sales calls alike.
- Build now if you have a clear single job — captions, transcripts, or a defined conversational flow — and you can price it on real audio length.
- Pilot first if your users speak in heavy regional accents or code-switch constantly; validate recognition on your own audio before committing.
- Model the bill carefully if your flow leans on GPT-Realtime-2 — per-token audio pricing is the variable that will surprise you, and caching the system prompt is the first thing to switch on.
Full details and the model card are in the OpenAI announcement.