What the raise actually signals
- A billion calls is a usage number, not a demo number. Vapi crossed one billion platform calls before raising — evidence that voice agents are running in production, not just on stage.
- $50M Series B, May 2026. Announced on 12 May and reported as led by Peak XV, the round funds the unglamorous infrastructure layer rather than a single flashy model.
- Three shipped features tell the strategy. Squads v2 (multi-assistant orchestration), Composer (prompt-built agents) and Simulations (systematic testing) move the platform from "wire it yourself" toward "describe it, then prove it works".
- The opportunity is vertical. The defensible business for most builders is not the platform — it is a specific voice agent for clinics, collections, logistics or reception, sold into India's contact-centre base or the UK's regulated sectors.
Voice agents are the defining product category of 2026. OpenAI shipped a real-time audio model trio aimed squarely at voice agents, and the orchestration layer beneath those models is now where serious money is moving. Vapi's $50M round is the clearest signal yet that real-time voice has crossed from novelty into production infrastructure — and that there is a buildable, sellable opportunity sitting on top of it for AI builders in both India and the UK.
What Vapi is — and why the plumbing is hard
Vapi is orchestration infrastructure for voice agents. A production phone agent is not one model; it is a real-time pipeline. Audio arrives over telephony, a speech-to-text engine turns it into text, a large language model decides what to say, a text-to-speech engine speaks the reply, and the whole loop has to close in well under a second so the conversation feels human. Vapi stitches those layers together and handles the parts that look trivial in a demo and brutal in production: turn-taking, knowing when the caller has finished speaking, interruption handling when they talk over the agent, and graceful recovery when the network jitters.
That plumbing is hard for reasons that have nothing to do with model quality. A text chatbot can take two seconds to think and nobody minds. A voice agent that pauses for two seconds sounds broken. Builders who have tried to assemble speech-to-text, an LLM and text-to-speech by hand discover that the latency budget is unforgiving — every hop adds milliseconds, and the model is only one of five places where time leaks. This is precisely why an orchestration layer is a fundable business: it absorbs the real-time complexity so the builder on top can think about the conversation, not the pipeline.
The three May releases sharpen that pitch. Squads v2 is a visual builder for multi-assistant orchestration — handoffs between specialised agents, so a booking agent can pass cleanly to a billing agent mid-call. Composer, in alpha, lets you describe an agent in plain language and have the platform write the prompt, configure tools, provision a phone number and wire integrations. And Simulations, also alpha, brings systematic, AI-powered testing of voice agents. Of the three, the testing tool is the one that matters most.
Why Simulations is the quiet headline
Evaluation is the hard, unglamorous bottleneck for voice agents, and almost nobody markets it. The problem is structural: a voice agent fails in ways a text agent does not. It mishears an accent, talks over the caller, loops on a misunderstanding, or confidently books the wrong appointment. You cannot catch those failure modes by reading a transcript, and you certainly cannot manually dial a thousand call variations the night before launch.
Simulations attacks exactly that gap. Instead of hand-testing, you define scenarios — an irate customer, a heavy regional accent, a caller who interrupts constantly, a number that does not exist in your system — and the platform runs them at volume and scores the outcomes. That turns a slow, subjective manual gate into an automated, repeatable one. For any builder serious about shipping a voice agent into a regulated environment, this is the feature that decides whether the thing is safe to put on a real phone line.
Treat your eval set as the product. Before you tune a single prompt, write twenty adversarial call scenarios — angry callers, code-mixed Hindi-English, a thick Glaswegian accent, silence, someone reciting a long account number. The agent that handles your worst twenty calls ships; the one that only handles the happy path does not. Whichever platform you choose, the eval harness is your moat, not the prompt.
The competitive stack and who owns what
Voice agents are a layered stack, and the field is crowded at every level. Understanding which layer you are competing in — and which you should simply buy — is the first strategic decision a builder makes. The table below maps the layers and the players who dominate each.
| Layer | What it does | Representative players |
|---|---|---|
| Telephony | Carries the call — PSTN, SIP, the phone number itself | Twilio, Telnyx, Plivo, regional carriers |
| Speech-to-text (STT) | Turns caller audio into text in real time | Deepgram, AssemblyAI, OpenAI, Sarvam (Indic) |
| LLM | Decides what the agent says and does | OpenAI, Anthropic, Google, open-weight models |
| Text-to-speech (TTS) | Speaks the reply with natural prosody | ElevenLabs, Cartesia, PlayHT |
| Orchestration | Ties it all together — turn-taking, interruptions, latency, tools | Vapi, Retell, Bland, LiveKit, OpenAI Realtime |
The orchestration row is where the gold rush is loudest. Retell and Bland compete directly with Vapi on developer-friendly agent building; LiveKit owns the open real-time media transport that several others build on; and OpenAI's Realtime API collapses STT, LLM and TTS into a single speech-to-speech model, which threatens to swallow the middle of the stack. For builders, the lesson is to compete on the layer you can defend and rent the rest. Almost nobody should be building their own telephony or TTS.
The India angle — language and the contact-centre base
India is the most interesting voice-agent market in the world, for two reasons. The first is language. A genuinely useful Indian voice agent has to handle Indic languages and, harder still, code-mixing — the Hindi-English-Tamil blend that real callers actually speak. Off-the-shelf English STT mangles this, which is why an Indic-first speech stack such as the kind of multilingual audio models now shipping matters so much. The builder who solves code-mixed recognition for a specific domain owns a moat that a generic global platform cannot easily copy.
The second is the contact-centre base. India runs a vast BPO and customer-support industry, and voice agents both disrupt and augment it. The honest framing is augmentation: agents take the repetitive, high-volume calls — order status, balance checks, appointment booking — and escalate the rest to humans. That is a near-bottomless supply of buildable, sellable workflows. Vapi itself lists collections, candidate screening and IVR navigation among its use cases, and every one of those maps onto an Indian enterprise that already does it at scale with people.
The UK angle — regulation as the moat
The UK opportunity looks different, because the binding constraint is compliance, not language. FCA-regulated financial communications carry strict rules on disclosure and record-keeping; a voice agent that discusses a loan or a missed payment has to be auditable and demonstrably fair. NHS and public-sector contact lines handle sensitive health and welfare data under tight scrutiny. And across every sector, GDPR plus PECR govern call recording and consent.
That regulation is not a reason to avoid the UK market — it is the moat. A builder who ships a voice agent that is compliant by construction, with consent capture, AI disclosure and an audit trail baked in, has something a generic platform does not. The same discipline that slows you down is exactly what an FCA-regulated lender or an NHS trust will pay for. We have watched this pattern across the agent stack: the scramble to acquire agent infrastructure is ultimately a scramble for trust and governance, and voice is no exception.
Two things will sink a voice agent regardless of how good the model is. First, recording and consent: in the UK, GDPR and PECR require a lawful basis and clear notice before you record, and many jurisdictions require disclosure that the caller is talking to an AI; in India, the DPDP framework governs consent and personal data. Get written sign-off from a data-protection adviser before you go live. Second, the latency budget: aim for under 800ms end-to-end response, because every layer — telephony, STT, LLM, TTS — eats into it, and a sluggish agent feels broken no matter how clever its answers are.
Every article here is written about builders, for builders. Want your name on the next one?
AI Tech Connect lists AI engineers, founders and researchers across India and the UK — and the people hiring browse it to find them. Adding your profile is free.
Become a Verified Builder →The builder opportunity — where the moats actually are
The platforms are funded and fighting over the orchestration layer, so the durable opportunity for an independent builder is vertical. A horizontal voice platform is a venture-scale bet that needs a Vapi-sized cheque; a vertical voice agent is a business one or two engineers can ship and sell this quarter. The buildable niches are concrete: a receptionist agent for dental and physiotherapy clinics, a logistics agent that chases delivery confirmations, a collections agent that handles polite payment reminders, an after-hours agent for tradespeople who lose work because they cannot answer the phone.
The moats in a vertical agent are not the model — anyone can call the same LLM. The moats are three: data (the domain-specific scripts, objection patterns and edge cases you accumulate from real calls), integrations (the deep wiring into a clinic's booking system or a lender's CRM that a generalist will not bother to build), and evaluation (the adversarial test suite that proves your agent is safe in your domain). Notice that two of those three are the unglamorous parts. The funding flowing into voice — Vapi's $50M is part of a broader wave that includes raises such as Cognition's $1B round in the adjacent agent space — is a signal to build on top, not a signal that the category is taken. The category is barely started.
The practical move for a builder in Mumbai or Manchester is the same: pick one painful, repetitive, high-volume call type in one industry you understand, rent the stack (telephony, STT, LLM, TTS from the table above, orchestration from Vapi or a rival), and pour your effort into the data, the integration and the eval harness. Primary details on the raise are in Vapi's funding announcement, and the new features are documented on Vapi's changelog.