The round: $107M, NVIDIA, Samsung Next, and the imo co-founder

On 4 May 2026, DeepInfra announced the close of a $107M Series B round — one of the largest dedicated to inference infrastructure since the open-weight boom began in earnest. The round was led by 500 Global and Georges Harik — former Google engineering director, co-developer of AdSense, and co-founder of imo messenger itself. That last detail is significant: DeepInfra was built by the very team that Harik helped create at imo, giving this round an unusual founder-backed-by-original-founder dynamic.

The investor syndicate reads like a who's-who of strategic AI capital: A.Capital Ventures, Crescent Cove, Felicis, NVIDIA, Peak6, Samsung Next, Supermicro, and Upper90 all participated. The presence of NVIDIA and Supermicro — companies whose business depends on GPU utilisation — is a signal that hyperscaler-adjacent players are betting on specialised inference clouds rather than assuming their own vertically integrated clouds will capture all the value.

DeepInfra raised this round after tripling revenue since the start of 2026 and after growing token volume processed 25× since its Series A. The company now handles approximately five trillion tokens per week across 190+ open-source and proprietary models.

Who is DeepInfra? The imo messenger lineage

DeepInfra was founded in 2022 by a team whose background is in consumer-scale messaging infrastructure — specifically, the team behind imo messenger, which scaled to more than 200 million users globally. That experience building low-latency, high-throughput communication infrastructure translates directly to inference: both problems require serving many concurrent requests cheaply, with tight latency budgets, under unpredictable demand spikes.

The founders carried that operational intuition into GPU infrastructure. Rather than renting cloud capacity and reselling it with a thin API layer — the model many inference aggregators use — DeepInfra owns its GPU infrastructure across eight US data centres. That vertical ownership matters: it means the company controls its own cost structure and can optimise kernel-level performance rather than working around a hyperscaler's scheduling abstractions.

The company has an early-collaborator relationship with NVIDIA that predates the Series B: DeepInfra supports Nemotron models and NVIDIA Dynamo inference software, positioning it as part of NVIDIA's preferred open-AI-ecosystem partner programme. The Series B investment from NVIDIA cements that alignment strategically.

The growth story: 5 trillion tokens per week and 25× scale

The headline metric — five trillion tokens per week — deserves unpacking. At roughly 750 tokens per second per H100 GPU, five trillion tokens per week is the approximate equivalent output of more than 10,000 H100s running flat-out, every hour of the day, seven days a week. Even accounting for the efficiency improvements that purpose-built inference stacks deliver, that is a serious infrastructure footprint for a three-year-old company.

The 25× growth since Series A is equally telling. It implies that the open-weight adoption curve has been steeper than most forecasts predicted eighteen months ago. DeepSeek V3 and V4, Llama 3 and 4, Qwen 2.5 and 3, and Mistral variants together drove an explosion in developer demand for model APIs that were not controlled by a single lab. DeepInfra was positioned to capture that demand because it already supported a broad model catalogue rather than betting on a single frontier model.

Revenue tripling since January 2026 correlates with two market forces: falling GPU acquisition costs (see our H100 GPU price guide for context) and rising demand from agentic workflows. Agentic applications — multi-step reasoning pipelines, RAG architectures, tool-calling loops — generate far more tokens per end-user session than chat interfaces. A single agentic task can consume 50,000–500,000 tokens where a chat session consumes 2,000. That multiplication effect is driving volume growth faster than user-count growth alone explains.

Pro tip

If your agentic pipeline calls a model more than 100 times per user session, a dedicated inference provider like DeepInfra will almost certainly be cheaper than a frontier-model API — even before accounting for model quality differences. Run the per-token maths before assuming a proprietary model is worth the premium.

Technical architecture: what makes an inference-native cloud different

General-purpose clouds — AWS, GCP, Azure — sell compute and wrap AI in managed services (Bedrock, Vertex AI, Azure OpenAI Service). Those wrappers are designed for breadth: they integrate billing, IAM, VPC networking, compliance certifications, and dozens of other enterprise concerns. That breadth is genuinely valuable for large enterprises with existing cloud commitments, but it introduces overhead at the kernel level: scheduling conflicts between training and inference jobs, CUDA driver versions that lag the latest performance-critical updates, and pricing structures built around GPU-hours rather than tokens.

An inference-native cloud strips those layers out. DeepInfra's stack is built around a single objective: serve tokens at the lowest possible latency and cost per output token. That means custom CUDA kernels tuned for each model architecture, speculative decoding where applicable, continuous batching rather than static batching, and GPU scheduling that never has to compete with training workloads for the same physical machines.

The 190+ model catalogue is both a product and a technical achievement. Running that many models efficiently requires intelligent model sharding, KV-cache management at scale, and the ability to load and unload model weights quickly as demand shifts between models. DeepInfra's NVIDIA Dynamo integration gives it access to inference software optimised at the driver and runtime level — a material advantage over teams running vanilla vLLM on rented cloud GPUs.

For builders working on RAG applications, the architectural benefit is concrete: lower first-token latency reduces the perceived response time of retrieval-augmented answers, where the bottleneck is often the model call rather than the vector search. For teams running cost-sensitive inference workloads, the per-token pricing model means you pay for what you use rather than for GPU-hours whether the GPU is busy or idle.

Watch out

Purpose-built inference clouds can have cold-start latency spikes for less-popular models that are not kept warm in memory. Before committing production traffic to any inference provider, benchmark first-token latency under realistic concurrency for your specific model — not just the median from a synthetic benchmark. Cold-start penalties of 2–5 seconds are not uncommon on long-tail models at off-peak hours.

Inference provider comparison: DeepInfra, AWS Bedrock, CoreWeave, RunPod

Choosing an inference provider is a decision that compounds over time — switching costs are non-trivial once your prompt templates, rate-limit handling, and billing integrations are baked in. The table below summarises the key differentiators as of May 2026.

Provider Primary model Open-weight support Pricing structure Best for Watch out for
DeepInfra Inference-only cloud 190+ models (Llama, DeepSeek, Qwen, Mistral, Nemotron…) Per output token High-volume open-weight inference; agentic workloads; RAG No training; US data centres only (expanding); cold starts on long-tail models
AWS Bedrock Managed AI service Limited (Claude, Titan, Llama 3 via Bedrock); proprietary bias Per token + cross-service costs Enterprises with existing AWS commitments; compliance-heavy use cases Model selection lag; pricing complexity; locked into AWS ecosystem
CoreWeave GPU cloud (training + inference) Bring your own model Per GPU-hour Large-scale training; custom model serving; dedicated capacity Requires MLOps expertise; higher operational overhead; GPU-hour billing regardless of utilisation
RunPod GPU marketplace Serverless endpoints for popular open-weight models Per GPU-hour (spot or reserved) Dev/test; cost-sensitive experimentation; hobbyist and research workloads Variable availability; no enterprise SLA; cold starts can be significant on serverless tier

The table illustrates the fork in the market: general-purpose GPU clouds (CoreWeave, RunPod) give you maximum flexibility but require you to build and operate the serving stack yourself. Managed AI services (AWS Bedrock) give you enterprise integration but constrain your model choice. Purpose-built inference clouds (DeepInfra) sit in the middle — broad model choice, no serving-stack overhead, per-token economics — and are winning developers who want to ship quickly with open-weight models. For more on the underlying GPU economics, see our piece on NVIDIA B300 inference economics.

What builders should consider when choosing inference infrastructure

The right inference provider depends on three variables: model flexibility requirements, volume profile, and compliance constraints. Here is a framework for the decision.

Model flexibility: If your roadmap requires switching models frequently — chasing the best cost-per-quality frontier as open-weight releases accelerate — a broad-catalogue provider like DeepInfra is defensible. If you are committed to a single proprietary model, a managed service from that lab's preferred cloud partner may offer better integration. The open-weight model roundup shows how quickly the quality frontier is moving.

Volume profile: Below roughly 10 million tokens per day, the overhead of running your own serving stack outweighs the cost savings. At that volume, a per-token API is almost always cheaper than a per-GPU-hour arrangement once you account for idle time, cold starts, and engineering hours. Above 500 million tokens per day, dedicated GPU capacity with a negotiated rate typically undercuts any per-token API. DeepInfra occupies the middle range — from 10 million to several billion tokens per day — where managed inference delivers the best price-performance.

Compliance constraints: Data residency requirements — particularly relevant for GDPR in Europe and sector-specific regulations in financial services and healthcare — may restrict which providers you can use. DeepInfra's current infrastructure is US-based; the company has signalled global expansion with the Series B proceeds, but teams with strict EU data residency requirements should verify before migrating production workloads.

Evaluating inference infrastructure for your AI product?

Browse Verified Builders on AI Tech Connect who have shipped production inference stacks — and shortlist who you want to talk to.

Browse Builders →

The India angle: why open-weight inference clouds matter for Indian AI builders

India's AI developer community has a structural preference for open-weight models that is not merely ideological — it is economic. Proprietary frontier APIs (GPT-4o, Claude Opus, Gemini Ultra) price in USD, and at current exchange rates a rupee-denominated startup paying $15–$30 per million output tokens faces a cost structure that is materially harder to sustain than a US competitor paying the same nominal rate from a stronger balance sheet.

Open-weight models running on infrastructure like DeepInfra change that equation. A production-quality Llama 3.1 70B or DeepSeek V3 call at DeepInfra's published rates costs a fraction of a GPT-4-class call — and for a large class of Indian enterprise use cases (document processing in regional languages, classification, summarisation, structured extraction), the quality gap between open-weight and proprietary frontier models has narrowed to the point where the cost difference is the decision.

The 25× token volume growth DeepInfra reports is partly a function of Indian and South-East Asian developer adoption, where cost-sensitivity drives open-weight preference more strongly than in the US or UK markets. As DeepInfra expands globally with Series B capital, the possibility of Asia-Pacific data centre presence would be transformative for latency-sensitive Indian workloads — currently, routing through US data centres adds 150–250ms to first-token latency for users in Bengaluru or Mumbai. For more on managing inference costs in production, see our guide on LLM cost reduction via prompt and semantic KV cache.

The UK angle: compliance-aware inference for UK AI builders

UK AI startups face a compliance landscape that is both more mature and more complex than it was two years ago. The UK Frontier AI Bill — still in committee as of May 2026 — is expected to introduce reporting obligations for high-throughput AI deployments, and the existing UK GDPR framework already imposes data residency considerations on any service processing personal data of UK residents.

For UK builders, the practical question when evaluating a provider like DeepInfra is not whether the API is capable — it clearly is — but whether the data-processing agreement is adequate for UK GDPR purposes, and whether the US data centre footprint creates complications for regulated sectors (financial services, healthcare, legal). DeepInfra's NVIDIA and Samsung Next backing gives it enterprise credibility, but UK teams in regulated industries will need to complete their own DPA review before onboarding.

On the positive side, DeepInfra's support for EU-AI-Act-compliant model variants (particularly through Nemotron and other commercially-licensed open-weight models) means that UK teams building products for EU customers can access models that have documented model cards, responsible-use licences, and audit trails — requirements that are increasingly relevant under the Act's General-Purpose AI provisions. As UK builders access global inference infrastructure at improving latency, the gap between London and San Francisco in terms of AI development economics is narrowing meaningfully.

What the Series B signals about the inference economy

The deeper story behind DeepInfra's round is not about one company — it is about a structural shift in where AI infrastructure value accrues. Training dominated the first wave of AI infrastructure investment: large compute clusters, specialised networking, custom silicon. The assumption was that whoever controlled training compute controlled the value chain.

The open-weight revolution complicated that assumption. When model weights are freely available, the training moat erodes. The infrastructure that creates durable value is the one that serves those weights to billions of API calls efficiently. DeepInfra processing five trillion tokens per week with 25× growth since Series A, CoreWeave reporting that inference is growing faster than training as a share of revenue, and analyst forecasts that the current 70:30 training-to-inference split will invert by end of 2026 — these are data points pointing in the same direction.

Inference is a different optimisation problem from training. Training is throughput-bound: you want to maximise GPU utilisation over long runs. Inference is latency-bound: you want to minimise time-to-first-token and time-per-output-token under variable concurrency. The business models are different too: training is sold by GPU-hour; inference is sold by token. Building a company that is excellent at inference requires different choices at every layer — from the CUDA kernels up to the pricing page — than building a company excellent at training.

DeepInfra's Series B is venture capital's acknowledgement that those are not the same company, and that the inference layer is worth backing as a distinct infrastructure category. The falling cost of AI inference is making it possible to build profitable products on top of open-weight models at scale — and DeepInfra is positioning itself as the infrastructure that enables that category.