The four numbers that should organise every enterprise agent RFP this quarter

The Q1 2026 enterprise data is finally settled enough to be useful, and the four numbers below are the ones that change how a buyer should read every agent pitch deck that lands on the desk. They matter as much to a system integrator in Bangalore advising a Singaporean bank on a procurement shortlist as they do to a UK FTSE buyer running an internal AI review for a regulated insurance line.

  • Per Q1 2026 enterprise data points (DigitalApplied, Kili Technology, Mem0 research consensus), 80% of enterprise applications shipped or updated in Q1 2026 embed at least one AI agent. Saturation at the build layer.
  • 31% of enterprises have at least one AI agent actually running in production. A 49-point gap between intent and live workload.
  • 5.1 months median time-to-value on agent deployments. Not weeks — months.
  • 37% gap between lab benchmark scores and real-world deployment performance for enterprise agentic AI systems. The RFP-vs-reality delta.
  • 50× cost variation across competing systems hitting comparable accuracy on the same task. Same answer, very different bill.
Pro tip

If a vendor's pitch quotes a single benchmark score with no mention of cost-per-task, latency budget or human-review overhead, you are looking at the lab number. Ask for the same accuracy figure measured on your own held-out test set under your production rate limits — that is the only number that survives procurement.

Why "embedded in an app" is not the same as "in production"

The gap between 80% and 31% is the single most misread number of the year. It is tempting to read the 80% figure and conclude that agents have crossed the chasm. They have not. Embedding an agent is a build-time decision a product team can make in a sprint; running an agent in production requires an evaluation rubric, an incident response runbook, a cost ceiling, a fallback path, a human-review queue, observability that captures tool-call traces, and a sign-off from whoever owns regulatory risk.

That second list is where most pilots stall. A team in Pune adds a Claude-powered support agent to its SaaS dashboard and ticks the "embeds an agent" box. The same team does not put that agent on the live customer queue until it has solved three problems: how to detect when the agent is confidently wrong, how to escalate to a human without losing context, and how to keep the per-conversation cost under a defensible threshold. Those three problems are the 49-point gap. They are also, not coincidentally, what the buyer-side rubric needs to test for.

For a procurement team that is being asked to choose a vendor, the takeaway is simple: any vendor whose case studies are all "embedded" rather than "in production" is selling you a feature flag, not a live workload. Push for a named reference whose deployment has been carrying real traffic for at least 90 days, and ask what their fallback rate, escalation rate and per-task cost actually look like under load. A vendor that cannot answer those three questions in detail has not actually solved the hard part yet.

Why the 5.1-month median TTV is the right expectation, not a failure

The 5.1-month median time-to-value is the figure that internal sponsors push back on hardest. The instinct, especially in an Indian GCC or a UK in-house digital team, is to compare it against the four-week sprint cadence that has become the default unit of delivery for everything else. That comparison is the wrong frame.

What 5.1 months actually measures is the end-to-end window from contract signature to a stable production agent that the business owner is willing to call a success. That window has to absorb data-access negotiation with the security team, integration with at least one system of record, evaluation harness setup, two or three rounds of prompt and tool refinement once real users see the agent, a guardrails review, and at least one quarter-end where finance can verify the cost line. Compress any of those steps and the deployment lands in the 69% that never reach production.

The corollary for buyers: a vendor promising a six-week production rollout is either describing a prototype, or has not understood your governance reality. The honest pitch should match the median — five months for a first production use case, faster on the second and third as the harness, the evaluation set and the integration patterns get reused. Sister-piece reading on the vendor-side framing of the same gap sits in our coverage of why most enterprise agentic AI pilots never reach production.

Lab benchmark vs production deployment — what each fails to capture

The 37-point gap between published benchmarks and real-world performance is not a measurement artefact. It is the predictable consequence of how benchmarks are constructed. They are designed to compare models on stable, curated, single-shot tasks because that is what makes scores reproducible. Production workloads are the opposite of every one of those properties. The table below is the version we walk every buyer through before they read another leaderboard.

Dimension What the benchmark measures What production actually adds Why the gap exists
Input quality Curated, clean, well-formed prompts Typos, partial context, ambiguous intent, mixed languages Real users do not write like a test set
Session length Single-shot or short multi-turn 30+ turn conversations with context drift Long sessions reveal compounding error rates
Tool-call reliability Synthetic tools with deterministic returns Real APIs with retries, rate limits, partial failures Production tools break in ways benchmarks do not simulate
Latency budget Untimed or generous timing windows Sub-second response expectations on a customer chat Best-accuracy chain-of-thought blows the SLO
Cost ceiling Not measured Hard per-task budget enforced by finance 50× cost spread hides under "comparable accuracy"
Adversarial inputs Curated to avoid prompt injection Users who paste, jailbreak, or game the agent Real users find the seams within hours
Distribution shift Frozen test set Workload that drifts week to week Frozen scores do not predict drifting reality
Human-in-the-loop Absent Escalation path, review queue, override rate End-to-end success depends on human handoff quality

Read that table once and the 37-point gap stops being mysterious. It is the sum of eight separate things that benchmarks were never designed to measure. For the broader pattern of why public benchmarks need to be treated as upper bounds rather than reliable predictors, our earlier piece on SWE-bench Verified vs SWE-bench Pro and the benchmark-contamination problem is the companion read.

Watch out — the 50× cost trap

Two systems can hit the same headline accuracy on your test set and differ by a factor of 50 on the bill. The cost spread comes from token-per-task burn, reasoning-trace verbosity, retry behaviour, tool-call fan-out and how aggressively the system caches. A vendor that wins on accuracy but ships verbose reasoning traces by default can quietly burn a low-six-figure pound or rupee budget in a quarter on a workload another system handles for a tenth of the spend. Always measure unit cost on your own workload, never on the vendor's demo, and never trust an accuracy number that is not paired with cost-per-task and p95 latency.

What to put in your evaluation rubric — the buyer-side categories

The Q1 2026 research consensus is direct: the most reliable evaluation approach combines automated metrics with expert human judgements, run on a fixed test set drawn from real production traffic. Automation gives statistical power and catches regressions; humans catch the wrong-but-confident answers, partial completions and quiet policy violations that automation systematically misses. The rubric below is the one we hand to buyers before they shortlist.

Category What to measure Why it matters How to verify before signing
Task accuracy on your data Pass-rate on 200+ held-out cases from your own workload Public benchmarks lose 37 points to production reality Run the vendor's system on your test set, not theirs
Cost per task Total tokens in + out + tool calls, priced at list reported up to 50× cost variation hides under "comparable accuracy" Measure on identical workload across all shortlisted vendors
p95 latency under load Response time at the 95th percentile at your expected QPS Median latency hides the tail that breaks the UX Synthetic load test or a 7-day shadow deployment
Tool-call reliability Success rate when downstream APIs return errors or partial data Production tools break in ways benchmarks never simulate Fault-injection in the evaluation harness
Long-session coherence Quality across 20+ turn conversations Drift compounds in real customer interactions Replay anonymised real transcripts end-to-end
Confident-error rate Frequency of wrong answers delivered with high model confidence Confident errors are the failure mode customers complain about Expert human review of 100+ randomised outputs
Escalation behaviour How cleanly the agent hands off to a human with context Bad handoff destroys the customer experience the agent saved Walk through escalations in the demo, not just happy paths
Memory and personalisation Does the agent remember prior turns and customer state correctly Stateless agents repeat questions and fail loyalty workloads Multi-session test scenarios with the vendor's memory store
Prompt-injection resilience Behaviour under adversarial inputs from your users Real users find seams in hours; assume hostile inputs Red-team the system on your own data before sign-off
Observability Tool-call traces, prompt logs, cost-per-trace, evaluation feeds You cannot fix what you cannot see in production Inspect the vendor's tracing UI yourself, not a screenshot
Governance and audit Decision logs, data-residency controls, prompt versioning UK FCA, RBI and EU AI Act reviewers will all ask for this Walk a hypothetical audit with the vendor

The rubric has eleven categories deliberately. A shorter list lets a vendor optimise to whatever happens to be measured and quietly fail on what is not. Eleven categories with one metric each — measured on your data, not theirs — is the version that survives a four-month bake-off.

From the desk

"We ran the same 200-case test set across four shortlisted enterprise agent vendors for a UK insurer earlier this year. Headline accuracy clustered inside a four-point band. Cost-per-task spread by a factor of 31. p95 latency varied by 9 seconds. Confident-error rate ranged from 2% to 11%. The vendor the procurement team was leaning towards on the brand-name strength came out fourth on three of those four metrics. The rubric paid for itself in one meeting."

— Builder Desk · AI Tech Connect

What this means for SI advisors and FTSE procurement

The two audiences this article is written for see the same numbers from opposite directions, and both end up at the same playbook. An Indian SI advisor sits between an enterprise buyer and a portfolio of vendor partners, and is being asked to recommend one or two for a shortlist. A UK FTSE in-house digital or procurement team has its own internal AI review and is being asked to defend a vendor choice to the board's risk committee. The rubric is the same; only the audience for the report differs.

  • Lead with the buyer's own test set. The 37-point benchmark-vs-reality gap means a leaderboard score is a marketing input, not a procurement input. Build a 200-case held-out set from real workload data, anonymised where required by DPDP, GDPR or the UK Data Protection Act, and make every shortlisted vendor run it under identical conditions.
  • Price the workload, not the model. The 50× cost spread shows up only when you measure total tokens, retries, tool calls and memory writes against a fixed workload. The headline rate card does not predict your bill.
  • Plan for 5.1 months end-to-end. Budget, governance and exec sponsorship should all be sized for a 20–22 week first deployment. Anyone promising less is selling a prototype.
  • Demand named production references. The 49-point embedded-vs-production gap means most case studies you will see are pilots. Insist on at least one customer whose deployment has been carrying live traffic for 90+ days, and call them.
  • Make human review part of the rubric, not an afterthought. The Q1 2026 consensus is that automated metrics plus expert human judgement on the same fixed set is the only combination that holds up. Build human review hours into the evaluation budget.

Running an enterprise agent evaluation and want a sounding board?

Every article on AI Tech Connect is written by a Verified Builder. Browse profiles, shortlist who you want to hire or collaborate with.

Browse Builders →

Where the numbers come from, and where they go next

The Q1 2026 enterprise figures cited above draw on industry reporting from DigitalApplied's enterprise AI agent data points, the 2026 evaluation review at Kili Technology, the agent-memory landscape at Mem0 and the software-development agent benchmarks compiled by MarkTechPost. The consistent finding across all four sources is that the published numbers do not predict the production numbers, and the only honest way to close that gap is buyer-controlled evaluation on the buyer's own data.

The macro context is worth holding alongside this rubric. Enterprise agent budgets are growing fast — see our coverage of Anthropic overtaking OpenAI on enterprise ARR on the back of agent workloads and of Sierra AI's $950M raise into enterprise agent infrastructure. The money is moving. The 37-point gap, the 50× cost spread and the 5.1-month median TTV are what determines whether it lands well.