What you need to know
- The headline number is 38.78%. PaperArena, an arXiv benchmark for tool-augmented agentic reasoning over scientific literature, reports that even a leading LLM driving a well-established agentic workflow reaches only 38.78% average accuracy.
- It is not an outlier. WebResearcher reports 36.7% on Humanity's Last Exam for its strongest configuration — a state-of-the-art result that still leaves most questions unanswered.
- These are hard-tail benchmarks by design. GSM-Agent builds controllable reasoning environments; OmniEAR probes embodied tasks the authors argue current models do not handle well.
- The number is not the story. The scaffold, the task distribution, and whether a score is single-run or pass@k matter more than the percentage itself.
If you have watched an agent demo lately, you have seen something close to magic: a model that reads a brief, calls three tools, browses the web, and hands back a tidy answer. Then you point the same agent at your own backlog and it stalls, loops, or confidently invents a citation. That gap between the demo and the deployment is exactly what this new generation of agentic-reasoning benchmarks is built to expose — and the scores are sobering.
For builders shipping agents into Indian fintech back-offices or UK public-sector workflows, the lesson is not "agents do not work". It is "agents work on a narrower slice of tasks than the demo implied, and you need to measure that slice yourself". Below we walk through what each benchmark actually measures, why the numbers look low, and a checklist for reading any agent benchmark — including your own — without fooling yourself.
Before you quote any agent benchmark in a board deck, write down three things: which scaffold produced the score, what the task distribution was, and whether it was a single run or pass@k. A number without those three is marketing, not evidence.
The four benchmarks, and what each one actually measures
These are research-paper results, so we present them as the papers report them — not as settled, universal truths. Each benchmark targets a different failure surface, which is precisely why reading them together is more useful than fixating on any single score.
| Benchmark | What it measures | Reported headline result |
|---|---|---|
| PaperArena | Tool-augmented agentic reasoning over scientific literature | A leading LLM in a well-established agentic workflow achieves merely 38.78% average accuracy |
| WebResearcher | Long-horizon web research across multiple challenging benchmarks | The heavy configuration reports 36.7% on Humanity's Last Exam, said to substantially outperform prior systems |
| GSM-Agent | Agentic reasoning inside controllable environments | Introduces a benchmark with controllable environments to evaluate agentic reasoning of LLMs |
| OmniEAR | Agent reasoning in embodied tasks | Argues embodied reasoning poses fundamentally different challenges than current models handle well |
PaperArena — tool use over scientific literature
PaperArena tests whether an agent can do what a competent research assistant does: pull the right tools, read across papers, and answer a hard question that no single document contains. The authors report that even a leading LLM driving a well-established agentic workflow lands at 38.78% average accuracy. Read that carefully — the model is strong and the workflow is mature, and the result is still well under half. The failure is not raw knowledge; it is the orchestration of tools and multi-step reasoning under realistic conditions.
WebResearcher — long-horizon web research
WebResearcher is a long-horizon research agent that the authors report as state of the art across six challenging benchmarks. On Humanity's Last Exam — one of the hardest public test sets going — the heavy configuration reports 36.7% accuracy, which the paper frames as substantially outperforming prior systems. The instructive part is the framing: 36.7% is genuinely a leading result, and it still means roughly two of every three questions go unanswered or wrong. "State of the art" and "reliable in production" are not the same claim.
GSM-Agent — controllable reasoning environments
GSM-Agent introduces a benchmark with controllable environments for evaluating the agentic reasoning of LLMs. Controllability is the point: by holding the environment steady and varying one factor at a time, the authors can analyse where reasoning breaks rather than just whether it does. For builders, this is the most directly portable idea — you can build your own controllable harness around your own tools and see which step degrades.
OmniEAR — embodied tasks
OmniEAR benchmarks agent reasoning in embodied tasks and argues that embodied reasoning poses fundamentally different challenges than current models handle well. This matters even if you are not building robots. Any agent that has to act in a stateful world — book the slot, then pay, then confirm, with the world changing between steps — is closer to an embodied task than to a single question-and-answer turn. The authors' argument is a useful warning for anyone wiring agents into live systems.
"We learned this the expensive way. Our support agent hit 91% on our own evaluation set and 54% on the first week of real tickets. The benchmark was clean and single-turn; the real world was messy and multi-step. Now we treat any published score as a ceiling we will never reach, not a floor."
— Aditya, Verified Builder · Bengaluru, INWhy the scores look low — and why that is healthy
A 38% headline is uncomfortable, but it is a sign these benchmarks are doing their job. Earlier agent benchmarks saturated quickly: a model would post 95%, the number stopped discriminating, and the leaderboard became theatre. The new wave deliberately targets the hard tail — tasks where tool orchestration, long horizons and stateful action all compound. Low scores mean headroom, and headroom means the benchmark can still tell two systems apart.
There is also a structural reason agents underperform their underlying models. A single-turn question gives the model one chance to be right. An agentic task chains many steps, and the probabilities multiply. If each of eight steps is 90% reliable, the end-to-end success rate is roughly 43% — before you add tool errors, stale state, or a misread document. This is the maths behind the demo-to-production gap, and it is why scoping an agent to fewer, higher-confidence steps usually beats reaching for a cleverer model.
A higher benchmark score with a heavier, more expensive scaffold is not free. WebResearcher's leading number comes from its "heavy" configuration. Before you celebrate a few extra points, check what they cost in latency, tokens and tool calls — your unit economics in Mumbai or Manchester may not survive the heavy setup.
How to read an agent benchmark honestly
Most disappointment with agents comes from reading benchmarks the way we read exam scores — as a single number that ranks contestants. Agent benchmarks do not work like that. The same model can swing twenty points depending on the scaffold around it. Here is what to interrogate before you trust any figure, published or your own.
- Watch the harness or scaffold. The orchestration loop, the tool definitions, the retry policy and the prompt all move the score. PaperArena's own framing — "a well-established agentic workflow" — tells you the 38.78% includes a competent scaffold, not a naive one. When a vendor quotes a number, ask what built it.
- Interrogate the task distribution. A benchmark is only as relevant as its task mix. Long-horizon web research says little about your structured-data agent. Map the benchmark's tasks to yours before you generalise from its score.
- Single-run versus pass@k. "Solves it in one of five attempts" (pass@5) and "solves it on the first try" (pass@1) are different products. Production usually needs pass@1 behaviour; many headline numbers quietly lean on pass@k.
- Check for contamination. If the benchmark's questions leaked into training data, the score measures memorisation, not reasoning. Newer benchmarks like these are valuable partly because they are fresh — but freshness decays, so favour held-out, recently authored tasks.
- Re-run on your own slice. The only score that should gate a launch is the one measured on a held-out slice of your actual workload, with your actual tools. Treat every public number as a ceiling, never a guarantee.
For a deeper treatment of how lab evaluation diverges from live behaviour, our piece on the enterprise agent evaluation benchmark versus production gap is a useful companion. If you want to see how creatively researchers now stress-test reasoning, the cattle-trade bluffing and bargaining benchmark shows how far the field has moved beyond tidy question-and-answer sets.
Every article here is written by a Verified Builder. Want your name on the next one?
AI Tech Connect lists AI engineers, founders and researchers across India and the UK — and the people hiring browse it to find them. Adding your profile is free.
Become a Verified Builder →A pre-ship checklist for agent builders
None of this means you should wait for benchmarks to hit 90% before shipping. It means you should ship narrow, measure honestly, and add guardrails where the maths says steps will compound. Practical moves that hold up in both Indian and UK deployments:
- Scope to the fewest steps that deliver value. Every step you remove from the chain multiplies your end-to-end reliability. A two-step agent beats a flaky eight-step one almost every time.
- Build a controllable evaluation harness. Borrow GSM-Agent's logic — hold the environment steady, vary one factor, and find the step that degrades. This is the single highest-leverage thing most teams skip.
- Measure pass@1 on a held-out slice. If you only track pass@k or aggregate accuracy, you will overestimate production reliability. Report the first-try number to your stakeholders.
- Add a confidence gate and a human fallback. Where OmniEAR's embodied warning applies — stateful, multi-step actions in a live system — let the agent escalate rather than act on low confidence.
- Track the cost of accuracy. A heavier scaffold can buy points, but cost and latency must fit your market. Decide your accuracy-per-rupee or accuracy-per-pound budget before you tune.
Teams that have already walked this road have written up the practical scars. Our guide on taking an enterprise agentic AI pilot to production covers the operational side, and the analysis of a live SWE agent hitting 79% with an open scaffold is a good demonstration of how much the scaffold — not just the model — moves the final number.
The takeaway
The honest reading of PaperArena, WebResearcher, GSM-Agent and OmniEAR is not pessimism. It is calibration. Agents are genuinely useful today on well-scoped tasks, and these benchmarks tell you where the well-scoped territory ends. A 38.78% on the hardest tool-reasoning tasks, or 36.7% on Humanity's Last Exam, is a measure of ambition, not of failure. Read the scaffold, read the task distribution, re-run on your own data, and ship the narrow thing that works. That is how builders in Bengaluru and Bristol turn a sobering benchmark into a shipped product.
Source papers are available on arXiv under the PaperArena, WebResearcher, GSM-Agent and OmniEAR titles.