The paper that formalises what builders already knew

For the past 18 months, teams shipping with AI coding agents have been running ahead of the academic literature. They knew that something structural had changed — that agents were no longer assistants completing a single prompt but orchestrators driving entire development cycles. The vocabulary to describe the shift, however, was inconsistent and often borrowed awkwardly from classical software engineering.

A paper published on arXiv in June 2026 — "Agentic Software: How AI Agents Are Restructuring the Software Paradigm" — gives the shift a formal name and a rigorous framework. The core argument is direct: AI agents are now the primary orchestrators in agentic software systems. They are not tools invoked by deterministic code. They are the process, and deterministic code has become one of their tools.

This is not a semantic distinction. It changes how software is designed, how teams are structured, how SDLC stages are coordinated, and — critically — how experienced engineers need to think about production reliability when the orchestrator can reason but also hallucinate.

From function-call to tool-call to agent-orchestrator: a structural shift

The paper describes three architectural generations, and the transition between them is the central intellectual contribution.

In the function-call era, software is a graph of deterministic function calls. Behaviour is entirely specified at design time. Testing confirms that the specified behaviour is correct. Debugging finds where the specification deviated from intent. The mental model for reasoning about software is static: you read the code and you know what it does.

The tool-call era introduced the first crack in that model. Large language models could invoke external tools — APIs, code executors, search — and the overall behaviour of a system became partially dependent on model outputs that were not deterministic. The system prompt was the specification, and "testing" started to mean something different. Evaluation harnesses replaced unit tests for the nondeterministic components. But the LLM was still being called; it was not calling.

The agent-orchestrator era — the subject of the arXiv paper — inverts the architecture. The agent calls the tools. The agent decides which tools to invoke, in which order, and how to interpret their outputs. The agent can loop, backtrack, spawn subagents, and adapt its plan based on real-world feedback. This is the structure that Claude Code, Cursor Agent, and GitHub Copilot Workspace represent in production today. The shift the paper describes is structural, not incremental.

For builders in Bengaluru, Mumbai, London, or Edinburgh who are building AI products: if your system has an agent that can invoke more than one tool in sequence, you are writing agentic software. The paper's framework applies to you.

The multi-file coherence problem

One of the paper's most practically useful observations is also the most painfully familiar to anyone who has used an agent to make a change that touched more than a couple of files: most agents "fall apart past 3 files."

The reason is coherence. When a function needs to be renamed, the change is not isolated. The function signature changes, which means every caller must update its call site, which may mean updating type annotations in a shared types file, which may mean updating the mock in the test fixtures, which may mean updating the documentation stub. A human engineer making this change carries a mental model of all five locations simultaneously. An agent without multi-file coherence makes the primary change and drops the thread.

Production-worthy agents solve this through three mechanisms the paper identifies. First, they hold context across multiple files — keeping the full working set in the context window rather than reading files one at a time and discarding. Second, they run real tests as verification gates, not as a formality at the end but iteratively after each sub-change. Third, they adjust based on output — if a test fails after a change, the agent re-reads the error, hypothesises a cause, patches, and re-tests before surfacing a result.

This is not how most introductory agent tutorials describe the workflow. It is, however, how teams that have moved agents into production actually build them. The paper gives formal grounding to a set of practices that have been emerging through trial and error in engineering teams across both markets.

Pro tip

When scoping agentic tasks, identify the blast radius before you start: how many files will this change touch, and how are they coupled? If it is more than five to seven files with tight coupling, break the task into subtasks with verification gates between them. The agent can handle the complexity; the risk is that without gates, errors compound silently. See our planning tips for Claude Code workflows for a practical framework.

Single-agent versus multi-agent: a practical decision framework

The paper draws a clear and useful line between single-agent and multi-agent architectures, and it is not the line most practitioners draw intuitively. The relevant question is not "how complex is the task?" It is "can the subtasks run independently, or does state need to remain coherent across them?"

Dimension Single-agent workflow Multi-agent workflow
State coherence Full — one agent holds all context Partial — agents share state via explicit interfaces
Best for Tasks where steps depend on prior steps' outputs Tasks where subtasks can run in parallel
Sequential reasoning Strong — no handoff degradation Weak — Google Research found 39–70% degradation
Throughput Limited to one agent's rate Scales horizontally with subtask count
SDLC use case Refactoring a tightly coupled feature Planning, scaffolding, writing, testing, debugging in parallel
Error propagation Contained within one session Can cascade across agents without explicit error contracts
Debugging difficulty Moderate — single trace High — requires inter-agent observability

The Google Research finding on sequential reasoning deserves particular attention. A 39–70% performance degradation for multi-agent variants on strict sequential reasoning tasks is large enough to be disqualifying for certain workloads. If your task requires reasoning where step N genuinely depends on the exact output of step N-1 — not just a shared data structure, but a reasoning chain — multi-agent architecture will hurt you, not help you. The paper is clear on this boundary.

Multi-agent wins when the SDLC stages can run independently. A planning agent determines what to build. A scaffolding agent creates the file structure. A code-writing agent populates the implementation. A test-writing agent writes the test suite against the interface specification. A debugging agent analyses failing test output. A deployment agent validates the build pipeline. Each of these agents operates on a defined interface and does not need to know what the others are reasoning about. That independence is the signal that multi-agent is the right choice.

Watch out

Splitting a task across agents because it "feels too big for one agent" is one of the most common agentic architecture mistakes. Bigness is not the criterion. Independence is. If agents must share a reasoning chain across handoffs, multi-agent architecture will degrade performance, not improve it. Design for independence first; scale second.

State management in multi-agent systems: JSONL and the merge-conflict problem

One of the paper's concrete technical contributions is its treatment of state management in multi-agent workflows. When multiple agents are operating concurrently on the same codebase, they need a shared state store that does not create the classic distributed-systems nightmare of merge conflicts at every write.

The solution described in the paper — storing issues as JSONL in git, with hash-based IDs — is elegant precisely because it works with the tooling teams already have. Each issue is a line in a JSONL file. Each line is identified by a content hash. Concurrent agents append new lines; they never update existing lines. Because git merge is line-oriented and each agent is writing unique lines, merge conflicts effectively disappear from the state management layer.

This is a small design decision with large operational consequences. Teams that try to use shared mutable state — a shared JSON object in a database, for example — inevitably hit a class of race condition bugs that are difficult to reproduce and worse to debug. The JSONL-in-git pattern sidesteps the entire category.

For Indian engineering teams using GitHub and GitLab as their primary collaboration infrastructure, this pattern slots in naturally — it is a state management strategy, not a new tool. For UK teams with stricter audit requirements, the fact that the state history is in git also provides a full audit trail with no additional instrumentation.

The three tiers of agentic tooling in 2026

The paper maps the current production landscape of agentic coding tools into three tiers, differentiated by how much of the SDLC they handle autonomously and how much human oversight remains in the loop.

Tier one: single-terminal agents. The canonical example is Claude Code operating with subagents. The human engineer is in the terminal. The agent runs tasks, reports results, and awaits direction. Autonomy is scoped to the task the human explicitly frames. Oversight is continuous and immediate. This tier is appropriate for most teams entering agentic development — the cost of a mistake is one task, not one CI run.

Tier two: isolated worktree agents with diff review. Cursor Agent is the representative example at this tier. The agent operates in an isolated worktree — it cannot affect the main branch without the engineer reviewing and accepting a diff. Autonomy extends to multi-file changes within the worktree. Oversight happens at the diff boundary. This tier is appropriate for teams comfortable delegating implementation whilst retaining review authority over what lands in the codebase.

Tier three: fully autonomous pipelines with CI integration. GitHub Copilot Workspace represents the current production ceiling of this tier. The agent can open a pull request, run the CI suite, interpret failures, push fixes, and iterate — without a human in the loop between commit and PR approval. Oversight is at the PR review stage. This tier requires robust CI as a safety net. Teams without strong test coverage should not operate at tier three: the agent will confidently merge broken code.

These tiers are not a progression to race through. They represent different risk profiles for different team contexts. A ten-person startup in Pune shipping a greenfield product may run happily at tier three. A fifteen-person fintech in Edinburgh operating under FCA oversight should probably stay at tier two and treat every agent-generated PR with the same scrutiny as a junior engineer's first contribution.

For more on the agent protocol layer that connects these tiers, see our coverage of the Zed 1.0 parallel agents and agent-client protocol.

The "land the plane" pattern and other production practices

The paper documents several production patterns that have emerged from teams operating agents at scale. Two are worth highlighting in detail.

The "land the plane" pattern addresses one of the more subtle failure modes in long-running agent sessions: accumulated uncommitted state. An agent that has been running for twenty minutes may have modified eight files, staged three of them, run two failing tests, and partially reverted one change — all without committing. If the session is interrupted (network timeout, token limit, explicit stop), this state is difficult to reconstruct. The "land the plane" pattern mandates that agents clean up state at session end: commit or stash modified files, document what was done and what remains, and leave the working directory in a state from which the next session can resume cleanly.

This is the agent equivalent of writing good commit messages — hygiene that feels optional until it is not. Teams running agentic systems in production should enforce this pattern at the tooling level, not rely on the agent to remember it.

The verification gate pattern requires agents to run real tests at defined checkpoints rather than assuming correctness. This sounds obvious, but the temptation — in both human and agent workflows — is to run tests once at the end. Verification gates distribute that check across the workflow: after each file change, after the scaffolding phase, after the first implementation pass. Failing early is dramatically cheaper than failing at the final gate.

What this means for builders in India and the UK

The arXiv paper is an academic contribution, but its implications are immediately practical for engineering teams in both markets.

For Indian product teams — particularly the cohort of early-stage startups and mid-size SaaS companies that have been adopting Claude Code and Cursor Agent rapidly since late 2025 — the paper provides vocabulary and a framework that helps justify architectural decisions to stakeholders. "We are using a multi-agent workflow because the subtasks are independent" is a more defensible position than "we are using multi-agent because it is newer." The paper also gives Indian engineering leads something concrete to share with hiring committees: the distinction between an engineer who understands agentic architecture and one who has simply used an agent once.

For UK teams operating under regulatory constraints — financial services under FCA, healthcare under MHRA, legal under SRA — the paper's treatment of state management and audit trails is directly relevant. JSONL-in-git state management produces an auditable log by default. The tier framework helps teams place themselves within a risk-appropriate level of autonomy. The explicit treatment of the human-in-the-loop boundary gives compliance teams a handle on what agents are and are not doing autonomously.

The hiring implications are significant in both markets. As agentic software becomes the default development paradigm rather than an experimental practice, the relevant engineering skills shift. Understanding multi-file coherence, designing for agent-orchestrator architecture, knowing when to use single-agent versus multi-agent, and operating the three tiers safely — these are now production skills, not research skills. Teams hiring AI engineers in 2026 should be evaluating candidates on exactly this capability set.

Builders who are already shipping at this tier — who can reason about tool-call composition, who have designed a JSONL state layer, who know the Google Research finding about sequential reasoning and can explain when it applies — are the practitioners the paper is describing. They are also the practitioners that teams in both markets most urgently need.

Shipping agentic systems? Your work should be visible.

Every Verified Builder on AI Tech Connect has a profile that shows what they have built and how. Browse profiles or add yours — free at launch.

Browse Builders →

The open question: evaluation

The paper's formal framework is strong on architecture and production patterns. It is appropriately cautious about evaluation. Evaluating agentic software remains an open problem in a way that evaluating deterministic software is not.

For deterministic software, correctness is a binary predicate over a finite input space. For agentic software, correctness is a probabilistic statement over a distribution of tasks, with compounding uncertainty across each tool-call step. The evaluation frameworks that exist — SWE-bench, GAIA, custom task harnesses — are useful but incomplete. No consensus benchmark yet captures the production failure modes that matter most: multi-file coherence degradation, state drift in long sessions, error propagation across agent handoffs.

This is the research frontier that the arXiv paper points at without fully resolving. For builders operating production agentic systems, the practical implication is that you need to build your own evaluation harness, tailored to your task distribution and your risk tolerance. A CI suite is necessary but not sufficient. You need agentic task-level evaluation that exercises the agent across the full range of tasks it will encounter in production, not just the happy path. See our Tips hub for practical guides on building evaluation suites for agentic systems.

The paradigm shift in one sentence

The arXiv paper's lasting contribution may be this: it makes explicit that the shift from function-call to tool-call to agent-orchestrator architecture is not a tooling upgrade. It is a structural change in what software is. In agentic software, the primary unit of composition is not the function but the agent. The primary control structure is not the call graph but the goal. The primary reliability concern is not the exception but the reasoning error.

Teams that have internalised this shift — that have rebuilt their mental models of what a software system is and how it is correctly specified, tested, and operated — are the teams that will ship reliable agentic products in 2026. The paper is a map. The builders who understand the territory it is describing are the ones worth finding.

The full paper is available at arxiv.org/abs/2606.05608.