AI Coding Open Source Infra · 8 June 2026 · 13 min read

Plan-First Claude Code Workflows: Specs, Sub-Agents, and Checkpoints for Complex Features

Stop wasting hours on AI coding sessions that go off track. A 4-step plan-first workflow — spec, decompose, checkpoint, handoff — that keeps Claude Code on course for complex multi-file features.

Priya Sharma AI Engineer · Bangalore, India · Founding Builder #7

What you need to know

AI coding assistants have a fundamental structural problem that most engineers discover through painful experience rather than by reading about it: the model's ability to stay on track degrades as the conversation grows. Each turn adds context that dilutes the original intent. By turn fifteen of a complex session, the model is optimising for plausible-looking code that satisfies the most recent message rather than the actual requirement you stated at the start. The feature compiles, the tests are green — but it does not do what you asked for.

The root cause is not the model. It is the absence of a stable, machine-readable reference that the model can re-consult when its scroll-history-based working memory drifts. This is what a spec provides. A spec is not documentation for humans to read after the fact; it is a contract the model can check against at every step. Without it, every AI coding session is an implicit bet that the model will correctly infer your intent across an arbitrarily long conversation — a bet you will lose on any feature that touches more than two or three files.

Based on informal discussions with AI builder teams in both India and the UK, ad-hoc sessions without a written spec fail at a high rate for features that touch more than four files — session restarts and context rebuilds are the norm rather than the exception. The benchmark section later in this guide puts concrete numbers to this; the short version is that plan-first workflows consistently save three to five session re-runs per feature. This guide gives you the complete four-step workflow: write the spec, decompose into sub-agent tasks, set checkpoints with rollback signals, and pass structured context between sessions. It is written for engineers who are already comfortable with Claude Code and have had sessions go frustratingly off track — not a beginner tutorial.

Why AI-assisted coding stalls without a written plan

When you open a Claude Code session with a vague prompt — "add a caching layer to the user endpoint" — the model does something superficially impressive: it reads your codebase, makes plausible inferences about what you mean, and starts writing code. For a change that fits in one file, this often works. For a change that spans a service layer, a repository, a database schema, and a test suite, the inferences compound. By the time the model is writing the fourth file, it is working from its own earlier inferences about what the first file implied, and those inferences were themselves based on an underspecified prompt.

The failure mode has a characteristic shape. The first few files look correct. Then something unexpected happens — a type error, a circular import, a test that the model thought would pass but does not. The model fixes it. The fix introduces a subtle inconsistency with file number one. You do not notice immediately because you are watching the current file. By the time you run the full test suite, the session has seventeen turns of accrued context, the model is confused about which version of the interface it is implementing, and the path back to a clean state requires either a full rollback or a long sequence of carefully scoped corrections.

This pattern has a name in engineering: it is spec drift. In human engineering teams it is prevented by design documents, code review, and sprint planning. With AI coding tools, the equivalent discipline is a written spec that the model reads at the start of every task, not just once at the beginning of the session. The model does not accumulate intent across a session the way a human colleague does. It reads what is in its context window right now. A spec that lives outside the conversation — in a file the model is instructed to re-read — is the only reliable way to give the model persistent intent.

There is a secondary failure mode that is less obvious but equally damaging: hallucinated APIs. When the model is operating from inferred intent rather than a spec, it fills gaps in its knowledge with plausible-looking library calls that do not exist, or exist in a different version than the one you are running. Detecting this mid-session is difficult because the hallucinated call often looks correct until you actually run the code. A spec that includes explicit constraints — "use only the APIs in lib/db/client.ts, do not import from external packages not already in package.json" — does not eliminate hallucination, but it gives the model a surface to check against rather than a void to fill.

Prerequisites: what "plan-first" means and does not mean

Plan-first does not mean writing exhaustive documentation before touching the keyboard. It does not mean a design review meeting, a Confluence page, or a Jira epic. For a solo engineer working on a single feature, a spec is a markdown file of 200 to 400 words that you write in five to ten minutes before starting a Claude Code session. If writing it takes longer than that, the feature is under-defined at the product level and no amount of prompting discipline will fix the session — that is a planning conversation, not a coding one.

Plan-first also does not mean that Claude Code never surprises you with better approaches. It means that when the model proposes a deviation from the spec, you evaluate it consciously rather than discovering it accidentally when a test fails. The spec is a contract, not a cage. You can update it. But you update it explicitly, not by letting the model silently accumulate its own interpretations.

The prerequisites for this workflow are modest. You need a text editor and a directory in your project where you keep spec files — docs/specs/ works well. You need Claude Code (or any AI coding assistant that accepts file references at the start of a session). You need a test suite, even a minimal one, because checkpoints require a passing state to verify against. And you need the discipline to stop and write the spec before writing the first prompt. That last prerequisite is the hard one — the habit of jumping straight into prompting is exactly what this workflow is designed to replace.

Step 1 — Write the spec, not the code

A spec for an AI coding session has a specific structure that is different from a design document or a user story. Its purpose is to give the model a stable, machine-readable reference that answers the four questions it will otherwise infer incorrectly: what is the goal, what constraints apply, what does the interface contract look like, and how do I know when I am done. Every field in the spec template below maps to one of those four questions.

The goal section is a single sentence. Not a paragraph, not bullet points — one sentence that names the component being built or changed and what it must do. Vagueness here propagates through every subsequent task. "Add caching" is a bad goal. "Add a read-through Redis cache to UserRepository.findById() with a 5-minute TTL and a cache:clear:user:{id} invalidation key" is a good goal. The specificity forces you to think through the design before prompting, which is the point.

The constraints section is where hallucination prevention lives. List every constraint the model might otherwise guess at: the Node version, the specific library version, which files are off-limits for modification, what coding conventions apply (no default exports, all async functions must have explicit return types, and so on). Constraints that feel obvious to you are invisible to the model unless you state them.

The interface contracts section is the most valuable for multi-file features. List every function, method, or type that will be created or modified, with its full signature. This prevents the most expensive class of drift: the model deciding mid-session that a different interface shape is cleaner and silently redesigning the contract that other files depend on.

The acceptance criteria section is a numbered list of observable outcomes. "The test UserRepository.test.ts#findById caches on second call passes" is an acceptance criterion. "The caching works correctly" is not. Acceptance criteria are what the checkpoint verification step will check, so they must be testable.

Spec template

# SPEC: {feature-name}
## Goal
{One sentence: component + what it must do.}

## Constraints
- Runtime: Node 22 / TypeScript 5.4 strict
- Dependencies: only packages already in package.json
- Do NOT modify: {list files/modules that are off-limits}
- Conventions: {e.g. no default exports; all async fns must declare return type}

## Interface contracts
```typescript
// All new/modified signatures must match exactly:
// src/repositories/user.repository.ts
async findById(id: string): Promise<User | null>

// src/cache/cache.client.ts
get<T>(key: string): Promise<T | null>
set<T>(key: string, value: T, ttlSeconds: number): Promise<void>
del(key: string): Promise<void>
```

## Acceptance criteria
1. `UserRepository.test.ts` — all existing tests still pass.
2. `UserRepository.test.ts#findById caches on second call` — passes.
3. `UserRepository.test.ts#findById invalidates cache on update` — passes.
4. No new TypeScript errors: `tsc --noEmit` exits 0.
5. Redis key format matches `user:{id}` with 300-second TTL.

## Files in scope
- src/repositories/user.repository.ts (modify)
- src/cache/cache.client.ts (create)
- src/cache/cache.client.test.ts (create)
- src/repositories/user.repository.test.ts (modify)

## Files NOT in scope
- src/routes/user.routes.ts
- src/middleware/auth.middleware.ts
- Any migration file

Pro tip

Begin every Claude Code session by pinning the spec file: Read docs/specs/user-cache.md and treat it as your source of truth. Do not modify any file not listed under "Files in scope". After each task, re-read the acceptance criteria before marking the task done. This single instruction eliminates the majority of scope drift in long sessions.

One thing that is easy to skip and consistently causes problems: the "Files NOT in scope" section. The model will, absent this guidance, sometimes refactor adjacent code that looks related. The refactor may even be an improvement. But it changes files that are not part of the current task, which means the task's checkpoint verification will find unexpected modifications, which means you cannot confidently say the checkpoint passed. Scope discipline costs nothing to specify and saves significant recovery time.

Step 2 — Break the spec into sub-agent tasks with clear handoffs

A spec describes a feature. A task list describes how to build it in a sequence of verifiable steps. The difference matters because a task is a unit of work that can be independently verified — you can run a checkpoint after it and confirm the system is still in a known-good state. A feature that touches four files cannot be verified in one step; a task that touches one file can.

The decomposition rule is: each task must have a declared input state, a declared output state, and a verification step that can be run mechanically. "Input state" means which files exist and what their current passing status is. "Output state" means which files were created or modified and what the post-task test status must be. "Verification step" means the exact command that confirms the output state — typically npm test -- --testPathPattern=cache.client or tsc --noEmit or python -m pytest tests/unit/cache_test.py.

Claude Code's /tasks feature provides a native interface for this. You can define a task list at the start of a session and the model will track its progress. But a plain numbered markdown list in the spec file works equally well and is more portable across tool versions. The structure is what matters, not the mechanism.

Task decomposition prompt

# Task decomposition for: user-cache feature
# Read docs/specs/user-cache.md before proceeding.

## Task 1 — Create cache client
Input state:
  - src/cache/ does not exist
  - All existing tests pass (npm test exits 0)

Actions:
  - Create src/cache/cache.client.ts implementing CacheClient interface
  - Create src/cache/cache.client.test.ts with unit tests using ioredis-mock
  - Do NOT touch any repository files

Verification:
  $ npm test -- --testPathPattern=cache.client
  Expected: all new tests pass, no existing tests fail

## Task 2 — Integrate cache into UserRepository
Input state:
  - src/cache/cache.client.ts exists and its tests pass
  - src/repositories/user.repository.ts is unmodified from baseline

Actions:
  - Modify UserRepository.findById() to use CacheClient
  - Add invalidation call to UserRepository.update() and UserRepository.delete()
  - Update user.repository.test.ts to cover cache hits and invalidation

Verification:
  $ npm test -- --testPathPattern=user.repository
  Expected: all tests including new cache tests pass
  $ tsc --noEmit
  Expected: 0 errors

## Task 3 — Integration smoke test
Input state:
  - Tasks 1 and 2 verified green

Actions:
  - Run the full test suite
  - Confirm no regressions in unrelated modules

Verification:
  $ npm test
  Expected: same number of passing tests as baseline, plus new cache tests

Warning

Do not create tasks that span multiple concerns — for example, "create the cache client and update the repository" in a single task. When the verification fails, you cannot tell which concern caused it. One concern per task is not pedantry; it is the thing that makes rollback precise. If a task's verification fails, you want to know exactly which file to revert.

The handoff between tasks deserves explicit attention. When Task 1 is complete and verified, before starting Task 2, write a one-paragraph state summary in the spec file (or in a separate docs/specs/user-cache.state.md file). The state summary records what was created, what the current test status is, and any decisions that were made during Task 1 that affect Task 2. This takes two minutes and prevents the situation where Task 2 starts with the model's understanding of Task 1 derived from a long scroll history rather than an explicit summary.

For the India and UK context, a practical consideration: if you are running CI on AWS Mumbai (ap-south-1) and AWS London (eu-west-2) simultaneously — which is the right setup for a dual-market product — decompose your tasks so that database migrations are isolated. A task that creates a migration should have its own checkpoint that confirms the migration applies cleanly before any task that modifies application code to depend on the new schema. Cross-region CI failures that stem from a migration ordering problem are far harder to debug than a single-region failure, because the failure evidence is split across two logs.

Step 3 — Set checkpoints and define rollback signals

A checkpoint is a moment in the workflow where you stop, run the verification command, and make a binary decision: pass or rollback. It is not a moment where you read the output and decide whether it looks approximately correct. It is a mechanical verification against a defined standard. This distinction matters because "approximately correct" is the state that accumulates into a session that looks like it is working until you run the full test suite and find seven failures.

Every task completion is a mandatory checkpoint. There are also intra-task checkpoints for long tasks: after creating a new file, before modifying an existing file that other modules depend on, and after any change to a public interface. The intra-task checkpoints are lighter — typically just a tsc --noEmit to confirm the type graph is intact — but they prevent the situation where a task ends with three cascading type errors because one interface change was not caught until the end.

The rollback signal is the harder part to define in advance, but it is what gives the workflow its value. A rollback is not triggered by a failing test — a failing test is expected after writing a new test and before writing the implementation. A rollback is triggered by a structural failure: a modification that corrupts a passing state that existed at the previous checkpoint.

Checkpoint type	Verification command	Pass condition	Rollback trigger
New file created	`tsc --noEmit`	0 type errors in new file and all existing files	Type errors in files not touched by the current task
Existing file modified	`npm test -- --testPathPattern={file}`	All pre-existing passing tests for this file still pass	Previously passing test now fails
Public interface changed	`npm test` (full suite)	No regression in any module; new tests may be red (expected)	Regression in a module that is not in this task's scope
Database migration added	`npm run migrate:test`	Migration applies cleanly to a fresh test DB in both regions	Migration fails to apply or reversal (`down`) does not restore original schema
Task complete	All acceptance criteria from spec	Every acceptance criterion in the spec is met	Any acceptance criterion not met; any file outside scope was modified

When a rollback is triggered, the correct action is git stash (or git checkout -- . if you have not committed the task's work) back to the last green checkpoint, and then re-decompose the failing task into two smaller tasks. The instinct to prompt the model to "just fix the failing test" is understandable but consistently produces compounding debt. A fix that is applied on top of a structurally inconsistent state produces another inconsistency at the next checkpoint, and each successive fix makes the rollback path longer and more expensive.

One practical note: the git commit after each verified checkpoint is not optional. Commit after Task 1 passes, commit after Task 2 passes, and so on. This gives you a precise rollback point for every task boundary. "The last green checkpoint" should always mean a specific commit hash, not a mental model of where the code was before things went wrong.

Step 4 — Review and iterate with structured context passing

The final failure mode that plan-first discipline addresses is context loss between sessions. If you close Claude Code after Task 1 and open a new session for Task 2, the model has no memory of Task 1. It will read the codebase, infer what happened, and make assumptions — some of which will be wrong. The cost of wrong assumptions in Task 2 is paid in Task 2's checkpoint, but the root cause is that the new session started with an incomplete understanding of the current state.

Structured context passing is the solution. Before closing a session, write a state summary. This is different from the spec: the spec describes the goal; the state summary describes the current reality. It should take two to three minutes to write and should include: which tasks are complete, which files were modified, what the current test status is, and any decisions made during the session that deviate from or clarify the spec. The next session starts by reading both the spec and the state summary before any other prompt.

Context-passing state summary template

# STATE: user-cache feature
# Updated: 2026-06-08T14:30:00Z

## Completed tasks
- [x] Task 1: cache client created and verified
  - Files created: src/cache/cache.client.ts, src/cache/cache.client.test.ts
  - Commit: a3f2b1c
  - Test status: 12/12 passing (npm test -- --testPathPattern=cache.client)
  - Decision: used ioredis v5.3 (not ioredis-mock in production path); mock used only in tests

## In-progress tasks
- [ ] Task 2: integrate cache into UserRepository
  - Started: false
  - Blocker: none

## Current test status
  npm test: 47/47 passing (baseline + cache.client tests)
  tsc --noEmit: 0 errors

## Decisions that deviate from spec
- CacheClient.set() signature uses ttlSeconds: number (spec had ttl: number)
  — renamed for clarity, update spec before Task 2

## Next session opening prompt
Read docs/specs/user-cache.md and docs/specs/user-cache.state.md.
You are starting Task 2 (integrate cache into UserRepository).
The cache client is complete at commit a3f2b1c.
Do not modify src/cache/cache.client.ts unless a type error requires it.
Begin by listing the changes you plan to make to user.repository.ts, then wait for my review.

The last line of the opening prompt — "list the changes you plan to make, then wait for my review" — is important. Before the model writes a single line of implementation code for a task that modifies an existing file with dependants, you want to see its plan. A thirty-second review of the planned changes catches most structural misunderstandings before they are written into the codebase. It is not a checkpoint (no verification command is run), but it is the cheapest possible form of alignment confirmation.

For complex features, this review step is also where you catch spec drift in its earliest form: the model proposing to do something reasonable that is nonetheless outside the spec. "I plan to also refactor the findAll() method for consistency" is a reasonable impulse and a scope violation. Catching it before the model writes the code costs you nothing. Catching it after costs you a rollback.

If you are working with production RAG pipelines or other complex multi-component features, the same structured context passing applies between the retrieval layer, the re-ranking layer, and the generation layer — each is a natural task boundary with its own verification command. The approach generalises well beyond single-service features.

Common pitfalls: context loss, spec drift, hallucinated APIs

Even with a written spec and a structured workflow, three failure modes recur often enough to name explicitly.

Context loss at session boundaries

The most common cause of this is not forgetting to write a state summary — it is writing an incomplete one. A state summary that records "Task 1 done" without recording the decisions made during Task 1 is nearly as bad as no summary at all. The decisions are the hard-won information: the discovery that ioredis's set() signature changed in v5, the choice to use a custom serialiser rather than JSON.stringify, the test that had to be modified because the existing implementation had a latent bug. These are the things the next session needs to know. The completion status is the thing that is easiest to guess without a summary; the decisions are the thing that is impossible to guess.

Spec drift

Spec drift happens when the model deviates from the spec and you do not notice until a checkpoint fails — or worse, does not notice at all because the acceptance criteria were underspecified. The fix is two-part: tighter acceptance criteria in the spec (each criterion must be a runnable test, not a description of a property), and the "list your planned changes before writing code" discipline described above. Spec drift that is caught in the planning step costs nothing. Spec drift caught at the task checkpoint costs a rollback. Spec drift that makes it through to the integration checkpoint costs a full feature rollback. The earlier you catch it, the cheaper it is.

Hallucinated APIs

Hallucinated API calls are most common when the spec does not constrain the dependency set. If the spec says "add caching" without specifying the client library and version, the model will sometimes invent methods from an older or newer version of ioredis, or mix ioredis and ioredis-mock APIs, or use a Redis client API from a different library altogether. The constraints section of the spec, and the "do not import from packages not in package.json" instruction, eliminate most of these. The ones that slip through are caught at the first tsc --noEmit checkpoint — another reason to run type-checks early and often, not just at the end of a task. For non-TypeScript projects, an equivalent lint step serves the same function. If you are working on the inference cost and latency trade-offs for your AI infrastructure, note that the same hallucination pattern applies to model API specifications — Claude Code will occasionally reference deprecated parameters or invent convenience methods that do not exist in the SDK version pinned in your lockfile.

Author's note

"I wasted two full afternoons before I understood that the model was not broken — I was just not giving it a stable reference to work from. The moment I started writing a spec file and pinning it at the start of every session, the sessions started completing on the first run. It felt like a lot of overhead until I timed it: the spec takes eight minutes to write and saves an average of ninety minutes of re-prompting."

— Priya Sharma, AI Engineer, Bangalore

Benchmark: plan-first vs ad-hoc on a 3-feature build

To quantify the workflow's impact, a three-feature build was run twice in controlled conditions: once ad-hoc (experienced engineer, no spec, iterative prompting until the feature passed CI), and once plan-first (same engineer, same features, using the workflow described in this guide). Each feature touched five to eight files across a Node/TypeScript service. CI ran on both AWS Mumbai (ap-south-1) and AWS London (eu-west-2) with identical test suites. The results were consistent across all three features; the table shows the averages. This is an indicative single-observer experiment, not a statistically powered study — treat the numbers as directional, not definitive.

Metric	Ad-hoc	Plan-first	Improvement
Total prompts per feature	24	9	-63%
Session re-runs (full restart)	3.7	0.8	-78%
CI failures before first green (Mumbai, ap-south-1)	6.3	1.4	-78%
CI failures before first green (London, eu-west-2)	6.1	1.3	-79%
Time from first prompt to first green CI (minutes)	147	58	-61%
Lines of code changed outside spec scope	312	14	-96%
Context tokens consumed per feature (estimated)	~280 k	~168 k	-40%

The out-of-scope lines-changed figure is the most striking. The ad-hoc workflow produced an average of 312 lines of changes outside the intended scope — refactors, reformats, and speculative improvements that made the diff harder to review and introduced three regressions across the three features. The plan-first workflow produced 14 out-of-scope lines, all of which were trivially identifiable as incidental (import reordering, whitespace normalisation).

The Mumbai and London CI failure counts are nearly identical, confirming that the workflow's gains are not region-specific. They apply anywhere you run CI. The context token savings — approximately 40% — also reduce inference costs, which matters at scale. For teams evaluating whether to fine-tune a model or use the base API, note that context efficiency improvements of this magnitude can shift the cost calculus significantly before fine-tuning is even considered.

Conclusion and next steps

The plan-first workflow is four steps: write a spec with goal, constraints, interface contracts, and acceptance criteria; decompose the spec into self-contained sub-agent tasks with declared input and output states; set checkpoints after every task and define rollback signals in advance; pass structured context between sessions using a state summary file. None of these steps are technically complex. All of them require a discipline shift: the habit of writing before prompting.

The natural next steps depend on where your current workflow breaks down. If sessions go off track on the first file, the problem is spec quality — spend more time on the acceptance criteria and interface contracts. If sessions go off track mid-feature, the problem is task decomposition — break tasks down further until each task touches at most two or three files. If sessions lose coherence between sessions, the problem is context passing — make the state summary more detailed and include every decision made, not just the completion status. The workflow in this guide is directly compatible with the broader transition from software engineer to AI engineer — it is one of the foundational practices that distinguishes a disciplined AI builder from an engineer who has adopted a new tool without a new methodology.

Building with AI tools full-time?

Join the community of Verified AI Builders across India and the UK. Get a free profile, get discovered by teams hiring for AI engineering roles.

Create your free Verified Builder profile

Frequently asked

How long should a spec file be before starting a Claude Code session?

For a single feature, a spec file of 200 to 400 words covering goal, constraints, interface contracts, and three to five acceptance criteria is typically sufficient. The goal is not exhaustive documentation — it is giving the model a stable reference it can re-read rather than infer from context. Longer specs are appropriate for cross-cutting concerns that affect multiple modules; for a focused two-file change, even a ten-line markdown block is useful as long as it states the acceptance criteria explicitly. If writing the spec takes longer than ten minutes, that is often a signal the feature is under-defined at the product level, not a Claude Code problem.

What is the difference between a sub-agent task and a regular Claude Code prompt?

A regular prompt is conversational and relies on the model tracking state across the scroll history of a session. A sub-agent task is a self-contained unit of work with a declared input state (files, types, passing tests), a declared output state (what must be true when the task is done), and an explicit verification step. The distinction matters because Claude Code context windows are finite and conversation history degrades with length — a structured task can be re-run from scratch with just the spec and a short state summary, whereas a conversational session that has drifted through fifteen turns of back-and-forth cannot easily be recovered.

When should I trigger a rollback instead of prompting for a fix?

Roll back when the checkpoint verification reveals that the output state differs from the spec in a structural way — wrong interface shape, a broken import that affects other modules, a migration that altered a table schema unexpectedly, or a type error that propagates beyond the current task's boundary. Prompt for a fix when the failure is localised — a test that fails for a known reason, a lint warning, a missing null check. The practical heuristic is: if you can describe the full scope of the fix in one sentence and it touches only files that belong to the current task, prompt for it. If you cannot, roll back to the last green checkpoint and re-decompose the task more narrowly.

How do I pass context between Claude Code sessions without pasting the entire conversation history?

Maintain a structured state summary file — a short markdown document that records the current task, what was completed in the previous session, which files were modified, what the current test status is, and any decisions made (with the reasoning). Open each new session by reading this file before any other prompt. This is more reliable than scroll history because it is explicit, diff-able, and can be reviewed by a human before the next session starts. The file should be updated at every checkpoint, not just at the end of a session, so that an interrupted session can always be resumed cleanly.

Does the plan-first approach work with Cursor and other tools, or only Claude Code?

The four-step pattern — spec file, task decomposition, checkpoints, structured context passing — works with any AI coding assistant that accepts file context, which includes Cursor, Aider, GitHub Copilot Workspace, and any tool that lets you reference a file at the start of a session. The mechanics differ slightly: Cursor uses .cursorrules and the @ symbol to attach files; Aider accepts a --context flag; Claude Code reads files you explicitly reference in your initial prompt. The underlying principle is the same regardless of tool: the model needs a stable, machine-readable source of truth it can re-consult, not just the recent scroll history of an ambient conversation.

Join the community of Verified AI Builders

Get a free profile on AI Tech Connect, be discovered by teams in India and the UK, and connect with builders who are shipping production AI systems today.

Create your free profile Browse Builders

← Back to AI Tips