Why Hiring Managers Check GitHub Before Your CV
The shift happened gradually, then all at once. By mid-2026, the proof-of-work expectation that once applied only to senior engineers has spread across every AI-adjacent role. Hiring managers at well-funded AI-native companies — and increasingly at traditional enterprises standing up AI teams — no longer treat a CV as evidence of capability. They treat it as a shortlist filter. Your GitHub, your Hugging Face spaces, your portfolio demo links: these are where capability is demonstrated.
The numbers support the urgency. As of June 2026, AI Engineer is the fastest-growing job title in the United States, with job postings rising 143% year-on-year in 2025. AI postings now make up 2.5% of all US job listings — a 55% jump year-on-year. Average salaries for AI engineers have reached $206,000 in 2026, up $50,000 from the previous year. The US alone projects 1.3 million AI job openings over the next two years, against a supply that covers fewer than 645,000 candidates.
The talent gap is real and structural. Many of the strongest agentic AI hires in 2026 are not coming from computer science PhD programmes or big-tech ML teams. They are coming from people with domain expertise — in fintech, healthtech, logistics, legal — who have demonstrated AI fluency through what they have actually built and shipped. That demonstration lives in your portfolio.
In London's fintech and healthtech scenes, and across funded AI startups in Bengaluru, Hyderabad, and Mumbai, the same pattern holds: technical hiring leads want to see evidence of agentic thinking. Not slide decks. Not certification badges. Code that runs, agents that loop, pipelines with eval scores attached.
This guide gives you five projects that serve as that evidence. Each one is scoped to be completable in a weekend or two of focused work. Each one maps directly to skills that are repeatedly cited in 2026 job descriptions for AI engineers, ML engineers, and agentic AI product builders. And each one is designed to grow — starting simple enough for someone new to agentic AI, extensible enough to serve as a conversation starter in a senior engineering interview.
What "Proof of Work" Actually Means in 2026
Before diving into the projects themselves, it is worth being precise about what counts and what does not. The following table summarises the signals that experienced technical interviewers describe as meaningful versus the noise that does not move the needle.
| What counts as proof of work | What does not move the needle |
|---|---|
| A GitHub repo with commits, tests, and a clear README | A certificate from an online course platform |
| An eval harness with a golden test set and pass rates | A project that wraps a chat API with no evaluation |
| A recorded walkthrough showing real inputs, tool calls, and output | A screenshot of a successful response |
| A deployed demo (even on a free tier) | A notebook that ran once and was never cleaned up |
| A project that handles failure — retry logic, fallback, graceful errors | A project that only works on the happy path |
| A write-up explaining what you would do differently next time | A README that only documents the happy path |
| Open-source contributions or a pull request to an agent framework | A list of frameworks you have "experience with" |
The underlying principle is simple: hiring managers are pattern-matching for builders who have shipped something that had to work under real-world conditions. That means the agent had to recover from a bad API response. The retrieval had to handle a query that did not match anything in the index. The evaluation had to flag a regression rather than just pass everything.
Add a LESSONS.md or a "What I would do differently" section to every portfolio project README. It signals intellectual honesty and shows you are thinking like an engineer, not just a tutorial follower. Hiring managers explicitly look for this.
Project 1: Autonomous Research Agent
What it demonstrates
An autonomous research agent is the canonical first agentic AI project — and it remains one of the most valued in 2026 precisely because it requires you to solve the core challenges of agent design: search-loop construction, result synthesis, citation tracking, and hallucination mitigation. It shows you understand the difference between a chain and a loop, and that you have thought about when the agent should stop.
What to build
The target: an agent that takes a research question, breaks it into sub-questions, searches the web for each, synthesises the results into a structured report with citations, and evaluates its own confidence before returning output. The agent should run for a configurable number of search rounds and exit when it meets a confidence threshold or exhausts its budget.
A minimal but impressive implementation has four components:
- Query decomposition — break the input question into 3–5 targeted sub-queries using an LLM call with a structured output schema.
- Search loop — call a search API (Brave Search, Tavily, or SerpAPI all offer free tiers) for each sub-query, extract the top 3–5 results, and fetch the page content for each.
- Synthesis layer — pass the fetched content plus the original question to a second LLM call that produces a structured report: executive summary, key findings, evidence table, and confidence score.
- Confidence gate — if the confidence score is below a threshold, run a second search round targeting the gaps identified in the synthesis. Log the loop count so you can show it in your demo.
import asyncio
from dataclasses import dataclass
from typing import Optional
import anthropic
client = anthropic.Anthropic()
@dataclass
class SearchResult:
url: str
title: str
snippet: str
content: str
@dataclass
class ResearchReport:
question: str
summary: str
key_findings: list[str]
sources: list[str]
confidence: float
rounds: int
async def decompose_query(question: str) -> list[str]:
"""Break a research question into targeted sub-queries."""
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=512,
messages=[{
"role": "user",
"content": f"""Break this research question into 3-5 targeted web search queries.
Return as a JSON array of strings.
Question: {question}"""
}]
)
import json
return json.loads(response.content[0].text)
async def search_and_fetch(query: str) -> list[SearchResult]:
"""Search and retrieve page content for a query."""
# Replace with your chosen search API
# Example: Tavily, Brave Search, SerpAPI
results = await your_search_api(query, num_results=4)
fetched = []
for r in results:
content = await fetch_page_text(r.url)
fetched.append(SearchResult(
url=r.url,
title=r.title,
snippet=r.snippet,
content=content[:3000] # truncate for context budget
))
return fetched
async def synthesise(question: str, results: list[SearchResult]) -> ResearchReport:
"""Synthesise search results into a structured research report."""
context = "\n\n".join(
f"SOURCE: {r.url}\n{r.content}" for r in results
)
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=2048,
messages=[{
"role": "user",
"content": f"""You are a research analyst. Based on the sources below, answer:
{question}
Respond as JSON with keys:
- summary (string, 2-3 sentences)
- key_findings (array of strings)
- sources (array of URLs cited)
- confidence (float 0-1, how confident you are the question is answered)
- gaps (array of strings, what is still uncertain)
SOURCES:
{context}"""
}]
)
import json
data = json.loads(response.content[0].text)
return data
async def research(question: str, max_rounds: int = 3,
confidence_threshold: float = 0.8) -> ResearchReport:
"""Run the full research loop."""
all_results = []
rounds = 0
sub_queries = await decompose_query(question)
for round_num in range(max_rounds):
rounds = round_num + 1
round_results = []
for q in sub_queries:
results = await search_and_fetch(q)
round_results.extend(results)
all_results.extend(round_results)
report = await synthesise(question, all_results)
print(f"Round {rounds}: confidence={report['confidence']:.2f}, "
f"sources={len(report['sources'])}")
if report['confidence'] >= confidence_threshold:
break
# Use gaps to refine next round's queries
sub_queries = report.get('gaps', sub_queries[:2])
return report
What to measure and show
The most impressive portfolio presentation of this project shows the agent running over 3–5 diverse research questions with a table of results: question, rounds taken, confidence score achieved, and a link to the generated report. Include at least one example where the agent ran a second loop because its first-round confidence was low. That shows the loop actually works.
Save every intermediate round's output to a JSON file. This gives you a "reasoning trace" you can show in your demo — hiring managers love seeing the agent's internal deliberation, not just the final answer.
Project 2: Code-Iterating Agent
What it demonstrates
The generate-test-fix loop is the core of agentic software development. Building a code-iterating agent shows you understand tool use, sandboxed execution, and the feedback-loop architecture that underpins tools like Claude Code and GitHub Copilot Workspace. It is one of the most direct demonstrations of agentic AI engineering competence in 2026.
What to build
The target: an agent that takes a natural-language programming task, generates a solution, runs the code in a sandbox, reads the test output, and iterates until the tests pass or a maximum attempt count is reached. The iteration loop — not the initial generation — is what matters here. Any good LLM can generate code. Only a well-designed agent can fix its own failures.
The four-stage loop:
- Generation — generate a function or module that satisfies a natural-language spec. Include the test file in the prompt so the agent knows what "passing" looks like.
- Execution — run the generated code plus the test suite in an isolated subprocess or Docker container. Capture stdout, stderr, and exit code.
- Diagnosis — pass the failure output back to the LLM with the instruction to identify the root cause and produce a corrected version. This is the step most tutorial implementations skip.
- Loop control — track attempt count, detect when the same error recurs (a sign the model is stuck), and fall back gracefully if the maximum is reached.
import subprocess
import tempfile
import os
from pathlib import Path
import anthropic
client = anthropic.Anthropic()
def run_in_sandbox(code: str, tests: str, language: str = "python") -> dict:
"""Execute code + tests in a temporary directory, return results."""
with tempfile.TemporaryDirectory() as tmpdir:
code_file = Path(tmpdir) / "solution.py"
test_file = Path(tmpdir) / "test_solution.py"
code_file.write_text(code)
test_file.write_text(tests)
result = subprocess.run(
["python", "-m", "pytest", str(test_file), "-v", "--tb=short"],
capture_output=True, text=True, cwd=tmpdir, timeout=30
)
return {
"exit_code": result.returncode,
"stdout": result.stdout[-3000:], # tail to stay within context
"stderr": result.stderr[-1000:],
"passed": result.returncode == 0,
}
def generate_code(spec: str, tests: str, previous_attempt: str = None,
error_output: str = None) -> str:
"""Generate or fix code given spec, tests, and optional error context."""
messages = [{
"role": "user",
"content": f"""Write a Python solution for the following specification.
SPECIFICATION:
{spec}
TEST SUITE (your code must pass all tests):
{tests}
Return ONLY the solution code. No explanation, no markdown fences."""
}]
if previous_attempt and error_output:
messages = [{
"role": "user",
"content": f"""Fix this Python code so it passes all tests.
SPECIFICATION:
{spec}
FAILING CODE:
{previous_attempt}
TEST FAILURE OUTPUT:
{error_output}
Return ONLY the corrected code. No explanation."""
}]
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=2048,
messages=messages,
)
return response.content[0].text.strip()
def code_iterate(spec: str, tests: str, max_attempts: int = 5) -> dict:
"""Run the generate-test-fix loop until tests pass or limit is reached."""
code = None
for attempt in range(1, max_attempts + 1):
code = generate_code(
spec=spec,
tests=tests,
previous_attempt=code,
error_output=None if attempt == 1 else last_error,
)
result = run_in_sandbox(code, tests)
print(f"Attempt {attempt}: {'PASS' if result['passed'] else 'FAIL'}")
if result["passed"]:
return {"code": code, "attempts": attempt, "passed": True}
last_error = result["stdout"] + result["stderr"]
# Detect a stuck loop — same error twice in a row signals escalation needed
if attempt > 1 and last_error == previous_error:
print("Agent stuck — same error on consecutive attempts, stopping early.")
break
previous_error = last_error
return {"code": code, "attempts": max_attempts, "passed": False}
What signals competence
The signal hiring managers read in this project is not "it generates correct code on the first try." It is the sophistication of your error handling and loop design. Show a test run where the agent needed three or four attempts and succeeded. Show a test run where it hit the stuck-loop detection and reported failure gracefully rather than spinning. Both outcomes demonstrate engineering judgment.
Never run generated code without a sandbox. At minimum, use a subprocess with a timeout. For a portfolio project you intend to show publicly, use Docker isolation. Hiring managers from security-conscious environments (UK fintech, healthcare) will ask how you handle adversarial code generation.
Project 3: RAG Pipeline with an Eval Harness
What it demonstrates
Retrieval-Augmented Generation is now table stakes for AI engineering roles. But simply building a RAG pipeline is no longer impressive — every tutorial covers it. What separates a portfolio-grade RAG project in 2026 is the evaluation harness: a golden test set with measurable pass rates, retrieval quality metrics, and a clear story about what you measured, what failed, and how you improved it. Evals matter more than the pipeline itself.
What to build
Build a domain-specific RAG pipeline over a corpus you care about — legislation, research papers, product documentation, Hansard transcripts, or the UK Companies House dataset are all good options — and pair it with an evaluation suite that measures the three core failure modes: retrieval failure (right question, wrong chunks retrieved), generation failure (right chunks, wrong answer), and hallucination (answer not grounded in retrieved content).
The architecture:
- Ingestion — chunk your corpus, embed each chunk, store in a vector database (Chroma, Qdrant, or Weaviate all have free local modes). Record chunk metadata: source, page, section.
- Retrieval — hybrid search (BM25 + dense retrieval) outperforms either alone on most real corpora. Use a reranker (Cohere Rerank or a cross-encoder) if your corpus exceeds 10,000 chunks.
- Generation — pass the top-k retrieved chunks plus the query to an LLM with a strict system prompt: "Answer only from the provided context. If the context does not contain the answer, say so."
- Eval harness — the part most projects skip. See below.
For the eval harness, build a golden test set of at least 50 question-answer pairs where you know the correct answer and the source chunk. For each test case, measure:
| Metric | What it measures | How to compute |
|---|---|---|
| Retrieval recall@k | Was the source chunk in the top-k retrieved? | Check if ground-truth chunk ID is in the top-k results |
| Answer correctness | Is the answer factually correct? | LLM-as-judge: compare generated vs. golden answer, return 0/1 |
| Faithfulness | Is the answer grounded in retrieved context? | LLM-as-judge: can each claim in the answer be cited to a retrieved chunk? |
| No-context refusal rate | Does the agent correctly say "I don't know" when context is absent? | Run queries with correct chunks deliberately excluded; count correct refusals |
Run your eval suite before and after any change to your retrieval or generation config, and commit both scorecard files to your repo. A portfolio that shows eval scores improving over commits is far more compelling than one that just shows a final score. It demonstrates iterative engineering, not a one-shot build.
For further guidance on building evaluation suites and golden test sets, see our guide LLM Evaluation Suite: Golden Sets and Judge Models.
Project 4: Multi-Agent Orchestrator
What it demonstrates
The move from a single agent to a multi-agent system is the architectural leap that separates mid-level AI engineers from senior ones in 2026. A well-designed multi-agent orchestrator demonstrates systems thinking: you have to reason about task decomposition, context handoff between agents, failure propagation, and when to route to which specialist. This is the project that tends to generate the longest technical interviews.
What to build
Build an orchestrator agent that receives a complex task, breaks it into sub-tasks, delegates each sub-task to a specialist subagent, collects the results, and synthesises a final output. The key design decisions — how the orchestrator decides to delegate, how subagents report back, and how the system handles partial failure — are the things hiring managers will ask about in detail.
A recommended scoping: a research and writing pipeline with three specialist agents:
- Research agent — given a topic, performs web searches, fetches pages, and returns structured findings with citations. (This is Project 1, promoted to a subagent.)
- Critique agent — given a draft piece of writing, evaluates it for factual accuracy, logical consistency, and completeness. Returns a list of issues with severity ratings.
- Synthesis agent — given research findings and critique feedback, produces a final polished output that addresses all issues flagged by the critique agent.
The orchestrator's job is to:
- Decide whether the task requires research, critique, synthesis, or some combination.
- Pass context efficiently between agents without re-processing everything from scratch.
- Handle the case where a subagent fails or returns low-confidence output.
- Maintain a task graph so you can visualise and log the full execution trace.
from dataclasses import dataclass, field
from typing import Any
from enum import Enum
import anthropic
client = anthropic.Anthropic()
class AgentRole(Enum):
ORCHESTRATOR = "orchestrator"
RESEARCHER = "researcher"
CRITIC = "critic"
SYNTHESISER = "synthesiser"
@dataclass
class AgentMessage:
role: AgentRole
content: str
confidence: float = 1.0
metadata: dict = field(default_factory=dict)
@dataclass
class TaskNode:
task_id: str
description: str
assigned_agent: AgentRole
status: str = "pending" # pending | running | done | failed
result: Any = None
children: list = field(default_factory=list)
class MultiAgentOrchestrator:
def __init__(self, model: str = "claude-opus-4-5"):
self.model = model
self.task_graph: list[TaskNode] = []
self.message_log: list[AgentMessage] = []
def _call_agent(self, role: AgentRole, system: str,
user_message: str) -> AgentMessage:
"""Call an LLM-backed agent and return a structured message."""
response = client.messages.create(
model=self.model,
max_tokens=2048,
system=system,
messages=[{"role": "user", "content": user_message}]
)
msg = AgentMessage(
role=role,
content=response.content[0].text,
)
self.message_log.append(msg)
return msg
def research(self, topic: str) -> AgentMessage:
return self._call_agent(
role=AgentRole.RESEARCHER,
system="You are a research specialist. Return structured findings with source citations.",
user_message=f"Research this topic thoroughly: {topic}"
)
def critique(self, draft: str, research: str) -> AgentMessage:
return self._call_agent(
role=AgentRole.CRITIC,
system="You are a critical reviewer. Identify factual errors, gaps, and logical inconsistencies.",
user_message=f"Critique this draft against the research:\n\nDRAFT:\n{draft}\n\nRESEARCH:\n{research}"
)
def synthesise(self, research: str, critique: str,
original_task: str) -> AgentMessage:
return self._call_agent(
role=AgentRole.SYNTHESISER,
system="You are a synthesis specialist. Produce a final polished output addressing all critique points.",
user_message=f"Task: {original_task}\n\nResearch:\n{research}\n\nCritique to address:\n{critique}"
)
def run(self, task: str) -> str:
print(f"Orchestrator: routing task — {task[:80]}...")
# Step 1: Research
research_msg = self.research(task)
print(f" Researcher: returned {len(research_msg.content)} chars")
# Step 2: Generate a first draft via synthesis (fast pass)
draft_msg = self.synthesise(research_msg.content, "", task)
# Step 3: Critique the draft
critique_msg = self.critique(draft_msg.content, research_msg.content)
print(f" Critic: identified issues")
# Step 4: Final synthesis incorporating critique
final_msg = self.synthesise(research_msg.content, critique_msg.content, task)
print(f" Synthesiser: final output ready")
return final_msg.content
Why architectural thinking matters here
The portfolio presentation of this project should include an architecture diagram showing the agent graph, a log of a complete execution trace, and a section discussing design decisions: why you chose this decomposition, what you tried that did not work, and how you would extend it to handle a failure in one subagent without cascading to the whole pipeline. These are exactly the questions that come up in senior AI engineering interviews at companies in London, Bengaluru, and New York in 2026.
Latency compounds in multi-agent pipelines. A four-agent pipeline with one LLM call each at two seconds per call takes eight seconds minimum — before retries, before tool calls. Build latency logging into every agent hop from day one. Hiring managers at latency-sensitive companies (fintech, real-time applications) will ask about this immediately.
Project 5: Production Evaluation Suite
What it demonstrates
A standalone evaluation suite — one that can be run against any LLM or any version of your agent — is one of the most underrepresented and most valued portfolio artefacts in 2026. Most builders build a pipeline and then manually eyeball the outputs. The builders who stand out have built a repeatable, automated process for measuring whether their system is getting better or worse. This is the project that best predicts whether someone will be able to maintain and improve an AI system in production.
What to build
Build an evaluation framework with three components:
- A golden dataset — at least 100 input-output pairs, hand-curated or sourced from real usage, covering the full distribution of tasks your system is expected to handle, including edge cases and failure modes.
- An automated runner — a script that takes a model name or agent config, runs the golden dataset through it, and produces a scorecard in a consistent format (JSON + HTML report).
- An LLM-as-judge layer — for tasks where correctness cannot be checked programmatically (open-ended generation, summarisation, multi-step reasoning), use a separate LLM call to score each output against a rubric. Log the judge's reasoning, not just the score.
| Task type | Primary metric | Secondary metric | Judge type |
|---|---|---|---|
| Classification | Accuracy | F1 by class | Programmatic |
| Extraction | Precision / Recall | Field-level F1 | Programmatic |
| Summarisation | Faithfulness (0–1) | Completeness (0–1) | LLM-as-judge |
| Code generation | Test pass rate | Attempt count to pass | Execution-based |
| Tool-calling | Correct tool selected | Correct args provided | Programmatic |
| Multi-step reasoning | Final answer correctness | Step validity rate | LLM-as-judge |
| RAG / grounded Q&A | Faithfulness | Retrieval recall@k | LLM-as-judge + programmatic |
Version your golden dataset alongside your code with a dataset_version field in each test case. When you add new test cases, record why — a regression you caught, a new task variant, an edge case from real usage. This changelog is itself a demonstration of engineering maturity that stands out in portfolios.
For a deeper dive into building evaluation suites with golden sets and judge models, see LLM Evaluation Suite: Golden Sets and Judges.
How to Present Your Portfolio
The best portfolio projects fail to land interviews because of poor presentation. Here is a battle-tested structure for every project README and walkthrough.
The five-question README structure
Every portfolio project README should answer these five questions, in this order, without requiring the reader to scroll more than one screen to get through them:
- What problem does this solve? — One sentence. "This agent automates competitive research by autonomously searching, synthesising, and evaluating sources — producing a cited report in under two minutes."
- What does the agent do, step by step? — A numbered list of five to eight steps. Include a simple architecture diagram if you can.
- How do I run it? — A copy-paste-able block of four or five commands that actually works on a clean environment. Test it.
- What do the eval results show? — A table of your key metrics with a timestamp. Even a small golden set (20 cases) with honest numbers is better than no numbers.
- What would you do next? — Two or three concrete extensions. Shows you are thinking beyond the tutorial.
The demo walkthrough
Record a two-to-three minute screen capture of your agent running. Use a tool like Loom or OBS. The walkthrough should show:
- A real, non-trivial input — not "hello world" or a toy example.
- The agent's intermediate steps — tool calls, search results, intermediate outputs, loop counts.
- An edge case where something goes slightly wrong and the agent handles it.
- The final output, with the eval metric for that run if applicable.
Link the recording prominently at the top of your README. Hiring managers will watch a two-minute video where they would not read a ten-paragraph description.
Publishing your eval results
Commit your eval scorecard as a JSON file and render it as a table in your README using a GitHub Actions workflow. This means every time you push a change, the scorecard updates automatically. A README with a live eval badge is a remarkably strong signal — it shows you treat AI development like software development, with automated quality gates. This is exactly the culture that well-run AI teams in London, Bengaluru, and beyond are trying to build.
See our Tips hub and the AI engineer career roadmap guide for further reading on how to position your skills for the 2026 market.
Your portfolio is ready. Now make it findable.
Add your Verified Builder profile on AI Tech Connect. Hiring teams across India and the UK use the directory to shortlist candidates. Free to add during our Founding Builder window.
Add your profile →Turning Your Portfolio Into a Verified Builder Profile
Building the projects in this guide is the first step. Making sure the right people can find them is the second. A Verified Builder profile on AI Tech Connect functions as a public proof-of-work signal that works independently of your LinkedIn, your GitHub bio, and your CV.
Here is how it works in practice: hiring managers and technical leads at companies actively recruiting AI engineers use the Builder directory to shortlist candidates. They can browse by skill, project type, and location. They can select up to five builders and request contact details by email — without needing a LinkedIn account or a recruiter relationship. For candidates in Bengaluru, Hyderabad, and Mumbai targeting remote-first UK and US roles, and for UK-based builders targeting London fintech and healthtech teams, this is a direct channel that most job boards do not offer.
Your Verified Builder profile links your five portfolio projects, your skills stack, your work history, and a short builder bio in one place. The profile is public and indexable — it surfaces in search results for your name and your skill keywords. Hiring managers who encounter your portfolio projects on GitHub will often search for your name directly, and a well-structured AITC profile is one of the first things they find.
The Founding Builder window is the highest-visibility tier. As of June 2026, Founding Builder slots are still available — this is the scarcity window before the directory reaches critical mass. Founding Builders appear first in search results, carry a permanent badge on their profiles, and are the cohort that the AITC team promotes to partner companies hiring in the next quarter.
If you have built even one of the five projects in this guide and can share a working demo or a public repository, that is sufficient to submit a Verified Builder profile. The bar is demonstrated output, not a complete five-project portfolio. Start with your strongest project, add the others over time, and let your public proof of work compound.
Browse existing Verified Builder profiles to see the format, and read the latest AI industry news to stay current on the skills and frameworks that are in demand. The talent gap is real — the question is whether you are positioned to be found when it matters.