When a company hires its first AI engineer, the rubric is ad hoc — often borrowed from software engineering hiring with a few AI keywords added. We covered that scenario separately in our guide to the first AI hire. This article is about a different and more systematic situation: what happens when a scale-up or established tech company — one that already has a functioning AI team — opens a new AI engineering role. At that point, the process is deliberate and repeatable, and it scores against dimensions that most candidates do not even know exist.
Understanding this rubric is the fastest leverage a candidate can get. The AI talent shortage is real — as our coverage of AI engineer hiring and retention trends shows, demand for qualified AI engineers continues to outstrip supply significantly in both India and the UK. That means hiring managers are actively looking for ways to filter fast without losing good people. Knowing how they filter, and what signals they use at each stage, lets you present the right evidence at exactly the right moment.
The 5-stage screening pipeline at established AI teams
Most scale-ups and mid-to-large tech companies with an existing AI function run a five-stage process. The stages are not always labelled clearly to candidates, and the criteria shift meaningfully at each one. The process typically looks like this: a GitHub and portfolio scan before a recruiter ever reaches out; a recruiter screen focused on fit signals; a hiring manager intro call that tests communication and systems thinking at a surface level; a take-home or technical challenge; and a system design round. Each stage has its own eliminating criteria, and understanding those criteria is the point of this guide.
One important meta-point: at companies with an established AI team, the hiring manager is usually an experienced practitioner — someone who has shipped real AI systems and knows what production looks like. They are not impressed by framework names or model benchmarks. They are looking for the texture of having actually done the work, and they have well-calibrated instincts for spotting the difference between a candidate who built something real and one who assembled a demo. Every stage is essentially a probe for that texture.
The 30-second GitHub scan: what hiring managers see first
At established AI teams, the process often starts before the job description is posted. A hiring manager or senior engineer on the team opens a candidate's GitHub profile and spends roughly 30 seconds deciding whether to pass it forward. That 30-second scan is not looking for impressive model names or a long list of technologies. It is looking for a specific cluster of signals that indicate production experience.
The first signal is project structure. Does the repository look like something that runs in a real environment, or like a notebook someone cleaned up and pushed? Production-oriented projects have a clear directory structure, a requirements or environment file, configuration separated from code, and usually some form of test coverage. A repository that is a single Jupyter notebook, however well-annotated, reads immediately as a demo.
The second signal is evaluation code. In 2026, an experienced AI hiring manager treats the presence or absence of an evaluation harness as one of the strongest signals in the entire profile. A project that ships with a structured eval — golden sets, scoring functions, a way to run the suite and see results — tells the reviewer that the candidate understood the problem well enough to define what "working" means. A project with no eval code, or worse with a comment that says "todo: add evaluation", signals that the candidate built something and assumed it worked.
The third signal is README quality. A strong README for an AI project does not describe what the project does in general terms — it gives the specific numbers: retrieval accuracy, latency at the p95, cost per 1,000 requests, the model or models used, and usually a short paragraph about what the most important design decision was. A README that says "a RAG system for question answering using LangChain and Pinecone" tells the reviewer nothing a tutorial could not also say. A README that says "NDCG@10 of 0.71 on a 400-query golden set; p95 retrieval latency of 180 ms; average cost of $0.003 per query on GPT-4o-mini with context caching enabled" tells them a great deal.
The fourth signal is commit hygiene. A repository whose entire history is a single initial commit, or whose commits are all "update" or "fix", suggests a candidate who cleaned up privately and pushed the result. That is not disqualifying, but it is a weaker signal than a history showing iterative work — experiments tried, benchmarks run, a direction changed when something did not perform, a regression fixed. The history of real work has texture that a polished dump does not.
Pin your three strongest repositories at the top of your GitHub profile, and make sure each one has a README that leads with the key performance numbers. Hiring managers open the pinned repos first. If the first README they read gives them NDCG, latency, and cost, you have already differentiated yourself from the majority of applicants they will see that week.
Red flags that end applications immediately
Certain signals will cause an experienced AI hiring manager to pass at the GitHub scan stage without reading further. It is worth being explicit about these, because some of them are counterintuitive — they involve things that look like effort but signal the wrong kind of effort.
| Red flag | Green flag (the alternative) |
|---|---|
| Notebooks-only repositories, even well-documented ones | Structured Python packages with modules, configs, and entry points |
| No tests on ML code ("it's hard to unit-test models") | Tests on data pipelines, eval harnesses, and deterministic logic even if model outputs are stochastic |
| "todo: add evaluation" or eval code is a single assert on one example | A proper eval suite with a golden set, scoring functions, and a README showing the scores |
| Every project uses MNIST, titanic.csv, or a Kaggle demo dataset | At least one project with a real, domain-specific dataset — even a small proprietary one scraped and cleaned by the candidate |
| Copy-paste architecture from a tutorial (identifiable by characteristic variable names or comments) | Evidence of adaptation: the candidate changed something non-trivial and can explain why |
| No mention of cost or latency anywhere across the entire profile | At least one project with explicit cost-per-query or latency figures in the README |
| Agent projects with no failure handling or fallback logic | Agents with explicit retry logic, timeouts, fallback paths, and logged tool-call traces |
The notebooks-only flag does not mean Jupyter notebooks are bad. They are the right tool for exploration and communication. The problem is when the entire portfolio is notebooks, with no evidence that the candidate knows how to take something from exploration into a structured, deployable system. One well-structured project is enough to clear this concern.
Green flags: production signals that make hiring managers stop scrolling
The flip side of the red flag list is a smaller set of signals that cause an experienced hiring manager to slow down and read carefully. These are the signals worth deliberately engineering into your public work.
Structured evaluation harnesses. A project that includes a full evaluation directory — golden queries, expected outputs or scores, a runner script, and results logged somewhere — is rare enough that it immediately marks a candidate as someone who treats AI engineering as engineering rather than alchemy. Even a modest harness covering 50 to 100 cases is meaningful. The presence of regression gates in a CI pipeline (catching score drops before they merge) is even stronger.
Production READMEs with latency and cost numbers. As described in the GitHub scan section, specific numbers in a README are one of the strongest possible green flags. They demonstrate that the project ran on real traffic or realistic load, that the candidate measured performance, and that they understand the business dimension of AI systems — cost and latency are not academic concerns, they are what determines whether a feature survives in production.
Real datasets, not tutorial data. A project built on a real, domain-specific dataset — clinical notes, product reviews, financial filings, code repositories, customer support tickets — demonstrates that the candidate can handle the messiness of real data: inconsistent formatting, missing values, domain-specific vocabulary, ambiguous ground truth. This is a significant differentiator because most tutorial projects use clean, pre-labelled data that bears no resemblance to what production AI actually processes.
Agent traces in logs or documentation. If a candidate has published an agent project, the difference between a demo agent and a production-oriented one is often visible in whether the project shows any logging or tracing of tool calls. A project that includes sample logs showing the agent's decision sequence, tool invocations and their outputs, and any retry or fallback events tells the reviewer that the candidate thought about observability — which is the foundation of debugging any agentic system that misbehaves in production.
Evidence of iteration. A project that documents what did not work — a chunking strategy that hurt retrieval performance, an embedding model that collapsed two important concepts, a re-ranker that added latency without improving quality — demonstrates the iterative, evidence-driven mindset that separates builders who have shipped from those who have only assembled. This can live in a README section, a blog post, or even commit messages. The content matters more than the format.
The 5 evaluation dimensions: what HMs score across every stage
Hiring managers at established AI teams do not evaluate candidates on a single pass. They accumulate evidence across every stage of the process and score it against a consistent set of dimensions. Understanding these dimensions lets you know what you are being evaluated on even when the interview question sounds like something else entirely.
The following table shows how an experienced AI hiring manager at a scale-up typically weights the five dimensions, what constitutes strong evidence for each, and how the dimension is tested across stages. The weights are illustrative, not universal — different teams adjust them — but the relative ordering is consistent across most established AI functions.
| Dimension | What counts as strong evidence | Max weight |
|---|---|---|
| 1. Production evidence | Systems that ran on real traffic; cost and latency numbers; production bugs debugged and fixed; monitoring and alerting in place | 25% |
| 2. Evaluation discipline | Golden sets constructed from real data; offline eval harnesses; regression suites; LLM-as-judge with calibration; CI gates on eval score | 25% |
| 3. Systems thinking | Reasoning about scale, cost, failure modes, and observability before implementation; trade-off articulation; knowing when not to use a given approach | 20% |
| 4. Communication quality | Can explain technical decisions and trade-offs to a non-specialist; README and documentation clarity; concise answers that still contain the relevant depth | 15% |
| 5. Learning velocity | Evidence of having adopted new tooling rapidly and correctly; blog posts or projects on emerging techniques (agents, multi-model routing, MCP); reasoned opinions on when new approaches are and are not appropriate | 15% |
Two dimensions deserve particular emphasis because they are the ones candidates most commonly underestimate.
Evaluation discipline (25%) is the single largest differentiator. At a company with an existing AI team, the hiring manager has almost certainly been burned by someone who shipped without evals. They have debugged a regression that took three days to identify because there was no baseline. They have explained to a product manager why a feature that "worked in testing" is hallucinating in production. Evaluation discipline is not a nice-to-have for them; it is the thing they most want to hire and most struggle to find. A candidate who leads with their eval philosophy — how they construct golden sets, how they measure retrieval quality, how they catch regressions before they merge — is addressing the most acute pain point on the team.
Systems thinking (20%) is tested throughout, not just in the design round. Even in the GitHub scan, an experienced reviewer is asking: "Does this person understand what they are trying to accomplish well enough to make deliberate trade-offs?" A project that uses the most expensive model for every call, with no caching and no routing, and has never measured its cost, fails the systems thinking test even if the model quality is fine. Every stage of the interview is an opportunity to demonstrate that you reason about the full system — cost, latency, failure modes, monitoring — not just the ML component.
Before your HM intro call, prepare one story for each of the five dimensions. Not five separate projects — ideally one or two projects that illustrate all five, told through specific examples. "Here is the system I built, here is what it cost and how fast it was, here is the eval that caught the regression before it reached users, here is the trade-off I made on the retrieval stack and why, here is what I had to learn to build it." That narrative structure maps directly onto how hiring managers are scoring you.
For a deeper look at the specific questions that probe each dimension, see our breakdown of AI engineer interview question clusters. The clusters there align closely with the five dimensions above, and preparing for them in parallel is a good use of the same preparation time.
How the take-home challenge is scored in 2026
The take-home challenge has changed significantly since 2023. Two years ago, a typical AI engineering take-home asked candidates to build a simple retrieval or classification system, and evaluation meant checking whether it produced plausible outputs. In 2026, take-homes at established AI teams are designed around the assumption that candidates will use AI tools — the constraint is not "no AI assistance", it is "demonstrate that you understand what you are building well enough to evaluate it, explain it, and extend it."
This shift reflects a real change in what the job requires. A hiring manager no longer needs to know whether you can write a tokeniser from scratch; they need to know whether you can design a system that works reliably when the underlying models behave unexpectedly. The take-home is calibrated to test exactly that.
The most common formats in 2026 at scale-ups with existing AI teams are:
Build or extend an agent with evaluation. The candidate is given a partial agent implementation and asked to extend it — add a tool, handle a new failure mode, improve the eval coverage. Scoring focuses on whether the candidate wrote evals for their changes, whether the evals are meaningful (not just checking that the agent produces output), and whether the code is structured clearly enough that another engineer could extend it. Candidates who submit a working extension with no eval coverage routinely score below candidates who submit a slightly incomplete extension with strong evals.
RAG system with evaluation harness. Build a retrieval system for a provided corpus, write an evaluation harness, and include a short document explaining the design decisions. Scoring rewards explicit trade-offs (why this chunking strategy, why this embedding model, what retrieval metric you chose and why), measured performance (actual eval scores on a held-out set), and documentation quality. The "short document explaining decisions" section is not optional decoration — it is scored as heavily as the code itself, because communication quality is one of the five dimensions.
Diagnosis and improvement challenge. The candidate is given a broken or underperforming AI system and asked to diagnose what is wrong and improve it. This format is increasingly common because it is harder for AI tools to solve end-to-end — the candidate needs to reason about failure modes, run targeted experiments, and form hypotheses before implementing. Strong answers identify the root cause correctly, propose a specific, testable improvement, and measure the improvement against a baseline. Weak answers make a change, note that outputs seem better, and do not measure anything.
What has not changed is the importance of clean code and clear reasoning. A take-home that scores highly on evaluation discipline but is written in a way that no reasonable engineer would want to read or extend is still a below-average submission. The standard is not production-perfect, but it should be clear enough that a senior engineer on the team could understand the choices made and modify the code confidently. Our guide to structuring a take-home submission covers the specific formatting and documentation conventions that score well consistently.
System design expectations: the "build a RAG system" question decoded
The system design round at an established AI team is the stage most candidates prepare for least effectively. The common mistake is treating it as a technical knowledge test — studying RAG architectures, vector database options, and orchestration frameworks — when it is primarily a reasoning and trade-off articulation test. The hiring manager already knows the standard architectures; they are not asking you to recite them. They are asking whether you can reason clearly about a real engineering problem under constraints.
The canonical question — "design a RAG system for [domain]" — is not asking for the most sophisticated architecture. It is asking you to demonstrate the following five things, in roughly this order.
First: ask clarifying questions before proposing anything. Strong candidates immediately ask about scale (queries per second, document corpus size), latency requirements (what is an acceptable response time for the user?), cost constraints (is there a token budget?), and domain specifics (how heterogeneous is the document set, how often does it update?). Candidates who dive straight into architecture without establishing constraints are demonstrating the opposite of systems thinking.
Second: reason about the retrieval component explicitly. What is the chunking strategy and why? What embedding model and why (cost, latency, domain fit)? Is hybrid search appropriate? What is the re-ranking approach? Hiring managers want to hear you make a specific choice, justify it against the constraints you established in step one, and acknowledge what you are trading off. "It depends" without a follow-up is the most common red flag in system design rounds.
Third: address failure modes unprompted. A strong candidate names the ways the system can go wrong before being asked: stale documents, retrieval misses (the answer exists in the corpus but is not retrieved), hallucinations when context is insufficient, latency spikes under load, cost overruns from unbounded context. More importantly, they propose specific mitigations: refresh schedules or change detection for staleness, fallback responses for low-confidence retrievals, context budgets for cost, monitoring thresholds for all of the above.
Fourth: describe how you would evaluate and monitor the system. What retrieval metrics would you track (NDCG, MRR, context precision, context recall)? What generation metrics (faithfulness, answer relevance, RAGAS scores)? How would you build an online evaluation loop as the system accumulates real queries? What alerts would you set up, and at what thresholds? Hiring managers at established AI teams have felt the cost of shipping systems with no monitoring; they weight this part of the answer heavily.
Fifth: make an explicit cost and latency estimate. This does not need to be precise, but it needs to exist. A candidate who can say "at 100 queries per day with an average context of 3,000 tokens and GPT-4o-mini, we are looking at roughly $8 to $12 per day, with a p95 latency of around 1.5 seconds including retrieval — which we can improve to under 1 second with a caching layer on repeated queries" is demonstrating production thinking. A candidate who cannot estimate at all has never operated anything at scale.
The profile-and-network shortlist: how hiring managers find candidates before posting jobs
One of the most important things candidates at established AI teams do not know is how often the shortlisting process starts before — or instead of — a public job posting. This is not unique to AI; it has always been true in competitive technical hiring. But the tooling available to hiring managers in 2026 has made proactive sourcing significantly faster and more systematic.
Modern AI sourcing tools aggregate profiles from LinkedIn, GitHub, Stack Overflow, Kaggle, and niche professional communities simultaneously, enabling a hiring manager or their recruiter to build a qualified shortlist in hours rather than weeks. Tools in this category can search across 800 million profiles or more, filter by specific technical signals — contribution patterns on relevant open-source projects, publications in specific domains, badges for particular skills — and surface passive candidates who have never applied for a job in their field. The practical result is that a hiring manager at a fintech scale-up in Bangalore or a health-tech company in London may reach out to the best candidates in their target profile before they have even written a job description.
This has a direct and actionable implication for candidates: your public visibility is not a supplement to the job application process — it is an earlier and often more important part of it. A strong GitHub profile is the baseline. A LinkedIn profile that clearly signals your AI engineering specialisation — not just "software engineer" with AI keywords buried in the summary, but an explicit focus on the specific area you work in (RAG, agents, fine-tuning, evaluation) — gets surfaced by sourcing tools that match by technical signal. And a public profile on a specialist platform that serves the exact audience of AI hiring managers looking for exactly your skills is the most targeted visibility you can have.
As the talent shortage data in our coverage of the AI hiring landscape confirms, the supply of qualified candidates is genuinely constrained. That means the best candidates do not need to apply for most roles — they get found. The question is whether you are findable in the right places.
The best candidates don't apply — they get found.
AI hiring managers at scale-ups and established tech companies shortlist from LinkedIn, GitHub, and specialist profiles before posting jobs. A Verified Builder profile on AI Tech Connect puts your production work in front of exactly that audience — across India and the UK. Early profiles receive the Founding Builder badge, and spots in the founding cohort are deliberately limited. Once they are gone, they are gone.
Claim your Founding Builder profile →The visibility dynamic extends to content as well. Hiring managers are active on the same platforms they use for sourcing, and a candidate who writes clearly about a technical problem they solved — a blog post, a detailed GitHub README, a thread on a developer forum — is simultaneously demonstrating communication quality (dimension four in the rubric) and increasing their findability. The candidates who get the most inbound interest in 2026 are those who have made their production experience visible in a format that search and sourcing tools can surface. Our guide to building in public as an AI engineer covers the specific formats and platforms that generate the best signal-to-noise for this purpose.
It is also worth noting that the network shortlist is not purely algorithmic. Hiring managers at established AI teams frequently ask their existing team members whether they know anyone good, and they check who has engaged with or starred relevant open-source projects. Being known — even slightly — within the AI engineering community in your target market (India or UK) provides a warm introduction signal that no cold application can replicate. This is a longer-term investment than polishing a README, but it compounds over time in a way that cold applications do not.
Making the rubric work for you: actionable 30-day plan
Understanding the rubric is only useful if it changes what you do. Here is a concrete 30-day plan for a candidate who wants to position themselves well against the five evaluation dimensions and the five-stage screening pipeline described in this guide. The plan assumes you have existing AI projects; if you are starting from scratch, the portfolio guide at AI engineer portfolio: proof of work is the right starting point, and the agentic specialisation guide at agentic AI role portfolio and interview prep covers the specific additions needed for agent-focused roles.
Days 1 to 7 — audit and repair your public profile. Open your GitHub profile as a stranger would. Look at the pinned repositories. Does each one have a README that leads with specific numbers (latency, cost, eval scores)? If not, write the numbers into the READMEs for your two or three strongest projects. This is the highest-leverage single change you can make because it directly addresses the 30-second GitHub scan. If any of your significant projects have no evaluation code, add a minimal eval harness — even 30 to 50 representative examples with a scoring script — and push it. "No evaluation" is the most common immediate elimination signal; addressing it takes a few hours.
Days 8 to 14 — build or extend one production-signal project. Take your strongest existing project and push it from demo quality to production quality. Add a proper directory structure if it is a notebook. Write tests on the deterministic parts (data pipelines, scoring functions, configuration loading). Add logging to agent tool calls if the project uses agents. Write a README section titled "Design decisions and trade-offs" that documents two or three non-obvious choices you made and why. This project becomes your primary evidence artefact for the production evidence and evaluation discipline dimensions.
Days 15 to 21 — prepare your system design narrative. Practice the RAG system design question out loud. Not as a recitation of architectures — as a guided walk through constraints, trade-offs, failure modes, evaluation, and cost estimation. Time yourself. A well-structured answer to a system design question should take 20 to 25 minutes and cover all five elements described in the section above. The goal is fluency, not memorisation: you want to be able to start from any entry point (a constraint, a failure mode, a cost question) and work outward to a complete picture. Record yourself if possible; the gaps and hesitations in a recording are more informative than any amount of reading.
Days 22 to 28 — make yourself findable. Update your LinkedIn headline to reflect your specific AI engineering specialisation, not just your job title. Add a skills section that reflects the five rubric dimensions — evaluation, RAG and retrieval, agents and tool use, cost and inference optimisation, systems design. Create or update your profile on a specialist platform that serves AI engineering hiring in India or the UK. If you have a blog or a detailed technical post anywhere, ensure it links back to your primary profile. If you do not have any published technical writing, write a single short post about a specific problem you solved and the measurement that confirmed the solution worked — this directly demonstrates both communication quality and evaluation discipline.
Days 29 to 30 — final calibration. Ask a colleague or a peer to review your GitHub profile as a hiring manager would — 30 seconds per pinned repository, looking for the signals described in this guide. Their fresh-eyes reaction is more reliable than your own assessment because you cannot unsee what you know about your own projects. Incorporate their feedback on clarity and the visibility of key numbers. Then activate your Verified Builder profile on AI Tech Connect if you have not already done so — the founding cohort is the right time to do it, while the badge carries its scarcity value.