What is new in Gemini 3.1 Ultra

Two headline additions define the 3.1 release. The first is the 2-million token context window — doubled from the 1M maximum available in earlier Gemini Ultra builds, and twice the size of Claude Opus 4.7's 1M context. The second is a native sandboxed Code Execution tool: the model can write Python, execute it in an isolated environment, read the standard output, and iterate within the same conversation turn — without any external interpreter infrastructure on your side.

Both features were announced at Google Cloud Next '26 alongside Google's TPU 8i chip, which Google claims delivers 80% better performance-per-dollar over the prior generation. That hardware improvement matters for production economics: it translates directly into lower per-token serving costs, which are already falling at roughly 95% per year across the industry. The AI inference chip market, worth $13.7 billion in 2025, is projected to reach $56.9 billion by 2035 — and a significant share of that growth is driven by exactly the kind of workloads Gemini 3.1 Ultra is designed to handle.

Google also released Gemma 4 alongside Gemini 3.1 Ultra, targeting reasoning and agentic workflows. We cover Gemma 4 separately — see our Gemma 4 thinking modes article for that story.

Pro tip

The 2M-token context and the Code Execution tool are independent capabilities that compound. You can load a 1.5M-token codebase into context, ask the model to write an analysis script, have it execute that script, and read back computed results — all in one call. This removes an entire orchestration layer from many data pipelines.

The 2M-token cross-modal context: what changes for builders

Context size on its own is only meaningful if it is genuinely cross-modal and if recall quality holds at scale. On both counts, Gemini 3.1 Ultra is a material step forward. The 2-million token budget is shared across text, images, audio transcripts, and video frames — not separate per-modality limits. A single call can include a 90-minute earnings call transcript, the accompanying slide deck images, and several years of related financial filings, without splitting across sessions or building a retrieval layer.

For teams in India and the UK, this opens four categories of work that were previously impractical in a single model call.

Legal document analysis

Indian enterprise legal teams dealing with multi-volume commercial contracts, SEBI regulatory filings, or NCLT tribunal records can now load entire document sets — often running 600k to 1.2M tokens when formatted — into a single context and query cross-document clause relationships. UK solicitors handling complex M&A due diligence packs, where the document room runs to hundreds of PDFs, face a similar bottleneck. At 2M tokens, the bottleneck moves from "what fits" to "what's worth asking."

A practical note: legal document recall at very long context should be validated on your specific corpus before production use. Needle-in-a-haystack benchmarks from earlier Gemini Ultra builds showed strong recall to around 1.2M tokens, with performance variability beyond that. Google has not yet published Gemini 3.1 Ultra-specific recall curves at the 1.8–2M range — treat that band with caution until community benchmarks emerge.

Large codebases

A mid-size production service with 600k lines of TypeScript, including generated types and test files, typically tokenises to 900k–1.4M tokens. That now fits within context. Builders can ask Gemini 3.1 Ultra to trace a data-flow path end-to-end, identify all callers of a deprecated interface, or propose a refactor plan — without chunking. The difference compared to a RAG-retrieval approach is qualitative: the model sees every file simultaneously rather than the retriever's ranked approximation of relevance.

See our companion article on Claude Managed Agents Beta for a contrasting approach to large-codebase agentic work. The orchestration patterns differ, but the core insight — that whole-repo context beats chunked retrieval for structural refactors — holds across both models.

Earnings calls and multi-quarter financial analysis

A full year of quarterly earnings call transcripts for a FTSE 100 or Nifty 50 company, including analyst Q&A, runs roughly 200k–400k tokens. Add three years of annual reports and the number approaches 1.2M. Gemini 3.1 Ultra can hold all of it and answer questions that require comparing guidance from Q1 2024 against delivery in Q3 2025. For fintech builders in Mumbai and London building investment research tools, this eliminates a painful chunking step that routinely lost cross-quarter context.

Multi-document research synthesis

Research teams at Indian pharmaceutical companies running regulatory submissions to CDSCO, or UK life-sciences firms preparing MHRA dossiers, regularly work with document sets that span clinical trial reports, prior art, and regulatory guidance. At 2M tokens, a synthesis agent can hold the full dossier in one context and produce structured comparisons without a retrieval approximation introducing noise. The same pattern applies to academic systematic reviews, policy analysis, and competitive intelligence work.

From a Builder

"We were splitting our contract review pipeline across four chunked calls and losing the clause cross-references that matter most in Indian joint-venture agreements — the kind where indemnity in schedule 3 contradicts the liability cap buried in schedule 11. One call with a 2M context window and the model catches the conflict on its own."

— Arjun, verified AI Builder · Pune, IN

The sandboxed Code Execution tool

The Code Execution capability is architecturally distinct from context size, and arguably more consequential for day-to-day builder work. The model writes Python code, submits it to a sandboxed runtime, receives the stdout, and can loop — amending its code in response to errors or unexpected output — all within a single conversation turn. From the caller's perspective, this is a single API call that returns a final answer; the code-write, execute, and read loop is internal to the model's reasoning.

What this removes from your stack:

  • An external code interpreter (no separate Python subprocess, no Jupyter kernel management)
  • The orchestration loop that catches interpreter errors and feeds them back to the model
  • The latency of multiple round-trips for iterative numerical debugging

What remains your responsibility:

  • The sandbox is Google-managed; you cannot install arbitrary packages. Validate that your numerical work falls within the available library set before committing to this approach in production.
  • For tasks requiring access to your own data stores or APIs, Code Execution works alongside function calling — not as a replacement for it.
  • Output from Code Execution is text (stdout). If your pipeline needs structured JSON out of the executed code, write your Python accordingly.
Watch out

Code Execution is isolated — it cannot reach your databases, internal APIs, or file system. It is best suited for computation on data already in context (e.g., a CSV pasted as text, or numbers from a document). For pipelines requiring live data access, combine Code Execution with function-calling tools that fetch from your systems first.

The practical applications are strongest wherever a model previously had to approximate arithmetic or statistical analysis through natural language. Earnings call ratio calculations, clinical trial statistical summaries, loan book risk aggregations — any domain where "number-heavy reasoning at long context" was a known failure mode benefits directly. Instead of trusting the model's in-context arithmetic, it now runs the calculation and reads the result.

UK healthcare builders working on NHS data analysis tools and Indian fintech teams running RBI regulatory stress tests both stand to benefit. The pattern is the same: load the source data into context, ask the model to write and run an analysis script, and receive a computed output rather than a language-model approximation.

Pricing and the economics of long context

Google has not published a standalone Gemini 3.1 Ultra pricing page at the time of writing. Pricing for earlier Gemini Ultra tiers via Google AI Studio and Vertex AI varied by region and usage tier. The TPU 8i announcement — 80% better performance-per-dollar — suggests that serving cost per token will be materially lower than Gemini 2.0 Ultra, even before the industry's 95%-per-year decline in inference costs is factored in.

For planning purposes, two cost dynamics dominate long-context work regardless of exact pricing:

  • Context caching is essential. At 2M tokens, the raw cost of re-sending your entire document set on every turn is prohibitive. Google's context caching (available on Gemini 1.5 Pro and expected to extend to 3.1 Ultra) stores a static prefix at a fraction of full-input cost. Design your system prompt and document loading to maximise the stable prefix length.
  • Output cost is typically a small fraction of input cost. For most long-context read tasks (document analysis, synthesis, question answering), output tokens are 1–5% of input tokens. The 2M-token input budget dominates your cost equation; optimise there first.

For infrastructure context, see our article on Google TurboQuant and KV-cache compression — the 6x compression technique reduces the memory footprint of long-context serving, which directly affects what gets cached and at what cost.

Gemini 3.1 Ultra vs Claude Opus 4.7: a direct comparison

Feature Gemini 3.1 Ultra Claude Opus 4.7
Max context window 2,000,000 tokens 1,000,000 tokens
Cross-modal context Text, image, audio, video Text, image, audio
Native code execution Yes (sandboxed, in-turn) No (requires external tool)
Context caching Yes (Vertex AI / AI Studio) Yes ($0.50/MTok cache read)
Prompt cache read cost Not yet published $0.50/MTok
Output cost Not yet published $25/MTok
TPU / chip generation TPU 8i (80% perf/$ gain) Anthropic-custom silicon
Managed agent support Vertex AI Agent Builder Claude Managed Agents (beta)
Open-source companion Gemma 4 None (proprietary only)

The honest comparison for most builder workloads: if your use case is primarily read-and-analyse on large document sets and you need the full 2M window, Gemini 3.1 Ultra is currently the only option at that scale. If your use case involves agentic writes, extensive tool-calling chains, or you have already built prompt-caching infrastructure on Anthropic's API, Claude Opus 4.7 remains strong — particularly given its published, predictable pricing. Read our Claude Opus 4.7 deep dive for the production cost numbers.

Working with large-context pipelines?

AI Tech Connect connects you with verified Builders who've shipped production long-context systems.

Browse Builders →

What can you build now that you could not before?

The combination of 2M cross-modal context and native Code Execution unlocks a specific category of product that previously required significant custom infrastructure: stateful analytical agents that read, compute, and report within one API call.

Here are five concrete product patterns that become practical in 2026 with Gemini 3.1 Ultra:

  1. Full-corpus legal due diligence agents. Load an entire M&A data room — contracts, filings, correspondence — into a single context. Ask for a risks-and-reps summary with clause citations. Previously required a retrieval layer and lost cross-document coherence. Now a single call, with Code Execution available to run any numerical clause comparison needed.
  2. Whole-repository security audit tools. A 1M-token codebase (roughly 600k LoC) can be loaded with test files and configuration. Ask for a full OWASP vulnerability sweep. The model reads every file, traces data flows, and identifies injection points without a chunked retrieval approximation dropping relevant context.
  3. Multi-quarter financial narrative engines. Load five years of earnings transcripts, annual reports, and analyst notes for a company. Ask for an executive narrative comparing management guidance to delivery, with statistical trend lines computed via Code Execution. An Indian asset manager or UK wealth platform can generate this as a client-facing report without a data science team running the numbers separately.
  4. Multi-modal medical record summarisers. A patient record spanning years of clinical notes (text), diagnostic images (image tokens), and consultation recordings (audio tokens) can be loaded and summarised in one call. This is directly applicable to NHS secondary care triage tools or Apollo Hospitals-style digital health platforms, subject to appropriate data governance and regulatory review.
  5. Regulatory compliance diff engines. Load the current version and the proposed amended version of a regulatory framework — say, RBI master directions or FCA conduct rules — alongside your internal policy documents. Ask the model to identify every point where the amendment creates a policy gap, with Code Execution used to compute any quantitative threshold changes.

Getting started: a practical checklist

If you are evaluating Gemini 3.1 Ultra for a production workload, run through this checklist before committing engineering time:

  • Verify recall on your corpus. Do not assume 2M token recall quality holds for your document type. Run a needle-in-a-haystack test on a representative sample of your actual data before building around it.
  • Design for context caching from day one. Identify your stable prefix — system prompt, document corpus, instructions — and structure your API calls so that prefix is cacheable. The cost difference between cached and uncached at this context size is the difference between viable and expensive.
  • Scope Code Execution tasks carefully. Code Execution is best for computation on data already in context. Map your data access pattern before deciding whether Code Execution replaces or complements your function-calling tools.
  • Access via Google AI Studio or Vertex AI. At time of writing, Gemini 3.1 Ultra is available through both. Vertex AI is the production path with enterprise SLAs; AI Studio is appropriate for prototyping and evaluation.
  • Plan for output format. If you need structured output (JSON, tables) from a Code Execution run, write your Python to print JSON to stdout and parse it in your calling code. The model will follow explicit formatting instructions reliably.
  • Budget for latency. At 2M tokens, first-token latency on cold calls is material. For user-facing products, pre-warm context with a background call at session start. For batch pipelines, latency is less critical than throughput economics.

Builders already using Gemini 1.5 Pro via Vertex AI will find the upgrade path straightforward — the API surface is consistent. Builders new to the Gemini ecosystem should start with Google AI Studio's free tier to validate that the 2M context window solves the specific bottleneck they are facing before moving to a paid Vertex AI deployment.