The launch — and why the launch is not the story
On 19 May 2026 at Google I/O, Sundar Pichai introduced Gemini 3.5 Flash and the Gemini Spark personal-agent layer. Headline specs are familiar: a 1M-token context window, full multimodal input across text, image, video and audio, and throughput around 280 tokens per second. The model went GA on day one through Google AI Studio, Vertex AI, and Antigravity, Google's agent-first dev platform. Coverage so far has focused on the capability story — reportedly beating the previous Gemini 3.1 Pro on coding and agentic benchmarks (note: 3.1 Pro, not the current 3.5 Pro tier).
That framing buries the lede. The interesting line in Google's pricing page is this: $1.50 per million input tokens, $9.00 per million output tokens. Compared with Gemini 3.1 Flash, that is roughly a 3× jump per token. The decision sitting on every CTO's desk is no longer "should we try the new model?" — Google has handed it to us free of charge in AI Studio. It is "should 3.5 Flash still be the default our application reaches for on every request, given it now costs three times as much as the model it replaces?"
For the original feature breakdown of the launch event itself, see our earlier piece on the 3.5 Flash and Spark debut at I/O 2026. This article is about what to do about the bill.
Before you change a single line of routing code, run last month's API traffic through a spreadsheet that multiplies token volume by the new prices. A B2B SaaS doing 100k Flash calls per day with average 4k-in / 1k-out tokens per call is looking at a difference of roughly £280,000 per year — or about 3 crore rupees — between staying on 3.1 Flash and migrating to 3.5 Flash. That is gross margin, not marketing budget.
The cost-per-task framework
Per-MTok pricing is the wrong unit for product decisions. The unit that matters is cost per completed task — a search, a classification, a summary, a generated email, a code-review comment. To get there, you need three numbers for every workflow in your product:
- Average input tokens per task — includes the system prompt, retrieved context, user turn, and any tool-call payloads.
- Average output tokens per task — the actual generation, not the maximum you set.
- Tasks per day — over a representative 7-day window, with weekend variance accounted for.
Multiply those by the per-MTok prices and you get a defensible monthly bill, per workflow, per candidate model. Doing this once for every workflow in your product is the single highest-leverage exercise an engineering organisation can run this quarter. Most teams discover that two or three workflows account for the overwhelming majority of their spend — and those workflows are almost always the wrong place for a premium model.
Pricing comparison — May 2026 frontier and value tiers
Here is where the major models sit today on per-MTok pricing. Prices are list, in US dollars, for hosted API access. Self-hosted DeepSeek V4 Pro is amortised at typical GPU rental rates on a representative workload. Figures for non-Gemini models are indicative based on the most recent public price sheets — always cross-check the vendor's pricing page before any procurement decision, since list prices move.
| Model | Input ($/MTok) | Output ($/MTok) | Context | Best for |
|---|---|---|---|---|
| Gemini 3.5 Flash | $1.50 | $9.00 | 1M | Agentic + multimodal default |
| Gemini 3.1 Flash | $0.50 | $3.00 | 1M | High-volume classification, extraction |
| Claude Haiku 4.5 | $1.00 | $5.00 | 200k | Cheap reasoning, tool calling |
| GPT-5.5 Mini | $0.80 | $4.00 | 400k | Chat, drafting, summarisation |
| Claude Opus 4.7 | $5.00 | $25.00 | 1M | Long-context coding, compliance review |
| GPT-5.5 (frontier) | $10.00 | $30.00 | 400k | Hard reasoning, scientific workloads |
| DeepSeek V4 Pro (self-host) | ~$0.15 amortised | ~$0.60 amortised | 256k | Predictable steady-state workloads |
The Gemini 3.1 Flash row is the one that should make you sit up. For workloads that do not need agentic coding or multimodal reasoning — think classification, intent detection, named-entity extraction, lightweight summarisation — staying on 3.1 Flash is a 3× cost saving for a quality drop most users will not detect. Do not migrate by default just because a new version shipped.
A worked example — May 2026 default-model picks by task type
The next table is the deliverable to take to your engineering leadership meeting. It maps common SaaS workflows to a recommended default model based on cost per 1k tasks. Assumptions: average input and output tokens are realistic estimates for production traffic, not toy benchmarks.
| Task type | Typical tokens (in / out) | Recommended default | Cost / 1k tasks |
|---|---|---|---|
| Support-ticket classification | 800 / 50 | Gemini 3.1 Flash | ~$0.55 |
| Product-catalogue enrichment | 1.5k / 300 | Claude Haiku 4.5 | ~$3.00 |
| Customer-email drafting | 2k / 600 | GPT-5.5 Mini | ~$4.00 |
| Multimodal product Q&A (image + text) | 4k / 400 | Gemini 3.5 Flash | ~$9.60 |
| Agentic coding task (mid-size) | 15k / 3k | Gemini 3.5 Flash or Claude Opus 4.7 | $49.50 / $150 |
| Compliance / contract review | 500k / 10k | Claude Opus 4.7 (cached) | ~$500 (cold) / ~$50 (cached) |
| Bulk steady-state classification (1M+/day) | 800 / 50 | DeepSeek V4 Pro self-host | ~$0.15 |
Read that table as a starting point, not gospel. Your traffic mix, latency requirements, and procurement constraints will shift recommendations. The exercise that matters is doing the same maths against your own log files.
Where 3.5 Flash genuinely earns the premium
The 3× price jump is real, but so is the capability lift. Three workloads where 3.5 Flash is straightforwardly the right call:
- Multimodal Q&A at scale — image, video and audio inputs in a single 1M-context window, cheaper than any other frontier model that can match the modality coverage. A retail catalogue assistant that needs to reason over product photos plus spec sheets plus a customer message has no obvious competitor at this price.
- Agentic coding inside Antigravity — the throughput, the tooling integration and the model's reported gains over 3.1 Pro on agentic benchmarks make 3.5 Flash a reasonable default for autonomous coding loops. For long-running agents over very large repositories, our coding agent leaderboard for May 2026 still shows Claude Opus 4.7 with the 1M-context edge on the hardest tasks.
- Tasks where 3.1 Flash genuinely fails — for a class of agentic and reasoning workloads, internal benchmarks from early adopters suggest 3.1 Flash plateaus while 3.5 Flash delivers a materially higher completion rate. If your product economics live or die on that completion-rate delta, run your own evals on a representative task set — and if 3.5 clears the bar your product needs, the price is no longer the deciding factor.
"We have three workflows in our application. Two are summarisation and ticket classification — they stayed on 3.1 Flash, no user complaints. The third is an agentic loop that reads a customer's invoice PDF, queries our database, and drafts a refund decision. That one moved to 3.5 Flash on day one. Our overall Gemini bill went up about 18%, not 200%, and the agentic loop's completion rate jumped from 71% to 89%. That is a textbook split-routing win."
— Verified Builder · Bengaluru, INGemini Spark — what it is and what it isn't
Spark is the consumer-facing news inside the launch. It is a persistent personal agent built on Gemini 3.5 Flash, available US-only at launch to Google AI Ultra subscribers. The Ultra tier sits at $100 per month; Spark is also bundled in the $200 tier (reduced from $250). Crucially, Spark is not a separate API model. Builders cannot buy Spark; they can buy 3.5 Flash on Vertex AI and build their own agent layer on top.
For IN and UK product teams, the practical takeaway from Spark is the demonstration. Google is signalling that 3.5 Flash is the model they trust to run their own consumer agent product. That is a useful credibility signal when you are pitching the same model to your own platform team.
Looking for AI Builders who can run a cost-per-task audit?
Every article on AI Tech Connect is backed by a directory of Verified Builders — engineers and consultants in India and the UK who have shipped real production AI systems.
Browse Builders →The gross-margin lever for IN and UK product teams
At consumer-scale traffic this is not a footnote — it is a CFO conversation. A B2B SaaS organisation in Bengaluru with 100k API calls per day, average 4k-in / 1k-out tokens, paying $1.50/$9.00 spends roughly $45,000 per month versus roughly $15,000 per month on 3.1 Flash at indicative one-third pricing. Annualised that is around $360,000, or close to 3 crore rupees, of pure infra-cost difference. For a UK fintech doing the same volume the comparable figure is around £280,000.
That number does not include caching, batching or output-token discipline — all three can compress it further. Our deep-dive on how to cut LLM API costs through caching, batching and routing is the operational companion to this strategic piece. The broader trend on inference costs and product profitability in 2026 is also worth reading before any large procurement decision.
For teams with truly steady-state high-volume workloads, the realistic alternative is no longer "go cheaper hosted" but "go self-hosted". DeepSeek V4 Pro on a small fleet of H100s amortises to a fraction of any hosted price, and the operational overhead has dropped meaningfully in the last six months. It will not replace Gemini 3.5 Flash for agentic multimodal workloads, but it does not need to. It needs to win on the boring 80% of your traffic.
Data residency and day-one access — the IN / UK angle
Indian product teams get Vertex AI access in the asia-south1 (Mumbai) region from day one. UK teams using Vertex AI's europe-west2 (London) region get the same day-one availability with UK data-residency guarantees that satisfy most enterprise procurement teams. Antigravity availability is global from the launch. There is no India- or UK-specific delay this time round, which is itself notable — historically Google's frontier model rollouts have lagged outside the US by weeks.
The practical upshot: there is no procurement excuse to delay your cost-per-task audit. Both pricing tiers (3.1 Flash and 3.5 Flash) are available to bill against today, in your region, with your data-residency story intact.
So — what should you actually do this week?
- Audit your top three API workflows. Pull last month's logs, calculate average tokens in and out per call, and build the cost-per-1k-tasks table for each candidate model.
- Split-route, do not switch wholesale. Migrate the workflows that genuinely need 3.5 Flash's agentic and multimodal lift; keep the rest on 3.1 Flash or a cheaper Haiku / GPT-5.5 Mini equivalent.
- Make caching and batching non-negotiable for any workflow that runs more than a few thousand times a day. The compounding savings here outweigh nearly every model-choice decision.
- Revisit the audit quarterly. Pricing tiers move; capability gaps close. The cost-per-task framework is the artefact you keep — the model picks underneath it will change.
Primary sources: Google's Gemini 3.5 launch post, Simon Willison's May 19 write-up on the day-one developer experience, and Tech Times's coverage of the 3× pricing line that prompted this article.