The problem: KV cache is eating your GPU budget
When a large language model generates text, it performs attention — a mechanism that lets each new token attend to everything that came before it. To avoid recomputing the representations of previous tokens on every step, modern serving systems cache the intermediate key and value tensors for each layer. This is the KV cache, and it sits at the centre of virtually every inference cost discussion in 2026.
The memory footprint of the KV cache is not fixed. It scales with three dimensions simultaneously: sequence length, batch size, and the number of attention heads across all layers. For a 70B-parameter model serving a 16k-token context at FP16 precision, the KV cache alone can consume 80–100 GB of GPU memory — meaning a single H100 (80 GB) is exhausted before the model weights even load. Operators deal with this by limiting concurrent requests, capping context lengths, or buying more hardware. All three options are expensive.
The deeper issue is that GPU memory is the binding constraint on throughput, not compute. You can add FLOPs relatively cheaply by clustering GPUs. Memory bandwidth and capacity do not scale as gracefully. Serving cost per token is therefore dominated by how efficiently you use GPU VRAM, and the KV cache is the largest movable target in that equation. Work on long-context retrieval systems compounds this — agentic pipelines that maintain multi-turn state across thousands of tokens make the problem worse, not better.
A rough rule of thumb: for most production models at FP16, KV cache memory per request grows at approximately 2 × (number of layers) × (number of KV heads) × (head dimension) × (sequence length) × 2 bytes. At 32k tokens on Llama 3 70B, that is roughly 40 GB per concurrent request. Compress that by 6× and you serve six requests for the cost of one.
What TurboQuant actually is
TurboQuant (arXiv 2504.19874) is a post-hoc, training-free KV cache compression method presented by Google Research at ICLR 2026. It operates entirely at the serving layer — the model's weights are never touched. When the model generates a new KV pair during a forward pass, TurboQuant intercepts those tensors, compresses them before they enter the cache, and decompresses them on retrieval before the attention computation. From the model's perspective, attention proceeds as normal. From the hardware's perspective, the memory footprint is roughly one-sixth of the original.
The headline number — 6× reduction — comes from compressing FP16 (16-bit floating-point) key and value tensors down to 3-bit integers. The theoretical compression ratio is 16/3 ≈ 5.3×, which Google rounds to 6× in their reporting to account for minor overhead terms that wash out in practice on long sequences. The 4-bit variant delivers a more conservative but still substantial 4× reduction, and is the configuration used for the 8× attention-computation speedup (4-bit config) on H100 GPUs cited in the paper.
Two techniques combine to make this work without accuracy loss: PolarQuant and QJL. They address different failure modes of naive low-bit quantisation, and understanding them separately makes it clear why neither alone would be sufficient.
PolarQuant: solving the outlier problem
The fundamental obstacle to quantising KV tensors to very low bit-widths is the distribution of their values. Raw key and value vectors are not uniformly distributed — they contain a small number of dimensions with very large magnitudes (commonly called outliers), and the rest cluster tightly near zero. This is the same problem that made naive 4-bit weight quantisation fail for LLMs before techniques like GPTQ and AWQ were developed.
When you quantise a tensor with outliers to 3 bits, you have roughly 8 distinct values to represent the full dynamic range. If a handful of dimensions span ±1000 and the rest span ±1, your quantisation grid is calibrated to the outliers and the common values all collapse to the same bin. The result is that attention scores computed from the decompressed tensors diverge substantially from the full-precision values, and model quality falls sharply.
PolarQuant's solution is a randomised rotation applied to the KV vectors immediately before quantisation. Concretely, it multiplies each key or value vector by a random orthogonal matrix (or an approximation thereof that can be applied efficiently). This rotation does not change the geometric relationship between vectors — dot products, and therefore attention scores, are preserved exactly in the full-precision case. What it does change is the distribution of energy across dimensions: the large outlier values get spread evenly across all dimensions, so the resulting tensor is much closer to uniformly distributed and quantises cleanly at 3 bits.
The rotation matrix is generated once at initialisation from a fixed random seed and applied in-place during the forward pass. It adds a small matrix-multiply overhead, but this is negligible compared to the memory bandwidth saved by storing and loading 3-bit rather than 16-bit tensors across the long-range attention computation.
QJL: the Johnson-Lindenstrauss sketch
PolarQuant solves the distribution problem. QJL (Query-directed Johnson-Lindenstrauss) solves a complementary problem: the effective dimensionality of KV representations is much lower than the raw tensor dimension suggests. Most of the information content can be captured in far fewer dimensions, and attention scores can be computed accurately on compressed representations if the compression is done correctly.
The Johnson-Lindenstrauss lemma is a classical result from compressed sensing. It states that a set of high-dimensional points can be projected into a much lower-dimensional space using a random linear map, and pairwise distances (and therefore inner products, and therefore attention scores) are approximately preserved with high probability. The approximation quality is controlled by the target dimension, and for the typical head dimensions used in modern LLMs (64–128), the JL sketch can reduce dimensionality substantially while keeping the attention score error below the threshold that causes measurable accuracy degradation.
QJL applies this sketch to the compressed KV representations, targeting the attention computation specifically. Rather than decompressing fully to FP16 and computing attention in full precision, the attention scores are approximated using the JL-sketched representations. This is where the 8× attention speedup (4-bit config) on H100s comes from: the memory bandwidth reduction and reduced arithmetic together make attention dramatically faster when KV heads are long.
Standard FP16 stores 16 bits per element. 3-bit integer quantisation stores 3 bits per element. The raw ratio is 16/3 ≈ 5.33×. On sequences long enough that the minor per-head fixed overhead (rotation matrix, dequantisation metadata) is amortised — typically above 2k tokens — the effective memory saving reaches the quoted 6×. At 4-bit precision the ratio is 4×, and the arithmetic speedup from reduced memory bandwidth is the dominant contribution to the H100 benchmark numbers.
Benchmark performance: what the paper reports
The ICLR 2026 paper evaluates TurboQuant across a range of standard language-modelling and downstream benchmarks, including perplexity on WikiText-2 and PTB, and accuracy on MMLU, HellaSwag, and WinoGrande. The headline claim is zero measurable accuracy degradation at 3-bit precision compared to FP16 baseline. The paper's evaluation covers Llama 2 7B, 13B, and 70B, as well as Mistral 7B.
The comparison table below summarises the key trade-offs across KV cache precision configurations. Numbers reflect the TurboQuant paper's H100 benchmarks unless noted.
| Configuration | Bits per element | Memory vs FP16 | Attention speedup | Accuracy | Retraining needed |
|---|---|---|---|---|---|
| FP16 baseline | 16 | 1× (reference) | 1× (reference) | Full | n/a |
| INT8 (naive) | 8 | 2× | ~1.5–2× | Near-lossless | No |
| INT4 (naive) | 4 | 4× | ~3–4× | Measurable degradation on long context | No |
| TurboQuant 4-bit | 4 | 4× | 8× | No measurable degradation | No |
| TurboQuant 3-bit | 3 | ~6× | ~6× | No measurable degradation | No |
The distinction between TurboQuant 4-bit and naive INT4 is instructive: both store 4 bits per element, but naive INT4 does not handle outliers and therefore degrades on long contexts, while TurboQuant's PolarQuant rotation neutralises that failure mode entirely. The speedup difference (3–4× naive versus 8× TurboQuant 4-bit) comes from the QJL attention kernel, which is optimised for the H100's memory hierarchy in a way that generic dequantisation is not.
Integration with vLLM: how to use it today
vLLM is one of the most widely adopted open-source LLM serving frameworks. An integration pull request for TurboQuant is currently open against the main vLLM repository. While it awaits full merge, you can already test it by checking out the TurboQuant fork and passing the relevant configuration flags to the engine.
The configuration is minimal. TurboQuant exposes three parameters: the KV quantisation bits, whether to enable the QJL attention kernel, and the JL sketch dimension. For most models, the defaults from the paper work well without tuning.
# vLLM server with TurboQuant 3-bit KV cache compression
# Requires: pip install vllm-turboquant (or the TurboQuant vLLM fork)
from vllm import LLM, SamplingParams
from vllm.config import KVCacheConfig
# TurboQuant configuration
kv_cache_config = KVCacheConfig(
# Compress KV cache to 3-bit integer precision
kv_quant_bits=3,
# Enable PolarQuant rotation (handles outliers)
polar_quant=True,
# Enable QJL attention kernel (8x speedup on H100)
qjl_attention=True,
# JL sketch dimension — default 64 works for most models.
# Increase to 96 or 128 for very long contexts (>32k tokens)
# if you observe accuracy drift on your eval set.
jl_sketch_dim=64,
# Random seed for the rotation matrix.
# Fix this for reproducible behaviour across restarts.
rotation_seed=42,
)
llm = LLM(
model="meta-llama/Llama-3-70b-instruct",
kv_cache_config=kv_cache_config,
# With 6x KV cache compression, you can raise max_model_len
# substantially relative to what FP16 would allow on the same GPU.
max_model_len=65536,
gpu_memory_utilization=0.90,
tensor_parallel_size=2, # 2 x H100 instead of the usual 4 for 70B
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(
["Explain the Johnson-Lindenstrauss lemma in plain English."],
sampling_params
)
print(outputs[0].outputs[0].text)
For llama.cpp, a similar integration PR is open. The flag is --kv-quant turboquant-3bit when launching the server. If you are on a CPU-only deployment or using GGUF models, the llama.cpp integration is the correct path — the QJL kernel is GPU-specific, but PolarQuant applies on CPU as well for the memory saving, with a more modest throughput benefit.
The integration PRs are not yet merged at time of writing (May 2026). Check the TurboQuant GitHub and the vLLM PR tracker before building a production dependency on specific version numbers. The API surface may change slightly between the fork and the merged implementation. The paper's open-source release is stable; the serving integration is moving fast.
What this means for Indian cloud GPU users
For builders in India running on shared GPU cloud providers — whether that is AWS ap-south-1, GCP asia-south1, or regional providers such as E2E Networks, Jio Cloud, or Yotta — GPU memory is both expensive and contended. Spot and reserved GPU instances in Indian regions carry a premium over US/EU equivalents, and VRAM-constrained configurations are often the only affordable tier.
A 6× KV cache compression translates directly into one of three operational improvements, and you choose which one you want:
- Serve 6× more concurrent users per GPU. If your current bottleneck is VRAM for the KV cache, you can increase batch size roughly sixfold without adding hardware. For a startup paying per-GPU-hour, this is a direct reduction in serving cost per request.
- Run 6× longer contexts at the same cost. If your application needs long-context capabilities — legal document analysis, multi-turn assistants, agentic RAG pipelines — you can extend your effective context window substantially without moving to a larger or more expensive GPU tier.
- Downsize the GPU tier. A model that previously required an 80 GB H100 for a given batch size and context length may now fit on a 40 GB A100, which is meaningfully cheaper at most cloud providers in the region.
The no-retraining property is particularly significant for builders working with open-weight models that have already been fine-tuned on domain-specific data. Custom checkpoints are often the most valuable artefact in a production AI system, and any optimisation that requires retraining them carries a hidden cost in compute, data, and iteration time. TurboQuant applies to any checkpoint without modification.
What this means for UK AIRR compute allocation holders
The UK AI Research Resource (AIRR) provides compute allocations to academic groups and some industry researchers. AIRR allocations are time-bounded and measured in node-hours, which creates a direct incentive to maximise the research output per allocated GPU-hour. TurboQuant changes the feasible experiment space in at least two ways.
First, longer-context experiments become affordable. If your research involves models attending over 32k, 64k, or 128k tokens — increasingly common in retrieval, reasoning, and document-processing research — the KV cache is likely your binding constraint rather than the model weights. A 6× cache reduction means experiments that previously required 4 or 8 GPUs may fit on 1 or 2, extending the effective reach of a fixed allocation.
Second, ablation studies over large models become more tractable. Research workflows often require running the same model at different context lengths or batch sizes across multiple seeds. With TurboQuant, the configuration space you can explore per allocation is substantially larger. The no-retraining property also means you can apply TurboQuant retroactively to existing checkpoints without consuming additional training compute.
"We had been using paged attention and prefix caching to manage KV memory, which helped at the 8k-token range. But at 32k we were still getting OOM on two A100s. TurboQuant's 3-bit config got us to 32k on a single A100 with no measurable quality change on our internal eval set — the legal clause extraction task we use as a proxy. The PolarQuant rotation adds maybe 2% overhead on the prefill step. That's a rounding error compared to what we saved on hardware."
— Aarav, Senior Builder · Bengaluru, IndiaThe trajectory: where inference costs are heading
TurboQuant is the most significant KV cache result since paged attention (introduced in vLLM in 2023), but it sits within a broader pattern. The last 18 months have seen a sustained campaign of inference-side optimisations — speculative decoding, continuous batching, weight quantisation, flash attention variants, and now KV quantisation — each compressing the cost-per-token curve independently. The cumulative effect is that serving frontier-class models is dramatically cheaper in 2026 than it was in 2023, even as the models themselves have grown.
The research trajectory suggests this is not slowing down. Several groups are working on mixed-precision KV caching — storing recent tokens at higher precision and older tokens at lower precision, since attention scores decay with distance in most tasks. Others are combining KV quantisation with KV eviction (selectively dropping less-attended cache entries) for multiplicative rather than additive savings. TurboQuant's open-source release and the active vLLM integration PR signal that these techniques are moving rapidly from research artefacts to production defaults.
For builders, the implication is that the serving cost envelope is still improving. Infrastructure decisions made today — particularly around GPU tier and context-length limits — should account for the likelihood that serving the same workload will be considerably cheaper in 12 months. Locking into over-provisioned hardware to handle today's memory constraints may not be the right call. The more durable investment is in the evaluation infrastructure and model quality rather than raw compute capacity.
Where to go from here
The paper is available at arXiv 2504.19874 and the open-source implementation is on GitHub under the Google Research organisation. If you want to integrate into vLLM, track the open PR against the main branch — it is actively being reviewed. The llama.cpp PR is at a slightly earlier stage but moving quickly.
Before deploying to production, run TurboQuant against your own evaluation set at the precision you intend to use. The paper's benchmarks are comprehensive but general-purpose. If your application involves specialised vocabulary, unusual token distributions, or very long output sequences, validate independently. In practice, the no-accuracy-loss claim holds broadly, but your specific use case is the ground truth that matters.
Start with the 4-bit configuration rather than 3-bit in production. The 4-bit variant gives you the 8× attention speedup and a 4× memory reduction with a larger safety margin. Switch to 3-bit only after you have validated on your own eval set and confirmed the compression ratio gain justifies the closer-to-the-limit operating point. For most use cases, 4-bit is the better default and 3-bit is the ceiling to target once you have confidence.