A question that comes up a lot: why does a 7B model that “fits in 14 GB” fall over when you serve it to a handful of users? The answer is almost always the KV cache.
What it caches
In a transformer, each token attends to every previous token via keys and values. Naively, generating token n recomputes the keys and values for tokens 1..n-1 every step — quadratic and wasteful. So we cache them: compute K and V once per token, keep them around, and each new token just attends against the stored cache.
That turns decoding from O(n²) recompute into O(n), which is the whole reason generation is tractable. But the cache has to live somewhere.
The memory math
Per token, the cache stores K and V for every layer and head:
bytes ≈ 2 (K and V) × n_layers × n_heads × head_dim × dtype_size
For a 7B-class model in fp16 that’s on the order of hundreds of KB per token. Multiply by a 4k context and by every concurrent request and the cache, not the weights, is what blows your budget. The weights are a fixed cost; the KV cache scales with batch × sequence_length.
What people do about it
- Paged attention (vLLM): manage the cache in fixed-size blocks like OS virtual memory, so you stop wasting the tail of every sequence’s allocation.
- GQA / MQA: share K/V heads across query heads, shrinking the cache by the sharing factor.
- Quantized cache: store K/V in int8/fp8 and eat a small quality hit.
Once you internalise that inference is memory-bandwidth-bound, a lot of serving decisions stop being mysterious.