KV cache, and why LLM inference is memory-bound

A question that comes up a lot: why does a 7B model that “fits in 14 GB” fall over when you serve it to a handful of users? The answer is almost always the KV cache.

What it caches

In a transformer, each token attends to every previous token via keys and values. Naively, generating token n recomputes the keys and values for tokens 1..n-1 every step — quadratic and wasteful. So we cache them: compute K and V once per token, keep them around, and each new token just attends against the stored cache.

That turns decoding from O(n²) recompute into O(n), which is the whole reason generation is tractable. But the cache has to live somewhere.

The memory math

Per token, the cache stores K and V for every layer and head:

bytes ≈ 2 (K and V) × n_layers × n_heads × head_dim × dtype_size

For a 7B-class model in fp16 that’s on the order of hundreds of KB per token. Multiply by a 4k context and by every concurrent request and the cache, not the weights, is what blows your budget. The weights are a fixed cost; the KV cache scales with batch × sequence_length.

What people do about it

Paged attention (vLLM): manage the cache in fixed-size blocks like OS virtual memory, so you stop wasting the tail of every sequence’s allocation.
GQA / MQA: share K/V heads across query heads, shrinking the cache by the sharing factor.
Quantized cache: store K/V in int8/fp8 and eat a small quality hit.

Once you internalise that inference is memory-bandwidth-bound, a lot of serving decisions stop being mysterious.

What it caches#

The memory math#

What people do about it#

What it caches

The memory math

What people do about it