Systems

KV cache, and why LLM inference is memory-bound

The cache that makes autoregressive decoding fast also makes it the thing that runs out of memory first.

Why my OOM kills moved around after switching to the unified hierarchy, and the three knobs that matter.