KV cache, and why LLM inference is memory-bound
The cache that makes autoregressive decoding fast also makes it the thing that runs out of memory first.
The cache that makes autoregressive decoding fast also makes it the thing that runs out of memory first.