Notes on cgroups v2: memory limits that actually hold

I spent a frustrating afternoon working out why a job that was fine under cgroups v1 started getting OOM-killed on a newer kernel. The short answer: the unified hierarchy (cgroups v2) accounts for memory differently, and the knobs are named differently.

The three knobs

Under v2 a memory controller exposes, among others:

memory.min     # hard reservation — never reclaimed below this
memory.low     # soft reservation — reclaimed only under pressure
memory.high    # throttle point — reclaim aggressively, slow the cgroup
memory.max     # hard limit — OOM kill past this

The one people miss is memory.high. It’s not a limit you hit and die at; it’s a point where the kernel starts throttling allocations and reclaiming hard. Set high a little below max and you get back-pressure instead of a sudden kill:

echo 3G > /sys/fs/cgroup/myjob/memory.high
echo 4G > /sys/fs/cgroup/myjob/memory.max

Page cache counts

The thing that bit me: v2 folds page cache into the accounting more eagerly. A process that mmaps a big read-only dataset can push the cgroup toward max even though that memory is trivially reclaimable. Watch memory.current vs memory.stat’s file field before you blame your own allocations.

PSI is the real win

memory.pressure (Pressure Stall Information) tells you how much time tasks stalled waiting on memory — a far better signal than “are we near the limit”. I now alert on PSI, not on current/max. A cgroup can sit at 95% of its limit forever and be perfectly healthy.

The three knobs#

Page cache counts#

PSI is the real win#

The three knobs

Page cache counts

PSI is the real win