Distributed data-parallel training feels like magic until you look at when the gradients move.

The naive version

Each rank computes gradients on its shard, then you allreduce and step. If you wait for the full backward pass before communicating, the network sits idle during compute and the GPUs sit idle during comms. You’ve serialised two things that could overlap.

Bucketing

for layer in reversed(model):     # backward runs last-layer-first
    grad = layer.backward()
    bucket.add(grad)
    if bucket.full():             # ~25 MB default in PyTorch DDP
        async_allreduce(bucket)   # overlaps with the next layer's backward

Because backward produces gradients last-layer-first, you can start reducing the late layers while the early layers are still computing. Tune the bucket size: too small and you drown in tiny latency-bound messages; too large and you lose the overlap. 25 MB is a sane default for 10–100 GbE.

It’s the same idea all the way up

ZeRO and FSDP are this principle pushed further: don’t replicate what you can shard, and never let the wire sit idle while the GPUs work. Once you see training as a scheduling problem between compute and communication, the knobs make sense.