Data-parallel training: gradient bucketing and overlap
Why DDP feels like magic until you look at the allreduce schedule.
Why DDP feels like magic until you look at the allreduce schedule.
Switching off loss-based congestion control on a long-fat path, and the fq gotcha for UDP.
Two ring buffers, one syscall, and the mental model that finally made it click.