Data-parallel training: gradient bucketing and overlap
Why DDP feels like magic until you look at the allreduce schedule.
Why DDP feels like magic until you look at the allreduce schedule.
Pin everything, mount data, never bake it. The boring setup that stopped me losing runs.
The cache that makes autoregressive decoding fast also makes it the thing that runs out of memory first.