Data-parallel training: gradient bucketing and overlapWhy DDP feels like magic until you look at the allreduce schedule.