Just a quick note. DDP expects to have a gradient / backward pass on each worker (or not to have it on all workers). Otherwise it hangs.
So do not forget to use grad scaler with native PyTorch AMP.
In my particular case, DDP worked well with AMP, but when I added grad scaler it stopped exploding / de-syncing and started converging even faster. If only I had GPUs with FP16 support =)