- torch.cuda.empty_cache() does not seem to do anything for networks with variable depth / sequence length / girth
- DDP + AMP ... seems 3x slower instead of 2x faster (lol) for some networks, we are looking for the cause
- For some networks, 2x speed bump using AMP out of the box
- Now DDP prevents me from using 2 processes on 1 GPU with
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729096996/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
- Looks like they are much more efficient in parallelizing and keeping high utilization (80-100%), same networks train ~2x-3x faster compared to Titan X (Maxwell) and 1080 Ti without any tweaks to the code- Same networks use more RAM with 3090 compared to 1080 Ti (?)
- I kind of was afraid that these cards would be under-utilized (50%), but they are just faster. Magic
#deep_learning