Spark in me - Internet, data science, math, deep learning, philosophy(@snakers4). Some More Observations About 3090- torch.cuda.empty_cache() does not seem to do anything for network

Some More Observations About 3090

- torch.cuda.empty_cache() does not seem to do anything for networks with variable depth / sequence length / girth

- DDP + AMP ... seems 3x slower instead of 2x faster (lol) for some networks, we are looking for the cause

- For some networks, 2x speed bump using AMP out of the box

- Now DDP prevents me from using 2 processes on 1 GPU with

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729096996/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8

- Looks like they are much more efficient in parallelizing and keeping high utilization (80-100%), same networks train ~2x-3x faster compared to Titan X (Maxwell) and 1080 Ti without any tweaks to the code

- Same networks use more RAM with 3090 compared to 1080 Ti (?)

- I kind of was afraid that these cards would be under-utilized (50%), but they are just faster. Magic

#deep_learning