Сomparing two GPUs with Tensor Cores, one of the single best indicators for each GPU’s performance is their memory bandwidth;
Most computation time on GPUs is memory access;
A100 compared to the V100 is 1.70x faster for NLP and 1.45x faster for computer vision;
Tesla A100 compared to the V100 is 1.70x faster for NLP and 1.45x faster for computer vision;
3-Slot design of the RTX 3090 makes 4x GPU builds problematic. Possible solutions are 2-slot variants or the use of PCIe extenders;
4x RTX 3090 will need more power than any standard power supply unit on the market can provide right now (this is BS, but power connectors may be an issue - I have 2000W PSU);
With BF16 precision, training might be more stable than with FP16 precision while providing the same speedups;
The new fan design for the RTX 30sV series features both a blower fan and a push/pull fan;
350W TDP;
Compared to an RTX 2080 Ti, the RTX 3090 yields a speedup of 1.57x for convolutional networks and 1.5x for transformers while having a 15% higher release price. Thus the Ampere RTX 30s delivers a pretty substantial improvement over the Turing RTX 20s series;
PCIe 4.0 and PCIe lanes do not matter in 2x GPU setups. For 4x GPU setups, they still do not matter much;
NVLink is not useful. Only useful for GPU clusters;
No info about power connector. But I believe the first gaming gpus use 2*6 pin plus maybe some adapter;
Despite heroic software engineering efforts, AMD GPUs + ROCm will probably not be able to compete with NVIDIA due to lacking community and Tensor Core equivalent for at least 1-2 years;
You will need +50Gbits/s network cards to gain speedups if you want to parallelize across machines;
So if you expect to run deep learning models after 300 days, it is better to buy a desktop instead of using AWS spot instances (also fuck off AWS and Nvidia with sla about data centers);