We were digging deep into understanding how to utilize AMP properly. Surprise-surprise:
- It works better with large networks, wide networks
- It works poorly with separable convolutions
-You need a bit more involved design considerations than just "have your channels divisible by 8":
For matrix multiplication:Also:
On FP16 inputs, all three dimensions (M, N, K) must be multiples of 8.
For convolution:
On FP16 inputs, input and output channels must be multiples of 8.
Prefer dense math operations.Also:
For example, vanilla convolutions have much higher arithmetic intensity than depth-wise separable convolutions.
Choose mini-batch to be a multiple of 8Please see
Choose linear layer dimensions to be a multiple of 8
Choose convolution layer channel counts to be a multiple of 8
For classification problems, pad vocabulary to be a multiple of 8
For sequence problems, pad the sequence length to be a multiple of 8
- https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html
#deep_learning