Spark in me - Internet, data science, math, deep learning, philosophy(@snakers4). Getting The Most Out of AMPWe were digging deep into understanding how to utilize AMP properly. Surp

Getting The Most Out of AMP

We were digging deep into understanding how to utilize AMP properly. Surprise-surprise:

- It works better with large networks, wide networks
- It works poorly with separable convolutions
-You need a bit more involved design considerations than just "have your channels divisible by 8":

For matrix multiplication:
On FP16 inputs, all three dimensions (M, N, K) must be multiples of 8.

For convolution:
On FP16 inputs, input and output channels must be multiples of 8.

Also:

Prefer dense math operations.
For example, vanilla convolutions have much higher arithmetic intensity than depth-wise separable convolutions.

Also:

Choose mini-batch to be a multiple of 8
Choose linear layer dimensions to be a multiple of 8
Choose convolution layer channel counts to be a multiple of 8
For classification problems, pad vocabulary to be a multiple of 8
For sequence problems, pad the sequence length to be a multiple of 8

Please see
- https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html

#deep_learning