Mixed precision:
Getting the most from Tensor Cores:
- Satisfy Tensor Core shape constraints
- Increase arithmetic intensity
- Decrease fraction of work in non-Tensor Core ops
- The
unreasonable effectiveness of gradient descent
- Bugs in code for mixed precision steps open manifest as slightly worsetraining accuracy
- Common mistakes:
Gradients not unscaled correctly
before weight update (Adam will try to handle this!)- Gradient clipping or regularization improperly using scaled gradients
- Incorrectly synchronizing master weight updates across multiple GPUs
- Not running
loss function in FP32
Three levels of optimization to best use Tensor Cores:
- Satisfy Tensor Core shape constraints
- Increase arithmetic intensity
- Decrease fraction of work in non-Tensor Core ops
-
GEMMs =
generalized (dense) matrix-matrix multiplies
All three dimensions (M, N, K) should be multiples of 8 -
GEMMs in
fully connected layers
:
Batch size, input and output features should be multiples of 8 -
GEMMs in
RNNs
:
Batch size, hidden size, embedding size, and dictionary size should bemultiples of 8 -
Convolution
:
Number ofchannels
(input, output) should be multiples of 8 -
In practice:
- Choose
minibatch
a multiples of 8 - Choose
layer dimensions
to be multiples of 8 - For classification, pad
vocabulary
to a multiples of 8 - For sequence, pad
sequence length
to a multiples of 8
- Choose
Enabling PyTorch’s autotuner:
import torch
torch.backends.cudnn.benchmark = True
...
The first iteration, it will test different cuDNN algorithms for each new convolution size it sees, and cache the fastest choice to use in later iteration. Details
-
Increase arithmetic intensity in model
implementation
:- Concatenate weights and gate activations in recurrent cells
- Concatenate activations across time in sequence models
-
Increase arithmetic intensity in model
architecture
:- Prefer dense math (vanilla convolutions vs. depth separable convolutions)
- Prefer wider layers - often little speed cost
- Of course,
always prefer accuracy first!
- Cutting-edge work on speeding up non-Tensor Core ops automatically with compiler tools:
- TensorFlow: XLA
- PyTorch JIT