- Original paper: https://arxiv.org/pdf/2005.00928.pdf.
- Motivation: Visualize each type of attention block and isolate their impact to image classification accuracy for ViT model.
Note that d_model = embed_dim already where d_model = number of tokens, head_dim = d_model/num_heads
- Hydra Attention argues for num_heads = embed_dim to get linear complexity. Have 2 Hydra Attention-Encoder block at the back improved accuracy while reduced FLOPs and runtime. Reimplemented by robflynnyh. Unfortunately, visualize Hydra Attention needed a different math so we will rely on their (figure 3 + appendix) to discuss different pretrained model
- Dilated-Self Attention used for LongNet: Also linear complexity. Reimplemented by https://github.com/alexisrozhkov/dilated-self-attention
- Attention Rollout
- Gradient-based Attention Rollout
- ????