AttentionRollout ReImplementation

Original paper: https://arxiv.org/pdf/2005.00928.pdf.
Motivation: Visualize each type of attention block and isolate their impact to image classification accuracy for ViT model.

Other Attention in ViT:

Note that d_model = embed_dim already where d_model = number of tokens, head_dim = d_model/num_heads

Hydra Attention argues for num_heads = embed_dim to get linear complexity. Have 2 Hydra Attention-Encoder block at the back improved accuracy while reduced FLOPs and runtime. Reimplemented by robflynnyh. Unfortunately, visualize Hydra Attention needed a different math so we will rely on their (figure 3 + appendix) to discuss different pretrained model
Dilated-Self Attention used for LongNet: Also linear complexity. Reimplemented by https://github.com/alexisrozhkov/dilated-self-attention

Name		Name	Last commit message	Last commit date
Latest commit History 425 Commits
Architectures		Architectures
attention_visualizer		attention_visualizer
src		src
.gitignore		.gitignore
AML.pdf		AML.pdf
AttentionRollout.ipynb		AttentionRollout.ipynb
LICENSE		LICENSE
README.md		README.md
args.py		args.py
cifar-l.submit_file		cifar-l.submit_file
main.py		main.py