🔥🔥🔥 A paper list of some recent works about Token Compress for Vit and VLM.
-
MustDrop:Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model .[MustDrop;Github]
-
Don’t Look Twice: Faster Video Transformers with Run-Length Tokenization .[RLT;Video;NeurIPS 2024;Github]
-
Inference Optimal VLMs Need Only One Visual Token but Larger Models .[QueCC;Github]
-
Video Token Merging for Long-form Video Understandin .[Learnable VTM;Video]
-
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding .[LongVU;Video;Github]
-
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction .[PyramidDrop;Github]
-
Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers .[Victor;]
-
VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models .[VidCompress;]
-
Retrieval Replace Reduction:An effective visual token reduction method via semantic match .[TRSM;]
-
AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity .[AVG-LLaVA;Github]
-
Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs .[TRIM]
-
TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Consideration .[TC-LLaVA;Video;]
-
TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings .[TG-LLaVA]
-
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding . [mPLUG-DocOwl2;Github]
-
TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval . [TempMe;Video;Github]
-
Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information . [Recoverable Compression]
-
HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments . [HiRED;Github]
-
Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding . [Token-level;Github]
-
HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models . [HiRes-LLaVA;]
-
TokenPacker: Efficient Visual Projector for Multimodal LLM . [TokenPacker;Github]
-
VoCo-LLaMA: Towards Vision Compression with Large Language Models . [VoCo-LLaMA;Github]
-
DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models . [DeCo;Github]
-
Matryoshka Multimodal Models . [Matryoshka;M3]Github]
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites . [InternVL;Pixel-Shuffle;Github]
-
CATP: Cross-Attention Token Pruning for Accuracy Preserved Multimodal Model Inference . [CATP;]
-
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models . [LLaVA-PruMerge;Github]
-
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-PLay Acceleration for VLLM Inference . [FastV;ECCV 2024;Github]
-
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model . [LDP-v2;Github]
- Honeybee: Locality-enhanced Projector for Multimodal LLM . [C-Abstractor;CVPR 2024;Github ]
- LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models . [LLaMA-VID;ECCV 2024;Github ]
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond . [Resampler;Github]
- CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers . [CrossGET; ICML 2024;Github]
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models . [Q-former;Github]
- Vote&Mix: Plug-and-Play Token Reduction for Efficient Vision Transformer . [Vote&Mix;]
- Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning . [Token Compensator;ToCom;Github]
- Dynamic and Compressive Adaptation of Transformers From Images to Videos . [InTI;]
- LookupViT: Compressing visual information to a limited number of tokens . [LookupViT;DeepMind]
- PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation . [PYRA;ECCV 2024;Github]
- PPT: Token Pruning and Pooling for Efficient Vision Transformers . [PPT;Github]
- DiffRate : Differentiable Compression Rate for Efficient Vision Transformers . [DiffRate;ICCV 2023;Github]
- Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers . [TPS;CVPR 2023;Github]
- TOKEN MERGING: YOUR VIT BUT FASTER . [ToMe;Token Merging; ICLR 2023]
- Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention . [Adaptive Sparse ViT]
- EViT: Expediting Vision Transformers via Token Reorganizations . [EViT;ICLR 2022;Github]
- Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space . [ViT-Slim;CVPR 2022;Github]
- A-ViT: Adaptive Tokens for Efficient Vision Transformer . [A-Vit;]
- ATS: Adaptive Token Sampling For Efficient Vision Transformers . [ATS;ECCV 2022;Github]
- Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer . [Evo-ViT;AAAI 2022;Github]
- Patch Slimming for Efficient Vision Transformers . [Patch Slimming;]
- DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsificationr . [DynamicViT;NeurIPS 2021;Github]