Haonan Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song, Yihang Duan, Xinyu Lyu, Heng Tao Shen
This is the official code implementation of the paper "Text-Video Retrieval with Global-Local Semantic Consistent Learning", the checkpoint and feature will be released soon.
We are continuously refactoring our code, be patient and wait for the latest updates!
- Release the pre-trained weight and datasets.
- Release the training and evaluation code.
Adapting large-scale image-text pre-training models, e.g., CLIP, to the video domain represents the current state-of-the-art for text-video retrieval. The primary approaches involve transferring text-video pairs to a common embedding space and leveraging cross-modal interactions on specific entities for semantic alignment. Though effective, these paradigms entail prohibitive computational costs, leading to inefficient retrieval. To address this, we propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL), which capitalizes on latent shared semantics across modalities for text-video retrieval. Specifically, we introduce a parameter-free global interaction module to explore coarse-grained alignment. Then, we devise a shared local interaction module that employs several learnable queries to capture latent semantic concepts for learning fine-grained alignment. Moreover, we propose an inter-consistency loss and an intra-diversity loss to ensure the similarity and diversity of these concepts across and within modalities, respectively.
Figure 1. Performance comparison of the retrieval results (R@1) and computational costs (FLOPs) for text-to-video retrieval models.
Figure 2. Overview of the proposed GLSCL for Text-Video retrieval.
The GLSCL framework depends on the following main requirements:
- torch==1.8.1+cu114
- Transformers 4.6.1
- OpenCV 4.5.3
- tqdm
We train our model on MSR-VTT-9k
, MSVD
, DiDeMo
, LSMDC
, and ActivityNet
datasets respectively. Please refer to this repo for data preparation.
For simple training on MSR-VTT-9k with default hyperparameters:
bash train_msrvtt.sh
or run in the terminal directly:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -m torch.distributed.launch \
--master_port 2513 \
--nproc_per_node=8 \
main_retrieval.py \
--do_train 1 \
--workers 8 \
--n_display 50 \
--epochs 5 \
--lr 1e-4 \
--coef_lr 1e-3 \
--batch_size 128 \
--batch_size_val 64 \
--anno_path ANNOTATION_PATH \
--video_path YOUR_RAW_VIDEO_PATH \
--datatype msrvtt \
--max_words 32 \
--max_frames 12 \
--video_framerate 1 \
--output_dir YOUR_SAVE_PATH \
--center 1 \
--temp 3 \
--alpha 0.0001 \
--beta 0.005 \
--query_number 8 \
--base_encoder ViT-B/32 \
--cross_att_layer 3 \
--query_share 1 \
--cross_att_share 1 \
--loss2_weight 0.5 \
For simple testing on MSR-VTT-9k with default hyperparameters:
bash train_msrvtt.sh
or
CUDA_VISIBLE_DEVICES=0,1 \
python -m torch.distributed.launch \
--master_port 2503 \
--nproc_per_node=2 \
main_retrieval.py \
--do_eval 1 \
--workers 8 \
--n_display 50 \
--epochs 5 \
--lr 1e-4 \
--coef_lr 1e-3 \
--batch_size 64 \
--anno_path ANNOTATION_PATH \
--video_path YOUR_RAW_VIDEO_PATH \
--datatype msrvtt \
--max_words 32 \
--max_frames 12 \
--video_framerate 1 \
--output_dir YOUR_SAVE_PATH \
--center 1 \
--temp 3 \
--alpha 0.0001 \
--beta 0.005 \
--query_number 8 \
--base_encoder ViT-B/32 \
--cross_att_layer 3 \
--query_share 1 \
--cross_att_share 1 \
--loss2_weight 0.5 \
--init_model YOUR_CKPT_FILE
@inproceedings{GLSCL,
author = {Haonan Zhang and
Pengpeng Zeng and
Lianli Gao and
Jingkuan Song and
Yihang Duan and
Xinyu Lyu and
Hengtao Sheng
},
title = {Text-Video Retrieval with Global-Local Semantic Consistent Learning},
year = {2024}
}