This repository provides the official PyTorch implementation of the paper Improving Foundation Model for Endoscopy Video Analysis via Representation Learning on Long Sequences by Zhao Wang, Chang Liu, Lingting Zhu, Tongtong Wang, Shaoting Zhang†, and Qi Dou†.
- First foundation model for learning from long endoscopy videos via self-supervised pre-train.
- A large-scale long endoscopy video dataset consisting of 6469 long sequences with an average duration of 68.1 seconds.
- Promising performance on 4 different types of typical downstream endoscopic tasks, including classification, segmentation, detection, and workflow recognition.
Recent advancements in endoscopy video analysis have relied on the utilization of relatively short video clips extracted from longer videos or millions of individual frames. However, these approaches tend to neglect the domain-specific characteristics of endoscopy data, which is typically presented as a long stream containing valuable semantic spatial and temporal information. To address this limitation, we propose EndoFM-LV, a foundation model developed under a minute-level pre-training framework upon long endoscopy video sequences. To be specific, we propose a novel masked token modeling scheme within a teacher-student framework for self-supervised video pre-training, which is tailored for learning representations from long video sequences. For pre-training, we construct a large-scale long endoscopy video dataset comprising 6,469 long endoscopic video samples, each longer than 1 minute and totaling over 13 million frames. Our EndoFM-LV is evaluated on four types of endoscopy tasks, namely classification, segmentation, detection, and workflow recognition, serving as the backbone or temporal module. Extensive experimental results demonstrate that our framework outperforms previous state-of-the-art video-based and frame-based approaches by a significant margin on those various downstream tasks.
We utilize 4 public and 1 private datasets for pre-training and 4 datasets as the downstream tasks. Except for Cholec80, we provide our preprocessed data for pre-training and downstream tasks, you can directly download via the following links:
- Pre-training
- Downstream: PolypDiag, CVC-12k, KUMC
Note the preprocessing of Cholec80 for workflow recognition can refer to SV-RCNet.
- torch==1.13.1
- torchvision==0.14.1
- pillow==10.0.1
- timm==0.9.7
We suggest using Anaconda to setup environment on Linux, if you have installed anaconda, you can skip this step.
wget https://repo.anaconda.com/archive/Anaconda3-2020.11-Linux-x86_64.sh && zsh Anaconda3-2020.11-Linux-x86_64.sh
Then, we can install packages using provided environment.yaml
.
cd EndoFM-LV
conda env create -f environment.yaml
conda activate endofm-lv
You can directly download our pre-trained Endo-FM via this link and put it under checkpoints/
for downstream fine-tuning.
Also, we provide the pre-trained weights of 4 downstream tasks for direct downstream testing.
Dataset | PolypDiag | CVC-12k | KUMC | Cholec80 |
---|---|---|---|---|
Weights | link | link | link | link |
cd EndoFM-LV
bash scripts/train_endofm_lv.sh
# PolypDiag (Classification)
cd EndoFM-LV
bash scripts/eval_finetune_polypdiag.sh
# CVC (Segmentation)
cd EndoFM-LV/TransUNet
python train.py
# KUMC (Detection)
cd EndoFM-LV/STMT
python setup.py build develop
python -m torch.distributed.launch \
--nproc_per_node=1 \
tools/train_net.py \
--master_port=$((RANDOM + 10000)) \
--config-file configs/STFT/kumc_R_50_STFT.yaml \
OUTPUT_DIR log_dir/kumc_finetune
# Cholec80 (Workflow Recognition)
cd EndoFM-LV/SV-RCNet
python train_singlenet_phase_1fc.py --exp endofm_lv
# PolypDiag (Classification)
cd Endo-FM
bash scripts/test_finetune_polypdiag.sh
# CVC (Segmentation)
cd Endo-FM/TransUNet
python train.py --test
# KUMC (Detection)
cd Endo-FM/STMT
python setup.py build develop
python -m torch.distributed.launch \
--nproc_per_node=1 \
tools/test_net.py \
--master_port=$((RANDOM + 10000)) \
--config-file configs/STFT/kumc_R_50_STFT.yaml \
MODEL.WEIGHT kumc.pth \
OUTPUT_DIR log_dir/kumc_finetune
# Cholec80 (Workflow Recognition)
cd EndoFM-LV/SV-RCNet
python train_singlenet_phase_1fc.py --exp endofm_lv --test
For further questions, pls feel free to contact Zhao Wang.
This project is under the Apache License 2.0 license. See LICENSE for details.
Our code is based on DINO, TimeSformer, SVT, TransUNet, and STFT. Thanks them for releasing their codes.
If you find this code useful, please cite in your research papers.
@article{
wang2025improving,
title={Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train},
author={Zhao Wang and Chang Liu and Lingting Zhu and Tongtong Wang and Shaoting Zhang and Qi Dou},
booktitle={IEEE Journal of Biomedical and Health Informatics},
pages={},
year={2025}
}