Skip to content

[JBHI'25] Improving Foundation Model for Endoscopy Video Analysis via Representation Learning on Long Sequences

License

Notifications You must be signed in to change notification settings

med-air/EndoFM-LV

Repository files navigation

EndoFM-LV

This repository provides the official PyTorch implementation of the paper Improving Foundation Model for Endoscopy Video Analysis via Representation Learning on Long Sequences by Zhao Wang, Chang Liu, Lingting Zhu, Tongtong Wang, Shaoting Zhang†, and Qi Dou†.

Key Features

  • First foundation model for learning from long endoscopy videos via self-supervised pre-train.
  • A large-scale long endoscopy video dataset consisting of 6469 long sequences with an average duration of 68.1 seconds.
  • Promising performance on 4 different types of typical downstream endoscopic tasks, including classification, segmentation, detection, and workflow recognition.

Details

Recent advancements in endoscopy video analysis have relied on the utilization of relatively short video clips extracted from longer videos or millions of individual frames. However, these approaches tend to neglect the domain-specific characteristics of endoscopy data, which is typically presented as a long stream containing valuable semantic spatial and temporal information. To address this limitation, we propose EndoFM-LV, a foundation model developed under a minute-level pre-training framework upon long endoscopy video sequences. To be specific, we propose a novel masked token modeling scheme within a teacher-student framework for self-supervised video pre-training, which is tailored for learning representations from long video sequences. For pre-training, we construct a large-scale long endoscopy video dataset comprising 6,469 long endoscopic video samples, each longer than 1 minute and totaling over 13 million frames. Our EndoFM-LV is evaluated on four types of endoscopy tasks, namely classification, segmentation, detection, and workflow recognition, serving as the backbone or temporal module. Extensive experimental results demonstrate that our framework outperforms previous state-of-the-art video-based and frame-based approaches by a significant margin on those various downstream tasks.

Datasets

We utilize 4 public and 1 private datasets for pre-training and 4 datasets as the downstream tasks. Except for Cholec80, we provide our preprocessed data for pre-training and downstream tasks, you can directly download via the following links:

Note the preprocessing of Cholec80 for workflow recognition can refer to SV-RCNet.

Get Started

Main Requirements

  • torch==1.13.1
  • torchvision==0.14.1
  • pillow==10.0.1
  • timm==0.9.7

Installation

We suggest using Anaconda to setup environment on Linux, if you have installed anaconda, you can skip this step.

wget https://repo.anaconda.com/archive/Anaconda3-2020.11-Linux-x86_64.sh && zsh Anaconda3-2020.11-Linux-x86_64.sh

Then, we can install packages using provided environment.yaml.

cd EndoFM-LV
conda env create -f environment.yaml
conda activate endofm-lv

Pre-trained Weights

You can directly download our pre-trained Endo-FM via this link and put it under checkpoints/ for downstream fine-tuning.

Also, we provide the pre-trained weights of 4 downstream tasks for direct downstream testing.

Dataset PolypDiag CVC-12k KUMC Cholec80
Weights link link link link

Pre-training

cd EndoFM-LV
bash scripts/train_endofm_lv.sh

Downstream Fine-tuning

# PolypDiag (Classification)
cd EndoFM-LV
bash scripts/eval_finetune_polypdiag.sh

# CVC (Segmentation)
cd EndoFM-LV/TransUNet
python train.py

# KUMC (Detection)
cd EndoFM-LV/STMT
python setup.py build develop
python -m torch.distributed.launch \
    --nproc_per_node=1 \
    tools/train_net.py \
    --master_port=$((RANDOM + 10000)) \
    --config-file configs/STFT/kumc_R_50_STFT.yaml \
    OUTPUT_DIR log_dir/kumc_finetune
    
# Cholec80 (Workflow Recognition)
cd EndoFM-LV/SV-RCNet
python train_singlenet_phase_1fc.py --exp endofm_lv

Direct Downstream Testing

# PolypDiag (Classification)
cd Endo-FM
bash scripts/test_finetune_polypdiag.sh

# CVC (Segmentation)
cd Endo-FM/TransUNet
python train.py --test

# KUMC (Detection)
cd Endo-FM/STMT
python setup.py build develop
python -m torch.distributed.launch \
    --nproc_per_node=1 \
    tools/test_net.py \
    --master_port=$((RANDOM + 10000)) \
    --config-file configs/STFT/kumc_R_50_STFT.yaml \
    MODEL.WEIGHT kumc.pth \
    OUTPUT_DIR log_dir/kumc_finetune

# Cholec80 (Workflow Recognition)
cd EndoFM-LV/SV-RCNet
python train_singlenet_phase_1fc.py --exp endofm_lv --test

🙋‍♀️ Feedback and Contact

For further questions, pls feel free to contact Zhao Wang.

🛡️ License

This project is under the Apache License 2.0 license. See LICENSE for details.

🙏 Acknowledgement

Our code is based on DINO, TimeSformer, SVT, TransUNet, and STFT. Thanks them for releasing their codes.

📝 Citation

If you find this code useful, please cite in your research papers.

@article{
    wang2025improving,
    title={Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train},
    author={Zhao Wang and Chang Liu and Lingting Zhu and Tongtong Wang and Shaoting Zhang and Qi Dou},
    booktitle={IEEE Journal of Biomedical and Health Informatics},
    pages={},
    year={2025}
}

About

[JBHI'25] Improving Foundation Model for Endoscopy Video Analysis via Representation Learning on Long Sequences

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published