Welcome! This repository contains the code and checkpoints implementation of paper DNTextSpotter: Arbitrary-Shaped Scene Text Spotting via Improved Denoising Training. DNTextSpotter decomposes the queries of the denoising part into noised positional and noised content queries. it uses the four Bezier control points of the Bezier center curve to generate the noised positional queries. For the noised content queries, considering that the output of the text in a fixed positional order is not conducive to aligning position with content, it employs a masked character sliding method to initialize noised content queries, thereby assisting in the alignment of text content and position. Additionally, to improve the model’s perception of the background, it further utilizes an additional loss function for background characters classification in the denoising training part.
- [2024/7/16] 🎉🎉🎉 DNTextSpotter is accepted by ACM'MM 2024!
1.Pre-trained Models for Total-Text & Inverse-Text & IC15
Backbone | Training Data | Weights |
---|---|---|
Res-50 | Synth150K+Total-Text+MLT17+IC13+IC15+TextOCR | Drive |
ViTAEv2-S | Synth150K+Total-Text+MLT17+IC13+IC15+TextOCR | Drive |
Finetune on Total-Text
Backbone | External Data | Det-P | Det-R | Det-F1 | E2E-None | E2E-Full | Weights |
---|---|---|---|---|---|---|---|
Res-50 | Synth150K+MLT17+IC13+IC15+TextOCR | 91.5 | 87.0 | Drive | |||
ViTAEv2-S | Synth150K+MLT17+IC13+IC15+TextOCR | 92.9 | 88.6 | 90.7 | 85.0 | 90.5 | Drive |
Finetune on ICDAR 2015 (IC15)
Backbone | External Data | Det-P | Det-R | Det-F1 | E2E-S | E2E-W | E2E-G | Weights |
---|---|---|---|---|---|---|---|---|
Res-50 | Synth150K+Total-Text+MLT17+IC13+TextOCR | 92.5 | 87.2 | 89.8 | OneDrive | |||
ViTAEv2-S | Synth150K+Total-Text+MLT17+IC13+TextOCR | 92.4 | 87.9 | 89.4 | 85.2 | 80.6 | OneDrive |
Inverse-Text (using the same weights as Finetune model on Total-Text)
Backbone | External Data | Det-P | Det-R | Det-F1 | E2E-None | E2E-Full | Weights |
---|---|---|---|---|---|---|---|
Res-50 | Synth150K+MLT17+IC13+IC15+TextOCR | 94.3 | 77.2 | Drive | |||
ViTAEv2-S | Synth150K+MLT17+IC13+IC15+TextOCR | 95.4 | 79.2 | 86.4 | 78.1 | 83.8 | Drive |
2.Pre-trained Model for CTW1500
Backbone | Training Data | Weights |
---|---|---|
Res-50 | Synth150K+Total-Text+MLT17+IC13+IC15+TextOCR | Drive |
Finetune on CTW1500
Backbone | External Data | Det-P | Det-R | Det-F1 | E2E-None | E2E-Full | Weights |
---|---|---|---|---|---|---|---|
Res-50 | Synth150K+Total-Text+MLT17+IC13+IC15+TextOCR | 93.5 | 87.1 | 90.2 | 67.0 | 84.2 | Drive |
-
- Python==3.8
- PyTorch>=2.0.1
- CUDA>=11.7
- Detectron2
git clone https://github.com/yyyyyxie/DNTextSpotter.git
cd DNTextSpotter
conda create -n dnts python=3.8 -y
conda activate dnts
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2
cd detectron2
pip install -e .
cd ..
pip install -r requirements.txt
python setup.py build develop
You can find the datasets here.
|- ./datasets
|- syntext1
| |- train_images
| └ annotations
| |- train_37voc.json
| └ train_96voc.json
|- syntext2
| |- train_images
| └ annotations
| |- train_37voc.json
| └ train_96voc.json
|- mlt2017
| |- train_images
| └ annotations
| |- train_37voc.json
| └ train_96voc.json
|- totaltext
| |- train_images
| |- test_images
| |- train_37voc.json
| |- train_96voc.json
| |- weak_voc_new.txt
| |- weak_voc_pair_list.txt
| └ test.json
|- ic13
| |- train_images
| |- train_37voc.json
| └ train_96voc.json
|- ic15
| |- train_images
| |- test_images
| |- train_37voc.json
| |- train_96voc.json
| └ test.json
|- CTW1500
| |- train_images
| |- test_images
| └ annotations
| |- train_96voc.json
| └ test.json
|- textocr
| |- train_images
| |- train_37voc_1.json
| |- train_37voc_2.json
| |- train_96voc_1.json
| └ train_96voc_2.json
|- inversetext
| |- test_images
| |- inversetext_lexicon.txt
| |- inversetext_pair_list.txt
|- evaluation
| |- gt_*.zip
Total-Text & ICDAR2015
1. Pre-train
For example, pre-train DNTextSpotter:
python tools/train_net.py --config-file configs/R_50/pretrain/150k_tt_mlt_13_15.yaml --num-gpus 8
2. Fine-tune
Fine-tune on Total-Text or ICDAR2015:
python tools/train_net.py --config-file configs/R_50/TotalText/finetune_150k_tt_mlt_13_15_textocr.yaml --num-gpus 8
python tools/train_net.py --config-file configs/R_50/IC15/finetune_150k_tt_mlt_13_15_textocr.yaml --num-gpus 8
CTW1500
1. Pre-train
python tools/train_net.py --config-file configs/R_50/CTW1500/pretrain_96voc_50maxlen.yaml --num-gpus 8
2. Fine-tune
python tools/train_net.py --config-file configs/R_50/CTW1500/finetune_96voc_50maxlen.yaml --num-gpus 8
python tools/train_net.py --config-file ${CONFIG_FILE} --eval-only MODEL.WEIGHTS ${MODEL_PATH}
python demo/demo.py --config-file ${CONFIG_FILE} --input ${IMAGES_FOLDER_OR_ONE_IMAGE_PATH} --output ${OUTPUT_PATH} --opts MODEL.WEIGHTS <MODEL_PATH>
🤩 If you encounter any difficulty using our code, please do not hesitate to submit an issue or directly contact us!
😍 If you do find our work helpful (or if you would be so kind as to offer us some encouragement), please consider kindly giving a star, and citing our paper.
@article{xie2024dntextspotter,
title={DNTextSpotter: Arbitrary-Shaped Scene Text Spotting via Improved Denoising Training},
author={Xie, Yu and Qiao, Qian and Gao, Jun and Wu, Tianxiang and Huang, Shaoyao and Fan, Jiaqing and Cao, Ziqiang and Wang, Zili and Zhang, Yue and Zhang, Jielei and others},
journal={arXiv preprint arXiv:2408.00355},
year={2024}
}
or
@inproceedings{qiao2024dntextspotter,
title={DNTextSpotter: Arbitrary-Shaped Scene Text Spotting via Improved Denoising Training},
author={Qiao, Qian and Xie, Yu and Gao, Jun and Wu, Tianxiang and Huang, Shaoyao and Fan, Jiaqing and Cao, Ziqiang and Wang, Zili and Zhang, Yue},
booktitle={ACM Multimedia 2024}
}
This project is based on Adelaidet and DeepSolo. For academic use, this project is licensed under the 2-clause BSD License.