Here we present our wining solution and its further development for MICCAI 2017 Endoscopic Vision Challenge Angiodysplasia Detection and Localization. It addresses binary segmentation problem, where every pixel in image is labeled as an angiodysplasia lesions or background. Then, we analyze connected component of each predicted mask. Based on the analysis we developed a classifier that predict angiodysplasia lesions (binary variable) and a detector for their localization (center of a component).
Contents
Alexey Shvets, Vladimir Iglovikov, Alexander Rakhlin, Alexandr A. Kalinin
If you find this work useful for your publications, please consider citing:
@inproceedings{shvets2018angiodysplasia, title={Angiodysplasia Detection and Localization using Deep Convolutional Neural Networks}, author={Shvets, Alexey A and Iglovikov, Vladimir I and Rakhlin, Alexander and Kalinin, Alexandr A}, booktitle={2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA)}, pages={612--617}, year={2018} }
Angiodysplasias are degenerative lesions of previously healthy blood vessels, in which the bowel wall have microvascular abnormalities. These lesions are the most common source of small bowel bleeding in patients older than 50 years, and cause approximately 8% of all gastrointestinal bleeding episodes. Gold-standard examination for angiodysplasia detection and localization in the small bowel is performed using Wireless Capsule Endoscopy (WCE). Last generation of this pill-like device is able to acquire more than 60 000 images with a resolution of approximately 520*520 pixels. According to the latest state-of-the art, only 69% of angiodysplasias are detected by gastroenterologist experts during the reading of WCE videos, and blood indicator software (provided by WCE provider like Given Imaging), in the presence of angiodysplasias, presents sensitivity and specificity values of only 41% and 67%, respectively.
The dataset consists of 1200 color images obtained with WCE. The images are in 24-bit PNG format, with 576 × 576 pixel resolution. The dataset is split into two equal parts, 600 images for training and 600 for evaluation. Each subset is composed of 300 images with apparent AD and 300 without any pathology. The training subset is annotated by human expert and contains 300 binary masks in JPEG format of the same 576 × 576 pixel resolution. White pixels in the masks correspond to lesion localization.
First row corresponds to images without pathology, the second row to images with several AD lesions in every image, and the last row contains masks that correspond to the pathology images from the second row.
Most images contain 1 lesion. Distribution of AD lesion areas reaches maximum of 12,000 pixels and has median 1,648 pixels.
We evaluate 4 different deep architectures for segmentation: U-Net (Ronneberger et al., 2015; Iglovikov et al., 2017a), 2 modifications of TernausNet (Iglovikov and Shvets, 2018), and AlbuNet34, a modifications of LinkNet (Chaurasia and Culurciello, 2017; Shvets et al., 2018). As an improvement over standard U-Net, we use similar networks with pre-trained encoders. TernausNet (Iglovikov and Shvets, 2018) is a U-Net-like architecture that uses relatively simple pre-trained VGG11 or VGG16 (Simonyan and Zisserman, 2014) networks as an encoder. VGG11 consists of seven convolutional layers, each followed by a ReLU activation function, and ve max polling operations, each reducing feature map by 2. All convolutional layers have 3 × 3 kernels. TernausNet16 has a similar structure and uses VGG16 network as an encoder
We use Jaccard index (Intersection Over Union) as the evaluation metric. It can be interpreted as a similarity measure between a finite number of sets. For two sets A and B, it can be defined as following:
Since an image consists of pixels, the expression can be adapted for discrete objects in the following way:
where and are a binary value (label) and a predicted probability for the pixel , respectively.
Since image segmentation task can also be considered as a pixel classification problem, we additionally use common classification loss functions, denoted as H. For a binary segmentation problem H is a binary cross entropy, while for a multi-class segmentation problem H is a categorical cross entropy.
As an output of a model, we obtain an image, in which each pixel value corresponds to a probability of belonging to the area of interest or a class. The size of the output image matches the input image size. For binary segmentation, we use 0.3 as a threshold value (chosen using validation dataset) to binarize pixel probabilities. All pixel values below the speci ed threshold are set to 0, while all values above the threshold are set to 255 to produce final prediction mask.
Following the segmentation step, we perform postprocessing in order to nd the coordinates of angiodysplasia lesions in the image. In the postprocessing step we use OpenCV implementation of connected component labeling function connectedComponentsWithStats. This function returns the number of connected components, their sizes (areas), and centroid coordinates of the corresponding connected component. In our detector we use another threshold to neglect all clusters with the size smaller than 300 pixels. Therefore, in order to establish the presence of the lesions, the number of found components should be higher than 0, otherwise the image corresponds to a normal condition. Then, for localization of angiodysplasia lesions we return centroid coordinates of all connected components.
The quantitative comparison of our models' performance is presented in the Table 1. For the segmentation task the best results is achieved by AlbuNet34 providing IoU = 0.754 and Dice = 0.850. When compared by the inference time, AlbuNet34 is also the fastest model due to the light encoder. In the segmentation task this network takes around 20ms
Prediction of our detector on the validation image. The left picture is original image, the central is ground truth mask, and the right is predicted mask. Green dots correspond to centroid coordinates that define localization of the angiodysplasia.
Model | IOU, % | Dice, % | Inference time, ms |
---|---|---|---|
U-Net | 73.18 | 83.06 | 21 |
TernausNet-11 | 74.94 | 84.43 | 51 |
TernausNet-16 | 73.83 | 83.05 | 60 |
AlbuNet34 | 75.35 | 84.98 | 30 |
Pre-trained weights for all model of all segmentation tasks can be found on google drive
- Python 3.6
- PyTorch 0.3.1
- TorchVision 0.1.9
- numpy 1.14.0
- opencv-python 3.3.0.10
- tqdm 4.19.4
These dependencies can be installed by running:
pip install -r requirements.txt
The dataset is organized in the folloing way::
├── data │ ├── test │ └── train │ ├── angyodysplasia │ │ ├── images │ │ └── masks │ └── normal │ ├── images │ └── masks │ .......................
The training dataset contains 2 sets of images, one with angyodysplasia and second without it. For training we used only the images with angyodysplasia, which were split in 5 folds.
- Training
The main file that is used to train all models - train.py
. Running python train.py --help
will return set of all possible input parameters.
To train all models we used the folloing bash script (batch size was chosen depending on how many samples fit into the GPU RAM, limit was adjusted accordingly to keep the same number of updates for every network):
#!/bin/bash for i in 0 1 2 3 do python train.py --device-ids 0,1,2,3 --limit 10000 --batch-size 12 --fold $i --workers 12 --lr 0.0001 --n-epochs 10 --jaccard-weight 0.3 --model UNet11 python train.py --device-ids 0,1,2,3 --limit 10000 --batch-size 12 --fold $i --workers 12 --lr 0.00001 --n-epochs 15 --jaccard-weight 0.3 --model UNet11 done
- Mask generation.
The main file to generate masks is generate_masks.py
. Running python generate_masks.py --help
will return set of all possible input parameters. Example:
python generate_masks.py --output_path predictions/UNet16 --model_type UNet16 --model_path data/models/UNet16 --fold -1 --batch-size 4
- Evaluation.
The evaluation is different for a binary and multi-class segmentation:
[a] In the case of binary segmentation it calculates jaccard (dice) per image / per video and then the predictions are avaraged.
[b] In the case of multi-class segmentation it calculates jaccard (dice) for every class independently then avaraged them for each image and then for every video:
python evaluate.py --target_path predictions/UNet16 --train_path data/train/angyodysplasia/masks
- Further Improvements.
Our results can be improved further by few percentages using simple rules such as additional augmentation of train images and train the model for longer time. In addition, the cyclic learning rate or cosine annealing could be also applied. To do it one can use our pre-trained weights as initialization. To improve test prediction TTA technique could be used as well as averaging prediction from all folds.
You can start working with our models using the demonstration example: Demo.ipynb