Added COVIDNet CXR-S and COVIDxSev for airspace grading to scripts

lindawangg · Apr 16, 2021 · aeaf3d0 · aeaf3d0
1 parent 74eaec6
commit aeaf3d0
Show file tree

Hide file tree

Showing 13 changed files with 1,265 additions and 20 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,5 +1,6 @@
 .ipynb_checkpoints
 data
+data_sev
 .DS_Store
 *.zip
 __pycache__

diff --git a/README.md b/README.md
@@ -4,6 +4,7 @@
 
 **Recording to webinar on [How we built COVID-Net in 7 days with Gensynth](https://darwinai.news/fny)**
 
+**Update 04/16/2021:** We released a new COVIDNet CXR-S [model](docs/models.md) and [COVIDxSev](create_COVIDxSev.ipynb) dataset for airspace severity grading in COVID-19 positive patient CXR images. For more information on training, testing and inference please refer to severity [docs](docs/covidnet_severity.md).
 **Update 03/20/2021:** We released a new COVID-Net CXR-2 [model](docs/models.md) for COVID-19 positive/negative detection which was trained on the new COVIDx8B dataset with 16,352 CXR images from a multinational cohort of 15,346 patients from at least 51 countries. The test results are based on the new COVIDx8B test set of 200 COVID-19 positive and 200 negative CXR images.\
 **Update 03/19/2021:** We released updated datasets and dataset curation scripts. The COVIDx V8A dataset and create_COVIDx.ipynb are for detection of no pneumonia/non-COVID-19 pneumonia/COVID-19 pneumonia, and COVIDx V8B dataset and create_COVIDx_binary.ipynb are for COVID-19 positive/negative detection. Both datasets contain over 16000 CXR images with over 2300 positive COVID-19 images.\
 **Update 01/28/2021:** We released updated datasets and dataset curation scripts. The COVIDx V7A dataset and create_COVIDx.ipynb are for detection of no pneumonia/non-COVID-19 pneumonia/COVID-19 pneumonia, and COVIDx V7B dataset and create_COVIDx_binary.ipynb are for COVID-19 positive/negative detection. Both datasets contain over 15600 CXR images with over 1700 positive COVID-19 images.\
@@ -69,12 +70,13 @@ If you find our work useful, can cite our paper using:
 ## Quick Links
 1. COVIDNet-CXR models (COVID-19 detection using chest x-rays): https://github.com/lindawangg/COVID-Net/blob/master/docs/models.md
 2. COVIDNet-CT models (COVID-19 detection using chest CT scans): https://github.com/haydengunraj/COVIDNet-CT/blob/master/docs/models.md
-3. COVIDNet-S models (COVID-19 lung severity assessment using chest x-rays): https://github.com/lindawangg/COVID-Net/blob/master/docs/models.md
-4. COVIDx-CXR dataset: https://github.com/lindawangg/COVID-Net/blob/master/docs/COVIDx.md
-5. COVIDx-CT dataset: https://github.com/haydengunraj/COVIDNet-CT/blob/master/docs/dataset.md
-6. COVIDx-S dataset: https://github.com/lindawangg/COVID-Net/tree/master/annotations
-7. COVIDNet-P inference for pneumonia: https://github.com/lindawangg/COVID-Net/blob/master/docs/covidnet_pneumonia.md
-8. CancerNet-SCa models for skin cancer detection: https://github.com/jamesrenhoulee/CancerNet-SCa/blob/main/docs/models.md
+3. COVIDNet-CXR-S models (COVID-19 airspace severity grading using chest x-rays): https://github.com/lindawangg/COVID-Net/blob/master/docs/models.md
+4. COVIDNet-S models (COVID-19 lung severity assessment using chest x-rays): https://github.com/lindawangg/COVID-Net/blob/master/docs/models.md
+5. COVIDx-CXR dataset: https://github.com/lindawangg/COVID-Net/blob/master/docs/COVIDx.md
+6. COVIDx-CT dataset: https://github.com/haydengunraj/COVIDNet-CT/blob/master/docs/dataset.md
+7. COVIDx-S dataset: https://github.com/lindawangg/COVID-Net/tree/master/annotations
+8. COVIDNet-P inference for pneumonia: https://github.com/lindawangg/COVID-Net/blob/master/docs/covidnet_pneumonia.md
+9. CancerNet-SCa models for skin cancer detection: https://github.com/jamesrenhoulee/CancerNet-SCa/blob/main/docs/models.md
 
 Training, inference, and evaluation scripts for COVIDNet-CXR, COVIDNet-CT, COVIDNet-S, and CancerNet-SCa models are available at the respective repos
 

diff --git a/create_COVIDxSev.ipynb b/create_COVIDxSev.ipynb
@@ -0,0 +1,253 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 72,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import os\n",
+    "import random \n",
+    "from shutil import copyfile\n",
+    "import pydicom as dicom\n",
+    "import cv2\n",
+    "import mdai\n",
+    "import json\n",
+    "from collections import Counter\n",
+    "from pathlib import Path"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 73,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# set data directory here\n",
+    "savepath = 'data_sev'\n",
+    "Path(os.path.join(savepath, 'test')).mkdir(parents=True, exist_ok=True)\n",
+    "Path(os.path.join(savepath, 'train')).mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "seed = 0\n",
+    "np.random.seed(seed) # Reset the seed so all runs are the same.\n",
+    "random.seed(seed)\n",
+    "MAXVAL = 255  # Range [0 255]\n",
+    "\n",
+    "# COVIDxSev requires the path to the ricord annotations to also be downloaded\n",
+    "ricord_annotations = 'create_ricord_dataset/1c_mdai_rsna_project_MwBeK3Nr_annotations_labelgroup_all_2021-01-08-164102.json'\n",
+    "\n",
+    "# path to ricord covid-19 images created by create_ricord_dataset/create_ricord_dataset.ipynb\n",
+    "# run create_ricord_dataset.ipynb before this notebook\n",
+    "ricord_imgpath = 'create_ricord_dataset/ricord_images'\n",
+    "ricord_txt = 'create_ricord_dataset/ricord_data_set.txt'\n",
+    "ricord_studyids = 'create_ricord_dataset/ricord_patientid_to_studyid_mapping.json'\n",
+    "\n",
+    "\n",
+    "\n",
+    "# parameters for COVIDx dataset\n",
+    "train = []\n",
+    "test = []\n",
+    "test_count = {'level1': 0,'level2': 0, 'NA': 0}\n",
+    "train_count = {'level1': 0,'level2': 0, 'NA': 0}\n",
+    "\n",
+    "\n",
+    "\n",
+    "# to avoid duplicates\n",
+    "patient_imgpath = {}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 74,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "mapping = {}\n",
+    "mapping['Mild Opacities  (1-2 lung zones)'] = 'level1'\n",
+    "mapping['Moderate Opacities (3-4 lung zones)'] = 'level2'\n",
+    "mapping['Severe Opacities (>4 lung zones)'] = 'level2'\n",
+    "mapping['Invalid Study'] = 'NA'\n",
+    "\n",
+    "classification=[\"Typical Appearance\",\"Indeterminate Appearance\",\"Atypical Appearance\",\"Negative for Pneumonia\"]\n",
+    "airspace_Disease_Grading=[\"Mild Opacities  (1-2 lung zones)\",\"Moderate Opacities (3-4 lung zones)\",\"Severe Opacities (>4 lung zones)\",\"Invalid Study\"]\n",
+    "\n",
+    "        \n",
+    "        \n",
+    "def get_label_study(annotations_df, studyid):\n",
+    "    airspace_grading_labels = []\n",
+    "    labels = annotations_df[\"annotations\"].loc[annotations_df[\"annotations\"][\"StudyInstanceUID\"]==studyid][\"labelName\"]\n",
+    "#     print(labels)\n",
+    "    for label in list(labels):\n",
+    "        if label in mapping.keys():\n",
+    "            airspace_grading_labels.append(mapping[label])\n",
+    "    \n",
+    "    severity = Counter(airspace_grading_labels).most_common()[0][0] if airspace_grading_labels else 'NA'\n",
+    "    return severity\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 75,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Data distribution from covid datasets:\n",
+      "{'level1': 226, 'level2': 683, 'NA': 187}\n"
+     ]
+    }
+   ],
+   "source": [
+    "filename_label = {'level1': [],'level2': [], 'NA': []}\n",
+    "count = {'level1': 0,'level2': 0, 'NA':0}\n",
+    "covid_ds = {'ricord': []}\n",
+    "        \n",
+    "# get ricord file names \n",
+    "with open(ricord_txt) as f:\n",
+    "    ricord_file_names = [line.split()[0] for line in f]\n",
+    "    \n",
+    "# get study ids for every patientid\n",
+    "with open(ricord_studyids, 'r') as f:\n",
+    "    studyids = json.load(f)\n",
+    "    \n",
+    "# load ricord annotations\n",
+    "annotations = mdai.common_utils.json_to_dataframe(ricord_annotations)\n",
+    "\n",
+    "for imagename in ricord_file_names:\n",
+    "    patientid = imagename.split('-')[3] + '-' + imagename.split('-')[4]\n",
+    "    study_uuid = imagename.split('-')[-2]\n",
+    "    \n",
+    "    # get complete study id from ricord_studyids json file to match to labels stored in ricord annotations\n",
+    "    for studyid in studyids[patientid]:\n",
+    "        if studyid[-5:] == study_uuid:\n",
+    "            severity_level = get_label_study(annotations, studyid)\n",
+    "            break\n",
+    "    count[severity_level] += 1\n",
+    "    entry = [patientid, imagename, severity_level, 'ricord']\n",
+    "    filename_label[severity_level].append(entry)\n",
+    "    \n",
+    "    covid_ds['ricord'].append(patientid)\n",
+    "    \n",
+    "print('Data distribution from covid datasets:')\n",
+    "print(count)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 76,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "test count:  {'level1': 52, 'level2': 98, 'NA': 0}\n",
+      "train count:  {'level1': 174, 'level2': 585, 'NA': 0}\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Write images into train and test directories accordingly\n",
+    "\n",
+    "# get test patients from label file\n",
+    "with open('labels/test_COVIDxSev.txt', 'r') as f:\n",
+    "    test_patients = [line.split()[0] for line in f]\n",
+    "\n",
+    "for label in filename_label.keys():\n",
+    "    # Skip all studyies that do not have an airspace grading\n",
+    "    if label != 'NA':\n",
+    "        for image in filename_label[label]:\n",
+    "            patientid = image[0]\n",
+    "            if patientid in test_patients:\n",
+    "                copyfile(os.path.join(ricord_imgpath, image[1]), os.path.join(savepath, 'test', image[1]))\n",
+    "                test.append(image)\n",
+    "                test_count[image[2]] += 1\n",
+    "            else:\n",
+    "                copyfile(os.path.join(ricord_imgpath, image[1]), os.path.join(savepath, 'train', image[1]))\n",
+    "                train.append(image)\n",
+    "                train_count[image[2]] += 1\n",
+    "\n",
+    "print('test count: ', test_count)\n",
+    "print('train count: ', train_count)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 77,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Final stats\n",
+      "Train count:  {'level1': 174, 'level2': 585, 'NA': 0}\n",
+      "Test count:  {'level1': 52, 'level2': 98, 'NA': 0}\n",
+      "Total length of train:  759\n",
+      "Total length of test:  150\n"
+     ]
+    }
+   ],
+   "source": [
+    "# final stats\n",
+    "print('Final stats')\n",
+    "print('Train count: ', train_count)\n",
+    "print('Test count: ', test_count)\n",
+    "print('Total length of train: ', len(train))\n",
+    "print('Total length of test: ', len(test))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 78,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# export to train and test files\n",
+    "# format as patientid, filename, label, separated by a space\n",
+    "# where label is either \"level1\" for mild air space grading or \"level2\" for moderate and severe grading\n",
+    "with open(\"train_split.txt\",'w') as train_file:\n",
+    "    for sample in train:\n",
+    "        info = str(sample[0]) + ' ' + sample[1] + ' ' + sample[2] + ' ' + sample[3] + '\\n'\n",
+    "        train_file.write(info)\n",
+    "\n",
+    "with open(\"test_split.txt\", 'w') as test_file:\n",
+    "    for sample in test:\n",
+    "        info = str(sample[0]) + ' ' + sample[1] + ' ' + sample[2] + ' ' + sample[3] + '\\n'\n",
+    "        test_file.write(info)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/create_ricord_dataset/ricord_patientid_to_studyid_mapping.json b/create_ricord_dataset/ricord_patientid_to_studyid_mapping.json
diff --git a/data.py b/data.py
@@ -7,10 +7,6 @@
 
 from tensorflow.keras.preprocessing.image import ImageDataGenerator
 
-# To remove TF Warnings
-tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)
-os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
-
 def crop_top(img, percent=0.15):
     offset = int(img.shape[0] * percent)
     return img[offset:]
@@ -131,7 +127,6 @@ def __init__(
                 datasets[l.split()[2]].append(l)
 
         if self.is_severity_model:
-            # For COVIDNet CXR-S upsample the severity level 1 cases to create balanced 50/50 batches
             self.datasets = [
                 datasets['level2'], datasets['level1']
             ]

diff --git a/docs/COVIDx.md b/docs/COVIDx.md
@@ -1,4 +1,5 @@
 # COVIDx Dataset
+**Update 04/16/2021:Released COVIDxSev, a new airspace severity grading dataset for COVID-19 positive patients for COVIDNet CXR-S model.**\
 **Update 03/19/2021:Released new datasets with both over 16,000 CXR images from a multinational cohort of over 15,100 patients from at least 51 countries. The dataset contains over 2,300 positive COVID-19 images from over 1,500 patients. The COVIDx V8A dataset is for detection of no pneumonia/non-COVID-19 pneumonia/COVID-19 pneumonia, and COVIDx V8B dataset is for COVID-19 positive/negative detection.**\
 **Update 01/28/2021:Released new datasets with over 15600 CXR images and over 1700 positive COVID-19 images. The COVIDx V7A dataset is for detection of no pneumonia/non-COVID-19 pneumonia/COVID-19 pneumonia, and COVIDx V7B dataset is for COVID-19 positive/negative detection.**\
 **Update 01/05/2021: Released new dataset for binary classification (COVID-19 positive or COVID-19 negative). Train dataset contains 517 positive and 13794 negative samples. Test dataset contains 100 positive and 100 negative samples.**\
@@ -24,7 +25,7 @@ The current COVIDx dataset is constructed by the following open source chest rad
  * `git clone https://github.com/agchung/Actualmed-COVID-chestxray-dataset.git`
  * go to this [link](https://www.kaggle.com/tawsifurrahman/covid19-radiography-database/version/3) to download the COVID-19 Radiography database. Only the COVID-19 image folder and metadata file is required. The overlaps between covid-chestxray-dataset are handled in the dataset curation scripts. **Note:** for COVIDx versions 8 & 7 please use Version 3 of the dataset, and for versions COVIDx6 and below please use Version 1.
  * go to this [link](https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/data) to download the RSNA pneumonia dataset
- * go to this [link] (https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=70230281) to download the RICORD COVID-19 dataset and clinical data csv
+ * go to this [link] (https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=70230281) to download the RICORD COVID-19 dataset, clinical data csv, and annotations
 2. Create a `data` directory and within the data directory, create a `train` and `test` directory
 3. Use [create\_ricord\_dataset\\create\_ricord\_dataset.ipynb](../create_ricord_dataset/create_ricord_dataset.ipynb) to pre-process the RICORD dataset before handling.
 3. Use [create\_COVIDx\_binary.ipynb](../create_COVIDx_binary.ipynb) to combine the three datasets to create COVIDx for binary classification. Make sure to remember to change the file paths. Use [create\_COVIDx.ipynb](../create_COVIDx.ipynb) for datasets compatible with COVIDx5 and earlier models (not binary classification).

diff --git a/docs/covidnet_severity.md b/docs/covidnet_severity.md
@@ -1,3 +1,73 @@
+# COVIDNet CXR-S Air Space Severity Grading
+COVDNet CXR-S model takes as input a chest x-ray image of shape (N, 480, 480, 3). where N is the number of batches, 
+and outputs the airspace severity of a SARS-CoV-2 positive patient. The airspace severity is grouped into two levels: 1) Level 1: opacities in 1-2 lung zones, and 2) Level 2: opacities in 3 or more lung zones.
+
+If using the TF checkpoints, here are some useful tensors:
+
+* input tensor: `input_1:0`
+* logit tensor: `norm_dense_2/MatMul:0`
+* output tensor: `norm_dense_2/Softmax:0`
+* label tensor: `norm_dense_1_target:0`
+* class weights tensor: `norm_dense_1_sample_weights:0`
+* loss tensor: `Mean:0`
+
+### Steps for training
+To train the model the COVIDxSev dataset is required, to create the dataset please run [create_COVIDxSev.ipynb](../create_COVIDxSev.ipynb).
+TF training script from a pretrained model:
+1. We provide you with the tensorflow evaluation script, [train_tf.py](../train_tf.py)
+2. Locate the tensorflow checkpoint files (location of pretrained model)
+3. To train from the COVIDNet-CXR-S pretrained model:
+```
+python3 train_tf.py \
+    --weightspath models/COVIDNet-CXR-S \
+    --metaname model.meta \
+    --ckptname model \
+    --n_classes 2 \
+    --datadir data_sev \
+    --trainfile labels/train_COVIDxSev.txt \
+    --testfile labels/test_COVIDxSev.txt \
+    --out_tensorname norm_dense_2/Softmax:0 \
+    --logit_tensorname norm_dense_2/MatMul:0 \
+    --is_severity_model
+```
+4. For more options and information, `python train_tf.py --help`
+
+### Steps for evaluation
+To evaluate the model the COVIDxSev dataset is required, to create the dataset please run [create_COVIDxSev.ipynb](../create_COVIDxSev.ipynb).
+1. We provide you with the tensorflow evaluation script, [eval.py](../eval.py)
+2. Locate the tensorflow checkpoint files
+3. To evaluate a tf checkpoint:
+```
+python eval.py \
+    --weightspath models/COVIDNet-CXR-S \
+    --metaname model.meta \
+    --ckptname model \
+    --n_classes 2 \
+    --testfolder data_sev/test \
+    --testfile labels/test_COVIDxSev.txt \
+    --out_tensorname norm_dense_2/Softmax:0 \
+    --is_severity_model
+```
+4. For more options and information, `python eval.py --help`
+
+### Steps for inference
+**DISCLAIMER: Do not use this prediction for self-diagnosis. You should check with your local authorities for the latest advice on seeking medical assistance.**
+
+1. Download a model from the [pretrained models section](models.md)
+2. Locate models and xray image to be inferenced
+3. To inference,
+```
+python inference.py \
+    --weightspath models/COVIDNet-CXR-S \
+    --metaname model.meta \
+    --ckptname model \
+    --n_classes 2 \
+    --imagepath assets/ex-covid.jpeg \
+    --out_tensorname norm_dense_2/Softmax:0 \
+    --is_severity_model
+```
+4. For more options and information, `python inference.py --help`
+
 # COVIDNet Lung Severity Scoring
 COVIDNet-S-GEO and COVIDNet-S-OPC models takes as input a chest x-ray image of shape (N, 480, 480, 3), where N is the number of batches, and outputs the SARS-CoV-2 severity scores for geographic extent and opacity extent, respectively. COVIDNet-S-GEO predicts the geographic severity. Geographic severity is based on the geographic extent score for right and left lung. For each lung: 0 = no involvement; 1 = <25%; 2 = 25-50%; 3 = 50-75%; 4 = >75% involvement, resulting in scores from 0 to 8. COVIDNet-S-OPC predicts the opacity severity. Opacity severity is based on the opacity extent score for right and left lung. For each lung, the score is from 0 to 4, with 0 = no opacity and 4 = white-out, resulting in scores from 0 to 8. For detailed description of COVIDNet lung severity scoring methodology, see the paper [here](https://arxiv.org/abs/2005.12855).