Skip to content

MinJeongwon/Endocrinology

Repository files navigation

Medical paper classification using transformer-based Pretrained Language Models

This repository contains the code and resources for text classification using the Transformer-based large language models. The work aims to classifying medical papers into 6 categories: bone, pitu/adrenal, diabetes, thyroid, others, x.
The major purpose is to firstly classify endocrinology papers and the ones in other medical departments, and to secondly classify them into subgroups of endocrinological topics if the paper was classified under endocrinology department. The model decides the category of the papers based on the information from the title and abstract. We explore how the performance of the models changes when given the abstracts as well as titles compared to case when given the titles only.

Data Format

Both train.py and inference.py takes the datasets in .csv format with the following format:

ID category TI abst
25,729,272 bone Long Term Effect of High Glucose and Phosphate Levels on the OPG/RANK/RANKL/TRAIL System in the Progression of Vascular Calcification in rat Aortic Smooth Muscle Cells. ...
25,750,573 pitu/adrenal A giant carotid aneurysm with intrasellar extension: a rare cause of panhypopituitarism. ...
25,630,229 thyroid Salivary gland dysfunction after radioactive iodine (I-131) therapy in patients following total thyroidectomy: emphasis on radioactive iodine therapy dose. ...
25,561,730 diabetes Ligand Binding Pocket Formed by Evolutionarily Conserved Residues in the Glucagon-like Peptide-1 (GLP-1) Receptor Core Domain. ...

Environment

conda env create --file environment.yaml

Train

In order to fine-tune each of the transformer models on your dataset, you can execute the following bash file:

bash train.sh

Please note that before executing the bash file, you need to define a set of files path in it.

Options

--model                     bert-base-uncased, roberta-base, etc.
--type                      whether to use both `title` and `abstract`(title_abst) for classification or just use `title`(title).
--source                    dataset dir.
--res                       path to result dir.
--log                       path to log dir.
--checkpoint                path to best-performed model dir.
--num_labels                number of classes.
--max_sequence_len          max length of sequence tokens.
--epoch                     number of epochs.
--train_batch_size          train batch size.
--valid_batch_size          evaluation batch size.
--lr                        learning rate.
--n_warmup_steps            warmup steps.
--local_rank                local rank.

Inference

In order to perform inference with the fine-tuned models, you can execute the following bash file:

bash inference.sh

Option

--model                     bert-base-uncased, roberta-base, etc.
--type                      whether to use both `title` and `abstract`(title_abst) for classification or just use `title`(title).
--test                      path to test dataset.
--res                       path to result dir.
--log                       path to log dir.
--checkpoint                path to best-performed model dir.
--num_labels                number of classes.
--max_sequence_len          max length of sequence tokens.
--test_batch_size           test batch size.

Models

All of the models on the Huggingface are supported by this repository and can be used by setting the model parameter of train.py with the corresponding model's name. Below are some of the models used in the experiments.

Models = {
    "BERT base uncased": "bert-base-uncased",
    "BioBERT": "dmis-lab/biobert-v1.1",
    "SciBERT": "allenai/scibert_scivocab_uncased",
    "BleuBERT": "bionlp/bluebert_pubmed_uncased_L-12_H-768_A-12",
    "BleuBERT": "bionlp/bluebert_pubmed_mimic_uncased_L-12_H-768_A-12",
    "PubMedBERT": "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract",
    "PubMedBERT": "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext",
    "ClinicalBERT": "emilyalsentzer/Bio_ClinicalBERT",
    "RoBERTa": "roberta-base",
    "Biomed RoBERTa": "allenai/biomed_roberta_base",

Results

Classification performance

Classifier Type Accuracy Precision Recall F1
BERT (bert-base-uncased) title + abstract 0.85 0.84 0.85 0.85
BERT (bert-base-uncased) title 0.83 0.83 0.83 0.83
BioBERT (dmis-lab/biobert-v1.1) title + abstract 0.88 0.88 0.88 0.88
SciBERT (allenai/scibert_scivocab_uncased) title + abstract 0.87 0.87 0.87 0.87
BleuBERT (bionlp/bluebert_pubmed_uncased_L-12_H-768_A-12) title + abstract 0.87 0.87 0.87 0.87
BleuBERT (bionlp/bluebert_pubmed_mimic_uncased_L-12_H-768_A-12) title + abstract 0.86 0.85 0.86 0.85
PubMedBERT (microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract) title + abstract 0.87 0.87 0.87 0.87
PubMedBERT (microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext) title + abstract 0.87 0.86 0.87 0.86
ClinicalBERT (emilyalsentzer/Bio_ClinicalBERT) title + abstract 0.85 0.85 0.85 0.85
RoBERTa (roberta-base) title + abstract 0.84 0.85 0.84 0.84
Biomed Roberta (allenai/biomed_roberta_base) title + abstract 0.84 0.84 0.84 0.84

Visualization of embedding space at epoch 50

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published