This repository contains the code and resources for text classification using the Transformer-based large language models. The work aims to classifying medical papers into 6 categories: bone, pitu/adrenal, diabetes, thyroid, others, x.
The major purpose is to firstly classify endocrinology papers and the ones in other medical departments, and to secondly classify them into subgroups of endocrinological topics if the paper was classified under endocrinology department. The model decides the category of the papers based on the information from the title and abstract. We explore how the performance of the models changes when given the abstracts as well as titles compared to case when given the titles only.
Both train.py and inference.py takes the datasets in .csv format with the following format:
ID | category | TI | abst |
---|---|---|---|
25,729,272 | bone | Long Term Effect of High Glucose and Phosphate Levels on the OPG/RANK/RANKL/TRAIL System in the Progression of Vascular Calcification in rat Aortic Smooth Muscle Cells. | ... |
25,750,573 | pitu/adrenal | A giant carotid aneurysm with intrasellar extension: a rare cause of panhypopituitarism. | ... |
25,630,229 | thyroid | Salivary gland dysfunction after radioactive iodine (I-131) therapy in patients following total thyroidectomy: emphasis on radioactive iodine therapy dose. | ... |
25,561,730 | diabetes | Ligand Binding Pocket Formed by Evolutionarily Conserved Residues in the Glucagon-like Peptide-1 (GLP-1) Receptor Core Domain. | ... |
conda env create --file environment.yaml
In order to fine-tune each of the transformer models on your dataset, you can execute the following bash file:
bash train.sh
Please note that before executing the bash file, you need to define a set of files path in it.
--model bert-base-uncased, roberta-base, etc.
--type whether to use both `title` and `abstract`(title_abst) for classification or just use `title`(title).
--source dataset dir.
--res path to result dir.
--log path to log dir.
--checkpoint path to best-performed model dir.
--num_labels number of classes.
--max_sequence_len max length of sequence tokens.
--epoch number of epochs.
--train_batch_size train batch size.
--valid_batch_size evaluation batch size.
--lr learning rate.
--n_warmup_steps warmup steps.
--local_rank local rank.
In order to perform inference with the fine-tuned models, you can execute the following bash file:
bash inference.sh
--model bert-base-uncased, roberta-base, etc.
--type whether to use both `title` and `abstract`(title_abst) for classification or just use `title`(title).
--test path to test dataset.
--res path to result dir.
--log path to log dir.
--checkpoint path to best-performed model dir.
--num_labels number of classes.
--max_sequence_len max length of sequence tokens.
--test_batch_size test batch size.
All of the models on the Huggingface are supported by this repository and can be used by setting the model
parameter of train.py with the corresponding model's name. Below are some of the models used in the experiments.
Models = {
"BERT base uncased": "bert-base-uncased",
"BioBERT": "dmis-lab/biobert-v1.1",
"SciBERT": "allenai/scibert_scivocab_uncased",
"BleuBERT": "bionlp/bluebert_pubmed_uncased_L-12_H-768_A-12",
"BleuBERT": "bionlp/bluebert_pubmed_mimic_uncased_L-12_H-768_A-12",
"PubMedBERT": "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract",
"PubMedBERT": "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext",
"ClinicalBERT": "emilyalsentzer/Bio_ClinicalBERT",
"RoBERTa": "roberta-base",
"Biomed RoBERTa": "allenai/biomed_roberta_base",
Classifier | Type | Accuracy | Precision | Recall | F1 |
BERT (bert-base-uncased) | title + abstract | 0.85 | 0.84 | 0.85 | 0.85 |
BERT (bert-base-uncased) | title | 0.83 | 0.83 | 0.83 | 0.83 |
BioBERT (dmis-lab/biobert-v1.1) | title + abstract | 0.88 | 0.88 | 0.88 | 0.88 |
SciBERT (allenai/scibert_scivocab_uncased) | title + abstract | 0.87 | 0.87 | 0.87 | 0.87 |
BleuBERT (bionlp/bluebert_pubmed_uncased_L-12_H-768_A-12) | title + abstract | 0.87 | 0.87 | 0.87 | 0.87 |
BleuBERT (bionlp/bluebert_pubmed_mimic_uncased_L-12_H-768_A-12) | title + abstract | 0.86 | 0.85 | 0.86 | 0.85 |
PubMedBERT (microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract) | title + abstract | 0.87 | 0.87 | 0.87 | 0.87 |
PubMedBERT (microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext) | title + abstract | 0.87 | 0.86 | 0.87 | 0.86 |
ClinicalBERT (emilyalsentzer/Bio_ClinicalBERT) | title + abstract | 0.85 | 0.85 | 0.85 | 0.85 |
RoBERTa (roberta-base) | title + abstract | 0.84 | 0.85 | 0.84 | 0.84 |
Biomed Roberta (allenai/biomed_roberta_base) | title + abstract | 0.84 | 0.84 | 0.84 | 0.84 |