This file is best viewed in a Markdown reader (eg. https://jbt.github.io/markdown-editor/)
This repository contains the code and data we used to test how NLP models perform when subjected to meaning preserving yet lexically different sentences for the task of sentence entailment. Currently, the repository supports training and testing a BiLSTM, CBOW, BERT and RoBERTa models. It has support to run inference on GPT-3 using OpenAI's API. It also has support to generate paraphrases for a given entailment dataset.
Follow these steps to setup your environment:
-
Create a Conda environment with Python 3.6:
conda create -n <env_name> Python=3.6
-
Activate the Conda environment. You will need to activate the Conda environment in each terminal in which you want to use this code:
conda activate <env_name>
-
Install the requirements:
pip install -r requirements.txt
.
parte
└── analysis_notebooks/
└── data/
└── models/
├── bilstm.py
├── cbow.py
└── transformer.py
└── util/
├── dataset_loader.py
├── load_utils.py
├── model_utils.py
├── transformer_dataset_loader.py
└── vocab.py
├── paraphraser.py
├── README.md
├── transformer_test.py
├── transformer_train.py
├── test.py
└── train.py
The paraphraser.py
file is responsible for generating the paraphrased data. To do this, it uses the HuggingFace's T5 paraphraser. In order to ensure lexical diversity, we measure the Jaccard Similarity score between the paraphrased sentences and original sentence.
To run the paraphraser, the following options are available:
--data_path
: The path to the.jsonl
file containing the data you wish to paraphrase. It expects the data to contain two columns -sentence1
andsentence2
- which need to be paraphrased.--save_path
: The path where to save the paraphrased data.--jaccard_score
: The threshold Jaccard Score to be used. The default score is 0.75.
Sample command:
python3 paraphraser.py --data_path RTE_test.jsonl --save_path RTE_test_paraphrased.jsonl
To train the BiLSTM and CBOW models, you would need to use the train.py
file. The models/bilstm.py
and models/cbow.py
files contain the implementation of the BiLSTM and CBOW models respectively. The following options are available for training the models:
--model_type
: The model type you wish to train. There are two options available -bilstm
andcbow
. The default model isbilstm
.--save_path
: Path to the directory to save the model. The path is created if it does not exist. The default path is./saved_model
.--train_path
: Path to the file which has the training data. The default path is./data/multinli_1.0/multinli_1.0_train.jsonl
.--val_path
: Path to the file which has the validation data. The default path is./data/multinli_1.0/multinli_1.0_dev_matched.jsonl
.--batch_size
: The batch size for the model. Default value is 32.--emb_path
: Path to the GloVe embeddings. Default path is/data/glove.840B.300d.txt
.--epochs
: Number of epochs to train the model.--model_name
: The suffix given to your model. This will be appended to themodel_type
. This parameter is required.--hidden_size
: The number of hidden units in the LSTM.--stacked_layers
: Number of stacked LSTM units.--seq_len
: The maximum sequence length allowed. The default value is 50.--vocab_size
: The vocab size to be used for the model. The default value is 50,000.--num_classes
: The number of training classes. This repo involves training on the MNLI dataset with 3 classes - entailment, neutral and contradiction - and validating/testing on the RTE dataset with two classes - entailment, non-entailment. This parameters specifies the number of training classes present in your training data. The default value is 2.--is_hypothesis_only
: Specifies if the model to be trained on the hypothesis only.
Sample training commands are:
python3 train.py --save_path saved_model/cbow/ --val_path RTE_dev.jsonl --model_name 3_class --num_classes 3 --model_type cbow
python3 train.py --save_path saved_model/bilstm/ --val_path RTE_dev.jsonl --model_name 3_class --num_classes 3
The model is saved based on the best validation accuracy achieved till the current epoch.
To test the BiLSTM and CBOW models, you would need to use the test.py
file. The following options are available for testing the models:
--model_type
: The model type you wish to test. There are two options available -bilstm
andcbow
. The default model isbilstm
.--save_path
: Path to the directory where the trained model is saved. The default path is./saved_model
.--test_path
: Path to the file which has the test data. The default path is./data/multinli_1.0/multinli_1.0_dev_mismatched.jsonl
.--batch_size
: The batch size for the model. Default value is 32.--emb_path
: Path to the GloVe embeddings. Default path is/data/glove.840B.300d.txt
.--model_name
: The suffix given to your model. This will be appended to themodel_type
to help find the name of the saved model. This parameter is required.--predictions_save_path
: The file where the model predictions will be saved.
Sample testing commands are:
python3 test.py --save_path saved_model/bilstm/ --test_path RTE_test.jsonl --model_name 3_class
python3 test.py --save_path saved_model/cbow/ --test_path RTE_test.jsonl --model_name 3_class --model_type cbow
To train transformer based, you would need to use the transformer_train.py
file. The models/transformer.py
contains the code to load the model you wish from HuggingFace. Currently only roberta
and bert
family models are supported. The following options are available to fine-tune the models:
--save_path
: Path to the directory to save the model. The path is created if it does not exist. The default path is./saved_model
.--train_path
: Path to the file which has the training data. The default path is./data/multinli_1.0/multinli_1.0_train.jsonl
.--val_path
: Path to the file which has the validation data. The default path is./data/multinli_1.0/multinli_1.0_dev_matched.jsonl
.--batch_size
: The batch size for the model. Default value is 32.--epochs
: Number of epochs to train the model.--gradient_accumulation
: Number of batches to accumulate gradients. This was added to support simulation of large batches in the case the training GPU does not have a large memory. The default value is 0.--model_name
: The name or path to the directory of the model you wish to train. The default value isroberta-base
.--num_classes
: The number of training classes. This repo involves training on the MNLI dataset with 3 classes - entailment, neutral and contradiction - and validating/testing on the RTE dataset with two classes - entailment, non-entailment. This parameters specifies the number of training classes present in your training data. The default value is 2.--is_hypothesis_only
: Specifies if the model to be trained on the hypothesis only.
Sample training commands are:
python3 transformer_train.py --batch_size 3 --val_path RTE_dev.jsonl --epochs 5 --num_classes 3 --save_path saved_model/roberta_3_class --model_name roberta-large --gradient_accumulation 8
To test the trained model, you would need to use the transformer_test.py
file. The following options are available to test the models:
--test_path
: Path to the file which has the testing data. The default path is./data/multinli_1.0/multinli_1.0_dev_mismatched.jsonl
.--batch_size
: The batch size for the model. Default value is 32.--model_name
: The name or path to the directory of the model you wish to test. The default value isroberta-large-mnli
.--is_hypothesis_only
: Specifies if the model to be tested on the hypothesis only.--predictions_save_path
: The file where the model predictions will be saved.
Sample testing commands are:
python3 transformer_test.py --model_name saved_model/roberta-large