This repository contains the code for our submission in Kaggle's competition Quora Question Pairs in which we ranked in the top 25%. A detailed report for the project can be found here.
train.csv
contains ~ 400k question pairs along with the corresponding label (duplicate or not) and
test.csv
contains ~ 2300k question pairs. Both the files can be found here.
We use a Siamese Neural Network architecture with Gated Recurrent Units in combination with traditional Machine Learning algorithms like Random Forest, SVM and Adaboost.
Firstly, place the train.csv
,test.csv
(see the Data section above) and the pre-trained GloVe embeddings in the input
folder. You can download the embeddings from here. Then, simply run the bash script:
bash run_model.sh
- numpy
- pandas
- nltk
- sklearn
- TensorFlow
Install them using pip.