This repository contains the code for our submission in Kaggle's competition Quora Question Pairs in which we ranked in the top 25%. A detailed report for the project can be found here.
train.csv
contains ~ 400k question pairs along with the corresponding label (duplicate or not) and
test.csv
contains ~ 2300k question pairs. Both the files can be found here.
We use a Siamese Neural Network architecture with Gated Recurrent Units in combination with traditional Machine Learning algorithms like Random Forest, SVM and Adaboost.
Firstly, place the train.csv
, test.csv
(see the Data section above) and the pre-trained GloVe embeddings in the input
folder. You can download the embeddings from here. Then, simply run the bash script:
bash run_model.sh
- numpy
- pandas
- nltk
- sklearn
- TensorFlow
Install them using pip.
- If there is any issue running the code, please post it in the issue tracker.
- If you like this repo and find it useful, please consider ★ starring it (on top right of the page) :)