This repository contains the source code used for reproducing the experiments conducted by D2KLab for the European Statistics Awards Deduplication Challenge 2023.
The source code on this repository was based on a Jupyter Notebook which is available on Google Colab at this link.
- Python >=3.9
- Install required packages using
pip
:pip install -r requirements.txt
- Copy the dataset file
wi_dataset.csv
into the same directory as this source code. - Run
main.py
:After processing, this should create a new file namedpython main.py
duplicates.csv
.
During the submission phase, our latest experiment obtained the following scores:
Full F1 | Semantic F1 | Temporal F1 | Partial F1 | Non-Duplicate F1 | Macro F1 |
---|---|---|---|---|---|
0.99 | 0.81 | 0.68 | 0.00 | 1.00 | 0.70 |