An AI approach to detect Semantic Data Types using NLP and Deep Learning.
Install python. Make sure you are using python version 3.7 or above.
Once the repository is cloned, go to project's folder and create a virtual environment using environment manager like virtualenv, conda, pipenv.
Activate the environment and install the desired python packages using:
pip install -r requirements.txt
Now the main dataset is extracted from the website . Once you go to the website download the first folder and extract the folder which will be called
and place it in the directory namedresources/data
There are 2 parts to the code, the data conversion part and the complete process execution part.
Data Conversion:
After the packages and the dataset is installed and structured, lets look at the data conversion part, it consists of command line arguments to run the codes in a specific ways. The command line arguments are:
, default=False, description=Choose if you want to use sample or not--no_of_tables
, default=False, description=Choose the number of tables that are needed for Mycroft
Using the above command line arguments, we can run the code using the following commands after going to the
To run the data extractor code and extract 50000 web tables
python -num 50000
Process Execution:
This part also consists of command line arguments to run the codes in a specific ways. The command line arguments are:
, default=sherlock
, description=Choose the type of data (options: sherlock, mycroft)--extract
, default=False, description=Choose if you want to generate features or used the saved features--split
, default=False, description=Choose if you want to split the data or not--train_split
, default=0.7, description=Choose the percentage of the train data split (Example: 0.7 -> 70% train)--no_of_tables
, default=20000, description=Choose the files with number of tables that is required for processing (options: "40000, 50000, 100000")--sample
, default=False, description="Choose if you want to use sample or not"
Using the above command line arguments, we can run the code using the following commands after going to the
To run the Mycroft on the the data with 50000 web tables with saved features
python -i mycroft -spt True -ts 0.8 -num 50000
To run the Mycroft on the the data with 50000 web tables with features being generated
python -i mycroft -e True -spt True -ts 0.8 -num 50000
Note: The feature extraction part will take a lot of time (8 secs per data column)
In the same way you can explore other values and get the results
The dataset is available in the website: Go to the link and download the first folder and extract the folder which will be called 0
and place it in resources/data
- M. Hulsebos, K. Hu,M. Bakker, E. Zgraggen, A. Satyanarayan, T. Kraska, c. Demiralp, and C. Hidalgo, Sherlock: A deep learning approach to semantic data type detection, in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & DataMining, ACM, 2019.
- O. Lehmberg, D. Ritze, R.Meusel, and C. Bizer, A large public corpus of web tables containing time and context metadata, in Proceedings of the 25th International Conference Companion on WorldWideWeb, pp. 75–76, 2016.