This repository is used to publish all the code used for the following article:
The code and datasets are completely released as of January 2018, including all the code for crawling, preprocessing and training on the datasets. However, the documentation may not be complete yet. That said, readers could refer to the doc
directory for an example in reproducing all the results for the Dianping dataset, and extend that to other datasets in similar ways.
If anyone sees a number in our paper, there is a script one can execute to reproduce it. No responsibility should be imposed on the user to figure out any experimental parameter barried in the paper's content.
The data
directory contains the preprocessing scripts for all the datasets used in the paper. These datasets are released separately of their processing source code. See below for details.
The following table is a summary of the datasets. Most of them have millions of samples for training.
Dataset | Language | Classes | Train | Test |
---|---|---|---|---|
Dianping | Chinese | 2 | 2,000,000 | 500,000 |
JD full | Chinese | 5 | 3,000,000 | 250,000 |
JD binary | Chinese | 2 | 4,000,000 | 360,000 |
Rakuten full | Japanese | 5 | 4,000,000 | 500,000 |
Rakuten binary | Japanese | 2 | 3,400,000 | 400,000 |
11st full | Korean | 5 | 750,000 | 100,000 |
11st binary | Korean | 2 | 4,000,000 | 400,000 |
Amazon full | English | 5 | 3,000,000 | 650,000 |
Amazon binary | English | 2 | 3,600,000 | 400,000 |
Ifeng | Chinese | 5 | 800,000 | 50,000 |
Chinanews | Chinese | 7 | 1,400,000 | 112,000 |
NYTimes | English | 7 | 1,400,000 | 105,000 |
Joint full | Multilingual | 5 | 10,750,000 | 1,500,000 |
Joint binary | Multilingual | 2 | 15,000,000 | 1,560,000 |
Datasets are released separtely of the source code via links from Google Drive. These datasets should only be used for the purpose of research.
Dataset | Train | Test |
---|---|---|
Dianping | Link | Link |
JD full | Link | Link |
JD binary | Link | Link |
Rakuten full | Link | Link |
Rakuten binary | Link | Link |
11st full | Link | Link |
11st binary | Link | Link |
Amazon full | Link | Link |
Amazon binary | Link | Link |
Ifeng | Link | Link |
Chinanews | Link | Link |
NYTimes | Link | Link |
Joint full | Link | Link |
Joint binary | Link | Link |
The glyphnet
scripts require the GNU Unifont character images to run. The file unifont-8.0.01.t7b.xz
can be downloaded via this link.