- Russian: https://ilyagusev.github.io/tgcontest/ru/main.html
- English: https://ilyagusev.github.io/tgcontest/en/main.html
Prerequisites: CMake, Boost
$ sudo apt-get install cmake libboost-all-dev build-essential
If you got zip archive, just go to building binary
To download code and models:
$ git clone https://github.com/IlyaGusev/tgcontest
$ cd tgcontest
$ git submodule init
$ git submodule update
$ bash download_models.sh
To build binary (in "tgcontest" dir):
$ mkdir build && cd build && cmake -DCMAKE_BUILD_TYPE=Release ..
$ make
To download datasets:
$ bash download_data.sh
Run on sample:
./build/tgnews top data --ndocs 10000
-
Russian FastText vectors training: VectorsRu.ipynb
-
Russian fasttext category classifier training: CatTrainRu.ipynb
-
Russian sentence embedder training: SimilarityRu.ipynb
-
English FastText vectors training: VectorsEn.ipynb
-
English fasttext category classifier training: CatTrainEn.ipynb
-
English sentence embedder training: SimilarityEn.ipynb
-
PageRank rating calculation: PageRankRating.ipynb
- Language detection model: lang_detect.ftz
- Russian FastText vectors: ru_vectors_v2.bin
- Russian categories detection model: ru_cat_v2.ftz
- English FastText vectors: en_vectors_v2.bin
- English categories detection model: en_cat_v2.ftz
- PageRank-based agency rating: pagerank_rating.txt
- Russian news from 0107 and 0817 archives: ru_tg_train.tar.gz
- English news from 0107 and 0817 archives: en_tg_train.tar.gz
- Russian news from 1821, 2225, 29 and 09 archives: ru_tg_test.tar.gz
- English news from 1821, 2225, 29 and 09 archives: en_tg_test.tar.gz
- Data for training Russian vectors: ru_unsupervised_train.tar.gz
- Data for training English vectors: en_unsupervised_train.tar.gz
- Russian categories train markup: ru_cat_train_raw_markup.tsv
- Russian categories test markup: ru_cat_test_raw_markup.tsv
- Russian not_news additional markup: ru_not_news.txt
- English categories train markup: en_cat_train_raw_markup.tsv
- English categories test markup: en_cat_test_raw_markup.tsv
- Description in Russian: https://habr.com/ru/post/487324/
- Framework for complex NN
- Proper clustering markup
- Error analysis for categories classifiers
- Alternatives for PageRank
- "Ugly" titles