Machine Translation

Russian to English machine translation using Transformer model.

Purpose

This project was done during the Practical Machine Learning and Deep Learning course at Spring 2020 semester at Innopolis University (Innopolis, Russia).

How to run

Firstly install all requirements from requirements.txt.

Data preparation

Then you need to prepare your dataset for the model, running the preparation.py:

usage: preparation.py [-h] [--dataset_size [DATASET_SIZE]]
                      [--vocab_size [VOCAB_SIZE]]
                      [--temp_file_path [TEMP_FILE_PATH]]
                      dataset_path store_path

positional arguments:
  dataset_path          path to the dataset.
  store_path            path there to store the results.

optional arguments:
  -h, --help            show this help message and exit
  --dataset_size [DATASET_SIZE]
                        max number of samples in dataset.
  --vocab_size [VOCAB_SIZE]
                        the size of the tokenizer's vocabulary.
  --temp_file_path [TEMP_FILE_PATH]
                        path there to save the temp file for the tokenizer.

It will create the train/test/val datasets and a BPE tokenizer for the model.

Training

After that you can train the model using the run_transformer.py:

usage: run_transformer.py [-h] [--model_save_path [MODEL_SAVE_PATH]]
                          [--batch_size [BATCH_SIZE]]
                          [--learning_rate [LEARNING_RATE]]
                          [--n_words [N_WORDS]] [--emb_size [EMB_SIZE]]
                          [--n_hid [N_HID]] [--n_layers [N_LAYERS]]
                          [--n_head [N_HEAD]] [--dropout [DROPOUT]]
                          data_path n_epochs tokenizer_path

positional arguments:
  data_path             path to the train/test/val sets.
  n_epochs              number of training epochs.
  tokenizer_path        path to the tokenizer.

optional arguments:
  -h, --help            show this help message and exit
  --model_save_path [MODEL_SAVE_PATH]
                        there to load/save the model.
  --batch_size [BATCH_SIZE]
                        batch size for training/validation.
  --learning_rate [LEARNING_RATE]
                        learning rate for training.
  --n_words [N_WORDS]   number of words to train on.
  --emb_size [EMB_SIZE]
                        embedding dimension.
  --n_hid [N_HID]       the dimension of the feedforward network model in
                        nn.TransformerEncoder.
  --n_layers [N_LAYERS]
                        the number of encoder/decoder layers in transformer.
  --n_head [N_HEAD]     the number of heads in the multiheadattention layers.
  --dropout [DROPOUT]   dropout rate during the training.

After the training it will compute the BLEU score on test dataset.

Translation

To translate the text use the translate.py:

usage: translate.py [-h] [--encoding [ENCODING]] [--max_len [MAX_LEN]]
                    model_path tokenizer_path in_data_path out_data_path

positional arguments:
  model_path            path to the trained model.
  tokenizer_path        path to the tokenizer.
  in_data_path          path to the input data.
  out_data_path         path where to save the results.

optional arguments:
  -h, --help            show this help message and exit
  --encoding [ENCODING]
                        encoding for files.
  --max_len [MAX_LEN]   maximum translation length.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
src		src
.gitignore		.gitignore
README.md		README.md
preparation.py		preparation.py
pytorch_translation.ipynb		pytorch_translation.ipynb
requirements.txt		requirements.txt
run_transformer.py		run_transformer.py
translate.py		translate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Translation

Purpose

How to run

Data preparation

Training

Translation

About

Releases

Packages

Languages

kzvdar42/machine-translation

Folders and files

Latest commit

History

Repository files navigation

Machine Translation

Purpose

How to run

Data preparation

Training

Translation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages