An experimental study on Standard-Chinese to Cantonese translator models.
(As a continuation of the previous project) this project focuses on Neural Machine Translation (NMT) between Standard Chinese and Cantonese, with the former as the source language and the latter as the target language.
Two sequence-to-sequence models were studied:
- A Transformer model. which follows the Encoder-Decoder architecture and uses stacked self-attention and point-wise, fully connected layers for both the encoder and the decoder (Vaswani et al., 2017).
- A vanilla RNN model, with the encoder and decoder layer being a recurrent neural network composed of Gated Recurrent Units (Chung et al., 2014), and Bahdanau attention (Bahdanau et al., 2015) in the encoder layer.
Preliminary result:
Abstract: Dialect as a Low-Resource Language: A Study on Standard-Chinese to Cantonese Translation with Movie Transcripts
Cantonese, a major Chinese spoken dialect, can be viewed a a low-resource language given that its raw written form of collection is scarce. This project develops a pipeline to accomplish the low-resource Cantonese translation task with its closely-related rich-resource language counterparts, Standard Chinese (SC). The pipeline consists of two major translation methods: (1) the sequence-to-sequence neural-network approach suggested by Jhamtani et al. (2017), and (2) the translation-matrix approach suggested by Mikolov et al. (2013). Our implementation to perform machine translation from SC to Cantonese, in a simplified setting, do not have satisfying results nor perform better than the baselines. This report describes the similarities and difference between our implementation and the original approaches, and also discusses possible future improvement.
Two major approaches are included:
- Copy-Enriched Seq2Seq Models (Jhamtani., 2017)
- Enriched dictionary table by Translation-Matrix (Mikolov., 2013)
Check here for more instructions. This version of code is initiated from the work by Jhamtani .