An experimental study on Standard-Chinese to Cantonese translator models.
Two major approaches are included:
- Copy-Enriched Seq2Seq Models (Jhamtani., 2017)
- Enriched dictionary table by Translation-Matrix (Mikolov., 2013)
This version of code is initiated from the work by Jhamtani .
-
Python 3.5
-
Tensorflow 1.1.0 (framework)
-
Change working directory to /code/v1/
-
Run:
python mt_main.py init
to download the pre-trained Cantonese and Chinese embeddings from fastText; and also to build token-dictionaries for the Chinese tokenizer
-
First run initialization
-
Change working directory to /code/v1/
-
Run:
python mt_main.py preprocessing
to pre-process and save train/valid/test data for the two models. -
The used dictionaries for tokenization can be changed at
prepro.py
-
First run pre-processing
-
Change working directory to /code/v1/
-
Run
python mt_main.py train <iter_num> <output_model_name>
orpython mt_main.py validation <saved_model_name>
-
Training settings can be modified at
configuration.py
-
Trained models and temporary results are saved in /data/mt_model/
-
Link to the original paper: https://arxiv.org/abs/1707.01161
-
First run pre-processing
-
Change working directory to /code/v1/
-
Download and get the Cantonese language model (wiki.zh_yue.bin), put under data/embedding/
-
Run
python mt_translation_matrix.py
and a linear projection matrix between the embedding spaces would be learnt using Stochastic Gradient Descent -
Trained matrix and temporary results are saved in /data/mt_translation_matrix/
-
Link to the original paper: https://arxiv.org/pdf/1309.4168.pdf
-
(Assume trained models are ready)
-
Change working directory to /code/v1/
-
Run
python baseline_as_it_is.py
or
python baseline_dictionary.py
to check performance of the two baseline methods (saved in /code/eval/baselines/) -
Run
python mt_main.py test <saved_model_name>
to check performance of the Copy-Enriched Seq2Seq Model (saved in /code/eval/mt_model/MOVIE-transcript/) -
Run
python mt_translation_matrix.py
to check performance of the Translation-Matrix Model (saved in /code/eval/mt_translation_matrix/MOVIE-transcript/)
- Movie transcripts are used to create a collection of sentences in pairs {Standard-Chinese(繁), Cantonese (粵)} as the parallel corpora. (/data/transcript/)
- A Cantonese-SC dictionary mapping consisting ~1600 entries are created from the online database from A Comparative Study of Modern Chinese and Cantonese in the Development of Teaching Resource (/data/static/canto2stdch_full.dict, see script at
crawl_dict_map.py
)
- Tokenziation by the tool Jieba with customized dictionaries. Cantonese sentences are tokenized based on token words from PyCantonese and also available words in the embedding.
- Pre-trained embeddings for Chinese and Cantonese are downloaded from FastText (Wiki word vectors)
- Conversion of traditional-vs-simplified Chinese characters is done by the python wrapper OpenCC-Python with respect to Open Chinese Convert.
-
BLEU metric evaluation is provided by the toolkit MOSES
-
BLEU evaluation is adjusted to consider only single-Chinese-character tokenization
- OSError: [Errno 12] Cannot allocate memory: Make sure one have enough RAM; restart the computer, or try adding a new swapfile (e.g. make to 4G total).
- Keith Carlson, Allen Riddell, and Daniel Rockmore. "Evaluating prose style transfer with the Bible". 2018. R. Soc. open sci. 5: 171829. http://dx.doi.org/10.1098/rsos.171920
- Francisco Guzman, Peng-Jen Chen, Myle OttF, Juan Pino, Guillaume Lample, Philipp Koehn, Vishrav Chaudhary, Marc’Aurelio Ranzato. 2019. "The FLORES Evaluation Datasets for Low-Resource Machine Translation: Nepali–English and Sinhala–English", arXiv:1902.01382v3, EMNLP 2019.
- Harsh Jhamtani, Varun Gangal, Eduard Hovy, and Eric Nyberg. 2017. "Shakespearizing modern language using copy-enriched sequence-to-sequence models." Proceedings of the Workshop on Stylistic Variation, EMNLP 2017.
- Huang G, Gorin A, Gauvain JL, Lamel L. Machine translation based data augmentation for Cantonese keyword spotting. In2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2016 Mar 20 (pp. 6020-6024). IEEE.
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781.
- Jackson L. Lee, Litong Chen, and Tsz-Him Tsui. 2016. PyCantonese: Developing computational tools for Cantonese linguistics. Talk at the 3rd Workshop on Innovations in Cantonese Linguistics, The Ohio State University. March 12. 2016.
- Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. "Bleu: a method for automatic evaluation of machine translation." In Proceedings of
the 40th annual meeting on association for computational linguistics. Association for Computational
Linguistics, pages 311–318.
- Mikolov, Tomas, Quoc V. Le, and Ilya Sutskever. "Exploiting similarities among languages for machine translation." arXiv preprint arXiv:1309.4168 (2013).
- Rao S, Tetreault J. 2018. "Dear sir or madam, may I introduce the YAFC corpus: corpus, benchmarks and metrics for formality style transfer." In Proceeding of the 2018 Conf. of the North American chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, New Orleans, LA, June, pp. 129– 140. Association for Computational Linguistics. (doi:10.18653/v1/n18-1012)
- Wong, Tak-sum, and John Lee. "Register-sensitive Translation: a Case Study of Mandarin and Cantonese (Non-archival Extended Abstract)." Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Papers). 2018.
- Mengzhou Xia, Xiang Kong, Antonios Anastasopoulos, Graham Neubig. 2019. "Generalized Data Augmentation for Low-Resource Translation", arXiv:1906.03785, ACL 2019
- Wei Xu, Alan Ritter, Bill Dolan, Ralph Grishman, and Colin Cherry. 2012. "Paraphrasing for style". Proceedings of COLING 2012 pages 2899–2914
- Xu, Jia, Richard Zens, and Hermann Ney. "Do we need Chinese word segmentation for statistical machine translation?." Proceedings of the third SIGHAN workshop on Chinese language processing. 2004.
- Chunting Zhou, Xuezhe Ma, Junjie Hu, Graham Neubig. 2019. "Handling Syntactic Divergence in Low-resource Machine Translation", arXiv:1909.00040v1, EMNLP 2019.
- Zhang, Xiaoheng. "Dialect MT: a case study between Cantonese and Mandarin." Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 2. Association for Computational Linguistics, 1998.
- Bojanowski, Piotr, et al. "Enriching word vectors with subword information." Transactions of the Association for Computational Linguistics 5 (2017): 135-146.
- Added reference and improved README.
- Restart the project. Rename /code/main to /code/v1
-
Summarized the result with a report titled "Dialect as a Low-Resource Language: A Study on Standard-Chinese to Cantonese Translation with Movie Transcripts" . Abstract:
Cantonese, a major Chinese spoken dialect, can be viewed a a low-resource language given that its raw written form of collection is scarce. This project develops a pipeline to accomplish the low-resource Cantonese translation task with its closely-related rich-resource language counterparts, Standard Chinese (SC). The pipeline consists of two major translation methods: (1) the sequence-to-sequence neural-network approach suggested by Jhamtani et al. (2017), and (2) the translation-matrix approach suggested by Mikolov et al. (2013). Our implementation to perform machine translation from SC to Cantonese, in a simplified setting, do not have satisfying results nor perform better than the baselines. This report describes the similarities and difference between our implementation and the original approaches, and also discusses possible future improvement.
-
Submitted the report for the UWaterloo course CS680 - Introduction to Machine Learning. Grade 21/25.
- Initialized the repository