Skip to content

Cantonese, a major Chinese spoken dialect, can be viewed a a low-resource language given that its raw written form of collection is scarce. This project develops a pipeline to accomplish the low-resource Cantonese translation task with its closely-related rich-resource language counterparts, Standard Chinese, using Transformers and RNN.

Notifications You must be signed in to change notification settings

kiking0501/Cantonese-Chinese-Translation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cantonese-Chinese-Translation

An experimental study on Standard-Chinese to Cantonese translator models.

Abstract: Learning Cantonese from Standard-Chinese with Neural Machine Translation

(As a continuation of the previous project) this project focuses on Neural Machine Translation (NMT) between Standard Chinese and Cantonese, with the former as the source language and the latter as the target language.

Two sequence-to-sequence models were studied:

  • A Transformer model. which follows the Encoder-Decoder architecture and uses stacked self-attention and point-wise, fully connected layers for both the encoder and the decoder (Vaswani et al., 2017).
  • A vanilla RNN model, with the encoder and decoder layer being a recurrent neural network composed of Gated Recurrent Units (Chung et al., 2014), and Bahdanau attention (Bahdanau et al., 2015) in the encoder layer.

Preliminary result:

Abstract: Dialect as a Low-Resource Language: A Study on Standard-Chinese to Cantonese Translation with Movie Transcripts

Cantonese, a major Chinese spoken dialect, can be viewed a a low-resource language given that its raw written form of collection is scarce. This project develops a pipeline to accomplish the low-resource Cantonese translation task with its closely-related rich-resource language counterparts, Standard Chinese (SC). The pipeline consists of two major translation methods: (1) the sequence-to-sequence neural-network approach suggested by Jhamtani et al. (2017), and (2) the translation-matrix approach suggested by Mikolov et al. (2013). Our implementation to perform machine translation from SC to Cantonese, in a simplified setting, do not have satisfying results nor perform better than the baselines. This report describes the similarities and difference between our implementation and the original approaches, and also discusses possible future improvement.

Two major approaches are included:

  • Copy-Enriched Seq2Seq Models (Jhamtani., 2017)
  • Enriched dictionary table by Translation-Matrix (Mikolov., 2013)

Check here for more instructions. This version of code is initiated from the work by Jhamtani .

About

Cantonese, a major Chinese spoken dialect, can be viewed a a low-resource language given that its raw written form of collection is scarce. This project develops a pipeline to accomplish the low-resource Cantonese translation task with its closely-related rich-resource language counterparts, Standard Chinese, using Transformers and RNN.

Topics

Resources

Stars

Watchers

Forks