Skip to content

This repository contains the dataset for our paper "Evaluation of Chinese-English Machine Translation of Emotion-Loaded Microblog Texts: A Human Annotated Dataset for the Quality Assessment of Emotion Translation"

Notifications You must be signed in to change notification settings

surrey-nlp/HADQAET

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Human Annotated Dataset for the Quality Assessment of Emotion Translation (HADQAET) for Chinese-English Machine Translation

This repository contains the HADQAET Dataset in our paper submitted to EAMT 2023. For details of the dataset, please see our paper Evaluation of Chinese-English Machine Translation of Emotion-Loaded Microblog Texts: A Human Annotated Dataset for the Quality Assessment of Emotion Translation or at arXiv. To use the dataset, please see our License.

Citation

  • Shenbin Qian, Constantin Orăsan, Félix do Carmo, Qiuliang Li and Diptesh Kanojia. 2023. Evaluation of Chinese-English Machine Translation of Emotion-Loaded Microblog Texts: A Human Annotated Dataset for the Quality Assessment of Emotion Translation. In Proceedings of the 24th Annual Conference of the European Association of Machine Translation, Finland, Tempere. European Association for Machine Translation.
@inproceedings{qian-etal-2023-evaluation,
    title = "Evaluation of {C}hinese-{E}nglish Machine Translation of Emotion-Loaded Microblog Texts: A Human Annotated Dataset for the Quality Assessment of Emotion Translation",
    author = "Qian, Shenbin  and
      Orasan, Constantin  and
      Carmo, Felix Do  and
      Li, Qiuliang  and
      Kanojia, Diptesh",
    booktitle = "Proceedings of the 24th Annual Conference of the European Association for Machine Translation",
    month = jun,
    year = "2023",
    address = "Tampere, Finland",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2023.eamt-1.13",
    pages = "125--135",
    abstract = "In this paper, we focus on how current Machine Translation (MT) engines perform on the translation of emotion-loaded texts by evaluating outputs from Google Translate according to a framework proposed in this paper. We propose this evaluation framework based on the Multidimensional Quality Metrics (MQM) and perform detailed error analyses of the MT outputs. From our analysis, we observe that about 50{\%} of MT outputs are erroneous in preserving emotions. After further analysis of the erroneous examples, we find that emotion carrying words and linguistic phenomena such as polysemous words, negation, abbreviation etc., are common causes for these translation errors.",
}

Data

Translation Quality Evaluation Data

The annotated dataset for the quality assessment of emotion translation can be found in the "data" folder. The inter and intra-annotator agreement data can be seen under the "IAA" folder. Detailed annotation guidelines can be seen in the "annotation_guidelines.txt" file.

Post-editing Data

The post-edited reference translations are now available under the "data" folder. We hired a translation company to post-edit the MT outputs that have errors in terms of emotion preservation in the annotation process. The post-editing activity is funded by the European Association for Machine Translation (EAMT). Note only about 70% of the whole data is released now (The rest will be released in a shared task). This subset includes a total of 4038 instances and about 3000 of them have post-edited reference translations. Details can be found in our project paper. To use the post-edited reference translations, please cite as follows:

@inproceedings{qian-etal-2024-evaluating,
    title = "Evaluating Machine Translation for Emotion-loaded User Generated Content ({T}rans{E}val4{E}mo-{UGC})",
    author = "Qian, Shenbin  and
      Orasan, Constantin  and
      Do Carmo, F{\'e}lix  and
      Kanojia, Diptesh",
    editor = "Scarton, Carolina  and
      Prescott, Charlotte  and
      Bayliss, Chris  and
      Oakley, Chris  and
      Wright, Joanna  and
      Wrigley, Stuart  and
      Song, Xingyi  and
      Gow-Smith, Edward  and
      Forcada, Mikel  and
      Moniz, Helena",
    booktitle = "Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)",
    month = jun,
    year = "2024",
    address = "Sheffield, UK",
    publisher = "European Association for Machine Translation (EAMT)",
    url = "https://aclanthology.org/2024.eamt-2.22",
    pages = "43--44",
    abstract = "This paper presents a dataset for evaluating the machine translation of emotion-loaded user generated content. It contains human-annotated quality evaluation data and post-edited reference translations. The dataset is available at our GitHub repository.",
}

License

This resource is built on top of the dataset in the SMP2020-EWECT shared task, provided by the Social Computing and Information Retrieval Research Center of Harbin Institute of Technology and sourced from Sina Weibo. The MT output was generated from Google Translate on the 30th of May, 2022, and distributed under the Terms of Service of Google.

Though the SMP2020-EWECT dataset does not have license information, the distribution of the public content of Sina Weibo needs to follow its Terms of Service (Clause 1.3), which allows itself and developers to use and distribute any published posts including texts, pictures or videos etc. The source text of this dataset, hence, remains under Sina Weibo's developer license. The quality evaluation data and post-edited reference translations are licensed under a Creative Commons 4.0 licence.

Maintainer

Shenbin Qian

About

This repository contains the dataset for our paper "Evaluation of Chinese-English Machine Translation of Emotion-Loaded Microblog Texts: A Human Annotated Dataset for the Quality Assessment of Emotion Translation"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published