A Human Annotated Dataset for the Quality Assessment of Emotion Translation (HADQAET) for Chinese-English Machine Translation
This repository contains the HADQAET Dataset in our paper submitted to EAMT 2023. For details of the dataset, please see our paper Evaluation of Chinese-English Machine Translation of Emotion-Loaded Microblog Texts: A Human Annotated Dataset for the Quality Assessment of Emotion Translation or at arXiv. To use the dataset, please see our License.
- Shenbin Qian, Constantin Orăsan, Félix do Carmo, Qiuliang Li and Diptesh Kanojia. 2023. Evaluation of Chinese-English Machine Translation of Emotion-Loaded Microblog Texts: A Human Annotated Dataset for the Quality Assessment of Emotion Translation. In Proceedings of the 24th Annual Conference of the European Association of Machine Translation, Finland, Tempere. European Association for Machine Translation.
@inproceedings{qian-etal-2023-evaluation,
title = "Evaluation of {C}hinese-{E}nglish Machine Translation of Emotion-Loaded Microblog Texts: A Human Annotated Dataset for the Quality Assessment of Emotion Translation",
author = "Qian, Shenbin and
Orasan, Constantin and
Carmo, Felix Do and
Li, Qiuliang and
Kanojia, Diptesh",
booktitle = "Proceedings of the 24th Annual Conference of the European Association for Machine Translation",
month = jun,
year = "2023",
address = "Tampere, Finland",
publisher = "European Association for Machine Translation",
url = "https://aclanthology.org/2023.eamt-1.13",
pages = "125--135",
abstract = "In this paper, we focus on how current Machine Translation (MT) engines perform on the translation of emotion-loaded texts by evaluating outputs from Google Translate according to a framework proposed in this paper. We propose this evaluation framework based on the Multidimensional Quality Metrics (MQM) and perform detailed error analyses of the MT outputs. From our analysis, we observe that about 50{\%} of MT outputs are erroneous in preserving emotions. After further analysis of the erroneous examples, we find that emotion carrying words and linguistic phenomena such as polysemous words, negation, abbreviation etc., are common causes for these translation errors.",
}
The annotated dataset for the quality assessment of emotion translation can be found in the "data" folder. The inter and intra-annotator agreement data can be seen under the "IAA" folder. Detailed annotation guidelines can be seen in the "annotation_guidelines.txt" file.
The post-edited reference translations are now available under the "data" folder. We hired a translation company to post-edit the MT outputs that have errors in terms of emotion preservation in the annotation process. The post-editing activity is funded by the European Association for Machine Translation (EAMT). Note only about 70% of the whole data is released now (The rest will be released in a shared task). This subset includes a total of 4038 instances and about 3000 of them have post-edited reference translations. Details can be found in our project paper. To use the post-edited reference translations, please cite as follows:
@inproceedings{qian-etal-2024-evaluating,
title = "Evaluating Machine Translation for Emotion-loaded User Generated Content ({T}rans{E}val4{E}mo-{UGC})",
author = "Qian, Shenbin and
Orasan, Constantin and
Do Carmo, F{\'e}lix and
Kanojia, Diptesh",
editor = "Scarton, Carolina and
Prescott, Charlotte and
Bayliss, Chris and
Oakley, Chris and
Wright, Joanna and
Wrigley, Stuart and
Song, Xingyi and
Gow-Smith, Edward and
Forcada, Mikel and
Moniz, Helena",
booktitle = "Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)",
month = jun,
year = "2024",
address = "Sheffield, UK",
publisher = "European Association for Machine Translation (EAMT)",
url = "https://aclanthology.org/2024.eamt-2.22",
pages = "43--44",
abstract = "This paper presents a dataset for evaluating the machine translation of emotion-loaded user generated content. It contains human-annotated quality evaluation data and post-edited reference translations. The dataset is available at our GitHub repository.",
}
This resource is built on top of the dataset in the SMP2020-EWECT shared task, provided by the Social Computing and Information Retrieval Research Center of Harbin Institute of Technology and sourced from Sina Weibo. The MT output was generated from Google Translate on the 30th of May, 2022, and distributed under the Terms of Service of Google.
Though the SMP2020-EWECT dataset does not have license information, the distribution of the public content of Sina Weibo needs to follow its Terms of Service (Clause 1.3), which allows itself and developers to use and distribute any published posts including texts, pictures or videos etc. The source text of this dataset, hence, remains under Sina Weibo's developer license. The quality evaluation data and post-edited reference translations are licensed under a Creative Commons 4.0 licence.