Skip to content
forked from deepchar/deepchar

Transliteration for translation of named entities

Notifications You must be signed in to change notification settings

nerses0/entities

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

deepchar/entities

github.com/deepchar/entities Transliteration for translation of named entities

The transliteration for translation of named entities is a different task than the transliteration from informal variants back into their canonical forms. Instead of many inputs mapping to one output, one input will map to many outputs.

Transliteration of named entities is useful in search engines, especially within social networks, and for legal discovery, criminal investigations and financial or governmental applications. The goal is to find all possible versions of name, in all alphabets.

The same name referring to the same person or other entity could be indexed or queried in multiple forms:

Ершов, Yershov, Ershov, Jerszow,...

Шостакович, Shostakovich, Šostakovič, Xostakóvitx, Szostakowicz,...

Թագուշ, Tagoush, Tagoosh, Tagush, Tagusch, Taguš, Тагуш, ...

Alegre, Алегри, Ալեգրի, Ալեգրե, آلگری, Αλέγκρε...

These forms are not standardised, but they are also not typos or misspellings. Certain forms do not occur organically: Tagus, Тагусх, ...

A human literate in the relevant languages and alphabets generally maps them back to the canonical form correctly and effortlessly, subconsciouly resolving ambiguity and discarding invalid candidates: Ершев, Ерщов, Թագոուշ, Տագուշ, Թաքուշ, Ершёв, Схостаковицх, ...

Our initial task is canonicalisation: given an informal child form and a target language and script, generate the canonical form.

Yershov + ru-CyrlЕршов

Ershov + ru-CyrlЕршов

Tagoush + hy-ArmnԹագուշ

Tagush + hy-ArmnԹագուշ

Tagoush + ru-CyrlТагуш

Алегри + pt-LatnAlegre

Алегри + en-LatnAlegre

Ալեգրի + pt-LatnAlegre

Our initial target languages are Russian, Armenian, Persian and Greek, and our source script is generally the Latin alphabet.

We evaluate the output with simple exact-match accuracy and character error rate (CER), based on word error rate (WER) but adapted for character-level sequence generation tasks.

We could also formulate the task in reverse: given a canonical form, generate child forms.

Types of name transliteration

Firstly we should consider the processes by which canonical forms are converted into their various child forms.

Many cases are fairly simple.

Հովիկ Աբրահամյանen-Latn: Hovik Abrahamyan
Հովիկ Աբրահամյանpl-Latn: Hovik Abrahamian
Հովիկ Աբրահամյանde-Latn: Hovig Abrahamyan
Հովիկ Աբրահամյանru-Cyrl: Овик Абраамян
Հովիկ Աբրահամյանuk-Cyrl: Овік Абрамян
Հրազդանen-Latn: Hrazdan
Հրազդանru-Cyrl: Раздан
Tom Collinsru-Cyrl: Том Коллинз

Some names can have dozens of versions when transliterated into another language.

Armn: Շողիկ Հովհաննեսի ՑոլակյանLatn: Shoghik Hovhannes Tsolakyan, Shokhik Hovhanes Tsolakian, Shoghig Hovhaness Tzolakyan

There are of course multiple languages per script, and some of the various is due to the specific target language.

Armn: Շողիկ Հովհաննեսի Ցոլակյանpl-Latn: Szochik Hovhaness Colakian

Armn: Հովիկ Աբրահամյանpl-Latn: Howik Abrahamian
Armn: Հովիկ Աբրահամյանde-Latn: Howik Abrahamjan

Armn: Սերժ Սարգսյանen-Latn: Serzh Sargsyan
Armn: Սերժ Սարգսյանfr-Latn: Serge Sarkissian
Armn: Սերժ Սարգսյանhr-Latn: Serž Sargsjan
Armn: Սերժ Սարգսյանnl-Latn: Serzj Sarkisian

Latn: John Smithhy-Armn: Ջոն Սմիթ
Latn: John Smithru-Cyrl: Джон Смит, sr-Cyrl: Џон Смит, uk-Cyrl: Джон Сміт
Latn: John Smithar-Arab: جون سميث, fa-Arab: جان اسمیت

Latn: SnapChathy-Armn: Սնապչատ
Latn: SnapChatru-Cyrl: Снепчат, bg-Cyrl: Снапчат
Latn: SnapChattr-Latn: Snapçat, pl-Latn: Snapczat

Multiple conversions

Quite often, a name could be repeatedly transliterated:

Հովիկ Աբրահամյանru-Cyrl: Овик Абраамянen-Latn: Ovik Abramyan
Հրազդանru-Cyrl: Разданen-Latn: Razdan
Howard Hughesru-Cyrl: Говард Хьюзhy-Armn: Գովարդ Խյուզ
Шостаковичit-Latn: Šostakovičen-Latn: Sostakovic

It could even come back into the original script mangled.

Latn: Howard Hughesru-Cyrl: Говард Хьюзen-Latn: Govard Khyuz
Latn: Charles Aznavourru-Cyrl: Шарль Азнавурen-Latn: Sharl Aznavur
Armn: Հասմիկ Կուրղինյանru-Cyrl: Асмик Кургинянhy-Armn: Ասմիկ Կուրգինյան

Within alphabets

We can also consider some variants within the alphabet.

There are three common transcriptions for German umlaut.

  1. Preserve the umlaut
    MüllerMüller
  2. Decompose the diacritic
    MüllerMueller
  3. Simply omit the umlaut
    MüllerMuller

Umlaut decomposition is specific to German. In many other orthographic traditions, the umlaut is usually simply omitted.

Some Latin orthographies nativize according to pronunciation, not unlike non-Latin alphabet orthographies.

George Bushtr-Latn: Corc Buş
George Bushde-Latn: Schorsch Busch
George Bushsq-Latn: Xhorxh Bysh

Our approach

We use an approach similar to those commonly used for translation and also our main transliteration project: parallel corpora.

We have trained initial models with two different architectures, seq2seq and tensor2tensor.

Datasets

There are very limited datasets available for transliteration, let alone transliteration of named entities in all directions.

Our goal was to create an almost fully automatic data harvesting pipeline so that other researchers in the field of NMT could use it. We were motivated by the fact that at the time of this research there were very few open sources. We extracted raw data, learned word alignment and created a dataset of multi-token name pairs.

We assumed that the pronunciation of a last name is independent of the previous tokens, and we and represent parallel corpora as pairs of single tokens, as in Yuval Merhav's and Steven Ash's approach.

Dataset Total size Training size Source Alphabet size Target Alphabet size
Latn ➜ hy-Armn 39,707 25,412 152 107
Latn ➜ ru-Cyrl 179,853 143,882 278 124
Latn ➜ el-Grek 37,505 30,004 170 97
Latn ➜ fa-Arab 78,663 62,930 170 117

The data in the Latin script included not just English but other European languages.

Baselines

As benchmarks we considered the transliteration module of Polyglot and the bi-directional transliteration library transliterate.

Training

TODO: describe hardware, training time, hyperparams

Results

We compared character error rate (CER) between Polyglot, and our seq2seq and tensor2tensor implementations on the given parallel corpora.

The results of the transliterate library baseline are not shown because they were not competitive at all.

Interestingly, the seq2seq model consistently performed better than the tensor2tensor model. The only exception is Latn ➜ el-Grek, where seq2seq and tensor2tensor had equal results. Moreover, the tensor2tensor model failed to train on the Persian dataset.

TODO: add exact match accuracy

Target language Polyglot baseline seq2Seq tensor2Tensor
hy-Armn 0.64 0.46 0.47
ru-Cyrl 0.56 0.24 0.37
el-Grek 0.9 0.52 0.52
fa-Arab - 0.49 -

Sample outputs

For each target language and thus for each model, we show a few examples where the model did well - the top candidate was the correct output - and a few where the model did poorly - the correct output was not even among the top 3 candidates - and in that case we also include the actual correct output.

hy-Armn

Source text Model outputs Correct output
fazlian ֆազլյան
ֆասլյան
ֆացլյան
ֆազլյան
chukhajyan չուխաջյան
չուխայան
ճուխաջյան
չուխաջյան
breslin բրեսլին
բրեզլին
բրեսլեն
բրեսլին
gobelyan գոբելյան
գոբիլյան
գոպելյան
կոպելյան
bizet բիզեթ
բիզետ
բիսեթ
բիզե
chkheidze չկխեյձե
չկխիձե
չխայձե
չխեիձե

ru-Cyrl

Source text Model outputs Correct output
afanasyeva афанасьева
афанасиева
афанасева
афанасьева
vishnevskiĭ вишневский
вышневский
вишнёвский
вишневский
edward эдвард
эдуард
эдуорд
эдвард
suzdal суздал
сюздаль
сюздал
суздаль
fargère фарджер
фаргер
фарджир
фаржер
wolkenstain волкенштайн
волькенштайн
уолкенштайн
волькенштейн

el-Grek

Source text Model outputs Correct output
kioussis κιούσης
κιούσσης
κιούσις
κιούσης
papastathopoulos παπασταθόπουλος
παπασθαθόπουλος
παπασταθώπουλος
παπασταθόπουλος
denzel ντένζελ
ντένσελ
ντάνζελ
ντένζελ
nissiotis νισιώτης
νισσιώτης
νυσιώτης
νησιώτης
dallas ντάλλας
ντέιλας
ντόλας
ντάλας
håkan χάκαν
χακάν
χέκαν
χόκαν

fa-Arab

Source text Model outputs Correct output
momayez ممیز
ممیظ
معمیز
ممیز
adineh آدینه
ادینه
آدینیه
آدینه
appleby اپلبی
آپلبی
اپلیبی
اپلبی
ereyahi اریاهی
اریهای
اریهی
الریاحی
ligt لیگت
لیجت
لیگ
لیخت
entezam انتزام
انتجام
انتزم
انتظام

Future work

More directions

Creating models for different combination of source and target languages, i.e. Latn ➜ ru-Cyrl, Cyrl ➜ en-Latn, Armn ➜ ru-Cyrl, etc.

More script types

Creating models for scripts that use ideograms, syllabaries, such as Chinese, Japanese and Korean, or often omit vowels, such as Hebrew or Arabic.

Learning curve

Test and graph how dataset size correlates with accuracy

Quality and confidence

Research how model errors and quality correlate with token frequency and length.

Context

Train models that use context, both the previous or next word or an entire sentence.

Architectures

Train character-based NMT (Ling et al., 2015) or convolutional seq2seq (Gehring et al., 2017)

Massive multilingual model

Train a single general model for all the languages and transliteration directions.

References

Design Challenges in Named Entity Transliteration 2018 arXiv
Yuval Merhav, Stephen Ash
Amazon Alexa AI, Amazon AWS AI

About

Transliteration for translation of named entities

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published