Many NLP systems are built for one language first but, in the real world, content comes in many languages.
Making your NLP system multilingual is easy conceptually and getting easier and easier thanks to new tools.
What types of systems can we easily make multilingual? What are the approaches and technologies? How can we evaluate the results?
This repo was first created for a workshop at EPFL Applied Machine Learning Days 2021.
Nerses Nersesyan, Junior AI Engineer, Polixis
Adam Bittlingmayer, CEO, ModelFront
So you have an NLP system - a chat bot, a search engine, NER, a classifier... - working well for English.
And you want to make it work for other languages - or even for all languages.
Examples:
Answering with the correct answer in a chat
Find mentions of protein interactions in medical research papers
Detecting malicious comments on a social network (Facebook, Wikipedia) or malicious ads on an ads network (Google Ads)
Searching across products (eBay, AirBnb) person and company names (Facebook, Polixis) or places (Google Maps)
Common theme:
- Real-world input - The data are crawled or user-generated. We have no control over the input in production.
- Numeric output - Classification, regression, retrieval, recommendation... We are not required to generate text.
What's not a flavor of this problem?
Translation itself. Text generation like GPT-3. Grammarly or LanguageTool.
We see a few common approaches:
Manually create more labelled training data for each language
$$$
Machine-translate at inference or query time
"Lazy"
Fine-tune a multilingual pretrained model like BERT or LASER and hope for transfer learning
"Do nothing
Machine-translate at training data
"Eager"
Or combinations of multiple approaches.
So which approach to use when?
Ease
Simplicity
Data requirements
Language support
Accuracy
Speed
Price
...
How accurate?
How many languages?
How much labelled data?
How much throughput/speed?
How much effort?
How often does the dataset update?
“The first workshop on CLIR was held in Zürich during the SIGIR-96 conference.”
Which machine translation system to use?
Fairseq, OPUS, T5, ... They can be fine-tuned to customize and are available on HuggingFace.
Google Translate, Microsoft Translate, DeepL, ModernMT, Lingvanex, ...
Google Translate, Lingvanex
ModernMT
Lingvanex, ModernMT, unofficial APIs
For niche tasks
Schwiizertüütsch, Rumantsch, Patois arpitan
Pretrained on large monolingual datasets with hundreds of languages
...
Alemannic (Schwyzertüütsch), Rumantsch or Patois arpitan
Similar to transliteration:
-
Monolingual data but no parallel data
-
No standard orthography
We can bootstrap from monolingual data if we have bad machine translation for back-translation.
Using rules or dictionaries
For e.g. Alemannic:German
Using existing systems that support languages like German and French
For e.g. Alemannic:English
Can also bridge to German
We can apply these concepts on the Jigsaw Multilingual Toxic Comment Classification using Wikipedia datasets provided by Google.
See lab.md
It's easy and getting easier.
The translation quality does not need to be perfect.
Translating at inference or query time does not scale.
There is no optimal approach - there are trade-offs.