This repository contains Dhopadhola and English Sentences that can be used for Machine Translation. The text comes from several domains and was scrapped from different sources online and in print media.
I did this as part of my submission for AI4D Language Dataset Challenge Round 2. My submission was not selected but I have decided to make the data open source for anyone to use as that was my initial goal and that of the challenge.
NLP, Machine Translation, Africa, Uganda
This dataset was created to provide Dhopadhola(ADH) to English Parallel sentences to help in availing services that require Natural Language Processing to Dhopadhola speakers.
The dataset can be used for Machine Translation purposes. It consists of 2484 parallel (Dhopadhola and English) sentences from different domains and 3386 monolingual Dhopadhola sentences. Both Supervised and Semi-supervised MT can utilise this dataset.
The dataset can also be used to study transfer learning in related African languages as it is closely related to Dholuo spoken in Kenya & Tanzania, Acholi, Lango and Alur in Uganda and other Luo languages.
Dhopadhola is a very low resourced language; it has very few resources available publicly on the internet and even in other print media. This dataset is will help in the availability of Dhopadhola in digital media as when the task for which it is intended for(Machine Translation) is implemented, more resources will be translated into the language and also the native speakers will be incentivized to use it online eg on social media because non-speakers can get the translations.
Get the most updated information from [the datasheet](./Clean Language Data/Ogayo_documentation_2.pdf)
This repo contains 3 main folders of interest.
Contains all the text combined from different source files. Datasheets expounding on the data also available.
Contains sentence in their individual source files. Not that raw as some cleaning has already been done. If you need the webpage or the document without any form of manipulation, let me know.
Jupyter Notebooks that I used to scrape and clean the data. They need some clean-up though.
- Clone this repo to your local machine using
https://github.com/Pogayo/ADH-EN_MT_Dataset
To get started...
-
Option 1
- 🍴 Fork this repo!
-
Option 2
- 👯 Clone this repo to your local machine using
https://github.com/Pogayo/ADH-EN_MT_Dataset
- 👯 Clone this repo to your local machine using
- HACK AWAY! 🔨🔨🔨
- 🔃 Create a new pull request
- We are a small team. Join us and let's put Africa on the NLP Map together!
I am in the process of setting up a wallet. Feel free to reach out to me so that I can give you other payment details in the meantime.
This work is licensed under a Creative Commons Attribution 4.0 International License.