Skip to content

Commit

Permalink
Merge pull request #19 from common-voice/sentence-collector-update
Browse files Browse the repository at this point in the history
Update Sentence Collector references to point to actual CV website
  • Loading branch information
jessicarose authored May 21, 2024
2 parents 6c7d292 + 0d840ac commit a0bebea
Show file tree
Hide file tree
Showing 2 changed files with 39 additions and 36 deletions.
36 changes: 17 additions & 19 deletions language/text-corpus/README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
# 📝 Text Corpus

## Our purpose
## Our purpose

Collect or generate text corpus under public domain licence that can be read by people to facilitate their voice donations.

#### Who we are
### Who we are

We are a community of text collectors and creators, always looking for places with text corpora we can extract and process so it can be transformed into short and simple sentences for people to read.

#### What’s success
### What’s success

Generate as many sentences as possible in our languages. Having more sentences allows contributors to donate more hours of voice data.

Expand All @@ -19,53 +19,52 @@ Generate as many sentences as possible in our languages. Having more sentences a

⚠️ _You will need at least 5000 validated sentences to have your language enabled for voice contributions on our voice collection site._

#### How to join
### How to join

Anyone can join this community. Join our [discourse forums](https://discourse.mozilla.org/c/voice/) or our [matrix chat](https://chat.mozilla.org/#/room/#common-voice:mozilla.org), introduce yourself and jump into our sentence tools right away.

#### What we do
### What we do

**Sentence extraction**
#### Sentence extraction

We have developed [a tool to extract sentences](https://github.com/Common-Voice/cv-sentence-extractor) from large sources of public domain text, with a focus easy-to-read corpus and Wikipedia.

This is the easiest and fastest way to get more than a million sentences as soon as possible for your language.

ℹ️ _Please read_ [_the tool documentation_](https://github.com/Common-Voice/cv-sentence-extractor#common-voice-sentence-extractor) _on how to generate specific rules for your language._
ℹ️ _Please read [the tool documentation](https://github.com/Common-Voice/cv-sentence-extractor#common-voice-sentence-extractor) on how to generate specific rules for your language._

⚠️ _Important: Due to legal reasons Mozilla needs to be the one running the final extraction, so please don’t do any manual processing to the resulting extraction during your tests. We can apply manual clean-up after the final version is generated by Mozilla._

🔨 _Skills required to help: Command line usage and git, familiar with regular expressions._

**Sentence collection**
#### Sentence collection

We have also created a [sentence collection tool](https://commonvoice.mozilla.org/sentence-collector/#/) that allows contributors to collect and validate sentences created by the community. You can use this tool also to import and clean-up small-to-medium-sized public domain corpus you have found or collected.
We have also created a [sentence collection tool](https://commonvoice.mozilla.org/en/write) that allows contributors to collect and validate sentences created by the community.

ℹ️ _Please read_ [_the collector how-to_](https://commonvoice.mozilla.org/sentence-collector/#/how-to) _before using this tool and check the_ [_community guidelines on how to validate sentences_](https://discourse.mozilla.org/t/discussion-of-new-guidelines-for-uploaded-sentence-validation/37718).
ℹ️ _Please check the [community guidelines on how to validate sentences](https://discourse.mozilla.org/t/discussion-of-new-guidelines-for-uploaded-sentence-validation/37718)._

🔨 _Skills required to help: Strong grammar knowledge of the target language you are contributing to._

**Large corpus validation**
#### Large corpus validation

If you have found an existing public domain corpus bigger than 100K sentences, we have an independent process to handle it, since we understand that manual validation using the sentence collector is not ideal.
If you have found an existing public domain corpus bigger than 100K sentences, we have an independent process to handle it, since we understand that manual validation using the sentence collector is not ideal. You can use the [bulk upload submission](https://commonvoice.mozilla.org/en/write).

ℹ️ _Please create a new topic on_ [_our_ ](https://discourse.mozilla.org/c/voice/)_discourse, so we can evaluate if your corpus fits the licence and size requirements to run this process._
ℹ️ _Please create a new topic on [our](https://discourse.mozilla.org/c/voice/) Discourse, so we can evaluate if your corpus fits the licence and size requirements to run this process._

🔨 _Skills required to help: Expertise processing and cleaning up text, linguistics/language expertise to check the quality of the resulting sentences._

**Tooling development**
#### Tooling development

Contributors also develop, maintain and update the sentence extractor and collector code.
Contributors also develop, maintain and update the sentence extractor.

* Sentence Extractor: 🐞 [Open issues](https://github.com/Common-Voice/cv-sentence-extractor/projects/1?fullscreen=true) - 🔨 _Skills needed: Rust_
* Sentence Collector: 🐞 [Open issues](https://github.com/Common-Voice/sentence-collector/projects/2?fullscreen=true) - 🔨 _Skills needed: React, JavaScript, Node.js_

#### Roles
### Roles

These are some roles you can take as part of this community.

* Text searcher - Find and connect with sources and organizations that have or are willing to donate text corpus under public domain licence.
* Text processor - Cleaning up the raw text corpus to apply [our sentences requirements](https://common-voice.github.io/sentence-collector/#/how-to).
* Text processor - Cleaning up the raw text corpus to apply [our sentences requirements](https://commonvoice.mozilla.org/en/guidelines).
* Text creator - Generate your own sentences and release them under public domain.
* Validator - Help validate and review existing cleaned-up sentences.
* Mobilizer - Help people in the community to get started and keep contributing.
Expand All @@ -79,4 +78,3 @@ These are some roles you can take as part of this community.
* [Common Voice project announcements](https://discourse.mozilla.org/tags/c/voice/announcements).

💬 If your language already exists on Common Voice, make sure you [check and join the local discourse](https://voice.mozilla.org/about#get-involved) and matrix room. If that’s not the case, please create a new topic [on discourse](https://discourse.mozilla.org/c/voice/239) asking for one to be created.

39 changes: 22 additions & 17 deletions sub_pages/text.md
Original file line number Diff line number Diff line change
@@ -1,36 +1,41 @@
# 📝 Text Corpus
Contributors and collaborators help to develop text corpus from original and new sources that are licensed under creative commons zero (CC0).

You can use a variety of methods such as;
Contributors and collaborators help to develop text corpus from original and new sources that are licensed under creative commons zero (CC0).

* [sentence collector](https://commonvoice.mozilla.org/sentence-collector/#/how-to) to contribute CC0 licensed content
* [bulk submission](https://github.com/common-voice/common-voice/blob/main/docs/SENTENCES.md#bulk-submission) to contribute large files of sentences in txt format
You can use a variety of methods such as:

* [sentence collector](https://commonvoice.mozilla.org/en/write) to contribute CC0 licensed content
* [bulk submission](https://commonvoice.mozilla.org/en/write) to contribute large files of sentences in txt format
* [sentence extractor](https://github.com/Common-Voice/cv-sentence-extractor) from large sources of public domain text, with a focus easy-to-read corpus and Wikipedia.

⚠️ _ Mozilla Common Voice datasets are released under a CC0 “No Rights Reserved” License and are part of the public domain. This means that works subject to copyright cannot be added to Common Voice datasets. But some copyright owners are willing to make a [CC0 waiver](https://common-voice.github.io/community-playbook/sub_pages/cc0waiver_process.html), dedicating their work to the public domain so that it can be contributed to Common Voice.
⚠️ Mozilla Common Voice datasets are released under a CC0 “No Rights Reserved” License and are part of the public domain. This means that works subject to copyright cannot be added to Common Voice datasets. But some copyright owners are willing to make a [CC0 waiver](https://common-voice.github.io/community-playbook/sub_pages/cc0waiver_process.html), dedicating their work to the public domain so that it can be contributed to Common Voice.

## Why is Sentence collection important?

Currently, Common Voice requires voice donations to be tied to sentences, by sourcing more sentences people are able to donate more hours of voice data. [Sentence Collection Bands](https://discourse.mozilla.org/t/share-your-views-nuancing-sentence-collection-requirements-new-sentence-collection-bands/93134) were introduced to support the entry-level for the voice collection stage.

## Why is Sentence collection Important?
Currently, Common Voice requires voice donations to be tied to sentences, by sourcing more sentences people are able to donate more hours of voice data. Sentence Collection Bands](https://discourse.mozilla.org/t/share-your-views-nuancing-sentence-collection-requirements-new-sentence-collection-bands/93134) were introduced to support the entry-level for the voice collection stage.
* 5,000 sentences _allow_ 5,5 hrs of voice
* 9,000 sentences _allow_ 10 hrs of voice
* 90,000 sentences _allow_ 100 hrs of voice
* 1,800,000 sentences _allow_ 2000 hrs of voice

## What should I consider when contributing Sentences?
It’s also important to ensure sentences are readable to speakers across all backgrounds.
## What should I consider when contributing sentences?

### Sentence Diversity
Phoneome, variant and domain diversity are crucial in ensuring that the dataset can understand the vastness of language; for example, some languages have Gramaitical Gender e.g Abogado and Abogada mean Male Lawyer and Female Lawyer respectively Spanish.
It’s also important to ensure sentences are readable to speakers across all backgrounds.

All sentences in the dataset can be viewed on the [Common Voice Github](https://github.com/common-voice/common-voice/tree/main/server/data). If you notice a gap regarding sources or types of content, we encourage you to add more sentences to help diversify the text corpus.
### Sentence Diversity

Phoneome, variant and domain diversity are crucial in ensuring that the dataset can understand the vastness of language; for example, some languages have Gramaitical Gender e.g Abogado and Abogada mean Male Lawyer and Female Lawyer respectively Spanish.

If you notice a gap regarding sources or types of content, we encourage you to add more sentences to help diversify the text corpus.

⚠️ _ As part of the [Common Voice 2022 Product Roadmap](https://docs.google.com/spreadsheets/d/137YOs41kbzXyai6_Kn_lu08EHAziPt4ioPUkuSFSSTc/edit?usp=sharing) we are scoping and delivering a domain-specific text corpus on the platform
⚠️ As part of the [Common Voice 2022 Product Roadmap](https://docs.google.com/spreadsheets/d/137YOs41kbzXyai6_Kn_lu08EHAziPt4ioPUkuSFSSTc/edit?usp=sharing) we are scoping and delivering a domain-specific text corpus on the platform

### Community Participation Guidelines (CPG)
It’s important that everyone and every language can have enjoyable experiences in contributing to Common Voice. Sentences that include harmful content or violations of the CPG, will be reviewed and subsequently deleted.

### Skills Needed
It’s important that everyone and every language can have enjoyable experiences in contributing to Common Voice. Sentences that include harmful content or violations of the CPG, will be reviewed and subsequently deleted.

### Skills Needed

**Sentence extraction**
🔨 _Skills required to help: Command line usage and git, familiar with regular expressions._
Expand All @@ -43,6 +48,6 @@ It’s important that everyone and every language can have enjoyable experiences

## Tooling development

Contributors also develop, maintain and update the sentence extractor and collector code.
Contributors also develop, maintain and update the sentence extractor code.

* Sentence Extractor: 🐞 [Open issues](https://github.com/Common-Voice/cv-sentence-extractor/projects/1?fullscreen=true) - 🔨 _Skills needed: Rust_
* Sentence Collector: 🐞 [Open issues](https://github.com/Common-Voice/sentence-collector/projects/2?fullscreen=true) - 🔨 _Skills needed: React, JavaScript, Node.js_

0 comments on commit a0bebea

Please sign in to comment.