Option to keep documents that can't be identified #88

Uinelj · 2023-02-03T11:44:01Z

We could add an option that enables keeping documents that are not identifiable (where the classifier can't infer a document language), for further inspection.

chris-ha458 · 2023-07-18T13:57:55Z

In the case of mC4 (also called c4/multilingual)
The undetermined portion('und') for mC4 3.1 this is when according to their langID cld3, the highest confidence for a language is <0.95. Since, Ungoliant works differently and with different langID tools and models (fasttext, lid176.bin but I hope to petition to change this to lid218) specific processes and cutoffs might have to be different.
Seeing how ungoliant records per sentence confidence score, many options could be explored.
The current average confidence weighted per byte seems a very good compromise especially compared to simple mean.

In any case this would be very useful. The 'und' portion of mC4 is second only to english in quantity or byte size and rife for opportunities where humans can get involved to salvage data or understand langID behaviors.

I and some others are actively doing such salvaging and here is an example of such salvaging efforts.

Uinelj added the enhancement New feature or request label Feb 3, 2023

Uinelj self-assigned this Feb 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to keep documents that can't be identified #88

Option to keep documents that can't be identified #88

Uinelj commented Feb 3, 2023

chris-ha458 commented Jul 18, 2023

Option to keep documents that can't be identified #88

Option to keep documents that can't be identified #88

Comments

Uinelj commented Feb 3, 2023

chris-ha458 commented Jul 18, 2023