You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We could add an option that enables keeping documents that are not identifiable (where the classifier can't infer a document language), for further inspection.
The text was updated successfully, but these errors were encountered:
In the case of mC4 (also called c4/multilingual)
The undetermined portion('und') for mC4 3.1 this is when according to their langID cld3, the highest confidence for a language is <0.95. Since, Ungoliant works differently and with different langID tools and models (fasttext, lid176.bin but I hope to petition to change this to lid218) specific processes and cutoffs might have to be different.
Seeing how ungoliant records per sentence confidence score, many options could be explored.
The current average confidence weighted per byte seems a very good compromise especially compared to simple mean.
In any case this would be very useful. The 'und' portion of mC4 is second only to english in quantity or byte size and rife for opportunities where humans can get involved to salvage data or understand langID behaviors.
We could add an option that enables keeping documents that are not identifiable (where the classifier can't infer a document language), for further inspection.
The text was updated successfully, but these errors were encountered: