Skip to content

Latest commit

 

History

History
9 lines (5 loc) · 613 Bytes

README.md

File metadata and controls

9 lines (5 loc) · 613 Bytes

wikipedia-data

/word-frequency/ contains analysis on word frequency from a full wikipedia dump in different languages, we used cvstools to generate it.

  • word-frecuency.es.txt - Spanish wikipedia (2955930 words)

/complex-words/ contains analysis on less common words used in different wikipedias, this can be used as blacklist to clean up words that are complex, non-native or with weird characters combination. Each language has a different word frequency limit.

  • complex.es.txt - Spanish wikipedia, words with 80 or less repetitions (2827258 words)