Skip to content

Commit

Permalink
Merge pull request #163 from eu9ene/ja_chars
Browse files Browse the repository at this point in the history
Add Japanese characters
  • Loading branch information
ZJaume authored Dec 17, 2024
2 parents 9004599 + aff9a83 commit c4f6b92
Showing 1 changed file with 8 additions and 0 deletions.
8 changes: 8 additions & 0 deletions opuscleaner/filters/clean_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,14 @@
'id': r'[a-z]',
'is': r'[abdefghijklmnoprstuvxyÁáðÐÉéÍíÓóÚúÝýÞþÆæÖö]',
'it': r'[a-zàÀèÈéÉìÌíÍîÎòÒóÓùÙúÚ]',
# http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml
# Hiragana: \u3040-\u309F (Hiragana characters)
# Katakana: \u30A0-\u30FF (Katakana characters)
# Full-width roman characters and half-width katakana ( \uFF00-\uFFEF)
# Kanji: \u4E00-\u9FAF (CJK unifed ideographs - Common and uncommon kanji)
# Japanese Punctuation and Symbols: \u3000-\u303F (CJK Symbols and Punctuation, including ideographic spaces, quotation marks, iteration marks, etc.)
# CJK unified ideographs Extension A - Rare kanji ( \u3400-\u4DBF )
'ja': r'[\u3040-\u309F\u30A0-\u30FF\uFF00-\uFFEF\u4E00-\u9FAF\u3000-\u303F\u3400-\u4DBF]',
'ko': r'[\uac00-\ud7af]|[\u1100-\u11ff]|[\u3130-\u318f]|[\ua960-\ua97f]|[\ud7b0-\ud7ff]',
'lt': r'[aąbcČčdeĘęĖėfghiĮįyjklmnoprsŠštuŲųŪūvzŽž]',
'lv': r'[aĀābcČčdeĒēfgĢģhiĪījkĶķlĻļmnŅņoprsŠštuŪūvzŽž]',
Expand Down

0 comments on commit c4f6b92

Please sign in to comment.