Text normalization

Yet another text normalization project for automatic speech recognition for Croatian language.

This code is quick-and-dirty and work-in-progress.

The general workflow is similar to Google Sparrowhawk, but its in pure Python:

text is tokenized by spaces
each token is split into subtokens containing equal char type (alpha,number,rest)
each subtoken is classified into one of basic classes
tokens are grouped into higher-level classes containing several subtokens (eg. time contains two numbers and a separator)
every final token is verbalized to its final form

Token classes

inflection (currently everything in nominative)
list of symbols
certain symbol can be both silent and pronounceable (eg. minus)
list of abbreviations
certain abbreviations can also be regular words (e.g. single letters followed by period)
decimal numbers

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Readme.md		Readme.md
hrvatski.py		hrvatski.py
normalized		normalized
strings_hr.py		strings_hr.py
text		text
wordlist		wordlist