Yet another text normalization project for automatic speech recognition for Croatian language.
This code is quick-and-dirty and work-in-progress.
The general workflow is similar to Google Sparrowhawk, but its in pure Python:
- text is tokenized by spaces
- each token is split into subtokens containing equal char type (alpha,number,rest)
- each subtoken is classified into one of basic classes
- tokens are grouped into higher-level classes containing several subtokens (eg. time contains two numbers and a separator)
- every final token is verbalized to its final form
- unknown - symbol of unknown type
- ignore - silent symbol (eg. punctuation)
- word - regular words
- number - numbers
- time - time string (number, separator, number)
- abbreviation - abbreviation from a list in
hrvatski.py
file - symbol - symbol that can be pronounced (eg. precent, degree, @)
- inflection (currently everything in nominative)
- list of symbols
- certain symbol can be both silent and pronounceable (eg. minus)
- list of abbreviations
- certain abbreviations can also be regular words (e.g. single letters followed by period)
- decimal numbers