Use this repository to split the compound words in a textline into a collection of dictionary words.
Dependencies are listed in requirements.txt. You can install all requirements using:
pip install -r requirements.txt
If using spacy, download its model like this:
python -m spacy download en_vectors_web_lg
decompound.sentence_to_words(sentence: str, preferred_words: List[str]=[], translate:bool = False, use_common: bool = True,
top_limit: int = 0) -> dict
Use this function to tokenize sentences which have compound words.
Parameters:
----------
sentence: str
input string to be tokenized
preferred_words: list
list of preferred words, useful when some words are more likely to occur in input text,
e.g. 'together' can be 'to get her' combined. Giving 'together' in this list will make sure it doesn't split further
If not using this, can remove spacy
translate: bool
if giving non-english words as input, then translation is required
use_common: bool
if true, returns only the combinations of words which are common in English, else returns all valid ones.
top_limit: int
if 0, returns all found valid combinations else only top 'top_limit' number of combinations.
Returns:
--------
result: dict with key as word and value as possible combinations of words for that key
e.g: 'sorttimetogether kid' => {'sorttimetogether': [('sort', 'time', 'together'),
('sort', 'time', 'to', 'get', 'her')], 'kid': ['kid']}
import decompound
decompound.sentence_to_words('sorttimetogether kid!')
output: {'sorttimetogether': [('sort', 'time', 'together'), ('sort', 'time', 'to', 'get', 'her')], 'kid': ['kid']}
sentence_to_words('youaregreat', use_common=False)
sentence_to_words('forgetthatpage', top_limit=1)