You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What's the best way to get jieba to cut '是因为' into '是' and '因为'?
I was processing 影子的出现是因为有光 to tag the sentence for rare words and it scored much rarer than expected because of the 是因为 token.
Cut for search on 是因为 gives ['因为', '是因为'] -- how often do the jieba cut functions duplicate the input like that? Is that by design? It was a little surprising, but maybe that's part of how that function is designed for search engines, I'm not sure.
What's the best way to get jieba to cut '是因为' into '是' and '因为'?
I was processing 影子的出现是因为有光 to tag the sentence for rare words and it scored much rarer than expected because of the 是因为 token.
Cut for search on 是因为 gives ['因为', '是因为'] -- how often do the jieba cut functions duplicate the input like that? Is that by design? It was a little surprising, but maybe that's part of how that function is designed for search engines, I'm not sure.
Setting HMM to False gives ['影子', '的', '出现', '是因为', '有', '光']
Unsure if this is a bug or by design. Is the right approach here to use a custom user dictionary limited to the top 20k words or so?
Apologies if this is pure user error, I am new to jieba and still trying to figure out all the features. Thanks for any recommendations.
The text was updated successfully, but these errors were encountered: