Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF-IDF抽取关键词是根据哪种分词模式分词的? #1022

Open
haixingpai opened this issue Aug 1, 2024 · 3 comments
Open

TF-IDF抽取关键词是根据哪种分词模式分词的? #1022

haixingpai opened this issue Aug 1, 2024 · 3 comments

Comments

@haixingpai
Copy link

是不是根据默认模式(精确模式)?如何自己修改成全模式?

@manother
Copy link

manother commented Aug 1, 2024 via email

@haixingpai
Copy link
Author

邮件已收到~

还有一个问题,我测试了一句话的分词:“我喜欢看电视,不喜欢看电影”。直接默认模式分词以后会分出:我,喜欢, 看电视, ,不, 喜欢, 看, 电影 这几个词。但是如果用TF-IDF找关键词用topK=None的模式也就是不设定关键词个数,显示的分词则不会包含“我”,“不” 这种单个字。是什么原因呢。怎样能让TF-IDF找关键词生成的词表结果也包含单个字?

@wfs420100
Copy link

邮件已收到~

还有一个问题,我测试了一句话的分词:“我喜欢看电视,不喜欢看电影”。直接默认模式分词以后会分出:我,喜欢, 看电视, ,不, 喜欢, 看, 电影 这几个词。但是如果用TF-IDF找关键词用topK=None的模式也就是不设定关键词个数,显示的分词则不会包含“我”,“不” 这种单个字。是什么原因呢。怎样能让TF-IDF找关键词生成的词表结果也包含单个字?

TFIDF在进行关键词提取时,会对长度小于2的词进行过滤

# jieba.analyes.tfidf.py

class TFIDF(KeywordExtractor):
    ...
    def extract_tags(self, sentence, topK=20, withWeight=False, allowPOS=(), withFlag=False):
            ...
            if len(wc.strip()) < 2 or wc.lower() in self.stop_words:
                continue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants