Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't support Chinese? #52

Open
haha010508 opened this issue Nov 26, 2021 · 4 comments
Open

Don't support Chinese? #52

haha010508 opened this issue Nov 26, 2021 · 4 comments

Comments

@haha010508
Copy link

No description provided.

@nikvaessen
Copy link
Collaborator

Hi,

I made this example script, with references and predictions taken from https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn

from jiwer import wer, cer

ground_truths = [
    "宋朝末年年间定居粉岭围。",
    "渐渐行动不便",
    "二十一年去世。",
    "他们自称恰哈拉。",
    "局部干涩的例子包括有口干、眼睛干燥、及阴道干燥。",
    "嘉靖三十八年,登进士第三甲第二名。",
    "这一名称一直沿用至今。",
    "同时乔凡尼还得到包税合同和许多明矾矿的经营权。",
    "为了惩罚西扎城和塞尔柱的结盟,盟军在抵达后将外城烧毁。",
    "河内盛产黄色无鱼鳞的鳍射鱼。",
]

hypothesis = [
    "宋朝末年年间定居分定为",
    "建境行动不片",
    "二十一年去世",
    "他们自称家哈",
    "菊物干寺的例子包括有口肝眼睛干照以及阴到干",
    "嘉靖三十八年登进士第三甲第二名",
    "这一名称一直沿用是心",
    "同时桥凡妮还得到包税合同和许多民繁矿的经营权",
    "为了曾罚西扎城和塞尔素的节盟盟军在抵达后将外曾烧毁",
    "合类生场环色无鱼林的骑射鱼",
]

wer_score = wer(truth=ground_truths, hypothesis=hypothesis)
print(f"{wer_score=}")

cer_score = cer(truth=ground_truths, hypothesis=hypothesis)
print(f"{cer_score=}")

Which outputs:

wer_score=1.0
cer_score=0.30201342281879195

What would the correct answer be? Does word_error_rate even make sense in zh-cn, as as far as I know, each character is a word?

@haha010508
Copy link
Author

Thanks for your reply, but the wer_score=1.0, is mean totally wrong word, i do not think so, for example, ground_truths = 宋朝末年年间定居粉岭围, hypothesis = "宋朝末年年间定居分定为" only last 3 words is wrong, so the wer must be < 1.0. if you improve the result, i can test it again. thanks!

@nikvaessen
Copy link
Collaborator

Semantically, I assume that you need the "character" error rate instead of the word error rate, as I assume they should be equivalent for Chinese. Therefore, is the CER score of 0.3 correct?

@haha010508
Copy link
Author

yes, CER is correct, i just focused on WER yesterday, because CER is confuse for chinese words.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants