-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jiwer gives an error when passed a very long list of strings #83
Comments
Thanks for reporting! Would you be able to share the length of the vocabulary object when generated from your input? |
Yep, it's 4484. |
I cannot reproduce it with the following toy data :( import random
import string
from typing import List
import jiwer
def random_word(low=2, high=10, rng=random.Random()) -> string:
word = ""
for i in range(rng.randint(low, high + 1)):
word += rng.choice(string.ascii_lowercase)
return word
def generate_sentence(vocabulary: List[str], low=1, high=12, rng=random.Random()):
sentence = []
for i in range(rng.randint(low, high + 1)):
sentence.append(rng.choice(vocabulary))
return " ".join(sentence)
NUM_SENTENCE = 500_000
NUM_WORDS = 5000
print('generating vocab...')
vocabulary = list(set([random_word() for _ in range(NUM_WORDS)]))
print(len(vocabulary))
print("generating reference...")
ref = [generate_sentence(vocabulary) for _ in range(NUM_SENTENCE)]
print("generating hypotheses...")
hyp = [generate_sentence(vocabulary) for _ in range(NUM_SENTENCE)]
print("calculating wer...")
print(jiwer.wer(ref, hyp)) Can you share the word which fails to be included in the vocabulary? |
The words which are not included are normal english words - words from entire sentences aren't included like "Australia", "he", "run", etc. Some sentences in my list also include numbers like "1", "10", so on and can also include non-english characters at time too. Could this be a potential cause of the issue? |
I think the size is not an issue, I think it's a specific sentence-pairing which fails. When you tested chunks of the dataset, did those chunks still span the entire range of reference/hypothesis pairs? Also, do you use a custom transform, or do you use the default? |
@nikvaessen when I test chunks, the chunks do span the entire range of the pairs. I have also tried finding wer by looping over one pair at a time, that also works. I'm using the default transform. |
Issue
When passing a very long list of strings (>350k strings) as the reference and hypothesis, jiwer gives the following error:
chr() arg not in range(0x110000)
What's been tried:
The error only seems to happen when the entire long list is passed into jiwer.
Additional Context
It seems like the vocabulary in the
_word2char
function isn't built properly. After adding words from the first N sentences in the list, words from rest of the sentences do not seem to be a part of the vocabulary. This results in thechr() arg not found
error when these lines are executed.Jiwer version -
v3.0.3
The text was updated successfully, but these errors were encountered: