jiwer gives an error when passed a very long list of strings #83

yashk2000 · 2023-10-18T12:29:54Z

Issue

When passing a very long list of strings (>350k strings) as the reference and hypothesis, jiwer gives the following error:

chr() arg not in range(0x110000)

What's been tried:

Calculating wer on individual list elements - this works successfully with no error
Splitting the large lists into smaller chunks - this works successfully with no error
Passing the entire list to another library such as fastwer - this works successfully with no error

The error only seems to happen when the entire long list is passed into jiwer.

Additional Context

It seems like the vocabulary in the _word2char function isn't built properly. After adding words from the first N sentences in the list, words from rest of the sentences do not seem to be a part of the vocabulary. This results in the chr() arg not found error when these lines are executed.

Jiwer version - v3.0.3

The text was updated successfully, but these errors were encountered:

nikvaessen · 2023-10-18T19:13:30Z

Thanks for reporting! Would you be able to share the length of the vocabulary object when generated from your input?

yashk2000 · 2023-10-19T06:26:31Z

Yep, it's 4484.

nikvaessen · 2023-10-19T22:29:15Z

I cannot reproduce it with the following toy data :(

import random
import string

from typing import List

import jiwer


def random_word(low=2, high=10, rng=random.Random()) -> string:
    word = ""

    for i in range(rng.randint(low, high + 1)):
        word += rng.choice(string.ascii_lowercase)

    return word


def generate_sentence(vocabulary: List[str], low=1, high=12, rng=random.Random()):
    sentence = []

    for i in range(rng.randint(low, high + 1)):
        sentence.append(rng.choice(vocabulary))

    return " ".join(sentence)


NUM_SENTENCE = 500_000
NUM_WORDS = 5000

print('generating vocab...')
vocabulary = list(set([random_word() for _ in range(NUM_WORDS)]))
print(len(vocabulary))

print("generating reference...")
ref = [generate_sentence(vocabulary) for _ in range(NUM_SENTENCE)]
print("generating hypotheses...")
hyp = [generate_sentence(vocabulary) for _ in range(NUM_SENTENCE)]
print("calculating wer...")
print(jiwer.wer(ref, hyp))

Can you share the word which fails to be included in the vocabulary?

yashk2000 · 2023-10-20T05:03:26Z

The words which are not included are normal english words - words from entire sentences aren't included like "Australia", "he", "run", etc.

Some sentences in my list also include numbers like "1", "10", so on and can also include non-english characters at time too. Could this be a potential cause of the issue?

nikvaessen · 2023-10-20T09:32:27Z

I think the size is not an issue, I think it's a specific sentence-pairing which fails. When you tested chunks of the dataset, did those chunks still span the entire range of reference/hypothesis pairs?

Also, do you use a custom transform, or do you use the default?

yashk2000 · 2023-10-30T04:20:05Z

@nikvaessen when I test chunks, the chunks do span the entire range of the pairs. I have also tried finding wer by looping over one pair at a time, that also works.

I'm using the default transform.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jiwer gives an error when passed a very long list of strings #83

jiwer gives an error when passed a very long list of strings #83

yashk2000 commented Oct 18, 2023

nikvaessen commented Oct 18, 2023

yashk2000 commented Oct 19, 2023

nikvaessen commented Oct 19, 2023 •

edited

Loading

yashk2000 commented Oct 20, 2023

nikvaessen commented Oct 20, 2023

yashk2000 commented Oct 30, 2023

jiwer gives an error when passed a very long list of strings #83

jiwer gives an error when passed a very long list of strings #83

Comments

yashk2000 commented Oct 18, 2023

Issue

Additional Context

nikvaessen commented Oct 18, 2023

yashk2000 commented Oct 19, 2023

nikvaessen commented Oct 19, 2023 • edited Loading

yashk2000 commented Oct 20, 2023

nikvaessen commented Oct 20, 2023

yashk2000 commented Oct 30, 2023

nikvaessen commented Oct 19, 2023 •

edited

Loading