Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get indecies of I,D,S #90

Open
hallelhel opened this issue Jul 1, 2024 · 3 comments
Open

get indecies of I,D,S #90

hallelhel opened this issue Jul 1, 2024 · 3 comments

Comments

@hallelhel
Copy link

hallelhel commented Jul 1, 2024

I trying to get indecies for words subtituted, inserted and deleted I found the alignments.alignment_chunk and seems not bad but it always give index depand on hypo text.
for example
if the referance is :
" I trying to get indecies for words subtituted, inserted and deleted I found the alignments.alignment_chunk and seems not bad but it "
and the hypo is: "for words"

the number of word deleted is right but indecies of deletion depand on length of hypo, I mean word in index 7 in refance was deleted and I didnt get it in alignment_chunk.
Do you have some way to get the all indecies in the sentece were deleted?

@nikvaessen
Copy link
Collaborator

nikvaessen commented Jul 2, 2024

With the following code

import jiwer

ref = "I trying to get indecies for words subtituted, inserted and deleted I found the alignments.alignment_chunk and seems not bad but it"
hyp = "for words"

r = jiwer.process_words(ref, hyp)

for a in r.alignments[0]:
    print(a)

you get these allignment chunks:

AlignmentChunk(type='delete', ref_start_idx=0, ref_end_idx=5, hyp_start_idx=0, hyp_end_idx=0)
AlignmentChunk(type='equal', ref_start_idx=5, ref_end_idx=7, hyp_start_idx=0, hyp_end_idx=2)
AlignmentChunk(type='delete', ref_start_idx=7, ref_end_idx=21, hyp_start_idx=2, hyp_end_idx=2)

meaning that in the reference, index 0, 1, 2, 3 and 4 are deleted, as well as index 7, ..., 20. Note here that the ref_end_idx is excluded in the range.

This can also be observed with a call to jiwer.visualize_alignment:

import jiwer

ref = "I trying to get indecies for words subtituted, inserted and deleted I found the alignments.alignment_chunk and seems not bad but it"
hyp = "for words"

r = jiwer.process_words(ref, hyp)
print(jiwer.visualize_alignment(r, show_measures=False))

which returns

sentence 1
REF: I trying to get indecies for words subtituted, inserted and deleted I found the alignments.alignment_chunk and seems not bad but it
HYP: * ****** ** *** ******** for words *********** ******** *** ******* * ***** *** ************************** *** ***** *** *** *** **
     D      D  D   D        D                     D        D   D       D D     D   D                          D   D     D   D   D   D  D

@hallelhel
Copy link
Author

thanks :)
its look like when the 2 sentene has minor match (only specific words appear in 2 sentences) the first word is always subtitited
for example I have this 2 sentences:
ref = 'On Monday, the French newspaper Le Parisien reported that a couple who arrived with their three-year-old daughter at a hotel in Paris 15th arrondissement encountered a receptionist who refused to confirm their reservation, and even threw them out into the street while telling them: "You will not get a room in this . The family filed a complaint with the Paris police.'

hyp = 'why couple apple hotel banana with the Paris police'

the result:
subtitution array - [0, 11, 20, 61]
deletion array - [1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15, 16, 17, 18, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57]

I want to get the indexes of the words for every type (S,D,I) and its look abit tricky

@nikvaessen
Copy link
Collaborator

You can get the index arrays like follows:

import jiwer

ref = "On Monday, the French newspaper Le Parisien reported that a couple who arrived with their three-year-old daughter at a hotel in Paris 15th arrondissement encountered a receptionist who refused to confirm their reservation, and even threw them out into the street while telling them: You will not get a room in this. The family filed a complaint with the Paris police."
hyp = "why couple apple hotel banana with the Paris police"

r = jiwer.process_words(ref, hyp)

sub_idx = []
del_idx = []
ins_idx = []

for a in r.alignments[0]:
    ref_idx = range(a.ref_start_idx, a.ref_end_idx)
    if a.type == "substitute":
        sub_idx.extend(ref_idx)
    elif a.type == "delete":
        del_idx.extend(ref_idx)
    elif a.type == "insert":
        ins_idx.extend(ref_idx)

its look like when the 2 sentene has minor match (only specific words appear in 2 sentences) the first word is always subtitited

Yes, this seems expected, and I don't see the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants