Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stitching audio samples to generate diverse positive dataset #59

Open
ljj7975 opened this issue Feb 25, 2021 · 10 comments
Open

stitching audio samples to generate diverse positive dataset #59

ljj7975 opened this issue Feb 25, 2021 · 10 comments
Assignees

Comments

@ljj7975
Copy link
Member

ljj7975 commented Feb 25, 2021

for some wake words, we only get few dev/test samples as their transcript must be equal to wakeword

For example, when I attempted to generate dataset for love you, dev and test datasets only contained two samples each.

Given that train set contains samples whose transcript contains at least one of the vocabs and aligned with the audio (by mfa)

we can possibly stitch some of the samples to generate synthetic wakeword samples

for example, stitching hey baby, I saw fire and wow, there was a fox to generate a sample for hey firefox

@ljj7975
Copy link
Member Author

ljj7975 commented Apr 4, 2021

With #66 it is possible to generate enough samples for train/dev/test.
The trained model generally reports high accuracy on the generated dev/test
However, when I test the detection, I found that the accuracy is not good enough.
I didn't see many fp cases but tp was also quite low.

First of all, I need to set up a proper evaluation setting.
dev/test containing the actual samples collected, train a model on stitched datasets, evaluate on the real samples
I believe hey firefox might be a good choice.

To improve the tp rate, restricting the stitching to smooth transitions between phonemes might be a good option
for example, suppose the words are hey, fire, and fox, then we mick hey from "hey F-", and fire from "-y fire f-"

@ljj7975
Copy link
Member Author

ljj7975 commented Apr 4, 2021

tested both res8 and seq-lstm.
I suspected that seq-lstm would be better but both models had low acc that I wasn't able to tell if one is actually better than the other.

@ljj7975
Copy link
Member Author

ljj7975 commented Apr 4, 2021

Current experiment plan:
train a model only with only with neg + stitched data set. evaluate on real samples

@ljj7975
Copy link
Member Author

ljj7975 commented Apr 4, 2021

hey_fire_fox should be able to give us the initial results but we might not have enough "real" samples for other wakewords.

Unlike common voice, GSC is collected per-word.
Therefore, we can apply stitching with GSC to construct various real wakeword samples (must be grouped by speaker id)
then evaluate the models trained with common voice vocab stitching

@ljj7975
Copy link
Member Author

ljj7975 commented Apr 6, 2021

Stitched datasets

train ds - num_examples=27628, audio_length_seconds=116447.25187500067, vocab_counts=Counter({'fire': 5639, 'fox': 5211, 'hey': 5028}))
Dev pos dataset - AudioDatasetStatistics(num_examples=2500, audio_length_seconds=3019.3278125000006, vocab_counts=Counter({'hey': 2500, 'fire': 2500, 'fox': 2500}))
Dev neg dataset - num_examples=808, audio_length_seconds=4192.392000000006, vocab_counts=Counter({'fire': 28, 'fox': 6}))
Test pos dataset - num_examples=2500, audio_length_seconds=2987.8693124999954, vocab_counts=Counter({'hey': 2500, 'fire': 2500, 'fox': 2500}))
Test neg dataset - num_examples=840, audio_length_seconds=4076.9919999999975, vocab_counts=Counter({'fire': 25, 'fox': 6, 'hey': 5}))

Real datasets

train ds (unnecessary info) - num_examples=26634, audio_length_seconds=111928.04543750011, vocab_counts=Counter({'hey': 4157, 'fire': 2420, 'fox': 2124}))

Dev pos dataset - num_examples=76, audio_length_seconds=126.97837499999999, vocab_counts=Counter({'hey': 76, 'fire': 76, 'fox': 76}))
Dev neg dataset - num_examples=2531, audio_length_seconds=10552.52668750001, vocab_counts=Counter({'hey': 488, 'fire': 218, 'fox': 193}))
Test pos dataset - num_examples=54, audio_length_seconds=94.43156250000001, vocab_counts=Counter({'hey': 54, 'fire': 54, 'fox': 54}))
Test neg dataset -num_examples=2504, audio_length_seconds=10269.859437499998, vocab_counts=Counter({'hey': 411, 'fire': 244, 'fox': 211}))

< seq-lstm >

Stitched datasets

Dev positive: : ConfusionMatrix(tp=2498, fp=0, tn=0, fn=2)
Dev negative: : ConfusionMatrix(tp=0, fp=0, tn=808, fn=0)
Test positive: : ConfusionMatrix(tp=2496, fp=0, tn=0, fn=4)
Test negative: : ConfusionMatrix(tp=0, fp=1, tn=839, fn=0)

Real datasets

Dev positive: : ConfusionMatrix(tp=27, fp=0, tn=0, fn=49)
Dev negative: : ConfusionMatrix(tp=0, fp=46, tn=2485, fn=0)
Dev noisy positive: : ConfusionMatrix(tp=23, fp=0, tn=0, fn=53)
Dev noisy negative: : ConfusionMatrix(tp=0, fp=155, tn=2376, fn=0)
Test positive: : ConfusionMatrix(tp=7, fp=0, tn=0, fn=47)
Test negative: : ConfusionMatrix(tp=0, fp=28, tn=2476, fn=0)
Test noisy positive: : ConfusionMatrix(tp=9, fp=0, tn=0, fn=45)
Test noisy negative: : ConfusionMatrix(tp=0, fp=153, tn=2351, fn=0)

< res8 >

Stitched datasets

Dev positive: : ConfusionMatrix(tp=1100, fp=0, tn=0, fn=1400)
Dev negative: : ConfusionMatrix(tp=0, fp=0, tn=808, fn=0)
Test positive: : ConfusionMatrix(tp=1095, fp=0, tn=0, fn=1405)
Test negative: : ConfusionMatrix(tp=0, fp=0, tn=840, fn=0)

Real datasets

Dev positive: : ConfusionMatrix(tp=29, fp=0, tn=0, fn=47)
Dev negative: : ConfusionMatrix(tp=0, fp=3, tn=2528, fn=0)
Dev noisy positive: : ConfusionMatrix(tp=23, fp=0, tn=0, fn=53)
Dev noisy negative: : ConfusionMatrix(tp=0, fp=2, tn=2529, fn=0)
Test positive: : ConfusionMatrix(tp=11, fp=0, tn=0, fn=43)
Test negative: : ConfusionMatrix(tp=0, fp=5, tn=2499, fn=0)
Test noisy positive: : ConfusionMatrix(tp=9, fp=0, tn=0, fn=45)
Test noisy negative: : ConfusionMatrix(tp=0, fp=0, tn=2504, fn=0)

@ljj7975
Copy link
Member Author

ljj7975 commented Apr 7, 2021

To improve the audio sample quality, I have applied secondary filtering with keyword spotting
for hey_fire_fox, it has dropped 574 invalid samples while it generates 10000 samples

@ljj7975
Copy link
Member Author

ljj7975 commented Apr 7, 2021

Keyword Spotting verification definitely helped.
However not enough

< res8 >

stitched datasets

Dev positive:: ConfusionMatrix(tp=878, fp=0, tn=0, fn=1622)
Dev negative: ConfusionMatrix(tp=0, fp=0, tn=808, fn=0)
Dev noisy positive: ConfusionMatrix(tp=767, fp=0, tn=0, fn=1733)
Dev noisy negative:: ConfusionMatrix(tp=0, fp=0, tn=808, fn=0)
Test positive:: ConfusionMatrix(tp=853, fp=0, tn=0, fn=1647)
Test negative:: ConfusionMatrix(tp=0, fp=0, tn=840, fn=0)
Test noisy positive:: ConfusionMatrix(tp=761, fp=0, tn=0, fn=1739)
Test noisy negative:: ConfusionMatrix(tp=0, fp=0, tn=840, fn=0)

real datasets

Dev positive: : ConfusionMatrix(tp=35, fp=0, tn=0, fn=41)
Dev negative: : ConfusionMatrix(tp=0, fp=10, tn=2521, fn=0)
Dev noisy positive: : ConfusionMatrix(tp=22, fp=0, tn=0, fn=54)
Dev noisy negative: : ConfusionMatrix(tp=0, fp=3, tn=2528, fn=0)
Test positive: : ConfusionMatrix(tp=14, fp=0, tn=0, fn=40)
Test negative: : ConfusionMatrix(tp=0, fp=8, tn=2496, fn=0)
Test noisy positive: : ConfusionMatrix(tp=14, fp=0, tn=0, fn=40)
Test noisy negative: : ConfusionMatrix(tp=0, fp=0, tn=2504, fn=0)

@ljj7975
Copy link
Member Author

ljj7975 commented Apr 8, 2021

found that the detection got a little better after putting more weights on "hey" (real datasets)

Dev positive: ConfusionMatrix(tp=42, fp=0, tn=0, fn=34)
Dev negative: ConfusionMatrix(tp=0, fp=12, tn=2519, fn=0)
Dev noisy positive(tp=31, fp=0, tn=0, fn=45)
Dev noisy negative(tp=0, fp=4, tn=2527, fn=0)
Test positive ConfusionMatrix(tp=18, fp=0, tn=0, fn=36)
Test negative ConfusionMatrix(tp=0, fp=13, tn=2491, fn=0)
Test noisy positive=17, fp=0, tn=0, fn=37)
Test noisy negative=0, fp=4, tn=2500, fn=0)

therefore, I double-checked the number of samples for each vocab used to generate a stitched dataset and found that it might be simply due to the small number of samples for "hey"

number of samples for vocab hey: 31
number of samples for vocab fire: 665
number of samples for vocab fox: 214

I believe, the phoneme-based stitching is inevitable

@ljj7975
Copy link
Member Author

ljj7975 commented Apr 8, 2021

explored bigger data sets (20000 samples)

Wake word dataset (DatasetType.TRAINING) (num_examples=32628, audio_length_seconds=122408.37631250109, vocab_counts=Counter({'fire': 10639, 'fox': 10211, 'hey': 10028}))
Wake word dataset (DatasetType.DEV) (num_examples=5808, audio_length_seconds=10201.0685, vocab_counts=Counter({'fire': 5028, 'fox': 5006, 'hey': 5000}))
Wake word dataset (DatasetType.TEST) (num_examples=5840, audio_length_seconds=10085.474125000013, vocab_counts=Counter({'fire': 5025, 'fox': 5006, 'hey': 5005}))
Dev pos dataset (DatasetType.DEV) (num_examples=5000, audio_length_seconds=6008.676500000001, vocab_counts=Counter({'hey': 5000, 'fire': 5000, 'fox': 5000}))
Dev neg dataset (DatasetType.DEV) (num_examples=808, audio_length_seconds=4192.392000000006, vocab_counts=Counter({'fire': 28, 'fox': 6}))
Test pos dataset (DatasetType.TEST) (num_examples=5000, audio_length_seconds=6008.482124999997, vocab_counts=Counter({'hey': 5000, 'fire': 5000, 'fox': 5000}))
Test neg dataset (DatasetType.TEST) (num_examples=840, audio_length_seconds=4076.9919999999975, vocab_counts=Counter({'fire': 25, 'fox': 6, 'hey': 5}))

tldr: detection is slightly better and dev/test acc is still not that great. I should focus on improving these first

stitched dataset

Dev positive: ] train: ConfusionMatrix(tp=1934, fp=0, tn=0, fn=3066)
Dev negative: ] train: ConfusionMatrix(tp=0, fp=0, tn=808, fn=0)
Dev noisy positive: : ConfusionMatrix(tp=1749, fp=0, tn=0, fn=3251)
Dev noisy negative: : ConfusionMatrix(tp=0, fp=0, tn=808, fn=0)
Test positive: ] train: ConfusionMatrix(tp=1900, fp=0, tn=0, fn=3100)
Test negative: ] train: ConfusionMatrix(tp=0, fp=0, tn=840, fn=0)
Test noisy positive: : ConfusionMatrix(tp=1726, fp=0, tn=0, fn=3274)
Test noisy negative: : ConfusionMatrix(tp=0, fp=0, tn=840, fn=0)

real datasets

Dev positive:: ConfusionMatrix(tp=47, fp=0, tn=0, fn=29)
Dev negative:: ConfusionMatrix(tp=0, fp=79, tn=2452, fn=0)
Dev noisy positive:(tp=43, fp=0, tn=0, fn=33)
Dev noisy negative:(tp=0, fp=72, tn=2459, fn=0)
Test positive:: ConfusionMatrix(tp=27, fp=0, tn=0, fn=27)
Test negative:: ConfusionMatrix(tp=0, fp=95, tn=2409, fn=0)
Test noisy positive:(tp=28, fp=0, tn=0, fn=26)
Test noisy negative:(tp=0, fp=92, tn=2412, fn=0)

@icyda17
Copy link

icyda17 commented Nov 25, 2021

@ljj7975 What do you mean about second filtering?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants