stitching audio samples to generate diverse positive dataset #59

ljj7975 · 2021-02-25T03:48:58Z

for some wake words, we only get few dev/test samples as their transcript must be equal to wakeword

For example, when I attempted to generate dataset for love you, dev and test datasets only contained two samples each.

Given that train set contains samples whose transcript contains at least one of the vocabs and aligned with the audio (by mfa)

we can possibly stitch some of the samples to generate synthetic wakeword samples

for example, stitching hey baby, I saw fire and wow, there was a fox to generate a sample for hey firefox

The text was updated successfully, but these errors were encountered:

ljj7975 · 2021-04-04T03:09:42Z

With #66 it is possible to generate enough samples for train/dev/test.
The trained model generally reports high accuracy on the generated dev/test
However, when I test the detection, I found that the accuracy is not good enough.
I didn't see many fp cases but tp was also quite low.

First of all, I need to set up a proper evaluation setting.
dev/test containing the actual samples collected, train a model on stitched datasets, evaluate on the real samples
I believe hey firefox might be a good choice.

To improve the tp rate, restricting the stitching to smooth transitions between phonemes might be a good option
for example, suppose the words are hey, fire, and fox, then we mick hey from "hey F-", and fire from "-y fire f-"

ljj7975 · 2021-04-04T03:11:39Z

tested both res8 and seq-lstm.
I suspected that seq-lstm would be better but both models had low acc that I wasn't able to tell if one is actually better than the other.

ljj7975 · 2021-04-04T03:16:29Z

Current experiment plan:
train a model only with only with neg + stitched data set. evaluate on real samples

ljj7975 · 2021-04-04T19:21:39Z

hey_fire_fox should be able to give us the initial results but we might not have enough "real" samples for other wakewords.

Unlike common voice, GSC is collected per-word.
Therefore, we can apply stitching with GSC to construct various real wakeword samples (must be grouped by speaker id)
then evaluate the models trained with common voice vocab stitching

ljj7975 · 2021-04-06T03:39:32Z

Stitched datasets

train ds - num_examples=27628, audio_length_seconds=116447.25187500067, vocab_counts=Counter({'fire': 5639, 'fox': 5211, 'hey': 5028}))
Dev pos dataset - AudioDatasetStatistics(num_examples=2500, audio_length_seconds=3019.3278125000006, vocab_counts=Counter({'hey': 2500, 'fire': 2500, 'fox': 2500}))
Dev neg dataset - num_examples=808, audio_length_seconds=4192.392000000006, vocab_counts=Counter({'fire': 28, 'fox': 6}))
Test pos dataset - num_examples=2500, audio_length_seconds=2987.8693124999954, vocab_counts=Counter({'hey': 2500, 'fire': 2500, 'fox': 2500}))
Test neg dataset - num_examples=840, audio_length_seconds=4076.9919999999975, vocab_counts=Counter({'fire': 25, 'fox': 6, 'hey': 5}))

Real datasets

train ds (unnecessary info) - num_examples=26634, audio_length_seconds=111928.04543750011, vocab_counts=Counter({'hey': 4157, 'fire': 2420, 'fox': 2124}))

Dev pos dataset - num_examples=76, audio_length_seconds=126.97837499999999, vocab_counts=Counter({'hey': 76, 'fire': 76, 'fox': 76}))
Dev neg dataset - num_examples=2531, audio_length_seconds=10552.52668750001, vocab_counts=Counter({'hey': 488, 'fire': 218, 'fox': 193}))
Test pos dataset - num_examples=54, audio_length_seconds=94.43156250000001, vocab_counts=Counter({'hey': 54, 'fire': 54, 'fox': 54}))
Test neg dataset -num_examples=2504, audio_length_seconds=10269.859437499998, vocab_counts=Counter({'hey': 411, 'fire': 244, 'fox': 211}))

< seq-lstm >

Stitched datasets

Dev positive: : ConfusionMatrix(tp=2498, fp=0, tn=0, fn=2)
Dev negative: : ConfusionMatrix(tp=0, fp=0, tn=808, fn=0)
Test positive: : ConfusionMatrix(tp=2496, fp=0, tn=0, fn=4)
Test negative: : ConfusionMatrix(tp=0, fp=1, tn=839, fn=0)

Real datasets

Dev positive: : ConfusionMatrix(tp=27, fp=0, tn=0, fn=49)
Dev negative: : ConfusionMatrix(tp=0, fp=46, tn=2485, fn=0)
Dev noisy positive: : ConfusionMatrix(tp=23, fp=0, tn=0, fn=53)
Dev noisy negative: : ConfusionMatrix(tp=0, fp=155, tn=2376, fn=0)
Test positive: : ConfusionMatrix(tp=7, fp=0, tn=0, fn=47)
Test negative: : ConfusionMatrix(tp=0, fp=28, tn=2476, fn=0)
Test noisy positive: : ConfusionMatrix(tp=9, fp=0, tn=0, fn=45)
Test noisy negative: : ConfusionMatrix(tp=0, fp=153, tn=2351, fn=0)

< res8 >

Stitched datasets

Dev positive: : ConfusionMatrix(tp=1100, fp=0, tn=0, fn=1400)
Dev negative: : ConfusionMatrix(tp=0, fp=0, tn=808, fn=0)
Test positive: : ConfusionMatrix(tp=1095, fp=0, tn=0, fn=1405)
Test negative: : ConfusionMatrix(tp=0, fp=0, tn=840, fn=0)

Real datasets

Dev positive: : ConfusionMatrix(tp=29, fp=0, tn=0, fn=47)
Dev negative: : ConfusionMatrix(tp=0, fp=3, tn=2528, fn=0)
Dev noisy positive: : ConfusionMatrix(tp=23, fp=0, tn=0, fn=53)
Dev noisy negative: : ConfusionMatrix(tp=0, fp=2, tn=2529, fn=0)
Test positive: : ConfusionMatrix(tp=11, fp=0, tn=0, fn=43)
Test negative: : ConfusionMatrix(tp=0, fp=5, tn=2499, fn=0)
Test noisy positive: : ConfusionMatrix(tp=9, fp=0, tn=0, fn=45)
Test noisy negative: : ConfusionMatrix(tp=0, fp=0, tn=2504, fn=0)

ljj7975 · 2021-04-07T04:01:42Z

To improve the audio sample quality, I have applied secondary filtering with keyword spotting
for hey_fire_fox, it has dropped 574 invalid samples while it generates 10000 samples

ljj7975 · 2021-04-07T14:43:08Z

Keyword Spotting verification definitely helped.
However not enough

< res8 >

stitched datasets

Dev positive:: ConfusionMatrix(tp=878, fp=0, tn=0, fn=1622)
Dev negative: ConfusionMatrix(tp=0, fp=0, tn=808, fn=0)
Dev noisy positive: ConfusionMatrix(tp=767, fp=0, tn=0, fn=1733)
Dev noisy negative:: ConfusionMatrix(tp=0, fp=0, tn=808, fn=0)
Test positive:: ConfusionMatrix(tp=853, fp=0, tn=0, fn=1647)
Test negative:: ConfusionMatrix(tp=0, fp=0, tn=840, fn=0)
Test noisy positive:: ConfusionMatrix(tp=761, fp=0, tn=0, fn=1739)
Test noisy negative:: ConfusionMatrix(tp=0, fp=0, tn=840, fn=0)

real datasets

Dev positive: : ConfusionMatrix(tp=35, fp=0, tn=0, fn=41)
Dev negative: : ConfusionMatrix(tp=0, fp=10, tn=2521, fn=0)
Dev noisy positive: : ConfusionMatrix(tp=22, fp=0, tn=0, fn=54)
Dev noisy negative: : ConfusionMatrix(tp=0, fp=3, tn=2528, fn=0)
Test positive: : ConfusionMatrix(tp=14, fp=0, tn=0, fn=40)
Test negative: : ConfusionMatrix(tp=0, fp=8, tn=2496, fn=0)
Test noisy positive: : ConfusionMatrix(tp=14, fp=0, tn=0, fn=40)
Test noisy negative: : ConfusionMatrix(tp=0, fp=0, tn=2504, fn=0)

ljj7975 · 2021-04-08T03:06:24Z

found that the detection got a little better after putting more weights on "hey" (real datasets)

Dev positive: ConfusionMatrix(tp=42, fp=0, tn=0, fn=34)
Dev negative: ConfusionMatrix(tp=0, fp=12, tn=2519, fn=0)
Dev noisy positive(tp=31, fp=0, tn=0, fn=45)
Dev noisy negative(tp=0, fp=4, tn=2527, fn=0)
Test positive ConfusionMatrix(tp=18, fp=0, tn=0, fn=36)
Test negative ConfusionMatrix(tp=0, fp=13, tn=2491, fn=0)
Test noisy positive=17, fp=0, tn=0, fn=37)
Test noisy negative=0, fp=4, tn=2500, fn=0)

therefore, I double-checked the number of samples for each vocab used to generate a stitched dataset and found that it might be simply due to the small number of samples for "hey"

number of samples for vocab hey: 31
number of samples for vocab fire: 665
number of samples for vocab fox: 214

I believe, the phoneme-based stitching is inevitable

ljj7975 · 2021-04-08T13:41:52Z

explored bigger data sets (20000 samples)

Wake word dataset (DatasetType.TRAINING) (num_examples=32628, audio_length_seconds=122408.37631250109, vocab_counts=Counter({'fire': 10639, 'fox': 10211, 'hey': 10028}))
Wake word dataset (DatasetType.DEV) (num_examples=5808, audio_length_seconds=10201.0685, vocab_counts=Counter({'fire': 5028, 'fox': 5006, 'hey': 5000}))
Wake word dataset (DatasetType.TEST) (num_examples=5840, audio_length_seconds=10085.474125000013, vocab_counts=Counter({'fire': 5025, 'fox': 5006, 'hey': 5005}))
Dev pos dataset (DatasetType.DEV) (num_examples=5000, audio_length_seconds=6008.676500000001, vocab_counts=Counter({'hey': 5000, 'fire': 5000, 'fox': 5000}))
Dev neg dataset (DatasetType.DEV) (num_examples=808, audio_length_seconds=4192.392000000006, vocab_counts=Counter({'fire': 28, 'fox': 6}))
Test pos dataset (DatasetType.TEST) (num_examples=5000, audio_length_seconds=6008.482124999997, vocab_counts=Counter({'hey': 5000, 'fire': 5000, 'fox': 5000}))
Test neg dataset (DatasetType.TEST) (num_examples=840, audio_length_seconds=4076.9919999999975, vocab_counts=Counter({'fire': 25, 'fox': 6, 'hey': 5}))

tldr: detection is slightly better and dev/test acc is still not that great. I should focus on improving these first

stitched dataset

Dev positive: ] train: ConfusionMatrix(tp=1934, fp=0, tn=0, fn=3066)
Dev negative: ] train: ConfusionMatrix(tp=0, fp=0, tn=808, fn=0)
Dev noisy positive: : ConfusionMatrix(tp=1749, fp=0, tn=0, fn=3251)
Dev noisy negative: : ConfusionMatrix(tp=0, fp=0, tn=808, fn=0)
Test positive: ] train: ConfusionMatrix(tp=1900, fp=0, tn=0, fn=3100)
Test negative: ] train: ConfusionMatrix(tp=0, fp=0, tn=840, fn=0)
Test noisy positive: : ConfusionMatrix(tp=1726, fp=0, tn=0, fn=3274)
Test noisy negative: : ConfusionMatrix(tp=0, fp=0, tn=840, fn=0)

real datasets

Dev positive:: ConfusionMatrix(tp=47, fp=0, tn=0, fn=29)
Dev negative:: ConfusionMatrix(tp=0, fp=79, tn=2452, fn=0)
Dev noisy positive:(tp=43, fp=0, tn=0, fn=33)
Dev noisy negative:(tp=0, fp=72, tn=2459, fn=0)
Test positive:: ConfusionMatrix(tp=27, fp=0, tn=0, fn=27)
Test negative:: ConfusionMatrix(tp=0, fp=95, tn=2409, fn=0)
Test noisy positive:(tp=28, fp=0, tn=0, fn=26)
Test noisy negative:(tp=0, fp=92, tn=2412, fn=0)

icyda17 · 2021-11-25T03:02:02Z

@ljj7975 What do you mean about second filtering?

ljj7975 mentioned this issue Mar 21, 2021

stitching vocab to generate wakeword samples #66

Merged

ljj7975 self-assigned this Mar 21, 2021

ljj7975 mentioned this issue Apr 7, 2021

Filter out bad stitched samples by checking the speech to text transcription #72

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stitching audio samples to generate diverse positive dataset #59

stitching audio samples to generate diverse positive dataset #59

ljj7975 commented Feb 25, 2021

ljj7975 commented Apr 4, 2021

ljj7975 commented Apr 4, 2021

ljj7975 commented Apr 4, 2021

ljj7975 commented Apr 4, 2021

ljj7975 commented Apr 6, 2021

ljj7975 commented Apr 7, 2021

ljj7975 commented Apr 7, 2021 •

edited

Loading

ljj7975 commented Apr 8, 2021 •

edited

Loading

ljj7975 commented Apr 8, 2021 •

edited

Loading

icyda17 commented Nov 25, 2021

stitching audio samples to generate diverse positive dataset #59

stitching audio samples to generate diverse positive dataset #59

Comments

ljj7975 commented Feb 25, 2021

ljj7975 commented Apr 4, 2021

ljj7975 commented Apr 4, 2021

ljj7975 commented Apr 4, 2021

ljj7975 commented Apr 4, 2021

ljj7975 commented Apr 6, 2021

ljj7975 commented Apr 7, 2021

ljj7975 commented Apr 7, 2021 • edited Loading

ljj7975 commented Apr 8, 2021 • edited Loading

ljj7975 commented Apr 8, 2021 • edited Loading

icyda17 commented Nov 25, 2021

ljj7975 commented Apr 7, 2021 •

edited

Loading

ljj7975 commented Apr 8, 2021 •

edited

Loading

ljj7975 commented Apr 8, 2021 •

edited

Loading