Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Generation Question #111

Open
bdytx5 opened this issue Feb 17, 2022 · 6 comments
Open

Dataset Generation Question #111

bdytx5 opened this issue Feb 17, 2022 · 6 comments

Comments

@bdytx5
Copy link
Contributor

bdytx5 commented Feb 17, 2022

First off, thanks for this awesome repo! Helping me a lot with my project!!!

Anyway, I'm a bit confused as to how the program is generating the samples that it does. For example, I chose a single wake word and generated a dataset from the speech commands dataset. For the positive set, I get

Generate training datasets: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 509/509 [01:03<0
"Number of speakers in corpus: 1, average number of utterances per speaker: 518.0."

However, when I follow the rest of the generation steps, I end up with a dataset of 10K examples. Im just a bit confused as to where these extra samples came from? Are they duplicates or some sort of augmented version of themselves? In the paper you mention-
"For improved robustness and better quality, we implement a set of popular augmentation routines: time stretching, time shifting, synthetic noise addition, recorded noise mixing, SpecAugment (no time warping; Park et al., 2019), and vocal tract length perturbation (Jaitly and Hinton, 2013). These are readily extensible, so practitioners may easily add new augmentation modules."

I am mainly using this repo for dataset generation, so I wasn't sure if this was just talking about your model preprocessing, or if you perhaps implemented this in your dataset generation code.... I would dig through the code a bit more, but I figured it would be pretty quick/straightforward question for you guys and possibly be useful for someone else down the line....

Thanks,
Brett

@ljj7975
Copy link
Member

ljj7975 commented Feb 24, 2022

I haven't generated a dataset from google speech commands for a long time.

what were the commands you used and what was the target word?

The augmentations are applied at the time of model training in memory. The dataset we generate is original where the augmentations have not been applied yet.

@bdytx5
Copy link
Contributor Author

bdytx5 commented Feb 26, 2022

Ahh sorry, I meant the common voice dataset, not speech commands...

I was generating a set for the word "pass" - here are the commands I used.. These may not work for you, as I am using a previous version of the repo...

VOCAB='["pass"]' INFERENCE_SEQUENCE=[0] python -m training.run.generate_raw_audio_dataset -i /home/brett/datasets/common_voice/en --positive-pct 100 --negative-pct 0

mkdir -p datasets/pass/positive/alignment
pushd montreal-forced-aligner
./bin/mfa_align --num_jobs 12 ../datasets/pass/positive/audio librispeech-lexicon.txt pretrained_models/english.zip ../datasets/pass/positive/alignment
popd

(SKIP- I am going to reuse negative samples from other datasets)
DATASET_PATH=datasets/pass/negative python -m training.run.attach_alignment --align-type stub

VOCAB='["pass"]' DATASET_PATH=datasets/pass/positive python -m training.run.attach_alignment --align-type mfa -i /home/brett/Desktop/howl/datasets/pass/positive/alignment

VOCAB='["pass"]' INFERENCE_SEQUENCE=[0] python -m training.run.stitch_vocab_samples --aligned-dataset "datasets/pass/positive" --stitched-dataset "data/pass-stitched"

@ljj7975
Copy link
Member

ljj7975 commented Apr 10, 2022

Would this script will solve your issue?

https://github.com/castorini/howl/blob/master/generate_dataset.sh

Details can be found here: https://github.com/castorini/howl/tree/master/howl/dataset

@bdytx5
Copy link
Contributor Author

bdytx5 commented Apr 10, 2022

I don't really have an issue, it just seemed strange that I had 20x more samples after stitching than before. Do you see what I am saying?

@boxabirds
Copy link

Data augmentation generally results in a combinatorial increase in data sets. You quoted the doc listing 6 augmentation methods so the combinatorial expansion of that is 1 + 2 + 3 + 4 + 5 + 6 = 21. I can easily see from the docs that you'd get 21 additional variations.

I might be completely off piste here but it makes sense at least intuitively for me. It also says they're configurable so you could check specifically which ones are enabled and which you might want to exclude for your situation, though it looks like a pretty comprehensive set of manipulations (SpecAugment is particularly neat as that's a relatively new technique).

@ljj7975
Copy link
Member

ljj7975 commented May 28, 2022

Sorry for the delay. I have other stuff going on that I didn't get to spent too much time with this project.

I think you just caught a bug.
@bdytx5 is right that it's coming from stitching.

It's supposed to mix the audio for each word (in the vocab) in the dataset to create a new set of audio samples
If the wake word is "hello world" and we have 2 positive audio samples (sample 1 and sample 2),
It's supposed to generate two more samples: "hello" from sample 1 and "world" from sample 2 & "hello" from sample 2 and "world" from sample 1.
The new dataset is stored under stitched folder.

The bug is simply coming from not handling the base case where there is only one word in the vocab.
If you listen to the generated audio files in the stitched folder, you will notice that they just contain the word you specified ("pass" based on your command)

The detailed implementation of stitching logic can be found in the class, WordStitcher.
You will probably find stitch_vocab_samples.py script and the test case for the WordStitcher resourceful as well.

I will fix the issue up sometime next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants