-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset Generation Question #111
Comments
I haven't generated a dataset from google speech commands for a long time. what were the commands you used and what was the target word? The augmentations are applied at the time of model training in memory. The dataset we generate is original where the augmentations have not been applied yet. |
Ahh sorry, I meant the common voice dataset, not speech commands... I was generating a set for the word "pass" - here are the commands I used.. These may not work for you, as I am using a previous version of the repo... VOCAB='["pass"]' INFERENCE_SEQUENCE=[0] python -m training.run.generate_raw_audio_dataset -i /home/brett/datasets/common_voice/en --positive-pct 100 --negative-pct 0 mkdir -p datasets/pass/positive/alignment (SKIP- I am going to reuse negative samples from other datasets) VOCAB='["pass"]' DATASET_PATH=datasets/pass/positive python -m training.run.attach_alignment --align-type mfa -i /home/brett/Desktop/howl/datasets/pass/positive/alignment VOCAB='["pass"]' INFERENCE_SEQUENCE=[0] python -m training.run.stitch_vocab_samples --aligned-dataset "datasets/pass/positive" --stitched-dataset "data/pass-stitched" |
Would this script will solve your issue? https://github.com/castorini/howl/blob/master/generate_dataset.sh Details can be found here: https://github.com/castorini/howl/tree/master/howl/dataset |
I don't really have an issue, it just seemed strange that I had 20x more samples after stitching than before. Do you see what I am saying? |
Data augmentation generally results in a combinatorial increase in data sets. You quoted the doc listing 6 augmentation methods so the combinatorial expansion of that is 1 + 2 + 3 + 4 + 5 + 6 = 21. I can easily see from the docs that you'd get 21 additional variations. I might be completely off piste here but it makes sense at least intuitively for me. It also says they're configurable so you could check specifically which ones are enabled and which you might want to exclude for your situation, though it looks like a pretty comprehensive set of manipulations (SpecAugment is particularly neat as that's a relatively new technique). |
Sorry for the delay. I have other stuff going on that I didn't get to spent too much time with this project. I think you just caught a bug. It's supposed to mix the audio for each word (in the vocab) in the dataset to create a new set of audio samples The bug is simply coming from not handling the base case where there is only one word in the vocab. The detailed implementation of stitching logic can be found in the class, WordStitcher. I will fix the issue up sometime next week. |
First off, thanks for this awesome repo! Helping me a lot with my project!!!
Anyway, I'm a bit confused as to how the program is generating the samples that it does. For example, I chose a single wake word and generated a dataset from the speech commands dataset. For the positive set, I get
Generate training datasets: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 509/509 [01:03<0
"Number of speakers in corpus: 1, average number of utterances per speaker: 518.0."
However, when I follow the rest of the generation steps, I end up with a dataset of 10K examples. Im just a bit confused as to where these extra samples came from? Are they duplicates or some sort of augmented version of themselves? In the paper you mention-
"For improved robustness and better quality, we implement a set of popular augmentation routines: time stretching, time shifting, synthetic noise addition, recorded noise mixing, SpecAugment (no time warping; Park et al., 2019), and vocal tract length perturbation (Jaitly and Hinton, 2013). These are readily extensible, so practitioners may easily add new augmentation modules."
I am mainly using this repo for dataset generation, so I wasn't sure if this was just talking about your model preprocessing, or if you perhaps implemented this in your dataset generation code.... I would dig through the code a bit more, but I figured it would be pretty quick/straightforward question for you guys and possibly be useful for someone else down the line....
Thanks,
Brett
The text was updated successfully, but these errors were encountered: