Link Dataset: https://www.kaggle.com/datasets/chrisfilo/urbansound8k
This repo demonstrates a basic, widely used approach to classify audio signals.
Step1: Build a preprocess data pipeline.
I resampled it to 16Khz and just take 48000 samples(3s) to feed to our model because I think 3s is long enough to recognise a sound. And then, I use torchaudio to extract the mel points. You can gain more intuition in this link: https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html
Step2: Visualize some random samples in our dataset (as always)
Step3: Build a mini VGG-19 like CNN model(where conv layer sizes gradually get smaller)
Step4: Train the model using Adam optimizer and CE Loss.
P/s: Due to limit computational resources. I'll stop here