Speech recognition in ~300kB of code.
Small footprint, standalone, zero dependency, offline
keyword spotting (KWS) CLI tool. Might be useful in
some automation task. Accepts audio on stdin a and recognizes
following words: up
, down
, left
, right
, stop
.
Here is an example invocation:
rec -q -t raw -c1 -e signed -b 16 -r16k - | ./kws_cli
Make sure you have decent microphone and the system audio is on a decent level.
Individual WAV files can piped (e.g. for testing) using:
sox -S ../untitled.wav -t raw -c1 -e signed -b 16 -r16k - | ./kws_cli
In the demo subdirectory there is a Python script showing how to
use kws_cli
for simple automation.
kws_cli_demo_1.mov
Speech recognition is based on this
model and examples from the same repository.
This simple model with three layers: 2x LSTM + 1x fully connected.
The model is trained in PyTorch and exported to ONNX.
Then onnx2c
is used to convert the model to a bunch of C code.
The LSTM layers had become mainstream in recent years and are well
supported in different frameworks. The model is small, so it might
be possible to run it on Cortex-M4/M7, or ESP32 (future work).
See below.
The usual CMake routine:
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release
make
This model was run on RP2040 and ESP32-S3.
The model runs on a 1s window of sound samples, so feature extraction and inference must take less than that in order to run continuously. Preferably there should also be an overlap between successive windows. On RP2040 the inference alone takes ~2.4s with 240MHz clock, so it's not possible to run real-time. The feature extraction also takes significant time. A smaller ("narrower") model was also tested and still the inference took ~1.2s. This is still impressive taking into account that RP2040 is a Cortex-M0+ without FPU.
On ESP32-S3 running at 240MHz inference with feature extraction takes ~0.5s, so running real-time is possible (e.g. every 750ms with 250ms overlap gives good results). A demo can be found here.