-
-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better Voice Activity Detection (VAD) (volume threshold) algorithm? #46
Comments
I was thinking about doing some frequency comparison. Maybe by getting the ratio between certain frequencies inside human voice range and outside of it. Some research would be needed, though, to find the right values. Also, non-instrumental music would probably cause some problems. |
Finding an existing one would be cooler, at least in other languages, or at least in maths form. |
What exactly are we looking for, would something like webvoicesdk be it? the demo seems to do what we are looking for. |
What do you mean? The pros and cons in the issue description should give a rough idea. |
I didn't look much into it. |
People who use NewPipe say that it does its job well (also it doesn't requre the user to specify volumeThreshold), so maybe copying their algorithm would be enough. It uses ExoPlayer, where the silence skipping feature is implemented: Also see vantezzen/skip-silence#36 (adaptive (dynamic) volume threshold) |
A few thoughts about OpenAI's Whisper (also mentioned in #164). I'm not an expert, but at a glance it looks to me like VAD with extra steps? I'm looking at FUTO Voice Input and they're using both Whisper and WebRTC VAD (check the "credits" section in their app:
|
Currently we're simply calculating the loudness of incoming audio and if it's below a certain value, we say it's silence. Would be cool to find a specialized algorithm that does this better, like what voice communication apps do.
FYI in theory we can use an implementation written in a different language, such as C, C++, Rust, Go - we could compile it to WASM.
And it's not necessary to replace the current implementation, we can make an option to switch between different silence detection algorithms.
➕ Advantages:
➖ Disadvantages:
Where to start (I update the list from time to time):
BiquadFilterNode
that would cut out frequencies that human speech is usually not associated with. See this StackOverflow answer also.Also see #164, there is a good collection of various VAD algorithms
I would appreciate your advice (as always).
I also found this: https://developer.mozilla.org/en-US/docs/Web/API/MediaTrackConstraints/noiseSuppression
Idk what it is, but may help.
Also this: w3c/webrtc-extensions#76
The text was updated successfully, but these errors were encountered: