-
-
Notifications
You must be signed in to change notification settings - Fork 1
Terminology
Caleb Bassi edited this page Feb 27, 2020
·
23 revisions
- ASR: Automatic Speech Recognition
- data sets
- Common Voice: An initiative from Mozilla to create an open speech data set with audio sourced from the community. It's also the name of the data set. It currently has over 1,000 hours of audio in English.
- LibriSpeech: A popular speech data set with about 1,000 hours of English audio.
- decoding/transcribing: The process of converting audio to text.
- online decoding: Streaming audio in chunks to be decoded in real time with multiple intermediate results and one final result.
- offline decoding: Decoding one chunk of audio and getting one result. If using offline decoding to transcribe a microphone stream, you have to use a VAD to segment the audio and then decode each segment.
- FST:
- grapheme: A sequence of words.
- modes
- command mode: When dictating commands. Used for general computer usage and programming. This is the default mode.
- speech mode: When dictating natural language like words, phrases, and sentences.
- NFA:
- phoneme: A phonetic unit of a language.
- RTF:
- speech recognition engine: Transcribes speech based on a given model.
- STT: Speech To Text
- TER:
- types of models:
- acoustic model: Models the relationship between an audio signal and phonemes.
- language model: A statistical model for a language that assigns probabilities to words and phrases based on the context.
- BERT: A pre-trained language model from Google that can be applied to a large array of NLP tasks.
- GPT-2: A pre-trained language model from OpenAI that can be used to predict and generate text.
- pronunciation model: Connects phonemes together to form words.
- speech recognition engine model: A model that is fed to a speech recognition engine. Includes several files including the above models. Each engine has its own model requirements and formats. Each engine usually offers some pre-trained models. Models are created by training on data using the engine they are intended for.
- VAD: Voice Activity Detection
- vocabulary: The set of words that a speech recognition engine can transcribe with a given speech model.
- voice typing: Using your voice as a form of computer input.
- WER: Word Error Rate