Terminology

Voice Typing

action: The callback that gets executed when a transcript matches a certain rule.
CCR: Continuous Command Recognition.
choice: A grammar element and placeholder that can match one of several predefined words.
command: A pairing of a rule and an action.
grammar: A nested tree of grammar elements that a rule is compiled to. Also used to refer to the collection of all available rules that a transcript can be matched against.
grammar complexity: A measure of the complexity of a grammar based on the number of rules and how complex the patterns are.
grammar element:
keyword: A word literal that is specified in a rule.
match object: A result passed to an action with information based on the rule that was matched and the transcript.
modes
- command mode: When dictating commands. Used for general computer usage and programming. This is the default mode.
- speech mode: When dictating natural language like words, phrases, and sentences.
placeholder: A grammar element that acts as a variable for certain words. The value of a placeholder is added to the match object.
Osprey script: A Python file that includes user-specified commands and is loaded by Osprey at runtime.
rule: A pattern of words and placeholders that a transcript is matched against that maps to an action.
voice typing: Using your voice as a form of computer input.

ASR: Automatic Speech Recognition
decoding/transcribing: The process of converting audio to text.
- online decoding: Streaming audio in chunks to be decoded in real time with multiple intermediate results and one final result.
- offline decoding: Decoding one chunk of audio and getting one result. If using offline decoding to transcribe a microphone stream, you have to use a VAD to segment the audio and then decode each segment.
enrollment/training: When an individual speaker reads text or vocabulary to a speech recognition system to fine-tune the system for that individual.
- speaker dependent: Systems that use enrollment/training.
- speaker independent: Systems that do not use enrollment/training.
model:
- language model:
RTF: Real Time Factor. Measures how quickly a speech recognition engine is able to return results.
SOTA: State Of The Art
speech recognition engine: Transcribes speech from some given audio based on a given model.
STT: Speech To Text
transcript: A sequence of words that a speech recognition engine generates based on some audio.
VAD: Voice Activity Detection
vocabulary: The set of words that a speech recognition engine can transcribe with a given model.
WER: Word Error Rate. The frequency of incorrectly transcribed words by a speech recognition engine with a given model. A measurement of accuracy.