Terminology

Voice Typing

action: The callback that gets executed when a transcript matches a certain rule.
audio:
- erroneous audio: Non voice audio that gets randomly picked up by the microphone such as breathing or background noise that should be filtered out.
- voice audio: Audio from speaking that should be picked up by the voice typing program and converted into a command.
CCR: Continuous Command Recognition. A feature of a voice typing program that allows users to speak several commands consecutively without having to pause between each.
choice: A grammar element and placeholder that can match one of several predefined words.
command: A pairing of a rule and an action.
dictation: When dictating natural language like words, phrases, or sentences as part of a command.
grammar: A nested tree of grammar elements that a rule is compiled into. Also used to refer to the collection of all available rules that a transcript can be matched against.
grammar complexity: A measure of the complexity of a grammar based on the number of rules and how complex the patterns are.
grammar element: A building block of a grammar. There are different types of grammar elements that allow for building different patterns.
keyword: Any word that appear in a rule, including words in a choice element.
match object: A result passed to an action with information based on the rule that was matched and the transcript.
placeholder: A grammar element that acts as a variable for certain words. The value of a placeholder is added to the match object.
rule: A pattern of words and placeholders that a transcript is matched against that maps to an action.
voice typing: Using your voice to control your computer with key presses, commands, and dictation.
voice typing program: A desktop program that allows you to do voice typing.

Speech Recognition

ASR: Automatic Speech Recognition. Another term for speech recognition.
decoding/transcribing: The process of converting audio to text.
- online decoding: Streaming audio in chunks to be decoded in real time with multiple intermediate results and one final result.
- offline decoding: Decoding one chunk of audio and getting one result. If using offline decoding to transcribe a microphone stream, you have to use a VAD to segment the audio and then decode each segment.
enrollment/training: When an individual speaker reads text or vocabulary to a speech recognition system to fine-tune the system for that individual.
- speaker dependent: Systems that use enrollment/training.
- speaker independent: Systems that do not use enrollment/training.
model:
- language model:
RTF: Real Time Factor. Measures how quickly a speech recognition engine is able to return results.
SOTA: State Of The Art
speech recognition engine: Transcribes speech from some given audio based on a given model.
STT: Speech To Text
transcript: A sequence of words that a speech recognition engine generates based on some audio.
VAD: Voice Activity Detection
vocabulary: The set of words that a speech recognition engine can transcribe with a given model.
WER: Word Error Rate. The frequency of incorrectly transcribed words by a speech recognition engine with a given model. A measurement of accuracy.

Microphones

cardioid:

Osprey

Osprey script: A Python file that includes user-specified commands and is loaded by Osprey at runtime.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terminology

Voice Typing

Speech Recognition

Microphones

Osprey

Clone this wiki locally