Skip to content

Terminology

Caleb Bassi edited this page Jan 3, 2021 · 23 revisions

Voice Typing

  • action: The callback that gets executed when a transcript matches a certain rule.
  • audio:
    • erroneous audio: Non voice audio that gets randomly picked up by the microphone such as breathing or background noise that should be filtered out.
    • voice audio: Audio from speaking that should be picked up by the voice typing program and converted into a command.
  • CCR: Continuous Command Recognition. A feature of a voice typing program that allows users to speak several commands consecutively without having to pause between each.
  • choice: A grammar element and placeholder that can match one of several predefined words.
  • command: A pairing of a rule and an action.
  • dictation: When dictating natural language like words, phrases, or sentences as part of a command.
  • grammar: A nested tree of grammar elements that a rule is compiled into. Also used to refer to the collection of all available rules that a transcript can be matched against.
  • grammar complexity: A measure of the complexity of a grammar based on the number of rules and how complex the patterns are.
  • grammar element: A building block of a grammar. There are different types of grammar elements that allow for building different patterns.
  • keyword: Any word that appear in a rule, including words in a choice element.
  • match object: A result passed to an action with information based on the rule that was matched and the transcript.
  • placeholder: A grammar element that acts as a variable for certain words. The value of a placeholder is added to the match object.
  • rule: A pattern of words and placeholders that a transcript is matched against that maps to an action.
  • voice typing: Using your voice to control your computer with key presses, commands, and dictation.
  • voice typing program: A desktop program that allows you to do voice typing.

Speech Recognition

  • ASR: Automatic Speech Recognition. Another term for speech recognition.
  • decoding/transcribing: The process of converting audio to text.
    • online decoding: Streaming audio in chunks to be decoded in real time with multiple intermediate results and one final result.
    • offline decoding: Decoding one chunk of audio and getting one result. If using offline decoding to transcribe a microphone stream, you have to use a VAD to segment the audio and then decode each segment.
  • enrollment/training: When an individual speaker reads text or vocabulary to a speech recognition system to fine-tune the system for that individual.
    • speaker dependent: Systems that use enrollment/training.
    • speaker independent: Systems that do not use enrollment/training.
  • model:
    • language model:
  • RTF: Real Time Factor. Measures how quickly a speech recognition engine is able to return results.
  • SOTA: State Of The Art
  • speech recognition engine: Transcribes speech from some given audio based on a given model.
  • STT: Speech To Text
  • transcript: A sequence of words that a speech recognition engine generates based on some audio.
  • VAD: Voice Activity Detection
  • vocabulary: The set of words that a speech recognition engine can transcribe with a given model.
  • WER: Word Error Rate. The frequency of incorrectly transcribed words by a speech recognition engine with a given model. A measurement of accuracy.

Microphones

  • cardioid:

Osprey

  • Osprey script: A Python file that includes user-specified commands and is loaded by Osprey at runtime.
Clone this wiki locally