Skip to content

Terminology

Caleb Bassi edited this page May 2, 2020 · 23 revisions

Voice Typing

  • action: The callback that gets executed when a transcript matches a certain rule.
  • CCR: Continuous Command Recognition.
  • choice: A grammar element and placeholder that can match one of several predefined words.
  • command: A pairing of a rule and an action.
  • grammar: A nested tree of grammar elements that a rule is compiled to. Also used to refer to the collection of all available rules that a transcript can be matched against.
  • grammar complexity: A measure of the complexity of a grammar based on the number of rules and how complex the patterns are.
  • grammar element:
  • keyword: A word literal that is specified in a rule.
  • match object: A result passed to an action with information based on the rule that was matched and the transcript.
  • modes
    • command mode: When dictating commands. Used for general computer usage and programming. This is the default mode.
    • speech mode: When dictating natural language like words, phrases, and sentences.
  • placeholder: A grammar element that acts as a variable for certain words. The value of a placeholder is added to the match object.
  • Osprey script: A Python file that includes user-specified commands and is loaded by Osprey at runtime.
  • rule: A pattern of words and placeholders that a transcript is matched against that maps to an action.
  • voice typing: Using your voice as a form of computer input.

Speech Recognition

  • ASR: Automatic Speech Recognition
  • decoding/transcribing: The process of converting audio to text.
    • online decoding: Streaming audio in chunks to be decoded in real time with multiple intermediate results and one final result.
    • offline decoding: Decoding one chunk of audio and getting one result. If using offline decoding to transcribe a microphone stream, you have to use a VAD to segment the audio and then decode each segment.
  • enrollment/training: When an individual speaker reads text or vocabulary to a speech recognition system to fine-tune the system for that individual.
    • speaker dependent: Systems that use enrollment/training.
    • speaker independent: Systems that do not use enrollment/training.
  • model:
    • language model:
  • RTF: Real Time Factor. Measures how quickly a speech recognition engine is able to return results.
  • SOTA: State Of The Art
  • speech recognition engine: Transcribes speech from some given audio based on a given model.
  • STT: Speech To Text
  • transcript: A sequence of words that a speech recognition engine generates based on some audio.
  • VAD: Voice Activity Detection
  • vocabulary: The set of words that a speech recognition engine can transcribe with a given model.
  • WER: Word Error Rate. The frequency of incorrectly transcribed words by a speech recognition engine with a given model. A measurement of accuracy.
Clone this wiki locally