Simple Khmer ASR Project

This repository contains a simple Khmer Automatic Speech Recognition (ASR) project from scratch. Feel free to fork this repository, submit pull requests, or send us suggestions on what should be improved! I just do it for fun and celebrating my bd :D

1. Data Collection and Preprocessing

Data Collection

YouTube Videos
- We crawl YouTube videos using yt_dlp and ffmpeg.
- Instructions: Just follow the links to download the tools and refer to the given channel names for crawling.
- yt_dlp: Download yt_dlp
- ffmpeg: Download ffmpeg
OpenSLR Dataset
- Alternatively, you can use the OpenSLR dataset, which is open-source.
- OpenSLR Dataset Link

Data Cleaning

Background Noise Removal
- We use Ultimate Vocal Remover for background noise removal.
- Ultimate Vocal Remover
- Separate code and model's files are provided in the folder Background_Noise.
Chunking
- Automated chunking is performed using Python. For non-stratified results, manual checking with Audacity is recommended.
- Download Audacity

2. Transcription

Transcribe_New.py
- The script outputs three folders: 1_word, UNK (unknown), and non-transcript.
- Manual checking is recommended for perfect transcription accuracy.

3. Data Training

Wav2Vec2

About Wav2Vec2
- Wav2Vec2 is a state-of-the-art model developed by Facebook AI (now Meta AI) for ASR. It converts raw audio waveforms into meaningful text.
- The model was trained using connectionist temporal classification (CTC), so the output has to be decoded using Wav2Vec2CTCTokenizer.

Metrics

WER (Word Error Rate)
- WER is a metric used to evaluate the quality of transcriptions produced by ASR systems.
- In many applications, it is of interest to estimate WER given a pair of a speech utterance and a transcript.

4. Simple StreamLit Application

This application records voice or inputs a file and returns the transcript text!

References

Problems

If you face any problems about the model weight, this might help :
Seanghay Yath
Vituo Phy 1
Vituo Phy 2

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Data_Collection_Preprocessing		Data_Collection_Preprocessing
Data_Training		Data_Training
Simple_StreamLit		Simple_StreamLit
Transcription		Transcription
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple Khmer ASR Project

1. Data Collection and Preprocessing

Data Collection

Data Cleaning

2. Transcription

3. Data Training

Wav2Vec2

Metrics

4. Simple StreamLit Application

References

Problems

About

Releases

Packages

Languages

SENCHEYSUON/Simple_Khmer_AutoSpeechRecognition

Folders and files

Latest commit

History

Repository files navigation

Simple Khmer ASR Project

1. Data Collection and Preprocessing

Data Collection

Data Cleaning

2. Transcription

3. Data Training

Wav2Vec2

Metrics

4. Simple StreamLit Application

References

Problems

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages