Skip to content

serendipity24/emrai-synthetic-diarization-corpus

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Synthetic Diarization Corpus

Introduction

A synthetic corpus of dialogs was constructed from the LibriSpeech corpus, and is made freely available for diarization research. It includes over 90 hours of training data, and over 9 hours each of development and test data. Both 2-person and 3-person dialogs, with and without overlap, are included. Timing information is provided in several formats, and includes not only speaker segmentations, but also phoneme segmentations. As such, it is a useful starting point for general, particularly early-stage, diarization system development.

How to use

The corpus contains 4 top-level directories:
librispeech2: 2-person dialogs
librispeech2o: 2-person dialogs with overlap
librispeech3: 3-person dialogs
librispeech3o: 3-person dialogs with overlap


All sub-directories are "Kaldi table" data directories. Audio files are 16kHz PCM 16bit little-endian mono encoded.

Formats

ctm - each line is F C BT DUR word
Where:
F The waveform filename. NOTE: no pathnames or extensions are expected.
C Speaker.
BT The begin time (seconds) of the segment, measured from the start time of the file.
DUR The duration (seconds) of the segment.
labs - each line is a speaker id or 0 for pauses. One line corresponds 0.01 seconds of audio.
rttm0 - Rich Transcription Time Marked file format. Full specification can be found in Appendix A of "NIST's The 2009 (RT-09) Rich Transcription Meeting Recognition Evaluation Plan" paper.
rttm - merged rttm0, without pauses

This corpus is licensed under CC BY 4.0, but requires the following reference:

Edwards, E., Brenndoerfer, M., Robinson, A., Sadoughi, N., Finley, G. P., Korenevsky, M., Axtmann, N. & Suendermann-Oeft, D. (2018, September). A Free Synthetic Corpus for Speaker Diarization Research. In International Conference on Speech and Computer (pp. 113-122). Springer, Cham.

Bibtex

@inproceedings{edwards2018free,
  title={A Free Synthetic Corpus for Speaker Diarization Research},
  author={Edwards, Erik and Brenndoerfer, Michael and Robinson, Amanda and Sadoughi, Najmeh and Finley, Greg P and Korenevsky, Maxim and Axtmann, Nico and Miller, Mark and Suendermann-Oeft, David},
  booktitle={International Conference on Speech and Computer},
  pages={113--122},
  year={2018},
  organization={Springer}
}

Based on the LibriSpeech ASR corpus

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published