Synthetic Diarization Corpus

Introduction

A synthetic corpus of dialogs was constructed from the LibriSpeech corpus, and is made freely available for diarization research. It includes over 90 hours of training data, and over 9 hours each of development and test data. Both 2-person and 3-person dialogs, with and without overlap, are included. Timing information is provided in several formats, and includes not only speaker segmentations, but also phoneme segmentations. As such, it is a useful starting point for general, particularly early-stage, diarization system development.

How to use

The corpus contains 4 top-level directories:
librispeech2: 2-person dialogs
librispeech2o: 2-person dialogs with overlap
librispeech3: 3-person dialogs
librispeech3o: 3-person dialogs with overlap

All sub-directories are "Kaldi table" data directories. Audio files are 16kHz PCM 16bit little-endian mono encoded.

Formats

ctm - each line is F C BT DUR word
Where:
F The waveform filename. NOTE: no pathnames or extensions are expected.
C Speaker.
BT The begin time (seconds) of the segment, measured from the start time of the file.
DUR The duration (seconds) of the segment.
labs - each line is a speaker id or 0 for pauses. One line corresponds 0.01 seconds of audio.
rttm0 - Rich Transcription Time Marked file format. Full specification can be found in Appendix A of "NIST's The 2009 (RT-09) Rich Transcription Meeting Recognition Evaluation Plan" paper.
rttm - merged rttm0, without pauses

This corpus is licensed under CC BY 4.0, but requires the following reference:

Edwards, E., Brenndoerfer, M., Robinson, A., Sadoughi, N., Finley, G. P., Korenevsky, M., Axtmann, N. & Suendermann-Oeft, D. (2018, September). A Free Synthetic Corpus for Speaker Diarization Research. In International Conference on Speech and Computer (pp. 113-122). Springer, Cham.

Bibtex

@inproceedings{edwards2018free,
  title={A Free Synthetic Corpus for Speaker Diarization Research},
  author={Edwards, Erik and Brenndoerfer, Michael and Robinson, Amanda and Sadoughi, Najmeh and Finley, Greg P and Korenevsky, Maxim and Axtmann, Nico and Miller, Mark and Suendermann-Oeft, David},
  booktitle={International Conference on Speech and Computer},
  pages={113--122},
  year={2018},
  organization={Springer}
}

Based on the LibriSpeech ASR corpus

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synthetic Diarization Corpus

Introduction

How to use

Formats

Bibtex

About

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
librispeech2		librispeech2
librispeech2o		librispeech2o
librispeech3		librispeech3
librispeech3o		librispeech3o
README.md		README.md

serendipity24/emrai-synthetic-diarization-corpus

Folders and files

Latest commit

History

Repository files navigation

Synthetic Diarization Corpus

Introduction

How to use

Formats

Bibtex

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages