Skip to content

Latest commit

 

History

History
77 lines (55 loc) · 5.08 KB

README.md

File metadata and controls

77 lines (55 loc) · 5.08 KB

awesome-speaker-recognition

This is an attempt to list interesting speaker recognition/identification/verification research works.

Review/survey papers

Pre-Deep learning

  1. Speaker Verification Using Adapted Gaussian Mixture Models, Reynolds et. al 2000 (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.117.338&rep=rep1&type=pdf)
  2. Front-end factor analysis for speaker verification, Dehak et. al 2010 (https://ieeexplore.ieee.org/document/5545402)
  3. Channel robust speaker verification via feature mapping, Reynolds 2003, ICASSP (https://ieeexplore.ieee.org/abstract/document/1202292/)

Speech features

  1. Multi-Channel Speaker Verification for Single and Multi-talker Speech, Kataria et. al 2021 (https://arxiv.org/abs/2010.12692)

Front-end

  1. Speaker recognition from raw waveform with sincnet, Ravanelli et. al 2018 (https://arxiv.org/abs/1808.00158)

Back-end

  1. Graph Attention Networks for Speaker Verification, Jung et. al 2020 (https://arxiv.org/abs/2010.11543)
  2. Ferrer, Luciana, Mitchell McLaren, and Niko Brummer. "A Speaker Verification Backend with Robust Performance across Conditions." arXiv preprint arXiv:2102.01760 (2021). (https://arxiv.org/abs/2102.01760)
  3. Scoring of Large-Margin Embeddings for Speaker Verification: Cosine or PLDA?, Wang et. al 2022, Interspeech 2022 (https://www.isca-speech.org/archive/interspeech_2022/wang22r_interspeech.html)

Architectures

  1. Ding, Shaojin, et al. "Autospeech: Neural architecture search for speaker recognition, Ding et. al 2020 (https://arxiv.org/abs/2005.03215)
  2. "Pushing the limits of raw waveform speaker recognition", Jee-weon Jung et. al 2022 (https://arxiv.org/abs/2203.08488)

Pooling

  1. Exploring the encoding layer and loss function in end-to-end speaker and language recognition system, Cai et. al 2018 (https://arxiv.org/abs/1804.05160)
  2. Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition, Xiang et. al 2019 (https://arxiv.org/abs/1906.07317)

(Towards?) End-to-end

  1. Garcia-Romero, Daniel, Gregory Sell, and Alan McCree. "Magneto: X-vector magnitude estimation network plus offset for improved speaker recognition." Proc. Odyssey 2020 The Speaker and Language Recognition Workshop. 2020. (https://www.isca-speech.org/archive/Odyssey_2020/pdfs/65.pdf)

Representation learning

  1. Deep Speaker: an End-to-End Neural Speaker Embedding System, Li et. al 2017 (https://arxiv.org/abs/1705.02304)

With self-supervised learning

  1. Learning Speaker Embedding with Momentum Contrast, Ding et. al 2020 (https://arxiv.org/abs/2001.01986)
  2. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing, Chen et al. 2021 (https://arxiv.org/abs/2110.13900)

With speaker diarization

With speech enhancement

  1. VoiceID Loss: Speech Enhancement for Speaker Verification, Shon et. al 2019 (https://arxiv.org/abs/1904.03601)
  2. Feature enhancement with deep feature losses for speaker verification, Kataria et. al 2019 (https://arxiv.org/abs/1910.11905)

With domain adaptation

  1. Cycle-gans for domain adaptation of acoustic features for speaker recognition, Nidadavolu et. al 2019 (https://ieeexplore.ieee.org/document/8683055)

Joint learning

Multi-modal

  1. A Study of Multimodal Person Verification Using Audio-Visual-Thermal Data, Abdrakhmanova et. al 2021 (https://arxiv.org/abs/2110.12136)
  2. Face-Mic: inferring live speech and speaker identity via subtle facial dynamics captured by AR/VR motion sensors, Shi et al. 2021 (https://dl.acm.org/doi/abs/10.1145/3447993.3483272)

Metrics

  1. The bosaris toolkit: Theory, algorithms and code for surviving the new dcf, Brummer et al., 2013 (https://arxiv.org/abs/1304.2865)

System Descriptions

  1. Beijing ZKJ-NPU Speaker Verification System for VoxCeleb Speaker Recognition Challenge 2021, Zhang et al., 2021 (https://arxiv.org/abs/2109.03568)

Miscellaneous

Datasets

  1. Fan, Yue, et al. "CN-CELEB: a challenging Chinese speaker recognition dataset." ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020. (https://ieeexplore.ieee.org/abstract/document/9054017)

Theses

  1. Villalba, J. Advances on speaker recognition in non collaborative environments. Diss. Ph. D. dissertation, University of Zaragoza, 2014.
  2. Brummer, Niko. Measuring, refining and calibrating speaker and language information extracted from speech. Diss. Stellenbosch: University of Stellenbosch, 2010. (http://scholar.sun.ac.za/handle/10019.1/5139)

Books

  1. Mak, Man-Wai, and Jen-Tzung Chien. Machine learning for speaker recognition. Cambridge University Press, 2020. (http://www.eie.polyu.edu.hk/~mwmak/papers/spkver-book_toc.pdf)

Softwares

  1. Hyperion, Villalba et al., 2019 (https://github.com/jsalt2019-diadet/hyperion/tree/14a11436d62f3c15cd9b1f70bcce3eafbea2f753)
  2. SpeechBrain, Ravanelli et al., 2021 (https://github.com/speechbrain/speechbrain)
  3. Angular Prototypical Loss, Chung et al. 2020 (https://arxiv.org/abs/2003.11982)
  4. BOSARIS, multiple versions, (https://github.com/bsxfan/PYLLR, https://projets-lium.univ-lemans.fr/sidekit/api/bosaris/index.html, https://gitlab.eurecom.fr/nautsch/pybosaris)