/voice-type-classifier

A deep learning model for classifying audio frames into [SPEECH, KCHI, CHI, MAL, FEM] classes.

Primary LanguageShell

SincNet and LSTM based Voice Type Classifier

Architecture of our model

In this repository, you'll find all the necessary code for applying a pre-trained model that, given an audio recording, classifies each frame into [SPEECH, KCHI, CHI, MAL, FEM].

  • FEM stands for female speech
  • MAL stands for male speech
  • KCHI stands for key-child speech
  • CHI stands for other child speech
  • SPEECH stands for speech :)

Our model has been developped in JSALT [1] and its architecture is based on SincNet [2]. The code mainly relies on pyannote-audio [3], an awesome python toolkit for building neural building blocks that can be combined to solve the speaker diarization task.

How to use ?

  1. Disclaimer /!\
  2. Installation
  3. Applying
  4. Evaluation
  5. Going further

References

[1] Paola Garcia et al., "Speaker detection in the wild: Lessons learned from JSALT 2019" arXiV

[2] Mirco Ravanelli, Yoshua Bengio, “Speaker Recognition from raw waveform with SincNet” arXiV

[3] Hervé Bredin et al., "pyannote.audio: neural building blocks for speaker diarization" arXiV