Speaker Diarization

Introduction

Speaker Diarization is an important task in audio retrieval and processing. Speaker Diarization is the process of determining the activity between different segments of an audio signal. The simplest of the activities can be determining the regions of speech and non speech. Non Speech regions includes background music, silence, laughter, etc. A more advanced version can be to classify the speech regions into speaker labels, that is identifying the total speakers through unsupervised learning and their corresponding time frames when each speaker spoke during the entire speech signal.

Some applications of Speaker Diarization are:

  • Speech-to-text Transcription/ Rich Transcription(RT)
  • Broadcast News
  • Conference Meetings
  • Youtube video automatic caption generation

Scope

In this project, we focused on the state of the art methods and techniques required to undertake Speaker Diarization, while discussing their merits as well as disadvantages.

Topics Discussed

Refer to the report : PDF

  • Dataset Used
  • Literature Survey of Methods Used
  • Feature Extraction Techniques
  • Signal Segmentation
  • Clustering Techniques

CONCLUSION

There has been tremendous progress in Speaker Diarization over the recent years. It can be applied to phone call conversations, broadcast news, and meetings recordings. Moreover, it has led to several by-products. The diarization techniques can further be applied for the betterment of automatic rich text in videos. Furthermore, it can also be identified the speakers in videos and index them helping the user to identify who is speaking at the particular moment. Overall, the future of Speaker Diarization is brighter and broader than what is currently utilised and there is a scope for large improvements in the area, especially handling of overlapping speech, which needs to be attributed to multiple speaker