/Whisper-Hindi-ASR-model-IIT-Bombay-Internship

The Whisper Hindi ASR (Automatic Speech Recognition) model utilizes the KathBath dataset, a comprehensive collection of speech samples in Hindi. Trained on this dataset, Whisper employs advanced deep learning techniques to accurately transcribe spoken Hindi into text.

Primary LanguageJupyter NotebookEclipse Public License 2.0EPL-2.0

Whisper-Hindi-ASR-model-IIT-Bombay-Internship

The Whisper Hindi ASR (Automatic Speech Recognition) model utilizes the KathBath dataset, a comprehensive collection of speech samples in Hindi. Trained on this dataset, Whisper employs advanced deep learning techniques to accurately transcribe spoken Hindi into text. Its architecture harnesses the power of neural networks to recognize and interpret the nuances of the Hindi language, including regional accents and dialects. Whisper stands out for its efficiency and accuracy, offering a reliable solution for converting spoken Hindi into written text across various applications, from transcription services to voice-enabled interfaces.

The KathBath dataset serves as a crucial foundation for training the Whisper model. This dataset encompasses diverse speech samples sourced from different regions, covering a wide range of topics and conversational styles. The richness and diversity of the data enable Whisper to generalize well across various real-world scenarios, ensuring robust performance in different contexts.

ASR models not only transcribe speech but also calculate the Word Error Rate (WER) by comparing their output to a reference transcription. This metric quantifies accuracy, with lower WER values indicating better performance.

Word Error Rate(WER)

  • Word Error Rate (WER) is a metric used to evaluate the performance of automatic speech recognition (ASR) systems. It measures the disparity between the transcribed output generated by the ASR system and the reference or ground truth transcription.
  • WER is calculated as the ratio of the total number of errors (insertions, deletions, and substitutions) required to align the transcribed text with the reference text, divided by the total number of words in the reference text.
  • WER provides a comprehensive assessment of the accuracy of ASR systems, taking into account both misrecognitions and omissions of words in the transcription. Lower WER values indicate higher accuracy, with a perfect score of 0 indicating an exact match between the transcribed and reference texts.
  • WER is a widely used evaluation metric in the field of speech recognition, providing valuable insights into the performance and quality of ASR models across different languages and applications.

Whisper Model

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.

approach

A Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. These tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing a single model to replace many stages of a traditional speech-processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or classification targets.

Setup

We used Python 3.9.9 and PyTorch 1.10.1 to train and test our models, but the codebase is expected to be compatible with Python 3.8-3.11 and recent PyTorch versions. The codebase also depends on a few Python packages, most notably OpenAI's tiktoken for their fast tokenizer implementation. You can download and install (or update to) the latest release of Whisper with the following command:

  pip install -U openai-whisper

Alternatively, the following command will pull and install the latest commit from this repository, along with its Python dependencies:

  pip install git+https://github.com/openai/whisper.git 

To update the package to the latest version of this repository, please run:

  pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git

It also requires the command-line tool ffmpeg to be installed on your system, which is available from most package managers:

  #on Ubuntu or Debian
  sudo apt update && sudo apt install ffmpeg

  # on Arch Linux
  sudo pacman -S ffmpeg

  # on MacOS using Homebrew (https://brew.sh/)
  brew install ffmpeg

  # on Windows using Chocolatey (https://chocolatey.org/)
  choco install ffmpeg

  # on Windows using Scoop (https://scoop.sh/)
  scoop install ffmpeg