/vistaar

Vistaar: Diverse Benchmarks and Training Sets for Indian Language ASR

Primary LanguagePythonMIT LicenseMIT

Vistaar: Diverse Benchmarks and Training sets for Indian Language ASR

Vistaar is a set of 59 benchmarks and training datasets across various language and domain combinations such as news, education, literature, tourism etc. The training datasets are avaialable for 12 Indian languages amounting to over 10,700 hours of labelled audio data. We also train IndicWhisper models by fine-tuning the Whisper models on the Vistaar train dataset and observe that it has the lowest WER on 39 out of 59 Vistaar benchmarks.

Benchmarks

Vistaar consists of benchmarks from several public datasets - Kathbath, FLEURS, CommonVoice, IndicTTS, MUCS, GramVaani across 12 languages. We evaluate IndicWhisper on these benchmarks along with 3 publicly available ASR systems and 2 commercially available systems. Below mentioned are the results

Datasets bn gu hi kn ml mr or pa sa ta te ur avg
Kathbath 16.6 17.8 10.3 19.3 34.8 19.9 24.7 16.9 45.6 24.2 25 11.9 22.3
Kathbath Hard 19.4 20.6 12.0 22.2 38.4 22.1 29.1 19.7 50.5 27.5 27.8 14.7 25.3
CommonVoice 24.7 - 11.4 - 44.5 22.8 35.2 22.4 - 29.2 - 31.7 27.8
FLEURS 20.9 23.5 15.0 18.6 22.6 20.5 32.9 23.1 - 25.2 25.4 19.2 22.5
IndicTTS 18.8 19.1 7.6 13.2 21.4 11.4 15.0 - - 17.2 33.8 - 17.5
MUCS - 33.2 12.0 - - 12.8 27.5 - - 28.3 32.1 - 24.3
Gramvaani - - 26.8 - - - - - - - - - 26.8
Average 20.1 22.8 13.6 18.3 32.3 18.2 27.4 20.5 48 25.3 28.8 19.4 24.6

Word error rates (%) of IndicWhisper on the Vistaar Benchmark. Values in bold indicates benchmarks where IndicWhisper has the lowest WER.

Model Kathbath Kathbath-Hard FLEURS CommonVoice IndicTTS MUCS Gramvaani Average
Google STT 14.3 16.7 19.4 20.8 18.3 17.8 59.9 23.9
IndicWav2vec 12.2 16.2 18.3 20.2 15 22.9 42.1 21
Azure STT 13.6 15.1 24.3 14.6 15.2 15.1 42.3 20
Nvidia-medium 14 15.6 19.4 20.4 12.3 12.4 41.3 19.4
Nvidia-large 12.7 14.2 15.7 21.2 12.2 11.8 42.6 18.6
IndicWhisper 10.3 12.0 11.4 15.0 7.6 12 26.8 13.6

Comparison of publicly-available models on the Hindi subset of the Vistaar benchmark

Table of contents

Resources

Download Training Datasets and Benchmarks

Datasets Training Datasets Benchmarks
Kathbath link link
Kathbath Hard link link
CommonVoice link link
FLEURS link link
IndicTTS link link
MUCS link link
gramvaani link link

Download Models

Language Model Checkpoint
Bengali hf
Gujarati hf
Hindi hf
Kannada hf
Malayalam hf
Marathi hf
Odia hf
Punjabi hf
Sanskrit hf
Tamil hf
Telugu hf
Urdu hf

Tutorials

Setting up your environment

  • Setting up python virtual environment
    python -m venv <env_name>
    source <env_name>/bin/activate
    
  • Installing/Updating Libraries
    sudo apt install ffmpeg
    pip install -r requirements.txt
    

Evaluating ASR models

  • Manifest creation
    • For each dataset Download and extract the benchmark data in a directory. The data should be extracted in such a way that each folder inside should contain data for a particular language i.e each language specific folder should contain train, valid and test folder and within them the audio + transcript.txt
    • Sample structure of folder tree:
      kathbath
          ├── bengali
          │   ├── audio
          │   └── transcript.txt
          │
          └── gujarti
              ├── audio
              └── transcript.txt 
          .
          .
          .
          .
    
    • The manifest needs to be a Json file where each line is a dictionary with audio filepath, duration of audio and text transcript
     {"audio_filepath": <path to audio file 1>, "duration": <duration of audio file 1>, "text": <transcript of audio file 1>}
     {"audio_filepath": <path to audio file 2>, "duration": <duration of audio file 2>, "text": <transcript of audio file 2>}
     .
     .
     .
    
  • Running evaluation
    python evaluation.py --model_path=<model path> \
    --manifest_path=<manifest path in vistaar> \
    --manifest_name=<dataset name in vistaar> \
    --device=<gpu to use> \
    --batch_size=<batch size> \
    --language=<2 letter language code>
    

Inference using IndicWhisper

  • Sample structure of manifest file
{"audio_filepath":<path to audio file 1>}
{"audio_filepath":<path to audio file 2>}
.
.
.
  • Running batch inference
deepspeed --include localhost:<gpus to include> \
transcribe.py <manifest path> \
<model path> \
<current language> \
<batch size>
<output path>
  • Running inference for a single audio file
from transformers import pipeline

model_path = "hindi_models/whisper-medium-hi_alldata_multigpu"
device = "cuda"
lang_code = "hi"

whisper_asr = pipeline(
    "automatic-speech-recognition", model=model_path, device=device,
)

# Special case to handle odia since odia is not supported by whisper model
if lang_code == 'or':
    whisper_asr.model.config.forced_decoder_ids = (
        whisper_asr.tokenizer.get_decoder_prompt_ids(
            language=None, task="transcribe"
        )
    )
else:
    whisper_asr.model.config.forced_decoder_ids = (
        whisper_asr.tokenizer.get_decoder_prompt_ids(
            language=lang_code, task="transcribe"
        )
    )

result = whisper_asr("audio.mp3")
print(result["text"])

Training on Vistaar Train Datasets

  • Manifest creation
  • Running training
deepspeed --include localhost:<gpus to include> training.py \
--deepspeed=<path to deepspeed config file> \
--model_name_or_path=<model path> \
--dataset_name=<dataset language directory path> \
--language=<language> \
--train_split_name=<train manifest name> \
--eval_split_name=<validation manifest name> \
--max_steps="5000" \
--output_dir=<output directory path> \
--cache_dir=<cache directory for downloaded models> \
--per_device_train_batch_size="64" \
--per_device_eval_batch_size="32" \
--gradient_accumulation_steps="1" \
--logging_steps="10" \
--learning_rate="1e-5" \
--warmup_steps="500" \
--evaluation_strategy="steps" \
--eval_steps="500" \
--save_strategy="steps" \
--save_steps="500" \
--generation_max_length="225" \
--length_column_name="input_length" \
--max_duration_in_seconds="30" \
--text_column_name="sentence" \
--freeze_feature_encoder="False" \
--report_to="tensorboard" \
--metric_for_best_model="wer" \
--greater_is_better="False" \
--load_best_model_at_end \
--gradient_checkpointing \
--fp16 \
--do_train \
--do_eval \
--predict_with_generate \
--do_normalize_eval="False" \
--streaming="True" \
--use_auth_token="True" \
--push_to_hub="True"

License

Vistaar is MIT-licensed. The license applies to all the fine-tuned language models

Contributors

  • Kaushal Bhogale, (AI4Bharat)
  • Sai Narayan Sundaresan, (IITKGP, AI4Bharat)
  • Abhigyan Raman, (AI4Bharat)
  • Tahir Javed, (IITM, AI4Bharat)
  • Mitesh Khapra, (IITM, AI4Bharat, RBCDSAI)
  • Pratyush Kumar, (Microsoft, AI4Bharat)

Contact

Acknowledgements

We would like to thank EkStep Foundation for their generous grant which helped in setting up the Centre for AI4Bharat at IIT Madras to support our students, research staff, data and computational requirements. We would like to thank The Ministry of Electronics and Information Technology (NLTM) for its grant to support the creation of datasets and models for Indian languages under its ambitions Bhashini project. We would also like to thank the Centre for Development of Advanced Computing, India (C-DAC) for providing access to the Param Siddhi supercomputer for training our models. Lastly, we would like to thank Microsoft for its grant to create datasets, tools and resources for Indian languages.