Vistaar is a set of 59 benchmarks and training datasets across various language and domain combinations such as news, education, literature, tourism etc. The training datasets are avaialable for 12 Indian languages amounting to over 10,700 hours of labelled audio data. We also train IndicWhisper models by fine-tuning the Whisper models on the Vistaar train dataset and observe that it has the lowest WER on 39 out of 59 Vistaar benchmarks.
Vistaar consists of benchmarks from several public datasets - Kathbath, FLEURS, CommonVoice, IndicTTS, MUCS, GramVaani across 12 languages. We evaluate IndicWhisper on these benchmarks along with 3 publicly available ASR systems and 2 commercially available systems. Below mentioned are the results
Datasets | bn | gu | hi | kn | ml | mr | or | pa | sa | ta | te | ur | avg |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Kathbath | 16.6 | 17.8 | 10.3 | 19.3 | 34.8 | 19.9 | 24.7 | 16.9 | 45.6 | 24.2 | 25 | 11.9 | 22.3 |
Kathbath Hard | 19.4 | 20.6 | 12.0 | 22.2 | 38.4 | 22.1 | 29.1 | 19.7 | 50.5 | 27.5 | 27.8 | 14.7 | 25.3 |
CommonVoice | 24.7 | - | 11.4 | - | 44.5 | 22.8 | 35.2 | 22.4 | - | 29.2 | - | 31.7 | 27.8 |
FLEURS | 20.9 | 23.5 | 15.0 | 18.6 | 22.6 | 20.5 | 32.9 | 23.1 | - | 25.2 | 25.4 | 19.2 | 22.5 |
IndicTTS | 18.8 | 19.1 | 7.6 | 13.2 | 21.4 | 11.4 | 15.0 | - | - | 17.2 | 33.8 | - | 17.5 |
MUCS | - | 33.2 | 12.0 | - | - | 12.8 | 27.5 | - | - | 28.3 | 32.1 | - | 24.3 |
Gramvaani | - | - | 26.8 | - | - | - | - | - | - | - | - | - | 26.8 |
Average | 20.1 | 22.8 | 13.6 | 18.3 | 32.3 | 18.2 | 27.4 | 20.5 | 48 | 25.3 | 28.8 | 19.4 | 24.6 |
Word error rates (%) of IndicWhisper on the Vistaar Benchmark. Values in bold indicates benchmarks where IndicWhisper has the lowest WER.
Model | Kathbath | Kathbath-Hard | FLEURS | CommonVoice | IndicTTS | MUCS | Gramvaani | Average |
---|---|---|---|---|---|---|---|---|
Google STT | 14.3 | 16.7 | 19.4 | 20.8 | 18.3 | 17.8 | 59.9 | 23.9 |
IndicWav2vec | 12.2 | 16.2 | 18.3 | 20.2 | 15 | 22.9 | 42.1 | 21 |
Azure STT | 13.6 | 15.1 | 24.3 | 14.6 | 15.2 | 15.1 | 42.3 | 20 |
Nvidia-medium | 14 | 15.6 | 19.4 | 20.4 | 12.3 | 12.4 | 41.3 | 19.4 |
Nvidia-large | 12.7 | 14.2 | 15.7 | 21.2 | 12.2 | 11.8 | 42.6 | 18.6 |
IndicWhisper | 10.3 | 12.0 | 11.4 | 15.0 | 7.6 | 12 | 26.8 | 13.6 |
Comparison of publicly-available models on the Hindi subset of the Vistaar benchmark
Datasets | Training Datasets | Benchmarks |
---|---|---|
Kathbath | link | link |
Kathbath Hard | link | link |
CommonVoice | link | link |
FLEURS | link | link |
IndicTTS | link | link |
MUCS | link | link |
gramvaani | link | link |
Language | Model Checkpoint |
---|---|
Bengali | hf |
Gujarati | hf |
Hindi | hf |
Kannada | hf |
Malayalam | hf |
Marathi | hf |
Odia | hf |
Punjabi | hf |
Sanskrit | hf |
Tamil | hf |
Telugu | hf |
Urdu | hf |
- Setting up python virtual environment
python -m venv <env_name> source <env_name>/bin/activate
- Installing/Updating Libraries
sudo apt install ffmpeg pip install -r requirements.txt
- Manifest creation
- For each dataset Download and extract the benchmark data in a directory. The data should be extracted in such a way that each folder inside should contain data for a particular language i.e each language specific folder should contain train, valid and test folder and within them the audio + transcript.txt
- Sample structure of folder tree:
kathbath ├── bengali │ ├── audio │ └── transcript.txt │ └── gujarti ├── audio └── transcript.txt . . . .
- The manifest needs to be a Json file where each line is a dictionary with audio filepath, duration of audio and text transcript
{"audio_filepath": <path to audio file 1>, "duration": <duration of audio file 1>, "text": <transcript of audio file 1>} {"audio_filepath": <path to audio file 2>, "duration": <duration of audio file 2>, "text": <transcript of audio file 2>} . . .
- Running evaluation
python evaluation.py --model_path=<model path> \ --manifest_path=<manifest path in vistaar> \ --manifest_name=<dataset name in vistaar> \ --device=<gpu to use> \ --batch_size=<batch size> \ --language=<2 letter language code>
- Sample structure of manifest file
{"audio_filepath":<path to audio file 1>}
{"audio_filepath":<path to audio file 2>}
.
.
.
- Running batch inference
deepspeed --include localhost:<gpus to include> \
transcribe.py <manifest path> \
<model path> \
<current language> \
<batch size>
<output path>
- Running inference for a single audio file
from transformers import pipeline
model_path = "hindi_models/whisper-medium-hi_alldata_multigpu"
device = "cuda"
lang_code = "hi"
whisper_asr = pipeline(
"automatic-speech-recognition", model=model_path, device=device,
)
# Special case to handle odia since odia is not supported by whisper model
if lang_code == 'or':
whisper_asr.model.config.forced_decoder_ids = (
whisper_asr.tokenizer.get_decoder_prompt_ids(
language=None, task="transcribe"
)
)
else:
whisper_asr.model.config.forced_decoder_ids = (
whisper_asr.tokenizer.get_decoder_prompt_ids(
language=lang_code, task="transcribe"
)
)
result = whisper_asr("audio.mp3")
print(result["text"])
- Manifest creation
- Follow the steps as in Evaluating ASR models for the vistaar training datasets
- Running training
deepspeed --include localhost:<gpus to include> training.py \
--deepspeed=<path to deepspeed config file> \
--model_name_or_path=<model path> \
--dataset_name=<dataset language directory path> \
--language=<language> \
--train_split_name=<train manifest name> \
--eval_split_name=<validation manifest name> \
--max_steps="5000" \
--output_dir=<output directory path> \
--cache_dir=<cache directory for downloaded models> \
--per_device_train_batch_size="64" \
--per_device_eval_batch_size="32" \
--gradient_accumulation_steps="1" \
--logging_steps="10" \
--learning_rate="1e-5" \
--warmup_steps="500" \
--evaluation_strategy="steps" \
--eval_steps="500" \
--save_strategy="steps" \
--save_steps="500" \
--generation_max_length="225" \
--length_column_name="input_length" \
--max_duration_in_seconds="30" \
--text_column_name="sentence" \
--freeze_feature_encoder="False" \
--report_to="tensorboard" \
--metric_for_best_model="wer" \
--greater_is_better="False" \
--load_best_model_at_end \
--gradient_checkpointing \
--fp16 \
--do_train \
--do_eval \
--predict_with_generate \
--do_normalize_eval="False" \
--streaming="True" \
--use_auth_token="True" \
--push_to_hub="True"
Vistaar is MIT-licensed. The license applies to all the fine-tuned language models
- Kaushal Bhogale, (AI4Bharat)
- Sai Narayan Sundaresan, (IITKGP, AI4Bharat)
- Abhigyan Raman, (AI4Bharat)
- Tahir Javed, (IITM, AI4Bharat)
- Mitesh Khapra, (IITM, AI4Bharat, RBCDSAI)
- Pratyush Kumar, (Microsoft, AI4Bharat)
- Mitesh Khapra (miteshk@cse.iitm.ac.in)
- Pratyush Kumar (pratyush@cse.iitm.ac.in)
We would like to thank EkStep Foundation for their generous grant which helped in setting up the Centre for AI4Bharat at IIT Madras to support our students, research staff, data and computational requirements. We would like to thank The Ministry of Electronics and Information Technology (NLTM) for its grant to support the creation of datasets and models for Indian languages under its ambitions Bhashini project. We would also like to thank the Centre for Development of Advanced Computing, India (C-DAC) for providing access to the Param Siddhi supercomputer for training our models. Lastly, we would like to thank Microsoft for its grant to create datasets, tools and resources for Indian languages.