Before running anything, make sure you download and install kaldi-toolkit.
The following datasets have been found for Hindi language.
OpenSLR: Multilingual and code-switching ASR Challenge Dataset (Sub-task 1)
OpenSLR: Multilingual and code-switching ASR Challenge Dataset (Sub-task 2)
Speech Lab, IITM - Hindi Corpus
Replicate the following directory structure to develop the model.
dataset
├── ...
├── audio
| ├── train: Contains all WAV files for training data
| ├── dev: Contains all WAV files for development data
| └── test: Contains all WAV files for testing data
├── data
| ├── train: Contains all transcription files for training data
| ├── dev: Contains all transcription files for development data
| ├── test: Contains all transcription files for testing data
| └── local
| └── dictionary: Contains all phonetic dictionary files for the model
└── ...