vagrawal/deepsphinx

What is the preferred dataset for training?

Closed this issue · 1 comments

Hi,
I have checked your repository and willing to do some experiment with it. But as you have mentioned I need to train it first and then use it. If you have a well trained model then please add it to your READ.ME file so that it will be accessible to all.

Else share the training data repository, so that I can do the training with it. I have found something in the tensorflow repository, where the datasets size is 100 GB or 21 GB https://github.com/pannous/tensorflow-speech-recognition#ultimate-goal
I want to confirm whether this is the training data that you used for the training purpose or is there something else. If you have used the training data then how you have utilized it.
Kindly, share your thoughts regarding it.

Thanks for reaching out.

I have been experimenting with various models and hyperparameters and I plan to publish pretrained model for the best performing one. If you want the last model I trained, please download it from here:

https://s3.amazonaws.com/deepsphinx/batch-21937.data-00000-of-00001
https://s3.amazonaws.com/deepsphinx/batch-21937.index

I am not using the dataset you mentioned for training, but you can easily use it since you just need to create a transcription file in the required format and change the vocabulary if needed. I have been using WSJ speech recognition corpus, since it's accuracy has been reported by many papers using various models.