This is a repository for an end-to-end Automated Speech Recognition model for my Udacity project. It was built using Tensorflow2/Keras, and is GPU-enabled. This model uses a CTC loss function; the CTC algorithm was devised by Graves et al. to help align the text output sequence with the audio input sequence. The one unique feature of the model is it's use of Google Brain's SpecAugment data augmentation methodology, which helped improve generalization. One further enhancement needed to further improve the results would be a language model (pre-trained) on a larger corpus, as in Baidu's DeepSpeech2 methodology.
The model is trained on the LibriSpeech dataset. The json files for the train and test corpus wereby first generated. An audio spectrogram feature representation of the audio was used. Much of the data processing functionality was provided by Udacity in the data_generator.py file, to which a data augmentation function was added (i.e. time and frequency masking from tensorflow I/O), which significantly improved generalizability. The train_utils.py contains the utility function for training the model. (Note that, while the file contains a PyTorch-based training function, that was not eventally used.. it has been left in there, to aid in migration to PyTorch in the future).
The sample_models.py contains all the neural network models built/tested. The final model architecture comprises a single CNN layer, along with two GRU layers followed by a time-distributed dense layer, and the final softmax layer to get the probabilities of each character in the output sequence. The reason for the choice of CNN and bidirectional GRU were based on earlier models tested. As common among neural nets, overfitting was a huge problem, which in this case was related to the data set size being pretty small. A large number of architectures (with a few different choices of kernel size and stride) were tested using both regular dropout (for the CNN part) and recurrent dropout for the RNN part (regular dropout doesnt work for RNNs, as explained by Gal and Ghahramani) to control overfitting. However, even if the overfitting could be controlled, generalization was not good. Causal dilation (as in Google's WaveNet) was tested, but could not help, possibly because only a single CNN layer was used. Finally Google Brain's SpecAugment methodology was used, which applies time and frequency masking to the audio spectrogram, which has been shown to help prevent overfitting and improve generalization. Although this definitely improved the results, once the training loss became low enough, overfitting re-emerged. I finally was able to come up with a good solution (using dropout-based regularizers) for which overfitting was well-controlled and both the training and validaiton loss were low enough. Note that in order to get good convergence, the learning rate was lowered to 0.01 for the final 50 epochs.. (although a learning schedule would have been a better solution). The model predictions were reasonable; beam search worked better than greedy search (changed ctc_decode).
An evaluation metric like WER or Levenshtein distance needs to be added to help with model evaluation. Results can be improved by pairing the CTC with a language model trained on a larger corpus, as in Baidu's DeepSpeech2. For a PyTorch implementation of that, see https://github.com/SeanNaren/deepspeech.pytorch.