Listen, Attend and Spell - Pytorch Implementation

Description

This is a pytorch implementation of Listen, Attend and Spell (LAS) published in ICASSP 2016 (Student Paper Award). Please feel free to use/modify them, any bug report or improvment suggestion will be appreciated.

This implement achieves 40% phoneme error rate on TIMIT (using original setting in the paper without hyper parameter tuning). It's not a remarkable score but please notice that deep end2end ASR without special designed loss function such as LAS requires larger corpus to achieve outstanding performance. As a comparison, my another implement of CTC(coming soon) achieves about 32% error rate with the exact same input and experiment setting.

Be aware of some difference between this implementation and the origianl proposed model:

Smaller Dataset

Originally, LAS was trained on Google's private voice search dataset representing 2000 hours of data plus additional data augmentation. Here the model was trained on TIMIT, a MUCH smaller dataset, without any data augmentation.
Different Target

Evaluation criterion is Word Error Rate (WER) on the output phoneme (61 classes in TIMIT) sequence instead of real sentences composed of real words.
Simplified Speller

Speller contains a single layer LSTM instaed of 2 layer LSTM proposed. According to the reponse I got from a letter I wrote to the author, using single layer can get similar result.

Requirements

Execution Environment

python 3
GPU computing is recommanded for training efficiency

Packages for TIMIT preprocessing

SoX

Command line tool for transforming raw wave file in TIMIT from NIST to RIFF
python_speech_features

A python package for exarcting MFCC features during preprocessing

Packages for running LAS model

pytorch (0.3.0 or later version)

Please use pytorch after version 0.3.0 which the softmax bug on 3D input is fixed.
editdistance

Package for calculating edit distance (Levenshtein distance).

Setup

TIMIT Dataset Preprocess

Please prepare TIMIT dataset without modifying the file structure of it and run the following command to preprocess it from wave to mfcc 26 before training.
```
  ./util/timit_preprocess.sh <TIMIT folder>		
```
After preprocessing step, std_preprocess_26_ch.pkl should be under your timit folder. Add your data path to config file.
LAS Model
```
  mkdir -p checkpoint
  mkdir -p log
  python3 run_exp.py <config file path>
```
Training log will be stored at log/ while model checkpoint at checkpoint/

For a customized experiment, please read and modify config/las_example_config.yaml

For more information and a simple demonstration, please refer to las_demo.ipynb

ToDo

Experiment on WSJ dataset

References

Special thanks to William Chan, the first author of LAS, for answering my questions during implementation.
TIMIT preprocessing : https://github.com/Faur/TIMIT

lgcming/Listen-Attend-and-Spell-Pytorch