/speech-to-text

mixlingual speech recognition system; hybrid (GMM+NNet) model; Kaldi + Keras

Primary LanguageJupyter Notebook

Mixlingual Speech Recognition

From the team:

As Chinese students studying in the states, we found our speaking habits morphed -- English words and phrases easily get slipped into Chinese sentences. We greatly feel the need to have messaging apps that can handle multilingual speech-to-text translation. So in this task, we are going to develop this function -- build a model using deep learning architecture(DNN, CNN, LSTM) to corretly translate multilingual audio (having Chinese and English in the same sentence) into text.

- Video Demo

Table of Content:

Directory Description

codeswitch:

Contains scripts to build our system

description:

LDC2015S04, our dataset description

notes:

Our study notes on Kaldi related recipie, including timit and librispeech

Resources to Build the System

Data Source:

Baseline Model Paper:

Other Code-switching related Paper:

Feature Improvement related Paper:

Interesting Python Kaldi Wrapper to be examined:

Kaldi recommended recipe to be examined:

Kaldi resources:

Data Preperation:

filename: pattern: format: path: source:
acoustic data: spk2gender <speakerID><gender> /data/train /data/test handmade
utt2spk <utteranceID><speakerID> /data/train /data/test handmade
wav.scp <utteranceID><full_path_to_audio_file> .scp: kaldi script file /data/train /data/test handmade
text <utteranceID><full_path_to_audio_file> .ark: kaldi archive file /data/train /data/test exists
language data: lexicon.txt <word> <phone 1><phone 2> ... .ark: kaldi archive file data/local/dict egs/voxforge
nonsilence_phones.txt  <phone> data/local/dict unkown
silence_phones.txt  <phone> data/local/dict unkown
optional_silence.txt  <phone> data/local/dict unkown
Tools: utils  / kaldi/egs/wsj/s5
steps / kaldi/egs/wsj/s5
score.sh  / kaldi/egs/voxforge/s5/local 

Language Model:

What are our language model:
3-grams trained from the transcripts of THCHS30 + LDC2015S04

directory structure taken from /egs/TIMIT/s5:

/data
  /local
    /nist_lm
      /lm_phone_bg.arpa.gz

How to build a language model:

Kaldi script utils/prepare_lang.sh

usage: utils/prepare_lang.sh <dict-src-dir> <oov-dict-entry> <tmp-dir> <lang-dir>
e.g.: utils/prepare_lang.sh data/local/dict <SPOKEN_NOISE> data/local/lang data/lang
options:
     --num-sil-states <number of states>             # default: 5, #states in silence models.
     --num-nonsil-states <number of states>          # default: 3, #states in non-silence models.
     --position-dependent-phones (true|false)        # default: true; if true, use _B, _E, _S & _I
                                                     # markers on phones to indicate word-internal positions.
     --share-silence-phones (true|false)             # default: false; if true, share pdfs of
                                                     # all non-silence phones.
     --sil-prob <probability of silence>             # default: 0.5 [must have 0 < silprob < 1]

Turning the –share-silence-phones option to TRUE was extremely helpful for the Cantonese data of IARPA's BABEL project, where the data is very messy and has long untranscribed portions that the Kaldi developers try to align to a special phone that is designated for that purpose. The --sil-prob might be another potentially important option.

Preparation

  • lexicon.txt
    • The pronunciation dictionary where every line is a word with its phonemic pronunciation. It Only contains words and their pronunciations that are present in the corpus.
    • ENG: CMU dictionary
  • nonsilence_phones.txt
  • optional_silence.txt
  • silence_phones.txt

MFCC Feature Extraction:

   echo
   echo "===== FEATURES EXTRACTION ====="
   echo
 
   # Making feats.scp files
   mfccdir=mfcc
   # Uncomment and modify arguments in scripts below if you have any problems with data sorting
   # utils/validate_data_dir.sh data/train     # script for checking prepared data - here: for data/train directory
   # utils/fix_data_dir.sh data/train          # tool for data proper sorting if needed - here: for data/train directory
   steps/make_mfcc.sh --nj $nj --cmd "$train_cmd" data/train exp/make_mfcc/train $mfccdir
   steps/make_mfcc.sh --nj $nj --cmd "$train_cmd" data/test exp/make_mfcc/test $mfccdir
  
   # Making cmvn.scp files
   steps/compute_cmvn_stats.sh data/train exp/make_mfcc/train $mfccdir
   steps/compute_cmvn_stats.sh data/test exp/make_mfcc/test $mfccdir

MFCC-related documents

HMM - GMM

Reference

a as the transition probability from state i to state j
b as the emission probability from state j to sequence X

Forward-backward algorithm fine tunes a

GMM providesb

HMM solves the following three problems:

  1. overall likelihood (Forward algorithm): determine the likelihood of an observation sequence X=(x1, x2, ... xT) being generated by an HMM
  2. training (Forward-backward algorithm EM): given an observation sequence, learn the best lambda
  3. decoding (Viterbi algorithm): given an on observation sequence, determine the most probable hidden state sequence

CNN and MFSC features

In order to train CNN, we need to extract MFSC features from the acoustic data instead of MFCC features, as Discrete Cosine Transformation (DCT) in MFCC destroys locality. MFSC features also called filter banks. In Kaldi, the scripts are something like the following:

steps/make_fbank.sh --nj 3 \ $trainDir/train_clean_fbank exp/make_fbank/train_clean_fbank feat/fbank/ || exit 1;
steps/compute_cmvn_stats.sh $trainDir/train_clean_fbank exp/make_fbank/train_clean_fbank feat/fbank/ || exit 1;

notice that fbanks don't work well with GMM as fbanks features are highly correlated, and GMM modelled with diagonal covariance matrices assumed independence of feature streams. fbanks/MFSC is okay with DNN, best for CNN.
why MFSC+GMM produced high WER-see Kaldi discussion
why DCT destroys locality-see post

Required Packages

tensorflow == 1.1.0
theano == 0.9.0.dev-c697eeab84e5b8a74908da654b66ec9eca4f1291
keras == 1.2

Run Kaldi on single GPU

This doesn't require Sun GridEngine. Simply download [CUDA toolkit] (https://developer.nvidia.com/cuda-downloads), install it with

sudo sh cuda_8.0.61_375.26_linux.run

and then go under kaldi/src execute

./configure

to check if it detects CUDA, you will also find CUDA = true in kaldi/src/kaldi.mk then recompile Kaldi with

make -j 8 # 8 for 8-core cpu
make depend -j 8 # 8 for 8-core cpu

Noted that GMM-based training and decode is not supported by GPU, only nnet does. source

** if you are using AWS g2.2xlarge, and launched the instance before 2017-04-18 (when this note is written), its NVIDIA may need a legacy 367.x driver, the default (latest) driver that comes with CUDA-8 cuda_8.0.61_375.26_linux.run will fail. To check the current version of the driver installed on the instance, type

apt-cache search nvidia | grep -P '^nvidia-[0-9]+\s'

to install a version of your choice from the list, type

sudo apt-get install nvidia-367

You can also download a specifc version from the web, for example NVIDIA-Linux-x86_64-367.18.run. Install it with

sudo sh NVIDIA-Linux-x86_64-367.18.run

and then when installing cuda_8.0.61_375.26_linux.run, it will ask you whether to install NVIDIA driver 375, make sure you choose no.

Install tensorflow-gpu

Required:

  1. install CUDA toolkit 8.0 as of 04-18-2017
  2. install cuDNN download v5, as of 04-18-2017, Tensorflow performs the best with cuDNN 5.x
    Follow commands carefully from the Tensorflow website. After intallation, you can test if tensorflow can detect your gpu by typing the following:
# makes sure you are out of the tensorflow git repo
python
>>> import tensorflow as tf
>>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

A working tensorflow will output:

I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:04.0
Total memory: 11.17GiB
Free memory: 11.11GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0)
Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:04.0
I tensorflow/core/common_runtime/direct_session.cc:257] Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:04.0

  1. During testing, if you run into error like:
I tensorflow/stream_executor/dso_loader.cc:126] Couldn't open CUDA library libcudnn.so.5. LD_LIBRARY_PATH: /usr/local/cuda/lib64
I tensorflow/stream_executor/cuda/cuda_dnn.cc:3517] Unable to load cuDNN DSO

from the writer's experience, you didn't set the right LD_LIBRARY_PATH in the ~/.profile file. You need to examine where is libcudnn.so.5 located and move it to the desired location, most likely it will be /usr/local/cuda. Also make sure you type source ~/.profile to activate the change, after you modify the file.

  1. If you are testing it in a python shell, and you met the following error:
ImportError: libcudart.so.8.0: cannot open shared object file: No such file or directory

very likely you are in the actual tensorflow git repo. source, make sure you jump out of it before testing.

Install Theano GPU

Keras-kaldi's LSTM training script breaks under the current tensorflow (as tensorflow went through series of API changes during the previous months), we need to install Theano GPU and switch to the theano backend for running run_kt_LSTM.sh.
After installing Theano-gpu using miniconda, in order to modify the theano.config file, you can create .theanorc by the following command:

echo -e "\n[global]\nfloatX=float32\n" >> ~/.theanorc

and add device=gpu to the this file. If theano can't detect NVCC, by giving you the following error:

ERROR (theano.sandbox.cuda): nvcc compiler not found on $PATH. Check your nvcc installation and try again.

(but you sure that you installed CUDA), you can solve it by adding the following lines to ~/.profile:

export PATH=/usr/local/cuda-8.0/bin/:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH

don't forget to source ~/.profile to enable the change.
to change the keras backend from tensorflow to theano, modify:

vim $HOME/.keras/keras.json

to test if theano is indeed using gpu, execute the following file:

from theano import function, config, shared, tensor
import numpy
import time
vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], tensor.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
    r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, tensor.Elemwise) and
              ('Gpu' not in type(x.op).__name__)
              for x in f.maker.fgraph.toposort()]):
    print('Used the cpu')
else:
    print('Used the gpu')

Kaldi script to train nnet

  1. 3-4 hours to train, 3 hours to decode on GPU:
    local/online/run_nnet2_baseline.sh

Chinese CER (Character Error Rate)

  1. egs/hkust/s5/local/ext/score.sh

Keras-Kaldi

dspavankumar/keras-kaldi github repo
Up to the time that we ran his code, the enviornment is still Keras 1.2.0 Make sure that the Keras version is the same across the machines. to reinstall Keras from 2.0.3 to older version, type

$ sudo pip3 install keras==1.2
or 
$ conda install keras==1.2.2 # if you are using conda

If there is version inconsistency (train model using 1.2.0 but decode it with 2.0.3, you will run into problem when loading an existing model:

  File "steps_kt/nnet-forward.py", line 33, in <module>
    m = keras.models.load_model (model)
  File "/usr/local/lib/python3.5/dist-packages/keras/models.py", line 281, in load_model
    Error: “Optimizer weight shape (1024, ) not compatible with provided weight shape (429,1024)”

source