/AccentDetection

ASR Project

Primary LanguageShell

0. Download the librispeech model from "http://www.kaldi-asr.org/downloads/build/6/trunk/egs/librispeech/s5/exp/". We have used model tri2a in our experiments. Extract the model and keep exp directory in the parent dir of this project. Also ensure to keep the lang directory in this folder. The lang folder can be downloaded from "http://www.kaldi-asr.org/downloads/build/6/trunk/egs/librispeech/s5/data/lang/"
1. Download the speech data from GMU Speech Archive.  Run the file getFiles.py in directory GMU_Archive
2. Process the speech files in GMU_Archive: convert them to 44100 Hz wav files
3. Run process_data.sh. The first arguement is the directory path of data, i.e. GMU_Archive in our case.The second arguement is the path of the file containing speech transcript. In our case it is the file "transcript". It generates wav.scp, utt2spk, spk2utt (for kaldi feature extraction) and text(for kaldi alignment) in data directory. It also filters non compatible wav files and calls fix_data_dir.sh to bring meta files to kaldi compatibility.
4. Copy the timit transcript from GMU archive to file utt_dict.
5. Execute baseAudioGen.sh. The first arguement is utt_dict. This will read the utt_dict word by word, capitilize it, generate machine synthesized audio clips for each word in StandardWords. It also calls the python script convert_mfcc.py to generate MFCC features of these clips and saves them in the same directory. Note that the formatted variable can be printed in a file to generate 'transcript' required in step 3 above. 
6. nationalities contains a sorted unique list of nationalities in the GMU dataset.
7. Run wordsnip.sh. First arguement is path of data dir, i.e. GMU_Archive. The second arguement is the path of transcript. In the script, set stage to 1 if running for the first time. This script will extract the features, generate the alignments and snip the utternace into words. It also stretches or elongates the word files to match the length of a word to its reference audio clip's length(machine synthesized). It saves these new word snippets in the directory "words_new".
8. The run getMFCC.py. It reads the word snippets from words_new directory, and for each word, randomly selects one audio clip from each nationality. For the given corpus, there are 55 words. Thus for each word we select an audio file for each nationality(~193), giving us total 55*193 ~10k files. The MFCC features so generated are placed in "mfcc_new_2"
9. Finally run train.py. Modify the file to try out different ML models. This is the conclusion of our pipeline.