A character-based language model build with the SRILM toolkit, plus a Viterbi-based decoding process of the language model implemented in C++.
C++MIT
Digital Signal Processing: Character-Based Language Model
Introduction
A character-based language model build with the SRILM toolkit,
Plus, a viterbi-based decoding process of the language model implemented in C++.
Given ZhuYin-mixed sequences obtained from an imperfect acoustic models with phoneme loss, reconstruct and decode the correct sentence using a character-based language model, this language model can be construct with the SRILM toolkit or this C++ implementation.
.
├── src/
| ├── Makefile -------------> g++ compiler make file
| ├── corpus.txt -----------> Training corpus in big5 encoding
| ├── Big5-ZhuYin.map ------> character to Zhu-Yin mapping in big5 encoding
| ├── mapping.py -----------> Creates Zhu-Yin to char mapping from its inverse mapping
| ├── mydisambig.cpp -------> My implementation of a viterbi-based decoding process of the language model
| ├── separator_big5.pl ----> Separate words into characters with white space inserted in between each character
| └── testdata/ ------------> testing data 1.txt ~ 5.txt are the easy ones, 6.txt ~ 10.txt are the hard ones
├── image/
├── srilm-1.5.10.tar.gz ------> SRILM binary source code
├── problem_description.pdf --> Work spec
└── Readme.md ----------------> This file
Usage
Compile code:
make all
Separate training and testing data into separate characters:
make separate
Build Zhu-Yin to char mapping:
make map
This generates 2 files: I) ZhuYin-Big5.map, and II) ZhuYin-Utf8.map where:
I) ZhuYin-Big5.map: the Zhu-Yin to Chinese character mapping in big5 encoding
II) ZhuYin-Utf8.map: the Zhu-Yin to Chinese character mapping in utf-8 encoding for user verification in ordinary linux environment
Build language model:
make build_lm
Decode with SRILM disambig:
make run_disambig
Decode with MY disambig:
make run
Decode with MY disambig but show output on screen instead of write to file:
make run_cout
Clean executables:
make clean
Clean everything generated in the above steps:
make cleanest
The variables SRIPATH and MACHINE_TYPE can be specified by the user through the make command:
make MACHINE_TYPE=i686-m64 SRIPATH=/home/user/srilm-1.5.10 all
make MACHINE_TYPE=i686-m64 SRIPATH=/home/user/srilm-1.5.10 separate
make MACHINE_TYPE=i686-m64 SRIPATH=/home/user/srilm-1.5.10 map
make MACHINE_TYPE=i686-m64 SRIPATH=/home/user/srilm-1.5.10 build_lm
make MACHINE_TYPE=i686-m64 SRIPATH=/home/user/srilm-1.5.10 run_disambig
make MACHINE_TYPE=i686-m64 SRIPATH=/home/user/srilm-1.5.10 run
Default settings of SRIPATH and MACHINE_TYPE are:
SRIPATH can be obtained by running the command $ pwd under the srilm-1.5.10/ directory.
MACHINE_TYPE can be verified through the command: $ lscpu
Environment Setup
Install dependencies
Install csh if not already installed: $ sudo apt-get install csh
Install gawk if not already installed: $ sudo apt-get install gawk
Compile SRILM from binary
The following instructions are for a Ubuntu 64 bit machine.
Use the SRILM source code provided in this repo, or download it here.
Untar the source code package: $ tar zxvf srilm-1.5.10.tar.gz
Enter the resulting SRILM directory: $ cd srilm-1.5.10/
Get the absolute path to the srilm-1.5.10/ directory: $ pwd
Modify srilm-1.5.10/Makefile and change the SRILM variable to the absolute path of srilm-1.5.10/, and change the MACHINE_TYPE variable to match the 64-bit Ubuntu architecture: