This project is a collection of scripts that help for language modeling. These scripts include:
- text cleaning
- text normalization
- vocabulary counts and frequencies
- language models building
- testing language models
- source ~/py3env/bin/activate
- prepare/prepare_text_for_lm.sh
- prepare/normalize_months.sh
A Vocabulary can be built in two ways
- Based on a frequency theshold (build_vocab/get_vocabs_greater_than_n.sh)
- Based on most frequent N terms (build_vocab/get_vocabs_most_freq.sh)
Both scripts use build_vocab/wordfreq2vocab.py
usage: wordfreq2vocab.py [-h] -t TEXT -v VOCABULARY -f FREQUENCY
[-top TOP | -gt GT | -all]
build_lm/build_lm.sh
- run_build_lm_v1.1.sh build LM
- test_LM_decoding.sh decode using DMP
- sclite.sh test the results
- formating.sh reformatting utterances id and test it
- mix_lm.sh interpolate two language models