motazsaad/language-modeling

language modeling

ShellGPL-3.0

language-modeling

scripts for language modeling

This project is a collection of scripts that help for language modeling. These scripts include:

text cleaning
text normalization
vocabulary counts and frequencies
language models building
testing language models

steps for preparing texts:

source ~/py3env/bin/activate
prepare/prepare_text_for_lm.sh
prepare/normalize_months.sh

steps for building Vocab:

A Vocabulary can be built in two ways

Based on a frequency theshold (build_vocab/get_vocabs_greater_than_n.sh)
Based on most frequent N terms (build_vocab/get_vocabs_most_freq.sh)

Both scripts use build_vocab/wordfreq2vocab.py

usage: wordfreq2vocab.py [-h] -t TEXT -v VOCABULARY -f FREQUENCY
                         [-top TOP | -gt GT | -all]

steps for building LM:

build_lm/build_lm.sh

add the scripts

run_build_lm_v1.1.sh build LM
test_LM_decoding.sh decode using DMP
sclite.sh test the results
formating.sh reformatting utterances id and test it
mix_lm.sh interpolate two language models