Vietnamese orthographic normalization for preprocessing.
A file variant_dic/variants_without_wrong.tsv
is a Vietnamese orthographic variation dictionary.
I pick it up from VNTQcorpus(big).txt using pick_variant_from_corpus.py
python script.
This file formats are
- Row of odd number:syllable
- Row of even number:left syllable's frequency in the corpus
A part of scripts are borrowed from Luu Tuan Anh.