Clean Vietnamese Text - Wikipedia dump 08-2018
Alphabet: aáàảãạăaáàảãạăắằẳẵặâấầẩẫậbcdđeéèẻẽẹêếềểễệfghiíìỉĩịjklmnoóòỏõọôốồổỗộơớờởỡợpqrstuúùủũụưứừửữựvwxyýỳỷỹỵz
$ cat dataset/viwik18_* > viwik18.txt
$ wget https://dumps.wikimedia.org/viwiki/20180801/viwiki-20180801-pages-articles.xml.bz2
$ bzip2 -d viwiki-20180801-pages-articles.xml.bz2
$ python WikiExtractor.py --no-templates -s --lists viwiki-20180801-pages-articles.xml -q -o - | perl -CSAD -Mutf8 cleaner.pl > viwik18.txt
Checkout the new dataset viwik19
at https://github.com/NTT123/viwik18/tree/viwik19