This project aims to create a plain text corpus from wikipedia for NLP and Speech Recognition.
Download the 20130805 dump using get_enwiki_dump.sh, the download and decompress process will last for a few hours due to its large size.(compress=10G, decompres=43G)
-
split enwiki dumps into 215 small files (200 MB/per file).
-
extracte text section into new file.
-
discard wiki markup tags and extracte plain text into another new file.
-
write loop to process all the splited files into plain text files one by one.