/creating-text-corpus-from-wikipedia

Using Wikipedia enwiki dump (43 GB) to create a plain text corpus for NLP and Speech Recognition

Primary LanguagePython

creating-text-corpus-from-wikipedia

This project aims to create a plain text corpus from wikipedia for NLP and Speech Recognition.

Get Dumps

Download the 20130805 dump using get_enwiki_dump.sh, the download and decompress process will last for a few hours due to its large size.(compress=10G, decompres=43G)

Extracte XML

  1. split enwiki dumps into 215 small files (200 MB/per file).

  2. extracte text section into new file.

  3. discard wiki markup tags and extracte plain text into another new file.

  4. write loop to process all the splited files into plain text files one by one.