/wikipedia2corpus

Wikipedia text corpus for self-supervised NLP model training

Primary LanguagePythonMIT LicenseMIT

Wikipedia 2 Corpus

Tools to extract and clean the Wikipedia texts to transform them into a text corpus for self-supervised NLP model training. Includes also a prepared corpus for English and German language (see below).

We use WikiExtractor to extract the Wikipedia database dumps. The texts are split into sentences by using SoMaJo. Each line of the text corpus contains one single sentence. Between each Wikipedia article is a blank line.

Remove blank Lines

If you want to remove the blank lines in the text corpus you can use this command: sed -i '/^$/d' <filename>

Download the German text Corpus

  • size of the corpus (unzipped): 6.1G
  • number of lines: 59,475,915
  • download the single files:
  • combine the parts: cat dewiki-20220201-clean-part-* > dewiki-20220201-clean.zip
  • optional check: sha256sum dewiki-20220201-clean.zip should return 09c47abf6200ecc342e04902e360773f9ba2d92abb64bfa20f22c63fd660edcf
  • unzip the textfile: unzip -t dewiki-20220201-clean.zip

Download the English text Corpus

How you can replicate our work

  • download the raw Wikipedia dump and store it in the data directory:
    • German language: Select the youngest directory from https://dumps.wikimedia.org/dewiki/ and download a file called dewiki-<yyyymmdd>-pages-articles.xml.bz2. Its is about 5.8 GB in size. We use dewiki-20220201-pages-articles.xml.bz2.
    • English language: Select the youngest directory from https://dumps.wikimedia.org/enwiki/ and download a file called dewiki-<yyyymmdd>-pages-articles.xml.bz2. Its is about 18.1 GB in size. We use enwiki-20220201-pages-articles.xml.bz2.
  • create and activate a new Python environment (for example with conda)
  • install the dependencies: pip install -r requirements.txt
  • for de data run: python -m wikiextractor.WikiExtractor data/dewiki-20220201-pages-articles.xml.bz2 -o data/dewiki-20220201
  • for en data run: python -m wikiextractor.WikiExtractor data/enwiki-20220201-pages-articles.xml.bz2 -o data/enwiki-20220201
  • use the process_wiki_files.py script:
    • edit INPUT_DIR, OUTPUT_DIR and if needed LANGUAGE
    • run the script
  • concatenate the output in OUTPUT_DIR by running cat <OUTPUT_DIR> > my_clean_wiki_corpus.txt

License

The Text Corpus

As Wikipedia itself, the text corpus is published under Creative Commons Attribution-ShareAlike 3.0 Unported license.

The Script

Copyright (c) 2022 Philip May

Licensed under the MIT License (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License by reviewing the file LICENSE in the repository.