/WikimediaDumpExtractor

WikimediaDumpExtractor extracts pages from Wikimedia/Wikipedia database backup dumps.

Primary LanguageJavaGNU Affero General Public License v3.0AGPL-3.0

WikimediaDumpExtractor

Usage

Usage: java -jar WikimediaDumpExtractor.jar
 pages      <input XML file> <output directory> <categories> <search terms> <ids>
 categories <input SQL file> <output directory> [minimum category size, default 10000]
The values <categories> and <search terms> can contain multiple entries separated by '|'
Website: https://github.com/EML4U/WikimediaDumpExtractor

Example

Download the example XML file. It contains 4 pages extracted from the enwiki 20080103 dump. Then run the following command:

java -jar WikimediaDumpExtractor.jar pages enwiki-20080103-pages-articles-example.xml ./ "Social philosophy" altruism ""

Afterwards, files similar to example result will be created.

Process large files

To process large XML files (e.g. enwiki 20080103 has 15 GB, enwiki 20210901 has 85 GB), set the following 3 parameters:

java -DentityExpansionLimit=0 -DtotalEntitySizeLimit=0 -Djdk.xml.totalEntitySizeLimit=0 -jar WikimediaDumpExtractor.jar ...

How to get data

Get Wikimedia dumps here:

Credits

Data Science Group (DICE) at Paderborn University

This work has been supported by the German Federal Ministry of Education and Research (BMBF) within the project EML4U under the grant no 01IS19080B.