The library crate provides functions (parse_from_file
and parse
) that parse the XML dumps of Wikimedia pages (for instance, pages-articles.xml.bz2
).
The binary crate converts the page information into formats that are easier to parse than XML:
CBOR sequence, JSONL, Bincode,
MessagePack.
Download an XML dump, such as pages-articles.xml.bz2
or pages-meta-current.xml.bz2
, for the Wikimedia project
that you are interested in, and run this command on the file, whether compressed (.xml.bz2
or .xml.7z
) or decompressed to .xml
:
cargo run --release -- --file xml-dump-path-here > cbor-file-name-here
Other formats are supported (JSONL, MessagePack, Bincode). For JSONL:
cargo run --release -- --file xml-dump-path-here --format jsonl > cbor-file-name-here
The resulting file contains all the fields in the XML. The format isn't documented, but it is fairly straightforward to figure out from the JSONL.
.xml.bz2
requires the bz2
feature and .xml.7z
requires the 7z
feature.
Both are enabled by the decompress
feature.