/wikipedia-analysis

A (mostly) rust tool to parse Wikipedia XML dumps and generate stats

Primary LanguageRustApache License 2.0Apache-2.0

Overview

This project includes several parts:

  • A Wikipidia XML dump parser implemented in rust that produces an intermediate link graph file.
  • An analysis tool implemented in rust that reads in the intermediate link graph file and produces several outputs (see analyze.rs for docs with more information).
  • A python jupyter notebook to process the output of the rust analysis tool

Running

The rust parser and analysis tool are accessible through a CLI. Make sure you compile this project as release (optimized) or otherwise it will be apocalyptically slow. Expect parsing to take 30 minutes to 1 hour depending on your disk IO performance. 16GB of ram is recommended for parsing to avoid swapping too much.

To access the CLI help run wikipedia-analysis --help or wikipedia-analysis <subcommand> --help. Additional help/explanation is available as rustdoc in the code and may be compiled to html using cargo.

Motivation

The purpose of this project was:

  • I was inspired by the Wiki Game and wanted to find out how connected the Wikipedia link graph was.
  • I thought this project was a good application for rust. I have previously implemented this project in python and C++, so wanted to try it in rust to help learn the language.

Results

I will create a more thorough writeup with results and link it here when its ready.