Experimenting with Mediawiki API and network graphs

nbviewer Binder

This README is included in the pywikibot2gephi.ipynb Jupyter notebook, you might as well head right over there.

TL;DR;

These are my experiments in learning how to use any Mediawiki API through Pywikibot to collect any data I am interested in processing as a network graph (originally in convenient tool Gephi). Besides pretty pictures, network graphs offer powerful methods to visualize and elucidate data in ways which would be difficult otherwise. Two applications I have in mind are are story narrative charts and as aid in corporate wiki management.

Overview

For the longest time, I've enjoyed playing with various types of network graphs, mainly using the revered open source tool Gephi. However, data acquisition and preparation is usually the challenge, so I've been wanting to get into programmatic ways of working with the data. I haven't so far explored... lower-level tools like neo4j or Wikibase (together with Mediawiki the software powering Wikidata, itself the structured data storage behind Wikipedia).

From the beginning, I've usually fed Gephi by creating some type of CSV/spreadsheet node-, edge lists or adjacency matrix. With some learning effort and assuming your data is in for instance Wikidata, you can feed it from the SemanticWebImport Gephi plugin and SPARQL queries (see my previous tutorial on that). There is also the option to simply scrape the web (that example using Beautiful Soup, Selenium, Pandas, NetworkX and Matplotlib), but that can easily get messy.

Thus follows my current toolchain - there are lots of exciting data accessible through the Mediawiki APIs (not only Wikipedia but also sites in the Fandom/Wikia family or perhaps your corporate wiki?). Pywikibot offers a convenient and well-used (though not as well documented?) Python wrapping for Mediawiki API, such as authentication, parsing, caching and configurable throttling. igraph (n.b. python-igraph) seems sensible and provides an interface to the Gephi GraphStreaming plugin. Jupyter notebooks are practically a given (see also my advent-of-code solutions, also in Jupyter notebooks)

As for potential paths of development, for when you don't have Gephi running along the notebook, one could opt for NetworkX as well as further usage of IPython.display, or even trying out something like pyvis (article here, see also their two previous articles).

Wikimedians may expect more Wikidata or even PAWS, but that's not where I'm going currently.

For inspiration, I draw mainly on Aidan Hogan et al. article "Knowledge Graphs (extended version)" and Alberto Cottica et al. presentation "People, words, data: Harvesting collective intelligence" and article "Semantic Social Networks: A Mixed Methods Approach to Digital Ethnography"

Usage

  • Have Python 3.x with Jupyter and preferably something like virtualenv installed
  • Define a virtualenv or similar to keep dependencies managable for this project: virtualenv ./venv . Activate it like source venv/bin/activate. In it, pip install -r requirements.txt
  • IDEs like VS Code may pick up on your virtualenv, or you may have to register it with IPython
  • Open the notebook in your IDE, or through Jupyter: jupyter pywikibot2gephi.ipynb
  • Optionally, install Gephi (0.9.2), start it and activate the Graph Streaming "master server"
  • Run, explore and modify the examples!

Alternatively, just click the "launch binder" button and mybinder.org together with repo2docker GitHub Actions should have all the requirements sorted for you in a jiffy (obviously a binder won't be able to connect to a local Gephi). Another option is to install the notebook in PAWS: A Web Shell (PAWS), but I haven't tried that.

Changelog

2021-10-16

Cleaned up the project, wrote down (and reconstructed) this README. Currently working in the dev3 branch, because of how I want things tidy before I set them in stone (a.k.a. main)

2021-10-10

Created this repository and started my experiments. Yesterday I confirmed that by hard-coding our corporate SSO session cookie (figure out a way to pull it from the browser?) in pywikibot.lwp ("Set-Cookie3 format"), the regular MediaWiki API of the corporate wiki is usable. So while this is playing around for personal education, my hope is it could be quite useful also for various methods of corporate information- and community management

See also