wikipedia markup parser

_{by
Spencer Kelly and

contributors}

wtf_wikipedia turns wikipedia's markup language into JSON,

so getting data from wikipedia is easier.

🏠 Try to have a good time. 🛀

^seriously,

this is among the most-curious data formats you can find.

^{(then we buried our human-record in it)}

Consider:

the egyptian hieroglyphics syntax
Birth_date_and_age vs Birth-date_and_age.
the partial-implementation of inline-css,
the deep nesting of similar-syntax templates,
the unexplained hashing scheme of image paths,
the custom encoding of whitespace and punctuation,
right-to-left values in left-to-right templates.

wtf_wikipedia supports many recursive shenanigans, depreciated and obscure template variants, and illicit 'wiki-esque' shorthands.

It will try it's best, and fail in reasonable ways.

→ building your own parser is never a good idea →

← but this library aims to be a straight-forward way to get data out of wikipedia

_{... so don't be mad at me,

be mad at this.}

well ok then,

npm install wtf_wikipedia

var wtf = require('wtf_wikipedia');

wtf.fetch('Whistling').then(doc => {

  doc.categories();
  //['Oral communication', 'Vocal music', 'Vocal skills']

  doc.sections('As communication').plaintext();
  // 'A traditional whistled language named Silbo Gomero..'

  doc.images(0).thumb();
  // 'https://upload.wikimedia.org..../300px-Duveneck_Whistling_Boy.jpg'

  doc.sections('See Also').links().map(l => l.page)
  //['Slide whistle', 'Hand flute', 'Bird vocalization'...]
});

on the client-side:

<script src="https://unpkg.com/wtf_wikipedia@latest/builds/wtf_wikipedia.min.js"></script>
<script>
  //(follows redirect)
  wtf.fetch('On a Friday', 'en', function(err, doc) {
    var data = doc.infobox(0).data
    data['current_members'].links().map(l => l.page);
    //['Thom Yorke', 'Jonny Greenwood', 'Colin Greenwood'...]
  });
</script>

What it does:

Detects and parses redirects and disambiguation pages
Parse infoboxes into a formatted key-value object
Handles recursive templates and links- like [[.. [[...]] ]]
Per-sentence plaintext and link resolution
Parse and format internal links
creates image thumbnail urls from File:XYZ.png filenames
Properly resolve {{CURRENTMONTH}} and {{CONVERT ..}} type templates
Parse images, headings, and categories
converts 'DMS-formatted' (59°12'7.7"N) geo-coordinates to lat/lng
parses citation metadata
Eliminate xml, latex, css, and table-sorting cruft

But what about...

Parsoid:

Wikimedia's Parsoid javascript parser is the official wikiscript parser. It reliably turns wikiscript into HTML, but not valid XML.

To use it for data-mining, you'll need to:

parsoid(wikiText) -> [headless/pretend-DOM] -> screen-scraping

which is fine,

but getting structured data this way (say, sentences or infobox values), is still a complex + weird process. Arguably, you're not any closer than you were with wikitext. This library has lovingly ❤️ borrowed a lot of code and data from the parsoid project, and thanks its contributors.

Full data-dumps:

wtf_wikipedia was built to work with dumpster-dive, which lets you parse a whole wikipedia dump on a laptop in a couple hours. It's definitely the way to go, instead of fetching many pages off the api.

API

wtf(wikiText, [options])
wtf.fetch(title, [lang_or_wikiid], [options], [callback])

outputs:

doc.plaintext()
doc.html()
doc.markdown()
doc.latex()

Document methods:

doc.isRedirect() - boolean
doc.isDisambiguation() - boolean
doc.categories()
doc.sections()
doc.sentences()
doc.images()
doc.links()
doc.tables()
doc.citations()
doc.infoboxes()
doc.coordinates()

Section methods:

(a section is any content between ==these kind== of headers)

sec.indentation()
sec.sentences()
sec.links()
sec.tables()
sec.templates()
sec.lists()
sec.interwiki()
sec.images()
sec.index()
sec.nextSibling()
sec.lastSibling()
sec.children()
sec.parent()
sec.remove()

Examples

wtf(wikiText)

flip your wikimedia markup into a Document object

import wtf from 'wtf_wikipedia'
wtf("==In Popular Culture==\n*harry potter's wand\n* the simpsons fence");
// Document {plaintext(), html(), latex()...}

wtf.fetch(title, [lang_or_wikiid], [options], [callback])

retrieves raw contents of a mediawiki article from the wikipedia action API.

This method supports the errback callback form, or returns a Promise if one is missing.

to call non-english wikipedia apis, add it's language-name as the second parameter

wtf.fetch('Toronto', 'de', function(err, doc) {
  doc.plaintext();
  //Toronto ist mit 2,6 Millionen Einwohnern..
});

you may also pass the wikipedia page id as parameter instead of the page title:

wtf.fetch(64646, 'de').then(console.log).catch(console.log)

the fetch method follows redirects.

doc.plaintext()

returns only nice text of the article

var wiki =
  "[[Greater_Boston|Boston]]'s [[Fenway_Park|baseball field]] has a {{convert|37|ft}} wall.<ref>{{cite web|blah}}</ref>";
var text = wtf(wiki).plaintext();
//"Boston's baseball field has a 37ft wall."

CLI

if you're scripting this from the shell, or from another language, install with a -g, and then run:

$ wtf_wikipedia George Clooney --plaintext
# George Timothy Clooney (born May 6, 1961) is an American actor ...

$ wtf_wikipedia Toronto Blue Jays --json
# {text:[...], infobox:{}, categories:[...], images:[] }

Good practice:

The wikipedia api is pretty welcoming though recommends three things, if you're going to hit it heavily -

1️⃣ pass a Api-User-Agent as something so they can use to easily throttle bad scripts
2️⃣ bundle multiple pages into one request as an array
3️⃣ run it serially, or at least, slowly.

wtf.fetch(['Royal Cinema', 'Aldous Huxley'], 'en', {
  'Api-User-Agent': 'spencermountain@gmail.com'
}).then((docList) => {
  let allLinks = docList.map(doc => doc.links());
  console.log(allLinks);
});

Contributing

projects like these are only done with many-hands, and I try to be a friendly and easy maintainer. (promise!)

Join in!

Thank you to the cross-fetch and jshashes libraries.

vsaarinen/wtf_wikipedia