/wtf_wikipedia

a pretty-committed wikipedia markup parser

Primary LanguageJavaScript

wikipedia markup parser
by Spencer Kelly and contributors

wtf_wikipedia turns wikipedia's markup language into JSON,
so getting data from wikipedia is easier.

🏠 Try to have a good time. 🛀

seriously,
this is among the most-curious data formats you can find.
(then we buried our human-record in it)

Consider:

wtf_wikipedia supports many recursive shenanigans, depreciated and obscure template variants, and illicit 'wiki-esque' shorthands.

It will try it's best, and fail in reasonable ways.

→ building your own parser is never a good idea →

← but this library aims to be a straight-forward way to get data out of wikipedia

... so don't be mad at me, be mad at this.

well ok then,

npm install wtf_wikipedia

var wtf = require('wtf_wikipedia');

wtf.fetch('Whistling').then(doc => {

  doc.categories();
  //['Oral communication', 'Vocal music', 'Vocal skills']

  doc.sections('As communication').plaintext();
  // 'A traditional whistled language named Silbo Gomero..'

  doc.images(0).thumb();
  // 'https://upload.wikimedia.org..../300px-Duveneck_Whistling_Boy.jpg'

  doc.sections('See Also').links().map(l => l.page)
  //['Slide whistle', 'Hand flute', 'Bird vocalization'...]
});

on the client-side:

<script src="https://unpkg.com/wtf_wikipedia@latest/builds/wtf_wikipedia.min.js"></script>
<script>
  //(follows redirect)
  wtf.fetch('On a Friday', 'en', function(err, doc) {
    var data = doc.infobox(0).data
    data['current_members'].links().map(l => l.page);
    //['Thom Yorke', 'Jonny Greenwood', 'Colin Greenwood'...]
  });
</script>

What it does:

  • Detects and parses redirects and disambiguation pages
  • Parse infoboxes into a formatted key-value object
  • Handles recursive templates and links- like [[.. [[...]] ]]
  • Per-sentence plaintext and link resolution
  • Parse and format internal links
  • creates image thumbnail urls from File:XYZ.png filenames
  • Properly resolve {{CURRENTMONTH}} and {{CONVERT ..}} type templates
  • Parse images, headings, and categories
  • converts 'DMS-formatted' (59°12'7.7"N) geo-coordinates to lat/lng
  • parses citation metadata
  • Eliminate xml, latex, css, and table-sorting cruft

But what about...

Parsoid:

Wikimedia's Parsoid javascript parser is the official wikiscript parser. It reliably turns wikiscript into HTML, but not valid XML.

To use it for data-mining, you'll need to:

parsoid(wikiText) -> [headless/pretend-DOM] -> screen-scraping

which is fine,

but getting structured data this way (say, sentences or infobox values), is still a complex + weird process. Arguably, you're not any closer than you were with wikitext. This library has lovingly ❤️ borrowed a lot of code and data from the parsoid project, and thanks its contributors.

Full data-dumps:

wtf_wikipedia was built to work with dumpster-dive, which lets you parse a whole wikipedia dump on a laptop in a couple hours. It's definitely the way to go, instead of fetching many pages off the api.

API

  • wtf(wikiText, [options])
  • wtf.fetch(title, [lang_or_wikiid], [options], [callback])

outputs:

  • doc.plaintext()
  • doc.html()
  • doc.markdown()
  • doc.latex()

Document methods:

  • doc.isRedirect() - boolean
  • doc.isDisambiguation() - boolean
  • doc.categories()
  • doc.sections()
  • doc.sentences()
  • doc.images()
  • doc.links()
  • doc.tables()
  • doc.citations()
  • doc.infoboxes()
  • doc.coordinates()

Section methods:

(a section is any content between ==these kind== of headers)

  • sec.indentation()
  • sec.sentences()
  • sec.links()
  • sec.tables()
  • sec.templates()
  • sec.lists()
  • sec.interwiki()
  • sec.images()
  • sec.index()
  • sec.nextSibling()
  • sec.lastSibling()
  • sec.children()
  • sec.parent()
  • sec.remove()

Examples

wtf(wikiText)

flip your wikimedia markup into a Document object

import wtf from 'wtf_wikipedia'
wtf("==In Popular Culture==\n*harry potter's wand\n* the simpsons fence");
// Document {plaintext(), html(), latex()...}

wtf.fetch(title, [lang_or_wikiid], [options], [callback])

retrieves raw contents of a mediawiki article from the wikipedia action API.

This method supports the errback callback form, or returns a Promise if one is missing.

to call non-english wikipedia apis, add it's language-name as the second parameter

wtf.fetch('Toronto', 'de', function(err, doc) {
  doc.plaintext();
  //Toronto ist mit 2,6 Millionen Einwohnern..
});

you may also pass the wikipedia page id as parameter instead of the page title:

wtf.fetch(64646, 'de').then(console.log).catch(console.log)

the fetch method follows redirects.

doc.plaintext()

returns only nice text of the article

var wiki =
  "[[Greater_Boston|Boston]]'s [[Fenway_Park|baseball field]] has a {{convert|37|ft}} wall.<ref>{{cite web|blah}}</ref>";
var text = wtf(wiki).plaintext();
//"Boston's baseball field has a 37ft wall."

CLI

if you're scripting this from the shell, or from another language, install with a -g, and then run:

$ wtf_wikipedia George Clooney --plaintext
# George Timothy Clooney (born May 6, 1961) is an American actor ...

$ wtf_wikipedia Toronto Blue Jays --json
# {text:[...], infobox:{}, categories:[...], images:[] }

Good practice:

The wikipedia api is pretty welcoming though recommends three things, if you're going to hit it heavily -

  • 1️⃣ pass a Api-User-Agent as something so they can use to easily throttle bad scripts
  • 2️⃣ bundle multiple pages into one request as an array
  • 3️⃣ run it serially, or at least, slowly.
wtf.fetch(['Royal Cinema', 'Aldous Huxley'], 'en', {
  'Api-User-Agent': 'spencermountain@gmail.com'
}).then((docList) => {
  let allLinks = docList.map(doc => doc.links());
  console.log(allLinks);
});

Contributing

projects like these are only done with many-hands, and I try to be a friendly and easy maintainer. (promise!)

Join in!

Thank you to the cross-fetch and jshashes libraries.

See also:

MIT