dumpster-dip: A JavaScript repository from cyanic-selkie

dumpster-dip

wikipedia dump parser

_{by
Spencer Kelly, Devrim Yasar,
and

others}

gets a wikipedia xml dump into tiny json files,

so you can get a bunch of easy data.

👍 〰〰〰〰〰〰〰〰 👍

dumpster-dip is a script that allows you to parse a wikipedia dump into ad-hoc data.

dumpster-dive is a script that puts it into mongodb, instead.

use whatever you prefer!

1. Download a dump
cruise the wikipedia dump page and look for ${LANG}wiki-latest-pages-articles.xml.bz2

2. Unzip the dump

bzip2 -d ./path/to/enwiki-latest-pages-articles.xml.bz2

3. Start the javascript

npm install dumpster-dip

import dip from 'dumpster-dip'

const opts = {
  input: '/path/to/my-wikipedia-article-dump.xml',
  parse: (doc) => {
    return doc.sentences()[0].text()// return the first sentence of each page
  }
}
// this promise takes ~4hrs
dip(opts).then(() => {
  console.log('done!')
})

en-wikipedia takes about 4hrs on a macbook.

This tool is intended to be a clean way to pull random bits out of wikipedia, like:

'all the birthdays of basketball players'

await dip({
  doPage: (doc) => doc.categories().find(cat => cat === `American men's basketball players`),
  parse: (doc) => doc.infobox().get('birth_date')
})

It uses wtf_wikipedia as the wikiscript parser.

Outputs:

By default, it outputs an individual file for every wikipedia article. Sometimes operating systems don't like having ~6m files in one folder, though - so it nests them 2-deep, using the first 4 characters of the filename's hash:

/BE
  /EF
    /Dennis_Rodman.txt
    /Hilary_Clinton.txt

as a helper, this library exposes a function for navigating this directory scheme:

import getPath from 'dumpster-dip/nested-path'
let file = getPath('Dennis Rodman')
// ./BE/EF/Dennis_Rodman.txt

This is the same scheme that wikipedia does internally.

Flat results:

if you want all files in one flat directory, you can do:

let opts = {
  outputDir: './results', 
  outputMode: 'flat', 
}

Results in one file:

if you want all results in one file, you can do:

let opts = {
  outputDir: './results', 
  outputMode: 'ndjson', 
}

Options

let opts = {
  // directory for all our new files
  outputDir: './results', // (default)
  // how we should write the results
  outputMode: 'nested', // (default)

  // which wikipedia namespaces to handle (null will do all)
  namespace: 0, //(default article namespace)
  // define how many concurrent workers to run
  workers: cpuCount, // default is cpu count
  //interval to log status
  heartbeat: 5000, //every 5 seconds
  
  // parse redirects, too
  redirects: false, // (default)
  // parse disambiguation pages, too
  disambiguation: true, // (default)

  // what do return, for every page
  parse: (doc) => doc.json(), // (default)
  // should we return anything for this page?
  doPage: (doc) => true, // (default)
  // add plugins to wtf_wikipedia
  extend: (wtf) => {
    wtf.extend((models, templates, infoboxes) => {
      models.Doc.prototype.isPerson = function () {
        return this.categories().find((cat) => cat.match(/people/))
      }
    })
  },
}

work in progress