dumpster-dip is a script that allows you to parse a wikipedia dump into ad-hoc data.
dumpster-dive is a script that puts it into mongodb, instead.
use whatever you prefer!
1. Download a dump
cruise the wikipedia dump page and look for ${LANG}wiki-latest-pages-articles.xml.bz2
bzip2 -d ./path/to/enwiki-latest-pages-articles.xml.bz2
npm install dumpster-dip
import dip from 'dumpster-dip'
const opts = {
input: '/path/to/my-wikipedia-article-dump.xml',
parse: (doc) => {
return doc.sentences()[0].text()// return the first sentence of each page
}
}
// this promise takes ~4hrs
dip(opts).then(() => {
console.log('done!')
})
en-wikipedia takes about 4hrs on a macbook.
This tool is intended to be a clean way to pull random bits out of wikipedia, like:
'all the birthdays of basketball players'
await dip({
doPage: (doc) => doc.categories().find(cat => cat === `American men's basketball players`),
parse: (doc) => doc.infobox().get('birth_date')
})
It uses wtf_wikipedia as the wikiscript parser.
By default, it outputs an individual file for every wikipedia article. Sometimes operating systems don't like having ~6m files in one folder, though - so it nests them 2-deep, using the first 4 characters of the filename's hash:
/BE
/EF
/Dennis_Rodman.txt
/Hilary_Clinton.txt
as a helper, this library exposes a function for navigating this directory scheme:
import getPath from 'dumpster-dip/nested-path'
let file = getPath('Dennis Rodman')
// ./BE/EF/Dennis_Rodman.txt
This is the same scheme that wikipedia does internally.
if you want all files in one flat directory, you can do:
let opts = {
outputDir: './results',
outputMode: 'flat',
}
if you want all results in one file, you can do:
let opts = {
outputDir: './results',
outputMode: 'ndjson',
}
let opts = {
// directory for all our new files
outputDir: './results', // (default)
// how we should write the results
outputMode: 'nested', // (default)
// which wikipedia namespaces to handle (null will do all)
namespace: 0, //(default article namespace)
// define how many concurrent workers to run
workers: cpuCount, // default is cpu count
//interval to log status
heartbeat: 5000, //every 5 seconds
// parse redirects, too
redirects: false, // (default)
// parse disambiguation pages, too
disambiguation: true, // (default)
// what do return, for every page
parse: (doc) => doc.json(), // (default)
// should we return anything for this page?
doPage: (doc) => true, // (default)
// add plugins to wtf_wikipedia
extend: (wtf) => {
wtf.extend((models, templates, infoboxes) => {
models.Doc.prototype.isPerson = function () {
return this.categories().find((cat) => cat.match(/people/))
}
})
},
}
MIT