/import-wikidata-dump-to-couchdb

import a subset or a full Wikidata dump into a CouchDB database

Primary LanguageJavaScript

import-wikidata-dump-to-couchdb

A tool to transfer an extract of a wikidata dump into a CouchDB database


2024 archive note

This tool was a bit of a naive implementation; if I wanted to do that today, I would do it differently, and make sure to use CouchDB bulk mode:


Summary

Dependency

  • NodeJS >= v6. If your distribution doesn't provide an recent version of NodeJS, you might want to uninstall NodeJS and reinstall it using NVM

Installation

git clone https://github.com/maxlath/import-wikidata-dump-to-couchdb
cd import-wikidata-dump-to-couchdb
npm install

Now you can customize ./config/default.js to your needs.

How to

Download dump

Download Wikidata latest dump

Extract subset

Extract the subset of the dump fitting your needs, as you might not want to throw ~40Go at your database's face.

For instance, for the needs of the authors-birthday bot, I wanted to keep only Wikidata entities of writers:

As each line of the dump is an entity, you could do something like this with grep

cat dump.json | grep '36180\,' > isWriter.json

Here the trick is that every entity with occupation-> writer (P106->Q36180) will have 36180 somewhere in the line (as a claim numeric-id). And tadaa, you went from a 39Go dump to a way nicer 384Mo subset.

But now, we can do something cleaner using wikidata-filter:

cat dump.json | wikidata-filter --claim P106:Q36180 > isWriter.json

Import

This new file isnt valid json (it's line-delimited JSON), but every new line is, once you remove the coma at the end of the line, so here is the plan: take every line, remove the coma, PUT it in your database:

./import.js ./isWriter.json

Specify start and end line numbers:

startline=5
# the line 10 will be included
endline=10
./import.js ./isWriter.json $startline $endline

Behavior on conflict

In the config file (./config/default.js), you can set the behavior on conflict, that is, when the importers tries to add an entity that was already previously added to CouchDB:

  • update (default): update document if there is a change, otherwise pass.
  • pass: always pass
  • exit: exit process at first conflict

See also

License

MIT