jq module to process Wikidata JSON format
This git repository contains a module for the jq data transformation language to process entity data from Wikidata or other Wikibase instances serialized in its JSON format.
Several methods exist to get entity data from Wikidata. This module is designed to process entities in their JSON serialization especially for large numbers of entities. Please also consider using a dedicated client such as wikidata-cli instead.
Installation requires jq version 1.5 or newer.
Put wikidata.jq
to a place where jq can find it as module.
One way to do so is to check out this repository to directory ~/.jq/wikidata/
:
mkdir -p ~/.jq && git clone https://github.com/nichtich/jq-wikidata.git ~/.jq/wikidata
The shortest method to use functions of this jq module is to directly include
the module. Try to process a single Wikidata entity (see below for details about per-item acces):
wget http://www.wikidata.org/wiki/Special:EntityData/Q42.json
jq 'include "wikidata"; .entities[].labels|reduceLabels' Q42.json
It is recommended to put Wikidata entities in a newline delimited JSON file:
jq -c .entities[] Q42.json > entities.ndjson
jq -c 'include "wikidata"; .labels|reduceLabels' entities.ndjson
More complex scripts should better be put into a .jq
file:
include "wikidata";
.labels|reduceLabels
The file can then be processed this way:
jq -f script.jq entities.ndjson
Wikidata JSON dumps are made available at https://dumps.wikimedia.org/wikidatawiki/entities/. The current dumps exceed 35GB even in its most compressed form. The file contains one large JSON array so it should better be converted into a stream of JSON objects for further processing.
With a fast and stable internet connection it's possible to process the dump on-the fly like this:
curl -s https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2 \
| bzcat | jq -nc --stream 'include "wikidata"; ndjson' | jq .id
JSON data for single entities can be ontained via the Entity Data URL. Examples:
- https://www.wikidata.org/wiki/Special:EntityData/Q42.json
- https://www.wikidata.org/wiki/Special:EntityData/L3006.json
- https://www.wikidata.org/wiki/Special:EntityData/L3006-F1.json
The module function entity_data_url
creates these URLs from Wikidata
itentifier strings. The resulting data is wrapped in JSON object; unwrap with
.entities|.[]
:
curl $(echo Q42 | jq -rR 'include "wikidata"; entity_data_url') | jq '.entities|.[]'
As mentioned above you better use wikidata-cli for accessing small sets of items:
wd d Q42
To get sets of items that match a given criteria either use SPARL or MediaWiki API module wbsearchentities and/or MediaWiki API module wbgetentities.
Use function reduceEntity or more specific functions (reduceInfo, reduceItem, reduceProperty, reduceLexeme) to reduce the JSON data structure without loss of essential information.
Furher select only some specific fields if needed:
jq '{id,labels}' entities.ndjson
Applies reduceInfo and one of reduceItem, reduceProperty, reduceLexeme.
reduceEntity
Simplifies labels, descriptions, aliases, claims, and sitelinks of an item.
reduceItem
Simplifies labels, descriptions, aliases, and claims of a property.
reduceProperty
.labels|reduceLabels
.descriptions|reduceDescriptions
.aliases|reduceAliases
.sitelinks|reduceSitelinks
Simplifies lemmas, forms, and senses of a lexeme entity.
reduceLexeme
.forms|reduceForms
.senses|reduceSenses
Removes unnecessary fields .id
, .hash
, .type
, .property
and simplifies
values for each claim.
.claims|reduceClaims
Reduces a single claim value.
.claims.P26[]|reduceClaim
...
Only lexemes have forms.
.forms|reduceForms
reduceInfo
Removes additional information fields pageid
, ns
, title
, lastrevid
, and modified
.
To remove selected field see jq function del
.
Module function ndjson
can be used to process a stream with an array of
entities into a list of entities:
bzcat latest-all.json.bz2 | jq -n --stream 'import "wikidata"; ndjson'
Alternative, possibly more performant methods to process array of entities are described here:
bzcat latest-all.json.bz2 | head -n-1 | tail -n+2 | sed 's/,$//'
The source code is hosted at https://github.com/nichtich/jq-wikidata.
Bug reports and feature requests are welcome!
Made available under the MIT License by Jakob Voß.