Map strings to Wikidata QID's using various methods
This is a work-in-progress Python command-line tool to align strings to Wikidata items (QID's).
Clone this repo and use poetry to install dependencies:
poetry install
Then run poetry run wdreconcile
.
Create a text file with strings you want to reconcile, separated by newline. E.g.
museums.txt
Metropolitan Museum of Art
Centraal Museum
Jewish Historical Museum
By default wbsearch.py
uses the wdsearch
reconciler. This gives you back the very first result from the the wbsearchentities
Wikidata API. This is the same as what you get when using the autocomplete field on the website. You need to specify a language in ISO-code form (e.g. en
)
poetry run wdreconcile -i museums.txt -o museums.csv -l en
This will give you back a filed called museums.csv
that looks like this:
query | id | label | description | status |
---|---|---|---|---|
Metropolitan Museum of Art | Q160236 | Metropolitan Museum of Art | major art museum in New York City, United States | ok |
Centraal Museum | Q260913 | Centraal Museum | museum in Utrecht, Netherlands | ok |
Jewish Historical Museum | Q702726 | Jewish Historical Museum | Jewish history, culture, and religion museum in Amsterdam, Netherlands | ok |
Note that the output format
(-o
) can have any extension that dataknead supports, so to use json
, just run the command like this:
poetry run wdreconcile -i museums.txt -o museums.json -l en
If you want more than the first result you can use the -li
(limit) parameter to change the number of results.
poetry run wdreconcile -i museums.txt -o museums-3.csv -l en -li 3
You can also use the Wikidata fulltext search, which will give you the same results as the Special:Search page. Specify wdfullsearch
using the -rt
argument. The wdfullsearch
reconciler is about half as slow as the default wdsearch
reconciler.
poetry run wdreconcile -i museums.txt -o museums.csv -l en -rt wdfullsearch
And you can also use the Wikidata reconciler as used by OpenRefine, using the -rt
(reconciler type) parameter.
poetry run wdreconcile -i museums.txt -o museum-openrefine.csv -rt openrefine -l en
Another use of wdreconcile
is to map back QID's to labels and descriptions using the wdentity
reconciler. This will also check if the item exists and might be handy for batch checking of existing QID's.
poetry run wdreconcile -i museum-qids.csv -o museum-matched.csv -rt wdentity -l en
You can also go from Wikimedia project pages (like articles on Wikipedia) to QIDs using the wmentity
reconciler type. This will also check for existence of the page title. This requires the -s
(site) argument, which is the projectname like enwiki
, commonswiki
or eswikinews
. For a full list of all possible sites see the API docs.
poetry run wdreconcile -i museums.txt -o museum-articles-qids.csv -rt wmentity -s enwiki -l en -v
Here are all reconcilers in a handy table:
id | name | usage |
---|---|---|
openrefine |
OpenRefine reconcilation | This is the OpenRefine Wikidata reconcilation service |
wdentity |
Wikidata Entity | This will map QID's to their respective labels and descriptions and check for existence |
wdsearch |
Wikidata Search | This uses the first result from the wbsearchentities API, which is equivalent to the 'autocomplete' box on the Wikidata site. This is the default. |
wdfullsearch |
Wikidata Fullsearch | This uses the regular Wikidata [search engine]. About half as slow as wdsearch |
wmentity |
Wikimedia Entity | This maps page titles from a Wikimedia project to their respective QIDs. |
If you add the -v
(verbose) flag wdreconcile
will give much more debug information.
usage: wdreconcile [-h] -i INPUT -o OUTPUT
[-rt {openrefine,wdentity,wdsearch,wdfullsearch,wmentity}]
-l LANGUAGE [-li LIMIT] [-s SITE] [-v]
Reconcile a list of strings to Wikidata items
optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
Input file (text, line based)
-o OUTPUT, --output OUTPUT
Output file
-rt {openrefine,wdentity,wdsearch,wdfullsearch,wmentity}, --reconciler_type {openrefine,wdentity,wdsearch,wdfullsearch,wmentity}
Reconciler type
-l LANGUAGE, --language LANGUAGE
ISO code of the language you're using to reconcile
-li LIMIT, --limit LIMIT
How many results to return
-s SITE, --site SITE Site (like enwiki)
-v, --verbose Display debug information
MIT © Hay Kranen