/RaiseWikibase

Knowledge graph construction: Fast inserts into a Wikibase instance

Primary LanguagePythonMIT LicenseMIT

RaiseWikibase

A tool for speeding up multilingual knowledge graph construction with Wikibase

[Camera-ready PDF preprint "RaiseWikibase: Fast inserts into the BERD instance" for ESWC 2021 P&D]

  • Fast inserts into a Wikibase instance: creates up to a million entities and wikitexts per hour.
  • Creates a mini Wikibase instance with Wikidata properties in a few minutes.
  • Creates the BERD knowledge graph with millions of entities in a few hours.

⚠️   This tool is experimental. The current desire is to move its functionality to the Wikibase API, see the ticket T287164 "Improve bulk import via API". If you are interested in improving bulk import into Wikibase, please contribute to that ticket.

Table of contents

How to use

Installation

Clone RaiseWikibase and install it via pip3:

git clone https://github.com/UB-Mannheim/RaiseWikibase
cd RaiseWikibase/
pip3 install .

Wikibase Docker

👀   Wikibase Docker is distributed under BSD 3-Clause License. Please fulfill the requirements.

RaiseWikibase is solely based on Wikibase Docker developed by Wikimedia Germany. Wikibase Docker significantly simplifies deployment of a Wikibase instance.

⚠️   Copy env.tmpl to .env and substitute the default values with your own usernames and passwords.

Install Docker.

Run in the main RaiseWikibase folder:

docker-compose -f docker-compose.yml -f docker-compose.extra.yml up -d --scale wikibase_jobrunner=1

See more details at Wikibase Release Pipeline.

If it runs first time, it pulls the Wikibase Docker images. Then it builds, creates, starts, and attaches to containers for a service. Check whether it's running using:

docker ps

If it's running, the output looks like this:

CONTAINER ID        IMAGE                                COMMAND                   CREATED              STATUS              PORTS                       NAMES
0cac985f00a5        wikibase/quickstatements:latest      "/bin/bash /entrypoi…"    About a minute ago   Up About a minute   0.0.0.0:9191->80/tcp        raisewikibase_quickstatements_1
2f277b599ea0        wikibase/wdqs:0.3.40                 "/entrypoint.sh /run…"    About a minute ago   Up About a minute                               raisewikibase_wdqs-updater_1
3d7e6462b290        wikibase/wdqs-frontend:latest        "/entrypoint.sh ngin…"    About a minute ago   Up About a minute   0.0.0.0:8282->80/tcp        raisewikibase_wdqs-frontend_1
ef945d05fc88        wikibase/wikibase:1.35-bundle        "/bin/bash /entrypoi…"    About a minute ago   Up About a minute   0.0.0.0:8181->80/tcp        raisewikibase_wikibase_1
10df54332657        wikibase/wdqs-proxy                  "/bin/sh -c \"/entryp…"   About a minute ago   Up About a minute   0.0.0.0:8989->80/tcp        raisewikibase_wdqs-proxy_1
37f34328b73f        wikibase/wdqs:0.3.40                 "/entrypoint.sh /run…"    About a minute ago   Up About a minute   9999/tcp                    raisewikibase_wdqs_1
9a1c8ddd8c89        wikibase/elasticsearch:6.5.4-extra   "/usr/local/bin/dock…"    About a minute ago   Up About a minute   9200/tcp, 9300/tcp          raisewikibase_elasticsearch_1
b640eaa556e3        mariadb:10.3                         "docker-entrypoint.s…"    About a minute ago   Up About a minute   127.0.0.1:63306->3306/tcp   raisewikibase_mysql_1

The logs can be viewed via:

docker-compose logs -f

Usually in less than a minute from the start you will see the messages from wdqs-updater_1 in the logs: INFO o.w.q.r.t.change.RecentChangesPoller - Got no real changes and INFO org.wikidata.query.rdf.tool.Updater - Sleeping for 10 secs. The Wikibase front-end (http://localhost:8181) and query service (http://localhost:8282) are already available. Data filling can be started.

If you want to stop the Wikibase Docker, to remove all your uploaded data and to run a fresh Wikibase instance, use:

docker-compose down
docker volume prune
docker-compose up -d

See also Wikibase/Docker.

Wikibase Extensions

"Extensions let you customize how MediaWiki looks and works" is written in Manual:Extensions. Note that Wikibase is itself an extension to the Mediawiki software.

To add the datatype Mathematical expression (or simply Math) to a Wikibase instance, install the extension Math. An example is the property defining formula.

See also Extending Wikibase.

Wikibase Data Model and RaiseWikibase functions

The Wikibase Data Model is an ontology describing the structure of the data in Wikibase. A non-technical summary of the Wikibase model is available at DataModel/Primer. The initial conceptual specification for the Data Model was created by Markus Krötzsch and Denny Vrandečić, with minor contributions by Daniel Kinzler and Jeroen De Dauw. The Wikibase Data Model has been implemented by Jeroen De Dauw and Thiemo Kreuz as Wikimedia Germany employees for the Wikidata project.

RaiseWikibase provides the functions for the Wikibase Data Model:

from RaiseWikibase.datamodel import label, alias, description, snak, claim, entity

The functions entity, claim, snak, description, aliasand label return the template dictionaries. So all basic operations with dictionaries in Python can be used. You can merge two dictionaries X and Y using X | Y (since Python 3.9), {**X, **Y} (since Python 3.5) and X.update(Y).

Let's check the Wikidata entity Q43229 with an English label 'organization'. You can create both English and German labels for the entity in a local Wikibase instance using RaiseWikibase:

labels = {**label('en', 'organization'), **label('de', 'Organisation')}

Multiple English and German aliases can also be easily created:

aliases = alias('en', ['organisation', 'org']) | alias('de', ['Org', 'Orga'])

Multilingual descriptions can be added:

descriptions = description('en', 'social entity (not necessarily commercial)')
descriptions.update(description('de', 'soziale Struktur mit einem gemeinsamen Ziel'))

To add statements (claims), qualifiers and references, we need the snak function. To create a snak, we have to specify property, datavalue, datatype and snaktype. For example, if a Wikibase instance has the property with ID P1, a label Wikidata ID and datatype external-id, we can create a mainsnak with that property and the value 'Q43229':

mainsnak = snak(datatype='external-id', value='Q43229', prop='P1', snaktype='value')

Just as an example of creating the qualifiers and references, let's add:

qualifiers = [snak(datatype='external-id', value='Q43229', prop='P1', snaktype='value')]
references = [snak(datatype='external-id', value='Q43229', prop='P1', snaktype='value')]

We have now a mainsnak, qualifiers and references. Let's create a claim for an item:

claims = claim(prop='P1', mainsnak=mainsnak, qualifiers=qualifiers, references=references)

If you need a claim with multiple values for one property, there are two opportunities. The first one is using the extend function on lists:

claims1 = claim(prop='P1', mainsnak=mainsnak1, qualifiers=qualifiers1, references=references1)
claims2 = claim(prop='P1', mainsnak=mainsnak2, qualifiers=qualifiers2, references=references2)
claims1['P1'].extend(claims2['P1'])

The second option is using the mainsnak and statement functions:

snak1 = snak(datatype='external-id', value='Q43229', prop='P1', snaktype='value')
snak2 = snak(datatype='external-id', value='Q5', prop='P1', snaktype='value')
mainsnak1 = mainsnak(prop='P1', snak=snak1, qualifiers=[], references=[])
mainsnak2 = mainsnak(prop='P1', snak=snak2, qualifiers=[], references=[])
statements = statement(prop='P1', mainsnaks=[mainsnak1, mainsnak2])

Note that the claim and statement functions return the same template dictionaries, but their input parameters are different. The claim function is useful when your claims have one value per property. Multiple values per property are easier to create using the statement function.

All ingredients for creating the JSON representation of an item are ready. The entity function does the job:

item = entity(labels=labels, aliases=aliases, descriptions=descriptions, claims=claims, etype='item')

where claims=claims can be replaced by claims=statements.

If a property is created, the corresponding datatype has to be additionally specified:

property = entity(labels=labels, aliases=aliases, descriptions=descriptions,
		  claims=claims, etype='property', datatype='string')

Note that these functions create only the dictionaries for the corresponding elements in the Wikibase Data Model. Writing into the database is performed using the page and batch functions.

Creating entities and texts

To create one thousand items with the already created JSON representation of an item, use:

from RaiseWikibase.raiser import batch
batch(content_model='wikibase-item', texts=[item for i in range(1000)])

Let wtext is a Python string representing a wikitext. Then, wikitexts = [wtext for i in range(1000)] is a list of wikitexts and page_titles = ['wikitext' + str(i) for i in range(1000)] is a list of the corresponding page titles. To create one thousand wikitexts in the main namespace, use:

batch(content_model='wikitext', texts=wikitexts, namespace=0, page_title=page_titles)

The dictionary of namespaces can be found here:

from RaiseWikibase.datamodel import namespaces

The ID for the main namespace namespaces['main'] is 0.

Alternatively, the page function can be used directly. First, a connection object is created. The page function executes the necessary inserts, the changes are commited and the connection is closed:

from RaiseWikibase.dbconnection import DBConnection
from RaiseWikibase.raiser import page
connection = DBConnection()
page(connection=connection, content_model=content_model,
     namespace=namespace, text=text, page_title=page_title, new=True)
connection.conn.commit()
connection.conn.close()

The argument new specifies whether the page is created (new=True) or edited (new=False). The new argument can be used in the batch function as well.

Testing all datatypes

This section is moved to docs. It describes testing all datatypes in a Wikibase instance and checking what kind of extensions they require.

Compatibility with WikidataIntegrator and WikibaseIntegrator

WikidataIntegrator and WikibaseIntegrator are the wrappers of the Wikibase API. A bot account is needed to start data filling with them. RaiseWikibase can create a bot account for a local Wikibase instance, save the login and password to a configuration file and read them back to a config dictionary:

from RaiseWikibase.raiser import create_bot
from RaiseWikibase.settings import Settings
create_bot()
config = Settings()

The config dictionary can be used in WikibaseIntegrator for creating a login instance:

from wikibaseintegrator import wbi_login
login_instance = wbi_login.Login(user=config.username, pwd=config.password)

and in WikidataIntegrator:

from wikidataintegrator import wdi_login
login_instance = wdi_login.WDLogin(user=config.username, pwd=config.password)

You can also create the JSON representations of entities in WikidataIntegrator or WikibaseIntegrator and then fill them into a Wikibase instance using RaiseWikibase. In WikibaseIntegrator you can create a wbi_core.ItemEngine object and use the get_json_representation function:

from wikibaseintegrator import wbi_core
item = wbi_core.ItemEngine(item_id='Q1003030')
ijson = item.get_json_representation()

In WikidataIntegrator a wdi_core.WDItemEngine object can be created and the get_wd_json_representation function can be used:

from wikidataintegrator import wdi_core
item = wdi_core.WDItemEngine(wd_item_id='Q1003030')
ijson = item.get_wd_json_representation()

The JSON representation of an entity can be uploaded into a Wikibase instance using the batch function in RaiseWikibase:

from RaiseWikibase.raiser import batch
batch('wikibase-item', [ijson])

Getting data from Wikidata and filling it into a Wikibase instance

The Wikidata knowledge graph already has millions of items and thousands of properties. For many projects some of these entities can be reused. Let's create the multilingual items human, organization and location in a local Wikibase instance using RaiseWikibase.

The example below defines the function get_wd_entity. It takes a Wikidata ID as an input, sends a request to Wikidata, gets the JSON representation of an entity, removes the keys unwanted in a local Wikibase instance, creates a claim and returns the JSON representation of the entity, if an error has not occured. The function get_wd_entity is used to get the JSON representations for human, organization and location. These JSON representations are then filled into a local Wikibase instance using the batch function.

from RaiseWikibase.raiser import batch
from RaiseWikibase.datamodel import claim, snak
import requests

def get_wd_entity(wid=''):
    """Returns JSON representation of a Wikidata entity for the given WID"""
    # Remove the following keys to avoid a problem with a new Wikibase instance
    remove_keys = ['lastrevid', 'pageid', 'modified', 'title', 'ns']
    try:
        r = requests.get('https://www.wikidata.org/entity/' + wid + '.json')
        entity = r.json().get('entities').get(wid)
        for key in remove_keys:
            entity.pop(key)
        entity['claims'] = claim(prop='P1',
                                 mainsnak=snak(datatype='external-id',
                                               value=wid,
                                               prop='P1',
                                               snaktype='value'),
                                 qualifiers=[],
                                 references=[])
    except Exception:
        entity = None
    return entity

wids = ['Q5', 'Q43229', 'Q17334923'] # human, organization, location
items = [get_wd_entity(wid) for wid in wids]
batch('wikibase-item', items)

The lines, where entity['claims'] is rewritten, can be commented. Then, the created items contain the claims with the property IDs corresponding to Wikidata. Just try it out.

If you filled the entities from Wikidata into a fresh Wikibase instance, but you cannot open a page at http://localhost:8181/entity/Q1, run in shell:

docker exec raisewikibase_wikibase_1 bash "-c" "php maintenance/update.php --quick --force"

We used the property with ID 'P1' in the claim. That property with a label 'Wikidata ID' can be created using the script miniWikibase.py. It creates all 9000+ Wikidata properties in two minutes.

Performance analysis

The script performance.py runs two performance experiments for creating the wikitexts and items. Run:

python3 performance.py

The variable batch_lengths is set by default to [100]. This means that the length of a batch in each experiment is 100. Running both experiments in this case takes 80 seconds. You can set it to [100, 200, 300] in order to run multiple experiments with different batch lengths. In our experiments we used batch_lengths = [10000].

The script saves the CSV files with numeric values of results and creates the pdf files with figures in ./experiments/.

(1a) Wikitexts (1b) Items
alt text alt text

The insert rates in pages per second are shown at Figure 1a for wikitexts and at Figure 1b for items. Every data point corresponds to a batch of ten thousands pages. At Figure 1a six different data points correspond to six repeated experiments. At Figure 1b two colors correspond to two repeated experiments and three shapes of a data point correspond to the three cases: 1) circle - each claim without a qualifier and without a reference, 2) x - each claim with one qualifier and without a reference, and 3) square - each claim with one qualifier and one reference.

To 'reproduce' Figures 1a and 1b, set batch_lengths to [10000]. Note that 'reproducibility' in this case does not mean that you will get the same values in the experiments as at Figures 1a and 1b. It means that you can get similar plots with the values specific for your hardware and software. Our analysis was performed using a workstation with 6-core Intel i5-8500T CPU @ 2.10GHz, 16GB RAM, SSD storage and running Debian 10.

Creating a mini Wikibase instance with thousands of entities in a few minutes

The script miniWikibase.py fills a fresh Wikibase instance with some structured and unstructured data in roughly 30 seconds. The data include 8400+ properties from Wikidata, two templates, a page with SPARQL examples, a page with a sidebar and modules. Check the folder texts containing unstructured data and add there your own data. Information about the Wikidata properties is queried through the Wikidata endpoint and it takes a few seconds. Run:

python3 miniWikibase.py
(2a) Main page (2b) List of properties
alt text alt text

Figure 2a shows the main page and Figure 2b shows a list of properties. If you run the script miniWikibase.py with the commented line 156, you will see only the property identifiers instead of the labels. You can either uncomment line 156 or run in shell docker-compose down and docker-compose up -d.

Creating a mega Wikibase instance with millions of BERD entities in a few hours

The script megaWikibase.py creates a knowledge graph with millions of BERD (Business, Economic and Related Data) entities from scratch. Before running it prepare the OpenCorporates dataset. Download https://daten.offeneregister.de/openregister.db.gz. Unzip it and run in shell:

sqlite3 -header -csv handelsregister.db "select * from company;" > millions_companies.csv

Put millions_companies.csv to the main RaiseWikibase folder.

Run:

python3 megaWikibase.py

Deployment in production

The setting above runs on localhost.

A setup (and this) for deployment using Nginx is provided by Louis Poncet (personaldata.io).

Paper

@inproceedings{RaiseWikibase2021,
author={Shigapov, Renat and Mechnich, J{\"o}rg and Schumm, Irene},
title={RaiseWikibase: {F}ast inserts into the {BERD} instance},
booktitle={The {S}emantic {W}eb: {ESWC} 2021 {S}atellite {E}vents},
year={2021},
publisher={Springer International Publishing},
pages={60--64},
doi={10.1007/978-3-030-80418-3\_11},
url={https://doi.org/10.1007/978-3-030-80418-3\_11}
}

[DOI] [preprint] [poster]

Acknowledgments

This work was funded by the Ministry of Science, Research and Arts of Baden-Württemberg through the project Business and Economics Research Data Center Baden-Württemberg (BERD@BW).

We thank Jesper Zedlitz for his experiments explained at the FactGrid blog and for his open source code wikibase-insert.

See also

The official Wikibase website, Wikidata & Wikibase architecture documentation, Strategy for the Wikibase Ecosystem, the posts about Wikibase and Wikidata by Adam 'addshore' Shorland, a Wikibase tutorial by Dan Scott, Wikibase Install Basic Tutorial and Wikibase for Research Infrastructure by Matt Miller, Get your own copy of WikiData by Wolfgang Fahl, Transferring Wikibase data between wikis by Jeroen De Dauw, Putting Data into Wikidata using Software by Steve Baskauf, Vanderbilt Heard Library digital scholarship resources on Wikidata and Wikibase, Learning Wikibase, Wikibase Yearly Summary 2020 and Wikibase Yearly Summary 2021.