Adds a reconciliation API endpoint to Datasette, based on the Reconciliation Service API specification.
The reconciliation API is used to match a set of strings to their correct identifiers, to help with disambiguation and consistency in large datasets. For example, the strings "United Kingdom", "United Kingdom of Great Britain and Northern Ireland" and "UK" could all be used to identify the country which has the ISO country code GB
. It is particularly implemented in OpenRefine.
The plugin adds a /-/reconcile
endpoint to a table served by datasette, which responds based on the Reconciliation Service API specification. In order to activate this endpoint you need to configure the reconciliation service, as dscribed in the usage section.
Install this plugin in the same environment as Datasette.
$ datasette install datasette-reconcile
The plugin should be configured using Datasette's metadata.json
file. The configuration can be put at the root, database or table layer of metadata.json
, for most use cases it will make most sense to configure at the table level.
Add a datasette-reconcile
object under plugins
in metadata.json
. This should look something like:
{
"databases": {
"sf-trees": {
"tables": {
"Street_Tree_List": {
"plugins": {
"datasette-reconcile": {
"id_field": "id",
"name_field": "name",
"type_field": "type",
"type_default": [{
"id": "tree",
"name": "Tree",
}],
"max_limit": 5,
"service_name": "Tree reconciliation"
}
}
}
}
}
}
}
The only required item in the configuration is name_field
. This refers to the field in the table which will be searched to match the query text.
The rest of the configuration items are optional, and are as follows:
id_field
: The field containing the identifier for this entity. If not provided, and there is a primary key set, then the primary key will be used. A primary key of more than one field will give an error.type_field
: If provided, this field will be used to determine the type of the entity. If not provided, then thetype_default
setting will be used instead.type_default
: If provided, this value will be used as the type of every entity returned. If not provided the default ofObject
will be used for every entity.max_limit
: The maximum number of records that a query can request to return. This is 5 by default. A individual query can request fewer results than this, but it cannot request more.service_name
: The name of the reconciliation service that will appear in the service manifest. If not provided it will take the form<database name> <table name> reconciliation
.identifierSpace
: Identifier space given in the service manifest. If not provided a default ofhttp://rdf.freebase.com/ns/type.object.id
is used.schemaSpace
: Schema space given in the service manifest. If not provided a default ofhttp://rdf.freebase.com/ns/type.object.id
is used.
Once the plugin is configured for a particular database or table, you can access the reconciliation endpoint using the url /<db_name>/<table>/-/reconcile
.
A simple GET request to /<db_name>/<table>/-/reconcile
will return the Service Manifest as JSON which reconciliation clients can use to determine how the service is set up.
A POST request to the same url with the queries
argument set will trigger the reconciliation process. The queries
parameter should be a json object in the format described in the specification. An example set of two queries would look like:
{
"q1": {
"query": "Hans-Eberhard Urbaniak"
},
"q2": {
"query": "Ernst Schwanhold"
}
}
The query can optionally be encoded as a queries
parameter in a GET request. For example:
/<db_name>/<table>/-/reconcile?queries={"q1":{"query":"Hans-Eberhard Urbaniak"},"q2":{"query": "Ernst Schwanhold"}}
Various options are available in the query object. Current the only ones implemented in datasette-reconcile are the mandatory query
string, and the limit
option, which must be less than or equal to the value in the max_limit
configration option.
All endpoints that start with /<db_name>/<table>/-/reconcile
are configured to send an Access-Control-Allow-Origin: *
CORS header to allow access as described in the specification.
JSONP output is not yet supported.
The result of the GET or POST queries
requests described above is a json object describing potential reconciliation candidates for each of the queries specified. The result will look something like:
{
"q1": {
"result": [
{
"id": "120333937",
"name": "Urbaniak, Regina",
"score": 53.015232,
"match": false,
"type": [{
"id": "person",
"name": "Person",
}]
},
{
"id": "1127147390",
"name": "Urbaniak, Jan",
"score": 52.357353,
"match": false,
"type": [{
"id": "person",
"name": "Person",
}]
}
]
},
"q2": {
"result": [
{
"id": "123064325",
"name": "Schwanhold, Ernst",
"score": 86.43497,
"match": true,
"type": [{
"id": "person",
"name": "Person",
}]
},
{
"id": "116362988X",
"name": "Schwanhold, Nadine",
"score": 62.04763,
"match": false,
"type": [{
"id": "person",
"name": "Person",
}]
}
]
}
}
The reconcile engine works by performing an SQL query against the name_field
within the specified database table. Where that table has a full text search index implemented, the search will be performed against that index.
When a full text search index is present on the table, the SQL query takes the form (based on the search query test
, note that double quotes are added to facilitate searching - these are not present in the original query):
select <id_field>, <name_field>
from <table>
inner join (
select "rowid", "rank"
from <fts_table>
where <fts_table> MATCH '"test"'
) as "a" on <table>."rowid" = a."rowid"
order by a.rank
limit 5
If a full text search index is not present, the query looks like this (note that the wildcard %
is added to either side of the query - these are not present in the original query):
select <id_field>, <name_field>
from <table>
where <name_field> like '%test%'
limit 5
To set up this plugin locally, first checkout the code. Then create a new virtual environment:
cd datasette-reconcile
python3 -mvenv venv
source venv/bin/activate
Or if you are using pipenv
:
pipenv shell
Now install the dependencies and tests:
pip install -e '.[test]'
You'll need to fetch the git submodules for the tests too:
git submodule init
git submodule update
To run the tests:
pytest