LD Workbench is a command-line tool for transforming large RDF datasets using pure SPARQL.
This project is currently in a Proof-of-Concept phase.
The main design principes are scalability and extensibility.
LD Workbench is scalable due to its iterator/generator approach:
- the iterator component fetches URIs using a SPARQL SELECT query, paginating results using SPARQL
OFFSET
andLIMIT
(binding each URI to a$this
variable) - the generator component then runs a SPARQL CONSTRUCT query for each URI (pre-binding
$this
to the URI), which returns the transformed result.
LD Workbench is extensible because it uses pure SPARQL queries (instead of code) for configuring transformation pipelines. Each pipeline is a sequence of stages; each stage consists of an iterator and generator.
An LD Workbench pipeline is defined with a YAML configuration file. The configuration is validated by a JSON Schema. The schema is part of this repository (link). The YAML and JSON Schema combination is tested to work in the VSCode editor.
A pipeline must have a name, one or more stages, and optionally a description. Multiple pipelines can be configured as long as they have unique names. See the example configuration file for a boilerplate configuration file. A visualization of the schema gives more insights on required and optional properties can be found here.
name: MyPipeline
description: Example pipeline configuration
destination: output/result.ttl
stages:
- name: Stage1
iterator:
query: "SELECT ?s ?p ?o WHERE { ?s ?p ?o } LIMIT 100"
endpoint: "http://example.com/sparql-endpoint"
generator:
- query: "CONSTRUCT { ?s ?p ?o } WHERE { ?s ?p ?o }"
batchSize: 50
destination: output/stage1-result.ttl
- name: Stage2
iterator:
query: file://queries/iteratorQuery.rq
endpoint: "http://example.com/sparql-endpoint-1"
batchSize: 200
generator:
- query: file://queries/generator1Query.rq
endpoint: "http://example.com/sparql-endpoint-1"
batchSize: 200
- query: file://queries/generator2Query.rq
endpoint: "http://example.com/sparql-endpoint-2"
batchSize: 100
destination: output/stage2-result.ttl
Section | Variable | Description | Required |
---|---|---|---|
General Configuration File | name | The name of your pipeline, it must be unique over all your configurations. | Yes |
description | An optional description for your pipeline. | No | |
destination | The file where the final result of your pipeline is saved. | No | |
Stage | name | The name of your pipeline step, it must be unique within one configuration. | Yes |
destination | The file where the results are saved. This is not a required property; if omitted, a temporary file will be created automatically. | No | |
Iterator | query | Path (prefixed with "file://") of SPARQL Query .rq file or SPARQL Query string that makes the iterator using SPARQL select. |
Yes |
endpoint | The SPARQL endpoint for the iterator. If it starts with "file://", a local RDF file is queried. If omitted, the result of the previous stage is used. | No | |
batchSize | Overrule the iterator's behavior of fetching 10 results per request, regardless of any limits in your query. | No | |
delay | Human-readable time delay for the iterator's SPARQL endpoint requests (e.g., '5ms', '100 milliseconds', '1s'). | No | |
Generator | query | Path (prefixed with "file://") of SPARQL Query .rq file or SPARQL Query string that makes the generator using SPARQL construct. |
Yes |
endpoint | The SPARQL endpoint for the generator. If it starts with "file://", a local RDF file is queried. If omitted, the endpoint of the Iterator is used. | No | |
batchSize | Overrule the generator's behavior of fetching results for 10 bindings of $this per request. | No |
-
Install Node.js 20.10.0 or larger, by going to https://nodejs.org and following the instructions for your OS.
Run the following command to test whether the installation succeeded:
npm --version node --version
-
Install LD Workbench:
npx @netwerk-digitaal-erfgoed/ld-workbench --init
Your workbench is now ready for use.
Once installed, an example workbench is present that can be run with the following command:
npx @netwerkdigitaalergoed/ld-workbench
To keep your workbench workspace clean, create a folder for each pipeline that contains the configuration and the SPARQL Select and Construct queries. Use the static
directory for this.
Here is an example of how your file structure may look:
ld-workbench
|-- static
| |-- my-pipeline
| | |-- configuration.yaml
| | |-- select.rq
| | |-- construct.rq
For local development, the following command should get you going:
git clone https://github.com/netwerk-digitaal-erfgoed/ld-workbench.git
cd ld-workbench
npm i
npm run compile
To start the CLI tool you can use this command:
npm run ld-workbench -- --configDir static/example
Since this project is written in Typescript, your code needs to be transpiled to Javascript before you can run it (using npm run compile
). With npm run dev
the transpiler will watch changes in the Typescript code an transpiles on each change.
The configuration of this project is validated and defined by JSON Schema. The schema is located in ./static/ld-workbench-schema.json
. To create the types from this schema, run npm run util:json-schema-to-typescript
. This will regenerate ./src/types/LDWorkbenchConfiguration.d.ts
, do not modify this file by hand.
This figure represents the workflow of the LD Workbench application:
A Pipeline can have multiple Stages, specified in the configuration file. A Stage has one Iterator and can have multiple Generators in it's configuration. An Iterator has to be connected to a SPARQL endpoint, when none is specified for the Generator(s), the Generator reuses the same SPARQL endpoint to generate linked data, when a different endpoint is specified in the Generator's configuration, this endpoint is used instead.