/wdsub

Wikidata Subsetting

Primary LanguageScalaMIT LicenseMIT

wdsub

This project is a Wikibase Subsetting tool based on Shape Expressions(ShEx).

The project processes wikidata dumps and extracts a subset based on a Shape Expression.

Usage

Command line options:

Usage:
  wdsub extract
  wdsub dump
Wikidata subsetting command line tool
Options and flags:
 --help
     Display this help text.
 --version, -v
     Print the version number and exit.
Subcommands:
  extract
    Show information about an entity.
  dump
    Process dump files

As an example, the following command:

wdsub dump -s examples/humans.shex -o target/outputFile.json examples/100lines.json.gz

processes the dump file examples/100lines.json using the ShEx schema examples/humans.shex generating the file target/outputFile.json

Installation and compilation

The tool has been implemented in Scala abd uses sbt for compilation. In order to create a standalone binary, you can use:

sbt universal:packageBin

Once it has been run, the binary will be available as a compressed file at:

target/universal/wdsubroot-version.zip

Once that file is uncompressed, the executable script is in folder bin and is called wdsubroot

Publish docker image

If you want to create a docker local image, you can run:

sbt docker:publishLocal

In order to create a docker image (it requires the right credentials):

sbt docker:publish

The docker image is published as wesogroup/wdsub

In order to process dumps from docker, you can run:

docker run -d -v [folder-with-dumps]:/data -v [folder-with-schemas]:/shex -v [output-folder]:/dumps wesogroup/wdsub:0.0.9 dump -o /dumps/resultDump.json -s /shex/[shexFile].shex /data/[dumpFile].json.gz

Docs

The documentation of the project is generated with mdoc and Docusaurus.

Although the documentation is generated automatically with github actions, you can generate the documentation locally using:

> sbt docs/mdoc
> cd website && yarn install && yarn run build

More information

Another tool that creates subsets from wikidata dumps is WDumper

Publishing to OSS-Sonatype

This project uses the sbt ci release plugin for publishing to OSS Sonatype.

SNAPSHOT Releases

Open a PR and merge it to watch the CI release a -SNAPSHOT version here

Full Library Releases
  1. Push a tag and watch the CI do a regular release
  2. git tag -a v0.1.0 -m "v0.1.0"
  3. git push origin v0.1.0 Note that the tag version MUST start with v.

Author & contributors