This project is a Wikibase Subsetting tool based on Shape Expressions(ShEx).
The project processes wikidata dumps and extracts a subset based on a Shape Expression.
Command line options:
Usage:
wdsub extract
wdsub dump
Wikidata subsetting command line tool
Options and flags:
--help
Display this help text.
--version, -v
Print the version number and exit.
Subcommands:
extract
Show information about an entity.
dump
Process dump files
As an example, the following command:
wdsub dump -s examples/humans.shex -o target/outputFile.json examples/100lines.json.gz
processes the dump file examples/100lines.json
using the ShEx schema examples/humans.shex
generating the file target/outputFile.json
The tool has been implemented in Scala abd uses sbt for compilation. In order to create a standalone binary, you can use:
sbt universal:packageBin
Once it has been run, the binary will be available as a compressed file at:
target/universal/wdsubroot-version.zip
Once that file is uncompressed, the executable script is in folder bin
and is called wdsubroot
If you want to create a docker local image, you can run:
sbt docker:publishLocal
In order to create a docker image (it requires the right credentials):
sbt docker:publish
The docker image is published as wesogroup/wdsub
In order to process dumps from docker, you can run:
docker run -d -v [folder-with-dumps]:/data -v [folder-with-schemas]:/shex -v [output-folder]:/dumps wesogroup/wdsub:0.0.9 dump -o /dumps/resultDump.json -s /shex/[shexFile].shex /data/[dumpFile].json.gz
The documentation of the project is generated with mdoc and Docusaurus.
Although the documentation is generated automatically with github actions, you can generate the documentation locally using:
> sbt docs/mdoc
> cd website && yarn install && yarn run build
Another tool that creates subsets from wikidata dumps is WDumper
This project uses the sbt ci release plugin for publishing to OSS Sonatype.
Open a PR and merge it to watch the CI release a -SNAPSHOT version here
- Push a tag and watch the CI do a regular release
git tag -a v0.1.0 -m "v0.1.0"
git push origin v0.1.0
Note that the tag version MUST start with v.
- Author: Jose Emilio Labra Gayo