This repository contains a simple pipeline that extracts HTML+RDFa data from webpages and combines them into a single Turtle file from it. Semantic gaps are filled by reasoning.
As a result, your website's data can be queried with SPARQL at 100% completeness and without worrying about vocabularies.
The article “Piecing the puzzle – Self-publishing queryable research data on the Web” explains in detail what the pipeline does and how it works.
$ ./extract-website-data https://example.org/ /var/www/example.org/
where https://example.org/
is the URL of your homepage and /var/www/example.org/
the location of its HTML files.
Place the ontologies you want to reason on in the ontologies
folder.
Rules for common RDFS and OWL constructs are available at the EYE website.
- Build the Docker image with
docker build -t WebsiteToRDF .
- Run container with
docker run -v /path/to/site:/data -v /path/to/results/folder:/result -i --rm WebsiteToRDF http://url.of.website
. - The RDF triples will be available in
/path/to/results/folder/triples.nt
.
©2017 Ruben Verborgh – MIT License.