/lodcc

A software framework to prepare and perform a large-scale graph-based analysis on the graph topology of RDF datasets.

Primary LanguagePythonMIT LicenseMIT

DOI version

A Software Framework for the graph-based Analysis on RDF Graphs

This framework enables to prepare and perform a graph-based analysis on the graph topology of RDF datasets. One of the main goals were to do that on large-scale and with focus on performance, i.e., with large state-of-the-art RDF graphs (hundreds of millions of edges) and in parallel, with many datasets at once.

A recent analysis on 280 datasets from the LOD Cloud 2017 has been conducted with this framework. Please find here the results on 28 graph measures as a browsable version of the study. Also, the results are available as a citable resource at Zenodo.

Domain Datasets analyzed Max. # of Vertices Max. # of Edges Avg. # of Vertices Avg. # of Edges
Cross Domain 15 614,448,283 2,656,226,986 57,827,358 218,930,066
Geography 11 47,541,174 340,880,391 9,763,721 61,049,429
Government 37 131,634,287 1,489,689,235 7,491,531 71,263,878
Life Sciences 32 356,837,444 722,889,087 25,550,646 85,262,882
Linguistics 122 120,683,397 291,314,466 1,260,455 3,347,268
Media 6 48,318,259 161,749,815 9,504,622 31,100,859
Publications 50 218,757,266 720,668,819 9,036,204 28,017,502
 Social Networking 3 331,647 1,600,499 237,003 1,062,986
User Generated 4 2,961,628 4,932,352 967,798 1,992,069

Goodies

RDF data dumps are preferred (so far). The framework is capable of dealing with the following:

  • Automatic downloading of the RDF data dumps before preparation.
  • Packed data dumps. Various formats are supported, like bz2, 7zip, tar.gz, etc. This is achieved by employing the unix-tool dtrx.
  • Archives, which contain a hierarchy of files and folders, will get scanned for files containing RDF data. Files which are not associated with RDF data will be ignored, e.g. Excel-, HTML-, or text-files.
  • The list of supported RDF media types is currently limited to the most common ones for RDF data, which are N-Triples, RDF/XML, Turtle, N-Quads, and Notation3. Any files containing these formats are transformed into N-Triples while graph creation. The transformation is achieved by employing the cli-tool rapper.

Further:

  • The framework is implemented in Python. The list of supported graph measures is extendable.
  • There is a ready-to-go docker-image available, with all third-party libraries pre-installed.

Currently ongoing and work in progress:

  • Query instantiation from graph representation, and
  • Edge- and vertex-based graph sampling.

Documentation

Installation

Installation instructions can be found in INSTALL.

Project Structure

In each of the subpackages you will find a detailed README file. The following table gives you an overview of the most important subpackages.

Package Description
constants Contains files which hold some static values. Some of them are configurable, e.g., datapackage.py and db.py
datapackages Contains code for (optional) pre-processing of datahub.io related datapackage.json files.
db Contains code to connect to a (optional) local database. A local database stores detailed information about dataset names, URLs, available RDF media types, etc. This is parsed by the datapackage.parser-module.
graph This is the main package which contains code for RDF data transformation, edgelist creation for graph building, graph measure computation, etc.
query Contains code for query generation from query templates.
util Utility subpackage with helper modules, used by various other modules.

Usage

Executable code can be found in each of the corresponding *.tasks.* subpackages, i.e.,

Tasks Package  Task Description
datapackage/tasks/* for an optional preliminary step to acquire metadata for datasets from datahub.io.
graph/tasks/* for a preliminary preparation process which turns your RDF dataset into an edgelist.
graph/tasks/analysis/* for graph-based measure computation of your graph instances.

Please find more detailed instructions in the README files of the corresponding packages.

Example commands

The software is suppossed to be run from command-line on a unix-based system.

1. Prepare RDF datasets for graph-analysis
$ python3 -m graph.tasks.prepare --from-db core education-data-gov-uk webisalod --threads 3

This command will (1) download (if not present), (2) transform (if necessary), and (3) prepare an RDF dataset as an edgelist, ready to be instantiated as graph-object.

  • --from-db used to load dataset URLs and available formats from an sqlite-database configured in db.sqlite.properties.
  • --threads indicates the number of datasets that are handled in parallel.
2. Run an analysis on the prepared RDF datasets in parallel
$ python3 -m graph.tasks.analysis.core_measures --from-file core education-data-gov-uk webisalod --threads 2 --threads-openmp 8 --features diameter --print-stats

This command instantiates the graph-objects, by loading the edgelists or the binary graph-objects, if available. After that, the graph measure diameter will be computed in the graphs.

  • --from-file used here, so measure values will be printed to STDOUT.
  • --threads indicates the number of datasets that are handled in parallel.

License

This package is licensed under the MIT License.

How to Cite

Please refer to the DOI for citation. You can cite all versions of this project by using the canonical DOI 10.5281/zenodo.2109469. This DOI represents all versions, and will always resolve to the latest one.