/kged

Knowledge Graph Error Detection

Primary LanguagePython

Detection of Relation Assertion Errors in Knowledge Graphs

Implementation of the PaTyBRED error detection method from the paper "Detection of Relation Assertion Errors in Knowledge Graphs" in the proceedings of K-CAP 2017.

How to use

Firstly the dataset needs to be converted into NPZ format supported by our system. This can be done with load_kb.py which takes NT files as input. If the dataset is in another format RDF2RDF can be used to convert it into the NT format.

python load_kb.py dataset.nt

Once the dataset was correctly loaded into the correct format, it is possible to compute the triple score and rank all facts in the data with rank_facts. The KG model is selected with -m, the path of the ranked data output with -o and the learned model can be saved to the path defined by -sp.

python rank_facts.py dataset.npz -m patybred -o ranked_dataset.pkl -sp learned-model.pkl 

An evaluation can be performed by adding noise (wrong triples) to a dataset and subsequently detecting it with the chosen method. In order to add noise generate_errors.py can be used. The parameter -pe indicates the ratio of noise to be generated (0.01 means 1% of the original number of triples). The parameter -ek is the kind of noise generated by corrupting correct triples by replacing the subject or object (1 for substituting original entity with a random entity of any type and 2 with a random entity of same type as the original). A NPZ file with the original data plus the generated errors will be created as output.

This file can then be used by detect_errors.py, which will learn a KG model on the noisy dataset, rank the facts, and evaluate how the erroneous facts are ranked. Evaluation results are shown with various performance measures.

python generate_errors.py dataset.npz -pe 0.01 -ek 1
python detect_errors.py dataset-ek1.npz -m patybred -o ranked_dataset.pkl -sp learned-model.pkl 

Generating SHACL Constraints

Implementation of the generation of SHACL-SPARQL relation constraints from the paper "Automatic Detection of Relation Assertion Errors and Induction of Relation Constraints" submitted to the Semantic Web Journal.

In order to generate the SHACL constraints it is necessary to learn a PaTyBRED model with decision trees as local classifiers (-m patybred -clf dt) when learning the model. When generating the constraints there two mandatory parameters: the first is the path to the learned model and the second the path to the original KG dataset, which contains the relation and type names.

python shacl-sparql.py learned-model.pkl dataset.npz -c 0.99 -ms 10

The parameter -c specifies the minimum confidence and -ms the minimum support. These parameters are used when pruning the learned decision tree. In order to validate your dataset against the set of learned constraints you can use the TopBraid implementation of SHACL based on Jena

Datasets

The datasets used in the paper's automatic evaluation (containing generated errorenous triples) can be downloaded here:

The datasets used in the paper's manual evaluation can be downloaded here: