Implementation of the PaTyBRED error detection method from the paper
"Detection of Relation Assertion Errors in Knowledge Graphs" in the proceedings of K-CAP 2017.
Firstly the dataset needs to be converted into NPZ format supported by our system.
This can be done with load_kb.py
which takes NT files as
input. If the dataset is in another format RDF2RDF
can be used to convert it into the NT format.
python load_kb.py dataset.nt
Once the dataset was correctly loaded into the correct format, it is possible to compute the triple score and rank all facts
in the data with rank_facts
.
The KG model is selected with -m
, the path of the ranked data output with -o
and the learned model can be saved to the path defined by -sp
.
python rank_facts.py dataset.npz -m patybred -o ranked_dataset.pkl -sp learned-model.pkl
An evaluation can be performed by adding noise (wrong triples) to a dataset and subsequently detecting it with the chosen method.
In order to add noise generate_errors.py
can be used.
The parameter -pe
indicates the ratio of noise to be generated (0.01
means 1% of the original number of triples).
The parameter -ek
is the kind of noise generated by corrupting correct triples by replacing the subject or object (1
for substituting original entity with a random entity of any type and 2
with a random entity of same type as the original).
A NPZ file with the original data plus the generated errors will be created as output.
This file can then be used by detect_errors.py
, which will learn a KG model on the noisy dataset, rank the facts, and evaluate
how the erroneous facts are ranked.
Evaluation results are shown with various performance measures.
python generate_errors.py dataset.npz -pe 0.01 -ek 1
python detect_errors.py dataset-ek1.npz -m patybred -o ranked_dataset.pkl -sp learned-model.pkl
Implementation of the generation of SHACL-SPARQL relation constraints from the paper
"Automatic Detection of Relation Assertion Errors and Induction of Relation Constraints" submitted to the Semantic Web Journal.
In order to generate the SHACL constraints it is necessary to learn a PaTyBRED model with decision trees as local classifiers (-m patybred -clf dt
) when learning the model.
When generating the constraints there two mandatory parameters: the first is the path to the learned model and the second the path to the original KG dataset, which contains the relation and type names.
python shacl-sparql.py learned-model.pkl dataset.npz -c 0.99 -ms 10
The parameter -c
specifies the minimum confidence and -ms
the minimum support. These parameters are used when pruning the learned decision tree.
In order to validate your dataset against the set of learned constraints you can use the TopBraid
implementation of SHACL based on Jena
The datasets used in the paper's automatic evaluation (containing generated errorenous triples) can be downloaded here:
- Semantic Bible
ek1
ek2
- ESWC2015
ek1
ek2
- ISWC2013
ek1
ek2
- WWW2012
ek1
ek2
- LREC2008
ek1
ek2
- Nobel Prize
ek1
ek2
- AIFB portal
ek1
ek2
- WN18
ek1
- FB15k
ek1
The datasets used in the paper's manual evaluation can be downloaded here: