Teide

We present Teide, an hybrid approach to improve the precision of linking resources from the Web of Data relaying on link rules (which are generally obtained by means of genetic programming algorithms). Teide receives a base link rule (main link rule) and apply it in order to obtain a collection of candidate links. Then a set of supporting link rules are bootstrapped to analyse the neighbours of the resources involved in each candidate link (the neighbours are the resources that can be reached by means of their object properties and linked by a supporting link rule). Teide analyses how similar such set of neighbours are in order decide which of the candidate links must be kept or discarded as false positives.

Building Teide

Download this project and open it into you IDE, then compile as a runnable jar the class 'TeideMain.jar' as teide.jar

Running the tests

We have inlcuded two scenarios with their respective datasets (as TDB jena databases) and input files to test Teide. The scenarios are 'Restaurants' and 'Acm-Newcastle', to run each:

java -jar teide.jar restaurants-file.json
java -jar teide.jar rae-newcastle-file.json

Getting Started

Having the teide.jar you only need to pass as argument an input file with the setup of the algorithm:

java -jar teide.jar input-file.json

The input file is a json file that must contain the following data:

  • resultsFile: where the effectiveness achived by the main rule and our techniques will be stored.
  • goldStandardFile: where is the gold standard to compute the effectiveness.
  • linksFile: is not required in this experiments as long as Teide will inject this value in running time.
  • searchForward: explore the RDF graphs (neighbours of the resources) following the direction of the object properties.
  • searchBackward: explore the RDF graphs (neighbours of the resources) following the oposite direction of the object properties.
  • maxSerachDepth: maximum depth of an RDF graph to explore.
  • resultsFolder: an existing folder where the outcome of Teide will be store.
  • outputMainRuleLinks: specifies if links generated by the main link rule must be stored in a file.
  • outputMainRuleLinksFile: specifies where the links generated by the main link rule will be stored.
  • filteredLinksFile: specifies where the links generated by Teide will be stored.
  • sourceDataset/targetDataset: specify where the source and target Jena TDB datasets are.
  • mainLinkRule: specifies a main link rule.
  • supportingRules: specifies an array of link rules to assist the main link rule to improve its precision when linking.

Relying on our 'Restaurants' test, its input file 'restaurants-input.json' is as follows:

{
	"links-evaluator" : {
		"resultsFile" : "restaurants/effectiveness.csv",
		"goldStandardFile" : "restaurants/restaurants-gold.nt",
		"linksFile" : "[Dynamically injected by Teide]",
	},
	"pathfinder" : {
		"searchForward" : "true",
		"searchBackwards" : "true",
		"maxSerachDepth" : "10"
	},
	"resultsFolder"  : "restaurants",
	"outputMainRuleLinks"  :"true",
	"outputMainRuleLinksFile" : "restaurants/mainLinks.nt",
	"filteredLinksFile" : "restaurants/filtered.nt",
	"sourceDataset"  : "tdb-data/restaurants1",
	"targetDataset"  : "tdb-data/restaurants2",
	"mainLinkRule"  : {
			"sourceClasses" : ["http://www.okkam.org/ontology_restaurant1.owl#Restaurant"],
			"targetClasses" : ["http://www.okkam.org/ontology_restaurant2.owl#Restaurant"],
			"restriction" : "agg:Mult(str:JaroWinklerTFIDFSimilarity(http://www.okkam.org/ontology_restaurant1.owl#name,http://www.okkam.org/ontology_restaurant2.owl#name,0.8),0.62)"
	},
	"supportingRules" : [
			{
				"sourceClasses" : ["http://www.okkam.org/ontology_restaurant1.owl#Address"],
				"targetClasses" : ["http://www.okkam.org/ontology_restaurant2.owl#Address"],
				"restriction" : "agg:Mult(str:JaroWinklerTFIDFSimilarity(http://www.okkam.org/ontology_restaurant1.owl#street,http://www.okkam.org/ontology_restaurant2.owl#street,0.8),0.62)"
			},
			{
				"sourceClasses" : ["http://www.okkam.org/ontology_restaurant1.owl#City"],
				"targetClasses" : ["http://www.okkam.org/ontology_restaurant2.owl#Address"],
				"restriction" : "agg:Mult(str:JaroWinklerTFIDFSimilarity(http://www.okkam.org/ontology_restaurant1.owl#name,http://www.okkam.org/ontology_restaurant2.owl#city,0.8),0.62)"
			}
		],
 
}

To build link rules Teide relies on the following functions:

Aggregate metrics

  • agg:Avg
  • agg:Max
  • agg:Min
  • agg:Mult

String metrics:

  • str:CosineSimilarity
  • str:JaccardSimilarity
  • str:JaroSimilarity
  • str:JaroWinklerSimilarity
  • str:JaroWinklerTFIDFSimilarity
  • str:LevenshteinSimilarity
  • str:OverlapSimilarity
  • str:QGramsSimilarity
  • str:SoftTFIDFSimilarity
  • str:SubstringSimilarity
  • str:TrigramsSimilarity

Transformation metrics:

  • trn:LowercaseTransformation
  • trn:RemoveSymbolsTransformation
  • trn:StemTransformation
  • trn:StripUriPrefixTransformation
  • trn:TokenizeTransformation
  • trn:UpercaseTransformation

Authors

License

This project is licensed under TDG's License - visit http://www.tdg-seville.info/license.html for details

Acknowledgments

  • Special thanks to Dr. Rafael Corchuelo
  • Spanish R&D programme (grants TIN2013-40848-R and TIN2013-40848-R)