- This project exploits DBpedia for disambiguating ambiguous entities and linking them to Wikipedia.
- To this end a DBpedia graph is created and used for disambiguation, thereby adapting the method of Navigli & Lapata (2010) (paper link)
- It is developed as a maven project and integrated into a DBpedia Spotlight fork (see DBpedia Spotlight Fork section for more details).
- Execute maven as follows; this creates a jar with all dependencies in the
/target
folder
mvn compile assembly:single
- Configuration for creating a graph and for disambiguation is done using a properties file
- The path to the actual properties file needs to be configured in redirect.properties
- A sample properties file, which explains each property, is given by graphdb.properties.
- The property file is reread at each request if it was changed.
- Configure
graph.*
,blueprints.*
, andloading.*
related properties in the configuration file - Build a single jar with dependencies.
- Run DBpediaGraphLoader class. As arguments multiple DBpedia datasets or directories containing DBpedia datasets are allowed.
Exemplary command for running DBpediaGraphLoader:
java \
-Xmx20G \
-Dlog4j.configuration=file:///data/spotlight/git/dbpedia-graph/src/test/resources/log4j.xml \
-cp target/dbpedia-graph-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
de.unima.dws.dbpediagraph.loader.DBpediaGraphLoader \
/data/dbpedia/dbpedia3.9/en/article_categories_en.nt \
/data/dbpedia/dbpedia3.9/en/skos_categories_en.nt \
/data/dbpedia/dbpedia3.9/en/topical_concepts_en.nt \
/data/dbpedia/dbpedia3.9/en/instance_types_en.nt \
/data/dbpedia/dbpedia3.9/en/mappingbased_properties_en.nt \
/data/dbpedia/dbpedia3.9/en/persondata_en.nt \
/data/dbpedia/dbpedia3.9/en/redirects_en.nt \
The following DBpedia datasets are considered useful for disambiguation:
- Category:
article_categories_en.nt, skos_categories_en.nt, topical_concepts_en.nt
- Infobox:
instance_types_en.nt (Ontology), mappingbased_properties_en.nt, persondata_en.nt
- Redirects (DBpedia Spotlight uses them for spotting & candidate generation):
redirects_en.nt
- The DBpedia Graph project is integrated into a DBpedia Spotlight fork.
- Here, the branch v0.6 contains all the modified code based on the DBpedia Spotlight version 0.6. This branch needs to be updated at some point to contain the Spotlight master code changes (see TODO).
- Integration with Spotlight is needed because the DBpedia Graph is only used for disambiguation and does not perform spotting or generation of candidate entities.
- The code changes for running the statistical disambiguators are done in the core module of DBpedia Spotlight. The affected classes are described in the core module section.
- For training the feature weights, the linear regression learners have been implemented in the eval module.
-
DBGraphDisambiguator: Interface between Spotlight and DBpedia Graph project.
- Generates candidate entities for the document.
- Converts forth and back between Spotlight and DBpedia Graph model.
- Prunes candidate set of entities using CandidateFilter.
- Creates subgraph using implementation of SubgraphConstruction.
- Disambiguates bestK entities using implementation of GraphDisambiguator.
-
DBMergedDisambiguator: Federated disambiguator that combines DBpedia Graph and Spotlight disambiguation.
- Two feature combination approach: Combines the bestK lists of entities of Spotlight and DBpedia Graph. Treats Spotlight as a black-box system.
- Four feature combination approach: Combines the scores of all 3 Spotlight features with the DBpedia graph scores for the bestK entities into a final score.
-
SpotlightConfiguration: Added GraphBased and Merged disambiguation policies to the DisambiguationPolicy enum in line 61.
-
SpotlightModel: Added GraphBased and Merged Disambiguators to the statistical system and mapped them to the respective disambiguator classes.
-
FeatureNormalizer: Different approaches for normalizing the log-scaled features. Relevant for training the feature weights and for the NormalizedLinearRegressionFeatureMixture.
-
NormalizedLinearRegressionFeatureMixture: uses a list of weighted features and a feature normalizer to combine the normalized features in a weighted linear combination.
- CorpusLearner: Generates training data that can be used for linear regression. To this end, an evaluation corpus and a disambiguator is specified. The corpus learner then generates the training data using the specified TrainingDataHandler's.
- LinRegTrainingDataHandler: Uses breeze linear regression to learn the feature weights based on the training data generated from the corpus and the disambiguator.
- DumpVowpalTrainingDataHandler: Generates a text file that can be used by Vowpal Wabbit for feature weights learning. Further documentation is provided in the code.
- DumpTsvTrainingDataHandler: Generates a TSV text file for feature weights learning.
- Install DBpedia Graph project into local maven repository using
mvn install
(make sure you have created a graph and configured the properties file). - Package the Spotlight fork using
mvn package
. This creates a jar with dependencies in thedist/target/
folder. - Download and extract the English Spotlight model
en.tar.gz
for version 0.6 from the Downloads Page. - Run the Spotlight fork jar.
Exemplary command for running the Spotlight fat jar:
java -Xmx20G -jar dist/target/dbpedia-spotlight-0.6-jar-with-dependencies.jar /path/to/your/spotlight/model/directory/ http://localhost:2222/rest
- For running the Spotlight demo web application with DBpedia Graph, the demo fork can be used.
- The demo fork allows to choose the GraphBased or Merged disambiguator on the demo page.
- To run it, serve the files from an http server like apache.
- For a quick demo, the python built-in http server module can be run from the demo directory using:
python -m SimpleHTTPServer
.
//TODO
's in code- Sync branch v0.6 with the Spotlight master
- Big TODO: Separate DBpedia Graph and Spotlight project so that both communicate using HTTP.
- Add dedicated redirect resolving module: Redirect resolving is needed because Spotlight generates candidate entities that are redirects. Currently all redirect URI's are integrated into the graph, which bloats up the graph unnecessarily.
- Improve performance of subgraph construction, e.g. by unifying traversal of same candidate entities of different surface forms.
- ...