/spotlight-model-editor

Tool for tweaking dbpedia spotlight's models

Primary LanguageScalaApache License 2.0Apache-2.0


Dbpedia Spotlight Model Editor

Prelude

The DBpedia Spotlight Model Editor was originally developed by Idio with the intent to tweak Dbpedia Spotlight models up to version 0.6 - 0.7. Both repo's have been archived and are no longer maintained.

As of 2018, DBpedia made the choice to move Spotlight's codebase to another repository, namely Dbpedia Spotlight Model.

This repo is an attempt to resuscitate the model-editor tool to make it work with the new(ish) dbpedia spotlight model entity linking system.

Many thanks to Idio (and specially to @dav009)

Table of Contents

Installation

In order to use the Model Editor, you will need:

  • Java 1.8
  • Sbt (> 1.0)
  • (Optional) Compiling and Installing the Dpedia Spotlight Model tool (if you want to test against a development version of spotlight model)
  • A pre-computed language model (downloaded from here)

You should be able to install java, mvn and sbt in your system. If you are editing the latest models (i.e. version 1.1) you are all set.

but for the rest you should:

  • git clone https://github.com/dbpedia-spotlight/dbpedia-spotlight-model
  • cd dbpedia-spotlight-model && mvn install (this will build a development version of spotlight model into your local maven repository)
  • clone this repo, cd into it, and change the reference of your build.sbt file to it

In either case, run sbt test, if it's all green you are ready to go.

Usage

This tool works as a command line tool for editing a model and there are essentially two ways to use it.

Since models are usually big, you should use a lot of ram in your machine (sometimes higher than 16GB).

Java

You can compile a jar via sbt assembly, which will produce target/scala-2.10/dbpedia-model-editor.jar. You can use this as a cli in the following way:

java -Xmx15g -jar target/scala-2.10/dbpedia-model-editor.jar <command> <subcommand> <args> ...

SBT

A script that calls the tool via sbt runMain has been provided in the model-editor.shfile. You can use it like:

./model-editor.sh <command> <subcommand> <args> ...

Note that you might need to tweak the amount of -Xmx passed in .sbtopts depending on your machine or use case.

Commands and Subcommands

Commands and Subcommands allow you to perform certain actions on a spotlight model, including to manually:

  • Add new Surface Forms
  • Add new entity uris
  • Create associations between surface forms and dbpedia uris
  • Remove associations between surface forms and dbpedia uris
  • Make surface forms spottable or invisible
  • Modify the context vectors

Models are published by Dbpedia's Databus here. So before running these operations you need to download one corresponding to the language you are going to modify.

Once you have it all compiled and ready to go, you can test things by running:

./model-editor.sh explore <path-to-model-folder>/<lang>/model/ 20

this should print the stats for 20 surface forms from the model you just downloaded.

Start by freeing as much ram as possible. Each of the following tools addressing a command refers to calling the jar/script with one command or subcommand as follows:

Exploring a Model

  • command: explore
  • arg1: path to dbpedia spotlight model,en/model
  • arg2: number of surface forms
  • result: outputs arg2 number of SurfaceForms with their respective candidates, priors and statistics

example:

./model-editor.sh explore <path-to-model>/<lang>/model/ 40

Entities

All topic related actions are carried out using the topic command followed by one of the following subcommands:

  • search : checking if a topic is in the stores
  • check-context : printing the context of a topic
  • clean-set-context : cleaning and setting the context of a topic
Searching an Entity
  • command: topic
  • subcommand: search
  • arg1: path to dbpedia spotlight model (e.g.en/model)
  • arg2: dbpediaURI
  • result: looks for a given DbpediaId in the Model and returns whether that topic exists or not in the model

i.e :

./model-editor.sh topic search <path-to-model> Michael_Schumacher
Check the Context words and counts of an entity
  • command: topic
  • subcommand: check-context
  • arg1: path to dbpedia spotlight model (e.g.en/model)
  • arg2: piped separated list of dbpediaUris

example:

./model-editor.sh topic check-context en/model Barack_Obama|United_States
Set the Context Words of an Entity
  • command: topic
  • subcommand: clean-set-context
  • arg1: path to dbpedia spotlight model (e.g.en/model)
  • arg2: pathToFile
  • result: The context words and counts for the topics in the file will be cleared. The specified context Words will be stemmed and added with their respective counts to the context vector of the given topics.

each line of the given input file should be like:

dbpediaUri <tab> contextWordsSeparatedByPipe <tab> countsSeparatedByPipe

the size of contextWordsSeparatedByPipe and countsSeparatedByPipe should be the same

example:

./model-editor.sh topic clean-set-context en/model folder/fileWithContextChanges 

Surface Forms

All surface forms related actions are carried out using the surfaceform command followed by one of the following subcommands:

  • stats : printing stats of a surface form
  • candidates : printing the list of candidates of a surface form
  • make-spottable : making surfaceforms spottable
  • make-unspottable : making surfaceforms unspottable
  • copy-candidates : adding to a surfaceformA all candidates of a surfaceFormB
Stats of a surface form
  • subcommand: surfaceform
  • subcommand: stats
  • arg1: path to dbpedia spotlight model (e.g.en/model)
  • arg2: surfaceForm
  • result: outputs statistics of the given surfaceForm

example :

./model-editor.sh surfaceform stats <path-to-model>/<lang>/model/ evrimleri

outputs statistics for the surface form evrimleri

Getting the candidate entities of a surface form
  • command: surfaceform
  • subcommand: candidates
  • arg1: path to dbpedia spotlight model (e.g.en/model)
  • arg2: surfaceForm
  • result: outputs the candidate topics of a surface form

example :

./model-editor.sh surfaceform candidates <path-to-model>/<lang>/model/ evrimleri

would check the candidate topics for the surface form evrimleri

Making a list of Surface Forms Unspottable
  • command: surfaceform
  • subcommand: make-unspottable
  • arg1: path to dbpedia spotlight model (e.g.en/model)
  • arg2:
    • list of Surface Forms separated by |. i.e: how\|How\|Hello\ World
    • file containing a surfaceForm per line ( if option -f is passed)
  • result: Each SF won't be spottable anymore
./model-editor.sh surfaceform make-unspottable <path-to-model> surfaceForm1\|surfaceForm2\|
./model-editor.sh surfaceform make-unspottable <path-to-model> pathTo/File/withSF -f
Copy Candidates
  • command: surfaceform

  • subcommand: copy-candidates

  • arg1: path to dbpedia spotlight model (e.g.en/model)

  • arg2: path to file containing pairs of surfaceForm. each line should be :

     ```
      <originSurfaceForm> <tab> <destinySurfaceForm>
     ```
    
  • result: copies the candidate topics from each originSurfaceForm as candidates topics to destinySurfaceForm

example:

./model-editor.sh surfaceform copy-candidates <path-to-model> pathToFile
Making a list of Surface Forms Spottable
  • command: surfaceform
  • subcommand: make-spottable
  • arg1: path to dbpedia spotlight model (e.g.en/model)
  • arg2:
    • list of Surface Forms separated by |. i.e: how\|How\|Hello\ World
    • file containing a surfaceForm per line ( if option -f is passed)
  • result: Each SF will be spottable

example:

./model-editor.sh surfaceform make-spottable <path-to-model> surfaceForm1\|surfaceForm2\|
./model-editor.sh surfaceform make-spottable <path-to-model> pathTo/File/withSF -f

Associations

All surface forms related actions are carried out using the association command followed by one of the following subcommands:

  • remove

Deleting Associations between SF and Topics

  • command: association
  • subcommand: remove
  • arg1: pathToSpotlightModel/model
  • arg2: pathToInputFile
  • result: All associations between SFs and Topics in the given input file will be deleted from the model.

Every line in the input file describes an association which will be deleted, each line should follow the format:

dbpediaURI <tab> Surface Form

example:

./model-editor.sh association remove en/model /path/to/file/file_with_associations

FSA

Checking if a SF is spottable via FSA
  • command: fsa
  • subcommand: find
  • arg1: path to dbpedia spotlight model (e.g.en/model)
  • arg2: piped separated list of surface forms
  • result: the FSA spots for each surface forms

example:

./model-editor.sh fsa find en/model Nintendo\ Wii\|barack

Updating Model From File

When updating the model with lots of SF, Topics and Context Words best is to do it from a file. each line of the file should follow the format:

dbpedia_id <tab> surfaceForm1|surfaceForm2... <tab> contextW1|contextW2... <tab> contextW1Counts|ContextW2Counts

Insight

Before doing actual changes to the model it might be useful to see how many SF,dbpedia topics and links between those two are missing. ./model-editor.sh file-update check path/to/en/model path_to_file/with/model/changes.

Updating a model From File (All in One Go)

make sure you have enough ram to hold all the models that should be around 15g. do:

./model-editor.sh file-update all path/to/en/model path_to_file/with/model/changes

Updating a model From File (Two Steps)

If you don't have enough ram you can update the SF and DbpediaTopics in one step and the Context Words in other, this will require less memory.

  1. go to the model folder and rename context.mem to context2.mem this will avoid the jar to avoid loading the context store
  2. calling the following command will update the surfaceform store, resource store and candidate store: ./model-editor.sh file-update all path/to/en/model path_to_file/with/model/changes.
  3. a new file path_to_file/with/model/changes_just_context will be generated after running the previous command.This file contains dbpediaIds(internal model's indexes) to contextWords, and it can be processed in the following step.
  4. rename context2.mem to context.mem, and rename every other file in the model folder to something else.( if this is not done, the stores will be loaded and they will consume all your ram)
  5. calling the following will update the context store:
./model-editor.sh file-update context-only path/to/en/model path_to_file/with/model/changes_just_context
  1. rename all files to their usual conventions and enjoy a fresh baked model

steps 1-4 could be applied while ignoring 5 and 6 when:

  • wanting to add SFs
  • wanting to link SFs with already existing Dbpedia Topic

steps 5-6 could be applied while ignoring 1-4 when:

  • wanting to add Context words to a Dbpedia Topic

Important:

  • step 1-4 will only add SF and Dbpedia Topics if they don't exist.
  • step 1-4 will make all specified SF spottable
  • step 5-6 Only ADDS context words to the context of a dbpedia Topic.

Using the scala console

One of the best ways to play the models and modify them is to use the scala console. From the project root, you can run:

sbt console (note: we provide a .sbtoptsfile with sensible defaults, but you might want to tweak those: adding less or more ram depending on your circumstances)

Once you start a scala console you can use it like ipython to create instances of the scala classes we have, to load the models, check if dbpedia id's exist, add new dbpedia ids, add new surface forms etc ..

Example:

import org.idio.dbpedia.spotlight.SpotlightModelReader

var spotlightModel = SpotlightModelReader.getSpotlightModel("<path-to-model>/model")
spotlightModel.showSomeSurfaceForms(10) // show 10 surface forms
spotlightModel.getStatsForSurfaceForm("Barack Obama") // prints stats for entities associated with that sf
spotlightModel.searchForDBpediaResource("Caetano_Veloso") // return boolean if false
spotlightModel.addNew("ikimono_gakari_sf","ikimono_gakari_dbpedia_uri", 1 , Array()) // adds a new entity
spotlightModel.exportModels("/new/path/of/folder/model/") // exports

License

Copyright 2014 Idio

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0