/easyesa

EasyESA is an implementation of Explicit Semantic Analysis (ESA) based on Wikiprep-ESA.

Primary LanguageCGNU General Public License v3.0GPL-3.0

EasyESA


Easy Semantic Approximation with Explicit Semantic Analysis


1. Overview

EasyESA is an implementation of Explicit Semantic Analysis (ESA) based on the Wikiprep-ESA code from Çağatay Çallı (https://github.com/faraday/wikiprep-esa). It runs as a JSON webservice which can be queried for the semantic relatedness measure, concept vectors or the context windows.

This manual provides information on the functionality, setup and usage of EasyESA package.

2. Explicit Semantic Analysis

Explicit Semantic Analysis (ESA) is a technique for text representation that uses Wikipedia commonsense knowledge base using the co-occurrence of words in the text. The articles' words are associated with its concept using TF-IDF scoring, and a word can be represented as a vector of its associations to each concept and thus "semantic relatedness" between any two words can be measured by means of cosine similarity. A document containing a string of words is represented as the centroid of the vectors representing its words.

For more information on ESA, please refer to the paper by Evgeniy Gabrilovich and Shaul Markovitch: "Wikipedia-based semantic interpretation for natural language processing" (http://www.jair.org/media/2669/live-2669-4346-jair.pdf).

3. EasyESA

EasyESA provides the following functionalities:

Semantic relatedness measure
Given two terms, returns the semantic relatedness: a real number in the [0,1] interval, representing how semantically close are the terms. The more related the terms are, the higher the value returned.
Concept vector
Given a term, returns the concept vector: a list of Wikipedia article titles (concepts) with the associated score for the term.
Query explanation
Given two terms, returns the concept vector overlapping between them and the "context windows" for both terms on each overlapping concept. A context window for a given pair (term, concept) is a short segment from the Wikipedia article represented by the concept, containing the term.
  1. EasyESA was developed as an improvement over Wikiprep-ESA. The main differences are:
  • Easy setup package for quick deployment of local ESA infrastructure.
  • Performance improvements.
  • Robust concurrent queriesThe setup package also facilitates the generation of new ESA databases from the latest wikipedia dumps.

4. Downloads

Install MongoDB

EasyESA distribution package: easyEsa (includes binaries and source)

Database and Indexes: English Wikipedia 2013 (Index) .

5. Installation

The EasyESA package includes a setup script for linux.

The setup script will perform the following steps:

  1. Download the latest Wikipedia dump.
  2. Download and install all the dependencies.
  3. Split the Wikipedia dump, using more than one thread.
  4. Preprocess the dump using Wikiprep (Zemanta's version) (http://www.tablix.org/~avian/git/wikiprep.git).
  5. Generate the ESA terms and concept vectors from the Wikipedia preprocessed dump.
  6. Generate the database and indexes.
  7. Start the EasyESA services.

The setup can be done in three ways, depending on the user needs and memory/storage availability:

5.1. Simple run (Recommended)

You can download the EasyESA database and indexes for English Wikipedia 2013 (Index).

Simple procedure:

1. Extract easy_esa.tar.gz into INSTALL_DIR/.
2. Extract data*.tar.gz into INSTALL_DIR/mongodb/data.
3. Extract index*.tar.gz into INSTALL_DIR/index.
4. Start mongodb: mongod --dbpath mongodb/data/db
5. Start the EasyESA service: java -jar easy_esa.jar 8890 INSTALL_DIR/index &

On Linux, you can execute the run.sh script for steps 4 and 5:

  $./run.sh 

5.2. From setup script only (full setup)

Simply execute the setup_all.sh script:

  $./setup_all.sh  

where

  • is the directory where the package will be installed.
  • is the number of threads to be used in the preprocessing step. A value up to the number of processors/cores available is recommended. HyperThreading processors can use up to 1.5 times the number of cores with noticeable performance gain.

Step 4 will take about 3 days to complete on a modern computer (I7 quad core) and use about 200GB of storage space for the early 2013 Wikipedia dump. Step 5 will take about 4 days to complete and use about 30GB on the same specs and Wikipedia dump.

5.3. From setup script with previously downloaded Wikipedia dump

If you already have a wikipedia dump and wish to use it, just comment line 5 of setup_all.sh and put your enwiki-???-pages-articles.xml.bz2 renamed to enwiki-latest-pages-articles.xml.bz2 in the destination directory. The setup script will skip only the download step (step 1).

6. Usage & Online Service

EasyESA service can be used online from

  http://lasics.dcc.ufrj.br/esaservice 

or locally

  http://localhost:8890/esaservice

The service parameters are:

**task**
The query function to be called. The choices and their parameters are:
  • esa: semantic relatedness measure.
    • term1, term2 (the two words to measure)
  • vector: concept vector.
    • source: the word for which the concept vector will be returned.
    • limit: maximum size of the concept vector. The concept vector will be truncated if larger than the limit.
  • explain: concept vector overlapping and context windows.
    • term1, term2 (the two words to compare)
    • limit: maximum size of the concept vector. the overlapping is calculated after any truncation.

6.1. Examples

6.1.1. Semantic relatedness measure query

  http://lasics.dcc.ufrj.br/esaservice?task=esa&term1=computing&term2=sensor

Query for the semantic relatedness measure between the words computing and sensor.

6.1.2. Concept vector query

  http://lasics.dcc.ufrj.br/esaservice?task=vector&source=coffee&limit=50

Query for the concept vector of the word coffee with maximum length of 50 concepts.

6.1.3. Explain query

  http://lasics.dcc.ufrj.br/esaservice?task=explain&term1=computing&term2=sensor&limit=10000

Query for the concept vector overlapping between the words computing and sensor, and the context windows of both words for each concept in the overlap.

8. License

EasyESA is distributed under GPL.

10. Publication

Please refer to the publication below if you are using ESA in your experiments.

Danilo Carvalho, Çağatay Çallı, André Freitas, Edward Curry, EasyESA: A Low-effort Infrastructure for Explicit Semantic Analysis, In Proceedings of the 13th International Semantic Web Conference (ISWC), Rival del Garda, 2014. (Demonstration Paper in Proceedings) (pdf)

11. Contact

Danilo Carvalho, Çağatay Çallı, Andre Freitas, Edward Curry.

Insight Centre for Data Analytics
Digital Enterprise Research Institute (DERI)
National University of Ireland, Galway

contact: danilo | at | jaist [dot] ac [dot] jp, andre - dot - freitas | at | deri [dot] org