EasyESA is an implementation of Explicit Semantic Analysis (ESA) based on the Wikiprep-ESA code from Çağatay Çallı (https://github.com/faraday/wikiprep-esa). It runs as a JSON webservice which can be queried for the semantic relatedness measure, concept vectors or the context windows.
This manual provides information on the functionality, setup and usage of EasyESA package.
Explicit Semantic Analysis (ESA) is a technique for text representation that uses Wikipedia commonsense knowledge base using the co-occurrence of words in the text. The articles' words are associated with its concept using TF-IDF scoring, and a word can be represented as a vector of its associations to each concept and thus "semantic relatedness" between any two words can be measured by means of cosine similarity. A document containing a string of words is represented as the centroid of the vectors representing its words.
For more information on ESA, please refer to the paper by Evgeniy Gabrilovich and Shaul Markovitch: "Wikipedia-based semantic interpretation for natural language processing" (http://www.jair.org/media/2669/live-2669-4346-jair.pdf).
EasyESA provides the following functionalities:
- Semantic relatedness measure
- Given two terms, returns the semantic relatedness: a real number in the [0,1] interval, representing how semantically close are the terms. The more related the terms are, the higher the value returned.
- Concept vector
- Given a term, returns the concept vector: a list of Wikipedia article titles (concepts) with the associated score for the term.
- Query explanation
- Given two terms, returns the concept vector overlapping between them and the "context windows" for both terms on each overlapping concept. A context window for a given pair (term, concept) is a short segment from the Wikipedia article represented by the concept, containing the term.
- EasyESA was developed as an improvement over Wikiprep-ESA. The main differences are:
- Easy setup package for quick deployment of local ESA infrastructure.
- Performance improvements.
- Robust concurrent queriesThe setup package also facilitates the generation of new ESA databases from the latest wikipedia dumps.
Install MongoDB
EasyESA distribution package: easyEsa (includes binaries and source)
Database and Indexes: English Wikipedia 2013 (Index) .
The EasyESA package includes a setup script for linux.
The setup script will perform the following steps:
- Download the latest Wikipedia dump.
- Download and install all the dependencies.
- Split the Wikipedia dump, using more than one thread.
- Preprocess the dump using Wikiprep (Zemanta's version) (http://www.tablix.org/~avian/git/wikiprep.git).
- Generate the ESA terms and concept vectors from the Wikipedia preprocessed dump.
- Generate the database and indexes.
- Start the EasyESA services.
The setup can be done in three ways, depending on the user needs and memory/storage availability:
You can download the EasyESA database and indexes for English Wikipedia 2013 (Index).
Simple procedure:
1. Extract easy_esa.tar.gz into INSTALL_DIR/.
2. Extract data*.tar.gz into INSTALL_DIR/mongodb/data.
3. Extract index*.tar.gz into INSTALL_DIR/index.
4. Start mongodb: mongod --dbpath mongodb/data/db
5. Start the EasyESA service: java -jar easy_esa.jar 8890 INSTALL_DIR/index &
On Linux, you can execute the run.sh script for steps 4 and 5:
$./run.sh
Simply execute the setup_all.sh script:
$./setup_all.sh
where
- is the directory where the package will be installed.
- is the number of threads to be used in the preprocessing step. A value up to the number of processors/cores available is recommended. HyperThreading processors can use up to 1.5 times the number of cores with noticeable performance gain.
Step 4 will take about 3 days to complete on a modern computer (I7 quad core) and use about 200GB of storage space for the early 2013 Wikipedia dump. Step 5 will take about 4 days to complete and use about 30GB on the same specs and Wikipedia dump.
If you already have a wikipedia dump and wish to use it, just comment line 5 of setup_all.sh and put your enwiki-???-pages-articles.xml.bz2 renamed to enwiki-latest-pages-articles.xml.bz2 in the destination directory. The setup script will skip only the download step (step 1).
EasyESA service can be used online from
http://lasics.dcc.ufrj.br/esaservice
or locally
http://localhost:8890/esaservice
The service parameters are:
- **task**
- The query function to be called. The choices and their parameters are:
- esa: semantic relatedness measure.
- term1, term2 (the two words to measure)
- vector: concept vector.
- source: the word for which the concept vector will be returned.
- limit: maximum size of the concept vector. The concept vector will be truncated if larger than the limit.
- explain: concept vector overlapping and context windows.
- term1, term2 (the two words to compare)
- limit: maximum size of the concept vector. the overlapping is calculated after any truncation.
- esa: semantic relatedness measure.
http://lasics.dcc.ufrj.br/esaservice?task=esa&term1=computing&term2=sensor
Query for the semantic relatedness measure between the words computing and sensor.
http://lasics.dcc.ufrj.br/esaservice?task=vector&source=coffee&limit=50
Query for the concept vector of the word coffee with maximum length of 50 concepts.
http://lasics.dcc.ufrj.br/esaservice?task=explain&term1=computing&term2=sensor&limit=10000
Query for the concept vector overlapping between the words computing and sensor, and the context windows of both words for each concept in the overlap.
EasyESA is distributed under GPL.
Please refer to the publication below if you are using ESA in your experiments.
Danilo Carvalho, Çağatay Çallı, André Freitas, Edward Curry, EasyESA: A Low-effort Infrastructure for Explicit Semantic Analysis, In Proceedings of the 13th International Semantic Web Conference (ISWC), Rival del Garda, 2014. (Demonstration Paper in Proceedings) (pdf)
Danilo Carvalho, Çağatay Çallı, Andre Freitas, Edward Curry.
Insight Centre for Data Analytics
Digital Enterprise Research Institute (DERI)
National University of Ireland, Galway
contact: danilo | at | jaist [dot] ac [dot] jp, andre - dot - freitas | at | deri [dot] org