/wikisim

Concept Representation (Embedding) and Semantic Relatedness

Primary LanguageJupyter Notebook

What is Wikisim?

Wikisim provides the following serrvices:

  • Vector-Space Representation of Wikipedia Concepts
  • Semantic Relatedness between Wikipedia Concepts
  • Wikification: Entity Linking to Wikipedia

Publications:

Detailed decription of the architecture and algorithms can be found in the following publications:

  • Armin Sajadi, Evangelos E. Milios, Vlado Keselj: "Vector Space Representation of Concepts Using Wikipedia Graph Structure". NLDB 2017: 393-405 ( bib, pdf)
  • Armin Sajadi, Evangelos E. Milios, Vlado Keselj, Jeannette C. M. Janssen, "Domain-Specific Semantic Relatedness from Wikipedia Structure: A Case Study in Biomedical Text" CICLing (1) 2015: 347-360 ( bib, pdf)
  • Armin Sajadi," Graph-Based Domain-Specific Semantic Relatedness from Wikipedia", Canadian AI 2014, LNAI 8436, pp. 381–386, 2014 ( bib, pdf)

Awards

  • Verifiability, Reproducibility, and Working Description Award, Computational Linguistics and In- telligent Text Processing, 16th International Conference, CICLing 2015, Cairo, Egypt, April 14-20, 2015

API

Webservice Address

Check the current address:

Single mode

The webservice provides three basic functions (or tasks): Wikification, Simiarity and Embedding calculation. All requests can be processed in single or batch mode.

  • Wikification:
    parameters:
    modelparams: should be 0 for using CoreNLP, 1 for our high-precision trained model and 2 for our high-recall trained method
    wikitext: the text to be wikified
    Example (using curl):
    curl --request POST 'http://35.231.242.71/wikisim/cgi-bin/cgi-wikify.py' -F 'modelparams=0' -F 'wikitext=Lee Jun-fan known professionally as Bruce Lee, was founder of the martial art Jeet Kune Do'
  • Similarity Calculation:
    parameters:
    task: should be 'sim' for this task
    direction: 0 for using incomming links, 1 for outgoing links and 2 for both. We recommend using only outgoing links as it provides decent results and is significantly faster
    c1 (and c2): the concept to be processed
    Example (using curl):
    curl --request POST 'http://35.231.242.71/wikisim/cgi-bin/cgi-pairsim.py' -F 'task=sim' -F 'dir=1' -F 'c1=Bruce_Lee' -F 'c2=Arnold_Schwarzenegger'
  • Concept Representation (Embedding):
    parameters:
    task: should be 'emb' for this task
    direction: 0 for using incomming links, 1 for outgoing links and 2 for both. We recommend using only outgoing links as it provides decent results and is significantly faster
    cutoff: the dimensionality of the embedding. This parameter is only used for returning the embeddings, the similarity calculation always uses all the dimensions.
    c1: the concept to be processed
    Example (using curl):
    curl --request POST 'http://35.231.242.71/wikisim/cgi-bin/cgi-pairsim.py' -F 'task=emb' -F 'dir=1' -F 'cutoff=10' -F 'c1=Bruce_Lee'

Batch mode

We strongly recommend using batch mode, either by post request and sending the file, or simply uploading the file in the batch mode input.

For Wikification, documents should be seperated by new lines. For similarity calculation, the file should be tab seperated, each line containing a pair of Wikipedia Concepts. For embedding calculation, each line of the file contains a single concept.

The parameters are the same, however, the target cgi-files are different:

  • Wikification: use cgi-batchwikify.py
    Example (using curl):
    curl --request POST 'http://35.231.242.71/wikisim/cgi-bin/cgi-batchwikify.py' -F 'modelparams=0' -F 'file=@filename'
  • Similarity: use cgi-batchsim.py
    Example (using curl):
    curl --request POST 'http://35.231.242.71/wikisim/cgi-bin/cgi-batchsim.py' -F 'task=sim' -F 'dir=1' -F 'file=@filename'
  • Embedding: use cgi-batchsim.py
    Example (using curl):
    curl --request POST 'http://35.231.242.71/wikisim/cgi-bin/cgi-batchsim.py' -F task='emb' -F 'dir=1' -F 'cutoff=10' -F 'file=@filename'

Downloading the embeddings

Current Version: enwiki20160305

You can download the embeddings, however, using our API has the following advantages:

  1. The embeddings are provided for wikipedia concepts page_ids and you need another table to find the concepts titles, moreover, redirect concepts are not included, so let's say you want to find the embeddings for "US", you have to follow the following steps:
    1. From the page table, find "US", and you see that its redirect field is 1, meaning that it's a redirect page; take its id, that is: 31643
    2. Go to redirect table and find out that it's redirected to 3434750 (the id for United_States)
    3. Go to embedding table and find the embedding for 3434750
  2. The nonzero dimentions are not included in the embedding, so there is a need for efficient alignment

But if you still want to use your own data-sturctures, download the following tables:

  1. Page Table

    Layout:

    page_id , page_namespace (0:page,14: Category) , page_title , page_is_redirect

  2. Redirect Table

    Layout:

    rd_from , rd_to

    As stated in the paper, out-links are shorter and leads to faster process. If you want to get the full embedding for a word, find both its in-embedding and out-embedding and add them up.

  3. Embeddings (in-links)

    • Layout:

      page_id , embedding as a pickled tuple(ids, values)

    • Note The second field is a binary string and needs to be properly unescaped. The following function, defined in the utils module in the Wikisim notebook, can read the embedding file.

      read_embedding_file(filename, records_number)

  4. Embeddings (Out-Links)

    • Layout:

      page_id , embedding

    • Note
    • Similar to the above explained sutuation with pagelinksorderedin, you need to use read_embedding_file(filename, records_number) to read the file.

Hosting Wikisim

You can run Wikisim locally and freely modify the source code.

Step 1.

Prepare the environement
  • Install conda
  • run

    conda env create -f environment.yml

Step 2. Clone the Source Code

It is mostrly written in Python. The repository contains several files, but the following two notbooks are the main entries to the source code and contain all the core features of the system Having prepared the conda environement, you have two options for Step 3:

Step3.

Preparing the MariaDB and Apache Solr servers

Option 1. Downloading the prepared servers

This option saves you a lot of time if it works! It requires the following two steps:

You can download our MariaDB server and Solr Cores. If you are using Linux, there is a chance that you can download the whole servers and they work out-of-the box.

Option 2. Starting from the scratch and importing a different version of Wikipedia

It requires downloading and preprocessing the Wikipedia dumps and extracting the graph-structure and textual information from them. The whole process can be done in two major steps:
  • Setting up a MariaDB server and preparing the graph-strcuture.

    The full instruction is given in this jupyter notebook

  • Processing the text and setting up the Apache Solr

    The full instruction is given in this jupyter notebook

About

Wikisim was developed in Machine Learning and Networked Information Spaces (MALNIS) Lab at Dalhousie University. This research was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC), the Boeing Company, and Mitacs

Contributors:

  • Armin Sajadi - Faculty of Computer Science
  • Ryan Amaral- Faculty of Computer Science

Contact

Armin Sajadi

We appreciate and value any question, special feature request or bug report, just let us know at:
sajadi@cs.dal.ca
asajadi@gmail.com