http://cgm6.research.cs.dal.ca/~sajadi/wikisim/
API Adress (current):https://github.com/asajadi/wikisim
Sourcce code:
Wikisim provides the following serrvices:
- Vector-Space Representation of Wikipedia Concepts
- Semantic Relatedness between Wikipedia Concepts
- Wikification: Entity Linking to Wikipedia
Publications:
Detailed decription of the architecture and algorithms can be found in the following publications:
- Armin Sajadi, Evangelos E. Milios, Vlado Keselj: "Vector Space Representation of Concepts Using Wikipedia Graph Structure". NLDB 2017: 393-405 ( bib, pdf)
- Armin Sajadi, Evangelos E. Milios, Vlado Keselj, Jeannette C. M. Janssen, "Domain-Specific Semantic Relatedness from Wikipedia Structure: A Case Study in Biomedical Text" CICLing (1) 2015: 347-360 ( bib, pdf)
- Armin Sajadi," Graph-Based Domain-Specific Semantic Relatedness from Wikipedia", Canadian AI 2014, LNAI 8436, pp. 381–386, 2014 ( bib, pdf)
Awards
- Verifiability, Reproducibility, and Working Description Award, Computational Linguistics and In- telligent Text Processing, 16th International Conference, CICLing 2015, Cairo, Egypt, April 14-20, 2015
Check the Webservice Address current address:
Single mode
The webservice provides three basic functions (or tasks): Wikification, Simiarity and Embedding calculation. All requests can be processed in single or batch mode.
-
Wikification:
parameters:
modelparams
: should be 0 for using CoreNLP, 1 for our high-precision trained model and 2 for our high-recall trained method
wikitext
: the text to be wikified
Example (using curl):
curl --request POST 'http://35.231.242.71/wikisim/cgi-bin/cgi-wikify.py' -F 'modelparams=0' -F 'wikitext=Lee Jun-fan known professionally as Bruce Lee, was founder of the martial art Jeet Kune Do'
-
Similarity Calculation:
parameters:
task
: should be 'sim' for this task
direction
: 0 for using incomming links, 1 for outgoing links and 2 for both. We recommend using only outgoing links as it provides decent results and is significantly faster
c1 (and c2)
: the concept to be processed
Example (using curl):
curl --request POST 'http://35.231.242.71/wikisim/cgi-bin/cgi-pairsim.py' -F 'task=sim' -F 'dir=1' -F 'c1=Bruce_Lee' -F 'c2=Arnold_Schwarzenegger'
-
Concept Representation (Embedding):
parameters:
task
: should be 'emb' for this task
direction
: 0 for using incomming links, 1 for outgoing links and 2 for both. We recommend using only outgoing links as it provides decent results and is significantly faster
cutoff
: the dimensionality of the embedding. This parameter is only used for returning the embeddings, the similarity calculation always uses all the dimensions.
c1
: the concept to be processed
Example (using curl):
curl --request POST 'http://35.231.242.71/wikisim/cgi-bin/cgi-pairsim.py' -F 'task=emb' -F 'dir=1' -F 'cutoff=10' -F 'c1=Bruce_Lee'
Batch mode
We strongly recommend using batch mode, either by post request and sending the file, or simply uploading the file in the batch mode input.
For Wikification, documents should be seperated by new lines. For similarity calculation, the file should be tab seperated, each line containing a pair of Wikipedia Concepts. For embedding calculation, each line of the file contains a single concept.
The parameters are the same, however, the target cgi-files are different:
-
Wikification: use
cgi-batchwikify.py
Example (using curl):
curl --request POST 'http://35.231.242.71/wikisim/cgi-bin/cgi-batchwikify.py' -F 'modelparams=0' -F 'file=@filename'
-
Similarity: use
cgi-batchsim.py
Example (using curl):
curl --request POST 'http://35.231.242.71/wikisim/cgi-bin/cgi-batchsim.py' -F 'task=sim' -F 'dir=1' -F 'file=@filename'
-
Embedding: use
cgi-batchsim.py
Example (using curl):
curl --request POST 'http://35.231.242.71/wikisim/cgi-bin/cgi-batchsim.py' -F task='emb' -F 'dir=1' -F 'cutoff=10' -F 'file=@filename'
You can download the embeddings, however, using our API has the following advantages:
-
The embeddings are provided for wikipedia concepts page_ids and you need another table to find the concepts titles, moreover,
redirect concepts are not included, so let's say you want to find the embeddings for "US",
you have to follow the following steps:
- From the page table, find "US", and you see that its redirect field is 1, meaning that it's a redirect page; take its id, that is: 31643
- Go to redirect table and find out that it's redirected to 3434750 (the id for United_States)
- Go to embedding table and find the embedding for 3434750
- The nonzero dimentions are not included in the embedding, so there is a need for efficient alignment
But if you still want to use your own data-sturctures, download the following tables:
-
Layout:
page_id , page_namespace (0:page,14: Category) , page_title , page_is_redirect
-
Layout:
rd_from , rd_to
As stated in the paper, out-links are shorter and leads to faster process. If you want to get the full embedding for a word, find both its in-embedding and out-embedding and add them up.
-
- Layout:
page_id , embedding as a pickled tuple(ids, values)
-
Note
The second field is a binary string and needs to be properly unescaped.
The following function, defined in the
utils module
in the Wikisim notebook, can read the embedding file.read_embedding_file(filename, records_number)
- Layout:
-
- Layout:
page_id , embedding
- Note
Similar to the above explained sutuation with pagelinksorderedin, you need to use
read_embedding_file(filename, records_number)
to read the file. - Layout:
Prepare the environement Step 1.
- Install conda
-
run
conda env create -f environment.yml
Source Code
It is mostrly written in Python. The repository contains several files, but the following two notbooks are the main entries to the source code and contain all the core features of the system Step 2. Clone the- wikisim.ipynb notebook : Contains the embedding and relatedness methods.
- wikify.ipynb notebook : Contains the word-sense disambiguation and entity linking methods.
Preparing the MariaDB and Apache Solr servers Step3.
Option 1. Downloading the prepared servers
This option saves you a lot of time if it works! It requires the following two steps:
You can download our MariaDB server and Solr Cores. If you are using Linux, there is a chance that you can download the whole servers and they work out-of-the box.
It requires downloading and preprocessing the Wikipedia dumps and extracting the graph-structure and textual information from them. The whole process can be done in two major steps: Option 2. Starting from the scratch and importing a different version of Wikipedia
- Setting up a MariaDB server and preparing the graph-strcuture.
The full instruction is given in this jupyter notebook
- Processing the text and setting up the Apache Solr
The full instruction is given in this jupyter notebook
Wikisim was developed in Machine Learning and Networked Information Spaces (MALNIS) Lab at Dalhousie University. This research was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC), the Boeing Company, and Mitacs
Contributors:
- Armin Sajadi - Faculty of Computer Science
- Ryan Amaral- Faculty of Computer Science
Armin Sajadi
We appreciate and value any question, special feature request or bug report, just let us know at:sajadi@cs.dal.ca
asajadi@gmail.com