/kgtk-similarity

Primary LanguagePythonMIT LicenseMIT

KGTK Semantic Similarity

The KGTK Semantic Similarity system provides a number of different similarity computations over Wikidata, a browser-based GUI to interactively explore them, and Web-based APIs to access them programmatically.

Similarity measures

The following similarity computations are currently supported:

  • class: an ontology-based measure based on Jaccard Similarity of the respective super class sets of two nodes inversely weighted by the instance counts of the classes. For this measure classes high up in the ontology with very high transitive instance counts are weighted lower than more specific classes with lower counts (see below on how we count instances).

  • jc: an ontology-based measure using our interpretation of the Jiang Conrath ontological distance (see https://arxiv.org/abs/cmp-lg/9709008). We use instance counts (same as used for class) to compute probabilities and normalize to the distance through the entity node (Q351201) to get a similarity. If a node pair has multiple most-specific subsumers, the maximum similarity based on those will be used.

  • complex: an embedding-based measure using 100-dimensional ComplEx graph embeddings computed over the Wikidata knowledge graph. Similarity is computed by using cosine distance between the embedding vectors of two nodes.

  • transe: an embedding-based measure using 100-dimensional TransE graph embeddings computed over the Wikidata knowledge graph. Similarity is computed by using cosine distance between the embedding vectors of two nodes.

  • text: an embedding-based measure using 1024-dimensional text embeddings based on sentences generated by the KGTK lexicalize command using the following properties: P31 P279 P106 P39 P1382 P373 P452. The resulting sentences are then fed to RoBERTa-large to create embeddings. Similarity is computed by using cosine distance between the embedding vectors of two nodes.

  • topsim: computes top-similar regions for each node by enumerating nearest neighbors from embeddings and from exploring the ontology which are then ranked using an aggregate of the above similarity computations (by default, a simple average). Once the top-similar regions are available, similarity is computed as a weighted average of the similarities between the two nodes and their top-5 similars.

All computations return values in the interval [0..1] with 0 being totally dissimilar and 1 meaning identical (negative cosine similarities are mapped to zero). However, similarities are not (yet) calibrated, that is, a value of 0.8 for class does not necessarily mean the same level of similarity as 0.8 for text.

Instance counts are computed using P31 (instance of), P39 (position held), P106 (occupation) and transitive P279 (subclass of) links. The reason for using positions and occupations in addition to P31 is that these more descriptive classes (e.g., actor) are usually not linked via P31 which generally only points to Q5 (human) in those cases.

Ontology-based measures such as class or jc generally capture more restrictive ontological similarity between closely linked types such as "film actor" and "television actor", for example. However, they tend to fail for ontologically distant but semantically related concepts such as "cinema" and "actor". Embedding-based measures on the other hand, in particular text, can capture such relatedness when desired. Aggregate measures such as topsim provide a mixture of ontological similarity and relatedness.

Deployed URLs:

KGTK similarity computations over DWD v2 (a subset of Wikidata developed for DARPA) are deployed at the following URLs:

Web API

Single pair similarity computation

The single pair API can be used to compute similarity for a single node pair with a single similarity measure. This API is deployed at the following URL:

It takes the following parameters:

  • q1: The first qnode for comparison, e.g., Q144 (dog)
  • q2: The second qnode for comparison. e.g., Q146 (house cat)
  • similarity_type: similarity type to be used, currently valid values are: [topsim, class, jc, complex, transe, text]

Examples

  1. https://kgtk.isi.edu/similarity_api?q1=Q144&q2=Q146&similarity_type=class

Result:

{ "similarity": 0.8480927733695294,
  "q1": "Q144",
  "q1_label": "dog",
  "q2": "Q146",
  "q2_label": "house cat" }
  1. https://kgtk.isi.edu/similarity_api?q1=Q144&q2=Q146&similarity_type=complex

Result:

{ "similarity": 0.6756600141525269,
  "q1": "Q144",
  "q1_label": "dog",
  "q2": "Q146",
  "q2_label": "house cat" }
  1. https://kgtk.isi.edu/similarity_api?q1=Q10800557&q2=Q16144339&similarity_type=text

Result:

{ "similarity": 0.8232139945030212,
  "q1": "Q10800557",
  "q1_label": "film actor",
  "q2": "Q16144339",
  "q2_label": "cinema" }
  1. https://kgtk.isi.edu/similarity_api?q1=Q40&q2=Q183&similarity_type=topsim

Result:

{ "similarity": 0.876457287986023,
  "q1": "Q40",
  "q1_label": "Austria",
  "q2": "Q183",
  "q2_label": "Germany" }

Bulk similarity computation

The bulk API can be used to compute similarities between multiple node pairs for one or more similarity measures. It is deployed at the following URL:

The bulk API requires a POST request that takes the following parameters:

  • an input file which should be a tsv file with 2 columns, q1 and q2 listing node pairs for which similarities should be computed (to limit CPU resources, at most 100 pairs will be compared in a single request)
  • similarity_types: a comma-separated list of similarity types listing one or more valid similarity types (see above), or all which generates all of them (the default).

Example input file test_file.tsv:

q1 q2
Q30 Q46
Q48352 Q30461
Q48352 Q132050
Q48352 Q14212
Q48352 Q30185

Example Python code to send such a file to the API (pandas are not really required and just used here for output formatting):

import os
import requests
import json
import pandas as pd

def call_semantic_similarity(input_file, url):
    file_name = os.path.basename(input_file)
    files = {
        'file': (file_name, open(input_file, mode='rb'), 'application/octet-stream')
    }
    resp = requests.post(url, files=files, params={'similarity_types': 'all'})
    s = json.loads(resp.json())
    return pd.DataFrame(s)

url = 'https://kgtk.isi.edu/similarity_api'
df = call_semantic_similarity('test_file.tsv', url)
df.to_csv('test_file_similarity.tsv', index=False, sep='\t')

Example output file which adds additional columns for the requested similarity types (minor edits for readability):

q1 q2 q1_label q2_label complex transe text class jc topsim
Q30 Q46 United States Europe 0.333 0.120 0.672 0.040 0.157 0.387
Q48352 Q30461 head of state president 0.418 0.465 0.825 0.946 0.935 0.956
Q48352 Q132050 head of state governor 0.651 0.701 0.809 0.486 0.654 0.794
Q48352 Q14212 head of state prime minister 0.667 0.685 0.793 0.479 0.600 0.739
Q48352 Q30185 head of state mayor 0.578 0.314 0.696 0.484 0.775 0.702

Nearest neighbor API

The nearest neighbor API can be used to compute the top-K most similar neighbors to a given node. It is deployed at the following URL:

It takes the following parameters:

  • qnode: the qnode to find nearest neighbors for
  • k: The number of nearest neighbors to return, default k = 5 (to limit CPU resources, at most 100 neighbors will be computed)
  • similarity_type: a valid similarity type to use for nearest neighbor computation; currently only complex is supported which is also the default

Examples

  1. https://kgtk.isi.edu/nearest-neighbors?qnode=Q41 # Q41 = Greece

Result:

[
  {
    "qnode": "Q35",
    "score": 7.171141147613525,
    "label": "Denmark",
    "sim": 0.8070391416549683
  },
  {
    "qnode": "Q414",
    "score": 6.726232528686523,
    "label": "Argentina",
    "sim": 0.8056117296218872
  },
  {
    "qnode": "Q37",
    "score": 7.542238712310791,
    "label": "Lithuania",
    "sim": 0.7704305648803711
  },
  {
    "qnode": "Q790",
    "score": 7.765163898468018,
    "label": "Haiti",
    "sim": 0.7616487145423889
  },
  {
    "qnode": "Q822",
    "score": 8.177531242370605,
    "label": "Lebanon",
    "sim": 0.7525420188903809
  }
]

Note that for complex this API returns both the similarity to the source qnode in the sim slot as well as a score which comes from the FAISS nearest neighbor index. The returned scores are a metric optimized when the index was trained that is only roughly inversely correlated with the computed similarity measures.

  1. https://kgtk.isi.edu/nearest-neighbors?qnode=Q42&k=3&similarity_type=complex # Q42 = Douglas Adams

Result:

[
  {
    "qnode": "Q202385",
    "score": 7.997315883636475,
    "label": "Arnold Wesker",
    "sim": 0.7826730012893677
  },
  {
    "qnode": "Q552025",
    "score": 8.355506896972656,
    "label": "John Christopher",
    "sim": 0.7731930613517761
  },
  {
    "qnode": "Q177984",
    "score": 9.097515106201172,
    "label": "Peter Sellers",
    "sim": 0.7573387622833252
  }
]

Paths API

!!! note Important: the Paths API is not currently enabled

Deployed URL: TBD

Parameters

  • source: The source qnode, start of the path
  • target: The target qnode, destination of the path
  • hops: Maximum number of hops between source and target qnodes. By default, 2. Maximum allowed: 4.

Examples

  1. https://dsbox02.isi.edu:8888/paths?source=Q76&target=Q30&hops=2 # Source: Obama, Target: United States

Result:

  [
  [
    "Q76",
    "P102",
    "Q29552",
    "P17",
    "Q30"
  ],
  [
    "Q76",
    "P102",
    "Q29552",
    "P2541",
    "Q30"
  ],
  [
    "Q76",
    "P103",
    "Q1860",
    "P17",
    "Q30"
  ],
  [
    "Q76",
    "P1038",
    "Q2856335",
    "P27",
    "Q30"
  ],
  [
    "Q76",
    "P108",
    "Q131252",
    "P17",
    "Q30"
  ],
  [
    "Q76",
    "P108",
    "Q3483312",
    "P17",
    "Q30"
  ],
  [
    "Q76",
    "P108",
    "Q4537781",
    "P17",
    "Q30"
  ]
]

Docker Installation

To setup the KGTK Similarity service via docker, please run the following commands.

  1. Make sure docker and docker-compose is installed.

  2. Build the docker image

cd kgtk-similarity
docker build -t kgtk-similarity .
  1. Update this line in the docker-compose.yaml
- <LOCAL PATH TO KGTK RESOURCES DIR>:/src/resources

Replace <LOCAL PATH TO KGTK RESOURCES DIR> with the path to a local folder with KGTK Resources. If the resources are in /kgtk-similarity-resources, the above line becomes:

- /kgtk-similarity-resources:/src/resources

Please make sure that all the files are physically present in the folder. Symbolic links will not work.

  1. Create the docker network overlay
docker network create overlay
  1. Run the docker container
docker-compose up

Please wait until a message similar to the following appears ,

Loading FAISS index...
 * Serving Flask app 'application'
 * Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:6433
 * Running on http://172.18.0.3:6433
Press CTRL+C to quit

This can take a minute or two as the container loads the files required for the service. Press CTRL+C to stop the container.

To run the container in background,

docker-compose up -d

To stop the docker container.

docker-compose down