The KGTK Semantic Similarity system provides a number of different similarity computations over Wikidata, a browser-based GUI to interactively explore them, and Web-based APIs to access them programmatically.
The following similarity computations are currently supported:
-
class: an ontology-based measure based on Jaccard Similarity of the respective super class sets of two nodes inversely weighted by the instance counts of the classes. For this measure classes high up in the ontology with very high transitive instance counts are weighted lower than more specific classes with lower counts (see below on how we count instances).
-
jc: an ontology-based measure using our interpretation of the Jiang Conrath ontological distance (see https://arxiv.org/abs/cmp-lg/9709008). We use instance counts (same as used for
class
) to compute probabilities and normalize to the distance through theentity
node (Q351201
) to get a similarity. If a node pair has multiple most-specific subsumers, the maximum similarity based on those will be used. -
complex: an embedding-based measure using 100-dimensional ComplEx graph embeddings computed over the Wikidata knowledge graph. Similarity is computed by using cosine distance between the embedding vectors of two nodes.
-
transe: an embedding-based measure using 100-dimensional TransE graph embeddings computed over the Wikidata knowledge graph. Similarity is computed by using cosine distance between the embedding vectors of two nodes.
-
text: an embedding-based measure using 1024-dimensional text embeddings based on sentences generated by the KGTK
lexicalize
command using the following properties:P31 P279 P106 P39 P1382 P373 P452
. The resulting sentences are then fed to RoBERTa-large to create embeddings. Similarity is computed by using cosine distance between the embedding vectors of two nodes. -
topsim: computes top-similar regions for each node by enumerating nearest neighbors from embeddings and from exploring the ontology which are then ranked using an aggregate of the above similarity computations (by default, a simple average). Once the top-similar regions are available, similarity is computed as a weighted average of the similarities between the two nodes and their top-5 similars.
All computations return values in the interval [0..1] with 0 being
totally dissimilar and 1 meaning identical (negative cosine
similarities are mapped to zero). However, similarities are not (yet)
calibrated, that is, a value of 0.8 for class
does not necessarily
mean the same level of similarity as 0.8 for text
.
Instance counts are computed using P31
(instance of), P39
(position held), P106
(occupation) and transitive P279
(subclass
of) links. The reason for using positions and occupations in addition
to P31
is that these more descriptive classes (e.g., actor) are
usually not linked via P31
which generally only points to Q5
(human) in those cases.
Ontology-based measures such as class
or jc
generally capture more
restrictive ontological similarity between closely linked types such
as "film actor" and "television actor", for example. However, they
tend to fail for ontologically distant but semantically related
concepts such as "cinema" and "actor". Embedding-based measures on
the other hand, in particular text
, can capture such relatedness
when desired. Aggregate measures such as topsim
provide a mixture of
ontological similarity and relatedness.
KGTK similarity computations over DWD v2 (a subset of Wikidata developed for DARPA) are deployed at the following URLs:
- GUI: https://kgtk.isi.edu/similarity
- Similarity Web API: https://kgtk.isi.edu/similarity_api
- Nearest neighbor Web API: https://kgtk.isi.edu/nearest-neighbors
The single pair API can be used to compute similarity for a single node pair with a single similarity measure. This API is deployed at the following URL:
It takes the following parameters:
q1
: The first qnode for comparison, e.g., Q144 (dog)q2
: The second qnode for comparison. e.g., Q146 (house cat)similarity_type
: similarity type to be used, currently valid values are: [topsim
,class
,jc
,complex
,transe
,text
]
https://kgtk.isi.edu/similarity_api?q1=Q144&q2=Q146&similarity_type=class
Result:
{ "similarity": 0.8480927733695294,
"q1": "Q144",
"q1_label": "dog",
"q2": "Q146",
"q2_label": "house cat" }
https://kgtk.isi.edu/similarity_api?q1=Q144&q2=Q146&similarity_type=complex
Result:
{ "similarity": 0.6756600141525269,
"q1": "Q144",
"q1_label": "dog",
"q2": "Q146",
"q2_label": "house cat" }
https://kgtk.isi.edu/similarity_api?q1=Q10800557&q2=Q16144339&similarity_type=text
Result:
{ "similarity": 0.8232139945030212,
"q1": "Q10800557",
"q1_label": "film actor",
"q2": "Q16144339",
"q2_label": "cinema" }
https://kgtk.isi.edu/similarity_api?q1=Q40&q2=Q183&similarity_type=topsim
Result:
{ "similarity": 0.876457287986023,
"q1": "Q40",
"q1_label": "Austria",
"q2": "Q183",
"q2_label": "Germany" }
The bulk API can be used to compute similarities between multiple node pairs for one or more similarity measures. It is deployed at the following URL:
The bulk API requires a POST request that takes the following parameters:
- an input file which should be a
tsv
file with 2 columns,q1
andq2
listing node pairs for which similarities should be computed (to limit CPU resources, at most 100 pairs will be compared in a single request) similarity_types
: a comma-separated list of similarity types listing one or more valid similarity types (see above), orall
which generates all of them (the default).
Example input file test_file.tsv
:
q1 | q2 |
---|---|
Q30 | Q46 |
Q48352 | Q30461 |
Q48352 | Q132050 |
Q48352 | Q14212 |
Q48352 | Q30185 |
Example Python code to send such a file to the API (pandas
are not
really required and just used here for output formatting):
import os
import requests
import json
import pandas as pd
def call_semantic_similarity(input_file, url):
file_name = os.path.basename(input_file)
files = {
'file': (file_name, open(input_file, mode='rb'), 'application/octet-stream')
}
resp = requests.post(url, files=files, params={'similarity_types': 'all'})
s = json.loads(resp.json())
return pd.DataFrame(s)
url = 'https://kgtk.isi.edu/similarity_api'
df = call_semantic_similarity('test_file.tsv', url)
df.to_csv('test_file_similarity.tsv', index=False, sep='\t')
Example output file which adds additional columns for the requested similarity types (minor edits for readability):
q1 | q2 | q1_label | q2_label | complex | transe | text | class | jc | topsim |
---|---|---|---|---|---|---|---|---|---|
Q30 | Q46 | United States | Europe | 0.333 | 0.120 | 0.672 | 0.040 | 0.157 | 0.387 |
Q48352 | Q30461 | head of state | president | 0.418 | 0.465 | 0.825 | 0.946 | 0.935 | 0.956 |
Q48352 | Q132050 | head of state | governor | 0.651 | 0.701 | 0.809 | 0.486 | 0.654 | 0.794 |
Q48352 | Q14212 | head of state | prime minister | 0.667 | 0.685 | 0.793 | 0.479 | 0.600 | 0.739 |
Q48352 | Q30185 | head of state | mayor | 0.578 | 0.314 | 0.696 | 0.484 | 0.775 | 0.702 |
The nearest neighbor API can be used to compute the top-K most similar neighbors to a given node. It is deployed at the following URL:
It takes the following parameters:
qnode
: the qnode to find nearest neighbors fork
: The number of nearest neighbors to return, defaultk
= 5 (to limit CPU resources, at most 100 neighbors will be computed)similarity_type
: a valid similarity type to use for nearest neighbor computation; currently onlycomplex
is supported which is also the default
https://kgtk.isi.edu/nearest-neighbors?qnode=Q41 # Q41 = Greece
Result:
[
{
"qnode": "Q35",
"score": 7.171141147613525,
"label": "Denmark",
"sim": 0.8070391416549683
},
{
"qnode": "Q414",
"score": 6.726232528686523,
"label": "Argentina",
"sim": 0.8056117296218872
},
{
"qnode": "Q37",
"score": 7.542238712310791,
"label": "Lithuania",
"sim": 0.7704305648803711
},
{
"qnode": "Q790",
"score": 7.765163898468018,
"label": "Haiti",
"sim": 0.7616487145423889
},
{
"qnode": "Q822",
"score": 8.177531242370605,
"label": "Lebanon",
"sim": 0.7525420188903809
}
]
Note that for complex
this API returns both the similarity to the
source qnode
in the sim
slot as well as a score
which comes from
the FAISS nearest neighbor index. The returned scores are a metric
optimized when the index was trained that is only roughly inversely
correlated with the computed similarity measures.
https://kgtk.isi.edu/nearest-neighbors?qnode=Q42&k=3&similarity_type=complex # Q42 = Douglas Adams
Result:
[
{
"qnode": "Q202385",
"score": 7.997315883636475,
"label": "Arnold Wesker",
"sim": 0.7826730012893677
},
{
"qnode": "Q552025",
"score": 8.355506896972656,
"label": "John Christopher",
"sim": 0.7731930613517761
},
{
"qnode": "Q177984",
"score": 9.097515106201172,
"label": "Peter Sellers",
"sim": 0.7573387622833252
}
]
!!! note Important: the Paths API is not currently enabled
Deployed URL: TBD
Parameters
source
: The source qnode, start of the pathtarget
: The target qnode, destination of the pathhops
: Maximum number of hops between source and target qnodes. By default, 2. Maximum allowed: 4.
https://dsbox02.isi.edu:8888/paths?source=Q76&target=Q30&hops=2
# Source: Obama, Target: United States
Result:
[
[
"Q76",
"P102",
"Q29552",
"P17",
"Q30"
],
[
"Q76",
"P102",
"Q29552",
"P2541",
"Q30"
],
[
"Q76",
"P103",
"Q1860",
"P17",
"Q30"
],
[
"Q76",
"P1038",
"Q2856335",
"P27",
"Q30"
],
[
"Q76",
"P108",
"Q131252",
"P17",
"Q30"
],
[
"Q76",
"P108",
"Q3483312",
"P17",
"Q30"
],
[
"Q76",
"P108",
"Q4537781",
"P17",
"Q30"
]
]
To setup the KGTK Similarity service via docker, please run the following commands.
-
Make sure docker and docker-compose is installed.
-
Build the docker image
cd kgtk-similarity
docker build -t kgtk-similarity .
- Update this line in the
docker-compose.yaml
- <LOCAL PATH TO KGTK RESOURCES DIR>:/src/resources
Replace <LOCAL PATH TO KGTK RESOURCES DIR>
with the path to a local folder with KGTK Resources. If the resources are in /kgtk-similarity-resources
, the above line becomes:
- /kgtk-similarity-resources:/src/resources
Please make sure that all the files are physically present in the folder. Symbolic links will not work.
- Create the docker network overlay
docker network create overlay
- Run the docker container
docker-compose up
Please wait until a message similar to the following appears ,
Loading FAISS index...
* Serving Flask app 'application'
* Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
* Running on all addresses (0.0.0.0)
* Running on http://127.0.0.1:6433
* Running on http://172.18.0.3:6433
Press CTRL+C to quit
This can take a minute or two as the container loads the files required for the service. Press CTRL+C
to stop the container.
To run the container in background,
docker-compose up -d
To stop the docker container.
docker-compose down