[Similarity module] Add more similarity measurements

Question

[Similarity module] Add more similarity measurements

FanwangM opened this issue 2 years ago · 8 comments

Implement methods listed in as similarity module https://vlachosgroup.github.io/AIMSim/implemented_metrics.html. Please add detailed documentation to show which similarity functions is corresponding to which distance functions in scikit-learn or scipy.

One question I have shall we separate the similarity and distance measurements? I get confused by some measurements, e.g. Tanimoto index of molecule fingerprints. I would see it as a distance, but they treated it as a similarity, https://vlachosgroup.github.io/AIMSim/implemented_metrics.html. If we decide to distinguish them, we may need to make them into similarity and distance modules instead of one module.

@PaulWAyers @FarnazH

Answer 1 · 2023-05-30T12:21:37.000Z

As explained on Wikipedia there is a Tanimoto similarity and a Tanimoto distance. So both exist.

The easiest test is to compare an object to itself. Its similarity is greater than zero (often one) and the distance is zero.

I feel like it is better to add AIMSim as a dependence. Implementing 30+ methods is a lot of work.

We may wish to have a few basic methods implemented; the most common distance metrics and similarity measures are already there in scikit-learn
(distances) sklearn.metrics.DistanceMetric
(similarities and divergences) sklearn.metrics.pairwise

I'd lead with interfacing to scikit-learn (I think we already did this in large part?) and then considering interfacing to AIMSim a follow-up task.

I guess it is important to distinguish between similarities/affinities and distances/divergences. I'd suggest making sure that we have these distinguished, plus the "converter" between them.

Answer 2 · 2023-06-02T11:23:58.000Z

Yes, we should make them differentiable and be obvious as much as we can to avoid any ambiguity.

Answer 3 · 2023-06-20T00:49:46.000Z

Update: We decided not to include any wrappers to support the functionality in other packages (reason: additional overhead and unnecessary dependency), instead, we showcase how our package works with other libraries in notebooks/tutorials.

Answer 4 · 2023-07-06T14:36:52.000Z

@ramirandaq will list the "key similarity measures" from https://vlachosgroup.github.io/AIMSim/implemented_metrics.html and we'll reimplement them.

Answer 5 · 2023-07-06T16:41:54.000Z

Of all the similarity indices we've tested, these are the "best ones". I'm including a sample implementation for the case in which they are calculated from binary fingerprints.

sim_indices.txt

Answer 6 · 2023-07-06T19:37:21.000Z

Thanks for sharing. I am copying @ramirandaq 's code for readibility.

import numpy as np

# Pairwise similarity indices calculated over binary fingerprints

def indicators(x, y):
    """Calculating base descriptors
    a : number of common on bits
    d : number of common off bits
    dis = b + c : 1-0 mismatches
    p : len of fingerprint
    Check Table S1 in the SI of https://link.springer.com/article/10.1186/s13321-021-00505-3#Sec21
    """
    p = len(x)
    a = np.dot(x, y)
    d = np.dot(1 - x, 1 - y)
    dis = p - a - d
    return a, d, dis, p

# Indices
# BUB: Baroni-Urbani-Buser, Fai: Faith, Ja: Jaccard
# JT: Jaccard-Tanimoto, RT: Rogers-Tanimoto, RR: Russel-Rao
# SM: Sokal-Michener, SSn: Sokal-Sneath n

x = np.array([1, 0, 1, 0, 1])
y = np.array([1, 1, 1, 0, 0])

a, d, dis, p = indicators(x, y)

bub = (a * d)**0.5 + a)/((a * d)**0.5 + a + dis)

fai = (a + 0.5 * d)/p

ja = (3 * a)/(3 * a + dis)

jt = a/(a + dis)

rt = (a + d)/(p + dis)

rr = a/p

sm =(a + d)/p

ss1 = a/(a + 2 * dis)

ss2 = (2 * (a + d))/(p + (a + d))

Answer 7 · 2023-07-09T18:35:51.000Z

Just to clarify, all of these are "bitwise". We have:
a = logical "and" between bitstrings; intersection between sets if for each element, "1" or "on" means an element/feature is present.
d = logical "not and" between bitstrings; {universe} - {union} between sets if "1" or "on" means an element is present. So these are "features that are not present in either set"
dis = logical "exclusive or" between bitstrings. {union} - {intersection} if "1" or "on" means an element is present. So these are "features that are present in one item, but not present in the other".

As Ramon notes, most of these are just one-line formulas. For things that aren't "logical" obviously there are more complicated forms of similarity, though most will be (some sort of) mahalanobis distance-related function.

Answer 8 · 2024-08-20T21:35:02.000Z

@marco-2023, please:

Rename https://github.com/theochem/Selector/blob/main/selector/similarity.py to measures/similarity.py
Move diversity.py and convertor.py to the measures module.
Implement any similarity measure and test your heart desires (thanks!)