theochem/Selector

[Similarity module] Add more similarity measurements

FanwangM opened this issue · 8 comments

Implement methods listed in as similarity module https://vlachosgroup.github.io/AIMSim/implemented_metrics.html. Please add detailed documentation to show which similarity functions is corresponding to which distance functions in scikit-learn or scipy.

One question I have shall we separate the similarity and distance measurements? I get confused by some measurements, e.g. Tanimoto index of molecule fingerprints. I would see it as a distance, but they treated it as a similarity, https://vlachosgroup.github.io/AIMSim/implemented_metrics.html. If we decide to distinguish them, we may need to make them into similarity and distance modules instead of one module.

@PaulWAyers @FarnazH

As explained on Wikipedia there is a Tanimoto similarity and a Tanimoto distance. So both exist.

The easiest test is to compare an object to itself. Its similarity is greater than zero (often one) and the distance is zero.

I feel like it is better to add AIMSim as a dependence. Implementing 30+ methods is a lot of work.

We may wish to have a few basic methods implemented; the most common distance metrics and similarity measures are already there in scikit-learn
(distances) sklearn.metrics.DistanceMetric
(similarities and divergences) sklearn.metrics.pairwise

I'd lead with interfacing to scikit-learn (I think we already did this in large part?) and then considering interfacing to AIMSim a follow-up task.

I guess it is important to distinguish between similarities/affinities and distances/divergences. I'd suggest making sure that we have these distinguished, plus the "converter" between them.

Yes, we should make them differentiable and be obvious as much as we can to avoid any ambiguity.

Update: We decided not to include any wrappers to support the functionality in other packages (reason: additional overhead and unnecessary dependency), instead, we showcase how our package works with other libraries in notebooks/tutorials.

@ramirandaq will list the "key similarity measures" from https://vlachosgroup.github.io/AIMSim/implemented_metrics.html and we'll reimplement them.

Of all the similarity indices we've tested, these are the "best ones". I'm including a sample implementation for the case in which they are calculated from binary fingerprints.

sim_indices.txt

Thanks for sharing. I am copying @ramirandaq 's code for readibility.

import numpy as np

# Pairwise similarity indices calculated over binary fingerprints

def indicators(x, y):
    """Calculating base descriptors
    a : number of common on bits
    d : number of common off bits
    dis = b + c : 1-0 mismatches
    p : len of fingerprint
    Check Table S1 in the SI of https://link.springer.com/article/10.1186/s13321-021-00505-3#Sec21
    """
    p = len(x)
    a = np.dot(x, y)
    d = np.dot(1 - x, 1 - y)
    dis = p - a - d
    return a, d, dis, p

# Indices
# BUB: Baroni-Urbani-Buser, Fai: Faith, Ja: Jaccard
# JT: Jaccard-Tanimoto, RT: Rogers-Tanimoto, RR: Russel-Rao
# SM: Sokal-Michener, SSn: Sokal-Sneath n

x = np.array([1, 0, 1, 0, 1])
y = np.array([1, 1, 1, 0, 0])

a, d, dis, p = indicators(x, y)

bub = (a * d)**0.5 + a)/((a * d)**0.5 + a + dis)

fai = (a + 0.5 * d)/p

ja = (3 * a)/(3 * a + dis)

jt = a/(a + dis)

rt = (a + d)/(p + dis)

rr = a/p

sm =(a + d)/p

ss1 = a/(a + 2 * dis)

ss2 = (2 * (a + d))/(p + (a + d))

Just to clarify, all of these are "bitwise". We have:
a = logical "and" between bitstrings; intersection between sets if for each element, "1" or "on" means an element/feature is present.
d = logical "not and" between bitstrings; {universe} - {union} between sets if "1" or "on" means an element is present. So these are "features that are not present in either set"
dis = logical "exclusive or" between bitstrings. {union} - {intersection} if "1" or "on" means an element is present. So these are "features that are present in one item, but not present in the other".

As Ramon notes, most of these are just one-line formulas. For things that aren't "logical" obviously there are more complicated forms of similarity, though most will be (some sort of) mahalanobis distance-related function.

@marco-2023, please: