theislab/chemCPA

How to calculate the uncertainty

tuln128 opened this issue · 2 comments

Dear authors,
Thank you very much for sharing such a nice tool as chemCPA.
In the article, you have mentioned about the calculation of the uncertainty as the following:

image

If possible, could you please explain a little bit more about the definition (and/or calculation) of X, which is mentioned as "the normalised pathway prediction from the neighbours of drug i"?

Thank you very much in advance
Kind regards,

Hi @tuln128,

We compute the entropy as follows:

def entropy(column, base=None):
    vc = pd.Series(column).value_counts(normalize=True, sort=False)
    base = np.exp if base is None else base
    return -(vc * np.log(vc) / np.log(base)).sum()

So for a drug i, we take 10 neighbours in the latent space and use the pathway labels as indication for embedding quality. If all neighbours come from the same pathway, the entropy will by low and the prediction good. If they come from multiple pathways, we assume that there is some uncertainty about the drug embedding.

You can also check it here in this notebook:

# # Uncertainty

Hi @MxMstrmn,

Thank you very much for the detailed explanation. I could figure out how H(X) is calculated from the link you shared.

According to the following reference:

adata.obs.loc[adata.obs.drug == adata.obs.drug.iloc[i], "uncertainty"] = (

the sum of distances is calculated before taking log:

adata.obs.loc[adata.obs.drug == adata.obs.drug.iloc[i], "uncertainty"] = (
1 / np.log(distances[i].sum()) * entropy(pathways, base=2))

which somehow is opposite to the definition mentioned above. Could you please explain a bit more about this difference or correct me if I misunderstood?

Thank you very much in advance, and sorry for bothering so much!
Kind regards,