tpq/propr

Question about why not to use negative rho

Opened this issue · 6 comments

Hi!

I have a quick question I couldn't entirely figure out of the documentation/vignettes.

I created a propr object, as follows:

rho <- propr(counts.matrix, metric = "rho")

I was wondering what the actual meaning of executing rho['>' , 0.9] would be? My interpretation is to selecting all pairs of features with an absolute value( rho) > 0.9(, and adding all this features to the @pairs index). Is this correct?

In a naive implementation, I would interpret this as keeping all rows (or columns) in which at least one of values' absolute value > 0.9.

If this is correct, I might be running into a bug with the filtering . I'm running version 4.1.1 from CRAN. I just want to make sure I understand the behavior before going through the trouble of explaining the bu.

Thank you!
~Mauricio

tpq commented

Hi Mauricio,

Thanks for your interest in propr!

rho[">", 0.9] would select pairs with the value (not absolute value) of rho > 0.9, and adding these pairs to the @pairs index.

I would suggest being careful when studying negative values of rho. I have found that they are not always directly analogous to negative correlations, making their interpretation difficult.

PS: If you want more control of the analysis, I recently introduced some helper functions: getMatrix and getResults to extract simple matrices from the S4 object.

Please let me know if you still suspect a bug!!

Thanks,
Thom

Ahh, this explains it, thank you very much! I was just assuming the the filtering considered the absolute value, but it makes sense to have a more general select👍

Could you please elaborate a little bit on what you mean with the negative values of rho and their relationship to negative correlations?

Also, from this, the bootstrapping for the choice of threshold, does this consider only positive proportionality? Or is the thresholding valid for negative values of rho as well?

Thank you for the swift reply!
Mauricio

tpq commented

I'll try my best to explain this succinctly, but please note that I am still trying to understand it myself!

Let's start by looking at the formula:

screenshot from 2019-02-07 10-19-19

For rho_p = 1, the numerator (which is the var(log(x / y)) ought to approach zero. Only one thing causes this to happen: the ratio x / y is fixed for all samples (i.e., proportional). Proportional events are always correlated.
For rho_p = -1, the numerator ought to approach 2x the denominator. This means that the variance of the ratio is twice the sum of the individual variances. It is very hard for me to imagine all events that satisfy this condition, but let us look empirically as the distribution of absolute correlations (y-axis) vs. proportionality (x-axis) (taken from Scientific Reports: 7(16252)):

screenshot from 2019-02-07 10-15-43

We see on the right that proportional events (rho_p -> 1) are correlated events (rho -> 1). But, on the left, we see that the anti-proportional events (rho_p -> -1) are also correlated events (rho -> 1)!!!

This seems to happen when there is a strong compositional constraint on the data (e.g., we do not see in Figure 5). Possibly, these strange events arise when the individual genes are correlated with the geometric mean center. This would shrink the denominator, driving rho_p -> -1, even if the numerator is quite small. These events seem to matter less for rho_p -> 1, because they would induce false negatives rather than false positives.

tpq commented

As for your other question, updateCutoffs is only valid for positive values of rho!! Sorry, I will clarify this in the documentation.

I'll also ping Ionas Erb to see if he has more to add about negative proportionality.

Hi Mauricio,

As Thom pointed out, for rho_p to be +1, the ratio between the variables has to be constant, i.e. y = m x for each sample, with m a positive constant (independent of the sample). Now for rho_p = -1 (i.e. numerator twice the denominator), one can show that y/r = m r/x, where r is the value of the reference (e.g., the geometric mean over the variables) in the sample. So y and x are reciprocal. As you can see, the reference does not cancel (as it does for rho_p being exactly +1), so these reciprocal events are not robust with respect to the choice of reference, which makes them less interesting perhaps.

Cheers,
Ionas

Thank you so much for the great replies!!! I need to sit down and mull over it 😄

Cheers,
Mauricio