Question about when to use CLR or ILR

Hi Thom,
I remember listening to your bioformatics chat podcast (https://bioinformatics.chat/propr) a while back and recently dove into the literature on several compositional data analysis reviews/methods/papers such as https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6755255/ and https://academic.oup.com/bioinformatics/article/34/16/2870/4956011 where you just so happened to be the leading author! First off, I just want to thank you for publishing on this topic because I have been actively trying to incorporate this in my everyday analysis of environmental/human metagenomics and convince my lab that non-CODA aware methods should not be used for most (all?) NGS analysis.

Anyways, I had a few questions:

In IQLR, is the "interquartile range" (IQR) based on: (case 1) a consensus set from all the samples or (case 2) is it feature/component agnostic per sample?
For example the latter would be sample_A having {OTU_X, OTU_Y, and OTU_Z} in the IQR but sample_B might have {OTU_A, OTU_X, and OTU_Y} in the IQR. Would the IQLR transformation use {OTU_X, Y, and Z} for sample_A and {OTU_A, OTU_X, OTU_Y} in sample_B (case 1)? Would the components used for reference be only {OTU_X, OTU_Y} (case 2)? Or something completely different?
When using CODA methods for metagenomics, can we use CLR for the ENTIRE metagenome or do we need to group it by some predefined category? For example, let's say we (tried to) assign taxonomy to each ORF then propagated that up to contig (perhaps MAG/BIN in the best case), would we need to use CLR/IQLR on each "group" (e.g. species mapping) (case 1)? Or would it be sufficient to use CLR/IQLR on the global metagenome and non subset by any grouping (case 2)?
Why does Aitichison distance use CLR instead of ILR if the individual components do not matter? For example, if you have a matrix X with n=samples rows and m=components columns and you end up with a distance matrix that is (n,n)...wouldn't ILR be a better option?

Best,
-Josh

Hey Josh! Thank you for the kind words. Yes that podcast was some time ago; it was good fun to do, though I've learned a lot since then and sometimes wish I could cut a second take ^_^.

But anyway, here is a very warm welcome to the exciting world of CoDa!

These are all good questions, and I'll try to do my best to answer them here.

(1) The IQLR is a construct introduced by ALDEx2. I think it's a very cool idea, although it's not, strictly speaking, an orthodox CoDa method. The IQLR attempts to normalize the data, not unlike an effective library size normalization used by edgeR/DESeq2 (we discuss the similarities between CLR/IQLR and ELSN in the supplement of our Bioinformatics review). Like ELSN, a single reference is computed for the whole data set. So I don't think it would be considered "sample agnostic" as you say. Below is my implementation of it (as used in the propr package), which may make it more clear how exactly it works.

counts.clr <- apply(log(counts), 1, function(x){ x - mean(x) })
counts.var <- apply(counts.clr, 1, var)
quart <- stats::quantile(counts.var) # use features with unextreme variance
use <- which(counts.var < quart[4] & counts.var > quart[2])

(2) I would be inclined to apply CLR to each sample as produced from a single experiment. You can then fraction off the features into sub-groups after the fact, e.g., by clustering the features. That said, if your features are related by a phylogenic tree, you could first add similar features together before transforming the data. For example, when I analyze 16s data, I will CLR transform all species and do an analysis. Meanwhile, separately, I will add the species-level features to get genus-level features, then CLR transform the genera and do an analysis. Adding before the CLR is called amalgamation. (Note that adding after is actually multiplication because of the log-transform.)

One exception here is if you have multi-omics data, in which case I strongly recommend you apply the CLR to each data modality separately before integrating the analysis. FYI we've recently written specifically about multi-omics data integration here if that's your cup of tea https://www.biorxiv.org/content/10.1101/847475v1

(3) Very good question. In both cases, you do end up with the same NxN distance matrix. I'm not enough of a pure mathematician to speak as why they are the same. However, the Aitchison distance is also equivalent to the total distance between all pairwise log-ratios. Apparently, it's possible to refactor each form (i.e., the pairwise LR distance, the CLR distance, and the ILR distance) to be equivalent, though I admit I never tried. My rough intuition is that since each transformation represents an equivalent view of the entire data, albeit from a different perspective, the distance remains the same regardless of the transformation. (Note that, on the other hand, the ALR distance is actually different...) FYI this is a short and sweet article on compositional distance that I enjoyed reading https://www.researchgate.net/publication/226002937_Logratio_Analysis_and_Compositional_Distance

I hope this helps, and do let me know if you have some more questions!

Thanks for intro into CoDa! This seems to be relatively new concepts to mainstream informaticians but I feel like the CoDa methods will be industry standard soon. Hopefully sooner than later. The biggest hurdle will be convincing some of the older generation to abandon the scripts they've used for analysis for 15+ years.

That really helped a lot in understanding how IQLR works. Reading code in Python and R is SO much easier for me to understand than some of the statistics equations unfortunately. Essentially, we are looking at the: 1) variance of the CLR transformed compositions; 2) the components in the IQR are then used for the reference when we take the geometric mean; 3) then use this vector when we center the components again.

I have been doing the same with my 16S data (i.e. summing up per taxonomic level and then performing CLR on those components). Have you used any of these on metatranscriptomics? These methods seem to make perfect sense with marker gene or single organism systems but I'm wondering if any of the edge cases have been worked out for extremely complex compositions? Theoretically, it should work right?
I ran PCA on ILR and CLR and it's the same although CLR is a million times faster than ILR. "However, the Aitchison distance is also equivalent to the total distance between all pairwise log-ratios." This definitely seems to be the case from the crude PCA analysis I did. I wonder if IQLR is equivalent as well? I'll need to check this out.

Extra)

https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004075

I'm trying to understand the Rho code from

propr/src/backend.cpp

Line 202 in 2bd7c44

// Function for rho

. All I know is that it works and it's fast af haha. I want to be able to describe it a little better so I can explain it to my advisor. I'll need to reread the Lovell paper and your subsequent papers again. Rho seems to be the most useful metric to serve as a stand in replacement to existing association-based network analysis like WGCNA

I do hope CoDa will grow in popularity. The reason I like it so much is that I think it brings a coherent theoretical framework to a field currently relying on ad hoc procedures.

As for your questions,

(1) Yes, this is true for me too! Even when I encounter a maths equation now, I try to break it down into pseudo-code to understand it :-)

(2) I don't see why it wouldn't work for meta-transcriptomics. All these data are produced from the same assay after all. I started out looking at RNA-Seq data, comparing edgeR/DESeq2 with ALDEx2, and finding a surprising amount of convergence https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2261-8
(which has since been clarified by my colleague Ionas in the supplement to our Bioinformatics review)
I've also benchmarked logistic regression for several types of data here (including meta-genomics) -- https://msystems.asm.org/content/5/2/e00230-19
and found consistent performance for CLR

Note though that the CLR is not a golden bullet. It has some nice properties, but it is a transformation not a normalization. The new features you work worth are the counts of the gene relative to the sample mean. They are not absolute counts in any way, shape, or form!

(3) The IQLR distance is not the same as the CLR/ILR distance. I think of the IQLR as a compromise between the CLR and the edgeR/DESeq2 approach. It's useful because it is similar to traditional methods, and so easy to explain to reviewers. However, it sort of invokes the same problematic assumptions used by edgeR/DESeq2 approach. These days I prefer balances and pairwise log-ratios for my own analyses (though I often still use IQLR/CLR when working with collaborators)

(4) C++ is fast, but I agree it is hard to read. I have saved the original (slow) base R functions, which I still use in my unit tests. Have a look at propr:::proprRho() function at https://github.com/tpq/propr/blob/master/R/9-deprecated.R
I implement it using a different form of the equation described in the Erb paper https://link.springer.com/article/10.1007/s12064-015-0220-8 at proposition 1 (i). I use it because I find it easier to interpret...

I think rho is a safe choice (and was ranked #1 by a independent benchmark using single-cell data! -- https://www.ncbi.nlm.nih.gov/pubmed/30962620)
Just be aware that, while positive rho indicates some kind of proportional association, I'd advise against taking negative rho too seriously.

(2) Ok great! I fear that I've been overthinking this issue. Whenever I see 1,000,000+ ORFs I start thinking twice about everything haha. Good to know that it's still applicable.

(3) I just calculated PCA for IQLR and it is different from PCA for ILR and CLR (as you mentioned and as expected).

These days I prefer balances and pairwise log-ratios for my own analyses (though I often still use IQLR/CLR when working with collaborators)

Are you talking about ILR and then splitting out the balances downstream?

(4)Thanks for the links to more human readable versions (one day I'll take a crack at learning C++). I'm going to suggest presenting this paper in lab meeting soon because it'll introduce some people in my lab to fundamental concepts that have been ignored by most "traditional" (for lack of a better word) methods.

I implement it using a different form of the equation described in the Erb paper https://link.springer.com/article/10.1007/s12064-015-0220-8 at proposition 1 (i). I use it because I find it easier to interpret...

Are the results for between the two methods the same or comparable?

Last question, I've been using propr::propr(X, metric="rho") where X is the abundance. If I already have clr computed, X_clr, is there a way to compute the pairwise rho matrix from X_clr using propr?

"Are you talking about ILR and then splitting out the balances downstream?"

I view balances as a subset of the ILR. The ILR requires an "orthonormal basis" to perform a transformation. There are many such bases! Balances satisfy this definition too, so one can think of balances as a more interpretable ILR.

selbal is a very cool package, but quite slow when you have 100s+ features.

https://msystems.asm.org/content/3/4/e00053-18

We recently published in this area too with an alternative that is faster for big data.

https://msystems.asm.org/content/5/2/e00230-19

"Are the results for between the two methods the same or comparable?"

I assume here you are asking about the two ways of calculating rho. Yes, they are equivalent. One can be derived from the other!

"Last question, I've been using propr::propr(X, metric="rho") where X is the abundance. If I already have clr computed, X_clr"

I've actually implemented exactly this just a few weeks ago! I made it for multi-omics problems where you want to CLR each data set separately (see https://www.biorxiv.org/content/10.1101/847475v1 for a discussion). Sorry, it's not documented yet, but here is a code snippet. Basically just set ivar = NA.

library(propr)
data(iris)
met.rel <- iris[,1:2]
mic.rel <- iris[,3:4]

# Analyze multi-omics
clr <- function(x) sweep(log(x), 1, rowMeans(log(x)), "-")
REL <- cbind(clr(met.rel), clr(mic.rel))
pr <- propr(REL, ivar = NA)

This was a fun thread to read, thanks!

When I CLR two relative abundance data (shotgun metagenomics and flow cytometry) and cbind them, I get the following error pr <- propr(bounded, ivar = NA, metric = "rho" ) Error in propr(bounded, ivar = NA, metric = "rho") : Data may not contain negative measurements..

The clr transformed data does have negative values, how do you suggest I go ahead?

PS: running propr:::lr2phs(bounded) independently works though.

Could you update propr with devtools::install_github('tpq/propr') and confirm whether you still get the same error?

Oh, I think ivar=NA toggle from issue #9 was never implemented.

Ah, I think it is implemented on GitHub only, not CRAN...

Could you update propr with devtools::install_github('tpq/propr') and confirm whether you still get the same error?

Thanks for taking a look!

Updated and the error still persists.

Thanks for checking. Does this code work?

library(propr)
data(iris)
met.rel <- iris[,1:2]
mic.rel <- iris[,3:4]

# Analyze multi-omics
clr <- function(x) sweep(log(x), 1, rowMeans(log(x)), "-")
REL <- cbind(clr(met.rel), clr(mic.rel))
pr <- propr(REL, ivar = NA)

Also please let me know what packageVersion("propr") says :-)

Also please let me know what packageVersion("propr") says :-)

Version 4.2.8 and snippet and my own CLR transformed data work on my local renv but not on the cluster.

But this will do, I can run them locally. Thank you so much !

Hm, my guess is that the cluster is not loading 4.2.8 correctly for some reason. In an older version of R, ivar = NA would call CLR by default, hence the error. I think that older version is still on CRAN, so maybe the cluster is pointing to the CRAN version?

I think you are right. I'll look into this, thank you for your help!:)