ifaa
is a robust approach to make inference on the association of covariates with the absolute abundance (AA) of microbiome in an ecosystem. It can be also directly applied to relative abundance (RA) data to make inference on AA because the ratio of two RA is equal ratio of their AA. This algorithm can estimate and test the associations of interest while adjusting for potential confounders. High-dimensional covariates are handled with regularization. The estimates of this method have easy interpretation like a typical regression analysis. High-dimensional covariates are handled with regularization and it is implemented by parallel computing. False discovery rate is automatically controlled by this approach. Zeros do not need to be imputed by a positive value for the analysis. The IFAA package also offers the 'MZILN' function for estimating and testing associations of abundance ratios with covariates.
The following mixed effect model is used for the association:
where
-
$\mathcal{Y}_i^k$ is the AA of taxa$k$ in subject$i$ in the entire ecosystem. -
$X_i$ is the covariate of interest -
$W_i$ is the confounder. -
$Z_i$ is the design matrix for random effects. -
$\epsilon_i^k$ is the random error. -
$\beta^k$ is the regression coefficients that will be estimated and tested with theIFAA()
function.
The challenge in microbiome analysis is that we can not oberve
To install, type the following command in Python console:
pip install ifaa
The package could be also installed from GitHub using the following code:
pip install git+https://github.com/lovestat/ifaa
Most of the time, users just need to feed the first five inputs to the function: MicrobData
, CovData
, linkIDname
, testCov
and ctrlCov
. All other inputs can just take their default values. Below are all the inputs of the functions
-
MicrobData
: Microbiome data matrix containing microbiome absolute abundance or relative abundance with each row per sample and each column per taxon/OTU/ASV (or any other unit). The data matrix can have zero-valued data points. It should also contain an id variable to be linked with the id variable in the covariates data:CovData
. This argument can also take file directory path. For example,MicrobData="C://...//microbiomeData.tsv"
. -
CovData
: Covariates data matrix containing covariates and confounders with each row per sample and each column per variable. Any categorical variable should be converted into dummy variables in this data matrix unless it can be treated as a continuous variable. It should also contain an id variable to be linked with the id variable in the microbiome data:MicrobData
. This argument can also take file directory path. For example,CovData="C://...//covariatesData.tsv"
. -
linkIDname
: The common variable name of the id variable in bothMicrobData
andCovData
. The two data sets will be merged by this id variable. -
testCov
: Covariates that are of primary interest for testing and estimating the associations. It corresponds to$X_i$ in the equation. Default isNULL
which means all covariates aretestCov
. -
ctrlCov
: Potential confounders that will be adjusted in the model. It corresponds to$W_i$ in the equation. Default isNULL
which means all covariates except those intestCov
are adjusted as confounders. -
testMany
: This takes logical valueTrue
orFalse
. IfTrue
, thetestCov
will contain all the variables inCovData
providedtestCov
is set to beNULL
. The default value isTrue
which does not do anything iftestCov
is notNULL
. -
ctrlMany
: This takes logical valueTrue
orFalse
. IfTrue
, all variables excepttestCov
are considered as control covariates providedctrlCov
is set to beNULL
. The default value isFalse
. -
nRef
: The number of randomly picked reference taxa used in phase 1. Default number is40
. -
nRefMaxForEsti
: The maximum number of final reference taxa used in phase 2. The default is2
. -
refTaxa
: A vector of taxa names. These are reference taxa specified by the user to be used in phase 1 if the user believe these taxa are indepenent of the covariates. If the number of reference taxa is less than 'nRef', the algorithm will randomly pick extra reference taxa to make up 'nRef'. The default isNULL
since the algorithm will pick reference taxa randomly. -
adjust_method
The adjusting method used for p value adjustment. Default is "BY" for dependent FDR adjustment. It can take any adjustment method for p.adjust function in R. -
fdrRate
: The false discovery rate for identifying taxa/OTU/ASV associated withtestCov
. Default is0.15
. -
paraJobs
: IfsequentialRun
isFalse
, this specifies the number of parallel jobs that will be registered to run the algorithm. If specified asNULL
, it will automatically detect the cores to decide the number of parallel jobs. Default isNULL
. -
bootB
: Number of bootstrap samples for obtaining confidence interval of estimates in phase 2 for the high dimensional regression. The default is500
. -
standardize
This takes a logical valueTrue
orFalse
. IfTrue
, the design matrix for X will be standardized in the analyses and the results. Default isFalse
. -
sequentialRun
: This takes a logical valueTrue
orFalse
. Default isFalse
. This argument could be useful for debug. -
refReadsThresh
: The threshold of proportion of non-zero sequencing reads for choosing the reference taxon in phase 2. The default is0.2
which means at least 20% non-zero sequencing reads. -
taxDropThresh
: The threshold of number of non-zero sequencing reads for each taxon to be dropped from the analysis. The default is0
which means taxon without any sequencing reads will be dropped from the analysis. -
SDThresh
: The threshold of standard deviations of sequencing reads for been chosen as the reference taxon in phase 2. The default is0.05
which means the standard deviation of sequencing reads should be at least0.05
in order to be chosen as reference taxon. -
SDquantilThresh
: The threshold of the quantile of standard deviation of sequencing reads, above which could be selected as reference taxon. The default is0
. -
balanceCut
: The threshold of the proportion of non-zero sequencing reads in each group of a binary variable for choosing the final reference taxa in phase 2. The default number is0.2
which means at least 20% non-zero sequencing reads in each group are needed to be eligible for being chosen as a final reference taxon. -
seed
: Random seed for reproducibility. Default is1
. It can be set to be NULL to remove seeding.
The estimation results are saved in the following lists:
-
sig_results
: A list containing estimating results that are statistically significant. -
full_results
: A list containing all estimating results. NA denotes unestimable.
The covariates data used in the analyses including testCov
and ctrlCov
is saved in the following object:
covariatesData
: A dataset containing covariates and confounders used in the analyses
The example datasets dataM
and dataC
are included in the package. They could be accessed by:
import numpy as np
from loadData import *
Both the microbiome data dataM
and the covariates data dataC
contain 40 samples (i.e., 40 rows).
-
dataM
contains 60 taxa with absolute abundances and these are gut microbiome. -
dataC
contains 3 covariates.
Next we analyze the data to test the association between microbiome and the variable "v1"
while adjusting for the variables (potential confounders) "v2"
and "v3"
.
>>> res = IFAA(load_dataM().iloc[:,:],
... load_dataC().iloc[:,:],
... testCov = ['v1'],
... ctrlCov = ['v2', 'v3'],
... paraJobs = 4,
... linkIDname="id",
... refTaxa = ["rawCount" + str(i + 1) for i in range(40)],
... bootB = 100,
... sequentialRun=False)
In this example, we are only interested in testing the associations with "v1"
which is why testCov=c("v1")
. The variables "v2" and "v3"
are adjusted as potential confounders in the analyses. The final analysis results are saved in the list sig_results
:
res['sig_results']
# {'v1': estimate SE est CI low CI up adj p-value
# rawCount18 0.028144 0.005027 0.018292 0.037997 1.935964e-06
# rawCount36 0.029991 0.005009 0.020174 0.039809 2.873697e-07
# rawCount41 0.033031 0.005040 0.023152 0.042909 1.512894e-08}
The results found three taxa "rawCount18"
, "rawCount36"
, "rawCount41"
associated with "v1"
while adjusting for "v2" and "v3"
. The regression coefficients and their 95% confidence intervals are provided. These coefficients correspond to
The interpretation is that
- Every unit increase in
"v1"
is associated with approximately 2.8% increase in the absolute abundance of"rawCount18"
, approximately 2.9% increase in the absolute abundance of"rawCount36"
, and approximately 3.3% increase in the absolute abundance of"rawCount41"
in the entire gut ecosystem.
All the analyzed covariates including testCov
and ctrlCov
can be extracted using the object covariatesData
. The covariates data of the first 10 subjects can extracted as follows:
res_bin['covariatesData'].iloc[0:10, :]
# id x1 x2 x3 x4 x5 x6 x7
# 0 1 58.069691 0 0 0 0 -49.903757 -15.306430
# 1 2 25.965216 0 0 0 0 -68.588941 -23.109922
# 2 3 193.716251 0 0 0 0 124.401856 119.567468
# 3 4 72.156467 0 0 0 0 -98.485358 2.877972
# 4 5 98.062712 0 0 0 0 23.553576 -79.893161
# 5 6 83.094848 0 0 0 0 -116.958209 -107.641285
# 6 7 8.217154 0 0 0 0 -205.644800 -139.958481
# 7 8 36.169820 0 0 0 0 58.957085 26.890379
# 8 9 152.786131 0 0 0 0 162.609349 138.731954
# 9 10 41.621790 0 0 0 0 65.154267 59.974310
-
Zhigang Li, Lu Tian, A. James O'Malley, Margaret R. Karagas, Anne G. Hoen, Brock C. Christensen, Juliette C. Madan, Quran Wu, Raad Z. Gharaibeh, Christian Jobin, Hongzhe Li (2020) IFAA: Robust association identification and Inference For Absolute Abundance in microbiome analyses. arXiv:1909.10101v3
-
Zhigang Li, Katherine Lee, Margaret Karagas, Juliette Madan, Anne Hoen, James O’Malley and Hongzhe Li (2018 ) Conditional regression based on a multivariate zero-inflated logistic normal model for modeling microbiome data. Statistics in Biosciences 10(3):587-608