PhenoMeNal (Phenome and Metabolome aNalysis) - processing and analysis of molecular phenotype data generated by metabolomics applications

Permutation-based and closed-form-expression tools for the estimation of metabolome-wide significance level (MWSL) and corresponding effective number of tests for (correlated) metabolomics data.

Section 0: Import the R package

The MWSL R package allows for

Estimation of permutation-based MWSL and corresponding effective number of tests (ENT)
Estimation of closed-form-expression MWSL and corresponding effective number of tests (Meff)
Identification of differentially regulated metabolomics variates for a specific clinical outcome

devtools::install_github("AlinaPeluso/PhenoMeNal", subdir="MWSL")
library(MWSL)

Section 1: Exploratory analysis

We aim to investigate the association between human serum 1H NMR metabolic profiles and various clinical outcomes in the Multi-Ethnic Study of Atherosclerosis (MESA).

The reference paper considers three sets of NMR spectra data: (1) a standard water-suppressed one-dimensional spectrum (NOESY), and (2) a Carr-Purcell-Meiboom-Gill spectrum (CPMG), and (3) a lower resolution version of the CPMG data (BINNED). The BINNED version consists of M = 655 features, while the NOESY and CPMG contain M = 30,590 features. The BINNED data sample comprises of n = 3500 individuals, while the NOESY and CPMG data have n = 3,867 individuals.

To illustrate the package capabilities, here we reproduce the results for the BINNED version of the data.

data("MESA_binned")

Metabolomics variates

For this specific analysis the metabolomics variates are anonimised and simply referred to as V1,V2,...,V655.

features <- MESA_binned[,23:(ncol(MESA_binned)-1)]

Descriptive statistics of the metabolomics variates:

t(round(sapply(features, function(x) c(mean=mean(x),sd=sd(x),median=median(x),min=min(x),max=max(x))),2))

feature	mean	sd	median	min	max
V1	0.13	0.98	0.03	-2.44	5.33
V2	0.15	0.78	0.05	-1.66	3.73
V3	0.05	0.95	0.02	-3.17	3.83
V4	2.85	0.99	2.79	0.04	6.38
V5	1.75	0.99	1.73	-2.29	5.85
…	…	…	…	…	…
V651	0.04	1.01	0.03	-7.21	4.25
V652	0.01	1.01	0.04	-6.48	4.63
V653	0	1.01	0.03	-4.81	5.51
V654	0.03	1.01	0.03	-3.74	4.08
V655	0.02	1.01	0.03	-4.09	4.54

Fixed effects confounders

Briefly, the cohort includes participants (51% females, 49% males), aged 44-84 years, (mean age of 63 years) from four different ethnic groups: Chinese-American, African-American, Hispanic, and Caucasian, all recruited between 2000-2002 at clinical centres in the United States and free of symptomatic cardiovascular disease at baseline. Demographic, medical history, anthropometric, and lifestyle data, as well as serum samples were collected, together with information on diabetes, and lipid and blood pressure treatment.

MESA_binned$male <- ifelse(MESA_binned$sex<2,1,0)
confounders <- MESA_binned[,c("age","male","height","ethnicityH","ethnicityAA","ethnicityCA","smokingF","smokingC","ldl_chol","hdl_chol","sbp","bp_treatment","diabetes","lipids_treatment")]

Descriptive statistics of the clinical outcomes measures:

t(round(sapply(MESA_binned[,c(7,9:22,ncol(MESA_binned))], function(x) c(mean=mean(x),sd=sd(x),median=median(x),min=min(x),max=max(x))),3))

confounder	mean	sd	median	min	max
age	62.888	10.327	64	44	84
height	166.433	10.237	166.3	123.8	196.7
ethnicityC	0.385	0.487	0	0	1
ethnicityH	0.234	0.424	0	0	1
ethnicityAA	0.255	0.436	0	0	1
ethnicityCA	0.127	0.333	0	0	1
smokingN	0.503	0.5	1	0	1
smokingF	0.121	0.326	0	0	1
smokingC	0.376	0.485	0	0	1
ldl_chol	117.678	31.043	116	20	315
hdl_chol	51.3	14.42	49	21	133
sbp	126.922	21.54	124	77	218
bp_treatment	0.376	0.484	0	0	1
diabetes	0.135	0.342	0	0	1
lipids_treatment	0.167	0.373	0	0	1
male	0.489	0.5	0	0	1

Clinical outcomes measures

The outcomes of interest are glucose concentrations and the body mass index or BMI.

glucose <- MESA_binned[,1]; log_glucose <- MESA_binned[,2]; bmi <- MESA_binned[,7]; log_bmi <- MESA_binned[,8];
outcomes <- MESA_binned[,c(1:4)]

Descriptive statistics of the clinical outcomes measures:

t(round(sapply(MESA_binned[,c(1:4)], function(x) c(mean=mean(x),sd=sd(x),median=median(x),min=min(x),max=max(x))),2))

outcome	mean	sd	median	min	max
glucose	97.55	29.6	90	38	507
logGlucose	4.55	0.22	4.5	3.64	6.23
BMI	28.14	5.39	27.34	15.36	61.86
logBMI	3.32	0.18	3.31	2.73	4.12

Plot of the distributions of the clinical outcomes measures:

par(mfrow=c(2,4))
for (i in 1:ncol(outcomes)){hist(outcomes[,i],main=names(outcomes)[i],xlab=NULL)}
for (i in 1:ncol(outcomes)){boxplot(outcomes[,i],main=names(outcomes)[i],xlab=NULL)}

Section 2: Permutation-based MWSL and ENT estimation

MWSL::FWERperm performs the estimation of permutation-based metabolome-wide significance level (MWSL) and the corresponding effective number of tests (ENT). The procedure controls the FWER at the α level. The type I error rate (false-positive rate) is measured as the number of occurrences of having a p-value less or equal than the MWSL, that is when a true null hypothesis is being rejected.

Arguments:

outcome a vector of n data point values of a continuous (both symmetric and skewed), discrete binary (0/1) or count outcome, or a data frame with n observations and column variables time (or time1 and time2) and status for a time-to-event survival outcome.
features a data frame of n observations (rows) and M features e.g. metabolic profiles (columns).
confounders an optional data frame of n observations (rows) and P fixed effects confounders (columns). Default to confounders=NULL.
methods an optional string which can take values 'identity' if no transformation is applied to the data, or 'mN' (or 'mlogN') when the set of features is simulated via a multivariate Normal (or multivariate log-Normal) distribution. Default to methods='mN'.
n.permutation an optional numeric value. Default to n.permutation=10,000.
alpha an optional probability value. Default to alpha=0.05.
verbose an optional logical value which allows output some status messages while computing. Default to verbose=TRUE.

Outputs:

matPvals the matrix of p-values for the M features (columns) and the n.permutation (rows).
q the vector of minimum p-values of length n.permutation.
res the vector of result estimates:
- MWSL = metabolome-wide significance level (MWSL);
- MWSL_CI.up = upper value alpha%-confidence interval MWSL;
- MWSL_CI.low = lower value alpha%-confidence interval MWSL.
- ENT = effective number of tests (ENT);
- ENT_CI.up = upper value alpha%-confidence interval ENT;
- ENT_CI.low = lower value alpha%-confidence interval ENT;
- R.percent = ENT/M.
t1err.percent the estimated type I error (%).

Optimal performances in terms of computational time are achieved when the procedure runs on a multi-core computer as parallel computing is applied within the function to deal with the heaviest steps.

Run the function across the clinical outcomes measures:

methods <- c('identity','mN','mlogN')
mat <- matrix(NA,3,8)
colnames(mat) <- c('MWSL','MWSL_CI.up','MWSL_CI.low','ENT','ENT_CI.up','ENT_CI.low','R.percent','t1err.percent')
rownames(mat) <- methods
rmesa_FWERperm <- list(glucose=mat,log_glucose=mat,bmi=mat,log_bmi=mat)

rmesa_pval <- list(glucose=mat,log_glucose=mat,bmi=mat,log_bmi=mat)
rmesa_FWERperm <- list(glucose=mat,log_glucose=mat,bmi=mat,log_bmi=mat)
allres_mesa <- list()
for (j in 1:length(methods)){
  for (i in 1:ncol(outcomes)){
    rmesa <- FWERperm(outcome=outcomes[,i],
                      features=features,
                      confounders=confounders,
                      n.permutation=60,
                      method=methods[j],
                      verbose=F)
    allres_mesa[[3*(i-1)+j]] <- rmesa
    rmesa_FWERperm[[i]][j,1:7] <- rmesa$res
    rmesa_FWERperm[[i]][j,8] <- rmesa$t1err.percent
  }
}
df.names <- expand.grid(methods, names(outcomes))
names(allres_mesa) <- paste(df.names$Var1, df.names$Var2,sep='.')

Explore the results:

rmesa_FWERperm

glucose

	MWSL	MWSL_CI.up	MWSL_CI.low	ENT	ENT_CI.up	ENT_CI.low	R.percent	t1err.percent
identity	0.0000685	0.0000665	0.0000713	729.803	751.496	701.647	111.42	4.95
mN	0.0001430	0.0001380	0.0001508	349.737	362.19	331.658	53.39	4.97
mlogN	0.0001376	0.0001296	0.0001447	363.358	385.699	345.604	55.47	5.14

logGlucose

	MWSL	MWSL_CI.up	MWSL_CI.low	ENT	ENT_CI.up	ENT_CI.low	R.percent	t1err.percent
identity	0.0001018	0.0000982	0.0001053	491.344	508.917	474.867	75.01	5.06
mN	0.0001434	0.0001378	0.0001483	348.567	362.855	337.154	53.22	5.07
mlogN	0.0001392	0.0001310	0.0001442	359.09	381.818	346.816	54.82	4.98

BMI

	MWSL	MWSL_CI.up	MWSL_CI.low	ENT	ENT_CI.up	ENT_CI.low	R.percent	t1err.percent
identity	0.0001262	0.0001216	0.0001335	396.188	411.0372	374.544	60.49	5.18
mN	0.0001470	0.0001401	0.0001534	340.142	357.0032	325.845	51.93	5.20
mlogN	0.0001402	0.0001345	0.0001468	356.707	371.7413	340.712	54.46	5.00

logBMI

	MWSL	MWSL_CI.up	MWSL_CI.low	ENT	ENT_CI.up	ENT_CI.low	R.percent	t1err.percent
identity	0.0001313	0.0001266	0.0001377	380.688	394.8619	363.169	58.12	5.08
mN	0.0001512	0.0001447	0.0001567	330.715	345.6358	319.064	50.49	5.12
mlogN	0.0001356	0.0001296	0.0001410	368.722	385.7733	354.6	56.29	4.92

The permutation procedure provides strong control of the FWER at the α level set it to 5% while it also incorporates the joint dependence structure between the test statistics.

We can inspect the distribution of p-values corresponding to the case where the multivariate Normal distribution is employed to simulate the set of features

hist(allres_mesa[["mN.glucose"]][["matPvals"]],main="Plot p-values under the null",breaks=50,xlab=NULL)

The p-values have a uniform distribution as we have re-sampled under the true null hypothesis of no association.

We can also inspect the distribution of the minimum p-values corresponding to the case where the multivariate Normal distribution is employed to simulate the set of features

hist(allres_mesa[["mN.glucose"]][["q"]],main="Plot minimum p-values",breaks=150,xlab=NULL) 
op <- par(cex = 1.5); alpha=0.05
abline(v=rmesa_FWERperm$glucose[1,1],col="red",lwd=5)
abline(v=rmesa_FWERperm$glucose[2,1],col="blue",lwd=5)
abline(v=rmesa_FWERperm$glucose[3,1],col="brown",lwd=5)
abline(v=1-(1-alpha)^(1/ncol(features)),col="green",lwd=5)
abline(v=alpha/ncol(features),col="orange",lwd=5)
legend("topright",c('perm_id','perm_mN','perm_mlogN','Sidak','Bonferroni'),fill=c("red","blue","brown","green","orange"))

The procedure is less conservative than the Bonferroni or Sidak correction.

Plot of the ENT estimates from the permutation procedure:

df_rmesa_FWERperm <- do.call(rbind,rmesa_FWERperm)
df1_rmesa_FWERperm<- data.frame(
  outcome = c('glucose','glucose','glucose',
              'logGlucose','logGlucose','logGlucose',
              'BMI','BMI','BMI',
              'logBMI','logBMI','logBMI'),
  type = c('identity','multivariate Normal','multivariate log-Normal',
           'identity','multivariate Normal','multivariate log-Normal',
           'identity','multivariate Normal','multivariate log-Normal',
           'identity','multivariate Normal','multivariate log-Normal'),
  ENT = c(df_rmesa_FWERperm[,4]),
  ENT.ciUP = c(df_rmesa_FWERperm[,5]),
  ENT.ciLOW = c(df_rmesa_FWERperm[,6]))
(plot_res.MESA_co <- ggplot(data=df1_rmesa_FWERperm,aes(x=outcome,y=ENT)) +
    facet_grid(~ type) +
    geom_hline(yintercept=ncol(features)) +
    annotate("text",x='BMI',y=(ncol(features)+15),label='ANT=655') +
    geom_text(mapping=aes(label=round(ENT,0)),hjust=-.5)+
    geom_point(size=3) +
    geom_errorbar(aes(ymin=ENT.ciLOW,ymax=ENT.ciUP),size=1) +
    theme(text = element_text(size=20)) +
    theme(legend.position="bottom") +
    theme(axis.text.x = element_text(angle=30, hjust=1)) +
    theme(axis.text.y = element_blank()) +
    ggtitle("MESA_binned data - all clinical outcomes")
)

From the conventional permutation procedure applied to the BINNED data, when the real features are considered, there is instability in the estimation of the ENT across the different outcomes, and in particular the ENT estimate for glucose is above the ANT. When the feature data are simulated from a multivariate log-Normal or Normal distribution, the ENT estimates are stable across the different outcomes and remain bounded below the total number of features with an average ENT of 352 and an R ratio of 53.8%.

Section 3: Closed form expression eigenvalues-based MWSL and ENT estimation

The empirical method of computing the permutation test p-value is hampered by the fact that a very large number of permutation is required to correctly estimate small, and therefore interesting p-values. Thus, the MWSL::Meff allows for an efficient approximation alternative. To distinguish from the effective number of non-redundant variates from the permutation procedure which has been defined as ENT, here we refer to the estimate from this practical approximation approach as Meff. The approximation is based on the spectral decomposition of the correlation matrix of the metabolomics variates, and allows for the comparison of the proposed estimate with methods of interest from the genomics field which proposed a similar formulation based on the same concept.

Arguments:

features a data frame of n observations (rows) and M features e.g. metabolic profiles (columns).
methods a string with possible values ecorr (empirical correlation), or scorr (shrinkage correlation). Default to ecorr.
alpha an optional probability value. Default to alpha=0.05.
big.mat an optional logic value to be set to TRUE when dealing with a very large set of features. Default to FALSE.

Outputs:

Meff_Nyholt closed-form expression of the effective number of tests based on Nyholt (2004)
Meff_Liji closed-form expression of the effective number of tests based on Li & Ji (2005)
Meff_Gao closed-form expression of the effective number of tests based on Gao (2008)
Meff_Galwey closed-form expression of the effective number of tests based on Galwey (2009)
Meff_Bonferroni closed-form expression of the effective number of tests based on Bonferroni (1963)
Meff_Sidak closed-form expression of the effective number of tests based on Sidak (1967)
Meff_MWSL the proposed closed-form expression of the effective number of tests
res.Meff_MWSL the vector of result estimates:
- MWSL = metabolome-wide significance level (MWSL);
- MWSL_CI.up = upper value alpha%-confidence interval MWSL;
- MWSL_CI.low = lower value alpha%-confidence interval MWSL.
- ENT = effective number of tests (ENT);
- ENT_CI.up = upper value alpha%-confidence interval ENT;
- ENT_CI.low = lower value alpha%-confidence interval ENT;
- R.percent = ENT/M.

Run the function:

rmesa_Meff_ecorr <- Meff(features=features,
                         n.permutation=100000,
                         method='ecorr',
			 big.mat=FALSE,
                         alpha=0.05)

Explore the results:

mat.rmesa_Meff_ecorr <- rbind(Meff_Nyholt=rmesa_Meff_ecorr$Meff_Nyholt,
                              Meff_Liji=rmesa_Meff_ecorr$Meff_Liji,
                              Meff_Gao=rmesa_Meff_ecorr$Meff_Gao,
                              Meff_Galwey=rmesa_Meff_ecorr$Meff_Galwey,
                              Meff_Bonferroni=rmesa_Meff_ecorr$Meff_Bonferroni,
                              Meff_Sidak=rmesa_Meff_ecorr$Meff_Sidak,
                              Meff_MWSL=rmesa_Meff_ecorr$Meff_MWSL);
mat.rmesa_Meff_ecorr

Method	Estimate	R(%)
Meff_Nyholt	611	93%
Meff_Liji	226	35%
Meff_Gao	435	66%
Meff_Galwey	201	31%
Meff_Bonferroni	655	100%
Meff_Sidak	639	98%
Meff_MWSL	345	53%
ENT	352	54%

The closed-form Meff value is a good approximation of the ENT estimate from the permutation procedure.

The MWSL estimate can be access as

rmesa_Meff_ecorr$res.Meff_MWSL

MWSL	MWSL_CI.up	MWSL_CI.low	ENT_MWSL	ENT_MWSL_CI.up	ENT_MWSL_CI.low	R.percent
0.000145102	0.000147141	0.000142823	344.5863	350.0843	339.8103	52.61%

Section 4: Identification of outcome-specific differentially regulated metabolites

This section focus on identification of differentially regulated metabolomics variates directly linked to a specified outcome (e.g. disease risk). The procedure supports various outcome type i.e. continuous outcomes both symmetric and skewed, binary (0/1) outcome, countable outcome taking discrete values 0,1,2,3,$...$, as well as a survival time-to-event outcome.

Outcome type	Model	R function
continuous (symmetric)	OLS regression	stats::lm
continuous (skewed)	Median quantile regression	quantreg::rq(tau=.5)
binary (0/1)	Logistic regression	stats::glm(family="binomial")
count (equidispersed)	GLM Poisson regression	stats::glm(family="poisson")
count (overdispersed)	Negative Binomial regression	glm(family="negative.binomial")
survival (time-to-event)	Cox proportional hazards regression	survival::coxph

Once the appropriate analysis has been performed the metabolites of interest can be identified comparing the respective p-value to the the adjusted thresold (MWSL or Meff). When the raw p-value corresponding to a certain feature is smaller than the adjusted thresold we identify that metabolomic variate as significant. For the other metabolites we conclude that there is no association between the changes in the outcome variable and the shifts in these features.

Arguments:

outcome a vector of n data point values of a continuous (both symmetric and skewed), discrete binary (0/1) or count outcome, or a data frame with n observations and column variables time (or time1 and time2) and status for a time-to-event survival outcome.
features a data frame of n observations (rows) and M features e.g. metabolic profiles (columns).
confounders an optional data frame of n observations (rows) and P fixed effects confounders (columns). Default to confounders=NULL.
MWSL metabolome-wide significance level (MWSL) estimated via the permutation-based approach (see MWSL::FWERperm) or via the via the closed-form-expression approach (see MWSL::Meff)
alpha an optional probability value, default=0.05
vennPlot an optional logic value to be set to TRUE for the visualisation of the Venn-plot for the total count of differrentally regulated metabolomics variates when employing the Bonferroni, the Benjamini-Hochberg (FDR), and the MWSL (FWER) correction, respectively
verbose an optional logic value to suppress some output status messages. Default to TRUE.

Outputs:

res.DE_count total count of differrentally regulated metabolomics variates when employing the Bonferroni, the Benjamini-Hochberg (FDR), and the MWSL (FWER) correction, respectively.
res.DE_names names of the of differrentally regulated metabolomics variates when employing the Bonferroni, the Benjamini-Hochberg (FDR), and the MWSL (FWER) correction, respectively.
Venn_plot Venn-plot visualisation tool for the total count of differrentally regulated metabolomics variates when employing the Bonferroni, the Benjamini-Hochberg (FDR), and the MWSL (FWER) correction, respectively.

Run the function for all the clinical outcomes measures:

rmesa_DEtest <- rmesa_DEtest_count  <- list()
names(rmesa_DEtest) <- c('glucose','logGlucose','BMI','logBMI')
for (i in 1:ncol(outcomes)){
  res_DEtest <- DEtest(outcome=outcomes[,i],
                      features=features[,1:10],
                      confounders=confounders,
                      MWSL=0.000145102,
                      alpha=0.05,
                      vennPlot=TRUE)
  rmesa_DEtest[[i]]<- res_DEtest
  rmesa_DEtest_count[[i]]<- res_DEtest$res.DE_count
}

Explore the results:

A Venn plot is generated for the visualisation of the number of identified differentially regulated metabolomics variates when employing the Bonferroni, the Benjamini-Hochberg (BH, FDR), and the MWSL (FWER) correction.

The counts of the differentially regulated metabolomics variates for each clinical outcomes can be accessed as

res.mesa_DEtest_count <- do.call(rbind, rmesa_DEtest_count)
rownames(res.mesa_DEtest_count) <- c('glucose','logGlucose','BMI','logBMI')
res.mesa_DEtest_count

	FWER.Bonf	FWER.MWSL	FDR.BH
glucose	185	198	352
logGlucose	191	197	357
BMI	124	131	200
logBMI	124	130	210

The FWER approach based on the MWSL estimation allows for the identification of a number of differentially regulated variates less than the too liberal BH correction and greater than the highly conservative Bonferroni correction. The number of differentially regulated variates is stable when considering transformations of the clinical outcome measures.

The list of differentially regulated metabolomics variates identified via the MWSL approach for a specific clinical outcome e.g. glucose can be accessed as

rmesa_DEtest[["glucose"]][["res.DE_names"]][["FWER.MWSL"]]

...

V596

V597

V610

V618

V619