surh/pbi

montecarlo test issue

Opened this issue · 9 comments

Hi Sur,

I tried to use your montecarlo function and the results are confusing.

If I used the relative abundance table as input to your function, all of my pair-wise p-values are 1: 
image

If I used enrichment/depletion status as input table: 1 as enriched => if mutant has significant higher abundance than WT, -1 as depleted => if mutant has significant lower abundance than WT, 0 as no significant change. And I made all of the value for WT as 0. The results showed that all p-value are 0.

image

Suggestions?

Thanks!

surh commented

I am not fully understanding what are the different inputs. If I remember correctly, the input for the monte_carlo_test function is a data frame with columns "Enrichment", "OTU", and "Genotype".

A minimal example is below, where I set genotypes A and C to be kinf of similar (it is not actually significant because there are only 3 OTUs).

> set.seed(12345)
> dat <- data.frame(OTU = rep(letters[1:3], each = 3),
+                   Genotype = rep(LETTERS[1:3], lenth.out = 9),
+                   Enrichment = c(1,0,1, -1, 1, -1, 0, 1, 0))
> dat
  OTU Genotype Enrichment
1   a        A          1
2   a        B          0
3   a        C          1
4   b        A         -1
5   b        B          1
6   b        C         -1
7   c        A          0
8   c        B          1
9   c        C          0
> res <- monte_carlo_test(dat = dat)
> res
      A B     C
A 0.000 1 0.165
B 1.000 0 1.000
C 0.165 1 0.000

It is technically possible that all your groups are too similar, in which case you would get all zero p-values, but I cannot say without seeing the actual input. You can try using the doughnut plot to see if there are any obvious taxonomic patterns.

Let's say if we have 3 mutants(A,B,C) and 1 wild type(D), is this the correct input data? Which all of the OTUs have been labelled as '0' in genotype D?

OTU Genotype Enrichment
a A 1
a B 0
a C 1
b A -1
b B 1
b C -1
c A 0
c B 1
c C 0
a D 0
b D 0
c D 0

My actual data has 8730 OTUs and only ~1000 show significant difference between wild type and mutant. I'm wondering should I filter some low abundance OTUs before perform the monte carlo test?

The DESeq package gave me lots of NA(p-value, logFC) for significant test, should I remove all of the OTUs which has at least 1 NA?

surh commented

If your WT is not labeled "Col-0", you need to change the default via the wt option, if your WT is 'D' it would be something like: monte_carlo_test(dat = dat, wt = 'D').

Abundance itself doesn't matter, but you should filter out all OTUs that are never significant (i.e. those that have 0 on the Enrichment column for all genotypes).

If the NAs are because of parameter unidentifiability, then you must reconsider your design matrix and make sure you understand what the remaining parameters mean. If the NAs are caused by numerical convergence issues (though this never happened to me with DEseq2) then you might want to filter some OTUs by abundance before running DEseq2. The monte carlo function assumes that the estimates of enrichment/depletion are accurate so if there are problems earlier it won't give you accurate results.

Hi Sur,

My input file is exactly same format as your example file and here is my code:

pval <- monte_carlo_test(dat = a, wt="D")

If I used your example data file as input:

OTU Genotype Enrichment
1 a A 1
2 a B 0
3 a C 1
4 b A -1
5 b B 1
6 b C -1
7 c A 0
8 c B 1
9 c C 0

and here are my results:

options(digits = 10)
pval
A B C
A 0 1 1
B 1 0 1
C 1 1 0

Looks like my R version only show 0 or 1 for monte test function? It can show normal p-value with other functions.

I went back and checked the results with DESeq, the NAs came from those OTUs only present in a few samples. I did some filtering now.

Full code:
`library(devtools)
library(AMOR)
options(digits = 10)
library(reshape2)

monte_carlo_test <- function(dat, wt = "Col-0",dist.method = "manhattan",
nperm = 1000){
dat$Enrichment <- as.numeric(as.character(dat$Enrichment))
dat <- subset(dat, Genotype != wt)
dat$Genotype <- droplevels(dat$Genotype)
mat <- acast(data = dat, formula = OTU ~ Genotype,
value.var = "Enrichment")

all <- mat
#all <- all[ , -which(colnames(all) == wt) ]

all.d <- dist(t(all),method=dist.method)
all.d <- as.matrix(all.d)
D.reps <- array(dim=c(nrow(all.d),nrow(all.d),nperm),
dimnames = list(colnames(all),
colnames(all),
1:nperm))
for (i in 1:nperm){
#i <- 1
all.rep <- apply(all,2,sample)
all.rep.d <- dist(t(all.rep),method=dist.method)
all.rep.d <- as.matrix(all.rep.d)
D.reps[,,i] <- all.d < all.rep.d
}
pval <- as.matrix(as.dist(1 - apply(D.reps,MARGIN=c(1,2),sum) / nperm))

}

a=read.table("~/Downloads/1.txt",header=T)
pval <- monte_carlo_test(dat = a,wt="D")`

suggestions?

surh commented

hmm... I don't know what is causing the difference. I would guess some version problem. I tried the code in rdrr.io and it works. Try copying and pasting the code below there to confirm:

library(reshape2)

monte_carlo_test <- function(dat, wt = "Col-0",dist.method = "manhattan",
                             nperm = 1000){
  dat$Enrichment <- as.numeric(as.character(dat$Enrichment))
  dat <- subset(dat, Genotype != wt)
  dat$Genotype <- droplevels(dat$Genotype)
  mat <- acast(data = dat, formula = OTU ~ Genotype,
               value.var = "Enrichment")
  
  all <- mat
  #all <- all[ , -which(colnames(all) == wt) ]
  
  all.d <- dist(t(all),method=dist.method)
  all.d <- as.matrix(all.d)
  D.reps <- array(dim=c(nrow(all.d),nrow(all.d),nperm),
                  dimnames = list(colnames(all),
                                  colnames(all),
                                  1:nperm))
  for (i in 1:nperm){
    #i <- 1
    all.rep <- apply(all,2,sample)
    all.rep.d <- dist(t(all.rep),method=dist.method)
    all.rep.d <- as.matrix(all.rep.d)
    D.reps[,,i] <- all.d < all.rep.d
  }
  pval <- as.matrix(as.dist(1 - apply(D.reps,MARGIN=c(1,2),sum) / nperm))
  
}

# a=read.table("~/Downloads/1.txt",header=T)
a <- data.frame(OTU = rep(letters[1:3], each = 3),
                Genotype = rep(LETTERS[1:3], lenth.out = 9),
                Enrichment = c(1,0,1, -1, 1, -1, 0, 1, 0))

pval <- monte_carlo_test(dat = a,wt="D")
pval

Also make sure you have the right versions loaded. Here is my session info:

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=es_US.UTF-8       LC_NUMERIC=C               LC_TIME=es_US.UTF-8        LC_COLLATE=es_US.UTF-8     LC_MONETARY=es_US.UTF-8   
 [6] LC_MESSAGES=es_US.UTF-8    LC_PAPER=es_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=es_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reshape2_1.4.3

loaded via a namespace (and not attached):
[1] compiler_3.6.1 magrittr_1.5   plyr_1.8.4     tools_3.6.1    yaml_2.2.0     Rcpp_1.0.2     stringi_1.4.3  stringr_1.4.0 

Looks like this is an aggregate problem which caused by acast function. I had exactly same problem as this post https://stackoverflow.com/questions/30463591/r-reshape2-aggregation-function-missing-defaulting-to-length

Can you help me to modify your function?

surh commented

try with

monte_carlo_test <- function(dat, wt = "Col-0",dist.method = "manhattan",
                             nperm = 1000){
  
  dat$Enrichment <- as.numeric(as.character(dat$Enrichment))
  dat <- subset(dat, Genotype != wt)
  dat$Genotype <- droplevels(dat$Genotype)
  mat <- acast(data = dat, formula = OTU ~ Genotype,
               value.var = "Enrichment",
               fun.aggregate = identity,
               fill = 0)
  
  all <- mat
  #all <- all[ , -which(colnames(all) == wt) ]
  
  all.d <- dist(t(all),method=dist.method)
  all.d <- as.matrix(all.d)
  D.reps <- array(dim=c(nrow(all.d),nrow(all.d),nperm),
                  dimnames = list(colnames(all),
                                  colnames(all),
                                  1:nperm))
  for (i in 1:nperm){
    #i <- 1
    all.rep <- apply(all,2,sample)
    all.rep.d <- dist(t(all.rep),method=dist.method)
    all.rep.d <- as.matrix(all.rep.d)
    D.reps[,,i] <- all.d < all.rep.d
  }
  pval <- as.matrix(as.dist(1 - apply(D.reps,MARGIN=c(1,2),sum) / nperm))
  
}

it will fail if you have duplicated OTU-Genotype pairs

Your example file works well in my r now, not matter with your old or modified function.

However, my actual data still gave me a matrix with all 0s.

What I tried next is modify the example input data and here are the two examples:

monte_carlo_test <- function(dat, wt = "Col-0",dist.method = "manhattan"nperm = 1000){
dat$Enrichment <- as.numeric(as.character(dat$Enrichment))
dat <- subset(dat, Genotype != wt)
dat$Genotype <- droplevels(dat$Genotype)
mat <- acast(data = dat, formula = OTU ~ Genotype,value.var = "Enrichment",fun.aggregate =identity,fill = 0)
all <- mat
all.d <- dist(t(all),method=dist.method)
all.d <- as.matrix(all.d)
D.reps <- array(dim=c(nrow(all.d),nrow(all.d),nperm), dimnames = list(colnames(all),colnames(all), 1:nperm))
for (i in 1:nperm){
#i <- 1
all.rep <- apply(all,2,sample)
all.rep.d <- dist(t(all.rep),method=dist.method)
all.rep.d <- as.matrix(all.rep.d)
D.reps[,,i] <- all.d < all.rep.d
}
pval <- as.matrix(as.dist(1 - apply(D.reps,MARGIN=c(1,2),sum) / nperm))
}

a <- data.frame(OTU = rep(letters[1:3], each = 3), Genotype = rep(LETTERS[1:3], lenth.out = 9),Enrichment = c(0,-1,0, 0, 1, 0, 0, 1, 0))

pval <- monte_carlo_test(dat = a,wt="D")
pval
A B C
A 0 1 1
B 1 0 1
C 1 1 0

a <- data.frame(OTU = rep(letters[1:3], each = 3),
Genotype = rep(LETTERS[1:3], lenth.out = 9),
Enrichment = c(0,0,0, 0, 1, 0, 0, 0, 0))
pval <- monte_carlo_test(dat = a,wt="D")
pval
A B C
A 0 1 1
B 1 0 1
C 1 1 0

This is really wired to me! Let me know if it will help with my actual data. I can try to upload it.

I got the same result from rdrr.io with the two examples above, looks like the input file is super important to the test.