Multiple conditions for DMRs
Closed this issue · 18 comments
Dear Keegan,
I am working on whole genome bisulfit seq data from 5 different brain tumor types and I would like to investigate what these tumors have in common and what distinguishes them.
Does 'dmrseq' support the calculation of DMRs for multiple conditions, say Glioblastoma vs. Medulloblastoma vs. Normal Brain, or are only pairwise comparisons possible?
Since I was used to use DSS, I thought about simply adjusting this method and correct for multiple testing, but 'dmrseq' seems to be the way more appropriate method.
Thanks for the nice R-package!
-----------------------------------------------------
Edit: I checked your paper/ supp. material and code:
Concerning my question: It seems that #conditions > 1 is possible, but it is labeled with "experimental". Can it already be used?
Since I do not have the same amount of samples per condition, the code for 'dmrseq' does not seem to support it:
# check sampleSize is even in both conditions
sampleSize <- length(sampleIndex)/2
if (length(unique(table(testCov))) > 1 |
!(sum(table(testCov) == rep(sampleSize,
length(unique(testCov)))) == length(unique(testCov)))) {
stop(paste0("Error: testCov is not balanced. Need to specify an equal",
"number of samples at each level"))
}
Do you already have a version, which incorporates your enhancement?
-----------------------------------------------------
Best wishes from Heidelberg (Germany),
Chris
There is an open issue regarding your last question, so it seems to be unfinished: #6
Hi @ChrisAi89 and @LKremer,
Apologies for the late reply; seems my email notifications for issues were temporarily shut off.
Yes, support for multiple conditions is currently not fully implemented, but I will be working on testing out that functionality very soon. I'll do so first in the R-3.4
branch, and I'll leave a reply here once I've got something ready that may suit your needs.
Best,
Keegan
Hi @ChrisAi89 and @LKremer,
The functionality for multiple group comparisons has been added to dmrseq. You can obtain it immediately on the master branch and the R-3.4 branch (which exists to allow you to install without R-devel). This update will automatically propagate to the Bioconductor version next day or so.
Please try it out and let me know if you have any issues or suggestions!
Best,
Keegan
Hi @kdkorthauer,
I tried to run a multi-group comparison but unfortunately got a crash in the call to dmrseq::dmrseq():
...Chromosome chr1: Error in .subset_DelayedArray_by_1Dindex(x, user_Nindex[[1L]]) :
1D-style subsetting of a DelayedArray object only accepts a numeric subscript at the moment
Here is a minimal working example with traceback and sessionInfo that produces the error:
> require(bsseq)
> require(BiocGenerics)
> library(BiocParallel)
> library(dmrseq)
> # Read some example data from the bsseq package:
> infile <- system.file("extdata/test_data.fastq_bismark.bismark.cov.gz",
+ package = 'bsseq')
> # Use the data multiple times and pretend it's different samples:
> cpg <- BiocGenerics::combine(
+ bsseq::read.bismark(files = infile, sampleNames = "c1a", strandCollapse = F),
+ bsseq::read.bismark(files = infile, sampleNames = "c1b", strandCollapse = F),
+ bsseq::read.bismark(files = infile, sampleNames = "c1c", strandCollapse = F),
+ bsseq::read.bismark(files = infile, sampleNames = "c1d", strandCollapse = F),
+ bsseq::read.bismark(files = infile, sampleNames = "c2a", strandCollapse = F),
+ bsseq::read.bismark(files = infile, sampleNames = "c2b", strandCollapse = F),
+ bsseq::read.bismark(files = infile, sampleNames = "c2c", strandCollapse = F),
+ bsseq::read.bismark(files = infile, sampleNames = "c2d", strandCollapse = F),
+ bsseq::read.bismark(files = infile, sampleNames = "c3a", strandCollapse = F),
+ bsseq::read.bismark(files = infile, sampleNames = "c3b", strandCollapse = F),
+ bsseq::read.bismark(files = infile, sampleNames = "c3c", strandCollapse = F),
+ bsseq::read.bismark(files = infile, sampleNames = "c3d", strandCollapse = F)
+ )
Assuming file type is cov
[read.bismark] Reading file '/home/lukas/R/x86_64-pc-linux-gnu-library/3.4/bsseq/extdata/test_data.fastq_bismark.bismark.cov.gz' ... done in 0.1 secs
[read.bismark] Joining samples ... done in 0.1 secs
> # Define three groups/conditions
> pData(cpg)$celltype <- as.factor(c(rep("celltype1", 4),
+ rep("celltype2", 4),
+ rep("celltype3", 4)))
> pData(cpg)
DataFrame with 12 rows and 1 column
celltype
<factor>
c1a celltype1
c1b celltype1
c1c celltype1
c1d celltype1
c2a celltype2
... ...
c2d celltype2
c3a celltype3
c3b celltype3
c3c celltype3
c3d celltype3
> # Only keep loci that have at least 1 coverage in **all** samples
> covered_loci <- which( rowSums( getCoverage(cpg, type="Cov") == 0) == 0)
> cpg <- cpg[covered_loci, ]
> cpg
An object of type 'BSseq' with
2013 methylation loci
12 samples
has not been smoothed
All assays are in-memory
> # Run dmrseq
> dmrs <- dmrseq(bs = cpg, testCovariate = "celltype")
Performing a global test of H0: no difference among 3 groups (assuming the test covariate celltype is a factor).
Parallelizing using 2 workers/cores (backend: BiocParallel:MulticoreParam).
Computing on 1 chromosome(s) at a time.
Detecting candidate regions with coefficient larger than 0.1 in magnitude.
...Chromosome chr1: Error in .subset_DelayedArray_by_1Dindex(x, user_Nindex[[1L]]) :
1D-style subsetting of a DelayedArray object only accepts a numeric subscript at the moment
> traceback()
7: stop(wmsg("1D-style subsetting of a DelayedArray object only ",
"accepts a numeric subscript at the moment"))
6: .subset_DelayedArray_by_1Dindex(x, user_Nindex[[1L]])
5: mat[mat != 0]
4: mat[mat != 0]
3: estim(meth.mat = meth.mat, cov.mat = cov.mat, pos = pos, chr = chr,
design = design, coeff = coeff)
2: bumphunt(bs = bs, design = design, coeff = coeff, coeff.adj = coeff.adj,
minInSpan = minInSpan, minNumRegion = minNumRegion, cutoff = cutoff,
maxGap = maxGap, maxGapSmooth = maxGapSmooth, smooth = smooth,
bpSpan = bpSpan, verbose = verbose, parallel = parallel,
block = block, blockSize = blockSize, chrsPerChunk = chrsPerChunk,
fact = fact)
1: dmrseq(bs = cpg, testCovariate = "celltype")
> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_GB.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2.2 forcats_0.3.0 stringr_1.3.1 dplyr_0.7.5 purrr_0.2.4
[6] readr_1.1.1 tidyr_0.8.1 tibble_1.4.2 ggplot2_2.2.1 tidyverse_1.2.1
[11] dmrseq_1.1.1 BiocParallel_1.12.0 bsseq_1.14.0 SummarizedExperiment_1.8.1 DelayedArray_0.4.1
[16] matrixStats_0.53.1 Biobase_2.38.0 GenomicRanges_1.30.3 GenomeInfoDb_1.14.0 IRanges_2.12.0
[21] S4Vectors_0.16.0 BiocGenerics_0.24.0
loaded via a namespace (and not attached):
[1] colorspace_1.3-2 XVector_0.18.0 rstudioapi_0.7
[4] bit64_0.9-7 lubridate_1.7.4 interactiveDisplayBase_1.16.0
[7] AnnotationDbi_1.40.0 xml2_1.2.0 codetools_0.2-15
[10] splines_3.4.4 R.methodsS3_1.7.1 mnormt_1.5-5
[13] TxDb.Rnorvegicus.UCSC.rn4.ensGene_3.2.2 jsonlite_1.5 Rsamtools_1.30.0
[16] broom_0.4.4 R.oo_1.22.0 shiny_1.1.0
[19] compiler_3.4.4 httr_1.3.1 assertthat_0.2.0
[22] Matrix_1.2-14 lazyeval_0.2.1 cli_1.0.0
[25] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 limma_3.34.9 later_0.7.2
[28] org.Rn.eg.db_3.5.0 org.Mm.eg.db_3.5.0 htmltools_0.3.6
[31] prettyunits_1.0.2 tools_3.4.4 gtable_0.2.0
[34] glue_1.2.0 GenomeInfoDbData_1.0.0 annotatr_1.4.1
[37] reshape2_1.4.3 doRNG_1.6.6 Rcpp_0.12.17
[40] TxDb.Dmelanogaster.UCSC.dm3.ensGene_3.2.2 TxDb.Hsapiens.UCSC.hg38.knownGene_3.4.0 TxDb.Mmusculus.UCSC.mm9.knownGene_3.2.2
[43] cellranger_1.1.0 bumphunter_1.20.0 Biostrings_2.46.0
[46] nlme_3.1-137 rtracklayer_1.38.3 iterators_1.0.9
[49] DelayedMatrixStats_1.0.3 psych_1.8.4 TxDb.Mmusculus.UCSC.mm10.knownGene_3.4.0
[52] rvest_0.3.2 mime_0.5 rngtools_1.3.1
[55] gtools_3.5.0 XML_3.98-1.11 AnnotationHub_2.10.1
[58] org.Hs.eg.db_3.5.0 zlibbioc_1.24.0 scales_0.5.0
[61] BSgenome_1.46.0 BiocInstaller_1.28.0 hms_0.4.2
[64] promises_1.0.1 RColorBrewer_1.1-2 yaml_2.1.19
[67] memoise_1.1.0 TxDb.Rnorvegicus.UCSC.rn6.refGene_3.4.1 pkgmaker_0.22
[70] biomaRt_2.34.2 stringi_1.2.2 RSQLite_2.1.1
[73] foreach_1.4.4 RMySQL_0.10.15 permute_0.9-4
[76] GenomicFeatures_1.30.3 TxDb.Rnorvegicus.UCSC.rn5.refGene_3.4.2 rlang_0.2.0
[79] pkgconfig_2.0.1 bitops_1.0-6 lattice_0.20-35
[82] bindr_0.1.1 GenomicAlignments_1.14.2 bit_1.1-13
[85] tidyselect_0.2.4 plyr_1.8.4 magrittr_1.5
[88] R6_2.2.2 TxDb.Dmelanogaster.UCSC.dm6.ensGene_3.4.1 DBI_1.0.0
[91] haven_1.1.1 foreign_0.8-70 pillar_1.2.2
[94] RCurl_1.95-4.10 crayon_1.3.4 modelr_0.1.2
[97] org.Dm.eg.db_3.5.0 progress_1.1.2 readxl_1.1.0
[100] locfit_1.5-9.1 grid_3.4.4 data.table_1.11.2
[103] blob_1.1.1 digest_0.6.15 xtable_1.8-2
[106] httpuv_1.4.3 regioneR_1.10.0 outliers_0.14
[109] R.utils_2.6.0 munsell_0.4.3 registry_0.5
I also tried it with my own data and got the same error. Am I missing something? Did you get it to work @ChrisAi89 ?
Greetings from Heidelberg,
Lukas
Hi Lukas,
Thanks for reporting this issue and for sending the minimal working example. I am unable to reproduce the error on my end. I am wondering if it could be due to incompatible versions of dmrseq/bsseq/DelayedArray/DelayedMatrixStats
. Can you try updating your Bioconductor packages to their latest devel versions and see if the problem persists? You can do this with:
library(BiocInstaller)
useDevel()
biocValid() # checks for out of date packages
biocLite() # (optional) updates out of date packages
Best,
Keegan
Hi @LKremer,
As a side note, when I was testing your example I found that the object created from the combine()
call resulted in slower downstream operations than expected. I raised this issue with bsseq maintainers and learned that it is recommended to construct BSseq
objects with cbind()
on a common set of loci, rather than using combine()
(details here: hansenlab/bsseq#70). Both the construction and downstream operations will be faster to carry out.
Best,
Keegan
Hi
I had this error when working in R 3.4. From my understanding it was due to the R version limiting the version of libraries like DelayedArray. After upgrading to R 3.5 it worked. I did not open an issue since the branch R 3.4 is tagged as "soon deprecated".
In case it can help you, I installed R 3.5 on my cluster without being root using a Conda environment and the channel rdonnellyr that provides R 3.5 and the needed dependencies like r-devtools r-rcpp r-openssl r-git2r etc
Thanks for this info, @jtomah.
@LKremer - I would advise you to switch to the current release version of R (3.5). Unfortunately, I cannot continue to support the latest versions of dmrseq on R 3.4, as all of the packages in Bioconductor now depend on R 3.5 (in release and in devel). Let me know if you have any questions or if this issue persists after upgrading to R 3.5.
Best,
Keegan
Hi @kdkorthauer and @jtomah,
thanks for your help. Unfortunately R 3.5 does not seem to be officially available for Ubuntu (see e.g. here. I'll see if there is another way to update R.
/edit: I managed to install R 3.5 manually from this ppa.
About the side note on combine() vs. cbind(): I believe cbind will only work if the samples to be combined have the same set of CpGs, right? If so, do I have to stick with combine when my samples differ in their CpGs?
To get a bit more off-topic while I'm at it, regarding the help of dmrseq::dmrseq():
minNumRegion | positive integer that represents the minimum number of nucleotides to consider for a candidate region. Default value is 5.
I think this should read "CpGs" instead of "nucleotides". Otherwise I'm confused :D
Hi @LKremer ,
Sorry for the trouble with R 3.5 on Ubuntu. Glad you were able to find a workaround.
Yes, you are correct about combine()
vs cbind()
. Unless you fist subset each sample on a common set of loci, you must stick with combine()
.
No, you are not confused; that's a typo. Thanks for catching that! :) I'll fix it straight away.
Best,
Keegan
Hi @kdkorthauer ,
sorry for reviving this issue once again. If I run dmrseq with the fake data as described above, I now get the expected result (no candidate regions are found). Now I tried to use dmrseq with multiple conditions on my own data and I get an error in normalizeSingleBracketSubscript2 (see below).
If I use the same files, but only define two conditions, dmrseq runs to completion without crashing. This would suggest that my input files are okay and that the crash is caused by using multiple conditions.
I suspected that something went wrong when I installed R 3.5 manually on ubuntu so I tested my script on another machine with a different OS and the regular R 3.5 from CRAN, but I got the exact same crash. I'll send you a mail with my input files + R script, I'm curious if it also crashes on your machine.
> dmrs <- dmrseq(bs = cpg, testCovariate = "celltype")
Performing a global test of H0: no difference among 3 groups (assuming the test covariate celltype is a factor).
Parallelizing using 2 workers/cores (backend: BiocParallel:MulticoreParam).
Computing on 1 chromosome(s) at a time.
Detecting candidate regions with coefficient larger than 0.1 in magnitude.
...Chromosome 19: Smoothed (0.06 min). Error in normalizeSingleBracketSubscript2(subscript, d, seed_dimnames[[along]]) :
subsetting a DelayedArray object with an array-like subscript is only supported if the subscript has a single dimension
This is my script:
require(BiocGenerics)
library(dmrseq)
cpg <- BiocGenerics::combine(
bsseq::read.bismark("B1.bedGraph", "b1", strandCollapse = F),
bsseq::read.bismark("B2.bedGraph", "b2", strandCollapse = F),
bsseq::read.bismark("S1.bedGraph", "s1", strandCollapse = F),
bsseq::read.bismark("S2.bedGraph", "s2", strandCollapse = F),
bsseq::read.bismark("O1.bedGraph", "o1", strandCollapse = F),
bsseq::read.bismark("O2.bedGraph", "o2", strandCollapse = F)
)
# Label the samples according to treatment & cell type:
pData(cpg)$celltype <- as.factor(c("b", "b", "s", "s", "o", "o"))
covered_loci <- which( rowSums( getCoverage(cpg, type="Cov") == 0) == 0)
cpg <- cpg[covered_loci, ]
cpg
# Detect differentially methylated regions (DMRs) with dmrseq:
dmrs <- dmrseq(bs = cpg,
testCovariate = "celltype")
The traceback looks like this:
> traceback()
19: stop(wmsg("subsetting a DelayedArray object with an array-like ",
"subscript is only supported if the subscript has a ", "single dimension"))
18: normalizeSingleBracketSubscript2(subscript, d, seed_dimnames[[along]])
17: FUN(X[[i]], ...)
16: lapply(seq_len(seed_ndim), function(along) {
subscript <- Nindex[[along]]
if (is.null(subscript))
return(NULL)
d <- seed_dim[[along]]
i <- normalizeSingleBracketSubscript2(subscript, d, seed_dimnames[[along]])
if (is_sequence(i, d))
return(NULL)
i
})
15: lapply(seq_len(seed_ndim), function(along) {
subscript <- Nindex[[along]]
if (is.null(subscript))
return(NULL)
d <- seed_dim[[along]]
i <- normalizeSingleBracketSubscript2(subscript, d, seed_dimnames[[along]])
if (is_sequence(i, d))
return(NULL)
i
})
14: new_DelayedSubset(x@seed, Nindex)
13: stash_DelayedSubset(x, Nindex)
12: cov.mat.cand[, design[, coeff] == l]
11: cov.mat.cand[, design[, coeff] == l]
10: DelayedMatrixStats::rowSums2(cov.mat.cand[, design[, coeff] ==
l] > 0)
9: replicateStatus(Indexes, design, coeff, fact)
8: regionScanner(meth.mat = meth.mat, cov.mat = cov.mat, pos = pos,
chr = chr, x = beta, y = rawBeta, maxGap = maxGap, cutoff = cutoff,
minNumRegion = minNumRegion, design = design, coeff = coeff,
coeff.adj = coeff.adj, verbose = verbose, parallel = parallel,
pDat = pData(bs), block = block, blockSize = blockSize, fact = fact)
7: eval(quote(list(...)), env)
6: eval(quote(list(...)), env)
5: eval(quote(list(...)), env)
4: standardGeneric("rbind")
3: rbind(tab, regionScanner(meth.mat = meth.mat, cov.mat = cov.mat,
pos = pos, chr = chr, x = beta, y = rawBeta, maxGap = maxGap,
cutoff = cutoff, minNumRegion = minNumRegion, design = design,
coeff = coeff, coeff.adj = coeff.adj, verbose = verbose,
parallel = parallel, pDat = pData(bs), block = block, blockSize = blockSize,
fact = fact))
2: bumphunt(bs = bs, design = design, coeff = coeff, coeff.adj = coeff.adj,
minInSpan = minInSpan, minNumRegion = minNumRegion, cutoff = cutoff,
maxGap = maxGap, maxGapSmooth = maxGapSmooth, smooth = smooth,
bpSpan = bpSpan, verbose = verbose, parallel = parallel,
block = block, blockSize = blockSize, chrsPerChunk = chrsPerChunk,
fact = fact)
1: dmrseq(bs = cpg, testCovariate = "celltype")
sessionInfo() and BiocValid():
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Fedora 22 (Twenty Two)
Matrix products: default
BLAS: /usr/local/lib64/R/lib/libRblas.so
LAPACK: /usr/local/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] dmrseq_1.1.8 bsseq_1.17.0
[3] SummarizedExperiment_1.11.3 DelayedArray_0.7.3
[5] BiocParallel_1.15.5 matrixStats_0.53.1
[7] Biobase_2.41.0 GenomicRanges_1.33.5
[9] GenomeInfoDb_1.17.1 IRanges_2.15.13
[11] S4Vectors_0.19.7 BiocGenerics_0.27.0
loaded via a namespace (and not attached):
[1] nlme_3.1-137 bitops_1.0-6
[3] bit64_0.9-7 RColorBrewer_1.1-2
[5] progress_1.1.2 httr_1.3.1
[7] doRNG_1.6.6 tools_3.5.0
[9] R6_2.2.2 HDF5Array_1.9.0
[11] DBI_1.0.0 lazyeval_0.2.1
[13] colorspace_1.3-2 permute_0.9-4
[15] withr_2.1.2 tidyselect_0.2.4
[17] prettyunits_1.0.2 bit_1.1-14
[19] compiler_3.5.0 pkgmaker_0.27
[21] rtracklayer_1.41.3 scales_0.5.0
[23] readr_1.1.1 stringr_1.3.1
[25] digest_0.6.15 Rsamtools_1.33.0
[27] R.utils_2.6.0 XVector_0.21.1
[29] pkgconfig_2.0.1 htmltools_0.3.6
[31] bibtex_0.4.2 limma_3.37.1
[33] BSgenome_1.49.0 regioneR_1.13.0
[35] rlang_0.2.1 RSQLite_2.1.1
[37] BiocInstaller_1.31.1 shiny_1.1.0
[39] DelayedMatrixStats_1.3.0 bindr_0.1.1
[41] gtools_3.5.0 dplyr_0.7.5
[43] R.oo_1.22.0 RCurl_1.95-4.10
[45] magrittr_1.5 GenomeInfoDbData_1.1.0
[47] Matrix_1.2-14 Rcpp_0.12.17
[49] munsell_0.4.3 Rhdf5lib_1.3.1
[51] R.methodsS3_1.7.1 stringi_1.2.2
[53] yaml_2.1.19 zlibbioc_1.27.0
[55] bumphunter_1.23.0 rhdf5_2.25.4
[57] plyr_1.8.4 AnnotationHub_2.13.1
[59] grid_3.5.0 blob_1.1.1
[61] promises_1.0.1 lattice_0.20-35
[63] splines_3.5.0 Biostrings_2.49.0
[65] GenomicFeatures_1.33.0 hms_0.4.2
[67] locfit_1.5-9.1 pillar_1.2.3
[69] rngtools_1.3.1 codetools_0.2-15
[71] reshape2_1.4.3 biomaRt_2.37.1
[73] XML_3.98-1.11 glue_1.2.0
[75] outliers_0.14 annotatr_1.7.0
[77] data.table_1.11.4 foreach_1.4.4
[79] httpuv_1.4.3 gtable_0.2.0
[81] purrr_0.2.5 assertthat_0.2.0
[83] ggplot2_2.2.1 mime_0.5
[85] xtable_1.8-2 later_0.7.2
[87] tibble_1.4.2 iterators_1.0.9
[89] registry_0.5 GenomicAlignments_1.17.1
[91] AnnotationDbi_1.43.1 memoise_1.1.0
[93] bindrcpp_0.2.2 interactiveDisplayBase_1.19.0
> library(BiocInstaller)
Bioconductor version 3.8 (BiocInstaller 1.31.1), ?biocLite for help
> useDevel()
Error: 'devel' version already in use
> biocValid()
[1] TRUE
Best,
Lukas
Hi Lukas,
Thanks for the thorough bug report! I'll dig into this and get back to you as soon as I can.
Best,
Keegan
Hi Lukas,
Thanks to your reproducible example, I was able to replicate the error and track down the issue (see commit message above for details). Hopefully the multi-group comparisons will run smoothly now.
Best,
Keegan
Hi @kdkorthauer ,
I tried it on my truncated dataset and dmrseq runs to completion, so the issue seems to be resolved. Thanks for the super quick fix!
May you give me some quick pointers on how to interpret the results? In my case I'm analysing three cell types: Stem cells (SC) that differentiate into either celltype A or celltype B. I have four WGBS replicates each for SC, A and B.
My aim is to identify DMRs that occur in the transition from SC to A, but not in the transition from SC to B (possible "switches" that may decide what the SC will differentiate into). Earlier, I used dmrseq to compare A vs B directly. That works fine, but adding the SC sample into the analysis would give me extra information for each DMR. For instance, if I have a DMR with low methylation in A and high methylation in B, it'd be interesting to know whether the region was high or low in stem cells.
I now ran dmrseq as described above, by specifying the cell type as a factor with 3 levels, releveled to "SC". The output gives me two beta-values and one q-value per DMR.
- What exactly does the q-value now refer to? Does a low q-value indicate a significant methylation change in both cell types, or only either of them?
- The two sets of beta values I receive are well-correlated, does this mean that many methylation-changes occur in parallel, or is does the method simply not look for DMRs that go e.g. up in A and down in B?
- Would it be valid to identify cell fate "switches" by selecting DMRs that have two very different betas, or is it better to just run dmrseq to compare A vs. B directly?
Best,
Lukas
Hi Lukas,
Great! Thanks for letting me know.
Happy to help with your questions regarding answering your biological questions with dmrseq.
-
The q-value represents the significance compared to the global null (no difference among the three groups). This is carried out by comparing ANOVA test statistics observed to those of null regions. So, a significant value could mean that all three groups are different from each other, or that one of the groups is significantly different from the other two.
-
The two sets of beta values represent an estimate of the effect size for that group compared to the reference group. Yes, if these values are similar to each other, this means that roughly both groups have a similar difference in comparison to the reference group. If they have opposite signs, then this means that the reference group is in between them. A beta value closer to zero means that group is not much different than the reference group. However, these interpretations are not proper inference (i.e. they don't allow you to say that the two betas are significantly different from each other). Due to the selection of candidate regions step, unfortunately we cannot simply obtain test statistics for the individual groups and compare them to a theoretical null (this is the reason a permutation test is used to obtain the q-values for the global null).
As a side note, you can interpret beta/pi as roughly on the same scale as methylation proportion (since the beta values are generated on the arcsin transformed values).
-
It sounds like you are not necessarily interested in regions where A and B are the same but different from SC, but instead would like to separate A vs B DMRs into different categories: (1) A and B are both different from SC, (2) A is different from B and SC (but B and SC are the same), and (3) B is different from A and SC (but A and SC are the same). At first glance, it sounds like carrying out all pairwise comparisons (and then adjusting for multiple comparisons) will get you the answer. However, since these are DMRs where the locations and sizes are determined from the data, this is a more complicated question. This is because regions defined in one comparison won't directly correspond to the regions defined in another.
I would probably start with the approach of first defining the set of A vs B DMRs (ignoring SC), and then splitting these into the 3 categories above based on whether or not they overlap a B-SC DMR and/or an A-SC DMR.
Hope that helps!
Best,
Keegan
Hi @kdkorthauer ,
Thanks for your thorough explanation, I understand the meaning of the betas and the q-value now. I'm currently trying out the approach you proposed at the bottom of your post.
On a different note, I ran into a new dmrseq crash when comparing multiple conditions. The crash looks like this:
Performing a global test of H0: no difference among 3 groups (assuming the test covariate celltype is a factor).
Parallelizing using 2 workers/cores (backend: BiocParallel:MulticoreParam).
Computing on 1 chromosome(s) at a time.
Detecting candidate regions with coefficient larger than 0.1 in magnitude.
...Chromosome 19: Smoothed (0.1 min). 222 CpG(s) excluded due to zero coverage. Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
Calls: dmrseq ... standardGeneric -> eval -> eval -> eval -> rbind -> rbind
In addition: There were 20 warnings (use warnings() to see them)
Execution halted
The crash also happens with the test script + tiny test data set that I emailed you a few weeks ago, maybe this can help you track down the problem. I tried to run the test script on my laptop with dmrseq_1.1.10 and it completed successfully. Then I updated dmrseq to the latest github version and now I get the crash above.
The printout "222 CpG(s) excluded due to zero coverage." occurs in the latest dmrseq version, but not in the older one. Maybe this is the root of the problem somehow? I removed all zero-coverage loci as usual, so I'm not sure why dmrseq::dmrseq() prints this message.
Sorry for bumping this issue once again ;)
Warnings (shortened, its 20x the same message):
> warnings()
Warning messages:
1: In summary.lm(fit2) : essentially perfect fit: summary may be unreliable
2: In summary.lm(fit2) : essentially perfect fit: summary may be unreliable
3: In summary.lm(fit2) : essentially perfect fit: summary may be unreliable
...
Traceback (the beginning is shortened cause it's a huge and repetitive)
> traceback()
...
list(stat = 2.24799854460156, constant = FALSE, beta_o = 0.0157305414733781,
beta_s = -0.493673619501872), list(stat = 0.675200975689104,
constant = FALSE, beta_o = 0.161133159537485, beta_s = 0.013842060847119),
list(stat = 0.417794949634261, constant = FALSE, beta_o = -0.0271493851625042,
beta_s = 0.076191161499296), list(stat = 2.29487196144337,
constant = FALSE, beta_o = 0.100343816570622, beta_s = 0.0514045745416547),
list(stat = 1.84089886800191, constant = FALSE, beta_o = -0.23181632553409,
beta_s = -0.341310578521614), list(stat = 1.73848158557243,
constant = FALSE, beta_o = 0.245729121735164, beta_s = -0.386210894846448),
list(stat = 2.3818747379595, constant = FALSE, beta_o = 0.328271070874218,
beta_s = -0.173563337303421), list(stat = 2.03203381216141,
constant = FALSE, beta_o = 0.119053999435268, beta_s = -0.40537199257229),
list(stat = 1.88974766082771, constant = FALSE, beta_o = -0.0777069484347307,
beta_s = -0.101715412908339), list(stat = 0.506295570781879,
constant = FALSE, beta_o = 0.0288295944719724, beta_s = -0.0471265077611153))
10: do.call("rbind", bplapply(Indexes, function(Index) asin.gls.cov(ix = ind[Index],
design = design, coeff = coeff)))
9: do.call("rbind", bplapply(Indexes, function(Index) asin.gls.cov(ix = ind[Index],
design = design, coeff = coeff)))
8: regionScanner(meth.mat = meth.mat, cov.mat = cov.mat, pos = pos,
chr = chr, x = beta, y = rawBeta, maxGap = maxGap, cutoff = cutoff,
minNumRegion = minNumRegion, design = design, coeff = coeff,
coeff.adj = coeff.adj, verbose = verbose, parallel = parallel,
pDat = pData(bs), block = block, blockSize = blockSize, fact = fact)
7: eval(quote(list(...)), env)
6: eval(quote(list(...)), env)
5: eval(quote(list(...)), env)
4: standardGeneric("rbind")
3: rbind(tab, regionScanner(meth.mat = meth.mat, cov.mat = cov.mat,
pos = pos, chr = chr, x = beta, y = rawBeta, maxGap = maxGap,
cutoff = cutoff, minNumRegion = minNumRegion, design = design,
coeff = coeff, coeff.adj = coeff.adj, verbose = verbose,
parallel = parallel, pDat = pData(bs), block = block, blockSize = blockSize,
fact = fact))
2: bumphunt(bs = bs, design = design, coeff = coeff, coeff.adj = coeff.adj,
minInSpan = minInSpan, minNumRegion = minNumRegion, cutoff = cutoff,
maxGap = maxGap, maxGapSmooth = maxGapSmooth, smooth = smooth,
bpSpan = bpSpan, verbose = verbose, parallel = parallel,
block = block, blockSize = blockSize, chrsPerChunk = chrsPerChunk,
fact = fact)
1: dmrseq(bs = cpg, testCovariate = "celltype", BPPARAM = MulticoreParam(workers = 2))
sessionInfo()
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] dmrseq_1.1.12 bsseq_1.17.0
[3] SummarizedExperiment_1.11.5 DelayedArray_0.7.19
[5] matrixStats_0.53.1 Biobase_2.41.1
[7] GenomicRanges_1.33.6 GenomeInfoDb_1.17.1
[9] IRanges_2.15.14 S4Vectors_0.19.17
[11] BiocParallel_1.15.7 BiocGenerics_0.27.1
loaded via a namespace (and not attached):
[1] nlme_3.1-137 bitops_1.0-6
[3] bit64_0.9-7 RColorBrewer_1.1-2
[5] progress_1.2.0 httr_1.3.1
[7] doRNG_1.7.1 tools_3.5.0
[9] R6_2.2.2 HDF5Array_1.9.5
[11] DBI_1.0.0 lazyeval_0.2.1
[13] colorspace_1.3-2 permute_0.9-4
[15] withr_2.1.2 tidyselect_0.2.4
[17] prettyunits_1.0.2 bit_1.1-14
[19] compiler_3.5.0 pkgmaker_0.27
[21] rtracklayer_1.41.3 scales_0.5.0
[23] readr_1.1.1 stringr_1.3.1
[25] digest_0.6.15 Rsamtools_1.33.2
[27] R.utils_2.6.0 XVector_0.21.3
[29] pkgconfig_2.0.1 htmltools_0.3.6
[31] bibtex_0.4.2 limma_3.37.2
[33] BSgenome_1.49.1 regioneR_1.13.0
[35] rlang_0.2.1 RSQLite_2.1.1
[37] BiocInstaller_1.31.1 shiny_1.1.0
[39] DelayedMatrixStats_1.3.4 bindr_0.1.1
[41] gtools_3.8.1 dplyr_0.7.6
[43] R.oo_1.22.0 RCurl_1.95-4.10
[45] magrittr_1.5 GenomeInfoDbData_1.1.0
[47] Matrix_1.2-14 Rcpp_0.12.17
[49] munsell_0.5.0 Rhdf5lib_1.3.1
[51] R.methodsS3_1.7.1 stringi_1.2.3
[53] yaml_2.1.19 zlibbioc_1.27.0
[55] bumphunter_1.23.0 rhdf5_2.25.4
[57] plyr_1.8.4 AnnotationHub_2.13.1
[59] grid_3.5.0 blob_1.1.1
[61] promises_1.0.1 crayon_1.3.4
[63] lattice_0.20-35 splines_3.5.0
[65] Biostrings_2.49.0 GenomicFeatures_1.33.0
[67] hms_0.4.2 locfit_1.5-9.1
[69] pillar_1.2.3 rngtools_1.3.1
[71] codetools_0.2-15 reshape2_1.4.3
[73] biomaRt_2.37.3 XML_3.98-1.11
[75] glue_1.2.0 outliers_0.14
[77] annotatr_1.7.1 data.table_1.11.4
[79] foreach_1.4.4 httpuv_1.4.4.2
[81] gtable_0.2.0 purrr_0.2.5
[83] assertthat_0.2.0 ggplot2_3.0.0
[85] mime_0.5 xtable_1.8-2
[87] later_0.7.3 tibble_1.4.2
[89] iterators_1.0.9 registry_0.5
[91] GenomicAlignments_1.17.2 AnnotationDbi_1.43.1
[93] memoise_1.1.0 bindrcpp_0.2.2
[95] interactiveDisplayBase_1.19.0
> BiocInstaller::biocValid()
[1] TRUE
> BiocInstaller::useDevel()
Error: 'devel' version already in use
Hi Lukas,
Thanks for reporting this issue! I was able to reproduce exactly the behavior you describe. It was puzzling at first since the code responsible is unchanged between devel and release, but I've been able to track it down to a change in the DelayedArray
package. There is an issue in DelayedArray
devel that affects subsetting and coercion to vectors - I've filed a report here: Bioconductor/DelayedArray#23.
I created a patch that will allow you to use devel again - just realizing the delayed matrices as actual matrices in the offending lines. I'll change it back to the more efficient version once the DelayedArray
issue is resolved (nothing should change on your end, just leaving this open as a reminder to myself).
Best,
Keegan
Hi Keegan,
Ah, this explains why downgrading dmrseq to an earlier version didn't fix the issue for me. Let's hope the DelayedArray issue gets fixed soon. I guess running into problems like this is one of the downsides of useDevel().
Thanks for the very quick patch! I'm testing it right now.
edit: It worked perfectly fine with the new patch, thanks!
Best,
Lukas