Only returning 2 columns for multiple conditions in a single testCovariate
jrhawley opened this issue · 2 comments
Hi @kdkorthauer,
I'm trying to use dmrseq
to find DMRs between multiple groups (7 in my case), but when the dmrseq
function finishes running, it only returns 2 beta_*
columns.
Here's an example:
> sites = data.table::fread("sites.bed")
> cov = as.matrix(data.table::fread("cov.tsv"))
> meth = as.matrix(data.table::fread("meth.tsv"))
> bs = BSseq(
chr = sites$chr,
pos = sites$start,
M = meth,
Cov = cov,
sampleNames = colnames(meth)
)
> dim(meth)
[1] 9133 14
> pData(bs)$CellType = rep(c("A", "B", "C", "D", "E", "F", "G"), each = 2)
> regions = dmrseq(bs, cutoff = 0.05, testCovariate = "CellType", block = TRUE)
> regions
GRanges object with 5 ranges and 13 metadata columns:
seqnames ranges strand | L area
<Rle> <IRanges> <Rle> | <integer> <numeric>
[1] chrY 10195606-10201158 * | 169 38.2301110753144
[2] chrY 10168805-10179039 * | 83 21.4140122389176
[3] chrY 20063027-20071947 * | 24 8.6088320479757
[4] chrY 20136566-20142275 * | 26 6.87157006072164
[5] chrY 20112930-20119704 * | 31 9.12274096963229
beta_B beta_C stat
<numeric> <numeric> <numeric>
[1] -0.0371691839226094 -0.0752807212251143 1.70573693696012
[2] -0.0441968981525388 -0.0861093245448863 1.33891191668975
[3] 0.0101195828552019 0.0454363657001892 1.01819060667095
[4] -0.0924929027767175 -0.0249519554741274 0.895781627374212
[5] -0.0482938427619002 -0.00452535560090753 0.781044756328963
pval qval index
<numeric> <numeric> <IRanges>
[1] 0.181818181818182 0.5 3497-3665
[2] 0.2 0.5 3384-3466
[3] 0.4 0.666666666666667 7818-7841
[4] 0.618181818181818 0.672727272727273 7919-7944
[5] 0.672727272727273 0.672727272727273 7871-7901
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
It's only showing the values for B
and C
instead of all 6 comparisons against CellType A
.
I think the issue is in this line in the dmrseq
function.
Instead of
coeff <- c(coeff, coeff + length(unique(testCov)) - 2)
I think it should be
coeff <- seq(coeff, coeff + length(unique(testCov)) - 2)
so that it includes all the non-intercept columns in the design matrix.
I tried running dmrseq
with that change, and I got the following result:
GRanges object with 5 ranges and 13 metadata columns:
seqnames ranges strand | L area
<Rle> <IRanges> <Rle> | <integer> <numeric>
[1] chrY 10195606-10201158 * | 169 38.2301110753144
[2] chrY 10168805-10179039 * | 83 21.4140122389176
[3] chrY 20063027-20071947 * | 24 8.6088320479757
[4] chrY 20136566-20142275 * | 26 6.87157006072164
[5] chrY 20112930-20119704 * | 31 9.12274096963229
beta_B beta_C beta_D
<numeric> <numeric> <numeric>
[1] -0.0371691839226094 -0.0752807212251143 -0.0371691839226094
[2] -0.0441968981525388 -0.0861093245448863 -0.0441968981525388
[3] 0.0101195828552019 0.0454363657001892 0.0101195828552019
[4] -0.0924929027767175 -0.0249519554741274 -0.0924929027767175
[5] -0.0482938427619002 -0.00452535560090753 -0.0482938427619001
beta_E beta_F beta_G
<numeric> <numeric> <numeric>
[1] -0.0481211935514686 -0.0565869713105525 0.0499099035745361
[2] -0.130865741892211 -0.167381024709723 -0.0805857625971508
[3] -0.199663427452862 -0.109570687948112 -0.0313615323168114
[4] -0.121608415537634 -0.0998756565588056 0.0193838796675397
[5] -0.215545198996335 -0.0252185442432093 -0.0444248610466189
stat pval qval
<numeric> <numeric> <numeric>
[1] 1.70573693696012 0.181818181818182 0.5
[2] 1.33891191668975 0.2 0.5
[3] 1.01819060667095 0.4 0.666666666666667
[4] 0.895781627374212 0.618181818181818 0.672727272727273
[5] 0.781044756328963 0.672727272727273 0.672727272727273
index
<IRanges>
[1] 3497-3665
[2] 3384-3466
[3] 7818-7841
[4] 7919-7944
[5] 7871-7901
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
I think the reason this didn't come up before in #6 is that there were 3 conditions, so only 2 test columns would have come up, anyway.
Let me know if you can replicate this result.
I think it's a simple fix, but I haven't tried it with multiple test covariates (instead of a single, multi-condition covariate), so I'm not completely sure that it's correct.
Thanks,
James
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS 10.14.2
Matrix products: default
BLAS/LAPACK: /Users/hawleyj/anaconda3/envs/DMRs/lib/R/lib/libRblas.dylib
locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] dmrseq_1.2.1 bsseq_1.18.0
[3] SummarizedExperiment_1.12.0 DelayedArray_0.8.0
[5] BiocParallel_1.16.2 matrixStats_0.54.0
[7] Biobase_2.42.0 GenomicRanges_1.34.0
[9] GenomeInfoDb_1.18.1 IRanges_2.16.0
[11] S4Vectors_0.20.1 BiocGenerics_0.28.0
[13] argparse_1.1.1 data.table_1.12.0
loaded via a namespace (and not attached):
[1] colorspace_1.4-0 XVector_0.22.0
[3] getopt_1.20.2 bit64_0.9-7
[5] interactiveDisplayBase_1.20.0 AnnotationDbi_1.44.0
[7] codetools_0.2-16 splines_3.5.1
[9] R.methodsS3_1.7.1 jsonlite_1.6
[11] Rsamtools_1.34.0 R.oo_1.22.0
[13] shiny_1.2.0 HDF5Array_1.10.1
[15] BiocManager_1.30.4 readr_1.3.1
[17] compiler_3.5.1 httr_1.4.0
[19] assertthat_0.2.0 Matrix_1.2-15
[21] lazyeval_0.2.1 limma_3.38.3
[23] later_0.7.5 htmltools_0.3.6
[25] prettyunits_1.0.2 tools_3.5.1
[27] bindrcpp_0.2.2 gtable_0.2.0
[29] glue_1.3.0 GenomeInfoDbData_1.2.0
[31] annotatr_1.8.0 reshape2_1.4.3
[33] dplyr_0.7.8 doRNG_1.7.1
[35] Rcpp_1.0.0 bumphunter_1.24.5
[37] Biostrings_2.50.1 nlme_3.1-137
[39] rtracklayer_1.42.1 iterators_1.0.10
[41] DelayedMatrixStats_1.4.0 stringr_1.3.1
[43] proto_1.0.0 mime_0.6
[45] rngtools_1.3.1 gtools_3.8.1
[47] XML_3.98-1.16 AnnotationHub_2.14.2
[49] zlibbioc_1.28.0 scales_1.0.0
[51] BSgenome_1.50.0 hms_0.4.2
[53] promises_1.0.1 rhdf5_2.26.2
[55] RColorBrewer_1.1-2 yaml_2.2.0
[57] memoise_1.1.0 ggplot2_3.1.0
[59] pkgmaker_0.27 biomaRt_2.38.0
[61] stringi_1.2.4 RSQLite_2.1.1
[63] foreach_1.4.4 permute_0.9-4
[65] GenomicFeatures_1.34.1 bibtex_0.4.2
[67] rlang_0.3.1 pkgconfig_2.0.2
[69] bitops_1.0-6 lattice_0.20-38
[71] purrr_0.2.5 Rhdf5lib_1.4.2
[73] bindr_0.1.1 GenomicAlignments_1.18.0
[75] bit_1.1-12 tidyselect_0.2.5
[77] plyr_1.8.4 magrittr_1.5
[79] R6_2.3.0 DBI_1.0.0
[81] pillar_1.3.1 findpython_1.0.3
[83] withr_2.1.2 RCurl_1.95-4.11
[85] tibble_2.0.1 crayon_1.3.4
[87] progress_1.2.0 locfit_1.5-9.1
[89] grid_3.5.1 blob_1.1.1
[91] digest_0.6.18 xtable_1.8-3
[93] httpuv_1.4.5.1 regioneR_1.14.0
[95] outliers_0.14 R.utils_2.7.0
[97] munsell_0.5.0 registry_0.5
Hi @jrhawley,
Thanks for reaching out and reporting this issue, as well as a fix. You are correct in that line should use seq
rather than c
- thanks for catching that! I've implemented the change in both devel and release, so you should see those propagate to Bioconductor in the next couple of days.
With regard to your comment
multiple test covariates (instead of a single, multi-condition covariate)
Multiple test covariates is actually not supported, and will throw an error if more than one is specified (see this line).
Thanks again and don't hesitate to reach out if you encounter any more issues.
Best,
Keegan
Great, glad that worked. I didn't catch that multiple testCovariate line, but that makes sense. That all sounds good to me