kdkorthauer/dmrseq

Only returning 2 columns for multiple conditions in a single testCovariate

jrhawley opened this issue · 2 comments

Hi @kdkorthauer,

I'm trying to use dmrseq to find DMRs between multiple groups (7 in my case), but when the dmrseq function finishes running, it only returns 2 beta_* columns.
Here's an example:

> sites = data.table::fread("sites.bed")
> cov = as.matrix(data.table::fread("cov.tsv"))
> meth = as.matrix(data.table::fread("meth.tsv"))

> bs = BSseq(
    chr = sites$chr,
    pos = sites$start,
    M = meth,
    Cov = cov,
    sampleNames = colnames(meth)
)
> dim(meth)
[1] 9133 14
> pData(bs)$CellType = rep(c("A", "B", "C", "D", "E", "F", "G"), each = 2)
> regions = dmrseq(bs, cutoff = 0.05, testCovariate = "CellType", block = TRUE)
> regions
GRanges object with 5 ranges and 13 metadata columns:
      seqnames            ranges strand |         L             area
         <Rle>         <IRanges>  <Rle> | <integer>        <numeric>
  [1]     chrY 10195606-10201158      * |       169 38.2301110753144
  [2]     chrY 10168805-10179039      * |        83 21.4140122389176
  [3]     chrY 20063027-20071947      * |        24  8.6088320479757
  [4]     chrY 20136566-20142275      * |        26 6.87157006072164
  [5]     chrY 20112930-20119704      * |        31 9.12274096963229
                   beta_B               beta_C               stat
                <numeric>            <numeric>          <numeric>
  [1] -0.0371691839226094  -0.0752807212251143   1.70573693696012
  [2] -0.0441968981525388  -0.0861093245448863   1.33891191668975
  [3]  0.0101195828552019   0.0454363657001892   1.01819060667095
  [4] -0.0924929027767175  -0.0249519554741274  0.895781627374212
  [5] -0.0482938427619002 -0.00452535560090753  0.781044756328963
                   pval              qval     index
              <numeric>         <numeric> <IRanges>
  [1] 0.181818181818182               0.5 3497-3665
  [2]               0.2               0.5 3384-3466
  [3]               0.4 0.666666666666667 7818-7841
  [4] 0.618181818181818 0.672727272727273 7919-7944
  [5] 0.672727272727273 0.672727272727273 7871-7901
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

It's only showing the values for B and C instead of all 6 comparisons against CellType A.

I think the issue is in this line in the dmrseq function.
Instead of

coeff <- c(coeff, coeff + length(unique(testCov)) - 2)

I think it should be

coeff <- seq(coeff, coeff + length(unique(testCov)) - 2)

so that it includes all the non-intercept columns in the design matrix.
I tried running dmrseq with that change, and I got the following result:

GRanges object with 5 ranges and 13 metadata columns:
      seqnames            ranges strand |         L             area
         <Rle>         <IRanges>  <Rle> | <integer>        <numeric>
  [1]     chrY 10195606-10201158      * |       169 38.2301110753144
  [2]     chrY 10168805-10179039      * |        83 21.4140122389176
  [3]     chrY 20063027-20071947      * |        24  8.6088320479757
  [4]     chrY 20136566-20142275      * |        26 6.87157006072164
  [5]     chrY 20112930-20119704      * |        31 9.12274096963229
                   beta_B               beta_C              beta_D
                <numeric>            <numeric>           <numeric>
  [1] -0.0371691839226094  -0.0752807212251143 -0.0371691839226094
  [2] -0.0441968981525388  -0.0861093245448863 -0.0441968981525388
  [3]  0.0101195828552019   0.0454363657001892  0.0101195828552019
  [4] -0.0924929027767175  -0.0249519554741274 -0.0924929027767175
  [5] -0.0482938427619002 -0.00452535560090753 -0.0482938427619001
                   beta_E              beta_F              beta_G
                <numeric>           <numeric>           <numeric>
  [1] -0.0481211935514686 -0.0565869713105525  0.0499099035745361
  [2]  -0.130865741892211  -0.167381024709723 -0.0805857625971508
  [3]  -0.199663427452862  -0.109570687948112 -0.0313615323168114
  [4]  -0.121608415537634 -0.0998756565588056  0.0193838796675397
  [5]  -0.215545198996335 -0.0252185442432093 -0.0444248610466189
                   stat              pval              qval 
              <numeric>         <numeric>         <numeric> 
  [1]  1.70573693696012 0.181818181818182               0.5 
  [2]  1.33891191668975               0.2               0.5 
  [3]  1.01819060667095               0.4 0.666666666666667 
  [4] 0.895781627374212 0.618181818181818 0.672727272727273 
  [5] 0.781044756328963 0.672727272727273 0.672727272727273 
          index
      <IRanges>
  [1] 3497-3665
  [2] 3384-3466
  [3] 7818-7841
  [4] 7919-7944
  [5] 7871-7901
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

I think the reason this didn't come up before in #6 is that there were 3 conditions, so only 2 test columns would have come up, anyway.

Let me know if you can replicate this result.
I think it's a simple fix, but I haven't tried it with multiple test covariates (instead of a single, multi-condition covariate), so I'm not completely sure that it's correct.

Thanks,
James

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS  10.14.2

Matrix products: default
BLAS/LAPACK: /Users/hawleyj/anaconda3/envs/DMRs/lib/R/lib/libRblas.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] dmrseq_1.2.1                bsseq_1.18.0
 [3] SummarizedExperiment_1.12.0 DelayedArray_0.8.0
 [5] BiocParallel_1.16.2         matrixStats_0.54.0
 [7] Biobase_2.42.0              GenomicRanges_1.34.0
 [9] GenomeInfoDb_1.18.1         IRanges_2.16.0
[11] S4Vectors_0.20.1            BiocGenerics_0.28.0
[13] argparse_1.1.1              data.table_1.12.0

loaded via a namespace (and not attached):
 [1] colorspace_1.4-0              XVector_0.22.0
 [3] getopt_1.20.2                 bit64_0.9-7
 [5] interactiveDisplayBase_1.20.0 AnnotationDbi_1.44.0
 [7] codetools_0.2-16              splines_3.5.1
 [9] R.methodsS3_1.7.1             jsonlite_1.6
[11] Rsamtools_1.34.0              R.oo_1.22.0
[13] shiny_1.2.0                   HDF5Array_1.10.1
[15] BiocManager_1.30.4            readr_1.3.1
[17] compiler_3.5.1                httr_1.4.0
[19] assertthat_0.2.0              Matrix_1.2-15
[21] lazyeval_0.2.1                limma_3.38.3
[23] later_0.7.5                   htmltools_0.3.6
[25] prettyunits_1.0.2             tools_3.5.1
[27] bindrcpp_0.2.2                gtable_0.2.0
[29] glue_1.3.0                    GenomeInfoDbData_1.2.0
[31] annotatr_1.8.0                reshape2_1.4.3
[33] dplyr_0.7.8                   doRNG_1.7.1
[35] Rcpp_1.0.0                    bumphunter_1.24.5
[37] Biostrings_2.50.1             nlme_3.1-137
[39] rtracklayer_1.42.1            iterators_1.0.10
[41] DelayedMatrixStats_1.4.0      stringr_1.3.1
[43] proto_1.0.0                   mime_0.6
[45] rngtools_1.3.1                gtools_3.8.1
[47] XML_3.98-1.16                 AnnotationHub_2.14.2
[49] zlibbioc_1.28.0               scales_1.0.0
[51] BSgenome_1.50.0               hms_0.4.2
[53] promises_1.0.1                rhdf5_2.26.2
[55] RColorBrewer_1.1-2            yaml_2.2.0
[57] memoise_1.1.0                 ggplot2_3.1.0
[59] pkgmaker_0.27                 biomaRt_2.38.0
[61] stringi_1.2.4                 RSQLite_2.1.1
[63] foreach_1.4.4                 permute_0.9-4
[65] GenomicFeatures_1.34.1        bibtex_0.4.2
[67] rlang_0.3.1                   pkgconfig_2.0.2
[69] bitops_1.0-6                  lattice_0.20-38
[71] purrr_0.2.5                   Rhdf5lib_1.4.2
[73] bindr_0.1.1                   GenomicAlignments_1.18.0
[75] bit_1.1-12                    tidyselect_0.2.5
[77] plyr_1.8.4                    magrittr_1.5
[79] R6_2.3.0                      DBI_1.0.0
[81] pillar_1.3.1                  findpython_1.0.3
[83] withr_2.1.2                   RCurl_1.95-4.11
[85] tibble_2.0.1                  crayon_1.3.4
[87] progress_1.2.0                locfit_1.5-9.1
[89] grid_3.5.1                    blob_1.1.1
[91] digest_0.6.18                 xtable_1.8-3
[93] httpuv_1.4.5.1                regioneR_1.14.0
[95] outliers_0.14                 R.utils_2.7.0
[97] munsell_0.5.0                 registry_0.5

Hi @jrhawley,

Thanks for reaching out and reporting this issue, as well as a fix. You are correct in that line should use seq rather than c - thanks for catching that! I've implemented the change in both devel and release, so you should see those propagate to Bioconductor in the next couple of days.

With regard to your comment

multiple test covariates (instead of a single, multi-condition covariate)

Multiple test covariates is actually not supported, and will throw an error if more than one is specified (see this line).

Thanks again and don't hesitate to reach out if you encounter any more issues.

Best,
Keegan

Great, glad that worked. I didn't catch that multiple testCovariate line, but that makes sense. That all sounds good to me