markrobinsonuzh/DAMEfinder

Having problem in loading full genome in reference_file

kashiff007 opened this issue · 8 comments

Hi,

I know from vignette how to upload one chr/scaffold into reference_file
reference_file <- DNAStringSet(genome[[19]], use.names = TRUE)

But having problem in uploading whole genome fasta file containing multiple chr/scaffolds.
My genome looks like:

Aiptasia genome:
# organism: Aiptasia CC7 (Aiptasia)
# provider: RSRC
# provider version: CC7
# release date: 2020
# release name: CC7.2
# 262 sequences:
#   CC7scaffold1   CC7scaffold2   CC7scaffold3   CC7scaffold4   CC7scaffold5  
#   CC7scaffold6   CC7scaffold7   CC7scaffold8   CC7scaffold9   CC7scaffold10 
#   CC7scaffold11  CC7scaffold12  CC7scaffold13  CC7scaffold14  CC7scaffold15 
#   CC7scaffold16  CC7scaffold17  CC7scaffold18  CC7scaffold19  CC7scaffold20 
#   CC7scaffold21  CC7scaffold22  CC7scaffold23  CC7scaffold24  CC7scaffold25 
#   ...            ...            ...            ...            ...           
#   CC7scaffold241 CC7scaffold242 CC7scaffold243 CC7scaffold244 CC7scaffold245
#   CC7scaffold246 CC7scaffold247 CC7scaffold248 CC7scaffold249 CC7scaffold250
#   CC7scaffold251 CC7scaffold252 CC7scaffold253 CC7scaffold254 CC7scaffold255
#   CC7scaffold256 CC7scaffold257 CC7scaffold258 CC7scaffold259 CC7scaffold260
#   CC7scaffold261 CC7scaffold262                                             
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
# to access a given sequence)

I also looked into manual of extract_bams, but couldn't found the solution.

Can you provide a solution/command by which I can perform this operation.

Thank you,
Kashif

I solve the this problem by just assigning the genome to reference_file variable:
reference_file <- ("Genome_CC7_new/CC7_R6_locked.softmasked.scaffolds.fa")
But while running:
snp.list <- extract_bams(bam_files, vcf_files, sample_names, cores=10, reference_file, coverage = 2)

I got the following error:

Reading reference file
Running sample CC7_25_1
Reading VCF file
Scanning file to determine attributes.
File attributes:
  meta lines: 289
  header_line: 290
  variant count: 36668556
  column count: 10
Meta line 289 read in.
All meta lines processed.
gt matrix initialized.
Character matrix gt created.
  Character matrix gt rows: 36668556
  Character matrix gt cols: 10
  skip: 0
  nrows: 36668556
  row_num: 0
Processed variant 30978000
 *** caught segfault ***
address 0x110, cause 'memory not mapped'

Traceback:
 1: .read_body_gz(file, stats = stats, nrows = nrows, skip = skip,     cols = cols, convertNA = as.numeric(convertNA), verbose = as.numeric(verbose))
 2: vcfR::read.vcfR(vcfFiles[samp], verbose = verbose)
 3: vcfR::getFIX(vcfR::read.vcfR(vcfFiles[samp], verbose = verbose))
 4: FUN(X[[i]], ...)
 5: lapply(sample_list, function(samp) {    message(sprintf("Running sample %s", sampleNames[samp]))    if (verbose)         message("Reading VCF file", appendLF = TRUE)    vcf <- vcfR::getFIX(vcfR::read.vcfR(vcfFiles[samp], verbose = verbose))    bam.file <- bamFiles[samp]    if (verbose)         message("Extracting methylation per SNP", appendLF = TRUE)    snp.table <- parallel::mcmapply(function(t, u, v) {        snp <- GRanges(seqnames = t, IRanges(start = as.integer(u),             width = 1))        if (length(grep(t, c(as.character(1:22), "X", "Y"))) ==             0) {            if (verbose)                 message("Bad chrom", appendLF = TRUE)            return(NULL)        }        chrom <- t        params <- Rsamtools::ScanBamParam(tag = c("MD", "XM",             "XR", "XG"), which = snp)        alns.pairs <- readGAlignmentPairs(bam.file, use.names = TRUE,             param = params)        alns <- unlist(alns.pairs)        split <- splitReads(alns, v, snp)        alt.reads <- split$alt.reads        ref.reads <- split$ref.reads        if (length(ref.reads) <= 1 | length(alt.reads) <= 1) {            if (verbose)                 message("0 or 1 read in one or both alleles",                   appendLF = TRUE)            return(NULL)        }        left <- min(start(alns))        right <- max(end(alns))        window <- GRanges(GenomeInfoDb::seqnames(snp), IRanges(left,             right))        dna <- Rsamtools::scanFa(fa, param = window)        cgsite <- stringr::str_locate_all(dna, "CG")[[1]][, 1]        if (length(cgsite) < 1) {            if (verbose)                 message("No CpG sites associated to this SNP",                   appendLF = TRUE)            return(NULL)        }        mepos <- cgsite/Biostrings::nchar(dna)        conversion <- vapply(alns.pairs, function(x) {            XGcondition <- mcols(x)$XG[1]            if (XGcondition == "GA") {                cgsite <- cgsite + 1            }            conversion <- matrix(NA, 1, length(cgsite))            read.start <- start(x) - left + 1            read.end <- end(x) - left + 1            for (pair in 1:2) {                something <- mcols(x[pair])$MD                tag <- getMD(something)                MDtag <- tag$MDtag                nucl.num <- tag$nucl.num                count <- 0                this.mepos <- cgsite - read.start[pair] + 1                for (p in seq_along(this.mepos)) {                  if (is.na(conversion[1, p])) {                    if (cgsite[p] > read.end[pair] || cgsite[p] <                       read.start[pair]) {                      conversion[1, p] <- NA                      next                    }                    else {                      for (i in seq_along(nucl.num)) {                        count <- count + nucl.num[i]                        if (count >= this.mepos[p]) {                          count <- 0                          break                        }                      }                      if (MDtag[i] %in% c("C", "G")) {                        conversion[1, p] <- 1                      }                      else {                        conversion[1, p] <- 2                      }                    }                  }                  else {                    next                  }                }            }            return(conversion)        }, double(length(cgsite)))        if (is.null(dim(conversion))) {            ref.conv <- conversion[names(conversion) %in% ref.reads]            alt.conv <- conversion[names(conversion) %in% alt.reads]            ref.cov <- sum(!is.na(ref.conv))            alt.cov <- sum(!is.na(alt.conv))            ref.meth <- sum(!is.na(ref.conv) & ref.conv == 2)            alt.meth <- sum(!is.na(alt.conv) & alt.conv == 2)        }        else {            ref.conv <- conversion[, colnames(conversion) %in%                 ref.reads]            alt.conv <- conversion[, colnames(conversion) %in%                 alt.reads]            ref.cov <- BiocGenerics::rowSums(!is.na(ref.conv))            alt.cov <- BiocGenerics::rowSums(!is.na(alt.conv))            ref.meth <- BiocGenerics::rowSums(!is.na(ref.conv) &                 ref.conv == 2)            alt.meth <- BiocGenerics::rowSums(!is.na(alt.conv) &                 alt.conv == 2)        }        GR <- GRanges(gsub("chr", "", chrom), IRanges(start = left +             cgsite, width = 1))        mcols(GR)$cov.ref <- ref.cov        mcols(GR)$cov.alt <- alt.cov        mcols(GR)$meth.ref <- ref.meth        mcols(GR)$meth.alt <- alt.meth        mcols(GR)$snp <- paste0(GenomeInfoDb::seqnames(snp),             ".", start(snp))        filt <- BiocGenerics::rowSums(cbind(GR$cov.ref, GR$cov.alt) >=             coverage) >= 2        if (sum(filt) < 2) {            if (verbose)                 message("No CpG sites sufficiently covered at this SNP",                   appendLF = TRUE)            return(NULL)        }        gr <- GR[filt]        return(gr)    }, t = vcf[, 1], u = vcf[, 2], v = vcf[, 4], SIMPLIFY = F,         USE.NAMES = F, mc.cores = cores)    message(sprintf("Done with sample %s", sampleNames[samp]))    return(snp.table)})
 6: extract_bams(bam_files, vcf_files, sample_names, reference_file,     coverage = 2)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: 

What is the possible mistake I am doing?

Thanks in advance, Kashif

Hi @kashiff007, sorry for the late response, I wasn't getting any emails for some reason.

Yes for the first error, you basically have to provide the location of the reference file you used to do the alignment.

From the second error, it seems that you are using an older version of the package. Could you please install directly from github: BiocManager::install("markrobinsonuzh/DAMEfinder")? or wait until the release version on bioconductor is updated to 1.0.3 https://bioconductor.org/packages/release/bioc/html/DAMEfinder.html, which should happen today or tomorrow.

Thank for the reply @sorjuela
I have installed 1.0.3 version directly from github. For this, I have to update my R to 4.0.0 and bioconductor to 3.11.

All steps for SNP-based methods went well and after running:
snp.list <- extract_bams(bam_files, vcf_files, sample_names, reference_file, coverage = 2, cores=30)

following verbose printed:

Reading reference file
Running sample CC7_25_1
Reading VCF file
Extracting methylation per SNP

No CpG sites associated to this SNP
No CpG sites associated to this SNP
No CpG sites sufficiently covered at this SNP
No CpG sites sufficiently covered at this SNP
Bad chrom
Bad chrom
continues like this...

I am not sure whats the meaning of 'No CpG sites associated to this SNP', 'No CpG sites sufficiently covered at this SNP' and 'Bad chrom'. Because after this step completes snp.list$CC7_25_1[[1]] shows NULL (in every case its NULL). I am wondering whether I am directing towards right path. Could you comment on this?

I have made changes to attain same chromosome name style in my BAM, BAM index, vcf and genome.

Hi @kashiff007

Can you post the output from sessionInfo() please?

Here it is.

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.7 LTS

Matrix products: default
BLAS:   /usr/local/lib/R/lib/libRblas.so
LAPACK: /usr/local/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.0.2

And for DAMEfinder

> sessionInfo(package = "DAMEfinder")
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.7 LTS

Matrix products: default
BLAS:   /usr/local/lib/R/lib/libRblas.so
LAPACK: /usr/local/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
character(0)

other attached packages:
[1] DAMEfinder_1.1.3

loaded via a namespace (and not attached):
 [1] pillar_1.4.6                compiler_4.0.2             
 [3] GenomeInfoDb_1.24.2         XVector_0.28.0             
 [5] methods_4.0.2               bitops_1.0-6               
 [7] utils_4.0.2                 tools_4.0.2                
 [9] grDevices_4.0.2             zlibbioc_1.34.0            
[11] lifecycle_0.2.0             tibble_3.0.3               
[13] gtable_0.3.0                lattice_0.20-41            
[15] pkgconfig_2.0.3             rlang_0.4.7                
[17] Matrix_1.2-18               DelayedArray_0.14.1        
[19] parallel_4.0.2              GenomeInfoDbData_1.2.3     
[21] stringr_1.4.0               dplyr_1.0.2                
[23] generics_0.0.2              Biostrings_2.56.0          
[25] vctrs_0.3.4                 graphics_4.0.2             
[27] S4Vectors_0.26.1            datasets_4.0.2             
[29] stats_4.0.2                 IRanges_2.22.2             
[31] stats4_4.0.2                grid_4.0.2                 
[33] tidyselect_1.1.0            glue_1.4.2                 
[35] base_4.0.2                  Biobase_2.48.0             
[37] R6_2.4.1                    BiocParallel_1.22.0        
[39] ggplot2_3.3.2               purrr_0.3.4                
[41] magrittr_1.5                Rsamtools_2.4.0            
[43] scales_1.1.1                ellipsis_0.3.1             
[45] matrixStats_0.57.0          BiocGenerics_0.34.0        
[47] GenomicAlignments_1.24.0    GenomicRanges_1.40.0       
[49] SummarizedExperiment_1.18.2 colorspace_1.4-1           
[51] stringi_1.5.3               RCurl_1.98-1.2             
[53] munsell_0.5.0               crayon_1.3.4               
Reading reference file
Running sample CC7_25_1
Reading VCF file
Extracting methylation per SNP

No CpG sites associated to this SNP
No CpG sites associated to this SNP
No CpG sites sufficiently covered at this SNP
No CpG sites sufficiently covered at this SNP
Bad chrom
Bad chrom
continues like this...

I think you are still running and old version of the package, this Bad chrom warning was removed in the latest version, so I'm a bit confused. If you load the package, and then run sessionInfo() is it the same?

I have again updated the DAMEfinder and now the 'Bad chrom' is gone. And now only 'No CpG sites associated to this SNP' and 'No CpG sites sufficiently covered at this SNP' are appearing on the screen. Does it mean no error and I am moving in right direction?

yes, you can ignore the messages, it's basically going through each SNP and checking if that SNP has a CpG in the same read pair. Might take some time to finish depending on the number of SNPs in your VCF file.