Having problem in loading full genome in reference_file
kashiff007 opened this issue · 8 comments
Hi,
I know from vignette how to upload one chr/scaffold into reference_file
reference_file <- DNAStringSet(genome[[19]], use.names = TRUE)
But having problem in uploading whole genome fasta file containing multiple chr/scaffolds.
My genome looks like:
Aiptasia genome:
# organism: Aiptasia CC7 (Aiptasia)
# provider: RSRC
# provider version: CC7
# release date: 2020
# release name: CC7.2
# 262 sequences:
# CC7scaffold1 CC7scaffold2 CC7scaffold3 CC7scaffold4 CC7scaffold5
# CC7scaffold6 CC7scaffold7 CC7scaffold8 CC7scaffold9 CC7scaffold10
# CC7scaffold11 CC7scaffold12 CC7scaffold13 CC7scaffold14 CC7scaffold15
# CC7scaffold16 CC7scaffold17 CC7scaffold18 CC7scaffold19 CC7scaffold20
# CC7scaffold21 CC7scaffold22 CC7scaffold23 CC7scaffold24 CC7scaffold25
# ... ... ... ... ...
# CC7scaffold241 CC7scaffold242 CC7scaffold243 CC7scaffold244 CC7scaffold245
# CC7scaffold246 CC7scaffold247 CC7scaffold248 CC7scaffold249 CC7scaffold250
# CC7scaffold251 CC7scaffold252 CC7scaffold253 CC7scaffold254 CC7scaffold255
# CC7scaffold256 CC7scaffold257 CC7scaffold258 CC7scaffold259 CC7scaffold260
# CC7scaffold261 CC7scaffold262
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
# to access a given sequence)
I also looked into manual of extract_bams, but couldn't found the solution.
Can you provide a solution/command by which I can perform this operation.
Thank you,
Kashif
I solve the this problem by just assigning the genome to reference_file variable:
reference_file <- ("Genome_CC7_new/CC7_R6_locked.softmasked.scaffolds.fa")
But while running:
snp.list <- extract_bams(bam_files, vcf_files, sample_names, cores=10, reference_file, coverage = 2)
I got the following error:
Reading reference file
Running sample CC7_25_1
Reading VCF file
Scanning file to determine attributes.
File attributes:
meta lines: 289
header_line: 290
variant count: 36668556
column count: 10
Meta line 289 read in.
All meta lines processed.
gt matrix initialized.
Character matrix gt created.
Character matrix gt rows: 36668556
Character matrix gt cols: 10
skip: 0
nrows: 36668556
row_num: 0
Processed variant 30978000
*** caught segfault ***
address 0x110, cause 'memory not mapped'
Traceback:
1: .read_body_gz(file, stats = stats, nrows = nrows, skip = skip, cols = cols, convertNA = as.numeric(convertNA), verbose = as.numeric(verbose))
2: vcfR::read.vcfR(vcfFiles[samp], verbose = verbose)
3: vcfR::getFIX(vcfR::read.vcfR(vcfFiles[samp], verbose = verbose))
4: FUN(X[[i]], ...)
5: lapply(sample_list, function(samp) { message(sprintf("Running sample %s", sampleNames[samp])) if (verbose) message("Reading VCF file", appendLF = TRUE) vcf <- vcfR::getFIX(vcfR::read.vcfR(vcfFiles[samp], verbose = verbose)) bam.file <- bamFiles[samp] if (verbose) message("Extracting methylation per SNP", appendLF = TRUE) snp.table <- parallel::mcmapply(function(t, u, v) { snp <- GRanges(seqnames = t, IRanges(start = as.integer(u), width = 1)) if (length(grep(t, c(as.character(1:22), "X", "Y"))) == 0) { if (verbose) message("Bad chrom", appendLF = TRUE) return(NULL) } chrom <- t params <- Rsamtools::ScanBamParam(tag = c("MD", "XM", "XR", "XG"), which = snp) alns.pairs <- readGAlignmentPairs(bam.file, use.names = TRUE, param = params) alns <- unlist(alns.pairs) split <- splitReads(alns, v, snp) alt.reads <- split$alt.reads ref.reads <- split$ref.reads if (length(ref.reads) <= 1 | length(alt.reads) <= 1) { if (verbose) message("0 or 1 read in one or both alleles", appendLF = TRUE) return(NULL) } left <- min(start(alns)) right <- max(end(alns)) window <- GRanges(GenomeInfoDb::seqnames(snp), IRanges(left, right)) dna <- Rsamtools::scanFa(fa, param = window) cgsite <- stringr::str_locate_all(dna, "CG")[[1]][, 1] if (length(cgsite) < 1) { if (verbose) message("No CpG sites associated to this SNP", appendLF = TRUE) return(NULL) } mepos <- cgsite/Biostrings::nchar(dna) conversion <- vapply(alns.pairs, function(x) { XGcondition <- mcols(x)$XG[1] if (XGcondition == "GA") { cgsite <- cgsite + 1 } conversion <- matrix(NA, 1, length(cgsite)) read.start <- start(x) - left + 1 read.end <- end(x) - left + 1 for (pair in 1:2) { something <- mcols(x[pair])$MD tag <- getMD(something) MDtag <- tag$MDtag nucl.num <- tag$nucl.num count <- 0 this.mepos <- cgsite - read.start[pair] + 1 for (p in seq_along(this.mepos)) { if (is.na(conversion[1, p])) { if (cgsite[p] > read.end[pair] || cgsite[p] < read.start[pair]) { conversion[1, p] <- NA next } else { for (i in seq_along(nucl.num)) { count <- count + nucl.num[i] if (count >= this.mepos[p]) { count <- 0 break } } if (MDtag[i] %in% c("C", "G")) { conversion[1, p] <- 1 } else { conversion[1, p] <- 2 } } } else { next } } } return(conversion) }, double(length(cgsite))) if (is.null(dim(conversion))) { ref.conv <- conversion[names(conversion) %in% ref.reads] alt.conv <- conversion[names(conversion) %in% alt.reads] ref.cov <- sum(!is.na(ref.conv)) alt.cov <- sum(!is.na(alt.conv)) ref.meth <- sum(!is.na(ref.conv) & ref.conv == 2) alt.meth <- sum(!is.na(alt.conv) & alt.conv == 2) } else { ref.conv <- conversion[, colnames(conversion) %in% ref.reads] alt.conv <- conversion[, colnames(conversion) %in% alt.reads] ref.cov <- BiocGenerics::rowSums(!is.na(ref.conv)) alt.cov <- BiocGenerics::rowSums(!is.na(alt.conv)) ref.meth <- BiocGenerics::rowSums(!is.na(ref.conv) & ref.conv == 2) alt.meth <- BiocGenerics::rowSums(!is.na(alt.conv) & alt.conv == 2) } GR <- GRanges(gsub("chr", "", chrom), IRanges(start = left + cgsite, width = 1)) mcols(GR)$cov.ref <- ref.cov mcols(GR)$cov.alt <- alt.cov mcols(GR)$meth.ref <- ref.meth mcols(GR)$meth.alt <- alt.meth mcols(GR)$snp <- paste0(GenomeInfoDb::seqnames(snp), ".", start(snp)) filt <- BiocGenerics::rowSums(cbind(GR$cov.ref, GR$cov.alt) >= coverage) >= 2 if (sum(filt) < 2) { if (verbose) message("No CpG sites sufficiently covered at this SNP", appendLF = TRUE) return(NULL) } gr <- GR[filt] return(gr) }, t = vcf[, 1], u = vcf[, 2], v = vcf[, 4], SIMPLIFY = F, USE.NAMES = F, mc.cores = cores) message(sprintf("Done with sample %s", sampleNames[samp])) return(snp.table)})
6: extract_bams(bam_files, vcf_files, sample_names, reference_file, coverage = 2)
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection:
What is the possible mistake I am doing?
Thanks in advance, Kashif
Hi @kashiff007, sorry for the late response, I wasn't getting any emails for some reason.
Yes for the first error, you basically have to provide the location of the reference file you used to do the alignment.
From the second error, it seems that you are using an older version of the package. Could you please install directly from github: BiocManager::install("markrobinsonuzh/DAMEfinder")
? or wait until the release version on bioconductor is updated to 1.0.3 https://bioconductor.org/packages/release/bioc/html/DAMEfinder.html, which should happen today or tomorrow.
Thank for the reply @sorjuela
I have installed 1.0.3 version directly from github. For this, I have to update my R to 4.0.0 and bioconductor to 3.11.
All steps for SNP-based methods went well and after running:
snp.list <- extract_bams(bam_files, vcf_files, sample_names, reference_file, coverage = 2, cores=30)
following verbose printed:
Reading reference file
Running sample CC7_25_1
Reading VCF file
Extracting methylation per SNP
No CpG sites associated to this SNP
No CpG sites associated to this SNP
No CpG sites sufficiently covered at this SNP
No CpG sites sufficiently covered at this SNP
Bad chrom
Bad chrom
continues like this...
I am not sure whats the meaning of 'No CpG sites associated to this SNP', 'No CpG sites sufficiently covered at this SNP' and 'Bad chrom'. Because after this step completes snp.list$CC7_25_1[[1]] shows NULL (in every case its NULL). I am wondering whether I am directing towards right path. Could you comment on this?
I have made changes to attain same chromosome name style in my BAM, BAM index, vcf and genome.
Hi @kashiff007
Can you post the output from sessionInfo()
please?
Here it is.
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.7 LTS
Matrix products: default
BLAS: /usr/local/lib/R/lib/libRblas.so
LAPACK: /usr/local/lib/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.0.2
And for DAMEfinder
> sessionInfo(package = "DAMEfinder")
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.7 LTS
Matrix products: default
BLAS: /usr/local/lib/R/lib/libRblas.so
LAPACK: /usr/local/lib/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
character(0)
other attached packages:
[1] DAMEfinder_1.1.3
loaded via a namespace (and not attached):
[1] pillar_1.4.6 compiler_4.0.2
[3] GenomeInfoDb_1.24.2 XVector_0.28.0
[5] methods_4.0.2 bitops_1.0-6
[7] utils_4.0.2 tools_4.0.2
[9] grDevices_4.0.2 zlibbioc_1.34.0
[11] lifecycle_0.2.0 tibble_3.0.3
[13] gtable_0.3.0 lattice_0.20-41
[15] pkgconfig_2.0.3 rlang_0.4.7
[17] Matrix_1.2-18 DelayedArray_0.14.1
[19] parallel_4.0.2 GenomeInfoDbData_1.2.3
[21] stringr_1.4.0 dplyr_1.0.2
[23] generics_0.0.2 Biostrings_2.56.0
[25] vctrs_0.3.4 graphics_4.0.2
[27] S4Vectors_0.26.1 datasets_4.0.2
[29] stats_4.0.2 IRanges_2.22.2
[31] stats4_4.0.2 grid_4.0.2
[33] tidyselect_1.1.0 glue_1.4.2
[35] base_4.0.2 Biobase_2.48.0
[37] R6_2.4.1 BiocParallel_1.22.0
[39] ggplot2_3.3.2 purrr_0.3.4
[41] magrittr_1.5 Rsamtools_2.4.0
[43] scales_1.1.1 ellipsis_0.3.1
[45] matrixStats_0.57.0 BiocGenerics_0.34.0
[47] GenomicAlignments_1.24.0 GenomicRanges_1.40.0
[49] SummarizedExperiment_1.18.2 colorspace_1.4-1
[51] stringi_1.5.3 RCurl_1.98-1.2
[53] munsell_0.5.0 crayon_1.3.4
Reading reference file Running sample CC7_25_1 Reading VCF file Extracting methylation per SNP No CpG sites associated to this SNP No CpG sites associated to this SNP No CpG sites sufficiently covered at this SNP No CpG sites sufficiently covered at this SNP Bad chrom Bad chrom continues like this...
I think you are still running and old version of the package, this Bad chrom
warning was removed in the latest version, so I'm a bit confused. If you load the package, and then run sessionInfo()
is it the same?
I have again updated the DAMEfinder and now the 'Bad chrom' is gone. And now only 'No CpG sites associated to this SNP' and 'No CpG sites sufficiently covered at this SNP' are appearing on the screen. Does it mean no error and I am moving in right direction?
yes, you can ignore the messages, it's basically going through each SNP and checking if that SNP has a CpG in the same read pair. Might take some time to finish depending on the number of SNPs in your VCF file.