crisprVerse/screenCounter

unknown base "N" error.

Closed this issue · 6 comments

Hello,

Wonderful tool, it's been working very well with some of my subseted fastq file.

I tried running it on my NGS data. However, I'm getting this error:

Error: BiocParallel errors
1 remote errors, element index: 1
0 unevaluated and other errors
first remote error:
Error in eval(expr, envir, enclos): unknown base 'N'

I'm thinking it has to do with some N bases in my sequences. Interestingly, a smaller fastq file with similar
sequences work.

Let me know of any thoughts on how to fix this. Thanks for developing this tool.

William

LTLA commented

Hm. I thought I fixed this in the last release cycle:

library(screenCounter)

 # Creating an example dual barcode sequencing experiment.
 known.pool <- c("AGAGAGAGA", "CTCTCTCTC",
     "GTGTGTGTG", "CACACACAC")
 
# Adding some N's to the sequence data.
 N <- 1000
 barcodes <- sprintf("CAGCTANNCGTACG%sCCAGCTCGANNTCG",
    sample(known.pool, N, replace=TRUE))
 names(barcodes) <- seq_len(N)
 
 library(Biostrings)
 tmp <- tempfile(fileext=".fastq")
 writeXStringSet(DNAStringSet(barcodes), filepath=tmp, format="fastq")

 # Counting the combinations.
 countSingleBarcodes(tmp, choices=known.pool,
     template="CGTACGNNNNNNNNNCCAGCTC")
## DataFrame with 4 rows and 2 columns
##       choices    counts
##   <character> <integer>
## 1   AGAGAGAGA       270
## 2   CTCTCTCTC       224
## 3   GTGTGTGTG       262
## 4   CACACACAC       244

Make sure you're running the latest version (1.2.0) from Bioconductor.

LTLA commented

One last question: is it possible to extract the read ID for each barcode?

Currently not, it's all aggregated in the underlying C++ libraries.

I suppose we could report the read names associated with each barcode, but that could use an awful lot of memory for a deeply sequenced experiment. There may or may not be a better way to do what you actually want to do.

Hello,

Just to let you know that this fixed it for me.

Works like a charm.

William

LTLA commented

Ok, great.

As for the other question: when you have more clarity on the nature of the problem, make another issue and we can see what we can do. It may be possible to adapt the C++ code underlying the countCombinatorialBarcodes function so that it captures the combination of genotype with a random barcode (assuming that we're dealing with a simple SNP).