potential typo in index building
alfredsimkin opened this issue · 8 comments
During index building of DNA sequencing reads, I'm getting the following message:
symbol counts: ($, A, C, G, T, N) = (47280160, 898907741, 853208574, 870771069, 607635, 901457172).
If true, this would mean that T counts are 607635 (6 digits), whereas the other canonical nucleotides (e.g. 'A') are 9 digits, or roughly a 1,000 fold difference. I seem to get this depletion reliably, regardless of input dataset, and for downstream applications it seems to work fine (with no noticeable depletion of T's). If the numbers are sorted lexicographically (as I suspect), then things might make more sense, with symbol counts ($, A, C, G, N, T) (N and T reversed because N is before T in the alphabet). In that case, it would make sense that N (a noncanonical nucleotide) would occur much more rarely than the 4 canonical nucleotides.
I therefore suspect that "symbol counts: (
Alternatively, is there any other explanation for why normal DNA sequencing reads should be so greatly depleted in 'T' counts?
What is the command line you are using?
The code is correct. Have you checked if your input has many Ns?
Here is the attached file for reference
fake_reads.txt
Or typed out:
AATTCCC
CAGGACT
CNTAGGC
GACTAAC
TACGCAA
This is not a bug. When you run tr NT TN
, you have replaced N with T and replaced T with N. I should have noticed this earlier.