lh3/ropebwt2

potential typo in index building

alfredsimkin opened this issue · 8 comments

During index building of DNA sequencing reads, I'm getting the following message:

symbol counts: ($, A, C, G, T, N) = (47280160, 898907741, 853208574, 870771069, 607635, 901457172).

If true, this would mean that T counts are 607635 (6 digits), whereas the other canonical nucleotides (e.g. 'A') are 9 digits, or roughly a 1,000 fold difference. I seem to get this depletion reliably, regardless of input dataset, and for downstream applications it seems to work fine (with no noticeable depletion of T's). If the numbers are sorted lexicographically (as I suspect), then things might make more sense, with symbol counts ($, A, C, G, N, T) (N and T reversed because N is before T in the alphabet). In that case, it would make sense that N (a noncanonical nucleotide) would occur much more rarely than the 4 canonical nucleotides.

I therefore suspect that "symbol counts: ($, A, C, G, T, N)" should be amended to read "symbol counts: ($, A, C, G, N, T)"

Alternatively, is there any other explanation for why normal DNA sequencing reads should be so greatly depleted in 'T' counts?

lh3 commented

What is the command line you are using?

lh3 commented

The code is correct. Have you checked if your input has many Ns?

Here is the attached file for reference
fake_reads.txt

Or typed out:

AATTCCC
CAGGACT
CNTAGGC
GACTAAC
TACGCAA

lh3 commented

This is not a bug. When you run tr NT TN, you have replaced N with T and replaced T with N. I should have noticed this earlier.