potential typo in index building

Question

potential typo in index building

alfredsimkin opened this issue 4 years ago · 8 comments

During index building of DNA sequencing reads, I'm getting the following message:

symbol counts: ($, A, C, G, T, N) = (47280160, 898907741, 853208574, 870771069, 607635, 901457172).

If true, this would mean that T counts are 607635 (6 digits), whereas the other canonical nucleotides (e.g. 'A') are 9 digits, or roughly a 1,000 fold difference. I seem to get this depletion reliably, regardless of input dataset, and for downstream applications it seems to work fine (with no noticeable depletion of T's). If the numbers are sorted lexicographically (as I suspect), then things might make more sense, with symbol counts ($, A, C, G, N, T) (N and T reversed because N is before T in the alphabet). In that case, it would make sense that N (a noncanonical nucleotide) would occur much more rarely than the 4 canonical nucleotides.

I therefore suspect that "symbol counts: ($, A, C, G, T, N)" should be amended to read "symbol counts: ($, A, C, G, N, T)"

Alternatively, is there any other explanation for why normal DNA sequencing reads should be so greatly depleted in 'T' counts?

Answer 1 · 2021-02-12T20:47:19.000Z

What is the command line you are using?

Answer 2 · 2021-02-12T20:57:01.000Z

The full command line I'm using (as part of an index building process for fmlrc2) is this: cat all_reads_sorted.txt | tr NT TN | ropebwt2 -LR | tr NT TN | fmlrc2-convert comp_msbwt.npy Input file is a bunch of alphabetically sorted reads (one read per line) saved with unix line endings and no other text. I adapted these commands from this website: https://github.com/HudsonAlpha/rust-fmlrc#msbwt-building

…

On Fri, Feb 12, 2021 at 3:47 PM Heng Li ***@***.***> wrote: What is the command line you are using? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#5 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABUMCVAYLEVXFYO6WI6NEZLS6WHWLANCNFSM4XRJUORQ> .

-- Alfred Simkin zoom: https://us02web.zoom.us/j/5903001786

Answer 3 · 2021-02-12T21:00:47.000Z

The code is correct. Have you checked if your input has many Ns?

Answer 4 · 2021-02-12T23:01:12.000Z

I will write a python script to double check my A, C, G, T, and N counts. I will also check to see if a sample small input file with only a few reads generates correct numbers or not.

…

On Fri, Feb 12, 2021 at 4:01 PM Heng Li ***@***.***> wrote: The code is correct. Have you checked if your input has many Ns? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#5 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABUMCVCO3BYXHLKMYUSA6ULS6WJI3ANCNFSM4XRJUORQ> .

-- Alfred Simkin zoom: https://us02web.zoom.us/j/5903001786

Answer 5 · 2021-02-14T19:43:45.000Z

I have verified that there does seem to be a bug in the code, and that the error seems to work in the way I described in my original bug report. When running this command: cat fake_reads.txt | tr NT TN | ropebwt2 -LR | tr NT TN | fmlrc2-convert indexed_fake_reads.npy On the attached output file, the stdout displays the following message: [M::main_ropebwt2] symbol counts: ($, A, C, G, T, N) = (5, 11, 11, 6, 1, 6) in spite of the fact that the attached text file in reality has 5 reads (correct), 11 A's (correct), 11 C's (correct), 6 G's (correct), 6 T's (incorrect) and 1 N (incorrect)

…

On Fri, Feb 12, 2021 at 4:01 PM Heng Li ***@***.***> wrote: The code is correct. Have you checked if your input has many Ns? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#5 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABUMCVCO3BYXHLKMYUSA6ULS6WJI3ANCNFSM4XRJUORQ> .

-- Alfred Simkin zoom: https://us02web.zoom.us/j/5903001786 AATTCCC CAGGACT CNTAGGC GACTAAC TACGCAA

Answer 6 · 2021-02-14T19:47:17.000Z

Here is the attached file for reference
fake_reads.txt

Or typed out:

AATTCCC
CAGGACT
CNTAGGC
GACTAAC
TACGCAA

Answer 7 · 2021-02-14T20:58:18.000Z

This is not a bug. When you run tr NT TN, you have replaced N with T and replaced T with N. I should have noticed this earlier.

Answer 8 · 2021-02-14T22:49:49.000Z

Thank you! My apologies. My understanding of tr is incomplete. I will try to figure out why the pipeline I'm following (fmlrc2) incorporates this unix code. It seems that they turn T's into N's and N's into T's, run ropebwt2, and then reverse the process. https://github.com/HudsonAlpha/rust-fmlrc#usage Thank you for your help.

…

On Sun, Feb 14, 2021 at 3:58 PM Heng Li ***@***.***> wrote: This is not a bug. When you run tr NT TN, you have replaced N with T and replaced T with N. Should have noticed this earlier. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#5 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABUMCVCOPT7AIBN6UUZ5IPLS7A2PXANCNFSM4XRJUORQ> .

-- Alfred Simkin zoom: https://us02web.zoom.us/j/5903001786