Some random reads have the same read ID

Question

Some random reads have the same read ID

JustinChu opened this issue 8 years ago · 4 comments

It looks like random sequences are given a random number between 0-255, resulting in duplicate IDs. Some programs break on these sequences. I can parse these sequences to rename them, but wouldn't it make sense for the sequences to be given a larger random ID based on a count?

Answer 1 · 2016-10-17T04:10:29.000Z

I am not seeing it in the code, could you print a couple examples?

Answer 2 · 2016-10-19T00:49:10.000Z

Actually, the numbers may exceed 255, but I'm still seeing duplicates.

grep rand test.bwa.read1.fastq
@rand_0_0_0_0_1_1_0:0:0_0:0:0_c/1
@rand_0_0_0_0_1_1_0:0:0_0:0:0_46/1
@rand_0_0_0_0_1_1_0:0:0_0:0:0_47/1
@rand_0_0_0_0_1_1_0:0:0_0:0:0_4a/1
...

grep @rand_0_0_0_0_1_1_0:0:0_0:0:0_c/1 test.bwa.read1.fastq | wc -l
180

Answer 3 · 2016-10-19T02:48:51.000Z

@JustinChu I think I see it. It starts its counter over again for reads within the same chromosome, even random ones. Let me see if I can get to a fix after ASHG, so probably late next week.

Answer 4 · 2017-04-27T16:26:15.000Z

@JustinChu better late than never: #30