Base quality and consensus generation
nriddiford opened this issue · 7 comments
First off - thanks for all the great work on tracy. It's quite amazing to me how few tools there are for performing trace file assembly - so thanks for filling this void with a very nice tool!
I have been using tracy quite a bit recently to assemble trace files, and perform variant calling relative to a reference sequence. Generally, this seems to work very well using tracy, but I have a question (related to a previous issue ) on the interplay between base-call confidence (on the chromatogram), and consensus formation.
I'm seeing incorrect consensus calls being made for a particular base where one of the trace files contains a low-confidence call and the other a high confidence call. From what I understand (based on your previous explanation) tracy does not use the base quality from the chromatogram, and I guess just choses on base over the other when there's a disagreement?
Here's what I'm seeing:
This shows 2 trace files in Geneious. When I assemble these using tracy assemble --format fastq --inccons trace1.ab1 trace2.ab1
the resulting consensus contains insertions at both positions highlighted in red. This is strange to me - the base quality in trace 2
is clearly higher than in trace 1
. Or is it the case that with insertions in one trace file, there is no base to compare to in the second trace file, so the insertion is included in the consensus, irrespective of quality?
Is this expected behaviour?
Thanks for any help!
Just to add, I've also been wondering about this!
@tobiasrausch forgive me for pinging you, but are you intending to respond to this?
For tracy assemble
it's a simple majority vote. If you have, for instance, 3 traces and 2 support a gap -
and 1 a nucleotide then the gap -
is chosen. Ties are arbitrarily broken and tracy assemble
does not take into account qualities at the moment. For the pairwise case, tracy consensus
does use the qualities but gaps don't have any to begin with. Therefore, for tracy consensus
it depends on whether you use -i
or not.
@tobiasrausch Thanks for the clarification. How about in cases where you have 2 traces (like the image above). Is it just a 50:50 change to incorporate a low quality insertion?
Indeed, it's a 50:50 chance in theory but in order to make the algorithm deterministic the code currently favours nucleotides over gaps.
Is this likely to change? As @nriddiford says, it seems a shame to have a 50% chance to incorporate a low quality base when the information is available to make the better call.
I think tracy consensus
in tracy v0.7.5 now properly handles the low-quality vs. high-quality base problem but low-quality insertions vs. gaps is still something I need to work on. Do you have some example traces that you can share with me where you think the insertion is incorrect? Thanks.