gear-genomics/tracy

Support for generating consensus sequence from forward and reverse sanger

blex-max opened this issue · 7 comments

Hi,

Thanks for maintaining Tracy!

There are remarkably few (reliable) programs available for generating consensus sequences from forward and reverse Sanger data! Luckily, Tracy exists; I'm generating consensus sequences of fungal ITS from Sanger forward and reverse seqs in the following way:

# basecall the forward seq for use as reference
tracy basecall -f fasta -o ref.fa forward.ab1

# assemble reverse using forward seq as reference
tracy assemble -t 4 -d 1 -r ref.fa -o con reverse.ab1

Then I'm extracting the gap free consensus sequence from the output JSON. I am assembling with -d 1 on the understanding that it will ensure the consensus sequence only includes bases where the reads match.

This works so far, but it feels a bit hacky and there are some niceties which would be extremely helpful. Most importantly, in situations such as A - N it would be nice to be able to take A into the consensus sequence rather than dropping it when -d 1. It would also be nice to be able to output the consensus sequence directly. What do you think?

Thanks for the suggestion. I think this forward - reverse case is indeed common enough to warrant a new subcommand in tracy that solves this issue. I will work on that for the next release.

Thanks! That would be a good enhancement I think. I've actually since realised that if I set -f 0.5 I (think) don't get Ns in the gap free consensus sequence and I get a longer (less end-gaps) sequence so that is what I am now doing. The documentation is a bit sparse here, on -f and -t etc., I think a little more doc would really help also. If you wanted any help with doc stuff I'd be happy to assist.

Just finished a beta release of a pairwise consensus mode which outputs the alignment (FASTA align and Clustal format), consensus sequence (FASTA and FASTQ) and the trace information (as for tracy basecall).

tracy consensus fwd.ab1 rev.ab1

Could you please do some testing and see what is missing? By default, the consensus is the union of the forward and reverse trace. With -i it's the intersection (only the overlapping part). Trimming can be done automatically or by specifying the left and right trimming parameters. By default, the consensus uses the best base call, with -a you get IUPAC codes (for best and 2nd best base call). For testing, I compiled a static binary that you can download temporarily from here.

Great, testing now. Thank you.

This seems to be exactly what I need so far. It also seems to have greater success generating consensus sequences compared to using the assemble subcommand - I don't know if that's to be expected.

Two suggestions:
i) it would be nice to have more control over output so different outputs can be requested/suppressed individually and directed to different locations
ii) I would like to be able to provide an ID for the consensus sequence; having them labelled as '@consensus' etc makes merging multiple outputs inconvenient.

I could probably do these on my own fork if needed, let me know how much time you have available for this.

Ah I suspect the increased success is because I now have finer control over trimming - previously trimming was quite opaque, now being able to specify (in my case just 0) with -q/u/r/s is really useful

Thanks, you can now specify a label instead of 'consensus' with -b. For the output files, it should be easy to script the renaming/moving/deleting and the files are tiny so I think I stay with the simple -o outprefix option. I released a new version tracy v0.7.1 with the consensus subcommand, just open a new issue if there are any problems.