BioJulia/BioSequences.jl

Translation Issue arising from `isambiguous(DNA_Gap) == false`

chelate opened this issue · 2 comments

So I would expect

isambiguous(DNA_Gap) == true

however, in fact "DNA_Gap" is NOT ambiguous. This has some surprising implications, such as that gaps code for alanine.

julia> translate(LongDNA{4}("---"))
1aa Amino Acid Sequence:
A

I'd strongly suggest either making gaps ambiguous, or possibly mapping to an amino acid gap (is there such a thing?). Maybe I'm just bad at bioinformatics, but I think some data does not distinguish between "gaps" (as in deletions) and missing coverage?

Dear @chelate

Thank you for the bug report. DNA_Gap is not ambiguous, since it does not correspond to one of multiple possible nucleotides. In fact, it corresponds to zero nucleotides. It's questionable whether we should even have added gap symbols.

However, the translation of gap nucleotides was definitely a bug. I've fixed this in #278 , after which trying to translate a sequence with gaps will throw an exception. There are two alternative behaviours:

  1. Translate to AA_Gap. However, this only works if the gap corresponds to whole codons, and this is not something we can assume. We could theoretically translate those gap codons to AA_Gap, then error on partial gapped codons, but I think this might be too confusing.
  2. Silently skip gaps when translating. However, this might be unexpected (for example, one would no longer be able to assume that the translated sequence is 1/3rd the size of the input sequence)

Implementing it as throwing an error is more conservative in the sense that it'll always be legal to change it to a non-throwing behaviour later.

Sorry if I'm annoying, but I have some input on this issue. I agree now that "-" is not ambiguous.

It would make much more sense to translate "---" to AA_Gap. I'd argue that "---" is how you'd express AA_Gap in standard code. If you are working with nt and align them in frame, these gaps will show up as "---" and there's really no other way to express them. These are the ubiquitous result of aligning to reference sequences for variable length proteins.

The only ambiguous thing is when your alignment assigns some nucleotides to dangling out of frame like "--G". In my case this has happened and that's how I found this bug. Erroring then is reasonable (, but conservative and annoying in practice). I'd prefer AA_X with a warning like "translating out-of-frame alignment as an ambiguous amino acid AA_X" it just means that somebody allowed out of frame alignments upstream in the pipeline from me, or that there was an insertion at the PCR level, depends what the pipeline looks like.

Sorry I don't know how to use git or I'd pr this myself. shame