BioJulia/SequenceVariation.jl

[Bug]: `Variation`s from different starting places give incorrect positions

MillironX opened this issue · 0 comments

This bug is best explained by an example. Consider a conserved substitution at a reference position on two reads. These reads start at different positions in the reference sequence. We would expect that the Variations comprising the Variants of these reads would be the same.

Steps to reproduce

using BioAlignments, BioSequences, SequenceVariation

refseq = dna"ACAACTTTATCT"
mutseq = dna"ACATCTTTATCT"

@show PairwiseAlignment(AlignedSequence(mutseq, refseq), refseq)

read01 = AlignedSequence(mutseq[1:10], Alignment("10M", 1, 1))
read02 = AlignedSequence(mutseq[3:12], Alignment("10M", 1, 3))

aln01 = PairwiseAlignment(read01, refseq)
aln02 = PairwiseAlignment(read02, refseq)

@show Variant(aln01)
@show Variant(aln02)

Results

PairwiseAlignment(AlignedSequence(mutseq, refseq), refseq) = PairwiseAlignment{LongDNASeq, LongDNASeq}:
  seq:  1 ACATCTTTATCT 12
          ||| ||||||||
  ref:  1 ACAACTTTATCT 12

Variant(aln01) = Variant{LongDNASeq, DNA} with 1 edit:
  A4T
Variant(aln02) = Variant{LongDNASeq, DNA} with 1 edit:
  C2T

Expected results

PairwiseAlignment(AlignedSequence(mutseq, refseq), refseq) = PairwiseAlignment{LongDNASeq, LongDNASeq}:
  seq:  1 ACATCTTTATCT 12
          ||| ||||||||
  ref:  1 ACAACTTTATCT 12

Variant(aln01) = Variant{LongDNASeq, DNA} with 1 edit:
  A4T
Variant(aln02) = Variant{LongDNASeq, DNA} with 1 edit:
  A4T

Possible fix

Variant(::PairwiseAlignment) assumes a starting reference and sequence position of 0.

https://github.com/jakobnissen/SequenceVariation.jl/blob/92a2d40553f4d7e5c6e7e757543b81526f7b79b4/src/SequenceVariation.jl#L245

If it instead pulled the refpos and seqpos of the first (starting) AlignmentAnchor, then that should fix the problem.

I'll get started on a PR for that.