naturalis/barcode-constrained-phylogeny

Non-flush alignments under certain conditions

rvosa opened this issue · 1 comments

The msa_hmm step combines two alignments into one file. Both the ingroup and the outgroup are processed separately to figure out the revcom orientation and do the alignment. The files are then simply concatenated under the assumption that this should result in a flush alignment. It turns out that this is not the case, sometimes. To address this, the following steps need to be taken:

  • Firstly, the problem seems to manifest especially with short sequences. Hence, better curation under #85 might help mitigate this in part.
  • Secondly, the two data sets can be either processed in one go, e.g. by concatenating the inputs and then do the orientation and alignment across the concatenation, or by reconciling them with --mapali
  • But, thirdly, how could this happen in the first place? The idea was that hmmalign would obviously yield alignments with the same length if they use the same HMM and are trimmed. What gives? Probably indels?

As of now, the Odonata branch realigns the exemplars. This solves the immediate problem but we still need to understand what's happening in the Stockholm files.