nextstrain/seasonal-flu

Use nextalign instead of custom Python codon alignment script

huddlej opened this issue · 3 comments

Context
The workflow uses a custom Python script to perform codon-aware alignment, to resolve issues with variable sites in mafft's alignments. Since @rneher wrote this script, he and @ivan-aksamentov developed nextalign. Even though nextalign was developed with SARS-CoV-2 in mind, it works well for H3N2 HA sequences, too, and is way faster.

Description / proposed solution
We should try using nextalign for our flu builds. We'll need to setup the corresponding FASTA reference files for all of the lineages and segments or maybe implement GenBank file support in nextalign (whichever is easier...I can guess which though!). Then, we can take advantage of the codon-aware alignment functionality and also run alignments with multiple threads to speed up that step of our builds.

Once we have nextalign in place, we could start to do analyses that previously would have taken too long like creating multiple sequence alignments of all amino acid sequences for HA and running the titer substitution model on all available sequences and titers.

Now that Nextclade has seasonal flu datasets, we should consider using Nextclade for alignment and clade annotations in our standard builds. This approach would quickly produce codon-aware nucleotide alignments, amino acid translations, and clade annotations for every sequence in the database.

We'll need to setup the corresponding FASTA reference files for all of the lineages and segments

For the Nextalign part, there are many input files in Nextclade repo already. You'll probably need something even more sophisticated that that. But this might be a partial solution ar at least a starting point:

https://github.com/nextstrain/nextclade/tree/master/data/flu

We should consider using Nextclade for alignment and clade annotations in our standard builds

Nextclade would require more files to run than Nextalign, including a reference auspice tree for every variation, every root sequence etc. So much more involved in terms of science things. Unless it all can piggyback on the existing trees somehow:

https://github.com/nextstrain/nextclade_data/tree/master/data/datasets

In either case, happy to help with the engineering part! Sadly I am almost entirely ignorant about the science of flu itself.

Closing this since we've used nextalign since the refactor.