nextstrain/seasonal-flu

Annotate Nextstrain clade names by default with official full clade names as alternate annotation

huddlej opened this issue · 2 comments

Context

We currently annotate full official clade names for all lineages, but for H3N2 and H1N1pdm the longer official clade names can be difficult to read in the Nextstrain view. We also often need to propose/follow new subclades before these are officially recognized.

A similar situation exists for SARS-CoV-2 where we support Nextstrain clades, PANGO lineages, and "emerging lineages" as alternate branch labels in Auspice.

Description

We should set up parallel "clade" names (i.e., Nextstrain clades) and "full clade" names (i.e., official WHO clade names) in an analogous fashion to how we have "clade" and "emerging lineage" in ncov. This would be both a coloring and toggle under the "branch labels".

We should default to showing the Nextstrain clade names which we'll abbreviate to "2a.2", etc... but have the full official clade names available. Nextclade will similarly have two fields for flu clades.

Possible solution

The full official clade names are defined in files named like clades_h3n2_ha.tsv while the Nextstrain clades are in files named like nextstrain_clades_h3n2_ha.tsv. We need to either define a separate clade annotation rule for the Nextstrain clades or modify the existing clades rule to be parameterized by the set of clade definitions to use.

The ncov workflow defines separate rules to annotate clades and emerging lineages. This workflow also needs an additional rule to add branch labels for emerging lineages to the Auspice JSON. However, if we could first merge @jameshadfield's excellent Augur PR to add support for arbitrary branch labels in augur clades, we could avoid this additional complexity in the seasonal flu workflow.

We also need to redefine Nextstrain clades as shortened versions of the full official clade names, keeping the official capitalization and abbreviating on dot. For example:

  • 3C.2a1b.2a.2 → 2a.2
  • 3C.2a1b.1b → 1b

Hi John,

If I may add a suggestion (which hope is okay), the abbreviations in my opinion would be more useful and less ambiguous if they included everything to the right of the dot. For example:

  • 3C.2a1b.1b → 2a1b.1b
  • 6B.1A.5a → 1A.5a
  • V1A.3a.1 → 3a.1

This method will reduce the chances of abbreviations potentially referring to multiple clades at once. For example, if 3C.2a1b.1b → 1b and we have a future hypothetical but very potential clade of 3C.2a1b.2a.1b → 1b then 1b refers to both.

Using the X.X reduces the above.

Thanks,

Ammar

We now use abbreviated clade names for recent WHO clade names and have started to move toward an alternate influenza clade nomenclature that uses even shorter names through aliasing.