BioJulia/SequenceVariation.jl

Unambiguate type nomenclature

MillironX opened this issue · 2 comments

Expected behavior

TL;DR: I want a (mostly) unambiguous pair of terms to rename the types Variant and Variation to. I chose "Haplotype" and "Variant," but want feedback from other biologists before writing the code to change it.

First, let's clarify what Variant and Variation do in the context of the package.

Variant Variation
A set of modifications applied to a biological sequence A single modification (substitution, deletion, or insertion) within a biological sequence

My point of view (veterinary diagnostics) is different, but aligns pretty closely with the Ensembl glossary. Here are some entries from the glossary that I think are pertinent:

Allele (variant): One of a number of alternative forms of the same genetic locus

Genotype (allele (variant)): The specific alleles that are present in an individual's genome

Haplotype (variation): A set of variant alleles in a contiguous genomic region. A haplotype block describes a set of alleles which tend to be inherited together.

Variant (Genome annotation): locus where the sequence differs between individuals of the same species

Variation: not defined

The big pain point for me comes from the fact that "Variant" refers to a single locus in most places, but in the package refers to a collection of loci. That disconnect even taints glue functions trying to parse Variations from VariantCallFormat.jl's VCF.Record.

I propose renaming Variant to Haplotype, and Variation to Variant. These seem like the least ambiguous terms that apply from the glossary.

I would like feedback from others on these terms. Specifically @jakobnissen, since I know you also work(ed) on viral genomes, and @rasmushenningsson, since there's some overlap between the terminology in VariantCallFormat and SequenceVariation. Anyone else with an opinion, please also jump in.

Current behavior

Why did I implement issue forms?

Possible implementation

Again, why?

Context

No response

Link to your project

No response

I've never seen these terms actually defined (until now), so I can only share the way I've seen people using it. People who use "haplotype" tend to define haplotypes based on observed patterns of inheritance, i.e. a haplotype is a collection of alleles which empirially clusters in actual populations. To me, that does not align with the use here. I would expect a HaploType struct to be defined in PopGen.jl.

Part of the problem is that nomenclariture is field-specific. "Variant" is pretty well understood among virologists[1], but I have no idea what e.g. botanists think it means. It's interesting (and sad) that Ensembl defined variant to mean something different.

I can't think of any terminology which is clear and unambiguous across fields. Perhaps "Genotype" for what is currently called "Variant"... but then again, I'm the one who originally settled on "Variant"/"Variation", so no wonder I can't come up with anything better. :)

[1]. Wikipedia:

a subtype of a microorganism that is genetically distinct from a main strain, but not sufficiently different to be termed a distinct strain

Well, I was hoping for more input than that, but...

I can see the issue with the name "haplotype," as it typically is associated with populations for mammals. My understanding is that a "haplotype" is the "genotype" for single-ploidy organism (e.g. viruses), while the "genotype" of a multi-ploidy organism consists of multiple "haplotypes." Since this package only deals with single-ploidy references, then it makes sense to use the more specific "haplotype," but I can see either term working.

"Variant" will still in my vocabulary mean a specific site (e.g. Single Nucleotide Variant), but enough papers refer to a single site as a "variation," that I would be content keeping "variation". Based on asking people in my department, it seems the term "variant" only came to have a strain-like connotation in the wake of SARS CoV2 (we use the term "lineage" or "clade" for what news anchors call "variant"), so its clear to me that the term "variant" is ambiguous enough that it probably should be removed entirely.

Side note: similar workflows to the paper A beginner’s guide for FMDV quasispecies analysis: sub-consensus variant detection and haplotype reconstruction using next-generation sequencing were what sparked the initial name choices.


I propose one of the following changes then

Option 1 Option 2 Option 3 Option 4
Variant -> Haplotype Variant -> Haplotype Variant -> Genotype Variant -> Genotype
Variation -> Variant Variation -> Variation Variation -> Variant Variation -> Variation

Feedback, anyone?