Unambiguate type nomenclature
MillironX opened this issue · 2 comments
Expected behavior
TL;DR: I want a (mostly) unambiguous pair of terms to rename the types Variant
and Variation
to. I chose "Haplotype" and "Variant," but want feedback from other biologists before writing the code to change it.
First, let's clarify what Variant
and Variation
do in the context of the package.
Variant |
Variation |
---|---|
A set of modifications applied to a biological sequence | A single modification (substitution, deletion, or insertion) within a biological sequence |
My point of view (veterinary diagnostics) is different, but aligns pretty closely with the Ensembl glossary. Here are some entries from the glossary that I think are pertinent:
Allele (variant): One of a number of alternative forms of the same genetic locus
Genotype (allele (variant)): The specific alleles that are present in an individual's genome
Haplotype (variation): A set of variant alleles in a contiguous genomic region. A haplotype block describes a set of alleles which tend to be inherited together.
Variant (Genome annotation): locus where the sequence differs between individuals of the same species
Variation: not defined
The big pain point for me comes from the fact that "Variant" refers to a single locus in most places, but in the package refers to a collection of loci. That disconnect even taints glue functions trying to parse Variation
s from VariantCallFormat.jl's VCF.Record
.
I propose renaming Variant
to Haplotype
, and Variation
to Variant
. These seem like the least ambiguous terms that apply from the glossary.
I would like feedback from others on these terms. Specifically @jakobnissen, since I know you also work(ed) on viral genomes, and @rasmushenningsson, since there's some overlap between the terminology in VariantCallFormat and SequenceVariation. Anyone else with an opinion, please also jump in.
Current behavior
Why did I implement issue forms?
Possible implementation
Again, why?
Context
No response
Link to your project
No response
I've never seen these terms actually defined (until now), so I can only share the way I've seen people using it. People who use "haplotype" tend to define haplotypes based on observed patterns of inheritance, i.e. a haplotype is a collection of alleles which empirially clusters in actual populations. To me, that does not align with the use here. I would expect a HaploType struct to be defined in PopGen.jl.
Part of the problem is that nomenclariture is field-specific. "Variant" is pretty well understood among virologists[1], but I have no idea what e.g. botanists think it means. It's interesting (and sad) that Ensembl defined variant to mean something different.
I can't think of any terminology which is clear and unambiguous across fields. Perhaps "Genotype" for what is currently called "Variant"... but then again, I'm the one who originally settled on "Variant"/"Variation", so no wonder I can't come up with anything better. :)
[1]. Wikipedia:
a subtype of a microorganism that is genetically distinct from a main strain, but not sufficiently different to be termed a distinct strain
Well, I was hoping for more input than that, but...
I can see the issue with the name "haplotype," as it typically is associated with populations for mammals. My understanding is that a "haplotype" is the "genotype" for single-ploidy organism (e.g. viruses), while the "genotype" of a multi-ploidy organism consists of multiple "haplotypes." Since this package only deals with single-ploidy references, then it makes sense to use the more specific "haplotype," but I can see either term working.
"Variant" will still in my vocabulary mean a specific site (e.g. Single Nucleotide Variant), but enough papers refer to a single site as a "variation," that I would be content keeping "variation". Based on asking people in my department, it seems the term "variant" only came to have a strain-like connotation in the wake of SARS CoV2 (we use the term "lineage" or "clade" for what news anchors call "variant"), so its clear to me that the term "variant" is ambiguous enough that it probably should be removed entirely.
Side note: similar workflows to the paper A beginner’s guide for FMDV quasispecies analysis: sub-consensus variant detection and haplotype reconstruction using next-generation sequencing were what sparked the initial name choices.
I propose one of the following changes then
Option 1 | Option 2 | Option 3 | Option 4 |
---|---|---|---|
Variant -> Haplotype |
Variant -> Haplotype |
Variant -> Genotype |
Variant -> Genotype |
Variation -> Variant |
Variation -> Variation |
Variation -> Variant |
Variation -> Variation |
Feedback, anyone?