Double mutation lines in gff file
Closed this issue · 2 comments
The same variant gets written twice in the c_elegans.PRJNA13758.WS281.annotations.gff3.gz file. Maybe this is already fixed for WS282?
(py376) /Users/mz3 % gzcat ~/Desktop/Standard/c_elegans.PRJNA13758.WS281.annotations.gff3.gz | grep WBVar01840896
IV PCoF_Variation_project_Polymorphism SNP 17489019 17489019 . + . variation=WBVar01840896;public_name=WBVar01840896;other_name=cewivar00553473;strain=JU533,JU642,KR314;polymorphism=1;substitution=C/T;consequence=missense_variant;vep_impact=MODERATE;aachange=G/S;codon_change=Ggc/Agc;hgvsg=CHROMOSOME_IV:g.17489019C>T;hgvsc=4R79.1a.1:c.247G>A;sift=tolerated(0.21);cdna_position=529;cds_position=247;aa_position=83;exon_nr=4/11
IV Variation_project_Polymorphism SNP 17489019 17489019 . + . variation=WBVar01840896;public_name=WBVar01840896;other_name=cewivar00553473;strain=JU533,JU642,KR314;polymorphism=1;substitution=C/T;consequence=missense_variant;vep_impact=MODERATE;aachange=G/S;codon_change=Ggc/Agc;hgvsg=CHROMOSOME_IV:g.17489019C>T;hgvsc=4R79.1a.1:c.247G>A;sift=tolerated(0.21);cdna_position=529;cds_position=247;aa_position=83;exon_nr=4/11
(py376) /Users/mz3 % gzcat ~/Desktop/Standard/c_elegans.PRJNA13758.WS281.annotations.gff3.gz | grep WBVar01435159
II PCoF_Variation_project_Polymorphism SNP 4428 4428 . + . variation=WBVar01435159;public_name=WBVar01435159;strain=JU774;polymorphism=1;substitution=A/C;consequence=missense_variant;vep_impact=MODERATE;aachange=K/T;codon_change=aAa/aCa;hgvsg=CHROMOSOME_II:g.4428A>C;hgvsc=2L52.1b.1:c.428A>C;sift=deleterious_low_confidence(0.03);polyphen=benign(0.29);cdna_position=428;cds_position=428;aa_position=143;exon_nr=3/3
II Variation_project_Polymorphism SNP 4428 4428 . + . variation=WBVar01435159;public_name=WBVar01435159;strain=JU774;polymorphism=1;substitution=A/C;consequence=missense_variant;vep_impact=MODERATE;aachange=K/T;codon_change=aAa/aCa;hgvsg=CHROMOSOME_II:g.4428A>C;hgvsc=2L52.1b.1:c.428A>C;sift=deleterious_low_confidence(0.03);polyphen=benign(0.29);cdna_position=428;cds_position=428;aa_position=143;exon_nr=3/3
There are also some written with column 3 says SNP, and some which say point_mutation. Not sure that distinction is needed?
The duplication is intended. The variation is duplicated with a 'PCoF_' prefix on the feature type for putatitive change of function alleles. This is so that they can be displayed on a separate track in JBrowse.
The distinction between SNP and point_mutation is also intended. SNP is used for natural variants, point_mutations for other single base substitutions.
So, here's what I'll add about this: JBrowse doesn't need two lines, but the GBrowse processing pipeline does. I'm working on really getting rid of GBrowse (once JBrowse 2 is up and running, we can dump GBrowse since JB2 provides the single missing piece that people really want from GBrowse--creating SVG images). So what I would say is, keep them for now and when we retire GBrowse, we can retire the duplicate GFF lines too.
Here I'll add a note, mostly for myself should I want to remember I said these things. When formatting a JBrowse track, there is a command line argument for type that looks like
--type deletion:PCoF_CGH_allele_Polymorphism,deletion:PCoF_Variation_project_Polymorphism,insertion_site:PCoF_Variation_project_Polymorphism,SNP:PCoF_Variation_project_Polymorphism,substitution:PCoF_Variation_project_Polymorphism,complex_substitution:PCoF_Variation_project_Polymorphism,sequence_alteration:PCoF_Variation_project_Polymorphism
Obviously, the length of that list doesn't matter, so I can just make the "vanilla" polymorphism config longer to look for the "PCoF" version of the source (column 2) and the non "PCoF" version.