Missing systematic_id on some genes
Closed this issue · 4 comments
Hello @ValWood and @kimrutherford
I noticed that in some of the old genome versions, some features did not have a systematic_id
qualifier, and they only had gene
qualifiers (sometimes more than one).
Checking this arbitrarily on chromosome1
in the ftp folder 20030814
, I see that in the vast majority (2392), one of the gene
qualifiers corresponds to a current systematic_id
. For some others, however, those qualifiers do not correspond to a systematic_id
or synonym. Full list for that version below:
Expand
['5SrRNA'] rRNA
['snr52'] snRNA
['Sp-snR40-sno'] snRNA
['Z8 snoRNA'] snRNA
['Z7 snoRNA'] snRNA
['Z5 snoRNA'] snRNA
['z4 snoRNA'] snRNA
['Z3 snoRNA'] snRNA
['SPAC1D4.07c'] CDS
['SPAC4G8.15c'] CDS
['U3snRNA'] snRNA
['Sp-snR41-sno'] snRNA
['Sp-snR70-sno'] snRNA
['Sp-snR51b-sno'] snRNA
['Sp-Z16-sno'] snRNA
['SPAC1002.21'] CDS
['putative gene free region'] misc_feature
['Sp-U24b-sno'] snRNA
['SPAC823.02'] CDS
['Sp-snR02-sno'] snRNA
['SPAC6F6.18c'] CDS
['SPAC3F10.14'] misc_feature
['SPRG5SC-1'] rRNA
['meu2', 'SPAC1556.06b'] CDS
['SPAC1F12.03c'] CDS
['Sp-snR62-sno'] snRNA
['SPAC4H3.12c'] CDS
['plr48'] misc_RNA
['SPAPB18E9.03c'] CDS
['prl53', 'prl63', 'prl49'] misc_RNA
['prl53'] misc_RNA
['prl49'] misc_RNA
['prl63'] misc_RNA
['Sp-snR69b-sno'] snRNA
['Sp-snR54b-sno'] snRNA
['omt3'] misc_RNA
['SPAC27D7.10c'] CDS
['SPAC22E12.12'] CDS
['SPAC22E12.15'] CDS
['Sp-snR38-sno'] snRNA
['Sp-snR56-sno'] snRNA
['Sp-snR58-sno'] snRNA
['U24-sno'] snRNA
['Sp-snR69-sno'] snRNA
['pex7', 'SPAC1834.12', 'SPAP17D4.01'] CDS
Some are clearly temporary gene names, but some others are a bit weird. For example:
['pex7', 'SPAC1834.12', 'SPAP17D4.01'] CDS
. I think this is probably a typo, since pex7 isSPAC17D4.01
(SPAC
and notSPAP
).['SPAC27D7.10c'] CDS
. This one is currently listed as a synonym ofSPAC27D7.09c
, but at the time the two were different:
FT CDS complement(4519039..4520190)
FT /colour=12
FT /gene="SPAC27D7.09c"
FT /product="hypothetical protein; possibly S. pombe
FT specific; predicted N-terminal signal sequence; similar to
FT S. pombe SPAC27D7.10c and SPAC27D7.11c and SPBC3D6.02
FT (paralogs); tandem duplication"
FT /fasta_file="fasta/c27D7.tab.seq.00028.out"
FT CDS complement(4522076..4523227)
FT /colour=12
FT /gene="SPAC27D7.10c"
FT /product="hypothetical protein; possibly S. pombe
FT specific; predicted N-terminal signal sequence; similar to
FT S. pombe SPAC27D7.09c and SPAC27D7.11c and SPBC3D6.02
FT (paralogs); tandem duplication"
FT /fasta_file="fasta/c27D7.tab.seq.00029.out"
My questions here are:
- If a feature X has a
\gene
qualifier that matches a currentsystematic_id
, is it safe to assume that feature X it corresponds to the gene with thatsystematic_id
? - If a feature X has no
\gene
qualifier that matches asystematic_id
, but has a\gene
that matches a current uniquesynonym
that only appears once in the tsv file (maybe some appear twice), is it safe to assume that feature X corresponds to the gene that currently has thatsynonym
?- Is this true for both
\gene
that start withSP
or that are in underscore?
- Is this true for both
I managed to fix most cases by either going to obsolete_name or by co-occurrence (a feature had a gene qualifier with a known systematic_id and an unknown one, so I made those synonyms). Here is the list of the ones that remained orphan, but I suspect it's not worth going into it.
I will close as "not planned"