pombase/genome_changelog

Missing systematic_id on some genes

Closed this issue · 4 comments

Hello @ValWood and @kimrutherford

I noticed that in some of the old genome versions, some features did not have a systematic_id qualifier, and they only had gene qualifiers (sometimes more than one).

Checking this arbitrarily on chromosome1 in the ftp folder 20030814, I see that in the vast majority (2392), one of the gene qualifiers corresponds to a current systematic_id. For some others, however, those qualifiers do not correspond to a systematic_id or synonym. Full list for that version below:

Expand
['5SrRNA'] rRNA
['snr52'] snRNA
['Sp-snR40-sno'] snRNA
['Z8 snoRNA'] snRNA
['Z7 snoRNA'] snRNA
['Z5 snoRNA'] snRNA
['z4 snoRNA'] snRNA
['Z3 snoRNA'] snRNA
['SPAC1D4.07c'] CDS
['SPAC4G8.15c'] CDS
['U3snRNA'] snRNA
['Sp-snR41-sno'] snRNA
['Sp-snR70-sno'] snRNA
['Sp-snR51b-sno'] snRNA
['Sp-Z16-sno'] snRNA
['SPAC1002.21'] CDS
['putative gene free region'] misc_feature
['Sp-U24b-sno'] snRNA
['SPAC823.02'] CDS
['Sp-snR02-sno'] snRNA
['SPAC6F6.18c'] CDS
['SPAC3F10.14'] misc_feature
['SPRG5SC-1'] rRNA
['meu2', 'SPAC1556.06b'] CDS
['SPAC1F12.03c'] CDS
['Sp-snR62-sno'] snRNA
['SPAC4H3.12c'] CDS
['plr48'] misc_RNA
['SPAPB18E9.03c'] CDS
['prl53', 'prl63', 'prl49'] misc_RNA
['prl53'] misc_RNA
['prl49'] misc_RNA
['prl63'] misc_RNA
['Sp-snR69b-sno'] snRNA
['Sp-snR54b-sno'] snRNA
['omt3'] misc_RNA
['SPAC27D7.10c'] CDS
['SPAC22E12.12'] CDS
['SPAC22E12.15'] CDS
['Sp-snR38-sno'] snRNA
['Sp-snR56-sno'] snRNA
['Sp-snR58-sno'] snRNA
['U24-sno'] snRNA
['Sp-snR69-sno'] snRNA
['pex7', 'SPAC1834.12', 'SPAP17D4.01'] CDS

Some are clearly temporary gene names, but some others are a bit weird. For example:

  • ['pex7', 'SPAC1834.12', 'SPAP17D4.01'] CDS. I think this is probably a typo, since pex7 is SPAC17D4.01 (SPAC and not SPAP).
  • ['SPAC27D7.10c'] CDS. This one is currently listed as a synonym of SPAC27D7.09c, but at the time the two were different:
FT   CDS             complement(4519039..4520190)
FT                   /colour=12
FT                   /gene="SPAC27D7.09c"
FT                   /product="hypothetical protein; possibly S. pombe 
FT                   specific; predicted N-terminal signal sequence; similar to 
FT                   S. pombe SPAC27D7.10c and SPAC27D7.11c and SPBC3D6.02 
FT                   (paralogs); tandem duplication"
FT                   /fasta_file="fasta/c27D7.tab.seq.00028.out"

FT   CDS             complement(4522076..4523227)
FT                   /colour=12
FT                   /gene="SPAC27D7.10c"
FT                   /product="hypothetical protein; possibly S. pombe 
FT                   specific; predicted N-terminal signal sequence; similar to 
FT                   S. pombe SPAC27D7.09c and SPAC27D7.11c and SPBC3D6.02 
FT                   (paralogs); tandem duplication"
FT                   /fasta_file="fasta/c27D7.tab.seq.00029.out"

My questions here are:

  1. If a feature X has a \gene qualifier that matches a current systematic_id, is it safe to assume that feature X it corresponds to the gene with that systematic_id?
  2. If a feature X has no \gene qualifier that matches a systematic_id, but has a \gene that matches a current unique synonym that only appears once in the tsv file (maybe some appear twice), is it safe to assume that feature X corresponds to the gene that currently has that synonym?
    • Is this true for both \gene that start with SP or that are in underscore?

I see, SPAC1834.12 is a synoym of both of these genes:

Screenshot 2022-12-01 at 19 55 21

probably due to some mix-up in the contig boundary which was between these two genes.

@ValWood

I managed to fix most cases by either going to obsolete_name or by co-occurrence (a feature had a gene qualifier with a known systematic_id and an unknown one, so I made those synonyms). Here is the list of the ones that remained orphan, but I suspect it's not worth going into it.

https://github.com/pombase/genome_changelog/blob/master/valid_ids_data/genes_starting_with_SP_no_match.txt

I will close as "not planned"