Missing systematic_id on some genes

Question

Missing systematic_id on some genes

Closed this issue 2 years ago · 4 comments

I noticed that in some of the old genome versions, some features did not have a systematic_id qualifier, and they only had gene qualifiers (sometimes more than one).

Checking this arbitrarily on chromosome1 in the ftp folder 20030814, I see that in the vast majority (2392), one of the gene qualifiers corresponds to a current systematic_id. For some others, however, those qualifiers do not correspond to a systematic_id or synonym. Full list for that version below:

Expand

['5SrRNA'] rRNA
['snr52'] snRNA
['Sp-snR40-sno'] snRNA
['Z8 snoRNA'] snRNA
['Z7 snoRNA'] snRNA
['Z5 snoRNA'] snRNA
['z4 snoRNA'] snRNA
['Z3 snoRNA'] snRNA
['SPAC1D4.07c'] CDS
['SPAC4G8.15c'] CDS
['U3snRNA'] snRNA
['Sp-snR41-sno'] snRNA
['Sp-snR70-sno'] snRNA
['Sp-snR51b-sno'] snRNA
['Sp-Z16-sno'] snRNA
['SPAC1002.21'] CDS
['putative gene free region'] misc_feature
['Sp-U24b-sno'] snRNA
['SPAC823.02'] CDS
['Sp-snR02-sno'] snRNA
['SPAC6F6.18c'] CDS
['SPAC3F10.14'] misc_feature
['SPRG5SC-1'] rRNA
['meu2', 'SPAC1556.06b'] CDS
['SPAC1F12.03c'] CDS
['Sp-snR62-sno'] snRNA
['SPAC4H3.12c'] CDS
['plr48'] misc_RNA
['SPAPB18E9.03c'] CDS
['prl53', 'prl63', 'prl49'] misc_RNA
['prl53'] misc_RNA
['prl49'] misc_RNA
['prl63'] misc_RNA
['Sp-snR69b-sno'] snRNA
['Sp-snR54b-sno'] snRNA
['omt3'] misc_RNA
['SPAC27D7.10c'] CDS
['SPAC22E12.12'] CDS
['SPAC22E12.15'] CDS
['Sp-snR38-sno'] snRNA
['Sp-snR56-sno'] snRNA
['Sp-snR58-sno'] snRNA
['U24-sno'] snRNA
['Sp-snR69-sno'] snRNA
['pex7', 'SPAC1834.12', 'SPAP17D4.01'] CDS

Some are clearly temporary gene names, but some others are a bit weird. For example:

['pex7', 'SPAC1834.12', 'SPAP17D4.01'] CDS. I think this is probably a typo, since pex7 is SPAC17D4.01 (SPAC and not SPAP).
['SPAC27D7.10c'] CDS. This one is currently listed as a synonym of SPAC27D7.09c, but at the time the two were different:

FT   CDS             complement(4519039..4520190)
FT                   /colour=12
FT                   /gene="SPAC27D7.09c"
FT                   /product="hypothetical protein; possibly S. pombe 
FT                   specific; predicted N-terminal signal sequence; similar to 
FT                   S. pombe SPAC27D7.10c and SPAC27D7.11c and SPBC3D6.02 
FT                   (paralogs); tandem duplication"
FT                   /fasta_file="fasta/c27D7.tab.seq.00028.out"

FT   CDS             complement(4522076..4523227)
FT                   /colour=12
FT                   /gene="SPAC27D7.10c"
FT                   /product="hypothetical protein; possibly S. pombe 
FT                   specific; predicted N-terminal signal sequence; similar to 
FT                   S. pombe SPAC27D7.09c and SPAC27D7.11c and SPBC3D6.02 
FT                   (paralogs); tandem duplication"
FT                   /fasta_file="fasta/c27D7.tab.seq.00029.out"

My questions here are:

If a feature X has a \gene qualifier that matches a current systematic_id, is it safe to assume that feature X it corresponds to the gene with that systematic_id?
If a feature X has no \gene qualifier that matches a systematic_id, but has a \gene that matches a current unique synonym that only appears once in the tsv file (maybe some appear twice), is it safe to assume that feature X corresponds to the gene that currently has that synonym?
- Is this true for both \gene that start with SP or that are in underscore?

Answer 1 · 2022-12-01T19:13:10.000Z

https://www.pombase.org/status/new-and-removed-genes

Answer 2 · 2022-12-01T19:44:23.000Z

SPAC27D7.09c/10c This one is a weird edge case. The assembly is incorrect and an extra copy of this genomic segment so it looked like a gene duplication. You can see it reported here: https://www.pombase.org/status/sequencing-updates I removed one CDS and put a big misc feature across it with a note "/note="This repeat region is caused by a misassembly and will be removed from the genomic sequence shortly." because I didn't know what else to do with it, but I didn't want to leave it looking as though there were two identical genes. There is only one copy (this has been confirmed by other since). This was in 2004, I didn't expect it would be this long before we had a final sequence. v

Answer 3 · 2022-12-01T20:00:17.000Z

I see, SPAC1834.12 is a synoym of both of these genes:

probably due to some mix-up in the contig boundary which was between these two genes.

Answer 4 · 2022-12-08T18:18:33.000Z

@ValWood

I managed to fix most cases by either going to obsolete_name or by co-occurrence (a feature had a gene qualifier with a known systematic_id and an unknown one, so I made those synonyms). Here is the list of the ones that remained orphan, but I suspect it's not worth going into it.

https://github.com/pombase/genome_changelog/blob/master/valid_ids_data/genes_starting_with_SP_no_match.txt

I will close as "not planned"