TranslatorSRI/Babel

Number of GeneProtein conflations and fell in 2024oct1

Opened this issue · 1 comments

2024oct1 has fewer gene-protein conflations (19,701,538) than 2024aug18 (21,431,316) and slightly fewer info-content values (3,345,015) than 2024aug18 (3,346,582). We should figure out why this is.

I traced one example back, and found that the gene identifier (NCBIGene:9736071) was no longer present in gene_info.gz (as downloaded from https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz), presumably because it was "discontinued on 23-Aug-2024". We previously associated this with UniProtKB:E0SDS8, which is still present in our database (but trying a GeneProtein conflation on this will simply return the gene). More information about prokaryotic genes discontinued by NCBI: https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/faq/#FAQ1

I randomly checked around ~10 genes that have been deleted and they were all bacterial genes discontinued in late August.

We could try to use a previous gene_info.gz file so that we don't lose this information, but that seems dumb. If these identifiers are mostly to do with prokaryotic identifiers, they're unlikely to have an effect on Translator, and I assume eventually UniProtKB will update their IDs. But when we have a bit of free time it might be worth looking into the gene history files to see if we can find new mappings for those identifiers for a future Babel release.