SysBioChalmers/GECKO

Fix: id entry in databases can be duplicated (kegg.tsv and uniprot.tsv)

ae-tafur opened this issue · 1 comments

Description of the bug:

Recently, working in a non-model organism I notice that it have two genes (same sequence) with a different Gene Name but same Entry (protein id) . This creates the same reaction duplicated (usage_prot_*) as well as the reaction where the protein is involved.

I proposed that loadDatabasesmust validate for unique Gene Name and Entry. In case it found duplicated entry, create a warning for the user. So the user can decide how to proceed.

I hereby confirm that I have:

  • Tested my code with all requirements for running GECKO
  • Done this analysis in the main branch of the repository
  • Checked that a similar issue does not exist already
edkerk commented

Resolved in #349 for uniprot.tsv.

A similar implementation will not be made for kegg.tsv, as there it is not uncommon to have gathered a correct species-specific KEGG database and still have duplicate IDs.

For instance for S. cerevisiae: sce:YNL030W and sce:YBR009C both refer to P02309.

At the same time, the uniprot.tsv would have just one P02309 that is instead assigned the Gene Name YBR009C; YNL030W. One can argue that this is actually problematic, but if it is conflicting with the model reconstruction it would actually be reported as a gene in noUniprot when running makeEcModel. In contrast, the problem mentioned by @ae-tafur would go unnoticed. So, "duplications" like these should not throw an error.

Going back to KEGG then, having duplicate Uniprot IDs is also not detrimental, KEGG database is only used for extracting EC numbers, and not to define usage_prot_* reactions as mentioned above.

tl:dr; with #349, this issue is resolved.