geneontology/gocamgen

Proteoforms shouldn't be split into separate models

Closed this issue · 14 comments

ukemi commented

In the recent conversion there is a mouse model for PR:Q9D828. PR:Q9D828 represents an isoform of TFG (MGI:1338041) and should be included in that model. The correct mouse gene for this isoform can be found in the GPI 1.1 file in column 8:

PR Q9D828 mTFG/iso:Q9D828 Tfg translation product isoform Q9D828 (mouse) mTFG/iso:Q9D828|mTFG/iso:Q9D828 protein taxon:10090 MGI:MGI:1338041 UniProtKB:Q9D828

So the bottom line is that the grouping of annotations in gene-centric models should be based on what is in column 8 of the GPI file and not based on column 2.

ping @vanaukenk

Ah, another great reason to be using the GPI file in the translations. Sounds doable.

@ukemi I have a single example model loaded into my USC server (it's back up!) that packages MGI:MGI:1338041 and its proteoform PR:Q9D828 in the same model.

ukemi commented

Yay. This is correct. PR:Q9D828 is a proteoform of the Tfg gene.

@ukemi @vanaukenk Since we moved to GPI 2.0 there are now two fields, Encoded_By (col 7) and Parent_Protein (col 8), that I can use to connect proteoforms to their gene model. Currently unsure which field I should be using. I could do:

  1. Only use Encoded_By
  2. Only use Parent_Protein
  3. Check both (if they're both filled, then I just use first one that has an ID, or scan all IDs until I find an entity with the gene SO?)
  4. Check SO of GPI entity to determine which field to look at - proteins, transcripts use Encoded_By; protein isoforms or modified proteins used Parent_Protein

FYI, the MGI GPI 2.0 file here looks like it only ever uses col 7 Encoded_By. Col 8 Parent_Protein is always blank.

@ukemi and I discussed this a month or so ago but it came up again while copy/pasting slides from previous meetings into my talk tomorrow ;)

ukemi commented

Hey @dustine32 Since we want to sum this all up to the level of a gene, then column 7 would be the most appropriate. You are correct we don't use column 8 yet. At some point I would suspect that this would be populated with information from the protein ontology for mouse proteins.

@ukemi @dustine32
Just confirming that the answer to this may depend upon the group submitting the annotations and what annotation identifiers they typically use. For example, a group that annotates to UniProtKB accessions may want to sum up to the parent protein, or possibly even have separate models for each proteoform.
For MGI, you've chose to sum up to the gene since that is your primary curation identifier, right?

ukemi commented

But even for UniProtKB, shouldn't it sum up to the GCRP that is a stand-in for the gene? In that case I would expect the GCRP to be in column 7. Maybe I'm mistaken, but yes for MGI we definitely want to sum up data to genes.

Hi - my main point was that there may not be a single rule that captures what all groups want to do wrt importing proteoforms and GO-CAMs. Am I missing something?

ukemi commented

If they want to make something other than gene-centric representations of their annotations, then yes. Maybe I'm missing something.

@ukemi @vanaukenk Thanks for discussing this! It sounds like, for now, we can just look in Encoded_By for the gene connection, but we'll remain aware that these rules could be more complicated for other MODs down the road.

I just remembered to check ZFIN's GPI 2.0 (my last version is dated 2021-01-25). @sierra-moxon @sabrinatoro Looks like neither Encoded_By nor Parent_Protein are used. Is this OK with you?

@dustine32 That is correct, we do not have any "encoded_by" or "parent_protein"
The ZFIN GPI contains only genes.

@sabrinatoro Thanks for clarifying!

Tested commit biolink/ontobio@309f077 by checking whether annotations to PR:Q769J6 appeared in model for MGI:MGI:2685556. Appears to work now.

This is now working for the new models in noctua-dev.