geneontology/gocamgen

Resolve Shex failures in MGI annotations due to invalid identifiers for binding input

Opened this issue · 38 comments

ukemi commented

When annotations are imported into GO-CAM, binding annotations are transformed. If the annotation has an IPI evidence code, the value of the 'with' field is converted to a has_input for the GO-CAM binding function.
The rationale for this is that historically, curators captured the binding partners using IPI and the partner in the 'with' field.
Since the 'with' field was traditionally entered as free text in MGI annotations, the values were never confirmed to be valid identifiers with respect to the MGI GPI file. Over the years, many of those identifiers have been entered as UniProtKB ids to represent proteins. These identifiers fail the Shex because valid identifiers for proteins for MGI come from PRO.
I agree with @goodb that this should be fixed in the MGI source file rather than having @dustine32 manipulate the file downstream. This allows for MGI curators to check the validity of the identifier with respect to the gene objects at MGI.

  • 1. QC will be done at MGI to be sure that all the identifiers in the 'with' field for binding annotations are valid objects.
  • 2. If a 'with' field identifier at MGI corresponds to a mouse marker and is a UniProt id, it will be converted to a PRO id and we will check that they are in the GPI file as annotatable objects. This should allow all of these to pass the Shex.
  • 3. If the PRO identifiers for the UniProt IDs are not available or are not in the MGI GPI file, we will investigate this and either change the 'with' field appropriately or modify the GPI file.
  • 4. In some cases, the value of the 'with' field represents an object from another species because the experiment was done with multiple species. In these cases we do not want the value of the 'with' field to be converted to an input for the GO-CAM binding function and the annotation should be imported without conversion, IPI evidence code and the value of the 'with' field stays in the with field. Note that these annotations are valid in this format because in this case, the with field is also supporting the evidence for the binding function.

In anticipation of a question I expect from @goodb. The reason why we are not converting the annotations at source to make the 'with' fields inputs is because curators and users have informed us that they use the 'with' field values in legacy tools and the change would break their procedures. Therefore we have decided to convert to make the GO-CAM models 'correct' and we will need to back convert when we generate conventional annotations (gafs and gpad)from the GO-CAM models.

ukemi commented

A few numbers from @mdolanme
Number of binding annotations using IPI and the 'with' column (non-chebi): 3968
Number whose values are already in the GPI file: 77
Number whose value will be in the GPI when we switch UniProtKB: to PR: : 3111
Number that are either non-mouse, don't have a PRO id or have typos: 780
Number of these that are UniProt: 726/780
Number of these that are Refseq: 37/780
Number of these that are PRO ids: 9/780
Number of these that are typos: 7/780
Number of these that are EMBL: 1/780
Number of these UniProt ids that are definitely non-mouse: 304/726

non_mouse_uniprot_with_ids.txt
Attached this list from @ukemi of non-mouse UniProtKB identifiers that appear in with/from field in some MGI binding annotations. These are examples of task 4 in @ukemi's initial issue description, which should not be converted to has_input edges and instead live in the with/from field of the evidence individual.

04/29/2020 : for dustin
David H : add any comments about what is special about this version
http://www.informatics.jax.org/downloads/custom/mgi.gpa
http://www.informatics.jax.org/downloads/custom/mgi.gpa.gz

ukemi commented

Hi @dustine32. In this version we have replaced the 'with' field values in binding annotations with PRO ids. They should now be validate for the Shex checks since they will be in our GPI file and therefore in NEO. @loricorbani replaced several thousand automatically and I replaced close to a thousand manually. I'd like to see how many models are still failing the Shex checks to determine where I have missed any. We can discuss in more detail on the GO-CAM specs call.

@ukemi @loricorbani Awesome! I tested a one-off model for MGI:MGI:1917115 with this binding GPAD line:

MGI     MGI:1917115     enables GO:0005515      MGI:MGI:5613094|PMID:24916387   ECO:0000353     PR:Q91WT8               20150211        MGI             contributor=http://orcid.org/0000-0001-7476-6306

And the with/from value PR:Q91WT8 at least appears to be in NEO since it resolves the ID to a label:
image
I'll run the ShEx minerva validator on this model to see if it is valid. Currently having trouble getting my Noctua instance "Use reasoner" option to work so heads up in case you try it too.

goodb commented

@dustine32 let me know if you need help with noctua/minerva reasoner situation. Its been in flux, so if you are running from a dev branch you may need to adjust some command line params. Should be stabilizing soon...

Thanks @goodb ! Yeah, I figured I would eventually need to hit you up on gitter or something to sort that out. I can probably attack that after the noon call today.

ukemi commented

Thanks @loricorbani. @dustine32, in this set of files we have changed the way we are retrieving PRO identifiers for proteins that correspond to MGI genes. I suggest that for further testing we use the gpi files supplied at this location.

05/22/2020 : for dustin
David H : add any comments about what is special about this version
http://www.informatics.jax.org/downloads/custom/mgi.gpa.gz
http://www.informatics.jax.org/downloads/custom/mgi.gpi.gz

These contain MGI/Production from 05/21.
Using PRO/GPI file.
With converted "contributor" values.

ukemi commented

Hi @dustine32. In this version I have fixed the binding inputs for all of the MFs with the IPI evidnce code that you are converting to inputs. Once you have updated to the complete cell and anatomy ontology, could you run these through the Shex validation again and I will clean up any stragglers. There still may be a few. Thanks!!!

non_mouse_uniprot_with_ids.txt

Attaching newest list of non-mouse identifiers described above in #79 (comment).

ukemi commented

Hi @dustine32. I just noticed another issue in the models with respect to item 4. If you look at the model titled MGI:MGI:1929601, you will notice that in some cases, human proteins have been converted to inputs and are valid presumably because they are in Neo. These should be treated like the ones above #79 (comment). Clearly, my trying to find these manually is not an optimal strategy. Maybe we should switch to converting only identifiers that are found in the mouse GPI file to inputs and leaving all the rest in the 'with' field. This will get around the hand-built list in the comment above and the few more that I found today. I am pretty convinced that I have caught most of the 'true' errors that existed and those identifiers have all been corrected as either valid MGI identifiers or PR identifiers from the mouse GPI file. We can chat about this if it's not clear.

06/08/2020 : for dustin
David H : add any comments about what is special about this version
http://www.informatics.jax.org/downloads/custom/mgi.gpa
http://www.informatics.jax.org/downloads/custom/mgi.gpa.gz

ukemi commented

This version should have fixed the Shex errors.

06/09/2020 : for dustin
David H : add any comments about what is special about this version
http://www.informatics.jax.org/downloads/custom/mgi.gpa
http://www.informatics.jax.org/downloads/custom/mgi.gpa.gz

ukemi commented

It should be very special. It should pass both the logic and Shex checks.

06/15/2020 : for dustin
David H : add any comments about what is special about this version
http://www.informatics.jax.org/downloads/custom/mgi.gpa
http://www.informatics.jax.org/downloads/custom/mgi.gpa.gz

ukemi commented

Fixed a typo.

06/29/2020 : for dustin
David H : add any comments about what is special about this version
http://www.informatics.jax.org/downloads/custom/mgi.gpa
http://www.informatics.jax.org/downloads/custom/mgi.gpa.gz

08/05/2020 : for dustin
This is GPI version 2 from MGI
David H : add any comments about what is special about this version
http://www.informatics.jax.org/downloads/custom/mgi2.gpi

ukemi commented

This version does not have values in column 8 yet and does not have protein complexes yet. Once we (GOC) decides exactly what is supposed to go into column 8, we (MGI) will populate it. Adding the complexes should also be straightforward and we can run a test with them at some point to make sure they are then available for curation in Noctua. Thanks @loricorbani !!!!!

08/12/2020 : for dustin
This is GPI version 2 from MGI
David H : add any comments about what is special about this version
http://www.informatics.jax.org/downloads/custom/mgi2.gpi

ukemi commented

This version has mouse protein complexes from PRO along with all the other changes from the previous version.

09/11/2020 : for dustin
This is GPI version 2 and GPAD version 2 from MGI
David H : add any comments about what is special about this version
http://www.informatics.jax.org/downloads/custom/mgi2.gpi
http://www.informatics.jax.org/downloads/custom/mgi2.gpad

ukemi commented

This GPAD2.0 file has everything including all of the properties that we will use for the initial import. We can go over the details on the call next Tuesday.

09/17/2020 : for dustin
This is GPI version 2 and GPAD version 2 from MGI
David H : add any comments about what is special about this version

http://www.informatics.jax.org/downloads/custom/mgi2.gpi

MGI-curated only
http://www.informatics.jax.org/downloads/custom/mgi2.gpad

ukemi commented

This file has been filtered so it only contains the annotations made by MGI curators using the MGI editorial interface.

09/24/2020 : for dustin
This is GPI version 2 and GPAD version 2 from MGI
David H : add any comments about what is special about this version

MGI-curated only
http://www.informatics.jax.org/downloads/custom/mgi2.gpad

ukemi commented

Hi @dustine32 and @dougli1sqrd
In this version we fixed the bug where contributes_to (RO:0002326) wasn't being added to the file correctly.

10/07/2020 : for dustin
This is GPI version 2 and GPAD version 2 from MGI
David H : add any comments about what is special about this version

MGI-curated only
http://www.informatics.jax.org/downloads/custom/mgi2.gpad

ukemi commented

Note that both of these files are filtered for annotations made by MGI curators. The previous version were filtered for annotations made by MGI, but included ones made automatically by our orthology pipeline.

10/23/2020 : for dustin
This is GPI version 2 and GPAD version 2 from MGI
David H : add any comments about what is special about this version

MGI-curated only
http://www.informatics.jax.org/downloads/custom/mgi2.gpad

ukemi commented

@dustine32 In this file I have (I hope) cleaned up all of the MGI annotations and @loricorbani has replaced all of the annotation extension relations with the new ones that we decided on. So this is a test for 'the real thing".

New versions

!gaf-version: 2.2
!gpa-version: 2.0
!gpi-version: 2.0

David H : add any comments about what is special about this version

MGI-curated only
http://www.informatics.jax.org/downloads/custom/go_cam_mgi.gpad
http://www.informatics.jax.org/downloads/custom/go_cam_gene_association.mgi
http://www.informatics.jax.org/downloads/custom/mgi.gpi

ukemi commented

@dustine32. This version has fixed the bug where we were using commas instead of pipes. It should resolve a lot of the nesting issues that I saw with our annotations because those will now be separated. Thanks @loricorbani

I have installed an automated script that will generate/copy the go_cam files from our production database to:

http://www.informatics.jax.org/downloads/custom

on a daily basis. You can pick them up at your convenience/when you need fresh copies.