geneontology/gocamgen

Regularly generate "bad extensions" reports of MGI GPAD

Opened this issue · 27 comments

Related to #36 in that this report will become a normal part of the test load process.

In the meantime, we can begin the regularly scheduled pulling of the MGI GPAD file (upstream of GO pipeline), generation of its bad_extensions.tsv report, and posting to either this ticket or just a google drive folder. I'm gonna propose sometime on Wednesdays?

This currently is achievable with:

$ wget http://www.informatics.jax.org/downloads/reports/mgi.gpa.gz
$ gunzip mgi.gpa.gz
$ python3 gpad_extensions_mapper.py -f mgi.gpa -m MGI -e

A bad_extensions_[YYYYMMDD].tsv file will then be created.

@ukemi First round is in a google sheet here:
https://docs.google.com/spreadsheets/d/1sjwT9cnG31fpg1r_-aQEYoqM33o7IYPEjHhcYutfVRk/edit?usp=sharing

This was on the 2019-10-22 MGI file.

BTW @vanaukenk @ukemi I made a folder in the GO drive for these under:

GO-CAM-and-Noctua -> Model & Validations -> GPAD extension validation reports
https://drive.google.com/drive/folders/1sQAeWJDEk-joMCIvuoUt5A1ERahbjlv4?usp=sharing

ukemi commented

Sweet! Looks like I've got a lot of these cleaned up. Some of them still look ok to me. Let's discuss them on our next call.

ukemi commented

@dustine32, what does violates combo rule mean?

@ukemi "violates combo rule" is from code used to invalidate combinations of certain extensions. Example:

If an occurs_in(EMAPA) exists, the line can't also contain an occurs_in(UBERON)

I think this was originally meant to prevent multiple occurs_in to anatomy terms in the same extension. Like, having multiple anatomy terms could be redundant or conflicting. But looking at some of the examples in the google sheet I think it's catching things that should be valid according to our formatted_ext_patterns.tsv rules. Example from the google sheet:

MGI:MGI:88373 | GO:0000978 | RNA polymerase II proximal promoter sequence-specific DNA binding | violates_combo_rule | has_input(MGI:MGI:108392)|occurs_in(CL:0000359),has_input(MGI:MGI:108392),has_input(MGI:MGI:97851)

In addition to the occurs_in([anatomy term]) combo rule, there's also this:

has_input(geneID)
has_input(CHEBI)
has_direct_input(geneID)
has_direct_input(CHEBI)

If one of these extensions exists, the line can't contain any of the other extensions in this list.

In the above example, I think the has_input(MGI:MGI:108392),has_input(MGI:MGI:97851) bit at the end is getting marked invalid by this combo rule even though our tsv (and ShEx) allows multiple has_input(geneID)'s on MFs. I'll need to debug this to confirm and fix.

Thanks @dustine32
Is there a corresponding WB file that I should look at? I think I've corrected or updated a lot of our extensions, but it'd be good to know if there are any left to fix.

ukemi commented

Thanks @dustine32. I think we should not have any restrictions on inputs other than they should be continuants. I don't think there should be cardinality restrictions. Generic processes can have any number of inputs.

@vanaukenk Right! I don't think I had sent this for the 2019-10-07 GO release wb.gpad file. Here it is:
https://docs.google.com/spreadsheets/d/1AI-g_HgM78exp-vX0TMY1kv94lX1Ip_myLYfbsnYKRY/edit#gid=0

Only 19 lines!

ukemi commented

I'm jealous.

@ukemi Yep, your suggestion agrees with the current ShEx spec for any MF or BP. I can change the TSV rules to reflect this and remove the combo rule for has_input. It would be slick to specify "only continuant terms" in the TSV but I'll first need to figure out how to trace up to BFO:0000002.

What about that occurs_in combo rule? Any combination of occurs_in(UBERON) and occurs_in(EMAPA) in the same extension is marked invalid. Should that remain?

ukemi commented

What about that occurs_in combo rule? Any combination of occurs_in(UBERON) and occurs_in(EMAPA) in the same extension is marked invalid. Should that remain?

Yes. Those don't make sense to me. So let me look at them and try to see what the curator was trying to get at. I know that we have those coming from models made in Noctua, but I think that is a bug that needs to be fixed in the GPAD generation.

ukemi commented

There are also a bunch on the spreadsheet that I think should pass, but I'd like to look at them with you and @vanaukenk at our next meeting. They are the ones marked as 'OK?' in column 1.

Okay, I've looked through the WB annotations.

Some were already fixed, some I fixed just now, some I removed, and there are also a few that I thought should pass but didn't. We can discuss the latter on our next call, but they were mostly the 'adjacent to' relations with the value being a WBbt.

Thanks @dustine32 !

@ukemi OK here's your MGI file (from 10/30 upstream GPAD) with the has_input rules generalized according to PR #66:
https://docs.google.com/spreadsheets/d/1_JQq0wiLfuvim6Cr9-N0CmoW8mmvdZPOguhIXy7nk1o/edit#gid=0

Down to 264!

@ukemi Here's this week's MGI report from the 2019-11-05 upstream GPAD still using the updated rules from PR #66:
https://docs.google.com/spreadsheets/d/1ypubckq1ZulShZJiopTra6KpInXnqBIfTRg6IZaAhVU/edit#gid=0

Down to 186!

ukemi commented

See I'm working! There are still a few others that I need to fix, then we need to examine the ones I have tagged as OK.

@ukemi Here's the link to this week's MGI report using the 2019-11-12 upstream GPAD:
https://docs.google.com/spreadsheets/d/1o9_e4DDAXNJ57gNL0Jd9e_t_drrnvQDOPDpyjhq4__0/edit#gid=0

Note this was run w/o the validation rule changes in #68 since they haven't yet been merged to master. To see that version of the MGI report with the changes I have a link in #68.

@ukemi Sorry, I missed last Wednesday's report run for the MGI file. Here it is run today on the 2019-11-29 upstream GPAD: https://docs.google.com/spreadsheets/d/1cHyzVmw6svV8HNDVZgHnaDLQ2fca5Qhy5LanLL4kWwQ/edit?folder=1sQAeWJDEk-joMCIvuoUt5A1ERahbjlv4#gid=0

Only 6 (s-i-x) lines!

@vanaukenk Since the November GO release just went out I also ran the wb.gpad report:
https://docs.google.com/spreadsheets/d/141tqbVWTgDKiX-qKJgppB4EiuOptjrgYkZHYQfQT29k/edit?folder=1sQAeWJDEk-joMCIvuoUt5A1ERahbjlv4#gid=0

Note that the version of the rules TSV used for these reports hasn't yet been merged into master. The rule changes are detailed in issue #73.

ukemi commented

Wow @dustine32, this is great! The first one is just one that I missed. The next five all should be added as valid.

'stem cell population maintenance' (GO:0019827) should allow acts_on_population_of
and
'regulation of stem cell population maintenance' (GO:2000036) should allow regulates_o_acts_o_population_of.

Cool, thanks @ukemi! I added these rules and regenerated the report for this week under #75.

@ukemi Here's a brand new bad_extensions report using the upstream MGI GPAD from 2020-01-08:
https://docs.google.com/spreadsheets/d/1AaNq_4OUtooYhCq8Fpwqih2DjqRFU5alUth4V7PdJXo/edit#gid=0

This one has three lines so lemme know if I missed adding any new rules to the TSV.

ukemi commented

Hi @dustine32 and @vanaukenk . These three are from recent annotations. Line number 3 was a simple curation error where the curator picked an incorrect relation. Lines 2 and 4 are more interesting. In line 2 the curator wanted to express that the DNA methylation influenced negative regulation of gene expression and qualified it with 'causally upstream of or within'. Should this be valid? I notice that this is also co-annotated. In line 4 the curator wanted to say that the metabolism occurred in the lung of a female. Since EMAPA doesn't have a female organism, the curator used the UBERON term. This is an example of a curator trying to nest information in a format that doesn't allow for nesting. It could easily be expressed in GO-CAM as 'sphingolipid metabolic process' occurs_in some (lung part_of some 'female organism'). @vanaukenk, I am tempted to just remove these annotations from the MGI interface and redo them as production models. What do you think?

ukemi commented

OK @dustine32. After conferring with @vanaukenk, we decided that I should just go and make the valid GO_CAMs for the other two annotations and delete then from the MGI editorial interface. At the next MGI release, everything should be clean again.

@ukemi Works for me!

@ukemi Super sorry about how late this is but I'm now back from vacation! You're correct, the latest upstream MGI GPAD dated 2020-01-28 doesn't have any extensions violations according to the bad_extensions report that I just ran. Congrats!

ukemi commented

Wooo hoooo! Onward. Hope you had a great vacation.