geneontology/gocamgen

Pipe-separated annotation extensions should result in separate GO-CAM annotations

Opened this issue · 11 comments

We have this in the Google doc but just to note an example here for testing purposes:

The inputs on the mec-3 contributes_to RNA polymerase II regulatory region, sequence-specific DNA binding are pipe-separated and should be split out into separate annotations.

For @ukemi cuz I found this example in the MGI file:

MGI     MGI:2159711     part_of GO:0044297      MGI:MGI:4361056|PMID:19684588   ECO:0000314                     20111103        MGI     part_of(EMAPA:16525),part_of(CL:0000678)|part_of(EMAPA:16525),part_of(CL:0000678)

It looks like the pipe-separated values are duplicates. Should I take the liberty of consolidating these dupes into one extension or should I leave them alone and emit them separately in the model?

Also, minor: when counting the occurrences of a pattern for reporting (like in our pattern spreadsheet) would I count this example as one or two occurrences of part_of(EMAPA),part_of(CL)?

ukemi commented

Hi @dustine32,
I have found a few of these too. It looks like the curator cut and pasted the same info twice. You should consolidate exact duplicates. If this were not duplicated, it would count as two occurrences because it would be split into two annotations each with a separate part_of(EMAPA1),part_of(CL1) and part_of(EMAPA2),part_of(CL2)

ukemi commented

Note that this would be nested in a GO-CAM where the cell (CL) would be a part of the anatomical structure (EMAPA).

@ukemi Exactly what I needed to know. Thank you!

This is mostly ready to test on noctua-dev. The two aspects of this ticket:

  1. Splitting pipe-separated extensions into multiple annotations. This model for WB:WBGene00003167 shows that the "has input" extensions are now separated.
  2. Condensing duplicated extension values. Our example of this MGI:MGI:2159711 still has two annotation individuals for Usp33-part of->cell body in noctua-dev
    though I've fixed it in my local instance at USC.

I'll try getting one more push into noctua-dev before the meeting (would like to get a start of has_regulation_target in too), hopefully today or tomorrow.

ukemi commented

It looks like the consolidation in the model above is for two annotations that are exact duplicates. Both of the evidence statements are exactly the same. Didn't we decide we wanted to only count these once? Are there exact duplicate annotations in the GPAD file?

ukemi commented

When I look in our editorial interface, I see two annotations to cell body that are identical except for an additional note that will eventually be loaded into a text field. It represents two different developmental stages. The cell type and anatomy extensions are still missing from the GO-CAM model. It should indicate that the cell body is part of a commisural neuron that is part of the future spinal cord.

ukemi commented

It might be best to look at this together along with the GPAD file. This is an interesting twist.

@ukemi Ahh, that explains a lot! Checking the GPAD file used,

source_path: http://www.informatics.jax.org/downloads/reports/mgi.gpa.gz
header_date: 04/03/2019

I only see the one line:

$ grep MGI:2159711 mgi.gpa | grep GO:0044297
MGI	MGI:2159711	part_of	GO:0044297	MGI:MGI:4361056|PMID:19684588	ECO:0000314			20111103	MGI	part_of(EMAPA:16525),part_of(CL:0000678)|part_of(EMAPA:16525),part_of(CL:0000678)

And here I see the "duplicated" extensions and no notes. This is the MGI GPAD upstream of the GO pipeline so my guess is the GPAD export process from MGI is doing this. Maybe this situation will be handled in the one-off import file?

For closing this ticket, here's an updated model WB:WBGene00003167 showing the with/from annotation split from a binding descendant GPAD line:
image
From this GPAD line:

WB      WBGene00003167  contributes_to  GO:0000977      PMID:9735371|WB_REF:WBPaper00003265     ECO:0000314                     20140910        WB      has_direct_input(WB:WBGene00003168)|has_direct_input(WB:WBGene00003171)|has_direct_input(WB:WBGene00036254)

@vanaukenk @ukemi Feel free to close if this looks good.

@vanaukenk @ukemi Actually, looking at those "contributes to" relations while transforming into ShExCop, I don't see any mention of "contributes to" in the ShEx spec at all. Are we still using this relation in the imports?