geneontology/go-site

gorule-0000057 filter lines by provided_by

Closed this issue · 12 comments

For the MOD imports project, one requirement is that we filter the MOD GPAD to keep only lines where the Provided_by column (aka Assigned_by) equals the MOD. So only Provided_by=MGI in mgi.gpad or Provided_by=WB in wb.gpad. Provided_by=UniProt lines would be filtered out.

GOOD: MGI     MGI:1920971     enables GO:0043014      MGI:MGI:3794006|PMID:18163442   ECO:0000314                     20081211        MGI
BAD: MGI     MGI:1920971     part_of GO:0002177      MGI:MGI:3794006|PMID:18163442   ECO:0000314                     20120221        UniProt

We can handle this by expanding on the filter_out pattern currently existing in the mgi.yaml and wb.yaml dataset files by adding a separate filter_for or filter_in (maybe required_attributes?) section:

filter_for:
  provided_by:
    - MGI

Accepting a list of provided_by values will allow flexibility if we later want to start importing some other non-MOD-source lines like UniProt. I'll update the datasets.schema.yaml, mgi.yaml and wb.yaml files in a test branch.

Tagging @dougli1sqrd @kltm

ukemi commented

This looks correct. An alternative strategy would be to have MGI filter the file for only the annotations that will be imported, but it seems that having this general ability on the GOC end of things would be useful.

@ukemi Yeah, upstream filtering would be a sure way of handling this. I just automatically started porting over the pre-existing filtering functionality from gocamgen but we can use it as needed.

I agree with the upstream filtering, otherwise we need to restrict how Rule57 is applied, that seems to just move the problem elsewhere.

Hi @dustine32 To make it easier (or even just possible) to find any references to rule, we've been rigorous about the format in tickets, please use gorule-nnnnnnn.

Thanks, Pascale

ukemi commented

OK. @dustine32, we will create a GPAD2.0 file that only contains the annotations made by MGI curators using the MGI editorial interface. We will put it out on the test site for you.

Thanks @pgaudet for correcting the title!

@ukemi Yes, having the file already filtered at its upstream location would definitely do the job. I can start translating that GPAD2.0 file once we get the ontobio GPAD parser to consume 2.0 in a short while.

Hi @dustine32

What is the status of this? I suppose some version of this in done in the pipeline, but is not documented here: https://github.com/geneontology/go-site/blob/master/metadata/rules/gorule-0000057.md

Noting that this rule is being reported in the reports: http://snapshot.geneontology.org/reports/assigned-by-gorule-report.html; based on Dustin's answer below: should we suppress this? or is this relevant for the production code and we should include tests?

Other AI: clarify the formulation of the rule, mention 'filter_out' in the datasets.yaml files, and change status to implemented.

Hi @dustine32

Is this specifically applying to imports, and how is this triggered?

Thanks, Pascale

@pgaudet Yep, this was proposed for the imports project but not needed. I'll close but feel free to reopen.

Thanks, we'll just make sure to remove it from the reports (not sure why it's even coming up)

kltm commented

There is a reports filter list (variable in the pipeline), if something needs to be disappeared.