INCATools/dead_simple_owl_design_patterns

Support for regex transformations

dosumis opened this issue · 4 comments

If a major use of DOSDPs is to be complete generation of ontology branches, we need ways to mung strings with regex for label (and synonym?) generation.

Proposal: regex field allow transformation of string vars.

definitions: 
   regex_sub:
    additionalProperties: False
    required: [in, out, match, sub]
     type: object
      properties:
          in: 
             type: string 
             description: name of input var
          out: 
             type: string 
             description: > 
                 Name of output var.  If input var specified an OWL entity then 
                 readable identifier is used as input to substitution
          match:
             type: string 
             description: perl style regex match             
          sub: 
             type: string 
             description: perl style regex sub.  May include backreferences.             

properties:
    substitutions:
       type: array
       items: { $ref : '#/definitions/regex_sub' }

Example

pattern_name: activator_activity

#...

vars:
  regulated_activity: "'catalytic activity'"

substitutions:
   in: regulated_activity
   out: regulated_activity_munged
   match: "(.+) activity"
   sub: '\1'  # Don't need to escape backslash if using single quotes in YAML.  

#...

name: 
  text: "%s activator activity"
  vars:
    - regulated_activity_munged

e.g. this pattern with regulated_activity: 'kinase activity' => 'kinase activator activity' (not 'kinase activity activator activity').

Possible issues:

  • I favor 'Perl-style' regex over POSIX, but as far as I can tell, this is not a formal standard. There is a potential danger of ambiguity/failure due to differences in perl-style regex dialects e.g. on lazy vs greedy by default quantifiers.

  • What to do if regex match fails.

CC @cmungall @balhoff @DoctorBud - comments most welcome.

If regex match fails, maybe default to just using the complete input value?

Can implementing code assume that the only thing special to handle in the sub text value is the presence of backslash-prefixed group numbers?

Can implementing code assume that the only thing special to handle in the sub text value is the presence of backslash-prefixed group numbers?

Yep.