GrammaticalFramework/GF

Difficulties understanding Pattern Matching

Closed this issue · 11 comments

I understand Pattern Matching in GF is working in a non-greedy way instead of the usual greedy matching which is the default in POSIX regular expressions.

Now I have difficulties writing a greedy pattern matcher. For instance, if I have

  mkTüttö : Str -> Noun = \tüttö -> 
    case tüttö of {
      tüt + "t" + ö => mkTüttöConcrete tüt ö ; 
      _ => Predef.error "Unsuitable lemma for mkTüttö"
    } ;

It will match tüt with the empty string and ö will match üttö. What I want is that the constant "t" in the pattern will match the last t in tüttö (i.e greedy matching).

The GF book and slides only mention pattern composition with the + and | operators but doesn't tell how repetition can be expressed (except for the non-greedy wildcard _).

What would be the correct way to express a greedy regular expression such as POSIX .* or [a-z]* with the GF Pattern Matching mechanism?

You can use ? to match one character. Then you can bind the variable tüt into something that is non-empty:

 oper
  mkTüttö : Str -> Str = \tüttö -> 
    case tüttö of {
      tüt@(? + _)+ "t" + ö => tüt + ö ;
      _ => Predef.error "Unsuitable lemma for mkTüttö"
    } ;

Works for me at least when testing with mkTüttö "tüttö".

This gets over the first "t" in the string but it still matches the wrong "t" in the stem. What I want is to match and leave out the last t in the stem.

@Kaljurand have you had to solve this problem in the Estonian morphology?

Right, I didn't think of that :) If you know that the last syllable is just a CV pattern, then you can just bind one character to the last ö, like this:

oper
  mkTüttö : Str -> Str = \tüttö -> 
    case tüttö of {
      tüt + "t" + ö@? => "<" + tüt +">" + ö ;
      _ => Predef.error "Unsuitable lemma for mkTüttö"
    } ;

And testing:

> cc mkTüttö "tüttö"
"<tüt>ö"

If the last syllable can be longer than CV, then maybe you can just duplicate the line, like tüt + "t" + ö@(?+?).

I didn't encounter this when implementing the Estonian noun form generation (https://github.com/GrammaticalFramework/GF/blob/59d47dd75d2d749cbd9ca2d106a0a08293da3da6/lib/src/estonian/HjkEst.gf). It seems that one can usually find a solution using "?", "Predef.tk", "Predef.dp", etc.

Btw, there is also the repetition pattern p* (Book, page 283). Have you tried this?

Thank you Kaljurand, I now found that the book states it very clearly, that

Variables in regular expression patterns are always bound to the first match, which is the first in the sequence of binding lists. For example:
• x + "e" + y matches "peter" with x = "p", y = "ter"
• x + "er"* matches "burgerer" with x = "burg"

It doesn't seem to be overcomeable in any easy way. Thank you for all the help, I will re-open this issue when I have solved it, e.g found a hack around it.

Can you give an example where you specifically need greedy matching and cannot just match the end of the word? If the tüttö paradigm is anything like the Finnish corresponding paradigm, the last syllable shouldn't be anything but a single vowel, so the example from my second post should work in that case. Of course, some other paradigm may have different kinds of syllables as the last one (like vastus and naine in Estonian), but even then, you shouldn't need more than 2 or 3 variants to match out of CV, CVV, CVC, CVVC and CVCC. Here's a dummy example assuming that tüttö paradigm really accepted all these options:

v : pattern Str = #("a" | "e" | "i" | "o" | "u" | "õ" | "ä" | "ö" | "ü") ;

mkTüttö : (nom : Str) -> (gen : Str) = \tüttö -> 
    case tüttö of {
      tüt + "t" + ö@(#v) => tüt + ö ; -- CV syllable (or tV really)
      tüt + "t" + ö@(#v+?) => tüt + ö ; -- CVV or CVC syllable
      tüt + "t" + ö@(#v+#v+?) =>  tüt + ö ; -- CVVC syllable
      _ => Predef.error "Unsuitable lemma for mkTüttö"
    } ;

The way you do it is the correct way: by binding the end of the word to a pattern you effectively make the first variable parts greedy.

Since the paradigms I operate on is mechanically derived by the paradigm extract tool, I will need to induce the end-of-word-patterns from the data. I put stress on the above word mechanical in contrast to linguistical, but this should also make it feasible to simply mechanically convert the data into these end-of-word patterns. Thus the mechanical paradigm named "tüttö" need not be at all like our linguistic intuition tells us, but probably is and that makes it interesting for me. Thank you for all the help.

The behaviour of pattern matching with regular expressions in GF is explained in a section in reference manual. In particular it gives the following examples:

  • x + "e" + y matches "peter" with x = "p", y = "ter"
  • x + "er"* matches "burgerer" with x = "burg"

so the "non-greedy" behaviour is by design.

The right way to match things like "the last t", is to be explicit about it, e.g. like Inari suggests. In general, the pattern for what can come after "the last t" should be a pattern that matches strings that don't contain a "t". You could use a combination of negation and repetition to express this: the pattern (-"t")* matches any string that does not contain a "t". Adding this to your example

    case tüttö of {
      tüt + "t" + ö@((-"t")*) => ...
      ...
    }

would eliminate the ambiguity and give you the "greedy" match.

This is getting very strange, even the general pattern provided by @Thomas-H (thank you!) still produces the same matching.

I put preliminary testing code here and here's my GF shell output:

> i -retain ResVot.gf
- compiling ResVot.gf... /home/kristian/Projektid/Grammatical Framework/mini-vot/ResVot.gf:
    Warning: default encoding has changed from Latin-1 to UTF-8
  write file ResVot.gfo
16 msec
> cc -table -unqual mkTüttö "tüttö"
s . NF Sg nominative => -T-üttö
s . NF Sg genitive => -üttö
s . NF Pl nominative => -üttö-D
s . NF Pl genitive => -T-üttö-I

The dashes in the output basically means borders between variables and constants in the pattern. So the s . NF Sg genitive => -üttö shows that tüt matches the empty string and ö matches everything after the first constant "t".

Sorry, the solution I suggested doesn't work! While the pattern "t" matches a single character, the negation -"t" maches any other string, not just any other character, e.g. -"t" would match "tt" or "tö", so it doesn't actually exclude strings containing t...

But there is another solution: _+"t"+_ matches all strings containing a t. The negation -(_+"t"+_) matches all other strings, e.g. those that don't contain a t. So

    case tüttö of {
      tüt + "t" + ö@(-(_+"t"+_)) => ...
      ...
    }

should give you what you want! Hope I got it right this time :)

Now it works flawless. Thank you all for the discussion. I might write a short tutorial or alike about this topic, but I guess I'll wait untill its settled where this kind of things should be organized (wiki?).