apertium/apertium-recursive

Rule only works when commenting out unrelated rules?

unhammer opened this issue · 12 comments

gender = m f nt ut un fn mf GD ;
gender_adj_sg_ind = nt ut ;
number = sg pl sp ND ;
defnes = def ind ;
a_adj = sint ord pp pprs ;
a_cmp = cmp ;
a_det = dem qnt pos emph ;
a_comp = pst comp sup ;

adj:   _.a_adj.a_comp.gender.number.defnes.a_cmp;
n:     _.gender.number.defnes.a_cmp;
det:   _.a_det.gender.number;

N:     _.gender.number.defnes.a_cmp;
A:     _.a_adj.a_comp.gender.number.defnes.a_cmp;
NP:    _.gender.number.defnes;
DP:    _.gender.number.defnes;


N -> %n         { %1 } ;

NP ->      %N { %1 }
    |  adj %N { 1 _ %2 } !!!
    ;

DP ->
      "vennene mine ~> mina vänner"
      %NP det.pos
      { 2[gender=(if (1.number = pl) un else 1.gender), number=1.number]
        _
        1[defnes=ind]
      }
    | "en venn ~> en vänn" det %NP { 1[gender=(if (2.number = pl) un else 2.gender), number=2.number] _ 2 } !!!

      ;

got:

$ echo ' ^venn<n><m><pl><def>/vän<n><ut><pl><def>$ ^min<det><pos><un><pl>/min<det><pos><un><pl>$ ^virtuell<adj><pst><nt><sg><ind>/virtuell<adj><sint><pst><nt><sg><ind>$' |rtx-proc nor-swe.rtx.bin
 ^vän<n><ut><pl><def>$ ^min<det><pos><un><pl>$ ^virtuell<adj><sint><pst><nt><sg><ind>$

expected:

 ^min<det><pos><un><pl>$ ^vän<n><ut><pl><ind>$ ^virtuell<adj><sint><pst><nt><sg><ind>$

HOWEVER: If I comment out either line 23 or line 33 (the ones marked !!!) then it strangely works.

But trace shows that those lines are not used (this is without commenting them out, where I get the bad result):

 echo ' ^venn<n><m><pl><def>/vän<n><ut><pl><def>$ ^min<det><pos><un><pl>/min<det><pos><un><pl>$ ^virtuell<adj><pst><nt><sg><ind>/virtuell<adj><sint><pst><nt><sg><ind>$' |rtx-proc -r nor-swe.rtx.bin

Applying rule 1 (line 20): ^venn<n><m><pl><def>/vän<n><ut><pl><def>$

Applying rule 2 (line 22): ^vän<N><ut><pl><def>{^venn<n><m><pl><def>/vän<n><ut><pl><def>$}$

Applying rule 4 (vennene mine ~> mina vänner - line 27): ^vän<NP><ut><pl><def>{^vän<N><ut><pl><def>{^venn<n><m><pl><def>/vän<n><ut><pl><def>$}$}$ ^min<det><pos><un><pl>/min<det><pos><un><pl>$

Applying output rule 1 (line 22): vän<NP><ut><pl><def> -> ^vän<N><ut><pl><def>{^venn<n><m><pl><def>/vän<n><ut><pl><def>$}$

Applying output rule 0 (line 20): vän<N><ut><pl><def> -> ^venn<n><m><pl><def>/vän<n><ut><pl><def>$

No rule specified: ^vän<n><ut><pl><def>$
^vän<n><ut><pl><def>$
No rule specified: ^min<det><pos><un><pl>/min<det><pos><un><pl>$
^min<det><pos><un><pl>$
No rule specified: ^virtuell<adj><pst><nt><sg><ind>/virtuell<adj><sint><pst><nt><sg><ind>$
^virtuell<adj><sint><pst><nt><sg><ind>$

I'm probably missing something obvious but I can't see it?

The trace for when line 33 is commented out shows not just applying rule 3 (line 27), but applying output rule 3 (line 27)

Note also if I just don't include the last word, the rule hits fine.

So the lookahead is trying to figure out whether to keep branches alive in case more rules might apply. You have n det adj, which it thinks could be n DP{ det NP{ adj [n] } }, not realizing that this is actually det.pos, which it looks like you want treated differently.

So the solution is probably for the lookahead to get smarter and for the last rule to change from det to det.[notpos], for a suitable definition of notpos.

The tricky part of this is whether I can fully do that without implementing FST subtraction in lttoolbox (or maybe I should just go ahead and do that...).

So if I understand correctly it's starting an analysis of n DP{ det NP{ adj [n] } } because there might be an n to the right. But the trace shows it did at one point find the right match, wouldn't it be more robust to backtrack to that?

Also, I can't change the last rule to det.[nonpos] because I do want it to match det.pos (in nob, mine venner and vennene mine are both possible, while in swe we want only the former).

My current workaround is to have a higher-level rewrite rule DP2 → DP Anyword, but it doesn't really make linguistic sense.

IRC:

[10:13:28] <popcorndude> the answer is that this actually is an annoyingly deep issue
[10:14:13] <popcorndude> at least in the reduced case, it reads in the adj
[10:14:50] <popcorndude> and then says DP{NP{N{n}} det} can't do anything with this, but NP{N{n}} det maybe can
[10:14:54] <popcorndude> so discard the first one
[10:14:58] <popcorndude> oh, oops, EOF
[10:17:44] <popcorndude> so I can write hacky rules to fix this in particular cases, but I have no idea how to solve this in general

Is there a way to give some info in the trace when this applies? It's quite hard to debug when it happens. E.g. I have rules that do

DP{NP{N{n.cmp n}} det}  →*   DP{det NP{N{n.cmp n}}}   ! vennene mine → mina vännar

and they work fine and then I add vcmp into the N rule so I can do

DP{NP{N{vblex.inf.cmp n}} det}  →*   DP{det NP{N{vblex.inf.cmp n}}} ! bakemesteren vår → vår bakmästare

and it works fine and but then I notice the first rule stops working in certain contexts :(

Turns out, if there's any verb in the rest of the sentence (doesn't have to be tagged cmp), the rule doesn't apply any more. Again, the fix is just to ensure the wider context has a parse (a rule like S→DP VP), but I only learnt that by accident, and I had almost forgotten the fix when the problem showed up again.

Information about what parses are getting discarded and why can be gotten from the -e debug option, though it prints out rather a lot of stuff and I don't guarantee it makes all that much sense.

We're seeing this issue again in sme-smj, e.g. we have rules for
N→n
NP→NP N | N
PP→N p | p
and on seeing a sequence n n p, it gives a parse for the final two words, but doesn't then apply anything for the first word (I think. I'm not 100% sure about the details here). But the first noun does get a parse if I send it in alone.

Would it be possible to do a final pass after everything is done and just treat all the unmatched lexical units in isolation, so they're at least matched by some single-word rule?

With sme-smj.rtx.zip:

$ echo '^Jämtlánda<np><top><sg><gen><@→N>/Jämtlánnda<np><top><sg><gen><@→N>$ ^regiovdna<n><sem_plc><sg><gen><@→P>/regiåvnnå<n><sem_plc><sg><gen><@→P>$ ^dáfus<post><@ADVL>/gáktuj<post><@ADVL>$^.<sent>/.<sent>$' | rtx-proc -e sme-smj.rtx.bin
[…]
Branch 3: 3 nodes, weight = 0
[Chunk]:
^Jämtlánnda<Name><sg><gen><@→N>{
        ^Jämtlánda<np><top><sg><gen><@→N>/Jämtlánnda<np><top><sg><gen><@→N>$
}$
[Blank]:

[Chunk]:
^gáktuj<PP>{
        ^regiåvnnå<N><sg><gen><@→P>{
                ^regiovdna<n><sem_plc><sg><gen><@→P>/regiåvnnå<n><sem_plc><sg><gen><@→P>$
        }$
        ^dáfus<post><@ADVL>/gáktuj<post><@ADVL>$
}$
Branch 4: 3 nodes, weight = 0
[Chunk]:
^Jämtlánda<np><top><sg><gen><@→N>/Jämtlánnda<np><top><sg><gen><@→N>$
[Blank]:

[Chunk]:
^gáktuj<PP>{
        ^regiåvnnå<N><sg><gen><@→P>{
                ^regiovdna<n><sem_plc><sg><gen><@→P>/regiåvnnå<n><sem_plc><sg><gen><@→P>$
        }$
        ^dáfus<post><@ADVL>/gáktuj<post><@ADVL>$
}$

Filtering Branches:
No branch can accept further input.
Branch 3  has no active branch to compare to.
Branch 4  has fewer partial parses or a higher weight than branch 3.
[…]

– isn't this plain wrong? Or am I misunderstanding what "partial parses" means? (In 3, all words have at least one parent, while in branch 4 (which is chosen), the first word has no parent node.)

EDIT: It seems the test is (cur->length < minNode->length || (cur->length == minNode->length && cur->weight >= minNode->weight))
and the values are

cur->length:3
minNode->length:3
cur->weight:0
minNode->weight:0

so they're just equal.

Yeah, I think it's >= since the branches later in the list have usually had more rules applied to them.

So I noticed that simply changing the file to have weights on each rule made it choose the parse that has more parses, and when doing that across a real rule file for sme-smj, it removes some untranslated words from corpus runs.

Is there a good reason not to have some "initial" weight for every rule, so it can favour parses that cover more words? (Will it then favour deeper trees as well?)

Yes, it will slightly favor deeper trees, but given how reduce-reduce conflicts are handled, those are favored already.

Perhaps we could add another file-level directive to change the default weight to something positive, since that will indeed improve the situation in many cases.