OOM in sme-smj, loops through same rules over and over again (not sure if it ever ends)
unhammer opened this issue · 3 comments
$ echo '– Lea stáhtahálddašeaddji rolla sihkkarastit ahte boazodoallit,
báikkálaš ja regionálalaš eiseválddit gulahallet ja lea maid
stáhtahálddašeaddji bargu oahpahit aktevrraide boazodoalo
areáladárbbu. Departemeanta lea 2021 vuosttaš jahkebeale gárveme
sierra bagadallama boazodoalu ja plána- ja huksenlobi birra mii galgá
nannet boazodolliid plána- ja huksenlobi gelbbolašvuođa ja mii galgá
nannet fylkkagielddaid ja gielddaid gelbbolašvuođa boazodoalus ja
boazodoallovuoigatvuođain, lohká Skogan.' | apertium -d . sme-smj_rtx
hangs.
or with input-to-rtx.txt since giella-smj doesn't have updated packages to build with:
$ cat input-to-rtx.txt | rtx-proc --anaphora sme-smj.rtx.bin
^–<punct>$ ^Liehket<vblex><indic><pres><p3><sg>$ ^stáhttaháldadiddje<n><nomag><sg><gen>$ ^roalla<n><sg><nom>$ ^sihkarasstet<vblex><inf>$ ^jut<cnjsub>$ ^ælloniehkke<n><pl><nom>$^,<cm>$
^bájkálasj<adj><attr>$ ^ja<cnjcoo>$ ^regiåvnålasj<adj><attr>$ ^oajválasj<n><pl><nom>$ ^guládallat<vblex><indic><pres><p3><pl>$ ^ja<cnjcoo>$ ^liehket<vblex><indic><pres><p3><sg>$ ^stáhttaháldadiddje<n><nomag><sg><nom>$
^aj<adv>$ ^barggo<n><sg><nom>$ ^åhpadit<vblex><supn>$ ^akterra<n><pl><ill>$ ^ællosujtto<n><sg><gen>$
^areálla<n><cmp_sgnom><cmp>+dárbbo<n><sg><acc>$
and then it hangs.
With --rules
we see it go through the same rules over and over again.
(Could some sort of per-sentence memoisation / dynamic programming be useful?)
I had previously concluded that caching was impossible because of shared state (global variables, destructive updates, etc), but now I think it might be possible for the compiler to flag which rules access or update that state and then at runtime everything else can be cached once the input reaches some threshold.
In the shorter term, does adding -F
help at all? (Also I just realized that the long versions of -f
and -F
are identical, so I should fix that.)
Can you give examples of shared state? I'm not sure if we're using that or not.
EDIT: I see https://wiki.apertium.org/wiki/Apertium-recursive/Formalism#Global_Chunk_Variables is one such; pretty sure we're not using that at least. What's a destructive update?
But -F
does help! Now that long sentence translates in half a second. I haven't checked tests yet for what effect it has though :)
chunk variables, string variables, node insertion, and <let>
(which only applies if you're writing rules in XML).
Though I think it's also entirely possible that the bytecode interpreter is not the bottleneck and our actual problem is allocating thousands of nodes to store the different paths.