apertium/lexd

optional disjointed lexicons referenced in pattern

jonorthwash opened this issue · 13 comments

Currently to indicate that a set of suffixational morphology is optional in a pattern, something like this would be a normal approach:

Roots [<pos>:] OptionalSuffixes?

To indicate that a set of prefixational morphology is optional in a pattern, something like this is needed:

Roots [<pos>:]
:OptionalPrefixes Roots [<pos>:] OptionalPrefixes:

The latter approach is generally tedious, but can be used for suffixational morphology. The former approach (which is more efficient from a coding perspective, and in complex cases is also much simpler) cannot be extended to prefixational morphology, or any other matched lexicon references.

I can (kind of?) imagine cases where the current behaviour (making each element combinatorially optional when they both have ?—i.e., 0 or X or Y or (X and Y)) could make sense, but it seems far more common to want them to operate together (i.e., 0 or (X and Y)). Or perhaps an additional symbol could be defined for this use?

Under the current setup, the best way to do this is

PATTERN PosStem
Roots [<pos>:]

PATTERN Pos
PosStem
:OptionalPrefixes PosStem OptionalPrefixes:

I'm having trouble coming up with reasonable syntax that would make that a single line, but if you have an idea for one, I'm open to implementing it.

The exact reason I created lexd was so that no one would ever have to write twoc again.

Yes, I agree that if all you want is the fst, then prefix tags are perfectly reasonable, but currently all Apertium tools assume suffix tags and I have no intention of being the one who rewrites everything for that. (I am willing to help make something that rearranges tags once between morph and disam, however.)

As for the particular suggestion, I like the idea, but it seems like it would introduce some ambiguity in parsing and I'm not sure how easy that would be to deal with. Like, the fact that 3?(3) would then have totally different behavior from 3? (3) bothers me. (And yes, completely numeric lexicon names are currently valid.)

Upon further reflection, 3(3) and 3 (3) are already different things, so actually I think @nlhowell's suggestion works.

So if all references to a particular lexicon in a pattern are [name]?([number]), two copies of the pattern will be compiled, one with all of them present and one with all of them absent.

If some are optional and some aren't, I don't think it would actually break anything, but it would probably be confusing, so yeah, probably best to require aliasing in that case.

If some are optional and some aren't [...] probably best to require aliasing in that case.

What's aliasing?

There's an ALIAS command the allows you to give a lexicon a second name, which is useful if for some reason you want independent copies of a lexicon in a single pattern.

LEXICON A
x
y

ALIAS A B

PATTERNS
A A # xx, yy
A B # xx, xy, yx, yy

So if all references to a particular lexicon in a pattern are [name]?([number]), two copies of the pattern will be compiled, one with all of them present and one with all of them absent.

Actually even simpler than that. The references to that lexicon can just have a temporary empty entry.

Ooh, nice. I see updates to code and tests, but not the documentation?

I forgot and added it in a separate commit

Optional lexicons where sides are matched are not acting as expected:

PATTERNS
A:? B :A?

LEXICON A
a:<a>

LEXICON B
b
bb

Output:

ab:b
abb:bb
abb:bb<a>
ab:b<a>
b
bb
bb:bb<a>
b:b<a>

Expected output:

abb:bb<a>
ab:b<a>
b
bb

For a question mark to be interpreted as disjointed, it needs to be before parentheses and not the last character, so you need to write A?(1): B :A?(1). With that change it works fine.

PATTERNS
A:? B :A?

For a question mark to be interpreted as disjointed, it needs to be before parentheses and not the last character, so you need to write A?(1): B :A?(1). With that change it works fine.

We just tried this again today and had to go find this issue after consulting the documentation and remaining confused about why this wasn't working. Could the documentation be updated with an example?

PATTERNS
A:? B :A?

For a question mark to be interpreted as disjointed, it needs to be before parentheses and not the last character, so you need to write A?(1): B :A?(1). With that change it works fine.

We just tried this again today and had to go find this issue after consulting the documentation and remaining confused about why this wasn't working. Could the documentation be updated with an example?

Done in 3950b6f