Zero (phantom, unvoiced) word support.
linas opened this issue · 22 comments
Zero/phantom words: Expressions such as "Looks good" have an implicit "it" (also called a zero-it or phantom-it) in them; that is, the sentence should really parse as "(it) looks good". The dictionary could be simplified by admitting such phantom words explicitly, rather than modifying the grammar rules to allow such constructions.
Other examples, with the phantom word in parenthesis, include:
- I ate all (of) the cookies.
- I taught him (how) to swim.
- I told him (that) it was gone.
- (It) looks good.
- (You) go home!
- You know you can (do something).
- I wish I had (done something).
- I will (do something), if you do.
- How often (does it happen)?
- How big (is it)?
- Room w/sea view (is) available. -- zero copula
- (some) thieves rob(-bed a) bank! -- newspaper headline
One possible solution to the unvoiced-word problem might be to allow the LG rules to insert alternatives during the early culling stages. This avoids the need to pre-insert all possible alternatives during tokenization...
The "(You) go home" variant is particularly important, as it provides a subject for directives/imperatives.
One way to implement this would be with certain special link types. For example:
+---Wi--+-MVp-+
| | |
LEFT-WALL go.v home.r
The Wi
link could be interpreted as a phantom-expanding link, so that, after parsing, there would be a phantom-expansion stage, which would convert the Wi
link to this:
+--Wd---+-Sp-+-MVp-+
| | | |
LEFT-WALL (you) go.v home.r
that is, the Wi
link gets split in two, into Wd-& Sp+
, and the phantom (you) gets dropped into place. However, for proper compatibility, we also want the WV
link:
+---->WV---->+
+--Wd---+-Sp-+-MVp-+
| | | |
LEFT-WALL (you) go.v home.
so it looks as if Wi
gets renamed to WV
. There are several practical difficulties with this proposal in LG itself. Although "trivial" if done in opencog, where graph-reiwriting is a built-in feature, it would be hard to do in LG itself, which does not support generic graph re-writing. That is,
- it requires a non-trivial post-parsing stage,
- it fails to actually simplify the dictionaries.
The second point being the more serious: one reason to entertain the idea of phantom words is the hope that the grammar would be simplified. However, in this proposal, the dicts would need to contain rules for both Wi as well as Wd, Sp, WV and so the post-parsing conversion does not simplify the grammar.
The post-parsing stage could be carried out using some graph re-write rules, e.g. in relex or with the opencog pattern matcher. Since this happens post-parsing, there is no particular point of putting it "inside" of LG itself.
One possible solution is to perform a one-point compactification. The dictionary contains the phantom words, and thier connectors. Ordinary disjuncts can link to these, but should do so using a special initial lower-case letter (say, 'z', in addition to 'h' and 't' as is currently implemented). The parser, as it works, examines the initial letter of each connector: if it is 'z', then the usual pruning rules no longer apply, and one or more phantom words are selected out of the bucket of phantom words. (This bucket is kept out-of-line, it is not yet placed into sentence word sequence order, which is why the usual pruning rules get modified.) Otherwise, parsing continues as normal. At the end of parsing, if there are any phantom words that are linked, then all of the connectors on the disjunct must be satisfied (of course!) else the linkage is invalid. After parsing, the phantom words can be inserted into the sentence, with the location deduced from link lengths.
One possible solution is to perform a one-point compactification.
@linas,
I didn't understand this method.
Can you construct an example?
Also, please consider that it is not a problem to insert phantom words/alternatives as needed during/after tokenization. Moreover, inserted phantom words can be marked as "optional", which means their slot would be automatically removed from the linkage if none of its words get fully linked.
Implementing a concept of "optional alternatives" can maybe help to handle phantom words.
The idea is that "optional alternatives" are to be used only if the sentence cannot be fully parsed using another alternative.
An extension of this idea is assigning a cost to optional alternatives so only the lowest cost one is actually used. This can be another kind of cost than disjunct cost. You once mentioned a similar idea - I will try to find its location. What I suggest here is to not use at all the higher cost alternatives, to prevent zillions of bogus parses in addition to the good ones.
This idea can also solve issue #224 "spell-guessing mis-handles capitalized words".
One obstacle in this idea is the case of more than one location of an "optional alternative" in a sentence when the lower-cost alternatives in different locations are mutually exclusive.
Another obstacle is how to handle "optional alternatives" if there is no full parse. A possible solution is just to omit them.
At the end of parsing, if any phantom words are linked, then all of the connectors on the disjunct must be satisfied (of course!) else the linkage is invalid. After parsing, the phantom words can be inserted into the sentence, with the location deduced from link lengths.
The phantom word addition may be redundant, i.e. that the sentence will get parsed w/o it. Of course, this cannot be found in advance (before parsing).
My updated algo is:
- In
determin_word_expressions()
, examine the expression of the candidate word to find out which phantom word (if any) is needed to be inserted after it (or maybe before it for some languages or situations, but I would initially implement only after it). - For now, support only a single word-slot insertion instead of an insertion of a whole alternative (can be extended to whole alternative insertion if needed). Mark it with a boolean "phantom" in addition to "optional". It seems to me that a macro is to be inserted and not words (see below).
- Prune, parse, extract links. No linkages are created yet.
- If parsed with nulls, remove the phantom words and continue to step 6.
- We now have a full linkage. Inspect the parse data structure (create by
extract_links()
) to see if there is still a complete parse w/o each phantom word. If there is, remove the phantom word. - Create linkages and continue as usual.
As an example, I tried to inspect this algo for the sentence I will, if he goes to the store
from PR #1234.
This is said to be converted to: I will.v [context-verb] , if she goes first
.
The question is what to insert as this [context-verb]
. The list from en/words/words.v.1.1
is too big, and also contains relevant disjuncts that may cause unintended parses. So I thought to create a special macro for it, say <verb-infinitive>
.
There is also the question of how we can know which phantom word to insert. I cannot see how a special z
connector may help here.
There seem to be several plausible solutions (including inserting words during tokenization). I don't know which is best or easiest. Let's explore each in greater detail. So, this #224 (comment) is rather cryptic. Let me try to reconstruct what I was thinking.
Given the sentence I will, if she goes first
, the desired result is
+------------Xc------------+
+------->WV------------>+----MVs----+----CV-->+ |
+->Wd--+-Sp*i+-----I----+ +-Xd+-Cs+--Ss-+--MVa-+ |
| | | | | | | | | |
LEFT-WALL I.p will.v [context-verb] , if.r she goes.v first.a RIGHT-WALL
For now, I don't care about the specific format: besides [context-verb]
, perhaps <do-something>
or {zero-verb}
... some way of indicating an unresolved reference to an action (cf. other discussions about reference resolution).
The existing 4.0.dict
is modified to contain the following:
LEFT-WALL: (zWV+ & Wd+) or ...;
will.v: (Sp- & zI+) or ...;
[context-verb]: zI- & zWV- & zMVs+;
if.r: (Xd- & zMVs- & Cs+ & CV+ & Xc+) or ...;
During dictionary read, a special class of zero-words is created: these can be spotted in one of two ways: (1) they are surrounded by square brackets e.g. [context-verb]
or better yet, (2) all connectors in the disjunct start with a z. This class of zero-words is kept in a special list.
Counting is done as usual, with a slight modification: if a disjunct has a connector starting with z
, that connector is ignored during pruning -- its "invisible". As if it was never there. Linkage generation is done as usual, except that islands are allowed. For the above example, one of the linkages will be
+------------Xc------------+
+----CV-->+ |
+->Wd--+-Sp*i+ +-Xd+-Cs+--Ss-+--MVa-+ |
| | | | | | | | |
LEFT-WALL I.p will.v , if.r she goes.v first.a RIGHT-WALL
During linkage generation, look at the chosen disjuncts, and see if any of them contain z
connectors. If so, then scan the list of zero words, to see if any of them can satisfy the "invisible" unconnected connectors.
That is, while generating the above linkage, it will be discovered that LEFT-WALL
has an invisible zWV+
on it, and will.v
has an invisible zI+
on it, and that if.r
has an invisible zMVs-
. These are conjoined: this linkage has a zero-disjunct of zWV+ & zI+ & zMVs-
. Flip all the direction indicators, and search the list of zero-words for some word that has that disjunct. If so, then we are done.
Well, almost done. Some open questions:
- Do we want to enforce the no-link crossing constraint on these invisible links? I don't know...
- How should it be drawn, in ascii-art? Conceptually, the zero-verb is kind of floating in outer space, not having any linear position in the sentence. It only has links tying it back down to earth. Thus, for example:
[context-verb]
^
+------->WV-------------+
^ +----I-----+ +------------Xc------------+
^ ^ +----MVs----+----CV-->+ |
+->Wd--+-Sp*i+ +-Xd+-Cs+--Ss-+--MVa-+ |
| | | | | | | | |
LEFT-WALL I.p will.v , if.r she goes.v first.a RIGHT-WALL
In the above, [context-verb]
is not the third or the fourth word in the sentence, its just in orbit around the sentence. Of course, this screws up the API, which wants to index the location of every word. More on this in the next comment.
The above comment points out that we have an API issue. The current API is driven by word indexes: all words have a fixed location in a sentence, which means that location-independent linkages are not currently possible. Here is a futuristic example of a desirable linkage diagram:
+==============VR================>[context-verb]
$ ^
$ +------->WV------+
$ ^ +-I-+
$ +---Js---+ ^ ^ +-MVs----+----CV-->+
+-->Wi--+---I--+-MVp+ +Ds**c+ +->Wd--+-Sp*i+ +-Xd+-Cs+--Ss-+--MVa-+
| | | | | | | | | | | | | |
LEFT-WALL let's jump.v off a cliff.n LEFT-WALL I.p will.v , if.r she goes.v first.a
Here, VR
is a new link type. It stands for "verb reference". we are employing it for reference resolution.
I would like to do things like the above, but the current infrastructure isn't suitable for that. We can kind-of do multiple sentences, today, but its ad hoc:
+-------Xp-------+------Xp-----+
+---->WV--->+ +-->WV-->+ |
+->Wd--+-Ss-+ +>Wd+-Ss-+ |
| | | | | | |
LEFT-WALL it walks.v ! it talks.v !
but there is no way to draw links to references:
+>========================NR==========>+
$ +>=====NR===============>+
$ $ +>===NR===>+
$ $ $ $
$ +----------------Xp----------$----+
$ +--------->WV-------->+ $ |
+-------Xp-------+------Xp-----+ | $ |
+---->WV--->+ +-->WV-->+ +-->WV->+ $ |
+->Wd--+-Ss-+ +>Wd+-Ss-+ +>Wd+-Ss+-Oste-+ |
| | | | | | | | | | |
LEFT-WALL it walks.v ! it talks.v ! it 's.v Wilber.m !
where the NR
links to Wilber
are "Named Reference" or "Named Entity" links. Note that these are crossing links. I'm wandering off-topic here; what I'm trying to say is that we need some kind of infrastructure for more general graphs. There are four possible directions to go in, for more general infrastructure:
- Extend the LG API's in a more general direction.
- Switch to some complete different API -- this is what the AtomSpace does.
- Pick some existing NLP framework, e.g. NLTK and work really hard to make sure LG inter-operates cleanly and elegantly with that framework.
- Like 3. but with some top-o-the-pops Java library.
I have deep misgivings about 3 and 4 since it seems one must sacrifice huge amounts of representational power, in exchange for a grab-bag of mostly silly tools. However, there are vast numbers of programmers (and corporate executives) who love these frameworks, and view them as the answer to all problems.
The problem with 2. is that it is too abstract: it's not programmer-friendly, its missing graph visualization tools (despite over a decade of failed efforts), there aren't any API utilities, e.g. no easy way to jam it into a computer-game non-player character chatbot. There are companies that make good money creating fancy chatbots for computer games... its frustrating that we can't demo that.
The problem with 1. is that... is there a problem? Well, it's not 2,3 or 4. I guess that's a problem. But I know that you personally would have a lot more fun working on 1. than on 2,3,4 and so .. that's a good thing. Lets push the boundary. See how far we can go.
Edited to complete it after a premature posting.
I need several more clarifications...
if a disjunct has a connector starting with
z
, that connector is ignored during pruning -- its "invisible". As if it was never there.
If seems z
connectors shouldn't be just totally invisible to the pruning algo, else their disjuncts could be discarded. Instead, maybe such disjuncts should just not be candidates for discarding during pruning. On the other hand, the non-z
connectors on such disjuncts would retain other disjuncts that have a connector that may connect to them. I suspect this may result in a rather ineffective pruning (see also below).
Counting is done as usual [...]
Linkage generation is done as usual, except that islands are allowed.
The "islands" state should be the same for counting.
For the above example, one of the linkages will be
+------------Xc------------+ +----CV-->+ | +->Wd--+-Sp*i+ +-Xd+-Cs+--Ss-+--MVa-+ | | | | | | | | | | LEFT-WALL I.p will.v , if.r she goes.v first.a RIGHT-WALL
But it is not so, the parsing with islands allowed is actually:
linkparser> !islands-ok=1
islands-ok set to 1
linkparser> I will, if she goes first
No complete linkages found.
Found 6 linkages (6 had no P.P. violations) at null count 1
Linkage 1, cost vector = (UNUSED=1 DIS= 1.06 LEN=13)
+-------Xx-------+
+---->WV---->+ +----->WV---->+
+->Wd--+-Sp*i+ +-->Wd--+--Ss-+--MVa-+
| | | | | | |
LEFT-WALL I.p will.v , [if] she goes.v first.a
The word if
is unlinked here so no phantom word would be able to link to it. However, parsing it as you indicated is needed for applying your algo.
if.r: (Xd- & zMVs- & Cs+ & CV+ & Xc+) or ...;
How many words would need z
connectors? If too many, the pruning may not be effective, with parsing speed implications.
I said above:
The phantom word addition may be redundant, i.e. that the sentence will get parsed w/o it.
It seems phantom words will get inserted in many places even though they are not needed.
To this I offered a solution:
Inspect the parse data structure (create by extract_links() ) to see if there is still a complete parse w/o each phantom word. If there is, remove the phantom word.
Moreover, they may get inserted in places that make invalid sentences parsable. For example:
*I will it, if she goes first
I don't see what would prevent some phantom word to complete the sentence (or some other wrong ones) as follows:
I will [context-verb] it, if she goes first
Do we want to enforce the no-link crossing constraint on these invisible links? I don't know...
BTW, I noted that for the sentence
I will and he will, if she goes first
inserting a phantom word only after the second will
is currently enough to make it fully parsable.
I pressed ENTER by mistake out of the comment box and the default button "comment" got triggered...
I will edit my post above to complete it. Please read it on the Web, as the mail copy is only a WIP-comment.
I have completed the said post... I have more things to ask but they may be redundant (or get changed) depending on the clarifications.
- Extend the LG API's in a more general direction.
It should be extended anyway so this seems a good direction.
- Switch to some complete different API -- this is what the AtomSpace does.
It may be an addition, not a replacement.
- Pick some existing NLP framework, e.g. NLTK and work really hard to make sure LG inter-operates cleanly and elegantly with that framework.
I guess it would be a good idea to learn the API of NLTK and mimic it when applicable.
if a disjunct has a connector starting with z, that connector is ignored during pruning -- its "invisible". As if it was never there.
If seems z connectors shouldn't be just totally invisible to the pruning algo, else their disjuncts could be discarded. Instead, maybe such disjuncts should just not be candidates for discarding during pruning.
Why keep them? why not discard them? Just right now, I see no reason to keep them, but everything is a bit murky...
The "islands" state should be the same for counting.
Turn on "island" only if pruning left behind disjuncts with z
in them.
parsing with islands
I think you did it wrong. Try again with this dict:
--- a/data/en/4.0.dict
+++ b/data/en/4.0.dict
@@ -13190,7 +13190,8 @@ LEFT-WALL:
or (hWl+ & {Xj+} & (RW+ or Xp+))
or (QUd+ & hWl+ & {Xj+} & (Xc+ or [()]) & QUc+)
or hCPa+
- or [[ZZZ+ & <sent-start>]];
+ or [[ZZZ+ & <sent-start>]]
+or Wd+;
% Cost on Xc- because Xc is intended for commas, not sentence-ends.
% Without this cost, the right wall gets used incorrectly with MX links.
@@ -13792,3 +13793,6 @@ LENGTH-LIMIT-1: YS+ & YP+ & PH+ & ZZZ+;
% Handy test
% grrr: (A- & B- & C+ & D+) or [(E- & @F+ & @G+ & H+)] or [[(I- & J- & @K- & @L+)]];
+
+will.z: Sp-;
+if.z: Xd- & Cs+ & CV+ & Xc+;
The first diff emulates LEFT-WALL: (zWV+ & Wd+) or ...
with z
ignored. Likewise, the second two. I get
linkparser> !is
Use of null-linked islands turned on.
linkparser> I will, if she goes first
No complete linkages found.
Found 19 linkages (19 had no P.P. violations) at null count 1
Linkage 1, cost vector = (UNUSED=0 DIS= 2.00 LEN=4)
+------------Xc------------+
+----CV-->+ |
+-Sp*i+ +-Xd+-Cs+--Ss-+--MVa-+ |
| | | | | | | |
I.p will.z , if.z she goes.v first.a RIGHT-WALL
Note that all of the UNUSED=0
are printed first, even if they have high cost. Only then are the UNUSED=1
linkages printed.
Moreover, they may get inserted in places that make invalid sentences parsable. For example:
*I will it, if she goes first
I don't see what would prevent some phantom word to complete the sentence (or some other wrong ones) as follows:
I will [context-verb] it, if she goes first
This will happen only if the dictionary contains a transitive phantom:
[context-verb-intrans]: zI- & zWV- & zMVs+;
[context-verb-trans]: zI- & zWV- & zMVs+ &zO+;
without the second line, it won't parse. This is a generic dictionary maintenance headache: poorly structured disjuncts allow crazy parses; it is challenging to set them up so that they work well.
You can try it: dive
is in words/words.v.5.1
and its intransitive:
+------------Xc------------+
+-------->WV------->+-----MVs----+----CV-->+ |
+->Wd--+-Sp*i+---I--+ +-Xd+-Cs+--Ss-+--MVa-+ |
| | | | | | | | | |
LEFT-WALL I.p will.v dive.v [it] , if.r she goes.v first.a RIGHT-WALL
I guess it would be a good idea to learn the API of NLTK and mimic it when applicable.
Given that it seems to be immensely popular, I suppose so. I am spread far too thin, working on too many projects already, so this is not something I could undertake. But, sure, looking how other people do things, and then stealing the best ideas is usually a good thing.
if a disjunct has a connector starting with z, that connector is ignored during pruning -- its "invisible". As if it was never there.
If seems z connectors shouldn't be just totally invisible to the pruning algo, else their disjuncts could be discarded. Instead, maybe such disjuncts should just not be candidates for discarding during pruning.
Why keep them? why not discard them? Just right now, I see no reason to keep them, but everything is a bit murky...
Since non-z
cannot match z
, I think that you are right.
Turn on "island" only if pruning left behind disjuncts with
z
in them.
Note that the current pruning looks at "Islands" for possible optimization (skipping parsing altogether in case there are more nulls than requested). This optimization can make a difference only when parsing with a null_count>0. So if any disjunct contains a z
connector, we would need to disable this optimization. I introduced this optimization in a hope that an aggressive pruning of an unparsable sentence (unparsable with the requested null count) would detect a significant amount of such sentences already in the pruning stage and thus would make parsing unnecessary. Currently, it already skips the parsing of some sentences. (I have implemented a more aggressive power pruning but it needs an overhaul of the existing code to reduce the CPU consumption of the added code.)
What is the purpose parsing with islands? Is this for finding the exact location of the island words?
The following just has a null word when using the modified dict:
linkparser> !sp=1
spell set to 1
linkparser> will you, if she goes first?
No complete linkages found.
Found 3 linkages (3 had no P.P. violations) at null count 1
Linkage 1, cost vector = (UNUSED=1 DIS= 2.06 LEN=12)
+--------------------Xp--------------------+
+-------Xx-------+----->WV---->+ |
+-->Qd---+-SIp+ +-->Wd--+--Ss-+--MVa-+ |
| | | | | | | |
LEFT-WALL will.v you , [if] she goes.v first.a ?
What is the purpose parsing with islands?
It's a flag that dates back to the original code. It says, basically "I can't parse the whole thing, but here are a bunch of phrases I understand, I just can't join them together." It is an alternative to saying "I can't parse the whole thing, but if I ignore these words, then I can".
For this example, the two islands "make sense", in a way: will you
and if she goes first
. For this example, the skipped word is terrible, because the Xx
link says "there are two sentences here, and the first sentence is will you
and the second sentences is she goes first
" which is strictly worse than the island form.
That said, the historic default has been skipped words instead of islands; I have no idea why that's the default. I kind of like islands better. They're usually less crazy.
I wrote above:
spell set to 1
This was a strange typo - my intention was is=1
. With it, it is as expected:
linkparser> !is=1
islands-ok set to 1
linkparser> will you, if she goes first?
No complete linkages found.
Found 7 linkages (7 had no P.P. violations) at null count 1
Linkage 1, cost vector = (UNUSED=0 DIS= 1.00 LEN=12)
+----------Xc---------+
+----CV-->+ |
+-->Qd---+-SIp+ +-Xd+-Cs+--Ss-+--MVa-+ |
| | | | | | | | |
LEFT-WALL will.v you , if.z she goes.v first.a ?
Additional questions:
- Are the phantom words supposed to always fit in island boundaries?
- Would this method of finding phantom words is supposed to give meaningful results also on ungrammatical sentences?
- Are the phantom words supposed to always fit in island boundaries?
It sure seems like it, doesn't it?
- Would this method of finding phantom words is supposed to give meaningful results also on ungrammatical sentences?
Possibly! I have repeatedly noticed that, when I repair the English dict to handle some new case, that there is a matching version with a phantom word that does parse correctly. Having explicit phantom word support could lead to simplifications of the dictionary, or so it seems: I keep having to add complexity to handle those cases; this is hard to do, and it creates yet-more disjuncts. Obviously, having fewer disjuncts would be better.
The psychological lesson here is that "newspaper English" is well-written and articulate and precise. But when people talk, they are sloppy, imprecise, and drop words all the time. Non-native speakers drop words simply because they just don't know what they should be. It seems that phantom words restore these, or "fill in the blanks", literally. Interesting...
- How should it be drawn, in ascii-art? Conceptually, the zero-verb is kind of floating in outer space, not having any linear position in the sentence. It only has links tying it back down to earth.
Why just not actually insert it in the sentence (to show how the sentence got parsed)?
- How should it be drawn, in ascii-art? Conceptually, the zero-verb is kind of floating in outer space, not having any linear position in the sentence. It only has links tying it back down to earth.
Why just not actually insert it in the sentence (to show how the sentence got parsed)?
Not sure. I guess all of the above examples do have an explicit location for the phantom word. An interesting exception is #1240 where the missing word forces a subject-verb inversion.