r0ller/alice

Introduce context

r0ller opened this issue · 4 comments

Bringing it over here from #10.

Context handling as currently solved within one sentence and later for related sentences must be reviewed. A better solution could be to make use of constituent/phrase tagging (tags for question tests) to determine the reference of e.g. a pronoun.

Context handling outside the scope of one sentence can only be solved if the system responses (with the result of the script execution) are also recorded in the db. For example:

-List contacts with Alice!
-Alice:
1) 00 1 234 567 89
-Call that number!

The relative pronoun 'that' refers to something that is out of the scope of the sentence. In the tutorial, the 'Relative clauses' section describes how a reference can be set within a sentence in cases like 'List files that are executable!" where 'that' refers to 'files'. The way of creating a semantic rule for setting up such references could be extended by adding a new root_type like 'C' (for Context) so that can be specified in the fields main_lookup_root or dependency_lookup_root. So when the semantic rule would combine the relative pronoun with the symbol of something else being out of scope to set up a reference, its root_type would be set as 'C'. That could then be handled in the sparser's is_valid_combination() function to look up the lexeme of that symbol being tagged with the relative pronoun in db table analyses_deps as close in time to the current utterance as possible.

Although this could work, it poses a problem of interpreting the results of the scripts generated by hi which for example looks like in the previous example as:

Alice
1) 00 1 234 567 89

One idea is to reuse the hi() API function by using the system user id 'hi' and passing in the result returned by the client system function that was indicated to be executed by the result the script execution returned. The functor returning the result from the script shall as well return a template to which the result of the client system function call result can be mapped. The template shall be in a form which is detailed enough and can be analysed by the interpreter. E.g. instead of returning just string(s) and number(s) like above, a json structure shall be returned like:

{ "contacts": [ { "name": "Alice", "phone numbers": [ "00 1 234 567 89" ] } ] }

That can be analysed with a parser for the json language just like it's done for english or hungarian currently and the analysis would be put in the db tables as usual along with tagging the relevant values (or keys as well?) like "Alice" and "00 1 234 567 89" during interpretation. No functor implementations seem to be necessary for the json interpreter currently.

Calling the hi() function this way means that the system response is not paired to the previous user utterance. Although its id (i.e. the analysis key) could be passed in if hi() had a corresponding overloaded interface or at least the user name to indicate whose latest utterance the system reponse shall be paired.

Putting the analysis of the response in the db and tagging the numbers (not the word but the phone numbers) as c_values with relative pronouns like 'that' (besides the question word tags) would make it possible for a context search outlined above to find that number.

This approach has the advantage that it can search in natural language contexts as well not only in system responses in a unified way. So e.g. if the user tells the number instead of the system, it shall also work:

-The phone number of Alice is 00 1 234 567 89.
-Call that number!

What needs to be done is like:
-create a json parser using also foma fst so that the lexical analyser can turn lexical items (keys and values as well) into lexemes
-set up the model (depolex, etc.) and add functor tags
-test if the parser, model and tagging works as expected
-implement context handling in is_valid_combination()

hints:
-when calling hi(), the source language can be 'JSON' and the target e.g. 'ENG' and the languages table could have a new column to indicate if the language is 'natural language'.
-a new column may need to be added to analyses, failed_analyses and analyses_deps as well to indicate the target language so that when looking for (english, hungarian, etc) context, the system responses can be distinguished as their source language will be json by which the context wouldn't be found.

Managed to parse a basic json like:
{"anya" : {"darab" : 1}}
Preprocessor needs to be implemented so it needs to be manually separated into:
{"darab" : 1}
{"anya" : . }
Where the dot is currently the reference to the previous json object.
TODOS:
-JSON semantics is not yet complete
-JSON context reference handling works only for dependency chains (find_dependency_chain_with_tag_value()), haven't tried for individual node references (find_dependency_nodes_with_tag_value())
-reference handling for dep. chains resulted in an analysis where the semantics contained a morpheme id for which no morpheme was present in the analysis

After commit #595b090 to wip at least three known todos remain:
-Handling objects in json arrays (ref_id currently encodes only call stack level and row number but if a row has an array as value it may contain several objs)
-Handling all rows of the json obj referred to as value
-Handling main_ or dependency_lookup_root for context referenced nodes. Shall NC and HC be reused? Currently, they are used to look up relative (reference) nodes in the context but it may be ok to reuse if:
morphalytics->has_feature("Relative")==false
If not,another solution must be found e.g. introducing new lookup root symbols to be able to look up symbols only in the context nodes but not in the actual ones since often the same symbols appear in the subtree if only N and H are used which leads to erroneous interpretations (either false positives or false negatives).