snipsco/snips-nlu

Non Deterministic behaviour for builtin entitiy

Garstig opened this issue · 4 comments

Hi,

for a project I validated my slot extraction with a 10-fold cross validation. For some reason the results of the builtin entities are not deterministic. Shouldn't they be, since they are rule based?

Example:

wo muss ich nach dem mittag hin
. Here I marked "mittag" as snips/datetime. In 1 out of 10 times it was identified correctly.

Also the f1 score I calculated for the snips/datetime slot increased from 75% to 80% when I used synonyms for some other slots.

Am I missing something?

Hi @Garstig ,
The builtin entities rely on builtin parsers which indeed have a deterministic behaviors. However, these parsers are not directly producing the output of the Snips NLU engine. The core machine learning algorithm which extracts entities, uses the builtin entity parsers to provide features but at this stage they are only features.
What it means is that if a builtin entity is detected by the underlying builtin parser, but it doesn't correspond to any slot in the sentence, the NLU algorithm will be able to understand that and correctly discard it.
In practice, these features are very powerful (which is what we want) and help a lot in the extraction of builtin entities.

Parsing errors on builtin entities may have two causes:

  • a builtin entity is extracted, however it spans additional words at the beginning or at the end, resulting in a mismatch with the labelled data (e.g. in the sentence "Book a table at 9pm" the extracted entity is "at 9pm" but the labelled entity was "9pm")
  • the slots containing builtin entities are not properly labelled in the training data and because of that the NLU engine doesn't learn properly how to identify them.

If you want to make sure that you have properly labelled the builtin entities, you can check what is detected by the underlying builtin entity parser by running the following:

>>> from snips_nlu.entity_parser import BuiltinEntiyParser
>>> parser = BuiltinEntityParser.build(language="de")
>>> parser.parse("wo muss ich nach dem mittag hin")
[{'value': 'mittag', 'resolved_value': {'kind': 'InstantTime', 'value': '2020-01-16 12:00:00 +01:00', 'grain': 'Hour', 'precision': 'Exact'}, 'entity_kind': 'snips/datetime', 'range': {'start': 21, 'end': 27}}]

And adjust the training data accordingly if you have some mistmatches.

I hope this helps.
Best

Hi @adrienball!

Thanks a lot for your help again!

Are any other features used for builtin entities? I don't think so since in #863 I asked if builtin entities are expandable and you said no. Just want to make sure, if I understood you correctly.

Do you have a documentation that explains all the features you used for slot filling? In the config file you can see all standard used features but I don't understand all of them just by reading their names. Hope this is not too off topic for this issue.

Bests
Garstig

The features used in the CRF are documented here.

Builtin entities, extracted by Snips NLU, are resolved (datetimes are returned with rich content for instance) and for this reason it is necessary that they correspond to something parsed by the underlying builtin entity parsers, however as I said before in the thread it is not sufficient.
This means that if a string is not parsed by one of our builtin entity parsers, it won't be output by the Snips NLU engine.
Best

Thank you very much! This explains a lot.

Have a nice weekend :)