pelias/schema

Correct our use of synonyms for ES6

Closed this issue · 5 comments

While testing ES6 support, I ran into the following error while running an OSM import:

type=illegal_argument_exception, reason=startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=9,endOffset=20,lastStartOffset=14 for field 'name.default'

After some digging it appears this error is related to token position offsets created by the Synonym token filter.

There is a very interesting Elastic blog post from 2017 discussing the solution: the new Synonym graph token filter and how to use it to improve how synonyms expansion works. We'll need to figure out what the right solution is here for ES6 support

Overall, the biggest takeaway appears to be this:

To make multi-token synonyms work correctly you must apply your synonyms at query time, not index-time, since a Lucene index cannot store a token graph.

Connects pelias/pelias#719

Hmm.. very interesting, I don't think we could get away with doing our synonyms at query-time because of autocomplete.
eg. if the source data was rd and the user entered roa then the documents would not match using query-time synonym substitution.

I suspect it would also increase the response-time significantly, and also potentially change some behaviour, so it's not a simple thing to change.

It looks like this bug exclusively affects multi-word synonyms, and we have relatively few of those?

I just looked and I couldn't find any multi-token synonyms listed in this repo, any idea which synonym is causing this?

[edit] there are multi-token synonyms in this repo after all! see https://github.com/pelias/schema/pull/388/files

Joxit commented

Query time synonyms can be cool if we do change often our synonyms. But it's not really the case here.
And as @missinglink says, I'm a bit afraid with the response time and the CPU and IO usage that will cause the query-time synonyms...

ohhh! you know what? i was testing using on of our geocode.earth client configurations. They have multi-token synonyms. so its still important to consider, but won't affect "stock" pelias

okay, so I have tracked down a reproducible testcase: https://gist.github.com/missinglink/8f55271dcf4f5e7e8d0712b1f2c8d742

a simple way to trigger this error is with:

POST http://localhost:9200/pelias/_analyze
{
    "analyzer": "peliasIndexOneEdgeGram",
    "text": "set"
}

The synonym generation goes crazy:

{
    "tokens": [
        {
            "token": "s",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0
        },
        {
            "token": "se",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0
        },
        {
            "token": "set",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0
        },
        {
            "token": "sep",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "sept",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "septi",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "septie",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "septiem",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "septiemb",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "septiembr",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "septiembre",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "setb",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "setbr",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "setbre",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "sepe",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "sepb",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "sepbr",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "sepbre",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "7",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "7b",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "7br",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "7bre",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "b",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 1
        },
        {
            "token": "br",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 1
        },
        {
            "token": "bre",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 1
        },
        {
            "token": "7",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 1
        },
        {
            "token": "7r",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 1
        },
        {
            "token": "7re",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 1
        },
        {
            "token": "r",
            "start_offset": 1,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 2
        },
        {
            "token": "re",
            "start_offset": 1,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 2
        },
        {
            "token": "7",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 2
        },
        {
            "token": "s",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 2
        },
        {
            "token": "se",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 2
        },
        {
            "token": "sep",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 2
        },
        {
            "token": "r",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 3
        },
        {
            "token": "re",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 3
        },
        {
            "token": "b",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 3
        },
        {
            "token": "br",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 3
        },
        {
            "token": "bre",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 3
        }
    ]
}