lior-k/fast-elasticsearch-vector-scoring

Error: binaryEmbeddingReader can't be null

ltung-cit opened this issue · 13 comments

I'm using Elasticsearch as docker container with the binary-vector-scoring plugin installed, but I'm getting an intermittent error when doing search with the following query:

{
  "function_score": {
    "boost": 1,
    "score_mode": "avg",
    "boost_mode": "multiply",
    "min_score": 0,
    "script_score": {
      "script": {
        "source": "binary_vector_score",
        "lang": "knn",
        "params": {
          "cosine": true,
          "field": "image_embedding",
          "vector": "MY_VECTOR_HERE"
        }
      }
    }
  }
}

The search runs ok for a while (first dozen of requests) and then it starts returning the following error:

Caused by: java.lang.IllegalStateException: binaryEmbeddingReader can't be null
elasticsearch    | 	at com.liorkn.elasticsearch.script.VectorScoreScript.setBinaryEmbeddingReader(VectorScoreScript.java:67) ~[?:?]
elasticsearch    | 	at com.liorkn.elasticsearch.service.VectorScoringScriptEngineService$1.getLeafSearchScript(VectorScoringScriptEngineService.java:65) ~[?:?]
elasticsearch    | 	at org.elasticsearch.common.lucene.search.function.ScriptScoreFunction.getLeafScoreFunction(ScriptScoreFunction.java:79) ~[elasticsearch-5.6.0.jar:5.6.0]
elasticsearch    | 	at org.elasticsearch.common.lucene.search.function.FunctionScoreQuery$CustomBoostFactorWeight.functionScorer(FunctionScoreQuery.java:140) ~[elasticsearch-5.6.0.jar:5.6.0]
...

Reindexing all documents is the only way to make the search work again, has anybody faced the same problem?

this error happens when the field ("image_embedding" in your case) does not exist in all the documents you are searching on.

Same error.
I used the field "embedding_vector", and it exists in my document I'm searching on.

Hi @lior-k
The field (image_embedding) also exists in my document.

I have an indice with 10 shards and I realized that when search does return hits, there's a JSON in the response with the property shards:

{
  "successful": 3,
  "failed": 7,
  "skipped": 0,
  "total": 10,
  "failures": [
    {
      "node": "ghr7DWYOSWa4tlvZ4kpsFQ",
      "index": "deckito",
      "reason": {
        "reason": "binaryEmbeddingReader can't be null",
        "type": "illegal_state_exception"
      },
      "shard": 0
    }
  ]
}

When setting shards to a low number (below 3), the error occurs more often.

nabas commented

I also have the same problem, the document has the field but the problem happens

Hi @lior-k

This is my mapping:

{
    "settings": {
        "number_of_shards": 10
    },
    "mappings": {
        "slide": {
            "properties": {
                "deck_id": {
                    "type": "keyword",
                    "index": true
                },                
                "number": {
                    "type": "integer",
                    "index": true
                },
                "image_embedding": {
                    "type": "binary",
                    "doc_values": true
                },
                "text": {
                    "type": "text",
                    "index": true
                }
            }
        },
        "searchResult": {
            "properties": {
                "deck_id": {
                    "type": "keyword",
                    "index": true
                },
                "search_timestamp": {
                    "type": "date",
                    "index": true
                },
            }
        }
    }
}

My query:

{
  "query": {
    "bool": {
      "should": [
        {
          "function_score": {
            "boost": 1,
            "score_mode": "avg",
            "boost_mode": "multiply",
            "min_score": 0,
            "script_score": {
              "script": {
                "source": "binary_vector_score",
                "lang": "knn",
                "params": {
                  "cosine": true,
                  "field": "image_embedding",
                  "vector": "MY_VECTOR"
                }
              }
            }
          }
        }
      ]
    }
  }
}

MY_VECTOR is something like [0.20438875, 0.087035105, 0.41949105, ...]

I'm using the Python client to search only documents of type slide, which have the field "image_embedding" in all of them:

result = self.client.search(index='deckito', doc_type='slide', from_=0, size=3, body=query, version=True, _source_include=['deck_id', 'number', 'image_embedding'])

please do the following query in order to check that all the documents have values in this field.
meaning this query should return 0 documents:

GET <es-url>/<index>/_search
{
    "query": {
        "bool" : {
            "must" : {
                "script" : {
                    "script" : {
                        "inline": "doc.image_embedding == null || doc.image_embedding.value == null || doc.image_embedding.value == ''",
                        "lang": "painless"
                     }
                }
            }
        }
    }
}

Hi @lior-k

I am also getting the same error: "{
"took" : 33,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 4,
"skipped" : 0,
"failed" : 1,
"failures" : [
{
"shard" : 3,
"index" : "indexvectors",
"node" : "Q5VeFkIvQh6KLS6PQsUg2w",
"reason" : {
"type" : "illegal_state_exception",
"reason" : "binaryEmbeddingReader can't be null"
}
}
]
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
"

my data looks like:
{
"indexvectors" : {
"aliases" : { },
"mappings" : {
"vectordocs" : {
"properties" : {
"embedding-vector" : {
"type" : "binary",
"doc_values" : true
},
"id" : {
"type" : "text"
},
"vector" : {
"type" : "text"
}
}
}
},
"settings" : {
"index" : {
"creation_date" : "1524853637835",
"number_of_shards" : "5",
"number_of_replicas" : "1",
"uuid" : "76m277CESNiYnovi6n6Q8A",
"version" : {
"created" : "5060099"
},
"provided_name" : "indexvectors"
}
}
}
}

I have just added one record and used the same records vector field in query to get knn with k=1. Ideally the query should have returned the record present in the index but instead I got the above mentioned error. Could you help me out here?

Hi @lior-k

I ran the query you posted in 3 different ways and it returned the following results (note I have 2 document types: slide and searchResult and the property image_embedding is only declared for type slide):

  • <es-url>/<index>/_search -> 0 documents, which is weird because all documents of type searchResult don't have the field image_embedding.

  • <es-url>/<index>/slide/_search -> 0 documents, makes sense because all documents of type slide have the field image_embedding populated.

  • <es-url>/<index>/searchResult/_search -> 0 documents, which is weird because all documents of type searchResult don't have the field image_embedding.

I was able to get the issue resolved by following lior-k's suggestion and making sure that 0 docs are returned for the query mentioned. I am able to get the KNN docs now using the plugin. Thanks @lior-k :-)

I fixed my templates, and reindexed them, finally it works.
Before fixing, I used different field names between templates and documents, but it should be same.
And also, I defined the "embbeding_vector" field as "text", but it should be "binary".

good to hear, closing the issue

Also struggling with this problem. The plugin works in production, but when I use elasticdump to copy the data to a local server I start getting "binaryEmbeddingReader can't be null".

elasticdump --input=./account_mapping.json --output=http://localhost:9200/account --type=mapping
elasticdump --input=./account.json --output=http://localhost:9200/account --type=data

In this state my vector searches fail entirely. If I inspect the mapping my field is mapped correctly. If I use the painless query above I find 0 records. If I reindex my document then things start working on most of the shards.

POST http://localhost:9200/_reindex
{
  "source": {
    "index": "account"
  },
  "dest": {
    "index": "tmp"
  }
}

Then I do a second _reindex to rename from tmp back to account. My queries start working now, however - I still see exceptions firing in the ES server and my query _shards has 3 successful and 2 failed shards:

"_shards": {
        "total": 5,
        "successful": 3,
        "skipped": 0,
        "failed": 2,
        "failures": [
            {
                "shard": 0,
                "index": "account",
                "node": "HlfEVuX_TbO8u6GXu47REQ",
                "reason": {
                    "type": "illegal_state_exception",
                    "reason": "binaryEmbeddingReader can't be null"
                }
            }
        ]
    },

Update:
After about 15 minutes and a few reboots, the two buggy shards started working and I am getting 5/5 successful now. So if anyone else has the same problem - import, reindex and then wait a while while shards rebuild.