koursaros-ai/nboost

Do I need to re-index an existing ElasticSearch index?

Closed this issue · 8 comments

I have an existing large index inside my ElasticSearch (~million documents, some of them pretty long).
I would like to use it with nboost, but avoid costly re-indexing and creating a csv file.

Is it possible, or do I need to use the nboost-index tool every time I want to work with new data?

No, you don't need to reindex when using nboost.
Nboost works as a proxy and sits between the user requesting the data and the elasticsearch.

When the request is sent:

  • It's sent to nboost
  • Nboost redirects it to elasticsearch which generates the results
  • Nboost receives the results from elasticsearch
  • Then it re-ranks the results with the neural net of your choice
  • And then returns the re-ranked results to the user

So, the only change you'll need to do is to send the requests to nboost instead of sending them directly to elasticsearch.

In that case I think there is an issue my proxy. Should I open a new issue? I have an index with polish wikipedia (I want to use it with the default tinybert model, which probably doesn't support polish, just to check if it's working at all), and when I query it directly:

curl localhost:9200/wikipedia/_search?pretty&q=text:test&size=1

it gives results as expected, however when I try to do it through nboost:

curl "localhost:8000/wikipedia/_search?pretty&q=text:test"

then all I get is an empty list of hits

{ "took": 16, "timed_out": false, "_shards": { "total": 4, "successful": 4, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 6019, "relation": "eq" }, "max_score": 10.853271, "hits": [] }, "nboost": { "scores": [] } }

What could be happening? Is it possible that the model thinks none of the results are valid and thus returns none? It works just fine with the travel index provided with nboost. Here is the command I use to run nboost:

`
nboost \

--uhost localhost                   \

--uport 9200                        \

--query_path url.query.q            \

--topk_path url.query.size          \

--default_topk 10                   \

--choices_path body.hits.hits       \

--cvalues_path _source.passage     \

--search_route "/wikipedia/_search"   \`

Are your Wikipedia texts located at _source.passage? You are providing --cvalues_path _source.passage but your texts might be located in a different path.

No, I have two fields in _source attribute of a hit: title and text.
I changed the setting to the default - --cvalues_path _source.* but it didn't help.
Now I get the following error:

Traceback (most recent call last): File "/net/scratch/people/plgklasocki/transformers-env/lib/python3.6/site-packages/flask/app.py", line 1813, in full_dispatch_request rv = self.dispatch_request() File "/net/scratch/people/plgklasocki/transformers-env/lib/python3.6/site-packages/flask/app.py", line 1799, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "/net/scratch/people/plgklasocki/transformers-env/lib/python3.6/site-packages/nboost/proxy.py", line 123, in proxy_through plugin.on_response(response, db_row) File "/net/scratch/people/plgklasocki/transformers-env/lib/python3.6/site-packages/nboost/plugins/rerank/base.py", line 39, in on_response reranked_choices = [response.choices[rank] for rank in ranks] File "/net/scratch/people/plgklasocki/transformers-env/lib/python3.6/site-packages/nboost/plugins/rerank/base.py", line 39, in <listcomp> reranked_choices = [response.choices[rank] for rank in ranks] IndexError: list index out of range

I think nboost should be configured to re-rank based on one text filed. I'd suggest changing the default --cvalues_path _source.* to --cvalues_path _source.text

Yes that worked, thank you!! So does it mean that nboost doesn't support multi-match queries?

yeah it supports all the queries elasticsearch does, cause it forwards the queries to elasticsearch, but the re-ranking can only be done on one text field.

Thank you very much, I'm closing :)