Computing BM25 Similarity for 30 Querys and 85000 Documents

Question

Computing BM25 Similarity for 30 Querys and 85000 Documents

codingnoobneedshelp opened this issue 7 years ago · 7 comments

codingnoobneedshelp commented 7 years ago

Hello,

the Code from the Book for BM25 is not working for large Datasets.

File "C:\Users\xxx\Anaconda2\lib\site-packages\scipy\sparse\base.py", line 1039, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)

MemoryError

It would be great if someone could change the code that it works in my case. I'm trying this by myself currently, but no success so far :(.

Thanks

Answer 1 · 2018-05-29T12:34:23.000Z

What are you trying to do for 85K documents? You haven't mentioned much about the problem you are trying to solve.

The similarity chapter is all about showing how the algorithms are implemented actually with the math behind them. If you want to scale this out for similarity, search and information retrieval, consider using a more scalable solution like elasticsearch which uses BM25 in the backend, rather than write it in raw python.

Answer 2 · 2018-05-29T12:47:49.000Z

yes this work belongs the the information retrieval chapter. i want to do learning to rank, but first i need to calculate the features and bm25 is one of them. so i just run each query against the document corpus to get the scores. i mean it still should be doable to change the function somehow that i dont get the error or?

Answer 3 · 2018-05-29T13:59:07.000Z

The BM25 code is just a mathematical function which has been converted into python code based on the formula. It could be that the numpy feature matrices are not fitting in the RAM of your system. It's still not very clear which line of the code is throwing the memory error though.

In general, the code corpus_features = corpus_features.toarray() can be brought outside the function to prevent it from eating up all the RAM for each query (basically generate the dense matrix just once instead of generating it each time in the function when making queries)

But if you want to solve this problem in the real-world, for ranking\querying 85K documents consider using elasticsearch which is more efficient and the right way to get similar documents and ranking (and you can even tune the algorithm in the backend based on constructs in the DSL queries).

Answer 4 · 2018-05-29T16:11:55.000Z

Thanks for your help. Yes the error is because if this code: corpus_features = corpus_features.toarray()

Can you maybe post how the code would with the changes you suggest?

Big thanks

Answer 5 · 2018-05-29T16:14:58.000Z

I'm actually working on the 2nd revision of this book for Python 3.x so a bit busy restructuring and working on the code for the different chapters since some things will change and all the code will also be ported over to python 3.

You just need to put that line of code in your main code file\segment and not call it repeatedly in the function where BM25 is defined. Then assuming you have enough RAM it should work.

But like I said, prefer using elasticsearch for these kind of problems.

Answer 6 · 2018-05-29T16:33:28.000Z

ok...

if i put the code outside and just execute it, i still get the MemoryError and i have 32gb ram.

Is ElasticSearch easy to use?

Answer 7 · 2018-05-30T13:58:39.000Z

It is better to build an index with the 85K documents instead of repeatedly making a matrix on python for all the queries.

Elasticsearch is very easy to learn and use: https://www.elastic.co/products/elasticsearch

There is also a python client for the same to use on top of elasticsearch: https://elasticsearch-py.readthedocs.io/en/master/