Reduce the memory footprint associated with the ProcessedDocBatch

Question

Reduce the memory footprint associated with the ProcessedDocBatch

Opened this issue 2 months ago · 0 comments

Right now, we run the document processing and the indexing on two different actors in order to take advantage of multithreading. (Side-note: it has been observed by @PSeitz that multithreading does not actually happen due to the unstealable fifo slot. The latter will be solved once we merge.)

The document processing takes a batch of docs in the form of json strings, parse them into a TantivyDocument.

The TantivyDocument is a structured object in which every string/object field is its own allocation.
This considerably inflates the memory footprint (probably due to pointers, memory alignment) and number of allocations.

In addition, right now we keep accounting the memory associated with the TantivyDoc as being equal to the JSON document length in the in flight memory counter, which is inaccurate.

We want a solution to reduce the memory footprint and ideally measure more accurately the associated memory footprint.

A monkey patch simply serializing the tantivy doc in the docprocessor and deserializing it in the indexer proved that the
benefit was significant.

Tantivy now has a Document trait. It would be good to have an implementation that works over a serialized buffer.
Working off a zero-copy solution straight from json is interesting in the long term, but I think an intermediary solution where we replace the tantivy document with something that allows zero-copy (rkyv, something adhoc, or whatever we can come up with) would be a simple and large win.