Findwise/Hydra

Bulk-off load processed documents

Opened this issue · 4 comments

Solr (and other search engines) are willing to accept bulk of documents. Wouldn't it be awesome if hydra was able to offer that.

For performance reasons?

Although I agree, I think a good way to think about this is to set some sort of predefined performance goal. That way, we can see where the bottlenecks are.

One possible metric is that Hydra should be within an order of magnitude of the solr indexing throughput for a "typical" pipeline.

Because I hypothesize that adding some sort of buffering to the solr-output in order to do bulk output would in fact do nothing for the overall throughput. But that would need to be proven.

This functionality previously existed in SolrOutputStage, but was removed due to just adding complexity while not actually speeding things up, for the reasons Petter lists. The stage would end up hogging RAM for very chatty pipelines (I'm looking at you, TikaStage) and lots of documents would end up in bad states if it crashed.

Most of the problems with it were more due to poor coding and design than anything else though, so a fresh look at it might be in order.