Using LeafReader only for first segment
SOLR4189 opened this issue · 4 comments
Hi,
I try to use LUWAK like UpdateProcessor in SOLR. I noticed that LUWAK matches X first documents only (for example, from DocumentBatch with 3000 docs it matched 163 docs only). I debugged LUWAK code, and noticed that the problem in the next code:
private static class MultiDocumentBatch extends DocumentBatch {
. . .
private LeafReader build(IndexWriter writer) throws IOException {
. . .
writer.forceMerge(1);
LeafReader reader = DirectoryReader.open(directory).leaves().get(0).reader();
I changed this code to:
LeafReader reader = SlowCompositeReaderWrapper.wrap(DirectoryReader.open(directory))
and it works for all docs in batch (but very slowly)
Was someone faced with this problem? Maybe do you have another solution for this problem?
I don't use MultiDocumentBatch but did you try adding a writer.commit() after writer.forceMerge(1)?
You are right. Now it works. So it is bug in LUWAK-1.5 that I use.
This is interesting, because as I understand it, forceMerge
should only return after the new segments are committed. I also can't reproduce this in a test - creating a batch of 10000 identical docs and then running a query over them returns all 10000 in the batched result. Do you think you could post a reproducible test case so that I can work out what's going on?
I can't publish my code, but I'll try to explain how I use LUWAK. I wrapped LUWAK into SOLR UpdateProcessor:
-
In function processAdd (this function will be called for each document in bulk) each SolrInputDocument I convert to LuceneDocuments and add the result to list of LuceneDocuments.
-
In function finish (this function will be called one time in the end of bulk) I build LUWAK DocumentBatch from the list of LuceneDocuments and pass it to match function of LUWAK monitor and results of function I write to file in format <docId,queryId>
So when does the bug happen? When I build LUWAK DocumentBatch. I debugged this code and saw that DirectoryReader.open(directory).leaves().get(0).reader() got only 163 documents from batch and another docs of batch were in DirectoryReader.open(directory).leaves().get(1).reader(), i.e. writer.force(1) didn't merge segments.
I don't know why you can't reproduce this in a test - maybe the issue is size of document? I have 3000 docs in batch, 5-15 KB each doc.
P. S. The solution of mjustice3 is good for me.