dbmdz/solr-ocrhighlighting

Using different fields for ocr file path and ocr text

krminta opened this issue · 6 comments

Hi
I have an application where i already have a solr field with documents text defined. I would like to integrate solr-ocrhighlighting with it without indexing extracted ocr again.
Lets say i have body field that stores text data. I'm extracting ocr text to that field for search purpose. A few components already use this field, so i can't change it. I defined another field ocr_file_path that is supposed to act as a source_pointer for ocr file.

Now i want to search in body field, and extract coordinates from ocr_file_path. The problem is that when i do that, ocr_highlighting compares fields definitions and returns empty result, because these are different fields. I tried a lot of different configurations/query definitions, but unfortunetelly it doesn't work the way i want it to.

Could You please consider introducing a new feature that allows user to use Your plugin that way?
Or if this is actually possible, introduce me a way to do it?

That's unfortunately not possible, for the highlighting to work, the terms in your existing OCR field would have to have the positions in the actual OCR markup (i.e. the byte offset where every token begins). I assume that you are doing the extraction of text from the OCR before you index it to Solr, so your terms in the existing field will have positions that match the extracted text and not the underlying markup.

Your best bet is to store the OCR text twice: Once in your existing field and once in a field that works with the plugin. Assuming that the extracted text is the same, you can then search and highlight with the plugin-field, and use the stored text from the other field for your existing code.

Sorry for late response, I missed the first response notification.

I'm extracting text from ocr file using OcrCharFilterFactory. Extracted data goes to body field. I have source pointer to ocr file in the other field, so shouldn't it be possible to use that source pointer to extract highlights directly from file?
I'm not sure if I understood it correctly, but doesn't it actually says that I don't need to store ocr in index? We could say that this is the way i'm doing it.
I have extracted text from ocr file stored in body field, and it isn't even needed by plugin, because plugin can use source pointer to access ocr file directly. And if it actually needs indexed file content, then it can use body, as it is extracted the same way (field definition doesn't use filters, but i'm using them manually for text extraction, the field attributes are nerly the same as well) as ocr_text from example.

My current (working) configuration is the same as You said (additional plugin field with indexed ocr), but i would like to optimize index a little bit.

Sorry, I seem to have misunderstood your setup!

So if I understand you correctly this is your actual setup:
You have one field body that has a OcrCharFilterFactory in its analysis chain and that you feed the OCR directly to.

Question: Is that field configured to store the data (i.e. stored="true" in the schema)?

If that is the case, you don't actually need the whole source pointer setup, you can just directly use your existing body field for highlighting, i.e. point hl.ocr.fl to it and you should be good to go.

It is not required to store OCR files on disk (and use the ExternalUtf8ContentFilterFactory) to get highlighting for OCR fields, as long as the OCR data is stored in the index.
Using source pointers to files on disk in addition to that is not going to optimize much, reading the data from disk is slower than getting it from the index itself, the main motivation for using external files is to keep the index size small when storing many documents, which is moot in your case since you're already storing the OCR in the index.

Yeah, body field is both indexed and stored. But it doesn't have OcrCharFilterFactory in its analysis chain. I'm calling that filter externally in some kind of text extractor. Then that extracted text is stored in body field. As i said in my first message, body field have some other uses so i can't insert additional filters.
From what i have seen, OcrCharFilterFactory gets rid of ocr data and returns cleaned ocr - i mean just the text w/o coordinates. This is exactly what i need for body field. This is why i wanted to use additional field with source pointer for highlighting - to get coordinates and width/height.

Sorry, if i'm explaining it unclearly. I said "indexed file content" when i meant indexed text... So i don't have ocr data stored in index, but just the text without ocr data.
Right now i'm using config from example with ocr_text field (so there is body field with stored and indexed text, and text_ocr with text and ocr data). What i would like to work is:
body field with extracted ocr text (w/o coordinates) for search purpose, additional field e.g. ocr_file_path and that is all.
It will be slower to read ocr file from disc instead of index, but i don't want increase index size unnecessarily.

OK, then I understood correctly and my original answer still stands :-)

Your best bet is to store the OCR text twice: Once in your existing field and once in a field that works with the plugin. Assuming that the extracted text is the same, you can then search and highlight with the plugin-field, and use the stored text from the other field for your existing code.

You can only make use of the highlighting function in this plugin if you have the OcrCharFilterFactory in the analysis chain for the field you want to highlight, since the highlighting requires that the term offsets point to the original OCR and not to the extracted text.

For you this means: You will have to introduce a second field that uses ExternalUtf8FilterFactory (to highlight from files on disk) and OcrCharFilterFactory and as a result you then have two fields that index the same text (body and your new ocr_from_disk field). And you then use the new one for searching and highlighting, and the old one for whatever other purposes you currently have for it. There'll be some overhead due to having to store twice the number of postings (depending on how your body field is set up), but that's effectively the most space-efficient way to use this highlighting plugin with your setup.

tl;dr: You cannot use your existing body field with the plugin (since it's missing said OcrCharFilterFactory in its analysis chain).

Ok, so the current setup is the best one i can actually use in this case. Thanks for clarification 🥇