opensemanticsearch/open-semantic-etl

If OCR in later stage, only start (re)process with enabled OCR, if (embedded) images

opensemanticsearch opened this issue · 1 comments

For more performance / preventing unnecessary tasks: If OCR in later stage, only add task to start (re)process / OCR, if content type image or embedded image(s) in document

Implemented Fake Tesseract CLI wrapper in Repo tesseract-ocr-cache (which is now submodule of Open Semantic ETL) so we get more status before real OCR running.

Plugin enhance_extract_tika_server using this status to set status / disable further OCR plugins.