If OCR in later stage, only start (re)process with enabled OCR, if (embedded) images
opensemanticsearch opened this issue · 1 comments
opensemanticsearch commented
For more performance / preventing unnecessary tasks: If OCR in later stage, only add task to start (re)process / OCR, if content type image or embedded image(s) in document
Mandalka commented
Implemented Fake Tesseract CLI wrapper in Repo tesseract-ocr-cache (which is now submodule of Open Semantic ETL) so we get more status before real OCR running.
Plugin enhance_extract_tika_server using this status to set status / disable further OCR plugins.