openkm/document-management-system

PDF Text extraction fails on 6.3.12

aldemira opened this issue · 5 comments

I just reverted back to 6.3.9 and it works flawlessly. I tried rebuilding indexes etc. But I see errors that text etraction had failed. Hence the search doesn't produce anything at all. 6.3.9 works fine.

I've checked and works fine. So, provide a sample PDF to test.

OK, let's do this, I can't freshly install 6.3.12 now so I'll be closing this issue, whenever I can. I'll install a fresh copy and test it. Thanks.

Sorry I've to reopen this issue now. I've just installed 6.3.12 from scratch (with docker-compose). And here are the logs I'm getting:

2022-09-27 11:25:00,105 [Thread-181] INFO c.o.extractor.TextExtractorWorker - processSerial.Working on {docUuid=d5a22248-29ae-4d42-aadc-551b810049e4, docPath=/okm:root/Video/intro-linux.pdf, docVerUuid=22de99e1-cb51-42e4-a67f-ff3da8064686, date=Tue Sep 27 11:22:47 UTC 2022}
2022-09-27 11:25:00,854 [Thread-181] WARN c.o.extractor.CuneiformTextExtractor - Undefined OCR application
2022-09-27 11:25:00,855 [Thread-181] WARN com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/Video/intro-linux.pdf': Too few text extracted
2022-09-27 11:30:00,067 [Thread-208] INFO com.openkm.core.UserMailImporter - *** User mail importer activated ***
2022-09-27 11:30:00,085 [Thread-209] INFO c.o.extractor.TextExtractorWorker - processSerial.Working on {docUuid=8f5e2b68-cbd1-45b6-a5f9-68fa46855fce, docPath=/okm:root/14F-Intro to Python-3.3.pdf, docVerUuid=e08e3c13-43ab-449d-9b9a-1a3fa891f6ed, date=Tue Sep 27 11:27:57 UTC 2022}
2022-09-27 11:30:00,088 [Thread-209] WARN c.o.extractor.CuneiformTextExtractor - Undefined OCR application
2022-09-27 11:30:00,089 [Thread-209] WARN com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/14F-Intro to Python-3.3.pdf': Too few text extracted

The files I've tested are:

https://www.tug.ca/tec/Sessions/Handouts/PDF/14F-Intro%20to%20Python-3.3.pdf
https://tldp.org/LDP/intro-linux/intro-linux.pdf

6.3.9 doesn't have this problem.

I kinda feel ashamed but I think I forgot to delete the local volume (tomcat) which was the issue this time. So reinstalled again and now search and text extraction works. Sorry for spamming your inbox (yet again)

Anyway, if you have these kind of problems again, check the list of text extractor because you may have collisions.

Best regards.