OCR

Question

OCR

Opened this issue 6 years ago · 5 comments

Do the documents need to be OCRed prior to uploading?

Answer 1 · 2019-03-11T22:26:32.000Z

No they dont. We have Apache Tika embedded, which uses Google Tesseract under the hood for OCR.

On Mon, Mar 11, 2019 at 5:42 PM dwmcqueen ***@***.***> wrote: Do the documents need to be OCRed prior to uploading? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#46>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AdAEOvXLLetpVYo919wA3hdY8doUZ2zFks5vVs2wgaJpZM4bpny6> .

-- *Eric Detterman *| VP and Global Head of Products and Solution Engineering, *LexPredict, LLC* *Email: *eric@lexpredict.com *LinkedIn: * *https://www.linkedin.com/in/ericdetterman <https://www.linkedin.com/in/ericdetterman>**Web: *https://www.lexpredict. <https://www.lexpredict.com/>com/ <https://www.lexpredict.com/> *Cell: +1 (248) 550-2111*

-- *CONFIDENTIALITY NOTICE*: This transmission, including any attachments, may contain confidential, protected, or sensitive information. If you are not the intended recipient of this transmission, you may not disclose, copy, redistribute, or use the contents of this message. If you have received this email in error, please destroy it and notify the sender immediately.

Answer 2 · 2019-03-19T23:09:32.000Z

I just attempted a clean and reinstall and tried loading a doc that was not OCRed.

I got this error:

`Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: fc37ca52-d218-4cdd-9a49-69bb95381e06
INFO 2019-03-19 23:07:24 | Celery task id: fc37ca52-d218-4cdd-9a49-69bb95381e06

Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: fc37ca52-d218-4cdd-9a49-69bb95381e06
INFO 2019-03-19 23:07:24 | Start task "Load Documents", id=None
Kwargs: {'project': {'model': 'project.project', 'pk': 1}, 'source_data': '/', 'source_type': 'agreements', 'document_type': {'model': 'document.documenttype', 'pk': '68f992f1-dba3-4dc0-a815-4d868b23c5b4'}, 'detect_contract': True, 'delete': False, 'run_standard_locators': True, 'user_id': 1, 'metadata': {'result_links': [{'name': 'View Document List', 'link': 'document:document-list'}, {'name': 'View Text Unit List', 'link': 'document:text-unit-list'}]}, 'task_id': 'fc37ca52-d218-4cdd-9a49-69bb95381e06'}
Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: fc37ca52-d218-4cdd-9a49-69bb95381e06
INFO 2019-03-19 23:07:24 | Parse / at NginxFileAccess: http://contrax-nginx:80/media/data/documents/
Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: fc37ca52-d218-4cdd-9a49-69bb95381e06
INFO 2019-03-19 23:07:24 | Detected 1 files. Added 1 subtasks.
Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: fc37ca52-d218-4cdd-9a49-69bb95381e06
INFO 2019-03-19 23:07:24 | Load Documents: starting 1 sub-tasks...
Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: fc37ca52-d218-4cdd-9a49-69bb95381e06
INFO 2019-03-19 23:07:25 | End of main task "Load Documents", id=None. Sub-tasks may be still running.
Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: 50a42743-5857-430f-97e7-4137e6021dda
INFO 2019-03-19 23:07:25 | Trying TIKA for file: JS#52732.PDF
Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: 50a42743-5857-430f-97e7-4137e6021dda
ERROR 2019-03-19 23:07:26 | TIKA returned too small text for file: JS#52732.PDF
Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: 50a42743-5857-430f-97e7-4137e6021dda
INFO 2019-03-19 23:07:26 | Trying Textract for file: JS#52732.PDF
Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: 50a42743-5857-430f-97e7-4137e6021dda
INFO 2019-03-19 23:07:26 | Caught exception while trying to parse file with Textract: JS#52732.PDF
Traceback (most recent call last):
File "/contraxsuite_services/apps/task/tasks.py", line 597, in try_parsing_with_textract
return textract2text(file_path, ext=ext), 'textract'
File "/contraxsuite_services/apps/task/utils/ocr/textract.py", line 116, in textract2text
text = process(path, ext=ext, method='tesseract', language=language)
File "/contraxsuite_services/apps/task/utils/ocr/textract.py", line 99, in process
filetype_module = importlib.import_module(rel_module, 'textract.parsers')
File "/contraxsuite_services/venv/lib/python3.6/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 994, in _gcd_import
File "", line 971, in _find_and_load
File "", line 953, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'textract.parsers.PDF_parser'`

Answer 3 · 2019-03-20T15:50:56.000Z

Looks like there is an issue with Tesseract in latest version. I did a full clean reinstall of 1.1.9 and keep getting a ModuleNotFoundError: No module named 'textract.parsers.PDF_parser'` even on previously OCRed text.

Answer 4 · 2019-03-20T15:53:10.000Z

If it helps, here is the output of docker ls:

ub5b48qsfg0s contraxsuite_contrax-celery global 1/1 lexpredict/lexpredict-contraxsuite:latest ngb0mq80ze6g contraxsuite_contrax-celery-beat replicated 1/1 lexpredict/lexpredict-contraxsuite:latest lzbuwjlkxfx4 contraxsuite_contrax-curator_filebeat replicated 1/1 stefanprodan/es-curator-cron:latest pn8w3ejqmsuf contraxsuite_contrax-curator_metricbeat replicated 0/0 stefanprodan/es-curator-cron:latest p928pz2n09ym contraxsuite_contrax-db replicated 1/1 postgres:9.6 tmpz5r4tkhcb contraxsuite_contrax-elasticsearch replicated 1/1 docker.elastic.co/elasticsearch/elasticsearch-oss:6.2.4 w8nwy98y4rlj contraxsuite_contrax-filebeat global 1/1 docker.elastic.co/beats/filebeat:6.2.4 ir5yt9t1kg47 contraxsuite_contrax-flower replicated 0/0 lexpredict/lexpredict-contraxsuite:latest pock348z204w contraxsuite_contrax-jupyter replicated 1/1 lexpredict/lexpredict-contraxsuite:latest seulb1l7wcya contraxsuite_contrax-kibana replicated 1/1 docker.elastic.co/kibana/kibana-oss:6.2.4 us12mggxpgz5 contraxsuite_contrax-logrotate global 1/1 tutum/logrotate:latest m3cwbg5xibfj contraxsuite_contrax-metricbeat replicated 0/0 docker.elastic.co/beats/metricbeat:6.2.4 l4d2wnujj4gw contraxsuite_contrax-nginx replicated 1/1 nginx:stable *:80->8080/tcp, *:443->4443/tcp lqo0l3ubbsz7 contraxsuite_contrax-rabbitmq replicated 1/1 rabbitmq:3-management uul2xgwxo17u contraxsuite_contrax-tika global 1/1 lexpredict/tika-server:latest azlhtr3dv8nn contraxsuite_contrax-uwsgi replicated 1/1 lexpredict/lexpredict-contraxsuite:latest

Answer 5 · 2019-03-20T15:59:26.000Z

Sorry for the frequent update. I did confirm that running OCR locally on the document and re-uploading allowed the standard Load Document task to function correctly. So something seems amiss with the Tesseract OCRing process.