LexPredict/lexpredict-contraxsuite

OCR

Opened this issue · 5 comments

Do the documents need to be OCRed prior to uploading?

I just attempted a clean and reinstall and tried loading a doc that was not OCRed.

I got this error:

`Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: fc37ca52-d218-4cdd-9a49-69bb95381e06
INFO 2019-03-19 23:07:24 | Celery task id: fc37ca52-d218-4cdd-9a49-69bb95381e06

Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: fc37ca52-d218-4cdd-9a49-69bb95381e06
INFO 2019-03-19 23:07:24 | Start task "Load Documents", id=None
Kwargs: {'project': {'model': 'project.project', 'pk': 1}, 'source_data': '/', 'source_type': 'agreements', 'document_type': {'model': 'document.documenttype', 'pk': '68f992f1-dba3-4dc0-a815-4d868b23c5b4'}, 'detect_contract': True, 'delete': False, 'run_standard_locators': True, 'user_id': 1, 'metadata': {'result_links': [{'name': 'View Document List', 'link': 'document:document-list'}, {'name': 'View Text Unit List', 'link': 'document:text-unit-list'}]}, 'task_id': 'fc37ca52-d218-4cdd-9a49-69bb95381e06'}
Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: fc37ca52-d218-4cdd-9a49-69bb95381e06
INFO 2019-03-19 23:07:24 | Parse / at NginxFileAccess: http://contrax-nginx:80/media/data/documents/
Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: fc37ca52-d218-4cdd-9a49-69bb95381e06
INFO 2019-03-19 23:07:24 | Detected 1 files. Added 1 subtasks.
Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: fc37ca52-d218-4cdd-9a49-69bb95381e06
INFO 2019-03-19 23:07:24 | Load Documents: starting 1 sub-tasks...
Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: fc37ca52-d218-4cdd-9a49-69bb95381e06
INFO 2019-03-19 23:07:25 | End of main task "Load Documents", id=None. Sub-tasks may be still running.
Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: 50a42743-5857-430f-97e7-4137e6021dda
INFO 2019-03-19 23:07:25 | Trying TIKA for file: JS#52732.PDF
Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: 50a42743-5857-430f-97e7-4137e6021dda
ERROR 2019-03-19 23:07:26 | TIKA returned too small text for file: JS#52732.PDF
Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: 50a42743-5857-430f-97e7-4137e6021dda
INFO 2019-03-19 23:07:26 | Trying Textract for file: JS#52732.PDF
Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: 50a42743-5857-430f-97e7-4137e6021dda
INFO 2019-03-19 23:07:26 | Caught exception while trying to parse file with Textract: JS#52732.PDF
Traceback (most recent call last):
File "/contraxsuite_services/apps/task/tasks.py", line 597, in try_parsing_with_textract
return textract2text(file_path, ext=ext), 'textract'
File "/contraxsuite_services/apps/task/utils/ocr/textract.py", line 116, in textract2text
text = process(path, ext=ext, method='tesseract', language=language)
File "/contraxsuite_services/apps/task/utils/ocr/textract.py", line 99, in process
filetype_module = importlib.import_module(rel_module, 'textract.parsers')
File "/contraxsuite_services/venv/lib/python3.6/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 994, in _gcd_import
File "", line 971, in _find_and_load
File "", line 953, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'textract.parsers.PDF_parser'`

Looks like there is an issue with Tesseract in latest version. I did a full clean reinstall of 1.1.9 and keep getting a ModuleNotFoundError: No module named 'textract.parsers.PDF_parser'` even on previously OCRed text.

If it helps, here is the output of docker ls:

ub5b48qsfg0s contraxsuite_contrax-celery global 1/1 lexpredict/lexpredict-contraxsuite:latest ngb0mq80ze6g contraxsuite_contrax-celery-beat replicated 1/1 lexpredict/lexpredict-contraxsuite:latest lzbuwjlkxfx4 contraxsuite_contrax-curator_filebeat replicated 1/1 stefanprodan/es-curator-cron:latest pn8w3ejqmsuf contraxsuite_contrax-curator_metricbeat replicated 0/0 stefanprodan/es-curator-cron:latest p928pz2n09ym contraxsuite_contrax-db replicated 1/1 postgres:9.6 tmpz5r4tkhcb contraxsuite_contrax-elasticsearch replicated 1/1 docker.elastic.co/elasticsearch/elasticsearch-oss:6.2.4 w8nwy98y4rlj contraxsuite_contrax-filebeat global 1/1 docker.elastic.co/beats/filebeat:6.2.4 ir5yt9t1kg47 contraxsuite_contrax-flower replicated 0/0 lexpredict/lexpredict-contraxsuite:latest pock348z204w contraxsuite_contrax-jupyter replicated 1/1 lexpredict/lexpredict-contraxsuite:latest seulb1l7wcya contraxsuite_contrax-kibana replicated 1/1 docker.elastic.co/kibana/kibana-oss:6.2.4 us12mggxpgz5 contraxsuite_contrax-logrotate global 1/1 tutum/logrotate:latest m3cwbg5xibfj contraxsuite_contrax-metricbeat replicated 0/0 docker.elastic.co/beats/metricbeat:6.2.4 l4d2wnujj4gw contraxsuite_contrax-nginx replicated 1/1 nginx:stable *:80->8080/tcp, *:443->4443/tcp lqo0l3ubbsz7 contraxsuite_contrax-rabbitmq replicated 1/1 rabbitmq:3-management uul2xgwxo17u contraxsuite_contrax-tika global 1/1 lexpredict/tika-server:latest azlhtr3dv8nn contraxsuite_contrax-uwsgi replicated 1/1 lexpredict/lexpredict-contraxsuite:latest

Sorry for the frequent update. I did confirm that running OCR locally on the document and re-uploading allowed the standard Load Document task to function correctly. So something seems amiss with the Tesseract OCRing process.