How to install ocrmypdf?
Closed this issue · 2 comments
Im running Windows 10. Currently given directions not elaborative enough. Please try to install ocrmypdf and use it with marker without getting errors. Where the hell do I find the root folder of marker if I installed it using command given below and not using some kind of visual studio environment (#316)?
- Installed marker:
pip install marker-pdf
https://github.com/VikParuchuri/marker/blob/master/docs/install_ocrmypdf.md - Installed ocrmypdf:
winget install -e --id Python.Python.3.11
winget install -e --id UB-Mannheim.TesseractOCR
installed ghostscript
python3 -m pip install ocrmypdf
- set two variables for ocrmypdf to be used:
set OCR_ALL_PAGES=true
set OCR_ENGINE=ocrmypdf
- trying to launch marker_single:
marker_single input.pdf C:/output/folder --langs Greek,Lithuanian
- Getting errors.
5.1 Trying to resolve:
Introduced a new variable:
set TESSDATA_PREFIX="C:\Program Files\Tesseract-OCR\tessdata"
- Errors again, frustration starts and hopefully ends here (with your help).
Loaded detection model vikp/surya_det3 on device cpu with dtype torch.float32 Loaded detection model vikp/surya_layout3 on device cpu with dtype torch.float32 Loaded reading order model vikp/surya_order on device cpu with dtype torch.float32 Loaded recognition model vikp/surya_rec2 on device cpu with dtype torch.float32 Loaded texify model to cpu with torch.float32 dtype Loaded recognition model vikp/surya_tablerec on device cpu with dtype torch.float32 Detecting bboxes: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:18<00:00, 18.75s/it] Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "C:\Users\wasup\miniconda3\Scripts\marker_single.exe\__main__.py", line 7, in <module> File "C:\Users\wasup\miniconda3\Lib\site-packages\convert_single.py", line 33, in main full_text, images, out_meta = convert_single_pdf(fname, model_lst, max_pages=args.max_pages, langs=langs, batch_multiplier=args.batch_multiplier, start_page=args.start_page) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\site-packages\marker\convert.py", line 98, in convert_single_pdf pages, ocr_stats = run_ocr(doc, pages, langs, ocr_model, batch_multiplier=batch_multiplier, ocr_all_pages=ocr_all_pages) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\site-packages\marker\ocr\recognition.py", line 55, in run_ocr new_pages = tesseract_recognition(doc, ocr_idxs, langs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\site-packages\marker\ocr\recognition.py", line 136, in tesseract_recognition pages = list(executor.map(_tesseract_recognition, pdf_pages, repeat(langs, len(pdf_pages)))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\concurrent\futures\_base.py", line 619, in result_iterator yield _result_or_cancel(fs.pop()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\concurrent\futures\_base.py", line 317, in _result_or_cancel return fut.result(timeout) ^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\concurrent\futures\_base.py", line 456, in result return self.__get_result() ^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\concurrent\futures\_base.py", line 401, in __get_result raise self._exception File "C:\Users\wasup\miniconda3\Lib\concurrent\futures\thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\site-packages\marker\ocr\recognition.py", line 177, in _tesseract_recognition new_doc = pdfium.PdfDocument(f.name) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\site-packages\pypdfium2\_helpers\document.py", line 78, in __init__ self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\site-packages\pypdfium2\_helpers\document.py", line 678, in _open_pdf raise PdfiumError(f"Failed to load document (PDFium: {pdfium_i.ErrorToStr.get(err_code)}).") pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: File access error).
In CMD I wrote (found it in #162):
pip show marker-pdf
Got this (bold is what I needed):
Name: marker-pdf
Version: 0.3.10
Summary: Convert PDF to markdown with high speed and accuracy.
Home-page: https://github.com/VikParuchuri/marker
Author: Vik Paruchuri
Author-email: github@vikas.sh
License: GPL-3.0-or-later
Location: C:\Users\Wasup\miniconda3\Lib\site-packages
Requires: filetype, ftfy, pdftext, Pillow, pydantic, pydantic-settings, python-dotenv, rapidfuzz, regex, surya-ocr, tabled-pdf, tabulate, texify, torch, tqdm, transformers
Went to C:\Users\Wasup\miniconda3\Lib\site-packages
and found undocumented settings.py
file. Opened with notepad++ and found the needed lines and filled in the TESSDATA_PREFIX value:
OCR_PARALLEL_WORKERS: int = 2 # How many CPU workers to use for OCR
TESSERACT_TIMEOUT: int = 20 # When to give up on OCR
TESSDATA_PREFIX: str = "C:\Program Files\Tesseract-OCR\tessdata"
Now atleast marker recognizes that there is some kind of TESSDATA_PREFIX:
C:\Users\Wasup>marker_single input.pdf C:/output/folder --langs Greek,Lithuanian C:\Users\Wasup\miniconda3\Lib\site-packages\marker\settings.py:59: SyntaxWarning: invalid escape sequence '\P' **TESSDATA_PREFIX: str = "C\Program Files\Tesseract-OCR\tessdata"** Loaded detection model vikp/surya_det3 on device cpu with dtype torch.float32 Loaded detection model vikp/surya_layout3 on device cpu with dtype torch.float32 Loaded reading order model vikp/surya_order on device cpu with dtype torch.float32 Loaded recognition model vikp/surya_rec2 on device cpu with dtype torch.float32 Loaded texify model to cpu with torch.float32 dtype Loaded recognition model vikp/surya_tablerec on device cpu with dtype torch.float32 Detecting bboxes: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:16<00:00, 16.33s/it] Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "C:\Users\Wasup\miniconda3\Scripts\marker_single.exe\__main__.py", line 7, in <module> File "C:\Users\Wasup\miniconda3\Lib\site-packages\convert_single.py", line 33, in main full_text, images, out_meta = convert_single_pdf(fname, model_lst, max_pages=args.max_pages, langs=langs, batch_multiplier=args.batch_multiplier, start_page=args.start_page) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\site-packages\marker\convert.py", line 98, in convert_single_pdf pages, ocr_stats = run_ocr(doc, pages, langs, ocr_model, batch_multiplier=batch_multiplier, ocr_all_pages=ocr_all_pages) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\site-packages\marker\ocr\recognition.py", line 55, in run_ocr new_pages = tesseract_recognition(doc, ocr_idxs, langs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\site-packages\marker\ocr\recognition.py", line 136, in tesseract_recognition pages = list(executor.map(_tesseract_recognition, pdf_pages, repeat(langs, len(pdf_pages)))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\concurrent\futures\_base.py", line 619, in result_iterator yield _result_or_cancel(fs.pop()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\concurrent\futures\_base.py", line 317, in _result_or_cancel return fut.result(timeout) ^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\concurrent\futures\_base.py", line 456, in result return self.__get_result() ^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\concurrent\futures\_base.py", line 401, in __get_result raise self._exception File "C:\Users\Wasup\miniconda3\Lib\concurrent\futures\thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\site-packages\marker\ocr\recognition.py", line 177, in _tesseract_recognition new_doc = pdfium.PdfDocument(f.name) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\site-packages\pypdfium2\_helpers\document.py", line 78, in __init__ self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\site-packages\pypdfium2\_helpers\document.py", line 678, in _open_pdf raise PdfiumError(f"Failed to load document (PDFium: {pdfium_i.ErrorToStr.get(err_code)}).") pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: File access error).
Launched cmd as administrator- problem (PdfiumError: Failed to load document (PDFium: File access error)
fixed