MacOS uses Tesseract and not Tesseract-OCR
Closed this issue · 2 comments
avigoen commented
Description of the bug
pymupdf/__init__.py in ?(tessdata)
17818 # Unix-like systems:
17819 cp = subprocess.run("whereis tesseract-ocr", shell=1, capture_output=1, check=0, text=True)
17820 response = cp.stdout.strip().split()
17821 if cp.returncode or len(response) != 2: # if not 2 tokens: no tesseract-ocr
> 17822 raise RuntimeError("No tessdata specified and Tesseract is not installed")
17823
17824 # search tessdata in folder structure
17825 dirname = response[1] # contains tesseract-ocr installation folderRuntimeError: No tessdata specified and Tesseract is not installed
How to reproduce the bug
PyMuPDF installation command:
uv add pymupdf
Issue:
for page in doc:
textPage = page.get_textpage_ocr()
print(textPage.extract_text())On running the above script, I am getting the error
I can see that on MacOS, tesseract is installed using brew install tesseract and has no package for tesseract-ocr
Tesseract Installation Proof:
tesseract: /opt/homebrew/bin/tesseract
tesseract-ocr:
PyMuPDF version
1.26.1
Operating system
MacOS
Python version
3.12
JorjMcKie commented
You know that you can fix this by either directly providing the folder name of tessdata or setting the appropriate environment variable (before starting your script)?
julian-smith-artifex-com commented
Fixed in PyMuPDF-1.26.4.