Unstructured-IO/unstructured-api

ModuleNotFoundError: No module named 'unstructured.partition.utils.ocr_models'

jashdalvi opened this issue · 2 comments

I used the latest pull from the unstructured api repo. This is specific to using paddle for ocr and also on GPU. Then these are the steps I followed:

  1. make install
  2. pip install onnxruntime-gpu
  3. pip install paddlepaddle-gpu
  4. pip install "unstructured.PaddleOCR"
  5. export ENTIRE_PAGE_OCR=paddle
  6. export TABLE_OCR=paddle
  7. make run-web-app

This was working fine with 0.0.47 version

how to get paddle working?

export ENTIRE_PAGE_OCR=paddle
export TABLE_OCR=paddle

request failed with

{
	"detail": "tesseract is not installed or it's not in your PATH. See README file for more information."
}

Hi @crapthings thanks for reaching out!

Sorry about the confusion, environment variable ENTIRE_PAGE_OCR and TABLE_OCR are being deprecated.

To make sure paddle is working, you might need to:

  • make sure paddle is installed in your environment, you can run make install-paddleocr from unst repo
  • set the correct ENV OCR_AGENT to paddle with export OCR_AGENT=paddle