mozilla/iris

Explore alternative solutions for OCR

Closed this issue · 7 comments

Our OCR support is currently limited, as it works reasonably well for simple uses but can be unreliable for others. To make it work better requires more investment on our behalf. As a result, we don't make much use of it.

To provide the support we have now, we download/compile/install the Tesseract project, which is then available via a python interface as pytesseract. This process is expensive, adding at least 15 minutes to an install of Iris, as well as any build triggered via CI. It also adds to the overhead of our CI and Docker work as another dependency to manage.

We should evaluate alternatives.

  • There is a Tesseract library for JavaScript. Is that something we could take advantage of?
  • Is there another native Python replacement? Maybe textract?
  • Should we add flags to the project to selectively turn off support for OCR, in order to loosen this requirement?

If I'm not mistaken, tessereact is only used in one place:

processed_data = pytesseract.image_to_data(stack_image)

I think the javascript library is worth exploring. I have experience writing python packages with npm dependencies, and Dependabot can also handle npm dependencies

@tracywalker @mwobensmith what tests are currently using tesseract? how many test cases would be affected, and what is the priority of these test cases?

It’s not too bad, 20 or so tests use a helper that is depending on ocr and one test that directly uses ocr. So this is very contained and could be fixed with a handful of pattern captures on each platform and syntax changes in each test using the helper.

awesome. I'll begin trying to extract this, I agree with your sentiment that it would be best to drop support for now

Work items:

  • Adjust bootstrap to make Tesseract/Leptonica install optional (off by default).
  • Remove requirement for Tesseract install from moziris/scripts/main.py, line 167.
  • Add property that indicates presence of Tesseract on system.
  • Use above property in APIs when we detect a string is being used for a find operation, and raise an exception plus helpful warning message.
  • Adjust Docker file(s) to not install Tesseract by default.
  • Adjust Travis CI to not install Tesseract by default.

@kimberlythegeek We can discuss division of labor for the above tasks when we get to working on this.

Since this issue was created to find alternate OCR solutions, I will move the work to remove current support into another issue and keep this one open for now.