Roadmap for 2024
robertknight opened this issue ยท 5 comments
This issue exists to document what I think are the highest priorities in the short-medium term.
Models and training:
- Document how to repeat the model training process (robertknight/ocrs-models#6). A repeatable training process is IMO required for an ML project to really be considered open source. Also this is needed to enable fine tuning or training for new languages
- Add benchmarks so the accuracy can be tracked over time
- Expand the training data sets for detection and recognition to improve accuracy
ocrs library and CLI tool:
- Add the infrastructure to support multiple languages and model updates (#8, #4)
- Add end-to-end tests that actually check the output. There is a simple end-to-end test but it only verifies that the CLI tool can be built and runs, not the actual output (#25)
- Improve runtime performance and efficiency
Beyond the short term list, here are some themes for subsequent work:
- Continue expanding the datasets and test cases to improve accuracy
- Use machine learning for layout analysis
- Quantize the models to 8-bit to make the downloads smaller and execution faster
- Improve WebAssembly execution performance
- Add bindings for other languages (eg. C, Python, Node)
And some longer term things:
- Support GPU inference. This will probably involve making the execution engine pluggable.
Disclaimer: I spent several years of my life building production proprietary OCR solutions around tweaked Tesseract and a lot of custom image preprocessing before OCR + some text post-processing stuff. I want to add some ideas that you can find good enough to add to the roadmap.
Add benchmarks so the accuracy can be tracked over time
I suggest adding performance benchmarks too to the continuous monitoring. With Tesseract we had a lot of issues around performance. This question is especially critical if we talk about resource-limited devices like smartphones (I had exactly this case in my practice). Knowing actual performance in different cases can be valuable for the users. Comparison with other OCR solutions (like Tesseract) is also a good thing to know when you choose an OCR engine to use.
Another idea is extending prerecognition image processing. I see that you already do some image-processing stuff. Probably, you will find some useful prerecognition image processing algorithms in my repo: https://github.com/zamazan4ik/PRLib .
Quantize the models to 8-bit to make the downloads smaller and execution faster
About performance questions. Now I am researching Profile-Guided Optimization (PGO) usage to achieve better performance for different kinds of software. Maybe PGO can be useful for ocrs
too - needs to be investigated, cannot say more right now without actual benchmarks. According to my tests, PGO already helps with optimizing in many real-life cases. However, if the current ocrs
bottleneck is somewhere on the model inference side I do not expect huge wins from PGO in this case since such code usually is already well-optimized and PGO cannot do more.
Hello, thanks for the input.
I suggest adding performance benchmarks too to the continuous monitoring. With Tesseract we had a lot of issues around performance.
I agree this would be useful. I suspect there may be problems with variability with the current GitHub Actions runners, as they are the free runners which don't provide any guarantees about isolation from other jobs.
Another idea is extending prerecognition image processing. I see that you already do some image-processing stuff. Probably, you will find some useful prerecognition image processing algorithms in my repo: https://github.com/zamazan4ik/PRLib .
Thanks for the link. One general goal of this project is to rely more on machine learning to handle noise and variability in inputs. So far I've found this to work well for things like handling low contrast or blurred input, but rotations need explicit handling.
However, if the current ocrs bottleneck is somewhere on the model inference side I do not expect huge wins from PGO in this case since such code usually is already well-optimized can PGO cannot do more.
I haven't tried PGO yet, so it would be interesting to see if it has an effect. From what I've seen in a few recent profiles using samply, most of the time spent is indeed concentrated in a few hot spots in model inference. If the input image is unnecessarily large (that is, far larger than is needed to read the text) that can also lead to a lot of time spent moving memory around for the decompressed image, see #15.
Hey, I just wanted to encourage you a little on this project. Since you posted on reddit about it 6 months ago I've used it in a few different contexts. I've pulled text out of digital comics, helped with rigging up a really janky testing setup for an emulated android app, and I used it to build a bot to, well, frankly to play an idle game for me.
This library is my go to for a quick and "good enough" OCR. It's fast enough in a lot of circumstances that I can use it even for real time feedback like in the idle game. And it's simple enough to set up that embedding it into another application is a matter of minutes.
Thanks, and I hope you continue to find fulfilling ways of improving this.
Thanks @TannerRogalsky, I appreciate the kind words!
great stuff. hoping for a python binding soon