Built in R with use of the Shiny package, and version 4.0 of the (Tesseract OCR engine)[https://github.com/tesseract-ocr/] provided through the Tesseract R Package.
This application allows you to upload an image, render the image in the application, where you can 'brush' (drag and select) over the parts of the image containing the text you want to extract.
The text selected will then display below the image.
Tesseract 4.0 includes a new neural network-based recognition engine that delivers significantly higher accuracy (on document images) than the previous versions, in return for a significant increase in required compute power. On complex languages however, it may actually be faster than base Tesseract.
An example can be found hosted here on jessevent.shinyapps.io/tesseract/
library(shiny)
# Easiest way is to use runGitHub
runGitHub("shiny-tesseract", "jessevent")
The following dependencies are required
install.packages("shiny")
install.packages("shinydashboard")
install.packages("magick")
install.packages("tesseract")
shiny::runApp()
- Add in PDF support
- Be able to brush multiple regions Needs help
Happy for any other feedback or thoughts.