Shiny Tesseract

OCR (Optical Character Recognition) With R and Shiny

Introduction

Built in R with use of the Shiny package, and version 4.0 of the (Tesseract OCR engine)[https://github.com/tesseract-ocr/] provided through the Tesseract R Package.

This application allows you to upload an image, render the image in the application, where you can 'brush' (drag and select) over the parts of the image containing the text you want to extract.

The text selected will then display below the image.

About Tesseract 4.0

Tesseract 4.0 includes a new neural network-based recognition engine that delivers significantly higher accuracy (on document images) than the previous versions, in return for a significant increase in required compute power. On complex languages however, it may actually be faster than base Tesseract.

Example

An example can be found hosted here on jessevent.shinyapps.io/tesseract/

library(shiny)

# Easiest way is to use runGitHub
runGitHub("shiny-tesseract", "jessevent")

Accuracy

Usage

The following dependencies are required

install.packages("shiny")
install.packages("shinydashboard")
install.packages("magick")
install.packages("tesseract")

shiny::runApp()

Next Steps

Add in PDF support
Be able to brush multiple regions Needs help

Happy for any other feedback or thoughts.

JesseVent/shiny-tesseract