/Tesseract-OCR-5-Docker

Docker Image with latest Tesseract OCR Version 5.x.x built from sources

Primary LanguagePythonApache License 2.0Apache-2.0

License Issues Last Commit

Docker Docker Docker

Tesseract-OCR-5-Docker

Docker Image with latest Tesseract OCR Version 5.x.x built from sources.

The sources are pulled from the latest main branch and latest releases of the Tesseract OCR project.

Docker Hub: https://hub.docker.com/r/franky1/tesseract

Usage

Pull Docker Image

Pull the docker image from Docker Hub:

docker pull franky1/tesseract

Run Docker Container

Mount your image data to the /tmp directory and run Tesseract OCR container with the required command line options, for example, run Tesseract OCR container with test image:

docker run -it -v ${PWD}/testdata:/tmp --rm franky1/tesseract \
  tesseract english.png output --oem 1 -l eng

For the Tesseract command line options, please refer to the Tesseract Manual

Mount more languages

Test if the mounted languages from your local subfolder /tessdata are available in the Docker container. Be aware that the local languages overwrite the installed languages in the Docker image. Example here with french language:

docker run -it -v ${PWD}/testdata:/tmp \
  -v ${PWD}/tessdata:/usr/local/share/tessdata/ \
  --rm franky1/tesseract

Test the mounted languages in the Docker container with a sample image. Example here with french language:

docker run -it -v ${PWD}/testdata:/tmp \
  -v ${PWD}/tessdata:/usr/local/share/tessdata/ \
  --rm franky1/tesseract \
  tesseract french.jpg output --oem 1 -l fra

Alternatively, you can build a new Docker image if you want other languages, see next section.

Build Docker Image yourself

For details have a look into the Dockerfile.

  1. Git clone this repo.
  2. Add your required languages to the languages.txt file.
  3. (a) Build the docker image from scratch, if you want the latest sources from the main branch.
docker build --tag tesseract .
  1. (b) Build the docker image from scratch, if you want a specific release version.
docker build --tag tesseract --build-arg TESSERACT_VERSION=5.0.0 .
  1. Run Tesseract OCR container with test image:
docker run -it --name tesseract -v ${PWD}/testdata:/tmp --rm \
  tesseract tesseract english.png output --oem 1 -l eng

Image conditions

  • Only supported target for this docker image currently is linux/amd64.
  • Working directory for ocr images is /tmp inside the container. See example above.
  • Directory for trained data is /usr/local/share/tessdata/ inside the container. See example above.
  • This image was built without the Tesseract training tools.
  • This image currently includes only the following languages:
    • English: tessdata_best > eng.traineddata
    • German: tessdata_best > deu.traineddata
    • If you need other languages, you have to build your own image or mount trained data to the /usr/local/share/tessdata/ directory. See example above.

Tesseract Trained Data for all available langauges

Further documentation

ToDo

  • Update README.md to latest Dockerfile and Usage
  • add workflow_dispatch to github workflows
  • Add dependabot on Github
  • Add vulnerability scanning in Github Actions with Snyk
  • Add GitHub Action for check container efficiency with Dive https://github.com/MartinHeinz/dive-action
  • Add badges to README.md
  • Add documentation for GitHub Actions Workflow
  • Add more inline comments in GitHub Actions related files
  • Build image for more targets
  • Building Tesseract with TensorFlow?
  • Building Tesseract with Training tools?
  • Change build in Dockerfile according to instructions in Compiling-GitInstallation.md

Issues

  • 27.07.2022 currently the build of the main source branch fails, reason is unknown

If you have any bugs or requests regarding this Docker image, please post an issue in this Github Repository.

Project status

27.07.2022: Docker Image is ready for usage, still some slight improvements possible, sometimes build issues