otiai10/gosseract

Gosseract is much slower than Pytesseract

frytoli opened this issue · 2 comments

Summary

Firstly, thank you for the great work on this repo! It made it really easy to incorporate Tesseract into my project.

I'm finding that OCR-ing images with gosseract to be much slower than with Pytesseract, and I would expect the opposite. One image that I tested took 1m19.203891157s to OCR with gosseract and only 3.1102023124694824 seconds with Pytesseract. My testing has shown that in the go function I've provided below, the client.Text() call is the culprit (no real surprise there). Has anyone else come across this issue? Is this expected behavior? I do have my code running in a Docker container, and I'm wondering if that might also be a part of this puzzle.

Reproducible Dockerfile

There's a lot going on in my Dockerfile, but here are the relevant parts:

RUN apt-get install -y \
  libtesseract-dev \
  libleptonica-dev
ENV TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata/
RUN apt-get install -y tesseract-ocr-eng

My go code:

func OCR(imgs []image.Image) string {
  // New Tesseract client
  client := gosseract.NewClient()
  defer client.Close()

  // Iterate over images and save text
  allText := []string{}
  for _, img := range imgs {
    // Convert image to byte array
    buf := new(bytes.Buffer)
    err := tiff.Encode(buf, img, nil)
    if err != nil {
        ErrorLogger.Printf("Error encoding tiff image and converting to byte array: %s\n", err)
        return ""
    }
    // Send bytes to Tesseract engine
    err = client.SetImageFromBytes(buf.Bytes())
    if err != nil {
        ErrorLogger.Printf("Error sending image byte array to Tesseract: %s\n", err)
        return ""
    }
    // Get OCRed text
    text, _ := client.Text()
    // Save
    allText = append(allText, text)
  }

  // Join text and return
  return strings.Join(allText[:], " ")
}

Environment

  • go version go1.17 linux/amd64
  • tesseract 4.0.0-beta.1
  • leptonica-1.75.3

Some things you can try:

  • Are you running gosseract in a docker container and pytesseract without docker? Check your memory limit with docker stats. Maybe pytesseract is running without ressource limits and your container has limited ressources.
  • Disable openmp. Since tesseract 4.1.0, openmp is disabled by default. See:

Thanks for your response! I'll look into this again when I get some.