Gosseract is much slower than Pytesseract
frytoli opened this issue · 2 comments
Summary
Firstly, thank you for the great work on this repo! It made it really easy to incorporate Tesseract into my project.
I'm finding that OCR-ing images with gosseract to be much slower than with Pytesseract, and I would expect the opposite. One image that I tested took 1m19.203891157s
to OCR with gosseract and only 3.1102023124694824
seconds with Pytesseract. My testing has shown that in the go function I've provided below, the client.Text()
call is the culprit (no real surprise there). Has anyone else come across this issue? Is this expected behavior? I do have my code running in a Docker container, and I'm wondering if that might also be a part of this puzzle.
Reproducible Dockerfile
There's a lot going on in my Dockerfile, but here are the relevant parts:
RUN apt-get install -y \
libtesseract-dev \
libleptonica-dev
ENV TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata/
RUN apt-get install -y tesseract-ocr-eng
My go code:
func OCR(imgs []image.Image) string {
// New Tesseract client
client := gosseract.NewClient()
defer client.Close()
// Iterate over images and save text
allText := []string{}
for _, img := range imgs {
// Convert image to byte array
buf := new(bytes.Buffer)
err := tiff.Encode(buf, img, nil)
if err != nil {
ErrorLogger.Printf("Error encoding tiff image and converting to byte array: %s\n", err)
return ""
}
// Send bytes to Tesseract engine
err = client.SetImageFromBytes(buf.Bytes())
if err != nil {
ErrorLogger.Printf("Error sending image byte array to Tesseract: %s\n", err)
return ""
}
// Get OCRed text
text, _ := client.Text()
// Save
allText = append(allText, text)
}
// Join text and return
return strings.Join(allText[:], " ")
}
Environment
go version go1.17 linux/amd64
tesseract 4.0.0-beta.1
leptonica-1.75.3
Some things you can try:
- Are you running gosseract in a docker container and pytesseract without docker? Check your memory limit with
docker stats
. Maybe pytesseract is running without ressource limits and your container has limited ressources. - Disable openmp. Since tesseract 4.1.0, openmp is disabled by default. See:
Thanks for your response! I'll look into this again when I get some.