Any plan to update to Tesseract 4.0?
chavenor opened this issue · 13 comments
Is there any plan to update Tesseract 4.0?
Yes. Do you know if the Tesseract team is providing docker images?
I do not know if they are going to provide docker images. I may be wrong on this one but the CPU says it's 10x more than the current 3.x version. If that is true will we see a super slow down while running it inside a docker container?
I'm also wondering how the training processes will work inside a container if the hardware changes?
Thoughts?
It works on pre-trained data, so the training process shouldn't be an issue.
I think the best approach would be to be able to switch between either tesseract 3 or 4 and let the user specify it somehow.
Agree I think that would likely cover all past and foreseeable future use cases.
Hello.
In tesseract-ocr/tesseract#817
A guy propose a docker image for 4.0.0
https://hub.docker.com/r/xlight/docker-tesseract4/~/dockerfile/
Regards
ok thanks for the heads up
Just for information I created an issue on tesseract tesseract-ocr/tesseract#893 because I was not able to have it working.
I will let you know once I will have an answer
I dit a try with base64 and tesseract 4.
Men it rocks (a little bit longer than tesseract 3) but were I had approximatively 60% of results success I had 100% with new version of tesseract.
I give you the dockerfile:
FROM ubuntu
RUN apt-get update && apt-get install -y \
autoconf \
automake \
libtool \
autoconf-archive \
pkg-config \
libpng12-dev \
libjpeg8-dev \
libtiff5-dev \
zlib1g-dev \
libicu-dev \
libpango1.0-dev \
libcairo2-dev \
git \
golang \
gcc \
curl && \
rm -rf /var/lib/apt/lists/*
RUN curl http://www.leptonica.org/source/leptonica-1.74.1.tar.gz -o leptonica-1.74.1.tar.gz && \
tar -zxvf leptonica-1.74.1.tar.gz && \
cd leptonica-1.74.1 && ./configure && make && make install && \
cd .. && rm -rf leptonica*
RUN git clone --depth 1 https://github.com/tesseract-ocr/tesseract.git && \
cd tesseract && \
./autogen.sh && \
./configure && \
LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make && \
make install && \
ldconfig && \
make training && \
make training-install && \
cd .. && rm -rf tesseract
# Get basic traineddata
RUN curl -LO https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata && \
mv eng.traineddata /usr/local/share/tessdata/
RUN curl -LO https://github.com/tesseract-ocr/tessdata/raw/master/fra.traineddata && \
mv fra.traineddata /usr/local/share/tessdata/
# go get open-ocr
RUN go get -u -v -t github.com/tleyden/open-ocr
# build open-ocr-httpd binary and copy it to /usr/bin
RUN cd $GOPATH/src/github.com/tleyden/open-ocr/cli-httpd && go build -v -o open-ocr-httpd && cp open-ocr-httpd /usr/bin
# build open-ocr-worker binary and copy it to /usr/bin
RUN cd $GOPATH/src/github.com/tleyden/open-ocr/cli-worker && go build -v -o open-ocr-worker && cp open-ocr-worker /usr/bin
If we want to have all the languages we can replace:
# Get basic traineddata
RUN curl -LO https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata && \
mv eng.traineddata /usr/local/share/tessdata/
RUN curl -LO https://github.com/tesseract-ocr/tessdata/raw/master/fra.traineddata && \
mv fra.traineddata /usr/local/share/tessdata/
With:
git clone https://github.com/tesseract-ocr/tessdata && \
mv -v tessdata/* /usr/local/share/tessdata/ && \
rm -rf tessadata
Now we should find a way to tell to docker-compose to use tesseract3 or tesseract4 based on the choice of the guy.
You could maybe create a docker file named
tleyden5iwx/open-ocr-4
Something should change here:
https://github.com/tleyden/open-ocr/blob/master/docker-compose/docker-compose.yml#L25
https://github.com/tleyden/open-ocr/blob/master/docker-compose/docker-compose.yml#L35
With an environment variable like:
openocrworker:
image: tleyden5iwx/{$OCR_VESION}
volumes:
- ./scripts/:/opt/open-ocr/
dns: ["8.8.8.8"]
depends_on:
- rabbitmq
command: "/opt/open-ocr/open-ocr-worker -amqp_uri amqp://admin:Phaish9ohbaidei6oole@rabbitmq/"
If you want I can make a try. However I don't know how to upload a docker file with dockerhub....
but were I had approximatively 60% of results success I had 100% with new version of tesseract.
Wow!!
I give you the dockerfile
Can you open a PR that adds that dockerfile to this repo? It should moved to this repo rather than it's current location: https://github.com/tleyden/docker/blob/master/open-ocr/Dockerfile
Now we should find a way to tell to docker-compose to use tesseract3 or tesseract4 based on the choice of the guy.
Yep that makes sense
I will do some test and create a PR with everything (the switch + the dockerfiles + a script to build the docker images)
I continue my development. However it seems that I have an issue with docker compose now. I will let you know