tleyden/open-ocr

Any plan to update to Tesseract 4.0?

chavenor opened this issue · 13 comments

Is there any plan to update Tesseract 4.0?

Yes. Do you know if the Tesseract team is providing docker images?

I do not know if they are going to provide docker images. I may be wrong on this one but the CPU says it's 10x more than the current 3.x version. If that is true will we see a super slow down while running it inside a docker container?

I'm also wondering how the training processes will work inside a container if the hardware changes?

Thoughts?

It works on pre-trained data, so the training process shouldn't be an issue.

I think the best approach would be to be able to switch between either tesseract 3 or 4 and let the user specify it somehow.

Agree I think that would likely cover all past and foreseeable future use cases.

ok thanks for the heads up

Just for information I created an issue on tesseract tesseract-ocr/tesseract#893 because I was not able to have it working.

I will let you know once I will have an answer

I dit a try with base64 and tesseract 4.
Men it rocks (a little bit longer than tesseract 3) but were I had approximatively 60% of results success I had 100% with new version of tesseract.

I give you the dockerfile:

FROM ubuntu
RUN apt-get update && apt-get install -y \
	autoconf \
	automake \
	libtool \
	autoconf-archive \
	pkg-config \
	libpng12-dev \
	libjpeg8-dev \
	libtiff5-dev \
	zlib1g-dev \ 
	libicu-dev \
	libpango1.0-dev \
	libcairo2-dev \
	git \
	golang \
	gcc \
	curl && \
	rm -rf /var/lib/apt/lists/*

RUN curl http://www.leptonica.org/source/leptonica-1.74.1.tar.gz -o leptonica-1.74.1.tar.gz && \
	tar -zxvf leptonica-1.74.1.tar.gz && \
	cd leptonica-1.74.1 && ./configure && make && make install && \
	cd .. && rm -rf leptonica*

RUN git clone --depth 1 https://github.com/tesseract-ocr/tesseract.git && \
	cd tesseract && \
	./autogen.sh && \
	./configure && \
	LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make && \
	make install && \
	ldconfig && \
	make training && \
	make training-install && \
	cd .. && rm -rf tesseract

# Get basic traineddata
RUN curl -LO https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata && \
	mv eng.traineddata /usr/local/share/tessdata/

RUN curl -LO https://github.com/tesseract-ocr/tessdata/raw/master/fra.traineddata && \
	mv fra.traineddata /usr/local/share/tessdata/
	
# go get open-ocr
RUN go get -u -v -t github.com/tleyden/open-ocr

# build open-ocr-httpd binary and copy it to /usr/bin
RUN cd $GOPATH/src/github.com/tleyden/open-ocr/cli-httpd && go build -v -o open-ocr-httpd && cp open-ocr-httpd /usr/bin

# build open-ocr-worker binary and copy it to /usr/bin
RUN cd $GOPATH/src/github.com/tleyden/open-ocr/cli-worker && go build -v -o open-ocr-worker && cp open-ocr-worker /usr/bin

If we want to have all the languages we can replace:

# Get basic traineddata
RUN curl -LO https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata && \
	mv eng.traineddata /usr/local/share/tessdata/

RUN curl -LO https://github.com/tesseract-ocr/tessdata/raw/master/fra.traineddata && \
	mv fra.traineddata /usr/local/share/tessdata/

With:

git clone https://github.com/tesseract-ocr/tessdata && \
mv  -v tessdata/* /usr/local/share/tessdata/ && \
rm -rf tessadata

Now we should find a way to tell to docker-compose to use tesseract3 or tesseract4 based on the choice of the guy.

You could maybe create a docker file named
tleyden5iwx/open-ocr-4

Something should change here:
https://github.com/tleyden/open-ocr/blob/master/docker-compose/docker-compose.yml#L25
https://github.com/tleyden/open-ocr/blob/master/docker-compose/docker-compose.yml#L35

With an environment variable like:

 openocrworker:
	image: tleyden5iwx/{$OCR_VESION}
	volumes:
	  - ./scripts/:/opt/open-ocr/
	dns: ["8.8.8.8"]
	depends_on:
	  - rabbitmq
	command: "/opt/open-ocr/open-ocr-worker -amqp_uri amqp://admin:Phaish9ohbaidei6oole@rabbitmq/"

If you want I can make a try. However I don't know how to upload a docker file with dockerhub....

but were I had approximatively 60% of results success I had 100% with new version of tesseract.

Wow!!

I give you the dockerfile

Can you open a PR that adds that dockerfile to this repo? It should moved to this repo rather than it's current location: https://github.com/tleyden/docker/blob/master/open-ocr/Dockerfile

Now we should find a way to tell to docker-compose to use tesseract3 or tesseract4 based on the choice of the guy.

Yep that makes sense

I will do some test and create a PR with everything (the switch + the dockerfiles + a script to build the docker images)

I continue my development. However it seems that I have an issue with docker compose now. I will let you know

This was merged in #90.