ocropus/hocr-tools

Convert Google Cloud Vision OCR output to hocr.

Closed this issue · 9 comments

I have a question.

I try to use Google Cloud Vision API to OCR.

https://cloud.google.com/vision/

The output of the OCR results including the position of the texts.

I want to convert Google OCR output to hocr format, do you have any ideas ?

Do you have an example of the output of the Google Cloud Vision API? I just tried the online interface but the results only contains the words each with a bounding polygon, which is not so much to start with...

The output is JSON and should contain polygon boxes both for regions with text and for the single words (see https://cloud.google.com/vision/reference/rest/v1/images/annotate), so a conversion to hOCR seems to be possible although much interesting information is missing.

The API supports suggesting languages (https://cloud.google.com/translate/v2/translate-reference#supported_languages), but I did not find support for old German (Fraktur). A short test with old text did not recognize much words.

Thank you for the comments.

I tried OCR (test.jpg) via Google Cloud Vision API, the result (test.jpg.json.txt) is the same from online interface.
I referred this web page.
http://blog.aimanbaharum.com/2016/04/21/ocr-with-google-cloud-vision-api/

Comparing Tesseract hocr output (out.hocr.txt) , many info lack from Google Cloud Vision API result,
but it seems to contain polygon boxes both for regions with text and for the single words.

I think I need generate the class ocr_page, ocr_carea, ocr_par, and ocr_line from Google Cloud Vision result, thats enough ?

test.jpg
test.jpg.json.txt
out.hocr.txt

Two larger differences are:

  1. There is no information about text lines, i.e. which words are on the same text line
  2. There are no rectangular boxes (bbox) but polygon (poly) for bounding the text

This means that you cannot come up with this information in a simple transformation of file formats. In general the hocr format is quite flexible, i.e. you might just output what is there, and it may still be valid. But the tools based on hocr are expecting normally ocr_line and also bbox.

UB-Mannheim/ocr-fileformat might be a better place for a converter from the Google Cloud Vision JSON format to hOCR.

Thank you for your kindly helps.

My first goal is to make a searchable pdf from Google Cloud Vision OCR, the exact position / layout of the text is not so important.

I have to learn hocr format, thank you for the information.

I will post this subject to UB-Mannheim/ocr-fileformat. Thank you again for the information.

Yeah, the hocr-pdf is expecting ocr_line see https://github.com/tmbdev/hocr-tools/blob/master/hocr-pdf#L65-L66 and bbox arguments see https://github.com/tmbdev/hocr-tools/blob/master/hocr-pdf#L59 . I don't think it will do anything without these information.

I made gcv2hocr at github.
https://github.com/dinosauria123/gcv2hocr

gcv2hocr converts from Google Cloud Vision OCR output to hocr to make a searchable pdf.
The code is far from complete, but it seems to work my sample file.

Thank you for valuable conversations !

This issue is continued on UB-Mannheim/ocr-fileformat#22, so I close it here.