Convert Google Cloud Vision OCR output to hocr.
Closed this issue · 9 comments
I have a question.
I try to use Google Cloud Vision API to OCR.
https://cloud.google.com/vision/
The output of the OCR results including the position of the texts.
I want to convert Google OCR output to hocr format, do you have any ideas ?
I already talked this subject here. Please check our previous discussions.
Hi, interesting use case but a tough problem.
As far as I can see, Google Cloud Vision detects the print space and individual words but not lines.
We can derive the lines from the word coordinates with some assumptions:
a) All annotations whose coordinates fit into another boundingPoly
are part of that page
b) All words that share the same y
for top-left and top-right corner form a line
But that is hard to tell from just one example. I'm sure we could come up with a prototype to convert to hOCR/ALTO the example JSON but for a robust solution, non-trivial examples would be necessary.
GCV outputs the text twice: First for a block of text. I think we have to ignore that part. Second for the single words. This is the information we need for hOCR.
As we are looking for words on the same (horizontal) line, we have to find words with similar lower corners. The top corners determine the size of the characters, but are not needed for the line detection. Similar lower corners means that small deviations of the y value are acceptable (caused by skew, but also by characters going below the line like 'g' for example). If we assume that all words were written horizontally, we can restrict our search to the left lower corner. So a first version of an algorithm could sort all lower left corners by their Y value and group them. Those groups are the lines. A more advanced algorithm could improve handling of skewed lines and only look for small Y deviations in neighboring words, but allow larger deviations for words which are far apart.
All words which we found to be in the same line then have to be sorted by the X value of the lower left corner. Finally, hOCR needs a bounding box for the whole line which can be easily computed from the minima and maxima of X | Y values.
Handling of words written on non horizontal lines has to be ignored. I'm not sure how hOCR processes such lines.
IMO these outlined technical approaches would not be just transformations of different file formats but one would have to do also a layout analysis. I tried a multi-column image and the recognized words are just ordered from top-to-bottom and from left-to-right. However, as said before for some well-defined document classes (e.g. one column, not skewed, ...) the transformations might be doable.
Thank you for the replays.
I try to make extract the words and coordinates from Google Cloud Vision output (test.jpg.json.txt).
I am not good for coding and this program (test2.c.txt) is so primitive.
I applicate any helps.
Comparing Tesseract hocr output (out.hocr.txt), It need to add many headers, but I think this line is the most important.
<span class='ocrx_word' id='word_1_1' title='bbox 81 103 253 173; x_wconf 85' lang='eng'
dir='ltr'>This`
This line has the word and coordinate and may be replace the output from test2.c.txt.
I made gcv2hocr at github.
https://github.com/dinosauria123/gcv2hocr
gcv2hocr converts from Google Cloud Vision OCR output to hocr to make a searchable pdf.
The code is far from complete, but it seems to work my sample file.
Thank you for valuable conversations !
GCV now supports a new output format with much more details. This should make transformation to hOCR easier.
https://cloud.google.com/vision/docs/fulltext-annotations
https://cloud.google.com/vision/docs/reference/rest/v1/images/annotate
https://cloud.google.com/vision/docs/detecting-fulltext
https://developers.google.com/resources/api-libraries/documentation/vision/v1/python/latest/vision_v1.images.html
We could now also support conversion of GCV to PAGE. Would that be useful?
And maybe we should also add an example file for GCV.
I think it is useful.
If you want to add gcv2hocr to hocr-tools, It is OK, freely use for it.
Have a nice new year !