Can not figure out which word corresponds to which confidence value. Hope the output api is as rich as in tesseract C++ API.

Question

Can not figure out which word corresponds to which confidence value. Hope the output api is as rich as in tesseract C++ API.

GoogleCodeExporter opened this issue 10 years ago · 1 comments

GoogleCodeExporter commented 10 years ago

What steps will reproduce the problem?
1. String recognizedText = baseApi.getUTF8Text();
   int[] wordConfidences = baseApi.wordConfidences();
   List<Rect> rect_lines = baseApi.getTextlines().getBoxRects();
2.
3.

What is the expected output? What do you see instead?

The number counted from recognizedText should be the same as the number in 
wordConfidences.
But they are different.
So I do not know how to match each word with each confidence.

What version of the product are you using? On what operating system?
tessearct-android-tools (Its document says it is built on tesseract 3.02)
Ubuntu

Please provide any additional information below.
First, great thanks for this useful tool.
I want to examine in the recognized text, each line, each word and its 
confidence value.

For example, I am trying to recognize digits with the special font.
The recognized text is:
///////////////////////////////
\n 0 0 - - -\n
\n
 -0630000470 898005714972- -\n
\n
    - -    -       - - 5 -
/////////////////////////
And BTW, why are there "-" outputs? I only train 0123456789, the ten digits.

I suppose the space separate words. However, in this way, the count is 
different from baseApi.wordConfidences().

I have a look at hOCR.html, which has a clear vision of each line, each word in 
each line and its confidence value, boxing borders.

Is it possible to output a similar format, for example, in array, or ArrayList?

Thanks a lot.
Best

Original issue reported on code.google.com by CodingPo...@gmail.com on 16 Nov 2012 at 2:39

Answer 1 · 2015-03-08T08:53:50.000Z

Sorry. After I have a look at hOCR.html using tesseract 3.02.02 command, I 
understand why.
With spaces between two characters, hOCR shows that sometimes it is regarded as 
separator, sometimes as spaces, sometimes as an empty word. So it is very hard 
to know which word corresponds to which line and which boundingbox. 

It seems it is better for the tesseract-android-tool to use an api for output, 
so that we could know each line contains what words, and each word corresponds 
to each confidence values and boundingbox.

ps. I apologize I made a mistake by claiming it should have no "-" outputs. I 
also trained "-", and forgot to exclude it.

Thanks.

Original comment by CodingPo...@gmail.com on 16 Nov 2012 at 8:08