Can not figure out which word corresponds to which confidence value. Hope the output api is as rich as in tesseract C++ API.
GoogleCodeExporter opened this issue · 1 comments
GoogleCodeExporter commented
What steps will reproduce the problem?
1. String recognizedText = baseApi.getUTF8Text();
int[] wordConfidences = baseApi.wordConfidences();
List<Rect> rect_lines = baseApi.getTextlines().getBoxRects();
2.
3.
What is the expected output? What do you see instead?
The number counted from recognizedText should be the same as the number in
wordConfidences.
But they are different.
So I do not know how to match each word with each confidence.
What version of the product are you using? On what operating system?
tessearct-android-tools (Its document says it is built on tesseract 3.02)
Ubuntu
Please provide any additional information below.
First, great thanks for this useful tool.
I want to examine in the recognized text, each line, each word and its
confidence value.
For example, I am trying to recognize digits with the special font.
The recognized text is:
///////////////////////////////
\n 0 0 - - -\n
\n
-0630000470 898005714972- -\n
\n
- - - - - 5 -
/////////////////////////
And BTW, why are there "-" outputs? I only train 0123456789, the ten digits.
I suppose the space separate words. However, in this way, the count is
different from baseApi.wordConfidences().
I have a look at hOCR.html, which has a clear vision of each line, each word in
each line and its confidence value, boxing borders.
Is it possible to output a similar format, for example, in array, or ArrayList?
Thanks a lot.
Best
Original issue reported on code.google.com by CodingPo...@gmail.com
on 16 Nov 2012 at 2:39
GoogleCodeExporter commented
Sorry. After I have a look at hOCR.html using tesseract 3.02.02 command, I
understand why.
With spaces between two characters, hOCR shows that sometimes it is regarded as
separator, sometimes as spaces, sometimes as an empty word. So it is very hard
to know which word corresponds to which line and which boundingbox.
It seems it is better for the tesseract-android-tool to use an api for output,
so that we could know each line contains what words, and each word corresponds
to each confidence values and boundingbox.
ps. I apologize I made a mistake by claiming it should have no "-" outputs. I
also trained "-", and forgot to exclude it.
Thanks.
Original comment by CodingPo...@gmail.com
on 16 Nov 2012 at 8:08