OCR recognition error

Question

OCR recognition error

cnico opened this issue 3 years ago · 8 comments

Hi,

I have many numbers to recognize (they are variables in my scenario) and it regularly falls in error through OCR recognition.
Here is an exemple in image :

The OCR do not recognize the correct number of zeros :
30739070800013 is the expected value.
3073907080013 is the recognized value.

Another one :

30761630000017 expected
3076163000017 recognized

Another one :

30243218200012 expected
0243218200012 recognized

Hope this helps you to improve the OCR engine.

Answer 1 · 2021-12-20T18:17:55.000Z

Hi @cnico

It's strange, it hasn't happened to me. I would need more information to be able to reproduce it.

Try to see if DPI scaling is not the problem, there are some antialiasing effects that can affect the image hash.

Sometimes if the selection is made against an edge and there are antialiasing, errors appear, you have to analyze if this is the case. You can test to see if it is an "edge". Move one pixel on selection and see if that error appears again.

A couple of recommendations

Disable DPI Scaling con Citrix
https://github.com/Blazemeter/CitrixPlugin/blob/master/TROUBLESHOOTING.md#image-hash-or-ocr-fails-on-different-machine-resolution

Disable some visual effects con Windows Desktop Manager
https://github.com/Blazemeter/CitrixPlugin/blob/master/TROUBLESHOOTING.md#high-cpu-usage-on-dwmexe-process

Try disabling DPI scaling first. If it stops happening to you, please inform us that this is the case.
Today we do not modify the configuration of the Citrix client, but it is something that we are evaluating to do if this generate this type of errors on hash image.

Answer 2 · 2021-12-20T19:16:13.000Z

Hi @cnico

Due to the error you mentioned to me, it reminded me that I made some adaptations to try to handle those slight differences caused by dpi, and that logic not is used in the sample code of the issue #52 (the default runtime probably handle that)
#52

I updated the code in gist from the issue 52 to handle that using the same method used in the runtime.
https://gist.github.com/3dgiordano/ce37b4e722911a0e1663f9ebbee0e9eb#file-beanshell_jmeter-java-L5

You need to add 2 new imports

import com.blazemeter.jmeter.citrix.clause.CheckType;
import com.blazemeter.jmeter.citrix.clause.Clause;

and create a clause based on the hash to find.
https://gist.github.com/3dgiordano/ce37b4e722911a0e1663f9ebbee0e9eb#file-beanshell_jmeter-java-L25

and replace the equals between hash for the clause evaluator.
https://gist.github.com/3dgiordano/ce37b4e722911a0e1663f9ebbee0e9eb#file-beanshell_jmeter-java-L40

You can probably handle the small differences between hashes with that.

In any case, it is good that you know that the adjustments per dpi can generate these differences and prove to cause others.
I recommend that you use the changes in the sample gist, and that you also see to disable dpi scaling.

Tell me if this solves the problem

Answer 3 · 2021-12-22T08:59:18.000Z

Hi @3dgiordano ,

First, thank you for your support.

I did the configuration you mentionned in the links related to DPI disabling and dwmexe : I have re-run my test cases and it has the same results with assertions errors.
For your information, I also use a very old citrix client (it is installed by my company policy) with a 13.4.300.10 version that date around 2012. I do not upgrade it since I fear to break the usage of other tools on my laptop.

It is not a blocker problem for me because I will simply ignore those errors : the assertion will simply timeout and continue normaly with the next steps of the scenario.

Regarding the sample code related to issuer #52, I do not use it in my jmeter scripts for now since it is quite complex.

I wanted to simply report this problem to you and if you think the citrix plugin could be improved.
If not, feel free to close it.

Answer 4 · 2021-12-22T10:50:46.000Z

Thanks @cnico

For me it is important to recognize if I can reproduce it to know what the possible problem is.
The information that you can provide me would help me to improve the plugin.

The hashes you provided, I notice they are short.

Your hash length: 30761630000017
Common hash length: 5441858075450394804047460577263589586942

Are you using a recording made with an old version of the plugin?
What version of the plugin are you using?

An old version of the plugin used very short hashes with low precision and no bit difference support that prevented it from working correctly.
Since the plugin tries to keep the operation of what was recorded by old versions, it detects when a hash is from the legacy version and uses the legacy code to compare it, which has flaws, but maintains the behavior with which it was recorded.

To migrate that recorded with a legacy version, simply changing the hash that compares with one of the new length, if you look at the result you get in the comparison of which one would expect, you can take it as a reference to update it.

That way you stop using the hash of the old algorithm and start using the hash of the new versions.
The hash of the new versions is not only more precise but it is also more robust in the face of slight differences in the images, a product of the image compression mechanism carried out by Citrix.

Tell me if that jmx or those recorded fragments are from an old version and which version of the plugin you are using.

Answer 5 · 2021-12-22T17:54:37.000Z

I started the project in october and I use the latest version of the citrix plugin : 0.7.4.

The numbers I gave you are not hashes values but text value to be recognized by the OCR engine. The images are the actual part of the application screen taken from the citrix interaction's view result tree.

Answer 6 · 2021-12-22T18:19:44.000Z

I have a look at the citrix plugin implementation and found it is based for ocr engine on tess4j with the tesseract implementation.
I saw that tess4j version used in the pom.xml and tesseract dll version could be upgraded to 5.0.0. I do not know if it would improve the result...

Answer 7 · 2021-12-22T19:25:48.000Z

Thanks @cnico

You clarified the situation around the OCR bug behind Tesseract further for me.

Sorry for my confusion, I was following the same line of the other issue.

Now having the problem clear I can try to see if it is possible to improve Tesseract's prediction when the number has multiple zeros.

Thanks for reporting the problem.

Answer 8 · 2022-12-06T19:53:15.000Z

A few versions ago we updated Tesseract to the latest version.
Investigating the issue further, we found that it is a very common error in OCR routines and it is not something easy to manage.
Tesseract owns the problem and possible solutions go to particular implementations to support detection of that particular source.
Currently there is no solution to the problem that is feasible to implement in the plugin.