paddleOCR returns shuffled outputs after recognizing images. For us to output correct key-value in medical test sheets, I design simple algorithm derived from paddleOCR outputs reasonably.
We can see in the figure, to detect from '结果', we have to find coordinates related to '结果' and key.
As an example, let's use 平均血红蛋白浓度(MCHC) item, to detect 329 correctly.
First, we have to understand that if we find one item has coordinates
Then, we will filter results. We filter both coulmns that contain '结果' via choosing the closer horizontal coordinates, after that we choose results that have the closer vertical coordinates. But how do we define 'closer'? I set threshold as epsilon
in code, the larger value it is, the more results we get after filtering.
Also, look closely, for each type of test sheet, results are not aligned the same as '结果', to address this, for each type, I set different values of 'middle', 'left' and 'right' with param as side
in code, in this example, side
should be set to 'middle', which utilizes bilinear method to compress 4 coordinates.
Finally, to get correct result, I propose to filter one more step. After filtering, we may get different results in the same column and the same row. To find more correct one, for each result, we must align key's horizontal coordinate weight_for_hori = 0.4
, so final residual should be
I left all above parameters to be tunable.
Before running the recognition code, this will rotate image back if the rotated degree is between
Args :
--data_dir: data directory, default is 'origin_pics'
--template: which template, you can specify the some directory
--img_format: image format, default is 'jpg'
--recognize_all: True for returning all usable results, else return the specific file
--return_name: default is None, but if you set recognize_all equals False, you must set return_name to be a specific value
--epsilon: default is 40, if you set epsilon too high or too low, the results may be not promising, the LARGET the BETTER
--weight_for_hori: horizontal line weight, default is 0.4
--side: 'middle' or 'left' or 'right', to tell where are texts and 结果 aligned
In the project file origin_pics contains all templates images results must contain all template results templates contain TXT files for each template every files 'templates' names must be the same.
To run the code, you can run the following command in shell:
python process.py OCR --data_dir='origin_pics' --template='h5' --img_format='jpg' --recognize_all=True --return_name='bloodtestreport2.jpg' --epsilon=40 --weight_for_hori=0.4 --side='middle'
NOTE: Results are influenced a lot by paddleOCR recognition ability and image resolution.
-- If paddle cannot recognize '结果' correctly, the results can be confused.
-- My algorithm still has some limitations, e.g., I can not use OpenCV to rotate image if image is mirrored, because OpenCV uses Hough transformation to detect lines, mirrored image still has 0 angle of lines.
-- For different styles of test sheets, we still have to set templates separately.
Veni,vidi,vici --Caesar