ocr quality on `chi_sim`

Question

ocr quality on `chi_sim`

englianhu opened this issue 4 years ago · 2 comments

I tried to ocr an image in chi_sim but there quality is not too good, some characters unable recognize... Is there anyway to improve accuracy?

> if(!require('BBmisc')) {
+   install.packages('BBmisc', dependencies = TRUE, INSTALL_opts = '--no-lock')
+ }
Loading required package: BBmisc

Attaching package: ‘BBmisc’

The following object is masked from ‘package:base’:

    isFALSE

> 
> suppressPackageStartupMessages(library('BBmisc'))
> # suppressPackageStartupMessages(library('rmsfuns'))
> 
> pkgs <- c('devtools', 'knitr', 'kableExtra', 'tint', 
+           'devtools','readr',   'lubridate', 'data.table', 
+           'feather', 'purrr', 'quantmod', 'tidyquant', 
+           'tibbletime', 'furrr', 'flyingfox', 'tidyr', 
+           'timetk', 'plyr', 'dplyr', 'stringr', 'magrittr', 
+           'tidyverse', 'memoise', 'htmltools', 'formattable', 
+           'zoo', 'forecast', 'seasonal', 'seasonalview', 
+           'rugarch', 'rmgarch', 'mfGARCH', 'sparklyr', 
+           'microbenchmark', 'dendextend', 'lhmetools', 
+           'stringr', 'pacman', 'tesseract')
> # https://github.com/mpiktas/midasr
> # https://github.com/onnokleen/mfGARCH
> # devtools::install_github("business-science/tibbletime")
> # devtools::install_github("DavisVaughan/furrr")
> 
> suppressAll(lib(pkgs))
> 
> # https://stackoverflow.com/a/24521657/3806250
> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> Sys.setlocale(category = "LC_ALL", locale = "chs")
[1] "LC_COLLATE=Chinese (Simplified)_China.936;LC_CTYPE=Chinese (Simplified)_China.936;LC_MONETARY=Chinese (Simplified)_China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_China.936"
> ## https://stackoverflow.com/a/24521657/3806250
> lnk <- 'https://jinshuju.net/f/pVKgV3'
> 
> # NOT RUN { 
> # https://www.rdocumentation.org/packages/tesseract/versions/4.1/topics/tesseract_download
> if(is.na(match('chi_sim', tesseract_info()$available)))
+   tesseract_download('chi_sim') 
> if(is.na(match('chi_sim_vert', tesseract_info()$available)))
+   tesseract_download('chi_sim_vert')
> 
> 
> chi_sim <- tesseract('chi_sim')
> text <- tesseract::ocr('https://gd-pub.jinshujufiles.com/di/20180308130431_f4fead', engine = chi_sim)
> cat(text)
吕]                                                           要    本
四                   SB    外    良
全中通速递详情秘 人IIIIIIHIILIIIINIIINI|
)    砚更”通过ISO9001:2000国际质量体系认证
中 通       http://www.zto.cn           Ooo10256518135、*
全                     本
EEC， GO nm | |
人人详址，                      收件人详址:         示            中，
》   |  上 :重庆市永川区  人
AAA人人入           -                       二一              全            |
广东省广州市 白云区 东平村 _    南大街 兴南路 观南城3栋 和有。
单位名称:                        单位名称:                        本、
)   Company  Company   四
， 由风攻Frame:15880302646昨名: 510000 几电Bisam 13167939801必纺: :402181I   和
请在签字前阅读背书条款，贵 | 名化说明，         本          仙 和
》     重物品请保价，未保价物品的理赔                                   重 量            NE于4          了二
癸最高为次并的5入。        是 目 时         |
号 zw        配件
寄件人签名，         经办人签名:      备注;   Re     | 人
   Senderssign        Operators sign     Remarks       Charge       所RNX     有 多
寄件日期，    书市         加风上
ee    4月1日 时    月日 时| 遇     |人 站
和                  其让     FE        与 名       |
二 680102618135 wa           sn
   下  写!       !
器                             请用力正楷填写! PRESS HARD                                  器

Originally posted by @englianhu in #146 (comment)

Answer 1 · 2022-02-28T02:58:00.000Z

I believe we need to train the Chinese language again, I know you've got a reply after so long. have you tried to fix it

Answer 2 · 2022-08-30T16:34:52.000Z

Updated Data Files (September 15, 2017)
We have three sets of .traineddata files on GitHub in three separate repositories. These are compatible with Tesseract 4.0x+ and 5.0.0.Alpha.

Trained models Speed Accuracy Supports legacy Retrainable

tessdata Legacy + LSTM (integerized tessdata-best) Faster than tessdata-best Slightly less accurate than tessdata-best Yes No

tessdata-best LSTM only (based on langdata) Slowest Most accurate No Yes

tessdata-fast Integerized LSTM of a smaller network than tessdata-best Fastest Least accurate No No

Most users will want tessdata_fast and that is what will be shipped as part of Linux distributions.
tessdata_best is for people willing to trade a lot of speed for slightly better accuracy. It is also the only set of files which can be used for certain retraining scenarios for advanced users.
The third set in tessdata is the only one that supports the legacy recognizer. The 4.00 files from November 2016 have both legacy and older LSTM models. The current set of files in tessdata have the legacy models and newer LSTM models (integer versions of 4.00.00 alpha models in tessdata_best).
Note: When using the new models in the tessdata_best and tessdata_fast repositories, only the new LSTM-based OCR engine is supported. The legacy engine is not supported with these files, so Tesseract’s oem modes ‘0’ and ‘2’ won’t work with them.
Updated Data Files (September 15, 2017) We have three sets of .traineddata files on GitHub in three separate repositories. These are compatible with Tesseract 4.0x+ and 5.0.0.Alpha.
Trained models Speed Accuracy Supports legacy Retrainable
tessdata Legacy + LSTM (integerized tessdata-best) Faster than tessdata-best Slightly less accurate than tessdata-best Yes No
tessdata-best LSTM only (based on langdata) Slowest Most accurate No Yes
tessdata-fast Integerized LSTM of a smaller network than tessdata-best Fastest Least accurate No No
Most users will want tessdata_fast and that is what will be shipped as part of Linux distributions.

tessdata_best is for people willing to trade a lot of speed for slightly better accuracy. It is also the only set of files which can be used for certain retraining scenarios for advanced users.

The third set in tessdata is the only one that supports the legacy recognizer. The 4.00 files from November 2016 have both legacy and older LSTM models. The current set of files in tessdata have the legacy models and newer LSTM models (integer versions of 4.00.00 alpha models in tessdata_best).

Note: When using the new models in the tessdata_best and tessdata_fast repositories, only the new LSTM-based OCR engine is supported. The legacy engine is not supported with these files, so Tesseract’s oem modes ‘0’ and ‘2’ won’t work with them.

Trying to download few different ocr models to analyse https://gd-pub.jinshujufiles.com/di/20180308130431_f4fead but how to download it?

## https://github.com/tesseract-ocr/tessdata
if(is.na(match('chi_sim.traineddata', tesseract_info()$available)))
   tesseract_download('tesseract-ocr/tessdata/chi_sim.traineddata') 
 Downloaded: 0.10 MB  
错误: Download failed: HTTP 404

if(is.na(match('chi_sim_vert.traineddata', tesseract_info()$available)))
   tesseract_download('tesseract-ocr/tessdata/chi_sim_vert.traineddata')
 Downloaded: 0.10 MB  
错误: Download failed: HTTP 404


## https://github.com/tesseract-ocr/tessdata_best
if(is.na(match('chi_sim.traineddata', tesseract_info()$available)))
   tesseract_download('tesseract-ocr/tessdata_best/chi_sim.traineddata') 
 Downloaded: 0.10 MB  
错误: Download failed: HTTP 404

if(is.na(match('chi_sim_vert.traineddata', tesseract_info()$available)))
   tesseract_download('tesseract-ocr/tessdata_best/chi_sim_vert.traineddata')
 Downloaded: 0.10 MB  
错误: Download failed: HTTP 404


## https://github.com/tesseract-ocr/tessdata_fast
if(is.na(match('chi_sim.traineddata', tesseract_info()$available)))
   tesseract_download('tesseract-ocr/tessdata_fast/chi_sim.traineddata') 
 Downloaded: 0.10 MB  
错误: Download failed: HTTP 404

if(is.na(match('chi_sim_vert.traineddata', tesseract_info()$available)))
   tesseract_download('tesseract-ocr/tessdata_fast/chi_sim_vert.traineddata')
 Downloaded: 0.10 MB  
错误: Download failed: HTTP 404

	Trained models	Speed	Accuracy	Supports legacy	Retrainable
tessdata	Legacy + LSTM (integerized tessdata-best)	Faster than tessdata-best	Slightly less accurate than tessdata-best	Yes	No
tessdata-best	LSTM only (based on langdata)	Slowest	Most accurate	No	Yes
tessdata-fast	Integerized LSTM of a smaller network than tessdata-best	Fastest	Least accurate	No	No