tesseract-ocr/tesseract

tesstrain.sh doesn't support vertical languages

davidb1 opened this issue · 8 comments

Environment

  • Tesseract Version:
tesseract 5.0.0-alpha-685-g3a3c4
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
  • Platform: Ubuntu 18.04

Current Behavior:

When passing a _vert language to tesstrain.sh in --lang

tesstrain.sh --lang lang_vert

it throws an error:

ERROR: Error: lang_vert is not a valid language code

as per this line

*) err_exit "Error: ${lang} is not a valid language code"

Expected Behavior:

Throw a more specific error: ERROR: Error: vertical languages aren't supported
or add a config to generate data for vertical languages

Suggested Fix:

Add a case for _vert languages in https://github.com/tesseract-ocr/tesseract/blob/d8d2f6f48a8ddaf0b668eb1abf18fd6d08470041/src/training/language-specific.sh

Vertical languages seem to be supported indirectly based on font names.

Please see:

# The following fonts will be rendered vertically in phase I.
VERTICAL_FONTS=( \
"TakaoExGothic" \ # for jpn
"TakaoExMincho" \ # for jpn
"AR PL UKai Patched" \ # for chi_tra
"AR PL UMing Patched Light" \ # for chi_tra
"Baekmuk Batang Patched" \ # for kor
)

and

# add --writing_mode=vertical-upright to common_args if the font is
# specified to be rendered vertically.
for vfont in "${VERTICAL_FONTS[@]}"; do
if [[ "${font}" == "${vfont}" ]]; then
common_args+=" --writing_mode=vertical-upright "
break
fi
done

Try adding your font to the vertical fonts list as well as the language fonts list and try.

@Shreeshrii so the _vert languages are made just by training on vertical fonts only or are there additional steps?

I have not trained any CJK languages or any other scripts requiring vertical fonts. I just pointed out what I found by searching on vert in the training script.

I suggest you give it a try. If you have more vertical fonts, they need to be added to both the lists.

You can try contacting @zodiac3539 for pointers, see https://github.com/zodiac3539/jpn_vert

You can try contacting @zodiac3539 for pointers, see https://github.com/zodiac3539/jpn_vert

I actually did try a couple months back but he doesn't wanna part with his secrets :)

Having the same issue here. I tried adding the font to the vertical font list but all i get is:

Warning in pixScaleSmooth: ridiculously small scaling factor 0.010464
Image too small to scale!! (1x1 vs min width of 3)
Line cannot be recognized!!
Image not trainable

Is it possible to train vertical languages? How was the jpn_vert.traineddata file in the tessdata_best repo made?

Please see comment by Ray at #707 (comment)

so, it's possible that the current code is using layout analysis for vertical text rather than a separate language.

Fixed via PR #3223

See #3001 for discussion