tmbdev/clstm

can clstm recognize speical character well

wanghaisheng opened this issue · 1 comments

There are some things the currently trained models for ocropus-rpred will not handle well, largely because they are nearly absent in the current training data. That includes all-caps text, some special symbols (including "?"), typewriter fonts, and subscripts/superscripts. This will be addressed in a future release,

In general an LSTM+CTC configuration is able to recognize anything from the training data including "special" symbols (doing ancient Greek and playing around with Arabic here). You have to ensure the input you want to handle is included in the training data which is the reason the default model of ocropy doesn't deal well with these inputs.

"Tricky" stuff right now is training models performing well (<1% error) on multiple fonts and RTL scripts will need some preprocessing to reorder the label sequence.