Adnan Ul-Hasan的博士论文-第八章 多种文字文档的通用 OCR 架构
wanghaisheng opened this issue · 4 comments
Multilingual documents are common in the computer age of today. Plethora of these
documents exist in the form of translations, books, operational manuals, etc. The
abundance of these multilingual documents in everyday life is observed today due to
two main reasons.
Firstly, technological advancements are reaching in each and every corner of the
world due to globalization, and there is an increasing need from the international customers
to access the technology in their native language. This phenomenon has a
two-fold impact: 1) operational manuals of electronic gadgets are required to be
in multiple languages, 2) the access to knowledge available in other languages has
become very easy; thereby, an increase in bilingual books and dictionaries has been
witnessed.
Secondly, English has become an international language, and the effect of this internationalization
is evident by its impact on many languages. Several languages have
adopted words from English and various documents, for instance newspapers, magazines,
and articles, use many English words on a daily basis. Therefore, the need to
develop reliable Multilingual OCR (MOCR) systems to digitize these documents has
inflated manifold.
Despite the increase in the availability of multilingual documents, automatic recognition
of multilingual text remains a challenge. Popat [Pop12] pointed out several
challenges in the context of the Google books project1. Some of these unique challenges
are:
• Multiple scripts/languages on a single page.
• Multiple languages in same or similar scripts, like Arabic-Persian, English-
German.
• The same language in multiple scripts, like Urdu in Nastaleeq and Naskh scripts.
• Archaic and reformed orthographies, for example, 18th Century English, Fraktur
(historical German).
One solution to handle multilingual documents is to develop an OCR methodology
that can recognize all characters of all scripts. However, it is commonly believed that
such a generic OCR framework would be very difficult to realize [PD14]. The alternate
process (as shown in Figure 8.1) is to employ a script identification step before
recognizing the text. This step separates various scripts present in a document, so
that a unilingual OCR model can be applied to recognize each script. This procedure,
however, is unsatisfactory for many reasons, some of which are listed below:
• The script identification is itself quite a challenging feat. Traditionally, it involves
finding suitable features of the given script(s). One has to either fine tune these
hand-crafted features or has to look for some other features, if the same script
identification methodology has to be used for other scripts.
• The process of script identification (see chapter 7) is not perfect, thereby the
scripts recognized by such process can not be separated reliably. This directly
affects the recognition accuracy of the OCR system employed.
• Moreover, humans do not process the multilingual documents using the script
identification step. A person possessing multilingual prowess reads a multilingual
document in a similar manner as he/she would read a monolingual document.
Hence the ultimate aim to OCR multilingual documents is to develop a generalized
OCR system that can recognize all scripts. An MOCR system must be able to handle
various scripts as well as it should be robust against the intraclass variations, that is,
it should be able to recognize the letters despite slight variations in their shapes and
sizes.
Although the idea of generalized OCR system is not new, it has not been pursued
greatly because of lack of computational powers and suitable algorithms to recognize
all characters of multiple scripts. However, recent advancement in machine learning
and pattern recognition fields have shown great promise on many tasks that were
once considered very difficult. Moreover, these learning strategies are claimed to
mimic the neural networks employed in the human brain. So they should be able to
replicate the human capabilities in a better way than other neural networks.
The main contribution of this chapter is a Generalized OCR framework2 that can
be used to OCR multilingual and multiscript documents such that there is no need
to employ the traditional script identification step. A sub-goal of this work is to highlight
the discriminating power and sequence learning capability of LSTM networks for
a large number of classes for OCR tasks. The trained LSTM networks can successfully
discriminate hundreds of classes when it is trained for multiple scripts/languages simultaneously.
The rest of this chapter is organized as follows. Section 8.1 reports the work done
by other researchers to develop generalized OCR systems for multilingual documents.
Our quest for a generalized OCR system starts with the development of a single OCR
model that can recognize multilingual text in which all languages belong to a single
script. Section 8.2 discusses the cross-language performance of LSTM networks. The
next step of our quest is to extend the idea of “single OCR model” from multilingual
documents to multiscript documents. A single OCR model that can recognize text
in multiple scripts is the first step in realizing a generalized OCR system. Section 8.3
describes the design of LSTM-based generalized OCR framework in detail. Section 8.4
concludes the chapter with a brief summary and outlines some directions in which the
present work can be further extended.
8.1 Traditional Approaches for MOCR
The usual approach to address the MOCR problem is to somehow combine two or
more separate classifiers [OHBA11]. This is because of the common belief that a reasonable
OCR output for a single script can not be obtained without sophisticated
post-processing steps such as language modeling, use of dictionary to correct OCR
errors, font adaptation, etc. Natarajan et al. [NLS+01] proposed an HMM-based scriptindependent
MOCR system. Feature extraction, training and recognition components
of this system are all language independent; however, they used language specific
word lexicon and language models for the recognition purpose.
There have been efforts reported for the adaptation of the existing OCR systems
to other languages. Open source OCR system Tesseract [SAL09] is one such example.
The recognition of characters in Tesseract is based on hierarchical shape classification.
The character set is reduced to few basic characters and then at last stage, the test
sample is matched against the representative of the reduced set. Although, Tesseract
can be used for a variety of languages, it can not be used as an all-in-one solution in
situations where multiple scripts are present in a single document together. Similar
to the Tesseract OCR, BBN BYBLOS system [LBK+98] can be trained for multiple languages;
however, this system is also not capable of recognizing multiple languages
and scripts simultaneously.
To the best of our knowledge, not a single method has been proposed for MOCR,
that can achieve very low error rates without using sophisticated post-processing
techniques. However, experiments on many scripts using LSTM networks have
demonstrated that significant OCR results can still be obtained without such techniques.
The details about the LSTM-based language independent OCR framework are
presented in the next section.
8.2 Language Independent OCR with LSTM Networks
Language models or recognition dictionaries are usually considered an essential step
in OCR. However, using a language model complicates the training of OCR systems
and it also narrows the range of texts that an OCR system can be used with. However,
recent results have shown that LSTM-based OCR yields low error rates even
without language modeling. This leads us to explore the question as to what extent
LSTM models can be used for MOCR without the use of language models. To this end,
we measure3 cross-language performance of LSTM models trained on different languages.
They have exhibited a great capability to be used for language independent
OCR. The recognition errors are very low (around 1%) without using any language
model or dictionary correction.
Our hypothesis for language independent OCR is that if a single model can be obtained
for a single script which is common to many languages, e.g. Latin, Arabic, Devanagari,
we can then use this single model to recognize text of that particular family.
Doing so, the efforts to combine multiple classifiers can be reduced.
The basic aim in this work is to benchmark how LSTM networks use language modeling
to predict the correct labels or can they do better without using any language
modeling and other post-processing step. Additionally, we also want to see how well
LSTM networks use the contextual information to recognize a particular character.
8.2.1 Experiment Setup
To explore the cross-language performance of LSTM networks, a number of experiments
have been performed. We have trained four separate LSTM networks for English,
German, French and a Mixed-Data of all these languages. For testing, there are
a total of 16 permutations. Each of the four aforementioned LSTM model is tested on
the respective language and on the other three languages as well, for example, testing
LSTM network trained on German language on a separate German corpus, French,
English, and Mixed-Data. These results are detailed in Section 8.2.5.
As an error metric, the ratio of insertions, deletions and substitutions relative to
the GT (CER) has been used and accuracy is measured at the character level. This
error metric is termed as ‘Levenshtein Distance’ in the literature and is given by Equation
5.1.
This section is further organized as follows. The next sub-section describes the binarization
and the text-line normalization, which are the first step in the LSTM-based
approach. Details on preparing the dataset for training the LSTM models and the
dataset for evaluation are given next. After the details on the database, LSTM network
parameters are given, while the results are presented at the tail end of this section
8.2.2 Preprocessing
Binarization and text-line normalization form the preprocessing step in this experiment.
Since synthetically generated text-lines are used in this work, binarization is
carried out at the text-line generation step. However, text-line normalization is done
separately. Scale and relative position of a character are important features in distinguishing
characters in the Latin script (and some other scripts). Moreover, 1D-LSTM
networks are not translation invariant in vertical direction. The text line normalization
is therefore, an essential step in applying such networks. In this work, we have used
the normalization approach introduced in [BUHAAS13] (see Appendix B for details),
namely text-line normalization based on a trainable, shape-based model. A token dictionary
created from a collection of text lines contains information about x-height,
baseline, and shape of individual characters. These models are then used to normalize
any text-line image.
8.2.3 Database
To evaluate the proposed methodology, a separate synthetic database for each language
is developed using the approach described in Chapter 4. Separate corpora of
text-line images in German, English and French languages are generated with commonly
used typefaces (including bold, italic, italic-bold variations) from freely available
online literature. These images are degraded using some of the degradation
models described in [Bai92] to reflect the scanning artifacts. Four degradation parameters
namely elastic elongation, jitter, sensitivity, and threshold have been selected.
Sample text-lines images from our database are shown in Figure 4.6. Each database
is further divided into training and test subsets. Statistics on the number of text line
images in training and test corpora of each script are given in Table 4.2.
8.2.4 LSTM Architecture and Parameters
For the experiments carried out in this work, 1D-BLSTM architecture has been utilized.
We have found that this architecture performs better than more complex LSTM architectures
for printed OCR tasks (please refer to Appendix-A for further details about
the LSTM networks and its different variants). 1D-LSTM networks require text-line
images of a fixed height as they are not translation invariant in the vertical dimension.
Therefore “normalization” is employed to make sure that the sequence depth
remains consistent for all the inputs.
The text lines are normalized to a height of 32 in the preprocessing step. Both leftto-
right and right-to-left LSTM layers contain 100 LSTM memory blocks. The learning
rate is set to 1e-4, and the momentum to 0.9. The training is carried out for one million
steps (roughly corresponding to 10 epochs, given the size of the training set). Training
errors are averaged after every 10,000 training steps. The network corresponding to
the minimum training error is used for test set evaluation.
While most of the other approaches use language modeling, font adaptation and
dictionary corrections as means to improve their results, LSTM networks have shown
to yield comparable results without employing these techniques. Therefore, it should
be recognized that the reported results are obtained without the aid of any of the
above-mentioned post-processing steps. Moreover, it should also be noted that no
handcrafted features are used for the training of LSTM networks to recognize multilingual
text.
8.2.5 Results
The experimental results are listed in Table 8.1, and some sample outputs are presented
in Table 8.2. Since, there are no umlauts (German) and accented (French) letters
in English, therefore, the words containing those special characters are omitted
from the recognition results while testing LSTM model trained for German on French
and model trained for English on French and German. If such words are not removed,
then the resulting errors would also contain a proportion of errors due to erroneous
recognition of characters that were not present in the training of the LSTM model for
that language. By removing them, the true performance of the LSTM network trained
LSTM model trained for Mixed-Data is able to obtain similar recognition results
(around 1% recognition error) when applied to English, German and French script individually.
Other results indicate little language dependence in that LSTM models
trained for a single language yielded lower error rates when tested on the same script
than when they were evaluated on other scripts.
To gauge the magnitude of affect of language modelling, we have compared our
results with Tesseract (Version 3.02) open-source OCR system [Smi07]. The available
models for English, French and German languages have been evaluated on the same
test-data. Tesseract system yields very high error rates (CER) as compared to LSTM
models. It seems that Tesseract models are not trained on certain fonts, thereby resulting
in more recognition errors on these fonts. Tesseract OCR model for English
yields 7.7%, 9.1% and 8.3% CER when applied to French, German and Mixed-Data respectively.
OCR model for French returns 7.14%, 7.3% and 6.8% CER when applied to
English, German and Mixed-Data respectively, while OCR model for German returns
7.2%, 8.59% and 7.4% recognition error when applied to English, French and Mixed-
Data respectively. These results show that the absence of language modeling or applying
different language models in Tesseract affects the recognition poorly. Since no
model for Mixed-Data is available for Tesseract, the effect of evaluating such a model
on individual script could not be computed.
8.2.6 Error Analysis
The results reported in this work demonstrate that the LSTM networks can be used
for MOCR. LSTM networks do not learn a particular language model internally (nor we
need any such model as a post-processing step). Moreover, they show great promise
to learn various shapes of a certain character in different fonts and under degradation
(as evident from our highly versatile data). The language dependence is observable,
but the affects are small as compared to other contemporary OCR methodologies,
where absence of language models results in very bad results. To gauge the language
dependence more precisely, one can evaluate the performance of LSTM networks by
training them on randomly generated data using n-gram statistics and testing those
models on natural languages.
In the following text, we will analyze the errors produced by the LSTM networks
when applied to other scripts. Top 5 confusions for each case are tabulated in Table
8.3. The case of applying an LSTM network to the same language for which it is
trained is not discussed here as it is not relevant for the discussion of cross-language
performance of LSTM networks.
Most of the errors caused by LSTM network tra
Most of the errors caused by LSTM network trained on Mixed-Data are due to its
failure in recognizing certain characters like ‘l,t,r,i’. These errors may be removed by
increasing the training data, that contains these characters in sufficient amount.
Looking at the first column of Table 8.3 (Applying LSTM network trained for English
on other 3 scripts), most of the errors are due to the confusion between characters
of similar shapes, like ‘I’ to ‘l’ (and vice versa), ‘Z’ to ‘2’ and ‘c’ to ‘e’. Two confusions
namely ‘Z’ with ‘A’ and ‘Z’ with ‘L’ are interesting as, apparently, there are no shape
similarity between them. However, if the ‘Z’ gets noisy due to scanning artifacts, then
it may look similar to a ‘L’. Another possibility of this error may be due to the fact
that ‘Z’ is the least frequent letter in English4 and thus there may be not many ‘Zs’
For LSTM networks trained on German language (second column in Table 8.3),
most of the top errors are due to the inability of LSTM to recognize a particular character.
Top errors when applying LSTM network trained for French language on other
scripts are shape-confusion between w/W with v/V. An interesting observation, which
could be a possible reason for such behaviour, is that relative frequency of ‘v’ is higher
than ‘w’ (see previous footnote) in German and English, while it is smaller in French.
So, this is a language dependent issue, which is not observable in case of Mixed-Data.
8.2.7 Conclusion
The application of LSTM networks for language independent OCR demonstrates that
these networks are capable of learning many character shapes simultaneously. Therefore,
they can be utilized to recognize multiple scripts simultaneously. The next section
reports a generalized OCR framework in which LSTM networks have been used
to recognize multiscript documents without the aid of a separate script identification
module. LSTM networks have demonstrated great ability to be used as a universal
OCR engine to recognize the text in multiple languages and scripts.
8.3 Generalized OCR with LSTM Networks
Generalized OCR is the term used for an OCR system that can recognize text in multiple
scripts and languages simultaneously. Encouraged by the promising OCR results
obtained by the LSTM networks on language independent OCR results for Latin
script, this section reports the extension of the same idea to recognize text comprising
multiple scripts such that the traditionally employed script identification step can
be avoided (see Figure 8.2).
The proposed methodology for generalized OCR is essentially the same as that
of using LSTM networks for a single script or unilingual OCR. The sequence learning
methodology for a single script or unilingual OCR system, employs the training of
LSTM networks on a large corpus of text-line images whose GT information is given.
The GT information contains the character labels or equivalent encoding of a single
script. An LSTM network is trained to learn the sequence-to-sequence mapping between
the given text-line image and associated ground-truth sequence.
In the proposed technique to OCR multilingual documents, the GT data contains
the class labels representing the character-set of all scripts. LSTM networks have been
used as a sequence learner on text-line images where the target labels are alphabets
of multiple scripts. LSTM-based line recognizer learns the sequence-to-sequence
matching between the multiscript target sequence with any given text-line image.
Salient features of the proposed approach are as follows:
• No handcrafted features are used; instead the LSTM network learns the features
from the raw pixel values of the text-line images.
• No post processing has been done to correct the OCR errors using language
modelling, dictionary correction or other such operations.
• Text is recognized at text-line level; thereby requiring only text-lines to be extracted
from the layout step.
Our hypothesis in this work is that the LSTM networks can be utilized to OCR multiple
scripts using the single OCR model. This hypothesis is based on the results reported
in the literature on the usage of LSTM networks for various sequence learning
tasks. To justify our hypothesis, a single LSTM-based line recognizer is trained with a
corpus containing multiple scripts (in our case, this corpus contains Latin and Greek
scripts).
To gauge the accuracy, standard metric of Levenshtein Distance is used (see Equation
5.1). The accuracy is measured at character level and reported as error rate. Experimental
evaluation of the LSTM-based solution for generalized OCR are presented
in the following section.
8.3.1 Preprocessing
As mentioned earlier, normalization is an important preprocessing step in applying
1D-LSTM networks. The filter-based normalization method, as explained in Appendix
B, is used for the experiments reported in this section. This method of textline
normalization is script independent and has shown to work for both printed and
handwritten text [YSBS15].
8.3.2 Database
One of the main issues in developing the MOCR technology is the unavailability of
standardized databases. A lot of work has been reported for script identification
on multilingual documents; however, as mentioned previously, the dataset used
therein is either private or no longer available. Therefore, to evaluate our hypothesis
about the generalized OCR, synthetically generated multilingual text-lines are
used5. Though one can not replace the effectiveness of real data for training, LSTM
networks, however have shown the capacity to tolerate small variations in shape images.
In order to train LSTM networks, we have used 90,000 synthetically generated textline
images in Serif and Sans-Serif fonts with normal, bold, italic and bold-italic styles
(see Figure 8.3 for some example images). The process to generate artificial text-line
images is explained in Chapter 4.
Since, these text-lines are taken from natural documents, they contain the natural
variation of scripts. Some text lines contain only one script and some contain a good
distribution of words from multiple scripts.
8.3.3 LSTM Architecture and Parameters
The architecture of LSTM-based line recognizer, used for Generalized OCR methodology,
is shown in Figure 8.4. It is basically the same architecture that has been used
throughout this thesis. The 1D-LSTM-based OCR system uses a small number of tunable
parameters. One important parameter is the number of LSTM cells in the hidden
layer(s) and the number of hidden layers. In this work, we used only one hidden
layer with 100 LSTM memory cells in each of right-to-left and left-to-right layers (corresponding
to bidirectional mode). Other parameters are learning rate (set to 1e−04)
and the momentum (set to 0.9) in the reported experiments. These parameters are
the same as that used in [BUHAAS13], because the performance of an LSTM line recognizer
is fairly unaffected by the use of other numbers.
The network has been trained for 1 million iterations (please refer to Figure 8.5 to
see how training errors appear at each iteration) and the intermediate models were
8.3.4 Results
To test the generalization capability of the trained LSTM network, 9,900 synthetically
generated text-line images (using the same methodology described in Section 8.3.2)
are used. The LSTM-based generalized OCR model yields an error rate of 1.28% on this
test data. A sample image which is correctly recognized with the best trained model is
shown in Figure 8.6 and an image on which the same model fails in predicting correct
labels is shown in Figure 8.7.
8.3.5 Error Analysis
Top confusions for OCR using the proposed generalized MOCR approach are tabulated
in Table 8.4. It can be observed that many errors are due to the insertion or deletion
of punctuation marks and ‘space’ deletions. This is understandable because of the
small size of punctuation marks. Other source of errors are the confusion between
the similar characters in both scripts. These letters such as o/, X/, O/ (first characters
are Latin, while latter ones are Greek) are even indistinguishable to human eyes.
But if one knows the context, it is easier to recognize any character. However, when
these characters occur in conjunction with the punctuation marks, the recognition of
a character becomes difficult because in this case context would not help LSTM networks
to recognize. To elaborate this point further, consider Table 8.5 where confusions
are shown considering neighboring context. There are many instances of similar
characters accompanied with punctuation marks. These pairs makes the contextual
processing of neural network difficult, resulting in substitution errors. However, it
must be noted that these errors are due to the similarity of some characters in both
scripts. Both scripts (English and Greek) share some characters’ shape (around 16),
resulting in substitution errors. These errors will not be present in case of scripts that
are markedly different in terms of their grapheme structure, e.g., English and Devanagari
or Latin and Arabic.
This chapter validates the capability of LSTM networks for the OCR of multilingual
(with multiple languages and scripts) documents. As a first step, a single LSTM model
is trained with a mixture of three European languages of a single script, namely, English,
German and French. The OCR model thus obtained produces very low errors
in recognizing the text of these languages without using any post-processing techniques,
such as language modeling or dictionary correction. The language dependence
is observed by reduction in the character recognition accuracy as compared
to the single language OCR; however, the effect is small in comparison to other OCR
methods.
The idea behind the language independent OCR is then extended to a generalized
OCR framework that can OCR multilingual documents comprising multiple scripts.
The presented methodology, by design, is very similar to that of a single script OCR
system and does not employ the traditional script identification module. A single
LSTM-based OCR model is trained for Latin-Greek bilingual documents that yields very
low Character Error Rate (CER) on a dataset consisting of 9,900 text-lines.
The results of our experiments underpin our claims that multiscript OCR can be
done without doing separate script identification step, thereby, we can save the efforts
spent on script identification and separation. The proposed system can be retrained
for any new script by just specifying the character-set of that script during
the training phase. Secondly, the OCR errors are mainly due to the similarity of both
scripts. The algorithm could be tested further on other multilingual documents containing
other languages and scripts, such as, Arabic and English, Devanagari and English,
and many more.
Conclusions and Future Work
This thesis contributes in the field of Optical Character Recognition (OCR) for printed
documents by extending the use of contemporary Recurrent Neural Networks (RNN)
in this domain. There has been an increasing demand in developing reliable OCR systems
for complex modern scripts like Devanagari and Arabic, whose combined user
population exceeds 500 million people around the globe. Likewise, large scale digitization
efforts for historical documents require robust OCR systems to preserve the
literary heritage of our world. Furthermore, an abundance of multilingual documents
present today in various forms intensifies the need of having usable OCR systems that
can handle multiple scripts and languages.
This thesis contributes in two ways to solve some of these issues. Firstly, several
datasets have been proposed to evaluate the performance of OCR systems for printed
Devanagari and Polytonic Greek scripts. Databases for OCR tasks related to multilingual
documents, such as script identification and cross-languages performance of an
OCR system, have also been proposed. Additionally, a bilingual database to evaluate
script independent OCR system has been developed.
Secondly, Long Short-Term Memory (LSTM)-based OCR methodology has been assessed
for some modern scripts including English, Devanagari, and Nastaleeq. This
methodology has also been evaluated for some historical scripts including Fraktur,
Polytonic Greek, and medieval Latin script of 15th century. A generalized OCR framework
for documents containing multiple languages and scripts has also been put forward.
This framework allows the use of a single OCR model for multiple scripts and
languages.
Several conclusions can be drawn from the work done in this thesis.
• The first and the foremost is that the use of artificial data, if generated carefully
to reflect closely the degradations in the scanning process, can replace the need
for a large collection of real ground-truthed datasets. Experiments performed
by training the LSTM-based line recognizer on synthetic data and testing it on
real scanned documents justifies this claim.
• The marriage of segmentation-based and segmentation-free approaches to
OCR documents, for which GT data is not available, results in a framework that
can self-correct the ground-truthed data in an iterative manner.
• The powerful context-aware processing makes the LSTM-based OCR network a
suitable sequence learning machine that can perform on mono-lingual scripts, as
well as on multiple scripts. The use of 1D-LSTM networks requires very few parameters
to tune and these networks outperform more complex MDLSTM networks,
if the input is normalized properly. Moreover, the performance is better
when features are learned automatically instead of when handcrafted features
are used.
There are multiple directions in which the work reported in this thesis can be further
extended. Some of the key future directions are listed below:
• The application of LSTM-based OCR methodology can be directly extended for
the case of camera-captured documents. The challenge in camera-captured documents
is the presence of curved text-lines. There are several techniques to correct
this issue, including image dewarping or text-line extraction directly from
the camera-capture documents. If these issues are taken care of, the application
of LSTM-based OCR is straight forward and simple.
• The Hierarchical Subsampling LSTM (HSLSTM) networks have shown excellent
results on Urdu Nastaleeq OCR. However, their performance could be more
thoroughly tested on other scripts to better gauge their potential.
• The LSTM-based OCR reported for Urdu Nastaleeq can be extended further to
other similar scripts, e.g., Persian, Pushto, Sindhi, and Kashmiri.
• The OCR of Devanagari can be improved by employing a mechanism that addresses
the issue of vertically stacked characters. MDLSTM networks may yield
better results, or alternatively, improved preprocessing may benefit 1D-LSTM
networks.
• The generalized OCR framework is presently evaluated for two only scripts
(Greek and English). If more datasets containing multiple scripts emerge, this
framework can be used to further establish its performance.
It is hoped that the work in this thesis successfully attempts to fulfill the gap that
was present in the field of OCR for printed documents, and would serve as a stepping
stone in future research endeavors.