wanghaisheng/awesome-ocr

文本行的标准化

wanghaisheng opened this issue · 1 comments

文本行的标准化

Text-Line Normalization
The relative position and scale of individual characters in a text-line are important
features for Latin and many other scripts. Normalization of text-lines helps in making
this information consistent across all text-lines in a given database. There are many
normalization methods proposed in the literature. Normalization methods that have
been used for various experiments reported in this thesis are described in the sections
below.
B.1 Image Rescaling
Image rescaling is the simplest method to make the heights of all images in a database
equal. For a desired image height, a scale can be calculated as following:
scale =
target_height
actual_height
This scale is then used to determine the width of the “normalized” image by simply
multiplying it with the width of the actual image.
target_width = scale ∗ actual_width
This normalization is used in the current thesis for some of the OCR experiments reported
for Urdu Nastaleeq script.
B.2 Zone-Based Normalization
Characters in many scripts like Latin, Greek and Devanagari follow certain typographic
rules. A text-line in such scripts can be divided into three zones. A baseline passes
through the bottom of majority of the characters, and a mean-line is at the middle
height from the baseline to the top edge of a text-line. Most of the small characters,
like ‘x’, ‘s’, and ‘o’ lie between these two lines. The portion of the characters that
extends above the mean-line is termed as ‘ascender’, and that extending below the
baseline is termed as the ‘descender’. The zone between the baseline and the meanline
is the middle-zone, the zone below the mean-line is the bottom-zone and the zone
above the baseline is called the top-zone. A sample text-line in Devanagari script with
these three zones is shown in Figure B.1.
Rashid et al. [Ras14] proposed a text-line normalization method which uses the
above-mentioned three zones. Statistical analysis is carried out to estimate these
zones in an image and then each zone is rescaled to a specific height by simple rescaling
described in the previous section. This normalization method has been employed
for the experiments reported for Devanagari script in this thesis.
B.3 Token-Dictionary based Normalization
This text-line normalization method is based on a dictionary composed of connected
component shapes and associated baseline and x-height information. This dictionary
is pre-computed based on a large sample of text-lines with baseline and x-heights
derived from alignment of the text-line images with textual ground-truth, together
with information about the relative position of Latin characters to the baseline and
x-height. Note that for some shapes (e.g., p/P, o/O), the baseline and x-height information
may be ambiguous; the information is therefore stored in the form of probability
densities given a connected component shape. The connected components do
not need to correspond to characters; they might be ligatures or frequently touching
character pairs like “oo” or “as”.
To measure the baseline and x-height of a new text-line, the connected components
are extracted from the text-line and the associated probability densities for the
baseline and x-height locations are retrieved. These densities are then mapped and
locally averaged across the entire line, resulting in a probability map for the baseline
and x-height across the entire text-line. Maps of x-height and baseline of an example
text-line (Figure B.2-(a)) are shown in Figure B.2-(b) and (c) respectively. The resulting
densities are then fitted with curves and are used as the baseline and x-height for line
size normalization. In line size normalization (possibly curved) baseline and x-height
lines are mapped to two straight lines in a fixed size output text-line image, with the
pixels in between them rescaled using spline transformation. This method of normalization
of a text-line has been used in the experiments for English and Fraktur.
B.4 Filter-based Normalization
The zone-based and token-dictionary methods work satisfactorily for scripts, where
either baselines and x-height information is easily estimated or where segmentation
can be done to extract individual characters. They fail to perform reasonably for Urdu
Nastaleeq script where neither baseline nor segmentation are trivial to estimate. The
filter-based normalization method is independent of estimating baseline or individual
characters. This method is based on simple filter operations and affine transformation;
thus making it script-independent normalization method, as compared to
the normalization process described in the previous section, which was based on the
shapes of the Latin alphabets. The complete normalization process is shown in Figure
B.3. The input text-line image is first inverted and smoothed with a large Gaussian
filter. The benefit of doing this is to capture the global structure of the underlying
contents. Now, as shown in Figure B.3-(a), the smoothed image has maximum values
near the center of the image along the vertical axis. These points are then fitted with
a straight line (in practice, we smooth the line passing through these points as well).
This is the line around which the whole text line is re-scaled using affine transformation.
First a zone is found according to the difference between the height of the input
image and the center line. Now, to make sure that the finally normalized image contains
all the contents without clipping, the next step is to expand the image above and
below of the center line by the amount equal to the height of the image. This padded
image is then cropped using the zone measurement found previously. Finally, the image
is scaled to the required height using affine transformation. The width of the final
image is calculated by multiplying the original width with the ratio of “target” height
to the height of the dewarped image. The only tunable parameter in this method is
the target height. Other parameters are calculated from the given image itself. This
text-line normalization is used for works reported for Urdu Nastaleeq, historical Latin
and for multilingual documents. Some of the normalized images using this methodology
are shown in Figure B.4.