0% found this document useful (0 votes)
15 views7 pages

Li 2021

The document presents IBN-STR, a robust text recognizer designed for irregular text in natural scenes, which improves performance through data augmentation and an innovative feature representation method. The model utilizes S-shape distortion to enhance training data diversity and combines instance normalization with batch normalization to enhance the model's capacity and generalization ability. Extensive experiments demonstrate that IBN-STR achieves state-of-the-art performance for both regular and irregular text recognition tasks.

Uploaded by

theja naveen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views7 pages

Li 2021

The document presents IBN-STR, a robust text recognizer designed for irregular text in natural scenes, which improves performance through data augmentation and an innovative feature representation method. The model utilizes S-shape distortion to enhance training data diversity and combines instance normalization with batch normalization to enhance the model's capacity and generalization ability. Extensive experiments demonstrate that IBN-STR achieves state-of-the-art performance for both regular and irregular text recognition tasks.

Uploaded by

theja naveen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

2020 25th International Conference on Pattern Recognition (ICPR)

Milan, Italy, Jan 10-15, 2021

IBN-STR: A Robust Text Recognizer for Irregular


Text in Natural Scenes

Xiaoqian Li*1,2 , Jie Liu*1 , Guixuan Zhang 1
, Shuwu Zhang1
1
Institute of Automation, Chinese Academy of Sciences
2
School of Artificial Intelligence, University of Chinese Academy of Sciences
Email: {lixiaoqian2015, jie.liu, guixuan.zhang, shuwu.zhang}@ia.ac.cn

Abstract—Although text recognition methods based on deep


neural networks have promising performance, there are still
challenges due to the variety of text styles, perspective distortion,
text with large curvature, and so on. To obtain a robust text
recognizer, we have improved the performance from two aspects:
data aspect and feature representation aspect. In terms of data,
we transform the input images into S-shape distorted images
in order to increase the diversity of training data. Besides,
we explore the effects of different training data. In terms of
2020 25th International Conference on Pattern Recognition (ICPR) | 978-1-7281-8808-9/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICPR48806.2021.9412775

feature representation, the combination of instance normalization


and batch normalization improves the model’s capacity and
generalization ability. This paper proposes a robust scene text (a) Real text (b) Synthetic text
recognizer IBN-STR, which is an attention-based model. Through
extensive experiments, the model analysis and comparison have Fig. 1. Various text.
been carried out from the aspects of data and feature repre-
sentation, and the effectiveness of IBN-STR on both regular and
irregular text instances has been verified. Furthermore, IBN-STR Secondly, for the feature representation aspect, we attempt to
is an end-to-end recognition system that can achieve state-of-the- mine a robust and efficient module to extract features.
art performance.
For the data aspect, we utilize S-shape distortion to en-
I. I NTRODUCTION rich the text curvature of the training data. Considering the
Text is a vital cue in natural scene images. Text recogni- contribution of instance normalization (IN) in style transfer
tion is a branch of computer vision, which can help people task [8], [9], IN can introduce appearance invariance, and
understand the content of images and can be widely used as batch normalization (BN) can preserve content information.
auxiliary aids in intelligent transportation, auxiliary transla- For the feature representation aspect, a robust instance-batch
tion, image retrieval, and so on. Research on text recognition normalization (IBN) module is proposed to introduce text
has a long history. Traditional pattern classification provides appearance invariance and improve performance.
many solutions for text recognition [1]–[3], and due to the In this paper, we propose a Scene Text Recognizer with
development of deep learning and computational power, text Instance-Batch Normalization module (named IBN-STR) to
recognition has recently achieved great breakthroughs. achieve regular text and irregular text recognition in natural
Although text recognition methods [4]–[6] based deep learn- scenes. The contributions of this paper are as follows:
ing perform well, there are still challenges in text recognition • In terms of data, we demonstrate the impact of data
due to the large curvature of text instances, the variety of text augmentation and different training data on text recog-
styles, similar characters, occlusion, uneven illumination, and nition. The input images will be S-shape distorted to
shooting environments. Therefore, a robust text recognizer is increase the diversity of training data and further improve
of great significance. performance.
Most text recognizers [4]–[7] are trained on synthetic data • In terms of feature representation, instance normalization
and evaluated on real data. As we can see in Figure 1, the is introduced into text recognition for the first time to
curvature of text in commonly used synthetic datasets is less improve the model’s capacity and generalization ability.
volatile, but the curvature of text in real images is greater The IBN module combines instance normalization with
and the appearance of the text will be more variable. This batch normalization, which helps the model extract more
means that there is a gap between the distribution of training effective feature maps and it is effective for both regular
data and test data. In view of the above problems, we first text and irregular text.
consider improving from the data aspect. It is necessary to • With the rectification network, we propose an IBN-STR
use data augmentation to increase the diversity of training data. model for text recognition and achieve state-of-the-art
performance.
* Authors contributed equally as first author.
† Corresponding author: Guixuan Zhang. The remainder of this paper is organized as follows. Section

978-1-7281-8808-9/20/$31.00 ©2020 IEEE 9522

Authorized licensed use limited to: Carleton University. Downloaded on May 29,2021 at 12:20:42 UTC from IEEE Xplore. Restrictions apply.
Conv(128, 1x1)
Conv(128, Conv(128,
1x1)1x1) Conv(128, 1x1)
Conv(128, 1x1) 1x1) Conv(128,Conv(128,
1x1) Conv(128, 1x1)
Conv(128, 1x1)
BN(128) IN(64) BN(64) BN(128)
BN(128)BN(128) IN(64) IN(64)
BN(64) BN(64) BN(128) BN(128)

Conv(128, 3x3) Conv(128, 3x3) Conv(128, 3x3)


Conv(128,
Conv(128, 3x3) 3x3) Conv(128,Conv(128,
3x3) 3x3) Conv(128, 3x3)
Conv(128, 3x3)
BN(128) BN(128) BN(128)
BN(128)BN(128) BN(128) BN(128) BN(128) BN(128)

ReLU ReLU IN(128)


ReLU ReLU ReLU ReLU IN(128) IN(128)
ReLU
ReLU ReLU

(a) origianal BN (b) IBN a module (c) IBN b module

Fig. 2. Instance-batch normalization (IBN) module.

II introduces the related works of text recognition. Section III Rectified image
illustrates the proposed method. Section IV demonstrates the Input image
experimental results to verify the effectiveness of the proposed Text
Rectification OPTIMUM
method. Section V will make a summary of this paper. Recognition

II. R ELATED W ORKS


Fig. 3. Overview of text recognizer. The dashed lines mean the direction of
Traditional text recognition methods mainly rely on manual gradient propagation.
design features such as connected components [10], stroke
width transform [11]–[13]. Recently, methods based on deep
neural networks have shown advantages in text recognition. network and generates rectified images. The text recognition
Jaderberg et al. [14] proposed a model combining convolu- network follows the encoder-decoder framework which is
tional neural network (CNN) and conditional random field, widely used in seq2seq text recognition. It consists of a CNN-
which is optimized by structured output loss. Most CNN- BLSTM encoder and an attention-based decoder. The encoder
based methods tend to treat text recognition as a sequence first extracts stacked convolutional features of input images
recognition task. Inspired by speech recognition, CRNN [8] and utilizes bidirectional long short-term memory (BLSTM)
introduced CTC loss into text recognition. CRNN is an end- to convert the image features into feature sequences. The
to-end system that utilizes CNN and recurrent neural network decoder is a sequence-to-sequence model that translates the
(RNN) to generate character sequence features. Vanilla CTC feature sequence into a character sequence. The IBN module
can only deal with 1D probability distributions, while 2D- is embedded in the stacked convolutional modules to improve
CTC [13] can compute the conditional probability of labels the capacity and generalization ability of text recognizer.
from 2D distributions, which is suitable for irregular text. With
the popularity of attention mechanism, attention-based text A. IBN Module
recognition methods are proposed [5], [6], [15]. RARE [5], Batch normalization [18] is proposed to normalize data and
ASTER [6], MORAN [7], and RCN [16] transformed irregular preserve the representations of data. BN enables the model
text images into rectified images, and then used the attention- less sensitive to parameters and converges faster by limiting
based recognition network of an encoder-decoder framework the input data to a certain range through the mean and variance.
to achieve text recognition. In the absence of a rectification Given the feature map x ∈ RN ×C×H×W with N samples,
network, Lyu et al. [17] proposed a relation attention module C channels, H height, and W width. The normalized data can
and a parallel attention module to transform text images be formulated as
into character feature sequences, which can be workable for
0 x − µ(x)
irregular text recognition. x = γ( ) + β. (1)
Our method is based on the attentional sequence-to- σ(x)
sequence (seq2seq) model. Different from previous methods, where γ, β are scaling and shift factors. BN retains the channel
the proposed IBN-STR introduces IN for the first time to im- dimension when calculating the mean and variance:
prove the capacity and generalization ability of text recognizer. N XH XW
1 X
µc (x) = xnchw , (2)
N HW n=1
III. M ETHOD w=1 h=1

As shown in Figure 3, the proposed IBN-STR model con-


v
u N XH XW
sists of a rectification network and a text recognition network.
u 1 X
σc (x) = t (xnchw − µc (x))2 + . (3)
The rectification network is based on the spatial transformer N HW n=1 w=1 h=1

9523

Authorized licensed use limited to: Carleton University. Downloaded on May 29,2021 at 12:20:42 UTC from IEEE Xplore. Restrictions apply.
In the training phase, the mean/variance is calculated base on B. Encoder
samples of the mini-batch. All means and variance of mini- The encoder aims to extract rich and discriminative features.
batches during training will be saved for testing. In the test As illustrated in Table I, the main structure of the encoder is
phase, the mean/variance of each mini-batch from the training the CNN-BLSTM framework. The encoder first extracts spatial
phase will be weighted average and obtain the estimated value. feature maps from the input image through stacked convolu-
Instance normalization [19] is mainly used in the field of tional layers with residual connections [20]. The proposed IBN
style transfer [8], [9] because IN can learn features that are module can be employed in the shallow layers to obtain strong
invariant to styles or appearance. For style transfer, each image spatial features. Based on ResNet45 [21], the proposed IBN-
should be regarded as a domain. In order to maintain the STR model uses the IBN module in the residual Block 2 to
independence between different image instances, IN retains residual Block 4.
the dimensions of N and C, and only operates of averaging The CNN of encoder aims to capture the features of local
and standard deviation for H and W within the channel. The regions. To capture the long-range dependencies of characters,
mean and variance can be formulated as: a multi-layer bidirectional long short-term memory [22] is
1 XX
H W introduced. BLSTM can encode feature sequences bidirection-
µnc (x) = xnchw , (4) ally, capture long-range dependencies in both directions, and
HW
h=1 w=1 model global context information, thereby leveraging richer
v context and improving performance.
u
u 1 X H XW
σnc (x) = t (xnchw − µnc (x))2 + . (5) C. Decoder
HW w=1
h=1 The decoder is attention-based which can achieve sequence
Texts in natural scenes are very variable in appearances. to sequence prediction. As shown in Table I, GRU cell
Inspired by the contribution of Instance normalization to style [23] is utilized to decode output dependencies. Through T
transfer tasks [8], [9], we introduce IN to obtain a robust iterations, the decoder generates a predicted symbol sequence
text recognizer in natural scenes. As shown in Figure 2, two (y1 , ..., yT ), where T is the number of characters. To generate
types of IBN modules are provided. In the IBN a module, a variable-length sequence, a special end-of-sequence symbol
the feature maps are divided into two parts and sent to IN (EOS) is inserted at the end of the target sequence. At step t,
and BN respectively, and the outputs of IN and BN will be the decoder produces a predicted output yt and the probability
concatenated and sent to the next convolutional layer. IBN a of yt is p(yt ):
module can integrate appearance invariant features and content p(yt ) = Sof tmax(Wout st + bout ),
related information to improve the performance. To explore the (6)
yt ∼ p(yt ),
generalization ability of the different types of IBN modules,
the IBN b module is proposed. The IN layer will be placed where st is the hidden state at the current time, and Wout and
before the residual block output. bout are trainable parameters. In this paper, the embedding
The experimental results in section IV verify the effective- vectors and st−1 (the hidden state at the previous time) will
ness of the proposed IBN module. Particularly, most of the be fed into GRU to update st :
time IBN a module performs better than the IBN b module
in text recognition and improves regular and irregular text st = GRU (st−1 , (gt , f (yt−1 ))), (7)
recognition. L
X
gt = (αt,i hi ), (8)
TABLE I i=1
A RCHITECTURE OF TEXT RECOGNITION NETWORK . BLSTM MEANS where (gt , f (yt−1 )) is the concatenation of gt glimpse vectors
BIDIRECTIONAL LONG SHORT- TERM MEMORY LAYER .
and f (yt−1 ) embedding vectors of the previous output yt−1 .
Layers Configurations Outsize The glimpse vectors focus on a small part of the whole context.
Block 0 3 × 3 conv, s 1 × 1, bn 32 × 32 × 100 In the formula 8, L is the length of feature maps; αt,i is the
1 × 1 conv, 32, bn
Block 1 ×3, s 2 × 2 32 × 16 × 50 attentional weights vector and it can be generated by
3 × 3 conv, 32, bn
1 × 1 conv, 64, ibn
Block 2 ×4, s 2 × 2 64 × 8 × 25 L
3 × 3 conv, 64, bn X
1 × 1 conv, 128, ibn
encoder

Block 3 ×6, s 2 × 1 128 × 4 × 25 αt,i = exp(et,i )/ (exp(et,j )), (9)


3 × 3 conv, 128, bn
1 × 1 conv, 256, ibn j=1
Block 4 ×6, s 2 × 1 256 × 2 × 25
3 × 3 conv, 256, bn
1 × 1 conv, 512, bn
et,i = T anh(W st−1 + V hi + b), (10)
Block 5 ×3, s 2 × 1 512 × 1 × 25
3 × 3 conv, 512, bn
BLSTM1 256 hidden units 25 × 256 where W, V and b are trainable parameters.
BLSTM2 256 hidden units 25 × 256 Given the predicted symbol sequence, the recognition loss
can be formulated as
decoder

GRU 256 hidden units 25 × 256


T
1X
Lrec = − (logpl2r (yt ) + logpr2l (yt )), (11)
T t=1

9524

Authorized licensed use limited to: Carleton University. Downloaded on May 29,2021 at 12:20:42 UTC from IEEE Xplore. Restrictions apply.
where pl2r (yt ) and pr2l (yt ) are the probabilities of the se-
quence from left to right and from right to left.

D. Rectification
The rectification network is based on the spatial transformer
network [24], similar to RARE [5]. First, fiducial points
are predicted by the localization network, and then thin-
plate-spline transformation [25] matrices will be calculated to
generate the sampling grid. Finally, the sampler uses bilinear
interpolation to obtain the rectified image. Table II illustrates
the architecture of the localization network. The input image (a) origin image (b) distorted images
is scaled to 32 × 100 and fed into convolutional layers
Fig. 4. S-shape distortion.
and pooling layers. Each convolution layer is followed by
a batch normalization layer and a ReLU layer. The adaptive
average pooling layer is used to generate feature vectors and IV. E XPERIMENTS
then feature vectors pass through 2 fully connected layers to
generate predicted fiducial points. In this section, we conduct extensive experiments to verify
the effectiveness of the proposed method. The performances
E. Data augmentation of all the methods are measured by word accuracy.
To enrich the diversity of training data, it is necessary to
adopt data augmentation for input images. Here, we utilize the A. Benchmark Datasets
trigonometric function to generate S-shape distortion transfor- • Street View Text (SVT). Street View Text dataset [1]
mation. Given the position of original image (i, j) and the has 350 images collected from Google street view. The
0 0
position of rectified image (i , j ), the correspondences of dataset has 647 word instances, and each instance has a
0 0
between (i, j) and (i , j ) are as follows: 50-word lexicon.
0 • IIIT5K-Words (IIIT5K). The IIIT5K-word dataset [26]
i =a1 i + a2 Sin(θ, j) + a3 ,
0 (12) has 3,000 cropped word instances for testing. The dataset
j = j, provides a 50-word and a 1k-word lexicons for each word
instance.
where a1 , a2 , a3 are scaling and shifting parameter. θ deter-
• ICDAR 2013 (IC13). ICDAR 2013 [27] has 1,095 word
mines the distortion mode for the entire image. In this paper,
instances cropped from 233 scene images. After filtering
the original image is S-shape distorted with a probability of
words with non-alphanumeric characters, 1,015 cropped
0.4. As shown in Figure 4, there are 16 distortion modes, one
word instances are obtained for evaluation.
of which will be randomly selected as the input image. The
• ICDAR 2015 (IC15). ICDAR 2015 [28] provides 2,077
experimental results demonstrate the effectiveness of S-shape
word instances in multi-oriented for text recognition. The
distortion.
word instances are cropped from test scene images. Re-
The proposed method focuses on alphanumeric character move non-alphanumeric characters, words with a length
recognition, but non-alphanumeric characters occur frequently less than 3 and irregular text will obtain 1,811 word
in text images of natural scenes. Therefore, this paper also instances.
discusses whether to use non-alphanumeric text images in • SVT-Perspective (SVT-P). SVT-Perspective dataset [29]
section IV. has 639 perspective text instances and a 50-lexicon is
provided for each instance.
TABLE II
A RCHITECTURE OF THE LOCALIZATION NETWORK . C ONV MEANS • CUTE80 (CUTE). CUTE80 dataset [30] has 288 word
CONVOLUTION LAYER , MP MEANS M AXPOOLING LAYER AND A DAPAVG P instances cropped from 80 high-resolution images taken
MEANS ADAPTIVE AVERAGE POOLING LAYER . in natural scenes. The dataset has many examples of
Layers Configurations Outsize curved text.
Conv 1 3 × 3 conv, 64, s 1 × 1 64 × 32 × 100 • Total-Text. Total-Text [31] has 300 test images. The word
MP 1 2×2 64 × 16 × 50 instances are arbitrary shape text, including flipped text.
Conv 2 3 × 3 conv, 128, s 1 × 1 128 × 16 × 50 2,204 word instances are obtained after filtering words
MP 2 2×2 128 × 8 × 25
Conv 3 3 × 3 conv, 256, s 1 × 1 256 × 8 × 25 with non-alphanumeric characters.
MP 3 2×2 256 × 4 × 12 The benchmarks consist of regular texts and irregular text.
Conv 4 3 × 3 conv, 512, s 1 × 1 512 × 4 × 12
AdapAvgP 1×1 512 × 1 × 1 There are 4,662 regular text instances from SVT, IIIT5K and
Linear 1 512, 256 256 IC13 datasets, and 5,214 irregular text instances from IC15,
Linear 2 256, 2K 2K SVT-P, CUTE, and Total-text datasets. The total number of
text instances is 9,876.

9525

Authorized licensed use limited to: Carleton University. Downloaded on May 29,2021 at 12:20:42 UTC from IEEE Xplore. Restrictions apply.
B. Implementation Details TABLE IV
T HE RESULTS OF DIFFERENT IBN MODULES .
In this paper, we utilize Synth90k [32] and SynthText
[33] as training data and evaluate the standard benchmarks. Method Regular Irregular Total
Synth90k dataset (denoted as SK) contains approximately 8.9 Base 92.32 74.41 82.87
Base-stn 93.07 76.89 84.53
million synthetic word images and SynthText dataset (denoted
Base-ibn-a 92.40+0.08 74.90+0.48 83.16+0.29
as ST) has 6.9 million training data, including 1.4 million Base-ibn-b 92.15−0.17 73.80−0.61 82.84−0.41
non-alphanumeric instances. For SynthText, 5.5 million word Basestn-ibn-a 92.96−0.11 77.67+0.78 84.89+0.36
instances (denoted as ST a) will be obtained by filtering Basestn-ibn-b 92.60−0.47 76.37−0.52 84.03−0.50
words with non-alphanumeric characters, while 1.4 million DataBase 92.53 75.49 83.54
DataBase-stn 93.22 78.02 85.20
non-alphanumeric word instances will be denoted as ST e. DataBase-ibn-a 92.65+0.11 76.39+0.90 84.06+0.52
The proposed model is trained using only synthetic data, with- DataBase-ibn-b 92.60+0.07 75.72+0.23 83.69+0.15
out fine-tuning. The model only recognizes the alphanumeric DataBasestn-ibn-a 93.16−0.06 77.50−0.52 84.89−0.31
characters including 26 letters, 10 digits, a symbol for non- DataBasestn-ibn-b 93.48+0.25 77.87−0.16 85.24+0.04
alphanumeric characters, and a symbol standing for ‘EOS’.
The model is trained from scratch and optimized by Adam TABLE V
T HE RESULTS OF DIFFERENT NUMBER OF IBN LAYERS .
optimizer with a learning rate of 5e-4. Iteration stops after
10 epochs. All the input images are resized to 32 × 100. The Method Regular Irregular Total
experiments are conducted with two NVIDIA Tesla K40 GPUs BN 92.53 75.49 83.54
and the batch size is 1024. IBN a, 2 92.66+0.13 76.04+0.55 83.89+0.35
IBN a, 1-2 93.03+0.50 75.87+0.38 83.97+0.43
TABLE III IBN a, 2-3 92.90+0.37 75.85+0.36 83.90+0.36
T HE RESULTS OF DATA AUGMENTATION . IBN a, 2-4 92.92+0.39 76.97+1.48 84.50+0.96
IBN a, 1-4 92.65+0.12 76.39+0.90 84.06+0.52
Method Regular Irregular Total
Base(BO+37) 90.20 72.61 80.91
Base-stn(BO+37) 90.78 75.76 82.85
The top half of Table III demonstrates the results of the
Base(TO+38) 92.32 74.41 82.87
Base-stn(TO+38) 93.07 76.89 84.53 regular text and irregular text recognition, and the bottom half
shows the performance improvement. Obviously, the S-shape
Data-base(BO+37) 90.35 72.75 81.06
Data-base-stn(BO+37) 91.30 75.93 83.18 distortion and ST e dataset greatly promote performance.
Data-base(TO+37) 92.94 75.07 83.51 Outputting non-alphanumeric symbol has a relatively small
Data-base-stn(TO+37) 93.35 77.94 85.22 impact on the performance of overall data. The output of 38
Data-base(TO+38) 92.53 75.49 83.54 classes will damage the text recognition of regular datasets and
Data-base-stn(TO+38) 93.22 78.02 85.20 promote the text recognition of irregular datasets. According
Improvement Regular Irregular Total to the above analysis, we take the Base(TO+38) model as the
S-shape(BO+37) +0.15 +0.13 +0.15 base model in the following.
S-shape-stn(BO+37) +0.52 +0.17 +0.33 2) IBN Module: We discuss the effects of different versions
S-shape(TO+38) +0.21 +1.07 +0.67 of the IBN module and the number of IBN module layers
S-shape-stn(TO+38) +0.15 +1.13 +0.67 on text recognizer. We utilize ResNet45 [21] as the backbone
Data(TO37-BO37) +2.59 +2.32 +2.45 which consists of 5 residual modules with batch normalization.
Data-stn(TO37-BO37) +2.06 +2.01 +2.04 According to [9], the batch normalization layers are replaced
Char(TO38-TO37) -0.41 +0.42 +0.03 by IBN in the shallow layers.
Char-stn(TO38-TO37) -0.13 +0.08 -0.02 When we compare the effects of two IBN modules (IBN a
module and IBN b module), the first 4 residual blocks with
batch normalization layer are replaced by IBN modules. As
C. Ablation Study illustrated in Table IV, we use the model with only batch
1) Data Augmentation: Here we attempt to display the normalization as the baseline (denoted by Base, Base-stn,
effects of the different training datasets, S-shape distortion, DataBase, and DataBase-stn). All the models are trained by
and the outputs (including a symbol for non-alphanumeric SK and ST datasets. Without S-shape distortion, the IBN a
characters). As shown in Table III, BO means using SK + module can always help improve performance, while the
ST a datasets, while TO means using SK+ST (SK + ST a IBN b module degrades the performance. With S-shape dis-
+ ST e) datasets. 37 indicates an output without considering tortion, IBN a module can help improve the performance of
non-alphanumeric characters, while 38 indicates an output the DataBase-ibn-a model but make Databasestn-ibn-a model
including a symbol for non-alphanumeric characters. Base-* slightly worse. As for the IBN b module, it can help the
model is trained by images without S-shape distortion, but the DataBase*-ibn-b model to improve the performance on overall
inputs of Data-Base-* are S-shape distorted. All the models data.
with *-stn are trained without the rectification network. In addition, we also compare the impact of the number of

9526

Authorized licensed use limited to: Carleton University. Downloaded on May 29,2021 at 12:20:42 UTC from IEEE Xplore. Restrictions apply.
TABLE VI
C OMPARISON OF OTHER TEXT RECOGNITION METHODS . * MEANS USING 1,811 IMAGES .

Regular Irregular
Data
Method IC13 SVT IIIT5K IC15 SVT-P CUTE Total-text Total
None None 50 None 50 1k None None 50 None None
CRNN [4] SK 89.6 82.7 97.5 81.2 97.8 95.0 - - - - - -
GCRNN [34] SK - 81.5 96.3 80.8 98.0 95.6 - - - - - -
R2AM [15] SK 90.0 80.7 96.3 78.4 96.8 94.4 - - - - - -
Liao et.al [35] ST 91.4 82.1 98.5 92.0 99.8 98.9 - - - 78.1 -
Aster [6] ST+SK 91.8 93.6 99.2 93.4 99.6 98.8 76.1∗ 78.5 - 79.5 - -
2D CTC [36] ST+SK 93.9 90.6 97.2 94.7 99.8 98.9 75.2∗ 79.2 - 81.3 63.0 -
RCN [16] ST+SK 93.2 88.6 97.7 94.0 99.6 98.9 77.1 80.6 95.0 88.5 - -
MORAN [7] ST+SK 92.4 88.3 96.6 91.2 97.9 96.2 68.8 76.1 94.3 77.4 - -
Lyu et.al [17] ST+SK 92.7 90.1 97.2 94.0 99.8 99.1 76.3 82.3 - 86.8 - -
IBN-STR(base) ST+SK 93.8 90.0 97.3 93.3 99.5 98.7 77.8 83.6 95.0 84.4 73.3 84.5
IBN-STR(stn) ST+SK 94.7 91.0 98.0 94.0 99.8 98.6 79.1 85.1 94.6 85.4 74.8 85.6

IBN layers. As shown in Table V, BN is the configurations ACKNOWLEDGMENT


for the DataBase model. Different IBN a module layers all
This work is supported by the National Key R&D Program
promote performance.
of China (2018YFB1403900) and the Science and Technol-
According to the above analysis, the proposed IBN module
ogy Program of Beijing (Z201100001820002). It was also
can improve text recognition. And overall IBN a module
the research achievement of the Key Laboratory of Digital
is better than IBN b module in performance improvement.
Rights Services, which is one of the National Science and
Performance improvement in irregular text is more than that
Standardization Key Labs for Press and Publication Industry.
in regular text. In addition, IBN a module does not increase
computational cost.
R EFERENCES
D. Comparisons with the State-of-the-arts [1] K. Wang, B. Babenko, and S. Belongie, “End-to-end scene text recog-
The proposed IBN-STR model is trained by ST + SK nition,” in 2011 International Conference on Computer Vision. IEEE,
2011, pp. 1457–1464.
datasets, and the outputs are 38 classes, including a non- [2] A. Bissacco, M. Cummins, Y. Netzer, and H. Neven, “Photoocr: Reading
alphanumeric symbol recognition. The input images will be text in uncontrolled conditions,” in Proceedings of the IEEE Interna-
S-shape distorted and fed into the IBN-STR model. In the tional Conference on Computer Vision, 2013, pp. 785–792.
[3] L. Neumann and J. Matas, “Real-time scene text localization and
final model IBN-STR, the IBN a modules are utilized in recognition,” in 2012 IEEE Conference on Computer Vision and Pattern
Block 2 to Block 4 of Table I. We compare the performance Recognition. IEEE, 2012, pp. 3538–3545.
of our model and other state-of-the-arts in Table VI. IBN- [4] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network
for image-based sequence recognition and its application to scene
STR(base) is trained without the rectification network, and text recognition,” IEEE transactions on pattern analysis and machine
IBN-STR(stn) is trained with the rectification network. Com- intelligence, vol. 39, no. 11, pp. 2298–2304, 2016.
pared with rectification-based methods Aster [6] and MORAN [5] B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai, “Robust scene text
recognition with automatic rectification,” in Proceedings of the IEEE
[7], our method performs better on IC13, IIIT5K, IC15, SVT- conference on computer vision and pattern recognition, 2016, pp. 4168–
P and CUTE datasets. In addition, on Total-text with complex 4176.
text instances, the performance of our model is significantly [6] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai, “Aster:
An attentional scene text recognizer with flexible rectification,” IEEE
improved, which is 10.3%-11.8% higher than the previous transactions on pattern analysis and machine intelligence, vol. 41, no. 9,
model. pp. 2035–2048, 2018.
[7] C. Luo, L. Jin, and Z. Sun, “Moran: A multi-object rectified attention
V. C ONCLUSION network for scene text recognition,” Pattern Recognition, vol. 90, pp.
109–118, 2019.
In this paper, we consider the data aspect and feature [8] H. Nam and H.-E. Kim, “Batch-instance normalization for adaptively
representation aspect to improve the generalization of the style-invariant neural networks,” in Advances in Neural Information
Processing Systems, 2018, pp. 2558–2567.
model. S-shape distortion is utilized to enrich the diversity of [9] X. Pan, P. Luo, J. Shi, and X. Tang, “Two at once: Enhancing learning
training data and the effect of data augmentation on text recog- and generalization capacities via ibn-net,” in Proceedings of the Euro-
nition is analyzed. In addition, the combination of instance pean Conference on Computer Vision (ECCV), 2018, pp. 464–479.
normalization and batch normalization improves the model’s [10] L. Neumann and J. Matas, “Real-time scene text localization and
recognition,” in 2012 IEEE Conference on Computer Vision and Pattern
capacity and generalization ability. The IBN-STR model is Recognition, 2012, pp. 3538–3545.
proposed to achieve text recognition and can compete with the [11] B. Epshtein, E. Ofek, and Y. Wexler, “Detecting text in natural scenes
state-of-the-arts. Experimental results show the effectiveness with stroke width transform,” in 2010 IEEE Computer Society Confer-
ence on Computer Vision and Pattern Recognition, 2010, pp. 2963–2970.
of the proposed method. Although the proposed method can [12] C. Yao, X. Bai, and W. Liu, “A unified framework for multioriented
perform well on the regular text and irregular text recognition, text detection and recognition,” IEEE Transactions on Image Processing,
our method cannot handle vertical or flipped text instances. vol. 23, no. 11, pp. 4737–4749, 2014.
[13] C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu, “Detecting texts of arbitrary
Future research will focus on a flexible text recognizer that orientations in natural images,” in 2012 IEEE Conference on Computer
can process text from various perspectives. Vision and Pattern Recognition, 2012, pp. 1083–1090.

9527

Authorized licensed use limited to: Carleton University. Downloaded on May 29,2021 at 12:20:42 UTC from IEEE Xplore. Restrictions apply.
[14] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep [36] Z. Wan, F. Xie, Y. Liu, X. Bai, and C. Yao, “2d-ctc for scene text
structured output learning for unconstrained text recognition,” arXiv recognition,” arXiv preprint arXiv:1907.09705, 2019.
preprint arXiv:1412.5903, 2014.
[15] C.-Y. Lee and S. Osindero, “Recursive recurrent nets with attention
modeling for ocr in the wild,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2016, pp. 2231–2239.
[16] Y. Gao, Y. Chen, J. Wang, Z. Lei, X.-Y. Zhang, and H. Lu, “Recur-
rent calibration network for irregular text recognition,” arXiv preprint
arXiv:1812.07145, 2018.
[17] P. Lyu, Z. Yang, X. Leng, X. Wu, R. Li, and X. Shen, “2d attentional
irregular scene text recognizer,” arXiv preprint arXiv:1906.05708, 2019.
[18] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” arXiv preprint
arXiv:1502.03167, 2015.
[19] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Improved texture networks:
Maximizing quality and diversity in feed-forward stylization and texture
synthesis,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2017, pp. 6924–6932.
[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[21] A. Mishra, K. Alahari, and C. Jawahar, “Enhancing energy minimization
framework for scene text recognition with top-down cues,” Computer
Vision and Image Understanding, vol. 145, pp. 30–42, 2016.
[22] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and
J. Schmidhuber, “A novel connectionist system for unconstrained hand-
writing recognition,” IEEE transactions on pattern analysis and machine
intelligence, vol. 31, no. 5, pp. 855–868, 2008.
[23] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares,
H. Schwenk, and Y. Bengio, “Learning phrase representations using
rnn encoder-decoder for statistical machine translation,” arXiv preprint
arXiv:1406.1078, 2014.
[24] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial transformer
networks,” in Advances in neural information processing systems, 2015,
pp. 2017–2025.
[25] F. L. Bookstein, “Principal warps: Thin-plate splines and the decom-
position of deformations,” IEEE Transactions on pattern analysis and
machine intelligence, vol. 11, no. 6, pp. 567–585, 1989.
[26] A. Mishra, K. Alahari, and C. V. Jawahar, “Scene text recognition using
higher order language priors,” in BMVC, 2012.
[27] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R.
Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. De Las Heras,
“Icdar 2013 robust reading competition,” in 2013 12th International
Conference on Document Analysis and Recognition. IEEE, 2013, pp.
1484–1493.
[28] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov,
M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu et al.,
“Icdar 2015 competition on robust reading,” in 2015 13th International
Conference on Document Analysis and Recognition (ICDAR). IEEE,
2015, pp. 1156–1160.
[29] T. Quy Phan, P. Shivakumara, S. Tian, and C. Lim Tan, “Recognizing
text with perspective distortion in natural scenes,” in Proceedings of the
IEEE International Conference on Computer Vision, 2013, pp. 569–576.
[30] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan, “A robust
arbitrary text detection system for natural scene images,” Expert Systems
with Applications, vol. 41, no. 18, pp. 8027–8048, 2014.
[31] C. K. Ch’ng and C. S. Chan, “Total-text: A comprehensive dataset for
scene text detection and recognition,” in 2017 14th IAPR International
Conference on Document Analysis and Recognition (ICDAR), vol. 1.
IEEE, 2017, pp. 935–942.
[32] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Synthetic
data and artificial neural networks for natural scene text recognition,”
arXiv preprint arXiv:1406.2227, 2014.
[33] A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for text
localisation in natural images,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, 2016, pp. 2315–2324.
[34] J. Wang and X. Hu, “Gated recurrent convolution neural network for
ocr,” in Advances in Neural Information Processing Systems, 2017, pp.
335–344.
[35] M. Liao, J. Zhang, Z. Wan, F. Xie, J. Liang, P. Lyu, C. Yao, and
X. Bai, “Scene text recognition from two-dimensional perspective,” in
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33,
2019, pp. 8714–8721.

9528

Authorized licensed use limited to: Carleton University. Downloaded on May 29,2021 at 12:20:42 UTC from IEEE Xplore. Restrictions apply.

You might also like