Li 2021
Li 2021
Authorized licensed use limited to: Carleton University. Downloaded on May 29,2021 at 12:20:42 UTC from IEEE Xplore. Restrictions apply.
Conv(128, 1x1)
Conv(128, Conv(128,
1x1)1x1) Conv(128, 1x1)
Conv(128, 1x1) 1x1) Conv(128,Conv(128,
1x1) Conv(128, 1x1)
Conv(128, 1x1)
BN(128) IN(64) BN(64) BN(128)
BN(128)BN(128) IN(64) IN(64)
BN(64) BN(64) BN(128) BN(128)
II introduces the related works of text recognition. Section III Rectified image
illustrates the proposed method. Section IV demonstrates the Input image
experimental results to verify the effectiveness of the proposed Text
Rectification OPTIMUM
method. Section V will make a summary of this paper. Recognition
9523
Authorized licensed use limited to: Carleton University. Downloaded on May 29,2021 at 12:20:42 UTC from IEEE Xplore. Restrictions apply.
In the training phase, the mean/variance is calculated base on B. Encoder
samples of the mini-batch. All means and variance of mini- The encoder aims to extract rich and discriminative features.
batches during training will be saved for testing. In the test As illustrated in Table I, the main structure of the encoder is
phase, the mean/variance of each mini-batch from the training the CNN-BLSTM framework. The encoder first extracts spatial
phase will be weighted average and obtain the estimated value. feature maps from the input image through stacked convolu-
Instance normalization [19] is mainly used in the field of tional layers with residual connections [20]. The proposed IBN
style transfer [8], [9] because IN can learn features that are module can be employed in the shallow layers to obtain strong
invariant to styles or appearance. For style transfer, each image spatial features. Based on ResNet45 [21], the proposed IBN-
should be regarded as a domain. In order to maintain the STR model uses the IBN module in the residual Block 2 to
independence between different image instances, IN retains residual Block 4.
the dimensions of N and C, and only operates of averaging The CNN of encoder aims to capture the features of local
and standard deviation for H and W within the channel. The regions. To capture the long-range dependencies of characters,
mean and variance can be formulated as: a multi-layer bidirectional long short-term memory [22] is
1 XX
H W introduced. BLSTM can encode feature sequences bidirection-
µnc (x) = xnchw , (4) ally, capture long-range dependencies in both directions, and
HW
h=1 w=1 model global context information, thereby leveraging richer
v context and improving performance.
u
u 1 X H XW
σnc (x) = t (xnchw − µnc (x))2 + . (5) C. Decoder
HW w=1
h=1 The decoder is attention-based which can achieve sequence
Texts in natural scenes are very variable in appearances. to sequence prediction. As shown in Table I, GRU cell
Inspired by the contribution of Instance normalization to style [23] is utilized to decode output dependencies. Through T
transfer tasks [8], [9], we introduce IN to obtain a robust iterations, the decoder generates a predicted symbol sequence
text recognizer in natural scenes. As shown in Figure 2, two (y1 , ..., yT ), where T is the number of characters. To generate
types of IBN modules are provided. In the IBN a module, a variable-length sequence, a special end-of-sequence symbol
the feature maps are divided into two parts and sent to IN (EOS) is inserted at the end of the target sequence. At step t,
and BN respectively, and the outputs of IN and BN will be the decoder produces a predicted output yt and the probability
concatenated and sent to the next convolutional layer. IBN a of yt is p(yt ):
module can integrate appearance invariant features and content p(yt ) = Sof tmax(Wout st + bout ),
related information to improve the performance. To explore the (6)
yt ∼ p(yt ),
generalization ability of the different types of IBN modules,
the IBN b module is proposed. The IN layer will be placed where st is the hidden state at the current time, and Wout and
before the residual block output. bout are trainable parameters. In this paper, the embedding
The experimental results in section IV verify the effective- vectors and st−1 (the hidden state at the previous time) will
ness of the proposed IBN module. Particularly, most of the be fed into GRU to update st :
time IBN a module performs better than the IBN b module
in text recognition and improves regular and irregular text st = GRU (st−1 , (gt , f (yt−1 ))), (7)
recognition. L
X
gt = (αt,i hi ), (8)
TABLE I i=1
A RCHITECTURE OF TEXT RECOGNITION NETWORK . BLSTM MEANS where (gt , f (yt−1 )) is the concatenation of gt glimpse vectors
BIDIRECTIONAL LONG SHORT- TERM MEMORY LAYER .
and f (yt−1 ) embedding vectors of the previous output yt−1 .
Layers Configurations Outsize The glimpse vectors focus on a small part of the whole context.
Block 0 3 × 3 conv, s 1 × 1, bn 32 × 32 × 100 In the formula 8, L is the length of feature maps; αt,i is the
1 × 1 conv, 32, bn
Block 1 ×3, s 2 × 2 32 × 16 × 50 attentional weights vector and it can be generated by
3 × 3 conv, 32, bn
1 × 1 conv, 64, ibn
Block 2 ×4, s 2 × 2 64 × 8 × 25 L
3 × 3 conv, 64, bn X
1 × 1 conv, 128, ibn
encoder
9524
Authorized licensed use limited to: Carleton University. Downloaded on May 29,2021 at 12:20:42 UTC from IEEE Xplore. Restrictions apply.
where pl2r (yt ) and pr2l (yt ) are the probabilities of the se-
quence from left to right and from right to left.
D. Rectification
The rectification network is based on the spatial transformer
network [24], similar to RARE [5]. First, fiducial points
are predicted by the localization network, and then thin-
plate-spline transformation [25] matrices will be calculated to
generate the sampling grid. Finally, the sampler uses bilinear
interpolation to obtain the rectified image. Table II illustrates
the architecture of the localization network. The input image (a) origin image (b) distorted images
is scaled to 32 × 100 and fed into convolutional layers
Fig. 4. S-shape distortion.
and pooling layers. Each convolution layer is followed by
a batch normalization layer and a ReLU layer. The adaptive
average pooling layer is used to generate feature vectors and IV. E XPERIMENTS
then feature vectors pass through 2 fully connected layers to
generate predicted fiducial points. In this section, we conduct extensive experiments to verify
the effectiveness of the proposed method. The performances
E. Data augmentation of all the methods are measured by word accuracy.
To enrich the diversity of training data, it is necessary to
adopt data augmentation for input images. Here, we utilize the A. Benchmark Datasets
trigonometric function to generate S-shape distortion transfor- • Street View Text (SVT). Street View Text dataset [1]
mation. Given the position of original image (i, j) and the has 350 images collected from Google street view. The
0 0
position of rectified image (i , j ), the correspondences of dataset has 647 word instances, and each instance has a
0 0
between (i, j) and (i , j ) are as follows: 50-word lexicon.
0 • IIIT5K-Words (IIIT5K). The IIIT5K-word dataset [26]
i =a1 i + a2 Sin(θ, j) + a3 ,
0 (12) has 3,000 cropped word instances for testing. The dataset
j = j, provides a 50-word and a 1k-word lexicons for each word
instance.
where a1 , a2 , a3 are scaling and shifting parameter. θ deter-
• ICDAR 2013 (IC13). ICDAR 2013 [27] has 1,095 word
mines the distortion mode for the entire image. In this paper,
instances cropped from 233 scene images. After filtering
the original image is S-shape distorted with a probability of
words with non-alphanumeric characters, 1,015 cropped
0.4. As shown in Figure 4, there are 16 distortion modes, one
word instances are obtained for evaluation.
of which will be randomly selected as the input image. The
• ICDAR 2015 (IC15). ICDAR 2015 [28] provides 2,077
experimental results demonstrate the effectiveness of S-shape
word instances in multi-oriented for text recognition. The
distortion.
word instances are cropped from test scene images. Re-
The proposed method focuses on alphanumeric character move non-alphanumeric characters, words with a length
recognition, but non-alphanumeric characters occur frequently less than 3 and irregular text will obtain 1,811 word
in text images of natural scenes. Therefore, this paper also instances.
discusses whether to use non-alphanumeric text images in • SVT-Perspective (SVT-P). SVT-Perspective dataset [29]
section IV. has 639 perspective text instances and a 50-lexicon is
provided for each instance.
TABLE II
A RCHITECTURE OF THE LOCALIZATION NETWORK . C ONV MEANS • CUTE80 (CUTE). CUTE80 dataset [30] has 288 word
CONVOLUTION LAYER , MP MEANS M AXPOOLING LAYER AND A DAPAVG P instances cropped from 80 high-resolution images taken
MEANS ADAPTIVE AVERAGE POOLING LAYER . in natural scenes. The dataset has many examples of
Layers Configurations Outsize curved text.
Conv 1 3 × 3 conv, 64, s 1 × 1 64 × 32 × 100 • Total-Text. Total-Text [31] has 300 test images. The word
MP 1 2×2 64 × 16 × 50 instances are arbitrary shape text, including flipped text.
Conv 2 3 × 3 conv, 128, s 1 × 1 128 × 16 × 50 2,204 word instances are obtained after filtering words
MP 2 2×2 128 × 8 × 25
Conv 3 3 × 3 conv, 256, s 1 × 1 256 × 8 × 25 with non-alphanumeric characters.
MP 3 2×2 256 × 4 × 12 The benchmarks consist of regular texts and irregular text.
Conv 4 3 × 3 conv, 512, s 1 × 1 512 × 4 × 12
AdapAvgP 1×1 512 × 1 × 1 There are 4,662 regular text instances from SVT, IIIT5K and
Linear 1 512, 256 256 IC13 datasets, and 5,214 irregular text instances from IC15,
Linear 2 256, 2K 2K SVT-P, CUTE, and Total-text datasets. The total number of
text instances is 9,876.
9525
Authorized licensed use limited to: Carleton University. Downloaded on May 29,2021 at 12:20:42 UTC from IEEE Xplore. Restrictions apply.
B. Implementation Details TABLE IV
T HE RESULTS OF DIFFERENT IBN MODULES .
In this paper, we utilize Synth90k [32] and SynthText
[33] as training data and evaluate the standard benchmarks. Method Regular Irregular Total
Synth90k dataset (denoted as SK) contains approximately 8.9 Base 92.32 74.41 82.87
Base-stn 93.07 76.89 84.53
million synthetic word images and SynthText dataset (denoted
Base-ibn-a 92.40+0.08 74.90+0.48 83.16+0.29
as ST) has 6.9 million training data, including 1.4 million Base-ibn-b 92.15−0.17 73.80−0.61 82.84−0.41
non-alphanumeric instances. For SynthText, 5.5 million word Basestn-ibn-a 92.96−0.11 77.67+0.78 84.89+0.36
instances (denoted as ST a) will be obtained by filtering Basestn-ibn-b 92.60−0.47 76.37−0.52 84.03−0.50
words with non-alphanumeric characters, while 1.4 million DataBase 92.53 75.49 83.54
DataBase-stn 93.22 78.02 85.20
non-alphanumeric word instances will be denoted as ST e. DataBase-ibn-a 92.65+0.11 76.39+0.90 84.06+0.52
The proposed model is trained using only synthetic data, with- DataBase-ibn-b 92.60+0.07 75.72+0.23 83.69+0.15
out fine-tuning. The model only recognizes the alphanumeric DataBasestn-ibn-a 93.16−0.06 77.50−0.52 84.89−0.31
characters including 26 letters, 10 digits, a symbol for non- DataBasestn-ibn-b 93.48+0.25 77.87−0.16 85.24+0.04
alphanumeric characters, and a symbol standing for ‘EOS’.
The model is trained from scratch and optimized by Adam TABLE V
T HE RESULTS OF DIFFERENT NUMBER OF IBN LAYERS .
optimizer with a learning rate of 5e-4. Iteration stops after
10 epochs. All the input images are resized to 32 × 100. The Method Regular Irregular Total
experiments are conducted with two NVIDIA Tesla K40 GPUs BN 92.53 75.49 83.54
and the batch size is 1024. IBN a, 2 92.66+0.13 76.04+0.55 83.89+0.35
IBN a, 1-2 93.03+0.50 75.87+0.38 83.97+0.43
TABLE III IBN a, 2-3 92.90+0.37 75.85+0.36 83.90+0.36
T HE RESULTS OF DATA AUGMENTATION . IBN a, 2-4 92.92+0.39 76.97+1.48 84.50+0.96
IBN a, 1-4 92.65+0.12 76.39+0.90 84.06+0.52
Method Regular Irregular Total
Base(BO+37) 90.20 72.61 80.91
Base-stn(BO+37) 90.78 75.76 82.85
The top half of Table III demonstrates the results of the
Base(TO+38) 92.32 74.41 82.87
Base-stn(TO+38) 93.07 76.89 84.53 regular text and irregular text recognition, and the bottom half
shows the performance improvement. Obviously, the S-shape
Data-base(BO+37) 90.35 72.75 81.06
Data-base-stn(BO+37) 91.30 75.93 83.18 distortion and ST e dataset greatly promote performance.
Data-base(TO+37) 92.94 75.07 83.51 Outputting non-alphanumeric symbol has a relatively small
Data-base-stn(TO+37) 93.35 77.94 85.22 impact on the performance of overall data. The output of 38
Data-base(TO+38) 92.53 75.49 83.54 classes will damage the text recognition of regular datasets and
Data-base-stn(TO+38) 93.22 78.02 85.20 promote the text recognition of irregular datasets. According
Improvement Regular Irregular Total to the above analysis, we take the Base(TO+38) model as the
S-shape(BO+37) +0.15 +0.13 +0.15 base model in the following.
S-shape-stn(BO+37) +0.52 +0.17 +0.33 2) IBN Module: We discuss the effects of different versions
S-shape(TO+38) +0.21 +1.07 +0.67 of the IBN module and the number of IBN module layers
S-shape-stn(TO+38) +0.15 +1.13 +0.67 on text recognizer. We utilize ResNet45 [21] as the backbone
Data(TO37-BO37) +2.59 +2.32 +2.45 which consists of 5 residual modules with batch normalization.
Data-stn(TO37-BO37) +2.06 +2.01 +2.04 According to [9], the batch normalization layers are replaced
Char(TO38-TO37) -0.41 +0.42 +0.03 by IBN in the shallow layers.
Char-stn(TO38-TO37) -0.13 +0.08 -0.02 When we compare the effects of two IBN modules (IBN a
module and IBN b module), the first 4 residual blocks with
batch normalization layer are replaced by IBN modules. As
C. Ablation Study illustrated in Table IV, we use the model with only batch
1) Data Augmentation: Here we attempt to display the normalization as the baseline (denoted by Base, Base-stn,
effects of the different training datasets, S-shape distortion, DataBase, and DataBase-stn). All the models are trained by
and the outputs (including a symbol for non-alphanumeric SK and ST datasets. Without S-shape distortion, the IBN a
characters). As shown in Table III, BO means using SK + module can always help improve performance, while the
ST a datasets, while TO means using SK+ST (SK + ST a IBN b module degrades the performance. With S-shape dis-
+ ST e) datasets. 37 indicates an output without considering tortion, IBN a module can help improve the performance of
non-alphanumeric characters, while 38 indicates an output the DataBase-ibn-a model but make Databasestn-ibn-a model
including a symbol for non-alphanumeric characters. Base-* slightly worse. As for the IBN b module, it can help the
model is trained by images without S-shape distortion, but the DataBase*-ibn-b model to improve the performance on overall
inputs of Data-Base-* are S-shape distorted. All the models data.
with *-stn are trained without the rectification network. In addition, we also compare the impact of the number of
9526
Authorized licensed use limited to: Carleton University. Downloaded on May 29,2021 at 12:20:42 UTC from IEEE Xplore. Restrictions apply.
TABLE VI
C OMPARISON OF OTHER TEXT RECOGNITION METHODS . * MEANS USING 1,811 IMAGES .
Regular Irregular
Data
Method IC13 SVT IIIT5K IC15 SVT-P CUTE Total-text Total
None None 50 None 50 1k None None 50 None None
CRNN [4] SK 89.6 82.7 97.5 81.2 97.8 95.0 - - - - - -
GCRNN [34] SK - 81.5 96.3 80.8 98.0 95.6 - - - - - -
R2AM [15] SK 90.0 80.7 96.3 78.4 96.8 94.4 - - - - - -
Liao et.al [35] ST 91.4 82.1 98.5 92.0 99.8 98.9 - - - 78.1 -
Aster [6] ST+SK 91.8 93.6 99.2 93.4 99.6 98.8 76.1∗ 78.5 - 79.5 - -
2D CTC [36] ST+SK 93.9 90.6 97.2 94.7 99.8 98.9 75.2∗ 79.2 - 81.3 63.0 -
RCN [16] ST+SK 93.2 88.6 97.7 94.0 99.6 98.9 77.1 80.6 95.0 88.5 - -
MORAN [7] ST+SK 92.4 88.3 96.6 91.2 97.9 96.2 68.8 76.1 94.3 77.4 - -
Lyu et.al [17] ST+SK 92.7 90.1 97.2 94.0 99.8 99.1 76.3 82.3 - 86.8 - -
IBN-STR(base) ST+SK 93.8 90.0 97.3 93.3 99.5 98.7 77.8 83.6 95.0 84.4 73.3 84.5
IBN-STR(stn) ST+SK 94.7 91.0 98.0 94.0 99.8 98.6 79.1 85.1 94.6 85.4 74.8 85.6
9527
Authorized licensed use limited to: Carleton University. Downloaded on May 29,2021 at 12:20:42 UTC from IEEE Xplore. Restrictions apply.
[14] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep [36] Z. Wan, F. Xie, Y. Liu, X. Bai, and C. Yao, “2d-ctc for scene text
structured output learning for unconstrained text recognition,” arXiv recognition,” arXiv preprint arXiv:1907.09705, 2019.
preprint arXiv:1412.5903, 2014.
[15] C.-Y. Lee and S. Osindero, “Recursive recurrent nets with attention
modeling for ocr in the wild,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2016, pp. 2231–2239.
[16] Y. Gao, Y. Chen, J. Wang, Z. Lei, X.-Y. Zhang, and H. Lu, “Recur-
rent calibration network for irregular text recognition,” arXiv preprint
arXiv:1812.07145, 2018.
[17] P. Lyu, Z. Yang, X. Leng, X. Wu, R. Li, and X. Shen, “2d attentional
irregular scene text recognizer,” arXiv preprint arXiv:1906.05708, 2019.
[18] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” arXiv preprint
arXiv:1502.03167, 2015.
[19] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Improved texture networks:
Maximizing quality and diversity in feed-forward stylization and texture
synthesis,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2017, pp. 6924–6932.
[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[21] A. Mishra, K. Alahari, and C. Jawahar, “Enhancing energy minimization
framework for scene text recognition with top-down cues,” Computer
Vision and Image Understanding, vol. 145, pp. 30–42, 2016.
[22] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and
J. Schmidhuber, “A novel connectionist system for unconstrained hand-
writing recognition,” IEEE transactions on pattern analysis and machine
intelligence, vol. 31, no. 5, pp. 855–868, 2008.
[23] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares,
H. Schwenk, and Y. Bengio, “Learning phrase representations using
rnn encoder-decoder for statistical machine translation,” arXiv preprint
arXiv:1406.1078, 2014.
[24] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial transformer
networks,” in Advances in neural information processing systems, 2015,
pp. 2017–2025.
[25] F. L. Bookstein, “Principal warps: Thin-plate splines and the decom-
position of deformations,” IEEE Transactions on pattern analysis and
machine intelligence, vol. 11, no. 6, pp. 567–585, 1989.
[26] A. Mishra, K. Alahari, and C. V. Jawahar, “Scene text recognition using
higher order language priors,” in BMVC, 2012.
[27] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R.
Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. De Las Heras,
“Icdar 2013 robust reading competition,” in 2013 12th International
Conference on Document Analysis and Recognition. IEEE, 2013, pp.
1484–1493.
[28] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov,
M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu et al.,
“Icdar 2015 competition on robust reading,” in 2015 13th International
Conference on Document Analysis and Recognition (ICDAR). IEEE,
2015, pp. 1156–1160.
[29] T. Quy Phan, P. Shivakumara, S. Tian, and C. Lim Tan, “Recognizing
text with perspective distortion in natural scenes,” in Proceedings of the
IEEE International Conference on Computer Vision, 2013, pp. 569–576.
[30] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan, “A robust
arbitrary text detection system for natural scene images,” Expert Systems
with Applications, vol. 41, no. 18, pp. 8027–8048, 2014.
[31] C. K. Ch’ng and C. S. Chan, “Total-text: A comprehensive dataset for
scene text detection and recognition,” in 2017 14th IAPR International
Conference on Document Analysis and Recognition (ICDAR), vol. 1.
IEEE, 2017, pp. 935–942.
[32] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Synthetic
data and artificial neural networks for natural scene text recognition,”
arXiv preprint arXiv:1406.2227, 2014.
[33] A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for text
localisation in natural images,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, 2016, pp. 2315–2324.
[34] J. Wang and X. Hu, “Gated recurrent convolution neural network for
ocr,” in Advances in Neural Information Processing Systems, 2017, pp.
335–344.
[35] M. Liao, J. Zhang, Z. Wan, F. Xie, J. Liang, P. Lyu, C. Yao, and
X. Bai, “Scene text recognition from two-dimensional perspective,” in
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33,
2019, pp. 8714–8721.
9528
Authorized licensed use limited to: Carleton University. Downloaded on May 29,2021 at 12:20:42 UTC from IEEE Xplore. Restrictions apply.