0% found this document useful (0 votes)

15 views7 pages

Li 2021

The document presents IBN-STR, a robust text recognizer designed for irregular text in natural scenes, which improves performance through data augmentation and an innovative feature representation method. The model utilizes S-shape distortion to enhance training data diversity and combines instance normalization with batch normalization to enhance the model's capacity and generalization ability. Extensive experiments demonstrate that IBN-STR achieves state-of-the-art performance for both regular and irregular text recognition tasks.

Uploaded by

theja naveen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views7 pages

Li 2021

Uploaded by

theja naveen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

2020 25th International Conference on Pattern Recognition (ICPR)

Milan, Italy, Jan 10-15, 2021

IBN-STR: A Robust Text Recognizer for Irregular

Text in Natural Scenes
†
Xiaoqian Li*1,2 , Jie Liu*1 , Guixuan Zhang 1
, Shuwu Zhang1
1
Institute of Automation, Chinese Academy of Sciences
2
School of Artificial Intelligence, University of Chinese Academy of Sciences
Email: {lixiaoqian2015, jie.liu, guixuan.zhang, shuwu.zhang}@ia.ac.cn

Abstract—Although text recognition methods based on deep

neural networks have promising performance, there are still
challenges due to the variety of text styles, perspective distortion,
text with large curvature, and so on. To obtain a robust text
recognizer, we have improved the performance from two aspects:
data aspect and feature representation aspect. In terms of data,
we transform the input images into S-shape distorted images
in order to increase the diversity of training data. Besides,
we explore the effects of different training data. In terms of
2020 25th International Conference on Pattern Recognition (ICPR) | 978-1-7281-8808-9/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICPR48806.2021.9412775

feature representation, the combination of instance normalization

and batch normalization improves the model’s capacity and
generalization ability. This paper proposes a robust scene text (a) Real text (b) Synthetic text
recognizer IBN-STR, which is an attention-based model. Through
extensive experiments, the model analysis and comparison have Fig. 1. Various text.
been carried out from the aspects of data and feature repre-
sentation, and the effectiveness of IBN-STR on both regular and
irregular text instances has been verified. Furthermore, IBN-STR Secondly, for the feature representation aspect, we attempt to
is an end-to-end recognition system that can achieve state-of-the- mine a robust and efficient module to extract features.
art performance.
For the data aspect, we utilize S-shape distortion to en-
I. I NTRODUCTION rich the text curvature of the training data. Considering the
Text is a vital cue in natural scene images. Text recogni- contribution of instance normalization (IN) in style transfer
tion is a branch of computer vision, which can help people task [8], [9], IN can introduce appearance invariance, and
understand the content of images and can be widely used as batch normalization (BN) can preserve content information.
auxiliary aids in intelligent transportation, auxiliary transla- For the feature representation aspect, a robust instance-batch
tion, image retrieval, and so on. Research on text recognition normalization (IBN) module is proposed to introduce text
has a long history. Traditional pattern classification provides appearance invariance and improve performance.
many solutions for text recognition [1]–[3], and due to the In this paper, we propose a Scene Text Recognizer with
development of deep learning and computational power, text Instance-Batch Normalization module (named IBN-STR) to
recognition has recently achieved great breakthroughs. achieve regular text and irregular text recognition in natural
Although text recognition methods [4]–[6] based deep learn- scenes. The contributions of this paper are as follows:
ing perform well, there are still challenges in text recognition • In terms of data, we demonstrate the impact of data
due to the large curvature of text instances, the variety of text augmentation and different training data on text recog-
styles, similar characters, occlusion, uneven illumination, and nition. The input images will be S-shape distorted to
shooting environments. Therefore, a robust text recognizer is increase the diversity of training data and further improve
of great significance. performance.
Most text recognizers [4]–[7] are trained on synthetic data • In terms of feature representation, instance normalization
and evaluated on real data. As we can see in Figure 1, the is introduced into text recognition for the first time to
curvature of text in commonly used synthetic datasets is less improve the model’s capacity and generalization ability.
volatile, but the curvature of text in real images is greater The IBN module combines instance normalization with
and the appearance of the text will be more variable. This batch normalization, which helps the model extract more
means that there is a gap between the distribution of training effective feature maps and it is effective for both regular
data and test data. In view of the above problems, we first text and irregular text.
consider improving from the data aspect. It is necessary to • With the rectification network, we propose an IBN-STR
use data augmentation to increase the diversity of training data. model for text recognition and achieve state-of-the-art
performance.
* Authors contributed equally as first author.
† Corresponding author: Guixuan Zhang. The remainder of this paper is organized as follows. Section

978-1-7281-8808-9/20/$31.00 ©2020 IEEE 9522

Authorized licensed use limited to: Carleton University. Downloaded on May 29,2021 at 12:20:42 UTC from IEEE Xplore. Restrictions apply.
Conv(128, 1x1)
Conv(128, Conv(128,
1x1)1x1) Conv(128, 1x1)
Conv(128, 1x1) 1x1) Conv(128,Conv(128,
1x1) Conv(128, 1x1)
Conv(128, 1x1)
BN(128) IN(64) BN(64) BN(128)
BN(128)BN(128) IN(64) IN(64)
BN(64) BN(64) BN(128) BN(128)

Conv(128, 3x3) Conv(128, 3x3) Conv(128, 3x3)

Conv(128,
Conv(128, 3x3) 3x3) Conv(128,Conv(128,
3x3) 3x3) Conv(128, 3x3)
Conv(128, 3x3)
BN(128) BN(128) BN(128)
BN(128)BN(128) BN(128) BN(128) BN(128) BN(128)

ReLU ReLU IN(128)

ReLU ReLU ReLU ReLU IN(128) IN(128)
ReLU
ReLU ReLU

(a) origianal BN (b) IBN a module (c) IBN b module

Fig. 2. Instance-batch normalization (IBN) module.

II introduces the related works of text recognition. Section III Rectified image
illustrates the proposed method. Section IV demonstrates the Input image
experimental results to verify the effectiveness of the proposed Text
Rectification OPTIMUM
method. Section V will make a summary of this paper. Recognition

II. R ELATED W ORKS

Fig. 3. Overview of text recognizer. The dashed lines mean the direction of
Traditional text recognition methods mainly rely on manual gradient propagation.
design features such as connected components [10], stroke
width transform [11]–[13]. Recently, methods based on deep
neural networks have shown advantages in text recognition. network and generates rectified images. The text recognition
Jaderberg et al. [14] proposed a model combining convolu- network follows the encoder-decoder framework which is
tional neural network (CNN) and conditional random field, widely used in seq2seq text recognition. It consists of a CNN-
which is optimized by structured output loss. Most CNN- BLSTM encoder and an attention-based decoder. The encoder
based methods tend to treat text recognition as a sequence first extracts stacked convolutional features of input images
recognition task. Inspired by speech recognition, CRNN [8] and utilizes bidirectional long short-term memory (BLSTM)
introduced CTC loss into text recognition. CRNN is an end- to convert the image features into feature sequences. The
to-end system that utilizes CNN and recurrent neural network decoder is a sequence-to-sequence model that translates the
(RNN) to generate character sequence features. Vanilla CTC feature sequence into a character sequence. The IBN module
can only deal with 1D probability distributions, while 2D- is embedded in the stacked convolutional modules to improve
CTC [13] can compute the conditional probability of labels the capacity and generalization ability of text recognizer.
from 2D distributions, which is suitable for irregular text. With
the popularity of attention mechanism, attention-based text A. IBN Module
recognition methods are proposed [5], [6], [15]. RARE [5], Batch normalization [18] is proposed to normalize data and
ASTER [6], MORAN [7], and RCN [16] transformed irregular preserve the representations of data. BN enables the model
text images into rectified images, and then used the attention- less sensitive to parameters and converges faster by limiting
based recognition network of an encoder-decoder framework the input data to a certain range through the mean and variance.
to achieve text recognition. In the absence of a rectification Given the feature map x ∈ RN ×C×H×W with N samples,
network, Lyu et al. [17] proposed a relation attention module C channels, H height, and W width. The normalized data can
and a parallel attention module to transform text images be formulated as
into character feature sequences, which can be workable for
0 x − µ(x)
irregular text recognition. x = γ( ) + β. (1)
Our method is based on the attentional sequence-to- σ(x)
sequence (seq2seq) model. Different from previous methods, where γ, β are scaling and shift factors. BN retains the channel
the proposed IBN-STR introduces IN for the first time to im- dimension when calculating the mean and variance:
prove the capacity and generalization ability of text recognizer. N XH XW
1 X
µc (x) = xnchw , (2)
N HW n=1
III. M ETHOD w=1 h=1

As shown in Figure 3, the proposed IBN-STR model con-

v
u N XH XW
sists of a rectification network and a text recognition network.
u 1 X
σc (x) = t (xnchw − µc (x))2 + . (3)
The rectification network is based on the spatial transformer N HW n=1 w=1 h=1

9523

Authorized licensed use limited to: Carleton University. Downloaded on May 29,2021 at 12:20:42 UTC from IEEE Xplore. Restrictions apply.
In the training phase, the mean/variance is calculated base on B. Encoder
samples of the mini-batch. All means and variance of mini- The encoder aims to extract rich and discriminative features.
batches during training will be saved for testing. In the test As illustrated in Table I, the main structure of the encoder is
phase, the mean/variance of each mini-batch from the training the CNN-BLSTM framework. The encoder first extracts spatial
phase will be weighted average and obtain the estimated value. feature maps from the input image through stacked convolu-
Instance normalization [19] is mainly used in the field of tional layers with residual connections [20]. The proposed IBN
style transfer [8], [9] because IN can learn features that are module can be employed in the shallow layers to obtain strong
invariant to styles or appearance. For style transfer, each image spatial features. Based on ResNet45 [21], the proposed IBN-
should be regarded as a domain. In order to maintain the STR model uses the IBN module in the residual Block 2 to
independence between different image instances, IN retains residual Block 4.
the dimensions of N and C, and only operates of averaging The CNN of encoder aims to capture the features of local
and standard deviation for H and W within the channel. The regions. To capture the long-range dependencies of characters,
mean and variance can be formulated as: a multi-layer bidirectional long short-term memory [22] is
1 XX
H W introduced. BLSTM can encode feature sequences bidirection-
µnc (x) = xnchw , (4) ally, capture long-range dependencies in both directions, and
HW
h=1 w=1 model global context information, thereby leveraging richer
v context and improving performance.
u
u 1 X H XW
σnc (x) = t (xnchw − µnc (x))2 + . (5) C. Decoder
HW w=1
h=1 The decoder is attention-based which can achieve sequence
Texts in natural scenes are very variable in appearances. to sequence prediction. As shown in Table I, GRU cell
Inspired by the contribution of Instance normalization to style [23] is utilized to decode output dependencies. Through T
transfer tasks [8], [9], we introduce IN to obtain a robust iterations, the decoder generates a predicted symbol sequence
text recognizer in natural scenes. As shown in Figure 2, two (y1 , ..., yT ), where T is the number of characters. To generate
types of IBN modules are provided. In the IBN a module, a variable-length sequence, a special end-of-sequence symbol
the feature maps are divided into two parts and sent to IN (EOS) is inserted at the end of the target sequence. At step t,
and BN respectively, and the outputs of IN and BN will be the decoder produces a predicted output yt and the probability
concatenated and sent to the next convolutional layer. IBN a of yt is p(yt ):
module can integrate appearance invariant features and content p(yt ) = Sof tmax(Wout st + bout ),
related information to improve the performance. To explore the (6)
yt ∼ p(yt ),
generalization ability of the different types of IBN modules,
the IBN b module is proposed. The IN layer will be placed where st is the hidden state at the current time, and Wout and
before the residual block output. bout are trainable parameters. In this paper, the embedding
The experimental results in section IV verify the effective- vectors and st−1 (the hidden state at the previous time) will
ness of the proposed IBN module. Particularly, most of the be fed into GRU to update st :
time IBN a module performs better than the IBN b module
in text recognition and improves regular and irregular text st = GRU (st−1 , (gt , f (yt−1 ))), (7)
recognition. L
X
gt = (αt,i hi ), (8)
TABLE I i=1
A RCHITECTURE OF TEXT RECOGNITION NETWORK . BLSTM MEANS where (gt , f (yt−1 )) is the concatenation of gt glimpse vectors
BIDIRECTIONAL LONG SHORT- TERM MEMORY LAYER .
and f (yt−1 ) embedding vectors of the previous output yt−1 .
Layers Configurations Outsize The glimpse vectors focus on a small part of the whole context.
Block 0 3 × 3 conv, s 1 × 1, bn 32 × 32 × 100 In the formula 8, L is the length of feature maps; αt,i is the
1 × 1 conv, 32, bn
Block 1 ×3, s 2 × 2 32 × 16 × 50 attentional weights vector and it can be generated by
3 × 3 conv, 32, bn
1 × 1 conv, 64, ibn
Block 2 ×4, s 2 × 2 64 × 8 × 25 L
3 × 3 conv, 64, bn X
1 × 1 conv, 128, ibn
encoder

Block 3 ×6, s 2 × 1 128 × 4 × 25 αt,i = exp(et,i )/ (exp(et,j )), (9)

3 × 3 conv, 128, bn
1 × 1 conv, 256, ibn j=1
Block 4 ×6, s 2 × 1 256 × 2 × 25
3 × 3 conv, 256, bn
1 × 1 conv, 512, bn
et,i = T anh(W st−1 + V hi + b), (10)
Block 5 ×3, s 2 × 1 512 × 1 × 25
3 × 3 conv, 512, bn
BLSTM1 256 hidden units 25 × 256 where W, V and b are trainable parameters.
BLSTM2 256 hidden units 25 × 256 Given the predicted symbol sequence, the recognition loss
can be formulated as
decoder

GRU 256 hidden units 25 × 256

T
1X
Lrec = − (logpl2r (yt ) + logpr2l (yt )), (11)
T t=1

9524

Authorized licensed use limited to: Carleton University. Downloaded on May 29,2021 at 12:20:42 UTC from IEEE Xplore. Restrictions apply.
where pl2r (yt ) and pr2l (yt ) are the probabilities of the se-
quence from left to right and from right to left.

D. Rectification
The rectification network is based on the spatial transformer
network [24], similar to RARE [5]. First, fiducial points
are predicted by the localization network, and then thin-
plate-spline transformation [25] matrices will be calculated to
generate the sampling grid. Finally, the sampler uses bilinear
interpolation to obtain the rectified image. Table II illustrates
the architecture of the localization network. The input image (a) origin image (b) distorted images
is scaled to 32 × 100 and fed into convolutional layers
Fig. 4. S-shape distortion.
and pooling layers. Each convolution layer is followed by
a batch normalization layer and a ReLU layer. The adaptive
average pooling layer is used to generate feature vectors and IV. E XPERIMENTS
then feature vectors pass through 2 fully connected layers to
generate predicted fiducial points. In this section, we conduct extensive experiments to verify
the effectiveness of the proposed method. The performances
E. Data augmentation of all the methods are measured by word accuracy.
To enrich the diversity of training data, it is necessary to
adopt data augmentation for input images. Here, we utilize the A. Benchmark Datasets
trigonometric function to generate S-shape distortion transfor- • Street View Text (SVT). Street View Text dataset [1]
mation. Given the position of original image (i, j) and the has 350 images collected from Google street view. The
0 0
position of rectified image (i , j ), the correspondences of dataset has 647 word instances, and each instance has a
0 0
between (i, j) and (i , j ) are as follows: 50-word lexicon.
0 • IIIT5K-Words (IIIT5K). The IIIT5K-word dataset [26]
i =a1 i + a2 Sin(θ, j) + a3 ,
0 (12) has 3,000 cropped word instances for testing. The dataset
j = j, provides a 50-word and a 1k-word lexicons for each word
instance.
where a1 , a2 , a3 are scaling and shifting parameter. θ deter-
• ICDAR 2013 (IC13). ICDAR 2013 [27] has 1,095 word
mines the distortion mode for the entire image. In this paper,
instances cropped from 233 scene images. After filtering
the original image is S-shape distorted with a probability of
words with non-alphanumeric characters, 1,015 cropped
0.4. As shown in Figure 4, there are 16 distortion modes, one
word instances are obtained for evaluation.
of which will be randomly selected as the input image. The
• ICDAR 2015 (IC15). ICDAR 2015 [28] provides 2,077
experimental results demonstrate the effectiveness of S-shape
word instances in multi-oriented for text recognition. The
distortion.
word instances are cropped from test scene images. Re-
The proposed method focuses on alphanumeric character move non-alphanumeric characters, words with a length
recognition, but non-alphanumeric characters occur frequently less than 3 and irregular text will obtain 1,811 word
in text images of natural scenes. Therefore, this paper also instances.
discusses whether to use non-alphanumeric text images in • SVT-Perspective (SVT-P). SVT-Perspective dataset [29]
section IV. has 639 perspective text instances and a 50-lexicon is
provided for each instance.
TABLE II
A RCHITECTURE OF THE LOCALIZATION NETWORK . C ONV MEANS • CUTE80 (CUTE). CUTE80 dataset [30] has 288 word
CONVOLUTION LAYER , MP MEANS M AXPOOLING LAYER AND A DAPAVG P instances cropped from 80 high-resolution images taken
MEANS ADAPTIVE AVERAGE POOLING LAYER . in natural scenes. The dataset has many examples of
Layers Configurations Outsize curved text.
Conv 1 3 × 3 conv, 64, s 1 × 1 64 × 32 × 100 • Total-Text. Total-Text [31] has 300 test images. The word
MP 1 2×2 64 × 16 × 50 instances are arbitrary shape text, including flipped text.
Conv 2 3 × 3 conv, 128, s 1 × 1 128 × 16 × 50 2,204 word instances are obtained after filtering words
MP 2 2×2 128 × 8 × 25
Conv 3 3 × 3 conv, 256, s 1 × 1 256 × 8 × 25 with non-alphanumeric characters.
MP 3 2×2 256 × 4 × 12 The benchmarks consist of regular texts and irregular text.
Conv 4 3 × 3 conv, 512, s 1 × 1 512 × 4 × 12
AdapAvgP 1×1 512 × 1 × 1 There are 4,662 regular text instances from SVT, IIIT5K and
Linear 1 512, 256 256 IC13 datasets, and 5,214 irregular text instances from IC15,
Linear 2 256, 2K 2K SVT-P, CUTE, and Total-text datasets. The total number of
text instances is 9,876.

9525

Authorized licensed use limited to: Carleton University. Downloaded on May 29,2021 at 12:20:42 UTC from IEEE Xplore. Restrictions apply.
B. Implementation Details TABLE IV
T HE RESULTS OF DIFFERENT IBN MODULES .
In this paper, we utilize Synth90k [32] and SynthText
[33] as training data and evaluate the standard benchmarks. Method Regular Irregular Total
Synth90k dataset (denoted as SK) contains approximately 8.9 Base 92.32 74.41 82.87
Base-stn 93.07 76.89 84.53
million synthetic word images and SynthText dataset (denoted
Base-ibn-a 92.40+0.08 74.90+0.48 83.16+0.29
as ST) has 6.9 million training data, including 1.4 million Base-ibn-b 92.15−0.17 73.80−0.61 82.84−0.41
non-alphanumeric instances. For SynthText, 5.5 million word Basestn-ibn-a 92.96−0.11 77.67+0.78 84.89+0.36
instances (denoted as ST a) will be obtained by filtering Basestn-ibn-b 92.60−0.47 76.37−0.52 84.03−0.50
words with non-alphanumeric characters, while 1.4 million DataBase 92.53 75.49 83.54
DataBase-stn 93.22 78.02 85.20
non-alphanumeric word instances will be denoted as ST e. DataBase-ibn-a 92.65+0.11 76.39+0.90 84.06+0.52
The proposed model is trained using only synthetic data, with- DataBase-ibn-b 92.60+0.07 75.72+0.23 83.69+0.15
out fine-tuning. The model only recognizes the alphanumeric DataBasestn-ibn-a 93.16−0.06 77.50−0.52 84.89−0.31
characters including 26 letters, 10 digits, a symbol for non- DataBasestn-ibn-b 93.48+0.25 77.87−0.16 85.24+0.04
alphanumeric characters, and a symbol standing for ‘EOS’.
The model is trained from scratch and optimized by Adam TABLE V
T HE RESULTS OF DIFFERENT NUMBER OF IBN LAYERS .
optimizer with a learning rate of 5e-4. Iteration stops after
10 epochs. All the input images are resized to 32 × 100. The Method Regular Irregular Total
experiments are conducted with two NVIDIA Tesla K40 GPUs BN 92.53 75.49 83.54
and the batch size is 1024. IBN a, 2 92.66+0.13 76.04+0.55 83.89+0.35
IBN a, 1-2 93.03+0.50 75.87+0.38 83.97+0.43
TABLE III IBN a, 2-3 92.90+0.37 75.85+0.36 83.90+0.36
T HE RESULTS OF DATA AUGMENTATION . IBN a, 2-4 92.92+0.39 76.97+1.48 84.50+0.96
IBN a, 1-4 92.65+0.12 76.39+0.90 84.06+0.52
Method Regular Irregular Total
Base(BO+37) 90.20 72.61 80.91
Base-stn(BO+37) 90.78 75.76 82.85
The top half of Table III demonstrates the results of the
Base(TO+38) 92.32 74.41 82.87
Base-stn(TO+38) 93.07 76.89 84.53 regular text and irregular text recognition, and the bottom half
shows the performance improvement. Obviously, the S-shape
Data-base(BO+37) 90.35 72.75 81.06
Data-base-stn(BO+37) 91.30 75.93 83.18 distortion and ST e dataset greatly promote performance.
Data-base(TO+37) 92.94 75.07 83.51 Outputting non-alphanumeric symbol has a relatively small
Data-base-stn(TO+37) 93.35 77.94 85.22 impact on the performance of overall data. The output of 38
Data-base(TO+38) 92.53 75.49 83.54 classes will damage the text recognition of regular datasets and
Data-base-stn(TO+38) 93.22 78.02 85.20 promote the text recognition of irregular datasets. According
Improvement Regular Irregular Total to the above analysis, we take the Base(TO+38) model as the
S-shape(BO+37) +0.15 +0.13 +0.15 base model in the following.
S-shape-stn(BO+37) +0.52 +0.17 +0.33 2) IBN Module: We discuss the effects of different versions
S-shape(TO+38) +0.21 +1.07 +0.67 of the IBN module and the number of IBN module layers
S-shape-stn(TO+38) +0.15 +1.13 +0.67 on text recognizer. We utilize ResNet45 [21] as the backbone
Data(TO37-BO37) +2.59 +2.32 +2.45 which consists of 5 residual modules with batch normalization.
Data-stn(TO37-BO37) +2.06 +2.01 +2.04 According to [9], the batch normalization layers are replaced
Char(TO38-TO37) -0.41 +0.42 +0.03 by IBN in the shallow layers.
Char-stn(TO38-TO37) -0.13 +0.08 -0.02 When we compare the effects of two IBN modules (IBN a
module and IBN b module), the first 4 residual blocks with
batch normalization layer are replaced by IBN modules. As
C. Ablation Study illustrated in Table IV, we use the model with only batch
1) Data Augmentation: Here we attempt to display the normalization as the baseline (denoted by Base, Base-stn,
effects of the different training datasets, S-shape distortion, DataBase, and DataBase-stn). All the models are trained by
and the outputs (including a symbol for non-alphanumeric SK and ST datasets. Without S-shape distortion, the IBN a
characters). As shown in Table III, BO means using SK + module can always help improve performance, while the
ST a datasets, while TO means using SK+ST (SK + ST a IBN b module degrades the performance. With S-shape dis-
+ ST e) datasets. 37 indicates an output without considering tortion, IBN a module can help improve the performance of
non-alphanumeric characters, while 38 indicates an output the DataBase-ibn-a model but make Databasestn-ibn-a model
including a symbol for non-alphanumeric characters. Base-* slightly worse. As for the IBN b module, it can help the
model is trained by images without S-shape distortion, but the DataBase*-ibn-b model to improve the performance on overall
inputs of Data-Base-* are S-shape distorted. All the models data.
with *-stn are trained without the rectification network. In addition, we also compare the impact of the number of

9526

Authorized licensed use limited to: Carleton University. Downloaded on May 29,2021 at 12:20:42 UTC from IEEE Xplore. Restrictions apply.
TABLE VI
C OMPARISON OF OTHER TEXT RECOGNITION METHODS . * MEANS USING 1,811 IMAGES .

Regular Irregular
Data
Method IC13 SVT IIIT5K IC15 SVT-P CUTE Total-text Total
None None 50 None 50 1k None None 50 None None
CRNN [4] SK 89.6 82.7 97.5 81.2 97.8 95.0 - - - - - -
GCRNN [34] SK - 81.5 96.3 80.8 98.0 95.6 - - - - - -
R2AM [15] SK 90.0 80.7 96.3 78.4 96.8 94.4 - - - - - -
Liao et.al [35] ST 91.4 82.1 98.5 92.0 99.8 98.9 - - - 78.1 -
Aster [6] ST+SK 91.8 93.6 99.2 93.4 99.6 98.8 76.1∗ 78.5 - 79.5 - -
2D CTC [36] ST+SK 93.9 90.6 97.2 94.7 99.8 98.9 75.2∗ 79.2 - 81.3 63.0 -
RCN [16] ST+SK 93.2 88.6 97.7 94.0 99.6 98.9 77.1 80.6 95.0 88.5 - -
MORAN [7] ST+SK 92.4 88.3 96.6 91.2 97.9 96.2 68.8 76.1 94.3 77.4 - -
Lyu et.al [17] ST+SK 92.7 90.1 97.2 94.0 99.8 99.1 76.3 82.3 - 86.8 - -
IBN-STR(base) ST+SK 93.8 90.0 97.3 93.3 99.5 98.7 77.8 83.6 95.0 84.4 73.3 84.5
IBN-STR(stn) ST+SK 94.7 91.0 98.0 94.0 99.8 98.6 79.1 85.1 94.6 85.4 74.8 85.6

IBN layers. As shown in Table V, BN is the configurations ACKNOWLEDGMENT

for the DataBase model. Different IBN a module layers all
This work is supported by the National Key R&D Program
promote performance.
of China (2018YFB1403900) and the Science and Technol-
According to the above analysis, the proposed IBN module
ogy Program of Beijing (Z201100001820002). It was also
can improve text recognition. And overall IBN a module
the research achievement of the Key Laboratory of Digital
is better than IBN b module in performance improvement.
Rights Services, which is one of the National Science and
Performance improvement in irregular text is more than that
Standardization Key Labs for Press and Publication Industry.
in regular text. In addition, IBN a module does not increase
computational cost.
R EFERENCES
D. Comparisons with the State-of-the-arts [1] K. Wang, B. Babenko, and S. Belongie, “End-to-end scene text recog-
The proposed IBN-STR model is trained by ST + SK nition,” in 2011 International Conference on Computer Vision. IEEE,
2011, pp. 1457–1464.
datasets, and the outputs are 38 classes, including a non- [2] A. Bissacco, M. Cummins, Y. Netzer, and H. Neven, “Photoocr: Reading
alphanumeric symbol recognition. The input images will be text in uncontrolled conditions,” in Proceedings of the IEEE Interna-
S-shape distorted and fed into the IBN-STR model. In the tional Conference on Computer Vision, 2013, pp. 785–792.
[3] L. Neumann and J. Matas, “Real-time scene text localization and
final model IBN-STR, the IBN a modules are utilized in recognition,” in 2012 IEEE Conference on Computer Vision and Pattern
Block 2 to Block 4 of Table I. We compare the performance Recognition. IEEE, 2012, pp. 3538–3545.
of our model and other state-of-the-arts in Table VI. IBN- [4] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network
for image-based sequence recognition and its application to scene
STR(base) is trained without the rectification network, and text recognition,” IEEE transactions on pattern analysis and machine
IBN-STR(stn) is trained with the rectification network. Com- intelligence, vol. 39, no. 11, pp. 2298–2304, 2016.
pared with rectification-based methods Aster [6] and MORAN [5] B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai, “Robust scene text
recognition with automatic rectification,” in Proceedings of the IEEE
[7], our method performs better on IC13, IIIT5K, IC15, SVT- conference on computer vision and pattern recognition, 2016, pp. 4168–
P and CUTE datasets. In addition, on Total-text with complex 4176.
text instances, the performance of our model is significantly [6] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai, “Aster:
An attentional scene text recognizer with flexible rectification,” IEEE
improved, which is 10.3%-11.8% higher than the previous transactions on pattern analysis and machine intelligence, vol. 41, no. 9,
model. pp. 2035–2048, 2018.
[7] C. Luo, L. Jin, and Z. Sun, “Moran: A multi-object rectified attention
V. C ONCLUSION network for scene text recognition,” Pattern Recognition, vol. 90, pp.
109–118, 2019.
In this paper, we consider the data aspect and feature [8] H. Nam and H.-E. Kim, “Batch-instance normalization for adaptively
representation aspect to improve the generalization of the style-invariant neural networks,” in Advances in Neural Information
Processing Systems, 2018, pp. 2558–2567.
model. S-shape distortion is utilized to enrich the diversity of [9] X. Pan, P. Luo, J. Shi, and X. Tang, “Two at once: Enhancing learning
training data and the effect of data augmentation on text recog- and generalization capacities via ibn-net,” in Proceedings of the Euro-
nition is analyzed. In addition, the combination of instance pean Conference on Computer Vision (ECCV), 2018, pp. 464–479.
normalization and batch normalization improves the model’s [10] L. Neumann and J. Matas, “Real-time scene text localization and
recognition,” in 2012 IEEE Conference on Computer Vision and Pattern
capacity and generalization ability. The IBN-STR model is Recognition, 2012, pp. 3538–3545.
proposed to achieve text recognition and can compete with the [11] B. Epshtein, E. Ofek, and Y. Wexler, “Detecting text in natural scenes
state-of-the-arts. Experimental results show the effectiveness with stroke width transform,” in 2010 IEEE Computer Society Confer-
ence on Computer Vision and Pattern Recognition, 2010, pp. 2963–2970.
of the proposed method. Although the proposed method can [12] C. Yao, X. Bai, and W. Liu, “A unified framework for multioriented
perform well on the regular text and irregular text recognition, text detection and recognition,” IEEE Transactions on Image Processing,
our method cannot handle vertical or flipped text instances. vol. 23, no. 11, pp. 4737–4749, 2014.
[13] C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu, “Detecting texts of arbitrary
Future research will focus on a flexible text recognizer that orientations in natural images,” in 2012 IEEE Conference on Computer
can process text from various perspectives. Vision and Pattern Recognition, 2012, pp. 1083–1090.

9527

Authorized licensed use limited to: Carleton University. Downloaded on May 29,2021 at 12:20:42 UTC from IEEE Xplore. Restrictions apply.
[14] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep [36] Z. Wan, F. Xie, Y. Liu, X. Bai, and C. Yao, “2d-ctc for scene text
structured output learning for unconstrained text recognition,” arXiv recognition,” arXiv preprint arXiv:1907.09705, 2019.
preprint arXiv:1412.5903, 2014.
[15] C.-Y. Lee and S. Osindero, “Recursive recurrent nets with attention
modeling for ocr in the wild,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2016, pp. 2231–2239.
[16] Y. Gao, Y. Chen, J. Wang, Z. Lei, X.-Y. Zhang, and H. Lu, “Recur-
rent calibration network for irregular text recognition,” arXiv preprint
arXiv:1812.07145, 2018.
[17] P. Lyu, Z. Yang, X. Leng, X. Wu, R. Li, and X. Shen, “2d attentional
irregular scene text recognizer,” arXiv preprint arXiv:1906.05708, 2019.
[18] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” arXiv preprint
arXiv:1502.03167, 2015.
[19] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Improved texture networks:
Maximizing quality and diversity in feed-forward stylization and texture
synthesis,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2017, pp. 6924–6932.
[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[21] A. Mishra, K. Alahari, and C. Jawahar, “Enhancing energy minimization
framework for scene text recognition with top-down cues,” Computer
Vision and Image Understanding, vol. 145, pp. 30–42, 2016.
[22] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and
J. Schmidhuber, “A novel connectionist system for unconstrained hand-
writing recognition,” IEEE transactions on pattern analysis and machine
intelligence, vol. 31, no. 5, pp. 855–868, 2008.
[23] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares,
H. Schwenk, and Y. Bengio, “Learning phrase representations using
rnn encoder-decoder for statistical machine translation,” arXiv preprint
arXiv:1406.1078, 2014.
[24] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial transformer
networks,” in Advances in neural information processing systems, 2015,
pp. 2017–2025.
[25] F. L. Bookstein, “Principal warps: Thin-plate splines and the decom-
position of deformations,” IEEE Transactions on pattern analysis and
machine intelligence, vol. 11, no. 6, pp. 567–585, 1989.
[26] A. Mishra, K. Alahari, and C. V. Jawahar, “Scene text recognition using
higher order language priors,” in BMVC, 2012.
[27] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R.
Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. De Las Heras,
“Icdar 2013 robust reading competition,” in 2013 12th International
Conference on Document Analysis and Recognition. IEEE, 2013, pp.
1484–1493.
[28] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov,
M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu et al.,
“Icdar 2015 competition on robust reading,” in 2015 13th International
Conference on Document Analysis and Recognition (ICDAR). IEEE,
2015, pp. 1156–1160.
[29] T. Quy Phan, P. Shivakumara, S. Tian, and C. Lim Tan, “Recognizing
text with perspective distortion in natural scenes,” in Proceedings of the
IEEE International Conference on Computer Vision, 2013, pp. 569–576.
[30] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan, “A robust
arbitrary text detection system for natural scene images,” Expert Systems
with Applications, vol. 41, no. 18, pp. 8027–8048, 2014.
[31] C. K. Ch’ng and C. S. Chan, “Total-text: A comprehensive dataset for
scene text detection and recognition,” in 2017 14th IAPR International
Conference on Document Analysis and Recognition (ICDAR), vol. 1.
IEEE, 2017, pp. 935–942.
[32] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Synthetic
data and artificial neural networks for natural scene text recognition,”
arXiv preprint arXiv:1406.2227, 2014.
[33] A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for text
localisation in natural images,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, 2016, pp. 2315–2324.
[34] J. Wang and X. Hu, “Gated recurrent convolution neural network for
ocr,” in Advances in Neural Information Processing Systems, 2017, pp.
335–344.
[35] M. Liao, J. Zhang, Z. Wan, F. Xie, J. Liang, P. Lyu, C. Yao, and
X. Bai, “Scene text recognition from two-dimensional perspective,” in
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33,
2019, pp. 8714–8721.

9528

Authorized licensed use limited to: Carleton University. Downloaded on May 29,2021 at 12:20:42 UTC from IEEE Xplore. Restrictions apply.

Robust Scene Text Recognition With Automatic Rectification
No ratings yet
Robust Scene Text Recognition With Automatic Rectification
9 pages
A Transformer-Based Framework For Scene Text Recognition
No ratings yet
A Transformer-Based Framework For Scene Text Recognition
16 pages
Scene Text Recognition Based On Improved CRNN
No ratings yet
Scene Text Recognition Based On Improved CRNN
14 pages
A Novel Ensemble Deep Network Framework For Scene Text Recognition
No ratings yet
A Novel Ensemble Deep Network Framework For Scene Text Recognition
11 pages
Enhanced Scene Text Recognition Using Deep Learning Based Hybrid Attention Recognition Network
No ratings yet
Enhanced Scene Text Recognition Using Deep Learning Based Hybrid Attention Recognition Network
12 pages
Tang Few Could Be Better Than All Feature Sampling and Grouping CVPR 2022 Paper
No ratings yet
Tang Few Could Be Better Than All Feature Sampling and Grouping CVPR 2022 Paper
10 pages
Improving Irregular Text Recognition With Adaptive Feature Compression
No ratings yet
Improving Irregular Text Recognition With Adaptive Feature Compression
5 pages
Jaderberg 16
No ratings yet
Jaderberg 16
20 pages
Show, Attend and Read: A Simple and Strong Baseline For Irregular Text Recognition
No ratings yet
Show, Attend and Read: A Simple and Strong Baseline For Irregular Text Recognition
9 pages
End-To-End Text Recognition With Convolutional Neural Networks
No ratings yet
End-To-End Text Recognition With Convolutional Neural Networks
60 pages
Review of Scene Text Detection and Recognition: Han Lin Peng Yang Fanlong Zhang
No ratings yet
Review of Scene Text Detection and Recognition: Han Lin Peng Yang Fanlong Zhang
22 pages
Semi-Supervised Scene Text Recognition
No ratings yet
Semi-Supervised Scene Text Recognition
8 pages
Rec Pa Mi
No ratings yet
Rec Pa Mi
14 pages
Lenet
No ratings yet
Lenet
46 pages
2022 Text Recognition in The Wild
No ratings yet
2022 Text Recognition in The Wild
35 pages
Neurocomputing: Juhua Liu, Qihuang Zhong, Yuan Yuan, Hai Su, Bo Du
No ratings yet
Neurocomputing: Juhua Liu, Qihuang Zhong, Yuan Yuan, Hai Su, Bo Du
11 pages
Text Detection OCR Reseacrh Paper
No ratings yet
Text Detection OCR Reseacrh Paper
26 pages
AI Text Detection & Recognition
No ratings yet
AI Text Detection & Recognition
11 pages
CRNN Model For Text Detection and Classification From Natural Scenes
No ratings yet
CRNN Model For Text Detection and Classification From Natural Scenes
11 pages
PGNet AAAI-2885.WangP
No ratings yet
PGNet AAAI-2885.WangP
9 pages
NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition
No ratings yet
NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition
6 pages
Deep Scene Text Detection With Connected Component Proposals
No ratings yet
Deep Scene Text Detection With Connected Component Proposals
10 pages
Enhancing Text Spotting With A Language Model and Visual Context Information
No ratings yet
Enhancing Text Spotting With A Language Model and Visual Context Information
10 pages
Deep-Text Recurrent Networks for OCR
No ratings yet
Deep-Text Recurrent Networks for OCR
8 pages
Image Classification and Text Extraction Using Convolutional Neural Network
No ratings yet
Image Classification and Text Extraction Using Convolutional Neural Network
7 pages
DTrOCR: Transformer for OCR Excellence
No ratings yet
DTrOCR: Transformer for OCR Excellence
11 pages
Kittenplon Towards Weakly-Supervised Text Spotting Using A Multi-Task Transformer CVPR 2022 Paper
No ratings yet
Kittenplon Towards Weakly-Supervised Text Spotting Using A Multi-Task Transformer CVPR 2022 Paper
10 pages
Fujitake DTrOCR Decoder-Only Transformer For Optical Character Recognition WACV 2024 Paper
No ratings yet
Fujitake DTrOCR Decoder-Only Transformer For Optical Character Recognition WACV 2024 Paper
11 pages
Yang 2017
No ratings yet
Yang 2017
4 pages
SVTRv2: Advancing Scene Text Recognition
No ratings yet
SVTRv2: Advancing Scene Text Recognition
17 pages
1 s2.0 S0031320317304120 Main
No ratings yet
1 s2.0 S0031320317304120 Main
24 pages
Long2021 Article SceneTextDetectionAndRecogniti
No ratings yet
Long2021 Article SceneTextDetectionAndRecogniti
24 pages
6071 Generative Shape Models Joint Text Recognition and Segmentation With Very Little Training Data
No ratings yet
6071 Generative Shape Models Joint Text Recognition and Segmentation With Very Little Training Data
9 pages
Text Detection and Extraction Techniques
No ratings yet
Text Detection and Extraction Techniques
18 pages
Char RCG TH
No ratings yet
Char RCG TH
11 pages
2021 LIN - Artificial Intelligence - STAN
No ratings yet
2021 LIN - Artificial Intelligence - STAN
9 pages
Etasr 10029
No ratings yet
Etasr 10029
5 pages
Haramaya University Computer Science Student
No ratings yet
Haramaya University Computer Science Student
15 pages
SVTR: Scene Text Recognition With A Single Visual Model
No ratings yet
SVTR: Scene Text Recognition With A Single Visual Model
7 pages
Plagiarism Checker X Originality Report: Similarity Found: 26%
No ratings yet
Plagiarism Checker X Originality Report: Similarity Found: 26%
29 pages
Most
No ratings yet
Most
10 pages
Synthetic Text Detection in Images
No ratings yet
Synthetic Text Detection in Images
10 pages
Text Classification Improved BT Integrating Bidirectional LSTM With Two-Dimensional Max Pooling
No ratings yet
Text Classification Improved BT Integrating Bidirectional LSTM With Two-Dimensional Max Pooling
11 pages
Visual Attention Models For Scene Text Recognition: Suman K. Ghosh and Ernest Valveny Andrew D. Bagdanov
No ratings yet
Visual Attention Models For Scene Text Recognition: Suman K. Ghosh and Ernest Valveny Andrew D. Bagdanov
6 pages
High-Performance OCR For Printed English and Fraktur Using LSTM Networks
No ratings yet
High-Performance OCR For Printed English and Fraktur Using LSTM Networks
5 pages
Kami Export - 1904.01941
No ratings yet
Kami Export - 1904.01941
5 pages
Jaderberg15b PDF
No ratings yet
Jaderberg15b PDF
188 pages
For The First Paper
No ratings yet
For The First Paper
49 pages
A Convolutional Recurrent Neural Network For The Handwritten Text Recognition of Historical Greek Manuscripts
No ratings yet
A Convolutional Recurrent Neural Network For The Handwritten Text Recognition of Historical Greek Manuscripts
14 pages
Text Extraction From Document Image
No ratings yet
Text Extraction From Document Image
7 pages
Artificial Intelligence and Machine Learning Approaches To Text Recognition: A Research Overview
No ratings yet
Artificial Intelligence and Machine Learning Approaches To Text Recognition: A Research Overview
5 pages
Deep Learning for Handwriting OCR
No ratings yet
Deep Learning for Handwriting OCR
13 pages
Imagenet Classification With Deep Convolutional Neural Networks
No ratings yet
Imagenet Classification With Deep Convolutional Neural Networks
9 pages
Unconstrained Offline Handwritten Word
No ratings yet
Unconstrained Offline Handwritten Word
5 pages
Spanish Laboratory Presentation
No ratings yet
Spanish Laboratory Presentation
18 pages
01-AIA42022755 Online
No ratings yet
01-AIA42022755 Online
21 pages
Handwritten Text Recognition Insights
No ratings yet
Handwritten Text Recognition Insights
7 pages
Symmetry 12 01956 v2
No ratings yet
Symmetry 12 01956 v2
27 pages
Performance Evaluation of Efficient and Accurate Text Detection
No ratings yet
Performance Evaluation of Efficient and Accurate Text Detection
10 pages
A Robust Intelligent System For Text-Based Traffic Signs Detection and Recognition in Challenging Weather Conditions
No ratings yet
A Robust Intelligent System For Text-Based Traffic Signs Detection and Recognition in Challenging Weather Conditions
14 pages
Random Blur Data Augmentation For Scene Text Recognition
No ratings yet
Random Blur Data Augmentation For Scene Text Recognition
11 pages
4.enhancement of Handwritten Text Recognition Using AI-based
No ratings yet
4.enhancement of Handwritten Text Recognition Using AI-based
11 pages
5.Real-Time Text Detection and Translation Using
No ratings yet
5.Real-Time Text Detection and Translation Using
7 pages
8.ancient Kannada Text Recognition Using CNN A Review IJERTV12IS050259
No ratings yet
8.ancient Kannada Text Recognition Using CNN A Review IJERTV12IS050259
7 pages
Module 1-Stacks
No ratings yet
Module 1-Stacks
94 pages
Module-1 Sparse Matrix
No ratings yet
Module-1 Sparse Matrix
8 pages
Final DSA Notes Module-2 BCS304
No ratings yet
Final DSA Notes Module-2 BCS304
25 pages
C++ Basics for Computer Science Students
No ratings yet
C++ Basics for Computer Science Students
58 pages
Module 2
No ratings yet
Module 2
40 pages
Farinograph TS-aqua Inject Metabridge
No ratings yet
Farinograph TS-aqua Inject Metabridge
214 pages
SL 777 46 049 A ONS Ok
No ratings yet
SL 777 46 049 A ONS Ok
4 pages
Photovis: Visualization of Digital Photograph Metadata: Michael Hsueh
No ratings yet
Photovis: Visualization of Digital Photograph Metadata: Michael Hsueh
10 pages
Manual de Refinador
No ratings yet
Manual de Refinador
60 pages
Lsm6Dsl: iNEMO Inertial Module: Always-On 3D Accelerometer and 3D Gyroscope
No ratings yet
Lsm6Dsl: iNEMO Inertial Module: Always-On 3D Accelerometer and 3D Gyroscope
113 pages
CCNA1 - H2/H6: Cheat Sheet
No ratings yet
CCNA1 - H2/H6: Cheat Sheet
7 pages
Business Information Systems Course Outline
No ratings yet
Business Information Systems Course Outline
5 pages
CSO-101 Syllabus
No ratings yet
CSO-101 Syllabus
1 page
New Century Wellness Group Case Study 2
No ratings yet
New Century Wellness Group Case Study 2
4 pages
Old WickedWhims v174h Exception 14 March 2023
No ratings yet
Old WickedWhims v174h Exception 14 March 2023
1,643 pages
Chapter 6
No ratings yet
Chapter 6
13 pages
EN ISO 12215-4 (2018) (E) Codified
0% (1)
EN ISO 12215-4 (2018) (E) Codified
8 pages
Automated Toll Collection System Based On RFID Sensor
100% (1)
Automated Toll Collection System Based On RFID Sensor
16 pages
System Dynamics: Modeling, Analysis, Simulation, Design
No ratings yet
System Dynamics: Modeling, Analysis, Simulation, Design
16 pages
Resume - Subash Subramanian PDF
No ratings yet
Resume - Subash Subramanian PDF
2 pages
PN Sequence
No ratings yet
PN Sequence
87 pages
National AI Policy 2024 DRAFT
100% (1)
National AI Policy 2024 DRAFT
19 pages
Enhancing Revit for BIM Efficiency
No ratings yet
Enhancing Revit for BIM Efficiency
20 pages
Data Dashboards & Reports Guide
No ratings yet
Data Dashboards & Reports Guide
12 pages
District Uttar Pradesh
No ratings yet
District Uttar Pradesh
14 pages
AI-Powered Lake Cleaning Robot Design
No ratings yet
AI-Powered Lake Cleaning Robot Design
30 pages
Apple Watch Ultra Invoice - KBRL
No ratings yet
Apple Watch Ultra Invoice - KBRL
1 page
GSM Cell Planning & Autotune Guide
No ratings yet
GSM Cell Planning & Autotune Guide
14 pages
Mensa IQ Requirements and Statistics
No ratings yet
Mensa IQ Requirements and Statistics
2 pages
08 All Uid
No ratings yet
08 All Uid
16 pages
IMS605 Assigment 1 Operational Command Room (AutoRecovered)
No ratings yet
IMS605 Assigment 1 Operational Command Room (AutoRecovered)
10 pages
Complex Number
No ratings yet
Complex Number
18 pages
Payroll Management System Project
No ratings yet
Payroll Management System Project
38 pages
Tri Light Metal Halide Uv Screen Exposure System
No ratings yet
Tri Light Metal Halide Uv Screen Exposure System
2 pages
What Is Polymorphism in Java Notes
100% (2)
What Is Polymorphism in Java Notes
11 pages

Li 2021

Uploaded by

Li 2021

Uploaded by

2020 25th International Conference on Pattern Recognition (ICPR)

Milan, Italy, Jan 10-15, 2021

IBN-STR: A Robust Text Recognizer for Irregular

Abstract—Although text recognition methods based on deep

feature representation, the combination of instance normalization

978-1-7281-8808-9/20/$31.00 ©2020 IEEE 9522

Conv(128, 3x3) Conv(128, 3x3) Conv(128, 3x3)

ReLU ReLU IN(128)

(a) origianal BN (b) IBN a module (c) IBN b module

Fig. 2. Instance-batch normalization (IBN) module.

II. R ELATED W ORKS

As shown in Figure 3, the proposed IBN-STR model con-

Block 3 ×6, s 2 × 1 128 × 4 × 25 αt,i = exp(et,i )/ (exp(et,j )), (9)

GRU 256 hidden units 25 × 256

IBN layers. As shown in Table V, BN is the configurations ACKNOWLEDGMENT

You might also like