0% found this document useful (0 votes)

22 views17 pages

Text Snake

The document presents TextSnake, a novel representation for detecting scene text of arbitrary shapes, particularly focusing on curved text. TextSnake utilizes a sequence of overlapping disks to represent text instances, allowing for greater flexibility compared to traditional axis-aligned and rotated rectangles. Experimental results demonstrate that TextSnake achieves state-of-the-art performance on various benchmarks, significantly improving detection accuracy for curved text by over 40% in F-measure on Total-Text.

Uploaded by

genai.nlp.aiml

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views17 pages

Text Snake

Uploaded by

genai.nlp.aiml

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

arXiv:1807.01544v2 [cs.

CV] 18 Aug 2020

TextSnake: A Flexible Representation for
Detecting Text of Arbitrary Shapes

Shangbang Long1,2 , Jiaqiang Ruan 1,2 , Wenjie Zhang1,2 , Xin He2 , Wenhao
Wu2 , Cong Yao2
1
Peking University, 2 Megvii (Face++) Technology Inc.
{longlongsb, jiaqiang.ruan, zhang wen jie}@pku.edu.cn,
{hexin,wwh}@megvii.com, [email protected]

Abstract. Driven by deep neural networks and large scale datasets,

scene text detection methods have progressed substantially over the past
years, continuously refreshing the performance records on various stan-
dard benchmarks. However, limited by the representations (axis-aligned
rectangles, rotated rectangles or quadrangles) adopted to describe text,
existing methods may fall short when dealing with much more free-form
text instances, such as curved text, which are actually very common in
real-world scenarios. To tackle this problem, we propose a more flex-
ible representation for scene text, termed as TextSnake, which is able
to effectively represent text instances in horizontal, oriented and curved
forms. In TextSnake, a text instance is described as a sequence of or-
dered, overlapping disks centered at symmetric axes, each of which is
associated with potentially variable radius and orientation. Such geom-
etry attributes are estimated via a Fully Convolutional Network (FCN)
model. In experiments, the text detector based on TextSnake achieves
state-of-the-art or comparable performance on Total-Text and SCUT-
CTW1500, the two newly published benchmarks with special emphasis
on curved text in natural images, as well as the widely-used datasets IC-
DAR 2015 and MSRA-TD500. Specifically, TextSnake outperforms the
baseline on Total-Text by more than 40% in F-measure.

Keywords: Scene Text Detection, Deep Neural Network, Curved Text

1 Introduction
In recent years, the community has witnessed a surge of research interest and
effort regarding the extraction of textual information from natural scenes, a.k.a.
scene text detection and recognition. The driving factors stem from both ap-
plication prospect and research value. On the one hand, scene text detection
and recognition have been playing ever-increasingly important roles in a wide
range of practical systems, such as scene understanding, product search, and
autonomous driving. On the other hand, the unique traits of scene text, for in-
stance, significant variations in color, scale, orientation, aspect ratio and pattern,
make it obviously different from general objects. Therefore, particular challenges
are posed and special investigations are required.
2 Shangbang Long et al.

(a) (b) (c) (d)

Fig. 1. Comparison of different representations for text instances. (a) Axis-aligned rect-
angle. (b) Rotated rectangle. (c) Quadrangle. (d) TextSnake. Obviously, the proposed
TextSnake representation is able to effectively and precisely describe the geometric
properties, such as location, scale, and bending of curved text with perspective dis-
tortion, while the other representations (axis-aligned rectangle, rotated rectangle or
quadrangle) struggle with giving accurate predictions in such cases.

Text detection, as a prerequisite step in the pipeline of textual information ex-

traction, has recently advanced substantially with the development of deep neu-
ral networks and large image datasets. Numerous innovative works [1,2,3,4,5,6,7,8,9,10,11]
are proposed, achieving excellent performances on standard benchmarks.
However, most existing methods for text detection shared a strong assump-
tion that text instances are roughly in a linear shape and therefore adopted
relatively simple representations (axis-aligned rectangles, rotated rectangles or
quadrangles) to describe them. Despite their progress on standard benchmarks,
these methods may fall short when handling text instances of irregular shapes,
for example, curved text. As depicted in Fig. 1, for curved text with perspective
distortion, conventional representations struggle with giving precise estimations
of the geometric properties.
In fact, instances of curved text are quite common in real life [12,13]. In this
paper, we propose a more flexible representation that can fit well text of arbitrary
shapes, i.e., those in horizontal, multi-oriented and curved forms. This represen-
tation describes text with a series of ordered, overlapping disks, each of which is
located at the center axis of text region and associated with potentially variable
radius and orientation. Due to its excellent capability in adapting for the complex
multiplicity of text structures, just like a snake changing its shape to adapt for
the external environment, the proposed representation is named as TextSnake.
The geometry attributes of text instances, i.e., central axis points, radii and
orientations, are estimated with a single Fully Convolutional Network (FCN)
model. Besides ICDAR 2015 and MSRA-TD500, the effectiveness of TextSnake
is validated on Total-Text and SCUT-CTW1500, which are two newly-released
benchmarks mainly focused on curved text. The proposed algorithm achieves
state-of-the-art performance on the two curved text datasets, while at the same
time outperforming previous methods on horizontal and multi-oriented text,
even in the single-scale testing mode. Specifically, TextSnake achieves significant
improvement over the baseline on Total-Text by 40.0% in F-measure.
In summary, the major contributions of this paper are three-fold: (1) We
propose a flexible and general representation for scene text of arbitrary shapes;
(2) Based on this representation, an effective method for scene text detection
is proposed; (3) The proposed text detection algorithm achieves state-of-the-art
TextSnake 3

performance on several benchmarks, including text instances of different forms

(horizontal, oriented and curved).

2 Related Work

In the past few years, the most prominent trend in the area of scene text de-
tection is the transfer from conventional methods [14,15] to deep learning based
methods [16,17,4,3,2]. In this section, we look back on relevant previous works.
For comprehensive surveys, please refer to [18,19]. Before the era of deep learning,
SWT [14] and MSER [15] are two representative algorithms that have influenced
a variety of subsequent methods [20,21]. Modern methods are mostly based on
deep neural networks, which can be coarsely classified into two categories: re-
gression based and segmentation based.
Regression based text detection methods [4] mainly draw inspirations from
general object detection frameworks. TextBoxes [4] adopted SSD [22] and added
“long” default boxes and filters to handle the significant variation of aspect ratios
of text instances. Based on Faster-RCNN [23], Ma et al. [24] devised Rotation
Region Proposal Networks (RRPN) to detect arbitrary-Oriented text in natural
images. EAST [3] and Deep Regression [25] both directly produce the rotated
boxes or quadrangles of text, in a per-pixel manner.
Segmentation based text detection methods cast text detection as a semantic
segmentation problem and FCN [26] is often taken as the reference framework.
Yao et al. [1] modified FCN to produce multiple heatmaps corresponding various
properties of text, such as text region and orientation. Zhang et al. [27] first use
FCN to extract text blocks and then hunt character candidates from these blocks
with MSER [15]. To better separate adjacent text instances, the method of [6]
distinguishes each pixel into three categories: non-text, text border and text.
These methods mainly vary in the way they separate text pixels into different
instances.
The methods reviewed above have achieved excellent performances on various
benchmarks in this field. However, most works, except for [1,7,12], have not
payed special attention to curved text. In contrast, the representation proposed
in this paper is suitable for text of arbitrary shapes (horizontal, multi-oriented
and curved). It is primarily inspired by [1,7] and the geometric attributes of text
are also estimated via the multiple-channel outputs of an FCN-based model.
Unlike [1], our algorithm does not need character level annotations. In addition, it
also shares a similar idea with SegLink [2], by successively decomposing text into
local components and then composing them back into text instances. Analogous
to [28], we also detect linear symmetry axes of text instances for text localization.
Another advantage of the proposed method lies in its ability to reconstruct
the precise shape and regional strike of text instances, which can largely facilitate
the subsequent text recognition process, because all detected text instances could
be conveniently transformed into a canonical form with minimal distortion and
background (see the example in Fig.9).
4 Shangbang Long et al.

3 Methodology

In this section, we first introduce the new representation for text of arbitrary
shapes. Then we describe our method and training details.

3.1 Representation

text region text center line

c θ
r

disk

Fig. 2. Illustration of the proposed TextSnake representation. Text region (in yellow) is
represented as a series of ordered disks (in blue), each of which is located at the center
line (in green, a.k.a symmetric axis or skeleton) and associated with a radius r and an
orientation θ. In contrast to conventional representations (e.g., axis-aligned rectangles,
rotated rectangles and quadrangles), TextSnake is more flexible and general, since it
can precisely describe text of different forms, regardless of shapes and lengths.

As shown in Fig. 1, conventional representations for scene text (e.g., axis-

aligned rectangles, rotated rectangles and quadrangles) fail to precisely describe
the geometric properties of text instances of irregular shapes, since they generally
assume that text instances are roughly in linear forms, which does not hold true
for curved text. To address this problem, we propose a flexible and general rep-
resentation: TextSnake. As demonstrated in Fig. 2, TextSnake expresses a text
instance as a sequence of overlapping disks, each of which is located at the center
line and associated with a radius and an orientation. Intuitively, TextSnake is
able to change its shape to adapt for the variations of text instances, such as
rotation, scaling and bending.
Mathematically, a text instance t, consisting of several characters, can be
viewed as an ordered list S(t). S(t) = {D0 , D1 , · · · , Di , · · · , Dn }, where Di
stands for the ith disk and n is the number of the disks. Each disk D is as-
sociated with a group of geometry attributes, i.e. D = (c, r, θ), in which c, r and
θ are the center, radius and orientation of disk D, respectively. The radius r is
defined as half of the local width of t, while the orientation θ is the tangential
TextSnake 5

direction of the center line around the center c. In this sense, text region t can
be easily reconstructed by computing the union of the disks in S(t).
Note that the disks do not correspond to the characters belonging to t. How-
ever, the geometric attributes in S(t) can be used to rectify text instances of
irregular shapes and transform them into rectangular, straight image regions,
which are more friendly to text recognizers.

3.2 Pipeline

TR TCL Masked TCL

Disjoint Set

radius cosθ sinθ

FCN+FPN

Text Instances Striding Algorithm Instance Segmentation

Reconstruction

Fig. 3. Method framework: network output and post-processing

In order to detect text with arbitrary shapes, we employ an FCN model to

predict the geometry attributes of text instances. The pipeline of the proposed
method is illustrated in Fig.3. The FCN based network predicts score maps of
text center line (TCL) and text regions (TR), together with geometry attributes,
including r, cosθ and sinθ. The TCL map is further masked by the TR map since
TCL is naturally part of TR. To perform instance segmentation, disjoint set is
utilized, given the fact that TCL does not overlap with each other. A striding
algorithm is used to extract the central axis point lists and finally reconstruct
the text instances.

3.3 Network Architecture

The whole network is shown in Fig. 4. Inspired by FPN[29] and U-net[30], we

adopt a scheme that gradually merges features from different levels of the stem
network. The stem network can be convolutional networks proposed for image
6 Shangbang Long et al.

𝑓1 𝑓2 𝑓3 𝑓4 𝑓5 /ℎ1

Input conv stage 1 conv stage 2 conv stage 3 conv stage 4 conv stage 5
image 32, /2 64, /2 128, /2 256, /2 512, /2

P ℎ5 ℎ4 ℎ3 ℎ2
predictor

concat conv 1x1, 32

deconv, x2
conv 3x3, 32

Fig. 4. Network Architecture. Blue blocks are convolution stages of VGG-16.

classification, e.g. VGG-16/19[31] and ResNet[32]. These networks can be di-

vided into 5 stages of convolutions and a few additional fully-connected (FC)
layers. We remove the FC layers, and feed the feature maps after each stage to
the feature merging network. We choose VGG-16 as our stem network for the
sake of direct and fair comparison with other methods.
As for the feature merging network, several stages are stacked sequentially,
each consisting of a merging unit that takes feature maps from the last stage
and corresponding stem network layer. Merging unit is defined by the following
equations:

h1 = f5 (1)

hi = conv3×3 (conv1×1 [fi−1 ; U pSampling×2 (hi−1 )]), for i ≥ 2 (2)

where fi denotes the feature maps of the i-th stage in the stem network and
hi is the feature maps of the corresponding merging units. In our experiments,
upsampling is implemented as deconvolutional layer as proposed in [33].
After the merging, we obtain a feature map whose size is 21 of the input
images. We apply an additional upsampling layer and 2 convolutional layers to
produce dense predictions:

hf inal = U pSampling×2 (h5 ) (3)

P = conv1×1 (conv3×3 (hf inal )) (4)

where P ∈ Rh×w×7 , with 4 channels for logits of TR/TCL, and the last 3
respectively for r, cosθ and sinθ of the text instance. As a result of the additional
upsampling layer, P has the same size as the input image.The final predictions
are obtained by taking softmax for TR/TCL and regularizing cosθ and sinθ so
that the squared sum equals 1.
TextSnake 7

3.4 Inference

After feed-forwarding, the network produces the TCL, TR and geometry maps.
For TCL and TR, we apply thresholding with values Ttcl and Ttr respectively.
Then, the intersection of TR and TCL gives the final prediction of TCL. Using
disjoint-set, we can efficiently separate TCL pixels into different text instances.
Finally, a striding algorithm is designed to extract an ordered point list that
indicates the shape and course of the text instance, and also reconstruct the
text instance areas. Two simple heuristics are applied to filter out false positive
text instances: 1) The number of TCL pixels should be at least 0.2 times their
average radius; 2) At least half of pixels in the reconstructed text area should
be classified as TR.

Input: Output:
Segmented TCL Prediction

Expanding to Ends

Act(a) Act(c)

Act(b) Act(a)

Step 1 Step 2 Step 3 Step 4

Fig. 5. Framework of Post-processing Algorithm. Act(a) Centralizing: relocate a given

point to the central axis; Act(b) Striding: a directional search towards the ends of text
instances; Act(c) Sliding: a reconstruction by sliding a circle along the central axis.

The procedure for the striding algorithm is shown in Fig.5. It features 3 main
actions, denoted as Act(a), Act(b), and Act(c), as illustrated in Fig.6. Firstly,
we randomly select a pixel as the starting point, and centralize it. Then, the
search process forks into two opposite directions, striding and centralizing until
it reaches the ends. This process would generates 2 ordered point list in two
opposite directions, which can be combined to produce the final central axis list
that follows the course of the text and describe the shape precisely. Details of
the 3 actions are shown below.
Act(a) Centralizing As shown in Fig.6, given a point on the TCL, we can draw
the tangent line and the normal line, respectively denoted as dotted line and solid
line. This step can be done with ease using the geometry maps. The midpoint of
the intersection of the normal line and the TCL area gives the centralized point.
Act(b) Striding The algorithm takes a stride to the next point to search. With
the geometry maps, the displacement for each stride is computed and represented
as ( 12 r × cosθ, 12 r × sinθ) and (− 12 r × cosθ, − 12 r × sinθ), respectively for the two
8 Shangbang Long et al.

Act(a): Centralizing Act(b): Striding Act(c): Sliding

Fig. 6. Mechanisms of Centralizing, Striding and Sliding

directions. If the next step is outside the TCL area, we decrement the stride
gradually until it’s inside, or it hits the ends.
Act(c) Sliding The algorithm iterates through the central axis and draw circles
along it. Radii of the circles are obtained from the r map. The area covered by
the circles indicates the predicted text instance.
In conclusion, taking advantage of the geometry maps and the TCL that
precisely describes the course of the text instance, we can go beyond detection
of text and also predict their shape and course. Besides, the striding algorithm
saves our method from traversing all pixels that are related.

3.5 Label Generation

Extracting Text Center Line For triangles and quadrangles, it’s easy to
directly calculate the TCL with algebraic methods, since in this case, TCL is a
straight line. For polygons of more than 4 sides, it’s not easy to derive a general
algebraic method.
Instead, we propose a method that is based on the assumption that, text
instances are snake-shaped, i.e. that it does not fork into multiple branches. For
a snake-shaped text instance, it has two edges that are respectively the head and
the tail. The two edges near the head or tail are running parallel but in opposite
direction.

! " ! " "

!
# # #
$ $ $

( ( (
% % %
)

' & ' & ' &

!"# !%# !$#

Fig. 7. Label Generation. (a) Determining text head and tail; (b) Extracting text
center line and calculating geometries; (c) Expanded text center line.
TextSnake 9

For a text instance t represented by a group of vertexes {v0 , v1 , v2 , ..., vn } in

clockwise or counterclockwise order, we define a measurement for each edge ei,i+1
as M (ei,i+1 ) = coshei+1,i+2 , ei−1,i i. Intuitively, the two edges with M nearest to
−1, e.g. AH and DE in Fig.7, are the head and tail. After that, equal number
of anchor points are sampled on the two sidelines, e.g. ABCD and HGF E in
Fig.7. TCL points are computed as midpoints of corresponding anchor points.
We shrink the two ends of TCL by 21 rend pixels, so that TCL are inside the TR
and makes it easy for the network to learn to separate adjacent text instances.
rend denotes the radius of the TCL points at the two ends. Finally, we expand
the TCL area by 51 r, since a single-point line is prone to noise.

Calculating r and θ For each points on TCL: (1) r is computed as the distance
to the corresponding point on sidelines; (2) θ is computed by fitting a straight line
on the TCL points in the neighborhood. For non-TCL pixels, their corresponding
geometry attributes are set to 0 for convenience.

3.6 Training Objectives

The proposed model is trained end-to-end, with the following loss functions as
the objectives:
L = Lcls + Lreg (5)
Lcls = λ1 Ltr + λ2 Ltcl (6)
Lreg = λ3 Lr + λ4 Lsin + λ5 Lcos (7)
Lcls in Eq.5 represents classification loss for TR and TCL, and Lreg for
regression loss of r, cosθ and sinθ. In Eq.6, Ltr and Ltcl are cross-entropy loss
for TR and TCL. Online hard negative mining [34] is adopted for TR loss, with
the ratio between the negatives and positives kept to 3:1 at most. For TCL, we
only take into account pixels inside TR and adopt no balancing methods.
In Eq.7, regression loss, i.e. Lr Lsin and Lcos , are calculated as Smoothed-L1
loss[35]:
 
  rb−r
Lr r
Lcos  = SmoothedL1  cosθ
d − cosθ   (8)
Lsin sinθ − sinθ
d

where rb, cosθ

d and sinθ
d are the predicted values, while r, cosθ and sinθ are
their ground truth correspondingly. Geometry loss outside TCL are set to 0,
since these attributes make no sense for non-TCL points.
The weights constants λ1 , λ2 , λ3 , λ4 and λ5 are all set to 1 in our experiments.

4 Experiments
In this section, we evaluate the proposed algorithm on standard benchmarks
for scene text detection and compare it with previous methods. Analyses and
discussions regarding our algorithm are also given.
10 Shangbang Long et al.

4.1 Datasets
The datasets used for the experiments in this paper are briefly introduced below:
SynthText [36] is a large sacle dataset that contains about 800K synthetic
images. These images are created by blending natural images with text rendered
with random fonts, sizes, colors, and orientations, thus these images are quite
realistic. We use this dataset to pre-train our model.
TotalText [12] is a newly-released benchmark for text detection. Besides hor-
izontal and multi-Oriented text instances, the dataset specially features curved
text, which rarely appear in other benchmark datasets,but are actually quite
common in real environments. The dataset is split into training and testing sets
with 1255 and 300 images, respectively.
CTW1500 [13] is another dataset mainly consisting of curved text. It con-
sists of 1000 training images and 500 test images. Text instances are annotated
with polygons with 14 vertexes.
ICDAR 2015 is proposed as the Challenge 4 of the 2015 Robust Reading
Competition [37] for incidental scene text detection. Scene text images in this
dataset are taken by Google Glasses without taking care of positioning, image
quality, and viewpoint. This dataset features small, blur, and multi-oriented text
instances. There are 1000 images for training and 500 images for testing. The
text instances from this dataset are labeled as word level quadrangles.
MSRA-TD500 [38] is a dataset with multi-lingual, arbitrary-oriented and
long text lines. It includes 300 training images and 200 test images with text
line level annotations. Following previous works [3,10], we also include the images
from HUST-TR400 [39] as training data when fine-tuning on this dataset, since
its training set is rather small.
For experiments on ICDAR 2015 and MSRA-TD500, we fit a minimum
bounding rectangle based on the output text area of our method.

4.2 Data Augmentation

Images are randomly rotated, and cropped with areas ranging from 0.24 to 1.69
and aspect ratios ranging from 0.33 to 3. After that, noise, blur, and lightness
are randomly adjusted. We ensure that the text on the augmented images are
still legible, if they are legible before augmentation.

4.3 Implementation Details

Our method is implemented in Tensorflow 1.3.0 [40]. The network is pre-trained
on SynthText for one epoch and fine-tuned on other datasets. We adopt the
Adam optimazer [41] as our learning rate scheme. During the pre-training stage,
the learning rate is fixed to 10−3 . During the fine-tuning stage, the learing rate is
set to 10−3 initially and decaies with a rate of 0.8 every 5000 iterations. During
fine-tuning, the number of iterations is decided by the sizes of datasets. All the
experiments are conducted on a regular workstation (CPU: Intel(R) Xeon(R)
CPU E5-2650 v3 @ 2.30GHz; GPU:Titan X; RAM: 384GB). We train our model
TextSnake 11

Fig. 8. Qualitative results by the proposed method. Top: Detected text contours (in
yellow ) and ground truth annotations (in green). Bottom: Combined score maps for
TR (in red ) and TCL (in yellow ). From left to right in column: image from ICDAR
2015, TotalText, CTW1500 and MSRA-TD500. Best viewed in color.

with the batch size of 32 on 2 GPUs in parallel and evaluate our model on 1
GPU with batch size set as 1. Hyper-parameters are tuned by grid search on
training set.

4.4 Experiment Results

Experiments on Curved Text (Total-Text and CTW1500) Fine-tuning

on these two datasets stops at about 5k iterations. Thresholds Ttr , Ttcl are set to
(0.4, 0.6) and (0.4, 0.5) respectively on Total-Text and CTW1500. In testing, all
images are rescaled to 512 × 512 for Total-Text, while for CTW1500, the images
are not resized, since the images in CTW1500 are rather small (The largest
image is merely 400 × 600). For comparison, we also evaluated the models of
EAST [3] and SegLink [2] on Total-Text and CTW1500. The quantitative results
of different methods on these two datasets are shown in Tab. 1 and Tab. 2,
respectively.

Table 1. Quantitative results of different methods evaluated on Total-Text. Note that

EAST and SegLink were not fine-tuned on Total-Text. Therefore their results are
included only for reference.

Method Precision Recall F-measure

SegLink [2] 30.3 23.8 26.7
EAST [3] 50.0 36.2 42.0
Baseline (DeconvNet[42]) 33.0 40.0 36.0
TextSnake 82.7 74.5 78.4

As shown in Tab. 1, the proposed method achieves 82.7%, 74.5%, and 78.4%
in precision, recall and F-measure on Total-Text, significantly outperforming
12 Shangbang Long et al.
Table 2. Quantitative results of different methods evaluated on CTW1500. Results
other than ours are obtained from [13].

Method Precision Recall F-measure

SegLink [2] 42.3 40.0 40.8
EAST [3] 78.7 49.1 60.4
DMPNet [43] 69.9 56.0 62.2
CTD[13] 74.3 65.2 69.5
CTD+TLOC[13] 77.4 69.8 73.4
TextSnake 67.9 85.3 75.6

previous methods. Note that the F-measure of our method is more than double
of that of the baseline provided in the original Total-Text paper [12].
On CTW1500, the proposed method achieves 67.9%, 85.3%, and 75.6% in
precision, recall and F-measure , respectively. Compared with CTD+TLOC
which is proposed together with the CTW1500 dataset in [13], the F-measure of
our algorithm is 2.2% higher (75.6% vs. 73.4%).
The superior performances of our method on Total-Text and CTW1500 verify
that the proposed representation can handle well curved text in natural images.

Experiments on Incidental Scene Text (ICDAR 2015) Fine-tuning on

ICDAR 2015 stops at about 30k iterations. In testing, all images are resized to
1280 × 768. Ttr , Ttcl are set to (0.4, 0.9). For the consideration that images in
ICDAR 2015 contains many unlabeled small texts, predicted rectangles with the
shorter side less than 10 pixels or the area less than 300 are filtered out.
The quantitative results of different methods on ICDAR 2015 are shown in
Tab.3. With only single-scale testing, our method outperforms most competitors
(including those evaluated in multi-scale). This demonstrates that the proposed
representation TextSnake is general and can be readily applied to multi-oriented
text in complex scenarios.

Table 3. Quantitative results of different methods on ICDAR 2015. ∗ stands for multi-
scale, † indicates that the base net of the model is not VGG16.

Method Precision Recall F-measure FPS

Zhang et al. [27] 70.8 43.0 53.6 0.48
CTPN [44] 74.2 51.6 60.9 7.1
Yao et al. [1] 72.3 58.7 64.8 1.61
SegLink [2] 73.1 76.8 75.0 -
EAST [3] 80.5 72.8 76.4 6.52
SSTD [45] 80.0 73.0 77.0 7.7
WordSup ∗ [8] 79.3 77.0 78.2 2
EAST ∗ † [3] 83.3 78.3 80.7 -
He et al. ∗ † [25] 82.0 80.0 81.0 1.1
PixelLink [46] 85.5 82.0 83.7 3.0
TextSnake 84.9 80.4 82.6 1.1
TextSnake 13

Experiments on Long Straight Text Lines (MSRA-TD500) Fine-tuning

on MSRA-TD500 stops at about 10k iterations. Thresholds for Ttr , Ttcl are
(0.4, 0.6) . In testing, all images are resized to 1280 × 768. Results are shown in
Tab.4. The F-measure (78.3%) of the proposed method is higher than that of
the other methods.

†
Table 4. Quantitative results of different methods on MSRA-TD500. indicates mod-
els whose base nets are not VGG16.

Method Precision Recall F-measure FPS

Kang et al. [47] 71.0 62.0 66.0 -
Zhang et al. [27] 83.0 67.0 74.0 0.48
Yao et al. [1] 76.5 75.3 75.9 1.61
EAST [3] 81.7 61.6 70.2 6.52
EAST † [3] 87.3 67.4 76.1 13.2
SegLink [2] 86.0 70.0 77.0 8.9
He et al. † [25] 77.0 70.0 74.0 1.1
PixelLink [46] 83.0 73.2 77.8 3.0
TextSnake 83.2 73.9 78.3 1.1

4.5 Analyses and Discussions

Precise Description of Text Instances What distinguishes our method from

others is its ability to predict a precise description of the shape and course of
text instances(see Fig.8).
We attribute such ability to the TCL mechanism. Text center line can be
seen as a kind of skeletons that prop up the text instance, and geo-attributes
providing more details. Text, as a form of written language, can be seen as a
stream of signals mapped onto 2D surfaces. Naturally, it should follows a course
to extend.
Therefore we propose to predict TCL, which is much narrower than the whole
text instance. It has two advantages: (1) A slim TCL can better describe the
course and shape; (2) TCL, intuitively, does not overlaps with each other, so
that instance segmentation can be done in a very simple and straightforward
way, thus simplifying our pipeline.
Moreover, as depicted in Fig.9, we can exploit local geometries to sketch the
structure of the text instance and transform the predicted curved text instances
into canonical form, which may largely facilitate the recognition stage.
Generalization Ability To further verify the generalization ability of our
method, we train and fine-tune our model on datasets without curved text and
evaluate it on the two benchmarks featuring curved text. Specifically, we fine-
tune our models on ICDAR 2015, and evaluate them on the target datasets. The
models of EAST [3], SegLink [2], and PixelLink [46] are taken as baselines, since
these two methods were also trained on ICDAR 2015.
14 Shangbang Long et al.

Fig. 9. Text instances transformed to canonical form using the predicted geometries.

Table 5. Comparison of cross-dataset results of different methods. The following mod-

els are fine-tuned on ICDAR 2015 and evaluated on Total-Text and CTW1500. Ex-
periments for SegLink, EAST and PixelLink are done with the open source code. The
evaluation protocol is DetEval [48], the same as Total-Text. While ICDAR 2015 and
Total-Text has word-level labels, CTW1500 uses line-level ones. We deem DetEval[48]
preferable to PASCAL [49]. Otherwise, the line-level labels of CTW1500 would signif-
icantly penalize models fine-tuned on word-level labeled ICDAR2015.

Datasets Total-Text CTW1500

Methods Precision Recall F-measure Precision Recall F-measure
SegLink[2] 35.6 33.2 34.4 33.0 28.4 30.5
EAST[3] 49.0 43.1 45.9 46.7 37.2 41.4
PixelLink [46] 53.5 52.7 53.1 50.6 42.8 46.4
TextSnake 61.5 67.9 64.6 65.4 63.4 64.4

As shown in Tab.5, our method still performs well on curved text and signifi-
cantly outperforms the three strong competitors SegLink, EAST and PixelLink,
without fine-tuning on curved text. We attribute this excellent generalization
ability to the proposed flexible representation. Instead of taking text as a whole,
the representation treats text as a collection of local elements and integrates
them together to make decisions. Local attributes are kept when formed into a
whole. Besides, they are independent of each other. Therefore, the final predic-
tions of our method can retain most information of the shape and course of the
text.We believe that this is the main reason for the capacity of the proposed text
detection algorithm in hunting text instances with various shapes.

5 Conclusion and Future Work

In this paper, we present a novel, flexible representation for describing the prop-
erties of scene text with arbitrary shapes, including horizontal, multi-oriented
and curved text instances. The proposed text detection method based upon
this representation obtains state-of-the-art or comparable performance on two
newly-released benchmarks for curved text (Total-Text and SCUT-CTW1500)
as well as two widely-used datasets (ICDAR 2015 and MSRA-TD500) in this
field, proving the effectiveness of the proposed method. As for future work, we
would explore the direction of developing an end-to-end recognition system for
text of arbitrary shapes.
TextSnake 15

References

1. Yao, C., Bai, X., Sang, N., Zhou, X., Zhou, S., Cao, Z.: Scene text detection via
holistic, multi-channel prediction. arXiv preprint arXiv:1606.09002 (2016)
2. Shi, B., Bai, X., Belongie, S.: Detecting oriented text in natural images by linking
segments. In: The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR). (July 2017)
3. Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., Liang, J.: EAST: An
efficient and accurate scene text detector. In: The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR). (July 2017)
4. Liao, M., Shi, B., Bai, X., Wang, X., Liu, W.: Textboxes: A fast text detector with
a single deep neural network. In: AAAI. (2017) 4161–4167
5. Huang, L., Yang, Y., Deng, Y., Yu, Y.: Densebox: Unifying landmark localization
with end to end object detection. arXiv preprint arXiv:1509.04874 (2015)
6. Wu, Y., Natarajan, P.: Self-organized text detection with minimal post-processing
via border learning. In: Proceedings of the IEEE Conference on CVPR. (2017)
5000–5009
7. He, D., Yang, X., Liang, C., Zhou, Z., Ororbia, A.G., Kifer, D., Giles, C.L.: Multi-
scale fcn with cascaded instance aware segmentation for arbitrary oriented word
spotting in the wild. In: Computer Vision and Pattern Recognition (CVPR), 2017
IEEE Conference on, IEEE (2017) 474–483
8. Hu, H., Zhang, C., Luo, Y., Wang, Y., Han, J., Ding, E.: Wordsup: Exploiting
word annotations for character based text detection. In: The IEEE International
Conference on Computer Vision (ICCV). (Oct 2017)
9. Tian, S., Lu, S., Li, C.: Wetext: Scene text detection under weak supervision.
arXiv preprint arXiv:1710.04826 (2017)
10. Lyu, P., Yao, C., Wu, W., Yan, S., Bai, X.: Multi-oriented scene text detection
via corner localization and region segmentation. In: Computer Vision and Pattern
Recognition (CVPR), 2018 IEEE Conference on. (2018)
11. Sheng, Z., Yuliang, L., Lianwen, J., Canjie, L.: Feature enhancement network: A
refined scene text detector. In: Proceedings of AAAI, 2018. (2018)
12. Kheng Chng, C., Chan, C.S.: Total-text: A comprehensive dataset for scene text
detection and recognition. arXiv preprint arXiv:1710.10400 (2017)
13. Yuliang, L., Lianwen, J., Shuaitao, Z., Sheng, Z.: Detecting curve text in the wild:
New dataset and new solution. arXiv preprint arXiv:1712.02170 (2017)
14. Epshtein, B., Ofek, E., Wexler, Y.: Detecting text in natural scenes with stroke
width transform. In: Computer Vision and Pattern Recognition (CVPR), 2010
IEEE Conference on, IEEE (2010) 2963–2970
15. Neumann, L., Matas, J.: A method for text localization and recognition in real-
world images. In: Asian Conference on Computer Vision, Springer (2010) 770–783
16. Jaderberg, M., Vedaldi, A., Zisserman, A.: Deep features for text spotting. In:
European conference on computer vision, Springer (2014) 512–528
17. Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Reading text in the wild
with convolutional neural networks. International Journal of Computer Vision
116(1) (2016) 1–20
18. Ye, Q., Doermann, D.: Text detection and recognition in imagery: A survey. IEEE
transactions on pattern analysis and machine intelligence 37(7) (2015) 1480–1500
19. Zhu, Y., Yao, C., Bai, X.: Scene text detection and recognition: Recent advances
and future trends. Frontiers of Computer Science 10(1) (2016) 19–36
16 Shangbang Long et al.

20. Yin, X.C., Yin, X., Huang, K., Hao, H.W.: Robust text detection in natural scene
images. IEEE transactions on pattern analysis and machine intelligence 36(5)
(2014) 970–983
21. Huang, W., Qiao, Y., Tang, X.: Robust scene text detection with convolution
neural network induced mser trees. In: European Conference on Computer Vision,
Springer (2014) 497–511
22. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.:
SSD: Single shot multibox detector. In: European conference on computer vision,
Springer (2016) 21–37
23. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detec-
tion with region proposal networks. In: Advances in neural information processing
systems. (2015) 91–99
24. Ma, J., Shao, W., Ye, H., Wang, L., Wang, H., Zheng, Y., Xue, X.:
Arbitrary-oriented scene text detection via rotation proposals. arXiv preprint
arXiv:1703.01086 (2017)
25. He, W., Zhang, X.Y., Yin, F., Liu, C.L.: Deep direct regression for multi-oriented
scene text detection. In: The IEEE International Conference on Computer Vision
(ICCV). (Oct 2017)
26. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic
segmentation. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. (2015) 3431–3440
27. Zhang, Z., Zhang, C., Shen, W., Yao, C., Liu, W., Bai, X.: Multi-oriented text de-
tection with fully convolutional networks. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition. (2016) 4159–4167
28. Zhang, Z., Shen, W., Yao, C., Bai, X.: Symmetry-based text line detection in
natural scenes. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. (2015) 2558–2567
29. Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature
pyramid networks for object detection. In: The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR). (July 2017)
30. Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional Networks for Biomed-
ical Image Segmentation. Springer International Publishing (2015)
31. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556 (2014)
32. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recogni-
tion. In: In Proceedings of the IEEE conference on computer vision and pattern
recognition(CVPR). (2016)
33. Zeiler, M.D., Krishnan, D., Taylor, G.W., Fergus, R.: Deconvolutional networks.
In: In Proceedings of the IEEE conference on computer vision and pattern recog-
nition(CVPR). (2010) 2528–2535
34. Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors
with online hard example mining. (2016) 761–769
35. Girshick, R.: Fast r-cnn. In: The IEEE International Conference on Computer
Vision (ICCV). (December 2015)
36. Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in
natural images. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. (2016) 2315–2324
37. Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura,
M., Matas, J., Neumann, L., Chandrasekhar, V.R., Lu, S., et al.: Icdar 2015
competition on robust reading. In: Document Analysis and Recognition (ICDAR),
2015 13th International Conference on, IEEE (2015) 1156–1160
TextSnake 17

38. Yao, C., Bai, X., Liu, W., Ma, Y., Tu, Z.: Detecting texts of arbitrary orientations
in natural images. In: Computer Vision and Pattern Recognition (CVPR), 2012
IEEE Conference on, IEEE (2012) 1083–1090
39. Yao, C., Bai, X., Liu, W.: A unified framework for multioriented text detection
and recognition. IEEE Transactions on Image Processing 23(11) (2014) 4737–4749
40. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghe-
mawat, S., Irving, G., Isard, M., et al.: Tensorflow: A system for large-scale machine
learning. In: OSDI. Volume 16. (2016) 265–283
41. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings
of ICLR. (2015)
42. Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmen-
tation. (2015) 1520–1528
43. Liu, Y., Jin, L.: Deep matching prior network: Toward tighter multi-oriented text
detection. (2017)
44. Tian, Z., Huang, W., He, T., He, P., Qiao, Y.: Detecting text in natural image
with connectionist text proposal network. In: European Conference on Computer
Vision, Springer (2016) 56–72
45. He, P., Huang, W., He, T., Zhu, Q., Qiao, Y., Li, X.: Single shot text detector with
regional attention. In: The IEEE International Conference on Computer Vision
(ICCV). (Oct 2017)
46. Deng, D., Liu, H., Li, X., Cai, D.: Pixellink: Detecting scene text via instance
segmentation. AAAI (2018)
47. Kang, L., Li, Y., Doermann, D.: Orientation robust text line detection in natural
images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. (2014) 4034–4041
48. Wolf, C., Jolion, J.M.: Object count/area graphs for the evaluation of object de-
tection and segmentation algorithms. International Journal of Document Analysis
and Recognition (IJDAR) 8(4) (2006) 280–296
49. Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman,
A.: The pascal visual object classes challenge: A retrospective. International journal
of computer vision 111(1) (2015) 98–136

Kami Export - 1904.01941
No ratings yet
Kami Export - 1904.01941
5 pages
Curved Text Detection Model
No ratings yet
Curved Text Detection Model
13 pages
Total-Text: A Comprehensive Dataset For Scene Text Detection and Recognition
No ratings yet
Total-Text: A Comprehensive Dataset For Scene Text Detection and Recognition
8 pages
Textfield: Learning A Deep Direction Field For Irregular Scene Text Detection
No ratings yet
Textfield: Learning A Deep Direction Field For Irregular Scene Text Detection
14 pages
Tang Few Could Be Better Than All Feature Sampling and Grouping CVPR 2022 Paper
No ratings yet
Tang Few Could Be Better Than All Feature Sampling and Grouping CVPR 2022 Paper
10 pages
Scene Text Detection With Fully Convolutional Neural Networks
No ratings yet
Scene Text Detection With Fully Convolutional Neural Networks
23 pages
Review of Scene Text Detection and Recognition: Han Lin Peng Yang Fanlong Zhang
No ratings yet
Review of Scene Text Detection and Recognition: Han Lin Peng Yang Fanlong Zhang
22 pages
Real-Time Scene Text Detection Based On Global Level and Word Level Features
No ratings yet
Real-Time Scene Text Detection Based On Global Level and Word Level Features
12 pages
TextFuseNet: Advanced Text Detection
No ratings yet
TextFuseNet: Advanced Text Detection
7 pages
2019 - Joeseytre - TextTubes-for-Detecting-Curved-Text-in-the-Wild
No ratings yet
2019 - Joeseytre - TextTubes-for-Detecting-Curved-Text-in-the-Wild
10 pages
Wang 2019
No ratings yet
Wang 2019
10 pages
Wang Shape Robust Text Detection With Progressive Scale Expansion Network CVPR 2019 Paper
No ratings yet
Wang Shape Robust Text Detection With Progressive Scale Expansion Network CVPR 2019 Paper
10 pages
Applied Sciences: Scene Text Detection Using Attention With Depthwise Separable Convolutions
No ratings yet
Applied Sciences: Scene Text Detection Using Attention With Depthwise Separable Convolutions
18 pages
Ria 38.01 12
No ratings yet
Ria 38.01 12
11 pages
Li Et Al. - Unknown - RSCA Real-Time Segmentation-Based Context-Aware Scene Text Detection
No ratings yet
Li Et Al. - Unknown - RSCA Real-Time Segmentation-Based Context-Aware Scene Text Detection
10 pages
Scene Text Detection and Recognition USING DL PDF
No ratings yet
Scene Text Detection and Recognition USING DL PDF
20 pages
Research PaPer EAST
No ratings yet
Research PaPer EAST
10 pages
CRNN Model For Text Detection and Classification From Natural Scenes
No ratings yet
CRNN Model For Text Detection and Classification From Natural Scenes
11 pages
Text Detection OCR Reseacrh Paper
No ratings yet
Text Detection OCR Reseacrh Paper
26 pages
Deep Learning Approaches To Scene Text Detection A
No ratings yet
Deep Learning Approaches To Scene Text Detection A
61 pages
Sast
No ratings yet
Sast
9 pages
Symmetry 12 01956 v2
No ratings yet
Symmetry 12 01956 v2
27 pages
Long2021 Article SceneTextDetectionAndRecogniti
No ratings yet
Long2021 Article SceneTextDetectionAndRecogniti
24 pages
Deep Scene Text Detection With Connected Component Proposals
No ratings yet
Deep Scene Text Detection With Connected Component Proposals
10 pages
Total-Text: A Comprehensive Dataset For Scene Text Detection and Recognition (Extended Version)
No ratings yet
Total-Text: A Comprehensive Dataset For Scene Text Detection and Recognition (Extended Version)
13 pages
Detection of Text From Lecture Video Images
No ratings yet
Detection of Text From Lecture Video Images
5 pages
Robust Scene Text Recognition With Automatic Rectification
No ratings yet
Robust Scene Text Recognition With Automatic Rectification
9 pages
Enhanced Scene Text Recognition Using Deep Learning Based Hybrid Attention Recognition Network
No ratings yet
Enhanced Scene Text Recognition Using Deep Learning Based Hybrid Attention Recognition Network
12 pages
2017 - Yingying - R2CNN-Rotational-Region-CNN-for-Orientation-Robust-Scene-Text-Detection
No ratings yet
2017 - Yingying - R2CNN-Rotational-Region-CNN-for-Orientation-Robust-Scene-Text-Detection
8 pages
6071 Generative Shape Models Joint Text Recognition and Segmentation With Very Little Training Data
No ratings yet
6071 Generative Shape Models Joint Text Recognition and Segmentation With Very Little Training Data
9 pages
Text Detection with Stroke Width
No ratings yet
Text Detection with Stroke Width
8 pages
Yerrijdnewpaper
No ratings yet
Yerrijdnewpaper
5 pages
PGNet AAAI-2885.WangP
No ratings yet
PGNet AAAI-2885.WangP
9 pages
Text Detection Based On MSER and CNN Features: Houssem Turki, Mohamed Ben Halima, Adel M. Alimi
No ratings yet
Text Detection Based On MSER and CNN Features: Houssem Turki, Mohamed Ben Halima, Adel M. Alimi
6 pages
Char RCG TH
No ratings yet
Char RCG TH
11 pages
2022 Text Recognition in The Wild
No ratings yet
2022 Text Recognition in The Wild
35 pages
Text Detection Techniques Overview
No ratings yet
Text Detection Techniques Overview
16 pages
Multi-Oriented Text Line Detection
No ratings yet
Multi-Oriented Text Line Detection
8 pages
X2 - Text Recognition PDF
No ratings yet
X2 - Text Recognition PDF
14 pages
Deep Learning for Text Polarity Detection
No ratings yet
Deep Learning for Text Polarity Detection
6 pages
Scene Text Detection Using Machine Learning Classifiers
No ratings yet
Scene Text Detection Using Machine Learning Classifiers
5 pages
Neurocomputing: Juhua Liu, Qihuang Zhong, Yuan Yuan, Hai Su, Bo Du
No ratings yet
Neurocomputing: Juhua Liu, Qihuang Zhong, Yuan Yuan, Hai Su, Bo Du
11 pages
Pyramid Mask Text Detector
No ratings yet
Pyramid Mask Text Detector
10 pages
8 e 58227702 CD 9 Aaf
No ratings yet
8 e 58227702 CD 9 Aaf
15 pages
Synthetic Text Detection in Images
No ratings yet
Synthetic Text Detection in Images
10 pages
Text Detection and Character Recognition in Scene Images With Unsupervised Feature Learning
No ratings yet
Text Detection and Character Recognition in Scene Images With Unsupervised Feature Learning
6 pages
Text Detection and Extraction Techniques
No ratings yet
Text Detection and Extraction Techniques
18 pages
Most
No ratings yet
Most
10 pages
Text Extraction in Complex Images
No ratings yet
Text Extraction in Complex Images
8 pages
A Large Chinese Text Dataset in The Wild: E-Mail
No ratings yet
A Large Chinese Text Dataset in The Wild: E-Mail
13 pages
Scene Text Recognition Based On Improved CRNN
No ratings yet
Scene Text Recognition Based On Improved CRNN
14 pages
A Novel Ensemble Deep Network Framework For Scene Text Recognition
No ratings yet
A Novel Ensemble Deep Network Framework For Scene Text Recognition
11 pages
Cheng BorderNet An Efficient Border-Attention Text Detector ACCV 2022 Paper
No ratings yet
Cheng BorderNet An Efficient Border-Attention Text Detector ACCV 2022 Paper
17 pages
Multi-Script-Oriented Text Detection and Recognition in Video/Scene/Born Digital Images
No ratings yet
Multi-Script-Oriented Text Detection and Recognition in Video/Scene/Born Digital Images
18 pages
10 36222-Ejt 1407231-3609588
No ratings yet
10 36222-Ejt 1407231-3609588
7 pages
Etasr 10029
No ratings yet
Etasr 10029
5 pages
Miriam Leon, Veronica Vilaplana, Antoni Gasull, Ferran Marques (Veronica - Vilaplana, Antoni - Gasull, Ferran - Marques) @upc - Edu
No ratings yet
Miriam Leon, Veronica Vilaplana, Antoni Gasull, Ferran Marques (Veronica - Vilaplana, Antoni - Gasull, Ferran - Marques) @upc - Edu
4 pages
Wide Aperture Imaging Sonar Reconstruction Using Generative Models
No ratings yet
Wide Aperture Imaging Sonar Reconstruction Using Generative Models
8 pages
Computer Vision: Interest Points
No ratings yet
Computer Vision: Interest Points
76 pages
IEEE Journal Paper Template 1
No ratings yet
IEEE Journal Paper Template 1
5 pages
FOD Detection Using MSER Algorithm
No ratings yet
FOD Detection Using MSER Algorithm
5 pages
Comparison of HOG, MSER, SIFT, FAST, LBP and CANNY Features For Cell Detection in Histopathological Images
No ratings yet
Comparison of HOG, MSER, SIFT, FAST, LBP and CANNY Features For Cell Detection in Histopathological Images
6 pages
Real Time Road Sign Recognition System For Autonomous Car
No ratings yet
Real Time Road Sign Recognition System For Autonomous Car
14 pages
Technical and Software Solutions For Autonomous Unmanned Aerial Vehicle (UAV) Navigation in Case of Unavailable GPS Signal
No ratings yet
Technical and Software Solutions For Autonomous Unmanned Aerial Vehicle (UAV) Navigation in Case of Unavailable GPS Signal
13 pages
A Single Neural Network For Mixed Style License Plate Detection and Recognition
No ratings yet
A Single Neural Network For Mixed Style License Plate Detection and Recognition
9 pages
Article Title
No ratings yet
Article Title
22 pages
Advanced Text Detection Techniques
100% (1)
Advanced Text Detection Techniques
14 pages
Image Processing Homework Solutions
No ratings yet
Image Processing Homework Solutions
25 pages
Mini Project Documentation
No ratings yet
Mini Project Documentation
37 pages
Project Report On 2factor Authentication
No ratings yet
Project Report On 2factor Authentication
91 pages
Detection of Lanes and Traffic Signs Painted On Road Using On-Board Camera
No ratings yet
Detection of Lanes and Traffic Signs Painted On Road Using On-Board Camera
7 pages
F) Maybe Is Full Script Complet
No ratings yet
F) Maybe Is Full Script Complet
35 pages
Digital Image Processing Techniques
No ratings yet
Digital Image Processing Techniques
51 pages
Computer Vision Toolbox (Matlab)
No ratings yet
Computer Vision Toolbox (Matlab)
10 pages
Batch 1 Project Book
No ratings yet
Batch 1 Project Book
73 pages
Image Processing With Character Recognition Using Matlab: Ciaran Cooney
No ratings yet
Image Processing With Character Recognition Using Matlab: Ciaran Cooney
74 pages
Text Detection
No ratings yet
Text Detection
17 pages
Matas MSER PDF
No ratings yet
Matas MSER PDF
10 pages
Image Matching with 14 Algorithms
No ratings yet
Image Matching with 14 Algorithms
8 pages
Text Snake
No ratings yet
Text Snake
17 pages

Text Snake

Uploaded by

Text Snake

Uploaded by

arXiv:1807.01544v2 [cs.

CV] 18 Aug 2020

Abstract. Driven by deep neural networks and large scale datasets,

Keywords: Scene Text Detection, Deep Neural Network, Curved Text

(a) (b) (c) (d)

Text detection, as a prerequisite step in the pipeline of textual information ex-

performance on several benchmarks, including text instances of different forms

text region text center line

As shown in Fig. 1, conventional representations for scene text (e.g., axis-

TR TCL Masked TCL

radius cosθ sinθ

Text Instances Striding Algorithm Instance Segmentation

Fig. 3. Method framework: network output and post-processing

In order to detect text with arbitrary shapes, we employ an FCN model to

3.3 Network Architecture

The whole network is shown in Fig. 4. Inspired by FPN[29] and U-net[30], we

concat conv 1x1, 32

Fig. 4. Network Architecture. Blue blocks are convolution stages of VGG-16.

classification, e.g. VGG-16/19[31] and ResNet[32]. These networks can be di-

hi = conv3×3 (conv1×1 [fi−1 ; U pSampling×2 (hi−1 )]), for i ≥ 2 (2)

hf inal = U pSampling×2 (h5 ) (3)

P = conv1×1 (conv3×3 (hf inal )) (4)

Step 1 Step 2 Step 3 Step 4

Fig. 5. Framework of Post-processing Algorithm. Act(a) Centralizing: relocate a given

Act(a): Centralizing Act(b): Striding Act(c): Sliding

Fig. 6. Mechanisms of Centralizing, Striding and Sliding

3.5 Label Generation

! " ! " "

' & ' & ' &

!"# !%# !$#

For a text instance t represented by a group of vertexes {v0 , v1 , v2 , ..., vn } in

3.6 Training Objectives

where rb, cosθ

4.2 Data Augmentation

4.3 Implementation Details

4.4 Experiment Results

Experiments on Curved Text (Total-Text and CTW1500) Fine-tuning

Table 1. Quantitative results of different methods evaluated on Total-Text. Note that

Method Precision Recall F-measure

Method Precision Recall F-measure

Experiments on Incidental Scene Text (ICDAR 2015) Fine-tuning on

Method Precision Recall F-measure FPS

Experiments on Long Straight Text Lines (MSRA-TD500) Fine-tuning

Method Precision Recall F-measure FPS

4.5 Analyses and Discussions

Precise Description of Text Instances What distinguishes our method from

Table 5. Comparison of cross-dataset results of different methods. The following mod-

Datasets Total-Text CTW1500

5 Conclusion and Future Work

You might also like