Text Snake
Text Snake
Shangbang Long1,2 , Jiaqiang Ruan 1,2 , Wenjie Zhang1,2 , Xin He2 , Wenhao
Wu2 , Cong Yao2
1
Peking University, 2 Megvii (Face++) Technology Inc.
{longlongsb, jiaqiang.ruan, zhang wen jie}@pku.edu.cn,
{hexin,wwh}@megvii.com, [email protected]
1 Introduction
In recent years, the community has witnessed a surge of research interest and
effort regarding the extraction of textual information from natural scenes, a.k.a.
scene text detection and recognition. The driving factors stem from both ap-
plication prospect and research value. On the one hand, scene text detection
and recognition have been playing ever-increasingly important roles in a wide
range of practical systems, such as scene understanding, product search, and
autonomous driving. On the other hand, the unique traits of scene text, for in-
stance, significant variations in color, scale, orientation, aspect ratio and pattern,
make it obviously different from general objects. Therefore, particular challenges
are posed and special investigations are required.
2 Shangbang Long et al.
2 Related Work
In the past few years, the most prominent trend in the area of scene text de-
tection is the transfer from conventional methods [14,15] to deep learning based
methods [16,17,4,3,2]. In this section, we look back on relevant previous works.
For comprehensive surveys, please refer to [18,19]. Before the era of deep learning,
SWT [14] and MSER [15] are two representative algorithms that have influenced
a variety of subsequent methods [20,21]. Modern methods are mostly based on
deep neural networks, which can be coarsely classified into two categories: re-
gression based and segmentation based.
Regression based text detection methods [4] mainly draw inspirations from
general object detection frameworks. TextBoxes [4] adopted SSD [22] and added
“long” default boxes and filters to handle the significant variation of aspect ratios
of text instances. Based on Faster-RCNN [23], Ma et al. [24] devised Rotation
Region Proposal Networks (RRPN) to detect arbitrary-Oriented text in natural
images. EAST [3] and Deep Regression [25] both directly produce the rotated
boxes or quadrangles of text, in a per-pixel manner.
Segmentation based text detection methods cast text detection as a semantic
segmentation problem and FCN [26] is often taken as the reference framework.
Yao et al. [1] modified FCN to produce multiple heatmaps corresponding various
properties of text, such as text region and orientation. Zhang et al. [27] first use
FCN to extract text blocks and then hunt character candidates from these blocks
with MSER [15]. To better separate adjacent text instances, the method of [6]
distinguishes each pixel into three categories: non-text, text border and text.
These methods mainly vary in the way they separate text pixels into different
instances.
The methods reviewed above have achieved excellent performances on various
benchmarks in this field. However, most works, except for [1,7,12], have not
payed special attention to curved text. In contrast, the representation proposed
in this paper is suitable for text of arbitrary shapes (horizontal, multi-oriented
and curved). It is primarily inspired by [1,7] and the geometric attributes of text
are also estimated via the multiple-channel outputs of an FCN-based model.
Unlike [1], our algorithm does not need character level annotations. In addition, it
also shares a similar idea with SegLink [2], by successively decomposing text into
local components and then composing them back into text instances. Analogous
to [28], we also detect linear symmetry axes of text instances for text localization.
Another advantage of the proposed method lies in its ability to reconstruct
the precise shape and regional strike of text instances, which can largely facilitate
the subsequent text recognition process, because all detected text instances could
be conveniently transformed into a canonical form with minimal distortion and
background (see the example in Fig.9).
4 Shangbang Long et al.
3 Methodology
In this section, we first introduce the new representation for text of arbitrary
shapes. Then we describe our method and training details.
3.1 Representation
c θ
r
disk
Fig. 2. Illustration of the proposed TextSnake representation. Text region (in yellow) is
represented as a series of ordered disks (in blue), each of which is located at the center
line (in green, a.k.a symmetric axis or skeleton) and associated with a radius r and an
orientation θ. In contrast to conventional representations (e.g., axis-aligned rectangles,
rotated rectangles and quadrangles), TextSnake is more flexible and general, since it
can precisely describe text of different forms, regardless of shapes and lengths.
direction of the center line around the center c. In this sense, text region t can
be easily reconstructed by computing the union of the disks in S(t).
Note that the disks do not correspond to the characters belonging to t. How-
ever, the geometric attributes in S(t) can be used to rectify text instances of
irregular shapes and transform them into rectangular, straight image regions,
which are more friendly to text recognizers.
3.2 Pipeline
Disjoint Set
𝑓1 𝑓2 𝑓3 𝑓4 𝑓5 /ℎ1
Input conv stage 1 conv stage 2 conv stage 3 conv stage 4 conv stage 5
image 32, /2 64, /2 128, /2 256, /2 512, /2
P ℎ5 ℎ4 ℎ3 ℎ2
predictor
h1 = f5 (1)
where fi denotes the feature maps of the i-th stage in the stem network and
hi is the feature maps of the corresponding merging units. In our experiments,
upsampling is implemented as deconvolutional layer as proposed in [33].
After the merging, we obtain a feature map whose size is 21 of the input
images. We apply an additional upsampling layer and 2 convolutional layers to
produce dense predictions:
where P ∈ Rh×w×7 , with 4 channels for logits of TR/TCL, and the last 3
respectively for r, cosθ and sinθ of the text instance. As a result of the additional
upsampling layer, P has the same size as the input image.The final predictions
are obtained by taking softmax for TR/TCL and regularizing cosθ and sinθ so
that the squared sum equals 1.
TextSnake 7
3.4 Inference
After feed-forwarding, the network produces the TCL, TR and geometry maps.
For TCL and TR, we apply thresholding with values Ttcl and Ttr respectively.
Then, the intersection of TR and TCL gives the final prediction of TCL. Using
disjoint-set, we can efficiently separate TCL pixels into different text instances.
Finally, a striding algorithm is designed to extract an ordered point list that
indicates the shape and course of the text instance, and also reconstruct the
text instance areas. Two simple heuristics are applied to filter out false positive
text instances: 1) The number of TCL pixels should be at least 0.2 times their
average radius; 2) At least half of pixels in the reconstructed text area should
be classified as TR.
Input: Output:
Segmented TCL Prediction
Expanding to Ends
Act(a) Act(c)
Act(b) Act(a)
The procedure for the striding algorithm is shown in Fig.5. It features 3 main
actions, denoted as Act(a), Act(b), and Act(c), as illustrated in Fig.6. Firstly,
we randomly select a pixel as the starting point, and centralize it. Then, the
search process forks into two opposite directions, striding and centralizing until
it reaches the ends. This process would generates 2 ordered point list in two
opposite directions, which can be combined to produce the final central axis list
that follows the course of the text and describe the shape precisely. Details of
the 3 actions are shown below.
Act(a) Centralizing As shown in Fig.6, given a point on the TCL, we can draw
the tangent line and the normal line, respectively denoted as dotted line and solid
line. This step can be done with ease using the geometry maps. The midpoint of
the intersection of the normal line and the TCL area gives the centralized point.
Act(b) Striding The algorithm takes a stride to the next point to search. With
the geometry maps, the displacement for each stride is computed and represented
as ( 12 r × cosθ, 12 r × sinθ) and (− 12 r × cosθ, − 12 r × sinθ), respectively for the two
8 Shangbang Long et al.
directions. If the next step is outside the TCL area, we decrement the stride
gradually until it’s inside, or it hits the ends.
Act(c) Sliding The algorithm iterates through the central axis and draw circles
along it. Radii of the circles are obtained from the r map. The area covered by
the circles indicates the predicted text instance.
In conclusion, taking advantage of the geometry maps and the TCL that
precisely describes the course of the text instance, we can go beyond detection
of text and also predict their shape and course. Besides, the striding algorithm
saves our method from traversing all pixels that are related.
Extracting Text Center Line For triangles and quadrangles, it’s easy to
directly calculate the TCL with algebraic methods, since in this case, TCL is a
straight line. For polygons of more than 4 sides, it’s not easy to derive a general
algebraic method.
Instead, we propose a method that is based on the assumption that, text
instances are snake-shaped, i.e. that it does not fork into multiple branches. For
a snake-shaped text instance, it has two edges that are respectively the head and
the tail. The two edges near the head or tail are running parallel but in opposite
direction.
( ( (
% % %
)
Fig. 7. Label Generation. (a) Determining text head and tail; (b) Extracting text
center line and calculating geometries; (c) Expanded text center line.
TextSnake 9
Calculating r and θ For each points on TCL: (1) r is computed as the distance
to the corresponding point on sidelines; (2) θ is computed by fitting a straight line
on the TCL points in the neighborhood. For non-TCL pixels, their corresponding
geometry attributes are set to 0 for convenience.
4 Experiments
In this section, we evaluate the proposed algorithm on standard benchmarks
for scene text detection and compare it with previous methods. Analyses and
discussions regarding our algorithm are also given.
10 Shangbang Long et al.
4.1 Datasets
The datasets used for the experiments in this paper are briefly introduced below:
SynthText [36] is a large sacle dataset that contains about 800K synthetic
images. These images are created by blending natural images with text rendered
with random fonts, sizes, colors, and orientations, thus these images are quite
realistic. We use this dataset to pre-train our model.
TotalText [12] is a newly-released benchmark for text detection. Besides hor-
izontal and multi-Oriented text instances, the dataset specially features curved
text, which rarely appear in other benchmark datasets,but are actually quite
common in real environments. The dataset is split into training and testing sets
with 1255 and 300 images, respectively.
CTW1500 [13] is another dataset mainly consisting of curved text. It con-
sists of 1000 training images and 500 test images. Text instances are annotated
with polygons with 14 vertexes.
ICDAR 2015 is proposed as the Challenge 4 of the 2015 Robust Reading
Competition [37] for incidental scene text detection. Scene text images in this
dataset are taken by Google Glasses without taking care of positioning, image
quality, and viewpoint. This dataset features small, blur, and multi-oriented text
instances. There are 1000 images for training and 500 images for testing. The
text instances from this dataset are labeled as word level quadrangles.
MSRA-TD500 [38] is a dataset with multi-lingual, arbitrary-oriented and
long text lines. It includes 300 training images and 200 test images with text
line level annotations. Following previous works [3,10], we also include the images
from HUST-TR400 [39] as training data when fine-tuning on this dataset, since
its training set is rather small.
For experiments on ICDAR 2015 and MSRA-TD500, we fit a minimum
bounding rectangle based on the output text area of our method.
Fig. 8. Qualitative results by the proposed method. Top: Detected text contours (in
yellow ) and ground truth annotations (in green). Bottom: Combined score maps for
TR (in red ) and TCL (in yellow ). From left to right in column: image from ICDAR
2015, TotalText, CTW1500 and MSRA-TD500. Best viewed in color.
with the batch size of 32 on 2 GPUs in parallel and evaluate our model on 1
GPU with batch size set as 1. Hyper-parameters are tuned by grid search on
training set.
As shown in Tab. 1, the proposed method achieves 82.7%, 74.5%, and 78.4%
in precision, recall and F-measure on Total-Text, significantly outperforming
12 Shangbang Long et al.
Table 2. Quantitative results of different methods evaluated on CTW1500. Results
other than ours are obtained from [13].
previous methods. Note that the F-measure of our method is more than double
of that of the baseline provided in the original Total-Text paper [12].
On CTW1500, the proposed method achieves 67.9%, 85.3%, and 75.6% in
precision, recall and F-measure , respectively. Compared with CTD+TLOC
which is proposed together with the CTW1500 dataset in [13], the F-measure of
our algorithm is 2.2% higher (75.6% vs. 73.4%).
The superior performances of our method on Total-Text and CTW1500 verify
that the proposed representation can handle well curved text in natural images.
Table 3. Quantitative results of different methods on ICDAR 2015. ∗ stands for multi-
scale, † indicates that the base net of the model is not VGG16.
†
Table 4. Quantitative results of different methods on MSRA-TD500. indicates mod-
els whose base nets are not VGG16.
Fig. 9. Text instances transformed to canonical form using the predicted geometries.
As shown in Tab.5, our method still performs well on curved text and signifi-
cantly outperforms the three strong competitors SegLink, EAST and PixelLink,
without fine-tuning on curved text. We attribute this excellent generalization
ability to the proposed flexible representation. Instead of taking text as a whole,
the representation treats text as a collection of local elements and integrates
them together to make decisions. Local attributes are kept when formed into a
whole. Besides, they are independent of each other. Therefore, the final predic-
tions of our method can retain most information of the shape and course of the
text.We believe that this is the main reason for the capacity of the proposed text
detection algorithm in hunting text instances with various shapes.
References
1. Yao, C., Bai, X., Sang, N., Zhou, X., Zhou, S., Cao, Z.: Scene text detection via
holistic, multi-channel prediction. arXiv preprint arXiv:1606.09002 (2016)
2. Shi, B., Bai, X., Belongie, S.: Detecting oriented text in natural images by linking
segments. In: The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR). (July 2017)
3. Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., Liang, J.: EAST: An
efficient and accurate scene text detector. In: The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR). (July 2017)
4. Liao, M., Shi, B., Bai, X., Wang, X., Liu, W.: Textboxes: A fast text detector with
a single deep neural network. In: AAAI. (2017) 4161–4167
5. Huang, L., Yang, Y., Deng, Y., Yu, Y.: Densebox: Unifying landmark localization
with end to end object detection. arXiv preprint arXiv:1509.04874 (2015)
6. Wu, Y., Natarajan, P.: Self-organized text detection with minimal post-processing
via border learning. In: Proceedings of the IEEE Conference on CVPR. (2017)
5000–5009
7. He, D., Yang, X., Liang, C., Zhou, Z., Ororbia, A.G., Kifer, D., Giles, C.L.: Multi-
scale fcn with cascaded instance aware segmentation for arbitrary oriented word
spotting in the wild. In: Computer Vision and Pattern Recognition (CVPR), 2017
IEEE Conference on, IEEE (2017) 474–483
8. Hu, H., Zhang, C., Luo, Y., Wang, Y., Han, J., Ding, E.: Wordsup: Exploiting
word annotations for character based text detection. In: The IEEE International
Conference on Computer Vision (ICCV). (Oct 2017)
9. Tian, S., Lu, S., Li, C.: Wetext: Scene text detection under weak supervision.
arXiv preprint arXiv:1710.04826 (2017)
10. Lyu, P., Yao, C., Wu, W., Yan, S., Bai, X.: Multi-oriented scene text detection
via corner localization and region segmentation. In: Computer Vision and Pattern
Recognition (CVPR), 2018 IEEE Conference on. (2018)
11. Sheng, Z., Yuliang, L., Lianwen, J., Canjie, L.: Feature enhancement network: A
refined scene text detector. In: Proceedings of AAAI, 2018. (2018)
12. Kheng Chng, C., Chan, C.S.: Total-text: A comprehensive dataset for scene text
detection and recognition. arXiv preprint arXiv:1710.10400 (2017)
13. Yuliang, L., Lianwen, J., Shuaitao, Z., Sheng, Z.: Detecting curve text in the wild:
New dataset and new solution. arXiv preprint arXiv:1712.02170 (2017)
14. Epshtein, B., Ofek, E., Wexler, Y.: Detecting text in natural scenes with stroke
width transform. In: Computer Vision and Pattern Recognition (CVPR), 2010
IEEE Conference on, IEEE (2010) 2963–2970
15. Neumann, L., Matas, J.: A method for text localization and recognition in real-
world images. In: Asian Conference on Computer Vision, Springer (2010) 770–783
16. Jaderberg, M., Vedaldi, A., Zisserman, A.: Deep features for text spotting. In:
European conference on computer vision, Springer (2014) 512–528
17. Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Reading text in the wild
with convolutional neural networks. International Journal of Computer Vision
116(1) (2016) 1–20
18. Ye, Q., Doermann, D.: Text detection and recognition in imagery: A survey. IEEE
transactions on pattern analysis and machine intelligence 37(7) (2015) 1480–1500
19. Zhu, Y., Yao, C., Bai, X.: Scene text detection and recognition: Recent advances
and future trends. Frontiers of Computer Science 10(1) (2016) 19–36
16 Shangbang Long et al.
20. Yin, X.C., Yin, X., Huang, K., Hao, H.W.: Robust text detection in natural scene
images. IEEE transactions on pattern analysis and machine intelligence 36(5)
(2014) 970–983
21. Huang, W., Qiao, Y., Tang, X.: Robust scene text detection with convolution
neural network induced mser trees. In: European Conference on Computer Vision,
Springer (2014) 497–511
22. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.:
SSD: Single shot multibox detector. In: European conference on computer vision,
Springer (2016) 21–37
23. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detec-
tion with region proposal networks. In: Advances in neural information processing
systems. (2015) 91–99
24. Ma, J., Shao, W., Ye, H., Wang, L., Wang, H., Zheng, Y., Xue, X.:
Arbitrary-oriented scene text detection via rotation proposals. arXiv preprint
arXiv:1703.01086 (2017)
25. He, W., Zhang, X.Y., Yin, F., Liu, C.L.: Deep direct regression for multi-oriented
scene text detection. In: The IEEE International Conference on Computer Vision
(ICCV). (Oct 2017)
26. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic
segmentation. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. (2015) 3431–3440
27. Zhang, Z., Zhang, C., Shen, W., Yao, C., Liu, W., Bai, X.: Multi-oriented text de-
tection with fully convolutional networks. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition. (2016) 4159–4167
28. Zhang, Z., Shen, W., Yao, C., Bai, X.: Symmetry-based text line detection in
natural scenes. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. (2015) 2558–2567
29. Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature
pyramid networks for object detection. In: The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR). (July 2017)
30. Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional Networks for Biomed-
ical Image Segmentation. Springer International Publishing (2015)
31. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556 (2014)
32. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recogni-
tion. In: In Proceedings of the IEEE conference on computer vision and pattern
recognition(CVPR). (2016)
33. Zeiler, M.D., Krishnan, D., Taylor, G.W., Fergus, R.: Deconvolutional networks.
In: In Proceedings of the IEEE conference on computer vision and pattern recog-
nition(CVPR). (2010) 2528–2535
34. Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors
with online hard example mining. (2016) 761–769
35. Girshick, R.: Fast r-cnn. In: The IEEE International Conference on Computer
Vision (ICCV). (December 2015)
36. Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in
natural images. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. (2016) 2315–2324
37. Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura,
M., Matas, J., Neumann, L., Chandrasekhar, V.R., Lu, S., et al.: Icdar 2015
competition on robust reading. In: Document Analysis and Recognition (ICDAR),
2015 13th International Conference on, IEEE (2015) 1156–1160
TextSnake 17
38. Yao, C., Bai, X., Liu, W., Ma, Y., Tu, Z.: Detecting texts of arbitrary orientations
in natural images. In: Computer Vision and Pattern Recognition (CVPR), 2012
IEEE Conference on, IEEE (2012) 1083–1090
39. Yao, C., Bai, X., Liu, W.: A unified framework for multioriented text detection
and recognition. IEEE Transactions on Image Processing 23(11) (2014) 4737–4749
40. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghe-
mawat, S., Irving, G., Isard, M., et al.: Tensorflow: A system for large-scale machine
learning. In: OSDI. Volume 16. (2016) 265–283
41. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings
of ICLR. (2015)
42. Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmen-
tation. (2015) 1520–1528
43. Liu, Y., Jin, L.: Deep matching prior network: Toward tighter multi-oriented text
detection. (2017)
44. Tian, Z., Huang, W., He, T., He, P., Qiao, Y.: Detecting text in natural image
with connectionist text proposal network. In: European Conference on Computer
Vision, Springer (2016) 56–72
45. He, P., Huang, W., He, T., Zhu, Q., Qiao, Y., Li, X.: Single shot text detector with
regional attention. In: The IEEE International Conference on Computer Vision
(ICCV). (Oct 2017)
46. Deng, D., Liu, H., Li, X., Cai, D.: Pixellink: Detecting scene text via instance
segmentation. AAAI (2018)
47. Kang, L., Li, Y., Doermann, D.: Orientation robust text line detection in natural
images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. (2014) 4034–4041
48. Wolf, C., Jolion, J.M.: Object count/area graphs for the evaluation of object de-
tection and segmentation algorithms. International Journal of Document Analysis
and Recognition (IJDAR) 8(4) (2006) 280–296
49. Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman,
A.: The pascal visual object classes challenge: A retrospective. International journal
of computer vision 111(1) (2015) 98–136