Jaderberg 16
Jaderberg 16
DOI 10.1007/s11263-015-0823-z
Received: 1 December 2014 / Accepted: 31 March 2015 / Published online: 7 May 2015
© Springer Science+Business Media New York 2015
Abstract In this work we present an end-to-end system Keywords Text spotting · Text recognition ·
for text spotting—localising and recognising text in natural Text detection · Deep learning · Convolutional neural
scene images—and text based image retrieval. This system is networks · Synthetic data · Text retrieval
based on a region proposal mechanism for detection and deep
convolutional neural networks for recognition. Our pipeline
uses a novel combination of complementary proposal gener- 1 Introduction
ation techniques to ensure high recall, and a fast subsequent
filtering stage for improving precision. For the recognition The automatic detection and recognition of text in natural
and ranking of proposals, we train very large convolutional images, text spotting, is an important challenge for visual
neural networks to perform word recognition on the whole understanding.
proposal region at the same time, departing from the charac- Text, as the physical incarnation of language, is one of
ter classifier based systems of the past. These networks are the basic tools for preserving and communicating informa-
trained solely on data produced by a synthetic text genera- tion. Much of the modern world is designed to be interpreted
tion engine, requiring no human labelled data. Analysing the through the use of labels and other textual cues, and so text
stages of our pipeline, we show state-of-the-art performance finds itself scattered throughout many images and videos.
throughout. We perform rigorous experiments across a num- Through the use of text spotting, an important part of the
ber of standard end-to-end text spotting benchmarks and semantic content of visual media can be decoded and used,
text-based image retrieval datasets, showing a large improve- for example, for understanding, annotating, and retrieving
ment over all previous methods. Finally, we demonstrate a the billions of consumer photos produced every day.
real-world application of our text spotting system to allow Traditionally, text recognition has been focussed on doc-
thousands of hours of news footage to be instantly search- ument images, where OCR techniques are well suited to
able via a text query. digitise planar, paper-based documents. However, when
applied to natural scene images, these document OCR tech-
Communicated by Cordelia Schmid. niques fail as they are tuned to the largely black-and-white,
line-based environment of printed documents. The text that
B Max Jaderberg occurs in natural scene images is hugely variable in appear-
[email protected]
ance and layout, being drawn from a large number of fonts
Karen Simonyan and styles, suffering from inconsistent lighting, occlusions,
[email protected]
orientations, noise, and, in addition, the presence of back-
Andrea Vedaldi ground objects causes spurious false-positive detections.
[email protected]
This places text spotting as a separate, far more challeng-
Andrew Zisserman ing problem than document OCR.
[email protected]
The increase of powerful computer vision techniques and
1 Department of Engineering Science, University of Oxford, the overwhelming increase in the volume of images produced
Oxford, UK over the last decade has seen a rapid development of text spot-
123
2 Int J Comput Vis (2016) 116:1–20
ting methods. To efficiently perform text spotting, the major- ther demonstrated in a real-world application—being used
ity of methods follow the intuitive process of splitting the task to instantly search through thousands of hours of archived
in two: text detection followed by word recognition (Chen news footage for a user-given text query.
and Yuille 2004). Text detection involves generating candi- The following section gives an overview of our pipeline.
date character or word region detections, while word recog- We then review a selection of related work in Sect. 3. Sec-
nition takes these proposals and infers the words depicted. tions 4, 5, 6, and 7 present the stages of our pipeline. We
In this paper we advance text spotting methods, making a extensively test all elements of our pipeline in Sect. 8 and
number of key contributions as part of this. include the details of datasets and the experimental setup.
Our main contribution is a novel text recognition method— Finally, Sect. 9 summarises and concludes.
this is in the form of a deep convolutional neural network Our word recognition framework appeared previously as
(CNN) (LeCun et al. 1998) which takes the whole word image a tech report (Jaderberg et al. 2014) and at the NIPS 2014
as input to the network. Evidence is gradually pooled from Deep Learning and Representation Learning Workshop. This
across the image to perform classification of the word across a report includes the synthetic data renderer and word recogni-
huge dictionary, such as the 90k-word dictionary evaluated in tion CNN model used in this paper (Sect. 6), as well as some
this paper. Remarkably, our model is trained purely on syn- other non-dictionary word recognition models.
thetic data, without incurring the cost of human labelling.
We also propose an incremental learning method to success-
fully train a model with such a large number of classes. Our 2 Overview of the Approach
recognition framework is exceptionally powerful, substan-
tially outperforming previous state of the art on real-world The stages of our approach are as follows: word bounding box
scene text recognition, without using any real-world labelled proposal generation (Sect. 4), proposal filtering and adjust-
training data. ments (Sect. 5), text recognition (Sect. 6) and final merging
Our second contribution is a novel detection strategy for the specific task (Sect. 7). The full process is illustrated
for text spotting: the use of fast region proposal methods in Fig. 1.
to perform word detection. We use a combination of an Our process loosely follows the detection/recognition
object-agnostic region proposal method and a sliding window separation—a word detection stage followed by a word
detector. This gives very high recall coverage of individual recognition stage. However, these two stages are not wholly
word bounding boxes, resulting in around 98 % word recall distinct, as we use the information gained from word recog-
on both ICDAR 2003 and Street View Text datasets with a nition to merge and rank detection results at the end, leading
manageable number of proposals. False-positive candidate to a stronger holistic text spotting system.
word bounding boxes are filtered with a stronger random The detection stage of our pipeline is based on weak-
forest classifier and the remaining proposals adjusted using but-fast detection methods to generate word bounding-box
a CNN trained to regress the bounding box coordinates. proposals. This draws on the success of the R-CNN object
Our third contribution is the application of our pipeline detection framework of Girshick et al. (2014) where region
for large-scale visual search of text in video. In a fraction of proposals are mapped to a fixed size for CNN recognition.
a second we are able to retrieve images and videos from a The use of region proposals avoids the computational com-
huge corpus that contain the visual rendering of a user given plexity of evaluating an expensive classifier with exhaustive
text query, at very high precision. multi-scale, multi-aspect-ratio sliding window search. We
We expose the performance of each part of the pipeline in use a combination of Edge Box proposals (Zitnick and Dollár
experiments, showing that we can maintain the high recall 2014) and a trained aggregate channel features detector (Dol-
of the initial proposal stage while gradually boosting preci- lár et al. 2014) to generate candidate word bounding boxes.
sion as more complex models and higher order information Due to the large number of false-positive proposals, we then
is incorporated. The recall of the detection stage is shown use a random forest classifier to filter the number of proposals
to be significantly higher than that of previous text detec- to a manageable size—this is a stronger classifier than those
tion methods, and the accuracy of the word recognition stage found in the proposal algorithms. Finally, inspired by the
higher than all previous methods. The result is an end-to-end success of bounding box regression in DPM (Felzenszwalb
text spotting system that outperforms all previous methods et al. 2010) and R-CNN (Girshick et al. 2014), we regress
by a large margin. We demonstrate this for the annotation more accurate bounding boxes from the seeds of the proposal
task (localising and recognising text in images) across a algorithms which greatly improves the average overlap ratio
large range of standard text spotting datasets, as well as of positive detections with groundtruth. However, unlike the
in a retrieval scenario (retrieving a ranked list of images linear regressors of Felzenszwalb et al. (2010), Girshick et al.
that contain the text of a query string) for standard datasets. (2014) we train a CNN specifically for regression. We discuss
In addition, the use of our framework for retrieval is fur- these design choices in each section.
123
Int J Comput Vis (2016) 116:1–20 3
Fig. 1 The end-to-end text spotting pipeline proposed. a A combi- performs text recognition on each of the refined proposals. e Detections
nation of region proposal methods extracts many word bounding box are merged based on proximity and recognition results and assigned
proposals. b Proposals are filtered with a random forest classifier reduc- a score. f Thresholding the detections results in the final text spotting
ing number of false-positive detections. c A CNN is used to perform result
bounding box regression for refining the proposals. d A different CNN
The second stage of our framework produces a text recog- 2014; Mishra et al. 2013; Neumann and Matas 2010, 2011,
nition result for each proposal generated from the detection 2012, 2013; Posner et al. 2010; Quack 2009; Wang et al.
stage. We take a whole-word approach to recognition, pro- 2011, 2012; Weinman et al. 2014).
viding the entire cropped region of the word as input to a
deep convolutional neural network. We present a dictionary
model which poses the recognition task as a multi-way clas- 3.1 Text Detection Methods
sification task across a dictionary of 90k possible words. Due
to the mammoth training data requirements of classification Text detection methods tackle the first task of the standard
tasks of this scale, these models are trained purely from syn- text spotting pipeline (Chen and Yuille 2004): producing
thetic data. Our synthetic data engine is capable of rendering segmentations or bounding boxes of words in natural scene
sufficiently realistic and variable word image samples that images. Detecting instances of words in noisy and cluttered
the models trained on this data translate to the domain of images is a highly non-trivial task, and the methods devel-
real-world word images giving state-of-the-art recognition oped to solve this are based on either character regions (Chen
accuracy. et al. 2011; Epshtein et al. 2010; Gomez and Karatzas 2013,
Finally, we use the information gleaned from recognition 2014; Huang et al. 2014; Neumann and Matas 2010, 2011,
to update our detection results with multiple rounds of non- 2012, 2013; Yi and Tian 2011; Yin et al. 2013) or sliding
maximal suppression and bounding box regression. windows (Anthimopoulos et al. 2013; Jaderberg et al. 2014;
Posner et al. 2010; Quack 2009; Wang et al. 2011, 2012).
Character region methods aim to segment pixels into char-
acters, and then group characters into words. Epshtein et al.
3 Related Work (2010) find regions of the input image which have constant
stroke width—the distance between two parallel edges—
In this section we review the contributions of works most by taking the stroke width transform (SWT). Intuitively
related to ours. These focus solely on text detection characters are regions of similar stroke width, so clustering
(Anthimopoulos et al. 2013; Chen et al. 2011; Epshtein et al. pixels together forms characters, and characters are grouped
2010; Gomez and Karatzas 2013, 2014; Huang et al. 2014; together into words based on geometric heuristics. In Neu-
Yi and Tian 2011; Yin et al. 2013), text recognition (Almazán mann and Matas (2013), revisit the notion of characters
et al. 2014; Bissacco et al. 2013; Jaderberg et al. 2014; Mishra represented as strokes and use gradient filters to detect ori-
et al. 2012; Novikova et al. 2012; Rath and Manmatha 2007; ented strokes in place of the SWT. Rather than regions of
Yao et al. 2014), or on combining both in end-to-end sys- constant stroke width, Neumann and Matas (2010, 2011,
tems (Alsharif and Pineau 2014; Gordo 2014; Jaderberg et al. 2012) use Extremal Regions (Matas et al. 2002) as character
123
4 Int J Comput Vis (2016) 116:1–20
regions. Huang et al. (2014) expand on the use of Maximally features, strokelets, by clustering sub-patches of characters.
Stable Extremal Regions by incorporating a strong CNN Characters are detected with Hough voting, with the charac-
classifier to efficiently prune the trees of Extremal Regions ters identified by a random forest classifier acting on strokelet
leading to less false-positive detections. and HOG features.
Sliding window methods approach text detection as a clas- The works of Alsharif and Pineau (2014), Bissacco et al.
sical object detection task. Wang et al. (2011) use a random (2013), Jaderberg et al. (2014), Wang et al. (2012) all use
ferns (Ozuysal et al. 2007) classifier trained on HOG features CNNs as character classifiers. Bissacco et al. (2013) and
(Felzenszwalb et al. 2010) in a sliding window scenario to Alsharif and Pineau (2014) over-segment the word image
find characters in an image. These are grouped into words into potential character regions, either through unsuper-
using a pictorial structures framework (Felzenszwalb and vised binarization techniques or with a supervised classifier.
Huttenlocher 2005) for a small fixed lexicon. Wang et al. Alsharif and Pineau (2014) then use a complicated combi-
(2012) show that CNNs trained for character classification nation of segmentation-correction and character recognition
can be used as effective sliding window classifiers. In some CNNs together with an HMM with a fixed lexicon to generate
of our earlier work (Jaderberg et al. 2014), we use CNNs for the final recognition result. The PhotoOCR system (Bissacco
text detection by training a text/no-text classifier for sliding et al. 2013) uses a neural network classifier acting on the
window evaluations, and also CNNs for character and bigram HOG features of the segments as scores to find the best com-
classification to perform word recognition. We showed that bination of segments using beam search. The beam search
using feature sharing across all the CNNs for the different incorporates a strong N-gram language model, and the final
classification tasks resulted in stronger classifiers for text beam search proposals are re-ranked with a further language
detection than training each classifier independently. model and shape model. Our own previous work (Jaderberg
Unlike previous methods, our framework operates in a et al. 2014) uses a combination of a binary text/no-text clas-
low-precision, high-recall mode—rather than using a single sifier, a character classifier, and a bigram classifier densely
word location proposal, we carry a sufficiently high num- computed across the word image as cues to a Viterbi scoring
ber of candidates through several stages of our pipeline. We function in the context of a fixed lexicon.
use high recall region proposal methods and a filtering stage As an alternative approach to word recognition other
to further refine these. In fact, our “detection method” is methods use whole word based recognition, pooling features
only complete after performing full text recognition on each from across the entire word sub-image before performing
remaining proposal, as we then merge and rank the proposals word classification. The works of Mishra et al. (2012) and
based on the output of the recognition stage to give the final Novikova et al. (2012) still rely on explicit character clas-
detections, complete with their recognition results. sifiers, but construct a graph to infer the word, pooling
together the full word evidence. Goel et al. (2013) use whole
3.2 Text Recognition Methods word sub-image features to recognise words by comparing
to simple black-and-white font-renderings of lexicon words.
Text recognition aims at taking a cropped image of a sin- Rodriguez-Serrano et al. (2013) use aggregated Fisher Vec-
gle word and recognising the word depicted. While there are tors (Perronnin et al. 2010) and a Structured SVM framework
many previous works focussing on handwriting or histori- to create a joint word-image and text embedding.
cal document recognition (Fischer et al. 2010; Frinken et al. Almazán et al. (2014) further explore the notion of word
2012; Manmatha et al. 1996; Rath and Manmatha 2007), embeddings, creating a joint embedding space for word
these methods don’t generalise in function to generic scene images and representations of word strings. This is extended
text due to the highly variable foreground and background in Gordo (2014) where Gordo makes explicit use of charac-
textures that are not present with documents. ter level training data to learn mid-level features. This results
For scene text recognition, methods can be split into two in performance on par with (Bissacco et al. 2013) but using
groups—character based recognition (Alsharif and Pineau only a small fraction of the amount of training data.
2014; Bissacco et al. 2013; Jaderberg et al. 2014; Posner et al. While not performing full scene text recognition, (Good-
2010; Quack 2009; Wang et al. 2011, 2012; Weinman et al. fellow et al. 2013) had great success using a CNN with
2014; Yao et al. 2014) and whole word based recognition multiple position-sensitive character classifier outputs to per-
(Almazán et al. 2014; Goel et al. 2013; Jaderberg et al. 2014; form street number recognition. This model was extended to
Mishra et al. 2012; Novikova et al. 2012; Rodriguez-Serrano CAPTCHA sequences up to 8 characters long where they
et al. 2013). demonstrated impressive performance using synthetic train-
Character based recognition relies on an individual char- ing data for a synthetic problem (where the generative model
acter classifier for per-character recognition which is inte- is known). In contrast, we show that synthetic training data
grated across the word image to generate the full word can be used for a real-world data problem (where the gener-
recognition. In Yao et al. (2014), learn a set of mid-level ative model is unknown).
123
Int J Comput Vis (2016) 116:1–20 5
Our method for text recognition also follows a whole contained inside a bounding box this implies objects are con-
word image approach. Similarly to Goodfellow et al. (2013), tained within the bounding box, whereas edges which cross
we take the word image as input to a deep CNN, however the border of the bounding box suggest there is an object that
we employ a dictionary classification model. Recognition is is not wholly contained by the bounding box.
achieved by performing multi-way classification across the The notion of an object being a collection of boundaries
entire dictionary of potential words. is especially true when the desired objects are words—
In the following sections we describe the details of each collections of characters with sharp boundaries.
stage of our text spotting pipeline. The sections are presented Following Zitnick and Dollár (2014), we compute the edge
in order of their use in the end-to-end system. response map using the Structured Edge detector (Dollár and
Zitnick 2013, 2014) and perform Non-Maximal Suppression
orthogonal to the edge responses, sparsifying the edge map.
4 Proposal Generation A candidate bounding box b is assigned a score sb based on
the number of edges wholly contained by b, normalised by
The first stage of our end-to-end text spotting pipeline the perimeter of b. The full details can be found in Zitnick
relies on the generation of word bounding boxes. This is and Dollár (2014).
word detection—in an ideal scenario we would be able to The boxes b are evaluated in a sliding window manner,
generate word bounding boxes with high recall and high pre- over multiple scales and aspect ratios, and given a score sb .
cision, achieving this by extracting the maximum amount Finally, the boxes are sorted by score and non-maximal sup-
of information from each bounding box candidate possible. pression is performed: a box is removed if its overlap with
However, in practice a precision/recall tradeoff is required another box of higher score is more than a threshold. This
to reduce computational complexity. With this in mind we results in a set of candidate bounding boxes for words Be .
opt for a fast, high recall initial phase, using computationally
cheap classifiers, and gradually incorporate more informa- 4.2 Aggregate Channel Feature Detector
tion and more complex models to improve precision by
rejecting false-positive detections resulting in a cascade. To Another method for generating candidate word bounding box
compute recall and precision in a detection scenario, a bound- proposals is by using a conventional trained detector. We use
ing box is said to be a true-positive detection if it has overlap the aggregate channel features (ACF) detector framework of
with a groundtruth bounding box above a defined threshold. Dollár et al. (2014) for its speed of computation. This is a
The overlap for bounding boxes b1 and b2 is defined as the conventional sliding window detector based on ACF features
ratio of intersection over union (IoU): |b 1 ∩b2 |
|b1 ∪b2 | . coupled with an AdaBoost classifier. ACF based detectors
Though never applied to word detection before, region have been shown to work well on pedestrian detection and
proposal methods have gained a lot of attention for generic general object detection, and here we use the same framework
object detection. Region proposal methods (Alexe et al. 2012; for word detection.
Cheng et al. 2014; Uijlings et al. 2013; Zitnick and Dol- For each image I a number of feature channels are
lár 2014) aim to generate object region proposals with high computed, such that channel C = Ω(I ), where Ω is the
recall, but at the cost of a large number of false-positive detec- channel feature extraction function. We use channels similar
tions. Even so, this still reduces the search space drastically to those in Dollár et al. (2010): normalised gradient magni-
compared to sliding window evaluation of the subsequent tude, histogram of oriented gradients (6 channels), and the
stages of a detection pipeline. Effectively, region proposal raw greyscale input. Each channel C is smoothed, divided
methods can be viewed as a weak detector. into blocks and the pixels in each block are summed and
In this work we combine the results of two detection smoothed again, resulting in aggregate channel features.
mechanisms—the Edge Boxes region proposal algorithm The ACF features are not scale invariant, so for multi-
(Zitnick and Dollár 2014, Sect. 4.1) and a weak aggregate scale detection we need to extract features at many different
channel features detector (Dollár et al. 2014, Sect. 4.2). scales—a feature pyramid. In a standard detection pipeline,
the channel features for a particular scale s are computed
4.1 Edge Boxes by resampling the image and recomputing the channel fea-
tures Cs = Ω(Is ) where Cs are the channel features at
We use the formulation of Edge Boxes as described in Zit- scale s and Is = R(I, s) is the image resampled by s.
nick and Dollár (2014). The key intuition behind Edge Boxes Resampling and recomputing the features at every scale is
is that since objects are generally self contained, the number computationally expensive. However, as shown in Dollár
of contours wholly enclosed by a bounding box is indica- et al. (2014, 2010), the channel features at scale s can be
tive of the likelihood of the box containing an object. Edges approximated by resampling the features at a different scale,
tend to correspond to object boundaries, and so if edges are such that Cs ≈ R(C, s) · s −λΩ , where λΩ is a channel spe-
123
6 Int J Comput Vis (2016) 116:1–20
123
Int J Comput Vis (2016) 116:1–20 7
5.2.1 Discussion
Our bounding box coordinate regressor takes each pro- 6 Text Recognition
posed bounding box b ∈ B f and produces an updated
estimate of that proposal b∗ . A bounding box is parametrised At this stage of our processing pipeline, a pool of accu-
by its top-left and bottom-right corners, such that bounding rate word bounding box proposals has been generated as
box b = (x1 , y1 , x2 , y2 ). The full image I is cropped to a described in the previous sections. We now turn to the task of
rectangle centred on the region b, with the width and height recognising words inside these proposed bounding boxes. To
inflated by a scale factor. The resulting image is resampled this end we use a deep CNN to perform classification across
to a fixed size W × H , giving Ib , which is processed by the a pre-defined dictionary of words—dictionary encoding—
CNN to regress the four values of b∗ . We do not regress the which explicitly models natural language. The cropped
absolute values of the bounding box coordinates directly, but image of each of the proposed bounding boxes is taken as
rather encoded values. The top-left coordinate is encoded by input to the CNN, and the CNN produces a probability distri-
the top-left quadrant of Ib , and the bottom-left coordinate by bution over all the words in the dictionary. The word with the
the bottom-left quadrant of Ib as illustrated by Fig. 3. This maximum probability can be taken as the recognition result.
normalises the coordinates to generally fall in the interval The model, described fully in Sect. 6.2, can scale to a huge
[0, 1], but allows the breaking of this interval if required. dictionary of 90k words, encompassing the majority of the
In practice, we inflate the cropping region of each pro- commonly used English language (see Sect. 8.1 for details of
posal by a factor of two. This gives the CNN enough context the dictionary used). However, to achieve this, many training
to predict a more accurate location of the proposal bound- samples of every different possible word must be amassed.
ing box. The CNN is trained with example pairs of (Ib , bgt ) Such a training dataset does not exist, so we instead use syn-
to regress the groundtruth bounding box bgt from the sub- thetic training data, described in Sect. 6.1, to train our CNN.
image Ib cropped from I by the estimated bounding box b. This synthetic data is so realistic that the CNN can be trained
This is done by minimising the L 2 loss between the encoded purely on the synthetic data but still applied to real world data.
bounding boxes, i.e.
6.1 Synthetic Training Data
min g(Ib ; Φ) − q(bgt )22 (1) This section describes our scene text rendering algorithm. As
Φ
b∈Btrain our CNN models take whole word images as input instead
of individual character images, it is essential to have access
over the network parameters Φ on a training set Btrain , where to a training dataset of cropped word images that covers the
g is the CNN forward pass function and q is the bounding whole language or at least a target lexicon. While there are
box coordinate encoder. some publicly available datasets from ICDAR (Karatzas et al.
123
8 Int J Comput Vis (2016) 116:1–20
Fig. 4 a The text generation process after font rendering, creating and colouring the image-layers, applying projective distortions, and after image
blending. b Some randomly sampled data created by the synthetic text engine
2013; Lucas 2005; Lucas et al. 2003; Shahab et al. 2011), the 2. Border/shadow rendering—an inset border, outset bor-
Street View Text (SVT) dataset (Wang et al. 2011), the IIIT- der, or shadow with a random width may be rendered
5k dataset (Mishra et al. 2012), and others, the number of from the foreground.
full word image samples is only in the thousands, and the 3. Base colouring—each of the three image-layers are filled
vocabulary is very limited. with a different uniform colour sampled from clusters
The lack of full word image samples has caused previ- over natural images. The clusters are formed by k-means
ous work to rely on character classifiers instead (as character clustering the RGB components of each image of the
data is plentiful), or this deficit in training data has been training datasets of Lucas et al. (2003) into three clusters.
mitigated by mining for data or having access to large propri- 4. Projective distortion—the foreground and border/shadow
etary datasets (Bissacco et al. 2013; Goodfellow et al. 2013; image-layers are distorted with a random, full projective
Jaderberg et al. 2014). However, we wish to perform whole transformation, simulating the 3D world.
word image based recognition and move away from character 5. Natural data blending—each of the image-layers are
recognition, and aim to do this in a scalable manner without blended with a randomly-sampled crop of an image from
requiring human labelled datasets. the training datasets of ICDAR 2003 and SVT. The
Following the success of some synthetic character datasets amount of blend and alpha blend mode (e.g. normal, add,
(de Campos et al. 2009; Wang et al. 2012), we create a multiply, burn, max, etc.) is dictated by a random process,
synthetic word data generator, capable of emulating the dis- and this creates an eclectic range of textures and compo-
tribution of scene text images. This is a reasonable goal, sitions. The three image-layers are also blended together
considering that much of the text found in natural scenes in a random manner, to give a single output image.
is restricted to a limited set of computer-generated fonts, and 6. Noise—Elastic distortion similar to Simard et al. (2003),
only the physical rendering process (e.g. printing, painting) Gaussian noise, blur, resampling noise, and JPEG com-
and the imaging process (e.g. camera, viewpoint, illumina- pression artefacts are introduced to the image.
tion, clutter) are not controlled by a computer algorithm.
Figure 4 illustrates the generative process and some result- This process produces a wide range of synthetic data sam-
ing synthetic data samples. These samples are composed ples, being drawn from a multitude of random distributions,
of three separate image-layers—a background image-layer, mimicking real-world samples of scene text images. The syn-
foreground image-layer, and optional border/shadow image- thetic data is used in place of real-world data, and the labels
layer—which are in the form of an image with an alpha are generated from a corpus or dictionary as desired. By
channel. The synthetic data generation process is as follows: creating training datasets many orders of magnitude larger
than what has been available before, we are able to use
data-hungry deep learning algorithms to train a richer, whole-
1. Font rendering—a font is randomly selected from a cata- word-based model.
logue of over 1400 fonts downloaded from Google Fonts.
The kerning, weight, underline, and other properties are 6.2 CNN Model
varied randomly from arbitrarily defined distributions.
The word is rendered on to the foreground image-layer’s This section describes our model for word recognition. We
alpha channel with either a horizontal bottom text line or formulate recognition as a multi-class classification problem,
following a random curve. with one class per word, where words w are constrained to be
123
Int J Comput Vis (2016) 116:1–20 9
Fig. 5 A schematic of the CNN used for text recognition by word classification. The dimensions of the featuremaps at each layer of the network
are shown
selected in a pre-defined dictionary W. While the dictionary provides the network with word-length cues. We also experi-
W of a natural language may seem too large for this approach mented with different padding regimes to preserve the aspect
to be feasible, in practice an advanced English vocabulary, ratio, but found that the results are not quite as good as per-
including different word forms, contains only around 90k forming naive resampling.
words, which is large but manageable. To summarise, for each proposal bounding box b ∈ B f
In detail, we propose to use a CNN classifier where each for image I we compute P(w|xb , L) by cropping the image
word w ∈ W in the lexicon corresponds to an output neuron. to Ib = c(b, I ), resampling to fixed dimensions W × H
We use a CNN with five convolutional layers and three fully- such that xb = R(Ib , W, H ), and compute P(w|xb ) with
connected layers, with the exact details described in Sect. 8.2. the text recognition CNN and multiply by P(w|L) (task
The final fully-connected layer performs classification across dependent) to give a final probability distribution over words
the dictionary of words, so has the same number of units as P(w|xb , L).
the size of the dictionary we wish to recognise.
The predicted word recognition result w ∗ out of the set
of all dictionary words W in a language L for a given input
7 Merging & Ranking
image x is given by
At this point in the pipeline, we have a set of word bound-
∗
w = arg max P(w|x, L). (2) ing boxes for each image B f with their associated word
w∈W probability distributions PB f = { pb : b ∈ B f }, where
pb = P(w|b, I ) = P(w|xb , L). However, this set of detec-
Since P(w|x, L) can be written as tions still contains a number of false-positive and duplicate
P(w|x)P(w|L)P(x) detections of words, so a final merging and ranking of detec-
P(w|x, L) = (3) tions must be performed depending on the task at hand: text
P(x|L)P(w)
spotting or text based image retrieval.
and with the assumptions that x is independent of L and that
prior to any knowledge of our language all words are equally 7.1 Text Spotting
probable, our scoring function reduces to
The goal of text spotting is to localise and recognise the
w ∗ = arg max P(w|x)P(w|L). (4) individual words in the image. Each word should be labelled
w∈W by a bounding box enclosing the word and the bounding box
should have an associated text label.
The per-word output probability P(w|x) is modelled by the
For this task, we assign each bounding box in b ∈ B f
softmax output of the final fully-connected layer of the recog-
a label wb and score sb according to b’s maximum word
nition CNN, and the language based word prior P(w|L) can
probability:
be modelled by a lexicon or frequency counts. A schematic
of the network is shown in Fig. 5.
wb = arg max P(w|b, I ), sb = max P(w|b, I ) (5)
One limitation of this CNN model is that the input x must w∈W w∈W
be a fixed, pre-defined size. This is problematic for word
images, as although the height of the image is always one To cluster duplicate detections of the same word instance,
character tall, the width of a word image is highly dependent we perform a greedy non maximum suppression (NMS) on
on the number of characters in the word, which can range detections with the same word label, aggregating the scores
between one and 23 characters. To overcome this issue, we of suppressed proposals. This can be seen as positional vot-
simply resample the word image to a fixed width and height. ing for a particular word. Subsequently, we perform NMS
Although this does not preserve the aspect ratio, the hor- to suppress non-maximal detections of different words with
izontal frequency distortion of image features most likely some overlap.
123
10 Int J Comput Vis (2016) 116:1–20
Fig. 6 An example of the improvement in localisation of the word detection pharmacy through multiple rounds of recurrent regression
Our text recognition CNN is able to accurately recognise This distribution is computed offline for all I ∈ I.
text in very loosely cropped word sub-images. Because of At query time, we can simply compute a score for each
this, we find that some valid text spotting results have less image s IQ representing the probability that the image I con-
than 0.5 overlap with groundtruth, but we require greater than tains any of the query words Q. Assuming independence
0.5 overlap for some applications (see Sect. 8.3). between the presence of query words
To improve the overlap of detection results, we addition-
ally perform multiple rounds of bounding box regression as s IQ = P(q|I ) = p I (q) (7)
in Sect. 5.2 and NMS as described above to further refine our q∈Q q∈Q
detections. This can be seen as a recurrent regressor network.
Each round of regression updates the prediction of the each where p I (q) is just a lookup of the probability of word q in
word’s localisation, giving the next round of regression an the word distribution p I . These scores can be computed very
updated context window to perform the next regression, as quickly and efficiently by constructing an inverted index of
shown in Fig. 6. Performing NMS between each regression p I ∀ I ∈ I.
causes bounding boxes that have become similar after the lat- After a one-time, offline pre-processing to compute p I
est round of regression to be grouped as a single detection. and assemble the inverted index, a query can be processed
This generally causes the overlap of detections to converge across a database of millions of images in less than a second.
on a higher, stable value with only a few rounds of recurrent
regression.
The refined results, given by the tuple (b, wb , sb ), are 8 Experiments
ranked by their scores sb and a threshold determines the final
text spotting result. For the direct comparison of scores across In this section we evaluate our pipeline on a number of stan-
images, we normalise the scores of the results of each image dard text spotting and text based image retrieval benchmarks.
by the maximum score for a detection in that image. We introduce the various datasets used for evaluation in
Sect. 8.1, give the exact implementation details and results
7.2 Image Retrieval of each part of our pipeline in Sect. 8.2, and finally present
the results on text spotting and image retrieval benchmarks
For the task of text based image retrieval, we wish to retrieve in Sects. 8.3 and 8.4 respectively.
the list of images which contain the given query words. Local-
isation of the query word is not required, only optional for 8.1 Datasets
giving evidence for retrieving that image.
This is achieved by, at query time, assigning each image We evaluate our pipeline on an extensive number of datasets.
I a score s IQ for the query words Q = {q1 , q2 , . . .}, and Due to different levels of annotation, the datasets are used
sorting the images in the database I in descending order of for a combination of text recognition, text spotting, and
score. It is also required that the score for all images can image retrieval evaluation. The datasets are summarised in
be computed fast enough to scale to databases of millions Tables 1, 2, and 3. The smaller lexicons provided by some
of images, allowing fast retrieval of visual content by text datasets are used to reduce the search space to just text con-
search. While retrieval is often performed for just a single tained within the lexicons.
query word (Q = {q}), we generalise our retrieval framework The Synth dataset is generated by our synthetic data engine
to be able to handle multiple query words. of Sect. 6.1. We generate 9 million 32 × 100 images, with
We estimate the per-image probability distribution across equal numbers of word samples from a 90k word dictionary.
word space P(w|I ) by averaging the word probability dis- We use 900k of these for a testing dataset, 900k for validation,
tributions across all detections B f in an image and the remaining for training. The 90k dictionary consists
of the English dictionary from Hunspell (http://hunspell.
1
p I = P(w|I ) = pb . (6) sourceforge.net/), a popular open source spell checking sys-
|B f | tem. This dictionary consists of 50k root words, and we
b∈B f
123
Int J Comput Vis (2016) 116:1–20 11
expand this to include all the prefixes and suffixes possible, as with a lot of noise, as well as suffering from many unanno-
well as adding in the test dataset words from the ICDAR, SVT tated words. Per-image 50 word lexicons (SVT-50) are also
and IIIT datasets—90k words in total. This dataset is publicly provided.
available at http://www.robots.ox.ac.uk/~vgg/data/text/. The IIIT 5k-word dataset Mishra et al. (2012) contains
ICDAR 2003 (IC03) (http://algoval.essex.ac.uk/icdar/ 3000 cropped word images of scene text and digital images
datasets.html), ICDAR 2011 (IC11) (Shahab et al. 2011), obtained from Google image search. This is the largest
and ICDAR 2013 (IC13) (Karatzas et al. 2013) are scene dataset for natural image text recognition currently avail-
text recognition datasets consisting of 251, 255, and 233 full able. Each word image has an associated 50 word lexicon
scene images respectively. The photos consist of a range of (IIIT5k-50) and 1k word lexicon (IIIT5k-1k).
scenes and word level annotation is provided. Much of the test IIIT Scene Text Retrieval (STR) Mishra et al. (2013) is a
data is the same between the three datasets. For IC03, Wang text based image retrieval dataset also collected with Google
et al. (2011) defines per-image 50 word lexicons (IC03-50) image search. Each of the 50 query words has an associated
and a lexicon of all test groundtruth words (IC03-Full). For list of 10–50 images that contain the query word. There are
IC11, Mishra et al. (2013) defines a list of 538 query words also a large number of distractor images with no text down-
to evaluate text based image retrieval. loaded from Flickr. In total there are 10k images and word
The Street View Text (SVT) dataset Wang et al. (2011) con- bounding box annotation is not provided.
sists of 249 high resolution images downloaded from Google The IIIT Sports-10k dataset Mishra et al. (2013) is another
StreetView of road-side scenes. This is a challenging dataset text based image retrieval dataset constructed from frames
123
12 Int J Comput Vis (2016) 116:1–20
100 1
0.95
4
90 10
0.9
3 0.85
# proposals
80 10
Recall (%)
0.8
Recall
2
70 10
0.75
1 0.7
60 10
Recall 0.65
# proposals 0 IC03 Edge Boxes (0.92 recall)
50 10 0.6 SVT Edge Boxes (0.92 recall)
(a) (b) (c) (d) (e) (f) (g)
IC03 ACF (0.70 recall)
SVT ACF (0.74 recall)
0.55 IC03 Combined (0.98 recall)
Fig. 7 The recall and the average number of proposals per image on SVT Combined (0.97 recall)
IC03 after each stage of the pipeline: (a) Edge Box proposals, (b) ACF 0.5
0 2000 4000 6000 8000 10000 12000
detector proposals, (c) proposal filtering, (d) bounding box regression, Average num proposals / image
(e) regression NMS round 1, (f ) regression NMS round 2, (g) regres-
sion NMS round 3. The recall computed is detection recall across the Fig. 8 The 0.5 overlap recall of different region proposal algorithms.
dataset (i.e.ignoring the recognition label) at 0.5 overlap. The detection The recall displayed in the legend for each method gives the maximum
precision is 13 % at the end of the pipeline (g) recall achieved. The curves are generated by decreasing the minimum
score for a proposal to be valid, and terminate when no more proposals
can be found. Due to the large number of region proposals and the small
of sports video. The images are low resolution and often number of words contained in each image the precision is negligible to
noisy or blurred, with text generally located on advertise- achieve high levels of recall (Color figure online)
ments and signboards, making this a challenging retrieval
task. 10 query words are provided with 10k total images,
without word bounding box annotations. In practice, we saw little effect of changing these parameters
BBC News is a proprietary dataset of frames from the in combined recall.
British Broadcasting Corporation (BBC) programmes that For the ACF detector, we set the number of decision trees
were broadcast between 2007 and 2012. Around 5000 h of to be 32, 128, 512 for each round of bootstrapping. For fea-
video (approximately 12 million frames) were processed to ture aggregation, we use 4×4 blocks smoothed with [1 2 1]/4
select 2.3 million keyframes at 1024 × 768 resolution. The filter, with 8 scales per octave. As the detector is trained for
videos are taken from a range of different BBC programmes a particular aspect ratio, we perform detection at multiple
on news and current affairs, including the BBC’s Evening aspect ratios in the range [1, 1.2, 1.4, . . . , 3] to account for
News programme. Text is often present in the frames from variable sized words. We train on 30k cropped 32 × 100 pos-
artificially inserted labels, subtitles, news-ticker text, and itive word samples amalgamated from a number of training
general scene text. No labels or annotations are provided for datasets as outlined in Jaderberg et al. (2014), and randomly
this dataset. sample negative patches from 11k images which do not con-
tain text.
8.2 Implementation Details Figure 8 shows the performance of our proposal genera-
tion stage. The recall at 0.5 overlap of groundtruth labelled
We train a single model for each of the stages in our pipeline, words in the IC03 and SVT datasets is shown as a function
and hyper parameters are selected using training datasets of the number of proposal regions generated per image. The
of ICDAR and SVT. Exactly the same pipeline, with the maximum recall achieved using Edge Boxes is 92 %, and
same models and hyper parameters are used for all datasets the maximum recall achieved by the ACF detector is around
and experiments. This highlights the generalisability of our 70 %. However, combining the proposals from each method
end-to-end framework to different datasets and tasks. The increases the recall to 98 % at 6k proposals and 97 % at 11k
progression of detection recall and the number of proposals proposals for IC03 and SVT respectively. The average max-
as the pipeline progresses can be seen in Fig. 7. imum overlap of a particular proposal with a groundtruth
bounding box is 0.82 on IC03 and 0.77 on SVT, suggest-
8.2.1 Edge Boxes & ACF Detector ing the region proposal techniques produce some accurate
detections amongst the thousands of false-positives.
The Edge Box detector has a number of hyper parameters, This high recall and high overlap gives a good starting
controlling the stride of evaluation and non maximal suppres- point to the rest of our pipeline, and has greatly reduced the
sion. We use the default values of α = 0.65 and β = 0.75 search space of word detections from the tens of millions of
(see Zitnick and Dollár 2014 for details of these parameters). possible bounding boxes to around 10k proposals per image.
123
Int J Comput Vis (2016) 116:1–20 13
8.2.2 Random Forest Word Classifier The convolutional layers have the following { f ilter si ze,
number o f f ilter s}: {5, 64}, {5, 128}, {3, 256}, {3, 512},
The random forest word/no-word binary classifier acts on {3, 512}. The first two fully-connected layers have 4k units
cropped region proposals. These are resampled to a fixed and the final fully-connected layer has the same number of
32 × 100 size, and HOG features extracted with a cell size of units as as number of words in the dictionary—90k words
4, resulting in h ∈ R8×25×36 , a 7200-dimensional descrip- in our case. The final classification layer is followed by a
tor. The random forest classifier consists of 10 trees with a softmax normalisation layer. Rectified linear non-linearities
maximum depth of 64. follow every hidden layer, and all but the fourth convolutional
For training, region proposals are extracted as we describe layers are followed by 2 × 2 max pooling. The inputs to con-
in Sect. 4 on the training datasets of ICDAR and SVT, with volutional layers are zero-padded to preserve dimensionality.
positive bounding box samples defined as having at least The fixed sized input to the CNN is a 32 × 100 greyscale
0.5 overlap with groundtruth, and negative samples as less image which is zero centred by subtracting the image mean
than 0.3 with groundtruth. Due to the abundance of negative and normalised by dividing by the standard deviation.
samples, we randomly sample an equal number of negative We train the network on Synth training data, back-
samples to positive samples, giving 300k positive and 400k propagating the standard multinomial logistic regression
negative training samples. loss. Optimisation uses SGD with dropout regularisation of
Once trained, the result is a very effective false-positive fully-connected layers, and we dynamically lower the learn-
filter. We select an operating probability threshold of 0.5, giv- ing rate as training progresses. With uniform sampling of
ing 96.6 and 94.8 % recall on IC03 and SVT positive proposal classes in training data, we found the SGD batch size must
regions respectively. This filtering reduces the total number be at least a fifth of the total number of classes in order for
of region proposals to on average 650 (IC03) and 900 (SVT) the network to train.
proposals per image. For very large numbers of classes (i.e. over 5k classes), the
SGD batch size required to train effectively becomes large,
8.2.3 Bounding Box Regressor slowing down training a lot. Therefore, for large dictionar-
ies, we perform incremental training to avoid requiring a
The bounding box regression CNN consists of four convo- prohibitively large batch size. This involves initially training
lutional layers with stride 1 with { f ilter si ze, number o f the network with 5k classes until partial convergence, after
f ilter s} of {5, 64}, {5, 128}, {3, 256}, {3, 512} for each layer which an extra 5k classes are added. The original weights
from input respectively, followed by two fully-connected are copied for the previously trained classes, with the extra
layers with 4k units and 4 units (one for each regression classification layer weights being randomly initialised. The
variable). All hidden layers are followed by rectified linear network is then allowed to continue training, with the extra
non-linearities, the inputs to convolutional layers are zero- randomly initialised weights and classes causing a spike in
padded to preserve dimensionality, and the convolutional training error, which is quickly trained away. This process
layers are followed by 2 × 2 max pooling. The fixed sized of allowing partial convergence on a subset of the classes,
input to the CNN is a 32 × 100 greyscale image which is before adding in more classes, is repeated until the full num-
zero centred by subtracting the image mean and normalised ber of desired classes is reached.
by dividing by the standard deviation. At evaluation-time we do not do any data augmentation.
The CNN is trained with stochastic gradient descent If a lexicon is provided, we set the language prior P(w|L) to
(SGD) with dropout (Hinton et al. 2012) on the fully- be equal probability for lexicon words, otherwise zero. In the
connected layers to reduce overfitting, minimising the L 2 absence of a lexicon, P(w|L) is calculated as the frequency of
distance between the estimated and groundtruth bounding word w in a corpus (we use the opensubtitles.org English cor-
boxes (Eq. 1). We used 700k training examples of bounding pus) with power law normalisation. In total, this model con-
box proposals with greater than 0.5 overlap with groundtruth tains around 500 million parameters and can process a word
computed on the ICDAR and SVT training datasets. in 2.2 ms on a GPU with a custom version of Caffe (Jia 2013).
Before the regression, the average positive proposal region
(with over 0.5 overlap with groundtruth) had an overlap of Recognition Results We evaluate the accuracy of our text
0.61 and 0.60 on IC03 and SVT. The CNN improves this recognition model over a wide range of datasets and lexicon
average positive overlap to 0.88 and 0.70 for IC03 and SVT. sizes. We follow the standard evaluation protocol by Wang
et al. (2011) and perform recognition on the words containing
8.2.4 Text Recognition CNN only alphanumeric characters and at least three characters.
The results are shown in Table 4, and highlight the excep-
The text recognition CNN consists of eight weight layers— tional performance of our deep CNN. Although we train on
five convolutional layers and three fully-connected layers. purely synthetic data, with no human annotation, our model
123
14 Int J Comput Vis (2016) 116:1–20
IIIT5k-1k
obtains significant improvements on state-of-the-art accu-
The ICDAR 2013 results given are case-insensitive. Bold results outperform previous state-of-the-art methods. The baseline method is from a commercially available document OCR system.
racy across all standard datasets. On IC03-50, the recognition
57.5
82.1
69.3
86.6
92.7
problem is largely solved with 98.7 % accuracy—only 11
–
–
–
–
–
–
–
–
mistakes out of 860 test samples—and we significantly out-
perform the previous state-of-the-art (Bissacco et al. 2013)
IIIT5k-50
64.1
91.2
80.2
93.3
97.1
dataset-specific lexicon, is lower at 80.7 %. This reflects the
–
–
–
–
–
–
–
difficulty of the SVT dataset as image samples can be of very
low quality, noisy, and with low contrast. The Synth dataset
IC13
87.6
90.8
accuracy shows that our model really can recognise word
–
–
–
–
–
–
–
–
–
–
–
samples consistently across the whole 90k dictionary.
78.0
80.7
into the contribution that the various stages of the synthetic
–
–
–
–
–
–
–
–
–
–
– data generation engine in Sect. 6.1 make to real-world recog-
nition accuracy. We define two reduced recognition models
Table 4 Comparison to previous methods for text recognition accuracy—where the groundtruth cropped word image is given as input
SVT-50
93.1
–
–
–
–
84.0
88.6
80.3
91.5
98.6
–
–
56.0
76.0
81.8
82.8
90.0
89.7
93.1
88.5
96.2
98.7
100
–
90
80
Recognition Accuracy %
Synth
70
95.2
–
–
–
–
–
–
–
–
–
–
–
–
60
50
a Recognition is constrained to a dictionary of 50k words
40
Baseline (ABBYY) (Wang et al. 2011; Yao et al. 2014)
30
20
tion of the synthetic data. (a) Black text rendered on a white background
Mishra et al. (2012)
Wang et al. (2011)
with a single font, Droid Sans. (b) Incorporating all of Google fonts. (c)
Yao et al. (2014)
racy on SVT. The final accuracies on IC03 and SVT are 98.1 and 87.0 %
respectively
123
Int J Comput Vis (2016) 116:1–20 15
in accuracy through incorporating natural image blending on 0.73/0.45/0.56 in Jaderberg et al. (2014). Similarly impres-
the SVT dataset compared to the IC03 dataset. This is most sive improvements can be seen on IC03, where in all lexicon
likely due to the fact that there are more varied and complex scenarios we improve F-measure by at least +10 %, reaching
backgrounds to text in SVT compared to in IC03. a P/R/F of 0.96/0.85/0.90. Looking at the precision/recall
curves in Fig. 10, we can see that our pipeline manages to
8.3 Text Spotting maintain very high recall, and the recognition score of our
text recognition system is a strong cue to the suitability of a
In the text spotting task, the goal is to localise and recognise detection.
the words in the test images. Unless otherwise stated, we fol- We also give results across all datasets when no lexicon is
low the standard evaluation protocol by Wang et al. (2011) given. As expected, the F-measure suffers from the lack of
and ignore all words that contain alphanumeric characters lexicon constraints, though is still significantly higher than
and are not at least three characters long. A positive recog- other comparable work. It should be noted that the SVT
nition result is only valid if the detection bounding box has dataset is only partially annotated. This means that the pre-
at least 0.5 overlap (IoU) with the groundtruth. cision (and therefore F-measure) is much lower than the
Table 5 shows the results of our text spotting pipeline true precision if fully annotated, since many words that are
compared to previous methods. We report the global F- detected are not annotated and are therefore recorded as false-
measure over all images in the dataset. Across all datasets, positives. We can however report recall on SVT-50 and SVT
our pipeline drastically outperforms all previous methods. On of 71 and 59 % respectively.
SVT-50, we increase the state-of-the-art by +20 % to a P/R/F Interestingly, when the overlap threshold is reduced to 0.3
(precision/recall/F-measure) of 0.85/0.68/0.76 compared to (last row of Table 5), we see a small improvement across
Precision
123
16 Int J Comput Vis (2016) 116:1–20
Fig. 11 Some example text spotting results from SVT-50 (top row) and IC11 (bottom row). Red dashed shows groundtruth and green shows
correctly localised and recognised results. P/R/F figures are given above each image (Color figure online)
123
Int J Comput Vis (2016) 116:1–20 17
Table 7 The processing time for each stage of the pipeline evaluated However, as with the text spotting results for SVT, our
on the SVT dataset on a single CPU core and single GPU retrieval results suffer from incomplete annotations on SVT
Stage # Proposals Time (s) Time/proposal (ms) and Sports datasets—Fig. 12 shows how precision is hurt by
this problem. The consequence is that the true mAP on SVT
(a) Edge Boxes >107 2.2 <0.002
is higher than the reported mAP of 86.3 %.
(b) ACF detector >107 2.1 <0.002 Depending on the image resolution, our algorithm takes
(c) RF filter 104 1.8 0.18 approximately 5–20 s to compute the end-to-end results per
(d) CNN regression 103 1.2 1.2 image on a single CPU core and single GPU. We analyse the
(e) CNN recognition 103 2.2 2.2 time taken for each stage of our pipeline on the SVT dataset,
As the pipeline progresses from (a) to (e), the number of proposals which has an average image size of 1260 × 860, showing the
is reduced (starting from all possible bounding boxes), allowing us results in Table 7. Since we reduce the number of proposals
to increase our computational budget per proposal while keeping the throughout the pipeline, we can allow the processing time per
overall processing time for each stage comparable
proposal to increase while keeping the total processing time
for each stage stable. This affords us the use of more com-
(2013) also report retrieval results on SVT for released imple-
putationally complex features and classifiers as the pipeline
mentations of other text spotting algorithms. The method
progresses. Our method can be trivially parallelised, meaning
from Wang et al. (2011) achieves 21.3 % mAP, the method
we can process 1–2 images per second on a high-performance
from Neumann and Matas (2012) acheives 23.3 % mAP and
workstation with 16 physical CPU cores and 4 commodity
the method proposed by Mishra et al. (2013) itself achieves
GPUs.
56.2 % mAP, compared to our own result of 86.3 % mAP.
123
18 Int J Comput Vis (2016) 116:1–20
Fig. 13 The top two retrieval results for three queries on our BBC We give the precision at 100 (P@100) for these queries, equivalent to
News dataset—hollywood, boris johnson, and vision. The the first page of results of our web application
frames and associated videos are retrieved from 5k hours of BBC video.
The high precision and speed of our pipeline allows us to image retrieval without any perceivable degradation in accu-
process huge datasets for practical search applications. We racy. Additionally, the ability of our recognition model to be
demonstrate this on a 5000 h BBC News dataset. Building trained purely on synthetic data allows our system to be eas-
a search engine and front-end web application around our ily re-trained for recognition of other languages or scripts,
image retrieval pipeline allows a user to instantly search for without any human labelling effort.
visual occurrences of text within the huge video dataset. This We set a new benchmark for text spotting and image
works exceptionally well, with Fig. 13 showing some exam- retrieval. Moving into the future, we hope to explore
ple retrieval results from our visual search engine. While we additional recognition models to allow the recognition of
do not have groundtruth annotations to quantify the retrieval unknown words and arbitrary strings.
performance on this dataset, we measure the precision at 100
(P@100) for the test queries in Fig. 13, showing a P@100 of Acknowledgments This work was supported by the EPSRC and ERC
Grant VisRec No. 228180. We gratefully acknowledge the support of
100 % for the queries hollywood and boris johnson, NVIDIA Corporation with the donation of the GPUs used for this
and 93 % for vision. These results demonstrate the scal- research. We thank the BBC and in particular Rob Cooper for access to
able nature of our framework. data and video processing resources.
9 Conclusions
123
Int J Comput Vis (2016) 116:1–20 19
Bissacco, A., Cummins, M., Netzer, Y., & Neven, H. (2013). PhotoOCR: Goel, V., Mishra, A., Alahari, K., & Jawahar, C. V. (2013). Whole is
Reading text in uncontrolled conditions. In Proceedings of the greater than sum of parts: Recognizing scene text words. In 2013
international conference on computer vision. 12th International conference on document analysis and recogni-
Breiman, L. (2001). Random forests. Machine Learning, 45(1), tion, Washington, DC, USA, August 25–28, 2013 (pp. 398–402).
5–32. IEEE Computer Society. doi:10.1109/ICDAR.2013.87.
Campos, D. T., Babu, B. R., & Varma, M. (2009). Character recognition Gomez, L., & Karatzas, D. (2013). Multi-script text extraction from
in natural images. In A. Ranchordas & H. Araújo (Eds.), VISAPP natural scenes. In 2013 12th International conference on document
2009—Proceedings of the fourth international conference on com- analysis and recognition (ICDAR) (pp. 467–471). IEEE.
puter vision theory and applications, Lisboa, Portugal, February Gomez, L., & Karatzas, D. (2014). A fast hierarchical method
5–8, 2009 (Vol. 2, pp. 273–280). INSTICC Press. for multi-script and arbitrary oriented scene text extraction.
Chen, H., Tsai, S., Schroth, G., Chen, D., Grzeszczuk, R., & Girod, arXiv:1407.7504.
B. (2011). Robust text detection in natural images with edge- Goodfellow, I. J., Bulatov, Y., Ibarz, J., Arnoud, S., & Shet, V. (2013).
enhanced maximally stable extremal regions. In Proceedings of Multi-digit number recognition from street view imagery using
international conference on image processing (ICIP) (pp. 2609– deep convolutional neural networks. arXiv:1312.6082.
2612). Gordo, A. (2014). Supervised mid-level features for word image repre-
Chen, X., & Yuille, A. L. (2004). Detecting and reading text in natural sentation. CoRR. arXiv:1410.5224
scenes. In Computer vision and pattern recognition, 2004. CVPR Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhut-
2004 (Vol. 2, pp. II-366). Piscataway, NJ: IEEE. dinov, R. (2012). Improving neural networks by preventing co-
Cheng, M. M., Zhang, Z., Lin, W. Y., & Torr, P. (2014). Bing: Bina- adaptation of feature detectors. CoRR arXiv:1207.0580.
rized normed gradients for objectness estimation at 300fps. In Huang, W., Qiao, Y., & Tang, X. (2014). Robust scene text detection with
2014 IEEE conference on computer vision and pattern recognition, convolution neural network induced mser trees. In D. J. Fleet, T.
CVPR 2014, Columbus, OH, USA, June 23–28, 2014 (pp. 3286– Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer vision ECCV
3293). Piscataway, NJ: IEEE. doi:10.1109/CVPR.2014.414. 2014 13th European conference, Zurich, Switzerland, September
Dollár, P., Appel, R., Belongie, S., & Perona, P. (2014). Fast feature pyra- 6–12, 2014, proceedings, part IV (pp. 497–511). New York City:
mids for object detection. IEEE Transactions on Pattern Analysis Springer.
and Machine Intelligence, 36, 1532–1545. Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014).
Dollár, P., Belongie, S., & Perona, P. (2010). The fastest pedestrian Synthetic data and artificial neural networks for natural scene text
detector in the west. In F. Labrosse, R. Zwiggelaar, Y. Liu & B. Tid- recognition. arXiv:1406.2227.
deman (Eds.), British Machine Vision Conference, BMVC 2010, Jaderberg, M., Vedaldi, A., & Zisserman, A. (2014). Deep features for
Aberystwyth, UK, August 31–September 3, 2010. Proceedings (pp. text spotting. In European conference on computer vision.
1–11). British Machine Vision Association. doi:10.5244/C.24.68. Jia, Y. (2013). Caffe: An open source convolutional architecture for fast
Dollár, P., & Zitnick, C. L. (2013). Structured forests for fast edge feature embedding. http://caffe.berkeleyvision.org/.
detection. In 2013 IEEE international conference on computer Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., Mestre, S. R., Mas, J.,
vision (ICCV) (pp. 1841–1848). IEEE. et al. (2013). ICDAR 2013 robust reading competition. In ICDAR
Dollár, P., & Zitnick, C. L. (2014). Fast edge detection using structured (pp. 1484–1493). Piscataway, NJ: IEEE.
forests. arXiv:1406.5549. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-
Epshtein, B., Ofek, E., & Wexler, Y. (2010). Detecting text in natural based learning applied to document recognition. Proceedings of
scenes with stroke width transform. In Proceedings of the IEEE the IEEE, 86(11), 2278–2324.
conference on computer vision and pattern recognition (pp. 2963– Lucas, S. (2005). ICDAR 2005 text locating competition results. In
2970). IEEE. Proceedings of the eighth international conference on document
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zis- analysis and recognition, 2005 (pp. 80–84). IEEE.
serman, A. (2010). The PASCAL visual object classes (VOC) Lucas, S., Panaretos, A., Sosa, L., Tang, A., Wong, S., & Young, R.
challenge. International Journal of Computer Vision, 88(2), 303– (2003). Icdar 2003 robust reading competitions. In Proceedings of
338. ICDAR.
Felzenszwalb, P., & Huttenlocher, D. (2005). Pictorial structures for Manmatha, R., Han, C., & Riseman, E. M. (1996). Word spotting: A
object recognition. International Journal of Computer Vision, new approach to indexing handwriting. In Proceedings CVPR’96,
61(1), 55–79. 1996 IEEE computer society conference on computer vision and
Felzenszwalb, P. F., Grishick, R. B., McAllester, D., & Ramanan, D. pattern recognition, 1996 (pp. 631–637). IEEE.
(2010). Object detection with discriminatively trained part based Matas, J., Chum, O., Urban, M., & Pajdla, T. (2002). Robust wide base-
models. IEEE Transactions on Pattern Analysis and Machine Intel- line stereo from maximally stable extremal regions. In Proceedings
ligence, 32, 1627–1645. of the British Machine Vision Conference (pp. 384–393).
Fischer, A., Keller, A., Frinken, V., & Bunke, H. (2010). Hmm-based Mishra, A., Alahari, K., & Jawahar, C. (2012). Scene text recognition
word spotting in handwritten documents using subword models. In using higher order language priors. In Proceedings of the British
2010 20th International conference on pattern recognition (icpr) Machine Vision Conference (pp. 127.1–127.11). BMVA Press.
(pp. 3416–3419). IEEE. Mishra, A., Alahari, K., & Jawahar, C. (2013). Image retrieval using
Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic textual cues. In 2013 IEEE international conference on computer
regression: A statistical view of boosting. Annals of Statistics, vision (ICCV) (pp. 3040–3047). IEEE.
28(2), 337–407. Neumann, L., & Matas, J. (2010). A method for text localization and
Frinken, V., Fischer, A., Manmatha, R., & Bunke, H. (2012). A novel recognition in real-world images. In Proceedings of the Asian con-
word spotting method based on recurrent neural networks. IEEE ference on computer vision (pp. 770–783). Springer.
Transactions on Pattern Analysis and Machine Intelligence, 34(2), Neumann, L., & Matas, J. (2011). Text localization in real-world
211–224. images using efficiently pruned exhaustive search. In Proceedings
Girshick, R. B., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature of ICDAR (pp. 687–691). IEEE.
hierarchies for accurate object detection and semantic segmenta- Neumann, L., & Matas, J. (2012). Real-time scene text localization and
tion. In Proceedings of the IEEE conference on computer vision recognition. In Proceedings of the IEEE conference on computer
and pattern recognition. vision and pattern recognition.
123
20 Int J Comput Vis (2016) 116:1–20
Neumann, L., & Matas, J. (2013). Scene text localization and recog- Uijlings, J. R., van de Sande, K. E., Gevers, T., & Smeulders, A. W.
nition with oriented stroke detection. In Proceedings of the (2013). Selective search for object recognition. International jour-
international conference on computer vision (pp. 97–104). nal of computer vision, 104(2), 154–171.
Novikova, T., Barinova, O., Kohli, P., & Lempitsky, V. (2012). Large- Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010). Locality-
lexicon attribute-consistent text recognition in natural images. In constrained linear coding for image classification. In Proceedings
Proceedings of the European conference on computer vision (pp. of the IEEE conference on computer vision and pattern recogni-
752–765). Springer. tion.
Ozuysal, M., Fua, P., & Lepetit, V. (2007). Fast keypoint recognition Wang, K., Babenko, B., & Belongie, S. (2011). End-to-end scene text
in ten lines of code. In Proceedings of the IEEE conference on recognition. In Proceedings of the international conference on
computer vision and pattern recognition. computer vision (pp. 1457–1464). IEEE.
Perronnin, F., Liu, Y., Sánchez, J., & Poirier, H. (2010). Large-scale Wang, T., Wu, D. J., Coates, A., & Ng, A. Y. (2012). End-to-end
image retrieval with compressed fisher vectors. In Proceedings of text recognition with convolutional neural networks. In ICPR (pp.
the IEEE conference on computer vision and pattern recognition. 3304–3308). IEEE.
Posner, I., Corke, P., & Newman, P. (2010). Using text-spotting to query Weinman, J. J., Butler, Z., Knoll, D., & Feild, J. (2014). Toward inte-
the world. In 2010 IEEE/RSJ international conference on intelli- grated scene text reading. IEEE Transactions on Pattern Analysis
gent robots and systems, October 18–22, 2010, Taipei, Taiwan and Machine Intelligence, 36(2), 375–387. doi:10.1109/TPAMI.
(pp. 3181–3186). Piscataway, NJ: IEEE. doi:10.1109/IROS.2010. 2013.126.
5653151. Yao, C., Bai, X., Shi, B., & Liu, W. (2014). Strokelets: A learned
Quack, T. (2009). Large scale mining and retrieval of visual data in a multi-scale representation for scene text recognition. In 2014 IEEE
multimodal context. Ph.D. Thesis, ETH Zurich. conference on computer vision and pattern recognition (CVPR)
Rath, T., & Manmatha, R. (2007). Word spotting for historical docu- (pp. 4042–4049). IEEE.
ments. IJDAR, 9(2–4), 139–152. Yi, C., & Tian, Y. (2011). Text string detection from natural scenes
Rodriguez-Serrano, J. A., Perronnin, F., & Meylan, F. (2013). Label by structure-based partition and grouping. IEEE Transactions on
embedding for text recognition. In Proceedings of the British Image Processing, 20(9), 2594–2605.
Machine Vision Conference. Yin, X. C., Yin, X., & Huang, K. (2013). Robust text detection in natural
Shahab, A., Shafait, F., & Dengel, A. (2011). ICDAR 2011 robust scene images. CoRR arXiv:1301.2628.
reading competition challenge 2: Reading text in scene images. Zitnick, C. L., & Dollár, P. (2014). Edge boxes: Locating object propos-
In Proceedings of ICDAR (pp. 1491–1496). IEEE. als from edges. In D. J. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars
Simard, P. Y., Steinkraus, D., & Platt, J. C. (2003). Best practices for con- (Eds.), Computer vision ECCV 2014 13th European conference,
volutional neural networks applied to visual document analysis. Zurich, Switzerland, September 6–12, 2014, proceedings, part IV
Piscataway, NJ: Institute of Electrical and Electronics Engineers, (pp. 391–405). New York City: Springer.
Inc.
123