License Plate Detection and Recognition in Unconstrained Scenarios
License Plate Detection and Recognition in Unconstrained Scenarios
Unconstrained Scenarios
1 Introduction
Several traffic-related applications, such as detection of stolen vehicles, toll con-
trol and parking lot access validation involve vehicle identification, which is
performed by Automatic License Plate Recognition (ALPR) systems. The re-
cent advances in Parallel Processing and Deep Learning (DL) have contributed
to improve many computer vision tasks, such as Object Detection/Recognition
and Optical Character Recognition (OCR), which clearly benefit ALPR sys-
tems. In fact, deep Convolutional Neural Networks (CNNs) have been the lead-
ing machine learning technique applied for vehicle and license plate (LP) de-
tection [18,28,19,3,2,9,31,17]. Along with academic papers, several commercial
ALPR systems have been also exploring DL methods. They are usually allocated
in huge data-centers and work through web-services, being able to process thou-
sands to millions of images per day and be constantly improved. As examples
2 S. M. Silva and C. R. Jung
2 Related Work
ALPR is the task of finding and recognizing license plates in images. It is com-
monly broken into four subtasks that form a sequential pipeline: vehicle detec-
tion, license plate detection, character segmentation and character recognition.
For simplicity, we refer to the combination of the last two subtasks as OCR.
Many different ALPR systems or related subtasks have been proposed in the
past, typically using image binarization or gray-scale analysis to find candidate
proposals (e.g. LPs and characters), followed by handcrafted feature extraction
methods and classical machine learning classifiers [1,4]. With the rise of DL,
the state-of-the-art started moving to another direction, and nowadays many
works employ CNNs due to its high accuracy for generic object detection and
recognition [23,24,21,25,8,11].
Related to ALPR are Scene Text Spotting (STS) and number reading in the
wild (e.g. from Google Street View images [22]) problems, which goals are to
find and read text/numbers in natural scenes. Although ALPR could be seen as
a particular case of STS, the two problems present particular characteristics: in
ALPR, we need to learn characters and numbers (without much font variabil-
ity) with no semantic information, while STS is focused on textual information
containing high font variability, and possibly exploring lexical and semantic in-
formation, as in [30]. Number reading does not present semantic information,
but dealing only with digits is simpler than the ALPR context, since it avoids
common digit/letter confusions such as B-8, D-0, 1-I, 5-S, for instance.
As the main contribution of this work is a novel LP detection network, we
start this section by reviewing DL-based approaches for this specific subtask, as
well as a few STS methods that can handle distorted text and could be used for
LP detection. Next, we move to complete ALPR DL-based systems.
The success of YOLO networks [23,24] inspired many recent works, targeting
real-time performance for LP detection [28,9,31,17]. A slightly modified version
of the YOLO [23] and YOLOv2 [24] networks were used by Hsu et al. [9], where
the authors enlarged the networks output granularity to improve the number of
detections, and set the probabilities for two classes (LP and background). Their
network achieved a good compromise between precision and recall, but the paper
lacks a detailed evaluation over the bounding boxes extracted. Moreover, it is
4 S. M. Silva and C. R. Jung
known that YOLO networks struggle to detect small sized objects, thus further
evaluations over scenarios where the car is far from the camera is needed.
In [31], a setup of two YOLO-based networks was trained with the goal of
detecting rotated LPs. The first network is used to find a region containing
the LP, called “attention model”, and the second network captures a rotated
rectangular bounding-box of the LP. Nonetheless, they considered only on-plane
rotations, and not more complex deformations caused by oblique camera views,
such as the ones illustrated in Fig. 1. Also, as they do not present a complete
ALPR system, it is difficult to evaluate how well an OCR method would perform
on the detected regions.
License plate detectors using sliding window approaches or candidate filtering
coupled with CNNs can also be found in the literature [3,2,27]. However, they
tend to be computationally inefficient as a result of not sharing calculations like
in modern meta-architectures for object detection such as YOLO, SSD [21] and
Faster R-CNN [25].
Although Scene Text Spotting (STS) methods focus mostly on large font
variations and lexical/semantic information, but it is worth mentioning a few
approaches that deal with rotated/distorted text and could be explored for LP
detection in oblique views. Jaderberg and colleagues [13] presented a CNN-based
approach for text recognition in natural scenes using an entirely synthetic dataset
to train the model. Despite the good results, they strongly rely on N-grams,
which are not applicable to ALPR. Gupta et al. [7] also explored synthetic
dataset by realistically pasting text into real images, focusing mostly on text
localization. The output is a rotated bounding box with around the text, which
finds limitations for off-plane rotations common in ALPR scenarios.
More recently, Wang et al. [29] presented an approach to detect text in a
variety of geometric positions, called Instance Transformation Network (ITN).
It is basically a composition of three CNNs: a backbone network to compute
features, a transformation network to infer affine parameters where supposedly
exists text in the feature map, and a final classification network whose input
is built by sampling features according to the affine parameters. Although this
approach can (in theory) handle off-plane rotations, it is not able to correctly
infer the transformation that actually maps the text region to a rectangle, since
there is no physical (or clear psychological) bounding region around the text
that should map to a rectangle in an undistorted view. In ALPR, the LP is rect-
angular and planar by construction, and we explore this information to regress
the transformation parameters, as detailed in Section 3.2.
The works of Silva and Jung [28] and Laroca et al. [17] presented complete ALPR
systems based on a series of modified YOLO networks. Two distinct networks
were used in [28], one to jointly detect cars and LPs, and another to perform
OCR. A total of five networks were used in [17], basically one for each ALPR
subtask, being two for character recognition. Both reported real-time systems,
License Plate Detection and Recognition in Unconstrained Scenarios 5
but they are focused only on Brazilian license plates and were not trained to
capture distortion, only frontal and nearly rectangular LPs.
Selmi et al. [27] used a series of pre-processing approaches based on mor-
phological operators, Gaussian filtering, edge detection and geometry analysis
to find LP candidates and characters. Then, two distinct CNNs were used to
(i) classify a set of LP candidates per image into one single positive sample;
and (ii) to recognize the segmented characters. The method handles a single LP
per image, and according to the authors, distorted LPs and poor illumination
conditions can compromise the performance.
Li et al. [19] presented a network based on Faster R-CNN [25]. Shortly, a
Region Proposal Network is assigned to find candidate LP regions, whose corre-
sponding feature maps are cropped by a RoI Pooling layer. Then, these candi-
dates are fed into the final part of the network, which computes the probability
of being/not being an LP, and performs OCR through a Recurrent Neural Net-
work. Despite promising, the evaluation presented by the authors shows a lack
of performance in most challenging scenarios containing oblique LPs.
Commercial systems are good reference points to the state-of-the-art. Al-
though they usually provide only partial (or none) information about their ar-
chitecture, we still can use them as black boxes to evaluate the final output.
As mentioned in Section 1, examples are Sighthound, OpenALPR (which is an
official NVIDIA partner in the Metropolis platform2 ) and Amazon Rekognition
(a general-purpose AI engine including a text detection and recognition module
that can be used for LP recognition, as informed by the company).
Since vehicles are one of the underlying objects present in many classical de-
tection and recognition datasets, such as PASCAL-VOC [5], ImageNet [26], and
COCO [20], we decided to not train a detector from scratch, and instead chose
a known model to perform vehicle detection considering a few criteria. On one
hand, a high recall rate is desired, since any miss detected vehicle having a visi-
ble LP leads directly to an overall LP miss detection. On the other hand, high
2
NVIDIA platform for video analysis in smart cities (https://www.nvidia.com/en-us/
autonomous-machines/intelligent-video-analytics-platform/).
6 S. M. Silva and C. R. Jung
MLC3511
MLC3534
MLC3543
LICENSE PLATE
INPUT CAR DETECTION OCR
DETECTION RECTIFICATION
IMAGE (YOLOv2) (OCR-NET)
(WPOD-NET)
precision is also desirable to keep running times low, as each falsely detected
vehicle must be verified by WPOD-NET. Based on these considerations, we de-
cided to use the YOLOv2 network due to its fast execution (around 70 FPS)
and good precision and recall compromise (76.8% mAP over the PASCAL-VOC
dataset). We did not perform any change or refinement to YOLOv2, just used
the network as a black box, merging the outputs related to vehicles (i.e. cars and
buses), and ignoring the other classes.
The positive detections are then resized before being fed to WPOD-NET.
As a rule of thumb, larger input images allow the detection of smaller objects
but increase the computational cost [12]. In roughly frontal/rear views, the ratio
between the LP size and the vehicle bounding box (BB) is high. However, this
ratio tends to be much smaller for oblique/lateral views, since the vehicle BB
tends to be larger and more elongated. Hence, oblique views should be resized
to a larger dimension than frontal ones to keep the LP region still recognizable.
Although 3D pose estimation methods such as [32] might be used to deter-
mine the resize scale, this work presents a simple and fast procedure based on the
aspect ratio of the vehicle BB. When it is close to one, a smaller dimension can
be used, and it must be increased as the aspect ratio gets larger. More precisely,
the resizing factor fsc is given by
1 max(Wv , Hv )
fsc = min Dmin , Dmax , (1)
min{Wv , Hv } min(Wv , Hv )
where Wv and Hv are the width and height of the vehicle BB, respectively.
Note that Dmin ≤ fsc min(Wv , Hv ) ≤ Dmax , so that Dmin and Dmax delimit
the range for the smallest dimension of the resized BB. Based on experiments
and trying to keep a good compromise between accuracy and running times, we
selected Dmin = 288 and Dmax = 608.
(1,1,6)
(M,N,6) (m,n) cell
Affine parameters =T AFFINE
WPOD (m,n)
Network
Object Probabilities
Fig. 3: Fully convolutional detection of planar objects (cropped for better visu-
alization).
RESBLOCK(N)
CONV 3x3, 16 DETECTION
CONV 3x3, N
RELU RESBLOCK(128) RELU
where the max function used for v3 and v6 was adopted to ensure that the
diagonal is positive (avoiding undesired mirroring or excessive rotations).
To match the network output resolution, the points pi are re-scaled by the
inverse of the network stride, and re-centered according to each point (m, n) in
the feature map. This is accomplished by applying a normalization function
1 1 n
Amn (p) = p− , (3)
α Ns m
where α is a scaling constant that represents the side of the fictional square. We
set α = 7.75, which is the mean point between the maximum and minimum LP
dimensions in the augmented training data divided by the network stride.
Assuming that there is an object (LP) at cell (m, n), the first part of the loss
function considers the error between a warped version of the canonical square
and the normalized annotated points of the LP, given by
4
X
faffine (m, n) = kTmn (q i ) − Amn (pi )k1 . (4)
i=1
License Plate Detection and Recognition in Unconstrained Scenarios 9
The second part of the loss function handles the probability of having/not
having an object at (m, n). It is similar to the SSD confidence loss [21], and
basically is the sum of two log-loss functions
where Iobj is the object indicator function that returns 1 if there is an object at
point (m, n) or 0 otherwise, and logloss(y, p) = −y log(p). An object is considered
inside a point (m, n) if its rectangular bounding box presents an IoU larger than
a threshold γobj (set empirically to 0.3) w.r.t. another bounding box of the same
size and centered at (m, n).
The final loss function is given by a combination of the terms defined in
Eqs. (4) and (5):
M X
X N
loss = [Iobj faffine (m, n) + fprobs (m, n)]. (6)
m=1 n=1
Given the reduced number of annotated images in the training dataset, the
use of data augmentation is crucial. The following augmentation transforms are
used:
Fig. 6: Different augmentations for the same sample. The red quadrilateral rep-
resents the transformed LP annotation.
3.3 OCR
The character segmentation and recognition over the rectified LP is performed
using a modified YOLO network, with the same architecture presented in [28].
However, the training dataset was considerably enlarged in this work by using
synthetic and augmented data to cope with LP characteristics of different regions
around the world (Europe, United States and Brazil)3 .
3
We also used Taiwanese LPs, but could not find information in English about the
font type used by this country in order to include in the artificial data generation.
License Plate Detection and Recognition in Unconstrained Scenarios 11
One of our goals is to develop a technique that performs well in a variety of un-
constrained scenarios, but that should also work well in controlled ones (such as
mostly frontal views). Therefore, we chose four datasets available online, namely
OpenALPR (BR and EU)4 , SSIG and AOLP (RP), which cover many different
situations, as summarized in the first part of Table 1. We consider three distinct
variables: LP angle (frontal and oblique), distance from vehicles to the camera
(close, intermediate and far), and the region where the pictures were taken.
distance from the camera to the vehicles, the SSIG dataset appears to be the
most challenging one. It is composed of high-resolution images, allowing that
LPs from distant vehicles might still be readable. None of them present LPs
from multiple (simultaneous) vehicles at once.
Although all these databases together cover numerous situations, to the best
of our knowledge there is a lack of more general-purpose dataset with challenging
images in the literature. Thus, an additional contribution of this work is the
manual annotation of a new set of 102 images (named as CD-HARD) selected
from the Cars Dataset, covering a variety of challenging situations. We selected
mostly images with strong LP distortion but still readable for humans. Some of
these images (crops around the LP region) are shown in Fig. 1, which was used
to motivate the problem tackled in this work.
4 Experimental Results
This section covers the experimental analysis of our full ALPR system, as well
as comparisons with other state-of-the-art methods and commercial systems.
Unfortunately, most academic ALPR papers focus on specific scenarios (e.g.
single country or region, environment conditions, camera position, etc.). As a
result, there are many scattered datasets available in the literature, each one
evaluated by a subset of methods. Moreover, many papers are focused only on
LP detection or character segmentation, which limits even more the comparison
possibilities for the full ALPR pipeline. In this work, we used four independent
datasets to evaluate the accuracy of the proposed method in different scenarios
and region layouts. We also show comparisons with commercial products and
papers that present full ALPR systems.
The proposed approach presents three networks in the pipeline, for which we
empirically set the following acceptance thresholds: 0.5 for vehicle (YOLOv2)
and LP (WPOD-NET) detection, and 0.4 for character detection and recognition
(OCR-NET). Also, it is worth noticing that characters “I” and “1” are identical
for Brazilian LPs. Hence, they were considered as a single class in the evaluation
of the OpenALPR BR and SSIG datasets. No other heuristic or post-processing
was applied to the results produced by the OCR module.
We evaluate the system in terms of the percentage of correctly recognized
LPs, where an LP is considered correct if all characters were correctly recognized,
and no additional characters were detected. It is important to note that the exact
same networks were applied to all datasets: no specific training procedure was
used to tune the networks for a given type of LP (e.g. European or Taiwanese).
The only slight modification performed in the pipeline was for the AOLP Road
Patrol dataset. In this dataset, the vehicles are very close to the camera (causing
the vehicle detector to fail in several cases), so that we directly applied the LP
detector (WPOD-NET) to the input images.
To show the benefits of including fully synthetic data in the OCR-NET train-
ing procedure, we evaluated our system using two sets training data: (i) real aug-
mented data plus artificially generated ones; and (ii) only real augmented data.
License Plate Detection and Recognition in Unconstrained Scenarios 13
These two versions are denoted by “Ours” and “Ours (no artf.)”, respectively, in
Table 2. As can be observed, the addition of fully synthetic data improved the
accuracy in all tested datasets (with a gain ≈ 5% for the AOLP RP dataset).
Moreover, to highlight the improvements of rectifying the detection bounding
box, we also present the results of using a regular non-rectified bounding box,
identified as “Ours (unrect.)” in Table 2. As expected, the results do not vary
much in the mostly frontal datasets (being even slightly better for ALPR-EU),
but there was a considerable accuracy drop in datasets with challenging oblique
LPs (AOLP-RP and the proposed CD-HARD).
Table 2 also shows the results of competitive (commercial and academic)
systems, indicating that our system achieved recognition rates comparable to
commercial ones in databases representing more controlled scenarios, where the
LPs are mostly frontal (OpenALPR EU and BR, and SSIG). More precisely, it
was the second best method in both OpenALPR datasets, and top one in SSIG.
In the challenging scenarios (AOLP RP and the proposed CD-HARD dataset),
however, our system outperformed all compared approaches by a significant mar-
gin (over 7% accuracy gain when compared to the second best result).
It is worth mentioning that the works of Li et al. [18,19], Hsu et al. [10] and
Laroca et al. [17] are focused on a single region or dataset. By outperforming
them, we demonstrate a strong generalization capacity. It is also important to
note that the full LP recognition rate for the most challenging datasets (AOLP-
RP and CD-HARD) was higher than directly applying the OCR module to the
annotated rectangular LP bounding boxes (79.21% for AOLP-RP and 53.85%
for CD-HARD). This gain is due to the unwarping allowed by WPOD-NET,
which greatly helps the OCR task when the LP is strongly distorted. To illus-
14 S. M. Silva and C. R. Jung
trate this behavior, we show in Fig. 8 the detected and unwarped LPs for the
images in Fig. 1, as well as the final recognition result produced by OCR-NET.
The detection score of the top right LP was below the acceptance threshold,
illustrating a false negative example.
*missed*
ZCAA30 GNO6BGV C24JBH ACAC1350 MXH4622
Fig. 8: Detected/unwarped LPs from images in Fig. 1 and final ALPR results.
References
Trade-Offs for Modern Convolutional Object Detectors. In: 2017 IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR). pp. 3296–3297.
IEEE (jul 2017). https://doi.org/10.1109/CVPR.2017.351, http://ieeexplore.ieee.
org/document/8099834/
13. Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic Data and Ar-
tificial Neural Networks for Natural Scene Text Recognition. NIPS, Conference on
Neural Information Processing Systems pp. 1–10 (2014)
14. Jaderberg, M., Simonyan, K., Zisserman, A., kavukcuoglu, k.: Spatial transformer
networks. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R.
(eds.) Advances in Neural Information Processing Systems 28, pp. 2017–2025. Cur-
ran Associates, Inc. (2015)
15. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR
abs/1412.6980 (2014)
16. Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine-
grained categorization. In: 4th International IEEE Workshop on 3D Representation
and Recognition (3dRR-13). Sydney, Australia (2013)
17. Laroca, R., Severo, E., Zanlorensi, L.A., Oliveira, L.S., Gonçalves, G.R., Schwartz,
W.R., Menotti, D.: A robust real-time automatic license plate recognition based on
the YOLO detector. CoRR abs/1802.09567 (2018), http://arxiv.org/abs/1802.
09567
18. Li, H., Shen, C.: Reading Car License Plates Using Deep Convolutional Neural
Networks and LSTMs. arXiv preprint arXiv:1601.05610 (jan 2016), http://arxiv.
org/abs/1601.05610
19. Li, H., Wang, P., Shen, C.: Towards end-to-end car license plates detection and
recognition with deep neural networks. CoRR abs/1709.08828 (2017), http://
arxiv.org/abs/1709.08828
20. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P.,
Zitnick, C.L.: Microsoft coco: Common objects in context. In: Fleet, D., Pajdla,
T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV 2014. pp. 740–755.
Springer International Publishing, Cham (2014)
21. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: SSD:
Single Shot MultiBox Detector. pp. 21–37 (2016). https://doi.org/10.1007/978-3-
319-46448-0 2
22. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits
in natural images with unsupervised feature learning. In: NIPS workshop on deep
learning and unsupervised feature learning. vol. 2011, p. 5 (2011)
23. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You Only Look
Once: Unified, Real-Time Object Detection. In: 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR). pp. 779–788. IEEE
(jun 2016). https://doi.org/10.1109/CVPR.2016.91, http://ieeexplore.ieee.org/
document/7780460/
24. Redmon, J., Farhadi, A.: YOLO9000: Better, Faster, Stronger. In: 2017 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6517–6525.
IEEE (jul 2017). https://doi.org/http://dx.doi.org/10.1109/CVPR.2017.690, http:
//arxiv.org/abs/1612.08242http://ieeexplore.ieee.org/document/8100173/
25. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards Real-
Time Object Detection with Region Proposal Networks. IEEE Transactions
on Pattern Analysis and Machine Intelligence 39(6), 1137–1149 (jun 2017).
https://doi.org/10.1109/TPAMI.2016.2577031
License Plate Detection and Recognition in Unconstrained Scenarios 17
26. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large
Scale Visual Recognition Challenge. International Journal of Computer Vision
(IJCV) 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
27. Selmi, Z., Ben Halima, M., Alimi, A.M.: Deep Learning System for Automatic
License Plate Detection and Recognition. In: 2017 14th IAPR International Con-
ference on Document Analysis and Recognition (ICDAR). pp. 1132–1138. IEEE
(nov 2017). https://doi.org/10.1109/ICDAR.2017.187, http://ieeexplore.ieee.org/
document/8270118/
28. Silva, S.M., Jung, C.R.: Real-time brazilian license plate detection and recogni-
tion using deep convolutional neural networks. In: 2017 30th SIBGRAPI Con-
ference on Graphics, Patterns and Images (SIBGRAPI). pp. 55–62 (Oct 2017).
https://doi.org/10.1109/SIBGRAPI.2017.14
29. Wang, F., Zhao, L., Li, X., Wang, X., Tao, D.: Geometry-Aware Scene Text De-
tection with Instance Transformation Network. In: IEEE Conference on Computer
Vision and Pattern Recognition (CVPR). pp. 1381–1389. Salt Lake City (2018)
30. Weinman, J.J., Learned-Miller, E., Hanson, A.R.: Scene text recognition using
similarity and a lexicon with sparse belief propagation. IEEE Transactions on
pattern analysis and machine intelligence 31(10), 1733–1746 (2009)
31. Xie, L., Ahmad, T., Jin, L., Liu, Y., Zhang, S.: A New CNN-Based
Method for Multi-Directional Car License Plate Detection. IEEE Trans-
actions on Intelligent Transportation Systems 19(2), 507–517 (feb 2018).
https://doi.org/10.1109/TITS.2017.2784093
32. Zhou, X., Zhu, M., Leonardos, S., Daniilidis, K.: Sparse representation for 3d shape
estimation: A convex relaxation approach. IEEE transactions on pattern analysis
and machine intelligence 39(8), 1648–1661 (2017)