Deep Learning for Highway Driving
Deep Learning for Highway Driving
I. I NTRODUCTION
Fig. 1: Sample output from our neural network capable of lane
Since the DARPA Grand Challenges for autonomous vehi-
and vehicle detection.
cles, there has been an explosion in applications and research
for self-driving cars. Among the different environments for
self-driving cars, highway and urban roads are on opposite
ends of the spectrum. In general, highways tend to be more detection typically requires radar, while nearby car detection
predictable and orderly, with road surfaces typically well- can be solved with sonar. Computer vision can play an
maintained and lanes well-marked. In contrast, residential or important a role in lane detection as well as redundant object
urban driving environments feature a much higher degree of detection at moderate distances. Radar works reasonably well
unpredictability with many generic objects, inconsistent lane- for detecting vehicles, but has difficulty distinguishing between
markings, and elaborate traffic flow patterns. The relative different metal objects and thus can register false positives on
regularity and structure of highways has facilitated some of the objects such as tin cans. Also, radar provides little orientation
first practical applications of autonomous driving technology. information and has a higher variance on the lateral position
Many automakers have begun pursuing highway auto-pilot of objects, making the localization difficult on sharp bends.
solutions designed to mitigate driver stress and fatigue and The utility of sonar is both compromised at high speeds and,
to provide additional safety features; for example, certain even at slow speeds, is limited to a working distance of about
advanced-driver assistance systems (ADAS) can both keep 2 meters. Compared to sonar and radar, cameras generate a
cars within their lane and perform front-view car detection. richer set of features at a fraction of the cost. By advancing
Currently, the human drivers retain liability and, as such, computer vision, cameras could serve as a reliable redundant
must keep their hands on the steering wheel and prepare to sensor for autonomous driving. Despite its potential, computer
control the vehicle in the event of any unexpected obstacle or vision has yet to assume a significant role in today’s self-
catastrophic incident. Financial considerations contribute to a driving cars. Classic computer vision techniques simply have
substantial performance gap between commercially available not provided the robustness required for production grade
auto-pilot systems and fully self-driving cars developed by automotives; these techniques require intensive hand engineer-
Google and others. Namely, today’s self-driving cars are ing, road modeling, and special case handling. Considering
equipped with expensive but critical sensors, such as LIDAR, the seemingly infinite number of specific driving situations,
radar and high-precision GPS coupled with highly detailed environments, and unexpected obstacles, the task of scaling
maps. classic computer vision to robust, human-level performance
In today’s production-grade autonomous vehicles, critical would prove monumental and is likely to be unrealistic.
sensors include radar, sonar, and cameras. Long-range vehicle Deep learning, or neural networks, represents an alternative
2
approach to computer vision. It shows considerable promise network we train likely learns some model of the road for
as a solution to the shortcomings of classic computer vision. object detection and depth predictions, but it is never explicitly
Recent progress in the field has advanced the feasibility engineered and instead learns from the annotations alone.
of deep learning applications to solve complex, real-world Before the wide spread adoption of Convolutional Neu-
problems; industry has responded by increasing uptake of such ral Networks (CNNs) within computer vision, deformable
technology. Deep learning is data centric, requiring heavy parts based models were the most successful methods for
computation but minimal hand-engineering. In the last few detection [13]. After the popular CNN model AlexNet [9]
years, an increase in available storage and compute capabilities was proposed, state-of-the-art detection shifted towards CNNs
have enabled deep learning to achieve success in supervised for feature extraction [1], [14], [10], [15]. Girshick et al.
perception tasks, such as image detection. A neural network, developed R-CNN, a two part system which used Selective
after training for days or even weeks on a large data set, can Search [16] to propose regions and AlexNet to classify them.
be capable of inference in real-time with a model size that is R-CNN achieved state-of-the-art on Pascal by a large margin;
no larger than a few hundred MB [9]. State-of-the-art neural however, due to its nearly 1000 classification queries and
networks for computer vision require very large training sets inefficient re-use of convolutions, it remains impractical for
coupled with extensive networks capable of modeling such real-time implementations. Szegedy et al. presented a more
immense volumes of data. For example, the ILSRVC data-set, scalable alternative to R-CNN, that relies on the CNN to
where neural networks achieve top results, contains 1.2 million propose higher quality regions compared to Selective Search.
images in over 1000 categories. This reduces the number of region proposals down to as low as
By using expensive existing sensors which are currently 79 while keeping the mAP competitive with Selective Search.
used for self-driving applications, such as LIDAR and mm- An even faster approach to image detection called Overfeat
accurate GPS, and calibrating them with cameras, we can was presented by Sermanet et al. [1]. By using a regular
create a video data set containing labeled lane-markings and pattern of “region proposals”, Overfeat can efficiently reuse
annotated vehicles with location and relative speed. By build- convolution computations from each layer, requiring only a
ing a labeled data set in all types of driving situations (rain, single forward pass for inference.
snow, night, day, etc.), we can evaluate neural networks on this For our empirical evaluation, we use a straight-forward
data to determine if it is robust in every driving environment application of Overfeat, due to its efficiencies, and combine
and situation for which we have training data. this with labels similar to the ones proposed by Szegedy et al..
In this paper, we detail empirical evaluation on the data set We describe the model and similarities in the next section.
we collect. In addition, we explain the neural network that we
applied for detecting lanes and cars, as shown in Figure 1. III. R EAL T IME V EHICLE D ETECTION
Convolutional Neural Networks (CNNs) have had the
largest success in image recognition in the past 3 years [9],
II. R ELATED W ORK
[17], [18], [19]. From these image recognition systems, a
Recently, computer vision has been expected to player a number of detection networks were adapted, leading to further
larger role within autonomous driving. However, due to its advances in image detection. While the improvements have
history of relatively low precision, it is typically used in been staggering, not much consideration had been given to
conjunction with either other sensors or other road models the real-time detection performance required for some appli-
[3], [4], [6], [7]. Cho et al. [3] uses multiple sensors, such cations. In this paper, we present a detection system capable
as LIDAR, radar, and computer vision for object detection. of operating at greater than 10Hz using nothing but a laptop
They then fuse these sensors together in a Kalman filter using GPU. Due to the requirements of highway driving, we need
motion models on the objects. Held et al. [4], uses only a to ensure that the system used can detect cars more than
deformable parts based model on images to get the detections, 100m away and can operate at speeds greater than 10Hz; this
then uses road models to filter out false positives. Carafii et al. distance requires higher image resolutions than is typically
[6] uses a WaldBoost detector along with a tracker to generate used, and in our case is 640 × 480. We use the Overfeat
pixel space detections in real time. Jazayeri et al. [7] relies on CNN detector, which is very scalable, and simulates a sliding
temporal information of features for detection, and then filters window detector in a single forward pass in the network by
out false positives with a front-view motion model. efficiently reusing convolutional results on each layer [1].
In contrast to these object detectors, we do not use any Other detection systems, such as R-CNN, rely on selecting
road or motion-based models; instead we rely only on the as many as 1000 candidate windows, where each is evaluated
robustness of a neural network to make reasonable predictions. independently and does not reuse convolutional results.
In addition, we currently do not rely on any temporal features, In our implementation, we make a few minor modifications
and the detector operates independently on single frames from to Overfeat’s labels in order to handle occlusions of cars, pre-
a monocular camera. To make up for the lack of other sensors, dictions of lanes, and accelerate performance during inference.
which estimate object depth, we train the neural network to We will first provide a brief overview of the original imple-
predict depth based on labels extracted from radar returns. mentation and will then address the modifications. Overfeat
Although the model only predicts a single depth value for converts an image recognition CNN into a “sliding window”
each object, Eigen et al. have shown how a neural network detector by providing a larger resolution image and trans-
can predict entire depth maps from single images [12]. The forming the fully connected layers into convolutional layers.
3
Fig. 3: overfeat-mask
A. Lane Detection
The CNN used for vehicle detection can be easily extended
for lane boundary detection by adding an additional class.
Whereas the regression for the vehicle class predicts a five
dimensional value (four for the bounding box and one for
depth), the lane regression predicts six dimensions. Similar to
the vehicle detector, the first four dimensions indicate the two
end points of a local line segment of the lane boundary. The
remaining two dimensions indicate the depth of the endpoints
with respect to the camera. Fig 4 visualizes the lane boundary
ground truth label overlaid on an example image. The green
tiles indicate locations where the detector is trained to fire, Fig. 4: Example of lane boundary ground truth
and the line segments represented by the regression labels are
explicitly drawn. The line segments have their ends connected
to form continuous splines. The depth of the line segments
are color-coded such that the closest segments are red and
the furthest ones are blue. Due to our data collection methods
for lane labels, we are able to obtain ground truth in spite of
objects that occlude them. This forces the neural network to
learn more than a simple paint detector, and must use context
to predict lanes where there are occlusions.
Similar to the vehicle detector, we use L1 loss to train
the regressor. We use mini-batch stochastic gradient descent
Fig. 5: Example output of lane detector after DBSCAN
for optimization. The learning rate is controlled by a variant
clustering
of the momentum scheduler [11]. To obtain semantic lane
information, we use DBSCAN to cluster the line segments
into lanes. Fig 5 shows our lane predictions after DBSCAN
Receiver. We also have access to the Q50 built-in Continental
clustering. Different lanes are represented by different colors.
mid-range radar system. The sensors are connected to a Linux
Since our regressor outputs depths as well, we can predict the
PC with a Core i7-4770k processor.
lane shapes in 3D using inverse camera perspective mapping.
Once the raw videos are collected, we annotate the 3D
locations for vehicles and lanes as well as the relative speed
IV. E XPERIMENTAL S ETUP
of all the vehicles. To get vehicle annotations, we follow the
A. Data Collection conventional approach of using Amazon Mechanical Turk to
Our Research Vehicle is a 2014 Infiniti Q50. The car get accurate bounding box locations within pixel space. Then,
currently uses the following sensors: 6x Point Grey Flea3 cam- we match bounding boxes and radar returns to obtain the
eras, 1x Velodyne HDL32E lidar, and 1x Novatel SPAN-SE distance and relative speed of the vehicles.
5
(a) (b)
(c) (d)
ground truth. The standard error in the depth predictions as a
function of depth can be seen in Fig 12.
Fig. 8: Lane detection results on different lateral lanes. (a) For a qualitative review of the detection system, we have
Ego-lane left border. (b) Ego-lane right border. (c) Left adja- uploaded a 1.5 hour video of the vehicle detector ran on our
cent lane left border. (d) Right adjacent lane right border. test set. This may be found at [Link]/GJ0cZBkHoHc. A
short video of our lane detector may also be found online
at [Link]/__f5pqqp6aM. In these videos, we evaluate the
and 30 mins of driving. The accuracy of the vehicle bound- detector on every frame independently and display the raw
ing box predictions were measured using Intersection Over detections, without the use of any Kalman filters or road
Union (IOU) against the ground truth boxes from Amazon models. The red locations in the video correspond to the mask
Mechanical Turk (AMT). A bounding box prediction matched detectors that are activated. This network was only trained on
with ground truth if IOU≥ 0.5. The performance of our the rear view of cars traveling in the same direction, which is
car detection as a function of depth can be seen in Fig 9. why cars across the highway barrier are commonly missed.
Nearby false positives can cause the largest problems for We have open sourced the code for the vehicle and lane
ADAS systems which could cause the system to needlessly
apply the brakes. In our system, we found overpasses and
shading effects to cause the largest problems. Two examples
of these situations are shown in Fig 10.
As a baseline to our car detector, we compared the detection
results to the Continental mid-range radar within our data
collection vehicle. While matching radar returns to ground
truth bounding boxes, we found that although radar had nearly
100% precision, false positives were being introduced through
errors in radar/camera calibration. Therefore, to ensure a fair
comparison we matched every radar return to a ground truth
bounding box even if IOU< 0.5, giving our radar returns 100%
precision. This comparison is shown in Fig 11, the F1 score
for radar is simply the recall.
In addition to the bounding box locations, we measured
the accuracy of the predicted depth by using radar returns as Fig. 11: Radar Comparison to Vehicle Detector
7
[6] Caraffi, Claudio, et al. ”A system for real-time detection and tracking of
vehicles from a single car-mounted camera.” Intelligent Transportation
Systems (ITSC), 2012 15th International IEEE Conference on. IEEE,
2012.
[7] Jazayeri, Amirali, et al. ”Vehicle detection and tracking in car video based
on motion model.” Intelligent Transportation Systems, IEEE Transactions
on 12.2 (2011): 583-595.
[8] Bradski, Gary. ”The opencv library.” Doctor Dobbs Journal 25.11 (2000):
120-126.
[9] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. ”Imagenet
classification with deep convolutional neural networks.” Advances in
neural information processing systems. 2012.
[10] Szegedy, Christian, Alexander Toshev, and Dumitru Erhan. ”Deep neural
networks for object detection.” Advances in Neural Information Process-
ing Systems. 2013.
[11] Sutskever, Ilya, et al. ”On the importance of initialization and momentum
in deep learning.” Proceedings of the 30th International Conference on
Machine Learning (ICML-13). 2013.
Fig. 12: Car Detector Depth Performance [12] Eigen, David, Christian Puhrsch, and Rob Fergus. ”Depth map prediction
from a single image using a multi-scale deep network.” Advances in
Neural Information Processing Systems. 2014.
[13] Felzenszwalb, Pedro F., et al. ”Object detection with discriminatively
detector online at [Link]/brodyh/caffe. Our repository was trained part-based models.” Pattern Analysis and Machine Intelligence,
forked from the original Caffe code base from the BVLC IEEE Transactions on 32.9 (2010): 1627-1645.
group [20]. [14] Szegedy, Christian, et al. ”Scalable, High-Quality Object Detection.”
arXiv preprint arXiv:1412.1441 (2014).
[15] Girshick, Ross, et al. ”Rich feature hierarchies for accurate object
V. C ONCLUSION detection and semantic segmentation.” Computer Vision and Pattern
Recognition (CVPR), 2014 IEEE Conference on. IEEE, 2014.
By using Camera, Lidar, Radar, and GPS we built a highway [16] Uijlings, Jasper RR, et al. ”Selective search for object recognition.”
data set consisting of 17 thousand image frames with vehicle International journal of computer vision 104.2 (2013): 154-171.
bounding boxes and over 616 thousand image frames with [17] Szegedy, Christian, et al. ”Going deeper with convolutions.” arXiv
preprint arXiv:1409.4842 (2014).
lane annotations. We then trained on this data using a CNN [18] He, Kaiming, et al. ”Delving Deep into Rectifiers: Surpassing
architecture capable of detecting all lanes and cars in a single Human-Level Performance on ImageNet Classification.” arXiv preprint
forward pass. Using a single GTX 780 Ti our system runs arXiv:1502.01852 (2015).
[19] Simonyan, Karen, and Andrew Zisserman. ”Very deep convolutional net-
at 44Hz, which is more than adequate for real-time use. Our works for large-scale image recognition.” arXiv preprint arXiv:1409.1556
results show existing CNN algorithms are capable of good (2014).
performance in highway lane and vehicle detection. Future [20] Jia, Yangqing, et al. ”Caffe: Convolutional architecture for fast feature
embedding.” Proceedings of the ACM International Conference on Mul-
work will focus on acquiring frame level annotations that will timedia. ACM, 2014.
allow us to develop new neural networks capable of using
temporal information across frames.
ACKNOWLEDGMENT
This research was funded in part by Nissan who generously
donated the car used for data collection. We thank our col-
leagues Yuta Yoshihata from Nissan who provided technical
support and expertise on vehicles that assisted the research.
In addition, the authors would like to thank the author of
Overfeat, Pierre Sermanet, for their helpful suggestions on
image detection.
R EFERENCES
[1] Sermanet, Pierre, et al. ”Overfeat: Integrated recognition, localization and
detection using convolutional networks.” arXiv preprint arXiv:1312.6229
(2013).
[2] Rothengatter, Talib Ed, and Enrique Carbonell Ed Vaya. ”Traffic and
transport psychology: Theory and application.” International Conference
of Traffic and Transport Psychology, May, 1996, Valencia, Spain. Perga-
mon/Elsevier Science Inc, 1997.
[3] Cho, Hyunggi, et al. ”A multi-sensor fusion system for moving object
detection and tracking in urban driving environments.” Robotics and
Automation (ICRA), 2014 IEEE International Conference on. IEEE,
2014.
[4] Held, David, Jesse Levinson, and Sebastian Thrun. ”A probabilistic
framework for car detection in images using context and scale.” Robotics
and Automation (ICRA), 2012 IEEE International Conference on. IEEE,
2012.
[5] Levinson, Jesse, et al. ”Towards fully autonomous driving: systems and
algorithms.” Intelligent Vehicles Symposium, 2011.