GMR Institute of Technology, Rajam
GMR Institute of Technology
Deep Learning using Lane Detection
Methodology Review
Jun 13, 2025
1
emb
Dec
201
1
er
7
Jun 13, 2025 1
Deep Learning using Lane-Detection
GMR Institute of Technology
P BHANUPRASAD (22345A0518)
Under the guidance of
[Link] kumar
Assistant Professor
GMR Institute of Technology
Al Mamun, A., Em, P. P., Hossen, M. J., Tahabilder, A., & Jahan, B. (2022). Efficient lane marking
detection using deep learning technique with differential and cross-entropy loss. International
Journal of Electrical and Computer Engineering, 12(4), 4206.
The proposed deep learning approach provides a novel technique for landmark determination
for automated vehicles. It is a simple deep learning model that has been derived from the
basic encoder-decoderbased model inspired by E-Net . A customized ENet architecture has
GMR Institute of Technology
been trained by the processed TuSimple data with a combination of discriminative and cross-
entropy losses.
Complete work flow approach
Input dataset and preprocessing:
TuSimple dataset has been used to develop the model because this dataset has adequate images with
accurate annotation. Efficient lane marking detection using deep learning technique 4209 standards
of the model. Images of this dataset are annotated with full lane boundary instead of just annotating
the lanemark. Each image of this dataset is of good quality in 720×1280 px dimension. After the lane
features get extracted, a hyper line is drawn to fit every lane's data points. Here, the corresponding
lane pixel has been converted into binary value for the pixels that do not belong to the lanes to create
GMR Institute of Technology
the binary and instant label images. Eventually, the image frames were resized to 224×224 px
dimension to maintain a constant aspect ratio and to reduce the computational cost. After completing
all the above data preprocessing steps, the output data is the original image with an instant label and
binary label.
E-Net architecture
The original ENet architecture is an encode-decode network in which there are three stages for the
encoding section and two stages for the decode section. Therefore, the original ENet architecture has
been customized by dividing the encode section into binary and instant segmentsBinary segmentation
provides information about the pixels inherent to the lanes, whereas instant segmentation ensures the
proper pixel position of the lane on the images. Several small Bottlenecks have been formed before
going to the main encode-decode sections, for example, normal bottleneck (NB), encode bottleneck
(EB), and decode bottleneck (DB). Hence, it will reduce the number of features in every layer to
reduce computational complexity, learn the relevant features more deeply and find the best possible
training loss.
GMR Institute of Technology
The architecture of the customized E-Net model
Normal bottleneck (NB) :
In the model, 1x1 convolution is used to reduce channel numbers, enhancing computational
efficiency and embedding in pooling. Dilation convolution with a 3x3 kernel maintains data
dimensions and resolution. Asymmetric convolution (5x1) minimizes overfitting and complexity by
filtering separate input channels. Standard 3x3 convolution continues with the same dimensions. A
1x1 convolution is applied to restore the original channel count. Dropout layers are used for
regularization, combating overfitting.
GMR Institute of Technology
Encode bottleneck (EB) :
Max pooling has been executed with the kernel of 2×2 and stride 2×2 in the EB. Besides, 2D
convolution with the same kernel and stride has been applied to reduce the channel numbers. Also,
1×1 convolution and dropout layer have been utilized as like as NB.
Decode bottleneck (DB) :
Max unpooling has been executed with the same resolution as EB. Transposed convolutional layer
with kernel 3×3 and stride 2×2 has been used to uplift the encoding features. Also, 1×1 convolution
and dropout layer have been utilized as like as NB.
Encoder stage :
The encoder processes original, binary label, and instant label images, splitting them into binary and
instant segmentation. It extracts lane features and instances. The encoder has two bottlenecks: the
first has one EB and four NB, and the second has one EB, seven NB, three dilations, and two
asymmetric convolution layers. In the bottleneck for both binary and instant segmentation, there are
eight NB with four dilations and two asymmetric convolutions.
Decode stage :
The information that has been obtained from the encode stages needs to decode to have the final
result from the network. One up-sampling and two NB have been applied in the bottleneck for the
binary segmentation part and for the instant segmentation part for uplifting the lane information
respectfully.
Loss measurement :
GMR Institute of Technology
The model uses backpropagation to optimize, employing two loss types: cross-entropy for binary
segmentation and discriminative loss for precise lane location. It clusters same-lane pixels while
maintaining inter-lane separation through separation, neighborhood, and regulation stages. Here the
computation of cross-entropy loss has been performed using. −( 𝑦(log( 𝑝) + (1 − 𝑦) log(1 − 𝑝)))
𝐷𝑖𝑠𝑐𝑙𝑜𝑠𝑠 = 𝐿𝑜𝑠𝑠𝑠𝑒𝑝 + 𝐿𝑜𝑠𝑠𝑛𝑒𝑖𝑔ℎ𝑏 + 𝐿𝑜𝑠𝑠𝑅𝑒𝑔𝑢 = 1 𝑁𝑐 ∑ 1 𝑁𝑒 ∑ [‖ 𝑀
Discriminative loss is computed using a formula :
− 𝑥𝑖‖ − 𝛿𝑛𝑒𝑖𝑔ℎ𝑏] + 𝑁 2 𝑒 𝑗=1 + 𝑁 𝑁=1 1 𝑁𝑐 (𝑁𝑐−1) ∑ ∑ [𝛿𝑠𝑒𝑝 − ‖𝑀𝑐𝑎 −
𝑀𝑐𝑏‖] + 2 + 1 𝑁𝑐 ∑ ‖𝑀‖ 𝑁𝑐 𝑁𝑐𝑎=1 𝑁𝑐=1 𝑁𝑐 𝑁𝑐𝑎=1
lane cluster count (Nc), element count (Ne), mean value (M), and instances (xi). The total loss
accumulates cross-entropy and discriminative loss. Backpropagation updates the model's weights
based on this total loss.
Interfacing:
Our model segments lane pixels and overlays them on input images using DBSCAN, which is
superior to K-means for noisy or irregular clusters. DBSCAN excels when lanes are close and
randomly positioned. A threshold of 0.05 determines if points belong to the same lane or different
Formulas:
Methods:
GMR Institute of Technology
Encode-decode architecture
2D convolution
1x1 convolution
Neighborhood Blocks (NB)
Transposed convolutional layers
Datasets:
Tusimple dataset
emb
Dec
201
er
7
8
Alam, M. Z., Kelouwani, S., Boisclair, J., & Amamou, A. A. (2022). Learning light
fields for improved lane [Link], 11, 271-283.
GMR Institute of Technology
9
Input: The network takes a sequence of perspective images extracted from the light field (LF)
data as input.
CNN: This approach is suggested because lane lines are the only color variation on the road, and it
is assumed that by using the correct RGB primaries, color information can be adequately
represented. For example, red and green channels can represent a yellow line, and all three color
channels are needed for a white line. Therefore, by using the disparity map, you can trade off the
GMR Institute of Technology
information from one color channel for one lane line for the additional information from the
disparity map for all road lines.
Feature Extraction: The perspective images are passed through a convolutional neural
network (CNN) to extract relevant features.
Temporal Dependencies: The features from the CNN are then fed into an LSTM layer, which
captures the temporal dependencies between the perspectives and learns the inter-view angular
information.
Regression: The output of the LSTM layer is passed to a regression layer, which predicts the
coordinates of the lane lines.
Improved Lane Detection: By utilizing the angular information present in the LF data, the
proposed network architecture improves the prediction of lane line coordinates and increases
emb
robustness against challenging conditions.
Dec
201
er
7
10
Methods:
Light field imaging
Convolutional neural networks (CNNs)
Perspective images
Disparity map estimation
LSTM network
Angular information
GMR Institute of Technology
Regression layer
emb
Dec
201
er
7
11
Yuhao Huang, Shitao Chen, Yu Chen, Zhiqiang Jian, Nanning Zheng.(2018). Spatial-
Temproal Based Lane Detection Using Deep Learning. 14th IFIP International
Conference on Artificial Intelligence Applications and Innovations (AIAI). pp.143-154,
ff10.1007/978-3-319-92007- 8_13ff.
GMR Institute of Technology
we use inverse perspective transform to get top view images and use the temporal and spatial
relevance of a lane to get its estimated location. Then we cut the image into 10-20 sub-images
centering on the estimated position of the lane boundary, each sub-image contains local lane
boundary information. We then use CNN to classify and regress sub-images to get the exact
location and category of local lane boundaries. Finally, we combine the local lane boundary
location with the linear information and through the spline fitting, we can get the lane boundary
GMR Institute of Technology
information for the entire image.
Inverse Perspective Transform:
It is common sense that images taken by a camera have perspective effects which will destroy the
original geometry of lane boundaries. Therefore, the first step in our algorithm is to convert the
front view to the top view by inverse perspective mapping. Inverse perspective transform can
eliminate the distortion caused by perspective effects in the picture. Through this transform step,
we can get the top-down projection of the road. We select the top view as the basis for lane
detection.
Coordinate Estimate: Both Lee et al. and Pan et al. used the single-frame lane detection
methods. The benefit is that these methods don’t Spatial-Temproal Based Lane Detection Using
Deep Learning 5 require any prior information in the process of lane detection task, which can be
completed only by inputting the image of a single frame. However, single-frame algorithms have
high time and space complexity and redundancy. In application, lane boundaries have strong
spatial and temporal constraints, especially in the top view
A lane boundary is a continuous ground road marking for dividing different lanes. In practice, the
camera typically samples the pavement information at 30 frame per second (FPS). By combining these
two points, it is not difficult to conclude that the lane boundary position between upper and lower frame
images does not change much, which means that the lane boundary in the next frame image is near the
lane boundary of the previous frame. Generally speaking, the lane width is the same for multi-lane
roads, and the space constraints are obvious in the top view. Based on the above considerations, we
propose to use the temporal and spatial constraints of lane boundaries to reduce the range of searching
GMR Institute of Technology
area .The main difference between our detection algorithm and the algorithm in is that ours is based on
multi-frame image lane detection.. We then use CNN to accurately detect the position of the boundary
based on the estimated lane boundary position with the time constraints of the lane boundaries
discussed above. The mean value of the former 10 frames’ lane boundary positions is utilized as the
estimated position of the current frame.
Sub-image Extraction: In our algorithm, we use control points to represent lane boundaries and a
lane boundary is usually represented by five control points. Typically, we select 10-20 control points
according to the boundary type classification result. We use the estimated position of the lane as a priori
information and cut the sub-image centered on the estimated position. After obtaining the exact location
of the lane boundary through CNN, we use the exact position we get as the control point for the current
frame and also as a reference for the next frame. The sub-image size needs to be determined by the
camera’s internal and external parameters. We consider the following rules when determining the sub-
image size: first, each sub-image can only contain one lane which needs to be as large as possible.
Second, sub-images should be long enough in the vertical direction to avoid lane misdetection of the
white dotted boundary. However, an excess of vertical length will result in a decline in the accuracy of
the curve section.
CNN for boundary type classification and lane boundary regression:
Network Structure: Inspired by the network structures in and we proposed a novel convolutional
neural network based on multitasking, which can accurately track the lane boundary position while
classifying lane boundary types. For each task of classification and regression, the network consists of
9 layers, including 5 convolutional layers and 4 fully connected layers. Before entering CNN, we
normalized the input image. The feature of our network structure is that the sub-tasks share the 5
GMR Institute of Technology
convolutional layers and a fully connected layer, thus compressing the parameter space. Our CNN is
structured as follows: The first 3 convolutional layers we use the 5 × 5 kernel with 2 × 2 stride, and it
followed with 2 convolutional layers which have 3 × 3 kernel and non-stride. The last convolutional
layer is connected with a fully connected (FC) layer which has 4096 neurons. Then the network is
divided into two parts (for the two sub-tasks of linetype classification and lane boundary regression),
for each contains the same scale of 3 FC layers. Our loss function is defined as follows:
loss = { Lcls, boundarytype = Noneline αLcls + (1 − α)Lloc, boundarytype ̸= Noneline
where α is a factor to balance the classification loss and regression loss. LclsandLloc are respectively
the classification loss function and regression loss function. In this paper, we use the cross-entropy loss
as the classification
Methods:
Dual-View Convolutional Neutral Network (DVCNN)
VPGNet
Linetype Classification and Lane Boundary Regression: In autonomous vehicles, linetype
classification is necessary since self-driving cars need to change lanes and sometimes even at a high
frequency. Traditional methods are not good at boundary classification while CNN can realize it easily
with its powerful classification capabilities. Based on lane categories, we classify lane boundaries into
double yellow lines, white solid lines, white dashed lines, single yellow lines, Noneline, etc. The
Noneline category is designed to improve the robustness of the system, as lane boundary information
for lane boundaries, especially for adjacent lanes, tends to be obstructed by moving vehicles in a
GMR Institute of Technology
realistic traffic environment. In this case, the control point of the corresponding position will be set to
Noneline type. We will not refer to the CNN coordinate regression value of this control point.
Lane Fitting: The algorithm of the previous step gives information about the control points of the
current lane and its adjacent lane, including the boundary type and the exact location of the lane
boundary. In this step, we fit the lane boundary through control points with the Catmull-Rom (CR)
spline . The CR spline is also known as Overhauster spline, which is one kind of cubic interpolation
spline. It is calculated as follows
P(s) = [ 1 u u2 u 3 ] 0 1 0 0 −τ 0 τ 0 2τ τ − 3 3 − 2τ −τ −τ 2 − τ τ − 2 τ Pi−2 Pi−1 Pi Pi+1 where Pi ,
Pi−1, Pi+1 are the current point, the previous point and next point on the spline respectively. τ is the
tension parameter which affects the spline’s sharpness. The value of u is set in the interval (0, 1),
which represents the points between Pi and Pi−1.
Optimization:.
If the left lane or the right lane cannot be changed, we will cancel the lane boundary position
estimation of the adjacent lane, and in the next frame If the current lane can be changed lanes (e.g. the
left lane is white dashed boundary), we will initialize the corresponding adjacent lanes based on the
lane constraints.
GMR Institute of Technology
17