CTLane: Advanced Lane Detection CNN
CTLane: Advanced Lane Detection CNN
Article
In advanced driving assistance systems and autonomous vehicles, lane detection plays a
crucial role in ensuring the safety and stability of the vehicle during driving. While deep
learning-based lane detection methods can provide accurate pixel-level predictions, they can
struggle to interpret lanes as a whole in the presence of interference. To address this issue, we
have developed a method that includes two components: a convolutional neural network
transformer and a fusion decoder. The CNN transformer extracts the overall semantics of the
lanes and speeds up convergence, while the fusion decoder combines high-level semantics
with low-level local features to improve accuracy and robustness. By using these two
components together, our method is able to effectively detect lanes in a variety of conditions,
even when interference is present. We tested our method on multiple lane datasets and
obtained superior results, with the best performance on the BDD100K dataset. Our method
has successfully addressed the challenge of accurately and completely detecting lanes in the
presence of interference, such as darkness, shadows, and strong light. The algorithm has been
employed in an edge computing device, an intelligent cart. The code has been made available
at: [Link]
1. INTRODUCTION
Lane detection is a crucial component of the perception phase in Advanced Driver
Assistance Systems (ADAS) and autonomous driving systems, playing an essential role
in ensuring vehicle safety and guiding driving paths. Unlike humans who rely on vision
and experience to judge lanes, lane detection uses algorithms to accurately identify lane
markings and road boundaries [1]. Vehicles acquire information about the road and its
surroundings through visual sensors (e.g., cameras) and they feed this data into deep
Convolutional Neural Networks (CNNs) to extract and analyse key features, providing
reliable foundational data for subsequent path planning and driving decisions.
With the rapid advancement of deep learning, many research methods [2, 3, 4] treat lane
detection as a segmentation task and employ end-to-end neural network frameworks.
During segmentation, the model must focus on every pixel and predict its category.
However, this pixel-wise processing makes it challenging for the model to treat the lane
line as a whole, leading to a loss of lane semantics in feature maps. This problem is
exacerbated under unfavourable lighting conditions, such as when shadows are cast on
lanes, vehicles block the lanes, or during nighttime driving, significantly impacting model
performance. Analysing lane feature maps generated by deep learning models reveals
that the network does not always focus on the lane regions during feature extraction,
which limits its detection performance. To address this, we propose introducing an
attention mechanism to enhance the network’s focus on key regions, thereby improving
the overall performance of lane detection.
A transformer [5] is a deep learning technique that has • We achieve state-of-the-art accuracy on the BDD100K
been widely applied in natural language processing [6], dataset and tier-1 performance on the Tusimple and
speech processing [7], and vision tasks [8]. It excels CULane datasets.
at parsing deep semantic information from images [9],
making it highly promising for lane detection tasks.
2. RELATED WORK
In recent years, a Vision Transformer (ViT) [9, 10] has
achieved remarkable results in image classification tasks. Within autonomous driving and ADAS applications, lane
Compared to traditional CNNs, ViT has a better ability to departure is one of the main causes of traffic accidents,
parse deep semantic information from images [9]. How- highlighting the importance of lane detection. As the
ever, since a transformer uses fully connected layers and rapid development of deep learning technology, this
weight matrices to propagate and transfer global infor- detection has gradually shifted from traditional feature-
mation, directly embedding a transformer into existing based methods to deep learning-based methods [12]. Tra-
CNN architectures is difficult. ResT [11] was the first to ditional lane detection methodologies predominantly de-
combine a transformer and CNN into a unified model, pend on manually designed feature extraction techniques,
and has provided some inspiration for our work. such as color segmentation, texture analysis, and edge de-
tection, and they subsequently employ post-processing
Traditional transformers utilize large matrices as training techniques like the Hough transform or Kalman filtering
parameters, leading to a significant number of parameters to extract lane lines. However, these methods perform
and slow network training convergence. Furthermore, poorly in complex scenarios, such as changes in lighting,
the advantages of convolution are not fully exploited occlusion, and complex road structures.
in hybrid models. To address these issues, we propose Upon reviewing a substantial body of literature, we have
replacing matrices with convolutions for computing at- observed that the transformer [13] is a novel deep learn-
tention in feature maps, thereby reducing the number ing technique commonly employed in natural language
of parameters and overcoming the challenges of heavy processing [14] and speech processing [15]. Unlike tradi-
training weights and slow training speeds associated tional CNN, the transformer, through its self-attention
with traditional transformers. mechanism, is capable of capturing global information
within images, thereby excelling in contextual modeling
In this study, we propose a method called CTLane, which and the capture of long-range dependencies. Several
combines the powerful semantic extraction capability of studies [16] [17] have significantly enhanced the global
transformers with the efficiency of traditional CNNs. By modeling capabilities of lane detection by incorporating
introducing CNNs into the structure of transformers, the a transformer architecture. For example, by generating
model not only significantly improves training speed conditional convolutional kernel parameters and inte-
but also enhances generalization ability. The CTLane grating a row-by-row classification strategy, these studies
model incorporates a multi-head attention mechanism to have achieved high-precision and efficient lane detection.
effectively extract features in the image space, enabling In ResT [18], the first attempt to combine a transformer
the model to focus more precisely on global lane features and CNN into a unified model has provided valuable
in the feature map, thereby predicting more complete, insights for our research.
smoother, and continuous lane lines, as shown in Fig. 1. Currently, the main methods in lane detection are deep
We conducted a comprehensive evaluation of CTLane learning-based methods. They can be divided into four
on several benchmark lane detection datasets, and the categories: semantic segmentation, row-wise classifica-
results demonstrate that the model maintains high lane tion, anchor, and curve fitting.
detection accuracy while achieving a higher F1 score and
a lower false detection rate.
2.1 Semantic segmentation-based methods
Our main contributions are summarized as follows:
Lane detection methods based on semantic segmentation
• We propose a CNN transformer to extract high-level achieve precise differentiation between lanes and back-
semantics of lanes and introduce a novel method to ground by transforming the task into a pixel-level classi-
compute self-attention. fication problem and leveraging deep neural networks.
• A CNN transformer combines the advantages of a CNN Typical approaches, such as UNet and its variants[19]
and transformer to better aggregate spatial features [20], employ encoder-decoder architectures combined
and can be easily deployed after the feature extraction with multiscale feature fusion to enhance accuracy. ENet-
stage of any CNN. 21[21] introduces lightweight convolutions and affinity
• We propose the fusion decoder, which aggregates high- field techniques, maintaining high performance while
level semantic features and low-level local features, reducing model complexity. GANs further improve
effectively preserving the original features in images feature extraction capabilities through adversarial learn-
and restoring information in the decoder.
Label
Channel-0
Channel-1
Figure 1 – The left-side (a) and (b) show the self-attention maps of the CTLane method on different channels, while the right-side (c) and (d) display
the corresponding original feature maps. The comparison demonstrates that, under the influence of our CNN-Transformer module, the lane features
are significantly enhanced, leading to the prediction of more complete, smooth, and continuous lane lines.
ing [22]. To address domain adaptation challenges, the perception of lane instances through multi-reference de-
MLDA framework [23] optimizes at pixel, instance, and formable attention [29]. Furthermore, in order to reduce
category levels, while the integration of spatiotemporal computational costs, the introduction of local and global
information [24] enhances detection stability through polar coordinate modules decreases the number of an-
hybrid spatiotemporal architectures. Additionally, the chors, while the triplet detection head enables end-to-end
combination of semantic segmentation and anchor-based detection without NMS, enhancing performance in dense
detection [25] achieves superior generalization and real- scenarios [17]. In 3D lane detection, the definition of 3D
time performance in multi-task perception. anchors combined with iterative regression and global
optimization avoids the complexity of traditional bird’s-
eye view transformations [30]. Hybrid anchor-driven
2.2 Row-wise classification-based methods ordinal classification further reduces computational costs
and improves localization accuracy [31].
Lane detection processes have been further streamlined
by the classic CNN row-by-row classification method
[26], which handles image features on a per-row ba- 2.4 Curve fitting-based methods
sis. High-precision and efficient lane detection has been
achieved by some studies [16] [27] through the genera- Lane detection methods are based on curve fitting model
tion of conditional convolutional kernel parameters and lanes as continuous curves. For instance, PolyLaneNet
the integration of a row-by-row classification strategy. [32] utilizes deep polynomial regression to directly output
GroupLane [28] has implemented efficient 3D lane de- the polynomial coefficients of lane markings, achieving
tection by employing a channel grouping strategy and a accuracy comparable to existing methods while main-
row classification head design, combined with Birds-Eye taining a high frame rate (115 FPS). Another study [33]
View (BEV) features and a Self-Organizing Map (SOM) enhances global context modeling through a transformer
mechanism. network, directly regressing a parameterized lane shape
model and thus avoiding the intermediate segmenta-
tion steps and post-processing involved in traditional
2.3 Anchor-based methods methods. Furthermore, the selective focus framework
[34] introduces a lane distortion score to quantify the
Anchor-based methods extract lane features through pre- impact of quantization errors, thereby further enhancing
defined anchors and generate candidate lanes based on detection performance.
these anchors, significantly improving efficiency and ac-
curacy. For instance, some studies utilize anchor-chain
representations to model lanes and enhance the model’s
Feature Interaction
CAttn Blocks
FocalLoss
Q K V
MatMul
Fusion DiceLoss
Sigmoid&BN
xN
BackBone FPN
Feature Interaction
Figure 2 – The overall structure of the network is shown in the figure, and the network structure is mainly divided into three parts: encoder, CNN
transformer and decoder. The encoder is used to extract the network to extract the image features, the CNN transformer is used to extract the deep
semantic information, and finally the decoder restores the feature map to the original input size.
FC Layer
LN the complete channel to effectively merge the multi-head
C1
ReShape attentions.
Feature Interation
CNN-Transformer Scale Feature
Our approach introduces convolution operations to gen-
erate Q, K, and V. In the multi-head attention mechanism,
Figure 3 – The CNN transformer has a specific structure in which the
mapping matrix is replaced by different convolution blocks. The MHA
we utilize different convolution operations CQ i
(·), CKi (·),
does not scale the channels of each Q, K, V, but rather concatenates all and CV i
(·) to generate multiple heads of Qi , Ki , and Vi ,
the channels to the original channels through concatenation in the final where i ∈ 1, 2, . . . , t, and t represents the number of atten-
stage. We also use sigmoid instead of softmax.
tion heads. After obtaining Qi , Ki , and Vi for each head,
of transformers through 1 × 1 convolution, ReLU, and we perform parallel attention computations for each set
softmax. In the CAttn block, the input Mc×h′ ×w′ is trans- and concatenate the results of all heads using the Concat(·)
formed into the outputs Qc′ ×h′ ×w′ , Kc×h′ ×w′ , and Vc′ ×h′ ×w′ operation, ultimately producing the aggregated attention
through convolution. We compute the dot product of map. The computational process can be expressed by the
the query with all keys, divide each dot product result following formula:
√
by h′ · w′ , introduce batch normalization to normalize CMultiHead(M̂) = Concat(head1 , . . . , headt ),
the computed attention scores, and then apply the sig- (2)
moid function to obtain the weights for the values. We headi = CAttn(CQ
i
(M̂), CKi (M̂), CV
i (M̂)),
calculate the output matrix with the following formula: where t is the number of attention heads.
T
QK
CAttn(Q, K, V) = Sigmoid(BN( √ ))V, (1)
h′ · w′ 3.2.3 Channels aggregation
where BN(·) is the batch normalization operation, and
Sigmoid(·) is the sigmoid function. To process the attention feature maps generated by MHA,
we need to use convolution to aggregate them for sub-
In the CAttn block, we take advantage of the fact that sequent computation. It is simple and efficient to aggre-
different channels represent different image features; a gate the attention maps obtained from different attention
linear mapping of each channel feature is performed heads using convolution. We aggregate the multi-headed
using convolution. This gives us a more significant features Mc′ ·t×h′ ×w′ ×h′ ×w′ obtained from the previous step
speed-up than linearly mapping the entire information. using Concat(·) and output them as Mc×h′ ×w′ . The function
Aggr(·) represents the concatenation feature map, with
the dimension c · t × h′ × w′ reduced to c × h′ × w′ by 1 × 1
convolution and layer normalization. The entire process then resize to N×Ho ×Wo , where Ci ×Hi ×Wi , 0 < i < 4, i ∈
can be described as: N denotes the number of feature channels in the ith layer.
N denotes the number of output channels, Ho , Wo denotes
CMHA(x) = Aggr(CMultiHead(M̂)). (3) the height and width of the output image, and S denotes
the scaling multiplier to satisfy S · H3 = Ho , S · W3 = Wo .
( N , S x H3 , S x W3 )
In the MHA mapping stage, Mn×d is mapped to Mn×d (C3 , H3 , W3)
with a k × k convolution kernel, and its total complexity is
(C2 , H2 , W2)
evaluated by O(n2 k2 d). In CT, we uses 1 × 1 kernel instead
(C1 , H1 , W1)
of k × k for the mapping operation, so that the mapping
complexity is O(n2 d). The MHA and Concat matrix Mnt×d Conv+BN+ReLU
is compressed into Mn×d , and the complexity is O(tn2 d).
(S2 x N , H3 , W3 )
(Cf, H1, W1) ReSize
(Cf, H2, W2)
Hence, the total compressing the multi-headed attention
(Cf, H3, W3)
Concat matrix as Mnt×d to Mn×d by total convolution, its
Bilinear Interpolate
complexity is O(tn2 d), so the total complexity is O(n2 d +
n2 d). In this case, d > n, the CMHA(·) complexity is lower
than MHA(·).
Channels Scale
In comparison to the number of training parameters
required, the number of training parameters required
by the traditional transformer to generate Q, K, V for the (3 x Cf , H3, W3 )
To solve imbalance between examples, the dice loss [39] 4.2 Evaluation metrics
is used for the segmentation output:
In Tusimple, there are three official assessment measure-
N
1 X 2(1 − oi )ti ment: accuracy, False Positive (FP), and False Negative
LDice = (1 − ) (6) (FN). The accuracy is defined as
N oi + ti
i=1
P
clip Cclip
Accuracy = .
The weighted sum focal loss, dice loss and exist loss is sumclip Sclip
then used as the total loss function of CTLane, where where Cclip is the number of correctly predicted lane
α, β, γ ≥ 0: points (predicated points within the range of 20 pixels
around ground truth points), and Sclip is the total number
L = αLFocal + βLDice + γLExist (7) of ground truth points in each clip. However, Tusimple
seems to become more saturated for many modern meth-
ods nowadays. Hence we add F1-score to evaluate the
4. EXPERIMENTS AND RESULTS performance of the model, which is defined as
Table 2 – Comparison with state-of-the-art results on CULane dataset with IoU threshold = 0.5. For crossroad, only FP are shown.
Type BackBone Total Normal Crowded Dazzle Shadow No line Arrow Curve CrossR Night
UFLD[31] ResNet34 72.30 90.70 70.2 0 59.50 69.30 44.40 85.70 69.50 2037 66.70
SCNN[12] VGG16 71.60 90.60 69.70 58.50 66.90 43.40 84.10 64.40 1990 66.10
MCA-UFLD[42] ResNet18 69.36 88.90 67.28 55.79 63.87 39.75 82.50 56.26 1741 63.30
STLNet[43] Swin 73.60 91.80 70.20 65.90 69.30 48.80 85.30 67.50 68.20
SAD[44] ResNet101 71.80 90.70 70.00 59.90 67.00 43.50 84.40 65.70 2052 66.30
PINet[45] Hourglass4 74.40 90.30 72.30 66.30 68.40 49.80 83.70 65.60 1427 67.70
E2E[26] ERFNet 74.00 91.00 73.10 64.50 74.10 46.60 85.80 71.90 2022 67.90
RESA[46] ResNet50 75.31 92.10 73.10 69.20 72.80 47.70 88.30 70.30 1503 69.90
LaneATT[47] ResNet18 75.13 91.17 72.71 65.82 68.03 49.13 87.82 63.75 1020 68.58
Ours
CTLane ResNet34 74.31 91.49 72.68 67.31 68.33 45.27 87.60 68.94 1542 69.69
CTLane DLA34 75.39 92.24 73.48 66.87 74.18 46.46 88.20 69.23 1672 71.20
Table 3 – Comparison results on TuSimple dataset. Our method shows that on some occasions where the
lanes are crowded and close to each other, it still separates
Method BackBone F1(%) Acc(%) FP(%) FN(%)
PolyLaneNet[32] EfficientNetB0 90.62 93.36 9.42 9.33 the lanes well. As shown in Fig. 5, even if the lanes in
FastDraw[49] ResNet50 94.44 94.90 5.90 5.20 images are very close, our method still distinguishes
SCNN[12] VGG16 95.97 96.53 6.17 1.80 them successfully.
RESA[50] ResNet18 96.84 96.84 3.25 2.67
E2E[26] ERFNet 96.25 96.02 3.21 4.28
Table 4 – Comparison results on BDD100K dataset.
LaneATT[32] ResNet34 96.06 96.10 5.64 2.17
FOLOLane[51] ERFNet 96.59 96.92 4.47 2.28
CondLaneNet[52] ResNet101 97.24 96.54 2.01 3.50 Method BackBone Acc(%) IoU(%)
Ours SCNN[12] VGG16 35.79 15.84
CTLane ResNet34 97.54 96.49 2.01 2.90
CTLane DLA34 97.45 96.50 2.14 2.96 SAD[44] ENet 36.56 16.02
HWLane[53] ResNet34 73.93 33.25
some difficult scenes with more severe occlusions, such YOLOPv8[54] CSPDarknet 84.90 28.80
as Crowded, Shadow, CrossRoad, Night, our method can HybridNets[55] EfficientNetB3 85.40 31.60
still successfully infer the correct lanes. It is noticed that Ours
CTLane with DLA34 has achieved a large improvement CTLane ResNet34 84.64 26.12
in major scenarios as shown in Fig. 5. CTLane DLA34 85.55 26.68
BDD100K The official lane segmentation results are given 4.4 Ablation
for both sides of the lane. We draw the complete lane
mask for training and we keep the original lane in the We have designed a series of ablation experiments to
val set. We use pixel classification accuracy and lane IoU analyse the effectiveness of different components in our
as evaluation metrics. Finally validated on the val set model.
and output results are shown in Table 4.
Overall ablation study. We first investigated the effec-
To explain the effectiveness of our method more visu- tiveness of the CNN transformer and the fusion decoder
ally, we show the qualitative results of our designed component. As a baseline, we choose ResNet-34 as the
model and other models for the CULane dataset in Fig. backbone to extract features, and then we use a feature
5. Traditional lane detection methods cannot identify pyramid to aggregate multiscale features to construct
lane markings well in dark night, shadowed and strong
Table 5 – Experiments of the proposed modules on TuSimple dataset
light situations, resulting in the final prediction of the with ResNet-34 backbone.
model breaking the continuity of the lanes. In contrast,
our model can solve this problem well by attention. The Baseline CNN transformer Fusion decoder F1
✓ 94.57
results of our model show more robustness, and introduc- ✓ 96.93
ing our convolution attention rather than the traditional ✓ 96.65
segmentation module can give the network a stronger ✓ ✓ 97.54
ability to capture structured prior objects.
ctlane_34_driver_37_30fra
[Link]
ctlane_34_driver_37_30fra
[Link]
ctlane_34_driver_37_30fra
[Link]
Figure 5 – Example results from CULane dataset with ResNet50, SCNN and CTLane. It indicates that CTLane is immune to interference caused by
dark night, shadows, strong light, etc.
an encoder. We adopt deconvolution and bilinear inter- deep Lane Detection (UFLD). The Fig. 6 shows that
polation in the decoder to up-sample the feature map the model with the CNN transformer has a significant
and finally output the segmentation. We integrate the improvement on accuracy. It also indicates that CNN
CNN transformer and fusion decoder respectively. The transformer helps convergence faster and is easily em-
F1-score is summarized in Table 5. We can see that both bedded into existing framework.
components greatly improve the lane detection perfor-
mance, which proves the effectiveness.
TuSimple Lane Detection Challenge 4.5 Edge computing deployment
98
SCNN
SCNN with CNNTransformer
To demonstrate the practicality and efficiency of CTLane
97 UFLD in real-world automotive applications, we deployed the
UFLD with CNNTransformer lane detection model on the iFlytek U-car as shown in Fig.
7, an intelligent vehicle equipped with an NVIDIA Jetson
F1(%)
96
Nano edge-computing device. The Jetson Nano, with its
compact design and energy-efficient performance, is well-
95
suited for embedded systems in autonomous driving
and Advanced Driver Assistance Systems (ADAS). This
94
40 50 60 70 80 90 100
deployment aimed to evaluate the model’s real-time
Epoch performance, resource efficiency, and detection accuracy
in a practical automotive environment. Implementation
Figure 6 – CNN transformer applied on different models Details
Ablation study on generality of CNN transformer To
validate the generalizability and stability of CNN trans- • Hardware setup: The iFlytek U-car is powered by an
former, we have implanted it into Spatial Convolutional NVIDIA Jetson Nano, featuring a 128-core Maxwell
Neural Network (SCNN) and Ultra Fast structure-aware GPU and a quad-core ARM CPU. This hardware con-
5. CONCLUSION
In this paper, we present CTLane, a novel lane detec-
tion method that integrates two key components: CNN
transformer and fusion decoder. The CNN transformer
introduces a convolution-based self-attention mechanism,
which significantly improves computational efficiency by
leveraging 1 × 1 convolutions instead of traditional ma-
trix multiplications. This approach not only accelerates
training convergence but also addresses the challenges of
integrating transformer and CNN architectures. The fu-
sion decoder effectively combines low-level local features
from shallow layers with high-level semantic informa-
Figure 7 – The iFlytek U-car equipped with edge-computing capabil- tion generated by the CNN transformer. This dual-layer
ities powered by the NVIDIA Jetson Nano is used by us in real-time fusion enables the network to capture both fine-grained
perception on lane marks.
details and global lane structures, enhancing the robust-
figuration provides a balance of computational power ness and accuracy of lane detection.
and energy efficiency, making it ideal for edge-based
lane detection tasks. The proposed method demonstrates strong generaliza-
• Software environment: The model was optimized tion capabilities and can be seamlessly integrated into
using TensorRT, NVIDIA’s high-performance deep existing frameworks. Extensive experiments on three
learning inference library, to maximize inference speed benchmark datasets, CULane, Tusimple, and BDD100K
and minimize memory usage. The framework was show that CTLane achieves close to state-of-the-art per-
implemented in PyTorch, and the model was quantized formance, particularly in challenging scenarios such as
to FP16 precision to further enhance computational low-light conditions, shadows, and occlusions. The re-
efficiency without compromising detection accuracy. sults highlight the method’s ability to maintain lane
• Input resolution: To ensure real-time performance continuity and accuracy even under adverse conditions.
on the Jetson Nano, the input resolution was set to To further validate the practicality of CTLane, we de-
352×640, striking a balance between detection accuracy ployed the model on the iFlytek U-car, an intelligent
and computational load. vehicle platform equipped with an NVIDIA Jetson Nano
edge-computing device. The deployment demonstrated
Performance metrics inference speed: On the Jetson the model’s close to real-time performance.
Nano, the model achieved an average inference speed
of 8 − 10 FPS (frames per second), close to meeting the Future work will focus on further optimizing the model
real-time requirements for lane detection in automotive for real-time applications and exploring its potential in
applications. other computer vision tasks. To enhance real-time per-
formance, we will investigate dynamic network pruning,
Resource utilization: The GPU utilization remained be- hybrid precision quantization, and hardware-customized
low 75%, indicating that the model is lightweight and acceleration strategies. Additionally, we aim to extend
leaves sufficient computational resources for other con- the framework to multi-task scenarios and cross-modal
current tasks, such as object detection or path planning. systems (e.g., fusing camera and LiDAR inputs). Im-
proving robustness under extreme conditions (e.g., fog,
Real-world testing We conducted extensive real-world heavy rain) through adversarial training or synthetic
testing on urban roads and highways using the iFlytek data augmentation will also be prioritized. The CTLane
U-car to evaluate the model’s performance under diverse framework provides a robust foundation for advancing
driving conditions. The results showed that CTLane lane detection technologies, contributing to the devel-
performs exceptionally well in challenging scenarios, opment of safer and more reliable autonomous driving
including low-light environments, shadows, and occlu- systems.
sions caused by other vehicles or road obstacles. The
model consistently maintained lane continuity and accu-
racy, proving its reliability for real-world deployment in
ADAS and autonomous driving systems.
REFERENCES [14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
“Bert: Pre-training of deep bidirectional transformers for lan-
[1] Aharon Bar Hillel, Ronen Lerner, Dan Levi, and Guy Raz. “Recent guage understanding”. In: Proceedings of the 2019 conference of the
progress in road and lane detection: a survey”. In: Machine North American chapter of the association for computational linguis-
Vision and Applications 25.3 (Apr. 2014), pp. 727–745. issn: 0932- tics: human language technologies, volume 1 (long and short papers).
8092, 1432-1769. doi: 10.1007/s00138- 011- 0404- 2. (Visited on 2019, pp. 4171–4186.
04/04/2022). [15] Linhao Dong, Shuang Xu, and Bo Xu. “Speech-transformer: a no-
[2] Xingang Pan, Jianping Shi, Ping Luo, Xiaogang Wang, and recurrence sequence-to-sequence model for speech recognition”.
Xiaoou Tang. “Spatial As Deep: Spatial CNN for Traffic Scene In: 2018 IEEE international conference on acoustics, speech and signal
Understanding”. In: AAAI Conference on Artificial Intelligence. processing (ICASSP). IEEE. 2018, pp. 5884–5888.
2017. url: [Link] [16] Long Zhuang, Tiezhen Jiang, Meng Qiu, Anqi Wang, and Zhix-
[3] Tu Zheng, Hao Fang, Yi Zhang, Wenjian Tang, Zheng Yang, iang Huang. “Transformer generates conditional convolution
Haifeng Liu, and Deng Cai. “RESA: Recurrent Feature-Shift kernels for end-to-end lane detection”. In: IEEE Sensors Journal
Aggregator for Lane Detection”. In: arXiv:2008.13719 [cs] (Mar. 25, (2024).
2021). arXiv: 2008.13719. (Visited on 04/04/2022). [17] Shengqi Wang, Junmin Liu, Xiangyong Cao, Zengjie Song, and
[4] Davy Neven, Bert De Brabandere, Stamatios Georgoulis, Marc Kai Sun. “Polar R-CNN: End-to-End Lane Detection with Fewer
Proesmans, and Luc Van Gool. “Towards End-to-End Lane Detec- Anchors”. In: arXiv preprint arXiv:2411.01499 (2024).
tion: an Instance Segmentation Approach”. In: arXiv:1802.05591 [18] Qinglong Zhang and Yu-Bin Yang. “Rest: An efficient trans-
[cs] (Feb. 15, 2018). arXiv: 1802.05591. (Visited on 04/04/2022). former for visual recognition”. In: Advances in neural information
[5] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, processing systems 34 (2021), pp. 15475–15485.
Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polo- [19] Der-Hau Lee and Jinn-Liang Liu. “End-to-end deep learning of
sukhin. “Attention Is All You Need”. In: arXiv:1706.03762 [cs] lane detection and path prediction for real-time autonomous
(Dec. 5, 2017). arXiv: 1706.03762. (Visited on 04/04/2022). driving”. In: Signal, Image and Video Processing 17.1 (2023), pp. 199–
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 205.
“BERT: Pre-training of Deep Bidirectional Transformers for Lan- [20] P Santhiya, Immanuel JohnRaja Jebadurai, Getzi Jeba Leelipush-
guage Understanding”. In: Proceedings of the 2019 Conference of pam Paulraj, A Jenefa, S Kiruba Karan, et al. “Deep Vision: Lane
the North. Proceedings of the 2019 Conference of the North. Min- Detection in ITS: A Deep Learning Segmentation Perspective”.
neapolis, Minnesota: Association for Computational Linguistics, In: 2024 Second International Conference on Inventive Computing
2019, pp. 4171–4186. doi: 10.18653/v1/N19- 1423. (Visited on and Informatics (ICICI). IEEE. 2024, pp. 21–26.
04/04/2022).
[21] Seyed Rasoul Hosseini, Hamid Taheri, and Mohammad Tesh-
[7] Linhao Dong, Shuang Xu, and Bo Xu. “Speech-Transformer: A nehlab. “Enet-21: an optimized light cnn structure for lane detec-
No-Recurrence Sequence-to-Sequence Model for Speech Recog- tion”. In: arXiv preprint arXiv:2403.19782 (2024).
nition”. In: 2018 IEEE International Conference on Acoustics, Speech
[22] Swati Jaiswal and B Chandra Mohan. “Deep learning-based path
and Signal Processing (ICASSP). ICASSP 2018 - 2018 IEEE Interna-
tracking control using lane detection and traffic sign detection
tional Conference on Acoustics, Speech and Signal Processing
for autonomous driving”. In: Web Intelligence. Vol. 22. 2. SAGE
(ICASSP). Calgary, AB: IEEE, Apr. 2018, pp. 5884–5888. isbn:
Publications Sage UK: London, England. 2024, pp. 185–207.
978-1-5386-4658-8. doi: 10.1109/ICASSP.2018.8462506. (Visited
on 04/04/2022). [23] Chenguang Li, Boheng Zhang, Jia Shi, and Guangliang Cheng.
“Multi-level domain adaptation for lane detection”. In: Proceed-
[8] Andrew Brown, Cheng-Yang Fu, Omkar Parkhi, Tamara L Berg,
ings of the IEEE/CVF Conference on Computer Vision and Pattern
and Andrea Vedaldi. “End-to-end visual editing with a gener-
Recognition. 2022, pp. 4380–4389.
atively pre-trained artist”. In: European Conference on Computer
Vision. Springer. 2022, pp. 18–35. [24] Yongqi Dong, Sandeep Patil, Bart Van Arem, and Haneen Farah.
“A hybrid spatial–temporal deep learning architecture for lane
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk
detection”. In: Computer-Aided Civil and Infrastructure Engineering
Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa De-
38.1 (2023), pp. 67–86.
hghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob
Uszkoreit, and Neil Houlsby. “An Image is Worth 16x16 Words: [25] Jiao Zhan, Jingnan Liu, Yejun Wu, and Chi Guo. “Multi-task
Transformers for Image Recognition at Scale”. In: arXiv:2010.11929 visual perception for object detection and semantic segmentation
[cs] (June 3, 2021). arXiv: 2010.11929. (Visited on 04/04/2022). in intelligent driving”. In: Remote Sensing 16.10 (2024), p. 1774.
[10] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas [26] Seungwoo Yoo, Hee Seok Lee, Heesoo Myeong, Sungrack Yun,
Usunier, Alexander Kirillov, and Sergey Zagoruyko. “End-to- Hyoungwoo Park, Janghoon Cho, and Duck Hoon Kim. “End-
End Object Detection with Transformers”. In: Computer Vision – to-end lane marker detection via row-wise classification”. In:
ECCV 2020. Ed. by Andrea Vedaldi, Horst Bischof, Thomas Brox, Proceedings of the IEEE/CVF conference on computer vision and
and Jan-Michael Frahm. Vol. 12346. Series Title: Lecture Notes pattern recognition workshops. 2020, pp. 1006–1007.
in Computer Science. Cham: Springer International Publishing, [27] Xinyu Zhang, Yan Gong, Jianli Lu, Zhiwei Li, Shixiang Li, Shu
2020, pp. 213–229. isbn: 978-3-030-58451-1 978-3-030-58452-8. doi: Wang, Wenzhuo Liu, Li Wang, and Jun Li. “Oblique convolution:
10.1007/978-3-030-58452-8_13. (Visited on 04/04/2022). A novel convolution idea for redefining lane detection”. In: IEEE
[11] Qinglong Zhang and Yubin Yang. “ResT: An Efficient Trans- Transactions on Intelligent Vehicles 9.2 (2023), pp. 4025–4039.
former for Visual Recognition”. In: arXiv:2105.13677 [cs] (Oct. 14, [28] Zhuoling Li, Chunrui Han, Zheng Ge, Jinrong Yang, En Yu, Hao-
2021). arXiv: 2105.13677. (Visited on 04/04/2022). qian Wang, Xiangyu Zhang, and Hengshuang Zhao. “Grouplane:
[12] Xingang Pan, Jianping Shi, Ping Luo, Xiaogang Wang, and End-to-end 3d lane detection with channel-wise grouping”. In:
Xiaoou Tang. “Spatial as deep: Spatial cnn for traffic scene IEEE Robotics and Automation Letters (2024).
understanding”. In: Proceedings of the AAAI conference on artificial [29] Zhongyu Yang, Chen Shen, Wei Shao, Tengfei Xing, Runbo Hu,
intelligence. Vol. 32. 1. 2018. Pengfei Xu, Hua Chai, and Ruini Xue. “LDTR: Transformer-based
[13] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, lane detection with anchor-chain representation”. In: Computa-
Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. tional Visual Media 10.4 (2024), pp. 753–769.
“Attention is all you need”. In: Advances in neural information
processing systems 30 (2017).
[30] Shaofei Huang, Zhenwei Shen, Zehao Huang, Zi-han Ding, Jiao [46] Tu Zheng, Hao Fang, Yi Zhang, Wenjian Tang, Zheng Yang,
Dai, Jizhong Han, Naiyan Wang, and Si Liu. “Anchor3dlane: Haifeng Liu, and Deng Cai. “Resa: Recurrent feature-shift aggre-
Learning to regress 3d anchors for monocular 3d lane detection”. gator for lane detection”. In: Proceedings of the AAAI conference on
In: Proceedings of the IEEE/CVF Conference on Computer Vision and artificial intelligence. Vol. 35. 4. 2021, pp. 3547–3554.
Pattern Recognition. 2023, pp. 17451–17460.
[47] Xu Cao, Weisheng Liu, and Zhijian Wang. “Adaptive ROI Opti-
[31] Zequn Qin, Pengyi Zhang, and Xi Li. “Ultra Fast Deep Lane mization Pyramid Network: Lane Detection for FSD under Data
Detection With Hybrid Anchor Driven Ordinal Classification”. In: Uncertainty.” In: Engineering Letters 33.2 (2025).
IEEE Trans. Pattern Anal. Mach. Intell. 46.5 (May 2024), pp. 2555–
[48] Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell.
2568. issn: 0162-8828. doi: 10.1109/TPAMI.2022.3182097. url:
Deep Layer Aggregation. Jan. 4, 2019. arXiv: 1707.06484[cs]. url:
[Link]
[Link] (visited on 09/05/2022).
[32] Lucas Tabelini, Rodrigo Berriel, Thiago M Paixao, Claudine
[49] Jonah Philion. “Fastdraw: Addressing the long tail of lane detec-
Badue, Alberto F De Souza, and Thiago Oliveira-Santos. “Poly-
tion by adapting a sequential prediction network”. In: Proceedings
lanenet: Lane estimation via deep polynomial regression”. In:
of the IEEE/CVF conference on computer vision and pattern recognition.
2020 25th international conference on pattern recognition (ICPR).
2019, pp. 11582–11591.
IEEE. 2021, pp. 6150–6156.
[50] Dan Zhang, Guolv Zhu, Shibo Lu, and Chang Li. “Lane Detec-
[33] Ruijin Liu, Zejian Yuan, Tie Liu, and Zhiliang Xiong. “End-to-end
tion Based on Improved RESA in Power Plant”. In: 2024 IEEE
lane shape prediction with transformers”. In: Proceedings of the
4th International Conference on Power, Electronics and Computer
IEEE/CVF winter conference on applications of computer vision. 2021,
Applications (ICPECA). IEEE. 2024, pp. 108–112.
pp. 3694–3702.
[51] Zhan Qu, Huan Jin, Yang Zhou, Zhen Yang, and Wei Zhang.
[34] Yunqian Fan, Xiuying Wei, Ruihao Gong, Yuqing Ma, Xiangguo
“Focus on local: Detecting lane marker from bottom up via key
Zhang, Qi Zhang, and Xianglong Liu. “Selective focus: inves-
point”. In: Proceedings of the IEEE/CVF conference on computer
tigating semantics sensitivity in post-training quantization for
vision and pattern recognition. 2021, pp. 14122–14130.
lane detection”. In: Proceedings of the AAAI Conference on Artificial
Intelligence. Vol. 38. 11. 2024, pp. 11936–11943. [52] Lizhe Liu, Xiaohao Chen, Siyu Zhu, and Ping Tan. “Condlanenet:
a top-to-down lane detection framework based on conditional
[35] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep
convolution”. In: Proceedings of the IEEE/CVF international confer-
Residual Learning for Image Recognition”. In: arXiv:1512.03385
ence on computer vision. 2021, pp. 3773–3782.
[cs] (Dec. 10, 2015). arXiv: 1512.03385. (Visited on 04/04/2022).
[53] Jing Zhao, Zengyu Qiu, Huiqin Hu, and Shiliang Sun. “HWLane:
[36] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath
HW-transformer for lane detection”. In: IEEE Transactions on
Hariharan, and Serge Belongie. “Feature Pyramid Networks for
Intelligent Transportation Systems (2024).
Object Detection”. In: arXiv:1612.03144 [cs] (Apr. 19, 2017). arXiv:
1612.03144. (Visited on 04/04/2022). [54] Shuyan Wang, Ya Liu, and Feng Zhang. “A Multi-Task Au-
tonomous Driving Environment Perception Network Based on
[37] Gang Li, Di Xu, Xing Cheng, Lingyu Si, and Changwen Zheng.
CA-YOLOPv8”. In: 2024 20th International Conference on Natu-
“SimViT: Exploring a Simple Vision Transformer with sliding
ral Computation, Fuzzy Systems and Knowledge Discovery (ICNC-
windows”. In: arXiv:2112.13085 [cs] (Dec. 24, 2021). arXiv: 2112.1
FSKD). IEEE. 2024, pp. 1–9.
3085. (Visited on 04/04/2022).
[55] Dat Vu, Bao Ngo, and Hung Phan. “HybridNets: End-to-End
[38] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
Perception Network”. In: arXiv e-prints (2022), arXiv–2203.
Piotr Dollar. “Focal Loss for Dense Object Detection”. In: IEEE
Transactions on Pattern Analysis and Machine Intelligence 42.2 (Feb. 1,
2020), pp. 318–327. issn: 0162-8828, 2160-9292, 1939-3539. doi:
10.1109/TPAMI.2018.2858826. (Visited on 04/04/2022).
AUTHORS
[39] Xiaoya Li, Xiaofei Sun, Yuxian Meng, Junjun Liang, Fei Wu,
and Jiwei Li. “Dice Loss for Data-imbalanced NLP Tasks”. In:
arXiv:1911.02855 [cs] (Aug. 29, 2020). arXiv: 1911.02855. (Visited Mian Zhou received a Doctor’s
on 04/04/2022). degree from the Department of
[40] Tusimple. Tusimple lane detection benchmark. 2017. Computer Science, University of
[41] Fisher Yu, Wenqi Xian, Yingying Chen, Fangchen Liu, Mike Liao, Reading, UK. He is currently a
Vashisht Madhavan, and Trevor Darrell. “BDD100K: A Diverse senior associate professor at the
Driving Video Database with Scalable Annotation Tooling”. In: School of AI and Advanced Com-
CoRR abs/1805.04687 (2018). arXiv: 1805.04687. url: [Link]
.org/abs/1805.04687.
puting, XJTLU Entrepreneur Col-
lege (Taicang), Xi’an Jiaotong-
[42] Lingyun Han, Kun Xu, Wensheng Hu, and Zhanwen Liu. “Lane
Detection Method Based on MCA-UFLD”. In: 2023 IEEE 8th Liverpool University. His re-
International Conference on Intelligent Transportation Engineering search interests include computer vision, image pro-
(ICITE). IEEE. 2023, pp. 146–152. cessing, and pattern recognition.
[43] Yufeng Du, Rongyun Zhang, Peicheng Shi, Linfeng Zhao, Bin
Zhang, and Yaming Liu. “ST-LaneNet: lane line detection method
based on swin transformer and LaneNet”. In: Chinese Journal of
Mechanical Engineering 37.1 (2024), p. 14.
Guoqiang Zhu is a graduate stu-
dent at Tianjin University of Tech-
[44] Yuenan Hou, Zheng Ma, Chunxiao Liu, and Chen Change Loy.
“Learning lightweight lane detection cnns by self attention distil- nology, specializing in computer
lation”. In: Proceedings of the IEEE/CVF international conference on vision and machine learning. His
computer vision. 2019, pp. 1013–1021. research focuses on leveraging
[45] Yeongmin Ko, Younkwan Lee, Shoaib Azam, Farzeen Munir, deep learning techniques for im-
Moongu Jeon, and Witold Pedrycz. “Key points estimation and age recognition.
point instance segmentation approach for lane detection”. In:
IEEE Transactions on Intelligent Transportation Systems 23.7 (2021),
pp. 8949–8958.