0% found this document useful (0 votes)
50 views13 pages

CTLane: Advanced Lane Detection CNN

The document presents CTLane, a novel end-to-end lane detection method utilizing a CNN transformer and fusion decoder, designed for edge computing in autonomous vehicles. This method enhances lane detection accuracy and robustness by combining high-level semantics with low-level features, effectively addressing challenges posed by interference such as poor lighting conditions. The results demonstrate superior performance on multiple lane datasets, particularly the BDD100K dataset, and the algorithm has been implemented in an intelligent cart.

Uploaded by

teatotaler313
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views13 pages

CTLane: Advanced Lane Detection CNN

The document presents CTLane, a novel end-to-end lane detection method utilizing a CNN transformer and fusion decoder, designed for edge computing in autonomous vehicles. This method enhances lane detection accuracy and robustness by combining high-level semantics with low-level features, effectively addressing challenges posed by interference such as poor lighting conditions. The results demonstrate superior performance on multiple lane datasets, particularly the BDD100K dataset, and the algorithm has been implemented in an intelligent cart.

Uploaded by

teatotaler313
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

ITU Journal on Future and Evolving Technologies, Volume 6, Issue 2, June 2025

Article

CTLane: An end-to-end lane detector by a CNN


transformer and fusion decoder for edge
computing
Mian Zhou 1 , Guoqiang Zhu 3 , Zhikun Feng 2 , Haoyi Lian 1 , Siqi Huang 1
1 School of AI and Advanced Computing, XJTLU Entrepreneur College (Taicang), Xi’an Jiaotong-Liverpool
University, Suzhou 215412, China, 2 School of Information and Software Engineering, University of Electronic
Science and Technology of China, Chengdu, 610000, Chengdu, Sichuan, China, 3 Tianjin University of
Technology, 300382, Tianjin, China

Corresponding author: Mian Zhou, [Link]@[Link]

In advanced driving assistance systems and autonomous vehicles, lane detection plays a
crucial role in ensuring the safety and stability of the vehicle during driving. While deep
learning-based lane detection methods can provide accurate pixel-level predictions, they can
struggle to interpret lanes as a whole in the presence of interference. To address this issue, we
have developed a method that includes two components: a convolutional neural network
transformer and a fusion decoder. The CNN transformer extracts the overall semantics of the
lanes and speeds up convergence, while the fusion decoder combines high-level semantics
with low-level local features to improve accuracy and robustness. By using these two
components together, our method is able to effectively detect lanes in a variety of conditions,
even when interference is present. We tested our method on multiple lane datasets and
obtained superior results, with the best performance on the BDD100K dataset. Our method
has successfully addressed the challenge of accurately and completely detecting lanes in the
presence of interference, such as darkness, shadows, and strong light. The algorithm has been
employed in an edge computing device, an intelligent cart. The code has been made available
at: [Link]

Keywords: Convolution, deep learning, lane detection, transformer

1. INTRODUCTION
Lane detection is a crucial component of the perception phase in Advanced Driver
Assistance Systems (ADAS) and autonomous driving systems, playing an essential role
in ensuring vehicle safety and guiding driving paths. Unlike humans who rely on vision
and experience to judge lanes, lane detection uses algorithms to accurately identify lane
markings and road boundaries [1]. Vehicles acquire information about the road and its
surroundings through visual sensors (e.g., cameras) and they feed this data into deep
Convolutional Neural Networks (CNNs) to extract and analyse key features, providing
reliable foundational data for subsequent path planning and driving decisions.

With the rapid advancement of deep learning, many research methods [2, 3, 4] treat lane
detection as a segmentation task and employ end-to-end neural network frameworks.
During segmentation, the model must focus on every pixel and predict its category.
However, this pixel-wise processing makes it challenging for the model to treat the lane
line as a whole, leading to a loss of lane semantics in feature maps. This problem is
exacerbated under unfavourable lighting conditions, such as when shadows are cast on
lanes, vehicles block the lanes, or during nighttime driving, significantly impacting model
performance. Analysing lane feature maps generated by deep learning models reveals
that the network does not always focus on the lane regions during feature extraction,
which limits its detection performance. To address this, we propose introducing an
attention mechanism to enhance the network’s focus on key regions, thereby improving
the overall performance of lane detection.

© International Telecommunication Union, 2025


Some rights reserved. c b n d
This work is available under the CC BY-NC-ND 3.0 IGO license: [Link]
More information regarding the license and suggested citation, additional permissions and disclaimers is available at:
[Link]
ITU Journal on Future and Evolving Technologies, Volume 6, Issue 2, June 2025

A transformer [5] is a deep learning technique that has • We achieve state-of-the-art accuracy on the BDD100K
been widely applied in natural language processing [6], dataset and tier-1 performance on the Tusimple and
speech processing [7], and vision tasks [8]. It excels CULane datasets.
at parsing deep semantic information from images [9],
making it highly promising for lane detection tasks.
2. RELATED WORK
In recent years, a Vision Transformer (ViT) [9, 10] has
achieved remarkable results in image classification tasks. Within autonomous driving and ADAS applications, lane
Compared to traditional CNNs, ViT has a better ability to departure is one of the main causes of traffic accidents,
parse deep semantic information from images [9]. How- highlighting the importance of lane detection. As the
ever, since a transformer uses fully connected layers and rapid development of deep learning technology, this
weight matrices to propagate and transfer global infor- detection has gradually shifted from traditional feature-
mation, directly embedding a transformer into existing based methods to deep learning-based methods [12]. Tra-
CNN architectures is difficult. ResT [11] was the first to ditional lane detection methodologies predominantly de-
combine a transformer and CNN into a unified model, pend on manually designed feature extraction techniques,
and has provided some inspiration for our work. such as color segmentation, texture analysis, and edge de-
tection, and they subsequently employ post-processing
Traditional transformers utilize large matrices as training techniques like the Hough transform or Kalman filtering
parameters, leading to a significant number of parameters to extract lane lines. However, these methods perform
and slow network training convergence. Furthermore, poorly in complex scenarios, such as changes in lighting,
the advantages of convolution are not fully exploited occlusion, and complex road structures.
in hybrid models. To address these issues, we propose Upon reviewing a substantial body of literature, we have
replacing matrices with convolutions for computing at- observed that the transformer [13] is a novel deep learn-
tention in feature maps, thereby reducing the number ing technique commonly employed in natural language
of parameters and overcoming the challenges of heavy processing [14] and speech processing [15]. Unlike tradi-
training weights and slow training speeds associated tional CNN, the transformer, through its self-attention
with traditional transformers. mechanism, is capable of capturing global information
within images, thereby excelling in contextual modeling
In this study, we propose a method called CTLane, which and the capture of long-range dependencies. Several
combines the powerful semantic extraction capability of studies [16] [17] have significantly enhanced the global
transformers with the efficiency of traditional CNNs. By modeling capabilities of lane detection by incorporating
introducing CNNs into the structure of transformers, the a transformer architecture. For example, by generating
model not only significantly improves training speed conditional convolutional kernel parameters and inte-
but also enhances generalization ability. The CTLane grating a row-by-row classification strategy, these studies
model incorporates a multi-head attention mechanism to have achieved high-precision and efficient lane detection.
effectively extract features in the image space, enabling In ResT [18], the first attempt to combine a transformer
the model to focus more precisely on global lane features and CNN into a unified model has provided valuable
in the feature map, thereby predicting more complete, insights for our research.
smoother, and continuous lane lines, as shown in Fig. 1. Currently, the main methods in lane detection are deep
We conducted a comprehensive evaluation of CTLane learning-based methods. They can be divided into four
on several benchmark lane detection datasets, and the categories: semantic segmentation, row-wise classifica-
results demonstrate that the model maintains high lane tion, anchor, and curve fitting.
detection accuracy while achieving a higher F1 score and
a lower false detection rate.
2.1 Semantic segmentation-based methods
Our main contributions are summarized as follows:
Lane detection methods based on semantic segmentation
• We propose a CNN transformer to extract high-level achieve precise differentiation between lanes and back-
semantics of lanes and introduce a novel method to ground by transforming the task into a pixel-level classi-
compute self-attention. fication problem and leveraging deep neural networks.
• A CNN transformer combines the advantages of a CNN Typical approaches, such as UNet and its variants[19]
and transformer to better aggregate spatial features [20], employ encoder-decoder architectures combined
and can be easily deployed after the feature extraction with multiscale feature fusion to enhance accuracy. ENet-
stage of any CNN. 21[21] introduces lightweight convolutions and affinity
• We propose the fusion decoder, which aggregates high- field techniques, maintaining high performance while
level semantic features and low-level local features, reducing model complexity. GANs further improve
effectively preserving the original features in images feature extraction capabilities through adversarial learn-
and restoring information in the decoder.

150 ©International Telecommunication Union, 2025


Zhou et al.: CTLane: An end-to-end lane detector by a CNN transformer and fusion decoder for edge computing

Label
Channel-0
Channel-1

(a) (b) (c) (d)

Figure 1 – The left-side (a) and (b) show the self-attention maps of the CTLane method on different channels, while the right-side (c) and (d) display
the corresponding original feature maps. The comparison demonstrates that, under the influence of our CNN-Transformer module, the lane features
are significantly enhanced, leading to the prediction of more complete, smooth, and continuous lane lines.

ing [22]. To address domain adaptation challenges, the perception of lane instances through multi-reference de-
MLDA framework [23] optimizes at pixel, instance, and formable attention [29]. Furthermore, in order to reduce
category levels, while the integration of spatiotemporal computational costs, the introduction of local and global
information [24] enhances detection stability through polar coordinate modules decreases the number of an-
hybrid spatiotemporal architectures. Additionally, the chors, while the triplet detection head enables end-to-end
combination of semantic segmentation and anchor-based detection without NMS, enhancing performance in dense
detection [25] achieves superior generalization and real- scenarios [17]. In 3D lane detection, the definition of 3D
time performance in multi-task perception. anchors combined with iterative regression and global
optimization avoids the complexity of traditional bird’s-
eye view transformations [30]. Hybrid anchor-driven
2.2 Row-wise classification-based methods ordinal classification further reduces computational costs
and improves localization accuracy [31].
Lane detection processes have been further streamlined
by the classic CNN row-by-row classification method
[26], which handles image features on a per-row ba- 2.4 Curve fitting-based methods
sis. High-precision and efficient lane detection has been
achieved by some studies [16] [27] through the genera- Lane detection methods are based on curve fitting model
tion of conditional convolutional kernel parameters and lanes as continuous curves. For instance, PolyLaneNet
the integration of a row-by-row classification strategy. [32] utilizes deep polynomial regression to directly output
GroupLane [28] has implemented efficient 3D lane de- the polynomial coefficients of lane markings, achieving
tection by employing a channel grouping strategy and a accuracy comparable to existing methods while main-
row classification head design, combined with Birds-Eye taining a high frame rate (115 FPS). Another study [33]
View (BEV) features and a Self-Organizing Map (SOM) enhances global context modeling through a transformer
mechanism. network, directly regressing a parameterized lane shape
model and thus avoiding the intermediate segmenta-
tion steps and post-processing involved in traditional
2.3 Anchor-based methods methods. Furthermore, the selective focus framework
[34] introduces a lane distortion score to quantify the
Anchor-based methods extract lane features through pre- impact of quantization errors, thereby further enhancing
defined anchors and generate candidate lanes based on detection performance.
these anchors, significantly improving efficiency and ac-
curacy. For instance, some studies utilize anchor-chain
representations to model lanes and enhance the model’s

©International Telecommunication Union, 2025 151


ITU Journal on Future and Evolving Technologies, Volume 6, Issue 2, June 2025

3. PROPOSED METHOD a more dense representation. In the attention module,


we need each pixel to contain global information, so we
The model is described as follows: Given an input image need to blend all the features for each channel. In this
I ∈ RC×H×W , the goal of CTLane is to output lanes I ∈ submodule, the main task is to first scale the input so
RN×H×W , where N denotes the maximum number of that it maintains a moderate computational scale, then
scheduled lanes. Our overall model structure is shown to use the Fully Connection (FC) layer and ReShape to
in Fig. 2. The encoder consists of a backbone network reduce the feature into a new dimensional 2D matrix.
that is used to extract the lane features from the images.
We use ResNet-34 [35] as the backbone, and the output In the feature interaction, we can map the feature Mc×h×w
of the last convolution layer becomes the input of a to Mc×h′ ×w′ in one FC layer. Therefore, to preserve the
Feature Pyramid Network(FPN) [36], which is used to location information, we create a positional code Pc×h×w
fuse the multiscale features from the backbone, and the and embed the code into the original feature M = M + P.
output of the FPN network becomes the input of the The position encoder is introduced by using sine or cosine
convolution attention (convolution transformer) module functions to add to the original signal.
we designed. The convolution attention (convolution
transformer) module is to further process high-level Most transformers applied in vision tasks normally cut
features from FPN. In the decoder, the shallow semantics the input image up to a series of tokens and are linearly
and the deeper information are fused together. The mapped to the dimension Mn×d , which breaks the original
final feature map of lane segmentation is to predict the global spatial information. However, we directly use the
presence of each channel and probability distribution, features extracted from the image and preserve the spatial
then perform binary classification. This section shows the features from an image as much as possible. Moreover, d
details of our design model; firstly, there is an overview of is often larger than n for applying attention in vision, and
the model structure followed by a detailed presentation we can effectively reduce the computational complexity
of the attention module of the volume machine and its of attention.
decoder part, respectively.
In the original transformer [5], to keep position informa-
tion between word embedding in the mapping process,
3.1 Encoder position encoding is introduced by using sine or cosine
functions to construct the value of each position so that
An encoder consists of a backbone network that is used it encodes the positional relationship into embedding in
to extract the lane features from the images. We use the computational process. As its extension in vision,
ResNet-34 [35] as the backbone, and the output of the ViT [9] uses patches as the latitude of image tokens, en-
last convolution layer becomes the input of the Feature coding the position of each token in such a way that the
Pyramid Network (FPN) [36], which is used to fuse the corresponding position information is maintained when
multiscale features from the backbone, and the output of projected onto a 2D image.
the FPN network becomes the input of the convolution
attention (convolution transformer) module we designed. As shown in Fig. 3, at the end of the CNN transformer
During encoding, we keep the features of shallow lay- structure, we use similar operations to map the features
ers in the feature extraction network to ensure that the back to the original dimensions.
decoder can receive the top-level feature information.

3.2.1 Convolution attention blocks


3.2 CNN transformer
The traditional transformer method typically divides the
Due to the limitation of convolution operations to local input image into a series of tokens and flattens them
receptive fields, traditional feature representations make into one-dimensional vectors to compute the global re-
each pixel’s features rely solely on its local region, lacking lationships among all [Link], this approach
global contextual information. This limitation hinders ignores the two-dimensional spatial structure of the im-
the effective modeling of relationships between different age, which can lead to the loss of spatial information
parts of the image. [37]. In the self-attention operation, the input matrix
Mn×din is first linearly transformed to generate the query
To address these issues, we propose a feature interaction matrix Qn×dm , key matrix Kn×dm , and value matrix Vn×dm ,
mechanism. This mechanism fully integrates features where dq = dk = dv = dm . However,in CTLane, we
within each channel, transforming them into more com- replace the linear mapping operations in the attention
pact and globally enriched representations, enabling each block with Convolution Attention (CAttn) blocks us-
pixel to incorporate global contextual information. Fea- ing CQ (·), CK (·), CV (·), which compute query, key, and
ture interaction means features are fully blended into value matrices equivalent to the linear mapping version

152 ©International Telecommunication Union, 2025


Zhou et al.: CTLane: An end-to-end lane detector by a CNN transformer and fusion decoder for edge computing

Feature Interaction

CAttn Blocks
FocalLoss
Q K V
MatMul
Fusion DiceLoss
Sigmoid&BN
xN

Channels Aggregation ExistLoss

BackBone FPN
Feature Interaction

Encoder CNN-Transformer Decoder Loss

Figure 2 – The overall structure of the network is shown in the figure, and the network structure is mainly divided into three parts: encoder, CNN
transformer and decoder. The encoder is used to extract the network to extract the image features, the CNN transformer is used to extract the deep
semantic information, and finally the decoder restores the feature map to the original input size.

3.2.2 Multi-Head attention


Scale Feature
Feature Interaction
C1 C1 C1
Position
Embadding In traditional multi-head self-attention mechanisms, the
C1 C1 SM
query, key, and Value (V) are typically mapped to differ-
CAttn Blocks xN Flatten 2D
RL SM C1 ent feature spaces through linear projections. Parallel
Q K V
FC Layer
C1 C1 C1 MatMul
attention computations are then performed to capture
V K Q
ReShape information from different subspaces. However, each
BatchNorm
Mx head in MHA will only pay attention to its own subspace,
SigMoid which makes attention heads neglect each other. Our
C1
C1 Conv2d(k=1x1)
SM SoftMax (d=1)
LN LayerNorm2d

Flatten 2D method obtains the attention map through convolution


C1
Channels Aggregation with reduced computational cost, enabling the use of
RL ReLU

FC Layer
LN the complete channel to effectively merge the multi-head
C1
ReShape attentions.
Feature Interation
CNN-Transformer Scale Feature
Our approach introduces convolution operations to gen-
erate Q, K, and V. In the multi-head attention mechanism,
Figure 3 – The CNN transformer has a specific structure in which the
mapping matrix is replaced by different convolution blocks. The MHA
we utilize different convolution operations CQ i
(·), CKi (·),
does not scale the channels of each Q, K, V, but rather concatenates all and CV i
(·) to generate multiple heads of Qi , Ki , and Vi ,
the channels to the original channels through concatenation in the final where i ∈ 1, 2, . . . , t, and t represents the number of atten-
stage. We also use sigmoid instead of softmax.
tion heads. After obtaining Qi , Ki , and Vi for each head,
of transformers through 1 × 1 convolution, ReLU, and we perform parallel attention computations for each set
softmax. In the CAttn block, the input Mc×h′ ×w′ is trans- and concatenate the results of all heads using the Concat(·)
formed into the outputs Qc′ ×h′ ×w′ , Kc×h′ ×w′ , and Vc′ ×h′ ×w′ operation, ultimately producing the aggregated attention
through convolution. We compute the dot product of map. The computational process can be expressed by the
the query with all keys, divide each dot product result following formula:

by h′ · w′ , introduce batch normalization to normalize CMultiHead(M̂) = Concat(head1 , . . . , headt ),
the computed attention scores, and then apply the sig- (2)
moid function to obtain the weights for the values. We headi = CAttn(CQ
i
(M̂), CKi (M̂), CV
i (M̂)),
calculate the output matrix with the following formula: where t is the number of attention heads.
T
QK
CAttn(Q, K, V) = Sigmoid(BN( √ ))V, (1)
h′ · w′ 3.2.3 Channels aggregation
where BN(·) is the batch normalization operation, and
Sigmoid(·) is the sigmoid function. To process the attention feature maps generated by MHA,
we need to use convolution to aggregate them for sub-
In the CAttn block, we take advantage of the fact that sequent computation. It is simple and efficient to aggre-
different channels represent different image features; a gate the attention maps obtained from different attention
linear mapping of each channel feature is performed heads using convolution. We aggregate the multi-headed
using convolution. This gives us a more significant features Mc′ ·t×h′ ×w′ ×h′ ×w′ obtained from the previous step
speed-up than linearly mapping the entire information. using Concat(·) and output them as Mc×h′ ×w′ . The function
Aggr(·) represents the concatenation feature map, with
the dimension c · t × h′ × w′ reduced to c × h′ × w′ by 1 × 1

©International Telecommunication Union, 2025 153


ITU Journal on Future and Evolving Technologies, Volume 6, Issue 2, June 2025

convolution and layer normalization. The entire process then resize to N×Ho ×Wo , where Ci ×Hi ×Wi , 0 < i < 4, i ∈
can be described as: N denotes the number of feature channels in the ith layer.
N denotes the number of output channels, Ho , Wo denotes
CMHA(x) = Aggr(CMultiHead(M̂)). (3) the height and width of the output image, and S denotes
the scaling multiplier to satisfy S · H3 = Ho , S · W3 = Wo .

3.2.4 Time complexity analysis of MHA and


CMHA

( N , S x H3 , S x W3 )
In the MHA mapping stage, Mn×d is mapped to Mn×d (C3 , H3 , W3)
with a k × k convolution kernel, and its total complexity is
(C2 , H2 , W2)
evaluated by O(n2 k2 d). In CT, we uses 1 × 1 kernel instead
(C1 , H1 , W1)
of k × k for the mapping operation, so that the mapping
complexity is O(n2 d). The MHA and Concat matrix Mnt×d Conv+BN+ReLU
is compressed into Mn×d , and the complexity is O(tn2 d).

(S2 x N , H3 , W3 )
(Cf, H1, W1) ReSize
(Cf, H2, W2)
Hence, the total compressing the multi-headed attention
(Cf, H3, W3)
Concat matrix as Mnt×d to Mn×d by total convolution, its
Bilinear Interpolate
complexity is O(tn2 d), so the total complexity is O(n2 d +
n2 d). In this case, d > n, the CMHA(·) complexity is lower
than MHA(·).
Channels Scale
In comparison to the number of training parameters
required, the number of training parameters required
by the traditional transformer to generate Q, K, V for the (3 x Cf , H3, W3 )

input features F = M1×l and once attention is 3×l×d, while


Figure 4 – The main structure of the decoder is shown in the figure; the
the number of parameters required to generate Q, K, V shallow features obtained in the feature extraction layer and the deep
with 1 × 1 convolution is 3 × n × 1, where n denotes the features are sampled to the same size by dustample, and then the size
times of convolutions used in Q, K, V. is scaled to the appropriate size by convolution to scale the target size.

In addition, the CMHA(·) is more suitable for the input


as images, because the image is a 2D matrix, and the 3.4 Losses
convolution can be used to extract the spatial features
from images, which is more suitable for the attention In order to cope with real-life situations where there
mechanism. are intersections or partial turnoffs without lanes, we
introduce an additional Exist Loss to determine whether
lanes exist, determine whether the current lane has a
branch part of a decoder, get the possibility of lanes in
3.3 Fusion decoder the current image then use Binary CrossEntropy Loss
(BCE) to calculate the loss,where li is the target value of
The decoder component mainly reorganizes multiple
lane status, ei is the softmax output.
feature maps by deformation and it reconstructs the seg-
mentation output by reshaping. The detailed structure of 1 X
a fusion decoder is shown in Fig. 4. Since deep networks LExist = − [li · ei + (1 − li ) · (1 − ei )] (4)
N
may lose small or tiny targets during feature extraction, i

incorporating shallow features focuses on global infor-


mation and prevents losing these type of features. For our binary split branch, we use a weighted binary
cross-entropy loss, and in order to better handle the split
We use the tow features from lower layers in backbone task of unbalanced examples, we use a loss function that
C f × H1 × W1 , C f × H2 × W2 and one features C3 × H3 × W3 is improved on the basis of focal loss [38]. The loss is
after the CNN transformer. The number of all features calculated as
of a channel would change to C f . Then the first and
second lower layers with sizes respectively are scaled to 1 X λ
LFocal = − [ti · log(oi ) + (1 − ti )λ · log(1 − oi )] (5)
the CNN transformer outputs feature size H3 × W3 by N
i
linear interpolation. Finally the features F are fused by
the Concat function, with the shape 3C f × H3 × W3 .
where ti is the target value of pixel i, oi is the softmax
To make the decoder arbitrarily scale, we add one channel output, the focal coefficient λ ≥ 0, and it becomes the
scale layer. The feature map F is scaled to S2 · N × H3 × W3 standard cross-entropy loss function when λ = 0.

154 ©International Telecommunication Union, 2025


Zhou et al.: CTLane: An end-to-end lane detector by a CNN transformer and fusion decoder for edge computing

To solve imbalance between examples, the dice loss [39] 4.2 Evaluation metrics
is used for the segmentation output:
In Tusimple, there are three official assessment measure-
N
1 X 2(1 − oi )ti ment: accuracy, False Positive (FP), and False Negative
LDice = (1 − ) (6) (FN). The accuracy is defined as
N oi + ti
i=1
P
clip Cclip
Accuracy = .
The weighted sum focal loss, dice loss and exist loss is sumclip Sclip
then used as the total loss function of CTLane, where where Cclip is the number of correctly predicted lane
α, β, γ ≥ 0: points (predicated points within the range of 20 pixels
around ground truth points), and Sclip is the total number
L = αLFocal + βLDice + γLExist (7) of ground truth points in each clip. However, Tusimple
seems to become more saturated for many modern meth-
ods nowadays. Hence we add F1-score to evaluate the
4. EXPERIMENTS AND RESULTS performance of the model, which is defined as

To compare the method we proposed, three widely used 2 × precision × recall


F1score = ,
lane detection datasets are adopted in the experiment. precision + recall
They are CULane [2], Tusimple Lane [40], and BDD100K
in which precision and recall are defined
[41]. The CULane contains 55 hours of video, including
urban and highway scenes, with nine different scenes Cclip
including normal, crowd, curve, dazzling night, night, precision =
Cclip + Fclip
etc. The Tusimple Lane was collected with stable light
conditions on highways, where there are major differ- Cclip
ences between the various types of data. BDD100K has a recall =
Cclip + Mclip
large number of different scenes, it contains road images
with weather, scene, lighting and other factors. Fclip is the number of lane points predicted incorrectly
and Mclip is the number of ground truth points missed in
The details of the datasets are shown in Table 1. each clip.

For the CULane dataset the official suggestion is to eval-


uate precision, F1, and recall. Each channel is treated as
4.1 Implementation a 30-pixel-wide lane, using Intersecting Unions (IoUs) to
calculate predictions and ground truths. Where the pre-
To augment training data, we use random Affine trans-
dicted IoU is greater than the 0.5 threshold, it is marked
form, random horizontal flip, color shift and other tech-
as a True Positive (TP). The evaluation function is defined
niques to generate more examples. In the encoder, we use
as:
the pre-trained ResNet [35] and DLA [48] as the backbone 2 × Accuracy × Recall
to extract multiscale features. In Tusimple, all images are F1 =
Accuracy + Recall
zoomed into 352 × 640, then generated into in four scales:
88×160, 44×80, 22×40, and 11×20. The latter two feature TP
Accuracy =
maps are fused by FPN into a 22 × 40 feature map which TP + FP
then is transferred into a CNN transformer using 3 head TP
Recall =
and attention 6 times. The fused feature map is fused TP + FN
again with the first two feature maps, and transferred
into the decoder to get the segmentation output with the
size of 352 × 640. 4.3 Comparison
The optimization function we used is Stochastic Gradient TUSimple We have conducted experiments several times
Descent (SGD) with learning rate set to 0.1, momentum on the Tusimple to show its performance. Since the
set to 0.9, and weight decay set to 0.0001. The scheduler accuracy in Tusimple is becoming saturated, we mainly
uses cosine annealing with the step set to 5 and warmup focus on the F1 score. Table 3 shows CTLane with
set to 3. Tusimple lane has 200 epochs, CULane 40 ResNet34 backbone made a significant improvement in
epochs, and BDD100K 60 [Link] the combined loss F1 score compared to other methods, where the FP value
function,γ = 0.1, α = 1.0 and β = 0.5. We train the model outperforms better than others.
on NVIDIA 1080TI and 4080.
CULane For the more harsh dataset, CULane, the qual-
itative results are shown in Table 2. It indicate that for

©International Telecommunication Union, 2025 155


ITU Journal on Future and Evolving Technologies, Volume 6, Issue 2, June 2025

Table 1 – Dataset type

Dataset Scenario Road Type Frames Train Resolution Max Lane


Tusimple light traffic, day highway 6,408 3,236 1280×720 5
CULane night, day, traffic urban, rural and highway 133,235 88,880 1640×590 4
BDD100K light traffic, day highway 100,042 58,269 1280×717 4

Table 2 – Comparison with state-of-the-art results on CULane dataset with IoU threshold = 0.5. For crossroad, only FP are shown.

Type BackBone Total Normal Crowded Dazzle Shadow No line Arrow Curve CrossR Night
UFLD[31] ResNet34 72.30 90.70 70.2 0 59.50 69.30 44.40 85.70 69.50 2037 66.70
SCNN[12] VGG16 71.60 90.60 69.70 58.50 66.90 43.40 84.10 64.40 1990 66.10
MCA-UFLD[42] ResNet18 69.36 88.90 67.28 55.79 63.87 39.75 82.50 56.26 1741 63.30
STLNet[43] Swin 73.60 91.80 70.20 65.90 69.30 48.80 85.30 67.50 68.20
SAD[44] ResNet101 71.80 90.70 70.00 59.90 67.00 43.50 84.40 65.70 2052 66.30
PINet[45] Hourglass4 74.40 90.30 72.30 66.30 68.40 49.80 83.70 65.60 1427 67.70
E2E[26] ERFNet 74.00 91.00 73.10 64.50 74.10 46.60 85.80 71.90 2022 67.90
RESA[46] ResNet50 75.31 92.10 73.10 69.20 72.80 47.70 88.30 70.30 1503 69.90
LaneATT[47] ResNet18 75.13 91.17 72.71 65.82 68.03 49.13 87.82 63.75 1020 68.58
Ours
CTLane ResNet34 74.31 91.49 72.68 67.31 68.33 45.27 87.60 68.94 1542 69.69
CTLane DLA34 75.39 92.24 73.48 66.87 74.18 46.46 88.20 69.23 1672 71.20

Table 3 – Comparison results on TuSimple dataset. Our method shows that on some occasions where the
lanes are crowded and close to each other, it still separates
Method BackBone F1(%) Acc(%) FP(%) FN(%)
PolyLaneNet[32] EfficientNetB0 90.62 93.36 9.42 9.33 the lanes well. As shown in Fig. 5, even if the lanes in
FastDraw[49] ResNet50 94.44 94.90 5.90 5.20 images are very close, our method still distinguishes
SCNN[12] VGG16 95.97 96.53 6.17 1.80 them successfully.
RESA[50] ResNet18 96.84 96.84 3.25 2.67
E2E[26] ERFNet 96.25 96.02 3.21 4.28
Table 4 – Comparison results on BDD100K dataset.
LaneATT[32] ResNet34 96.06 96.10 5.64 2.17
FOLOLane[51] ERFNet 96.59 96.92 4.47 2.28
CondLaneNet[52] ResNet101 97.24 96.54 2.01 3.50 Method BackBone Acc(%) IoU(%)
Ours SCNN[12] VGG16 35.79 15.84
CTLane ResNet34 97.54 96.49 2.01 2.90
CTLane DLA34 97.45 96.50 2.14 2.96 SAD[44] ENet 36.56 16.02
HWLane[53] ResNet34 73.93 33.25
some difficult scenes with more severe occlusions, such YOLOPv8[54] CSPDarknet 84.90 28.80
as Crowded, Shadow, CrossRoad, Night, our method can HybridNets[55] EfficientNetB3 85.40 31.60
still successfully infer the correct lanes. It is noticed that Ours
CTLane with DLA34 has achieved a large improvement CTLane ResNet34 84.64 26.12
in major scenarios as shown in Fig. 5. CTLane DLA34 85.55 26.68

BDD100K The official lane segmentation results are given 4.4 Ablation
for both sides of the lane. We draw the complete lane
mask for training and we keep the original lane in the We have designed a series of ablation experiments to
val set. We use pixel classification accuracy and lane IoU analyse the effectiveness of different components in our
as evaluation metrics. Finally validated on the val set model.
and output results are shown in Table 4.
Overall ablation study. We first investigated the effec-
To explain the effectiveness of our method more visu- tiveness of the CNN transformer and the fusion decoder
ally, we show the qualitative results of our designed component. As a baseline, we choose ResNet-34 as the
model and other models for the CULane dataset in Fig. backbone to extract features, and then we use a feature
5. Traditional lane detection methods cannot identify pyramid to aggregate multiscale features to construct
lane markings well in dark night, shadowed and strong
Table 5 – Experiments of the proposed modules on TuSimple dataset
light situations, resulting in the final prediction of the with ResNet-34 backbone.
model breaking the continuity of the lanes. In contrast,
our model can solve this problem well by attention. The Baseline CNN transformer Fusion decoder F1
✓ 94.57
results of our model show more robustness, and introduc- ✓ 96.93
ing our convolution attention rather than the traditional ✓ 96.65
segmentation module can give the network a stronger ✓ ✓ 97.54
ability to capture structured prior objects.

156 ©International Telecommunication Union, 2025


Zhou et al.: CTLane: An end-to-end lane detector by a CNN transformer and fusion decoder for edge computing

ctlane_34_driver_37_30fra
[Link]

ctlane_34_driver_37_30fra
[Link]

ctlane_34_driver_37_30fra
[Link]

Ground Truth ResNet50 SCNN CTLane

Figure 5 – Example results from CULane dataset with ResNet50, SCNN and CTLane. It indicates that CTLane is immune to interference caused by
dark night, shadows, strong light, etc.

an encoder. We adopt deconvolution and bilinear inter- deep Lane Detection (UFLD). The Fig. 6 shows that
polation in the decoder to up-sample the feature map the model with the CNN transformer has a significant
and finally output the segmentation. We integrate the improvement on accuracy. It also indicates that CNN
CNN transformer and fusion decoder respectively. The transformer helps convergence faster and is easily em-
F1-score is summarized in Table 5. We can see that both bedded into existing framework.
components greatly improve the lane detection perfor-
mance, which proves the effectiveness.
TuSimple Lane Detection Challenge 4.5 Edge computing deployment
98
SCNN
SCNN with CNNTransformer
To demonstrate the practicality and efficiency of CTLane
97 UFLD in real-world automotive applications, we deployed the
UFLD with CNNTransformer lane detection model on the iFlytek U-car as shown in Fig.
7, an intelligent vehicle equipped with an NVIDIA Jetson
F1(%)

96
Nano edge-computing device. The Jetson Nano, with its
compact design and energy-efficient performance, is well-
95
suited for embedded systems in autonomous driving
and Advanced Driver Assistance Systems (ADAS). This
94
40 50 60 70 80 90 100
deployment aimed to evaluate the model’s real-time
Epoch performance, resource efficiency, and detection accuracy
in a practical automotive environment. Implementation
Figure 6 – CNN transformer applied on different models Details
Ablation study on generality of CNN transformer To
validate the generalizability and stability of CNN trans- • Hardware setup: The iFlytek U-car is powered by an
former, we have implanted it into Spatial Convolutional NVIDIA Jetson Nano, featuring a 128-core Maxwell
Neural Network (SCNN) and Ultra Fast structure-aware GPU and a quad-core ARM CPU. This hardware con-

©International Telecommunication Union, 2025 157


ITU Journal on Future and Evolving Technologies, Volume 6, Issue 2, June 2025

5. CONCLUSION
In this paper, we present CTLane, a novel lane detec-
tion method that integrates two key components: CNN
transformer and fusion decoder. The CNN transformer
introduces a convolution-based self-attention mechanism,
which significantly improves computational efficiency by
leveraging 1 × 1 convolutions instead of traditional ma-
trix multiplications. This approach not only accelerates
training convergence but also addresses the challenges of
integrating transformer and CNN architectures. The fu-
sion decoder effectively combines low-level local features
from shallow layers with high-level semantic informa-
Figure 7 – The iFlytek U-car equipped with edge-computing capabil- tion generated by the CNN transformer. This dual-layer
ities powered by the NVIDIA Jetson Nano is used by us in real-time fusion enables the network to capture both fine-grained
perception on lane marks.
details and global lane structures, enhancing the robust-
figuration provides a balance of computational power ness and accuracy of lane detection.
and energy efficiency, making it ideal for edge-based
lane detection tasks. The proposed method demonstrates strong generaliza-
• Software environment: The model was optimized tion capabilities and can be seamlessly integrated into
using TensorRT, NVIDIA’s high-performance deep existing frameworks. Extensive experiments on three
learning inference library, to maximize inference speed benchmark datasets, CULane, Tusimple, and BDD100K
and minimize memory usage. The framework was show that CTLane achieves close to state-of-the-art per-
implemented in PyTorch, and the model was quantized formance, particularly in challenging scenarios such as
to FP16 precision to further enhance computational low-light conditions, shadows, and occlusions. The re-
efficiency without compromising detection accuracy. sults highlight the method’s ability to maintain lane
• Input resolution: To ensure real-time performance continuity and accuracy even under adverse conditions.
on the Jetson Nano, the input resolution was set to To further validate the practicality of CTLane, we de-
352×640, striking a balance between detection accuracy ployed the model on the iFlytek U-car, an intelligent
and computational load. vehicle platform equipped with an NVIDIA Jetson Nano
edge-computing device. The deployment demonstrated
Performance metrics inference speed: On the Jetson the model’s close to real-time performance.
Nano, the model achieved an average inference speed
of 8 − 10 FPS (frames per second), close to meeting the Future work will focus on further optimizing the model
real-time requirements for lane detection in automotive for real-time applications and exploring its potential in
applications. other computer vision tasks. To enhance real-time per-
formance, we will investigate dynamic network pruning,
Resource utilization: The GPU utilization remained be- hybrid precision quantization, and hardware-customized
low 75%, indicating that the model is lightweight and acceleration strategies. Additionally, we aim to extend
leaves sufficient computational resources for other con- the framework to multi-task scenarios and cross-modal
current tasks, such as object detection or path planning. systems (e.g., fusing camera and LiDAR inputs). Im-
proving robustness under extreme conditions (e.g., fog,
Real-world testing We conducted extensive real-world heavy rain) through adversarial training or synthetic
testing on urban roads and highways using the iFlytek data augmentation will also be prioritized. The CTLane
U-car to evaluate the model’s performance under diverse framework provides a robust foundation for advancing
driving conditions. The results showed that CTLane lane detection technologies, contributing to the devel-
performs exceptionally well in challenging scenarios, opment of safer and more reliable autonomous driving
including low-light environments, shadows, and occlu- systems.
sions caused by other vehicles or road obstacles. The
model consistently maintained lane continuity and accu-
racy, proving its reliability for real-world deployment in
ADAS and autonomous driving systems.

158 ©International Telecommunication Union, 2025


Zhou et al.: CTLane: An end-to-end lane detector by a CNN transformer and fusion decoder for edge computing

REFERENCES [14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
“Bert: Pre-training of deep bidirectional transformers for lan-
[1] Aharon Bar Hillel, Ronen Lerner, Dan Levi, and Guy Raz. “Recent guage understanding”. In: Proceedings of the 2019 conference of the
progress in road and lane detection: a survey”. In: Machine North American chapter of the association for computational linguis-
Vision and Applications 25.3 (Apr. 2014), pp. 727–745. issn: 0932- tics: human language technologies, volume 1 (long and short papers).
8092, 1432-1769. doi: 10.1007/s00138- 011- 0404- 2. (Visited on 2019, pp. 4171–4186.
04/04/2022). [15] Linhao Dong, Shuang Xu, and Bo Xu. “Speech-transformer: a no-
[2] Xingang Pan, Jianping Shi, Ping Luo, Xiaogang Wang, and recurrence sequence-to-sequence model for speech recognition”.
Xiaoou Tang. “Spatial As Deep: Spatial CNN for Traffic Scene In: 2018 IEEE international conference on acoustics, speech and signal
Understanding”. In: AAAI Conference on Artificial Intelligence. processing (ICASSP). IEEE. 2018, pp. 5884–5888.
2017. url: [Link] [16] Long Zhuang, Tiezhen Jiang, Meng Qiu, Anqi Wang, and Zhix-
[3] Tu Zheng, Hao Fang, Yi Zhang, Wenjian Tang, Zheng Yang, iang Huang. “Transformer generates conditional convolution
Haifeng Liu, and Deng Cai. “RESA: Recurrent Feature-Shift kernels for end-to-end lane detection”. In: IEEE Sensors Journal
Aggregator for Lane Detection”. In: arXiv:2008.13719 [cs] (Mar. 25, (2024).
2021). arXiv: 2008.13719. (Visited on 04/04/2022). [17] Shengqi Wang, Junmin Liu, Xiangyong Cao, Zengjie Song, and
[4] Davy Neven, Bert De Brabandere, Stamatios Georgoulis, Marc Kai Sun. “Polar R-CNN: End-to-End Lane Detection with Fewer
Proesmans, and Luc Van Gool. “Towards End-to-End Lane Detec- Anchors”. In: arXiv preprint arXiv:2411.01499 (2024).
tion: an Instance Segmentation Approach”. In: arXiv:1802.05591 [18] Qinglong Zhang and Yu-Bin Yang. “Rest: An efficient trans-
[cs] (Feb. 15, 2018). arXiv: 1802.05591. (Visited on 04/04/2022). former for visual recognition”. In: Advances in neural information
[5] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, processing systems 34 (2021), pp. 15475–15485.
Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polo- [19] Der-Hau Lee and Jinn-Liang Liu. “End-to-end deep learning of
sukhin. “Attention Is All You Need”. In: arXiv:1706.03762 [cs] lane detection and path prediction for real-time autonomous
(Dec. 5, 2017). arXiv: 1706.03762. (Visited on 04/04/2022). driving”. In: Signal, Image and Video Processing 17.1 (2023), pp. 199–
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 205.
“BERT: Pre-training of Deep Bidirectional Transformers for Lan- [20] P Santhiya, Immanuel JohnRaja Jebadurai, Getzi Jeba Leelipush-
guage Understanding”. In: Proceedings of the 2019 Conference of pam Paulraj, A Jenefa, S Kiruba Karan, et al. “Deep Vision: Lane
the North. Proceedings of the 2019 Conference of the North. Min- Detection in ITS: A Deep Learning Segmentation Perspective”.
neapolis, Minnesota: Association for Computational Linguistics, In: 2024 Second International Conference on Inventive Computing
2019, pp. 4171–4186. doi: 10.18653/v1/N19- 1423. (Visited on and Informatics (ICICI). IEEE. 2024, pp. 21–26.
04/04/2022).
[21] Seyed Rasoul Hosseini, Hamid Taheri, and Mohammad Tesh-
[7] Linhao Dong, Shuang Xu, and Bo Xu. “Speech-Transformer: A nehlab. “Enet-21: an optimized light cnn structure for lane detec-
No-Recurrence Sequence-to-Sequence Model for Speech Recog- tion”. In: arXiv preprint arXiv:2403.19782 (2024).
nition”. In: 2018 IEEE International Conference on Acoustics, Speech
[22] Swati Jaiswal and B Chandra Mohan. “Deep learning-based path
and Signal Processing (ICASSP). ICASSP 2018 - 2018 IEEE Interna-
tracking control using lane detection and traffic sign detection
tional Conference on Acoustics, Speech and Signal Processing
for autonomous driving”. In: Web Intelligence. Vol. 22. 2. SAGE
(ICASSP). Calgary, AB: IEEE, Apr. 2018, pp. 5884–5888. isbn:
Publications Sage UK: London, England. 2024, pp. 185–207.
978-1-5386-4658-8. doi: 10.1109/ICASSP.2018.8462506. (Visited
on 04/04/2022). [23] Chenguang Li, Boheng Zhang, Jia Shi, and Guangliang Cheng.
“Multi-level domain adaptation for lane detection”. In: Proceed-
[8] Andrew Brown, Cheng-Yang Fu, Omkar Parkhi, Tamara L Berg,
ings of the IEEE/CVF Conference on Computer Vision and Pattern
and Andrea Vedaldi. “End-to-end visual editing with a gener-
Recognition. 2022, pp. 4380–4389.
atively pre-trained artist”. In: European Conference on Computer
Vision. Springer. 2022, pp. 18–35. [24] Yongqi Dong, Sandeep Patil, Bart Van Arem, and Haneen Farah.
“A hybrid spatial–temporal deep learning architecture for lane
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk
detection”. In: Computer-Aided Civil and Infrastructure Engineering
Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa De-
38.1 (2023), pp. 67–86.
hghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob
Uszkoreit, and Neil Houlsby. “An Image is Worth 16x16 Words: [25] Jiao Zhan, Jingnan Liu, Yejun Wu, and Chi Guo. “Multi-task
Transformers for Image Recognition at Scale”. In: arXiv:2010.11929 visual perception for object detection and semantic segmentation
[cs] (June 3, 2021). arXiv: 2010.11929. (Visited on 04/04/2022). in intelligent driving”. In: Remote Sensing 16.10 (2024), p. 1774.
[10] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas [26] Seungwoo Yoo, Hee Seok Lee, Heesoo Myeong, Sungrack Yun,
Usunier, Alexander Kirillov, and Sergey Zagoruyko. “End-to- Hyoungwoo Park, Janghoon Cho, and Duck Hoon Kim. “End-
End Object Detection with Transformers”. In: Computer Vision – to-end lane marker detection via row-wise classification”. In:
ECCV 2020. Ed. by Andrea Vedaldi, Horst Bischof, Thomas Brox, Proceedings of the IEEE/CVF conference on computer vision and
and Jan-Michael Frahm. Vol. 12346. Series Title: Lecture Notes pattern recognition workshops. 2020, pp. 1006–1007.
in Computer Science. Cham: Springer International Publishing, [27] Xinyu Zhang, Yan Gong, Jianli Lu, Zhiwei Li, Shixiang Li, Shu
2020, pp. 213–229. isbn: 978-3-030-58451-1 978-3-030-58452-8. doi: Wang, Wenzhuo Liu, Li Wang, and Jun Li. “Oblique convolution:
10.1007/978-3-030-58452-8_13. (Visited on 04/04/2022). A novel convolution idea for redefining lane detection”. In: IEEE
[11] Qinglong Zhang and Yubin Yang. “ResT: An Efficient Trans- Transactions on Intelligent Vehicles 9.2 (2023), pp. 4025–4039.
former for Visual Recognition”. In: arXiv:2105.13677 [cs] (Oct. 14, [28] Zhuoling Li, Chunrui Han, Zheng Ge, Jinrong Yang, En Yu, Hao-
2021). arXiv: 2105.13677. (Visited on 04/04/2022). qian Wang, Xiangyu Zhang, and Hengshuang Zhao. “Grouplane:
[12] Xingang Pan, Jianping Shi, Ping Luo, Xiaogang Wang, and End-to-end 3d lane detection with channel-wise grouping”. In:
Xiaoou Tang. “Spatial as deep: Spatial cnn for traffic scene IEEE Robotics and Automation Letters (2024).
understanding”. In: Proceedings of the AAAI conference on artificial [29] Zhongyu Yang, Chen Shen, Wei Shao, Tengfei Xing, Runbo Hu,
intelligence. Vol. 32. 1. 2018. Pengfei Xu, Hua Chai, and Ruini Xue. “LDTR: Transformer-based
[13] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, lane detection with anchor-chain representation”. In: Computa-
Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. tional Visual Media 10.4 (2024), pp. 753–769.
“Attention is all you need”. In: Advances in neural information
processing systems 30 (2017).

©International Telecommunication Union, 2025 159


ITU Journal on Future and Evolving Technologies, Volume 6, Issue 2, June 2025

[30] Shaofei Huang, Zhenwei Shen, Zehao Huang, Zi-han Ding, Jiao [46] Tu Zheng, Hao Fang, Yi Zhang, Wenjian Tang, Zheng Yang,
Dai, Jizhong Han, Naiyan Wang, and Si Liu. “Anchor3dlane: Haifeng Liu, and Deng Cai. “Resa: Recurrent feature-shift aggre-
Learning to regress 3d anchors for monocular 3d lane detection”. gator for lane detection”. In: Proceedings of the AAAI conference on
In: Proceedings of the IEEE/CVF Conference on Computer Vision and artificial intelligence. Vol. 35. 4. 2021, pp. 3547–3554.
Pattern Recognition. 2023, pp. 17451–17460.
[47] Xu Cao, Weisheng Liu, and Zhijian Wang. “Adaptive ROI Opti-
[31] Zequn Qin, Pengyi Zhang, and Xi Li. “Ultra Fast Deep Lane mization Pyramid Network: Lane Detection for FSD under Data
Detection With Hybrid Anchor Driven Ordinal Classification”. In: Uncertainty.” In: Engineering Letters 33.2 (2025).
IEEE Trans. Pattern Anal. Mach. Intell. 46.5 (May 2024), pp. 2555–
[48] Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell.
2568. issn: 0162-8828. doi: 10.1109/TPAMI.2022.3182097. url:
Deep Layer Aggregation. Jan. 4, 2019. arXiv: 1707.06484[cs]. url:
[Link]
[Link] (visited on 09/05/2022).
[32] Lucas Tabelini, Rodrigo Berriel, Thiago M Paixao, Claudine
[49] Jonah Philion. “Fastdraw: Addressing the long tail of lane detec-
Badue, Alberto F De Souza, and Thiago Oliveira-Santos. “Poly-
tion by adapting a sequential prediction network”. In: Proceedings
lanenet: Lane estimation via deep polynomial regression”. In:
of the IEEE/CVF conference on computer vision and pattern recognition.
2020 25th international conference on pattern recognition (ICPR).
2019, pp. 11582–11591.
IEEE. 2021, pp. 6150–6156.
[50] Dan Zhang, Guolv Zhu, Shibo Lu, and Chang Li. “Lane Detec-
[33] Ruijin Liu, Zejian Yuan, Tie Liu, and Zhiliang Xiong. “End-to-end
tion Based on Improved RESA in Power Plant”. In: 2024 IEEE
lane shape prediction with transformers”. In: Proceedings of the
4th International Conference on Power, Electronics and Computer
IEEE/CVF winter conference on applications of computer vision. 2021,
Applications (ICPECA). IEEE. 2024, pp. 108–112.
pp. 3694–3702.
[51] Zhan Qu, Huan Jin, Yang Zhou, Zhen Yang, and Wei Zhang.
[34] Yunqian Fan, Xiuying Wei, Ruihao Gong, Yuqing Ma, Xiangguo
“Focus on local: Detecting lane marker from bottom up via key
Zhang, Qi Zhang, and Xianglong Liu. “Selective focus: inves-
point”. In: Proceedings of the IEEE/CVF conference on computer
tigating semantics sensitivity in post-training quantization for
vision and pattern recognition. 2021, pp. 14122–14130.
lane detection”. In: Proceedings of the AAAI Conference on Artificial
Intelligence. Vol. 38. 11. 2024, pp. 11936–11943. [52] Lizhe Liu, Xiaohao Chen, Siyu Zhu, and Ping Tan. “Condlanenet:
a top-to-down lane detection framework based on conditional
[35] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep
convolution”. In: Proceedings of the IEEE/CVF international confer-
Residual Learning for Image Recognition”. In: arXiv:1512.03385
ence on computer vision. 2021, pp. 3773–3782.
[cs] (Dec. 10, 2015). arXiv: 1512.03385. (Visited on 04/04/2022).
[53] Jing Zhao, Zengyu Qiu, Huiqin Hu, and Shiliang Sun. “HWLane:
[36] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath
HW-transformer for lane detection”. In: IEEE Transactions on
Hariharan, and Serge Belongie. “Feature Pyramid Networks for
Intelligent Transportation Systems (2024).
Object Detection”. In: arXiv:1612.03144 [cs] (Apr. 19, 2017). arXiv:
1612.03144. (Visited on 04/04/2022). [54] Shuyan Wang, Ya Liu, and Feng Zhang. “A Multi-Task Au-
tonomous Driving Environment Perception Network Based on
[37] Gang Li, Di Xu, Xing Cheng, Lingyu Si, and Changwen Zheng.
CA-YOLOPv8”. In: 2024 20th International Conference on Natu-
“SimViT: Exploring a Simple Vision Transformer with sliding
ral Computation, Fuzzy Systems and Knowledge Discovery (ICNC-
windows”. In: arXiv:2112.13085 [cs] (Dec. 24, 2021). arXiv: 2112.1
FSKD). IEEE. 2024, pp. 1–9.
3085. (Visited on 04/04/2022).
[55] Dat Vu, Bao Ngo, and Hung Phan. “HybridNets: End-to-End
[38] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
Perception Network”. In: arXiv e-prints (2022), arXiv–2203.
Piotr Dollar. “Focal Loss for Dense Object Detection”. In: IEEE
Transactions on Pattern Analysis and Machine Intelligence 42.2 (Feb. 1,
2020), pp. 318–327. issn: 0162-8828, 2160-9292, 1939-3539. doi:
10.1109/TPAMI.2018.2858826. (Visited on 04/04/2022).
AUTHORS
[39] Xiaoya Li, Xiaofei Sun, Yuxian Meng, Junjun Liang, Fei Wu,
and Jiwei Li. “Dice Loss for Data-imbalanced NLP Tasks”. In:
arXiv:1911.02855 [cs] (Aug. 29, 2020). arXiv: 1911.02855. (Visited Mian Zhou received a Doctor’s
on 04/04/2022). degree from the Department of
[40] Tusimple. Tusimple lane detection benchmark. 2017. Computer Science, University of
[41] Fisher Yu, Wenqi Xian, Yingying Chen, Fangchen Liu, Mike Liao, Reading, UK. He is currently a
Vashisht Madhavan, and Trevor Darrell. “BDD100K: A Diverse senior associate professor at the
Driving Video Database with Scalable Annotation Tooling”. In: School of AI and Advanced Com-
CoRR abs/1805.04687 (2018). arXiv: 1805.04687. url: [Link]
.org/abs/1805.04687.
puting, XJTLU Entrepreneur Col-
lege (Taicang), Xi’an Jiaotong-
[42] Lingyun Han, Kun Xu, Wensheng Hu, and Zhanwen Liu. “Lane
Detection Method Based on MCA-UFLD”. In: 2023 IEEE 8th Liverpool University. His re-
International Conference on Intelligent Transportation Engineering search interests include computer vision, image pro-
(ICITE). IEEE. 2023, pp. 146–152. cessing, and pattern recognition.
[43] Yufeng Du, Rongyun Zhang, Peicheng Shi, Linfeng Zhao, Bin
Zhang, and Yaming Liu. “ST-LaneNet: lane line detection method
based on swin transformer and LaneNet”. In: Chinese Journal of
Mechanical Engineering 37.1 (2024), p. 14.
Guoqiang Zhu is a graduate stu-
dent at Tianjin University of Tech-
[44] Yuenan Hou, Zheng Ma, Chunxiao Liu, and Chen Change Loy.
“Learning lightweight lane detection cnns by self attention distil- nology, specializing in computer
lation”. In: Proceedings of the IEEE/CVF international conference on vision and machine learning. His
computer vision. 2019, pp. 1013–1021. research focuses on leveraging
[45] Yeongmin Ko, Younkwan Lee, Shoaib Azam, Farzeen Munir, deep learning techniques for im-
Moongu Jeon, and Witold Pedrycz. “Key points estimation and age recognition.
point instance segmentation approach for lane detection”. In:
IEEE Transactions on Intelligent Transportation Systems 23.7 (2021),
pp. 8949–8958.

160 ©International Telecommunication Union, 2025


Zhou et al.: CTLane: An end-to-end lane detector by a CNN transformer and fusion decoder for edge computing

Zhikun Feng is pursuing a PhD at


the University of Electronic Sci-
ence and Technology of China.
His research interests include ma-
chine learning and computer
vision.

Haoyi Lian is an undergraduate


student at the School of AI and
Advanced Computing, XJTLU
Entrepreneur College (Taicang),
Xi’an Jiaotong-Liverpool Univer-
sity. She is majoring in data sci-
ence and big data technology and
is particularly interested in the
fields of machine learning, deep
learning and computer vision.

Siqi Huang is an assistant pro-


fessor in the School of AI and
Advanced Computing at XJTLU
Entrepreneur College (Taicang).
He received his Ph.D. in electri-
cal engineering at The University
of North Carolina at Charlotte in
2022, and BEng in software en-
gineering from Sun Yat-sen Uni-
versity in [Link] research interests include AI-driven
video streaming system optimization; energy and la-
tency analysis and optimization of AI (DNN) models on
mobile devices (TinyML); real-time HD map generation
and updates for autonomous driving and software OTA
update services for autonomous vehicles; mobile edge
computing with embedded AI devices; mobile AR/VR
and human-computer interaction system.

©International Telecommunication Union, 2025 161

You might also like