Adaptive Traffic Light Control Using Deep Reinforcement Learning Technique
Adaptive Traffic Light Control Using Deep Reinforcement Learning Technique
https://doi.org/10.1007/s11042-023-16112-3
Abstract
Smart city growth needs information and communication technology to increase urban sus-
tainability but faces critical traffic congestion and vehicle classification issues. It is crucial
to dynamically change the traffic light on the road network to reduce the delay of vehicles
and avoid congestion in the smart city. Modifying the traffic light should be adaptive, con-
sidering the number of vehicles on the road and the options available to route the vehicles
toward their destination. Our scheme is the first proposed model based on deep learning to
solve the problem of traffic congestion in the urban environment. This model classifies the
vehicle’s type on the road and assigns different vehicle weights. We assign 0.0 for no vehi-
cles, and 1.0, 2.0, 3.0 for light-weight, moderate-weight, and heavy-weight vehicles respec-
tively. The proposed work has trained using experience replay and target network based
on a deep double-Q learning mechanism. Our resultant model applies in a real-time traffic
network that uses Dedicated-Short-Range-Communication (DSRC) protocol for wireless
communication. The simulation of this work uses SUMO (Simulation in Urban MObil-
ity) with the data generated on SUMO using a random function. The results show that the
traffic light of a certain traffic intersection becomes adaptive, aligning with the goals men-
tioned above. The proposed model efficiently reduces the average waiting time up to 91.7%
at the intersection points of the road which is shown in the graph in the result section.
1 Introduction
The growth of vehicles worldwide is making the roads congested, resulting in a signifi-
cant delay for an average vehicle to commute from one place to another. The road traffic
system is still handled and managed in a static round-robin manner despite machine
* Ritesh Kumar
[email protected]
Vijay K. Chaurasiya
[email protected]
1
Indian Institute of Information Technology, Allahabad, India
13
Vol.:(0123456789)
13852 Multimedia Tools and Applications (2024) 83:13851–13872
learning and artificial intelligence techniques. The existing traditional methods like
round-robin and the longest queue first methods do not take the classification of differ-
ent types of vehicles into account, which certainly increases the traffic and makes these
algorithms inefficient. Traffic lights are typically used to regulate intersections along
busy roads or major highways. However, their ineffective regulation results in a number
of issues, including significant energy waste as well as traveler delays. Even worse, inef-
ficient light control could result in auto accidents [17, 23]. Current traffic signal control
either deploy fixed programs without taking real-time traffic into account or takes traf-
fic into account only very minimally [2]. The fixed programs set the traffic signals to
have variable cycle times depending on historical data rather than equal cycle times. To
determine whether there are any vehicles in front of traffic signals, some control algo-
rithms use input from sensors like subterranean inductive loop detectors. To estimate
the length of green/red lights, the inputs are processed extremely roughly [34]. SUMO
is a simulation tool to simulate real-time traffic and analyze the results after taking the
decisions made by the machine learning model. The acronym TraCI stands for "Traffic
Control Interface."It allows to recover the value of the simulated objects and modifies
their behavior "online" by giving access to a running road traffic simulation. Existing
traffic signal control systems occasionally function, albeit inefficiently. The traffic light
control systems, however, become paralyzed in several other situations, such as in a
sports event or a more typical situation during a peak hour of traffic. Instead, we fre-
quently see a skilled police officer directly controlling the crossroads by waving signs.
In situations with high traffic, a controller observes the actual traffic situation on the
crossing roads and, utilizing his or her extensive knowledge of the crossing cleverly
calculates the length of the permitted passing of time with each direction (Figs. 1, 2, 3,
4, 5, 6, 7, 8, 9, 10).
This discovery prompts us to suggest an intelligent intersection traffic signal man-
agement system that can learn how to operate the intersection like a human operator by
incorporating real-time traffic conditions as input.
13
Multimedia Tools and Applications (2024) 83:13851–13872 13853
Many technologies can help a vehicle communicate with every other vehicle and with
the infrastructure node placed on the roadside. However, the automation of traffic flow
is still a challenging research problem that can reduce traffic congestion and is adaptive,
dynamic, and automated. Machine learning approaches can be used to solve such prob-
lems. Many machine learning models can solve traffic problems of road networks. Some of
such models are presented below: (1) Supervised Learning (2) Semi-supervised Learning
(3) Unsupervised Learning (4) Reinforcement Learning [26].
Further reinforcement learning is a sort of machine learning in which a model learns
from a sequence of actions and then determines which step to do depending on the cur-
rent state, maximizing the reward [15, 31]. In simpler terms, reinforcement learning means
a goal-oriented algorithm that learns how to attain a final complicated objective(goal) or
maximize the goal targets in a particular dimension. Deep learning algorithms are also
available to develop adaptive traffic light control systems. But many existing solutions have
Fig. 3 The proposed workflow diagram in which agent architecture is shown in Fig. 8
13
13854 Multimedia Tools and Applications (2024) 83:13851–13872
Fig. 4 Vehicle State position representation of Bike lightweight (1.0), Car moderate weight (2.0) and Truck
heavyweight (3.0) vehicles are represented in the form of grid matrix
not considered the randomness of the traffic and the different kinds of vehicles’ impact on
traffic congestion. DRL method is used [13] for adaptive traffic signal control framework
which explicitly considers realistic traffic scenarios sensors and physical constraints but
it does not consider intersections with left and right turns. Hence the four-phase scheme
is considered in [20] which uses PPO (Proximal Policy Optimization) for improving con-
vergence of the model but it fails under multiple intersection scenarios. Further, a deep
actor-critic method is designed to provide efficient traffic signal plans using a series of tem-
porally sequential images in [19] but it also does not consider the implementation of mul-
tiple intersections. In The continuous values states of the model cannot be done by rein-
forcement learning. In this case, the states have to go through as an input to the function
approximator derived from neural networks like Convolutional Neural Networks (CNN),
Artificial Neural Networks (ANN), etc. These machine learning models, in particular, are
called Deep Reinforcement Learning.
13
Multimedia Tools and Applications (2024) 83:13851–13872 13855
With the continuous and rapid growth of the automotive industry, tools to automate traf-
fic junctions are still missing. Traffic congestion is a real problem because time has become
more critical than ever. Since a human manages every vehicle in the present situation, it
is paint-staking work. Therefore, an automatic and adaptive traffic light control system is
required to reduce traffic congestion and vehicle wait times at traffic crossings.
With the recent development in artificial intelligence and machine learning, many auto-
mation applications are coming into the picture and utilized in automatic traffic light con-
trol applications. DSRC protocol achieves security which is crucial for communication
between vehicle to vehicle (V2V) and infrastructure. And when DSRC is combined with
the RL algorithm, it can efficiently reduce the average waiting time of vehicles at an inter-
section, even with a low detection rate, thus reducing the travel time of vehicles [41]. But
the drawback of this method is that it is not more efficient with multi-intersections. Further
13
13856 Multimedia Tools and Applications (2024) 83:13851–13872
other deep learning algorithms like MADQN (Multi-agent deep-Q network) work better
and investigates its use to further address the curse of dimensionality under traffic network
scenarios with high traffic volume and disturbances [28] but it is also not as good as in
abrupt traffic collision disturbance. Hence deep -Q learning model with experience replay
and target network is used to combat the above problems [29] and tries to optimize the
traffic light cycle using the Markov decision process also but the main drawback of this
method is that it fails in case of complex road network scenarios. None of the above-said
methods have taken advantage of the different weights of different vehicles which are done
in this paper.
The main contributions of our paper are 1) To the best of our knowledge, this paper
first uses an algorithm to extract different vehicles from the traffic of a specific traffic inter-
section is implemented by considering their weights and impacts on the traffic delays to
classify them into different categories. For example, different weights for different types
of vehicles have been assigned (e.g., for lightweight = 1, moderate weight = 2, heavy-
weight = 3, and no vehicle = 0) for controlling traffic lights. 2) The deep reinforcement
learning algorithm along with the Target network and Experience replay improves effi-
ciency and makes simulations more realistic. 3) Further, this paper explores and solves
automating dynamic and adaptive traffic lights for a traffic intersection using reinforcement
learning, and the complete model is deployed into a simulation using SUMO and Traffic
Fig. 9 The left part shows a deep reinforcement learning model for traffic light control in the left part. The
right part shows the phase transition diagram. Taken from multi-agent [37]
13
Multimedia Tools and Applications (2024) 83:13851–13872 13857
Fig. 10 Figure shows the average waiting time delay for different methods used in base paper [18] the traffic
while training where the x-axis represents the no. of episodes and the y-axis represents the average waiting
time of vehicle’s staying time at the intersection in seconds
Control Interface (TraCI). 4) Also, the best Markov decision process is used to make the
traffic light adaptive and dynamic. The agent decides on the reward it gets through pre-
vious steps and learning through the environment. 5) The last important contribution of
this paper is that DSRC vehicle to infrastructure (V2I) communication is used for knowing
about the exact location of each vehicle and henceforth that location data is used for a sub-
stantial deduction on traffic congestion.
The remaining part is organized in the following section as stated below. Further Sec-
tion 2 describes related work and Section 3 describes the Problem definition. Background
and Proposed Methodology are described in Sections 4 and 5. Section 6 describes the
implementation and result and Section 7 represents conclusion and result. Further declara-
tion and acknowledgment are made in Sections 8 and 9 respectively.
2 Related work
With the recent development in artificial intelligence and machine learning, many automa-
tion applications are coming into the picture, and traffic light control is among them. Secu-
rity concerns have also been solved through DSRC, a crucial blockage for communication
between V2V and V2I. So the data and technical tools are available to solve the automation
of adaptive traffic light management through which less traffic congestion and minimiza-
tion of the time delay of vehicles waiting at a particular traffic intersection can be achieved.
Deep learning algorithms taking care of adaptive traffic light control systems are avail-
able in today’s world. However, the lack of randomness of the traffic and the different kinds
of vehicles still impact traffic congestion. All the methods and algorithms proposed until
now have not considered these cases. That is why they lag behind the real world’s traffic.
The problems that have been solved through deploying deep reinforcement learning are
discussed in this section. An adaptive agent is made, which diminishes the traffic conges-
tion of a particular traffic junction.
In [8], a state called DTSE (Discrete Traffic State Encoding) has been proposed, which
is a vector having the information of a cell containing a vehicle or not, the vehicle’s speed
at a cell includes a vehicle and the current traffic light. Experience replay has been used
in the deep Q-learning agent, with CNN as a function approximator containing one hid-
den layer. It has been found that the agent can minimize 82 % of average cumulative delay,
66 % of average queue length, and 20 % of travel time [8, 9]. "Further, a deep Q-learning
13
13858 Multimedia Tools and Applications (2024) 83:13851–13872
algorithm is proposed to learn the Q-function of the traffic system state inputs and the cor-
responding traffic system outputs in" [14]. The benefits provided by the system are also
analyzed over regular traffic system control. The queue length as the state is used and is
stacked autoencoders as the function approximator. The reward suggested is the difference
between flows in two directions, which is further minimized over training sampled data.
In [35] the position as the state is also proposed for the deep Q-learning algorithm,
which then can be used to lessen the traffic congestion and subsequently help the traffic to
be adaptive and intelligent. The approach is to collect the data, divide the entire intersec-
tion into smaller grids, and quantify complex traffic scenarios into states. "A traffic light’s
timing adjustments are the acts that are modeled as a high-dimension Markov decision
process"[16]. "Experience replay and target network into the deep reinforcement learning
algorithm is used in " [5, 6]. In which input states are used as velocity and position of
vehicles in the traffic network. The incentive instilled in this paper is the change in the
cumulative delay time of vehicles in the traffic network. Convolutional neural networks
have been used as function approximators. The result is compared with two other road sign
algorithms, the maximum fast-queue algorithm, and the fixed-time rotation algorithm.
In [41] the effect caused by the partial detection of the vehicles present in the traffic
junction is explored. The idea is that one cannot have every vehicle united under a similar
source of communication and hence would only use those technically supported vehicles
with the vehicle to infrastructure communication. The system can reduce vehicles’ accu-
mulated delay time at traffic intersections even if the detection rate is low.
The summary of the work with previous existing related work to show the better result
obtained by the proposed algorithm is shown in the above Tables 1, 2, 3.
3 Problem definition
In recent years, there have been many advancements in the Q-learning approach like expe-
rience replay and the target network. In this paper, the following problems are proposed to
solve.
After doing this, further, the agent is trained based on a deep double-Q learning mecha-
nism using experience replay such that it is deployed into a real-time traffic network that
uses DSRC to communicate between vehicle and infrastructure and make the traffic light of
a certain traffic intersection adaptive, aligning with the goal of the agent.
4 Background
In this section, the proposed solution to the problem is discussed along with the discussion
of the action, state, and reward function:
13
Table 1 Comparison for related work
Author’s name and Ref.number Objective Techniques Simulation tools Drawback
Seyed Sajad Mousavi(2017) [23] Traffic light control using DRL Policy-gradient and valued -function SUMO Not applicable for multi-agent cases
based
Rusheng Zhang (2020) [41] Traffic light control using partial DSRC SUMO Not applicable for more than 5 inter-
detection section
Li,li (2016) [14] Traffic light control within appropri- Deep Reinforcement learning SUMO Not worked in unstructured traffic
ate time Algorithm format
Xiaorong Hu (2020) [10] Dynamic traffic light control using GP light technique SUMO Not effective in long and heavy traffic
GNN
Hua Wei (2018) [36] Traffic light control on real-time data Deep reinforcement learning tested SUMO Not suitable for multi-phase traffic
with real-time data light
Bingquan Yu (2020) [38] Traffic light control DDPG-based DRL technique SUMO Not suitable with large scale road
network
Deepeka Garg (2018) [7] Traffic light control using optimally Policy-based gradient SUMO Not work in dynamic and diverse traf-
Multimedia Tools and Applications (2024) 83:13851–13872
13
Table 1 (continued)
13860
13
Dongfang Ma(2021) [20] To develop a deep actor-critic Deep reinforcement learning with DRL it does not explicitly consider the
method that can provide efficient a series of temporally sequential implementation on multiple intersec-
traffic signal plans images tions
Kai Liang Tan (2019) [33] To propose a DRL-based adaptive Deep reinforcement learning in low DRL To extend the DRL framework towards
traffic signal control framework and high traffic scenario intersections with left and right turns
that explicitly considers realistic and arterial corridors is still needed
traffic scenarios, sensors, and
physical constraints
Zibo Ma, (2021) [19] Urban intersection traffic light timing Proximal Policy Optimization (PPO) SUMO the traffic light scheme designed uses
optimization is used to improve the convergence the classic four-phase scheme and
speed of the model does not design multiple-phase
schemes for tidal traffic flow
Faizan Rasheed (2020) [28] To investigate the curse of dimen- (Multi-Agent Deep Q-Network) SUMO and Matlab to prioritize the experiences during
sionality under traffic network MADQN technique is used experience replay for faster learning,
scenarios with high traffic volume and take account of other kinds of
and disturbances traffic disturbances, such as traffic
collisions, that can increase the
queue length of vehicles significantly
Muhammad Saleem(2022) [30] To provide innovative services to A fusion-based intelligent traffic FITCCS-VN using The system accuracy may be improved
drivers that enable a view of traffic congestion control system for ML techniques more by using federated learning and
flow and the volume of vehicles VNs (FITCCS-VN) using ML Alexnet.
available on the road remotely, techniques that collect traffic data
intending to avoid traffic jams and route traffic on available routes
to alleviate traffic congestion
Multimedia Tools and Applications (2024) 83:13851–13872
Table 2 Comparison for related work
Author’s name and Ref.number Objective Techniques Simulation tools Drawback
Saeed Maadi(2022) [21] To develop a real-time RL RL technique for CAV (Con- PTV VISSIM microsimulation Offset optimization could be
(Reinforcement Learning)- nected and Automated Vehi- platform added to the signal timing
based adaptive traffic signal cles) is used optimization for reducing
control that optimizes a signal computational time for the RL
plan to minimize the total training.
queue length
Zahra Zeinaly(2023) [40] To develop a reliable controller Deep Q-learning technique with Deep-Q learning and SUMO It is not suitable in a complex
for such a high environment experience replay is used network and also for automated
and investigate the resilience vehicles
of these controllers to a variety
Multimedia Tools and Applications (2024) 83:13851–13872
of environmental disruptions,
such as accidents, weather
conditions, or special events
Alfonso Navarro-Espi- To predict traffic flow at an machine-learning (ML) and deep The Multilayer Perceptron Neu- It does not give much better result
noza(2022) [24] intersection learning (DL) algorithms are ral Network (MLP-NN) in a four-lane cross-section road
used for predicting traffic flow
at an intersection, thus laying
the groundwork for adaptive
traffic control, either by remote
control of traffic lights or by
applying an algorithm that
adjusts the timing according to
the predicted flow
13861
13
13862 Multimedia Tools and Applications (2024) 83:13851–13872
Markov decision-making is a mathematical model that helps to make better decisions based
on the environmental situation. It’s an environment to get an agent to the desired state. Now,
MDP (Markov Decision Process) constructs on a set of environment states, spanning over
some set of actions among which the agent has to decide to choose one, a reward function
which determines a reward over the input of specified states and action and lastly a transition
function which determines the change in the environment while taking a specified action on
specified state. Markov Property is only satisfied by a Markov decision process if the transi-
tion function T only depends on the current status s while taking an action a, in simple terms
the probability of transition from a state s to s’ has to be only depending upon the state s and
the action a. This can be mathematically shown as [32]:
P(st+1 |st , at , st−1 , at−1 , ....) = P(st+1 |st , at ) (1)
Mathematically a Markov Decision Process can be formulated as a four-tuple:
< S, A, R, T > where.
"The goal of the agent is to prioritize the short-term goal at first, compared to the long-
term goal. While time progresses, it has discounted the reward for every next step by a fac-
tor of 𝛾 . This can be represented by the following mathematical expression" [32].
∞
∑
Rt = 𝛾 k rt + k + 1 (2)
k=0
where 𝛾 is a discount factor such that 0 < 𝛾 ≤ 1, meaning that future rewards are discounted
exponentially.[?] "To align with the plan, the agent finds a policy 𝜋 which is a strategy to
select an action a with input as state r. Now there are two types of policies" [35]
"Now a V function, is defined as the expected return of reaching a state s’ from a state s
following a policy 𝜋 , mathematically V function can be expressed as" [32].
13
Multimedia Tools and Applications (2024) 83:13851–13872 13863
Here the prospect of dynamic programming comes into the picture, which helps achieve
optimal policy 𝜋 , given that 𝜏 and ℜ are known. Value iteration is one of the planning algo-
rithms which finds the optimal approach.
4.1.1 Partial observability
In some cases, the agent cannot determine s’ given a state s and an action a but only could
observe the proxy for a form, which could be called an observation o. This is called partial
observability [11].
4.2 Tabular Q‑learning
When the states and actions are discrete, there is a one–one mapping between a state and
action. Every state-action pair has a so-called Q-value, which the agent/model chooses
every time it needs according to the policy 𝜋 . In traditional tabular Q-learning [3], there is
a look-up table that the model is using to maximize the reward function keeping in mind
the long-term reward aligning with the policy 𝜋 . Since the Q-value in the look-up table is
not available upfront, so the agent iteratively updates the Q-value estimates in the look-up
table using this, which converges after some samples:
With the advancement in the domains of state and action, there are some scenarios where
a state-action pair cannot be asserted as in tabular Q-learning because states cannot be dis-
crete everywhere. Hence the problem of getting a model to know what exact action should
13
13864 Multimedia Tools and Applications (2024) 83:13851–13872
it choose for a given input state comes into the picture. In this case, the Q-value cannot be
determined through a look-up table.
Here function approximator helps to determine the function with the help of the learned
weight 𝜃 . The weights of the function approximator can be updated to converge to a spe-
cific value following a policy 𝜋 . The mean squared error between the current estimate and
the target is generally minimized, which is then defined as the true Q-value of the state-
action pair under policy Q𝜋.
There are some issues regarding the convergence of the function approximator for the use-
case, which are [35]:
Correlation between consecutive samples: While sampling in this kind of uncertain envi-
ronment, a lot of correlation occurred because of sampling the successive data input, which
minimizes the probability of the distribution of samples from being mutually independent.
Sampling data distribution: Since Qt is continuously changing, the sampling data
changes with every distribution, which leads to a sampling bias and hence needs to be
examined and solved. Since sampling data is biased and neither identical nor identically
distributed caused, the scope of training the agent is diminished and correlated.
It takes the whole sampling data at once at every step and then iteratively optimizes the
weights according to the error. This makes it very slow to train the model.
This solves the problem with batch gradient descent, where the model with the whole train-
ing data is trained. It calculates the gradient using a random training data instance at each
step, making it significantly quicker than batch gradient descent. Since training the model
has become very tedious and time taking, a new solution of optimization is proposed:
4.3.4 RMSProp
"The problem solved by Adagrad as adaptively tuning the learning rate per parameter. But
this diminishes the learning rate with time because of the growing sum with time" [35].
4.3.5 Back propagation
After using gradient descent, an update of the weights according to the errors is required.
It is a method to send the error back from the output layer. The chain rule calculates the
derivative of the error function w.r.t to the neural network weights.
Deep reinforcement learning is the branch of machine learning where a deep neural net-
work coupled with reinforcement learning is used as its function approximator. In this
13
Multimedia Tools and Applications (2024) 83:13851–13872 13865
paper, a CNN network as a function approximator for the Q-learning model is tried to use,
making it a deep Q-Learning model [22]. Although some errors might be while training the
model using a neural network, the same mistakes come into play while coupling CNN as a
function approximator.
Now the solutions which would help in solving convergence issues have been discussed
below:
4.5.1 Experience replay
Experience replay is used as a tuple (s, a, r, s’) in a replay memory 𝔻. There are two versions
where one store all experiences tuples and the other stores N number of transitions in a slid-
ing window. After each of the iterations, the agent takes a sample of a small batch out of the
replay memory D and further uses this mini-batch to update the weights of the value network
Qt. [35] Since the sampling data is being taken randomly from the batch, the problem of co-
relativity of the samples is solved, and the sampling bias problem is also solved. The next
thing that experience replay solves is that since it takes a batch at once and updates the Qt, it
makes the sampling data uniform too. The drawback of using experience replay is that if the
data changes over a particular pattern with time, then the agent would continuously update the
Qt, making the agent interpret wrong.
4.5.2 Target network
The problem of moving targets, i.e., rapidly changing states, has been discussed above. This
problem is solved by using a target network. There are two different networks in the target net-
work; one gets trained after every step, and the other gets trained based on the former network.
5 Proposed methodology
The main aim of this work is to classify the vehicles to reduce traffic congestion. Traffic
light management is the challenge to achieving urbanization in smart cities. All the problems
are discussed in Section 1. Our proposed methodology has been divided into five phases to
address all the issues.
5.1 States
The model takes one input value: a < Position > matrix, and the other input state that the
model captures < velocity > of the current intersection. Three classes of vehicles are light-
weight, moderate-weight, and heavy-weight vehicles. The position grid of every respective
vehicle is assigned. If a single grid of the position matrix does not contain any vehicle, it has
been given a 0 weight; similarly, 1, 2, and 3 weights are assigned for a light-weight, moderate-
weight, and heavy-weight vehicle. The table shows vehicle type and its corresponding weight
assigned in table ??. The velocity matrix of the vehicle at the intersection is a normalized vehi-
cle velocity which is Vcurrent ∕10.
13
13866 Multimedia Tools and Applications (2024) 83:13851–13872
5.2 Action
The agent certainly takes action according to the Q-learning algorithm. It is allowed for a
green light to enable a particular lane to pass and all the three leftover states to block, mak-
ing vehicles in that particular lane stop propagating. The agent, at a given time, only allows
one lane to pass. A vehicle has three ways to pass if standing in front of the green light
lane. This gives the agent get four sets of options to choose from.
A two-bit representation of this action is considered: 00, 01, 10, and 11 representing the
north lane’s green light, the west lane’s green light, the south lane’s green light, and the
east lane’s green light, respectively, and all other lanes to have a red light in all the configu-
rations possible.
5.3 Reward function
The reward function for the deep Q-learning algorithm has been discussed in this section.
The agent’s goal is to reduce the time it takes for all vehicles in the intersection to travel.
The traffic delay for vehicles waiting at the intersection can be calculated as ts—tmin. ts is
the time a vehicle takes to complete its journey from starting point to its destination, and
tmin, the amount of time it takes to complete a trip by vehicle Vmax.
Mathematically, the reward function can be represented as -
rt = ts − tmin (8)
5.4 Q‑learning
Signal control problems are solved using Q- learning. Because the state is not discrete in
the Q-learning algorithm, a function approximator is necessarily utilized in deep reinforce-
ment learning. The function approximator used is a Convolutional Neural Network [25]. If
the agent knows the optimal Q values for all the state-action pairs, then the only task for
the agent would be choosing the correct and optimal action policy 𝜋 ∗ action for any of the
occurring states of the traffic intersection. So to select the optimal Q∗ value, the following
dynamic programming-based recursive expression is used -
A grid of 60 × 60 and the first convolutional layer having 32 filters in CNN are used. Each
is 4 × 4 in size, and it strides for 2 × 2 throughout the first input layer separately and then
undergoes a ReLU (Rectified Linear Unit) function. Because the position and velocity of
a vehicle as states are taken in the agent’s input, a 60 × 60x2 grid is taken as an input state.
The second layer has 64 filters, and this time each filter has a size of 2 × 2 and moves for a
stride of 2 × 2. The output of the third convolutional layer has a 15 × 15x128 tensor which
gets transferred into a fully connected tensor of 128 × 1 matrix and undergoes a ReLU
function. This is separated into two components, one of which is used to compute the value
13
Multimedia Tools and Applications (2024) 83:13851–13872 13867
and the other to calculate the advantage. Here advantage means how well an action can
achieve while choosing that action above the different options available [15]. The convolu-
tion network parameters like no. of filters, grid size and layer numbers are provided in the
below Table 4 [15].
5.5 Architecture
The proposed deep-Q learning uses two additional advantages: experience replay and
the target network. Here M is the replay memory which stores observed experience
Et = {St , At , Rt , St+1 } into the replay memory M = {E1 , E2 , ...., En } which further gets
randomly sampled while training, this replay memory has a type value as a queue data
structure. Exactly 32 samples are taken randomly and used as a mini-batch for sampling
and learning the Q-network. To make the agent learn the Deep Neural Network (DNN)
parameters 𝜃 such that the best Q∗ (s, a) can be approximated with the help of the output
Q(s, a;𝜃), the agent must need training data. The input data has been retrieved (St , At ) from
the before-mentioned replay memory M because of no clue about Q∗ (St , At ) the input data
is estimated by using.
𝜃 � = 𝛽𝜃 + (1 − 𝛽)𝜃 (12)
where 𝛽 is update rate is always << 1.
In this section, we have discussed the implementation details and the results of the experi-
mentation of the proposed solution.
Table 4 CNN parameters and I/O No. of filters Grid Size Layers
corresponding layers
Input State 32 60 × 60x2 First Layer
Input State 64 60 × 60x2 Second layer
Output State 128 15 × 15x128 Third Layer
13
13868 Multimedia Tools and Applications (2024) 83:13851–13872
6.1 Implementation details
The traffic is simulated using SUMO, and the details of the simulation are as follows:
The intersection of 4 ways with every road’s length of 500 m is simulated. Every cell has a
length of 8 m, and the limit for velocity is 40 kmph for each vehicle at the traffic intersec-
tion. Each vehicle’s length is 5 m, and every vehicle is separated by at least 2 m.
Every vehicle has three routes to choose from at an intersection. Once a lane gets a
green light, the vehicles of that particular intersection can opt for any of the three options
present to them, and all the vehicles in the other lanes have to stop and wait for the traffic
light of that lane to turn from red to green.
For traffic generation, SUMO has a tool called random.py, which randomizes the traffic
as much as possible to match the real-time traffic. Some additional parameters are also
set for the traffic generation while using the tool randomly; the same Poisson process has
been followed but is tweaked. Different rates for the different kinds of vehicles have been
set to match the real-time traffic, for e.g., Plight_weight = 1/2, PModerate_weight = 4/5 and lastly
PHeavy_weight = 1/5.
For N = 2000 episodes, the agent has been trained. Each episode equates to 0.25 h of traf-
fic. The 𝜀-greedy method is discussed in the next section; the value used for 𝜀 is 0.2 for all
episodes. The discount factor 𝛾 is 0.90, and the value of 𝛽 is to update the target network set
to 0.001, along with the capacity of the replay memory set to 100 episodes.
The agent has four options to choose from and which action it needs to give as input for
attaining an optimal result in the long run. It decides the best optimal action available at the
moment, called exploitation vs to try for something which is not the best action available
now. Still, it could give more optimized results called exploration in the long run. In the ini-
tial stages, the agent doesn’t care about the optimal action and explores more. After training
the model, the agent increases to choose the exploitative action. The equation is as follows-
h
𝜀=1− (13)
H
where 𝜀 is the tendency of the agent to choose explorative action, h shows the present-time num-
ber of episodes and H shows the whole number of episodes [39]. The above Table 5 shows the
parameters taken during the experiment and their corresponding values assumed at that time.
6.2 Results
In this section, the results collected after generating different classes of vehicles and then
applying the deep Q-learning algorithm are shown in the Fig. 11 below. The average
13
Multimedia Tools and Applications (2024) 83:13851–13872 13869
waiting time is calculated for Double Deuling Deep Q-Network, Deep Q-Network, Adap-
tive traffic light signal control (ATSC) and fixed time approach in the Fig. 10. The result
is compared with a traffic simulation where the same kind of traffic is generated, but the
lights follow the traditional round-robin method and stay green for 10 s for every lane. The
traffic rules are the same for this simulation too.
The resultant model got trained through deep double Q-learning using target network and
experience replay, and the average cumulative sum of waiting time for every vehicle has
minimized substantially. Hence, the traffic congestion is minimized, which is the primary
goal of this work. It is allowed for a green light to enable a particular lane to pass and all
the three leftover states to block, making vehicles in that specific lane stop propagating.
The agent, at a given time, only allows one lane to pass. A vehicle has three ways to pass
if standing in front of the green light lane. This provides the agent with four options to
choose from, i.e., the north lane’s green light, the west lane’s green light, the south lane’s
13
13870 Multimedia Tools and Applications (2024) 83:13851–13872
green light, and the east lane’s green light. Thus after learning, the agent finds a reward
function by calculating the vehicle’s waiting time at the intersections. The agent’s goal is to
reduce the time vehicle takes in the intersection. The simulation of this work uses SUMO
(Simulation in Urban MObility) with the data generated on SUMO using a random func-
tion. The results show that the traffic light of a certain traffic intersection becomes adap-
tive, aligning with the goals mentioned above. The proposed model efficiently reduces the
average waiting time up to 91.7% at the intersection points of the road which is shown in
the graph in the result section.
This methodology can also be scaled to a more significant extent and use multiple
agents that would synchronize among themselves and minimize the traffic of a specific area
through vector minimization techniques. Even though there can be a drastic drop-off in
the cumulative waiting time of the traffic intersection, we still are behind in deploying the
DSRCs into every vehicle available in the traffic network. With the rapid development of
communication protocols and technology, a mechanism must be designed to be adaptive,
readily available, and cheap in the future. Henceforth, it makes it more accessible to the
general population’s vehicles and could be deployed in more and more vehicles. We still
need to see the effects caused by the disruptions and congestion caused by unforeseeable
accidents and how the trained model deals with them. More research must be done on how
much the reinforcement learning algorithm depends upon time. A communication network
that is more reliable than DSRC has to be achieved.
Acknowledgements I am highly thankful to my co-author Mr. Nistala Venkata Kameshwer Sharma, for his
important contribution. After that, I thank my supervisor, Dr. Vijay K Chaurasiya, for guiding me in this
research work. At last, I am also thankful to Dr. Shishupal Kumar for his direction from time to time.
Declarations
Conflicts of interest/Competing interests No conflict.
References
1. Bellman R, Kalaba R (1957) Dynamic programming and statistical communication theory. Proc Natl
Acad Sci USA 43(8):749
2. Casas N (2017) Deep deterministic policy gradient for urban traffic light control,” arXiv preprint
arXiv:1703.09035
3. Christopher J (1992) Watkins and peter dayan. Q-Learn Mach Learn 8(3):279–292
4. Coşkun M, Baggag A, Chawla S (2018) Deep reinforcement learning for traffic light optimization. In:
2018 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE, 564–571
5. Farazi NP, Ahamed T, Barua L, Zou B (2020) Deep reinforcement learning and transportation
research: A comprehensive review. https://doi.org/10.48550/arXiv.2010.06187
6. Gao J, Shen Y, Liu J, Ito M, Shiratori N (2017) Adaptive traffic signal control: Deep reinforcement
learning algorithm with experience replay and target network. arXiv preprint arXiv:1705.02755
7. Garg D, Chli M, Vogiatzis G (2018) Deep reinforcement learning for autonomous traffic light con-
trol. In: 2018 3rd IEEE international conference on intelligent transportation engineering (ICITE),
IEEE, Singapore, pp 214–218. https://doi.org/10.1109/ICITE.2018.8492537
8. Genders W, Razavi S (2016) Using a deep reinforcement learning agent for traffic signal con-
trol. https://doi.org/10.48550/arXiv.1611.01142
13
Multimedia Tools and Applications (2024) 83:13851–13872 13871
9. Gong Y, Abdel-Aty M, Cai Q, Rahman MS (2019) Decentralized network level adaptive signal control
by multi-agent deep reinforcement learning. Transp Res Interdiscip Perspect 1:100020
10. Hu X, Zhao C, Wang G (2020) A traffic light dynamic control algorithm with deep reinforcement
learning based on GNN Prediction. https://doi.org/10.48550/arXiv.2009.14627
11. Kaelbling LP, Littman ML, Cassandra AR (1998) Planning and acting in partially observable stochas-
tic domains. Artif Intell 101(1–2):99–134
12. Kumar N, Rahman SS, Dhakad N (2020) Fuzzy inference enabled deep reinforcement learning-based
traffic light control for intelligent transportation system. IEEE Trans IntellTransp Syst
13. Li C, Ma X, Xia L, Zhao Q, Yang J (2020) Fairness control of traffic light via deep reinforcement
learning. In: 2020 IEEE 16th International Conference on Automation Science and Engineering
(CASE). IEEE, Hong Kong, China, pp 652–658. https://doi.org/10.1109/CASE48305.2020.9216899
14. Li L, Lv Y, Wang F-Y (2016) Traffic signal timing via deep reinforcement learning. IEEE/CAA J
Automat Sin 3(3):247–254
15. Liang X (2019) Applied deep learning in intelligent transportation systems and embedding explora-
tion, Ph.D. thesis, New Jersey Institute of Technology
16. Liang X, Du X, Wang G, Han Z (2018) Deep reinforcement learning for traffic light control in vehicu-
lar networks. arXiv preprint arXiv:1803.11115
17. Liang X, Yan T, Lee J, Wang G (2018) A distributed intersection management protocol for safety, effi-
ciency, and driver’s comfort. IEEE Internet Things J 5(3):1924–1935
18. Liang X, Du X, Wang G, Han Z (2019) A deep reinforcement learning network for traffic light cycle
control. IEEE Trans Veh Technol 68(2):1243–1253
19. Ma Z, Cui T, Deng W, Jiang F, Zhang L (2021) Adaptive optimization of traffic signal timing via deep
reinforcement learning. J Adv Transp 2021
20. Ma D, Zhou B, Song X, Dai H (2021) A deep reinforcement learning approach to traffic signal control
with temporal traffic pattern mining. IEEE Trans Intell Transp Syst
21. Maadi S, Stein S, Hong J, Murray-Smith R (2022) Real-time adaptive traffic signal control in a con-
nected and automated vehicle environment: optimisation of signal planning with reinforcement learn-
ing under vehicle speed guidance. Sensors 22(19):7501
22. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fid-
jeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature
518(7540):529–533
23. Mousavi SS, Schukat M, Howley E (2017) Traffic light control using deep policy‐gradient and value‐
function‐based reinforcement learning. IET Intel Transport Syst 11(7):417–423
24. Navarro-Espinoza A, López-Bonilla OR, García-Guerrero EE, Tlelo-Cuautle E, López-Mancilla D,
Hernández-Mejía C, Inzunza-González E (2022) Traffic flow prediction for smart traffic lights using
machine learning algorithms. Technologies 10(1):5
25. Pang H, Gao W (2019) Deep Deterministic policy gradient for traffic signal control of single inter-
section. In: 2019 Chinese Control And Decision Conference (CCDC), Nanchang, China, pp 5861–
5866. https://doi.org/10.1109/CCDC.2019.8832406
26. Prosper HB (2017) Deep learning and Bayesian methods. EPJ Web Conf 137:9. https://doi.org/10.
1051/epjconf/201713711007
27. Raeisi M, Mahboob AS (2021) Intelligent control of urban intersection traffic light based on reinforce-
ment learning algorithm. In: 2021 26th International Computer Conference, Computer Society of Iran
(CSICC). IEEE, 1–5
28. Rasheed F, Yau K-LA, Low Y-C (2020) Deep reinforcement learning for traffic signal control under
disturbances: A case study on Sunway city, Malaysia. Futur Gener Comput Syst 109:431–445
29. Sahu SP, Dewangan DK, Agrawal A, Priyanka TS (2021) Traffic light cycle control using deep rein-
forcement technique. In: 2021 International Conference on Artificial Intelligence and Smart Systems
(ICAIS). IEEE, Coimbatore, India, pp 697–702. https://doi.org/10.1109/ICAIS50930.2021.9395880
30. Saleem M, Abbas S, Ghazal TM, Khan MA, Sahawneh N, Ahmad M (2022) Smart cities: Fusion-
based intelligent traffic congestion control system for vehicular networks using machine learning tech-
niques. Egypt Inf J 23(3):417–426
31. Schneider C (2020) Intelligent signalized intersection management for mixed traffic using Deep
Q-Learning. Not applicable
32. Sutton RS, Barto AG et al (1998) Introduction to reinforcement learning, volume 135. MIT press
Cambridge
33. Tan KL, Poddar S, Sarkar S, Sharma A (2019) Deep reinforcement learning for adaptive traffic signal
control. In: Dynamic Systems and Control Conference, volume 59162. American Society of Mechani-
cal Engineers, V003T18A006
13
13872 Multimedia Tools and Applications (2024) 83:13851–13872
34. Tong W, Hussain A, Bo WX, Maharjan S (2019) Artificial Intelligence for Vehicle-to-Everything: A
Survey. IEEE Access 7:10823–10843. https://doi.org/10.1109/ACCESS.2019.2891073
35. van der Pol E (2016) Deep reinforcement learning for coordination in traffic light control. Master’s
thesis, University of Amsterdam
36. Wei H, Zheng G, Yao H, Li Z (2018) Intellilight: A reinforcement learning approach for intelligent
traffic light control. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowl-
edge Discovery & Data Mining, 2496–2505
37. Wu T, Zhou P, Liu K, Yuan Y, Wang X, Huang H, Wu DO (2020) Multi-agent deep reinforcement
learning for urban traffic light control in vehicular networks. IEEE Trans Veh Technol 69(8):8243–8256
38. Yu B, Guo J, Zhao Q, Li J, Rao W (2020) Smarter and safer traffic signal controlling via deep rein-
forcement learning. In: Proceedings of the 29th ACM International Conference on Information &
Knowledge Management, 3345–3348
39. Yuan X (2021) Faster Finding of Optimal Path in Robotics Playground Using Q-Learning with
“Exploitation-Exploration Trade-Off”. J Phys Conf Ser 1748(2):022008
40. Zeinaly Z, Sojoodi M, Bolouki S (2023) A resilient intelligent traffic signal control scheme for acci-
dent scenario at intersections via deep reinforcement learning. Sustainability 15(2):1329
41. Zhang R, Ishikawa A, Wang W, Striner B, Tonguz OK (2020) Using reinforcement learning with par-
tial vehicle detection for intelligent traffic signal control. IEEE Trans Intell Transp Syst 22(1):404–415
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under
a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such publishing agreement and applicable
law.
13