Enhancing Traffic Flow Through Multi-Agent Reinforcement Learning For Adaptive Traffic Light Duration Control
Enhancing Traffic Flow Through Multi-Agent Reinforcement Learning For Adaptive Traffic Light Duration Control
Corresponding Author:
Nada Faqir
Department of Computer Science, Faculty of Sciences Dhar El Mehraz
Sidi Mohammed Ben Abdellah University of Fes
Fes 30000, Morocco
Email: [email protected]
1. INTRODUCTION
Traffic congestion is a pervasive issue in urban areas worldwide, causing significant economic losses
due to the total time lost by commuters. These economic losses are estimated to reach billions of dollars
annually in countries suffering from traffic congestion. As a result, addressing the economic impact of
alleviating traffic congestion has become a crucial priority. Managing traffic congestion has emerged as a
pressing and unsolved problem in numerous urban areas around the world.
The traditional static traffic light control system (TLCS) has long been the conventional solution for
managing traffic flow. However, with the increasing complexity and scale of urban transportation networks,
traditional traffic light control system is no longer adequate in effectively mitigating traffic congestion. To
overcome these limitations, researchers have turned to deep reinforcement learning frameworks, which
leverage the power of reinforcement learning algorithms and neural networks such as deep neural networks
(DNN) and convolutional neural networks (CNN) [1]. Notable advancements in this field include systems like
intellilight, which utilize deep reinforcement learning for adaptive traffic signal control (ATSC), demonstrating
significant improvements in traffic efficiency [2]. In addressing urban traffic congestion, our focus lies in
optimizing traffic flow efficiency within a two-intersection scenario. We emphasize the use of CNN in our
deep Q-network (DQN) model and employ a centralized training and distributed execution (CTDE) approach
within a multi-agent reinforcement learning (MARL) framework [3].
This article proposes a novel multi-agent perspective to enhance the performance of traffic light
control systems, recognizing the importance of agent collaboration in supporting efficient traffic flows [4]. By
adopting a MARL approach, our goal is to develop a system where multiple agents regulate traffic at adjacent
intersections, collaborating to optimize traffic flow. We acknowledge the potential for broader applications in
more complex settings, and our work aims to address scalability and partial observability challenges in large-
scale multi-agent deep reinforcement learning systems [5]. To validate the effectiveness of our approach, we
conducted several experiments using traffic simulations involving two intersecting roads. These experiments
aimed to observe how agents achieve collaboration and improve traffic flow through independent deep Q
network (IDQN) techniques.
By addressing the economic impact of traffic congestion and introducing a novel multi-agent
perspective, our proposed approach offers a promising solution for optimizing traffic flow in complex urban
environments. Through the integration of real-time data, scalability, and considerations for mixed traffic
scenarios, we aim to improve overall traffic efficiency and create smarter, more sustainable cities. This research
provides valuable insights into the challenges and potential solutions for managing traffic congestion using
deep reinforcement learning techniques.
Numerous studies have demonstrated the effectiveness of reinforcement learning in adaptive traffic
light control, with a particular focus on single traffic light controllers [6], [7]. However, when addressing multi-
agent systems, the challenge arises in determining whether traffic control should be approached as a
collaborative or competitive problem [8]. In such environments, the conflict between local and global optima,
coupled with the interdependence of agent actions, adds complexity to the dynamics of transport networks.
Additionally, factors such as driver behavior and co-evolution further complicate the challenges of traffic light
control in urban environments.
Our work contributes to this landscape by delving into the complexities of traffic signal control (TSC)
in networks with multiple intersections. Introducing a novel MARL framework, our approach emphasizes
collaboration among agents to enhance traffic flow. It capitalizes on the advantages of independent traffic light
control, ensuring stable operations even in the event of failures in other components. To enable efficient
decision-making, we employ CNNs to model our agents and incorporate a comprehensive representation of
the state of adjacent intersections. While existing studies [6], [7] have predominantly focused on single traffic
light controllers, our work extends the scope to multi-intersection scenarios, presenting a promising approach
for addressing traffic congestion in urban environments. The outlined comparisons aim to underscore the
innovative aspects of our proposed system within the broader landscape of traffic light control methodologies.
In the domain of deep reinforcement learning, a DNN typically represents the Q function. Value-based
methods, such as DQN [9], utilize DNN to extract state representations directly from the environment. The
DQN outputs Q values for all possible actions, enabling the approximation of optimal strategies. To enhance
the performance, stability, and effectiveness of the DQN agent, two commonly employed mechanisms, namely
the target network and replay memory, are integrated during training.
Our exploration aligns with this approach, and we further contribute by examining various
DQN-based models, including double deep Q-network (DDQN) [10], dueling DQN [11], DDQN with
proportional prioritization based on DDQN [12], and dynamic frame skip deep Q-network (DFDQN) [13].
While artificial neural networks (ANN) have been widely used as model architectures [14], [15], there is a
growing utilization of CNN in this context [16]–[19]. By integrating CNN-based models and examining their
impact on the learning quality of agents, our paper contributes to advancing the understanding of deep
reinforcement learning in traffic light control. The findings shed light on the strengths and limitations of
different neural network architectures, providing insights into their effectiveness in optimizing traffic flow.
at the time-step t by the agent, 𝑟𝑡 ∈ 𝑅 the immediate reward value received by the agent in the state
𝑠𝑡 performing action 𝑎𝑡 and finally P the state transition probability.
In traffic signal problems, the agent interacts with the intersection at discrete time steps t = 0, 1, 2,
first observing the state of the intersection St at the start of the time step t, then selects an action At depending
on a policy π. After the vehicle moves under the active traffic light, the state of the intersection changes to the
new state of St+1. At the end of the time step t, the agent receives a reward Rt following its decision to choose
a signal. Assuming that the reward at each time step t must be multiplied by the discount factor γ ϵ [0, 1] to
determine the importance given to immediate and future rewards, the cumulative reward from time step t to
the end of time T is expressed as follows: 𝑅𝑡 = ∑∞ 𝑘
𝑘=0 𝛾 𝑅𝑡+𝑘 . The value function Qπ (s,a) represents the action
At taken in the current state St, followed by the application of policy π until the episode concludes. During this
process, the cumulative rewards earned by the agents can be expressed as (1).
The agent's objective is to identify the optimal policy for selecting actions, represented as π∗ in (2), which
maximizes the cumulative future reward or Q-values and enables the agent to accomplish its objective.
The Q-values that correspond to the optimal policy 𝜋 ∗ are denoted as 𝑄∗ (𝑠, 𝑎) = 𝑄𝜋∗ (𝑠, 𝑎) and can
be approximated using linear function approximators. The primary objective is to identify the optimal policy
𝜋 ∗ and subsequently, determine the optimal Q-values 𝑄 ∗ (𝑠, 𝑎). The recursive relation for the optimal Q-values
𝑄 ∗ (𝑠, 𝑎) is provided by Bellman's optimality equation.
Linear function approximators, as well as nonlinear function approximators such as DNN, can be employed to
estimate action-value functions or policy.
Stochastic games can be categorized into three types: fully cooperative, fully competitive, and mixed
games. In fully cooperative games, all agents share the same reward function, aiming to maximize the total
reward collectively. Cooperation among agents is essential to reach this objective. Approaches based on the
Q-learning algorithm have been developed to tackle coordination challenges, focusing on a single optimal joint
action [20]. Lauer's distributed Q-learning algorithm [21] addresses the issue while maintaining low
computational complexity. However, this algorithm only applies to deterministic scenarios with non-negative
reward functions.
When dealing with a perfectly competitive game, where there are two agents with opposing goals and
R1 = -R2, most of the literature on reinforcement learning focuses on the two-agent scenario. The minimax
principle is applied in such a situation, which aims to maximize an agent's profit while considering the
worst-case scenario, where the opponent will always try to minimize it. To achieve this, adversary-independent
algorithms are used, such as the minimax-Q algorithm [20], [22], which computes stage game strategies and
values using the minimax principle and propagates values through state pairs-stock using a time-difference rule
similar to Q learning.
Mixed stochastic games involve agents with different and correlated rewards, where no constraints
are imposed on the agents' reward functions. This model is suitable for self-interested agents, but even
cooperating agents may encounter conflicts of interest, such as when competing for resources. Game theory
concepts such as equilibrium play a significant role in algorithms for mixed stochastic games. Single-agent
reinforcement learning to MARL methods make no assumptions about the task and can be applied to mixed
stochastic games, along with other techniques such as agent-independent, agent-tracking, and agent-aware
methods. However, there is no guarantee of success in these scenarios.
Setting learning goals for agents can be challenging since it is difficult to define good general
objectives. The two primary aspects of learning goals are stability and adaptability. Stability refers to an agent's
ability to converge to a stable policy, while adaptability ensures that an agent's performance does not decline
when other agents change their strategies [23], [24].
3. METHOD
In this study, we introduce an innovative approach to address the persistent challenge of urban traffic
congestion using advanced artificial intelligence techniques. The primary objective is to optimize traffic flow
at complex intersections by integrating CNN into a DQN model within a MARL framework. Our approach
leverages a CTDE architecture, crucial for scalability and effective coordination among multiple agents
controlling traffic signals.
To formulate the problem, the primary aim is to enhance traffic flow optimization in a scenario with
each intersection's state is discretized into irregularly shaped cells using discrete traffic state encoding (DTSE).
These cells represent different zones monitored by the agents, capturing vehicle presence and traffic dynamics.
The agent's state representation, denoted as two intersections, and we benchmark our results against baseline
Enhancing traffic flow through multi-agent reinforcement learning for adaptive traffic light … (Nada Faqir)
504 ISSN: 2252-8938
approaches. Our innovative approach leverages data such as vehicle positions, queue lengths, and current traffic
light states. This detailed situational awareness enables agents to make informed decisions to optimize traffic
flow.
The action space consists of four distinct actions per agent: adjusting traffic lights to prioritize
north/south or east/west traffic, encompassing both straight and turning movements. Actions are selected based
on the agent's observation of the current state aimed to minimize cumulative waiting times. The reward function
across all vehicles at the intersection is defined as the reduction in cumulative waiting time from t-1 to t,
emphasizing the agent's role in improving traffic efficiency over time.
The CTDE framework facilitates collaborative action selection among agents by sharing their last
chosen actions, enhancing by sharing their last chosen actions, enhancing coordination and traffic
synchronization between adjacent intersections. This sharing mechanism optimizes traffic flow across
interconnected road networks, crucial for mitigating congestion hotspots and improving overall urban mobility.
To evaluate the effectiveness of our approach, we simulate realistic traffic scenarios using the Weibull
distribution: high-traffic (4,000 cars), low-traffic (600 cars), NS-traffic (2,000 cars, 90% north/south), and
EW-traffic (2,000 cars, 90% east/west). These scenarios simulate diverse traffic conditions to robustly test the
adaptability and performance of our CTDE-based MARL framework under varying load and directional traffic
distributions. Through rigorous experimentation and simulations, we validate the efficacy of our approach we
validate the effectiveness and efficiency of our CTDE-based MARL framework in enhancing traffic
management strategies. Performance metrics such as average waiting times, intersection throughput, and
overall traffic flow efficiency provide quantitative insights into the framework's capability to optimize urban
traffic operations effectively.
Our environment is composed of an integrated traffic light system where each 4-arm intersection
(shown in Figure 4) consists of eight lanes each (four inbound lanes and four outbound lanes). Each arm is 750
meters long from the vehicle access point to the intersection traffic lights. Each vehicle chooses its incoming
lane according to its destination. The possible directions for each vehicle are explained as follows:
− Go straight using the two central lanes and the one on the far right.
− Turn right only using the rightmost lane.
− Turn left only using the leftmost lane.
The traffic lights regulating each intersection arise from the following equation and shown in
Figure 5. As in real life, the traffic lights are either red, yellow, or green and at each time step, the traffic light
is either yellow or green in phase. The color phase transition is cyclical (red-green-yellow-red) and mandatory
to clear the center of the intersection and to avoid probable accidents. The duration of each phase is fixed, 10
seconds for the green time and 4 seconds for the yellow time. The red phase, therefore, lasts the time elapsed
since the previous phase change.
Figure 5. Traffic lights for each inbound lane of each intersection in the simulated environment
3.2. Agents
In the context of MARL, ATSC involves coordinating the actions of multiple intersection traffic light
controllers to achieve network-wide traffic optimization. Each intersection traffic light controller acts as an
agent within the MARL framework, perceiving traffic conditions at their respective intersections and making
decisions on phase control signals accordingly. Through interaction and collaboration, agent controllers adapt
their policies to changing traffic dynamics, leveraging collective network intelligence to optimize traffic flow
across the entire system. This collaborative approach allows officers to learn from each other's experiences,
improve decision-making processes, and ultimately improve the efficiency of the TSC system. By using MARL
in ATSC, the network-wide control loop shown in Figure 6, facilitates the coordination and synchronization of
agent actions, leading to a more efficient and adaptive traffic management solution. This approach not only
considers local optimization at each intersection, but also considers the overall impact of signal control
decisions, resulting in improved traffic flow, reduced congestion, and improved traffic overall performance of
the transport network.
Enhancing traffic flow through multi-agent reinforcement learning for adaptive traffic light … (Nada Faqir)
506 ISSN: 2252-8938
Figure 6. Network-wide MARL control loop: Policy optimization in ATSC as an interactive MDP process
with 2 agents
Using deep Q-Learning, each agent in our proposed method retrieves the discretized representation of
the environment state, denoted as St = (𝑃𝑡 , 𝐶𝑡𝑐 , ∆𝐶𝑡𝑐 , 𝑉𝑡𝑐 ) (detailed in the next section). As our approach is based
on the CTDE design of a multi-agent environment, knowledge sharing between the agents of the two
intersections is crucial in a partially observable environment. To enable this knowledge sharing, at each time
step t, each agent shares the last executed action at time step t-1 with the other agent. This information is
provided as inputs to the CNN network, which estimates the Q-values 𝑄(𝑆𝑡 , 𝑎, 𝜃) for all possible actions in the
action space, as shown in Figure 7.
Furthermore, using the replay memeory mechanism, each agent records the observed interaction
experience 𝐸𝑡 = (𝑆𝑡 , 𝐴𝑡 , 𝑅𝑡 , 𝑆𝑡+1 ) into a replay memory M = {𝐸1 , 𝐸2 ,· · · , 𝐸𝑡 }. To address the issue of
oscillation during the training phase, we employ the target network mechanism, which utilizes the same
architecture as the CNN network depicted in Figure 7. In addition to updating the CNN parameters, the agent
also updates the target network parameters using a method called soft updating, as explained in section 3.6.
The agent system structure of our proposed method and his training process are depicted in Figure 7, illustrating
the integration of these components and the overall flow of information and updates within the system.
2
𝐿=𝐸 , , , 𝑄(𝑠, 𝑎 𝜃)
𝑟 =𝑟+ 𝑚𝑎𝑥𝑄(𝑠 , 𝑎 𝜃 )
𝑎
agent to learn how to optimize traffic effectively. This choice is also motivated by the quantity and quality of
information provided to the agent.
Vidali et al. [19] used a discretized representation of each arm of a junction where each cell is irregular
based on formula DTSE [14], except that less information is encoded. Not all cells are the same size, as shown
in Figure 8. At each intersection, there are 20 cells per arm, with 10 cells allocated to the left lane and the
remaining 10 cells covering the other three lanes, resulting in 80 cells per intersection. These cells are not of
the same size and their length is determined by their distance from the stop line. It is important to strike a
balance between cell length and computation complexity. If the cell is too long, cars may not be detected, while
if it is too short, the computation required covering the path length increases. The paper proposes a cell length
that is 2 meters longer than the car's length for the shortest cell closest to the stop line. Whenever the agent
observes the environment, it receives a set of cells that indicate the presence of vehicles in each approach road
zone.
In our approach and for each intersection we retrieve the state from sumo, in the form of four metrics as follows:
− The phase 𝑃𝑡 of the traffic signal at time step t;
− The congestion level 𝐶𝑡𝑐 of each cell c, at time step t;
− The variation in congestion level ∆𝐶𝑡𝑐 for each cell c during a single step;
− The mean speed 𝑉𝑡𝑐 of the vehicles of each cell c, at time step t.
Where congestion level 𝐶𝑡𝑐 must be in the range of [0, 1]; A value of 1 is assigned when a vehicle is present,
up to the maximum occupancy of the cell., is in cell c of each incoming lane of the intersection and 0 when no
vehicle is in this cell. We define the congestion level as (6).
𝑐
𝑁𝑣𝑒ℎ
𝐶𝑡𝑐 = 𝑐 (6)
𝐿 (𝑆𝑣𝑒ℎ+𝑔𝑚𝑖𝑛 )
𝑐
Where 𝑁𝑣𝑒ℎ the number of vehicles in cell c, the length of the cell c 𝐿𝑐 , 𝑆𝑣𝑒ℎ the size of the vehicle veh and
𝑔𝑚𝑖𝑛 . This refers to the minimum distance maintained between vehicles. (in SUMO 𝑆𝑣𝑒ℎ = 5 meters and
𝑔𝑚𝑖𝑛 = 2.5).
To identify congested lanes, the congestion level is employed, and if a lane has high congestion levels,
it should shift to the traffic signal phase that allows traffic to proceed and reduces congestion. The status of
neighboring traffic signals is determined using the change in congestion denoted as ∆𝐶𝑡𝑐 . If the congestion
levels at the lanes of adjacent intersections increase, the signaling phase that permits traffic to enter the
intersection. must be adopted. Each traffic signal can potentially achieve cooperative control with neighboring
signals by considering the change in congestion level as a state. ∆𝐶𝑡𝑐 at time step t can be calculated by taking
the difference between 𝐶𝑡𝑐 and 𝐶𝑡−1𝑐
then ∆𝐶𝑡𝑐 can be in the range [-1,1]. We normalize ∆𝐶𝑡𝑐 as (7).
1+𝐶𝑡𝑐 − 𝐶𝑡−1
𝑐
∆𝐶𝑡𝑐 = (7)
2
The speed 𝑉𝑡𝑐 is determined by normalizing the speed of each vehicle against the maximum speed allowed for
the edge. In summary, the traffic signal agent in this environment establishes the state at a specific discrete
time step t as St=(𝑃𝑡 , 𝐶𝑡𝑐 , ∆𝐶𝑡𝑐 , 𝑉𝑡𝑐 ).
Our approach is based on the CTDE design of a multi-agent environment where the sharing of
knowledge between the agents of the two intersections in a partially observable environment implies that each
agent at time step t shares with the other the last action executed at time step t-1. The detail of the state of the
environment that we have adopted also takes the partially observable nature of this environment. This is
justified by the change congestion level metric ∆𝐶𝑡𝑐 added to the representation of the state.
Enhancing traffic flow through multi-agent reinforcement learning for adaptive traffic light … (Nada Faqir)
508 ISSN: 2252-8938
Where 𝑡𝑖,𝑡 present the waiting time of the vehicle i at timestep t. then the reward can be defined as (9).
The waiting time of a vehicle is the time (in seconds) after the appearance of the vehicle in the
environment where its speed is less than 0.1 m/s at the step of the agent t. As a metric, each agent considers
the total cumulative waiting time for step t of each vehicle (𝐶𝑢𝑚𝑢𝑙_𝑇𝑜𝑡𝑎𝑙 𝑇𝑡 ). If the vehicle manages to move
forward but fails to pass the intersection, this metric does not reset the value of the total cumulative waiting
time (10). This saves us from misleading statistics during agent training.
Where 𝑐𝑢𝑚𝑢𝑙_ 𝑡𝑖,𝑡 is the accumulated waiting time of the vehicle i at timestep t. Then our reward function is
the accumulated total waiting time of all the cars in the intersection captured respectively at agent step t and
t – 1 presented as (11).
In (9) and (11) produce negative values. As more vehicles accumulate in the queues at the agent's step t, the
reward becomes increasingly negative, prompting the agent to assess its action more carefully. Consequently,
in our results, we focus on visualizing the negative reward at each time step to evaluate the behavior of each
agent.
Algorithm 1: DQN with experience replay and target network for ATSC
1: Initialize CNN network with random weights θ0
2: Initialize target network with weights
3: Initialize hyperparameters ϵ, γ, α, and the number of episodes N
4: for each episode episode = 1,2,...,N do
5: Initialize action A0
6: Start a new time step
7: for each time step time = 1,2,...,T do
8: if a new time step t begins then
9: The agent observes the current intersection state St ;
10: The agent selects an action A = argmaxa Q(St,a;θ) with probability 1 − ϵ,
and randomly selects a random action At with probability ϵ,
11: if A == At−1 then
12 No transition phase for traffic signals;
13: else
14: Apply a transition phase for traffic signals;
15: end if
16: end if
17: Execute the selected action At;
18: Vehicles move according to the current traffic signals;
19: Increment time time = time + 1;
20: if the time step t ends then
21: The agent observes the reward R and the current intersection state St+1;
22: Store the observed experience (St,At,Rt,St+1) in replay memory M;
23: Randomly sample a minibatch of 100 experiences (Si,Ai,Ri,Si+1) from M;
24: Form the training data: input set X and targets y;
25: Update θ with the training data;
26: Update the target network θ′ according to equation (12);
27: end if
28: end for
29: end for
The replay memory M has a finite capacity of 50,000 samples, and once full, the oldest data is
discarded. To learn CNN parameters that best approximate 𝑄 ∗ (s, a), the agent requires training data: input data
set X = {(𝑆𝑡 , 𝐴𝑡 ): t ≥ 1} and the corresponding targets y = {𝑄 ∗ (𝑆𝑡 , 𝐴𝑡 ) ∶ t ≥ 1} can be retrieved from
replay memory M for the input data set. However, the target 𝑄 ∗ (𝑆𝑡 , 𝐴𝑡 ) is unknown. We use its estimated value
𝑅𝑡 + γ 𝑚𝑎𝑥 (𝑄(𝑆𝑡+1 , 𝑎 𝜃 ) as the target instead, as in [16], [26]–[28], where 𝑄(𝑆𝑡+1 , 𝑎 𝜃 ) is the output
of a separate target network with parameters 𝜃 . These parameters are softly updated using the equation (soft
update for weights), and the target network's input is the corresponding 𝑆𝑡+1 from interaction experience
𝐸𝑡 = (𝑆𝑡 , 𝐴𝑡 , 𝑅𝑡 , 𝑆𝑡+1 ). The target network employs the same architecture as the CNN network depicted in
Figure 7. Thus, targets y = {𝑅𝑡 + γ 𝑚𝑎𝑥 (𝑄(𝑆𝑡+1 , 𝑎 𝜃 ) ∶ t ≥ 1}.
We use the stochastic gradient descent algorithm root mean squared propagation (RMSProp) [29]
with a minibatch size of 100 to reduce computational costs. During the training process of the CNN network,
the agent utilizes the experience replay technique by randomly selecting 100 samples from the replay memory
Enhancing traffic flow through multi-agent reinforcement learning for adaptive traffic light … (Nada Faqir)
510 ISSN: 2252-8938
M to create input data and target pairs. These pairs are then utilized to update the CNN parameters θ using the
RMSProp algorithm. Along with updating the CNN parameters, the agent also needs to update the target
network ′ using a method called soft updating as illustrated in (12) [27].
𝜃 = 𝛽𝜃 + (1 𝛽)𝜃 (12)
4.2. Results
Our study distinguishes itself by addressing the more complex task of optimizing traffic light control
in a network with multiple intersections. While earlier studies have explored fixed configurations of traffic
lights and the use of ANN for traffic management, they have not explicitly addressed the influence of
knowledge sharing between agents in a multi-intersection environment. We compare our results with two
existing approaches in the literature. The first approach considers a fixed configuration of traffic lights, a
method that fails to adapt to dynamic traffic conditions and thus often leads to suboptimal traffic flow. The
second approach builds upon the work of [19], which models environmental agents using an ANN and employs
a basic, straightforward representation of the environment's state. This state is discretized for each arm of an
intersection, where each cell follows an irregular pattern based on the DTSE formula [14].
However, the work in [21] is limited to an environment based on an isolated intersection. This
limitation prevents its extension to a more complex multi-agent environment with an extended discrete traffic
state encoding (DTDE) design, where no knowledge sharing between the agents is feasible. As a result, these
approaches fall short when applied to real-world scenarios involving multiple interconnected intersections,
where coordinated decision-making is crucial for effective traffic management.
Our approach, guided by a CTDE framework, allows for knowledge sharing between agents,
addressing scalability issues often associated with MARL approaches. Despite using only two intersections in
this study, our CTDE design lays the groundwork for extending the model to larger networks, ensuring the
relevance and generalization of our approach. Our study suggests that the use of CNNs and a detailed DTDE
representation enhances traffic management efficiency without compromising scalability. Unlike previous
approaches limited to isolated intersections, our CTDE framework facilitates effective coordination between
multiple intersections, leading to significant improvements in traffic flow dynamics.
For example, research by Lauer and Riedmiller [21], which relies on ANN and a basic environmental
state representation, is constrained by its application to isolated intersections. This lack of inter-agent
communication hinders its scalability and efficiency in managing urban traffic. In contrast, our CTDE
framework's ability to enable knowledge sharing among agents significantly improves traffic synchronization
and overall network performance.
Our contribution demonstrates significant improvements in traffic flow while effectively minimizing
the total waiting time for vehicles at intersections. This is evident as our agents were able to accumulate fewer
negative rewards, reflecting remarkable robustness. Figure 10 illustrates that our approach outperforms the
neural network method in terms of accumulated negative rewards.
Figure 10. Cumulative negative reward over 100 episodes for both agents using CNN and neural network
architectures, with basic and advanced state representations
In Table 2, we present a comparison of our results against the neural network approach and a basic
representation of the environmental state. For the four traffic scenarios examined, the neural network-based
agent only surpassed the classical traffic light management model in the east-west (E-W) scenario. In contrast,
our approach, utilizing a CNN with an advanced and detailed representation of traffic intersection conditions,
excelled across all traffic scenarios.
Figures 11 to 14 illustrate the performance of each agent at the two intersections, allowing us to
visualize the impact of our contribution on agent behavior in various traffic situations. Figure 15 further
complements this analysis by presenting the management of vehicle queue lengths at both intersections for
both approaches across all proposed traffic scenarios. Notably, our method achieved optimal queue
management, particularly in the HIGH scenario, leading to reductions in total waiting times of 7%, 70%, 58%,
and 97% for the high, low, north-south (NS), and east-west (EW) traffic scenarios, respectively, compared to
the fixed traffic light configuration. Conversely, the neural network approach, despite knowledge sharing
between agents, showed only a marginal reduction in vehicle waiting times, primarily limited to the E-W
scenario.
While our study focused on a limited scale with two intersections, future research should explore the
scalability of our CTDE approach to larger networks. Additionally, further investigation is warranted to assess
the adaptability of our model under varying environmental conditions and traffic dynamics. The current study,
although promising, represents an initial step towards comprehensive urban traffic management.
Our findings highlight the potential for extending our CTDE framework to more complex urban
networks. Future studies could explore adaptive learning mechanisms within the MARL framework to
Enhancing traffic flow through multi-agent reinforcement learning for adaptive traffic light … (Nada Faqir)
512 ISSN: 2252-8938
dynamically adjust traffic management strategies based on real-time traffic conditions and environmental
changes. Moreover, investigating the integration of other advanced AI techniques, such as reinforcement
learning algorithms with dynamic state representations, could further enhance the adaptability and efficiency
of traffic management systems.
In conclusion, our study demonstrates the effectiveness of integrating CNNs and a detailed DTDE
representation within a CTDE framework for optimizing traffic flow at multiple intersections. By significantly
reducing vehicle waiting times across diverse traffic scenarios, our approach contributes to advancing
intelligent traffic management systems and lays a foundation for scalable applications in urban environments.
This work not only addresses current limitations in traffic management research but also opens avenues for
future exploration and development in the field.
Figure 11. Agent 1 and Agent 2 queue management for HIGH traffic scenarios in both approaches
Figure 12. Agent 1 and Agent 2 queue management for LOW traffic scenarios in both approaches
Figure 13. Agent 1 and Agent 2 queue management for EAST-WEST traffic scenarios in both approaches
Figure 14. Agent 1 and Agent 2 queue management for NORTH-SOUTH traffic scenarios in both approaches
Figure 15. Vehicle queue length management at both intersections provided by both approaches across all
proposed traffic scenarios
5. CONCLUSION
This paper explored the feasibility of utilizing reinforcement learning for adaptive traffic signal
management. Using a validated traffic simulator, we established a robust framework for evaluating and training
reinforcement learning agents in a controlled environment. Key investigations included detailed environmental
state representation, knowledge sharing among agents in a partially observable setting, and the development
of a refined reward mechanism focused solely on negative value accumulation to assess agent performance.
Future research endeavors will build upon these foundations to enhance the outcomes achieved in this study.
Specifically, we intend to further refine reinforcement learning algorithms and investigate their implementation
across larger, more complex road networks. Additionally, we aim to explore the coordination potential among
multiple reinforcement learning agents to achieve global traffic flow improvements rather than local
optimizations alone. These analyses are pivotal in understanding both the potential advantages and unintended
consequences associated with deploying self-adaptive systems in real-world settings. By continuing to refine
our approaches and methodologies, we aim to contribute meaningfully to the ongoing evolution of intelligent
transportation systems.
Enhancing traffic flow through multi-agent reinforcement learning for adaptive traffic light … (Nada Faqir)
514 ISSN: 2252-8938
REFERENCES
[1] L. A. P . B , “R x ,” IEEE
Transactions on Intelligent Transportation Systems, vol. 12, no. 2, pp. 412–421, 2011, doi: 10.1109/TITS.2010.2091408.
[2] H. W , H. Y , G. Z , Z. L , “I L A ,” The ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2496–2505, 2018, doi: 10.1145/3219819.3220096.
[3] T. Wu et al., “ - b ,” IEEE Transactions on
Vehicular Technology, vol. 69, no. 8, pp. 8243–8256, 2020, doi: 10.1109/TVT.2020.2997896.
[4] .W ,“ - RL ,” Physical Review D, vol. 96, no. 1, 2017.
[5] K. J. P b , A. N. H K , .B ,“ - ,” 2014
17th IEEE International Conference on Intelligent Transportation Systems, ITSC 2014, pp. 2529–2534, 2014, doi:
10.1109/ITSC.2014.6958095.
[6] B. Ab L. K , “R I ,” Canadian
Journal of Civil Engineering, vol. 30, no. 6, pp. 981–991, 2003, doi: 10.1139/l03-014.
[7] D. D. Oliveira et al., “R -based control of traffic lights in non-stationary environments: A case study in a
,” CEUR Workshop Proceedings, 2006, vol. 223.
[8] A. L. . B zz , “ ,” Autonomous
Agents and Multi-Agent Systems, vol. 18, no. 3, pp. 342–375, 2009, doi: 10.1007/s10458-008-9062-9.
[9] V. Mnih et al., “P ,” arXiv-Computer Science, pp. 1-9, 2013.
[10] H. V H , A. G z, D. , “D b -L ,” 30th AAAI Conference on Artificial
Intelligence, AAAI 2016, pp. 2094–2100, 2016, doi: 10.1609/aaai.v30i1.10295.
[11] Z. Wang, T. Schaul, M. Hessel, H. V. Hasselt, M. Lanctot, and N. D. F , “D
,” 33rd International Conference on Machine Learning, ICML 2016, vol. 4, pp. 2939–2947, 2016.
[12] . , J. , I. A , D. , “P z x ,” 4th International Conference on Learning
Representations, ICLR 2016 - Conference Track Proceedings, 2016, pp. 1–21.
[13] A. . L , . , B. R , “D ,” 31st AAAI
Conference on Artificial Intelligence, AAAI 2017, pp. 2133–2139, 2017, doi: 10.1609/aaai.v31i1.10918.
[14] W. G .R z ,“ ,” arXiv-Computer Science, pp. 1-9, 2016.
[15] W. G .R z ,“ ,” Procedia
Computer Science, vol. 130, pp. 26–33, 2018, doi: 10.1016/j.procs.2018.04.008.
[16] N. Faqir, N. En-N , J. B , “D - b ,” 4th International
Conference on Intelligent Computing in Data Sciences, ICDS 2020, 2020, doi: 10.1109/ICDS50568.2020.9268709.
[17] N. F q , . L q , J. B , “D - b NN XGB ,”
International Journal of Advanced Computer Science and Applications, vol. 13, no. 9, pp. 529–536, 2022, doi:
10.14569/IJACSA.2022.0130961.
[18] Y. Gong, M. Abdel-A , . , . .R , “D z b -agent deep
,” Transportation Research Interdisciplinary Perspectives, vol. 1, 2019, doi: 10.1016/j.trip.2019.100020.
[19] A. V , L. , G. V zz , .B , “A ,”
in CEUR Workshop Proceedings, 2019, pp. 42–50.
[20] . L. L , “V - ,” Cognitive Systems Research, vol. 2, no. 1, pp. 55–66,
2001, doi: 10.1016/S1389-0417(01)00015-8.
[21] .L .R , “A b - (2000),”
Seventeenth International Conference on Machine Learning, 2000, pp. 535–542.
[22] . L. L ,“ - ,” Proceedings of the 11th International
Conference on Machine Learning, ICML 1994, 1994, pp. 157–163, doi: 10.1016/B978-1-55860-335-6.50027-1.
[23] T. P. Lillicrap et al., “ ,” arXiv-Computer Science, pp. 1-14, 2016.
[24] L. B ş , R. B b š , B. D , “ - ,” Studies in Computational
Intelligence, vol. 310, pp. 183–221, 2010, doi: 10.1007/978-3-642-14435-6_7.
[25] .B , L. B , J. , D. K jz z, “ -- b b ,” Proceedings of
SIMUL 2011, The Third International Conference on Advances in System Simulation, pp. 1-6, 2011.
[26] V. Mnih et al., “H - ,” Nature, vol. 518, no. 7540, pp. 529–533, 2015, doi:
10.1038/nature14236.
[27] J. G , Y. , J. L , . I , N. , “A th
x ,” arXiv-Computer Science, pp. 1-10, 2017.
[28] . . , . , .H , “D ,” Proceedings of SAI Intelligent Systems
Conference (IntelliSys) 2016, vol. 16, pp. 426–440, 2018, doi: 10.1007/978-3-319-56991-8_32.
[29] . G. H , “L 6.5- b ,” COURSERA:
Neural Networks for Machine Learning, vol. 4, no. 2, pp. 26–31, 2012.
BIOGRAPHIES OF AUTHORS
Enhancing traffic flow through multi-agent reinforcement learning for adaptive traffic light … (Nada Faqir)