Deep Reinforcement Learning Based Charging Scheduling For Household Electric Vehicles in Active Distribution Network
Deep Reinforcement Learning Based Charging Scheduling For Household Electric Vehicles in Active Distribution Network
6, November 2023
Abstract—
—With the booming of electric vehicles (EVs) across EV sales in China have reached 3.3 million in 2021 [2],
the world, their increasing charging demands pose challenges to whose soring charging demands pose new challenges to the
urban distribution networks. Particularly, due to the further im‐
reliable operation of the distribution network, i.e., dramatical
plementation of time-of-use prices, the charging behaviors of
household EVs are concentrated on low-cost periods, thus gen‐ peak-valley load differences [3], power congestions [4], and
erating new load peaks and affecting the secure operation of undervoltage [5]. The active distribution network (ADN) en‐
the medium- and low-voltage grids. This problem is particular‐ ables a mass of flexible loads to deliver various regulation
ly acute in many old communities with relatively poor electrici‐ services to solve the above problems, which puts forward ef‐
ty infrastructure. In this paper, a novel two-stage charging ficient and economic solutions for power systems [6]. Con‐
scheduling scheme based on deep reinforcement learning is pro‐
posed to improve the power quality and achieve optimal charg‐ ventional flexible loads with promising regulation capacity,
ing scheduling of household EVs simultaneously in active distri‐ such as air conditioners [7], have been fully investigated in‐
bution network (ADN) during valley period. In the first stage, to demand response for supporting system balancing. Howev‐
the optimal charging profiles of charging stations are deter‐ er, limited response capacity and duration time of residential
mined by solving the optimal power flow with the objective of loads restrict their further implementations on long-time-
eliminating peak-valley load differences. In the second stage, an
intelligent agent based on proximal policy optimization algo‐ scale dispatch considering users’ comforts.
rithm is developed to dispatch the household EVs sequentially EVs are regarded as ideal alternatives to provide various
within the low-cost period considering their discrete nature of regulation services in the ADN [8]. The significant regula‐
arrival. Through powerful approximation of neural network, tion potential of EVs has attracted lots of research interests,
the challenge of imperfect knowledge is tackled effectively dur‐ and their prominent contributions to peak shaving, accommo‐
ing the charging scheduling process. Finally, numerical results
demonstrate that the proposed scheme exhibits great improve‐
dation of renewable energy resources (RESs), and voltage
ment in relieving peak-valley differences as well as improving regulations have been validated in [9], [10], and [11], respec‐
voltage quality in the ADN. tively. Charging scheduling is fundamental for the ADN to
Index Terms— —Household electric vehicles, deep reinforcement utilize the flexibility of EVs due to the elasticity of depar‐
learning, proximal policy optimization, charging scheduling, ac‐ ture time. A coordinated charging strategy for EVs in the dis‐
tive distribution network, time-of-use prices. tribution network is presented in [12] to manage power con‐
gestion. An intelligent charging scheduling algorithm is pro‐
posed in [13] to choose the suitable charging station (CS)
I. INTRODUCTION and charging period with the goal of minimizing charging
peak period to valley period. The fixed valley price means proposes a novel DRL method based on the prioritized deep
that the charging costs are minimum as long as the whole learning deterministic policy gradient method, so as to solve
charging process is finished within the valley period. Conse‐ the bi-level optimization of the examined EV pricing prob‐
quently, owners instinctively decide to start charging at the lem. For the EV CS, an energy management based on DRL
beginning of valley period for convenience, which leads to is proposed in [24] to tickle varying input data and reduce
new congestions in the ADN [17]. The intensive charging de‐ the cumulated operation costs. However, the above litera‐
mands are likely to result in equipment failures and severely tures mostly focus on minimizing the operation costs of CSs
threaten the secure operation of the ADN, accounting for the by dispatching EVs, while the contributions to the improve‐
large-scale integration of EVs. Therefore, developing a prom‐ ment of power quality in the ADN are not fully accounted
ising charging scheduling method for household EVs is of for. Under TOU prices, EVs can be further dispatched to re‐
great importance to tackle these challenges. lieve the congestion and shorten the peak-valley differences
Apart from the above regulation obstacles under TOU without extra charging costs.
prices, previous approaches to solve the charging scheduling To address the above problems and take full use of sub‐
problem are not suitable for household EVs with private stantial household EVs during valley period under TOU pric‐
charging piles, accounting for their sequential arrivals and es, this paper proposes a two-stage charging scheduling
uncertain charging demands. For instance, the offline and on‐ scheme for household EVs in the ADN. In the first stage, to
line scheduling algorithms are proposed in [18] for EVs to relieve the power congestions and shorten the peak-valley
save the charging cost, which is formulated as a mixed inte‐ differences, the optimal power flow (OPF) of the ADN is
ger programming (MIP) problem assuming full knowledge solved to determine the optimal charging profiles of CSs dur‐
of the charging demands. The bi-level optimal dispatching ing the valley period. In the second stage, DRL based on
model proposed in [19] is also solved by converting it into proximal policy optimization (PPO) algorithm is employed
an MIP problem. These studies assume that all charging de‐ to dispatch the household EVs sequentially within the low-
mands are collected before optimization so as to convert cost period according to the optimal charging profiles. PPO
them to solvable MIP problems. However, it is very difficult algorithm was proposed by OpenAI in 2017 [25], which
or even impossible to acquire all charging information in ad‐ combines the advantages of trust region policy optimization
vance. In reality, the EV arrives sequentially and the charg‐ and advantage actors-critic, to prevent the performance col‐
ing demands can only be obtained precisely after arrival. lapse caused by a large update of the policy. Besides, most
Under these circumstances, the charging process of the decisions are finished by the distributed agents in the pro‐
household EV can only be regarded as an uninterruptable posed scheme with lower communication requirements and
process and the charging demands cannot be obtained in ad‐ computational burden, which makes it appliable easily in
vance. The charging scheduling problem of household EVs ADNs with numerous EVs.
can be formulated as a Markov decision process (MDP) The main contributions of this paper are as follows.
[20], which aims to dispatch EVs sequentially with finite in‐ 1) A two-stage charging scheduling scheme for household
formation, and achieve the global optimum for all EVs in EVs is proposed to improve the power quality of the ADN
the end. Therefore, how to determine the specific charging and achieve the optimal charging scheduling of EVs simulta‐
start time of the EV when arriving is one of the key priori‐ neously during the valley period, which consists of the OPF
ties. On the basis of the charging reservation function, house‐ of the ADN and charging dispatch of EVs. On this basis, the
hold EVs can be adopted appropriately in the charging contributions of household EVs to power congestion manage‐
scheduling of the ADN without extra equipment investments. ment and peak-valley difference elimination are further ex‐
With the rapid development of deep learning (DL) and re‐ ploited.
inforcement learning (RL), deep reinforcement learning 2) The realistic characteristics of household EVs are taken
(DRL), which combines both advantages of DL and RL, is into consideration, including the limited controllability and
proposed to overcome the dimensional curse and solve the uncertain charging demands. The charging process of EV is
MDP problem with continuous action spaces [21]. Based on regarded as an uninterruptable procedure with constant pow‐
the powerful function approximation of neural networks and er, and the charging scheduling process is modelled as a se‐
big data technology, DRL is emerged as an interesting alter‐ quential MDP problem, thereby the owner can make a charg‐
native to address the sequential charging scheduling problem ing reservation to achieve charging scheduling without extra
without full knowledge of charging demands [22]. First of equipment investments.
all, the decision for the current EV only depends on the real- 3) The intelligent DRL agent based on the PPO algorithm
time environment states, i. e., arrival time, charging power, is developed to schedule the charging process of EVs.
charging duration, and departure time, and it is notable for a Through the remarkable approximation function of the neu‐
DRL agent to address such a problem due to the sequential ral network, the agent can accumulate rich experience when
feature. Moreover, through interacting repeatedly with the dy‐ interacting with various environments repeatedly to break
namic environment, the agent can learn from the experience the limitations on imperfect information. Hence, numerous
and investigate an excellent control policy in the absence of household EVs are dispatched effectively to formulate the
models, which is more appliable in uncertain environments. optimal charging profile even when lacking full knowledge
In the field of charging scheduling problems, DRL has of charging demands in advance.
been implemented in various optimizations. Reference [23] The remainder of this paper is organized as follows. Sec‐
1892 JOURNAL OF MODERN POWER SYSTEMS AND CLEAN ENERGY, VOL. 11, NO. 6, November 2023
tion II establishes the two-stage charging scheduling scheme tives at first. In this paper, charging scheduling of household
of household EVs. Section III introduces the MDP model of EVs is employed to flatten the tie-line power to provide an‐
EV charging scheduling and the intelligent DRL agent based cillary services to power systems. Considering the mutual
on the PPO algorithm. Case studies are conducted in Section impacts between different nodes, the optimal charging pro‐
IV using the real-world data of residential and EV loads, files of CSs are not appropriate to be determined simply ac‐
which proves the effectiveness of the proposed scheme. Sec‐ cording to their electricity sectors. Therefore, the OPF algo‐
tion V concludes the remarks of this paper. rithm is used to solve the problem with regard to the secure
and stable operation. The optimal charging power is calculat‐
II. TWO-STAGE CHARGING SCHEDULING SCHEME OF ed with the goal of shortening the peak-valley differences,
HOUSEHOLD EVS based on historical and forecasted load data during the val‐
ley period.
In this section, an overview of the charging scheduling
scheme is introduced first to illustrate the coordination be‐ Power grid 4 13 14 The first stage
tailed sequential decision problem in the second stage is for‐ Charging Charging time Scheduling
elimination
The second
nd stage
mulated to describe the charging scheduling process. scheduling
Power Charging demand Aggregator control results
EV owner
wner
center based on DRL
Submit Charging
A. Overview of Charging Scheduling Scheme Power Optimal charging profile
reservation
Day σ Day σ+1
As important flexible loads of the ADN, household EVs Arrival Plug in Charging demand Departure
are not fully exploited for further regulation potential under
TOU prices. Generally, most charging durations of house‐ Peak period Valley period Time
hold EVs are much shorter than their sojourn time [26]. Sub‐ Scheduled EVs power
stantial charging demands of EVs concentrate on the pro‐ Fig. 1. Schematic diagram of two-stage charging scheduling scheme of
phase of the valley period lacking effective guidance, which household EVs.
results in extra power congestions and wastes the regulation
potential of EVs to a large extent. In the second stage, to overcome the obstacle of limited
At the same time, the ADN is suffering from power quali‐ charging information, the aggregator control center based on
ty issues including dramatic peak-valley differences, power DRL is used to make decisions with imperfect knowledge
congestions, and voltage limit violations. Consequently, the and dispatch the charging processes of EVs in terms of the
managers of the ADN, i. e., distribution network operator determined charging profile. Before the valley period, when
(DSO) and energy supplier, are motivated to further dispatch the kth EV arrives home, the user needs to plug in the EV
household EVs to improve the power quality under TOU and submit the charging demands to the control center, in‐
prices without extra equipment investments and charging cluding charging power, charging duration time, and depar‐
costs, even earning profits through delivering ancillary ser‐ ture time. Then, according to the optimal charging profile
vices for power systems. Apart from the DSOs, estates or and previously scheduled EV power, the aggregator will de‐
community administrators are also encouraged to implement termine the most suitable charging time period for the kth EV
such a charging scheduling, so as to satisfy increasing charg‐ immediately. At last, the owner makes a charging reservation
ing demands accounting for the limited carrying capacities to realize the charging scheduling without extra charging
of ADNs. costs, and even gets some incentives for the contributions to
The schematic diagram of two-stage charging scheduling the operation of the ADN.
scheme of household EVs is shown in Fig. 1, assuming that B. First-stage Problem: OPF Model of ADN
the ADN at the residential side is operated by the DSO and
consists of several residential loads and EV charging loads. The first-stage problem aims to involve EVs participating
Considering the relatively centralized installations of private in shortening peak-valley differences and managing conges‐
charging piles, e. g., underground parking spaces, nearby tions of the ADN. Considering the fact that an ADN typical‐
charging piles are aggregated as a CS and managed by the ly features radial topology, as shown in Fig. 2, the complex
aggregator. Assume that only EVs can provide flexible regu‐ power flow at each node can be described by the classic
lation service while other residential loads are regarded as DistFlow model [27].
fixed loads. To relieve the power congestion caused by inten‐ P i + 1 = P i - r i (P i2 + Q 2i )/V i2 - p i + 1 (1)
sive charging demands and transfer them to appropriate time Q ii + 1 = Q i - 1i - x i (P i2 + Q 2i )/V i2 - q i + 1 (2)
periods, a two-stage problem is formulated.
In the first stage, determining the optimal charging pro‐ V 2
i+1 = V - 2(r i P i + x i Q i ) + (r + x )(P + Q )/V
i
2 2
i
2
i
2
i
2
i i
2
(3)
files of CSs is the key point. Because of the various operat‐
ïì p i = p i - p i
D g
ing characteristics of different ADNs, it is of great impor‐ í (4)
î qi = qi - qi
D g
tance for the DSO to choose favorable optimization objec‐
QI et al.: DEEP REINFORCEMENT LEARNING BASED CHARGING SCHEDULING FOR HOUSEHOLD ELECTRIC VEHICLES IN ACTIVE... 1893
∫
te
where P i and Q i are the active and reactive power flows
min | DP | = |P sub (t) - P obj (t)|dt (9)
from node i to node i + 1, respectively; p i and q i are the ac‐ ts
tive and reactive power demands at node i, respectively, where P sub (t) is the real tie-line power at time t. The OPF
which are determined by the load demands (with superscript problem aims to optimize the power profiles of all nodes to
D) and generator outputs (with superscript g); Vi is the volt‐ minimize the differences between P sub (t) and P obj (t).
age at node i; and r i and x i are the resistance and reactance Assume F and R represent the set of EV nodes and the
of the branch from node i to node i + 1, respectively. set of residential nodes, respectively. Considering the contin‐
0 1 i i+1 n uous characteristic of the charging process, it is difficult to
regulate the charging power of CS dramatically in a short pe‐
riod, thereby the ramp rate of the CS needs to be limited
Pn+jQn within λ.
P0+jQ0 P1+jQ1 Pi+jQi Pi+1+jQi+1
|p i (t + 1) - p i (t)| £ λp i (t) "i Î F (10)
p1+jq1 pi+jqi pi+1+jqi+1 pn+jqn
In addition, the constraints of the ADN mainly include the
Fig. 2. Diagram of ADN with radial topology. nodal voltage and feeder ampacity as shown in (11) and
(12), respectively.
The DistFlow equations above are nonlinear and difficult V imin £ V i (t)£ V imax "i Î F Rt Î[t s t e ] (11)
to solve. Ignoring network losses, the DistFlow equations
can be converted to linearized power flow equations as (5), P imin £ P i (t)£ P imax "i Î F Rt Î[t s t e ] (12)
which have been widely used in distribution network analy‐ where V imin and V imax are the minimum and maximum nodal
sis [28]. voltages at node i, respectively; and P imin and P imax are the
ì Pi + 1 = Pi - pi + 1 minimum and maximum ampacities of the branch from node
ï i to node i + 1, respectively.
ïQ i + 1 = Q i - q i + 1
ïV = V - (r P + x Q ) V C. Second-stage Problem: Scheduling Model of Household
í i+1 i i i i i 1 (5)
ïï p = p D - p g EVs
ïï i i i
After calculating the OPF of the ADN, the optimal charg‐
ïq = qD - qg
î i i i ing profiles of CSs are determined. Then, the agent based on
The OPF model proposed in this subsection aims to flat‐ DRL will dispatch EVs to approach the optimal charging
ten the tie-line power, as well as maintain the node voltage power.
within the acceptable range. The objective tie-line power The charging process of EVs can be divided into three
should be determined in advance and power profiles of all parts, which are trickle charging, constant current charging,
nodes can be calculated using the OPF model with the goal and constant voltage charging, where the constant current
of minimizing differences between real tie-line power and charging process accounts for 80% duration and has relative‐
objective power. Considering the limited penetrations of EVs ly constant power [29]. On the other hand, considering the
in the ADN at present, it is difficult to eliminate the peak- actual situations where household charging piles are in‐
valley differences completely without abundant regulation ca‐ stalled, it is difficult for charging piles to achieve continuous
pacity. Hence, the objective tie-line power which is related power regulation due to the lack of communication condi‐
to residential consumption and the charging electricity can tions. Therefore, the charging process of a household EV is
be computed as: regarded as a continuous process with constant power [30],
E and the charging demands of the kth EV CD k can be repre‐
P obj (t) = P u (t) + EV (P umax - P u (t)) (6)
E dev sented using a tuple as:
CD k = (t arrk P ck t ck t depk )
∫ (P
te (13)
E dev = umax - P u (t))dt (7) where t arrk is the arrival time of the kth EV; P ck and t ck are
ts
the constant charging power and charging time duration, re‐
∫
te
E EV = P EV (t)dt (8) spectively; and t depk is the departure time, which means that
ts
the charging process needs to be finished before the depar‐
where P obj (t) is the objective tie-line power of the ADN at ture of EVs to satisfy the owner’s traveling energy require‐
time t; P EV (t) and P u (t) are the power of EVs and residential ments.
loads at time t, respectively; P umax is the maximum power of EVs arrive sequentially and the specific charging demands
residential loads during the valley period; E EV is the total can only be obtained precisely when an EV is plugged in.
electricity consumption of EVs; E dev is the electricity devia‐ The aggregator control center aims to transfer the charging
tion between the residential loads and that of the maximum demands to formulate a redistribution scheme of charging de‐
power; and t s and t e are the start time and end time of the mands based on the objective charging power. Through the
valley period, respectively. charging scheduling of EVs, not only power congestions at
Therefore, the objective function of the OPF model can be the prophase of valley period can be alleviated, but also the
represented by: ancillary service for shortening the peak-valley differences
1894 JOURNAL OF MODERN POWER SYSTEMS AND CLEAN ENERGY, VOL. 11, NO. 6, November 2023
∫
te
t s £ t bk £ min(t e t depk ) - t ck (16) Dev k = |P opti (t) - P sumk
EV
(t)|dt (21)
th ts
After dispatching the k EV, the control center will update
the scheduled charging power of the first k EVs as follow: where r k is the reward gained by the agent after taking the
action a k; Dev k is the deviation between the optimal charg‐
(t) = ∑P EVj (t) t Î[t s t e ]
k
EV
P sumk (17) ing power and the scheduled charging power of k EVs;
j=1
P opti (t) is the optimal power of node i at time t; and ρ is the
EV
where P sumk (t) is the scheduled power of the first k EVs at coefficient of reward, which is used to normalize the reward
QI et al.: DEEP REINFORCEMENT LEARNING BASED CHARGING SCHEDULING FOR HOUSEHOLD ELECTRIC VEHICLES IN ACTIVE... 1895
between different nodes with various EVs. processes, which are difficult to be involved into the optimi‐
Moreover, the reward function can also reveal how much zation problem and solved using conventional methods.
the charging demand is not satisfied or the charging costs Based on the outstanding approximation ability of the neural
have increased. Figure 4 illustrates the specific penalty when network, DRL can take the subsequent effects into consider‐
charging demands are satisfied or not. If the charging pro‐ ation. For example, the DRL agent may take an action that
cess of the kth EV is beyond the boundary of valley period, cannot gain the maximum reward at present, but it contrib‐
the power deviation Dev(b) (a)
k will be smaller than Dev k after
utes to obtaining more rewards in the future and achieving
dispatching the kth EV, and the agent will obtain a smaller re‐ the maximum total reward.
ward r k. B. PPO Algorithm
Power Valley period Power Valley period Policy gradient is an essential method for training the
Devk(a) The kth EV Devk(b) The kth EV DRL agent to maximize the cumulative reward, which works
EV EV
by computing an estimator of the policy gradient and plug‐
Psum,k1 Psum,k1 ging it into a stochastic gradient ascent algorithm [25]. The
Time Time most commonly used gradient estimator ĝ can be represent‐
(a) (b) ed as:
Scheduled charging power; Power deviation ĝ = Ê (Ñ lg π (a |s )Â )
t θ θ (24) t t t
Charging process of the kth EV
where π θ is the stochastic policy function with parameter θ;
Fig. 4. Specific penalty when charging demands are satisfied or not satis‐
fied. (a) Demands are satisfied. (b) Demands are not satisfied.
 t is the estimator of the advantage function at time t; a t and
s t are the action and state, respectively; and Ê t is the empiri‐
cal average with finite samples.
Theoretically, the total reward with optimal charging As a result, the loss function is defined as:
scheduling can be represented as:
L (θ) = Ê (lg π (a |s )Â ) (25)
(∑r ) = ∫ P
te PG t θ t t t
max E k opti (t)dt (22) However, traditional policy gradient methods have low uti‐
ts
lization efficiency of sampling data and have to spend too
R norm much time on sampling new data once the policy is updated.
(∑r )
ρ= (23)
max E Besides, it is difficult to determine appropriate steps for up‐
k
dating policy so as to prevent resulting in large differences
where R norm is a constant value for normalizing the reward. between the new policy and the old policy.
From the perspective of the whole scheduling process, the Therefore, the PPO algorithm was proposed in 2017 to ad‐
agent will schedule all EVs to approach the optimal charg‐ dress the above shortcomings. The detailed training work‐
ing profiles so as to maximize the total reward. Neverthe‐ flow of the DRL agent with PPO algorithm is demonstrated
less, considering the indivisibility of EV charging processes, in Fig. 5. PPO algorithm consists of three networks, includ‐
it is tricky to realize the global optimum through the optimal ing two actor networks with the new policy π θ and the old
decision of every single step. To be specific, the present de‐ policy π θ′ (parameterized by θ and θ′, respectively) and a
cision has durable effects on the later charging scheduling critic network V ϕ (parameterized by ϕ).
Minibatch
To increase the sample efficiency, π θ′ is used to interact trained according to the demonstrations of π θ′. Utilizing im‐
with environments and sample N trajectory sets with T portance sampling technology, the same trajectory sets can
timesteps, while π θ is the actual network that needs to be be used multiple times although there are differences be‐
1896 JOURNAL OF MODERN POWER SYSTEMS AND CLEAN ENERGY, VOL. 11, NO. 6, November 2023
tween π θ′ and π θ. The probability ratio of new policy and old on the discounting factor γ, and a larger γ means that an
policy can be expressed as: agent is more long-sight so as to take full consideration of
π (a |s ) future uncertainties to achieve the maximum cumulative re‐
r t (θ) = θ t t (26)
π θ′ (a t|s t ) wards. Thus, γ is set to be 0.99 [14].
Another point of PPO algorithm is that the new policy Algorithm 1: training workflow of PPO algorithm
should avoid significant evolution from the old policy after 1: Initialize policy network π θ′ and value function network V ϕ
every update, so as to maintain the accuracy of importance 2: for i = 0; i < N; i + + do
sampling and avoid accident performance collapse. Hence, a 3: Run policy π θ′ to interact with the environment for T timesteps and ob‐
clipped surrogate function is used to remove the incentive tains the trajectory samples (s t a t r t s t + 1 )
for moving r t (θ) outside of the interval [1 - є1 + є], so the 4: Calculate the reward-to-go R̂ t
loss function of PPO algorithm can be represented as [25]: 5: Use V ϕ to estimate the advantage function  t
L(θ) = Ê (min(r (θ)Â clip(r (θ)1 - є1 + є)Â ))
t t t (27) t t 6: Compute the loss function L(θ) with regard to θ with K epochs of gradi‐
ent decent
where є is the clipping parameter, which aims to clip the
7: π θ′ ¬ π θ ; V ϕ′ ¬ V ϕ
probability ratio. For instance, the objective will increase if
8: end for
the advantage function  t is positive, but the increase is
maintained within 1 + є by a limit set by the clipping func‐
TABLE I
tion. PARAMETERS OF PPO ALGORITHM
Consequently, the network parameters θ of the new policy
are updated using: Parameter Value Parameter Value
θ = arg max E (L(s t a t θ′θ)) (28) λ 0.15 N 2048
θ s t a t π θ′
є 0.2 K 10
Apart from the actor network, a critic network is used to γ 0.99 R norm 1000
estimate the state value function and the advantage function.
lr 3×10-4 MB 64
The advantage function describes how much an action is bet‐
ter than other actions on average, which is defined as:
 = Q (s a ) - V (s ) (29) Both the convergence speed and performance stability de‐
( )
t t t t t t
pend on є [22], hence є is set to be 0.2 to balance the train‐
∑γ r | s = s a = a
¥
ing speed and total reward of the agent [25], [32]. The multi-
Q t (s t a t ) = E k
t+k t t (30)
layer perceptrons of the policy network are composed of two
( )
k=0
hidden layers and the neurons of each layer are 64. The num‐
∑γ r | s = s
¥
V t (s t ) = E k
t+k t (31) ber of training episodes is set to be 250.
k=0
where γ is the discounting factor, which aims to balance the IV. CASE STUDY
importance between immediate and future rewards; and V t
and Q t are the value function and the action-value function, To evaluate the performance of the proposed two-stage
respectively. Therefore, V t (s t ) is the expected value on aver‐ DRL-based charging scheduling scheme, case studies are
conducted in this section.
age at state s t, which contains all optional actions. Q t (s t a t )
is the expected value at state s t when taking action a t. A. Parameter Setting of ADN
The critic network V ϕ is updated using regression to mini‐ An ADN for simulation is established based on the IEEE
mize a mean-squared-error objective [22]: 14-node test feeder, as shown in Fig. 6, where 10 nodes are
ϕ = arg min (V (s ) - R̂ )2 (32) ϕ t t set as residential loads without regulation flexibility and 3
ϕ
nodes are set as CSs. The ADN is operated by the DSO,
R̂ t = ∑r t (s t′ a t′ s t′ + 1 )
T
with the goal of relieving the power congestion and shorten‐
(33)
t′ = t ing the peak-valley differences of tie-line power.
where R̂ t is the reward-to-go, which is the sum of rewards
after a point in the trajectory. Power grid 4 13 14
The DRL agent with PPO algorithm is trying to schedule
the EV charging process according to the optimal charging Tie-line
12 11 10
profile, with the goal of maximizing the total expected re‐
wards. The training workflow of PPO algorithm is summa‐ 1 3
8
rized in Algorithm 1. The corresponding parameters are
shown in Table I, where lr is the learning rate; and MB is 6
CS 9
the minibatch size.
The discounting factor γ and the clipping parameter є are Residential load
important hyperparameters that influence the performance 2 5 7
agent observably. The importance of current action depends Fig. 6. ADN based on IEEE 14-node test feeder.
QI et al.: DEEP REINFORCEMENT LEARNING BASED CHARGING SCHEDULING FOR HOUSEHOLD ELECTRIC VEHICLES IN ACTIVE... 1897
In accordance with the realistic situations in Zhejiang, Chi‐ to 22:00, and the departure time is set consistently at 08:00
na, the valley period of TOU is set from 22:00 to 08:00 of of the next day.
the next day. Meanwhile, the residential load data during the
TABLE II
valley period are obtained from a housing estate in Hang‐ PARAMETERS OF EVS AT DIFFERENT NODES
zhou, Zhejiang, as shown in Fig. 7. Most residential loads
have similar features, and the summits of the electricity con‐ Parameter Description Value
sumption appear at 22:00 around. Then, the power demands P ck (kW) Charging power U(525)
continue to decline and reach the nadir at 03:00 of the next η (%) Charging efficiency 90
day. Finally, the electricity demands recover gradually as res‐ C EVk (kWh) Battery capacity 50
idents wake up. Therefore, there are significant peak-valley SOC sk (%) Starting SOC N(5020)
differences in residential distribution networks.
SOC ek (%) Expected SOC 100
2.5 t arrk (hour) Arrival time U(16:0022:00)
t depk (hour) Departure time 08:00
2.0
Note: normal distribution with the mean value of μ and the standard devia‐
Power (MW)
1.4
introduced in Section II, the OPF of the ADN is calculated
1.0 with the goal of flattening tie-line power, and the optimal
0.6 charging profiles are shown in Fig. 8.
It can be observed that the main charging demands are
0.2
22:00 transferred to 01: 00-05: 00, during which other electricity
00:00
9 consumptions are the lowest. Moreover, the regulation tar‐
02:00
Tim 04:00 5
gets are not allocated simply according to the total electrici‐
e .
06:00 e No ty demands of CSs; nodal voltages and impacts from other
08:00 13 Nod
nodes are also taken into account to realize the multidimen‐
Original power (Node 5); Original power (Node 9)
Original power (Node 13); Optimal power (Node 5)
sional optimum. Therefore, the CSs are coordinated and the
Optimal power (Node 9); Optimal power (Node 13) optimal charging profiles at different nodes are various, as
Fig. 8. Original and optimal charging profiles of EVs at different CSs.
shown in Fig. 8. For example, the charging summit of node
9 appears at 01:00 while that of node 13 appears at 03:00.
The numbers of household EVs are set to be 179, 225, C. Charging Scheduling Results in Second-stage Problem
and 146 at node 5, node 9, and node 13, respectively. It is On the basis of optimal charging profiles calculated in the
assumed that the charging power obeys the uniform distribu‐ first stage, the DRL agent needs to schedule the charging
tion and the starting SOC obeys the normal distribution [16], processes of EVs sequentially to approach the optimal pro‐
whose parameters can be found in Table II. Moreover, the files. The charging scheduling results of household EVs at
charging efficiency and battery capacity are set to be 90% different nodes are shown in Fig. 9, where the deviation rep‐
and 50 kWh, respectively. Detailed charging data of house‐ resents its absolute value. During the scheduling process, the
hold EVs are obtained from 550 individual meters which are agent makes decisions based on probability, which is calcu‐
equipped for household EVs. To simplify the charging lated through massive pieces of training. All feasible actions
scheduling, the arrival time distributes uniformly from 16:00 are possible to be taken by the agent, although the probabili‐
1898 JOURNAL OF MODERN POWER SYSTEMS AND CLEAN ENERGY, VOL. 11, NO. 6, November 2023
ty of making a bad decision is very low. Therefore, it is in‐ During the whole charging scheduling processes, the DRL
evitable for the agent to take bad actions that will cause de‐ agent makes efforts to maximize the reward and obtains a to‐
viations in a series of decision processes. It can be observed tal reward of 939.5, 923.1, and 946.6 for node 5, node 9,
that the real power profiles are very close to the optimal and node 13 in the end, respectively. Similar to the indexes
power profiles except for some points, which proves the ef‐ of average deviation and the maximum deviation, the total
fectiveness of the DRL agent with PPO algorithm on charg‐ reward indicates that the agent performs better with a
ing scheduling. smoother objective charging profile.
1.0
Moreover, the average SOC and median SOC at specific
2.0 hours are further analyzed, as shown in Fig. 10. The median
0.8 SOC reflects the charging completion result of every EVs. It
Deviation (MW)
1.5 can be observed that more than 50% EVs have finished their
Power (MW)
0.6
charging before 03: 00 in the original charging scheduling,
1.0 0.4 even though only half of valley period has passed. The re‐
0.5 0.2
sults also indicate there is a significant regulation potential
to be exploited for household EVs. The average SOC repre‐
0 0 sents the overall charging progress of all EVs. It can be ob‐
22:00 00:00 02:00 04:00 06:00 08:00
Time served that the charging speed of the original charging is
(a) much faster than that of the proposed charging scheduling in
2.4 1.0 the first half of the valley period, when the electricity de‐
2.0 0.8
mand is decreasing towards the nadir. Hence, the original
Deviation (MW)
1.6
0.6 the ADN. On the contrary, the proposed charging scheduling
1.2
0.4
takes full use of the shiftability of charging demands to re‐
0.8 shape the charging curve with the goal of eliminating peak-
0.4 0.2 valley differences in the ADN.
0 0 140 100
22:00 00:00 02:00 04:00 06:00 08:00
Time 120 80
100
1.8 0.6 60
80
1.5
60 40
Deviation (MW)
Power (MW)
1.2 0.4
40
20
0.9 20
0.6 0.2 0 0
22:00 00:00 02:00 04:00 06:00 08:00
0.3 23:00 01:00 03:00 05:00 07:00
Time
0 0
22:00 00:00 02:00 04:00 06:00 08:00 Median SOC (original charging); Median SOC (optimal charging)
Time Average SOC (original charging); Average SOC (optimal charging)
(c)
Deviation; Optimal power; Real power Fig. 10. Average SOC and median SOC during valley period.
Value loss
60
es. 40
The A2C algorithm has a sharp increase at the beginning 20
of the training process, which appears much faster than that
of the PPO algorithm. The results prove that the clipped 0 50 100 150 200 250
Episode
function of PPO algorithm has worked and limited the dras‐ (a)
tic change of policy network, so as to effectively avoid per‐ 80
formance collapse and local optimum. As illustrated in Fig. 60
Loss
11, the A2C algorithm experiences an oscillation period after 40
aggressive policy update and performance improvement, 20
then it converges to the total reward around 908.
0 50 100 150 200 250
1000 Episode
(b)
900 Fig. 12. Loss function performance of PPO algorithm during training pro‐
cess. (a) Value loss. (b) Loss.
800
Reward
14
thon 3.7 on an Intel(R) i7 12700kf, 32 GB RAM desktop.
6.61 MW
In terms of test results, the PPO algorithm outperforms 12
2.76 MW
the A2C algorithm, DQN algorithm, and random policy, al‐
though the PPO algorithm has the lowest training speed with 10
the same timesteps. To be specific, the PPO algorithm can
obtain the total reward of 937 when scheduling EV charging 8
22:00 00:00 02:00 04:00 06:00 08:00
processes, which is 29, 112, and 337 more than that of the Time
A2C algorithm, DQN algorithm, and random policy, respec‐ Fig. 13. Comparison between original and optimized tie-line power pro‐
tively. files.
Then, the loss function performance of PPO algorithm is
presented, as shown in Fig. 12. The value loss and loss rep‐ Different from the transmission network, the distribution
resent the performance of PPO algorithm on training sets network possesses much higher resistance, and the active
and test sets, respectively. It can be observed that the value power has more significant effects on voltages. As a result,
loss and loss share similar trends during the training process, apart from great contributions to the elimination of peak-val‐
which demonstrate the remarkable adaptability on various da‐ ley differences, the voltage quality of the ADN is also im‐
ta sets. proved through scheduling household EVs during the valley
Hence, the PPO algorithm is suited for addressing the period. As shown in Fig. 14, due to the overlap of the peak
charging scheduling problem and can be adopted to handle of residential electricity consumption and the intensive charg‐
the uncertainty of environment. ing demands, some nodal voltages are extremely low at the
1900 JOURNAL OF MODERN POWER SYSTEMS AND CLEAN ENERGY, VOL. 11, NO. 6, November 2023
9 0.981
8 7 0.971 0.974 0.28
7
6
0.977 8 0.972 0.976 0.39
5 9 0.969 0.973 0.35
4 0.973
3 10 0.974 0.976 0.21
2 0.969
22:00 00:00 02:00 04:00 06:00 08:00 11 0.976 0.978 0.23
Time 12 0.976 0.978 0.23
Fig. 14. Original nodal voltages. 13 0.974 0.977 0.29
14 0.973 0.975 0.28
Utilizing the proposed two-stage charging scheduling
scheme, the voltage violation problem is addressed effective‐ V. CONCLUSION
ly, as shown in Fig. 15.
In the context of taking full use of the regulation potential
Voltage (p.u.) of household EVs under TOU prices, this paper proposes a
14 0.989
13 two-stage charging scheduling scheme to dispatch household
12 0.985 EVs. The first-stage problem aims to involve the charging
11
10 scheduling of household EVs in operation and optimization
Node No.
9 0.981
8 of the ADN, and the optimal charging power profiles of CSs
7
6
0.977 are determined by calculating the OPF so as to relieve the
5 power congestions and shorten the peak-valley differences.
4 0.973
3 Furthermore, a PPO-based DRL agent is developed to dis‐
2 0.969 patch the charging processes of EVs in terms of the optimal
22:00 00:00 02:00 04:00 06:00 08:00
Time charging power. Case studies with realistic data are conduct‐
Fig. 15. Nodal voltages after charging scheduling.
ed to illustrate the multidimensional performance of the pro‐
posed scheme. It is demonstrated that the PPO-based DRL
agent can be adopted in different CSs with various objective
Besides, the oscillations of nodal voltages are also limited charging profiles and EV amounts. Besides, the charging
with smoother tie-line power, which is beneficial for reduc‐ scheduling of EVs contributes to significant improvement in
ing the operating times of voltage regulation equipment such power quality, including decreasing the peak-valley differenc‐
as on-load tap changers. For example, the voltage variations es and stabilizing the nodal voltages.
of node 4 decrease from 0.0051 p. u. to 0.0028 p. u. during Moreover, the proposed scheme can be adopted properly
the whole valley period. Meanwhile, the contributions to in substantial distributed communities with the combination
voltage regulation are not restricted to the nodes of CSs. As of edge computing technology. On this basis, numerous flexi‐
shown in Table IV, all nodal voltages have different degrees
ble loads, e.g., thermostatic loads, energy storage, RES, can
of improvement.
be involved into the proposed scheme to be managed effi‐
Because OPF is involved in the first-stage optimization
ciently, so as to activate their flexibility and enhance the reg‐
problem, all nodal voltages are taken into consideration
ulation capacity of ADNs in the near future.
when determining the optimal charging profiles of CSs. Spe‐
cifically, the voltage nadir of node 8 has 0.39% improve‐
ment. For these old communities with relatively poor elec‐ REFERENCES
tricity infrastructure, the proposed scheme can also satisfy [1] T. Chen, X. -P. Zhang, J. Wang, et al., “A review on electric vehicle
charging infrastructure development in the UK,” Journal of Modern
residential power consumption and charging demands simul‐ Power Systems and Clean Energy, vol. 8, no. 2, pp. 193-205, Mar.
taneously with limited carrying capacity. 2020.
Therefore, under the existing TOU price circumstances, [2] IEA. (2022, May). Global EV outlook 2022. [Online]. Available: https:
//www.iea.org/reports/global-ev-outlook-2022
the proposed two-stage charging scheduling scheme can take [3] H. Liu, P. Zeng, J. Guo et al., “An optimization strategy of controlled
full use of the regulation potential of household EVs during electric vehicle charging considering demand side response and region‐
valley periods to improve the power quality of the ADN al wind and photovoltaic,” Journal of Modern Power Systems and
without extra equipment investments and charging costs, in‐ Clean Energy, vol. 3, no. 2, pp. 232-239, Jun. 2015.
[4] Fco. J. Zarco-Soto, J. L. Martínez-Ramos, and P. J. Zarco-Periñán, “A
cluding peak-valley difference elimination, congestion man‐ novel formulation to compute sensitivities to solve congestions and
agement, and nodal voltage regulation. voltage problems in active distribution networks,” IEEE Access, vol.
QI et al.: DEEP REINFORCEMENT LEARNING BASED CHARGING SCHEDULING FOR HOUSEHOLD ELECTRIC VEHICLES IN ACTIVE... 1901
9, pp. 60713-60723, Apr. 2021. els,” IEEE Transactions on Industry Applications, vol. 56, no. 5, pp.
[5] B. Wei, Z. Qiu, and G. Deconinck, “A mean-field voltage control ap‐ 5901-5912, Sept. 2020.
proach for active distribution networks with uncertainties,” IEEE [24] M. Shin, D.-H. Choi, and J. Kim, “Cooperative management for PV/
Transactions on Smart Grid, vol. 12, no. 2, pp. 1455-1466, Mar. 2021. ESS-enabled electric vehicle charging stations: a multiagent deep rein‐
[6] Y. Luo, Q. Nie, D. Yang, et al., “Robust optimal operation of active forcement learning approach,” IEEE Transactions on Industrial Infor‐
distribution network based on minimum confidence interval of distrib‐ matics, vol. 16, no. 5, pp. 3493-3503, May 2020.
uted energy beta distribution,” Journal of Modern Power Systems and [25] J. Schulman, F. Wolski, P. Dhariwal et al. (2017, Jul.). Proximal poli‐
Clean Energy, vol. 9, no. 2, pp. 423-430, Mar. 2021. cy optimization algorithms. [Online]. Available: http://arXiv:
[7] K. Xie, H. Hui, Y. Ding et al., “Modeling and control of central air 1707.06347
conditionings for providing regulation services for power systems,” Ap‐ [26] S. Yoon and E. Hwang, “Load guided signal-based two-stage charging
plied Energy, vol. 315, p. 119035, Jun. 2022. coordination of plug-in electric vehicles for smart buildings,” IEEE Ac‐
[8] H. Wei, J. Liang, C. Li et al., “Real-time locally optimal schedule for cess, vol. 7, pp. 144548-144560, Oct. 2019.
electric vehicle load via diversity-maximization NSGA-II,” Journal of [27] M. E. Baran and F. F. Wu, “Network reconfiguration in distribution
Modern Power Systems and Clean Energy, vol. 9, no. 4, pp. 940-950, systems for loss reduction and load balancing,” IEEE Transactions on
Jul. 2021. Power Delivery, vol. 4, no. 2, pp. 1401-1407, Apr. 1989.
[9] E. Hadian, H. Akbari, M. Farzinfar et al., “Optimal allocation of elec‐ [28] S. Tan, J.-X. Xu, and S. K. Panda, “Optimization of distribution net‐
tric vehicle charging stations with adopted smart charging/discharging work incorporating distributed generators: an integrated approach,”
schedule,” IEEE Access, vol. 8, pp. 196908-196919, Oct. 2020. IEEE Transactions on Power Systems, vol. 28, no. 3, pp. 2421-2432,
[10] H.-M. Chung, S. Maharjan, Y. Zhang et al., “Intelligent charging man‐ Aug. 2013.
agement of electric vehicles considering dynamic user behavior and re‐ [29] C. Zhang, Y. Liu, F. Wu et al., “Effective charging planning based on
newable energy: a stochastic game approach,” IEEE Transactions on deep reinforcement learning for electric vehicles,” IEEE Transactions
Intelligent Transportation Systems, vol. 22, no. 12, pp. 7760-7771, Jul. on Intelligent Transportation Systems, vol. 22, no. 1, pp. 542-554, Jan.
2021. 2021.
[11] J. Hu, C. Ye, Y. Ding et al., “A distributed MPC to exploit reactive [30] S. Han, S. Han, and K. Sezaki, “Development of an optimal vehicle-
power V2G for real-time voltage regulation in distribution networks,” to-grid aggregator for frequency regulation,” IEEE Transactions on
IEEE Transactions on Smart Grid, vol. 13, no. 1, pp. 576-588, Sept. Smart Grid, vol. 1, no. 1, pp. 65-72, Jun. 2010.
2022. [31] Z. Zhao and C. K. M. Lee, “Dynamic pricing for EV charging sta‐
[12] S. Deb, A. K. Goswami, P. Harsh et al., “Charging coordination of tions: a deep reinforcement learning approach,” IEEE Transactions on
plug-in electric vehicle for congestion management in distribution sys‐ Transportation Electrification, vol. 8, no. 2, pp. 2456-2468, Jun. 2022.
tem integrated with renewable energy sources,” IEEE Transactions on [32] W. Zhu and A. Rosendo, “A functional clipping approach for policy
Industry Applications, vol. 56, no. 5, pp. 5452-5462, Sept. 2020. optimization algorithms,” IEEE Access, vol. 9, pp. 96056-96063, Jul.
[13] S. Das, P. Acharjee, and A. Bhattacharya, “Charging scheduling of 2021.
electric vehicle incorporating grid-to-vehicle and vehicle-to-grid tech‐
nology considering in smart grid,” IEEE Transactions on Industry Ap‐
plications, vol. 57, no. 2, pp. 1688-1702, Mar. 2021. Taoyi Qi received the B. S. degree in electrical engineering from Zhejiang
[14] L. Yan, X. Chen, J. Zhou et al., “Deep reinforcement learning for con‐ University, Hangzhou, China, in 2020, where he is currently pursuing the
tinuous electric vehicles charging control with dynamic user behav‐ M.S. degree in electrical engineering. His research interests mainly focus on
iors,” IEEE Transactions on Smart Grid, vol. 12, no. 6, pp. 5124- demand response, flexible loads, and deep reinforcement learning.
5134, Jul. 2021.
[15] F. L. D. Silva, C. E. H. Nishida, D. M. Roijers et al., “Coordination
Chengjin Ye received the B.E. and Ph.D. degrees in electrical engineering
of electric vehicle charging through multiagent reinforcement learn‐
from Zhejiang University, Hangzhou, China, in 2010 and 2015, respectively.
ing,” IEEE Transactions on Smart Grid, vol. 11, no. 3, pp. 2347-2356,
From 2015 to 2017, he served as a Distribution System Engineer with the
May 2020.
Economics Institute, State Grid Zhejiang Electric Power Company Ltd.,
[16] L. Gan, X. Chen, K. Yu et al., “A probabilistic evaluation method of
Hangzhou, China. From 2017 to 2019, he was an Assistant Research Fellow
household EVs dispatching potential considering users’ multiple travel
with the College of Electrical Engineering, Zhejiang University. Since 2020,
needs,” IEEE Transactions on Industry Applications, vol. 56, no. 5,
he has been a Tenure-track Professor there. His research interests mainly in‐
pp. 5858-5867, Sept. 2020.
clude resilience enhancement of power grids and integrated energy systems,
[17] E. L. Karfopoulos and N. D. Hatziargyriou, “A multi-agent system for
as well as market mechanism and control strategy towards the integration of
controlled charging of a large population of electric vehicles,” IEEE
demand resources into power system operation.
Transactions on Power Systems, vol. 28, no. 2, pp. 1196-1204, May
2013.
[18] A.-M. Koufakis, E. S. Rigas, N. Bassiliades et al., “Offline and online Yuming Zhao received the B.S. and Ph.D. degrees from the Department of
electric vehicle charging scheduling with V2V energy transfer,” IEEE Electrical Engineering, Tsinghua University, Beijing, China, in 2001 and
Transactions on Intelligent Transportation Systems, vol. 21, no. 5, pp. 2006, respectively. He is currently a Senior Engineer (Professor level) with
2128-2138, May 2020. Shenzhen Power Supply Co., Ltd., Shenzhen, China. His main research in‐
[19] Y. Li, M. Han, Z. Yang et al., “Coordinating flexible demand response terest includes the DC distribution power grid.
and renewable uncertainties for scheduling of community integrated
energy systems with an electric vehicle charging station: a bi-level ap‐ Lingyang Li is currently pursuing the Ph.D. degree in electrical engineering
proach,” IEEE Transactions on Sustainable Energy, vol. 12, no. 4, pp. at Zhejiang University, Hangzhou, China. His research interests include the
2321-2331, Oct. 2021. optimal operation of distribution networks and mobile energy storage sys‐
[20] S. Li, W. Hu, D. Cao et al., “Electric vehicle charging management tem optimization scheduling.
based on deep reinforcement learning,” Journal of Modern Power Sys‐
tems and Clean Energy, vol. 10, no. 3, pp. 719-730, May 2022. Yi Ding received the bachelor’s degree in electrical engineering from
[21] D. Silver, A. Huang, C. J. Maddison et al., “Mastering the game of go Shanghai Jiao Tong University, Shanghai, China, in 2002, and the Ph.D. de‐
with deep neural networks and tree search,” Nature, vol. 529, no. gree in electrical engineering from Nanyang Technological University, Sin‐
7587, pp. 484-489, Jan. 2016. gapore, in 2007. He is currently a Professor at the College of Electrical En‐
[22] B. Huang and J. Wang, “Deep-reinforcement-learning-based capacity gineering, Zhejiang University, Hangzhou, China. His current research inter‐
scheduling for PV-battery storage system,” IEEE Transactions on ests include power systems reliability analysis incorporating renewable ener‐
Smart Grid, vol. 12, no. 3, pp. 2272-2283, May 2021. gy resources, smart grid performance analysis, and engineering systems reli‐
[23] D. Qiu, Y. Ye, D. Papadaskalopoulos et al., “A deep reinforcement ability modeling and optimization.
learning method for pricing electric vehicles with discrete charging lev‐