0% found this document useful (0 votes)
22 views12 pages

Deep Reinforcement Learning Based Charging Scheduling For Household Electric Vehicles in Active Distribution Network

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views12 pages

Deep Reinforcement Learning Based Charging Scheduling For Household Electric Vehicles in Active Distribution Network

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

1890 JOURNAL OF MODERN POWER SYSTEMS AND CLEAN ENERGY, VOL. 11, NO.

6, November 2023

Deep Reinforcement Learning Based Charging


Scheduling for Household Electric Vehicles in
Active Distribution Network
Taoyi Qi, Chengjin Ye, Yuming Zhao, Lingyang Li, and Yi Ding

Abstract—
—With the booming of electric vehicles (EVs) across EV sales in China have reached 3.3 million in 2021 [2],
the world, their increasing charging demands pose challenges to whose soring charging demands pose new challenges to the
urban distribution networks. Particularly, due to the further im‐
reliable operation of the distribution network, i.e., dramatical
plementation of time-of-use prices, the charging behaviors of
household EVs are concentrated on low-cost periods, thus gen‐ peak-valley load differences [3], power congestions [4], and
erating new load peaks and affecting the secure operation of undervoltage [5]. The active distribution network (ADN) en‐
the medium- and low-voltage grids. This problem is particular‐ ables a mass of flexible loads to deliver various regulation
ly acute in many old communities with relatively poor electrici‐ services to solve the above problems, which puts forward ef‐
ty infrastructure. In this paper, a novel two-stage charging ficient and economic solutions for power systems [6]. Con‐
scheduling scheme based on deep reinforcement learning is pro‐
posed to improve the power quality and achieve optimal charg‐ ventional flexible loads with promising regulation capacity,
ing scheduling of household EVs simultaneously in active distri‐ such as air conditioners [7], have been fully investigated in‐
bution network (ADN) during valley period. In the first stage, to demand response for supporting system balancing. Howev‐
the optimal charging profiles of charging stations are deter‐ er, limited response capacity and duration time of residential
mined by solving the optimal power flow with the objective of loads restrict their further implementations on long-time-
eliminating peak-valley load differences. In the second stage, an
intelligent agent based on proximal policy optimization algo‐ scale dispatch considering users’ comforts.
rithm is developed to dispatch the household EVs sequentially EVs are regarded as ideal alternatives to provide various
within the low-cost period considering their discrete nature of regulation services in the ADN [8]. The significant regula‐
arrival. Through powerful approximation of neural network, tion potential of EVs has attracted lots of research interests,
the challenge of imperfect knowledge is tackled effectively dur‐ and their prominent contributions to peak shaving, accommo‐
ing the charging scheduling process. Finally, numerical results
demonstrate that the proposed scheme exhibits great improve‐
dation of renewable energy resources (RESs), and voltage
ment in relieving peak-valley differences as well as improving regulations have been validated in [9], [10], and [11], respec‐
voltage quality in the ADN. tively. Charging scheduling is fundamental for the ADN to
Index Terms— —Household electric vehicles, deep reinforcement utilize the flexibility of EVs due to the elasticity of depar‐
learning, proximal policy optimization, charging scheduling, ac‐ ture time. A coordinated charging strategy for EVs in the dis‐
tive distribution network, time-of-use prices. tribution network is presented in [12] to manage power con‐
gestion. An intelligent charging scheduling algorithm is pro‐
posed in [13] to choose the suitable charging station (CS)
I. INTRODUCTION and charging period with the goal of minimizing charging

I N recent years, electric vehicles (EVs) are widely used to costs.


reduce air pollution and emissions of greenhouse gases Private charging is currently the dominant charging mode
with the appeal for sustainable development goals [1]. Ac‐ for household EVs in many countries [2]. However, lower
cording to the estimation of International Energy Agency, charging power and utilization rate as well as wide-area dis‐
tributions make it uneconomic to retrofit household charging
piles to achieve flexible regulation used in commercial direct
Manuscript received: July 27, 2022; revised: December 28, 2022; accepted:
January 20, 2023. Date of CrossCheck: January 20, 2023. Date of online publi‐
current (DC) piles, such as the on/off charging control strate‐
cation: February 24, 2023. gy and continuous charging power adjustment mentioned in
This work was supported by the National Key R&D Program of China (No. [14] and [15]. As a result, time-of-use (TOU) prices are im‐
2021ZD0112700) and the Key Science and Technology Project of China South‐
ern Power Grid Corporation (No. 090000k52210134). plemented for household EVs to transfer charging demands
This article is distributed under the terms of the Creative Commons Attribu‐ from peak period to valley period in some areas, e.g., Zheji‐
tion 4.0 International License (http://creativecommons.org/licenses/by/4.0/). ang Province of China. Household EVs possess competitive
T. Qi, C. Ye (corresponding author), L. Li, and Y. Ding are with the College
of Electrical Engineering, Zhejiang University, Hangzhou, China (e-mail: eeq‐ charging flexibility without sharing charging piles with oth‐
[email protected]; [email protected]; [email protected]; [email protected]. ers [16]. Considering the fixed electricity price during the
cn). valley period, it is tricky for TOU prices to dispatch EVs ad‐
Y. Zhao is with Shenzhen Power Supply Bureau Co., Ltd., Shenzhen, China
(e-mail: [email protected]). equately with regard to the realistic operating characteristics
DOI: 10.35833/MPCE.2022.000456 of the ADN, except for transferring charging loads from
QI et al.: DEEP REINFORCEMENT LEARNING BASED CHARGING SCHEDULING FOR HOUSEHOLD ELECTRIC VEHICLES IN ACTIVE... 1891

peak period to valley period. The fixed valley price means proposes a novel DRL method based on the prioritized deep
that the charging costs are minimum as long as the whole learning deterministic policy gradient method, so as to solve
charging process is finished within the valley period. Conse‐ the bi-level optimization of the examined EV pricing prob‐
quently, owners instinctively decide to start charging at the lem. For the EV CS, an energy management based on DRL
beginning of valley period for convenience, which leads to is proposed in [24] to tickle varying input data and reduce
new congestions in the ADN [17]. The intensive charging de‐ the cumulated operation costs. However, the above litera‐
mands are likely to result in equipment failures and severely tures mostly focus on minimizing the operation costs of CSs
threaten the secure operation of the ADN, accounting for the by dispatching EVs, while the contributions to the improve‐
large-scale integration of EVs. Therefore, developing a prom‐ ment of power quality in the ADN are not fully accounted
ising charging scheduling method for household EVs is of for. Under TOU prices, EVs can be further dispatched to re‐
great importance to tackle these challenges. lieve the congestion and shorten the peak-valley differences
Apart from the above regulation obstacles under TOU without extra charging costs.
prices, previous approaches to solve the charging scheduling To address the above problems and take full use of sub‐
problem are not suitable for household EVs with private stantial household EVs during valley period under TOU pric‐
charging piles, accounting for their sequential arrivals and es, this paper proposes a two-stage charging scheduling
uncertain charging demands. For instance, the offline and on‐ scheme for household EVs in the ADN. In the first stage, to
line scheduling algorithms are proposed in [18] for EVs to relieve the power congestions and shorten the peak-valley
save the charging cost, which is formulated as a mixed inte‐ differences, the optimal power flow (OPF) of the ADN is
ger programming (MIP) problem assuming full knowledge solved to determine the optimal charging profiles of CSs dur‐
of the charging demands. The bi-level optimal dispatching ing the valley period. In the second stage, DRL based on
model proposed in [19] is also solved by converting it into proximal policy optimization (PPO) algorithm is employed
an MIP problem. These studies assume that all charging de‐ to dispatch the household EVs sequentially within the low-
mands are collected before optimization so as to convert cost period according to the optimal charging profiles. PPO
them to solvable MIP problems. However, it is very difficult algorithm was proposed by OpenAI in 2017 [25], which
or even impossible to acquire all charging information in ad‐ combines the advantages of trust region policy optimization
vance. In reality, the EV arrives sequentially and the charg‐ and advantage actors-critic, to prevent the performance col‐
ing demands can only be obtained precisely after arrival. lapse caused by a large update of the policy. Besides, most
Under these circumstances, the charging process of the decisions are finished by the distributed agents in the pro‐
household EV can only be regarded as an uninterruptable posed scheme with lower communication requirements and
process and the charging demands cannot be obtained in ad‐ computational burden, which makes it appliable easily in
vance. The charging scheduling problem of household EVs ADNs with numerous EVs.
can be formulated as a Markov decision process (MDP) The main contributions of this paper are as follows.
[20], which aims to dispatch EVs sequentially with finite in‐ 1) A two-stage charging scheduling scheme for household
formation, and achieve the global optimum for all EVs in EVs is proposed to improve the power quality of the ADN
the end. Therefore, how to determine the specific charging and achieve the optimal charging scheduling of EVs simulta‐
start time of the EV when arriving is one of the key priori‐ neously during the valley period, which consists of the OPF
ties. On the basis of the charging reservation function, house‐ of the ADN and charging dispatch of EVs. On this basis, the
hold EVs can be adopted appropriately in the charging contributions of household EVs to power congestion manage‐
scheduling of the ADN without extra equipment investments. ment and peak-valley difference elimination are further ex‐
With the rapid development of deep learning (DL) and re‐ ploited.
inforcement learning (RL), deep reinforcement learning 2) The realistic characteristics of household EVs are taken
(DRL), which combines both advantages of DL and RL, is into consideration, including the limited controllability and
proposed to overcome the dimensional curse and solve the uncertain charging demands. The charging process of EV is
MDP problem with continuous action spaces [21]. Based on regarded as an uninterruptable procedure with constant pow‐
the powerful function approximation of neural networks and er, and the charging scheduling process is modelled as a se‐
big data technology, DRL is emerged as an interesting alter‐ quential MDP problem, thereby the owner can make a charg‐
native to address the sequential charging scheduling problem ing reservation to achieve charging scheduling without extra
without full knowledge of charging demands [22]. First of equipment investments.
all, the decision for the current EV only depends on the real- 3) The intelligent DRL agent based on the PPO algorithm
time environment states, i. e., arrival time, charging power, is developed to schedule the charging process of EVs.
charging duration, and departure time, and it is notable for a Through the remarkable approximation function of the neu‐
DRL agent to address such a problem due to the sequential ral network, the agent can accumulate rich experience when
feature. Moreover, through interacting repeatedly with the dy‐ interacting with various environments repeatedly to break
namic environment, the agent can learn from the experience the limitations on imperfect information. Hence, numerous
and investigate an excellent control policy in the absence of household EVs are dispatched effectively to formulate the
models, which is more appliable in uncertain environments. optimal charging profile even when lacking full knowledge
In the field of charging scheduling problems, DRL has of charging demands in advance.
been implemented in various optimizations. Reference [23] The remainder of this paper is organized as follows. Sec‐
1892 JOURNAL OF MODERN POWER SYSTEMS AND CLEAN ENERGY, VOL. 11, NO. 6, November 2023

tion II establishes the two-stage charging scheduling scheme tives at first. In this paper, charging scheduling of household
of household EVs. Section III introduces the MDP model of EVs is employed to flatten the tie-line power to provide an‐
EV charging scheduling and the intelligent DRL agent based cillary services to power systems. Considering the mutual
on the PPO algorithm. Case studies are conducted in Section impacts between different nodes, the optimal charging pro‐
IV using the real-world data of residential and EV loads, files of CSs are not appropriate to be determined simply ac‐
which proves the effectiveness of the proposed scheme. Sec‐ cording to their electricity sectors. Therefore, the OPF algo‐
tion V concludes the remarks of this paper. rithm is used to solve the problem with regard to the secure
and stable operation. The optimal charging power is calculat‐
II. TWO-STAGE CHARGING SCHEDULING SCHEME OF ed with the goal of shortening the peak-valley differences,
HOUSEHOLD EVS based on historical and forecasted load data during the val‐
ley period.
In this section, an overview of the charging scheduling
scheme is introduced first to illustrate the coordination be‐ Power grid 4 13 14 The first stage

tween the problems in the two stages. Then, the first-stage 11


10

problem which considers the mutual impacts of different Tie-line


12
8
nodes is put forward to determine the optimal operation of power 1 3
6
the ADN with household EVs. On the basis of the optimal OPF
CS
9
Peak-valley
charging profiles provided by the first-stage problem, the de‐ DSO 2 5 7 difference

tailed sequential decision problem in the second stage is for‐ Charging Charging time Scheduling
elimination
The second
nd stage
mulated to describe the charging scheduling process. scheduling
Power Charging demand Aggregator control results
EV owner
wner
center based on DRL
Submit Charging
A. Overview of Charging Scheduling Scheme Power Optimal charging profile
reservation
Day σ Day σ+1
As important flexible loads of the ADN, household EVs Arrival Plug in Charging demand Departure
are not fully exploited for further regulation potential under
TOU prices. Generally, most charging durations of house‐ Peak period Valley period Time

hold EVs are much shorter than their sojourn time [26]. Sub‐ Scheduled EVs power

stantial charging demands of EVs concentrate on the pro‐ Fig. 1. Schematic diagram of two-stage charging scheduling scheme of
phase of the valley period lacking effective guidance, which household EVs.
results in extra power congestions and wastes the regulation
potential of EVs to a large extent. In the second stage, to overcome the obstacle of limited
At the same time, the ADN is suffering from power quali‐ charging information, the aggregator control center based on
ty issues including dramatic peak-valley differences, power DRL is used to make decisions with imperfect knowledge
congestions, and voltage limit violations. Consequently, the and dispatch the charging processes of EVs in terms of the
managers of the ADN, i. e., distribution network operator determined charging profile. Before the valley period, when
(DSO) and energy supplier, are motivated to further dispatch the kth EV arrives home, the user needs to plug in the EV
household EVs to improve the power quality under TOU and submit the charging demands to the control center, in‐
prices without extra equipment investments and charging cluding charging power, charging duration time, and depar‐
costs, even earning profits through delivering ancillary ser‐ ture time. Then, according to the optimal charging profile
vices for power systems. Apart from the DSOs, estates or and previously scheduled EV power, the aggregator will de‐
community administrators are also encouraged to implement termine the most suitable charging time period for the kth EV
such a charging scheduling, so as to satisfy increasing charg‐ immediately. At last, the owner makes a charging reservation
ing demands accounting for the limited carrying capacities to realize the charging scheduling without extra charging
of ADNs. costs, and even gets some incentives for the contributions to
The schematic diagram of two-stage charging scheduling the operation of the ADN.
scheme of household EVs is shown in Fig. 1, assuming that B. First-stage Problem: OPF Model of ADN
the ADN at the residential side is operated by the DSO and
consists of several residential loads and EV charging loads. The first-stage problem aims to involve EVs participating
Considering the relatively centralized installations of private in shortening peak-valley differences and managing conges‐
charging piles, e. g., underground parking spaces, nearby tions of the ADN. Considering the fact that an ADN typical‐
charging piles are aggregated as a CS and managed by the ly features radial topology, as shown in Fig. 2, the complex
aggregator. Assume that only EVs can provide flexible regu‐ power flow at each node can be described by the classic
lation service while other residential loads are regarded as DistFlow model [27].
fixed loads. To relieve the power congestion caused by inten‐ P i + 1 = P i - r i (P i2 + Q 2i )/V i2 - p i + 1 (1)
sive charging demands and transfer them to appropriate time Q ii + 1 = Q i - 1i - x i (P i2 + Q 2i )/V i2 - q i + 1 (2)
periods, a two-stage problem is formulated.
In the first stage, determining the optimal charging pro‐ V 2
i+1 = V - 2(r i P i + x i Q i ) + (r + x )(P + Q )/V
i
2 2
i
2
i
2
i
2
i i
2
(3)
files of CSs is the key point. Because of the various operat‐
ïì p i = p i - p i
D g
ing characteristics of different ADNs, it is of great impor‐ í (4)
î qi = qi - qi
D g
tance for the DSO to choose favorable optimization objec‐
QI et al.: DEEP REINFORCEMENT LEARNING BASED CHARGING SCHEDULING FOR HOUSEHOLD ELECTRIC VEHICLES IN ACTIVE... 1893


te
where P i and Q i are the active and reactive power flows
min | DP | = |P sub (t) - P obj (t)|dt (9)
from node i to node i + 1, respectively; p i and q i are the ac‐ ts

tive and reactive power demands at node i, respectively, where P sub (t) is the real tie-line power at time t. The OPF
which are determined by the load demands (with superscript problem aims to optimize the power profiles of all nodes to
D) and generator outputs (with superscript g); Vi is the volt‐ minimize the differences between P sub (t) and P obj (t).
age at node i; and r i and x i are the resistance and reactance Assume F and R represent the set of EV nodes and the
of the branch from node i to node i + 1, respectively. set of residential nodes, respectively. Considering the contin‐
0 1 i i+1 n uous characteristic of the charging process, it is difficult to
regulate the charging power of CS dramatically in a short pe‐
  riod, thereby the ramp rate of the CS needs to be limited
Pn+jQn within λ.
P0+jQ0 P1+jQ1 Pi+jQi Pi+1+jQi+1
|p i (t + 1) - p i (t)| £ λp i (t) "i Î F (10)
p1+jq1 pi+jqi pi+1+jqi+1 pn+jqn
In addition, the constraints of the ADN mainly include the
Fig. 2. Diagram of ADN with radial topology. nodal voltage and feeder ampacity as shown in (11) and
(12), respectively.
The DistFlow equations above are nonlinear and difficult V imin £ V i (t)£ V imax "i Î F  Rt Î[t s t e ] (11)
to solve. Ignoring network losses, the DistFlow equations
can be converted to linearized power flow equations as (5), P imin £ P i (t)£ P imax "i Î F  Rt Î[t s t e ] (12)
which have been widely used in distribution network analy‐ where V imin and V imax are the minimum and maximum nodal
sis [28]. voltages at node i, respectively; and P imin and P imax are the
ì Pi + 1 = Pi - pi + 1 minimum and maximum ampacities of the branch from node
ï i to node i + 1, respectively.
ïQ i + 1 = Q i - q i + 1
ïV = V - (r P + x Q ) V C. Second-stage Problem: Scheduling Model of Household
í i+1 i i i i i 1 (5)
ïï p = p D - p g EVs
ïï i i i
After calculating the OPF of the ADN, the optimal charg‐
ïq = qD - qg
î i i i ing profiles of CSs are determined. Then, the agent based on
The OPF model proposed in this subsection aims to flat‐ DRL will dispatch EVs to approach the optimal charging
ten the tie-line power, as well as maintain the node voltage power.
within the acceptable range. The objective tie-line power The charging process of EVs can be divided into three
should be determined in advance and power profiles of all parts, which are trickle charging, constant current charging,
nodes can be calculated using the OPF model with the goal and constant voltage charging, where the constant current
of minimizing differences between real tie-line power and charging process accounts for 80% duration and has relative‐
objective power. Considering the limited penetrations of EVs ly constant power [29]. On the other hand, considering the
in the ADN at present, it is difficult to eliminate the peak- actual situations where household charging piles are in‐
valley differences completely without abundant regulation ca‐ stalled, it is difficult for charging piles to achieve continuous
pacity. Hence, the objective tie-line power which is related power regulation due to the lack of communication condi‐
to residential consumption and the charging electricity can tions. Therefore, the charging process of a household EV is
be computed as: regarded as a continuous process with constant power [30],
E and the charging demands of the kth EV CD k can be repre‐
P obj (t) = P u (t) + EV (P umax - P u (t)) (6)
E dev sented using a tuple as:
CD k = (t arrk P ck t ck t depk )
∫ (P
te (13)
E dev = umax - P u (t))dt (7) where t arrk is the arrival time of the kth EV; P ck and t ck are
ts
the constant charging power and charging time duration, re‐

te
E EV = P EV (t)dt (8) spectively; and t depk is the departure time, which means that
ts
the charging process needs to be finished before the depar‐
where P obj (t) is the objective tie-line power of the ADN at ture of EVs to satisfy the owner’s traveling energy require‐
time t; P EV (t) and P u (t) are the power of EVs and residential ments.
loads at time t, respectively; P umax is the maximum power of EVs arrive sequentially and the specific charging demands
residential loads during the valley period; E EV is the total can only be obtained precisely when an EV is plugged in.
electricity consumption of EVs; E dev is the electricity devia‐ The aggregator control center aims to transfer the charging
tion between the residential loads and that of the maximum demands to formulate a redistribution scheme of charging de‐
power; and t s and t e are the start time and end time of the mands based on the objective charging power. Through the
valley period, respectively. charging scheduling of EVs, not only power congestions at
Therefore, the objective function of the OPF model can be the prophase of valley period can be alleviated, but also the
represented by: ancillary service for shortening the peak-valley differences
1894 JOURNAL OF MODERN POWER SYSTEMS AND CLEAN ENERGY, VOL. 11, NO. 6, November 2023

can be delivered to power systems. time t.


EVs can be divided into adjustable groups and non-adjust‐
able groups. The non-adjustable EV, whose charging time du‐ III. MDP MODEL OF EV CHARGING SCHEDULING AND
ration is longer than its sojourn time, will not be regulated. INTELLIGENT DRL AGENT BASED ON PPO ALGORITHM
The start charging time of non-adjustable EVs needs to be
In this section, the MDP model of EV charging schedul‐
set as their arrival time to satisfy charging demands and
ing is developed at first. Then, the intelligent DRL agent
there is no need to involve them in the proposed charging
based on the PPO algorithm is introduced, followed by the
scheduling. Therefore, the following charging scheduling fo‐ training workflow of the PPO algorithm.
cuses on adjustable EVs. When dispatching EVs to formu‐
late the optimal charging profile, the charging demands can A. MDP of EV Charging Scheduling Process
be described using a rectangle as demonstrated in Fig. 3, The charging scheduling of household EVs can be mod‐
whose length and height indicate the charging time duration elled as an MDP due to the discrete arrivals of EVs and the
and charging power, respectively. The valley period is from randomness of charging demands, which can be appropriate‐
t s of day σ to t e of day σ + 1. Once the kth EV arrives, the ly solved using the DRL algorithm. An MDP can be repre‐
charging demand CD k is submitted to the control center. sented as a tuple (SART) [31], where S is the state space;
Then the control center determines the charging start time of A is the action space; R is the reward function; and T is the
the kth EV according to its charging demands and the opti‐ state transition function, which is determined by (14) and
mal charging profile. (17). The specific illustrations of the MDP are follows.
Valley period 1) The state space observed by the agent is represented as:
S = (CD k P devk - 1 (t)) (18)
tc,k
P devk - 1 (t) = P opti (t) - P sumk
EV
- 1 (t) (19)
Pc,k The k EV th
th
where CD k is the charging demand of the k EV, including
tarr,k ts tb,k te tdep,k Time the arrival time t arrk, the charging power P ck, the charging
time duration t ck, and the departure time t depk; and P devk - 1 (t)
Day σ Day σ+1
is the deviation between the optimal charging power of node
EV
Fig. 3. Diagram of charging scheduling process. i P opti (t) and the scheduled charging power P sumk - 1 (t) of the
first k - 1 EVs. The state space S contains all knowledge of
Denoting t bk as the optimized charging start time of the kth the current environment which can be obtained by the agent
EV, the real-time charging power of the kth EV during the when scheduling the kth EV at time t. The properties of MDP
valley period can be represented as: have decided that the future state only depends on the pres‐
ent state and the action taken by the agent. To be specific,
ìï 0 t s £ t < t bk
the agent can only determine the charging start time of the
P EVk (t) = í P ck t bk £ t < t bk + t ck (14) current kth EV and the scheduled charging power only de‐
ï0 t bk + t ck £ t < t e
î pends on the scheduling result of the kth EV.
C EVk (SOC ek - SOC sk ) 2) The action space is represented as A = (t bk ) because the
t ck = (15) actions are taken sequentially, which determines the specific
ηP ck
charging profile of the kth EV combining its charging de‐
where C EVk is the rated battery capacity of the kth EV; mands. In other words, the agent needs to make a decision
SOC ek and SOC sk are the expected SOC and the starting when an EV arrives instead of scheduling all EVs together
SOC of the kth EV, respectively; and η is the charging effi‐ in the end. Due to the fixed electricity price, the feasible
ciency and regarded as a fixed value. charging start time should be limited within the valley peri‐
These adjustable EVs have already decided to charge dur‐ od to prevent extra charging costs. Considering the discrete
ing the valley period with lower electricity price though they feature for charging reservations, the action space is set as a
have arrived earlier, and most of them instinctively choose discrete space with 1-min interval.
to start charging at the beginning of valley period t s due to 3) Every action taken by the agent will obtain a reward,
the lack of effective guidance. Therefore, the range of action‐ which describes the performance of this action and contrib‐
able space is set from t s to t e, which aims to determine their utes to improving the agent to achieve the maximum cumula‐
optimal charging periods. Besides, to satisfy the charging de‐ tive rewards. The reward function is defined as:
mands and save charging costs of EVs, t bk is also con‐ r k = ρ(Dev k - 1 - Dev k ) (20)
strained as:


te
t s £ t bk £ min(t e t depk ) - t ck (16) Dev k = |P opti (t) - P sumk
EV
(t)|dt (21)
th ts
After dispatching the k EV, the control center will update
the scheduled charging power of the first k EVs as follow: where r k is the reward gained by the agent after taking the
action a k; Dev k is the deviation between the optimal charg‐
(t) = ∑P EVj (t) t Î[t s t e ]
k
EV
P sumk (17) ing power and the scheduled charging power of k EVs;
j=1
P opti (t) is the optimal power of node i at time t; and ρ is the
EV
where P sumk (t) is the scheduled power of the first k EVs at coefficient of reward, which is used to normalize the reward
QI et al.: DEEP REINFORCEMENT LEARNING BASED CHARGING SCHEDULING FOR HOUSEHOLD ELECTRIC VEHICLES IN ACTIVE... 1895

between different nodes with various EVs. processes, which are difficult to be involved into the optimi‐
Moreover, the reward function can also reveal how much zation problem and solved using conventional methods.
the charging demand is not satisfied or the charging costs Based on the outstanding approximation ability of the neural
have increased. Figure 4 illustrates the specific penalty when network, DRL can take the subsequent effects into consider‐
charging demands are satisfied or not. If the charging pro‐ ation. For example, the DRL agent may take an action that
cess of the kth EV is beyond the boundary of valley period, cannot gain the maximum reward at present, but it contrib‐
the power deviation Dev(b) (a)
k will be smaller than Dev k after
utes to obtaining more rewards in the future and achieving
dispatching the kth EV, and the agent will obtain a smaller re‐ the maximum total reward.
ward r k. B. PPO Algorithm
Power Valley period Power Valley period Policy gradient is an essential method for training the
Devk(a) The kth EV Devk(b) The kth EV DRL agent to maximize the cumulative reward, which works
EV EV
by computing an estimator of the policy gradient and plug‐
Psum,k1 Psum,k1 ging it into a stochastic gradient ascent algorithm [25]. The
Time Time most commonly used gradient estimator ĝ can be represent‐
(a) (b) ed as:
Scheduled charging power; Power deviation ĝ = Ê (Ñ lg π (a |s )Â )
t θ θ (24) t t t
Charging process of the kth EV
where π θ is the stochastic policy function with parameter θ;
Fig. 4. Specific penalty when charging demands are satisfied or not satis‐
fied. (a) Demands are satisfied. (b) Demands are not satisfied.
 t is the estimator of the advantage function at time t; a t and
s t are the action and state, respectively; and Ê t is the empiri‐
cal average with finite samples.
Theoretically, the total reward with optimal charging As a result, the loss function is defined as:
scheduling can be represented as:
L (θ) = Ê (lg π (a |s )Â ) (25)
(∑r ) = ∫ P
te PG t θ t t t

max E k opti (t)dt (22) However, traditional policy gradient methods have low uti‐
ts
lization efficiency of sampling data and have to spend too
R norm much time on sampling new data once the policy is updated.
(∑r )
ρ= (23)
max E Besides, it is difficult to determine appropriate steps for up‐
k
dating policy so as to prevent resulting in large differences
where R norm is a constant value for normalizing the reward. between the new policy and the old policy.
From the perspective of the whole scheduling process, the Therefore, the PPO algorithm was proposed in 2017 to ad‐
agent will schedule all EVs to approach the optimal charg‐ dress the above shortcomings. The detailed training work‐
ing profiles so as to maximize the total reward. Neverthe‐ flow of the DRL agent with PPO algorithm is demonstrated
less, considering the indivisibility of EV charging processes, in Fig. 5. PPO algorithm consists of three networks, includ‐
it is tricky to realize the global optimum through the optimal ing two actor networks with the new policy π θ and the old
decision of every single step. To be specific, the present de‐ policy π θ′ (parameterized by θ and θ′, respectively) and a
cision has durable effects on the later charging scheduling critic network V ϕ (parameterized by ϕ).
Minibatch

Action trajectory States trajectory


(a0, s0, a1, s1, …, aT, sT) (s0, s1, …, sT)
Experiecne πθ(at|st)
replay buffer rt(θ)=
πθ' (at|st) Aˆt Advantage
+
function
(st, at, rt, st+1) Importance Clipping function
sampling clip(rt(θ), Vt(st)
+ Critic network
Environment πθ' πθ 1є,1+є)
Power 1+є ϕ
Power deviation st Qt(st, at)
1
at Actor network Update Actor network 1є
Update
Charging demand rt θ′ θ 1є 1 1+є rt(θ) arg min(VϕRˆ t)2
ϕ

L(θ) Loss Loss


Time function function
? Parameter update loops
? Sampling loops ? Estimation of advantage funtion
Fig. 5. Training workflow of DRL agent with PPO algorithm.

To increase the sample efficiency, π θ′ is used to interact trained according to the demonstrations of π θ′. Utilizing im‐
with environments and sample N trajectory sets with T portance sampling technology, the same trajectory sets can
timesteps, while π θ is the actual network that needs to be be used multiple times although there are differences be‐
1896 JOURNAL OF MODERN POWER SYSTEMS AND CLEAN ENERGY, VOL. 11, NO. 6, November 2023

tween π θ′ and π θ. The probability ratio of new policy and old on the discounting factor γ, and a larger γ means that an
policy can be expressed as: agent is more long-sight so as to take full consideration of
π (a |s ) future uncertainties to achieve the maximum cumulative re‐
r t (θ) = θ t t (26)
π θ′ (a t|s t ) wards. Thus, γ is set to be 0.99 [14].
Another point of PPO algorithm is that the new policy Algorithm 1: training workflow of PPO algorithm
should avoid significant evolution from the old policy after 1: Initialize policy network π θ′ and value function network V ϕ
every update, so as to maintain the accuracy of importance 2: for i = 0; i < N; i + + do
sampling and avoid accident performance collapse. Hence, a 3: Run policy π θ′ to interact with the environment for T timesteps and ob‐
clipped surrogate function is used to remove the incentive tains the trajectory samples (s t a t r t s t + 1 )
for moving r t (θ) outside of the interval [1 - є1 + є], so the 4: Calculate the reward-to-go R̂ t
loss function of PPO algorithm can be represented as [25]: 5: Use V ϕ to estimate the advantage function  t
L(θ) = Ê (min(r (θ)Â clip(r (θ)1 - є1 + є)Â ))
t t t (27) t t 6: Compute the loss function L(θ) with regard to θ with K epochs of gradi‐
ent decent
where є is the clipping parameter, which aims to clip the
7: π θ′ ¬ π θ ; V ϕ′ ¬ V ϕ
probability ratio. For instance, the objective will increase if
8: end for
the advantage function  t is positive, but the increase is
maintained within 1 + є by a limit set by the clipping func‐
TABLE I
tion. PARAMETERS OF PPO ALGORITHM
Consequently, the network parameters θ of the new policy
are updated using: Parameter Value Parameter Value
θ = arg max E (L(s t a t θ′θ)) (28) λ 0.15 N 2048
θ s t a t  π θ′
є 0.2 K 10
Apart from the actor network, a critic network is used to γ 0.99 R norm 1000
estimate the state value function and the advantage function.
lr 3×10-4 MB 64
The advantage function describes how much an action is bet‐
ter than other actions on average, which is defined as:
 = Q (s a ) - V (s ) (29) Both the convergence speed and performance stability de‐

( )
t t t t t t
pend on є [22], hence є is set to be 0.2 to balance the train‐
∑γ r | s = s a = a
¥
ing speed and total reward of the agent [25], [32]. The multi-
Q t (s t a t ) = E k
t+k t t (30)
layer perceptrons of the policy network are composed of two

( )
k=0
hidden layers and the neurons of each layer are 64. The num‐
∑γ r | s = s
¥
V t (s t ) = E k
t+k t (31) ber of training episodes is set to be 250.
k=0
where γ is the discounting factor, which aims to balance the IV. CASE STUDY
importance between immediate and future rewards; and V t
and Q t are the value function and the action-value function, To evaluate the performance of the proposed two-stage
respectively. Therefore, V t (s t ) is the expected value on aver‐ DRL-based charging scheduling scheme, case studies are
conducted in this section.
age at state s t, which contains all optional actions. Q t (s t a t )
is the expected value at state s t when taking action a t. A. Parameter Setting of ADN
The critic network V ϕ is updated using regression to mini‐ An ADN for simulation is established based on the IEEE
mize a mean-squared-error objective [22]: 14-node test feeder, as shown in Fig. 6, where 10 nodes are
ϕ = arg min (V (s ) - R̂ )2 (32) ϕ t t set as residential loads without regulation flexibility and 3
ϕ
nodes are set as CSs. The ADN is operated by the DSO,
R̂ t = ∑r t (s t′ a t′ s t′ + 1 )
T
with the goal of relieving the power congestion and shorten‐
(33)
t′ = t ing the peak-valley differences of tie-line power.
where R̂ t is the reward-to-go, which is the sum of rewards
after a point in the trajectory. Power grid 4 13 14
The DRL agent with PPO algorithm is trying to schedule
the EV charging process according to the optimal charging Tie-line
12 11 10
profile, with the goal of maximizing the total expected re‐
wards. The training workflow of PPO algorithm is summa‐ 1 3
8
rized in Algorithm 1. The corresponding parameters are
shown in Table I, where lr is the learning rate; and MB is 6
CS 9
the minibatch size.
The discounting factor γ and the clipping parameter є are Residential load
important hyperparameters that influence the performance 2 5 7
agent observably. The importance of current action depends Fig. 6. ADN based on IEEE 14-node test feeder.
QI et al.: DEEP REINFORCEMENT LEARNING BASED CHARGING SCHEDULING FOR HOUSEHOLD ELECTRIC VEHICLES IN ACTIVE... 1897

In accordance with the realistic situations in Zhejiang, Chi‐ to 22:00, and the departure time is set consistently at 08:00
na, the valley period of TOU is set from 22:00 to 08:00 of of the next day.
the next day. Meanwhile, the residential load data during the
TABLE II
valley period are obtained from a housing estate in Hang‐ PARAMETERS OF EVS AT DIFFERENT NODES
zhou, Zhejiang, as shown in Fig. 7. Most residential loads
have similar features, and the summits of the electricity con‐ Parameter Description Value
sumption appear at 22:00 around. Then, the power demands P ck (kW) Charging power U(525)
continue to decline and reach the nadir at 03:00 of the next η (%) Charging efficiency 90
day. Finally, the electricity demands recover gradually as res‐ C EVk (kWh) Battery capacity 50
idents wake up. Therefore, there are significant peak-valley SOC sk (%) Starting SOC N(5020)
differences in residential distribution networks.
SOC ek (%) Expected SOC 100
2.5 t arrk (hour) Arrival time U(16:0022:00)
t depk (hour) Departure time 08:00
2.0
Note: normal distribution with the mean value of μ and the standard devia‐
Power (MW)

1.5 tion of σ is abbreviated to N(μσ 2 ); uniform distribution with the minimum


and maximum values of a and b, respectively, is abbreviated to U(ab).
1.0
TOU prices make great contributions to transferring charg‐
0.5
ing demands from peak period to valley period. However,
0
the charging processes cannot be dispatched effectively due
22:00 00:00 02:00 04:00 06:00 08:00 to TOU prices are unable to describe the various demands of
Time the ADN precisely during different time periods. Therefore,
Node 2; Node 3; Node 4; Node 6; Node 7
Node 8; Node 10; Node 11; Node 12; Node 14
EV owners instinctively decide to start charging at the begin‐
ning of the valley period. As shown in Fig. 8, most charging
Fig. 7. Residential load data profiles during valley period.
processes start at 22:00, but the charging durations are much
shorter than the valley period. The charging demands over‐
B. OPF Results in First-stage Problem lap with the residential peak, resulting in new power conges‐
The original and optimal charging profiles of EVs at dif‐ tions at the beginning of the valley period, which threatens
ferent CSs are shown in Fig. 8. the secure and stable operation of the ADN.
To alleviate the power congestions and schedule the charg‐
2.2 ing demands according to distribution network operations,
1.8
the DSO needs to determine the optimal charging profiles of
EV CSs by solving the OPF. Utilizing the DistFlow model
Power (MW)

1.4
introduced in Section II, the OPF of the ADN is calculated
1.0 with the goal of flattening tie-line power, and the optimal
0.6 charging profiles are shown in Fig. 8.
It can be observed that the main charging demands are
0.2
22:00 transferred to 01: 00-05: 00, during which other electricity
00:00
9 consumptions are the lowest. Moreover, the regulation tar‐
02:00
Tim 04:00 5
gets are not allocated simply according to the total electrici‐
e .
06:00 e No ty demands of CSs; nodal voltages and impacts from other
08:00 13 Nod
nodes are also taken into account to realize the multidimen‐
Original power (Node 5); Original power (Node 9)
Original power (Node 13); Optimal power (Node 5)
sional optimum. Therefore, the CSs are coordinated and the
Optimal power (Node 9); Optimal power (Node 13) optimal charging profiles at different nodes are various, as
Fig. 8. Original and optimal charging profiles of EVs at different CSs.
shown in Fig. 8. For example, the charging summit of node
9 appears at 01:00 while that of node 13 appears at 03:00.

The numbers of household EVs are set to be 179, 225, C. Charging Scheduling Results in Second-stage Problem
and 146 at node 5, node 9, and node 13, respectively. It is On the basis of optimal charging profiles calculated in the
assumed that the charging power obeys the uniform distribu‐ first stage, the DRL agent needs to schedule the charging
tion and the starting SOC obeys the normal distribution [16], processes of EVs sequentially to approach the optimal pro‐
whose parameters can be found in Table II. Moreover, the files. The charging scheduling results of household EVs at
charging efficiency and battery capacity are set to be 90% different nodes are shown in Fig. 9, where the deviation rep‐
and 50 kWh, respectively. Detailed charging data of house‐ resents its absolute value. During the scheduling process, the
hold EVs are obtained from 550 individual meters which are agent makes decisions based on probability, which is calcu‐
equipped for household EVs. To simplify the charging lated through massive pieces of training. All feasible actions
scheduling, the arrival time distributes uniformly from 16:00 are possible to be taken by the agent, although the probabili‐
1898 JOURNAL OF MODERN POWER SYSTEMS AND CLEAN ENERGY, VOL. 11, NO. 6, November 2023

ty of making a bad decision is very low. Therefore, it is in‐ During the whole charging scheduling processes, the DRL
evitable for the agent to take bad actions that will cause de‐ agent makes efforts to maximize the reward and obtains a to‐
viations in a series of decision processes. It can be observed tal reward of 939.5, 923.1, and 946.6 for node 5, node 9,
that the real power profiles are very close to the optimal and node 13 in the end, respectively. Similar to the indexes
power profiles except for some points, which proves the ef‐ of average deviation and the maximum deviation, the total
fectiveness of the DRL agent with PPO algorithm on charg‐ reward indicates that the agent performs better with a
ing scheduling. smoother objective charging profile.
1.0
Moreover, the average SOC and median SOC at specific
2.0 hours are further analyzed, as shown in Fig. 10. The median
0.8 SOC reflects the charging completion result of every EVs. It

Deviation (MW)
1.5 can be observed that more than 50% EVs have finished their
Power (MW)

0.6
charging before 03: 00 in the original charging scheduling,
1.0 0.4 even though only half of valley period has passed. The re‐
0.5 0.2
sults also indicate there is a significant regulation potential
to be exploited for household EVs. The average SOC repre‐
0 0 sents the overall charging progress of all EVs. It can be ob‐
22:00 00:00 02:00 04:00 06:00 08:00
Time served that the charging speed of the original charging is
(a) much faster than that of the proposed charging scheduling in
2.4 1.0 the first half of the valley period, when the electricity de‐
2.0 0.8
mand is decreasing towards the nadir. Hence, the original
Deviation (MW)

charging scheduling cannot match the regulation demand of


Power (MW)

1.6
0.6 the ADN. On the contrary, the proposed charging scheduling
1.2
0.4
takes full use of the shiftability of charging demands to re‐
0.8 shape the charging curve with the goal of eliminating peak-
0.4 0.2 valley differences in the ADN.
0 0 140 100
22:00 00:00 02:00 04:00 06:00 08:00
Time 120 80

Average SOC (%)


(b)
Median SOC (%)

100
1.8 0.6 60
80
1.5
60 40
Deviation (MW)
Power (MW)

1.2 0.4
40
20
0.9 20
0.6 0.2 0 0
22:00 00:00 02:00 04:00 06:00 08:00
0.3 23:00 01:00 03:00 05:00 07:00
Time
0 0
22:00 00:00 02:00 04:00 06:00 08:00 Median SOC (original charging); Median SOC (optimal charging)
Time Average SOC (original charging); Average SOC (optimal charging)
(c)
Deviation; Optimal power; Real power Fig. 10. Average SOC and median SOC during valley period.

Fig. 9. Charging scheduling results of household EVs at different nodes.


(a) Node 5. (b) Node 9. (c) Node 13. D. Performance of PPO Algorithm
To verify the advantages of PPO algorithm, advantage ac‐
As shown in Table III, the average deviations of node 5, tor critic (A2C) and deep Q-network (DQN) algorithms are
node 9, and node 13 are 0.101 MW, 0.067 MW, and 0.044 implemented as the benchmarks. All training timesteps are
MW, respectively, which are restricted at a relatively low lev‐ set to be 512200 to analyze the total reward and the conver‐
el. Besides, it should be noted that significant deviations are gence speed. The cumulative reward is regarded as an index
likely to appear at the turning points of the optimal charging to evaluate the performance of the agents trained by differ‐
profile, e.g., the deviation of node 5 at 04:30 reaches 0.246 MW. ent algorithms. Figure 11 illustrates the reward evolution
TABLE III curves of the PPO, A2C, and DQN algorithms during the
PERFORMANCE OF CHARGING SCHEDULING training process.
The number of start points of the reward curve is regard‐
Node No.
Average The maximum
Total reward
ed as the performance of the random policy, which is around
deviation (MW) deviation (MW) 600. The PPO algorithm achieves the highest reward about
5 0.101 0.246 939.5 937. The PPO algorithm reaches a relatively stable state af‐
9 0.067 0.277 923.1 ter 50 episodes (102400 timesteps). In the following 250 epi‐
13 0.044 0.127 946.6 sodes, the PPO algorithm keeps exploring the optimal strate‐
QI et al.: DEEP REINFORCEMENT LEARNING BASED CHARGING SCHEDULING FOR HOUSEHOLD ELECTRIC VEHICLES IN ACTIVE... 1899

gy and stabilizes its policy networks. Finally, the agent com‐ 80


prehensively reaches convergence with lower reward varianc‐

Value loss
60
es. 40
The A2C algorithm has a sharp increase at the beginning 20
of the training process, which appears much faster than that
of the PPO algorithm. The results prove that the clipped 0 50 100 150 200 250
Episode
function of PPO algorithm has worked and limited the dras‐ (a)
tic change of policy network, so as to effectively avoid per‐ 80
formance collapse and local optimum. As illustrated in Fig. 60

Loss
11, the A2C algorithm experiences an oscillation period after 40
aggressive policy update and performance improvement, 20
then it converges to the total reward around 908.
0 50 100 150 200 250
1000 Episode
(b)
900 Fig. 12. Loss function performance of PPO algorithm during training pro‐
cess. (a) Value loss. (b) Loss.
800
Reward

700 E. Improvement for ADN


PPO The original and optimized tie-line power profiles are
600 A2C
demonstrated in Fig. 13. At first, the power congestions
DQN
500 caused by intensive charging demands at the beginning of
0 1 2 3 4 5 the valley period are eliminated effectively. Moreover, these
Timestep (105) charging demands are allocated to smooth the tie-line power.
Fig. 11. Reward evolution curves of PPO, A2C, and DQN algorithms. Thus, the regulation potential of household EVs is further ex‐
ploited by ADN without extra costs, which benefits both the
The DQN algorithm converges to the total reward around power system and EV owners. It can be observed that the
peak-valley differences of ADN are dramatically eliminated
825 and the PPO algorithm outperforms it by more than
from 6.61 MW to 2.76 MW, and the curtailment of peak
13%, which proves the advantages of actor-critic networks.
load will save remarkable investments in electric power facil‐
Besides, DQN spends much time collecting abundant sam‐
ities. With further integrations of household EVs in the
ples and filling up its replay buffer, so there is no improve‐
ADN, the proposed scheme can also be used to formulate a
ment at the beginning of the training process. smooth tie-line power during the peak period under TOU
It takes a total time of 782.75 s, 586.83 s, and 542.21 s prices.
for PPO, A2C, and DQN algorithms in the entire training
process, respectively. Besides, the decision time of the pro‐ 18 Original tie-line power
posed DRL agent with PPO algorithm is also tested and the Optimized tie-line power
16
results indicate that the average decision time per EV is
about 2.5 ms. These tests have been carried out using Py‐
Power (MW)

14
thon 3.7 on an Intel(R) i7 12700kf, 32 GB RAM desktop.
6.61 MW
In terms of test results, the PPO algorithm outperforms 12
2.76 MW
the A2C algorithm, DQN algorithm, and random policy, al‐
though the PPO algorithm has the lowest training speed with 10
the same timesteps. To be specific, the PPO algorithm can
obtain the total reward of 937 when scheduling EV charging 8
22:00 00:00 02:00 04:00 06:00 08:00
processes, which is 29, 112, and 337 more than that of the Time
A2C algorithm, DQN algorithm, and random policy, respec‐ Fig. 13. Comparison between original and optimized tie-line power pro‐
tively. files.
Then, the loss function performance of PPO algorithm is
presented, as shown in Fig. 12. The value loss and loss rep‐ Different from the transmission network, the distribution
resent the performance of PPO algorithm on training sets network possesses much higher resistance, and the active
and test sets, respectively. It can be observed that the value power has more significant effects on voltages. As a result,
loss and loss share similar trends during the training process, apart from great contributions to the elimination of peak-val‐
which demonstrate the remarkable adaptability on various da‐ ley differences, the voltage quality of the ADN is also im‐
ta sets. proved through scheduling household EVs during the valley
Hence, the PPO algorithm is suited for addressing the period. As shown in Fig. 14, due to the overlap of the peak
charging scheduling problem and can be adopted to handle of residential electricity consumption and the intensive charg‐
the uncertainty of environment. ing demands, some nodal voltages are extremely low at the
1900 JOURNAL OF MODERN POWER SYSTEMS AND CLEAN ENERGY, VOL. 11, NO. 6, November 2023

beginning of the valley period, especially at node 9, whose TABLE IV


nadir has reached 0.969 p. u.. In the near future, with more NODAL VOLTAGE IMPROVEMENT IN ADN
penetration of household EVs into the ADN, the voltage lim‐
Original voltage Optimized voltage Voltage
it violation problem will be more serious if no strategies are Node No.
nadir (p.u.) nadir (p.u.) improvement (%)
taken.
2 0.982 0.984 0.17
Voltage (p.u.) 3 0.979 0.981 0.19
14 0.989
4 0.981 0.983 0.14
13
12 0.985 5 0.973 0.975 0.26
11
10 6 0.975 0.976 0.11
Node No.

9 0.981
8 7 0.971 0.974 0.28
7
6
0.977 8 0.972 0.976 0.39
5 9 0.969 0.973 0.35
4 0.973
3 10 0.974 0.976 0.21
2 0.969
22:00 00:00 02:00 04:00 06:00 08:00 11 0.976 0.978 0.23
Time 12 0.976 0.978 0.23
Fig. 14. Original nodal voltages. 13 0.974 0.977 0.29
14 0.973 0.975 0.28
Utilizing the proposed two-stage charging scheduling
scheme, the voltage violation problem is addressed effective‐ V. CONCLUSION
ly, as shown in Fig. 15.
In the context of taking full use of the regulation potential
Voltage (p.u.) of household EVs under TOU prices, this paper proposes a
14 0.989
13 two-stage charging scheduling scheme to dispatch household
12 0.985 EVs. The first-stage problem aims to involve the charging
11
10 scheduling of household EVs in operation and optimization
Node No.

9 0.981
8 of the ADN, and the optimal charging power profiles of CSs
7
6
0.977 are determined by calculating the OPF so as to relieve the
5 power congestions and shorten the peak-valley differences.
4 0.973
3 Furthermore, a PPO-based DRL agent is developed to dis‐
2 0.969 patch the charging processes of EVs in terms of the optimal
22:00 00:00 02:00 04:00 06:00 08:00
Time charging power. Case studies with realistic data are conduct‐
Fig. 15. Nodal voltages after charging scheduling.
ed to illustrate the multidimensional performance of the pro‐
posed scheme. It is demonstrated that the PPO-based DRL
agent can be adopted in different CSs with various objective
Besides, the oscillations of nodal voltages are also limited charging profiles and EV amounts. Besides, the charging
with smoother tie-line power, which is beneficial for reduc‐ scheduling of EVs contributes to significant improvement in
ing the operating times of voltage regulation equipment such power quality, including decreasing the peak-valley differenc‐
as on-load tap changers. For example, the voltage variations es and stabilizing the nodal voltages.
of node 4 decrease from 0.0051 p. u. to 0.0028 p. u. during Moreover, the proposed scheme can be adopted properly
the whole valley period. Meanwhile, the contributions to in substantial distributed communities with the combination
voltage regulation are not restricted to the nodes of CSs. As of edge computing technology. On this basis, numerous flexi‐
shown in Table IV, all nodal voltages have different degrees
ble loads, e.g., thermostatic loads, energy storage, RES, can
of improvement.
be involved into the proposed scheme to be managed effi‐
Because OPF is involved in the first-stage optimization
ciently, so as to activate their flexibility and enhance the reg‐
problem, all nodal voltages are taken into consideration
ulation capacity of ADNs in the near future.
when determining the optimal charging profiles of CSs. Spe‐
cifically, the voltage nadir of node 8 has 0.39% improve‐
ment. For these old communities with relatively poor elec‐ REFERENCES
tricity infrastructure, the proposed scheme can also satisfy [1] T. Chen, X. -P. Zhang, J. Wang, et al., “A review on electric vehicle
charging infrastructure development in the UK,” Journal of Modern
residential power consumption and charging demands simul‐ Power Systems and Clean Energy, vol. 8, no. 2, pp. 193-205, Mar.
taneously with limited carrying capacity. 2020.
Therefore, under the existing TOU price circumstances, [2] IEA. (2022, May). Global EV outlook 2022. [Online]. Available: https:
//www.iea.org/reports/global-ev-outlook-2022
the proposed two-stage charging scheduling scheme can take [3] H. Liu, P. Zeng, J. Guo et al., “An optimization strategy of controlled
full use of the regulation potential of household EVs during electric vehicle charging considering demand side response and region‐
valley periods to improve the power quality of the ADN al wind and photovoltaic,” Journal of Modern Power Systems and
without extra equipment investments and charging costs, in‐ Clean Energy, vol. 3, no. 2, pp. 232-239, Jun. 2015.
[4] Fco. J. Zarco-Soto, J. L. Martínez-Ramos, and P. J. Zarco-Periñán, “A
cluding peak-valley difference elimination, congestion man‐ novel formulation to compute sensitivities to solve congestions and
agement, and nodal voltage regulation. voltage problems in active distribution networks,” IEEE Access, vol.
QI et al.: DEEP REINFORCEMENT LEARNING BASED CHARGING SCHEDULING FOR HOUSEHOLD ELECTRIC VEHICLES IN ACTIVE... 1901

9, pp. 60713-60723, Apr. 2021. els,” IEEE Transactions on Industry Applications, vol. 56, no. 5, pp.
[5] B. Wei, Z. Qiu, and G. Deconinck, “A mean-field voltage control ap‐ 5901-5912, Sept. 2020.
proach for active distribution networks with uncertainties,” IEEE [24] M. Shin, D.-H. Choi, and J. Kim, “Cooperative management for PV/
Transactions on Smart Grid, vol. 12, no. 2, pp. 1455-1466, Mar. 2021. ESS-enabled electric vehicle charging stations: a multiagent deep rein‐
[6] Y. Luo, Q. Nie, D. Yang, et al., “Robust optimal operation of active forcement learning approach,” IEEE Transactions on Industrial Infor‐
distribution network based on minimum confidence interval of distrib‐ matics, vol. 16, no. 5, pp. 3493-3503, May 2020.
uted energy beta distribution,” Journal of Modern Power Systems and [25] J. Schulman, F. Wolski, P. Dhariwal et al. (2017, Jul.). Proximal poli‐
Clean Energy, vol. 9, no. 2, pp. 423-430, Mar. 2021. cy optimization algorithms. [Online]. Available: http://arXiv:
[7] K. Xie, H. Hui, Y. Ding et al., “Modeling and control of central air 1707.06347
conditionings for providing regulation services for power systems,” Ap‐ [26] S. Yoon and E. Hwang, “Load guided signal-based two-stage charging
plied Energy, vol. 315, p. 119035, Jun. 2022. coordination of plug-in electric vehicles for smart buildings,” IEEE Ac‐
[8] H. Wei, J. Liang, C. Li et al., “Real-time locally optimal schedule for cess, vol. 7, pp. 144548-144560, Oct. 2019.
electric vehicle load via diversity-maximization NSGA-II,” Journal of [27] M. E. Baran and F. F. Wu, “Network reconfiguration in distribution
Modern Power Systems and Clean Energy, vol. 9, no. 4, pp. 940-950, systems for loss reduction and load balancing,” IEEE Transactions on
Jul. 2021. Power Delivery, vol. 4, no. 2, pp. 1401-1407, Apr. 1989.
[9] E. Hadian, H. Akbari, M. Farzinfar et al., “Optimal allocation of elec‐ [28] S. Tan, J.-X. Xu, and S. K. Panda, “Optimization of distribution net‐
tric vehicle charging stations with adopted smart charging/discharging work incorporating distributed generators: an integrated approach,”
schedule,” IEEE Access, vol. 8, pp. 196908-196919, Oct. 2020. IEEE Transactions on Power Systems, vol. 28, no. 3, pp. 2421-2432,
[10] H.-M. Chung, S. Maharjan, Y. Zhang et al., “Intelligent charging man‐ Aug. 2013.
agement of electric vehicles considering dynamic user behavior and re‐ [29] C. Zhang, Y. Liu, F. Wu et al., “Effective charging planning based on
newable energy: a stochastic game approach,” IEEE Transactions on deep reinforcement learning for electric vehicles,” IEEE Transactions
Intelligent Transportation Systems, vol. 22, no. 12, pp. 7760-7771, Jul. on Intelligent Transportation Systems, vol. 22, no. 1, pp. 542-554, Jan.
2021. 2021.
[11] J. Hu, C. Ye, Y. Ding et al., “A distributed MPC to exploit reactive [30] S. Han, S. Han, and K. Sezaki, “Development of an optimal vehicle-
power V2G for real-time voltage regulation in distribution networks,” to-grid aggregator for frequency regulation,” IEEE Transactions on
IEEE Transactions on Smart Grid, vol. 13, no. 1, pp. 576-588, Sept. Smart Grid, vol. 1, no. 1, pp. 65-72, Jun. 2010.
2022. [31] Z. Zhao and C. K. M. Lee, “Dynamic pricing for EV charging sta‐
[12] S. Deb, A. K. Goswami, P. Harsh et al., “Charging coordination of tions: a deep reinforcement learning approach,” IEEE Transactions on
plug-in electric vehicle for congestion management in distribution sys‐ Transportation Electrification, vol. 8, no. 2, pp. 2456-2468, Jun. 2022.
tem integrated with renewable energy sources,” IEEE Transactions on [32] W. Zhu and A. Rosendo, “A functional clipping approach for policy
Industry Applications, vol. 56, no. 5, pp. 5452-5462, Sept. 2020. optimization algorithms,” IEEE Access, vol. 9, pp. 96056-96063, Jul.
[13] S. Das, P. Acharjee, and A. Bhattacharya, “Charging scheduling of 2021.
electric vehicle incorporating grid-to-vehicle and vehicle-to-grid tech‐
nology considering in smart grid,” IEEE Transactions on Industry Ap‐
plications, vol. 57, no. 2, pp. 1688-1702, Mar. 2021. Taoyi Qi received the B. S. degree in electrical engineering from Zhejiang
[14] L. Yan, X. Chen, J. Zhou et al., “Deep reinforcement learning for con‐ University, Hangzhou, China, in 2020, where he is currently pursuing the
tinuous electric vehicles charging control with dynamic user behav‐ M.S. degree in electrical engineering. His research interests mainly focus on
iors,” IEEE Transactions on Smart Grid, vol. 12, no. 6, pp. 5124- demand response, flexible loads, and deep reinforcement learning.
5134, Jul. 2021.
[15] F. L. D. Silva, C. E. H. Nishida, D. M. Roijers et al., “Coordination
Chengjin Ye received the B.E. and Ph.D. degrees in electrical engineering
of electric vehicle charging through multiagent reinforcement learn‐
from Zhejiang University, Hangzhou, China, in 2010 and 2015, respectively.
ing,” IEEE Transactions on Smart Grid, vol. 11, no. 3, pp. 2347-2356,
From 2015 to 2017, he served as a Distribution System Engineer with the
May 2020.
Economics Institute, State Grid Zhejiang Electric Power Company Ltd.,
[16] L. Gan, X. Chen, K. Yu et al., “A probabilistic evaluation method of
Hangzhou, China. From 2017 to 2019, he was an Assistant Research Fellow
household EVs dispatching potential considering users’ multiple travel
with the College of Electrical Engineering, Zhejiang University. Since 2020,
needs,” IEEE Transactions on Industry Applications, vol. 56, no. 5,
he has been a Tenure-track Professor there. His research interests mainly in‐
pp. 5858-5867, Sept. 2020.
clude resilience enhancement of power grids and integrated energy systems,
[17] E. L. Karfopoulos and N. D. Hatziargyriou, “A multi-agent system for
as well as market mechanism and control strategy towards the integration of
controlled charging of a large population of electric vehicles,” IEEE
demand resources into power system operation.
Transactions on Power Systems, vol. 28, no. 2, pp. 1196-1204, May
2013.
[18] A.-M. Koufakis, E. S. Rigas, N. Bassiliades et al., “Offline and online Yuming Zhao received the B.S. and Ph.D. degrees from the Department of
electric vehicle charging scheduling with V2V energy transfer,” IEEE Electrical Engineering, Tsinghua University, Beijing, China, in 2001 and
Transactions on Intelligent Transportation Systems, vol. 21, no. 5, pp. 2006, respectively. He is currently a Senior Engineer (Professor level) with
2128-2138, May 2020. Shenzhen Power Supply Co., Ltd., Shenzhen, China. His main research in‐
[19] Y. Li, M. Han, Z. Yang et al., “Coordinating flexible demand response terest includes the DC distribution power grid.
and renewable uncertainties for scheduling of community integrated
energy systems with an electric vehicle charging station: a bi-level ap‐ Lingyang Li is currently pursuing the Ph.D. degree in electrical engineering
proach,” IEEE Transactions on Sustainable Energy, vol. 12, no. 4, pp. at Zhejiang University, Hangzhou, China. His research interests include the
2321-2331, Oct. 2021. optimal operation of distribution networks and mobile energy storage sys‐
[20] S. Li, W. Hu, D. Cao et al., “Electric vehicle charging management tem optimization scheduling.
based on deep reinforcement learning,” Journal of Modern Power Sys‐
tems and Clean Energy, vol. 10, no. 3, pp. 719-730, May 2022. Yi Ding received the bachelor’s degree in electrical engineering from
[21] D. Silver, A. Huang, C. J. Maddison et al., “Mastering the game of go Shanghai Jiao Tong University, Shanghai, China, in 2002, and the Ph.D. de‐
with deep neural networks and tree search,” Nature, vol. 529, no. gree in electrical engineering from Nanyang Technological University, Sin‐
7587, pp. 484-489, Jan. 2016. gapore, in 2007. He is currently a Professor at the College of Electrical En‐
[22] B. Huang and J. Wang, “Deep-reinforcement-learning-based capacity gineering, Zhejiang University, Hangzhou, China. His current research inter‐
scheduling for PV-battery storage system,” IEEE Transactions on ests include power systems reliability analysis incorporating renewable ener‐
Smart Grid, vol. 12, no. 3, pp. 2272-2283, May 2021. gy resources, smart grid performance analysis, and engineering systems reli‐
[23] D. Qiu, Y. Ye, D. Papadaskalopoulos et al., “A deep reinforcement ability modeling and optimization.
learning method for pricing electric vehicles with discrete charging lev‐

You might also like