A Deep Reinforcement Learning Based Charging and Discharging Scheduling Strategy For Electric Vehicles
A Deep Reinforcement Learning Based Charging and Discharging Scheduling Strategy For Electric Vehicles
Energy Reports
journal homepage: www.elsevier.com/locate/egyr
Research paper
A R T I C L E I N F O A B S T R A C T
Keywords: Grid security is threatened by the uncontrolled access of large-scale electric vehicles (EVs) to the grid. The
Electric vehicles scheduling problem of EVs charging and discharging is described as a Markov decision process (MDP) to develop
Markov decision process an efficient charging and discharging scheduling strategy. Furthermore, a deep reinforcement learning (DRL)-
Deep reinforcement learning
based model-free method is suggested to address such issue. The proposed method aims to enhance EVs charging
Soft actor-critic
Charging and discharging scheduling
and discharging profits while reducing drivers’ electricity anxiety and ensuring grid security. Drivers’ electricity
anxiety is described by fuzzy mathematical theory, and analyze the effect of EV current power and remaining
charging time on it. Variable electricity pricing is calculated by real-time residential load. A dynamic charging
environment is constructed considering the stochasticity of electricity prices, driver’s behavior, and residential
load. A soft actor-critic (SAC) framework is used to train the agent, which learns the optimal charging and
discharging scheduling strategies by interacting with the dynamic charging environment. Finally, simulation
with actual data is used to verify that the suggested approach can reduce drivers’ charging costs and electricity
anxiety while avoiding transformer overload.
1. Introduction and Zhen, 2019). To solve the issues with EVs charging in the workplace,
a two-stage energy management strategy is suggested (Wu et al., 2017).
EVs have been heavily pushed as a greener substitute for conven- An information gap decision theory-based strategy is presented for
tional fossil fuel cars due to their advantages in reducing emissions and coordinating the day-ahead charging and discharging scheduling stra-
energy consumption (Chen et al., 2021). National policies are driving a tegies of the EV aggregators, taking into account the risks associated
significant increase in the quantity of EVs and charging stations. How- with the unpredictability of electricity prices (Zhao et al., 2017). To
ever, because of the highly stochastic nature of EVs behavior, improper coordinate EVs without jeopardizing the reliability of the distribution
charging and discharging of large-scale EVs may cause distribution network, a decentralized charging management system based on the
transformer overload as well as an increase in the peak-to-valley dif- theory of games is suggested (Li et al., 2018). A decentralized charging
ference of grid load (Muratori, 2018). Making an effective scheduling strategy for EVs according to the mean-field game is developed by
strategy for EVs charging and discharging is essential to reducing this Tajeddini and Kebriaei (2019). In Alsabbagh et al. (2021), the authors
detrimental effect on the distribution network. present a charging control approach for EVs that takes into account
Model-based approaches have been employed in the majority of driver behavior variations and time anxiety by framing the EV charging
earlier initiatives to address this problem. The scheduling of EVs control as a Nash equilibrium game. A model predictive control-based
charging is framed as a binary optimization issue (Sun et al., 2018; Yao strategy is created in (Ghotge et al., 2020; Shi et al., 2019; Zheng
et al., 2017), which is analogous to linear programming. An enhanced et al., 2019) to deal with the arranging of EV charging. For coordinated
binary grey wolf optimizer is used to address the EVs charging optimi- EV charging, a hierarchical coordination system with grid-supporting
zation problem, which is framed as a cost minimization problem (Jiang reactive power is suggested (Wang et al., 2019). The alternative
* Corresponding author.
E-mail address: [email protected] (Q. Xiao).
https://doi.org/10.1016/j.egyr.2024.10.056
Received 13 November 2023; Received in revised form 10 September 2024; Accepted 28 October 2024
Available online 6 November 2024
2352-4847/© 2024 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
Q. Xiao et al. Energy Reports 12 (2024) 4854–4863
direction approach of the multiplier algorithm can be used to address the improvement of charging and discharging scheduling algorithms. But
EV charging problem, which is characterized as an efficient scheduling the characterization of drivers is not detailed enough, and most of them
issue according to the linked feeder capacity restrictions in the network are only portrayed by arrival time, departure time and charging power,
of distribution (Zhou et al., 2021). The assignment problem for EV with little consideration of the driver’s electricity satisfaction degree
charging or discharging is framed as a multi-service queuing model with and electricity anxiety. Furthermore, the problem of transformer over-
a cut-off priority rule (Said and Mouftah, 2020). A real-time multi-- load caused by the superposition of EVs charging demand and residen-
objective optimization approach is used to schedule EVs charging and tial load is given less consideration. We propose a DRL-based EV
discharging with the aim of easing the pressure on the power grid while charging and discharging scheduling approach, which further considers
meeting the charging needs of EV owners (Das et al., 2021). In (Li and electricity anxiety to characterize the driver and the transformer over-
Hu, 2021), EVs charging and discharging scheduling is modeled as a load problem during EV charging. The following are the key contribu-
mixed-integer programming problem and solved by an improved krill tions of this paper.
swarm algorithm, which can effectively smooth the load fluctuations. In
Saber et al. (2024), the authors propose a two-phase coordination al- 1) The MDP is used to define the EV charging scheduling issue with the
gorithm for charging stations, where the first phase determines the driver’s unpredictable actions. Stochasticity of real-time electricity
charging station day-ahead scheduling plan and the second phase de- prices, residential load and driver behavior are considered in a dy-
termines the energy management of the EVs. The impact of unexpected namic charging environment for EVs. The reward takes into account
departure of EV owners on the energy scheduling of charging stations is EVs charging and discharging profits, driver’s electricity anxiety and
considered, and a communication-free policy multiplier algorithm based transformer overload.
on the alternating direction method is proposed to coordinate EV 2) The electricity satisfaction degree is applied to quantify the driver’s
charging (Shi et al., 2024). However, the unpredictability of real-time anxiety about electricity, and the effects of current power and
power pricing and driver behaviors makes these previously described remaining charging time on the driver’s electricity anxiety are
approaches difficult to solve. They also depend on an accurate dynamic analyzed in depth. A load-based real-time electricity price model is
system model that is difficult to get. constructed, which adjusts electricity prices in real time to induce
Recently, model-free methods have been widely used to develop EVs drivers to participate in charging and discharging scheduling.
charging and discharging scheduling strategies, which can learn optimal 3) A SAC framework is utilized to design an EV charging and dis-
policies in complex uncertain environments based on the DRL and do not charging scheduling method based on DRL. The approach employs
need any detailed information about the system model at all. A machine the value network and the policy network to achieve good perfor-
learning-based smart charging approach is proposed for making mance. The proposed strategy can improve driver satisfaction and
charging plans that reduce the total energy cost of EVs (Lopez et al., avoid transformer overloading.
2019). The MDP formulation of the EV charging scheduling issue and
the building of a Q network for analyzing the most suitable action-value The remainder of the paper is arranged as follows. The MDP
function are presented in Wan et al. (2019). A second-order cone description of the scheduling issue for EV charging and discharging is
programming-based best EV charging framework is proposed in Ding presented in Section 2. Then Section 3 offers the solution to this issue
et al. (2020), and a deep deterministic policy gradient (DDPG) method is based on DRL. Section 4 shows simulation results to illustrate the effi-
used to address it. To determine the best EV charging plan, the fitted cacy of the suggested approach. Section 5 concludes and provides a
Q-iteration with an approximate function is used in Sadeghianpourha- summary of the whole paper.
mami et al. (2020). In Jin and Xu (2021), the authors propose a com-
bined optimal control policy characterizations and model-free SAC 2. Problem formulation
framework for EV charging scheduling in the distribution network. A
feature-based linear function is suggested in Wang et al. (2021) as a way The sequential decision problem is similar to the charging and dis-
of approximating the state-value equation. To find the best EVs charging charging schedule problem for EV users. This kind of problem can be
scheduling plans, a charging control deep deterministic policy gradient expressed as an MDP model. Specifically, the tuple {S, A, P, R, γ} is used
technique is presented in Zhang et al. (2021). An attention-based to represent the MDP model. Among these components, S denotes the
federated DRL method may be used to address the EVs charging issue, state of the environment, A represents the action, P is the probability of a
which is defined as a decentralized partial observable MDP (Chu et al., state transition, R is the reward, and γ is the ratio of the future reward to
2022a). In Jiang et al. (2022), the authors propose a deep policy the present reward. Fig. 1 displays the MDP model for the scheduling
gradient method for coordinated EV charging. To facilitate the problem of EV charging and discharging. The following is a detailed
computing of the EVs scheduling process, a communication neural description of each component.
network framework has been suggested (Zhang et al., 2022). A decen- State: The agent makes strategies for EV charging and discharging
tralized framework is presented based on an efficient federated DRL scheduling according to the state. The state st consists of residential load,
approach for plug-in EV fleet charging scheduling in a distributed way EV information and historical electricity prices. The state st can be
(Chu et al., 2022b). In Li et al. (2022), the MDP is solved by the DDPG described as
method, which has continuous action spaces. A multi-agent DRL is ( )
applied to schedule EV charging for urban consumer communities to st = λday,t , λday− 1,t+1 , …, λday− 7,t+1 , pday,t , pday− 1,t+1 , …, pday− 7,t+1 , bi,t , tlea , soct
maximize social welfare and EV charging success in Zou et al. (2022). In (1)
Lee and Choi (2024), the authors propose a three-stage privacy- and
security-aware DRL framework to coordinate smart charging stations where λday,t is the electricity price at timeslot t; (λday-1, t+1, …λday-7, t+1) is
while protecting data privacy. In Liu et al. (2024), the multi-agent DRL is the electricity price at timeslot t+1 of the past week; pday,t is the resi-
used to train the agents and a fully distributed scheduling strategy is dential load at timeslot t; (pday-1, t+1, …pday-7, t+1) is the residential load
proposed for mobile charging stations to charge underpowered EVs, at timeslot t+1 of the past week; if bi,t =1, the EV is at home; if bi,t =0, the
which can increase the profit of mobile charging stations and the pro- EV is not at home; tlea is the remaining dwell time; soct is the stage of
portion of EVs successfully charged. charge (SOC) of EV battery at timeslot t.
The aforementioned strategies mainly focus on the construction of Q- Considering that the electricity price λday correlates with the load.
networks with the most appropriate action-value function, the protec- Therefore, the real-time load-based electricity price model can be con-
tion of EV owners’ data privacy, the characterization of uncertainty in structed as
owners’ charging demand and electricity price, and the efficiency
4855
Q. Xiao et al. Energy Reports 12 (2024) 4854–4863
Fig. 1. The MDP model for the scheduling problem of EV charging and discharging.
( )
Ct = αCt0 + 1 − α Ctf (2) Action: The action at denotes the charging and discharging power
controlled by the agent according to state st. When the EV is charging,
the at is positive. When the EV is discharging, the at is negative. The
where Ct denotes the real-time electricity price, Ct0 is the base electricity
action at is defined as
price, Ctf is the floating electricity price, α is the ratio of the base price to
{
the real-time price. at = pt+1
(7)
Considering that the amount of transferable load varies from one − pmax ≤ p ≤ pmax
time period to another, the variance of the same period on different
dates is used to measure the amount of transferable load for that period where pt+1 is the charging and discharging power of the home charging
as pile at timeslot t+1; The charging and discharging power p is a contin-
n
uous variable, whose upper and lower bounds are pmax and -pmax,
∑ 2
(pi,t − pi,ave ) respectively.
s =
2 i=1
(3) State transition: The probability pt that the state changes from st to st+1
n
is shown as
where s2 is load variance, pi,t denotes the load at timeslot t on the day i, pi, pt = (st+1 |st , at ) = P(S = st+1 |S = st , A = at ) (8)
ave
is the average load on the day i, and n is the number of periods in a
day. where the input of pt is the state st and action at timeslot t, and the output
To stop the sudden change of load in time, the load time difference is is the probability of state st+1.
used to express the trend and amount of change of load as Reward: The agent will get the reward rt when the system from state st
to next state st+1, which consists of the EV profits for charging and dis-
Δpt = pt − pt− 1
(4)
charging, the penalty for transformer overload and the
where Δpt is the load time difference at timeslot t, pt is the load at driver’s electricity anxiety. Driver’s electricity anxiety indicates the
timeslot t, pt− 1 is the load at timeslot t-1. driver’s worry and anxiety about the vehicle’s electricity level.
To improve the data comparability of actual load, variance and load The EV profits of charging and discharging rp,t at timeslot t is
time difference, all three of the above data were standardized to elimi- calculated as
nate magnitudes using z-score as rp,t = − pt ⋅λt ta ≤ t < td (9)
⎧ √̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
√
⎪ σ = √1
⎪ √ ∑ N
where ta is the EV arrival time and td is the EV departure time, pt is the
⎪ (xi − μ)2
charging and discharging power at timeslot t. In this reward, pt⋅λt rep-
⎪
⎨ N i=1
(5) resents the charging cost of EV at timeslot t. When the EV is charging,
⎪
⎪ ( )
⎪
⎪
⎩ z(x) = 1 pt⋅λt is positive. When the EV is discharging to the grid, pt⋅λt is negative.
x− μ
σ The more anxious a driver is when the EV battery is low and close to
departure. In this paper, fuzzy mathematical theory is used to describe
where xi is i-th data; σ is the standard deviation of data x; μ is the mean the driver’s electricity anxiety ra,t which can be calculated as
value of x; N is total number of x; z(⋅) denotes z-score normalization of
data.
ηch pm tchar
Uf = (10)
The above processed data of actual load, variance and load time (1 − soc) ∗ cb
difference are accumulated. Then 0–1 normalization is used to limit the {
( ) 0, U( f ≥ 1 )
data to between 0 and 1, which is the floating electricity price as ra,t = Fcd Uf = (11)
0.5 + 0.5 cos π ∗ Uf , Uf < 1
( )
⎧ Ct = z(pt ) + z( s2 ) + z Δpt
f0
⎪
⎪
⎨ where Uf is the driver’s electricity satisfaction degree; pm is the EV slow
( )
Ctf0 − min Ctf0 (6) charging power; tchar is the remaining charging time; soc is the SOC of
⎪
⎪ t
⎩ Cf = ( ) ( ) the current EV battery; cb is the EV battery capacity; Fcd(Uf) is the Uf
max Ctf0 − min Ctf0 affiliation function, which represents the subjective value of driver’s
electricity anxiety in the range [0,1].
4856
Q. Xiao et al. Energy Reports 12 (2024) 4854–4863
When Uf ≥1 indicates that the current remaining time can be fully SAC is an off-policy actor-critic DRL algorithm that aims to maximize
charged, the driver’s electricity anxiety is 0. When Uf <1 indicates that expected reward and entropy (Haarnoja et al., 2020). SAC can be used to
the current remaining time cannot be fully charged, the driver’s elec- address the high sample complexity and poor stability of DRL methods
tricity anxiety decreases with the increase of electricity satisfaction Uf. (Yan et al., 2021). SAC outperforms DDPG (Lillicrap et al., 2019), twin
Assuming that the remaining charging time is from one to six hours and delayed deep deterministic policy gradient (Fujimoto et al., 2018) and
the current SOC is from 60 % to 95 %, the distribution of the driver’s proximal policy optimization (Schulman et al., 2017) algorithms in
electricity anxiety is shown in Fig. 2. terms of utilizing non-strategic updating and maximum entropy frame-
We consider the medium and low voltage transformer capacity works. The optimization objective of the SAC algorithm is to maximize
constraints and establish the transformer power constraints as reward and entropy as follows
( )
T
− Pmax ch dis max
tran ≤ Pload,t + PEV,t − PEV,t ≤ Ptran (12) ∑
max J(π) = E(st ,at )∼ρπ [r(st , at ) + αH(π(⋅|st ))] (17)
t=0
where Pload,t is the residential load at timeslot t, which can be called non-
EV load; Pch dis where T is the length of the discrete-time series of EVs scheduling times;
EV,t is EV charging power at timeslot t; PEV,t is EV discharging
E is the mathematical expectation; ρπ is the trajectory distribution of (st,
power at timeslot t; “Pch− Pdis
denotes EV load;
EV,t ” Pmax
is the maximum
EV,t tran at) under the policy π; r(st, at) is the reward rt for acting at in the state st; α
transmitted power of the medium and low voltage transformer. refers to the temperature parameter, which indicates the entropy to the
The EV charging and discharging power is constrained as reward’s relative importance; H(π(⋅|st)) denotes the policy entropy,
⎧
0 ≤ Pch which is defined as
EV,t ≤ bEV,t Ppile
max
⎨
( ) (13) [ ]
⎩ 0 ≤ Pdis
EV,t ≤ 1 − bEV,t Ppile
max ⎧ ( ) ∑ ( )
⎪
⎪
⎨ H(s)≜Entropy π ⋅|s =− π(a|s)⋅ log π a|s
a∈A
(18)
where bEV,t denotes the EV charging and discharging state at timeslot t; If ⎪
⎪ ∑N
⎩ Entropy(p) = p ⋅ log p
bEV,t =1, EV is charging. If bEV,t =0, EV is discharging. Pmax pile is the
i i
i=1
maximum charging and discharging power of the charging pile.
Neglecting the value of the power loss of each transformer, the power where N denotes the number of all actions; pi refers to the probability of
balance constraints for the residents, EVs and transformers are estab- the i-th action; H(s) is the entropy of state s.
lished as With the introduction of maximum entropy, the action reward r(st,
at) will be updated to the soft action reward rsoft(st, at) as
Ptran,t = Pload,t + Pch dis
EV,t − PEV,t (14)
( ) ( )
rsoft st , at = r st , at + λαEst+1 ∼ρπ [π(⋅|st+1 )] (19)
The charging and discharging scheduling strategy requires avoiding
transformer overload. The formula for calculating the penalty for Policy learning is the method by which the policy function π learns
transformer overload is as follows based on different state expectations. It is also called the state value
⎧ ⃒⃒ ⃒ function Vπ(st), which represents the reward gt of adopting the policy
dis ⃒
⎨ 1, ⃒Pload,t + Pch max
EV,t − PEV,t ⃒ > Ptran function π at state st as
rl,t = ⃒ ⃒ (15)
⎩ 0, ⃒⃒P dis ⃒
load,t + PEV,t − PEV,t ⃒ ≤ Ptran
ch max
Vπ (st ) = Eat ∼π [Qπ (st , at )] (20)
Combining the above three components, the reward rt can be defined with
as
Qπ (st , at ) = E[gt |S = st , A = at ] (21)
T
∑ ( )
rt = rp,t − αra,t − βrl,t (16) where Qπ(st, at) is the action value function that represents the expected
t=1
reward for acting at in the state st; the cumulative reward gt can be
where α and β are the weight coefficients for driver’s electricity anxiety written as
and power crossing the line; T is the length of the discrete-time series. T
∑
gt = rt+1 + γrt+2 + γ 2 rt+3 + … = γ k rt+k+1 (22)
k=0
3. Proposed approach
where rt+k+1 is the reward at timeslot k; γ is the discount factor that
3.1. Preliminaries
indicates how much the present reward is worth in the future; T is the
length of the discrete-time series of EVs scheduling times.
Based on the maximum entropy reinforcement learning paradigm,
Based on (19) and (20), the soft action value function Qsoft π is ob-
tained as
( ) ( ) [ ( ) ]
Qsoftπ st , at = r st , at + γEst+1 ,at+1 Qsoftπ st+1 , at+1 − α log(π(at+1 |st+1 ))
(23)
The state value function Vπ(st) in (20) can be updated to the soft state
value function Vsoft π(st) as
[ ( ) ]
Vsoftπ (st ) = Eat ∼π Qsoftπ st , at − α log(π(at |st )) (24)
Assuming that the policy π converges to the optimal policy, the proof
of convergence of SAC is given in Haarnoja et al. (2020). The policy π
can be defined as
( ) ( )
1( ( ))
Fig. 2. Distribution of driver’s electricity anxiety with different SOC and π st , at = exp Qsoftπ st , at − Vsoftπ (st ) (25)
remaining charge times. α
4857
Q. Xiao et al. Energy Reports 12 (2024) 4854–4863
Considering the π(at| st) ∝ exp[Qsoft π(st, at)] in (25), the policy π can 4) Then deposit the quaternion (st, st+1, at, rt) into the replay buffer D for
be updated by using Kullback-Leibler (KL) divergence method as the agent to update and learn.
( ( ⃦ ) 5) The above four steps are repeated until the optimization iteration
⃦exp( Q old ( s , ⋅))
π new = arg max DKL πʹ ⋅|st )⃦
⃦ softπ t
(26) converges. After the first reinforcement learning iteration converges,
πʹ ⃦ Zπ old (st ) inherit the agent parameters to the second agent.
6) Repeat the above steps until the optimization iteration converges,
with where the reward rt is calculated by the Eq. (16).
[ ] ∫ [ ( )]
1 1
Zπ (st ) = exp Vsoftπ (st ) = exp Qsoftπ st , a da (27) The steps for updating the parameters θ of the soft action value
α α
network Qθ are as follows. Input state st and action at to soft action value
where Zπ denotes the partition function that normalizes the distribution. network Qθ to get soft action value Qθ (st, at). Input next state st+1 to soft
To achieve the above optimization results, the parameters of the soft state value network Vψ to get soft state value Vψ (st+1). Update the pa-
state value network Vψ (st), the soft action value network Qθ(st, at) and rameters θ of soft action value network Qθ using Eq. (1) in Fig. 3.
the policy network πϕ (at| st) are set to be ψ , θ and ϕ, respectively. The The steps for updating the parameter ψ of the soft state value
tuple {S, A, P, R, γ} of the MDP model in the replay pool buffer (RPB) is network Vψ are as follows. Input current state st to soft state value
extracted by prioritizing the experience playback. The trust region network Vψ to get soft state value Vψ (st). Input state st and action at to
policy optimization (TRPO) algorithm is used to optimize each network the policy network πϕ to get the policy entropy logπ(at|st). Update the
parameter based on the quaternion until the results converge. The parameters ψ of soft state value network Vψ using Eq. (3) in Fig. 3.
optimization methods for the above three network parameters within The steps to update the parameters ϕ of the policy network πϕ are as
the SAC algorithm are detailed in Haarnoja et al. (2020). follows. Input state st and action at to the policy network πϕ to get the
policy entropy logπ(at|st). Input state st and action at to soft action value
network Qθ to get soft action value Qθ (st, at). Update the parameters ϕ of
3.2. The proposed approach
the policy network πϕ using Eq. (2) in Fig. 3.
To maximize the reward is the goal of the scheduling issue for EV
charging and discharging. Since the reward is multi-objective optimi- 4. Simulation results
zation and difficult to converge, the model begins by optimizing the first
two terms in the Eq. (16), which include the EV profits for charging and This section uses simulation analysis to illustrate the efficacy of the
discharging and the penalty for the driver’s electricity anxiety, and then suggested EV charging and discharging scheduling approach. First, the
optimizes all of them after the algorithm converges. The training process simulation setup is described in detail. The model training process is
of the DRL-based EV charging and discharging scheduling model is then explained. Finally, the scheduling results based on the suggested
shown in Fig. 3. The training object is the soft action value network Qθ, approach are displayed.
soft state value network Vψ and policy network πϕ of SAC, and the
overall simulation steps are as follows. 4.1. Simulation setup
1) First, obtain the initial state s0. Randomly select the residential load In this paper, a household with electric private car is used as the
of the day and add a 5 % random value. Calculate the real-time simulation object. The simulation time is from 18:00–10:00 the next
electricity price at the current moment based on the load. day, and the simulation step is one hour. The EV dynamic charging
2) Then schedule the charging and discharging power of the home environment includes EV stochastic behavior, residential load, and dy-
charging pile. namic electricity price, all of which are from real data. The probability
3) Next, obtain the reward rt based on the first two terms in the Eq. (16) distribution of statistical EV trips is based on 2017 National Household
and the next state st+1. Travel Survey (NHTS2017) data (National Household Travel Survey,
Fig. 3. Training process of the DRL-based EV charging and discharging scheduling model.
4858
Q. Xiao et al. Energy Reports 12 (2024) 4854–4863
2020) including arrival times, departure times, and arrival electric the proposed method can alleviate the driver’s electricity anxiety when
quantity. Considering the increase in power consumption of EVs due to leaving home.
traffic congestion or weather factors, as well as the decay of battery life To demonstrate whether the proposed method can alleviate the load
with usage time, the amount of arrivals is set to be 0.70–0.95. According and avoid transformer overload, the two SAC agents trained above are
to the analysis of Yan et al. (2022), the household load overload penalty simulated 1000 times. The simulation results are then compared with
weight β and the electricity anxiety penalty weight α were set to 5 and 8, the original residential load profile, and the household load profile
respectively. The parameter settings are shown in Table 1. before scheduling, as shown in Fig. 9. It can be seen that the original
In this paper, the total UK grid load from September 1 to October 31, residential load level is higher from 18:00–22:00, the load decreases
2022 (GRIDWATCH, 2023) is selected as the residential load, and the from 23:00–4:00 the next day, and the load gradually increases after
time interval of the raw load is changed from five minutes to one hour. 5:00 the next day. The average load is 4.0593 kW with a variance of
The z-score normalization process is used to eliminate the magnitude of 1.1356. After adding the unscheduled EV charging load, the peak-valley
the raw load data and retain only the fluctuating trend of the load. The difference of the raw residential load increases to a maximum load of
processed time distribution of residential load is shown in Fig. 4. As can 13.6628 kW, which is much larger than the maximum load limit of
be seen in Fig. 4, the residential load is largest in the period 16:00–20:00 10 kW, with an average load of 6.4551 kW and a variance of 4.0638.
and smallest in the period 22:00–6:00 at night. This distribution of After the first training session of the scheduling model, the raw peak-to-
residential load prepares for the calculation of real-time electricity valley differences during the night are reversed. The load from
prices. 22:00–8:00 the next day all exceeded the maximum load limit of 10 kW
A random noise of five percent is added to the processed residential with a variance of 3.4423. After the second training of the scheduling
load as a randomized user load. The time-of-use electricity pricing uses model, the reversal of the raw peak-valley difference during the night-
Chengdu’s most recent method for setting prices (Sichuan Development time weakened, with a maximum load of 10.2678 kW and a variance of
and Reform Commission, 2021), and the ratio α of the base electricity 2.3034. There were only two moments when the loads were slightly
price to the floating electricity prices is set to 0.5. The real-time elec- above the maximum load limit. This indicates that the developed EV
tricity prices can be calculated by the Eqs. (2)-(6) in Section 2 as shown charging and discharging scheduling strategy can significantly reduce
in Fig. 5. From Fig. 5, it can be found that daytime electricity price is the occurrence of transformer overloading.
higher than nighttime electricity price, which is highest at 16:00–20:00 Setting 1000 electric private cars to simulate and analyze the prob-
and lowest at 0:00–4:00 at night. This real-time electricity price will be ability distribution of household load before and after charging/dis-
used to construct a dynamic charging environment for EVs. charging scheduling is shown in Fig. 10. The household load reaches the
maximum value of 1219.340 kW at 20:00 before scheduling, and rea-
4.2. Training performance ches the maximum value of 915.214 kW at 20:00 after scheduling. It can
be seen that the maximum household load reduces by 24.94 % after
The training performance of the DRL-based EV charging and dis- scheduling, and the peak of charging load is shifted to the period from
charging scheduling model is shown in Fig. 6 and Fig. 7. It can be seen 24:00 to the next day at 7:00, which shows that the proposed method
that the first training of the agent converges after five hundred and can achieve peak shaving of household load at night. All the above
twenty iterations, optimizing the charging and discharging profits and simulation results show that the developed charging and discharging
drivers’ electricity anxiety penalty. The overall reward fluctuates be- scheduling strategy can reduce charging costs and electricity anxiety
tween negative ten to two. At the beginning of the second training of the while avoiding transformer overload and achieving peak shaving of
agent, electricity anxiety and charging/discharging profits almost load.
converge.
This proves that the experience from the first learning was utilized in 4.4. Comparison with benchmarks
the second iteration. The second time converges after five hundred
eighty optimization iterations and the overall reward converges in a In this case, the proposed approach is evaluated and compared with
process almost equal to the transformer overload penalty. This proves benchmarks, including the day-ahead and uncontrolled strategy. For the
that the agent learns to schedule EV charging and discharging with day-ahead solution, we assume the EV arrival time and depart time, and
maximizing the overall reward while keeping the transformer safe. the initial SOC are known in advance. For the uncontrolled strategy, the
EV is immediately charged at maximum constant power when arriving
4.3. EV charging and discharging scheduling performance home. The 200 test days are conducted to compare the performance of
the different methods. The cumulative costs are the negative value of the
To visualize the agent scheduling process after two trainings, the cumulative reward. A comparison of the cumulative costs of the pro-
agent scheduling EV charging and discharging is observed for one week posed and other benchmark methods is shown in Fig. 11. After 200 days
of continuous simulation, and the results are shown in Fig. 8. From Fig. 8 of testing, the cumulative costs of the three methods are shown in
(a), it is evident that the agent devises a plan that leads the EV to Table 2. The cumulative cost reduction rate is used to evaluate the de-
discharge at high electricity prices and charge at low electricity prices. gree of reduction, which is the percentage reduction in the cumulative
This suggests that agents are trained to minimize charging costs based cost of the method compared to the uncontrolled method. It can be
on changes in electricity prices. In Fig. 8(b), the EV leaves the home with found that the proposed approach can reduce the cumulative costs by
almost a hundred percent battery during the week, which indicates that 82.46 % compared to the uncontrolled strategy, while the day-ahead
solution can only reduce the cumulative costs by 57.58 %. This result
Table 1 demonstrates that the proposed method is effective in reducing the cu-
Input parameters of charging and discharging scheduling model. mulative costs of EV charging.
Input parameters value
5. Conclusion
Arrival time ta U(17,20)
Departure time td U(6,10)
Initial SOC socta U(0.70,0.95) The EV charging and discharging scheduling problem is formulated
Transformer power limit Pmax
tran 10 kW as an MDP in this work. Then, we present a DRL-based method for
Charging pile power limit Pmax
pile 7 kW learning a charging and discharging scheduling strategy that optimizes
Household load overload penalty weight β 5 rewards by interacting with the dynamic environment. A DRL-based EVs
Electricity anxiety penalty weight α 8
charging and discharging scheduling model is designed by the SAC
4859
Q. Xiao et al. Energy Reports 12 (2024) 4854–4863
4860
Q. Xiao et al. Energy Reports 12 (2024) 4854–4863
Fig. 8. EV charging and discharging scheduling results for one week: (a) Charging and discharging power and hourly electricity price; (b) SOC of EV.
Fig. 9. Comparison of household load profile before and after scheduling: (a) Original residential load profile; (b) Household load profile before scheduling; (c)
Household load profile after the first training of the model; (d) Household load profile after the second training of the model.
Fig. 10. Probability distribution of household load before and after charging/discharging scheduling.
framework and trained twice. The proposed method calculates real-time costs and power anxiety while preventing transformer overloading. In
electricity price based on load, taking into account driver electricity the future, the impact of EVs charging and discharging on electricity
anxiety and transformer overloading. Several simulation examples were prices will be further analyzed to develop charging and discharging
performed, and the results show that the strategy can minimize charging scheduling strategies for large-scale EVs.
4861
Q. Xiao et al. Energy Reports 12 (2024) 4854–4863
Das, R., Wang, Y., Busawon, K., Putrus, G., Neaimeh, M., 2021. Real-time multi-objective
Table 2 optimisation for electric vehicle charging management. J. Clean. Prod. 292, 126066.
Comparison of cumulative costs of the three methods. https://doi.org/10.1016/j.jclepro.2021.126066.
Ding, T., Zeng, Z., Bai, J., Qin, B., Yang, Y., Shahidehpour, M., 2020. Optimal Electric
Methods Cumulative costs($) Reduction rate(%) Vehicle Charging Strategy With Markov Decision Process and Reinforcement
Learning Technique. IEEE Trans. Ind. Appl. 56, 5811–5823. https://doi.org/
Uncontrolled 1769.75 -
10.1109/TIA.2020.2990096.
Day-ahead 750.66 57.58
Fujimoto, S., van Hoof, H., Meger, D., 2018. Address Funct. Approx. Error Actor-Crit.
Proposed 310.42 82.46
Methods.
Ghotge, R., Snow, Y., Farahani, S., Lukszo, Z., Van Wijk, A., 2020. Optimized Scheduling
of EV Charging in Solar Parking Lots for Local Peak Reduction under EV Demand
CRediT authorship contribution statement Uncertainty. Energies 13, 1275. https://doi.org/10.3390/en13051275.
GRIDWATCH. Available online: http://www.gridwatch.templar.co.uk (accessed on 14
February 2023).
Qin Xiao: Conceptualization, Methodology, Validation, Wri- Haarnoja, T., Zhou, A., Abbeel, P., Levine, S., 2020. Off-Policy Maximum Entropy Deep
ting—original draft. Runtao Zhang: Validation, Investigation, Wri- Reinforcement Learning with a Stochastic Actor. https://doi.org/arXiv preprint
ting—original draft. Yongcan Wang: Validation, Investigation, arXiv:2002.02829.
Jiang, W., Zhen, Y., 2019. A Real-Time EV Charging Scheduling for Parking Lots With PV
Writing—original draft. Peng Shi: Formal analysis, Writing - Review & System and Energy Store System. IEEE Access 7, 86184–86193. https://doi.org/
Editing, Supervision. Xi Wang: Formal analysis, Writing - Review & 10.1109/ACCESS.2019.2925559.
Editing, Project administration. Baorui Chen: Formal analysis, Writing - Jiang, Y., Ye, Q., Sun, B., Wu, Y., Tsang, D.H.K., 2022. Data-driven coordinated charging
for electric vehicles with continuous charging rates: a deep policy gradient approach.
Review & Editing, Funding acquisition. Chengwei Fan: Formal anal-
IEEE Internet Things J. 9, 12395–12412. https://doi.org/10.1109/
ysis, Data curation, Writing - Review & Editing. Gang Chen: Formal JIOT.2021.3135977.
analysis, Data curation, Writing - Review & Editing. Jin, J., Xu, Y., 2021. Optimal policy characterization enhanced actor-critic approach for
electric vehicle charging scheduling in a power distribution network. IEEE Trans.
Smart Grid 12, 1416–1428. https://doi.org/10.1109/TSG.2020.3028470.
Declaration of Competing Interest Lee, S., Choi, D.-H., 2024. Three-stage deep reinforcement learning for privacy-and
safety-aware smart electric vehicle charging station scheduling and Volt/VAR
control. IEEE Internet Things J. 11, 8578–8589. https://doi.org/10.1109/
The authors declare that they have no known competing financial JIOT.2023.3319588.
interests or personal relationships that could have appeared to influence Li, J., Li, C., Xu, Y., Dong, Z.Y., Wong, K.P., Huang, T., 2018. Noncooperative Game-
the work reported in this paper. Based Distributed Charging Control for Plug-In Electric Vehicles in Distribution
Networks. IEEE Trans. Ind. Inf. 14, 301–310. https://doi.org/10.1109/
TII.2016.2632761.
Acknowledgments Li, S., Hu, W., Cao, D., Dragicevic, T., Huang, Q., Chen, Z., Blaabjerg, F., 2022. Electric
vehicle charging management based on deep reinforcement learning. J. Mod. Power
Syst. Clean. Energy 10, 719–730. https://doi.org/10.35833/MPCE.2020.000460.
This work was supported by Sichuan Science and Technology Pro- Li, Y., Hu, B., 2021. A consortium blockchain-enabled secure and privacy-preserving
gram under Grant 2023YFG0107. optimized charging and discharging trading scheme for electric vehicles. IEEE Trans.
Ind. Inf. 17, 1968–1977. https://doi.org/10.1109/TII.2020.2990732.
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.,
Data availability 2019. Contin. Control Deep Reinf. Learn.
Liu, L., Huang, Z., Xu, J., 2024. Multi-agent deep reinforcement learning based
I have shared a link to my usage data in the Reference step. scheduling approach for mobile charging in internet of electric vehicles. IEEE Trans.
Mob. Comput. 1–16. https://doi.org/10.1109/TMC.2024.3373410.
Lopez, K.L., Gagne, C., Gardner, M.-A., 2019. Demand-side management using deep
References learning for smart charging of electric vehicles. IEEE Trans. Smart Grid 10,
2683–2691. https://doi.org/10.1109/TSG.2018.2808247.
Alsabbagh, A., Wu, B., Ma, C., 2021. Distributed electric vehicles charging management Muratori, M., 2018. Impact of uncoordinated plug-in electric vehicle charging on
considering time anxiety and customer behaviors. IEEE Trans. Ind. Inf. 17, residential power demand. Nat. Energy 3, 193–201. https://doi.org/10.1038/
2422–2431. https://doi.org/10.1109/TII.2020.3003669. s41560-017-0074-z.
Chen, Xinyu, Liu, Y., Wang, Q., Lv, J., Wen, J., Chen, Xia, Kang, C., Cheng, S., National Household Travel Survey. Available online: https://nhts.ornl.gov/ (accessed on
McElroy, M.B., 2021. Pathway toward carbon-neutral electrical systems in China by November 2020 ).
mid-century with negative CO2 abatement costs informed by high-resolution Saber, H., Ranjbar, H., Hajipour, E., Shahidehpour, M., 2024. Two-Stage Coordination
modeling. Joule 5, 2715–2741. https://doi.org/10.1016/j.joule.2021.10.006. Scheme for Multiple EV Charging Stations Connected to an Exclusive DC Feeder
Chu, Y., Wei, Z., Fang, X., Chen, S., Zhou, Y., 2022a. A multiagent federated Considering Grid-Tie Converter Limitation. IEEE Trans. Transp. Electrific. 1-1.
reinforcement learning approach for plug-in electric vehicle fleet charging https://doi.org/10.1109/TTE.2024.3357217.
coordination in a residential community. IEEE Access 10, 98535–98548. https://doi. Sadeghianpourhamami, N., Deleu, J., Develder, C., 2020. Definition and Evaluation of
org/10.1109/ACCESS.2022.3206020. Model-Free Coordination of Electrical Vehicle Charging With Reinforcement
Chu, Y., Wei, Z., Fang, X., Chen, S., Zhou, Y., 2022b. A multiagent federated Learning. IEEE Trans. Smart Grid 11, 203–214. https://doi.org/10.1109/
reinforcement learning approach for plug-in electric vehicle fleet charging TSG.2019.2920320.
coordination in a residential community. IEEE Access 10, 98535–98548. https://doi.
org/10.1109/ACCESS.2022.3206020.
4862
Q. Xiao et al. Energy Reports 12 (2024) 4854–4863
Said, D., Mouftah, H.T., 2020. A novel electric vehicles charging/discharging Yan, L., Chen, X., Zhou, J., Chen, Y., Wen, J., 2021. Deep reinforcement learning for
management protocol based on queuing model. IEEE Trans. Intell. Veh. 5, 100–111. continuous electric vehicles charging control with dynamic user behaviors. IEEE
https://doi.org/10.1109/TIV.2019.2955370. Trans. Smart Grid 12, 5124–5134. https://doi.org/10.1109/TSG.2021.3098298.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O., 2017. Prox. Policy Optim. Yan, L., Chen, X., Chen, Y., Wen, J., 2022. A cooperative charging control strategy for
Algorithms. electric vehicles based on multiagent deep reinforcement learning. IEEE Trans. Ind.
Shi, M., Wang, H., Lyu, C., Dong, Q., Li, X., Jia, Y., 2024. An optimal regime of energy Inform. 18, 8765–8775. https://doi.org/10.1109/TII.2022.3152218.
management for smart building clusters with electric vehicles. IEEE Trans. Ind. Inf. Yao, L., Lim, W.H., Tsai, T.S., 2017. A real-time charging scheme for demand response in
1–11. https://doi.org/10.1109/TII.2024.3363059. electric vehicle parking station. IEEE Trans. Smart Grid 8, 52–62. https://doi.org/
Shi, Y., Tuan, H.D., Savkin, A.V., Duong, T.Q., Poor, H.V., 2019. Model Predictive 10.1109/TSG.2016.2582749.
Control for Smart Grids With Multiple Electric-Vehicle Charging Stations. IEEE Zhang, F., Yang, Q., An, D., 2021. CDDPG: a deep-reinforcement-learning-based
Trans. Smart Grid 10, 2127–2136. https://doi.org/10.1109/TSG.2017.2789333. approach for electric vehicle charging control. IEEE Internet Things J. 8, 3075–3087.
Sichuan Development and Reform Commission. Available online: http://fgw.sc.gov. https://doi.org/10.1109/JIOT.2020.3015204.
cn//sfgw/c106088/2021/12/8/c333f275605d487d92c153832496f627.shtml Zhang, Z., Wan, Y., Qin, J., Fu, W., Kang, Y., 2022. A Deep RL-Based Algorithm for
(accessed on 8 December 2021). Coordinated Charging of Electric Vehicles. IEEE Trans. Intell. Transp. Syst. 23,
Sun, B., Huang, Z., Tan, X., Tsang, D.H.K., 2018. Optimal scheduling for electric vehicle 18774–18784. https://doi.org/10.1109/TITS.2022.3170000.
charging with discrete charging levels in distribution grid. IEEE Trans. Smart Grid 9, Zhao, J., Wan, C., Xu, Z., Wang, J., 2017. Risk-Based Day-Ahead Scheduling of Electric
624–634. https://doi.org/10.1109/TSG.2016.2558585. Vehicle Aggregator Using Information Gap Decision Theory. IEEE Trans. Smart Grid
Tajeddini, M.A., Kebriaei, H., 2019. A mean-field game method for decentralized 8, 1609–1618. https://doi.org/10.1109/TSG.2015.2494371.
charging coordination of a large population of plug-in electric vehicles. IEEE Syst. J. Zheng, Y., Song, Y., Hill, D.J., Meng, K., 2019. Online Distributed MPC-Based Optimal
13, 854–863. https://doi.org/10.1109/JSYST.2018.2855971. Scheduling for EV Charging Stations in Distribution Systems. IEEE Trans. Ind. Inf. 15,
Wan, Z., Li, H., He, H., Prokhorov, D., 2019. Model-free real-time EV charging scheduling 638–649. https://doi.org/10.1109/TII.2018.2812755.
based on deep reinforcement learning. IEEE Trans. Smart Grid 10, 5246–5257. Zhou, X., Zou, S., Wang, P., Ma, Z., 2021. ADMM-Based Coordination of Electric Vehicles
https://doi.org/10.1109/TSG.2018.2879572. in Constrained Distribution Networks Considering Fast Charging and Degradation.
Wang, J., Bharati, G.R., Paudyal, S., Ceylan, O., Bhattarai, B.P., Myers, K.S., 2019. IEEE Trans. Intell. Transp. Syst. 22, 565–578. https://doi.org/10.1109/
Coordinated electric vehicle charging with reactive power support to distribution TITS.2020.3015122.
grids. IEEE Trans. Ind. Inf. 15, 54–63. https://doi.org/10.1109/TII.2018.2829710. Zou, L., Munir, Md.S., Tun, Y.K., Kang, S., Hong, C.S., 2022. Intelligent EV charging for
Wang, S., Bi, S., Zhang, Y.A., 2021. Reinforcement Learning for Real-Time Pricing and urban prosumer communities: an auction and multi-agent deep reinforcement
Scheduling Control in EV Charging Stations. IEEE Trans. Ind. Inf. 17, 849–859. learning approach. IEEE Trans. Netw. Serv. Manag. 19, 4384–4407. https://doi.org/
https://doi.org/10.1109/TII.2019.2950809. 10.1109/TNSM.2022.3160210.
Wu, D., Zeng, H., Lu, C., Boulet, B., 2017. Two-Stage Energy Management for Office
Buildings With Workplace EV Charging and Renewable Energy. IEEE Trans. Transp.
Electrific. 3, 225–237. https://doi.org/10.1109/TTE.2017.2659626.
4863