Learning To Operate An Electric Vehicle Charging Station Considering Vehicle-Grid Integration
Learning To Operate An Electric Vehicle Charging Station Considering Vehicle-Grid Integration
4, JULY 2022
Abstract—The rapid adoption of electric vehicles (EVs) calls amin /amax The minimal/maximum charging power. The
for the widespread installation of EV charging stations. To minimal charging power corresponds to the
maximize the profitability of charging stations, intelligent con- maximum discharging power.
trollers that provide both charging and electric grid services are
in great need. However, it is challenging to determine the optimal B Net revenue of the charging station by charg-
charging schedule due to the uncertain arrival time and charging ing and discharging EV batteries.
demands of EVs. In this paper, we propose a novel centralized Cl Demand charge of the charging station.
allocation and decentralized execution (CADE) reinforcement Cp Penalty paid by the charging station for not
learning (RL) framework to maximize the charging station’s charging EVs to their target energy levels.
profit. In the centralized allocation process, EVs are allocated
to either the waiting or charging spots. In the decentralized exe- er Remaining energy to be charged for an EV,
cution process, each charger makes its own charging/discharging i.e., the difference between e and etgt .
decision while learning the action-value functions from a shared eini /efnl /etgt The initial/final/target energy level of an EV.
replay memory. This CADE framework significantly improves the emin /emax The minimum/maximum energy level of an
scalability and sample efficiency of the RL algorithm. Numerical EV.
results show that the proposed CADE framework is both com-
putationally efficient and scalable, and significantly outperforms Er,w Sum of er of the EVs in the waiting area.
the baseline model predictive control (MPC). We also provide an eit Energy level of the ith EV at time t.
in-depth analysis of the learned action-value function to explain H The set of time-of-use periods.
the inner working of the reinforcement learning agent. h h ∈ H, one of the time-of-use periods.
Index Terms—Electric vehicle, charging station, vehicle-grid I The set of EVs that have arrived at the
integration, reinforcement learning. charging station in T.
J The set of all chargers in the charging station.
N OMENCLATURE Lht Recorded peak power of a charging station in
βit A binary variable indicating if the ith EV is time-of-use period h at time t.
in the charging station at time t. mt The net revenue of charging/discharging an
a Interval between neighboring values in action EV with 1kWh of electricity.
set A. N The total number of parking spots in a charg-
t Length of one time step. ing station.
ηit A binary variable indicating if the ith EV is N c /N w The total number of chargers/waiting spots.
connected to a charger at time t. pc The price paid by an EV customer to the
A The set of all discrete actions of a reinforce- charging station for 1kWh of electricity.
ment learning agent. pd The price the charging station pays to EV
a a ∈ A, one of the actions, i.e., the charg- customers for discharging 1kWh of electricity.
ing/discharging power selected by a charger pe The price of electricity charged by the electric
for an EV. utility.
alower /aupper The lower/upper bound of charging power pl The price the charging station pays to the
constrained by battery energy level. electric utility for 1kW of recorded peak
power.
Manuscript received November 8, 2021; revised February 15, 2022; Q(s, a) The state-action value function of taking
accepted March 24, 2022. Date of publication April 6, 2022; date of cur- action a at state s.
rent version June 21, 2022. This work was supported by the University of
California Institute of Transportation Studies from the State of California r The reward received by a charger.
through the Public Transportation Account and the Road Repair and rb The reward received by a charger through
Accountability Act of 2017 (Senate Bill 1). Paper no. TSG-01793-2021. performing charging/discharging of EVs.
(Corresponding author: Nanpeng Yu.)
The authors are with the Department of Electrical and Computer rl The reward received by a charger for increas-
Engineering, University of California at Riverside, Riverside, CA 92521 USA ing the recorded peak power.
(e-mail: [email protected]). rp The reward received by a charger for not
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TSG.2022.3165479. charging an EV to its target energy level.
Digital Object Identifier 10.1109/TSG.2022.3165479 s The state of environment of a charger.
1949-3053
c 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: VIT University. Downloaded on January 09,2023 at 05:36:14 UTC from IEEE Xplore. Restrictions apply.
YE et al.: LEARNING TO OPERATE EV CHARGING STATION CONSIDERING VEHICLE-GRID INTEGRATION 3039
T A discrete time horizon. issues of model-based methods, researchers are turning to rein-
t t ∈ T, one of the time steps. forcement learning (RL), which excels at solving sequential
ta /td The arrival/departure time of an EV. decision-making problems in real-time. The RL agent aims at
tr The remaining dwelling time of an EV. learning a good control policy by interacting with the envi-
TB The billing period of demand charge Cl . ronment or a simulation system to achieve the best long-term
Z Profit of the charging station. rewards. Combining RL algorithms with deep neural networks
has led to breakthroughs in solving complex problems with
large-scale state and action spaces [17].
I. I NTRODUCTION The application of RL to schedule the charging of a sin-
N THE past decade, the global electric vehicle (EV) mar- gle EV is widely covered in the literature. An early attempt
I ket has been growing exponentially thanks to the rapid
advancement of battery technologies. To support further pene-
leverages the tabular Q-learning method [18], which can
only deal with small-scale state and action space. To over-
tration of EVs, it is critical to develop smart charging stations come the limitations of tabular methods, kernel averaging
that could satisfy the charging needs in a cost-effective manner. regression functions are used to approximate the action-
The topic of charging station operation optimization has value functions in scheduling the charging of an individual
been widely researched with model-based algorithms. The EV [19]. Reference [20] further extends the charging schedul-
pricing and scheduling of EV charging are jointly consid- ing problem for a single EV by considering bidirectional
ered in the charging station profit maximization problem energy transfer and utilizing deep neural networks to approx-
and solved by an enhanced Lyapunov optimization algo- imate state-action values. Apart from reducing the charging
rithm [1]. By jointly considering the admission control, cost, drivers’ anxiety about the EVs’ range [21] and pho-
pricing, and charging scheduling, a charging station profit tovoltaic self-consumption [22] can also be quantified and
maximization problem is formulated and solved by the com- included as training targets. To ensure the EV is sufficiently
bination of multi-sub-process based admission control and charged upon departure, [23] introduces an exponential penalty
gradient-based optimization algorithm [2]. The planning and term to encourage the charger finishing the charging task on
operation problem of single output multiple cables charging time, while [24] utilizes constrained policy optimization to
spots is formulated and solved by a two-stage stochastic pro- learn a safe policy without the need to manually design a
gramming [3]. The charging station scheduling problem that penalty term. Long short-term memory (LSTM) is also widely
considers vehicle-to-vehicle charging is formulated as a con- used to predict the electricity price as an input to the RL
strained mixed-integer linear program and solved by dual and algorithm [20], [25]–[27].
Benders decomposition [4]. The scheduling of battery replace- Compared to the single EV algorithms, it is much more
ment in charging stations is solved by a multistage stochastic challenging to apply RL to coordinate the charging of multiple
programming algorithm [5]. By considering renewable gener- EVs in a charging station. This is because the dimensional-
ation and energy storage systems in the design [6]–[8] and ity of the state space is changing with the stochastic arrival
operation of charging stations [9], [10], significant economic and departure of EVs. To address this problem, a two-stage
benefits can be gained. heuristic RL algorithm is proposed [28]. In the first stage, an
Model-based control approaches have also been applied RL agent learns the collective EV fleet charging strategy. In
to schedule EV charging considering vehicle-grid integration the second stage, a simple heuristic is used to translate collec-
(VGI). A decentralized control algorithm to optimally schedule tive control action back to individual charger control action.
EV charging is developed to fill the valleys in electric load pro- The drawback of this approach is that the simple heuristic
files [11]. To mitigate the impact on distribution networks, the cannot guarantee optimal charging scheduling. An alternative
EV charging and discharging coordination problem is solved approach to tackle the difficulty of dimension-varying state
with the strategy of the coalition formation [12]. To smooth out and action space is to represent the state-action function using
the power fluctuations caused by distributed renewable energy a linear combination of a set of carefully designed feature
sources, a joint iterative optimization method is developed to functions [29]. However, designing good feature functions is
schedule the charging of EVs and minimize the cost of load a manual process that can be challenging. To overcome this
regulation [13]. The problem of routing and charging schedul- challenge, [30] develops a novel grid representation of EVs’
ing of EVs considering VGI is formulated as an optimization required and remaining charging time such that the dimen-
problem solved by an approximated algorithm [14]. The VGI, sions of state and action are constants. The problem with this
vehicle-to-vehicle energy transfer, and demand-side manage- grid approach is that the charger can only select to charge
ment in vehicle-to-building aspects have also been studied with with full power or to not charge, without any intermediate
the mixed integer programming [15] and a distributed control power outputs. While in most of the research papers RL
algorithm [16]. agents are deployed to determine output charging powers, [31]
The aforementioned model-based methods rely heavily on instead uses RL to predict future conditions. A model-based
sophisticated algorithm designs that are tailored for specific method is used determine charging powers. The drawback of
charging scenarios. Most of the model-based methods cast the this approach is that the prediction can hardly be accurate.
charging station scheduling problem as a complex optimization Researchers also studied the problem of charging scheduling
problem, which is time-consuming to solve when the scale of multiple individual EVs in a local area to reduce the load of
of the charging station increases. To address the scalability the grid [32], [33]. In these work, each charger is an RL agent
Authorized licensed use limited to: VIT University. Downloaded on January 09,2023 at 05:36:14 UTC from IEEE Xplore. Restrictions apply.
3040 IEEE TRANSACTIONS ON SMART GRID, VOL. 13, NO. 4, JULY 2022
and collective features (e.g., the total load) are usually intro- problem. Case studies are included in Section IV to demon-
duced to the state vector. These pioneering studies demonstrate strate the effectiveness of our proposed method. Section V
the potential of RL in charging scheduling and pave the way provides a brief conclusion.
for further research.
Other interesting applications of RL related to charging II. P ROBLEM F ORMULATION
include the scheduling of multiple charging stations where
We will first formulate the charging station scheduling
each charging station is treated as a unit [34], [35], the
problem as an optimization problem. Let us consider a charg-
dynamic pricing for charging services [36]–[38], and the
ing station with N parking spots, among which N c are charging
scheduling of battery swapping stations [39], [40].
spots and N w are waiting spots, such that N = N w + N c .
In this paper, we propose an innovative centralized alloca-
When the number of EVs in the charging station is larger than
tion and decentralized execution (CADE) framework to sched-
the number of chargers, the charging station needs to deter-
ule the charging activities at the charging station. Compared
mine which EVs shall be connected to the chargers. Here, we
to previous work [28]–[31], the problem formulation in this
assume that the time of switching an EV from a waiting spot
paper considers waiting area, VGI, and demand charge, mak-
to a charger or vice versa is negligible.
ing it more realistic and comprehensive. By synergistically
The goal of the charging station’s scheduling system is to
combining the merits of RL and advanced optimization, our
maximize its profit while satisfying EV customers’ charging
proposed CADE framework is highly scalable and yields com-
needs. The operating profit of a charging station Z consists of
petitive charging station profit compared to state-of-the-art
three components: the net revenue of charging and discharg-
baseline algorithms. The centralized allocation module lever-
ing EVs, B, the penalty of failing to satisfy the customers’
ages an optimization technique to determine which EVs should
charging needs, Cp , and the demand charge of the station, Cl
be connected to the chargers. To address the challenge of the
as shown in (1).
dimension-varying state and action space of the charging sta-
tion, we propose a decentralized execution scheme, where each Z = B − Cp − Cl (1)
charger is represented by an RL agent and determines its own
charging/discharging actions. The individual chargers will sub- We assume that Z is measured over a time horizon T, which
mit operational experiences to a shared replay memory, which consists of intervals of equal length t. All of the incoming
provides training samples to update the policies of the RL EVs during T are denoted by a set I. We will describe each
agents. Most of the existing algorithms neglect the demand component of Z in detail in the next few subsections.
charge cost as it is extremely difficult to predict the peak
demand of a charging station ahead of time. In our proposed A. Net Revenue From Charging and Discharging EVs
framework, an online algorithm is designed to assign the The EV owners pay the charging station pct for each kWh of
charging station-wide penalty associated with demand charge electricity charged and receive pdt for each kWh of electricity
to the individual chargers. discharged from the batteries, which was sold to the electric
The contributions of this paper are summarized below. grid. The charging station pays or receives pet for each kWh
1) We propose a scalable centralized allocation and decen- of energy bought from or sold to the electric utility.
tralized execution (CADE) framework to operate an Suppose at time t, the ith EV is charged/discharged at a
EV charging station considering waiting area, vehicle- power level ait , where ait > 0 means charging and ait <
grid integration, and demand charge. By synergistically 0 means discharging. The net revenue from charging and
combining advanced optimization with reinforcement discharging EVs, B, can be calculated as:
learning algorithms, our proposed approach identifies
both the EV assignment and the chargers’ outputs that B= mt |ait |t, (2)
yield high charging station profit. i∈I t∈T
2) By modeling each charger as an intelligent RL agent, our where mt is the net revenue of charging/discharging an EV
proposed algorithm tackles the challenge of dealing with with 1kWh of energy at time t and can be calculated by:
dimension-varying state and action space associated with c
stochastic EV arrivals and charging needs. To improve pt − pet if ait ≥ 0
mt = (3)
the learning efficiency, we allow the chargers to exploit pet − pdt if ait < 0
a shared replay memory of operational experience in
Note that the price paid to the customer for discharging the
learning decentralized charger control policies.
EV battery is higher than the EV charging cost, i.e., pdt > pct ,
3) Comprehensive numerical studies demonstrate that our
∀t. This is necessary to cover the degradation cost of the EV
proposed CADE framework is not only extremely scal-
battery [41], [42].
able but also outperforms state-of-the-art baseline algo-
rithms in terms of charging station profit.
The rest of this paper is structured as follows. Section II for- B. Penalty of Failing to Satisfy Customers’ Charging Needs
mulates the charging scheduling problem as an optimization Upon arrival, an EV customer i will specify the departure
tgt
problem. Then, Section III reformulates the charging schedul- time tid and the desired target energy level ei at departure.
ing problem as a Markov Decision Process and proposes the The charging station will try to ensure that the EV’s energy
tgt
CADE framework to solve the sequential decision making level is not lower than ei at time tid . If the target energy level
Authorized licensed use limited to: VIT University. Downloaded on January 09,2023 at 05:36:14 UTC from IEEE Xplore. Restrictions apply.
YE et al.: LEARNING TO OPERATE EV CHARGING STATION CONSIDERING VEHICLE-GRID INTEGRATION 3041
is not met, a penalty will be imposed on the charging station D. Summary of the Charging Station Optimization Problem
p
to compensate the customer. The penalty is denoted as ci . The charging station scheduling problem can be formu-
Such a penalty shall reflect the gap between the final energy lated as the following optimization problem, which aims at
fnl tgt
level ei and the target energy level ei . For simplicity, we maximizing the profit Z (10) while satisfying the operational
p
use a linear function to calculate ci as suggested by [33]: constraints (11a) - (11g):
p tgt fnl +
ci = μ ei − ei , (4) max Z = mt |ait |t
ait ,ηit
i∈I t∈T
where μ is a constant coefficient and x+ = max(x, 0). Note ⎡ ⎛ d ⎞⎤+
fnl tgt fnl ti
that a penalty is triggered only when ei < ei . ei can be
μ⎣ei − ⎝eini ait T ⎠⎦
tgt
derived by (5): − i +
d
i∈I t=tia
ti
+
fnl
ei = eini + ait T, (5) T l
i − ph · max ait , (10)
t=tia TB t∈Th
h∈H i∈I
where eini a
i is the initial energy level of the ith EV and ti is subject to:
its arrival time. Also note that an EV customer cannot specify
an etgt higher than a threshold that is physically achievable amin ≤ ait ≤ amax , ∀i ∈ I, ∀t ∈ T (11a)
with the charging infrastructure. We explicitly express this ait = 0, if ηit = 0, ∀i ∈ I, ∀t ∈ T (11b)
requirement as shown in (6): ηit ∈ {0, 1}, ∀i ∈ I, ∀t ∈ T (11c)
tgt
ei ≤ min emax , eini
i +a
max d
ti − tia , (6) ηit ≤ N c , ∀t ∈ T (11d)
i∈I
where emax is the maximum capacity of the battery, and amax
is the maximal charging power. βit − ηit ≤ N w , ∀t ∈ T (11e)
p
The total penalty for a charging station Cp is the sum of ci i∈I i∈I
Authorized licensed use limited to: VIT University. Downloaded on January 09,2023 at 05:36:14 UTC from IEEE Xplore. Restrictions apply.
3042 IEEE TRANSACTIONS ON SMART GRID, VOL. 13, NO. 4, JULY 2022
Fig. 1. Overview of the centralized allocation and decentralized execution (CADE) framework.
However, the effectiveness of the MPC-based algorithm heav- Suppose an agent follows a policy π to make decisions, we
ily relies on accurate predictions of future EV arrivals and can define the state-action value function of the policy in (12)
does not scale well with the size of the problem. This to quantify how good it is to take an action a in a state s:
motivates us to develop model-free methods, such as reinforce-
Qπ (s, a) = E[Rt |st = s, at = a, π ] (12)
ment learning-based algorithms, which is capable of making
real-time decisions in complex environments. The state-action value function must satisfy the following
Bellman equations (13):
III. T ECHNICAL M ETHODS Qπ (s, a) = Es r + γ Ea ∼π(s ) Qπ s , a |s, a, π (13)
We propose a centralized allocation and decentralized exe-
We define the optimal state-action value function Q∗ (s, a)
cution framework (CADE) to solve the charging station
as Q∗ (s, a) = maxπ Qπ (s, a). Then the optimal determin-
scheduling problem formulated in Section II. The overall
istic policy π ∗ is the one that chooses actions a∗ =
framework of CADE for charging station scheduling is shown
argmaxa∈A Q∗ (s, a).
in Fig. 1. EVs stochastically arrive at the charging station.
Immediately after observing the actual EV arrival at time step B. Formulate the Charger Control Problem as an MDP
t, a centralized allocation algorithm determines which EVs
shall be connected to chargers or assigned to the waiting area. In the individual charger control problem, each charger is
Then each charger acts as an individual agent and makes treated as an agent who interacts with the environment, which
charging/discharging decisions based on the state of the EV consists of the EV connected to the charger and the charging
connected with it. The sequential decision-making problem station.
of each charger is formulated as a Markov Decision Process Suppose the set of chargers is J. For an individual charger
(MDP). The operational experiences of all chargers will be j ∈ J, its state of environment at time t is defined as
collected and used in a learning process. In this process, the sjt = {δjt , t, tjr , ejt , erjt , NtEV,w , Etr,w , h̃t , Lht }. δjt ∈ {0, 1}
parameters of the state-action value function’s neural networks indicates if there is an EV connected to charger j. t ∈ T denotes
will be updated and then broadcast to all chargers. the global time. tjr = tjd − t is the remaining dwelling time
of the EV connected to the charger. ejt denotes the current
tgt
A. Overview of Markov Decision Process energy level of the EV battery. erjt = ej − ejt is the remaining
energy to be charged for the EV battery. NtEV,w is the total
In this subsection, we briefly introduce MDP, which is the
number of EVs in the waiting area at time t. Etr,w is the sum
mathematical foundation for sequential decision-making prob-
of remaining energy to be charged for all EVs in the wait-
lems. In an MDP, an agent interacts with an environment over NtEV,w r
a sequence of discrete time steps t = {1, 2, . . . , T }. At each ing area, which can be calculated as Etr,w = k=1 ekt . h̃t is
time step t, the agent senses the state s ∈ S of the environment an one-hot encoded vector indicating the current time-of-use
and takes action a ∈ A accordingly. Once the action is taken, period. The dimension of h̃t is dim(H). For example, if there
the environment will transition to a new state s with probabil- are three time-of-use periods and we are currently in the sec-
ity Pr(s |s, a) and provide a reward r to the agent. The agent’s ond period, then h̃t = [0, 1, 0]. Lht is the maximum load that
goal is tomaximize the expected sum of discounted rewards the charging station has seen so far for time-of-use period h,
Rt = E[ Tτ =t γ τ −t rτ ], where γ ∈ [0, 1] is a discount factor which corresponds to global time t.
that balances the relative importance of near-term and future Note that tjr and erjt quantify the urgency of the need to
rewards. charge the EV connected to charger j. NtEV,w and Etr,w quantify
Authorized licensed use limited to: VIT University. Downloaded on January 09,2023 at 05:36:14 UTC from IEEE Xplore. Restrictions apply.
YE et al.: LEARNING TO OPERATE EV CHARGING STATION CONSIDERING VEHICLE-GRID INTEGRATION 3043
the urgency of the need to charge the EVs in the waiting Algorithm 1: Online Calculation Procedure for Rlt
area. The agent can obtain the current time-of-use pricing for 1 Obtain maximum charging station load Lht for time-of-use period h up
electricity given the global time t. to time t;
The action for a charger j at time t is its output power 2 Initialize
Rlt = 0;
ajt . In this paper, we discretize the action space using dis- 3 if j∈J ajt > Lht then
crete charging rates as suggested by [45], [46]. The upper 4 Rlt ← − TT plh j∈J ajt − Lht ;
upper B
bound of the charger power output action ajt can be calcu-
5 Lht ← j∈J ajt ;
lated as the minimum of the physical maximum charger output 6 end
and the maximum charging rate allowed to fill up the bat-
upper emax −e
tery of the EV: ajt = min(amax , t jt ). The lower bound
of the charger power output action alower jt can be calculated If a charger j selects a higher charging power ajt , then it
as the maximum of the physical minimum charger output and contributes more to the increase in demand charge. Thus, we
the maximum discharging rate allowed to deplete the battery design rjtl in a way that assigns a charger with higher power
emin −e
of the EV: alowerjt = max(amin , t jt ). Finally, the feasible output a greater penalty as shown in (16):
action space for a charge is defined as an ordered set of dis- ajt
crete and increasing charger output Ajt = {alower jt
upper
, . . . , ajt } rjtl = Rlt , (16)
with uniform difference a between adjacent actions. j∈J ajt
The design of the reward function for the chargers shall Note that if the study time horizon T is different from
be consistent with the charging station operation objective the electricity billing period TB , then the increase in demand
in (1), which is maximizing the net revenue B from charg- charge is scaled accordingly in Algorithm 1.
ing/discharging EVs and minimizing charging station demand
charge Cl and the penalty of not satisfying customers’ charg- C. Decentralized Execution at the Charger Level
ing needs Cp . The reward function of a charger equals to the
p The chargers in the charging station determine the charg-
sum of three components: rjt = rjtb + rjt + rjtl . The first com-
ing/discharging power for each time step by following an RL-
ponent rjtb = mt |ajt |t represents the net revenue of a charger based decentralized execution framework. Individual chargers
for charging/discharging an EV in the period of t. collect historical operational experiences, which include the
p p p,w
The second component rjt = −(cjt + cjt ) accounts for the current state, the action, the next state, and the reward,
penalty of failing to satisfy the customer’s charging needs. The and store them in a shared replay memory M. The action
second component of the reward function has two parts. The value functions corresponding to the chargers will be learned
p
first part cjt is a penalty assigned to the charger j when the EV through a deep Q-learning (DQN) algorithm [47] using the
connected to it does not reach its desired energy level upon shared replay memory. DQN is selected due to its ability
leaving, which can be calculated as follows: to handle large state and action spaces. The learned param-
+ eters of the Q networks will be periodically broadcasted to
tgt
p μ ej − ejt if tr = 0 the individual chargers. In real-time operations, the chargers
cjt = (14)
0 if tr > 0 will be taking charging/discharging actions based on its own
observation of the state in a decentralized manner.
p,w
The second part cjt is a penalty assigned to the charger j In the DQN algorithm, a deep neural network parameterized
when EVs in the waiting area do not reach their desired energy by θ is used to approximate the action value function Q. To
level upon leaving. Since EVs in the waiting area are not stabilize the learning process, a second neural network param-
directly associated with a specific charger, the corresponding eterized by θ − called the target network is introduced to reduce
penalty will be split among all chargers in the charging station. the strong correlations between the function approximator and
p,w
The aggregated penalty Rt for all EVs in the waiting area its target. The parameters of the Q-network are iteratively
p,w NtEV,w p updated by stochastic gradient descent on the following loss
can be calculated as Rt = k=1 ckt . To encourage charg-
ers to increase charging power when the EVs in the waiting function:
2
areas have not reached their desired energy level, we propose −
p,w
splitting Rt in a way that assigns chargers fewer penalties if L(θ ) = E r + γ (1 − d) max Q s , a ; θ − Q(s, a; θ ) ,
a ∈A
their power output levels are higher: (17)
p,w amax − ajt s
cjt = max Rp,w
t (15) where s and are the current state and the next state of
j∈J a − ajt the environment, respectively. d ∈ {0, 1} is an indicator
of the terminal state for each episode, and d = 1 indicates
The third component of the reward function rjtl is associated s is the terminal state. The target net’s parameters θ − are
with the demand charge of the charging station. The aggre- replaced by θ after every certain number of training episodes.
gated additional demand charge to be paid by the charging We design the Q-network with the following structure. The
station Rlt at time t is positive when the total charging power
input layer of the neural network includes the state and action
j∈J ajt exceeds the maximum load of the charging station Lht variables (s, a) of the individual chargers. The output layer
up to time t. Rlt can be calculated by following Algorithm 1. has a single neuron representing the value of Q(s, a). This
Authorized licensed use limited to: VIT University. Downloaded on January 09,2023 at 05:36:14 UTC from IEEE Xplore. Restrictions apply.
3044 IEEE TRANSACTIONS ON SMART GRID, VOL. 13, NO. 4, JULY 2022
design is suitable for the charger decision-making problem as Algorithm 2: Summary of the CADE Framework
the charging powers are ordinal variables. 1 Initialize exploration rate = 1;
2 Initialize replay memory M;
3 Initialize state-action value function network Q(s, a; θ ) with random
D. Centralized Allocation at the Charging Station Level weights θ ;
At each time step, the charging station scheduler needs to 4 Initialize target network Q(s, a, θ − ) with θ − = θ ;
5 for episode = 1 : n do
allocate the EVs to the chargers and waiting spots to maximize 6 Initialize t = 0 and NtEV = 0;
its own profit. Instead of directly solving the optimization 7 Initialize maximum loads Lht = 0, ∀h ∈ H;
problem formulated in Section II-D, we propose a central- 8 while t < T do
ized EV allocation algorithm that leverages the learned action 9 If NtEV < N, admit arriving EVs to the charging station;
10 Allocate EVs to chargers & waiting area by solving
value function Q for the chargers in the previous time step. optimization problem (18) - (19b) where Q(s, a; θ ) is utilized
Suppose at time t, we have NtEV EVs in the charging station. to generate qck and qw EV
k , ∀k ∈ {1, 2, . . . , Nt };
An EV can either be connected to one of the chargers or parked 11 for j ∈ J do
12 Obtain sjt for charger j;
in the waiting area. Let a binary variable αk denotes whether 13 Draw x from uniform distribution U(0, 1);
the kth EV is connected to a charger. αk = 1 indicates that 14 if x < then
the kth EV is connected to a charger, and such a connection 15 Randomly select ajt ∈ Ajt
else
will create an action value qck = maxa Q(s(αk = 1), a), where
16
17 Select a∗jt = arg maxajt Q(sjt , ajt ; θ );
Q is the action value function learned in the previous time 18 end
step and s(αk = 1) denotes the state of the charger when it is 19 Calculate rjtb ;
connected to the kth EV. On the other hand, when the kth EV 20 The environment transitions to the next state sj(t+1) ;
is parked on a waiting spot, it is equivalent to be connected to a 21 end
22 for j ∈ J do
charger with zero power output, in which case, the connection 23
p
Calculate rjt , rjtl , and finally rjt ;
will create an action value qw k = Q(s(αk = 1), a = 0). As a 24 Add (sjt , ajt , sj(t+1) , rjt ) to M;
summary, the action value created by the connection of the kth 25 end
EV is αk qck + (1 − αk )qw k . Note that for state vector s(αk = 1),
26 Sample a mini-batch m from M;
27 Update θ by using the gradient descent method to minimize
∀k ∈ {1, 2, . . . , NtEV }, the remaining energy to be charged for the loss function in (17);
all EVs in the waiting area can be approximated by its value 28 Decrease exploration rate ;
in the previous time step. 29 Set θ − = θ every K episodes;
30 t += t;
The charging station’s scheduling problem can be solved 31 end
by finding the EV allocation that maximize the summation 32 end
of the action values created by the connections of all EVs
while satisfying the space limitations as shown in (18), (19a),
and (19b):
different chargers are stored in the same replay memory. The
NtEV
-greedy technique is deployed during the training process to
max αk qck + (1 − αk )qw
k (18) ensure sufficient exploration.
αk
k=1
subject to: IV. N UMERICAL S TUDY
EV
Nt In this section, we evaluate the performance of the proposed
αk = min N c , NtEV , (19a) CADE framework in managing the operations of charging
k=1
stations. A comprehensive comparison between the proposed
and baseline algorithms in terms of optimality and scal-
αk ∈ {0, 1}, ∀k ∈ {1, 2, . . . , NtEV }, (19b)
ability is conducted. We also examine the learned action
Equation (19a) enforces the constraint that at most N c EVs value function under different operation scenarios. The train-
are connected to chargers if NtEV > N c . If NtEV ≤ N c , then all ing of the proposed RL algorithm is performed on a GPU
EVs should be allocated to charging area. (Nvidia GeForce RTX 2080 Ti). We leveraged a desktop with
AMD Ryzen 7 8-core CPU and Gurobi solver to execute the
E. Summary of CADE Framework MPC-based baseline algorithm.
By synergistically combining the centralized allocation and
decentralized execution modules, the CADE framework can be A. Setup of the Numerical Study
summarized by Algorithm 2, in which line 10 corresponds to It is assumed that the arrival of EVs at the charging station
the centralized allocation module and line 11 ∼ 25 correspond follows the Poisson process. The EV arrival patterns in four
to the decentralized execution module in the training phase. It different locations are evaluated: Office zone, residential area,
is worth noting that in the deployment phase when no more highway, and retail stores. The arrival rates per hour λ of the
exploration or training is required, the decentralized execution four locations are shown in Fig. 2. For instance, a charging
module only includes line 11, 17, and 21. Since we assumed station near the residential area may expect higher EV arrival
that the chargers in the charging station are homogeneous, they rates in the afternoon. Highway charging stations may see sev-
all share the same parameter vector θ . The experiences from eral peaks of arrival rates in a typical day [48] and the retail
Authorized licensed use limited to: VIT University. Downloaded on January 09,2023 at 05:36:14 UTC from IEEE Xplore. Restrictions apply.
YE et al.: LEARNING TO OPERATE EV CHARGING STATION CONSIDERING VEHICLE-GRID INTEGRATION 3045
TABLE II
C HARGING S TATION AND E NVIRONMENT PARAMETERS
TABLE I
E LECTRICITY P RICE AND D EMAND C HARGE
Authorized licensed use limited to: VIT University. Downloaded on January 09,2023 at 05:36:14 UTC from IEEE Xplore. Restrictions apply.
3046 IEEE TRANSACTIONS ON SMART GRID, VOL. 13, NO. 4, JULY 2022
TABLE III
C OMPARISON OF C OMPUTATION T IME P ER S TEP
D. Scalability Analysis
C. Profit Comparison With Baseline Methods In this subsection, we analyze the scalability of the proposed
We compare the charging station profit obtained by method. We evaluate the performance of CADE under 4 dif-
the proposed CADE framework and four baseline meth- ferent charging station sizes with the number of chargers
ods. The first baseline method is the MPC-based approach, N c ∈ {10, 20, 50, 100}. For each case, N w is kept as 0.5N c .
where the one-hour and two-hour ahead EV arrivals are pre- The office zone EV arrival pattern is used and scaled linearly
dicted based on the ground-truth Poisson distribution. The with the number of chargers. The average charging station and
second baseline is the MPC-ideal method, which assumes the per charger profit under the 4 different cases are reported in
charging station has perfect prediction for two-hour ahead EV Fig. 6. As the charging station size increases, the average daily
arrivals. For both MPC and MPC-ideal methods, at each time profit per charger stays in a stable range from $7.8 to $8.4.
step, the optimization problem (10)-(11g) are solved by replac- This result shows that the proposed CADE framework can
ing the original time horizon T with the prediction horizon and efficiently manage the operations of a large group of chargers.
the solution for this time step is executed. The third and fourth The proposed CADE framework not only outperforms
baselines are greedy methods. The GRD method assigns EVs baseline methods in terms of profit but also achieves high
to chargers based on urgency of charging needs. The urgency computation efficiency. The computation time of the proposed
is measured by er /tr , the minimum charging power required and MPC-based methods are listed in Table III. As shown in
to fulfill the energy demand upon an EV’s departure. If an the table, the computation time of the proposed CADE frame-
EV has a higher required minimum charging power, it has a work is much shorter than that of the MPC-based method. The
higher urgency, which leads to a higher priority to be allo- decentralized execution scheme of CADE enables its compu-
cated to a charger. The chargers then charges EVs greedily tation time to increase linearly with the number of chargers
at maximal power during off-peak hours, and discharges EVs rather than exponentially.
evenly during on-peak hours to a level (no less than Etgt ) that
avoids penalty. If discharging EV battery is not allowed, then E. Interpretation of the Learned State-Action Values
the method is denoted as GRD-noVGI. All results are obtained In our proposed CADE framework, the state-action value
with N c = 10 and N w = 5. The proposed and baseline meth- functions are represented by neural networks, which are
ods are evaluated using 30 randomly generated trajectories difficult to interpret. To better understand how the trained RL-
from each of the 4 different EV arrival patterns. As shown based chargers make decisions under different scenarios, we
in Fig. 5, the proposed CADE framework outperforms all analyze the learned state-action values. Specifically, we change
Authorized licensed use limited to: VIT University. Downloaded on January 09,2023 at 05:36:14 UTC from IEEE Xplore. Restrictions apply.
YE et al.: LEARNING TO OPERATE EV CHARGING STATION CONSIDERING VEHICLE-GRID INTEGRATION 3047
Fig. 7. Q values of four scenarios with different (a) time-of-use period h, (b) remaining charging time tr , (c) number of EVs in the waiting area NtEV,w ,
and (d) historical maximum loads Lh .
one component in the state vector at a time while fixing the rest [2] S. Wang, S. Bi, Y.-J. A. Zhang, and J. Huang, “Electrical vehicle
to quantify how this component impacts the distribution of Q charging station profit maximization: Admission, pricing, and online
scheduling,” IEEE Trans. Sustain. Energy, vol. 9, no. 4, pp. 1722–1731,
values. Fig. 7(a)-(d) illustrate Q values of four scenarios with Oct. 2018.
different (a) time-of-use-period h, (b) remaining charging time [3] H. Zhang, Z. Hu, Z. Xu, and Y. Song, “Optimal planning of PEV charg-
tr , (c) number of EVs in the waiting area NtEV,w = NtEV − N c , ing station with single output multiple cables charging spots,” IEEE
Trans. Smart Grid, vol. 8, no. 5, pp. 2119–2128, Sep. 2017.
and (d) historical maximum loads Lh . The vertical blue lines [4] P. You and Z. Yang, “Efficient optimal scheduling of charging station
indicate the maximum Q values and the associated actions. As with multiple electric vehicles via V2V,” in Proc. IEEE Int. Conf. Smart
shown in Fig. 7(a), the RL-based charger prefers discharging in Grid Commun. (SmartGridComm), 2014, pp. 716–721.
[5] Q. Dong, D. Niyato, P. Wang, and Z. Han, “The PHEV charging schedul-
on-peak hours and charging in off-peak hours to increase oper- ing and power supply optimization for charging stations,” IEEE Trans.
ating profit. Fig. 7(b) shows that when the remaining charging Veh. Technol., vol. 65, no. 2, pp. 566–580, Feb. 2016.
time tr is close to 0, i.e., the EV is about to leave, the RL [6] Q. Huang, Q.-S. Jia, Z. Qiu, X. Guan, and G. Deconinck, “Matching EV
charging load with uncertain wind: A simulation-based policy improve-
agent prefers higher charging powers in order to meet the tar- ment approach,” IEEE Trans. Smart Grid, vol. 6, no. 3, pp. 1425–1433,
get battery energy level. In Fig. 7(c), we can see that higher May 2015.
charging powers are preferred when there are more EVs in [7] J. Domínguez-Navarro, R. Dufo-López, J. Yusta-Loyo, J. Artal-Sevil,
and J. Bernal-Agustín, “Design of an electric vehicle fast-charging sta-
the waiting area. As shown in Fig. 7(d), if a higher historical tion with integration of renewable energy and storage systems,” Int. J.
maximum load Lh is recorded, the chargers are less concerned Electr. Power Energy Syst., vol. 105, pp. 46–58, Feb. 2019.
about incurring even higher demand charge, which led to the [8] Y. Deng, Y. Zhang, F. Luo, and Y. Mu, “Operational planning of cen-
tralized charging stations utilizing second-life battery energy storage
preference of larger charging power. systems,” IEEE Trans. Sustain. Energy, vol. 12, no. 1, pp. 387–399,
Jan. 2021.
V. C ONCLUSION [9] Q. Yan, B. Zhang, and M. Kezunovic, “Optimized operational cost
reduction for an EV charging station integrated with battery energy
In this paper, a centralized allocation and decentralized exe- storage and PV generation,” IEEE Trans. Smart Grid, vol. 10, no. 2,
cution (CADE) framework is developed to operate a charging pp. 2096–2106, Mar. 2019.
[10] A. S. Awad, M. F. Shaaban, T. H. El-Fouly, E. F. El-Saadany, and
station considering vehicle-grid integration. The objective of M. M. Salama, “Optimal resource allocation and charging prices for
the proposed algorithm is to maximize the profit of the charg- benefit maximization in smart PEV-parking lots,” IEEE Trans. Sustain.
ing station by considering the net revenue of charging and Energy, vol. 8, no. 3, pp. 906–915, Jul. 2017.
[11] L. Gan, U. Topcu, and S. H. Low, “Optimal decentralized protocol
discharging, as well as the operational costs associated with for electric vehicle charging,” IEEE Trans. Power Syst., vol. 28, no. 2,
demand charge and penalty. A centralized optimization mod- pp. 940–951, May 2013.
ule handles the allocation of EVs among waiting and charging [12] R. Yu, J. Ding, W. Zhong, Y. Liu, and S. Xie, “PHEV charging and
discharging cooperation in V2G networks: A coalition game approach,”
spots. Enabled by the reinforcement learning algorithm, the IEEE Internet Things J., vol. 1, no. 6, pp. 578–589, Dec. 2014.
chargers collect and share operational experiences, which are [13] C. Wei, J. Xu, S. Liao, and Y. Sun, “Aggregation and scheduling models
used to learn charger control policies. Comprehensive numer- for electric vehicles in distribution networks considering power fluctu-
ations and load rebound,” IEEE Trans. Sustain. Energy, vol. 11, no. 4,
ical studies with different EV arrival patterns show that the pp. 2755–2764, Oct. 2020.
proposed CADE framework outperforms state-of-the-art base- [14] X. Tang, S. Bi, and Y.-J. A. Zhang, “Distributed routing and charging
line algorithms. The scalability analysis shows that the CADE scheduling optimization for Internet of Electric Vehicles,” IEEE Internet
Things J., vol. 6, no. 1, pp. 136–148, Feb. 2019.
framework is more computationally efficient than the base- [15] A.-M. Koufakis, E. S. Rigas, N. Bassiliades, and S. D. Ramchurn,
line model-based control algorithm. Detailed analysis on the “Towards an optimal EV charging scheduling scheme with V2G and
learned state-action value function provides insights on how an V2V energy transfer,” in Proc. IEEE Int. Conf. Smart Grid Commun.
(SmartGridComm), 2016, pp. 302–307.
RL-based charger makes charging and discharging decisions [16] H. K. Nguyen and J. B. Song, “Optimal charging and discharging for
under different operational scenarios. multiple PHEVs with demand side management in vehicle-to-building,”
J. Commun. Netw., vol. 14, no. 6, pp. 662–671, 2012.
R EFERENCES [17] G. Dulac-Arnold et al., “Deep reinforcement learning in large discrete
action spaces,” 2015, arXiv:1512.07679.
[1] Y. Kim, J. Kwak, and S. Chong, “Dynamic pricing, scheduling, and [18] S. Dimitrov and R. Lguensat, “Reinforcement learning based algorithm
energy management for profit maximization in PHEV charging stations,” for the maximization of EV charging station revenue,” in Proc. Int. Conf.
IEEE Trans. Veh. Technol., vol. 66, no. 2, pp. 1011–1026, Feb. 2017. Math. Comput. Sci. Ind., 2014, pp. 235–239.
Authorized licensed use limited to: VIT University. Downloaded on January 09,2023 at 05:36:14 UTC from IEEE Xplore. Restrictions apply.
3048 IEEE TRANSACTIONS ON SMART GRID, VOL. 13, NO. 4, JULY 2022
[19] A. Chiş, J. Lundén, and V. Koivunen, “Reinforcement learning-based [40] X. Wang, J. Wang, and J. Liu, “Vehicle to grid frequency regulation
plug-in electric vehicle charging with forecasted price,” IEEE Trans. capacity optimal scheduling for battery swapping station using deep
Veh. Technol., vol. 66, no. 5, pp. 3674–3684, May 2017. Q-network,” IEEE Trans. Ind. Informat., vol. 17, no. 2, pp. 1342–1351,
[20] Z. Wan, H. Li, H. He, and D. Prokhorov, “Model-free real-time EV Feb. 2021.
charging scheduling based on deep reinforcement learning,” IEEE Trans. [41] B. Foggo and N. Yu, “Improved battery storage valuation through degra-
Smart Grid, vol. 10, no. 5, pp. 5246–5257, Sep. 2019. dation reduction,” IEEE Trans. Smart Grid, vol. 9, no. 6, pp. 5721–5732,
[21] L. Yan, X. Chen, J. Zhou, Y. Chen, and J. Wen, “Deep reinforcement Nov. 2018.
learning for continuous electric vehicles charging control with dynamic [42] N. Yu and B. Foggo, “Stochastic valuation of energy storage in whole-
user behaviors,” IEEE Trans. Smart Grid, vol. 12, no. 6, pp. 5124–5134, sale power markets,” Energy Econ., vol. 64, pp. 177–185, May 2017.
Nov. 2021. [43] P. Kou, D. Liang, L. Gao, and F. Gao, “Stochastic coordination of plug-in
electric vehicles and wind turbines in microgrid: A model predictive con-
[22] M. Dorokhova, Y. Martinson, C. Ballif, and N. Wyrsch, “Deep rein-
trol approach,” IEEE Trans. Smart Grid, vol. 7, no. 3, pp. 1537–1551,
forcement learning control of electric vehicle charging in the pres-
May 2016.
ence of photovoltaic generation,” Appl. Energy, vol. 301, Nov. 2021,
[44] W. Tang and Y. J. Zhang, “A model predictive control approach for
Art. no. 117504.
low-complexity electric vehicle charging scheduling: Optimality and
[23] F. Chang, T. Chen, W. Su, and Q. Alsafasfeh, “Charging control of an scalability,” IEEE Trans. Power Syst., vol. 32, no. 2, pp. 1050–1063,
electric vehicle battery based on reinforcement learning,” in Proc. 10th Mar. 2017.
Int. Renew. Energy Congr. (IREC), 2019, pp. 1–63. [45] G. Binetti, A. Davoudi, D. Naso, B. Turchiano, and F. L. Lewis,
[24] H. Li, Z. Wan, and H. He, “Constrained EV charging scheduling based “Scalable real-time electric vehicles charging with discrete charging
on safe deep reinforcement learning,” IEEE Trans. Smart Grid, vol. 11, rates,” IEEE Trans. Smart Grid, vol. 6, no. 5, pp. 2211–2220, Sep. 2015.
no. 3, pp. 2427–2439, May 2020. [46] B. Sun, Z. Huang, X. Tan, and D. H. K. Tsang, “Optimal scheduling
[25] F. Zhang, Q. Yang, and D. An, “CDDPG: A deep-reinforcement- for electric vehicle charging with discrete charging levels in distribution
learning-based approach for electric vehicle charging control,” IEEE grid,” IEEE Trans. Smart Grid, vol. 9, no. 2, pp. 624–634, Mar. 2018.
Internet Things J., vol. 8, no. 5, pp. 3075–3087, Mar. 2021. [47] V. Mnih et al., “Human-level control through deep reinforcement
[26] F. Wang, J. Gao, M. Li, and L. Zhao, “Autonomous PEV charging learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
scheduling using Dyna-Q reinforcement learning,” IEEE Trans. Veh. [48] M. B. Arias, M. Kim, and S. Bae, “Prediction of electric vehi-
Technol., vol. 69, no. 11, pp. 12609–12620, Nov. 2020. cle charging-power demand in realistic urban traffic networks,” Appl.
[27] S. Li et al., “Electric vehicle charging management based on deep rein- Energy, vol. 195, pp. 738–753, Jun. 2017.
forcement learning,” J. Mod. Power Syst. Clean Energy, early access,
Jun. 25, 2021, doi: 10.35833/MPCE.2020.000460.
[28] S. Vandael, B. Claessens, D. Ernst, T. Holvoet, and G. Deconinck,
“Reinforcement learning of heuristic EV fleet charging in a day-
ahead electricity market,” IEEE Trans. Smart Grid, vol. 6, no. 4,
Zuzhao Ye (Graduate Student Member, IEEE)
pp. 1795–1805, Jul. 2015.
received the B.E. degree in thermal energy and
[29] S. Wang, S. Bi, and Y. J. A. Zhang, “Reinforcement learning for real- power engineering from the University of Science
time pricing and scheduling control in EV charging stations,” IEEE and Technology of China, Hefei, China, in 2015.
Trans. Ind. Informat., vol. 17, no. 2, pp. 849–859, Feb. 2021. He is currently pursuing the Ph.D. degree in electri-
[30] N. Sadeghianpourhamami, J. Deleu, and C. Develder, “Definition and cal and computer engineering with the University
evaluation of model-free coordination of electrical vehicle charging of California at Riverside, Riverside, CA, USA.
with reinforcement learning,” IEEE Trans. Smart Grid, vol. 11, no. 1, His research interests include big data analytics,
pp. 203–214, Jan. 2020. machine learning, and optimization, particularly in
[31] T. Ding, Z. Zeng, J. Bai, B. Qin, Y. Yang, and M. Shahidehpour, their applications to the planning and operation of
“Optimal electric vehicle charging strategy with Markov decision pro- electric-vehicle charging infrastructure.
cess and reinforcement learning technique,” IEEE Trans. Ind. Appl.,
vol. 56, no. 5, pp. 5811–5823, Sep./Oct. 2020.
[32] F. L. Da Silva, C. E. H. Nishida, D. M. Roijers, and A. H. R. Costa,
“Coordination of electric vehicle charging through multiagent reinforce-
ment learning,” IEEE Trans. Smart Grid, vol. 11, no. 3, pp. 2347–2356,
Yuanqi Gao (Member, IEEE) received the B.E.
May 2020.
degree in electrical engineering from Donghua
[33] F. Tuchnitz, N. Ebell, J. Schlund, and M. Pruckner, “Development and University, Shanghai, China, in 2015, and the Ph.D.
evaluation of a smart charging strategy for an electric vehicle fleet based degree in electrical engineering from the University
on reinforcement learning,” Appl. Energy, vol. 285, Mar. 2021, Art. no. of California at Riverside, Riverside, CA, USA in
116382. 2020, where he is currently a Postdoctoral Scholar
[34] M. Shin, D.-H. Choi, and J. Kim, “Cooperative management for with the Department of Electrical and Computer
PV/ESS-enabled electric vehicle charging stations: A multiagent deep Engineering. His research interests include big data
reinforcement learning approach,” IEEE Trans. Ind. Informat., vol. 16, analytics and machine learning applications in smart
no. 5, pp. 3493–3503, May 2019. grids.
[35] S. Lee and D.-H. Choi, “Dynamic pricing and energy management
for profit maximization in multiple smart electric vehicle charging
stations: A privacy-preserving deep reinforcement learning approach,”
Appl. Energy, vol. 304, Dec. 2021, Art. no. 117754.
[36] Z. Zhao and C. K. M. Lee, “Dynamic pricing for EV charging stations: A
deep reinforcement learning approach,” IEEE Trans. Transp. Electrific.,
Nanpeng Yu (Senior Member, IEEE) received the
early access, Dec. 30, 2021, doi: 10.1109/TTE.2021.3139674.
B.S. degree in electrical engineering from Tsinghua
[37] A. Abdalrahman and W. Zhuang, “Dynamic pricing for differentiated University, Beijing, China, in 2006, and the M.S. and
PEV charging services using deep reinforcement learning,” IEEE Trans. Ph.D. degrees in electrical engineering from Iowa
Intell. Transp. Syst., vol. 23, no. 2, pp. 1415–1427, Feb. 2022. State University, Ames, IA, USA, in 2007 and 2010,
[38] V. Moghaddam, A. Yazdani, H. Wang, D. Parlevliet, and F. Shahnia, “An respectively. He is currently an Associate Professor
online reinforcement learning approach for dynamic pricing of electric with the Department of Electrical and Computer
vehicle charging stations,” IEEE Access, vol. 8, pp. 130305–130313, Engineering, University of California at Riverside,
2020. Riverside, CA, USA. His current research interests
[39] Y. Gao, J. Yang, M. Yang, and Z. Li, “Deep reinforcement learn- include machine learning in smart grid, electricity
ing based optimal schedule for a battery swapping station considering market design and optimization, and smart energy
uncertainties,” IEEE Trans. Ind. Appl., vol. 56, no. 5, pp. 5775–5784, communities. He is an Editor of IEEE T RANSACTIONS ON S MART G RID
Sep./Oct. 2020. and IEEE T RANSACTIONS ON S USTAINABLE E NERGY.
Authorized licensed use limited to: VIT University. Downloaded on January 09,2023 at 05:36:14 UTC from IEEE Xplore. Restrictions apply.