0% found this document useful (0 votes)
35 views14 pages

Multi-Objective Interval Optimization Dispatch of Microgrid Via Deep Reinforcement Learning

This paper introduces an improved deep reinforcement learning (DRL) algorithm for the optimal dispatch of microgrids under uncertainties, specifically focusing on a multi-objective interval optimization dispatch (MIOD) model. The proposed triplet-critics comprehensive experience replay soft actor-critic (TCSAC) algorithm aims to enhance performance by reducing estimation bias and improving decision-making in the face of fluctuating renewable energy outputs. Simulation results demonstrate the effectiveness of the TCSAC algorithm in optimizing microgrid operations compared to existing DRL methods.

Uploaded by

hywang100
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views14 pages

Multi-Objective Interval Optimization Dispatch of Microgrid Via Deep Reinforcement Learning

This paper introduces an improved deep reinforcement learning (DRL) algorithm for the optimal dispatch of microgrids under uncertainties, specifically focusing on a multi-objective interval optimization dispatch (MIOD) model. The proposed triplet-critics comprehensive experience replay soft actor-critic (TCSAC) algorithm aims to enhance performance by reducing estimation bias and improving decision-making in the face of fluctuating renewable energy outputs. Simulation results demonstrate the effectiveness of the TCSAC algorithm in optimizing microgrid operations compared to existing DRL methods.

Uploaded by

hywang100
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

IEEE TRANSACTIONS ON SMART GRID, VOL. 15, NO.

3, MAY 2024 2957

Multi-Objective Interval Optimization Dispatch of


Microgrid via Deep Reinforcement Learning
Chaoxu Mu , Senior Member, IEEE, Yakun Shi, Na Xu, Xinying Wang , Senior Member, IEEE, Zhuo Tang,
Hongjie Jia , Senior Member, IEEE, and Hua Geng , Fellow, IEEE

Abstract—This paper presents an improved deep reinforcement PMTj Active output of the jth MT.
learning (DRL) algorithm for solving the optimal dispatch of PFCk Active output of the kth FC.
microgrids under uncertaintes. First, a multi-objective interval
Lk Branch stability factor of the branch.
optimization dispatch (MIOD) model for microgrids is con-
structed, in which the uncertain power output of wind and PG,i Active power injected by node i.
photovoltaic (PV) is represented by interval variables. The QG,i Reactive power injected by node i.
economic cost, network loss, and branch stability index for Pload,i Active load demand of node i.
microgrids are also optimized. The interval optimization is Qload,i Reactive load demand of node i.
modeled as a Markov decision process (MDP). Then, an improved
SOC State of charge of the ESS.
DRL algorithm called triplet-critics comprehensive experience
replay soft actor-critic (TCSAC) is proposed to solve it. Finally, Pct Charging power of the ESS.
simulation results of the modified IEEE 118-bus microgrid Pdt Discharging power of the ESS.
validate the effectiveness of the proposed approach.
Index Terms—Deep reinforcement learning, microgrid, uncer- Parameters
tainty, interval optimization, experience replay. ND The number of buses.
NG The number of generators.
NT The number of transformer branches.
N OMENCLATURE M The number of WT.
N The number of PV.
Variables
NDE The number of DEs.
PGESS Regulated power of the ESS. a, b, c The cost coefficients of the constant and the
VL Voltage amplitude of the bus. primary term and secondary term, respectively.
QG Reactive power output of the generator. NMT The number of MTs.
PG Active power output of the generator. NFC The number of FCs.
VG Generator node voltage cng The price of gas.
T Transformer tap ratio. ηMTj Efficiency of the jth MT.
PWT Uncertain power output of WT. QLHV low heating value of natural gas.
PPV Uncertain power output of PV. ηFCk The efficiency of the kth FC.
CDE Cost of power generation from diesel generators. Nbranch The total number of network branches.
CMT Cost of power generation from micro turbines. Gij The conductance of the branches.
CFC Cost of power generation from fuel cells. Bij The capacitance of the branches.
PDEi Active output of the ith DE. Ui , Uj The voltage at nodes i and j, respectively.
Manuscript received 12 April 2023; revised 28 August 2023; accepted θi , θj The voltage phase angles at nodes i and j,
8 October 2023. Date of publication 6 December 2023; date of current respectively.
version 23 April 2024. This work was supported in part by the National X The reactance of the branch.
Natural Science Foundation of China under Grant 62022061, and in part by
the Science and Technology Project of the State Grid Corporation of China Pi The active power injected at the first node.
(SGCC) under Grant 5700-202212197A-1-1-ZN. Paper no. TSG-00540-2023. Qj The reactive power output at the end node.
(Corresponding author: Chaoxu Mu.) ηc The charging efficiency.
Chaoxu Mu, Yakun Shi, Na Xu, Zhuo Tang, and Hongjie Jia are with the
School of Electrical and Information Engineering, Tianjin University, Tianjin ηd The discharging efficiency.
300072, China (e-mail: cxmu@[Link]; syk@[Link]; naxu@[Link]; α Temperature coefficient of entropy
tangzhuo@[Link]; hjjia@[Link]). γ Discount factor that penalizes future rewards
Xinying Wang is with the Artificial Intelligence Application Research
Department, China Electric Power Research Institute, Beijing 100000, China λ The weight of a single critic.
(e-mail: wangxinying@[Link]). τ Soft update parameter
Hua Geng is with the Department of Automation and the Beijing
National Research Center for Information Science and Technology, Tsinghua
University, Beijing 100084, China (e-mail: genghua@[Link]) I. I NTRODUCTION
Color versions of one or more figures in this article are available at
HE WORLDWIDE energy crisis and environmen-
[Link]
Digital Object Identifier 10.1109/TSG.2023.3339541 T tal deterioration are both gaining prominence. New
1949-3053 
c 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See [Link] for more information.
Authorized licensed use limited to: Tsinghua University. Downloaded on June 04,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
2958 IEEE TRANSACTIONS ON SMART GRID, VOL. 15, NO. 3, MAY 2024

power generation technologies geared toward economical to fall into local optimal solutions, which do not necessarily
and efficient renewable energy sources are rapidly devel- guarantee the feasibility and optimality of the resulting solu-
oping. Microgrid, as a key component of the smart grid, tion. Moreover, the traditional optimization algorithms have a
may effectively integrate renewable energy sources, enhance long computation time, the optimization results obtained from
their penetration, and maximize their use in the power repeated calculations are not unique and less robust, and they
grid [1], [2], [3]. However, because of the natural environ- need to be re-computed every time a scheduling decision is
mental elements such as wind speed, light, and temperature, made. Therefore, it remains a great challenge to optimally
wind and solar power generation are prone to sporadic and dispatch microgrids for their efficient operation under the
unpredictable fluctuations [4]. uncertainty of renewable energy generation [23], [24], [25].
In fact, the optimal dispatch of microgrids is a complex Reinforcement learning (RL) provides an excellent idea
multi-objective optimization problem, and a complete descrip- for solving the optimal operation of microgrids in complex
tion of the operation of microgrids cannot be achieved by simply state spaces, and RL is considered a control scheme that can
considering operating costs or environmental benefits [5], [6], significantly improve current conventional methods [26]. DRL
[7], [8]. Microgrid multi-objective optimal dispatch (MOD) can extract more data features that are not considered in the
solutions generally include operating costs, network losses, numerical model due to the powerful data processing and rep-
and system security. It is critical to develop a multi-objective resentation capabilities of deep neural networks, and it can also
optimal dispatch model that takes into account multiple better deal with high-dimensional data in microgrid schedul-
uncertainties based on random fluctuations in renewable energy ing, overcoming the “dimensionality catastrophe” caused by
output and effectively solves this multi-objective optimization the large decision space for optimization problem solving. In
problem [9], [10]. addition, the DRL approach does not require repetitive training
The focus or difficulty in solving uncertainty optimization of the agent, and the trained agent is able to make decisions
problems is how to deal with uncertainties. Fuzzy optimization, in seconds, thus allowing real-time online decision-making
stochastic optimization, robust optimization, and interval based on the current state of the microgrid [27]. DRL is a
optimization are often used in the optimal dispatch of microgrids branch of machine learning that aims to interact in uncertain
to deal with uncertainties [11], [12]. The objective function, environments and make decisions by continuously revising
constraints, and model coefficients in fuzzy optimization need strategies guided by rewards. DRL collects data from the
to be fuzzified, which is more difficult to achieve [13], [14]. environment to learn control strategies without relying on an
Stochastic optimization may have a bias in the empirical accurate model of the environment [28].
distribution due to the limited amount of historical data [15]. It is Many applications of reinforcement learning already exist
difficult to obtain accurate and reliable probability distributions in power systems, including microgrids, demand response,
for some of the uncertainty parameters, which affect the electrical equipment fault detection, power system analysis and
multi-objective optimal scheduling of microgrids [16]. Robust control, network security, and renewable energy generation
optimization leads to conservatism in its solutions because it forecasting [29]. Two algorithms were developed based on
provides optimal planning solutions that preserve the worst-case learning automata to find Nash equilibrium by establishing a
scenario in the set of uncertainties [17], [18]. relationship between average utility maximization and optimal
The main idea of interval optimization is to represent data policies [30]. A multi-agent reinforcement learning algo-
with uncertainty as intervals. The goal is to find the interval rithm introducing graph convolutional networks was proposed,
variables that optimize the objective function, provided that thereby facilitating the generation of agent voltage control
the interval variables satisfy their respective constraints [19]. strategies [31]. An improved DRL method was proposed
When dealing with uncertainty in random variables such as to achieve dynamic and optimal energy allocation [32].
wind power and PV, interval optimization does not require In [33], a multi-objective optimal dispatch model for a
knowledge of its affiliation function or probability distribution, microgrid cluster was developed based on the asynchronous
but only the upper and lower bounds of the random variables, advantage actor-critic (A3C) algorithm. A multiagent-based
which are expressed as intervals, and then the model is model was used for the distributed energy management of
transformed into a deterministic model [20]. In [21], a non- microgrids [34].
linear interval optimization (NIO) model was proposed for Although there is a lot of current work that has made
solving optimal power system dispatch with uncertain wind significant contributions to microgrid optimal dispatch, there
power integration. In [22], a multi-objective optimal schedul- is still a lot of room for DRL to be explored in the field of
ing model for microgrids with uncertainty was developed using microgrid optimal dispatch. On the one hand, to the best of
interval optimization, which has extended interval optimization our knowledge, there is no relevant work on applying DRL
to multi-objective microgrid scheduling for the first time. algorithms to multi-objective interval optimal scheduling for
The uncertainty of wind and solar energy poses many microgrids. On the other hand, DRL algorithms originating
challenges to power systems. Multi-objective optimal dispatch from the Q-learning class suffer from value estimation bias,
problems for microgrids usually contain a large state space and such as deep deterministic policy gradient (DDPG), twin
an action space. Traditional intelligent optimization algorithms delayed deep deterministic policy gradient (TD3), and soft
deal with such a large search space that the computational actor-critic (SAC), and the standard DRL algorithms suffer
effort grows exponentially with uncertainty fluctuations. As a from sampling inefficiencies and poor feature extraction capa-
result, many intelligent algorithms converge slowly and tend bilities during the training process of the network.

Authorized licensed use limited to: Tsinghua University. Downloaded on June 04,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
MU et al.: MIOD OF MICROGRID VIA DRL 2959

To address the limitations of the existing work, this paper


proposes an improved DRL algorithm and applies it to solve a
multi-objective interval optimal dispatch model for microgrids.
The algorithm introduces a triple critic mechanism to reduce
the estimation bias and a novel priority ranking scheme to
improve the algorithm’s performance. In this paper, we name it
the TCSAC algorithm. It is also a DRL algorithm based on the
Maximum Entropy Reinforcement Learning (MERL) frame-
work, which encourages policy exploration through maximum
entropy and avoids falling into a local optimum by repeatedly Fig. 1. Schematic diagram of a typical microgrid system.
choosing the same action. The main contributions of the paper
are threefold.
1) A multi-objective interval optimization dispatching photovoltaics (PVs), diesel engines (DEs), micro turbines
model is constructed for microgrids with renewable energy (MTs), fuel cells (FCs), an energy storage system (ESS), and
uncertainty. For the first time, a microgrid interval optimization loads. DEs, MTs and FCs are dispatchable units; WTs and PVs
dispatching framework based on DRL is established, and an are non-dispatchable units, ESS is responsible for smoothing
MDP is established from the microgrid interval optimization. out fluctuations caused by renewable energy sources and
Not only can it flexibly adapt to the stochastic fluctuations of absorbing unbalanced power in the microgrid. In this paper,
renewable energy, but it can also enhance the reliability and we aim to adapt the output of the dispatchable units to wind
economy of microgrid operation. outputs, PV outputs and load requirements by adjusting the
2) An improved DRL algorithm, TCSAC, is proposed. output of the dispatchable units.
Based on the maximum entropy reinforcement learning frame- All variables of the MOD can be divided into three cat-
work, a triple-critics mechanism is introduced to reduce egories: state variables, decision variables, and uncertainty
estimation bias, and a novel prioritization scheme with com- variables. The state variables consist of the regulated power
prehensive experience replay is introduced to improve the of the energy storage system PGESS , the voltage amplitude VL
performance of the algorithm. of the bus, and the reactive power output QG of the generator.
3) The proposed MIODMG is tested and analyzed in a The state variable X T is defined by
 
modified IEEE 118-bus system. The proposed algorithm is
X T = PGESS , VL1 , . . . , VLND , QG1 , . . . , QGNG , (3)
compared with three state-of-the-art DRL algorithms: SAC,
TD3, DDPG, as well as multi-objective particle swarm where ND is the number of buses and NG is the number of
optimization (MOPSO). The effectiveness of the proposed generators. The decision variables consist of the active power
multi-objective interval optimization dispatch model for output PG of the generator, the generator node voltage VG ,
microgrids and the TCSAC algorithm is verified through four and the transformer tap ratio T. The decision variable U T is
different simulation cases. defined by
The rest of the paper is organized as follows. Section II  
presents the multi-objective interval optimization dispatch U T = PG2 , . . . , PGNG , VG1 , . . . , VGNG , T1 , . . . , TNT , (4)
model of microgrid. Section III formulates the MIODMG
based on DRL. Section IV discusses a simulation study of the where NT is the number of transformer branches. The uncer-
modified IEEE 118-bus microgrid system. Finally, conclusions tain interval variables representing wind and PV are
and future work are given in Section V.  
W = PWT1 , . . . , PWTM , PPV1 , . . . , PPVN , (5)
II. M ULTI -O BJECTIVE I NTERVAL O PTIMAL where PWT is the uncertain power output of wind power,
D ISPATCH OF M ICROGRID PPV is the uncertain power output of PV, and M and N
A. Problem Formulation are the number of WT and PV connected to the microgrid,
The main task of a multi-objective optimal dispatch model respectively.
for a microgrid is to establish the objective function and con-
straints based on the operational characteristics. The problem B. Objective Function of MOD
can be formulated as
  The goal of multi-objective optimal dispatch of microgrids
min F(x) = f1 (x, w), f2 (x, w), . . . , fM (x, w) , (1) in this paper is to optimize the operating costs, network losses,
s.t. g(x, w) ≥ 0 and system security of the microgrid. Three objectives are
proposed accordingly, namely economic cost, network loss,
h(x, w) = 0, (2)
and branch stability index.
where g(x, w) and h(x, w) are inequality constraints and 1) Economic cost (EC): Economic dispatch is essential to
equality constraints, fi (x, w) is the ith objective function on the the operation of microgrids, with the objective of developing
decision variable x and the interval variable w. the output of the units at the lowest cost. We include economic
We consider a typical microgrid structure, as shown in cost as one of goal of multi-objective optimal dispatch for
Fig. 1, in which this structure consists of wind turbines (WTs), microgrids. The economic cost includes the cost of power

Authorized licensed use limited to: Tsinghua University. Downloaded on June 04,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
2960 IEEE TRANSACTIONS ON SMART GRID, VOL. 15, NO. 3, MAY 2024

generation from diesel generators, micro turbines, and fuel power injected at the first node, and Qj is the reactive power
cells in the typical structures, and hence it can be given by output at the end node. The branch stability index is
f1 = CDE + CMT + CFC , (6) f3 = max L1 , L2 , . . . , LNL . (12)
where f1 is the economic cost, CDE , CMT , and CFC denote the
operating costs of diesel generators, micro turbines, and fuel C. Constraints for MOD
cells, respectively. The objective function is subject to constraints, composed
The economic cost of a diesel generator is expressed by of the equality constraint and the inequality constraint. The
equation constraint mainly refers to the power balance of the

NDE
CDE = (ai + bi PDEi + ci P2DEi ), (7) system, i.e., that the total power input of the system is equal
i=1 to the total power output of the system. The power balance
constraints are
where PDEi is the active output of the ith DE and NDE is   
the number of DEs. a, b, and c are the cost coefficients PG,i = Pload,i + Ui Uj Gij cos θij + Bij sin θij (13)
of the constant and the primary term and secondary term, j∈i
  
respectively. QG,i = Qload,i + Ui Uj Gij sin θij − Bij cos θij , (14)
The economic cost of a micro turbine is expressed by j∈i


NMT
cng PMTj where PG,i and QG,i denote the active and reactive power
CMT = , (8) injected by node i, respectively. Ui and Uj denote the voltage
ηMT j QLHV
j=1 magnitudes of node i and node j; θij denotes the voltage phase
where PMTj is the active output of the jth MT and NMT is the angle difference between nodes i and j; Gij and Bij denote the
number of MTs. cng is the price of gas, ηMTj is the efficiency real and imaginary parts of the elements in row i and column
j of the node derivative matrix, respectively. Pload,i and Qload,i
of the jth MT, and QLHV is the low heating value of natural
denote the active load demand and reactive load demand of
gas.
node i, respectively.
The economic cost of a fuel cell is computed by
Inequality constraints consist of capacity constraints on the

NFC
cng PFCk generating unit, upper and lower voltage constraints, and safe
CFC = , (9) system operation constraints, given by
ηFCk QLHV
k=1
G,i ≤
Pmin PG,i ≤ Pmax
G,i
where PFCk is the active output of the kth FC and NFC is the
number of FCs. ηFCk is the efficiency of the kth FC. G,i ≤
Qmin QG,i ≤ Qmax
G,i
2) Network loss (NL): The network loss of the power system Tjmin ≤ Tj ≤ Tjmax
is another indicator reflecting its efficient and stable operation. Vkmin ≤ Vk ≤ Vkmax
The sum of the losses of all branches is given by
SOCmin ≤ SOCt ≤ SOCmax , (15)
N
branch   
where Pmin max
f2 = Gij Ui2 + Uj2 − 2Ui Uj cos θi − θj , (10) G,i and PG,i are the lower and upper limits of the
k=1 active power output of the ith generator, respectively. Qmin
G,i and
where f2 is the network loss, Nbranch is the total number of Qmax
G,i are the lower and upper limits of the reactive power
th min
output of the i generator, respectively. Tj and Tj max are the
network branches, Gij is the conductance of the branches, Ui
is the voltage at node i at the head of the line, and Uj is the lower and upper limits of the tap ratio of the jth transformer.
voltage at node j at the end of the line, θi and θj are the voltage Vkmin and Vkmax are the lower and upper limits of the voltage
phase angles at nodes i and j, respectively. magnitude of the kth bus, respectively. SOCt is the state of
3) Branch stability index (BSI): When the system is stable, charge of the ESS at time t, SOCmin is the minimum state of
the BSI corresponding to all branches must be less than the charge of ESS and SOCmax is the maximum state of charge
voltage critical run-out point, and when a run-out occurs in the of ESS. The state of charge of ESS is the stored energy, and
system, it must start from the weakest branch in the system. the state of charge is calculated as
Therefore, our goal is to find the weakest branch in the system, 
Pd
which is also the branch with the largest branch stability factor, SOCt = SOCt−1 + Pct ηc − t t, (16)
and keep each branch away from the run-out point. The branch ηd
stability factor is given by where ηc is the charging efficiency, ηd is the discharging
efficiency, Pct is the charging power, Pdt is the discharging
X P2i X power, and t is the time step.
Lk = 4 Qj − , (11)
Ui2 Ui2
where Lk is the branch stability factor of the branch between D. Model of MIOD
nodes i and j at the ends of branch k, X is the reactance of Interval optimization is used to deal with the uncertainty of
the branch, Ui is the voltage at the first node, Pi is the active wind and PV, and the upper and lower bounds of wind and PV

Authorized licensed use limited to: Tsinghua University. Downloaded on June 04,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
MU et al.: MIOD OF MICROGRID VIA DRL 2961

are obtained using interval prediction. As a result, the MIOD ranking of Pareto solutions can be obtained by fuzzy decision-
can be expressed as making, and finally, the extraction of the optimal compromise
  solution from the Pareto frontier is completed.
min F(x) = f1I (x, w), f2I (x, w), . . . , fMI (x, w) , (17) The other category is to transform multi-objectives into a
s.t. gm (x, w) ≥ am single objective for optimization. This paper uses DRL to
hn (x, w) = bn , (18) solve the multi-objective optimization problem for microgrids,
which involves the problem that reward values need to be set
where gm (x, w) is the constrained interval inequality, am ∈ in DRL to guide the agent to seek the optimal solution. In this
[am , am ], m = 1, 2, . . . , M, hn (x, w) is the constrained paper, we can set multiple reward values for the agent, such
interval equation, bn ∈ [bn , bn ], n = 1, 2, . . . , N, fiI (x, w) is as economic cost reward, system active network loss reward,
the ith objective function on the decision variable x and the branch stability index reward, unit output crossing the limit
interval variable w ∈ [wL , wU ], which can be expressed as the reward, and node voltage crossing reward. However, in MDP,
number of intervals as the final R only needs a definite value, which means that the
 I
 set reward values need to be weighted into a composite reward
fiI (x, w) = f Ii (x, w), f i (x, w) , (19)
value in the end. Therefore, the treatment of multi-objective
I problems in this thesis belongs to the second category.
where f Ii (x, w) and f i (x, w) are the upper and lower bounds of In our opinion, the difference between DRL and heuristic
the ith objective function fiI (x, w), which can be determined by algorithms in the process of solving multi-objective problems
lies in the sequential difference in setting weights for different
f Ii (x, w) = min fi (x, w)
y∈w objectives. The process of the heuristic algorithm is to solve
I the Pareto frontier of the multi-objective optimization problem,
f i (x, w) = max fi (x, w). (20)
y∈w then construct the affiliation function of each objective, assign
weights to each objective, and use fuzzy decision-making to
As a consequence, the maximum and minimum values of the
rank each alternative solution in the Pareto frontier to get
objective function of the dispatch solution x in the uncertainty
the optimal compromise solution. The process of the DRL
interval w can be obtained. Since fi (x, w) is an interval number,
algorithm is to set the reward value that guides the agent to
in order to compare the objective function values correspond-
explore, and according to the relative weights of the objectives,
ing to different dispatch solutions, the interval number needs to
it can assign the weights to the corresponding sub-reward
be further processed by converting it to a deterministic value,
values. Instead of searching for many alternative solutions, the
which can be computed by
agent searches for the solution that has the highest combined
I reward value under the current weights; that is, it searches for
f Ii (x, w) + f i (x, w)
fiO (x, w) = . (21) a set of values in the solution space that are closest to the ideal
2
solutions of all objective functions under the current weights.
We represent the MIOD model as follows: Therefore, DRL-based multi-objective solving does not need
  to deal with Pareto frontiers.
min F(x) = f1O (x, w), f2O (x, w), . . . , fMO (x, w) , (22)
I III. D EEP R EINFORCEMENT L EARNING FOR MIODMG
f Ii (x, w) + f i (x, w)
s.t. fiO (x, w) = A. Reinforcement Learning Framework
2
gm (x, w) ≥ am In this section, we improve a DRL algorithm to solve the
hn (x, w) = bn . (23) multi-objective interval optimization dispatch for microgrids.
Reinforcement learning is performed by an agent interacting
From this, it can be seen that the MIOD is essentially a with its environment and maximizing cumulative rewards,
two-level optimization, where the decision vector for exter- eventually learning how to choose the best action based on
nal optimization is x and the decision vector for internal different states. The problem can be viewed as a discrete-
optimization is w. time stochastic control process, or MDP, defined by a tuple
(S, A, R, P). In this MDP problem, the decision maker is the
E. Multi-Objective Optimization Solution Discussion system operator, and the agent’s function is equivalent to that
of a dispatcher, which usually wants to ensure system security
Two categories exist for the handling of multiple objectives
while keeping the system network loss and operation cost
in a multi-objective optimization solution. One is to consider
lower. The basic components of reinforcement learning under
optimizing multiple objectives at the same time, and the
the framework of MIODMG mainly include:
heuristic algorithm is used to perform the non-dominated
1) State: The state of the MIODMG includes the regulated
sorting, screen out the non-dominated solutions during the
power of the energy storage system, the voltage amplitude
execution of the algorithm, and finally give the Pareto-
at the bus, the reactive power output of the generator, and
optimal solution and the Pareto-frontier. Since different Pareto
the active power and reactive power demand. The state is
solutions have different focuses on different objectives, fuzzy
represented as
decision-making methods are generally used to better evaluate
the Pareto optimal solution under multiple objectives. The st = {PGESS ,t , VL,t , QG,t , Pload,t , Qload,t }. (24)

Authorized licensed use limited to: Tsinghua University. Downloaded on June 04,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
2962 IEEE TRANSACTIONS ON SMART GRID, VOL. 15, NO. 3, MAY 2024

2) Action: The actions of the MIODMG are the relevant where π(·|st ) is the strategy of an agent that maps the
decision variables, including the active power output of the distribution from the state space to the action space. α is
generator, the generator node voltage, and the transformer tap the temperature coefficient that determines the importance of
ratio setting. The action is represented as entropy relative to the reward value, which in turn ensures the
stochasticity of the optimal strategy. γ is the discount factor
at = {PG,t , VG,t , TN,t }. (25) that penalizes future rewards.
The Bellman equation Qπ is
3) Reward: The reward is the optimization objective of steer-
        
ing agent, and the reward values included are the economic Qπ (s, a) = E R s, a, s + γ Qπ s , a + αH π ·|s .
cost reward r1 , the system active network loss reward r2 , the s ∼P
a ∼π
branch stability index reward r3 , the reactive output crossing (34)
limit reward r4 , and the node voltage crossing limit reward r5 .
These rewards are designed as follows: The SAC algorithm consists of two neural networks: the
strategy network and the Q network. SAC learns both a
r1 = −(CDE + CMT + CFC ), (26) strategy πφ and two Q functions, Qθ1 and Qθ2 . The Q-network
N
branch    parameterized by θ is used to approximate the soft Q-function
r2 = − Gij Ui2 + Uj2 − 2Ui Uj cos θi − θj , (27) Qθ (s, a). The policy network parameterized by φ outputs
k=1 the mean and variance of the action probability distribution
r3 = − max{L1 , L2 , . . . , LNL },   (28) according to the different states.

⎪  QG,i −Qmin 
⎨ G,i 
 QG,i ≤ QG,i
NG min
 min
r4 = − QG,i , QG,i =  QG,i max  (29) C. Estimation Bias in SAC
⎩  QG,i −Q
⎪ G,i 
j=1 max
QG,i QG,i > Qmax ,
G,i RL algorithms originating from the Q-learning class all
⎧  suffer from value estimation bias. It starts with Q-learning
⎪  
⎨  Vk −V
min
 k 
 Vkmin Vk ≤ Vk
ND min
and uses greedy estimation of the target value: y = r +
r5 = − Vk , Vk =  max 
(30) γ max Q(s , a ). However, when there is an error in the target,
⎩  Vk −V
⎪ 
k=1 V max  k
k
V > Vkmax . a
k the target estimate will be larger than the true target value,
The final reward is subject to a combination of factors that and this overestimation bias is then propagated through the
may affect the operation of the microgrid, and the total reward Bellman equation. This overestimation bias therefore carries
r is expressed as over to DQN and DDPG, and it is demonstrated in [37] that
this theoretical overestimation occurs in practice. TD3 is an
r = a1 r1 + a2 r2 + a3 r3 + a4 r4 + a5 r5 , (31) extension of DDPG that uses the minimum between two critics
as the target value estimate, thereby limiting the overestimation
where a is the relative weighting factor for each reward. phenomenon. A clipped double Q-learning model for actor-
critic is thus proposed:
B. Soft Actor-Critic   
y = r + γ min Qθi s , πφ1 s . (35)
SAC is a DRL algorithm of the off-policy class based on i=1,2
the maximum entropy reinforcement learning framework [35]. The use of clipped double Q-learning, while potentially
It is a combination of stochastic policy optimization and a leading to an underestimation bias, would be far more effective
DDPG-like deterministic policy optimization algorithm that than an overestimation bias, since the value of the underesti-
not only incorporates some of the techniques of DDPG [36] mation behavior would not be propagated through the strategy
and TD3 [37], but also retains the inherent stochasticity of update. However, the underestimation phenomenon occurring
their policies. At the heart of SAC is entropy regularization, in the TD3 algorithm was investigated in [38], its existence
where the agent needs to maximize both the reward value and was theoretically demonstrated, and it was experimentally
the entropy value during training. Entropy is a measure of the observed that this underestimation bias did affect performance
randomness in a strategy that can be adjusted to allow the in experiments.
agent to explore more or less and prevent the strategy from Since SAC uses clipped double Q-learning to reduce over-
converging too early to a local optimum. estimation bias, underestimation may also exist, and very little
From the probability distribution P of its state transfer, the has been done to address the underestimation present in SAC.
entropy can be determined by Based on the above observations, in combination with the
  approach of [38], we propose a “triple critic” framework of
H(P) = E − log P(x) . (32)
x∼P SAC for estimating bias. Combining the overestimation bias
of DDPG with the underestimation bias of TD3 allows the
The agent receives a reward proportional to the entropy
estimated bias to fall somewhere in between. Assume that
of the strategy at each time step, and the maximum entropy
the parameterized functions Qθ (s, a) and πφ (a|s) are used to
reinforcement learning formulation is
estimate the soft Q value and the policy, respectively. Set the

∞  corresponding single-critic network Q value of the DDPG as
π ∗ = arg max E γ t (R(st , at ) + αH(π(·|st ))) , (33)    
π τ ∈π Q1 = Qθ1 s , a ,  a ∼ πφ ·|s . (36)
t=0

Authorized licensed use limited to: Tsinghua University. Downloaded on June 04,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
MU et al.: MIOD OF MICROGRID VIA DRL 2963

The double-critic network Q value of the TD3 is D. Comprehensive Experience Replay


    Experience replay deposits the gathered samples into an
Q2 = min Qθi s , a ,  a ∼ πφ ·|s . (37) experience replay buffer and randomly selects samples from
i=2,3
it to train the network. Although experience replay makes
The target values for the triple critics are updated as the samples independent of each other, they are all sam-
   pled uniformly by the same probability, ignoring the relative
y = r + γ λQ1 + (1 − λ)Q2 − α log πφ  a |s , (38) importance of the different samples. Therefore, [39] proposed
prioritized experience replay with TD-error to determine prior-
where λ ∈ (0, 1) is the weight of a single critic. ity and then importance sampling. Reference [40] proposed an
The update to the policy network can be expressed as fol- emphasis on recent experience replay. Reference [41] preferred
lows: to select data from those collected. Inspired by the above ideas,
  in this paper, a new prioritization scheme is introduced for
Jπ (φ) = Est ∼D α log πφ (at |st ) − Qθ (st , at ) . (39) comprehensive prioritized experience replay in order to select
better samples from the experience replay buffer.
The nature of SAC stochastic policy optimization dictates
In contrast to existing prioritization methods, instead of
that its actions are not directly output by the policy network,
sampling data from within the experience replay buffer based
but rather the policy is re-parameterized by a neural network
on priority, we prioritize data based on the collected data and
transformation. The action function is
also consider that samples closer to the current moment should
μ have a higher value and should be sampled more aggressively
at = fφ (εt ; st ) = fφ (st ) + εt fφσ (st ), (40)
from the most recent experience. We use ρe to denote the most
where εt is the noise vector following a normal distribution, recent score of an episode in which all individual transitions
μ
and fφ (st ) and fφσ (st ) are the mean and variance of the heavy (ce1 , . . . , cet ) have the same most recent score, where cet =
parameter sampling output, is Hadamard product. Then, (set , aet , rte , set+1 ). At the current stage of E episodes, in a small
Jπ (φ) can be rewritten as batch of updates, episode e has a recent score of
 
E 1000
     ρe = max N · ηe e
, ρmin , (47)
Jπ (φ) = E α log πφ fφ (εt ; st )|st − Qθ st , fφ (εt ; st ) .
st ∼D
εt ∼N
where N is the capacity of the experience replay buffer. ηe ∈
(41) (0, 1) is a hyperparameter that determines the recent score of
the data. ρmin is the minimum allowed value of ρe . It is clear
The strategy gradient can be approximated by the follow-
that the more recent the sample, the higher its recent score.
ing equation: Then, we use the reward value ρf for the termination state
 π (φ) = ∇φ log πφ (at |st ) + (∇at log πφ (at |st ) of an episode as the value score for all transitions within that
∇J
episode. The episode’s total priority score is ρ, ρ = ρe + ρf .
− ∇at Q(st , at ))∇φ fφ (εt ; st ). (42) When the mini-batch within the experience replay buffer is
collected to update the agent, the transitions within the mini-
Update the Q-network according to the minimized Bellman batch need to be prioritized according to the priority score.
residuals This is done as follows: add the total priority score ρ to the
1 2  tuple cet to obtain the transition tuple  cet = (set , aet , rte , set+1 , ρ).
JQ (θ ) = E(st ,at )∼D Qθ (st , at ) − 
Qθ (st , at ) (43) Two batches, H1 = { e1
ct1 , . . . ,em
ctm } and H2 = {
em+1
ctm+1 , . . . ,
e2m
ct2m },
2
are extracted from within the experience replay buffer, where
with m is the size of the batch. Then the two extracted batches
  H1 ∪H2 are merged, and a G of size m is set. The 2m transition

Qθ (st , at ) = rt + γ Qθ (st+1 , at+1 ) − α log(πφ (at+1 |st+1 )) . tuples in H1 and H2 are sorted according to the priority score
(44) ρ of each transition tuple, and the top m transition tuples are
selected to form a new batch G for network training.
The gradient is Till now, a TCSAC-based framework for optimal dispatch-
  ing of microgrids is shown in Fig. 2. It consists of an
 Q (θ ) = ∇θ Qθ (st , at ) Qθ (st , at ) − rt − γ Q (st+1 , at+1 )
∇J θ actor network with three critic networks and three target
 
− α log πφ (at+1 |st+1 ) . critic networks. The agent interacts with the environment,
(45) which provides the agent with information about the current
operational state of the system, including the regulated power
The target value network performs a soft update, expressed of the energy storage system, the voltage magnitude of the
as follows: load bus, the reactive power output of the generator, and
the active and reactive power demand of the load. The
θ ← (1 − τ )θ + τ θ, (46) actor network in the agent outputs the mean and variance
of an action-dependent Gaussian distribution based on the
where τ is the soft update parameter. data provided by the environment and the experience replay

Authorized licensed use limited to: Tsinghua University. Downloaded on June 04,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
2964 IEEE TRANSACTIONS ON SMART GRID, VOL. 15, NO. 3, MAY 2024

Fig. 2. A TCSAC-based framework for optimal dispatch of microgrids.

buffer through reparameter sampling. Then a sample from the Algorithm 1 TCSAC (Triplet-Critics Comprehensive
policy distribution is taken to generate an action value and Experience Replay Soft Actor-Critic)
applied to the environment, which then gets a reward value that Input: Initial policy parameters φ; Q-function parameters θ1 , θ2 ,
reflects the reward of the current action. After the environment θ3 ; experience replay buffer D; Initial target networks θtarg,1 ← θ1 ,
gets the action command, it runs to the next state and gets θtarg,2 ← θ2 , θtarg,3 ← θ3
a set of transitions (st , at , rt , st+1 ), which are stored in the Ouput: φ, θ1 , θ2 , θ3
experience replay buffer. Then mini-batches are selected from for t = 1, ..., T do
the experience replay buffer to train the actor network and the Select and execute action a ∼ πφ (·|s), observe next state
s and reward r  
critic network, and then the estimation bias is reduced by the
Store transition tuple s, a, r, s , ρ in D
triple-critic network, and TD-error is obtained, which is fed if it’s time to update then
back to the actor network. This process is repeated until the H1 ∼ D, H2 ∼ D
agent is able to achieve the optimal decision, and then G = (H1 ∪ H2 )prior
the final dispatch strategy is obtained. The detailed steps of The following network updates are computed on the
the algorithm are shown in Algorithm 1. mini-batch G
Set Q1 = Qθtarg,1 (s , ã ), ã ∼ πφ (·|s )
IV. S IMULATION S TUDIES Set Q2 = mini=2,3 Qθtarg,i (s , ã ), ã ∼ πφ (·|s )
Compute the target Q value: 
A. Simulation Settings y = r + γ λQ1 + (1 − λ)Q2 − α log πφ ( a |s )
To verify the effectiveness of the proposed DRL- Update Q-functions by one step of gradient descent
using:
based multi-objective interval optimal dispatch method for 1   2
microgrids, the modified IEEE 118-bus system is used as an ∇θi |G| Qθi (s, a) − y for i = 1, 2
(s,a,r,s )∈G
arithmetic example for simulation studies. The system consists Update policy by one step of gradient ascent using:
of 53 distributed generators, including 15 DEs, 10 MTs, and 8 1     
∇φ |G| min Qθi s,aφ (s) − α log πφ  aφ (s)|s
FCs. The operating parameters of the microgrid are provided in s∈G i=1,2
Table I. Uncertain renewable energy output includes 10 WTs Update target critic network parameters:
and 10 PVs, whose power output interval values are obtained θtarg,i ← τ θtarg,i + (1 − τ )θi for i = 1, 2, 3
end if
by the interval prediction method, as shown in Fig. 3. The
end for
basic power value of the system is 100 kVA. Bus information
and branch circuit parameters can be found in [42], bus
modified for the purpose of analysis.
All case studies were conducted on the same computer algorithm are provided in Table II. Our simulation study of
device with an Nvidia GeForce RTX 3090, Intel Xeon W-2255 the three objectives considered, divided according to two-
CPU, and 64 GB of RAM. The hyperparameters of the DRL objective optimization and three-objective optimization, can

Authorized licensed use limited to: Tsinghua University. Downloaded on June 04,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
MU et al.: MIOD OF MICROGRID VIA DRL 2965

Fig. 4. Learning curve of the TCSAC and baseline algorithms in Case 1.


Fig. 3. Renewable energy power output interval values.

TABLE I The DRL training process is shown in Fig. 4. The solid


O PERATING PARAMETERS OF THE M ICROGRID curve indicates the average reward for each episode, and the
shaded area indicates the standard deviation of the reward for
each episode. It can be seen that the four classes of algorithms
start to converge around 2000 episodes, with TCSAC show-
ing better performance with the highest cumulative rewards
and DDPG performing worse as the value overestimates the
problem into a local optimum.
In addition, four indicators are used to quantify the learning
performance: the ultimate average reward IUAR , the ultimate
standard deviation IUSD , the maximum episode reward IMER ,
TABLE II and the maximum cumulative reward IMCR . The performance
H YPERPARAMETERS OF THE A LGORITHM
of the algorithm is evaluated in different aspects according to
the selected indicators, which are calculated as

Ne
IUAR = N1e Ri
 i=1

Ne
ISD = 1
(Ri − IUAR )2
Ne
i=1 , (48)
IMER = max Ri
Ne
1 
ne
IMAR = max Ri
ne ≤Ne ne i=1

where Ri is the reward for the ith episode, Ne is the total


number of episodes trained by the algorithm, and ne is the
total number of episodes trained up to the current time.
produce the following four cases: (1) Optimizing EC and NL; Table III gives the different indicator values for the four
(2) Optimizing EC and BSI; (3) Optimizing NL and BSI; and types of algorithms. Compared to SAC, the final cumulative
(4) Optimizing EC, NL, and BSI simultaneously. reward and standard deviation of TCSAC have improved. This
shows that the final performance and stability of the algorithm
have been improved accordingly. In addition, TCSAC has the
B. Simulation Results and Discussion highest values for both the maximum episode reward and
1) Case 1: Optimizing both EC and NL objectives. Finding the maximum cumulative reward, indicating that it is able to
a dispatching strategy that simultaneously optimizes EC and explore dispatch solutions that outperform other algorithms.
NL objectives is the aim of this case. Not only is the validity To better visualize the advantages of the proposed algo-
of our proposed TCSAC algorithm verified, but also other rithm, we sampled from the renewable energy power output
DRL algorithms are selected for comparison, namely the interval in Fig. 3 to generate 300 sets of samples. Then, two
original SAC algorithm, the TD3 algorithm, and the DDPG DRL algorithms, TCSAC and SAC, and the MOPSO algorithm
algorithm. Next, we further compared this with the traditional are selected for testing. The results of the economic cost and
optimization algorithm, MOPSO. EC is generally considered network loss test comparison are given in Fig. 5. In terms of
to be relatively more important than NL, so in this case their optimizing EC, TCSAC has the advantage, SAC is better than
weights are set to 0.6 and 0.4, respectively. MOPSO but worse than TCSAC. And in terms of optimizing

Authorized licensed use limited to: Tsinghua University. Downloaded on June 04,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
2966 IEEE TRANSACTIONS ON SMART GRID, VOL. 15, NO. 3, MAY 2024

TABLE III TABLE IV


P ERFORMANCE C OMPARISON OF D IFFERENT A LGORITHMS AVERAGE VALUE OF O PTIMIZATION FOR E ACH O BJECTIVE
U NDER D IFFERENT A LGORITHMS

Fig. 6. Learning curve of the TCSAC and baseline algorithms in Case 2.

again due to the high value estimation problem entering a local


optimum, resulting in a lower average reward value.
As shown in Table III, both the ultimate cumulative reward
and the standard deviation of TCSAC were improved com-
pared to SAC. TCSAC has the highest value of the maximum
cumulative reward, but SAC has the highest value of the
maximum episode reward. This indicates that it is able to
Fig. 5. Comparison of TCSAC, SAC, and MOPSO algorithms for EC and explore better dispatching solutions than the other algorithms
NL samples.
but eventually converges to a value lower than that of TCSAC.
Similarly, 300 sets of samples were generated for testing to
compare the performance differences between the different
NL, TCSAC and SAC are almost on par, with MOPSO algorithms. The comparative test results for economic cost and
optimizing at a higher level. Table IV gives the average of line stability index are shown in Fig. 7. The differences in EC
the test results for 300 samples of different algorithms, with and BSI values between the TCSAC and SAC optimizations
TCSAC obtaining the lowest average test results in Case 1. are very small, but both are significantly better than MOPSO.
2) Case 2: Optimizing both EC and BSI objectives. Finding As shown in Table IV, TCSAC has the lowest mean test results
a dispatching strategy that simultaneously optimizes EC and in Case 2.
BSI objectives is the aim of this case. Similarly, the BSI is 3) Case 3: Optimizing both NL and BSI objectives. Finding a
usually considered to be relatively more important than the EC, dispatching strategy that simultaneously optimizes NL and BSI
so in this case their weights are set to 0.6 and 0.4, respectively. objectives is the aim of this case. The BSI is usually considered
The DRL training process is shown in Fig. 6. It can be seen to be relatively more important than the NL, and their weights
that TCSAC, SAC, and DDPG start to converge around 2000 are set to 0.7 and 0.3, respectively. The DRL training process
episodes, and TD3 starts to converge around 4000 episodes, is shown in Fig. 8. It can be seen that TCSAC, SAC, and
with TCSAC showing better performance with the highest TD3 start to converge around 2000 episodes, and DDPG starts
cumulative reward but almost the same as SAC, and DDPG to converge around 3000 episodes, with TCSAC showing

Authorized licensed use limited to: Tsinghua University. Downloaded on June 04,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
MU et al.: MIOD OF MICROGRID VIA DRL 2967

Fig. 7. Comparison of TCSAC, SAC, and MOPSO algorithms for EC and Fig. 9. Comparison of TCSAC, SAC, and MOPSO algorithms for NL and
BSI samples. BSI samples.

Fig. 8. Learning curve of the TCSAC and baseline algorithms in Case 3. Fig. 10. Learning curve of the TCSAC and baseline algorithms in Case 4.

better performance with the highest cumulative reward and and DDPG start to converge around 2000 episodes, while
DDPG with the lowest average reward value. TCSAC starts to converge around 4000 episodes, but TCSAC
As shown in Table III, both the ultimate cumulative reward has the highest cumulative reward and DDPG has the lowest
and the standard deviation of TCSAC are improved compared average reward value. As shown in Table III, both the ultimate
to SAC. TCSAC has the highest values for both the maximum cumulative reward and the standard deviation of the TCSAC
episode reward and the maximum cumulative reward. This were increased compared to the SAC. The TCSAC has the
indicates that it is able to both explore dispatching solutions highest values for both the maximum episode reward and the
that outperform other algorithms and eventually converge maximum cumulative reward.
to lower values than TCSAC. Similarly, 300 samples are Similarly, 300 samples are generated for testing to compare
generated for testing to compare the performance differences the performance differences of the different algorithms. The
of the different algorithms. The results of the network loss results of the test comparison of network loss and line stability
and line stability index tests are shown in Fig. 9. The differ- index are shown in Fig. 11. It can be seen that TCSAC has
ence between the NL and BSI values for TCSAC and SAC a greater advantage in optimizing EC and BSI; TCSAC is
optimization is very small, but both are significantly better almost on par with SAC in optimizing NL; and the value of
than MOPSO. As shown in Table IV, TCSAC has the lowest NL optimized by MOPSO is at a higher level. The comparison
mean test result. of the training time of four DRL algorithms (TSCAC, SAC,
4) Case 4: Simultaneous optimization of the EC, NL, and BSI TD3, and DDPG) is shown in Table V. The training time of
triple objectives. Finding an optimal dispatching strategy while all four DRL algorithms is in order of magnitude, with the
simultaneously optimizing the EC, NL, and BSI objectives TD3 algorithm in Case 1 taking the least amount of time
is the aim of this case. The weights of EC, NL, and BSI at 59,888.62s, with an average of 6s per episode, and the
are set to 0.3, 0.3, and 0.4, respectively. The DRL training DDPG algorithm in Case 4 taking the most amount of time at
process is shown in Fig. 10. It can be seen that SAC, TD3, 81,362.43s, with an average of 8s per episode. The comparison

Authorized licensed use limited to: Tsinghua University. Downloaded on June 04,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
2968 IEEE TRANSACTIONS ON SMART GRID, VOL. 15, NO. 3, MAY 2024

Fig. 12. DEs operating results for each unit.

Fig. 11. Comparison of TCSAC, SAC, and MOPSO algorithms for EC, NL
and BSI samples.

TABLE V
T RAINING T IME C OMPARISON (10,000 E PISODES ,
MAX 50 S TEPS PER E PISODE , SEC )
Fig. 13. MTs operating results for each unit.

TABLE VI
T EST T IME C OMPARISON (300 T EST S AMPLES , SEC )

of the testing time of the three algorithms, TCSAC, SAC, and Fig. 14. FCs operating results for each unit.
MOPSO, for 300 sets of samples is shown in Table VI. DRL
takes 2s to test a set of samples on average, and MOPSO takes
330s to test a set of samples on average. This is one of the high, so the power of MTs and FCs is relatively low. The
major advantages of the DRL algorithm. Although it takes a ESS is responsible for absorbing the imbalanced power of the
long time for the agent to be trained, a trained and mature system, which is always kept in a stable range.
agent can make decisions in seconds and has the ability to
make decisions in real time. V. C ONCLUSION
Finally, the results of the different operations of the In this paper, a multi-objective interval optimization dis-
microgrid in the four cases are given. The operational results patch model for microgrids considering renewable energy
for DEs are shown in Fig. 12, for MTs in Fig. 13, for FCs in uncertainty was proposed. Economic cost, network loss, and
Fig. 14, and for ESS in Fig. 15. It can be seen that in Case 2, branch stability index were considered the optimization objec-
the power of DEs is relatively low, so the power of MTs and tives of the microgrid. The interval variables were used to
FCs is relatively high; in Case 4, the power of DEs is relatively represent the uncertain power output of wind and PV, and

Authorized licensed use limited to: Tsinghua University. Downloaded on June 04,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
MU et al.: MIOD OF MICROGRID VIA DRL 2969

[10] M. Sedighizadeh, S. S. Fazlhashemi, H. Javadi, and M. Taghvaei, “Multi-


objective day-ahead energy management of a microgrid considering
responsive loads and uncertainty of the electric vehicles,” J. Clean.
Prod., vol. 267, Sep. 2020, Art. no. 121562.
[11] A. Ehsan and Q. Yang, “State-of-the-art techniques for modelling of
uncertainties in active distribution network planning: A review,” Appl.
Energy, vol. 239, pp. 1509–1523, Apr. 2019.
[12] A. R. Jordehi, “How to deal with uncertainties in electric power
systems? A review,” Renew. Sustain. Energy Rev., vol. 96, pp. 145–155,
Nov. 2018.
[13] B. Cao, W. Dong, Z. Lv, Y. Gu, S. Singh, and P. Kumar, “Hybrid
microgrid many-objective sizing optimization with fuzzy decision,”
IEEE Trans. Fuzzy Syst., vol. 28, no. 11, pp. 2702–2710, Nov. 2020.
[14] R. W. Jiang, J. H. Wang, and Y. P. Guan, “Robust unit commitment
with wind power and pumped storage hydro,” IEEE Trans. Power Syst.,
vol. 27, no. 2, pp. 800–810, May 2012.
[15] R. J. Yan et al., “A two-stage stochastic-robust optimization for a hybrid
renewable energy CCHP system considering multiple scenario-interval
Fig. 15. ESS operation results. uncertainties,” Energy, vol. 247, May 2022, Art. no. 123498.
[16] J. Lee, S. Lee, and K. Lee, “Multistage stochastic optimization for
microgrid operation under islanding uncertainty,” IEEE Trans. Smart
Grid, vol. 12, no. 1, pp. 56–66, Jan. 2021.
an MDP was established for the interval optimization. An [17] B. Y. Zhang, Q. Q. Li, L. H. Wang, and W. Feng, “Robust optimization
improved DRL algorithm, TCSAC, was further proposed to for energy transactions in multi-microgrids under uncertainty,” Appl.
solve it. To verify the effectiveness of the proposed method, Energy, vol. 217, pp. 346–360, May 2018.
[18] K. Wang and C. X. Mu, “Learning-based control with decentral-
case studies were conducted using the modified IEEE 118- ized dynamic event-triggering for vehicle systems,” IEEE Trans. Ind.
bus microgrid system. In four cases, the proposed algorithm Informat., vol. 19, no. 3, pp. 2629–2639, Mar. 2023.
TCSAC was able to obtain better convergence solutions [19] D. Saez, F. Avila, D. Olivares, C. Canizares, and L. Marin, “Fuzzy
prediction interval models for forecasting renewable resources and loads
compared to the state-of-the-art DRL algorithms SAC, TD3, in microgrids,” IEEE Trans. Smart Grid, vol. 6, no. 2, pp. 548–556,
and DDPG. In addition, further sampling analysis was done Mar. 2015.
to compare the dispatching results of TCSAC, SAC, and [20] L. Wu, M. Shahidehpour, and Z. Y. Li, “Comparison of scenario-based
and interval optimization approaches to stochastic SCUC,” IEEE Trans.
MOPSO, and the proposed algorithm was able to achieve Power Syst., vol. 27, no. 2, pp. 913–921, May 2012.
better dispatching results. In the future, we will further extend [21] Y. Z. Li, Q. H. Wu, L. Jiang, J. B. Yang, and D. L. Xu, “Optimal power
this work to multi-timescale optimal dispatch of microgrids, system dispatch with wind power integrated using nonlinear interval
optimization and evidential reasoning approach,” IEEE Trans. Power
and the efficiency of renewable energy generation will be Syst., vol. 31, no. 3, pp. 2246–2254, May 2016.
maximized for the power system. [22] Y. Z. Li, P. Wang, H. B. Gooi, J. Ye, and L. Wu, “Multi-objective optimal
dispatch of microgrid under uncertainties via interval optimization,”
IEEE Trans. Smart Grid, vol. 10, no. 2, pp. 2046–2058, Mar. 2019.
R EFERENCES [23] C. X. Mu, W. Q. Liu, and W. Xu, “Hierarchically adaptive frequency
control for an EV-integrated smart grid with renewable energy,” IEEE
[1] M. H. Rehmani, M. Reisslein, A. Rachedi, M. Erol-Kantarci, and Trans. Ind. Informat., vol. 14, no. 9, pp. 4254–4263, Sep. 2018.
M. Radenkovic, “Integrating renewable energy resources into the smart [24] J. R. Vazquez-Canteli and Z. Nagy, “Reinforcement learning for demand
grid: Recent developments in information and communication tech- response: A review of algorithms and modeling techniques,” Appl.
nologies,” IEEE Trans. Ind. Informat., vol. 14, no. 7, pp. 2814–2825, Energy, vol. 235, pp. 1072–1089, Feb. 2019.
Jul. 2018. [25] B. G. Kim, Y. Zhang, M. van der Schaar, and J. W. Lee, “Dynamic pric-
[2] C. Lamnatou, D. Chemisana, and C. Cristofari, “Smart grids and ing and energy consumption scheduling with reinforcement learning,”
smart technologies in relation to photovoltaics, storage systems, build- IEEE Trans. Smart Grid, vol. 7, no. 5, pp. 2187–2198, Sep. 2016.
ings and the environment,” Renew. Energy, vol. 185, pp. 1376–1391, [26] P. C. Dai, W. W. Yu, G. H. Wen, and S. Baldi, “Distributed reinforce-
Feb. 2022. ment learning algorithm for dynamic economic dispatch with unknown
[3] G. Dileep, “A survey on smart grid technologies and applications,” generation cost functions,” IEEE Trans. Ind. Informat., vol. 16, no. 4,
Renew. Energy, vol. 146, pp. 2589–2625, Feb. 2020. pp. 2258–2267, Apr. 2020.
[4] A. Zakaria, F. B. Ismail, M. S. H. Lipu, and M. A. Hannan, “Uncertainty [27] G. K. Venayagamoorthy, R. K. Sharma, P. K. Gautam, and A. Ahmadi,
models for stochastic optimization in renewable energy applications,” “Dynamic energy management system for a smart microgrid,” IEEE
Renew. Energy, vol. 145, pp. 1543–1571, Jan. 2020. Trans. Neural Netw. Learn. Syst., vol. 27, no. 8, pp. 1643–1656,
[5] X. H. Lu, K. L. Zhou, S. L. Yang, and H. Z. Liu, “Multi- Aug. 2016.
objective optimal load dispatch of microgrid with stochastic access [28] C. X. Mu, K. Wang, and C. Y. Sun, “Learning control supported by
of electric vehicles,” J. Clean. Prod., vol. 195, pp. 187–199, dynamic event communication applying to industrial systems,” IEEE
Sep. 2018. Trans. Ind. Informat., vol. 17, no. 4, pp. 2325–2335, Apr. 2021.
[6] V. Sarfi and H. Livani, “An economic-reliability security-constrained [29] D. X. Zhang, X. Q. Han, and C. Y. Deng, “Review on the research and
optimal dispatch for microgrids,” IEEE Trans. Power Syst., vol. 33, practice of deep learning and reinforcement learning in smart grids,”
no. 6, pp. 6777–6786, Nov. 2018. CSEE J. Power Energy Syst, vol. 4, no. 3, pp. 362–370, Sep. 2018.
[7] C. X. Mu, Y. Zhang, H. J. Jia, and H. B. He, “Energy-storage- [30] H. W. Wang, T. W. Huang, X. F. Liao, H. Abu-Rub, and
based intelligent frequency control of microgrid with stochastic model G. Chen, “Reinforcement learning in energy trading game among smart
uncertainties,” IEEE Trans. Smart Grid, vol. 11, no. 2, pp. 1748–1758, microgrids,” IEEE Trans. Ind. Electron., vol. 63, no. 8, pp. 5109–5119,
Mar. 2020. Aug. 2016.
[8] K. Ullah, G. Hafeez, I. Khan, S. D. Jan, and N. Javaid, “A multi-objective [31] C. X. Mu, Z. Y. Liu, J. Yan, H. J. Jia, and X. Y. Zhang, “Graph
energy optimization in smart grid with high penetration of renewable multi-agent reinforcement learning for inverter-based active voltage
energy sources,” Appl. Energy, vol. 299, Oct. 2021, Art. no. 117104. control,” IEEE Trans. Smart Grid, early access, Aug. 17, 2023,
[9] H. Y. Dong, Y. B. Fu, Q. Q. Jia, and X. Y. Wen, “Optimal dis- doi: 10.1109/TSG.2023.3298807.
patch of integrated energy microgrid considering hybrid structured [32] Y. T. Zhou, Z. J. Ma, J. H. Zhang, and S. L. Zou, “Data-driven stochastic
electric-thermal energy storage,” Renew. Energy, vol. 199, pp. 628–639, energy management of multi energy system using deep reinforcement
Nov. 2022. learning,” Energy, vol. 261, Dec. 2022, Art no. 125187.

Authorized licensed use limited to: Tsinghua University. Downloaded on June 04,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
2970 IEEE TRANSACTIONS ON SMART GRID, VOL. 15, NO. 3, MAY 2024

[33] H. C. Hua, Y. C. Qin, C. T. Hao, and J. W. Cao, “Optimal energy Xinying Wang (Senior Member, IEEE) received
management strategies for energy Internet via deep reinforcement the Ph.D. degree from the Dalian University
learning approach,” Appl. Energy, vol. 239, pp. 598–609, Apr. 2019. of Technology, Dalian, China, in 2015. He has
[34] E. Foruzan, L. K. Soh, and S. Asgarpoor, “Reinforcement learning been working with China Electric Power Research
approach for optimal distributed energy management in a microgrid,” Institute, Beijing, China, where he is the Director of
IEEE Trans. Power Syst., vol. 33, no. 5, pp. 5749–5758, Sep. 2018. Artificial Intelligence Application Research Section.
[35] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- His main fields of interest are artificial intelligence
policy maximum entropy deep reinforcement learning with a stochastic and its application in energy Internet. He is a Senior
actor,” in Proc. 35th Int. Conf. Mach. Learn. (ICML), Stockholm, Member of CSEE.
Sweden, 2018, pp. 1861–1870.
[36] T. P. Lillicrap et al., “Continuous control with deep reinforcement
learning,” in Proc. 4th Int. Conf. Learn. Represent. (ICLR), 2016,
pp. 1–14.
[37] S. Fujimoto, H. Van Hoof, and D. Meger, “Addressing function approx-
imation error in actor-critic methods,” in Proc. 35th Int. Conf. Mach.
Learn. (ICML), Stockholm, Sweden, 2018, pp. 2587–2601.
[38] D. Wu, X. Dong, J. Shen, and S. C. Hoi, “Reducing estimation bias via
triplet-average deep deterministic policy gradient,” IEEE Trans. Neural
Netw. Learn. Syst., vol. 31, no. 11, pp. 4933–4945, Nov. 2020. Zhuo Tang received the B.S. degree in automa-
[39] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience tion from Tianjin University, Tianjin, China, in
replay,” 2015, arXiv:1511.05952. 2022, where she is currently pursuing the M.S.
[40] C. Wang and K. Ross, “Boosting soft actor-critic: Emphasizing recent degree in control engineering with the School of
experience without forgetting the past,” 2019, arXiv:1906.04009. Electrical and Information Engineering. Her research
[41] C. Banerjee, Z. Chen, and N. Noman, “Improved soft actor-critic: interests include adaptive dynamic programming,
Mixing prioritized off-policy samples with on-policy experiences,” neural networks, and related applications.
IEEE Trans. Neural Netw. Learn. Syst., early access, May 19, 2022,
doi: 10.1109/TNNLS.2022.3174051.
[42] “Power systems test case archive.” Accessed: Jul. 15, 2016. [Online].
Available: [Link]

Chaoxu Mu (Senior Member, IEEE) received the


Ph.D. degree in control science and engineering from
the School of Automation, Southeast University,
Nanjing, China, in 2012. She was a visiting Ph.D.
student with the Royal Melbourne Institute of Hongjie Jia (Senior Member, IEEE) received the
Technology University, Melbourne, VIC, Australia, B.S., M.S., and Ph.D. degrees in electrical engineer-
from 2010 to 2011. She was a Postdoctoral Fellow ing from Tianjin University, Tianjin, China, in 1996,
with the Department of Electrical, Computer and 1998, and 2001, respectively, where he is currently
Biomedical Engineering, The University of Rhode a Professor. His research interests include power
Island, Kingston, RI, USA, from 2014 to 2016. She system stability analysis and control, distribution
is currently a Professor with the School of Electrical network planning, renewable energy integration, and
and Information Engineering, Tianjin University, Tianjin, China. Her current smart grids.
research interests include nonlinear system control and optimization, and
adaptive and learning systems.

Yakun Shi received the B.S. degree in process


equipment and control engineering from the North
University of China, Taiyuan, China, in 2021. He
is currently pursuing the M.S. degree in control
engineering with Tianjin University, Tianjin, China.
His current research interests include deep reinforce- Hua Geng (Fellow, IEEE) received the B.S.
ment learning and multiobjective optimal dispatch degree in electrical engineering from the Huazhong
of power systems considering large-scale access to University of Science and Technology, Wuhan,
renewable energy. China, in 2003, and the Ph.D. degree in control
theory and application from Tsinghua University,
Beijing, China, in 2008. From 2008 to 2010, he was
a Postdoctoral Research Fellow with the Department
of Electrical and Computer Engineering, Ryerson
University, Toronto, ON, Canada. In June 2010,
Na Xu received the B.S. degree in automation from
he joined the Department of Automation, Tsinghua
the College of Information Science and Engineering,
University, where he is currently a Full Professor.
Shanxi University in 2013, and the M.S. degree in
His current research interests include advanced control on power electronics
control science and engineering from the School
and renewable energy conversion systems. He was the recipient of second
of Electrical and Information Engineering, Taiyuan
prize of the National Science and Technology Progress Award. He is an
University of Science and Technology, Taiyuan,
Editor of the IEEE T RANSACTIONS ON E NERGY C ONVERSION and IEEE
China, in 2018. she is currently pursuing the Ph.D.
T RANSACTIONS ON S USTAINABLE E NERGY, and an Associate Editor of the
degree with Tianjin University. Her research interests
IEEE T RANSACTIONS ON I NDUSTRY A PPLICATIONS, IET Renewable Power
include reinforcement learning, deep learning, and
Generation, and Control Engineering Practice. He is a Convener of IEC SC
optimal dispatch.
8 A WG 8, and the Standing Director of China Power Supply Society.

Authorized licensed use limited to: Tsinghua University. Downloaded on June 04,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.

You might also like