Multi-Objective Interval Optimization Dispatch of Microgrid Via Deep Reinforcement Learning
Multi-Objective Interval Optimization Dispatch of Microgrid Via Deep Reinforcement Learning
Abstract—This paper presents an improved deep reinforcement PMTj Active output of the jth MT.
learning (DRL) algorithm for solving the optimal dispatch of PFCk Active output of the kth FC.
microgrids under uncertaintes. First, a multi-objective interval
Lk Branch stability factor of the branch.
optimization dispatch (MIOD) model for microgrids is con-
structed, in which the uncertain power output of wind and PG,i Active power injected by node i.
photovoltaic (PV) is represented by interval variables. The QG,i Reactive power injected by node i.
economic cost, network loss, and branch stability index for Pload,i Active load demand of node i.
microgrids are also optimized. The interval optimization is Qload,i Reactive load demand of node i.
modeled as a Markov decision process (MDP). Then, an improved
SOC State of charge of the ESS.
DRL algorithm called triplet-critics comprehensive experience
replay soft actor-critic (TCSAC) is proposed to solve it. Finally, Pct Charging power of the ESS.
simulation results of the modified IEEE 118-bus microgrid Pdt Discharging power of the ESS.
validate the effectiveness of the proposed approach.
Index Terms—Deep reinforcement learning, microgrid, uncer- Parameters
tainty, interval optimization, experience replay. ND The number of buses.
NG The number of generators.
NT The number of transformer branches.
N OMENCLATURE M The number of WT.
N The number of PV.
Variables
NDE The number of DEs.
PGESS Regulated power of the ESS. a, b, c The cost coefficients of the constant and the
VL Voltage amplitude of the bus. primary term and secondary term, respectively.
QG Reactive power output of the generator. NMT The number of MTs.
PG Active power output of the generator. NFC The number of FCs.
VG Generator node voltage cng The price of gas.
T Transformer tap ratio. ηMTj Efficiency of the jth MT.
PWT Uncertain power output of WT. QLHV low heating value of natural gas.
PPV Uncertain power output of PV. ηFCk The efficiency of the kth FC.
CDE Cost of power generation from diesel generators. Nbranch The total number of network branches.
CMT Cost of power generation from micro turbines. Gij The conductance of the branches.
CFC Cost of power generation from fuel cells. Bij The capacitance of the branches.
PDEi Active output of the ith DE. Ui , Uj The voltage at nodes i and j, respectively.
Manuscript received 12 April 2023; revised 28 August 2023; accepted θi , θj The voltage phase angles at nodes i and j,
8 October 2023. Date of publication 6 December 2023; date of current respectively.
version 23 April 2024. This work was supported in part by the National X The reactance of the branch.
Natural Science Foundation of China under Grant 62022061, and in part by
the Science and Technology Project of the State Grid Corporation of China Pi The active power injected at the first node.
(SGCC) under Grant 5700-202212197A-1-1-ZN. Paper no. TSG-00540-2023. Qj The reactive power output at the end node.
(Corresponding author: Chaoxu Mu.) ηc The charging efficiency.
Chaoxu Mu, Yakun Shi, Na Xu, Zhuo Tang, and Hongjie Jia are with the
School of Electrical and Information Engineering, Tianjin University, Tianjin ηd The discharging efficiency.
300072, China (e-mail: cxmu@[Link]; syk@[Link]; naxu@[Link]; α Temperature coefficient of entropy
tangzhuo@[Link]; hjjia@[Link]). γ Discount factor that penalizes future rewards
Xinying Wang is with the Artificial Intelligence Application Research
Department, China Electric Power Research Institute, Beijing 100000, China λ The weight of a single critic.
(e-mail: wangxinying@[Link]). τ Soft update parameter
Hua Geng is with the Department of Automation and the Beijing
National Research Center for Information Science and Technology, Tsinghua
University, Beijing 100084, China (e-mail: genghua@[Link]) I. I NTRODUCTION
Color versions of one or more figures in this article are available at
HE WORLDWIDE energy crisis and environmen-
[Link]
Digital Object Identifier 10.1109/TSG.2023.3339541 T tal deterioration are both gaining prominence. New
1949-3053
c 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See [Link] for more information.
Authorized licensed use limited to: Tsinghua University. Downloaded on June 04,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
2958 IEEE TRANSACTIONS ON SMART GRID, VOL. 15, NO. 3, MAY 2024
power generation technologies geared toward economical to fall into local optimal solutions, which do not necessarily
and efficient renewable energy sources are rapidly devel- guarantee the feasibility and optimality of the resulting solu-
oping. Microgrid, as a key component of the smart grid, tion. Moreover, the traditional optimization algorithms have a
may effectively integrate renewable energy sources, enhance long computation time, the optimization results obtained from
their penetration, and maximize their use in the power repeated calculations are not unique and less robust, and they
grid [1], [2], [3]. However, because of the natural environ- need to be re-computed every time a scheduling decision is
mental elements such as wind speed, light, and temperature, made. Therefore, it remains a great challenge to optimally
wind and solar power generation are prone to sporadic and dispatch microgrids for their efficient operation under the
unpredictable fluctuations [4]. uncertainty of renewable energy generation [23], [24], [25].
In fact, the optimal dispatch of microgrids is a complex Reinforcement learning (RL) provides an excellent idea
multi-objective optimization problem, and a complete descrip- for solving the optimal operation of microgrids in complex
tion of the operation of microgrids cannot be achieved by simply state spaces, and RL is considered a control scheme that can
considering operating costs or environmental benefits [5], [6], significantly improve current conventional methods [26]. DRL
[7], [8]. Microgrid multi-objective optimal dispatch (MOD) can extract more data features that are not considered in the
solutions generally include operating costs, network losses, numerical model due to the powerful data processing and rep-
and system security. It is critical to develop a multi-objective resentation capabilities of deep neural networks, and it can also
optimal dispatch model that takes into account multiple better deal with high-dimensional data in microgrid schedul-
uncertainties based on random fluctuations in renewable energy ing, overcoming the “dimensionality catastrophe” caused by
output and effectively solves this multi-objective optimization the large decision space for optimization problem solving. In
problem [9], [10]. addition, the DRL approach does not require repetitive training
The focus or difficulty in solving uncertainty optimization of the agent, and the trained agent is able to make decisions
problems is how to deal with uncertainties. Fuzzy optimization, in seconds, thus allowing real-time online decision-making
stochastic optimization, robust optimization, and interval based on the current state of the microgrid [27]. DRL is a
optimization are often used in the optimal dispatch of microgrids branch of machine learning that aims to interact in uncertain
to deal with uncertainties [11], [12]. The objective function, environments and make decisions by continuously revising
constraints, and model coefficients in fuzzy optimization need strategies guided by rewards. DRL collects data from the
to be fuzzified, which is more difficult to achieve [13], [14]. environment to learn control strategies without relying on an
Stochastic optimization may have a bias in the empirical accurate model of the environment [28].
distribution due to the limited amount of historical data [15]. It is Many applications of reinforcement learning already exist
difficult to obtain accurate and reliable probability distributions in power systems, including microgrids, demand response,
for some of the uncertainty parameters, which affect the electrical equipment fault detection, power system analysis and
multi-objective optimal scheduling of microgrids [16]. Robust control, network security, and renewable energy generation
optimization leads to conservatism in its solutions because it forecasting [29]. Two algorithms were developed based on
provides optimal planning solutions that preserve the worst-case learning automata to find Nash equilibrium by establishing a
scenario in the set of uncertainties [17], [18]. relationship between average utility maximization and optimal
The main idea of interval optimization is to represent data policies [30]. A multi-agent reinforcement learning algo-
with uncertainty as intervals. The goal is to find the interval rithm introducing graph convolutional networks was proposed,
variables that optimize the objective function, provided that thereby facilitating the generation of agent voltage control
the interval variables satisfy their respective constraints [19]. strategies [31]. An improved DRL method was proposed
When dealing with uncertainty in random variables such as to achieve dynamic and optimal energy allocation [32].
wind power and PV, interval optimization does not require In [33], a multi-objective optimal dispatch model for a
knowledge of its affiliation function or probability distribution, microgrid cluster was developed based on the asynchronous
but only the upper and lower bounds of the random variables, advantage actor-critic (A3C) algorithm. A multiagent-based
which are expressed as intervals, and then the model is model was used for the distributed energy management of
transformed into a deterministic model [20]. In [21], a non- microgrids [34].
linear interval optimization (NIO) model was proposed for Although there is a lot of current work that has made
solving optimal power system dispatch with uncertain wind significant contributions to microgrid optimal dispatch, there
power integration. In [22], a multi-objective optimal schedul- is still a lot of room for DRL to be explored in the field of
ing model for microgrids with uncertainty was developed using microgrid optimal dispatch. On the one hand, to the best of
interval optimization, which has extended interval optimization our knowledge, there is no relevant work on applying DRL
to multi-objective microgrid scheduling for the first time. algorithms to multi-objective interval optimal scheduling for
The uncertainty of wind and solar energy poses many microgrids. On the other hand, DRL algorithms originating
challenges to power systems. Multi-objective optimal dispatch from the Q-learning class suffer from value estimation bias,
problems for microgrids usually contain a large state space and such as deep deterministic policy gradient (DDPG), twin
an action space. Traditional intelligent optimization algorithms delayed deep deterministic policy gradient (TD3), and soft
deal with such a large search space that the computational actor-critic (SAC), and the standard DRL algorithms suffer
effort grows exponentially with uncertainty fluctuations. As a from sampling inefficiencies and poor feature extraction capa-
result, many intelligent algorithms converge slowly and tend bilities during the training process of the network.
Authorized licensed use limited to: Tsinghua University. Downloaded on June 04,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
MU et al.: MIOD OF MICROGRID VIA DRL 2959
Authorized licensed use limited to: Tsinghua University. Downloaded on June 04,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
2960 IEEE TRANSACTIONS ON SMART GRID, VOL. 15, NO. 3, MAY 2024
generation from diesel generators, micro turbines, and fuel power injected at the first node, and Qj is the reactive power
cells in the typical structures, and hence it can be given by output at the end node. The branch stability index is
f1 = CDE + CMT + CFC , (6) f3 = max L1 , L2 , . . . , LNL . (12)
where f1 is the economic cost, CDE , CMT , and CFC denote the
operating costs of diesel generators, micro turbines, and fuel C. Constraints for MOD
cells, respectively. The objective function is subject to constraints, composed
The economic cost of a diesel generator is expressed by of the equality constraint and the inequality constraint. The
equation constraint mainly refers to the power balance of the
NDE
CDE = (ai + bi PDEi + ci P2DEi ), (7) system, i.e., that the total power input of the system is equal
i=1 to the total power output of the system. The power balance
constraints are
where PDEi is the active output of the ith DE and NDE is
the number of DEs. a, b, and c are the cost coefficients PG,i = Pload,i + Ui Uj Gij cos θij + Bij sin θij (13)
of the constant and the primary term and secondary term, j∈i
respectively. QG,i = Qload,i + Ui Uj Gij sin θij − Bij cos θij , (14)
The economic cost of a micro turbine is expressed by j∈i
NMT
cng PMTj where PG,i and QG,i denote the active and reactive power
CMT = , (8) injected by node i, respectively. Ui and Uj denote the voltage
ηMT j QLHV
j=1 magnitudes of node i and node j; θij denotes the voltage phase
where PMTj is the active output of the jth MT and NMT is the angle difference between nodes i and j; Gij and Bij denote the
number of MTs. cng is the price of gas, ηMTj is the efficiency real and imaginary parts of the elements in row i and column
j of the node derivative matrix, respectively. Pload,i and Qload,i
of the jth MT, and QLHV is the low heating value of natural
denote the active load demand and reactive load demand of
gas.
node i, respectively.
The economic cost of a fuel cell is computed by
Inequality constraints consist of capacity constraints on the
NFC
cng PFCk generating unit, upper and lower voltage constraints, and safe
CFC = , (9) system operation constraints, given by
ηFCk QLHV
k=1
G,i ≤
Pmin PG,i ≤ Pmax
G,i
where PFCk is the active output of the kth FC and NFC is the
number of FCs. ηFCk is the efficiency of the kth FC. G,i ≤
Qmin QG,i ≤ Qmax
G,i
2) Network loss (NL): The network loss of the power system Tjmin ≤ Tj ≤ Tjmax
is another indicator reflecting its efficient and stable operation. Vkmin ≤ Vk ≤ Vkmax
The sum of the losses of all branches is given by
SOCmin ≤ SOCt ≤ SOCmax , (15)
N
branch
where Pmin max
f2 = Gij Ui2 + Uj2 − 2Ui Uj cos θi − θj , (10) G,i and PG,i are the lower and upper limits of the
k=1 active power output of the ith generator, respectively. Qmin
G,i and
where f2 is the network loss, Nbranch is the total number of Qmax
G,i are the lower and upper limits of the reactive power
th min
output of the i generator, respectively. Tj and Tj max are the
network branches, Gij is the conductance of the branches, Ui
is the voltage at node i at the head of the line, and Uj is the lower and upper limits of the tap ratio of the jth transformer.
voltage at node j at the end of the line, θi and θj are the voltage Vkmin and Vkmax are the lower and upper limits of the voltage
phase angles at nodes i and j, respectively. magnitude of the kth bus, respectively. SOCt is the state of
3) Branch stability index (BSI): When the system is stable, charge of the ESS at time t, SOCmin is the minimum state of
the BSI corresponding to all branches must be less than the charge of ESS and SOCmax is the maximum state of charge
voltage critical run-out point, and when a run-out occurs in the of ESS. The state of charge of ESS is the stored energy, and
system, it must start from the weakest branch in the system. the state of charge is calculated as
Therefore, our goal is to find the weakest branch in the system,
Pd
which is also the branch with the largest branch stability factor, SOCt = SOCt−1 + Pct ηc − t t, (16)
and keep each branch away from the run-out point. The branch ηd
stability factor is given by where ηc is the charging efficiency, ηd is the discharging
efficiency, Pct is the charging power, Pdt is the discharging
X P2i X power, and t is the time step.
Lk = 4 Qj − , (11)
Ui2 Ui2
where Lk is the branch stability factor of the branch between D. Model of MIOD
nodes i and j at the ends of branch k, X is the reactance of Interval optimization is used to deal with the uncertainty of
the branch, Ui is the voltage at the first node, Pi is the active wind and PV, and the upper and lower bounds of wind and PV
Authorized licensed use limited to: Tsinghua University. Downloaded on June 04,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
MU et al.: MIOD OF MICROGRID VIA DRL 2961
are obtained using interval prediction. As a result, the MIOD ranking of Pareto solutions can be obtained by fuzzy decision-
can be expressed as making, and finally, the extraction of the optimal compromise
solution from the Pareto frontier is completed.
min F(x) = f1I (x, w), f2I (x, w), . . . , fMI (x, w) , (17) The other category is to transform multi-objectives into a
s.t. gm (x, w) ≥ am single objective for optimization. This paper uses DRL to
hn (x, w) = bn , (18) solve the multi-objective optimization problem for microgrids,
which involves the problem that reward values need to be set
where gm (x, w) is the constrained interval inequality, am ∈ in DRL to guide the agent to seek the optimal solution. In this
[am , am ], m = 1, 2, . . . , M, hn (x, w) is the constrained paper, we can set multiple reward values for the agent, such
interval equation, bn ∈ [bn , bn ], n = 1, 2, . . . , N, fiI (x, w) is as economic cost reward, system active network loss reward,
the ith objective function on the decision variable x and the branch stability index reward, unit output crossing the limit
interval variable w ∈ [wL , wU ], which can be expressed as the reward, and node voltage crossing reward. However, in MDP,
number of intervals as the final R only needs a definite value, which means that the
I
set reward values need to be weighted into a composite reward
fiI (x, w) = f Ii (x, w), f i (x, w) , (19)
value in the end. Therefore, the treatment of multi-objective
I problems in this thesis belongs to the second category.
where f Ii (x, w) and f i (x, w) are the upper and lower bounds of In our opinion, the difference between DRL and heuristic
the ith objective function fiI (x, w), which can be determined by algorithms in the process of solving multi-objective problems
lies in the sequential difference in setting weights for different
f Ii (x, w) = min fi (x, w)
y∈w objectives. The process of the heuristic algorithm is to solve
I the Pareto frontier of the multi-objective optimization problem,
f i (x, w) = max fi (x, w). (20)
y∈w then construct the affiliation function of each objective, assign
weights to each objective, and use fuzzy decision-making to
As a consequence, the maximum and minimum values of the
rank each alternative solution in the Pareto frontier to get
objective function of the dispatch solution x in the uncertainty
the optimal compromise solution. The process of the DRL
interval w can be obtained. Since fi (x, w) is an interval number,
algorithm is to set the reward value that guides the agent to
in order to compare the objective function values correspond-
explore, and according to the relative weights of the objectives,
ing to different dispatch solutions, the interval number needs to
it can assign the weights to the corresponding sub-reward
be further processed by converting it to a deterministic value,
values. Instead of searching for many alternative solutions, the
which can be computed by
agent searches for the solution that has the highest combined
I reward value under the current weights; that is, it searches for
f Ii (x, w) + f i (x, w)
fiO (x, w) = . (21) a set of values in the solution space that are closest to the ideal
2
solutions of all objective functions under the current weights.
We represent the MIOD model as follows: Therefore, DRL-based multi-objective solving does not need
to deal with Pareto frontiers.
min F(x) = f1O (x, w), f2O (x, w), . . . , fMO (x, w) , (22)
I III. D EEP R EINFORCEMENT L EARNING FOR MIODMG
f Ii (x, w) + f i (x, w)
s.t. fiO (x, w) = A. Reinforcement Learning Framework
2
gm (x, w) ≥ am In this section, we improve a DRL algorithm to solve the
hn (x, w) = bn . (23) multi-objective interval optimization dispatch for microgrids.
Reinforcement learning is performed by an agent interacting
From this, it can be seen that the MIOD is essentially a with its environment and maximizing cumulative rewards,
two-level optimization, where the decision vector for exter- eventually learning how to choose the best action based on
nal optimization is x and the decision vector for internal different states. The problem can be viewed as a discrete-
optimization is w. time stochastic control process, or MDP, defined by a tuple
(S, A, R, P). In this MDP problem, the decision maker is the
E. Multi-Objective Optimization Solution Discussion system operator, and the agent’s function is equivalent to that
of a dispatcher, which usually wants to ensure system security
Two categories exist for the handling of multiple objectives
while keeping the system network loss and operation cost
in a multi-objective optimization solution. One is to consider
lower. The basic components of reinforcement learning under
optimizing multiple objectives at the same time, and the
the framework of MIODMG mainly include:
heuristic algorithm is used to perform the non-dominated
1) State: The state of the MIODMG includes the regulated
sorting, screen out the non-dominated solutions during the
power of the energy storage system, the voltage amplitude
execution of the algorithm, and finally give the Pareto-
at the bus, the reactive power output of the generator, and
optimal solution and the Pareto-frontier. Since different Pareto
the active power and reactive power demand. The state is
solutions have different focuses on different objectives, fuzzy
represented as
decision-making methods are generally used to better evaluate
the Pareto optimal solution under multiple objectives. The st = {PGESS ,t , VL,t , QG,t , Pload,t , Qload,t }. (24)
Authorized licensed use limited to: Tsinghua University. Downloaded on June 04,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
2962 IEEE TRANSACTIONS ON SMART GRID, VOL. 15, NO. 3, MAY 2024
2) Action: The actions of the MIODMG are the relevant where π(·|st ) is the strategy of an agent that maps the
decision variables, including the active power output of the distribution from the state space to the action space. α is
generator, the generator node voltage, and the transformer tap the temperature coefficient that determines the importance of
ratio setting. The action is represented as entropy relative to the reward value, which in turn ensures the
stochasticity of the optimal strategy. γ is the discount factor
at = {PG,t , VG,t , TN,t }. (25) that penalizes future rewards.
The Bellman equation Qπ is
3) Reward: The reward is the optimization objective of steer-
ing agent, and the reward values included are the economic Qπ (s, a) = E R s, a, s + γ Qπ s , a + αH π ·|s .
cost reward r1 , the system active network loss reward r2 , the s ∼P
a ∼π
branch stability index reward r3 , the reactive output crossing (34)
limit reward r4 , and the node voltage crossing limit reward r5 .
These rewards are designed as follows: The SAC algorithm consists of two neural networks: the
strategy network and the Q network. SAC learns both a
r1 = −(CDE + CMT + CFC ), (26) strategy πφ and two Q functions, Qθ1 and Qθ2 . The Q-network
N
branch parameterized by θ is used to approximate the soft Q-function
r2 = − Gij Ui2 + Uj2 − 2Ui Uj cos θi − θj , (27) Qθ (s, a). The policy network parameterized by φ outputs
k=1 the mean and variance of the action probability distribution
r3 = − max{L1 , L2 , . . . , LNL }, (28) according to the different states.
⎧
⎪ QG,i −Qmin
⎨ G,i
QG,i ≤ QG,i
NG min
min
r4 = − QG,i , QG,i = QG,i max (29) C. Estimation Bias in SAC
⎩ QG,i −Q
⎪ G,i
j=1 max
QG,i QG,i > Qmax ,
G,i RL algorithms originating from the Q-learning class all
⎧ suffer from value estimation bias. It starts with Q-learning
⎪
⎨ Vk −V
min
k
Vkmin Vk ≤ Vk
ND min
and uses greedy estimation of the target value: y = r +
r5 = − Vk , Vk = max
(30) γ max Q(s , a ). However, when there is an error in the target,
⎩ Vk −V
⎪
k=1 V max k
k
V > Vkmax . a
k the target estimate will be larger than the true target value,
The final reward is subject to a combination of factors that and this overestimation bias is then propagated through the
may affect the operation of the microgrid, and the total reward Bellman equation. This overestimation bias therefore carries
r is expressed as over to DQN and DDPG, and it is demonstrated in [37] that
this theoretical overestimation occurs in practice. TD3 is an
r = a1 r1 + a2 r2 + a3 r3 + a4 r4 + a5 r5 , (31) extension of DDPG that uses the minimum between two critics
as the target value estimate, thereby limiting the overestimation
where a is the relative weighting factor for each reward. phenomenon. A clipped double Q-learning model for actor-
critic is thus proposed:
B. Soft Actor-Critic
y = r + γ min Qθi s , πφ1 s . (35)
SAC is a DRL algorithm of the off-policy class based on i=1,2
the maximum entropy reinforcement learning framework [35]. The use of clipped double Q-learning, while potentially
It is a combination of stochastic policy optimization and a leading to an underestimation bias, would be far more effective
DDPG-like deterministic policy optimization algorithm that than an overestimation bias, since the value of the underesti-
not only incorporates some of the techniques of DDPG [36] mation behavior would not be propagated through the strategy
and TD3 [37], but also retains the inherent stochasticity of update. However, the underestimation phenomenon occurring
their policies. At the heart of SAC is entropy regularization, in the TD3 algorithm was investigated in [38], its existence
where the agent needs to maximize both the reward value and was theoretically demonstrated, and it was experimentally
the entropy value during training. Entropy is a measure of the observed that this underestimation bias did affect performance
randomness in a strategy that can be adjusted to allow the in experiments.
agent to explore more or less and prevent the strategy from Since SAC uses clipped double Q-learning to reduce over-
converging too early to a local optimum. estimation bias, underestimation may also exist, and very little
From the probability distribution P of its state transfer, the has been done to address the underestimation present in SAC.
entropy can be determined by Based on the above observations, in combination with the
approach of [38], we propose a “triple critic” framework of
H(P) = E − log P(x) . (32)
x∼P SAC for estimating bias. Combining the overestimation bias
of DDPG with the underestimation bias of TD3 allows the
The agent receives a reward proportional to the entropy
estimated bias to fall somewhere in between. Assume that
of the strategy at each time step, and the maximum entropy
the parameterized functions Qθ (s, a) and πφ (a|s) are used to
reinforcement learning formulation is
estimate the soft Q value and the policy, respectively. Set the
∞ corresponding single-critic network Q value of the DDPG as
π ∗ = arg max E γ t (R(st , at ) + αH(π(·|st ))) , (33)
π τ ∈π Q1 = Qθ1 s , a , a ∼ πφ ·|s . (36)
t=0
Authorized licensed use limited to: Tsinghua University. Downloaded on June 04,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
MU et al.: MIOD OF MICROGRID VIA DRL 2963
Authorized licensed use limited to: Tsinghua University. Downloaded on June 04,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
2964 IEEE TRANSACTIONS ON SMART GRID, VOL. 15, NO. 3, MAY 2024
buffer through reparameter sampling. Then a sample from the Algorithm 1 TCSAC (Triplet-Critics Comprehensive
policy distribution is taken to generate an action value and Experience Replay Soft Actor-Critic)
applied to the environment, which then gets a reward value that Input: Initial policy parameters φ; Q-function parameters θ1 , θ2 ,
reflects the reward of the current action. After the environment θ3 ; experience replay buffer D; Initial target networks θtarg,1 ← θ1 ,
gets the action command, it runs to the next state and gets θtarg,2 ← θ2 , θtarg,3 ← θ3
a set of transitions (st , at , rt , st+1 ), which are stored in the Ouput: φ, θ1 , θ2 , θ3
experience replay buffer. Then mini-batches are selected from for t = 1, ..., T do
the experience replay buffer to train the actor network and the Select and execute action a ∼ πφ (·|s), observe next state
s and reward r
critic network, and then the estimation bias is reduced by the
Store transition tuple s, a, r, s , ρ in D
triple-critic network, and TD-error is obtained, which is fed if it’s time to update then
back to the actor network. This process is repeated until the H1 ∼ D, H2 ∼ D
agent is able to achieve the optimal decision, and then G = (H1 ∪ H2 )prior
the final dispatch strategy is obtained. The detailed steps of The following network updates are computed on the
the algorithm are shown in Algorithm 1. mini-batch G
Set Q1 = Qθtarg,1 (s , ã ), ã ∼ πφ (·|s )
IV. S IMULATION S TUDIES Set Q2 = mini=2,3 Qθtarg,i (s , ã ), ã ∼ πφ (·|s )
Compute the target Q value:
A. Simulation Settings y = r + γ λQ1 + (1 − λ)Q2 − α log πφ ( a |s )
To verify the effectiveness of the proposed DRL- Update Q-functions by one step of gradient descent
using:
based multi-objective interval optimal dispatch method for 1 2
microgrids, the modified IEEE 118-bus system is used as an ∇θi |G| Qθi (s, a) − y for i = 1, 2
(s,a,r,s )∈G
arithmetic example for simulation studies. The system consists Update policy by one step of gradient ascent using:
of 53 distributed generators, including 15 DEs, 10 MTs, and 8 1
∇φ |G| min Qθi s,aφ (s) − α log πφ aφ (s)|s
FCs. The operating parameters of the microgrid are provided in s∈G i=1,2
Table I. Uncertain renewable energy output includes 10 WTs Update target critic network parameters:
and 10 PVs, whose power output interval values are obtained θtarg,i ← τ θtarg,i + (1 − τ )θi for i = 1, 2, 3
end if
by the interval prediction method, as shown in Fig. 3. The
end for
basic power value of the system is 100 kVA. Bus information
and branch circuit parameters can be found in [42], bus
modified for the purpose of analysis.
All case studies were conducted on the same computer algorithm are provided in Table II. Our simulation study of
device with an Nvidia GeForce RTX 3090, Intel Xeon W-2255 the three objectives considered, divided according to two-
CPU, and 64 GB of RAM. The hyperparameters of the DRL objective optimization and three-objective optimization, can
Authorized licensed use limited to: Tsinghua University. Downloaded on June 04,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
MU et al.: MIOD OF MICROGRID VIA DRL 2965
Authorized licensed use limited to: Tsinghua University. Downloaded on June 04,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
2966 IEEE TRANSACTIONS ON SMART GRID, VOL. 15, NO. 3, MAY 2024
Authorized licensed use limited to: Tsinghua University. Downloaded on June 04,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
MU et al.: MIOD OF MICROGRID VIA DRL 2967
Fig. 7. Comparison of TCSAC, SAC, and MOPSO algorithms for EC and Fig. 9. Comparison of TCSAC, SAC, and MOPSO algorithms for NL and
BSI samples. BSI samples.
Fig. 8. Learning curve of the TCSAC and baseline algorithms in Case 3. Fig. 10. Learning curve of the TCSAC and baseline algorithms in Case 4.
better performance with the highest cumulative reward and and DDPG start to converge around 2000 episodes, while
DDPG with the lowest average reward value. TCSAC starts to converge around 4000 episodes, but TCSAC
As shown in Table III, both the ultimate cumulative reward has the highest cumulative reward and DDPG has the lowest
and the standard deviation of TCSAC are improved compared average reward value. As shown in Table III, both the ultimate
to SAC. TCSAC has the highest values for both the maximum cumulative reward and the standard deviation of the TCSAC
episode reward and the maximum cumulative reward. This were increased compared to the SAC. The TCSAC has the
indicates that it is able to both explore dispatching solutions highest values for both the maximum episode reward and the
that outperform other algorithms and eventually converge maximum cumulative reward.
to lower values than TCSAC. Similarly, 300 samples are Similarly, 300 samples are generated for testing to compare
generated for testing to compare the performance differences the performance differences of the different algorithms. The
of the different algorithms. The results of the network loss results of the test comparison of network loss and line stability
and line stability index tests are shown in Fig. 9. The differ- index are shown in Fig. 11. It can be seen that TCSAC has
ence between the NL and BSI values for TCSAC and SAC a greater advantage in optimizing EC and BSI; TCSAC is
optimization is very small, but both are significantly better almost on par with SAC in optimizing NL; and the value of
than MOPSO. As shown in Table IV, TCSAC has the lowest NL optimized by MOPSO is at a higher level. The comparison
mean test result. of the training time of four DRL algorithms (TSCAC, SAC,
4) Case 4: Simultaneous optimization of the EC, NL, and BSI TD3, and DDPG) is shown in Table V. The training time of
triple objectives. Finding an optimal dispatching strategy while all four DRL algorithms is in order of magnitude, with the
simultaneously optimizing the EC, NL, and BSI objectives TD3 algorithm in Case 1 taking the least amount of time
is the aim of this case. The weights of EC, NL, and BSI at 59,888.62s, with an average of 6s per episode, and the
are set to 0.3, 0.3, and 0.4, respectively. The DRL training DDPG algorithm in Case 4 taking the most amount of time at
process is shown in Fig. 10. It can be seen that SAC, TD3, 81,362.43s, with an average of 8s per episode. The comparison
Authorized licensed use limited to: Tsinghua University. Downloaded on June 04,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
2968 IEEE TRANSACTIONS ON SMART GRID, VOL. 15, NO. 3, MAY 2024
Fig. 11. Comparison of TCSAC, SAC, and MOPSO algorithms for EC, NL
and BSI samples.
TABLE V
T RAINING T IME C OMPARISON (10,000 E PISODES ,
MAX 50 S TEPS PER E PISODE , SEC )
Fig. 13. MTs operating results for each unit.
TABLE VI
T EST T IME C OMPARISON (300 T EST S AMPLES , SEC )
of the testing time of the three algorithms, TCSAC, SAC, and Fig. 14. FCs operating results for each unit.
MOPSO, for 300 sets of samples is shown in Table VI. DRL
takes 2s to test a set of samples on average, and MOPSO takes
330s to test a set of samples on average. This is one of the high, so the power of MTs and FCs is relatively low. The
major advantages of the DRL algorithm. Although it takes a ESS is responsible for absorbing the imbalanced power of the
long time for the agent to be trained, a trained and mature system, which is always kept in a stable range.
agent can make decisions in seconds and has the ability to
make decisions in real time. V. C ONCLUSION
Finally, the results of the different operations of the In this paper, a multi-objective interval optimization dis-
microgrid in the four cases are given. The operational results patch model for microgrids considering renewable energy
for DEs are shown in Fig. 12, for MTs in Fig. 13, for FCs in uncertainty was proposed. Economic cost, network loss, and
Fig. 14, and for ESS in Fig. 15. It can be seen that in Case 2, branch stability index were considered the optimization objec-
the power of DEs is relatively low, so the power of MTs and tives of the microgrid. The interval variables were used to
FCs is relatively high; in Case 4, the power of DEs is relatively represent the uncertain power output of wind and PV, and
Authorized licensed use limited to: Tsinghua University. Downloaded on June 04,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
MU et al.: MIOD OF MICROGRID VIA DRL 2969
Authorized licensed use limited to: Tsinghua University. Downloaded on June 04,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
2970 IEEE TRANSACTIONS ON SMART GRID, VOL. 15, NO. 3, MAY 2024
[33] H. C. Hua, Y. C. Qin, C. T. Hao, and J. W. Cao, “Optimal energy Xinying Wang (Senior Member, IEEE) received
management strategies for energy Internet via deep reinforcement the Ph.D. degree from the Dalian University
learning approach,” Appl. Energy, vol. 239, pp. 598–609, Apr. 2019. of Technology, Dalian, China, in 2015. He has
[34] E. Foruzan, L. K. Soh, and S. Asgarpoor, “Reinforcement learning been working with China Electric Power Research
approach for optimal distributed energy management in a microgrid,” Institute, Beijing, China, where he is the Director of
IEEE Trans. Power Syst., vol. 33, no. 5, pp. 5749–5758, Sep. 2018. Artificial Intelligence Application Research Section.
[35] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- His main fields of interest are artificial intelligence
policy maximum entropy deep reinforcement learning with a stochastic and its application in energy Internet. He is a Senior
actor,” in Proc. 35th Int. Conf. Mach. Learn. (ICML), Stockholm, Member of CSEE.
Sweden, 2018, pp. 1861–1870.
[36] T. P. Lillicrap et al., “Continuous control with deep reinforcement
learning,” in Proc. 4th Int. Conf. Learn. Represent. (ICLR), 2016,
pp. 1–14.
[37] S. Fujimoto, H. Van Hoof, and D. Meger, “Addressing function approx-
imation error in actor-critic methods,” in Proc. 35th Int. Conf. Mach.
Learn. (ICML), Stockholm, Sweden, 2018, pp. 2587–2601.
[38] D. Wu, X. Dong, J. Shen, and S. C. Hoi, “Reducing estimation bias via
triplet-average deep deterministic policy gradient,” IEEE Trans. Neural
Netw. Learn. Syst., vol. 31, no. 11, pp. 4933–4945, Nov. 2020. Zhuo Tang received the B.S. degree in automa-
[39] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience tion from Tianjin University, Tianjin, China, in
replay,” 2015, arXiv:1511.05952. 2022, where she is currently pursuing the M.S.
[40] C. Wang and K. Ross, “Boosting soft actor-critic: Emphasizing recent degree in control engineering with the School of
experience without forgetting the past,” 2019, arXiv:1906.04009. Electrical and Information Engineering. Her research
[41] C. Banerjee, Z. Chen, and N. Noman, “Improved soft actor-critic: interests include adaptive dynamic programming,
Mixing prioritized off-policy samples with on-policy experiences,” neural networks, and related applications.
IEEE Trans. Neural Netw. Learn. Syst., early access, May 19, 2022,
doi: 10.1109/TNNLS.2022.3174051.
[42] “Power systems test case archive.” Accessed: Jul. 15, 2016. [Online].
Available: [Link]
Authorized licensed use limited to: Tsinghua University. Downloaded on June 04,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.