Sensors 20 03039 v2
Sensors 20 03039 v2
Article
A Deep Reinforcement Learning-Based MPPT Control
for PV Systems under Partial Shading Condition
Bao Chau Phan 1 , Ying-Chih Lai 1, * and Chin E. Lin 1,2
1 Department of Aeronautics and Aeronautics, National Cheng Kung University, Tainan 701, Taiwan;
[email protected] (B.C.P.); [email protected] (C.E.L.)
2 UAV Center, Chang Jung Christian University, Tainan 701, Taiwan
* Correspondence: [email protected]; Tel.: +886-6-275-7575 (ext. 63648)
Received: 26 April 2020; Accepted: 22 May 2020; Published: 27 May 2020
Abstract: On the issues of global environment protection, the renewable energy systems have
been widely considered. The photovoltaic (PV) system converts solar power into electricity and
significantly reduces the consumption of fossil fuels from environment pollution. Besides introducing
new materials for the solar cells to improve the energy conversion efficiency, the maximum power
point tracking (MPPT) algorithms have been developed to ensure the efficient operation of PV
systems at the maximum power point (MPP) under various weather conditions. The integration of
reinforcement learning and deep learning, named deep reinforcement learning (DRL), is proposed in
this paper as a future tool to deal with the optimization control problems. Following the success of
deep reinforcement learning (DRL) in several fields, the deep Q network (DQN) and deep deterministic
policy gradient (DDPG) are proposed to harvest the MPP in PV systems, especially under a partial
shading condition (PSC). Different from the reinforcement learning (RL)-based method, which is
only operated with discrete state and action spaces, the methods adopted in this paper are used
to deal with continuous state spaces. In this study, DQN solves the problem with discrete action
spaces, while DDPG handles the continuous action spaces. The proposed methods are simulated
in MATLAB/Simulink for feasibility analysis. Further tests under various input conditions with
comparisons to the classical Perturb and observe (P&O) MPPT method are carried out for validation.
Based on the simulation results in this study, the performance of the proposed methods is outstanding
and efficient, showing its potential for further applications.
Keywords: solar PV; maximum power point tracking (MPPT); partial shading condition (PSC); deep
Q network (DQN); deep deterministic policy gradient (DDPG)
1. Introduction
Energy demand has been continuously increasing and is predicted to rise at a significant rate in
the future [1]. It leads to the rapid development of renewable energy resources like solar, wind, tidal,
geothermal, etc., for reducing the consumption of fossil fuels and protecting the global environment
from pollution. Besides wind power, solar energy is the most commonly used energy source with
a high energy market share in the energy industry around the world [2]. Due to the continuous decline
in price and the increasing concern of greenhouse gas emissions, lots of photovoltaic (PV) systems
have been intensively constructed, especially in areas with rich solar radiation.
Besides the efforts of improving the production process of the PV module and converter power
electronics for better performance of the system, it is essential to enhance the system throughput with
an efficient maximum power point tracking (MPPT) controller. The MPPT algorithm is employed
in conjunction with a DC/DC converter or inverter to assure the MPP can always achieve the goal
under different weather conditions of solar radiation and temperature. Over the years, numerous
MPPT methods have been employed, which can be classified into various categories according to
sensor requirements, robustness, response speed, effectiveness, and memory as shown in these review
papers [2–4]. The conventional MPPT methods [5] that have been practically adopted due to their
simplicity and easy implementation. In which, Perturbation and Observation (P&O) and Incremental
Conductance (IC) are the famous algorithms. Moreover, many other traditional algorithms have
been introduced by Karami [6], such as Open Circuit Voltage (OV), Ripple Correlation Control (CC),
Short Circuit Current (SC), One-Cycle Control (OCC). Mohapatra [7] has confirmed that conventional
methods can usually perform efficiently under a uniform solar radiation condition. However, being
trapped at a local MPP resulting in low energy conversion under a partial shading condition (PSC) is
their considerable drawback. In addition, a small step size of the duty cycle causes longer tracking
time, while it can oscillate around the MPP with the large one. Ahmed [8] tried to modify the P&O
method with variable step size to eliminate its drawbacks of slow tracking speed, weak convergence,
and high oscillation. In this scenario, the controller can choose a large step size when the MPP is still far
away. As it approaches the MPP, the small step size is used to reduce the oscillation. Other modified
methods can be found in [2–5].
Another class of MPPT control is based on soft computing techniques as summarized by
Rezk [4], such as fuzzy logic control (FLC) [9], artificial neural network (ANN) [10], and neuro-fuzzy
(ANFIS) [11,12]. While some methods are proposed based on the evolution algorithms, like genetic
algorithm (GA) [13], cuckoo search (CS) [14], ant colony optimization (ACO) [15], bee colony algorithm
(BCA) [16], bat-inspired optimization (BAT) [17], bio-inspired memetic salp swarm algorithm [18], etc.
Jiang [19] has defined that these methods, based on both soft computing techniques and evolutionary
algorithms, can efficiently deal with the nonlinear problem and obtain global solutions or are able to
track the global MPP under PSCs. However, they have two significant disadvantages. It generally
requires an expensive microprocessor for less computational time and the knowledge of a specific
PV system for low convergence randomness. Rezk et al. [4] have shown that the method based on
particle swarm optimization (PSO) is currently popular in the application of MPPT control [20]. It can
uniquely combine with other algorithms to create a new approach for efficiently solving the MPPT
control problems, such as PSO with P&O by Suryavanshi [21], and PSO with GA by Garg [22], etc.
Recently, extensive studies have focused on reinforcement learning (RL) with various successful
applications due to its superior learning ability from environmental-interacting historical data,
instead of the requirement of complex mathematical models of the control system in conventional
approaches [23,24]. As summarized by Kofinas et al. [25], RL has higher convergence stability with
shorter computational time compared to meta-heuristic methods, thus making it a potential tool
for optimally solving the problem of MPPT control. To date, a few studies have focused on this
field, in which Q-learning is the most-used algorithm. In [26], Wei has applied MPPT control for
a variable-speed wind energy system based on Q-learning. The authors in [27] also developed an
MPPT controller for a tidal energy conversion system. Additionally, the works that try to implement
RL for the MPPT control of a solar energy conversion system can be found in [25,28,29]. However,
these approaches have the drawbacks of low state and action spaces. Kofinas et al. [25] have used
a combination of 800 states with five actions to form a state action space of 4000 state actions, while Hsu
et al. [28] and Youssef [29] just made only four states. As a consequence, the system with large state
and action spaces results in longer computational time. Phan and Lai [30] proposed a combination
of Q-learning and P&O methods. Each control area, which is divided based on the temperature and
solar radiation, are handled by a Q-learning controller for learning the optimal duty cycle. Then, these
optimal duty cycles are forward to the P&O controller resulting in the smaller step size used. Chou [31]
has developed two MPPT algorithms based on RL, one uses a Q table and the other one adopts a Q
network. However, the problems under PSCs are not mentioned in the above studies. Instead of using
a trained agent, the approaches [32,33] deal with the MPPT control problem by using multiple agents.
A novel memetic reinforcement learning-based MPPT control for PV systems under partial shading
condition was developed [32] while a transfer reinforcement learning approach was studied to deal
Sensors 2020, 20, 3039 3 of 23
with the problem of global maximum power point tracking [33]. Generally, the major drawback of
the methods, as mentioned above, is the use of small discrete state and action space.
The recent development of machine learning leads to an integration of reinforcement learning
and deep learning, named as deep reinforcement learning (DRL), which is considered as a powerful
and potential tool to deal with the optimization control problem [34–36]. The successful performance
of the DRL method in playing Atari and Go games is described in the study [37]. DRL is a powerful
method for handling complex control problems with large state spaces. The advantage of DRL is that it
can manage the problem with continuous state and action spaces. To date, DRL has been successfully
applied to several fields, including games [37], robotics [35,38], natural language processing [39],
computer vision [38], healthcare [40], smart grid [41], etc. Zhang [42] has defined a brief overview of
DRL for the power system. A similar concept with deep reinforcement learning has been developed for
MPPT control of the wind energy conversion system, in which a neural network is used as a function
approximation to replace the Q-value table [43,44].
After an exhaustive search of related works and the achievement of reinforcement learning (RL),
it is shown that there is a gap in the application of the DRL algorithm for MPPT control. Therefore,
this paper proposes MPPT controllers based on DRL algorithms to harvest the maximum power and
improve the efficient and robust operation of the PV energy conversion systems. In this study, two
model-free DRL algorithms, including deep Q network (DQN) and deep deterministic policy gradient
(DDPG), are introduced to the MPPT controllers. Different from the RL-based method, which can
only operate with discrete state and action spaces, both proposed methods can deal with continuous
state spaces. In which, DQN works with discrete action space; while the continuous action space is
used in the DDPG method. Rather than using a lookup table to store and learn all possible states and
their values in the RL-based method, which is impossible with large discrete state and action spaces,
the DRL-based method uses neural networks to approximate a value function or a policy function.
The main contributions of this paper are as follows:
• Two proposed efficient and robust MPPT controllers for PV systems based on DRL are proposed
and simulated in MATLAB/Simulink, including DQN and DDPG.
• Eight scenarios under different weather conditions are considered for testing the performances of
the two proposed methods. They are divided into four scenarios under uniform conditions and
four other scenarios under partial shading conditions, as shown in Table 3.
• A comparison between the proposed method and the P&O method is also investigated.
In this paper, the descriptions of a PV mathematical model and the influence of partial shading
conditions to the location of MPP are introduced in Section 2. The proposed methods based on two
different reinforcement learning algorithms, including DQN and DDPG, are described and formulated
in Section 3. Based on the simulation and the comparison results in Section 4, the performance of
the proposed methods appears very outstanding and efficient in PV operation. Finally, the conclusion
and future work are presented in Section 5.
V + IRs
Ish = (2)
Rp
where Rs is the series resistance because of all the components that come in the path of the current
which is desired as low as possible, and Rp is parallel resistance which is desired as high as possible.
Additionally, Iph is the light generated current, which is proportional to the light intensity. It is
calculated by
G
Iph = [Isc + KI (Tc − Tr )] × (3)
GSTC
where Isc is the current of short circuit at standard testing condition (STC) (T = 25 , GSTC = 1000 W/m2 )
and KI is the cell short-circuit current temperature coefficient. Tc is the cell operating temperature,
while Tr is the reference temperature, and G is the relative irradiation. Id is the diode current, which is
given by " ! #
qVd
Id = I0 exp −1 (4)
AkTc
where q = 1.6 × 10−19 is the electronic charge, k = 1.38 × 10−23 is the Boltzmann’s constant, and A is
the ideal factor of the diode. I0 is the reverse saturation current of the diode, while Vd is the voltage of
the equivalent diode. They are calculated by
Vd = V + IRs (5)
PV cells are usually connected in series to become a PV module. A simple mathematical model
for calculating the current of a PV module, which is simultaneously dependent on the solar irradiation
and temperature, is given by
" ! #
q(V + IRs )
Ipv = Iph − IO exp −1 (6)
AkTc Ns
Specifications Value
Maximum Power (W) 334.905
Voltage at MPP (V) 41.5
Current at MPP (A) 8.07
Open circuit voltage, Voc (V) 49.9
Short circuit current, Isc (A) 9
Temperature coefficient of Voc (%/◦ C) −0.36
Temperature coefficient of Isc (%/◦ C) 0.09
Sensors 2020, 20, 3039 5 of 23
Vin
D = 1− (7)
Vout
In this paper, two deep reinforcement learning algorithms are applied for MPPT control, including
DQN and DDPG. The principles of these two algorithms, applied for MPPT control of a PV system, are
introduced in the next section.
Sensors 2020, 20, 3039 6 of 23
Over the long run, the maximum cumulative reward is achieved by an optimal policy π∗ . At this
time the best value function and action-value function are given by [23]
One of the most interesting areas of AI today is the deep reinforcement learning (DRL) algorithm,
where an agent can learn on its own based upon the interacting results with a specific environment.
DRL, which is the combination of RL and deep learning, has significantly achieved great success in
various fields, such as robotics, games, natural language processing, and the management of finance
and business. One of the major disadvantages of RL is using a look-up table to store, index, which is
sometimes impossible for real-world problems with large state-and-action spaces. Hence, a neural
network can be adopted to approximate a value function, or a policy function [37,51]. That is, neural
nets can learn how to map states or state-action pairs to Q values.
Sensors 2020, 20, x FOR PEER REVIEW 7 of 22
Sensors
spaces.2020, 20, 3039
Hence, 7 of 23
a neural network can be adopted to approximate a value function, or a policy function
[37,51]. That is, neural nets can learn how to map states or state-action pairs to Q values.
As shown in Figure 1, there are two types of solution methods, including model-based, model-
As shown in Figure 1, there are two types of solution methods, including model-based, model-free.
free. In model-based DRL, the model is known or learned. The strong advantage of the model-based
In model-based DRL, the model is known or learned. The strong advantage of the model-based
method is that it requires few samples to learn. However, it is far more computationally complex
method is that it requires few samples to learn. However, it is far more computationally complex
when the model becomes surprisingly tricky to learn. On the other hand, model-free RL will be more
when the model becomes surprisingly tricky to learn. On the other hand, model-free RL will be more
favorable to deal with. No accurate representation of the environment to be effective is needed and
favorable to deal with. No accurate representation of the environment to be effective is needed and it is
it is also less computationally complex. In model-free DRL, it is divided into value-based and policy-
also less computationally complex. In model-free DRL, it is divided into value-based and policy-based.
based. Value-based try to improve the value function every iteration until reaching the convergence
Value-based try to improve the value function every iteration until reaching the convergence condition.
condition. Here, the objective function and updating method are given below [36,42]:
Here, the objective function and updating method are given below [36,42]:
"
J (θ) = E
rt+1 + γmax a
a
Q(st+1 , at+1 θ) − Q(st , at , |θ)
# 2
J rt 1 max Q st 1 , at 1 | Q st , at ,| 2
(14)
(14)
t 1 θt+1t =
θt +rt α1 r max
t+1 + γmaxQ
a a
t+1
Q st (1s, a |,a θ) −QQ s(ts,t ,aatt |θ
| )∇Q Q s , a |
θ (st , att |θ)t
(15)
(15)
where α𝛼isislearning
learningrate, andθ 𝜃is is
rate,and thethe
weights of of
weights thethe
neural network.
neural network.
Figure 1.
Figure Introduction of
1. Introduction of deep
deep reinforcement
reinforcement learning
learning (DRL)
(DRL) algorithms.
algorithms.
In the policy-based
In the policy-based methods,
methods, the quantity of
the quantity of interest
interest is
is directly
directly optimized
optimized by by updating
updating the
the policy
policy
at each time step and computing the value based on this new policy until getting
at each time step and computing the value based on this new policy until getting the policy the policy convergence.
Firstly, the gradient
convergence. Firstly,of
thethe objective
gradient function
of the is defined
objective functionand the weight
is defined andmatrix will be
the weight updated
matrix as
will be
below
updated [36,42]:
as below [36,42]: T T
X X
∇θ J (θ) = E T ∇θ log πθ (at |st ) T r(st , at )
(16)
t 0
J t=0 log at | st t=0 r st , at
t 0
(16)
θ ← θ + α∇θ J (θ) (17)
J (17)
3.2. Markov Decision Process Model of a PV System
To implement an RL or DRL approach on MPPT control of a PV system, a Markov Decision
3.2. Markov
Process Decision
(MDP) modelProcess
of theModel of a PV
PV system System needs to be defined. Almost all RL problems can be
behavior
considered as MDPs.
To implement anBefore
RL or starting a detailed
DRL approach ondescription
MPPT control of theof deep
a PV reinforcement
system, a Markov learning, this
Decision
part provides short background information on the concept of MDP models applied
Process (MDP) model of the PV system behavior needs to be defined. Almost all RL problems can be for the MPPT
control problem.
considered as MDPs. Before starting a detailed description of the deep reinforcement learning, this
part Formally, an MDP
provides short is considered
background as a tuple
information on S,
theA,concept
T, R. S is
of aMDP
finitemodels
set of states
appliedwhich describes
for the MPPT
the all the operating
control problem. point of the PV system, while R is a finite set of actions, which is the perturbation
of the duty cycle and is applied on the converter to change the state of operation of the PV source. T is
Sensors 2020, 20, 3039 8 of 23
the transition function, while R is the reward function, representing how much immediate reward we
expect to get at the moment when an action is performed from a current state. They are given by [23]
The agent learns how to obtain the maximum total reward getting over an episode develop
a strategy or a policy. Thus, we reinforce the agent with positive rewards for choosing a correct
action with good performances, as well as negatives rewards for poor performances [23]. For
the implementation of RL and DRL on an MPPT control, the calculations of predefined state and
action spaces, as well as the reward, are defined. The observation is represented by the combination of
voltage, current, duty cycle, and its perturbation as below [25,28]:
n o
S = Vpv , Ipv , D, ∆D (20)
Action-spaces are the perturbations of duty cycle ∆D, including negative, positive, and no change:
r = r1 + r2 + r3 (22)
Pt+1
r1 = (23)
PMPP,STC
Pt+1 2
i f ∆P ≥ −δ1
( PMPP,STC )
r2 = (24)
i f ∆P < −δ1
0
0
if 0 ≤ D ≤ 1
r3 = (25)
−1
otherwise
where ∆P = Pt+1 − Pt , δ1 stands for the small number considered as the small area around the maximum
power point and used for preventing the, PMPP,STC is the MPP at STC. Here, the reward function
includes three components. First, r1 is the reward received every time step in a specific episode. This
helps the agent to distinguish local and global MPPs, where higher rewards are received if the agent
always stays at the global MPPs. Second, based on the value of r2 , the agent will obtain a positive
reward if the power increases, otherwise, zero rewards. Finally, for r3 , the agent will get a penalty if it
is out of the boundary of the duty cycle.
A diagram
Figure 2. A diagram of
of the
the deep
deep Q
Q network
network (DQN) algorithm.
Qtarget r Q st 1 , st 1 | µ0 ' | Q0Q '
Qtarget = r + γQ st+1 , µ st+1 θ θ
(31)
(31)
predict
QQpredict =QQss,, aa θ| Q Q (32)
(32)
µ 𝜇
The update
The updateofofactor
actor is given
is given by minimizing
by minimizing the expected
the expected returnreturn (𝐽𝜃 sampled
(Jθ ) with ) with sampled policy
policy gradient
gradient as follows
as follows [24]: [24]:
h
s,sµ| (s θµ|)θQQ =
∇θµ≈ EQ∇θsµ,Q
i
E
h
µ ( s )
∇ ( sQ)Qs, µs,(
s
θ ) θ
∇ θ
µs | Q | µ Q µ s | (33)
µ ( s θ )
i
(33)
Figure33shows
Figure showsa adiagram
diagram ofof
thethe DDPG
DDPG method,
method, while
while Supplementary
Supplementary Figure
Figure S7 describes
S7 describes the
the steps
steps
in the in the DDPG
DDPG algorithm.
algorithm. As usedAsin used
DQN in and
DQN and other
many manyRL other RL algorithms,
algorithms, DDPG DDPG
also uses also uses a
a replay
replaytobuffer
buffer sample to experience
sample experience
to updatetoneural
update neural parameters.
network network parameters. In aaddition,
In addition, a mini-batch,
mini-batch, randomly
randomlyfrom
sampled sampled from the
the replay replay
buffer buffer
is also is also
used used to the
to update update
valuetheand
value and policy
policy networks.
networks. TheseThesehelp
helplearning
the the learning
processprocess to bestable
to be more more[42].
stable [42]. Compared
Compared to DQN,
to DQN, where thewhere the target
target network network in
is updated is
aupdated
couple ofin time
a couple ofby
steps time steps by
directly directly
copying thecopying
weightsthe weights
from from prediction
prediction network, innetwork,
DDPG, the in DDPG,
target
the target are
networks networks
updated areevery
updated
timeevery time step, the
step, following following the soft
soft update update
as given as given
below below [24,50]:
[24,50]:
θ
'
µ0 ← µ +((1
τθ
1 −τ))θµ
0 '
(35)
(35)
the system was operated with a total time of 0.5 s in an episode and 0.01 s time step. The simulation
was conducted within 1000 episodes for both methods. The network layout and number of layers used
in this study are recommended by Mathworks. The deep neural networks as shown in Supplementary
Figure S8, used to approximate the critic for both DQN and DDPG, have the same setting. It is used for
the critic net to approximate the action-value function. Moreover, Supplementary Figure S9 indicated
the actor net, which is used to select the action that maximizes the discounted reward. To multiply
the input by a weight matrix, a linear function is employed to a fully connected layer. ReLu layer is
the most popular activation function in deep neural networks, which employs the rectified linear unit
activation function. The hyperbolic tangent activation function is used to constraint the output action
to the range (−1,1), marked as a tanh layer. Then a linear layer is applied to scale up the output from
the tanh layer to the desired magnitude. In addition, the Adam optimization method is applied for
the training of neural networks. The learning rate (α) is defined with the value of 0.001 for both critic
networks in two proposed algorithms, while the learning rate of actor network is 0.0001. The action
space of DQN is [−0.03, −0.01, −0.005, −0.001, 0, 0.001, 0.005, 0.01, 0.03], while that of DDPG is the range
(−0.03,0.03). Moreover, the step of duty cycle used in the P&O method is equal to the value of 0.03.
Finally, the other setting parameters are indicated in Table 2.
Specifications Value
Replay memory size 106
Batch size 512
Discount factor (γ) 0.9
DQN
Exploration rate (ε) 1
Decay of exploration rate 0.0001
Exploration rate minimum (εmin ) 0.001
DDPG
Initial variance 0.4
Decay of initial variance 0.0001
Smoothing factor (τ) 0.001
4. Training process
Figure 4. process of
of the
the DQN
DQN method.
method.
Figure 5.
Figure 5. Training
Training process
process of
of the
the DDPG
DDPG method.
method.
In the following part, the first set of simulations under standard test condition (STC,
G = 1000 W/m2 and T = 15 ◦ C) are carried out to validate the implementation and evaluate
the DRL-based MPPT controllers under different operating conditions. In this scenario,
the performances of two proposed methods are tested under a standardized operating condition, as
well as being compared with traditional MPP tracking method P&O. The simulated results of this
scenario are illustrated in Figure 6. As can be seen in the figure, the MPP is tracked after just about
0.07 s for the DQN-based method, while DDPG and P&O almost get the same tracking speed. On
the other hand, DQN and DDPG methods are more stable. As the P&O method has oscillation with
high magnitude, the two proposed methods perform with a constant duty cycle at about 0.5, which
Sensors 2020, 20, x FOR PEER REVIEW 13 of 22
results in the low oscillation of the P–V curve. Based on the results in this scenario, the power tracking
efficiency of DQN and DDPG methods increases with the values of about 5.83% and 3.21%, respectively,
when compared to that of the P&O method.
Sensors 2020, 20, x FOR PEER REVIEW 13 of 22
Next, the DRL-based methods are tested under the change of both irradiation and temperature
as shown in Figure 11. The operating condition starts with 1000 W/m2 and gradually decreases to
a value of 600 W/m2 , while the temperature is set to 40 ◦ C at the beginning and also declines to a value
of 20 ◦ C at the end. The performances of the three proposed methods are demonstrated in Figure 12.
As shown in the graph, the red line is for the DQN method, while the blue line and green lines indicate
the DDPG and P&O, respectively. The graphs on the left-hand side illustrate the output power while
the right-hand side graphs show the duty cycle. Under the step change of weather conditions, as
shown in the first second and the last second of the graphs, the DQN method has the best performance,
Sensors 2020, 20, x FOR PEER REVIEW 15 of 22
resulting in 20,
Sensors 2020, thex lowest
FOR PEER oscillation
REVIEW of the duty cycle and output power. It is followed by the DDPG 15 and
of 22
P&O, respectively. However, under the gradual change of both temperature and
irradiation, as shown from 1–4 s, DDPG follows the power path better than the other methods, so itsirradiation, as shown
from
duty 1–4 s, DDPG
irradiation,
cycle as shown
curve follows
is from
less the
1–4power pathfollows
s, DDPG
oscillating. Thus, better than the
the power
the DDPG other
method pathmethods,
has better so itsthe
than
the highest duty cycle
other curve
methods,
efficiency, followedisso
less
its
by
oscillating.
duty cycle Thus,
curve the
is DDPG
less method
oscillating. has
Thus, the highest
the DDPG efficiency,
method followed
has the by
highest
DQN and P&O methods. Compared to the P&O method, the power tracking efficiency of the DDPG DQN and P&O
efficiency, methods.
followed by
DQN and
Compared
method P&O
to themethods.
increases P&O
by 1.62%,Compared
method, the that
while toofthe
power theP&O
tracking
DQN method,
methodthe
efficiency power
ofjust
is tracking
the about
DDPG method
1.58%. efficiency of the
increases DDPG
by 1.62%,
method
while thatincreases
of the DQN by 1.62%,
method while that
is just of the
about DQN method is just about 1.58%.
1.58%.
Figure 12. PV power under the change of both irradiation and temperature.
Figure 12. PV power under the change of both irradiation and temperature.
FigurePSC
4.4. Performance under 12. PV power under the change of both irradiation and temperature.
4.4. Performance under PSC
In this section,
4.4. Performance different
under PSC partial shading conditions are applied for the testing and validation of
In this section,
the proposed methods. different partial
There are threeshading conditions
PV modules in theare
PVapplied
system andfor the testing
they and validation
are connected of
in series.
In
Firstly, this
the proposed section,
methods.
a uniform different
weatherThere partial shading
are three
condition atPV conditions
900modules 2
W/m isinappliedare applied
the PV system for
and theand the testing
they are
tracking and validation
connected
results of
in series.
are displayed
the
in proposed
Firstly,
Figurea 13. methods.
uniform
Then,weather There
the scenarioare three
condition
with at PV
one900 modules
W/m PV
shaded 2 in the PV system and they are connected in series.
is module
applied is
and the tracking
tested, followed results
by twoareshaded
displayed
PV
Firstly,
in Figure a uniform
13. Then,weather condition
the scenario with atone900 W/m2PV
shaded is module
applied and the tracking
is tested, followed results
by twoareshaded
displayed
PV
in Figure 13. Then, the scenario with one shaded PV module is tested, followed
modules and three shaded PV modules. Under this uniform condition, the theoretical value of the by two shaded PV
modules and three shaded PV modules. Under this uniform condition, the
MPP is equal to about 902.8 W. As can be seen from the graph, the DQN method, marked as the redtheoretical value of the
MPPhas
line, is equal to about
the best tracking902.8 W. As
speed withcanthebelowest
seen from the graph,
oscillation aroundthethe
DQNMPP, method, marked
resulting in theasflat
theduty
red
line, has the best tracking speed with the lowest oscillation around the MPP, resulting
cycle curve. In contrast, the P&O method, marked as the green line has the poorest performance with in the flat duty
cycle curve. In contrast, the P&O method, marked as the green line has the poorest performance with
illustrated in Table 3. Most of the time the proposed methods are outstanding in tracking the MPP
compared to the P&O method, however, they cannot always obtain global MPP. For example,
scenario 8 illustrates the state where the proposed methods cannot track global MPP. Figure 17
describes the P–V curves of the PV array under a uniform condition and a PSC with two shaded PV
Sensors 2020, 20, 3039 16 of 23
modules (900, 300, 250 W/m2 ). There are three peaks on the graph, consisting of two local MPPs and
one global MPP. In this scenario, the value of the global MPP significantly reduces from about 902.8
W to justmodules
around and288.3
three shaded
W. AsPV modules.
can Under
be seen fromthisthe
uniform condition,
tracking the theoretical
results in Figurevalue
18, of
DQNthe MPP
and DDPG
is equal to about 902.8 W. As can be seen from the graph, the DQN method, marked as the red line, has
methods can track more power compared to the P&O method, and the powers increase by 17.9% and
the best tracking speed with the lowest oscillation around the MPP, resulting in the flat duty cycle curve.
15.4%, respectively.
In contrast, theHowever,
P&O method, instead
markedofasstanding at the
the green line global
has the MPP
poorest with a value
performance of about
with the highest 288.3 W,
they canoscillation
only detectof the the
dutylocal
cycle. MPP with a value
When compared of around
to the P&O method 270 W. ofThus,
in terms powerfurther
tracking study should be
efficiency,
conductedthe DQN is higherthese
to improve with apotential
3.35% increase, while that methods.
and efficient of the DDPG method is just 3.17%.
Figure 14. PV power under partial shading condition (PSC) with one shaded PV module.
Figure14.
Figure 14. PV
PV power
power under
under partial
partial shading
shading condition
condition (PSC)
(PSC) with
withone
oneshaded
shadedPV
PVmodule.
module.
Figure 14. PV power under partial shading condition (PSC) with one shaded PV module.
Figure 15.
15. PV power
power under PSC
PSC with two
two shaded PV
PV modules.
Figure 15. PV
Figure PV power under
under PSC with
with two shaded
shaded PVmodules.
modules.
A summary of the power tracking efficiency under different scenarios simulated in this study is
illustrated in Table 3. Most of the time the proposed methods are outstanding in tracking the MPP
compared to the P&O method, however, they cannot always obtain global MPP. For example, scenario
8 illustrates the state where the proposed methods cannot track global MPP. Figure 17 describes the P–V
curves of the PV array under a uniform condition and a PSC with two shaded PV modules (900, 300,
250 W/m2 ). There are three peaks on the graph, consisting of two local MPPs and one global MPP.
In this scenario, the value of the global MPP significantly reduces from about 902.8 W to just around
288.3 W. As can be seen from the tracking results in Figure 18, DQN and DDPG methods can track
more power compared to the P&O method, and the powers increase by 17.9% and 15.4%, respectively.
However, instead of standing at the global MPP with a value of about 288.3 W, they can only detect
the local MPP with a value of around 270 W. Thus, further study should be conducted to improve
these potential and efficient methods.
Table 3. MPPT tracking efficiency of the proposed methods under various weather conditions compared
to P&O.
Figure 18. PV power under PSC with a local maximum power point (MPP) tracking.
Figure 18. PV power under PSC with a local maximum power point (MPP) tracking.
5. Conclusions
Table 3.Besides
MPPTthetracking efficiency
development of the
of materials forproposed methodstheunder
PV cells to improve powervarious weather
conversion conditions
efficiency, it is
essential to develop
compared to P&O. a new MPPT method which can accurately extract the MPP with high tracking
speed under various weather conditions, especially under PSCs. In this study, two robust MPPT
controllers basedScenarios Weather
on DRL are proposed, Conditions
including DQN and DDPG. DQN BothDDPG
algorithms can handle
the problem with continuous state spaces. Inwith
which, DQN 2
is applied5.83%
with discrete action spaces while
1 Uniform 1000 W/m 3.21%
DDPG can deal with continuous action spaces. The advantage of these two methods is that no prior
2 G changes 1.24% 0.96%
model of the control system is needed. The controllers will learn how to act after being trained based
3 T changes
on the reward received by the continuous interaction with the environment.
2.74% 2.55%
Rather than using 4 a look-up Bothtable Tinand G changemethod,
the RL-based 1.62%
DRL uses 1.58%
neural networks to
2
5
approximate a value function 900,900,350
or a policy so that highW/m 38.3% for44.6%
memory requirement sizeable discrete state
and action spaces could6 be significantly reduced. Here,
900,350,300 W/mthe2environment
25.9% is the 22.1%
PV system and refers to
the object that the agent is acting on. Here, the agent represents
2 the DRL algorithm, while the action is
7 500,800,600 W/m 0.56% 0.92%
the perturbation of the duty cycle. It starts by sending a previous
2 state to the agent, which then based
8 900,300,250 W/m 17.9% 15.4%
on its knowledge, takes action in response to this previous state. Then, the environment responds
with a pair of the next state and reward back to the agent. The agent can learn how to take action
5. Conclusions
based on the reward and current state received from the environment. After being trained based
on the historical data collected by the direct interaction with the power system, the proposed MPPT
Besides
methodstheautonomously
development of materials
regulate for PV of
the perturbation cells
the to
dutyimprove the power
cycle to extract conversion
the best MPP. efficiency, it
is essential to
Todevelop a new MPPT
sum up, compared method which
to the traditional can accurately
P&O method, the DRL-basedextract
MPPTthemethods
MPP with high
applied in tracking
speed under various
this study have a weather conditions,
better performance. Theyespecially under
can accurately detectPSCs.
the MPPIn with
thisastudy, two
significant robust MPPT
tracking
speed,
controllers especially
based on DRL the global MPP under
are proposed, partial shading
including DQNconditions.
and DDPG. In most
Bothofalgorithms
the cases, thecan
DQNhandle the
method overtakes the DDPG method. However, when the partial shading condition happens, the DDPG
problem with continuous state spaces. In which, DQN is applied with discrete action spaces while
method slightly outstrips the DQN method. The simulated results show the outstanding performance
DDPG can deal
of the with MPPT
proposed continuous action
controllers. spaces.
However, theThe advantage
limitation of these
of this study two
is that themethods is that no prior
proposed method
cannot always detect global MPP. Thus, further study will be conducted in the future to improve
the tracking ability of DRL-based methods. Furthermore, real-time experiments will be carried out
for validation.
module under various temperatures; Figure S3. Diagram of three PV modules in series; Figure S4. P–V curve
under uniform condition and PSC; Figure S5. Diagram of a typical PV system; Figure S6. DQN algorithm; Figure
S7. DDPG algorithm; Figure S8. The structure of critic network in both DQN and DDPG algorithms; Figure S9.
The structure of actor network in DDPG algorithm.
Author Contributions: Conceptualization, B.C.P., Y.-C.L., and C.E.L.; methodology, Y.-C.L.; software, B.C.P.;
validation, Y.-C.L.; formal analysis, B.C.P.; investigation, B.C.P.; resources, B.C.P. and Y.-C.L.; data curation,
B.C.P.; writing—original draft preparation, B.C.P.; writing—review and editing, C.E.L. and Y.-C.L.; visualization,
B.C.P. and Y.-C.L.; supervision, C.E.L., and Y.-C.L. All authors have read and agreed to the published version of
the manuscript.
Funding: This work is supported by Ministry of Science and Technology of Taiwan (MOST) under contract
MOST 108-2622-E-309-001-CC1.
Conflicts of Interest: The authors declare there is no conflict of interest to any institutes or organizations.
Nomenclature
G Irradiation (W/m2 )
T Temperature ()
q Electronic charge (C)
k Boltzmann’s constant
π A policy
Vπ Value function
J Objective function
L Loss function
Qπ Action-value function
a Action
r Reward
s State
θ Weight matrix
γ Discount factor
ε Exploration rate
I Current (A)
V Voltage (V)
P Power (W)
References
1. Lin, C.E.; Phan, B.C. Optimal Hybrid Energy Solution for Island Micro-Grid. In Proceedings of the 2016 IEEE
International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking
(SocialCom), Sustainable Computing and Communications (SustainCom) (BDCloud-SocialCom-SustainCom),
Atlanta, GA, USA, 8–10 October 2016; pp. 461–468.
2. Belhachat, F.; Larbes, C. A review of global maximum power point tracking techniques of photovoltaic
system under partial shading conditions. Renew. Sustain. Energy Rev. 2018, 92, 513–553. [CrossRef]
3. Ramli, M.A.M.; Twaha, S.; Ishaque, K.; Al-Turki, Y.A. A review on maximum power point tracking for
photovoltaic systems with and without shading conditions. Renew. Sustain. Energy Rev. 2017, 67, 144–159.
[CrossRef]
4. Rezk, H.; Fathy, A.; Abdelaziz, A.Y. A comparison of different global MPPT techniques based on meta-heuristic
algorithms for photovoltaic system subjected to partial shading conditions. Renew. Sustain. Energy Rev. 2017,
74, 377–386. [CrossRef]
5. Danandeh, M.A.; Mousavi, G.S.M. Comparative and comprehensive review of maximum power point
tracking methods for PV cells. Renew. Sustain. Energy Rev. 2018, 82, 2743–2767. [CrossRef]
6. Karami, N.; Moubayed, N.; Outbib, R. General review and classification of different MPPT Techniques.
Renew. Sustain. Energy Rev. 2017, 68, 1–18. [CrossRef]
7. Mohapatra, A.; Nayak, B.; Das, P.; Mohanty, K.B. A review on MPPT techniques of PV system under partial
shading condition. Renew. Sustain. Energy Rev. 2017, 80, 854–867. [CrossRef]
Sensors 2020, 20, 3039 21 of 23
8. Ahmed, J.; Salam, Z. An improved perturb and observe (P&O) maximum power point tracking (MPPT)
algorithm for higher efficiency. Appl. Energy 2015, 150, 97–108.
9. Al-Majidi, S.D.; Abbod, M.F.; Al-Raweshidy, H.S. A novel maximum power point tracking technique based
on fuzzy logic for photovoltaic systems. Int. J. Hydrogen Energy 2018, 43, 14158–14171. [CrossRef]
10. Kassem, A.M. MPPT control design and performance improvements of a PV generator powered DC
motor-pump system based on artificial neural networks. Int. J. Electr. Power Energy Syst. 2012, 43, 90–98.
[CrossRef]
11. Belhachat, F.; Larbes, C. Global maximum power point tracking based on ANFIS approach for PV array
configurations under partial shading conditions. Renew. Sustain. Energy Rev. 2017, 77, 875–889. [CrossRef]
12. Mumtaz, S.; Ahmad, S.; Khan, L.; Ali, S.; Kamal, T.; Hassan, S. Adaptive Feedback Linearization Based
NeuroFuzzy Maximum Power Point Tracking for a Photovoltaic System. Energies 2018, 11, 606. [CrossRef]
13. Shaiek, Y.; Smida, M.B.; Sakly, A.; Mimouni, M.F. Comparison between conventional methods and GA
approach for maximum power point tracking of shaded solar PV generators. Sol. Energy 2013, 90, 107–122.
[CrossRef]
14. Ahmed, J.; Salam, Z. A Maximum Power Point Tracking (MPPT) for PV system using Cuckoo Search with
partial shading capability. Appl. Energy 2014, 119, 118–130. [CrossRef]
15. Titri, S.; Larbes, C.; Toumi, K.Y.; Benatchba, K. A new MPPT controller based on the Ant colony optimization
algorithm for Photovoltaic systems under partial shading conditions. Appl. Soft Comput. 2017, 58, 465–479.
[CrossRef]
16. Benyoucef, A.S.; Chouder, A.; Kara, K.; Silvestre, S.; Sahed, O.A. Artificial bee colony based algorithm for
maximum power point tracking (MPPT) for PV systems operating under partial shaded conditions. Appl.
Soft Comput. 2015, 32, 38–48. [CrossRef]
17. Kaced, K.; Larbes, C.; Ramzan, N.; Bounabi, M.; Dahmane, Z.E. Bat algorithm based maximum power point
tracking for photovoltaic system under partial shading conditions. Sol. Energy 2017, 158, 490–503. [CrossRef]
18. Yang, B.; Zhong, L.; Zhang, X.; Chun, H.; Yu, T.; Li, H.; Jiang, L.; Sun, L. Novel bio-inspired memetic salp
swarm algorithm and application to MPPT for PV systems considering partial shading condition. J. Cleaner
Prod. 2019, 215, 1203–1222. [CrossRef]
19. Jiang, L.L.; Srivatsan, R.; Maskell, D.L. Computational intelligence techniques for maximum power point
tracking in PV systems: A review. Renewable Sustainable Energy Rev. 2018, 85, 14–45. [CrossRef]
20. Koad, R.B.A.; Zobaa, A.F.; El-Shahat, A. A Novel MPPT Algorithm Based on Particle Swarm Optimization
for Photovoltaic Systems. IEEE Trans. Sustain. Energy 2017, 8, 468–476. [CrossRef]
21. Suryavanshi, R.; Joshi, D.R.; Jangamshetti, S.H. PSO and P&O based MPPT technique for SPV panel under
varying atmospheric conditions. In Proceedings of the 2012 International Conference on Power, Signals,
Controls and Computation, Thrissur, Kerala, India, 3–6 January 2012.
22. Garg, H. A hybrid PSO-GA algorithm for constrained optimization problems. Appl. Math. Comput. 2016,
274, 292–305. [CrossRef]
23. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA,
2018.
24. Glavic, M. (Deep) Reinforcement learning for electric power system control and related problems: A short
review and perspectives. Annu. Rev. Control 2019, 48, 22–35. [CrossRef]
25. Kofinas, P.; Doltsinis, S.; Dounis, A.I.; Vouros, G.A. A reinforcement learning approach for MPPT control
method of photovoltaic sources. Renew. Energy 2017, 108, 461–473. [CrossRef]
26. Wei, C.; Zhang, Z.; Qiao, W.; Qu, L. Reinforcement-Learning-Based Intelligent Maximum Power Point
Tracking Control for Wind Energy Conversion Systems. IEEE Trans. Ind. Electron. 2015, 62, 6360–6370.
[CrossRef]
27. Nambiar, A.; Anderlini, E.; Payne, G.; Forehand, D.; Kiprakis, A.; Wallace, A. Reinforcement Learning Based
Maximum Power Point Tracking Control of Tidal Turbines. In Proceedings of the 12th European Wave and
Tidal Energy Conference, Cork, Ireland, 27 August–September 2017.
28. Hsu, R.; Liu, C.T.; Chen, W.Y.; Hsieh, H.-I.; Wang, H.L. A Reinforcement Learning-Based Maximum Power
Point Tracking Method for Photovoltaic Array. Int. J. Photoenergy 2015, 2015. [CrossRef]
29. Youssef, A.; Telbany, M.E.; Zekry, A. Reinforcement Learning for Online Maximum Power Point Tracking
Control. J. Clean Energy Technol. 2016, 4, 245–248. [CrossRef]
Sensors 2020, 20, 3039 22 of 23
30. Phan, B.C.; Lai, Y.-C. Control Strategy of a Hybrid Renewable Energy System Based on Reinforcement
Learning Approach for an Isolated Microgrid. Appl. Sci. 2019, 9, 4001. [CrossRef]
31. Chou, K.-Y.; Yang, S.-T.; Chen, Y.-P. Maximum Power Point Tracking of Photovoltaic System Based on
Reinforcement Learning. Sensors 2019, 19, 5054. [CrossRef]
32. Zhang, X.; Li, X.; He, T.; Yang, B.; Yu, T.; Li, H.; Jiang, L.; Sun, L. Memetic reinforcement learning based
maximum power point tracking design for PV systems under partial shading condition. Energy 2019, 174,
1079–1090. [CrossRef]
33. Dong, M.; Li, D.; Yang, C.; Li, S.; Fang, Q.; Yang, B.; Zhang, X. Global Maximum Power Point Tracking of PV
Systems under Partial Shading Condition: A Transfer Reinforcement Learning Approach. Appl. Sci. 2019, 9,
2769. [CrossRef]
34. Lapan, M. Deep Reinforcement Learning Hands-On: Apply Modern RL Methods, with Deep Q-Networks, Value
Iteration, Policy Gradients, TRPO, AlphaGo Zero and More; Packt Publishing Ltd.: Birmingham, UK, 2018.
35. Gu, S.; Holly, E.; Lillicrap, T.; Levine, S. Deep reinforcement learning for robotic manipulation with
asynchronous off-policy updates. In Proceedings of the 2017 IEEE International Conference on Robotics and
Automation (ICRA), Marina Bay Sands, Singapore, 29 May–2 June 2017; pp. 3389–3396.
36. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control
with deep reinforcement learning. arXiv Preprint 2015, arXiv:1509.02971.
37. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, L.; Wierstra, D.; Riedmiller, M. Playing atari
with deep reinforcement learning. arXiv Preprint 2013, arXiv:1312.5602.
38. Kahn, G.; A Villaflor, B.D.; Abbeel, P.; Levine, S. Self-Supervised Deep Reinforcement Learning with
Generalized Computation Graphs for Robot Navigation. In Proceedings of the 2018 IEEE International
Conference on Robotics and Automation (ICRA), Brisbane, Australia, 20–25 May 2018; pp. 1–8.
39. He, J.; Chen, J.; He, X.; Gao, J.; Li, L.; Deng, L.; Ostendorf, M. Deep reinforcement learning with a natural
language action space. arXiv Preprint 2015, arXiv:1511.04636.
40. Mohamed Shakeel, P.; Baskar, S.; Dhulipala, V.R.S.; Mishra, S.; Jaber, M.M. Maintaining Security and Privacy
in Health Care System Using Learning Based Deep-Q-Networks. J. Med. Syst. 2018, 42, 186. [CrossRef]
[PubMed]
41. Zhang, D.; Han, X.; Deng, C. Review on the research and practice of deep learning and reinforcement learning
in smart grids. CSEE J. Power Energy Syst. 2018, 4, 362–370. [CrossRef]
42. Zhang, Z.; Zhang, D.; Qiu, R.C. Deep reinforcement learning for power system: An. overview. CSEE J. Power
Energy Syst. 2019, 1–12.
43. Wei, C.; Zhang, Z.; Qiao, W.; Qu, L. An Adaptive Network-Based Reinforcement Learning Method for
MPPT Control of PMSG Wind Energy Conversion Systems. IEEE Trans. Power Electron. 2016, 31, 7837–7848.
[CrossRef]
44. Saenz-Aguirre, A.; Zulueta, E.; Fernandez-Gamiz, U.; Lozano, J.; Lopez-Guede, J. Artificial Neural Network
Based Reinforcement Learning for Wind Turbine Yaw Control. Energies 2019, 12, 436. [CrossRef]
45. Ram, J.P.; Babu, T.S.; Rajasekar, N. A comprehensive review on solar PV maximum power point tracking
techniques. Renewable Sustainable Energy Rev. 2017, 67, 826–847. [CrossRef]
46. Bendib, B.; Belmili, H.; Krim, F. A survey of the most used MPPT methods: Conventional and advanced
algorithms applied for photovoltaic systems. Renewable Sustainable Energy Rev. 2015, 45, 637–648. [CrossRef]
47. Mirza, A.F.; Ling, Q.; Javed, M.Y.; Mansoor, M. Novel MPPT techniques for photovoltaic systems under
uniform irradiance and Partial shading. Sol. Energy 2019, 184, 628–648. [CrossRef]
48. Prasanth Ram, J.; Rajasekar, N. A new global maximum power point tracking technique for solar photovoltaic
(PV) system under partial shading conditions (PSC). Energy 2017, 118, 512–525. [CrossRef]
49. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Venes, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.;
Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015,
518, 529–533. [CrossRef] [PubMed]
50. Casas, N. Deep deterministic policy gradient for urban traffic light control. arXiv Preprint 2017,
arXiv:1703.09035.
51. Li, Y. Deep Reinforcement Learning: An Overview. arXiv Preprint 2018, arXiv:1810.06339.
52. Wu, J.; He, H.; Peng, J.; Li, Y.; Li, Z. Continuous reinforcement learning of energy management with deep Q
network for a power split hybrid electric bus. Appl. Energy 2018, 222, 799–811. [CrossRef]
Sensors 2020, 20, 3039 23 of 23
53. Fan, J.; Wang, Z.; Xie, Y.; Yang, Z. A Theoretical Analysis of Deep Q-Learning. arXiv Preprint 2019,
arXiv:1901.00137v3.
54. Wu, Y.; Tan, H.; Peng, J.; Zhang, H.; He, H. Deep reinforcement learning of energy management with
continuous control strategy and traffic information for a series-parallel plug-in hybrid electric bus. Appl.
Energy 2019, 247, 454–466. [CrossRef]
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).