1 s2.0 S0306261921010801 Main
1 s2.0 S0306261921010801 Main
Applied Energy
journal homepage: www.elsevier.com/locate/apenergy
Keywords: Behind-the-meter distributed energy resources (DERs), including building solar photovoltaic (PV) technology
Deep reinforcement learning and electric battery storage, are increasingly being considered as solutions to support carbon reduction goals
Deep deterministic policy gradient algorithm and increase grid reliability and resiliency. However, dynamic control of these resources in concert with
Smart buildings
traditional building loads, to effect efficiency and demand flexibility, is not yet commonplace in commercial
Control systems
control products. Traditional rule-based control algorithms do not offer integrated closed-loop control to
Distributed energy resources
Load flexibility
optimize across systems, and most often, PV and battery systems are operated for energy arbitrage and demand
Energy efficiency charge management, and not for the provision of grid services. More advanced control approaches, such as
MPC control have not been widely adopted in industry because they require significant expertise to develop
and deploy. Recent advances in deep reinforcement learning (DRL) offer a promising option to optimize the
operation of DER systems and building loads with reduced setup effort. However, there are limited studies that
evaluate the efficacy of these methods to control multiple building subsystems simultaneously. Additionally,
most of the research has been conducted in simulated environments as opposed to real buildings. This paper
proposes a DRL approach that uses a deep deterministic policy gradient algorithm for integrated control of
HVAC and electric battery storage systems in the presence of on-site PV generation. The DRL algorithm, trained
on synthetic data, was deployed in a physical test building and evaluated against a baseline that uses the
current best-in-class rule-based control strategies. Performance in delivering energy efficiency, load shift, and
load shed was tested using price-based signals. The results showed that the DRL-based controller can produce
cost savings of up to 39.6% as compared to the baseline controller, while maintaining similar thermal comfort
in the building. The project team has also integrated the simulation components developed during this work as
an OpenAIGym environment and made it publicly available so that prospective DRL researchers can leverage
this environment to evaluate alternate DRL algorithms.
1. Introduction energy management system such that the control strategy can fulfill
the grid flexibility commitment without affecting building services
Behind-the-meter distributed energy resources (DERs) include local (e.g., occupant comfort) [3].
electricity generation (e.g., solar photovoltaic [PV]), energy storage de- Traditional rule-based control algorithms do not offer integrated
vices (e.g., battery units), and thermostatically controlled loads (TCLs) control to coordinate across systems and they typically operate sub-
such as heating, ventilation and air conditioning systems (HVAC) [1]. optimally [4]. On the other hand, several advanced control algorithms
DERs are transforming the demand side of the grid from traditionally have been proposed in academia in the last two decades, including
passive to active systems [2] and they are increasingly being con- Model Predictive Control [5–7], Particle Swarm Optimization [8,9],
sidered as solutions to support carbon emissions and energy saving
Neural Networks [10,11] for similar energy management problems.
goals, and to maximize grid reliability and resiliency [3]. However,
Nevertheless, these algorithms have not been adopted by the controls
coordinating these resources with traditional building loads to improve
industry at scale, yet. For instance, model-predictive control (MPC) is
load flexibility and grid response is not common practice in commer-
currently the state of the art for building controls, particularly those
cial control systems [4]. The challenge is how to implement a DER
∗ Corresponding author.
E-mail address: [email protected] (M. Pritoni).
1
These authors contributed equally to this work.
https://doi.org/10.1016/j.apenergy.2021.117733
Received 12 March 2021; Received in revised form 17 August 2021; Accepted 26 August 2021
Available online 13 September 2021
0306-2619/© 2021 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
S. Touzani et al. Applied Energy 304 (2021) 117733
2
S. Touzani et al. Applied Energy 304 (2021) 117733
to on-policy ones, have a greedy learning approach. In on-policy learn- Algorithm 1: Q-learning algorithm using approximation function
ing, the RL controller records rewards based on the current strat- 1. Initialize 𝑄(𝑠𝑡 , 𝑎𝑡 ) (i.e., initialize the neural networks with
egy/action performed, while in off-policy, the RL greedily selects the random weights).
actions that gave the best performance from memory. Online learning 2. Obtain the observation of the transition (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 ) by
means the RL agent is deployed in the real-world environment and RL making the RL agent interact with the environment.
agent is trained by interacting with the environment. Offline learning 3. Compute the loss function:
means the RL agent is trained without directly interacting with the ( [ ])2
= 𝑄𝜋 (𝑠𝑡 , 𝑎𝑡 ) − 𝑟(𝑠𝑡 , 𝑎𝑡 ) + 𝛾𝑚𝑎𝑥𝑎′ ∈ 𝑄𝜋 (𝑠𝑡+1 , 𝑎′ ) .
environment. Instead, it learns from historical data or in a virtual
4. Update 𝑄 (e.g., using the stochastic gradient descent
training environment. In offline learning, the controller is deployed in
algorithm) by minimizing with respect of the neural network
the real environment once it is well trained. Given this nature, offline
parameters.
learning is ‘‘less risky’’ because the controller does not need to interact
5. Repeat starting from 2 until some convergence criteria is
with the environment when it is not well trained. In model-based
satisfied.
learning, the controller learns the system dynamics first and then uses
the learned system dynamics for planning; while model-free algorithm
learns the optimal control without learning the system dynamics. The
model-based algorithm is usually more computationally expensive, be- (𝑄𝜋 (𝑠𝑡 , 𝑎𝑡 )) and 𝑟(𝑠𝑡 , 𝑎𝑡 ) + 𝛾𝑚𝑎𝑥𝑎′ 𝑄𝜋 (𝑠𝑡+1 , 𝑎′ ), and (3) the fact that a small
cause the algorithm first needs to learn an accurate environment model change to the action-value function may significantly change the policy.
(usually a difficult task in real application), then it needs to find an In 2015, Mnih et al. [20] introduced Deep Q-Network (DQN), which
optimal policy. Thus, model-free algorithms are more popular, as they was the first successful use of a combination of deep neural network and
are usually less computationally expensive. Q-learning algorithm. This work is responsible for the rapid growth of
Action value function The action-value function represents the the field of deep reinforcement learning (DRL). In addition to the inno-
expected cumulative reward of the RL agent starting from state 𝑠 vation of using neural networks, DQN proposed the use a replay buffer
and following control policy 𝜋. Formally the action-value function is and two neural networks to approximate the action-value function and
defined as: to overcome the instability issues of previous approaches. The replay
[𝑇 ]
∑ buffer is used to store a relatively large number of observations of
𝑄𝜋 (𝑠, 𝑎) = E𝜋 𝛾 𝑟𝑡+𝑘+1 |𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎
𝑘
(2) transitions (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 ). Mini-batches of these transitions are sampled
𝑘=0
randomly (using a uniform distribution), which allows the algorithm
where E𝜋 is the expected value given that the RL agent follows the to update the neural network from a set of uncorrelated transitions
control policy 𝜋. 𝑇 is a final time step of the episode. Depending on at each iteration. To reduce the correlation between the action-values
the considered environment, the value of the discount factor 𝛾 need to (𝑄𝜋 (𝑠𝑡 , 𝑎𝑡 )) and 𝑟(𝑠𝑡 , 𝑎𝑡 ) + 𝛾𝑚𝑎𝑥𝑎′ 𝑄𝜋 (𝑠𝑡+1 , 𝑎′ ), DQN uses two neural net-
be tuned based on the need to balance the current and future rewards. works. The weights of the first one are updated at each iteration, and
Q-Learning The action-value function, which is also known as 𝑄- this network is directly interacting with the environment. The weights
value, gave its name to one of the major off-policy and model-free RL of the second neural network, called a target network, are updated after
algorithms named Q-learning. The process of Q-learning algorithms is a fixed number of iterations by simply copying the weights of the first
to compute and update at each iteration the 𝑄-value for each state– network.
action pair at time 𝑡, in order to achieve the maximum cumulative While DQN can handle environments with high dimensional state
rewards. The optimal 𝑄∗𝜋 (𝑠𝑡 , 𝑎𝑡 ) for action 𝑎𝑡 taken at state 𝑠𝑡 can be spaces, it can only be used with discrete and low-dimensional action
expressed using a recursive relation known as a Bellman equation, which spaces. In fact, DQN relies on finding the action that maximizes the
is a summation of the present reward and the maximum discounted action-value function, which requires an iterative optimization at every
future rewards: step, if used with an environment that has continuous action. In theory,
one can discretize the action space; however, this solution is likely be
𝑄∗𝜋 (𝑠𝑡 , 𝑎𝑡 ) = 𝑟(𝑠𝑡 , 𝑎𝑡 ) + 𝛾𝑚𝑎𝑥𝑄𝜋 (𝑠𝑡+1 , 𝑎𝑡+1 ) (3)
intractable for problems that require fine control of actions (i.e., finer
When used in a discrete state–action space, the RL agent policy is grained discretization). The high number of discrete actions spaces
determined by a state–action lookup table called a Q-table, which is are difficult to explore efficiently. Moreover, a naive discretization
used to select an action for a given state. The Q-table is updated strategy of the action-spaces can exclude important information about
according the following equation: the action domain, which can be essential for finding optimal control
[ ] policy in several problems. To overcome this issue of discrete action
𝑄𝜋 (𝑠𝑡 , 𝑎𝑡 ) ← 𝑄𝜋 (𝑠𝑡 , 𝑎𝑡 )+𝛼 𝑟(𝑠𝑡 , 𝑎𝑡 ) + 𝛾𝑚𝑎𝑥𝑎′ ∈ 𝑄𝜋 (𝑠𝑡+1 , 𝑎′ ) − 𝑄𝜋 (𝑠𝑡 , 𝑎𝑡 ) (4)
spaces, [21] introduced the Deep Deterministic Policy Gradient (DDPG)
where 𝛼 ∈ [0, 1] is the learning rate. algorithm, which is a model-free, off-policy algorithm that can learn
Deep Q-Learning The Q-table approach can work well for envi- control policies in high-dimensional, continuous action spaces.
ronments where the state–action space is relatively small, but when
the states and action spaces increase or become infinite (i.e., environ- 2.1.2. Deep deterministic policy gradient
ment with continuous states/actions), the size of the table becomes DDPG [21] is an actor–critic framework based on the deterministic
intractable. A solution to this problem is to replace the Q-table with policy gradient (DPG) method [22], and it borrows methodological
a nonlinear representation (i.e., function approximation) that maps advances (i.e., replay buffer and target networks) developed for the
state and action onto a Q-value, which can be seen as a supervised DQN algorithm [20]. An actor–critic algorithm is a RL algorithm that has
learning problem (i.e., regression). Recently, approximating the action two agents: an actor and a critic. The actor makes decisions based on
value function with a deep neural network has became one of the most observation of the environment and the current control policy. The role
popular options, due to network’s capacity to accurately approximate of the critic is to observe the state of the environment and the reward
high dimensional and complex systems. A simplified illustration of the obtained by using an action and return the value-function estimate to
Q-learning algorithm using an approximation function is provided by the actor.
the following pseudo-code. DDPG learns two networks that approximate the actor function
Several papers used neural network to approximate action-value 𝜇(𝑠𝑡 |𝜃 𝜇 ) and the critic function 𝑄(𝑠𝑡 , 𝑎𝑡 |𝜃 𝑄 ), where 𝑠𝑡 is the state of the
functions [18,19]. However, these early methods often resulted in environment at time 𝑡, 𝑎𝑡 is the control action (i.e., the HVAC setpoints
being unstable. This instability was due to: (1) the correlation between and the battery setpoint), and 𝜃 𝜇 and 𝜃 𝑄 are respectively the weights
sequential observations, (2) the correlation between the action-values of the actor and critic networks.
3
S. Touzani et al. Applied Energy 304 (2021) 117733
The actor function 𝜇(𝑠𝑡 |𝜃 𝜇 ) deterministically maps states to a specific Algorithm 2: DDPG algorithm
action and represents the current policy. The critic function 𝑄(𝑠𝑡 , 𝑎𝑡 |𝜃 𝑄 )
Initialize 𝑄(𝑠𝑡 , 𝑎𝑡 |𝜃 𝑄 ) and 𝜇(𝑠𝑡 |𝜃 𝜇 ) networks with random weights
(i.e., the action’s values function) maps each action-state pair to a value
𝜃 𝑄 and 𝜃 𝜇 .
in R that is the expected cumulative future reward obtained by taking ′ ′
Initialize target networks 𝜇 ′ (𝑠𝑡 |𝜃 𝜇 ) and 𝑄′ (𝑠𝑡 , 𝑎𝑡 |𝜃 𝑄 ) with
action 𝑎𝑡 at state 𝑠𝑡 following the policy. During the training process ′ ′
𝜃𝜇 ← 𝜃𝜇 𝜃𝑄 ← 𝜃𝑄 .
the actor and critic networks are iteratively updated using stochastic
Initialize replay buffer 𝐵.
gradient descent (SGD) on two different losses, 𝐿𝜇 and 𝐿𝑄 , which
for episode = 1, ..., 𝑀 do
are computed using mini-batches of transitions samples (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 ).
Initialize a OU process 𝑡 for exploration
𝑠𝑡+1 is the resulting state by taking action 𝑎𝑡 at state 𝑠𝑡 , and 𝑟𝑡 is the
Obtain initial observation of state 𝑠1
subsequent reward. Using SGD assumes that the transition samples in
for 𝑡 = 1, ..., 𝑇 do
the mini-batches are independently and identically distributed, which
Obtain action 𝑎𝑡 = 𝜇(𝑠𝑡 |𝜃 𝜇 ) + 𝑡
may not be the case when the observations are generated sequentially.
Execute action 𝑎𝑡 and compute the reward 𝑟𝑡 and observe
Similarly to DQN, DDPG address this issue by sampling uniformly
the new state 𝑠𝑡+1
the transitions from a relatively large fixed size replay buffer. This
Store transition (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 ) in replay buffer 𝐵
allows the algorithm to update the actor and critic network from a
Randomly sample a mini-batch of 𝑁 transitions
set of uncorrelated transitions, at each training iteration. The replay
(𝑠𝑖 , 𝑎𝑖 , 𝑟𝑖 , 𝑠𝑖+1 ) from 𝐵
buffer is populated by sampling transitions from the environment using ′ ′
the current policy, and after the buffer is fully populated, the oldest Set 𝑦𝑖 = 𝑟𝑖 + 𝛾𝑄′ (𝑠𝑖+1 , 𝜇′ (𝑠𝑖+1 |𝜃 𝜇 |𝜃 𝑄 )
𝑄
Update 𝜃 by minimizing the loss:
transitions are replaced by the new ones. Another method that DDPG ∑
borrows from DQN to improve the stability of the training process is the 𝐿 = 𝑁1 𝑁 𝑖=1 (𝑦𝑖 − 𝑄(𝑠𝑖 , 𝑎𝑖 |𝜃 ))
𝑄 2
′ ′
use of two target networks: 𝜇 ′ (𝑠𝑡 |𝜃 𝜇 ) and 𝑄′ (𝑠𝑡 , 𝑎𝑡 |𝜃 𝑄 ). Their weights Update 𝜃 𝜇 using the sampled policy gradient:
′ ′
𝜃 𝜇 and 𝜃 𝑄 are slowly tracking the learned weight 𝜃 𝜇 and 𝜃 𝑄 .
1 ∑
𝑁
As for most DRL algorithms, DDPG needs to explore the state-space ∇𝜃𝜇 ≈ ∇ 𝑎 𝑄(𝑠, 𝑎|𝜃 𝑄 )|𝑠=𝑠𝑖 ,𝑎=𝜇(𝑠𝑖 ) ∇𝜃𝜇 𝜇(𝑠|𝜃 𝜇 |𝑠 = 𝑠𝑖
𝑁 𝑖=1 𝜃
in order to avoid converging to a local minimum that produces non-
optimal policies. Usually, this exploration is performed by adding noise
sampled from a predefined random process 𝑁 to the actions produced Update the target networks:
by the actor network: ′ ′ ′ ′
𝜃 𝑄 ← 𝜏𝜃 𝑄 + (1 − 𝜏𝜃 𝑄 ); 𝜃 𝜇 ← 𝜏𝜃 𝜇 + (1 − 𝜏𝜃 𝜇 )
𝑎𝑡 = 𝜇(𝑠𝑡 |𝜃 ) + 𝑁
𝜇
(5)
4
S. Touzani et al. Applied Energy 304 (2021) 117733
Fig. 1. (a) Outside view of FLEXLAB, (b) mechanical drawing of FLEXLAB envelope and HVAC (top view) [51].
2.3. Research gaps and contribution 3.1. Experimental design and setup
The literature clearly shows that DRL research in the built environ- The evaluation of the controller developed in this work was con-
ment has great potential, and it is a fast growing field [13,15,25,26]. ducted in FLEXLAB [51], a well-instrumented experimental test facil-
However, a clear gap identified in the literature is that most of DRL- ity at Lawrence Berkeley National Laboratory in Berkeley, California,
United States. FLEXLAB allows accurate measurement of the behavior
based building energy optimization methods are still not implemented
of integrated systems (e.g., HVAC and batteries) in providing grid
in practice [52] nor use physical equipment [13]. This observation is
services (e.g., Shift/Shed load when requested) and building services
also echoed by various reviews such as [25]: ‘‘[...] The vast majority
(e.g., visual and thermal comfort). To evaluate the performance of
of studies that implement RL for problems relating to building energy
the DRL controller compared to a baseline, we used a pair of side-
management are in simulation only’’ and [26]: ‘‘[...] the research works
by-side identical cells (called Cell A and B in the paper), a unique
about applications of RL in sustainable energy and electric systems
feature of FLEXLAB (Fig. 1). Each cell is representative of a small
are largely still in the laboratory research stage, which have not been
office space with a floor area of 57 m2 and includes a large south-
practically implemented in sustainable energy and electric systems’’. A facing window. FLEXLAB can be reconfigured with different types of
second gap is that very few RL studies have co-optimized DERs with equipment, end-uses, and envelope features. As HVAC we selected a
HVAC control to achieve load flexibility at the building level. The single-zone variable-capacity air-handling unit (AHU) for each cell,
few field tests that employ DRL-based algorithms only optimize the a very common system installed in many U.S. buildings [53]. The
operation of one system at a time (mostly the HVAC system). AHU had a single variable-speed fan and two dampers for outdoor air
Given these gaps in the literature, the goal of this paper is to and return air mixing. The air side of the AHU was monitored with
implement a price-responsive DRL-based controller and deploy it in sensors to measure air volume flow rate as well as the temperatures
an actual building that includes on-site distributed energy resources and humidities in each section of the ducts. The water-side of the two
(DERs). The DRL algorithm will aim at minimizing energy cost while AHUs was served by a small chiller and boiler, shared by the two cells
maintaining acceptable indoor thermal comfort. The performance of (Fig. 1). The flow rates and inlet and outlet temperatures in each AHU
the DRL-based controller will be tested against a best-in-class rule-based coil were measured to calculate heat flows transferred to the thermal
controller. The paper makes the following contributions to the existing zones. Power consumption of the heating and cooling equipment was
literature: estimated using these measured heat flows multiplied by a fixed ef-
ficiency (chiller COP = 3.0 and boiler efficiency = 0.95). Each room
• An application of a DRL algorithm for integrated optimal control had six ceiling-mounted light fixtures, six plug-load stations (i.e., desks
of a DER system composed of PV, electric storage, and thermal with desktop computers), six heat-generating mannequins that emulate
loads in a commercial building occupants, and multiple environmental sensors measuring temperature,
• A field demonstration of the DRL controller in a well instrumented indoor light levels, and relative humidity (Fig. 2). The DERs comprised
single-zone commercial building test facility, controlling multiple a 3.64 kilowatt (kW) photovoltaic system and a Tesla Powerwall battery
pieces of equipment, along with on-site DERs storage with a capacity of 7.2 kWh and peak power output of 3.3 kW
• An open-source OpenAI Gym environment that can be reused (Fig. 3) for each cell. Each sensor in FLEXLAB collected data every 10 s
by other researchers, in order to both train and evaluate the or less and it was aggregated to 1 min for analysis. The DRL controller
performance of their algorithms was deployed in Cell B, while the baseline controller was deployed
• A discussion about the challenges and lessons learned in the in Cell A. The FLEXLAB control system allowed users to change the
field deployment of the DRL controller in an actual building, setpoint of the supply air temperature (SATsp ) and the supply air flow
particularly with respect to integrating multiple building systems rate (SAFsp ) of the AHU directly or to influence them by setting the
zone air temperature heating and cooling setpoints (ZAThsp , ZATcsp )
using a standard control sequence. In addition, the charge/discharge
3. Methodology
setpoint of the battery was controllable (batterysp ).
Three different scenarios were tested, based on demand-side man-
This section discusses the experimental design and test set up. It agement strategies commonly referred as: (1) Energy Efficiency (EE),
also outlines the performance metrics used to monitor and evaluate the (2) Shed, and (3) Shift [3]. In the EE scenario, the controllers were
outputs of different tests and assess the efficacy of the DRL controller. compared based on the amount of energy purchased from the grid; in
Finally, it discusses the baseline strategy used to control the battery and the Shed scenario, the comparison was based on the ability to reduce
the building loads during the tests. energy purchases in a specific time window; and in the Shift scenario,
5
S. Touzani et al. Applied Energy 304 (2021) 117733
during the HVAC operational hours (7 am–7 pm) and 15 ◦ C–29 ◦ C dur-
ing the remaining hours of the day. The experiments were conducted
during the cooling season, at the end of July and early August of 2020,
with each experiment spanning 7–8 days, with a total testing period of
slightly more than three weeks.
1. ENERGY CONSUMPTION
Since the DER system is composed of local generation and
storage, the energy consumption is defined in terms of energy
purchase from the grid, which correspond to the difference
between the PV generation and the building load adjusted by
the change in state of charge of the battery. The building load is
Fig. 2. Room set up inside FLEXLAB: heat-emitting mannequins or equivalent thermal
a summation of power consumption by HVAC systems (chiller,
generators, desktop computers, and overhead light-emitting diode (LED) lights with
multiple environmental sensors [51]. fans, pumps), lighting, and plug loads. We call this net load.
𝑇
𝙴𝚗𝚎𝚝 = (𝙿𝚋𝚕𝚍𝚐 + 𝙿𝚋𝚊𝚝𝚝 − 𝙿𝚙𝚟 ) 𝑑𝑡 (6)
∫0
the two controllers were evaluated based on their ability to shift energy
(a) ENERGY SAVINGS
purchases from one time window to another one. Different electricity
The energy savings are the difference between net energy
prices were set to produce a desired effect on the electricity purchase
consumption in Cell A (baseline) and Cell B (DRL). They
(i.e., higher prices would drive reduced purchase from the grid).
are expressed in kWh for the whole testing period of each
In the first experiment, which ran for seven days in July, the
scenario:
electricity price was kept constant (0.13$/kWh). The Shed experiment
involved two tiers of prices: a low price (0.13$/kWh) throughout the 𝙴𝚜𝚊𝚟𝚒𝚗𝚐 = 𝙴𝚋𝚊𝚜𝚎𝚕𝚒𝚗𝚎 − 𝙴𝙳𝚁𝙻 (7)
day, and a high price between 4 pm and 9 pm (1.33$/kWh). The
(b) % ENERGY SAVINGS
Shift experiment included an additional price tier: high price (4 pm–
The % savings correspond to the energy savings divided
9 pm) at 0.16$/kWh; medium price (2 pm–4 pm and 9 pm–11 pm)
by the baseline:
at 0.13$/kWh; and low price (11 pm–2 pm) at 0.11$/kWh. These
prices were selected based on current commercial tariffs used in Cal- 𝙴𝚋𝚊𝚜𝚎𝚕𝚒𝚗𝚎 − 𝙴𝙳𝚁𝙻
% 𝙴𝚜𝚊𝚟𝚒𝚗𝚐 = × 𝟷𝟶𝟶% (8)
ifornia [54] at the time of the experiment. New research on demand 𝙴𝚋𝚊𝚜𝚎𝚕𝚒𝚗𝚎
response [55,56] have suggested using prices for reflecting grid require- 2. ENERGY COST
ments and for initiating responses from building equipment, instead The energy cost is the product of Enet and energy price, that
of using traditional demand-response events. The constant prices that differ by scenario and time period.
were used in the EE experiment, along with the objective to minimize
total energy costs, discourage any shifting of loads from one period to (a) COST SAVINGS
another and incentivize an overall reduction in energy consumption Similar to energy savings, cost savings are defined as
by optimizing the operations of the building equipment. The tiered the difference between cost in Cell A (baseline) and Cell
prices used in the Shed and Shift experiments were used to trigger B (DRL). They are expressed in $ for the whole testing
load reduction during the high price times and in the case of Shift, to period of each scenario.
encourage a corresponding additional increase in load during the lower (b) % COST SAVINGS
price periods as well, thereby achieving a load shed and load shift. The % COST corresponds to the cost savings divided by
For simplicity, the tariff only accounted for variable cost of elec- the baseline:
tricity (i.e., kWh) and not for maximum demand (i.e., kW). Both the
baseline and the DRL algorithm were tasked to minimize the energy $(𝙴𝚋𝚊𝚜𝚎𝚕𝚒𝚗𝚎 ) − $(𝙴𝙳𝚁𝙻 )
$𝚜𝚊𝚟𝚒𝚗𝚐𝚜 % = × 𝟷𝟶𝟶% (9)
cost while maintaining a zone air temperature deadband of 21 ◦ C–24 ◦ C $(𝙴𝚋𝚊𝚜𝚎𝚕𝚒𝚗𝚎 )
6
S. Touzani et al. Applied Energy 304 (2021) 117733
7
S. Touzani et al. Applied Energy 304 (2021) 117733
to train the DRL controller. A training framework that uses a physics- Table 4
Summary of calibration measures and CV-RMSE.
based simulation component was implemented to learn end-to-end
optimal control policy to manage the charge and discharge of battery Cell Calibration measures CV-RMSE
storage and HVAC setpoints. We calibrated the simulation model using Cell A Raw model, developed from design document 11.2%
Tuning thermal insulation 7.6%
operational data from the testbed, and we are confident that this model
Tuning thermal insulation & infiltration rate 4.0%
can emulate the real environment (FLEXLAB) reasonably well. Tuning thermal insulation infiltration rate & 2.8%
thermal mass multiplier
4.1. Simulation environment Cell B Raw model, developed from design document 10.9%
Tuning thermal insulation 6.3%
4.1.1. Software stack and modeling choices Tuning thermal insulation & infiltration rate 2.9%
The EnergyPlus™ simulation tool [59] was used to model the build- Tuning thermal insulation infiltration rate & 2.1%
thermal mass multiplier
ing and HVAC system. The physical parameters of the FLEXLAB enve-
lope were selected based on the materials in the construction drawings.
For the HVAC system, we developed a water-based primary loop and
air-based secondary loop HVAC model, as shown in Fig. 4, which the ratio of total interior thermal mass to the thermal capacitance of
represents the FLEXLAB configuration used for the test. We assumed the air in the volume of the specified zone, and (4) infiltration rate,
constant boiler efficiency (0.95) and chiller COP (3.0) because climate i.e. air changes per hour. We tuned the three parameters that are
at the test site is mild. found to be the most effective in Zhang et al. (2019) [62] and relevant
The battery and PV panels were simulated using the Smart Control to our experiment settings: thermal insulation, internal thermal mass
of Distributed Energy Resources (SCooDER) library [60]. SCooDER multiplier and infiltration rate to minimize the difference between the
is a Modelica library developed to facilitate the simulation and op- measured and simulated temperature. We measured the indoor temper-
timization of photovoltaics, battery storage, smart inverters, electric ature at 1-minute intervals and up-sampled it to 15 min, to match the
vehicles, and electric power systems [60]. SCooDER depends on the simulation time step. We calculated the Coefficient of Variation of Root
Modelica Standard Library and Modelica Buildings Library. The inter- Mean Square Error (CV-RMSE) between the 15-min interval measured
action between the DRL controller and the four simulated components and simulated temperature. By adjusting the internal thermal mass
(i.e., building envelope, HVAC system, battery, PV generation) is a multiplier and infiltration rate, we achieved a CV-RMSE of less than
typical co-simulation problem, since multiple software packages are 3% with the internal thermal mass multiplier of 6 and the infiltration
involved. After a review of existing simulation frameworks, we selected rate of 2. We summarized the improvements of CV-RMSE during the
the Functional Mock-up Interface (FMI) [59], a standard that defines a model calibration process in Table 4.
container and an interface to exchange dynamic models using a combi- In addition, we calibrated the power consumption of the fan, which
nation of XML files, binaries, and C code zipped into a single file [59]. is another major energy consumer in air-based systems. We used a cubic
As an open-source standard, the FMI is used in both academia and polynomial curve to model the fan energy behavior and achieved a
industry, and currently supported by more than 100 tools. By using the CV-RMSE of 7% for the fan in Cell A and 8% for fan in Cell B when
FMI, the building envelope, PV, and battery model were all exported comparing the air flow values recorded at 15-min intervals. Then we
in accordance to the FMI standard as Functional Mock-up Unit (FMU). used the regressed coefficients to develop the fan model. Since the
The DRL controller was implemented in a Python environment using chiller and boiler serve both the baseline and the DRL cell, and they
the package PyFMI [61] that enabled the interaction with the exported use a secondary loop with storage, it was impossible to calibrate the
FMU. The FMUs for the different models were further wrapped into an power curve of each cell independently, and default values were used.
OpenAI Gym environment, to run the building simulation in parallel In this study, we set COP to 2.5.
with the DRL algorithm, developed in PyTorch.2 To allow others to
reproduce the analysis or test other DRL algorithms, the simulation and 4.2. Design of the DRL algorithm
control framework were packaged into a Docker container including all
4.2.1. State and actions
the dependencies needed. The software is distributed as an open-source
The variables of the state vector 𝑠𝑡 were selected as: time of the
library and can be found at https://github.com/LBNL-ETA/FlexDRL.
day, outdoor temperature, outdoor relative humidity, solar irradiation,
zone temperature, net power, net energy, battery state of charge, PV
4.1.2. Model calibration
generation power, and electricity price look ahead. The DRL controller
The simulation model was calibrated with experimental data gath-
ran every 15 min and generated three actions (i.e., three setpoints):
ered during a calibration test performed before the experiment. We first
supply air temperature of the AHU; supply air flow rate of the AHU;
calibrated the envelope model by comparing the indoor temperatures
and charge/discharge rate of the battery. These actions were mapped
of the physical building to the indoor temperatures reported by the
to model variables in the simulation, as well as to actual controllable
simulated model. To accurately measure the stratification and nonuni-
points in FLEXLAB (respectively SATsp , SAFsp , and batterysp ).
form distribution of temperatures, we deployed four sensor trees with
seven temperature sensors in Cell A and Cell B. In a detailed physics-
4.2.2. Reward function
based model with hundreds of parameters, such as the one developed
The goal of DRL control algorithm was to find an optimal policy
in EnergyPlus used in this paper, there are different combinations of
that minimizes the cost of the electricity purchased from the grid
variables that produce the same results, therefore their calibration is
while maintaining a good level of thermal comfort. In other words,
challenging. To select the best parameters needed for calibration we the objective was to efficiently consume the electricity locally pro-
followed best practices in the literature. Zhang et al. (2019) [62] used duced by the PV panels and manage the battery while maintaining
the sensitivity analysis proposed by Morris (1991), [63] to identify four the indoor zone temperature within a desired range. The considered
key parameters, which are found to significantly influence the simula- reward function was composed of three parts: (1) the penalty due to
tion errors and need to be calibrated. The identified parameters include the energy consumption 𝐸𝑐𝑜𝑠𝑡 (replaced by energy cost in the Shed and
the (1) thermal insulation, (2) total area of radiant heating/cooling Shift scenario), (2) the penalty due to the violation of the temperature
surfaces, (3) internal thermal mass multiplier, which is defined as comfort zone 𝐶, and (3) the penalty due to the violation of the battery
physical limits 𝜆. This reward function is defined as:
2
https://pytorch.org/. 𝑟𝑡 = −𝛼𝐸𝑐𝑜𝑠𝑡 (𝑎𝑡 , 𝑠𝑡+1 ) − 𝛽𝐶(𝑎𝑡 , 𝑠𝑡+1 ) − 𝜆(𝑎𝑡 , 𝑠𝑡+1 ) (17)
8
S. Touzani et al. Applied Energy 304 (2021) 117733
Fig. 4. HVAC system schematics in the test facility and Energyplus model.
with
and
9
S. Touzani et al. Applied Energy 304 (2021) 117733
• OU parameters: 𝜃 the reversion rate and 𝜎 the standard devia- 𝜇𝜈 (◦ C) 𝜎𝜈 (◦ C) 𝜁𝜈 per day
(◦ C-h/day)
tion, which are two parameters of the OU process that will control
level of the state space exploration. EE baseline 4 0.3 0.21 0.33
EE DRL 13 0.3 0.20 0.99
• reward function weights: weights that control the importance
Shed baseline 25 1.0 0.98 4.92
of electricity cost vs. the thermal comfort. Shed DRL 15 1.3 0.81 3.68
• electricity price look ahead: the number of following hours of Shift baseline 15 0.7 0.53 2.48
electricity price. Shift DRL 14 0.3 0.29 1.27
10
S. Touzani et al. Applied Energy 304 (2021) 117733
Table 7 Table 8
SHED TEST comparison of the DRL vs. baseline controllers’ demand and total cost of Shift test comparison of DRL vs. Baseline controllers’ load demand and total cost during
energy consumed during different price periods. different price periods.
Metric Baseline DRL Metric Baseline DRL
Mean demand in low price 594 W 655 W Mean of demand in low price 870 W 412 W
Mean demand in high price 242 W 121 W Mean of demand in medium price 207 W 129 W
Mean daily energy consumption 12.5 kWh 13.1 kWh Mean of demand in high price 170 W 334 W
Cost of energy consumed $ 3.13 $ 2.46 Mean of daily energy use across all test days 14.21 kWh 8.21 kWh
Cost of energy consumed $ 1.62 $ 1.03
11
S. Touzani et al. Applied Energy 304 (2021) 117733
Fig. 6. EE test: (a) PV generation (gold), SOC of the battery in the baseline cell (light green) and SOC of the battery in the DRL cell (dark green); (b) chilled water power in the
baseline cell (light blue) and the DRL cell (dark blue), dry bulb outdoor air temperature (yellow, right axis); (c) hot water power in the baseline cell (light red) and the DRL cell
(dark red); (d) indoor dry bulb temperature in the baseline cell (light purple) and the DRL cell (dark purple).
Fig. 7. Daily energy consumption for different prices for the baseline and DRL controller for the SHED control test.
the lower price periods. This result is interesting because the DRL average, the DRL controller started cooling earlier, compared to the
algorithm achieved its objective of reducing cost, but it failed the baseline, and continued cooling throughout the high peak period, caus-
implicit objective of shifting energy out of the peak period. These ing an increase of demand during this period, but saving total energy
findings are further discussed in Section 6.3. and cost (Table 8). This is the second effect that limited the amount
Fig. 9 shows details about the experiment. The vertical gray bars of energy shifted by the DRL algorithm. As it pertains to heating, the
in each panel represent the mid peak (half gray bar) and high peak baseline controller created sharp spikes every early morning (Fig. 9c).
(full gray bar) periods. Both controllers discharged the batteries during In comparison, the DRL algorithm never used heating during the testing
the mid peak and peak periods (Fig. 9a), but the baseline controller period. The DRL controller also provided better thermal comfort com-
discharged the battery by a larger amount. This is the first reason that pared to the baseline (Fig. 9d), with 3.2 ◦ C-h of temperature violations
explains the reduced shift in energy by the DRL controller, described per day compared to 4.3 ◦ C-h for the baseline. In addition, most of
above. The baseline controller used more chilled water and produced the comfort violations happened during the high price periods for the
higher chilled water peaks than the DRL controller (Fig. 9b). On baseline cell, while they happened outside of that period for the DRL
12
S. Touzani et al. Applied Energy 304 (2021) 117733
Fig. 8. Shed test: (a) PV generation (gold), SOC of the battery in the baseline cell (light green), and SOC of the battery in the DRL cell (dark green); (b) chilled water power in
the baseline cell (light blue) and the DRL cell (dark blue), dry bulb outdoor air temperature (yellow, right axis); (c) hot water power in the baseline cell (light red) and the DRL
cell (dark red); (d) indoor dry bulb temperature in the baseline cell (light purple) and the DRL cell (dark purple).
controller (Fig. 9d). Overlaying the comfort and chilled water profiles deployed in FLEXLAB (the best performer identified in the tuning
it is evident that DRL engaged in cooling during high price periods to process). Finally, the black dots represent the simulated energy con-
maintain thermal comfort, in contrast to the baseline controller. This sumption values using the policies that were generated using different
behavior suggests that DRL places a higher value on thermal comfort random seeds during the training process. The actual energy consump-
over cost savings during this period. tion (in green) is significantly different from the energy consumption
of the simulated model used for training the DRL controller (in red).
6. Discussion Additionally, the random seeds seem to have a much higher impact
in the EE experiment than for the Shift and Shed experiments. Our
6.1. Instability and variance of trained policies hypothesis for this effect is that it is likely due to the fixed price of
electricity in EE test that make the algorithm converge more easily to
Variation in trained policies across different random seeds is a some local optimal solution. In the EE test the difference between the
common problem in DRL. Policies trained with different random seeds best model (i.e., policy selected for the testbed) and the worst model is
may generate a significant gap in performance and produce unstable re- significant (≈ 400 kWh). However, the small variability of the results for
sults [66]. To investigate this variability, the performances of eight DRL Shift and Shed may be misleading, as it is important to remember that
agents, each trained with the same combination of hyper-parameters only half of the experiments had converged. Fig. 11 shows the annual
but with eight different seeds, were evaluated for each experiment. net energy purchased from the grid using the same controllers identi-
As described in Sections 4.2.3 and 4.2.4, the hyper-parameters were fied above, performing EE, Shed, and Shift experiments respectively.
identified using the random search grid approach. For the Shift and While the differences in values are not as prominent as in Fig. 10, it is
Shed experiments, even out of these eight, only four converged to a clear that the seed has a significant impact on the actions taken by a
realistic solution. While this in itself provides evidence of the significant DRL controller.
impact of the random seeds, all the other controllers that had converged
during training were evaluated further. The controllers were deployed 6.2. Mismatch between modeled and actual control behavior
in a simulated model of the building for (1) the duration of the actual
FLEXLAB tests and (2) a whole year. When the trained DRL controller was deployed in the building, the
Fig. 10 shows the net energy consumption from the grid for the actions generated by the controller differed from those taken during the
duration of the three experiments (EE, Shift, and Shed). While the green simulation. Some of these occurred due to: using the wrong sensors,
dots represent the actual energy consumed during the field experiment issues with mismatched units of measurement, or other factors, and
in FLEXLAB, the red dots are the energy consumption values estimated they were able to be diagnosed and fixed immediately. However,
in the simulation environment using the same controller that was there were more variations in the actions that produced significant
13
S. Touzani et al. Applied Energy 304 (2021) 117733
Fig. 9. Shift test: (a) PV generation (gold), SOC of the battery in the baseline cell (light green), and SOC of the battery in the DRL cell (dark green); (b) chilled water power in
the baseline cell (light blue) and the DRL cell (dark blue), dry bulb outdoor air temperature (yellow, right axis); (c) hot water power in the baseline cell (light red) and the DRL
cell (dark red); (d) indoor dry bulb temperature in the baseline cell (light purple) and the DRL cell (dark purple).
Fig. 10. Net energy consumption from the grid for simulated and test models.
differences in the relevant metrics (e.g., energy consumption, costs energy model was updated in the simulation (this, of course, meant
incurred), and upon further inspection it was determined that these retraining the DRL controller). However characterizing the stochastic
happened due to inconsistencies between the simulated building en- battery charge/discharge behavior with respect to a given setpoint
ergy model and the actual building. Such variations in actions were turned out to be quite challenging, and the DRL controller was used
not present when the trained DRL controller was being trained and with this particular flaw built in. A more detailed evaluation of the
tested in the simulation environment and hence it was noticed only impacts of these embedded inconsistencies has been presented in [67].
when it was deployed in the actual building. For example, the power Similar inconsistencies that were overlooked during the model cali-
consumption pattern of the supply air fan during cooling and the actual bration also affected the results of our experiments. For example, in the
charge/discharge rate of the battery given a certain setpoint proved EE test, we noticed that the DRL controller was unnecessarily heating
to be wrongly represented in the simulated building model. Once this the cell during early morning hours. Similarly during the Shed test,
issue was identified, the supply air fan was better characterized and the while the DRL controller performed well during the test, the actions
14
S. Touzani et al. Applied Energy 304 (2021) 117733
Fig. 11. Energy purchased from the grid for different simulated models for the entire year.
15
S. Touzani et al. Applied Energy 304 (2021) 117733
system or reward functions that are hard to optimize. Future work Writing – original draft, Writing – review & editing, Visualization. Zhe
should develop more stable DRL algorithms that improve the robustness Wang: Methodology, Software, Validation, Formal analysis, Investi-
to hyper-parameter tuning, as well as investigate the effectiveness of gation, Writing – original draft, Writing – review & editing. Shreya
the random search grid as an approach for auto-tuning DRL parameters. Agarwal: Formal analysis, Writing – original draft, Writing – review
In addition, including the random seed in the tuning process can & editing, Visualization. Marco Pritoni: Methodology, Formal analy-
enhance the selection of optimal policies. sis, Writing – original draft, Writing – review & editing, Supervision.
Third, a proper definition of the reward functions is key to suc- Mariam Kiran: Methodology, Writing – review & editing. Richard
cessfully finding a good control policy. Inadequate reward functions Brown: Writing – review & editing. Jessica Granderson: Conceptual-
can result in unstable training and/or inappropriate policy. Therefore, ization, Writing – review & editing, Supervision, Project administration,
more research is needed to analyze the impact of different definitions Funding acquisition.
of reward functions (that optimize the same objective). This will help
provide a more standardized definition of reward function that is well Declaration of competing interest
adapted for DRL frameworks.
Fourth, to promote adoption, the policies and behavior of the algo- The authors declare that they have no known competing finan-
rithm need to be easy to explain, particularly in the cases where the cial interests or personal relationships that could have appeared to
policy provide a nonstandard and unexpected solution to control the influence the work reported in this paper.
system. In addition, understanding the reasons behind control errors is
important, to be able to create control algorithms that are robust to
Acknowledgments
failures and thus reassure stakeholders, which is an essential step to
increase the adoption of this technology.
This work was supported by the Assistant Secretary for Energy
Fifth, standardized building benchmarks are needed to properly
Efficiency and Renewable Energy, Building Technologies Office, of the
evaluate the developed DRL methods. It is crucial to have standard
U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
environments where the DRL algorithms can be trained, to have well-
The authors would like to thank Heather Goetsch, Cedar Blazek, the
defined metrics to quantify the performance of trained policies, and
FLEXLAB team, Callie Clark and Christoph Gehbauer for their support.
to have baseline DRL models against which the new algorithms can
This research used the Lawrencium computational cluster resource
be evaluated. These benchmarks will help to speed incremental im-
provided by the IT Division at the Lawrence Berkeley National Lab-
provements and to evaluate the generalization of the proposed solutions
oratory (Supported by the Director, Office of Science, Office of Basic
to different buildings systems. By releasing and open-sourcing the
Energy Sciences, of the U.S. Department of Energy under Contract No.
environment and the DRL algorithm that was developed for this work,
our aim is to provide the research community a benchmark for building DE-AC02-05CH11231).
systems that involves small commercial buildings with PV and battery
storage. In the future, more of these benchmark will be needed to cover References
more applications (i.e., buildings systems and control objectives).
[1] Bayram IS, Ustun TS. A survey on behind the meter energy management systems
Lastly, future research should overcome the limitations of this study
in smart grid. Renew Sustain Energy Rev 2017;72:1208–32.
and extend its results by comparing different DRL algorithms and [2] Foruzan E, Soh L-K, Asgarpoor S. Reinforcement learning approach for opti-
testing different control variables (e.g., controlling ZAT setpoints or mal distributed energy management in a microgrid. IEEE Trans Power Syst
chilled water valve position) in addition to the setpoints controlled in 2018;33(5):5749–58.
this experiment. Better comfort models, should be explored, to take [3] Neukomm M, Nubbe V, Fares R. Grid-interactive efficient buildings. 2019, http:
//dx.doi.org/10.2172/1508212, URL https://www.osti.gov/biblio/1508212.
advantage of more realistic assumptions about comfort adaptation.
[4] Granderson J, Kramer H, Rui T, Brown R, Curtin C. Market brief: Customer-
Additional field testing should be performed in a variety of building sited distributed energy resources. 2020, URL https://escholarship.org/uc/item/
systems to make sure the results are generalizable. 1xv3z9jb.
[5] Krishnan Prakash A, Zhang K, Gupta P, Blum D, Marshall M, Fierro G,
7. Conclusion Alstone P, Zoellick J, Brown R, Pritoni M. Solar+ optimizer: A model predictive
control optimization platform for grid responsive building microgrids. Energies
2020;13(12). http://dx.doi.org/10.3390/en13123093, URL https://www.mdpi.
This paper explores the use of a DRL approach to control a behind com/1996-1073/13/12/3093.
the meter DER system that consists of a proxy of a small office building [6] Bruno S, Giannoccaro G, La Scala M. A demand response implementation
with local PV solar generation and a battery unit. A simulation-based in tertiary buildings through model predictive control. IEEE Trans Ind Appl
training framework was developed and used to train a DRL algorithm. 2019;55(6):7052–61. http://dx.doi.org/10.1109/TIA.2019.2932963.
[7] Kim D, Braun J, Cai J, Fugate D. Development and experimental demon-
This trained controller was then deployed in a testbed to evaluate its
stration of a plug-and-play multiple RTU coordination control algorithm for
performance under three load flexibility modes (EE, Shift, and Shed). small/medium commercial buildings. Energy Build 2015;107:279–93. http://
The results of this work, and prior published efforts show that DRL can dx.doi.org/10.1016/j.enbuild.2015.08.025, URL https://www.sciencedirect.com/
be a promising solution for DER systems. As the research community science/article/pii/S0378778815302097.
transitions from simulation-based DRL research to field demonstra- [8] Bonthu RK, Pham H, Aguilera RP, Ha QP. Minimization of building energy cost
by optimally managing PV and battery energy storage systems. In: 2017 20th
tions, the results and the challenges described in this paper can be
international conference on electrical machines and systems (ICEMS). 2017, p.
used to accelerate the efforts of future researchers and practitioners. 1–6. http://dx.doi.org/10.1109/ICEMS.2017.8056442.
The next steps in this research include improving the DRL training [9] Wang Z, Yang R, Wang L. Intelligent multi-agent control for integrated building
methods, evaluating different reward functions and testing alternative and micro-grid systems. In: ISGT 2011. 2011, p. 1–7. http://dx.doi.org/10.1109/
algorithms. These can help develop portable and scalable solutions that ISGT.2011.5759134.
[10] Reynolds J, Rezgui Y, Kwan A, Piriou S. A zone-level, building energy
enable easier deployment of DRL in different buildings.
optimisation combining an artificial neural network, a genetic algorithm,
and model predictive control. Energy 2018;151:729–39. http://dx.doi.org/10.
CRediT authorship contribution statement 1016/j.energy.2018.03.113, URL https://www.sciencedirect.com/science/article/
pii/S036054421830522X.
Samir Touzani: Conceptualization, Methodology, Software, Valida- [11] Macarulla M, Casals M, Forcada N, Gangolells M. Implementation of pre-
dictive control in a commercial building energy management system using
tion, Investigation, Writing – original draft, Writing – review & edit- neural networks. Energy Build 2017;151:511–9. http://dx.doi.org/10.1016/
ing, Supervision, Project administration. Anand Krishnan Prakash: j.enbuild.2017.06.027, URL https://www.sciencedirect.com/science/article/pii/
Software, Validation, Formal analysis, Investigation, Data curation, S0378778817300907.
16
S. Touzani et al. Applied Energy 304 (2021) 117733
[12] Drgoňa J, Arroyo J, Cupeiro Figueroa I, Blum D, Arendt K, Kim D, Ollé EP, [33] Azuatalam D, Lee W-L, de Nijs F, Liebman A. Reinforcement learning for whole-
Oravec J, Wetter M, Vrabie DL, Helsen L. All you need to know about model building HVAC control and demand response. Energy AI 2020;2:100020. http:
predictive control for buildings. Annu Rev Control 2020;50:190–232. http://dx. //dx.doi.org/10.1016/j.egyai.2020.100020, URL http://www.sciencedirect.com/
doi.org/10.1016/j.arcontrol.2020.09.001, URL https://www.sciencedirect.com/ science/article/pii/S2666546820300203.
science/article/pii/S1367578820300584. [34] Nagy A, Kazmi H, Cheaib F, Driesen J. Deep reinforcement learning for optimal
[13] Vázquez-Canteli JR, Nagy Z. Reinforcement learning for demand re- control of space heating. In: 4th building simulation and optimization conference.
sponse: A review of algorithms and modeling techniques. Appl En- 2018, URL http://www.ibpsa.org/proceedings/BSO2018/1C-4.pdf.
ergy 2019;235:1072–89. http://dx.doi.org/10.1016/j.apenergy.2018.11.002, URL [35] Lee S, Choi D-H. Energy management of smart home with home appliances,
http://www.sciencedirect.com/science/article/pii/S0306261918317082. energy storage system and electric vehicle: A hierarchical deep reinforcement
[14] Sutton RS, Barto AG. Reinforcement learning: An introduction. USA: A Bradford learning approach. Sensors 2020;20(7). http://dx.doi.org/10.3390/s20072157,
Book; 2018. URL https://www.mdpi.com/1424-8220/20/7/2157.
[15] Wang Z, Hong T. Reinforcement learning for building controls: The opportuni- [36] Chen B, Cai Z, Bergés M. Gnu-rl: A precocial reinforcement learning solution
ties and challenges. Appl Energy 2020;269:115036. http://dx.doi.org/10.1016/ for building hvac control using a differentiable mpc policy. In: Proceedings of
j.apenergy.2020.115036, URL http://www.sciencedirect.com/science/article/pii/ the 6th ACM international conference on systems for energy-efficient buildings,
S0306261920305481. cities, and transportation. 2019; p. 316–25.
[37] Zhang Z, Lam KP. Practical implementation and evaluation of deep reinforcement
[16] White CC, White DJ. Markov decision processes. European J Oper Res
learning control for a radiant heating system. In: Proceedings of the 5th
1989;39:1–16, URL https://doi.org/10.1016/0377-2217(89)90348-2.
conference on systems for built environments. BuildSys ’18, New York, NY, USA:
[17] Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T,
Association for Computing Machinery; 2018, p. 148–57. http://dx.doi.org/10.
Baker L, Lai M, Bolton A, et al. Mastering the game of go without human
1145/3276774.3276775.
knowledge. Nature 2017;550(7676):354–9.
[38] Moriyama T, De Magistris G, Tatsubori M, Pham T-H, Munawar A, Tachibana R.
[18] Mocanu E, Mocanu DC, Nguyen PH, Liotta A, Webber ME, Gibescu M,
Reinforcement learning testbed for power-consumption optimization. In: Li L,
Slootweg JG. On-line building energy optimization using deep reinforcement
Hasegawa K, Tanaka S, editors. Methods and applications for modeling and
learning. IEEE Trans Smart Grid 2019;10(4):3698–708.
simulation of complex systems. Singapore: Springer Singapore; 2018, p. 45–59.
[19] Wei T, Wang Y, Zhu Q. Deep reinforcement learning for building HVAC control. [39] Zhang Z, Chong A, Pan Y, Zhang C, Lu S, Lam KP. A deep reinforcement
In: Proceedings of the 54th annual design automation conference 2017. 2017; learning approach to using whole building energy model for HVAC optimal
p. 1–6. control. In: ASHRAE and IBPSA-USA simbuild building performance modeling
[20] Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, conference. 2018, URL https://www.researchgate.net/publication/326711617_A_
Riedmiller M, Fidjeland AK, Ostrovski G, Petersen S, Beattie C, Sadik A, Deep_Reinforcement_Learning_Approach_to_Using_Whole_Building_Energy_Model_
Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D. Human-level For_HVAC_Optimal_Control.
control through deep reinforcement learning. Nature 2015;518(7540):529–33. [40] Zhang Z, Zhang C, Lam K. A deep reinforcement learning method for model-
http://dx.doi.org/10.1038/nature14236. based optimal control of HVAC systems. 2018, p. 397–402. http://dx.doi.org/
[21] Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D. 10.14305/ibpc.2018.ec-1.01.
Continuous control with deep reinforcement learning. 2015, arXiv preprint arXiv: [41] Zhang Z, Chong A, Pan Y, Zhang C, Lam KP. Whole building energy
1509.02971. model for HVAC optimal control: A practical framework based on deep
[22] Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M. Deterministic pol- reinforcement learning. Energy Build 2019;199:472–90. http://dx.doi.org/10.
icy gradient algorithms. In: Proceedings of the 31st international conference on 1016/j.enbuild.2019.07.029, URL http://www.sciencedirect.com/science/article/
international conference on machine learning - Volume 32. ICML’14, JMLR.org; pii/S0378778818330858.
2014, p. I–387–I–395. [42] Li Y, Wen Y, Tao D, Guan K. Transforming cooling optimization for green data
[23] Henderson P, Islam R, Bachman P, Pineau J, Precup D, Meger D. Deep center via deep reinforcement learning. IEEE Trans Cybern 2020;50(5):2002–13.
reinforcement learning that matters. In: Proceedings of the AAAI conference [43] Kazmi H, Mehmood F, Lodeweyckx S, Driesen J. Gigawatt-hour scale sav-
on artificial intelligence, Vol. 32. 2018, (1). URL https://ojs.aaai.org/index.php/ ings on a budget of zero: Deep reinforcement learning based optimal
AAAI/article/view/11694. control of hot water systems. Energy 2018;144:159–68. http://dx.doi.org/10.
[24] Liang E, Liaw R, Nishihara R, Moritz P, Fox R, Goldberg K, Gonzalez J, Jordan M, 1016/j.energy.2017.12.019, URL http://www.sciencedirect.com/science/article/
Stoica I. RLlib: Abstractions for distributed reinforcement learning. In: Dy J, pii/S0360544217320388.
Krause A, editors. Proceedings of the 35th international conference on machine [44] Wu P, Partridge J, Bucknall R. Cost-effective reinforcement learning energy
learning. Proceedings of machine learning research, vol. 80, Stockholmsmässan, management for plug-in hybrid fuel cell and battery ships. Appl Energy
Stockholm Sweden: PMLR; 2018, p. 3053–62, URL http://proceedings.mlr.press/ 2020;275:115258. http://dx.doi.org/10.1016/j.apenergy.2020.115258, URL http:
v80/liang18b.html. //www.sciencedirect.com/science/article/pii/S0306261920307704.
[25] Mason K, Grijalva S. A review of reinforcement learning for autonomous building [45] Qiu S, Li Z, Li Z, Li J, Long S, Li X. Model-free control method based
energy management. Comput Electr Eng 2019;78:300–12. http://dx.doi.org/10. on reinforcement learning for building cooling water systems: Validation
1016/j.compeleceng.2019.07.019, URL http://www.sciencedirect.com/science/ by measured data-based simulation. Energy Build 2020;218:110055. http://
article/pii/S0045790618333421. dx.doi.org/10.1016/j.enbuild.2020.110055, URL http://www.sciencedirect.com/
science/article/pii/S0378778819339945.
[26] Yang T, Zhao L, Li W, Zomaya AY. Reinforcement learning in sustainable energy
[46] Soares A, Geysen D, Spiessens F, Ectors D, De Somer O, Vanthournout K. Using
and electric systems: a survey. Annu Rev Control 2020;49:145–63. http://dx.
reinforcement learning for maximizing residential self-consumption – Results
doi.org/10.1016/j.arcontrol.2020.03.001, URL https://www.sciencedirect.com/
from a field test. Energy Build 2020;207:109608. http://dx.doi.org/10.1016/
science/article/pii/S1367578820300079.
j.enbuild.2019.109608, URL http://www.sciencedirect.com/science/article/pii/
[27] Wang Y, Velswamy K, Huang B. A long-short term memory recurrent neural
S037877881930934X.
network based reinforcement learning controller for office heating ventilation
[47] Yu L, Xie W, Xie D, Zou Y, Zhang D, Sun Z, Zhang L, Zhang Y, Jiang T. Deep
and air conditioning systems. Processes 2017;5(3). http://dx.doi.org/10.3390/
reinforcement learning for smart home energy management. IEEE Internet Things
pr5030046, URL https://www.mdpi.com/2227-9717/5/3/46.
J 2020;7(4):2751–62. http://dx.doi.org/10.1109/JIOT.2019.2957289.
[28] Jia R, Jin M, Sun K, Hong T, Spanos C. Advanced building control via deep
[48] Alfaverh F, Denaï M, Sun Y. Demand response strategy based on reinforce-
reinforcement learning. Energy Procedia 2019;158:6158–63. http://dx.doi.org/
ment learning and fuzzy reasoning for home energy management. IEEE Access
10.1016/j.egypro.2019.01.494, Innovative Solutions for Energy Transitions. URL
2020;8:39310–21. http://dx.doi.org/10.1109/ACCESS.2020.2974286.
http://www.sciencedirect.com/science/article/pii/S187661021930517X.
[49] Schreiber T, Eschweiler S, Baranski M, Müller D. Application of two
[29] Wei T, Wang Y, Zhu Q. Deep reinforcement learning for building HVAC control. promising Reinforcement Learning algorithms for load shifting in a cool-
In: 2017 54th ACM/EDAC/IEEE design automation conference (DAC). 2017, p. ing supply system. Energy Build 2020;229:110490. http://dx.doi.org/10.1016/
1–6. j.enbuild.2020.110490, URL http://www.sciencedirect.com/science/article/pii/
[30] Brandi S, Piscitelli MS, Martellacci M, Capozzoli A. Deep reinforcement S0378778820320922.
learning to optimise indoor temperature control and heating energy consump- [50] Hua H, Qin Y, Hao C, Cao J. Optimal energy management strategies
tion in buildings. Energy Build 2020;224:110225. http://dx.doi.org/10.1016/ for energy Internet via deep reinforcement learning approach. Appl Energy
j.enbuild.2020.110225, URL http://www.sciencedirect.com/science/article/pii/ 2019;239:598–609.
S0378778820308963. [51] BL Energy Technologies Area. FLEXLAB: Advanced integrated building & grid
[31] Yoon YR, Moon HJ. Performance based thermal comfort control (PTCC) technologies testbed. 2020, URL https://flexlab.lbl.gov/.
using deep reinforcement learning for space cooling. Energy Build [52] Yu L, Qin S, Zhang M, Shen C, Jiang T, Guan X. Deep reinforcement learning
2019;203:109420. http://dx.doi.org/10.1016/j.enbuild.2019.109420, URL for smart building energy management: A survey. 2020, arXiv:2008.05074.
http://www.sciencedirect.com/science/article/pii/S0378778819310692. [53] EI Administration. 2012 CBECS survey data - microdata. 2020, https://www.eia.
[32] Gao G, Li J, Wen Y. Energy-efficient thermal comfort control in smart buildings gov/consumption/commercial/data/2012/index.php?view=microdata. [Online;
via deep reinforcement learning. 2019, URL arXiv:1901.04693. accessed 17-December-2020].
17
S. Touzani et al. Applied Energy 304 (2021) 117733
[54] P Gas, E Company. Electric schedule B-19. 2020, https://www.pge.com/tariffs/ [62] Zhang Z, Chong A, Pan Y, Zhang C, Lam KP. Whole building energy model
assets/pdf/tariffbook/ELEC_SCHEDS_B-19.pdf. [Online; accessed 16-December- for HVAC optimal control: A practical framework based on deep reinforcement
2020]. learning. Energy Build 2019;199:472–90.
[55] Vasudevan J, Swarup KS. Price based demand response strategy considering load [63] Morris MD. Factorial sampling plans for preliminary computational experiments.
priorities. In: 2016 IEEE 6th international conference on power systems (ICPS). Technometrics 1991;33(2):161–74.
2016, p. 1–6. http://dx.doi.org/10.1109/ICPES.2016.7584019. [64] Bergstra J, Bengio Y. Random search for hyper-parameter optimization. J Mach
[56] Asadinejad A, Tomsovic K. Optimal use of incentive and price based de- Learn Res 2012;13(2).
mand response to reduce costs and price volatility. Electr Power Syst Res [65] Kingma DP, Ba J. Adam: A method for stochastic optimization. 2014, arXiv
2017;144:215–23. http://dx.doi.org/10.1016/j.epsr.2016.12.012, URL https:// preprint arXiv:1412.6980.
www.sciencedirect.com/science/article/pii/S0378779616305259. [66] Islam R, Henderson P, Gomrokchi M, Precup D. Reproducibility of benchmarked
[57] Liu J, Yin R, Pritoni M, Piette MA, Neukomm M. Developing and evaluating deep reinforcement learning tasks for continuous control. 2017, arXiv preprint
metrics for demand flexibility in buildings: Comparing simulations and field data. arXiv:1708.04133.
2020, URL https://escholarship.org/uc/item/2d93p636. [67] Prakash AK, Touzani S, Kiran M, Agarwal S, Pritoni M, Granderson J. Deep
[58] ASHRAE. New guideline on standardized advanced sequences of operation for reinforcement learning in buildings: Implicit assumptions and their impact.
common HVAC systems. 2018, URL https://www.ashrae.org/news/esociety/new- In: Proceedings of the 1st international workshop on reinforcement learning
guideline-on-standardized-advanced-/sequences-of-operation-for-common-hvac- for energy management in buildings & cities. RLEM’20, New York, NY, USA:
systems. Association for Computing Machinery; 2020, p. 48–51. http://dx.doi.org/10.
[59] Nouidui T, Wetter M, Zuo W. Functional mock-up unit for co-simulation import 1145/3427773.3427868.
in EnergyPlus. J Build Perform Simul 2014;7(3):192–202. [68] Tan Y, Yang J, Chen X, Song Q, Chen Y, Ye Z, Su Z. Sim-to-real optimization
[60] Gehbauer C, Mueller J, Swenson T, Vrettos E. Photovoltaic and behind-the-meter of complex real world mobile network with imperfect information via deep
battery storage: Advanced smart inverter controls and field demonstration. 2020. reinforcement learning from self-play. 2018, arXiv:1802.06416.
[61] Andersson C, Åkesson J, Führer C. Pyfmi: A Python package for simulation [69] Sohlberg B, Jacobsen E. Grey box modelling - branches and experiences.
of coupled dynamic models with the functional mock-up interface. Centre for IFAC Proc Vol 2008;41(2):11415–20. http://dx.doi.org/10.3182/20080706-5-
Mathematical Sciences, Lund University Lund; 2016. KR-1001.01934, 17th IFAC World Congress. URL http://www.sciencedirect.com/
science/article/pii/S1474667016408025.
18