0% found this document useful (0 votes)
5 views18 pages

1 s2.0 S0306261921010801 Main

This paper presents a deep reinforcement learning (DRL) approach for controlling distributed energy resources (DERs) such as HVAC and battery storage in buildings, aiming to enhance energy efficiency and load flexibility. The proposed DRL algorithm, which uses a deep deterministic policy gradient method, was tested in a physical building and demonstrated cost savings of up to 39.6% compared to traditional rule-based control strategies while maintaining thermal comfort. The authors also made the simulation components publicly available to support further research in DRL for building controls.

Uploaded by

xinyuusc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views18 pages

1 s2.0 S0306261921010801 Main

This paper presents a deep reinforcement learning (DRL) approach for controlling distributed energy resources (DERs) such as HVAC and battery storage in buildings, aiming to enhance energy efficiency and load flexibility. The proposed DRL algorithm, which uses a deep deterministic policy gradient method, was tested in a physical building and demonstrated cost savings of up to 39.6% compared to traditional rule-based control strategies while maintaining thermal comfort. The authors also made the simulation components publicly available to support further research in DRL for building controls.

Uploaded by

xinyuusc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Applied Energy 304 (2021) 117733

Contents lists available at ScienceDirect

Applied Energy
journal homepage: www.elsevier.com/locate/apenergy

Controlling distributed energy resources via deep reinforcement learning for


load flexibility and energy efficiency
Samir Touzani 1 , Anand Krishnan Prakash 1 , Zhe Wang 1 , Shreya Agarwal, Marco Pritoni ∗,
Mariam Kiran, Richard Brown, Jessica Granderson
Lawrence Berkeley National Laboratory, United States of America

ARTICLE INFO ABSTRACT

Keywords: Behind-the-meter distributed energy resources (DERs), including building solar photovoltaic (PV) technology
Deep reinforcement learning and electric battery storage, are increasingly being considered as solutions to support carbon reduction goals
Deep deterministic policy gradient algorithm and increase grid reliability and resiliency. However, dynamic control of these resources in concert with
Smart buildings
traditional building loads, to effect efficiency and demand flexibility, is not yet commonplace in commercial
Control systems
control products. Traditional rule-based control algorithms do not offer integrated closed-loop control to
Distributed energy resources
Load flexibility
optimize across systems, and most often, PV and battery systems are operated for energy arbitrage and demand
Energy efficiency charge management, and not for the provision of grid services. More advanced control approaches, such as
MPC control have not been widely adopted in industry because they require significant expertise to develop
and deploy. Recent advances in deep reinforcement learning (DRL) offer a promising option to optimize the
operation of DER systems and building loads with reduced setup effort. However, there are limited studies that
evaluate the efficacy of these methods to control multiple building subsystems simultaneously. Additionally,
most of the research has been conducted in simulated environments as opposed to real buildings. This paper
proposes a DRL approach that uses a deep deterministic policy gradient algorithm for integrated control of
HVAC and electric battery storage systems in the presence of on-site PV generation. The DRL algorithm, trained
on synthetic data, was deployed in a physical test building and evaluated against a baseline that uses the
current best-in-class rule-based control strategies. Performance in delivering energy efficiency, load shift, and
load shed was tested using price-based signals. The results showed that the DRL-based controller can produce
cost savings of up to 39.6% as compared to the baseline controller, while maintaining similar thermal comfort
in the building. The project team has also integrated the simulation components developed during this work as
an OpenAIGym environment and made it publicly available so that prospective DRL researchers can leverage
this environment to evaluate alternate DRL algorithms.

1. Introduction energy management system such that the control strategy can fulfill
the grid flexibility commitment without affecting building services
Behind-the-meter distributed energy resources (DERs) include local (e.g., occupant comfort) [3].
electricity generation (e.g., solar photovoltaic [PV]), energy storage de- Traditional rule-based control algorithms do not offer integrated
vices (e.g., battery units), and thermostatically controlled loads (TCLs) control to coordinate across systems and they typically operate sub-
such as heating, ventilation and air conditioning systems (HVAC) [1]. optimally [4]. On the other hand, several advanced control algorithms
DERs are transforming the demand side of the grid from traditionally have been proposed in academia in the last two decades, including
passive to active systems [2] and they are increasingly being con- Model Predictive Control [5–7], Particle Swarm Optimization [8,9],
sidered as solutions to support carbon emissions and energy saving
Neural Networks [10,11] for similar energy management problems.
goals, and to maximize grid reliability and resiliency [3]. However,
Nevertheless, these algorithms have not been adopted by the controls
coordinating these resources with traditional building loads to improve
industry at scale, yet. For instance, model-predictive control (MPC) is
load flexibility and grid response is not common practice in commer-
currently the state of the art for building controls, particularly those
cial control systems [4]. The challenge is how to implement a DER

∗ Corresponding author.
E-mail address: [email protected] (M. Pritoni).
1
These authors contributed equally to this work.

https://doi.org/10.1016/j.apenergy.2021.117733
Received 12 March 2021; Received in revised form 17 August 2021; Accepted 26 August 2021
Available online 13 September 2021
0306-2619/© 2021 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
S. Touzani et al. Applied Energy 304 (2021) 117733

This paper proposes a DRL-based approach to control DER and test


Nomenclature it in a real building that includes building PV and electric battery
$saving (%) Cost savings % storage. The rest of the paper is organized as follows: in Section 2 we
$𝐸baseline Cost of Net Energy Consumption by base- review the key concepts of Reinforcement Learning (RL) 2.1, present
line a summary of the existing research that have applied RL in building
$𝐸DRL Cost of Net Energy Consumption by DRL controls 2.2, and consequently, we identify the research gaps in the
literature and state the contributions of this paper 2.3. In Section 3
%𝐸saving Energy savings %
we present the methodology, including the details about the field test
𝑎𝑏𝑏𝑟𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 Explanation for the abbreviation
and experimental setup, and in Section 4 we describe the design and
𝐸baseline Net energy consumption by baseline
training of the DRL algorithm. In Section 5 we present the results of
𝐸DRL Net energy consumption by DRL the experiment, and Section 6 dives into detail about some of the
𝐸net Net energy consumption interesting topics of discussion, including future directions of research.
𝐸saving Energy savings The paper is summarized and concluded in Section 7.
𝑃batt Net battery power
𝑃bld Building load 2. Related work
𝑃pv PV generation
𝛥shedhp Differential shed This section details the reinforcement learning background, specif-
𝛥takelow+med Differential take ically the Markov Decision Process. It also discusses the Deep Deter-
p
𝜂 overshoot Number of timestamps when ZAT overshot ministic Policy Gradient (DDPG) method and how DRL can be used
thermal comfort band for building controls. Finally it highlights the research gaps that were
𝜂 total Total number of occupancy timestamps identified through literature review and the contributions of this paper.
𝜇𝜈 Mean of comfort violation
2.1. Learning via Markov decision processes
𝜈 Comfort violation
𝜎𝜈 Standard deviation of comfort violation
2.1.1. Reinforcement learning overview
𝜁𝜈 Comfort violation in (◦ C-h)
Reinforcement learning (RL) is a class of machine learning algo-
a Total floor area rithms that are based on a trial-and-error learning approach. An RL
Pbaseline,hp Baseline demand during high price agent interacts with the environment, learns the dynamics by directly
Pbaseline,low+med p Baseline demand during low+medium trying different control actions and observing the consequences through
price some notion of rewards. This typically involves the agent trying a signif-
PDRL, hp DRL demand during high price icant number of actions (e.g., HVAC setpoints, battery setpoints) from
PDRL, low+med p DRL demand during low+medium price a possible action space in the environment (e.g., building, DER system)
that is in one of the many possible states (such as a specific time of the
day when the building is occupied or outdoor temperature is above a
certain limit) and receives a reward. The rewards indicate to the RL
that coordinate diverse end-use systems [12]. However, MPC requires agent how well a particular action performed with respect to some
a dynamic model (typically based on first principles or a mix of physics environmental condition (e.g., thermal comfort, energy cost) [14].
and data) of the system under control. Developing these models need Assuming that the environment is a fully observed collection of
significant expertise and it is time consuming, making this approach states, such that the observation at time 𝑡 is equal to the environment
difficult to scale. Since buildings are heterogeneous, different models state at time 𝑡 (i.e., 𝑠𝑡 ), the sequential interaction between the RL agent
are required for each building. Without addressing these scalability and the environment can be modeled as a Markov Decision Process
issues, MPC algorithms for DERs cannot be broadly implemented. (MDP), which means the future state 𝑠𝑡+1 of the environment is only
With the development of machine learning algorithms and avail- dependent on the current state 𝑠𝑡 (i.e., given the present state, a future
ability of inexpensive computing power, Deep Reinforcement Learning state is independent of the past states) [16]. Formally, for an MDP the
(DRL) has emerged as a promising alternative to MPC for manag- state transition probability is defined as:
ing DERs and building loads [13]. Hence, this work explores the
development of a DRL controller and its deployment in an actual (𝑠𝑡+1 |𝑠𝑡 , 𝑎𝑡 , 𝑠𝑡−1 , 𝑎𝑡−1 , …) = (𝑠𝑡+1 |𝑠𝑡 , 𝑎𝑡 ) (1)
building to study the benefits and challenges of such approach and to
An MDP is defined as a four element tuple ⟨, , , , 𝛾⟩, where  is the
evaluate potential scalability issues. While there are multiple ways to
set of states of the environment,  is the set of possible actions,  is the
develop a DRL controller, the process generally involves formulating
state transition probability that describes the probability distribution of
an agent that is in a partially observable environment and learning the
next steps (𝑠𝑡+1 ) given the current state (𝑠𝑡 ) and action (𝑎𝑡 ),  is the
best decisions through interactions with the environment. The agent
reward function that provides the reward obtained of taking action 𝑎𝑡
observes environment snapshots, and chooses an action, receiving a
at state 𝑠𝑡 , and finally 𝛾 ∈ [0, 1] is a parameter that is called discount
reward value for this action in the current state. It continues to receive
feedback on its actions until a terminal state is reached. The objective factor, which determines the importance of future rewards. If 𝛾 = 0 the
is to maximize the cumulative reward over all actions in the time agent will be concerned only by maximizing immediate rewards, which
the agent is active [14]. Nevertheless, allowing a DRL controller to means it will learn a control policy that selects actions to maximize
train directly in a real building (the environment) by actively changing 𝑅𝑡+1 . As 𝛾 increases, the RL agent becomes increasingly focused on the
setpoints (online learning) could result in discomfort to the occupants future rewards. Given an MDP, the RL agent learns, by mapping the
and damage to the equipment. Hence, in the literature, the agent is environment states to the actions, a control policy (𝜋(𝑎𝑡 |𝑠𝑡 ) ∶ 𝑆𝑡 →
sometimes trained using historical operational data before deploying 𝐴𝑡 ) that maximizes the expected cumulative reward at each time step
it to real buildings (offline learning). However, as operational data (i.e., maximizing the expected cumulative reward it will receive in the
from an existing building might be insufficient to represent all possible future) [17].
system states, oftentimes simulation models of the building (e.g., using Further, RL algorithms can be subcategorized into different groups:
simulation software such as MATLAB and EnergyPlus™) are used to on-policy versus off-policy, online versus offline learning, and model-
train the DRL controller [15]. based versus model-free algorithms. Off-policy algorithms, as compared

2
S. Touzani et al. Applied Energy 304 (2021) 117733

to on-policy ones, have a greedy learning approach. In on-policy learn- Algorithm 1: Q-learning algorithm using approximation function
ing, the RL controller records rewards based on the current strat- 1. Initialize 𝑄(𝑠𝑡 , 𝑎𝑡 ) (i.e., initialize the neural networks with
egy/action performed, while in off-policy, the RL greedily selects the random weights).
actions that gave the best performance from memory. Online learning 2. Obtain the observation of the transition (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 ) by
means the RL agent is deployed in the real-world environment and RL making the RL agent interact with the environment.
agent is trained by interacting with the environment. Offline learning 3. Compute the loss function:
means the RL agent is trained without directly interacting with the ( [ ])2
 = 𝑄𝜋 (𝑠𝑡 , 𝑎𝑡 ) − 𝑟(𝑠𝑡 , 𝑎𝑡 ) + 𝛾𝑚𝑎𝑥𝑎′ ∈ 𝑄𝜋 (𝑠𝑡+1 , 𝑎′ ) .
environment. Instead, it learns from historical data or in a virtual
4. Update 𝑄 (e.g., using the stochastic gradient descent
training environment. In offline learning, the controller is deployed in
algorithm) by minimizing  with respect of the neural network
the real environment once it is well trained. Given this nature, offline
parameters.
learning is ‘‘less risky’’ because the controller does not need to interact
5. Repeat starting from 2 until some convergence criteria is
with the environment when it is not well trained. In model-based
satisfied.
learning, the controller learns the system dynamics first and then uses
the learned system dynamics for planning; while model-free algorithm
learns the optimal control without learning the system dynamics. The
model-based algorithm is usually more computationally expensive, be- (𝑄𝜋 (𝑠𝑡 , 𝑎𝑡 )) and 𝑟(𝑠𝑡 , 𝑎𝑡 ) + 𝛾𝑚𝑎𝑥𝑎′ 𝑄𝜋 (𝑠𝑡+1 , 𝑎′ ), and (3) the fact that a small
cause the algorithm first needs to learn an accurate environment model change to the action-value function may significantly change the policy.
(usually a difficult task in real application), then it needs to find an In 2015, Mnih et al. [20] introduced Deep Q-Network (DQN), which
optimal policy. Thus, model-free algorithms are more popular, as they was the first successful use of a combination of deep neural network and
are usually less computationally expensive. Q-learning algorithm. This work is responsible for the rapid growth of
Action value function The action-value function represents the the field of deep reinforcement learning (DRL). In addition to the inno-
expected cumulative reward of the RL agent starting from state 𝑠 vation of using neural networks, DQN proposed the use a replay buffer
and following control policy 𝜋. Formally the action-value function is and two neural networks to approximate the action-value function and
defined as: to overcome the instability issues of previous approaches. The replay
[𝑇 ]
∑ buffer is used to store a relatively large number of observations of
𝑄𝜋 (𝑠, 𝑎) = E𝜋 𝛾 𝑟𝑡+𝑘+1 |𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎
𝑘
(2) transitions (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 ). Mini-batches of these transitions are sampled
𝑘=0
randomly (using a uniform distribution), which allows the algorithm
where E𝜋 is the expected value given that the RL agent follows the to update the neural network from a set of uncorrelated transitions
control policy 𝜋. 𝑇 is a final time step of the episode. Depending on at each iteration. To reduce the correlation between the action-values
the considered environment, the value of the discount factor 𝛾 need to (𝑄𝜋 (𝑠𝑡 , 𝑎𝑡 )) and 𝑟(𝑠𝑡 , 𝑎𝑡 ) + 𝛾𝑚𝑎𝑥𝑎′ 𝑄𝜋 (𝑠𝑡+1 , 𝑎′ ), DQN uses two neural net-
be tuned based on the need to balance the current and future rewards. works. The weights of the first one are updated at each iteration, and
Q-Learning The action-value function, which is also known as 𝑄- this network is directly interacting with the environment. The weights
value, gave its name to one of the major off-policy and model-free RL of the second neural network, called a target network, are updated after
algorithms named Q-learning. The process of Q-learning algorithms is a fixed number of iterations by simply copying the weights of the first
to compute and update at each iteration the 𝑄-value for each state– network.
action pair at time 𝑡, in order to achieve the maximum cumulative While DQN can handle environments with high dimensional state
rewards. The optimal 𝑄∗𝜋 (𝑠𝑡 , 𝑎𝑡 ) for action 𝑎𝑡 taken at state 𝑠𝑡 can be spaces, it can only be used with discrete and low-dimensional action
expressed using a recursive relation known as a Bellman equation, which spaces. In fact, DQN relies on finding the action that maximizes the
is a summation of the present reward and the maximum discounted action-value function, which requires an iterative optimization at every
future rewards: step, if used with an environment that has continuous action. In theory,
one can discretize the action space; however, this solution is likely be
𝑄∗𝜋 (𝑠𝑡 , 𝑎𝑡 ) = 𝑟(𝑠𝑡 , 𝑎𝑡 ) + 𝛾𝑚𝑎𝑥𝑄𝜋 (𝑠𝑡+1 , 𝑎𝑡+1 ) (3)
intractable for problems that require fine control of actions (i.e., finer
When used in a discrete state–action space, the RL agent policy is grained discretization). The high number of discrete actions spaces
determined by a state–action lookup table called a Q-table, which is are difficult to explore efficiently. Moreover, a naive discretization
used to select an action for a given state. The Q-table is updated strategy of the action-spaces can exclude important information about
according the following equation: the action domain, which can be essential for finding optimal control
[ ] policy in several problems. To overcome this issue of discrete action
𝑄𝜋 (𝑠𝑡 , 𝑎𝑡 ) ← 𝑄𝜋 (𝑠𝑡 , 𝑎𝑡 )+𝛼 𝑟(𝑠𝑡 , 𝑎𝑡 ) + 𝛾𝑚𝑎𝑥𝑎′ ∈ 𝑄𝜋 (𝑠𝑡+1 , 𝑎′ ) − 𝑄𝜋 (𝑠𝑡 , 𝑎𝑡 ) (4)
spaces, [21] introduced the Deep Deterministic Policy Gradient (DDPG)
where 𝛼 ∈ [0, 1] is the learning rate. algorithm, which is a model-free, off-policy algorithm that can learn
Deep Q-Learning The Q-table approach can work well for envi- control policies in high-dimensional, continuous action spaces.
ronments where the state–action space is relatively small, but when
the states and action spaces increase or become infinite (i.e., environ- 2.1.2. Deep deterministic policy gradient
ment with continuous states/actions), the size of the table becomes DDPG [21] is an actor–critic framework based on the deterministic
intractable. A solution to this problem is to replace the Q-table with policy gradient (DPG) method [22], and it borrows methodological
a nonlinear representation (i.e., function approximation) that maps advances (i.e., replay buffer and target networks) developed for the
state and action onto a Q-value, which can be seen as a supervised DQN algorithm [20]. An actor–critic algorithm is a RL algorithm that has
learning problem (i.e., regression). Recently, approximating the action two agents: an actor and a critic. The actor makes decisions based on
value function with a deep neural network has became one of the most observation of the environment and the current control policy. The role
popular options, due to network’s capacity to accurately approximate of the critic is to observe the state of the environment and the reward
high dimensional and complex systems. A simplified illustration of the obtained by using an action and return the value-function estimate to
Q-learning algorithm using an approximation function is provided by the actor.
the following pseudo-code. DDPG learns two networks that approximate the actor function
Several papers used neural network to approximate action-value 𝜇(𝑠𝑡 |𝜃 𝜇 ) and the critic function 𝑄(𝑠𝑡 , 𝑎𝑡 |𝜃 𝑄 ), where 𝑠𝑡 is the state of the
functions [18,19]. However, these early methods often resulted in environment at time 𝑡, 𝑎𝑡 is the control action (i.e., the HVAC setpoints
being unstable. This instability was due to: (1) the correlation between and the battery setpoint), and 𝜃 𝜇 and 𝜃 𝑄 are respectively the weights
sequential observations, (2) the correlation between the action-values of the actor and critic networks.

3
S. Touzani et al. Applied Energy 304 (2021) 117733

The actor function 𝜇(𝑠𝑡 |𝜃 𝜇 ) deterministically maps states to a specific Algorithm 2: DDPG algorithm
action and represents the current policy. The critic function 𝑄(𝑠𝑡 , 𝑎𝑡 |𝜃 𝑄 )
Initialize 𝑄(𝑠𝑡 , 𝑎𝑡 |𝜃 𝑄 ) and 𝜇(𝑠𝑡 |𝜃 𝜇 ) networks with random weights
(i.e., the action’s values function) maps each action-state pair to a value
𝜃 𝑄 and 𝜃 𝜇 .
in R that is the expected cumulative future reward obtained by taking ′ ′
Initialize target networks 𝜇 ′ (𝑠𝑡 |𝜃 𝜇 ) and 𝑄′ (𝑠𝑡 , 𝑎𝑡 |𝜃 𝑄 ) with
action 𝑎𝑡 at state 𝑠𝑡 following the policy. During the training process ′ ′
𝜃𝜇 ← 𝜃𝜇 𝜃𝑄 ← 𝜃𝑄 .
the actor and critic networks are iteratively updated using stochastic
Initialize replay buffer 𝐵.
gradient descent (SGD) on two different losses, 𝐿𝜇 and 𝐿𝑄 , which
for episode = 1, ..., 𝑀 do
are computed using mini-batches of transitions samples (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 ).
Initialize a OU process 𝑡 for exploration
𝑠𝑡+1 is the resulting state by taking action 𝑎𝑡 at state 𝑠𝑡 , and 𝑟𝑡 is the
Obtain initial observation of state 𝑠1
subsequent reward. Using SGD assumes that the transition samples in
for 𝑡 = 1, ..., 𝑇 do
the mini-batches are independently and identically distributed, which
Obtain action 𝑎𝑡 = 𝜇(𝑠𝑡 |𝜃 𝜇 ) + 𝑡
may not be the case when the observations are generated sequentially.
Execute action 𝑎𝑡 and compute the reward 𝑟𝑡 and observe
Similarly to DQN, DDPG address this issue by sampling uniformly
the new state 𝑠𝑡+1
the transitions from a relatively large fixed size replay buffer. This
Store transition (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 ) in replay buffer 𝐵
allows the algorithm to update the actor and critic network from a
Randomly sample a mini-batch of 𝑁 transitions
set of uncorrelated transitions, at each training iteration. The replay
(𝑠𝑖 , 𝑎𝑖 , 𝑟𝑖 , 𝑠𝑖+1 ) from 𝐵
buffer is populated by sampling transitions from the environment using ′ ′
the current policy, and after the buffer is fully populated, the oldest Set 𝑦𝑖 = 𝑟𝑖 + 𝛾𝑄′ (𝑠𝑖+1 , 𝜇′ (𝑠𝑖+1 |𝜃 𝜇 |𝜃 𝑄 )
𝑄
Update 𝜃 by minimizing the loss:
transitions are replaced by the new ones. Another method that DDPG ∑
borrows from DQN to improve the stability of the training process is the 𝐿 = 𝑁1 𝑁 𝑖=1 (𝑦𝑖 − 𝑄(𝑠𝑖 , 𝑎𝑖 |𝜃 ))
𝑄 2
′ ′
use of two target networks: 𝜇 ′ (𝑠𝑡 |𝜃 𝜇 ) and 𝑄′ (𝑠𝑡 , 𝑎𝑡 |𝜃 𝑄 ). Their weights Update 𝜃 𝜇 using the sampled policy gradient:
′ ′
𝜃 𝜇 and 𝜃 𝑄 are slowly tracking the learned weight 𝜃 𝜇 and 𝜃 𝑄 .
1 ∑
𝑁
As for most DRL algorithms, DDPG needs to explore the state-space ∇𝜃𝜇 ≈ ∇ 𝑎 𝑄(𝑠, 𝑎|𝜃 𝑄 )|𝑠=𝑠𝑖 ,𝑎=𝜇(𝑠𝑖 ) ∇𝜃𝜇 𝜇(𝑠|𝜃 𝜇 |𝑠 = 𝑠𝑖
𝑁 𝑖=1 𝜃
in order to avoid converging to a local minimum that produces non-
optimal policies. Usually, this exploration is performed by adding noise
sampled from a predefined random process 𝑁 to the actions produced Update the target networks:
by the actor network: ′ ′ ′ ′
𝜃 𝑄 ← 𝜏𝜃 𝑄 + (1 − 𝜏𝜃 𝑄 ); 𝜃 𝜇 ← 𝜏𝜃 𝜇 + (1 − 𝜏𝜃 𝜇 )
𝑎𝑡 = 𝜇(𝑠𝑡 |𝜃 ) + 𝑁
𝜇
(5)

where 𝑁 can be either a Gaussian process or, as proposed in [21], an


Ornstein–Uhlenbeck (OU) process that generates temporally correlated
exploration. This solution makes the changes in action noise less abrupt
from one time step to another. This approach has the advantage of leaving the lower-level control
The pseudo-code of DDPG is given below. This algorithm is the DRL of the equipment to local controllers that ensure quick response and
approach used in this work. equipment safety. For simpler systems, such as residential appliances,
authors of [18,35,43,46,48] have also generated actions to turn them
on and off directly. With a focus more on the supply side of the grid,
2.1.3. Hyperparameter tuning Hua et al. also developed an asynchronous advantage actor–critic (A3C)
As seen in Henderson et al. [23], the sensitivity of various DRL algorithm based DRL controller to manage grid level renewable energy
techniques can impact the reward scale, environment dynamics, and sources (subgrid level PV, Wind Turbine Generator) along with the
reproduciblity of the experiments. Finding benchmarks for a fair com- building load, and trained the algorithm in simulation [50].
parison is a challenge, but OpenAI stable baselines can help provide The reward functions used in the literature typically track the objec-
initial comparison across the gym environments chosen. Complex hy- tive of the DRL algorithm, hence penalizing occupant discomfort and
perparameters can affect how quickly the DRL solutions reach consen-
rewarding for energy savings (and corresponding cost savings) [18,27–
sus or produce a robust solution that achieves the maximum reward. A
35,37–43,45–49]. Control of on-site energy storage often includes an
popular technique to find optimal hyperparameters is the grid search
additional penalty to discourage battery depreciation [18,44,47].
as illustrated by Liang et al. [24].
Researchers have developed their DRL algorithms using a wide vari-
2.2. DRL for building and DER control ety of algorithms: Actor–Critic [27,35], Asynchronous Actor–Critic [37,
39–41], Policy Gradient [18,28], Deep Deterministic Policy Gradi-
The literature on advanced control algorithms for optimizing a ent [32,42,47,49], Trust Region Policy Optimization [38], Double
building’s energy operations is large, as mentioned in Section 1. Given Deep Neural Fitted Q Iteration [34], Q-learning [45,46,48], Deep
the scope of this work, in this section we focus on reviewing the grow- Q-learning [18,29–31,44,49], and Proximal Policy Optimization [33].
ing body of research using RL-based algorithms to control buildings and Python (using the OpenAI Gym framework) and MATLAB have
DERs [13,15,25,26]. Researchers used DRL controllers to achieve differ- been the most popular programming environments used to develop
ent objectives: authors of [27–36] aimed at improving thermal comfort
these DRL controllers. Due to the difficulty and cost of field testing,
of building occupants while minimizing energy consumption, whereas
a large portion of the existing work has been performed in simulation.
others [37–47] aimed at improving energy efficiency and reducing
In the realm of simulation-based research, DRL-based controllers have
cost. Some recent work also has looked at using DRL for better load
been developed for residential buildings [18,34,35,47,48], commercial
management through peak reduction, load shifting, and better schedul-
ing [18,48,49]. The actions taken by DRL controllers include generating buildings (e.g., single-zone, multi-zone, multi-storey) [27–29,31–33,39,
supervisory control setpoints for the HVAC systems (e.g., variable air 49], and data centers [38,42]. In the few instances of field tests, DRL
volume [VAV] boxes, chiller plants, room thermostats), such as tem- controllers have been deployed in well instrumented facilities, such as
perature [27,28,30–33,37–39,41,42,49], flow rate [29,31,42,49] and the Intelligent Workspace at Carnegie Mellon University, Pittsburgh,
fan speed setpoints [45]. Supervisory control of electrical batteries United States [36,37,40,41] with unconventional HVAC systems, or in
by setting the charge/discharge rate [35,44,46,47] is also common. simpler residential buildings [43,46].

4
S. Touzani et al. Applied Energy 304 (2021) 117733

Fig. 1. (a) Outside view of FLEXLAB, (b) mechanical drawing of FLEXLAB envelope and HVAC (top view) [51].

2.3. Research gaps and contribution 3.1. Experimental design and setup

The literature clearly shows that DRL research in the built environ- The evaluation of the controller developed in this work was con-
ment has great potential, and it is a fast growing field [13,15,25,26]. ducted in FLEXLAB [51], a well-instrumented experimental test facil-
However, a clear gap identified in the literature is that most of DRL- ity at Lawrence Berkeley National Laboratory in Berkeley, California,
United States. FLEXLAB allows accurate measurement of the behavior
based building energy optimization methods are still not implemented
of integrated systems (e.g., HVAC and batteries) in providing grid
in practice [52] nor use physical equipment [13]. This observation is
services (e.g., Shift/Shed load when requested) and building services
also echoed by various reviews such as [25]: ‘‘[...] The vast majority
(e.g., visual and thermal comfort). To evaluate the performance of
of studies that implement RL for problems relating to building energy
the DRL controller compared to a baseline, we used a pair of side-
management are in simulation only’’ and [26]: ‘‘[...] the research works
by-side identical cells (called Cell A and B in the paper), a unique
about applications of RL in sustainable energy and electric systems
feature of FLEXLAB (Fig. 1). Each cell is representative of a small
are largely still in the laboratory research stage, which have not been
office space with a floor area of 57 m2 and includes a large south-
practically implemented in sustainable energy and electric systems’’. A facing window. FLEXLAB can be reconfigured with different types of
second gap is that very few RL studies have co-optimized DERs with equipment, end-uses, and envelope features. As HVAC we selected a
HVAC control to achieve load flexibility at the building level. The single-zone variable-capacity air-handling unit (AHU) for each cell,
few field tests that employ DRL-based algorithms only optimize the a very common system installed in many U.S. buildings [53]. The
operation of one system at a time (mostly the HVAC system). AHU had a single variable-speed fan and two dampers for outdoor air
Given these gaps in the literature, the goal of this paper is to and return air mixing. The air side of the AHU was monitored with
implement a price-responsive DRL-based controller and deploy it in sensors to measure air volume flow rate as well as the temperatures
an actual building that includes on-site distributed energy resources and humidities in each section of the ducts. The water-side of the two
(DERs). The DRL algorithm will aim at minimizing energy cost while AHUs was served by a small chiller and boiler, shared by the two cells
maintaining acceptable indoor thermal comfort. The performance of (Fig. 1). The flow rates and inlet and outlet temperatures in each AHU
the DRL-based controller will be tested against a best-in-class rule-based coil were measured to calculate heat flows transferred to the thermal
controller. The paper makes the following contributions to the existing zones. Power consumption of the heating and cooling equipment was
literature: estimated using these measured heat flows multiplied by a fixed ef-
ficiency (chiller COP = 3.0 and boiler efficiency = 0.95). Each room
• An application of a DRL algorithm for integrated optimal control had six ceiling-mounted light fixtures, six plug-load stations (i.e., desks
of a DER system composed of PV, electric storage, and thermal with desktop computers), six heat-generating mannequins that emulate
loads in a commercial building occupants, and multiple environmental sensors measuring temperature,
• A field demonstration of the DRL controller in a well instrumented indoor light levels, and relative humidity (Fig. 2). The DERs comprised
single-zone commercial building test facility, controlling multiple a 3.64 kilowatt (kW) photovoltaic system and a Tesla Powerwall battery
pieces of equipment, along with on-site DERs storage with a capacity of 7.2 kWh and peak power output of 3.3 kW
• An open-source OpenAI Gym environment that can be reused (Fig. 3) for each cell. Each sensor in FLEXLAB collected data every 10 s
by other researchers, in order to both train and evaluate the or less and it was aggregated to 1 min for analysis. The DRL controller
performance of their algorithms was deployed in Cell B, while the baseline controller was deployed
• A discussion about the challenges and lessons learned in the in Cell A. The FLEXLAB control system allowed users to change the
field deployment of the DRL controller in an actual building, setpoint of the supply air temperature (SATsp ) and the supply air flow
particularly with respect to integrating multiple building systems rate (SAFsp ) of the AHU directly or to influence them by setting the
zone air temperature heating and cooling setpoints (ZAThsp , ZATcsp )
using a standard control sequence. In addition, the charge/discharge
3. Methodology
setpoint of the battery was controllable (batterysp ).
Three different scenarios were tested, based on demand-side man-
This section discusses the experimental design and test set up. It agement strategies commonly referred as: (1) Energy Efficiency (EE),
also outlines the performance metrics used to monitor and evaluate the (2) Shed, and (3) Shift [3]. In the EE scenario, the controllers were
outputs of different tests and assess the efficacy of the DRL controller. compared based on the amount of energy purchased from the grid; in
Finally, it discusses the baseline strategy used to control the battery and the Shed scenario, the comparison was based on the ability to reduce
the building loads during the tests. energy purchases in a specific time window; and in the Shift scenario,

5
S. Touzani et al. Applied Energy 304 (2021) 117733

during the HVAC operational hours (7 am–7 pm) and 15 ◦ C–29 ◦ C dur-
ing the remaining hours of the day. The experiments were conducted
during the cooling season, at the end of July and early August of 2020,
with each experiment spanning 7–8 days, with a total testing period of
slightly more than three weeks.

3.2. Performance metrics

In each scenario, the performance of the controllers was evaluated


using the following metrics:

1. ENERGY CONSUMPTION
Since the DER system is composed of local generation and
storage, the energy consumption is defined in terms of energy
purchase from the grid, which correspond to the difference
between the PV generation and the building load adjusted by
the change in state of charge of the battery. The building load is
Fig. 2. Room set up inside FLEXLAB: heat-emitting mannequins or equivalent thermal
a summation of power consumption by HVAC systems (chiller,
generators, desktop computers, and overhead light-emitting diode (LED) lights with
multiple environmental sensors [51]. fans, pumps), lighting, and plug loads. We call this net load.
𝑇
𝙴𝚗𝚎𝚝 = (𝙿𝚋𝚕𝚍𝚐 + 𝙿𝚋𝚊𝚝𝚝 − 𝙿𝚙𝚟 ) 𝑑𝑡 (6)
∫0
the two controllers were evaluated based on their ability to shift energy
(a) ENERGY SAVINGS
purchases from one time window to another one. Different electricity
The energy savings are the difference between net energy
prices were set to produce a desired effect on the electricity purchase
consumption in Cell A (baseline) and Cell B (DRL). They
(i.e., higher prices would drive reduced purchase from the grid).
are expressed in kWh for the whole testing period of each
In the first experiment, which ran for seven days in July, the
scenario:
electricity price was kept constant (0.13$/kWh). The Shed experiment
involved two tiers of prices: a low price (0.13$/kWh) throughout the 𝙴𝚜𝚊𝚟𝚒𝚗𝚐 = 𝙴𝚋𝚊𝚜𝚎𝚕𝚒𝚗𝚎 − 𝙴𝙳𝚁𝙻 (7)
day, and a high price between 4 pm and 9 pm (1.33$/kWh). The
(b) % ENERGY SAVINGS
Shift experiment included an additional price tier: high price (4 pm–
The % savings correspond to the energy savings divided
9 pm) at 0.16$/kWh; medium price (2 pm–4 pm and 9 pm–11 pm)
by the baseline:
at 0.13$/kWh; and low price (11 pm–2 pm) at 0.11$/kWh. These
prices were selected based on current commercial tariffs used in Cal- 𝙴𝚋𝚊𝚜𝚎𝚕𝚒𝚗𝚎 − 𝙴𝙳𝚁𝙻
% 𝙴𝚜𝚊𝚟𝚒𝚗𝚐 = × 𝟷𝟶𝟶% (8)
ifornia [54] at the time of the experiment. New research on demand 𝙴𝚋𝚊𝚜𝚎𝚕𝚒𝚗𝚎
response [55,56] have suggested using prices for reflecting grid require- 2. ENERGY COST
ments and for initiating responses from building equipment, instead The energy cost is the product of Enet and energy price, that
of using traditional demand-response events. The constant prices that differ by scenario and time period.
were used in the EE experiment, along with the objective to minimize
total energy costs, discourage any shifting of loads from one period to (a) COST SAVINGS
another and incentivize an overall reduction in energy consumption Similar to energy savings, cost savings are defined as
by optimizing the operations of the building equipment. The tiered the difference between cost in Cell A (baseline) and Cell
prices used in the Shed and Shift experiments were used to trigger B (DRL). They are expressed in $ for the whole testing
load reduction during the high price times and in the case of Shift, to period of each scenario.
encourage a corresponding additional increase in load during the lower (b) % COST SAVINGS
price periods as well, thereby achieving a load shed and load shift. The % COST corresponds to the cost savings divided by
For simplicity, the tariff only accounted for variable cost of elec- the baseline:
tricity (i.e., kWh) and not for maximum demand (i.e., kW). Both the
baseline and the DRL algorithm were tasked to minimize the energy $(𝙴𝚋𝚊𝚜𝚎𝚕𝚒𝚗𝚎 ) − $(𝙴𝙳𝚁𝙻 )
$𝚜𝚊𝚟𝚒𝚗𝚐𝚜 % = × 𝟷𝟶𝟶% (9)
cost while maintaining a zone air temperature deadband of 21 ◦ C–24 ◦ C $(𝙴𝚋𝚊𝚜𝚎𝚕𝚒𝚗𝚎 )

Fig. 3. Distributed Energy Resources installed in FLEXLAB [51].

6
S. Touzani et al. Applied Energy 304 (2021) 117733

3. THERMAL COMFORT Table 1


Baseline control strategies for the EE test.
Thermal comfort is measured in terms of violations of the indoor
Equipment Control/Parameters Time
or zone air temperature (ZAT) from the comfort band during
building occupancy (between 8am – 6pm). The thermal comfort HVAC ZAThsp = 21 ◦ C; ZATcsp = 24 ◦ C 7 am–7 pm
range is defined as 21◦ C – 24◦ C. ZAThsp = 15 ◦ C; ZATcsp = 29 ◦ C All other times
Battery if(Ppv >Pbldg ):
(a) % TIME OUTSIDE COMFORT BAND charge battery (max: 90%)
This metric captures the relative duration of violations else:
compared to the testing period. discharge battery (min: 10%)
𝜂
𝜂𝚘𝚟𝚎𝚛𝚜𝚑𝚘𝚘𝚝 % = 𝚘𝚟𝚎𝚛𝚜𝚑𝚘𝚘𝚝 × 𝟷𝟶𝟶% (10) Table 2
𝜂𝚝𝚘𝚝𝚊𝚕
Baseline control strategies for the SHED test.
(b) MEAN AND STANDARD DEVIATION OUTSIDE Equipment Control/Parameters Time
COMFORT BAND HVAC ZAThsp = 21 ◦ C; ZATcsp = 24 ◦ C 7 am–4 pm
The mean and standard deviation of the violation of com- ZAThsp = 15 ◦ C; ZATcsp = 26 ◦ C 4 pm–7 pm
fort measure the intensity and variability of such events. (load Shed during high prices)
ZAThsp = 15 ◦ C; ZATcsp = 29 ◦ C All other times

𝜂𝑡𝑜𝑡𝑎𝑙 Battery charge battery (max: 90%) 7 am–4 pm


∑ 𝚖𝚊𝚡(𝟶, 𝚉𝙰𝚃 − 𝟸𝟺 ◦ 𝙲) + 𝚖𝚊𝚡(𝟶, 𝟸𝟷 ◦ 𝙲 − 𝚉𝙰𝚃) discharge battery (min: 10%) 4 pm–7 pm
𝜇𝜈 = (11)
0
𝜂total
Table 3
√∑ Baseline control strategies for the Shift test.
([𝚖𝚊𝚡(𝟶, 𝚉𝙰𝚃 − 𝟸𝟺 ◦ 𝙲) + 𝚖𝚊𝚡(𝟶, 𝟸𝟷 ◦ 𝙲 − 𝚉𝙰𝚃)] − 𝜇)2
𝜎𝜈 = Equipment Control/Parameters Time
𝜂total
HVAC ZAThsp = 21 ◦ C; ZATcsp = 24 ◦ C 7 am–7 pm
(12) ZAThsp = 17 ◦ C; ZATcsp = 20 ◦ C 12 pm–2 pm
(pre-cooling during low prices)
(c) DEGREE HOUR OUTSIDE COMFORT BAND ZAThsp = 16 ◦ C; ZATcsp = 22 ◦ C 2 pm–4 pm
(pre-cooling during medium prices)
Degree hour is a metric used to evaluate the performance
ZAThsp = 15 ◦ C; ZATcsp = 26 ◦ C (load 4 pm–7 pm
of the controllers. It is defined as the integral of ZAT over- shed during high prices)
shoot beyond 24◦ C over the duration of the overshoot (T) ZAThsp = 15 ◦ C; ZATcsp = 29 ◦ C All other times
and undershoot below 21◦ C. The factor of 0.25 in Eq. (13) Battery charge battery (max: 90%) 7 pm–2 pm
represents 15 minute time interval of each timestamp. if (Ppv <Pbldg ): 2 pm–4 pm
𝜂𝑡𝑜𝑡𝑎𝑙
discharge battery (min: 50%)
∑ discharge battery (min: 10%) 4 pm–7 pm
𝜁𝜈 = (𝚖𝚊𝚡(𝟶, 𝚉𝙰𝚃 − 𝟸𝟺 ◦ 𝙲) + 𝚖𝚊𝚡(𝟶, 𝟸𝟷 ◦ 𝙲 − 𝚉𝙰𝚃))×
0 (13)
𝟶.𝟸𝟻𝚝
(ZAThsp , ZATcsp ) for the AHU was used in the shift and shed ex-
wheret = number of timestamps when ZAT is
periments to implement ‘‘pre-cooling’’ (for an appropriate compari-
outside the comfort band
son to the advanced DRL controller). A similar rule-based battery
𝜁𝜈
𝜁𝜈 𝚙𝚎𝚛 𝚍𝚊𝚢 = (14) charge/discharge setpoint (batterysp ) generator was also imple-
# 𝚝𝚎𝚜𝚝 𝚍𝚊𝚢𝚜
mented for all three experiments. The rules for both these controllers
4. SHED and SHIFT METRICS are tabulated in Tables 1, 2, and 3. Similar to the DRL controller, these
The differential load shed (W/ft2) during peak price time is rule-based supervisory baseline controllers generated new setpoints for
defined as the difference in the average demand intensity be- the AHU and the battery every 15 min.
tween the baseline and DRL controller during the high price It should be noted that the equipment in the baseline-controlled cell
period [57]. was programmed with a preheating mode that did override these super-
visory setpoints. This is a common strategy that building managers use
𝙿𝚋𝚊𝚜𝚎𝚕𝚒𝚗𝚎,𝚑𝚙 − 𝙿𝙳𝚁𝙻,𝚑𝚙
𝛥𝚜𝚑𝚎𝚍𝚑𝚙 = (15) in many building systems and it was implemented by default in the
𝚊 test cells. While the test was conducted in summer, heating triggered
Demand shifted during the test can be evaluated using two during several days, because in the local climate (Berkeley, CA) it is
metrics: a demand reduction metric during the high price period not uncommon to have large diurnal temperature swings with outdoor
(identical to the demand shed metric) and a demand increase temperature dropping to around 12◦ C at night. This resulted in heating
metric during the rest of the day (take period). Demand in- events in the early morning hours during the actual experiments.
crease is considered negative and demand decrease is considered
positive, as per conventions from traditional demand response 4. DRL control algorithm
programs.
𝙿𝚋𝚊𝚜𝚎𝚕𝚒𝚗𝚎,𝚕𝚘𝚠+𝚖𝚎𝚍 𝚙 − 𝙿𝙳𝚁𝙻,𝚕𝚘𝚠+𝚖𝚎𝚍 𝚙 The DRL control algorithm developed uses a model-free learning
𝛥𝚝𝚊𝚔𝚎𝚕𝚘𝚠+𝚖𝚎𝚍 𝚙 = (16)
𝚊 approach, namely an actor–critic technique called DDPG, which is
suitable for continuous control problems. While the controller does
3.3. Baseline control strategy not need a model, it needs data to be trained on. Sometimes, building
operational data from an existing building is insufficient to represent all
Lighting, plugs, and occupants (i.e., mannequins) were set on a possible system states. In our case, the historical building operational
time-based schedule and operated identically in the two cells. The AHU data available was generated using a single control logic and did not
in the baseline cell implemented Section 5.18 of ASHRAE Guideline explore all possible control actions. In contrast, since DRL needs a trial-
36 (Single Zone VAV Air-Handling Unit) [58]. However, a supervisory and-error approach, i.e. testing different control logic, to figure out
rule-based controller that generated the zone temperature setpoints which one performs the best, a building simulation model was used

7
S. Touzani et al. Applied Energy 304 (2021) 117733

to train the DRL controller. A training framework that uses a physics- Table 4
Summary of calibration measures and CV-RMSE.
based simulation component was implemented to learn end-to-end
optimal control policy to manage the charge and discharge of battery Cell Calibration measures CV-RMSE

storage and HVAC setpoints. We calibrated the simulation model using Cell A Raw model, developed from design document 11.2%
Tuning thermal insulation 7.6%
operational data from the testbed, and we are confident that this model
Tuning thermal insulation & infiltration rate 4.0%
can emulate the real environment (FLEXLAB) reasonably well. Tuning thermal insulation infiltration rate & 2.8%
thermal mass multiplier
4.1. Simulation environment Cell B Raw model, developed from design document 10.9%
Tuning thermal insulation 6.3%
4.1.1. Software stack and modeling choices Tuning thermal insulation & infiltration rate 2.9%
The EnergyPlus™ simulation tool [59] was used to model the build- Tuning thermal insulation infiltration rate & 2.1%
thermal mass multiplier
ing and HVAC system. The physical parameters of the FLEXLAB enve-
lope were selected based on the materials in the construction drawings.
For the HVAC system, we developed a water-based primary loop and
air-based secondary loop HVAC model, as shown in Fig. 4, which the ratio of total interior thermal mass to the thermal capacitance of
represents the FLEXLAB configuration used for the test. We assumed the air in the volume of the specified zone, and (4) infiltration rate,
constant boiler efficiency (0.95) and chiller COP (3.0) because climate i.e. air changes per hour. We tuned the three parameters that are
at the test site is mild. found to be the most effective in Zhang et al. (2019) [62] and relevant
The battery and PV panels were simulated using the Smart Control to our experiment settings: thermal insulation, internal thermal mass
of Distributed Energy Resources (SCooDER) library [60]. SCooDER multiplier and infiltration rate to minimize the difference between the
is a Modelica library developed to facilitate the simulation and op- measured and simulated temperature. We measured the indoor temper-
timization of photovoltaics, battery storage, smart inverters, electric ature at 1-minute intervals and up-sampled it to 15 min, to match the
vehicles, and electric power systems [60]. SCooDER depends on the simulation time step. We calculated the Coefficient of Variation of Root
Modelica Standard Library and Modelica Buildings Library. The inter- Mean Square Error (CV-RMSE) between the 15-min interval measured
action between the DRL controller and the four simulated components and simulated temperature. By adjusting the internal thermal mass
(i.e., building envelope, HVAC system, battery, PV generation) is a multiplier and infiltration rate, we achieved a CV-RMSE of less than
typical co-simulation problem, since multiple software packages are 3% with the internal thermal mass multiplier of 6 and the infiltration
involved. After a review of existing simulation frameworks, we selected rate of 2. We summarized the improvements of CV-RMSE during the
the Functional Mock-up Interface (FMI) [59], a standard that defines a model calibration process in Table 4.
container and an interface to exchange dynamic models using a combi- In addition, we calibrated the power consumption of the fan, which
nation of XML files, binaries, and C code zipped into a single file [59]. is another major energy consumer in air-based systems. We used a cubic
As an open-source standard, the FMI is used in both academia and polynomial curve to model the fan energy behavior and achieved a
industry, and currently supported by more than 100 tools. By using the CV-RMSE of 7% for the fan in Cell A and 8% for fan in Cell B when
FMI, the building envelope, PV, and battery model were all exported comparing the air flow values recorded at 15-min intervals. Then we
in accordance to the FMI standard as Functional Mock-up Unit (FMU). used the regressed coefficients to develop the fan model. Since the
The DRL controller was implemented in a Python environment using chiller and boiler serve both the baseline and the DRL cell, and they
the package PyFMI [61] that enabled the interaction with the exported use a secondary loop with storage, it was impossible to calibrate the
FMU. The FMUs for the different models were further wrapped into an power curve of each cell independently, and default values were used.
OpenAI Gym environment, to run the building simulation in parallel In this study, we set COP to 2.5.
with the DRL algorithm, developed in PyTorch.2 To allow others to
reproduce the analysis or test other DRL algorithms, the simulation and 4.2. Design of the DRL algorithm
control framework were packaged into a Docker container including all
4.2.1. State and actions
the dependencies needed. The software is distributed as an open-source
The variables of the state vector 𝑠𝑡 were selected as: time of the
library and can be found at https://github.com/LBNL-ETA/FlexDRL.
day, outdoor temperature, outdoor relative humidity, solar irradiation,
zone temperature, net power, net energy, battery state of charge, PV
4.1.2. Model calibration
generation power, and electricity price look ahead. The DRL controller
The simulation model was calibrated with experimental data gath-
ran every 15 min and generated three actions (i.e., three setpoints):
ered during a calibration test performed before the experiment. We first
supply air temperature of the AHU; supply air flow rate of the AHU;
calibrated the envelope model by comparing the indoor temperatures
and charge/discharge rate of the battery. These actions were mapped
of the physical building to the indoor temperatures reported by the
to model variables in the simulation, as well as to actual controllable
simulated model. To accurately measure the stratification and nonuni-
points in FLEXLAB (respectively SATsp , SAFsp , and batterysp ).
form distribution of temperatures, we deployed four sensor trees with
seven temperature sensors in Cell A and Cell B. In a detailed physics-
4.2.2. Reward function
based model with hundreds of parameters, such as the one developed
The goal of DRL control algorithm was to find an optimal policy
in EnergyPlus used in this paper, there are different combinations of
that minimizes the cost of the electricity purchased from the grid
variables that produce the same results, therefore their calibration is
while maintaining a good level of thermal comfort. In other words,
challenging. To select the best parameters needed for calibration we the objective was to efficiently consume the electricity locally pro-
followed best practices in the literature. Zhang et al. (2019) [62] used duced by the PV panels and manage the battery while maintaining
the sensitivity analysis proposed by Morris (1991), [63] to identify four the indoor zone temperature within a desired range. The considered
key parameters, which are found to significantly influence the simula- reward function was composed of three parts: (1) the penalty due to
tion errors and need to be calibrated. The identified parameters include the energy consumption 𝐸𝑐𝑜𝑠𝑡 (replaced by energy cost in the Shed and
the (1) thermal insulation, (2) total area of radiant heating/cooling Shift scenario), (2) the penalty due to the violation of the temperature
surfaces, (3) internal thermal mass multiplier, which is defined as comfort zone 𝐶, and (3) the penalty due to the violation of the battery
physical limits 𝜆. This reward function is defined as:
2
https://pytorch.org/. 𝑟𝑡 = −𝛼𝐸𝑐𝑜𝑠𝑡 (𝑎𝑡 , 𝑠𝑡+1 ) − 𝛽𝐶(𝑎𝑡 , 𝑠𝑡+1 ) − 𝜆(𝑎𝑡 , 𝑠𝑡+1 ) (17)

8
S. Touzani et al. Applied Energy 304 (2021) 117733

Fig. 4. HVAC system schematics in the test facility and Energyplus model.

with

𝐶(𝑎𝑡 , 𝑠𝑡+1 ) = 1−𝑒𝑥𝑝(−0.5(𝑇𝑡+1 −𝑇𝑚 )2 )+0.2([𝑇𝑡+1 −𝑇𝑈 ]+ +[𝑇𝐿 −𝑇𝑡+1 ]+ ) (18)

and

⎧−20, if 𝐸𝑏 (𝑡 + 1) < 0 and |𝐸𝑏 (𝑡 + 1)| > 𝑆𝑂𝐶𝑚𝑖𝑛



𝜆(𝑎𝑡 , 𝑠𝑡+1 ) = ⎨−20, if 𝐸𝑏 (𝑡 + 1) > 0 and (𝐸𝑏 (𝑡 + 1) + 𝑆𝑂𝐶(𝑡)) > 𝑆𝑂𝐶𝑚𝑎𝑥

⎩0, otherwise.
(19)

where 𝑇𝑡+1 is the zone temperature at time 𝑡 + 1 (i.e., after action


𝑎𝑡 was applied), 𝑇𝑚 is the average desired temperature, and 𝑇𝐿 and
𝑇𝑈 are (respectively) the desired lower and upper bound for the zone
temperature. 𝐸𝑏 (𝑡 + 1) is the total energy that the battery has been
charged (𝐸𝑏 (𝑡 + 1) > 0) or discharged (𝐸𝑏 (𝑡 + 1) < 0) between 𝑡 and 𝑡 + 1.
𝑆𝑂𝐶(𝑡) is the state of charge (SOC) of the battery at 𝑡, 𝑆𝑂𝐶𝑚𝑖𝑛 , and
𝑆𝑂𝐶𝑚𝑎𝑥 are respectively the minimum and the maximum battery state Fig. 5. Thermal comfort penalty function. The vertical axis correspond to the penalty
of charge that is allowed. To ensure that the equation is homogeneous, due to the violation of the temperature comfort zone 𝐶 and the horizontal axis
correspond to the zone temperature in degree Celsius.
the units of 𝛼, 𝛽 and 𝜆 are set accordingly.
Using an approach introduced in [38], the penalty due to the viola-
tion of the temperature comfort zone 𝐶 was defined as shown in Fig. 5.
Within the red bars (temperature comfort boundaries), the reward is constraints. This is achieved by adding a high penalty on actions that
shaped like a bell curve that has a maximum reward (i.e., equal to violate the constraints. This approach allows the actor to easily limit the
0) when 𝑇𝑡 is equal to 𝑇𝑚 , and rapidly decreases as the temperature
moves away from 𝑇𝑚 . Outside the red band we imposed a linear space of actions where the optimal value is searched. However, our tests
decay (i.e., trapezoid shape) of the reward. We found that adding
showed that it is important to select the proper penalty value (i.e., −20
the bell curve component to the thermal penalty, as shown in Fig. 5,
significantly improved the stability of the control algorithm, especially in this work) because if the penalty is too high the actor may avoid
when the cost of energy was equal to 0 (i.e., during the periods when
all the load was covered by the PV generation). any control action on the battery (i.e., the controller converged to a
The reward component that penalized the violation of the battery
local maximum), because they consider these actions too risky (i.e., the
physical limits 𝜆(𝑎𝑡 , 𝑠𝑡+1 ) was designed to keep the battery from mak-
ing charging/discharging decisions that violate the system’s physical cumulative reward can be too negative).

9
S. Touzani et al. Applied Energy 304 (2021) 117733

4.2.3. DRL algorithm hyperparameter tuning Table 5


Metrics: DRL vs. baseline controller.
The simulation framework developed was used to tune several
hyperparameters of the DRL controller, to achieve a good control Test Metric Value

policy: EE % Energy savings 0.6%


% Cost savings 0.6%
• mini-batch size: number of transitions (i.e., (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 )) sam- Shed Differential shed during high price 0.25 W/sq.ft. (50%)
ples used by the gradient-based optimizer. % Cost savings 29.4%
• actor and critic learning rate: optimization parameter that con- Differential shed during high price −0.34 W/sq.ft. (−97%)
trols how strongly the network weights are updated while moving Shift Differential take during low + medium price 1.1 W/sq.ft. (49%)
toward the minimum of the loss functions. % Cost savings 39.6%
• neurons number: the size of the two layers that constitute each
network. Table 6
• discount factor 𝛾: parameter used in critic loss function to con- ZAT Overshoot % and parameters for DRL vs. Baseline controlled cells.
trol up to what extent future rewards have an impact on the Controller % of time intervals During the overshoot
return at time 𝑡. In other words, how many future time steps the outside comfort band period (above 24 ◦ C)
agent is considering to select an action. and violation rate

• OU parameters: 𝜃 the reversion rate and 𝜎 the standard devia- 𝜇𝜈 (◦ C) 𝜎𝜈 (◦ C) 𝜁𝜈 per day
(◦ C-h/day)
tion, which are two parameters of the OU process that will control
level of the state space exploration. EE baseline 4 0.3 0.21 0.33
EE DRL 13 0.3 0.20 0.99
• reward function weights: weights that control the importance
Shed baseline 25 1.0 0.98 4.92
of electricity cost vs. the thermal comfort. Shed DRL 15 1.3 0.81 3.68
• electricity price look ahead: the number of following hours of Shift baseline 15 0.7 0.53 2.48
electricity price. Shift DRL 14 0.3 0.29 1.27

The performance of the DRL policy depends significantly on these


parameters. A manual selection approach is extremely time consuming,
requires significant DRL expertise, and the interaction of the hyper- one episode contains 35,040 time steps. Lawrencium3 computational
parameters can be counterintuitive. Thus, we adopted a random search cluster resources have been used for training the controllers, and more
grid [64] approach to tune the parameters. This method randomly specifically we have used nodes with INTEL XEON Gold 5218 (i.e., 32
selects combinations of hyper-parameters from a uniformly distributed cores central processing unit). Each training experiment (i.e., specific
search space. To limit the computational cost, we fixed the number of hyperparameters set) took about 3 days to converge.
combinations selected to 25 for each test batch. The hyper-parameters
that obtained the highest reward, estimated using the data from the 5. Results
test year (see next section), were selected as the best parameters. A
common problem with DRL approaches is the high variability of the The performance of the controllers was evaluated using the fol-
resulting policies when different random seeds are used for the same lowing metrics: reduction in cost (all tests), differential load shed
combination of hyper-parameters. To deal with this challenge, seven during peak price time (shed and shift tests) or differential load take
additional policies using a different random seed were trained for each during low and medium price periods (shift test), and thermal comfort
best tuned combination (one for each test batch). The policies that were violations (all tests). Equations for these metrics are detailed in Sec-
deployed in FLEXLAB were selected from among those eight policies as tion 3.2. The DRL algorithm reduced energy costs across all the three
the ones that produced the highest reward using the test year. Note tests as compared to the baseline control strategy and the results are
that a more robust approach would have been to include the random summarized in Table 5. These costs are based on the tariffs described
seeds as hyper-parameters to be tuned and to test all the considered in Section 3.1. In the Shed test, the DRL shed 0.25 W/sq.ft. (50%) more
seeds for each hyper-parameter combination. However, this approach than the baseline controller during the high price period. However,
is very costly in term of computational complexity, therefore it was not during the Shift test, the DRL algorithm shed 0.34 W/sq.ft. (97%) less
pursued in this work. than the baseline strategy during the high price time and drew more
power, 1.1 W/sq.ft. (49%) more than the baseline during the remaining
4.2.4. DRL controller training day, while still reducing cost. This behavior was contrary to what we
Given the availability of five years of local weather data, four years expected and it is discussed in detail in Section 6.3.
were used to train the DRL controller, using the simulation framework In terms of thermal comfort, neither the DRL nor the baseline
described in Section 4, and one year was used for validation. For control strategy were able to maintain thermal comfort in the cells at
each test scenario (i.e., EE, Shed, and Shift) we trained DRL models all times. Due to high internal and external loads and limited HVAC
separately using the corresponding price signal. Since the state vector capacity, the temperature exceeded the upper threshold (i.e., 24◦ C)
consisted of various physical features that have different units with several times during each test. During the EE test, the overshoot during
significant differences in value ranges, we scaled the state observations the occupied hours (8 am–6 pm) averaged over the test days was higher
to the range [−1,1]. Adam was used as a gradient-based optimizer [65]. for the DRL (0.99◦ C-h) than the baseline (0.33◦ C-h), thus pointing to
The replay buffer size was fixed to 1.5 million transitions. Both actor a poorer performance by the DRL controller. However, for both Shift
and critic networks had the same size and had two layers. The energy and Shed tests, the DRL controller performed significantly better than
cost reward function weight 𝛼 was fixed to 1. For all three scenarios the the baseline controller, as summarized in Table 6.
selected hyperparameters were: batch size = 128; weight of thermal
comfort 𝛽 = 3; actor learning rate = 0.00005; critic learning rate = 5.1. Energy efficiency test
0.0001; discount factor 𝛾 = 0.98; and OU 𝜎 parameter = 0.3. Addition-
ally, the first and second network layer sizes were respectively: 450 Fig. 6 shows the details about the EE test. As the cost of energy is
and 400 for EE; 400 and 400 for Shed; and 400 and 350 for Shift. fixed throughout the day, the expectation on DRL during the EE test is
The OU 𝜃 parameter was 0.2 for EE and Shed, and 0.15 for Shift. The
electricity price look ahead was two hours for Shift and three hours
3
for Shed. Note that we defined each episode as one calendar year, thus http://scs.lbl.gov/home.

10
S. Touzani et al. Applied Energy 304 (2021) 117733

Table 7 Table 8
SHED TEST comparison of the DRL vs. baseline controllers’ demand and total cost of Shift test comparison of DRL vs. Baseline controllers’ load demand and total cost during
energy consumed during different price periods. different price periods.
Metric Baseline DRL Metric Baseline DRL
Mean demand in low price 594 W 655 W Mean of demand in low price 870 W 412 W
Mean demand in high price 242 W 121 W Mean of demand in medium price 207 W 129 W
Mean daily energy consumption 12.5 kWh 13.1 kWh Mean of demand in high price 170 W 334 W
Cost of energy consumed $ 3.13 $ 2.46 Mean of daily energy use across all test days 14.21 kWh 8.21 kWh
Cost of energy consumed $ 1.62 $ 1.03

to minimize the energy consumption. In our experimental set up, this


would mean that the DRL-controlled cell should either consume less analysis revealed that the battery operation was largely responsible for
energy than the baseline cell while providing similar thermal comfort the cost reduction of the DRL algorithm compared to the baseline.
or consume a similar level of energy while providing better thermal Both controllers managed to reduce the chilled water consumption
comfort. While the aggregated energy consumption and cost of the to zero for the duration of the peak prices, but the DRL engaged in more
two controllers was similar, the baseline controller used more energy pre-cooling between 2pm – 4pm (12.4 kWh for the whole test) before
for heating (Fig. 6c), while the DRL controller used more energy for the onset of the peak price period compared to the baseline (8 kWh)
cooling, especially during the days with mild outdoor temperatures (Fig. 8b). The baseline controller also produced very sharp heating
(Fig. 6b). This also impacted the internal temperature profile (Fig. 6d). spikes which occurred all but one day around 7am, with different mag-
The baseline controller kept consistently lower temperatures in the test nitude (Fig. 8c). In comparison, the DRL controller produced smaller
cell, but that strategy caused more heating in the morning, after the morning heating spikes but caused some unexpected heating to happen
room was unoccupied and exposed to relatively cold outdoor tempera- during the interval corresponding to the peak price (Fig. 8c). This unex-
tures at night (between 13◦ C and 19◦ C). The temperature violations, all pected phenomenon is explored further in the Discussion in Section 6.2.
above the comfort band, were slightly higher in the DRL cell (Table 6). Overall both controllers behaved poorly when it came to maintaining
The DRL controller seemed unable to reduce the indoor temperature thermal comfort in the building during the peak price period (Fig. 8d),
fast enough at the peak of internal and external heat gains. The batter- despite the relatively mild outdoor conditions (temp between 10.5 ◦ C
ies were not significantly utilized by either controllers, remaining close and 25.5 ◦ C). The degree hour of temperature violations averaged per
to their minimum charging level for most of the time (Fig. 6a). Since day was much higher for the baseline (≈ 4.92◦ C-h) than for the DRL
the cost of electricity was constant in the EE test, the energy and cost controller (≈ 3.68◦ C-h). Further, when no heating was supplied by the
performance were aligned. DRL controller during the peak price period (on August 5, Fig. 8d),
the indoor temperature was kept within the comfort band, suggesting
5.2. Shed test that eliminating the unexpected heating during the Shed window would
also improve comfort of the DRL cell. Active pre-cooling by DRL before
The objectives of the DRL controller are to minimize both energy the onset of the high price signal manifested as a sharp dip in the cell
costs and the thermal discomfort. The anticipated behavior of the DRL temperature right before the high price period.
controller during the high price period (4pm–9pm) of the Shed test An interesting behavior of the DRL controller emerged during the
was to reduce consumption by relaxing the comfort band. While there second day of the test (August 5). Since the day was foggy and PV
was no expectation to consume more energy any other time, we also generation was low, the DRL controller decided to charge the battery
expected the controller to pre-cool the zone before this period, since more than average, just before the Shed event. Hence, the additional
this strategy would allow the HVAC system to idle during the high price availability of battery power allowed the DRL controller to reduce grid
period, leading to cost savings and a reduction in thermal discomfort. purchases during the high price window. In contrast the baseline did
Additionally, we expected the battery to be charged as much as possible not foresee the issue and did not engage in any additional charging.
before the high price period to be available for discharge when the Also, from Fig. 7, we can see that the DRL controller consumed less
cost of electricity is higher. Fig. 7 shows that DRL reduced cost signifi- energy than the baseline cell on August 9. This can be attributed to
cantly during the Shed test, while it reduced energy consumption only the higher outdoor air temperatures and the strategy pursued by each
marginally. As noted earlier in Table 5, the differential load shed during controller. Both cells required a significant amount of cooling in the
the high price time is 0.25 W/sq.ft. In Fig. 7, each set of bars represents pre-event period, however the DRL compromised more on comfort
a day during the testing period, and their height measures the total and saved more energy compared to the baseline cell, relative to
energy consumption for that day. The lighter color of each bar depicts what happens during the other days. In these two days (08/05 and
the amount of energy consumed during off-peak hours, while the darker 08/09) the DRL controller is able to make better decisions than a rule
color represents energy consumption during peak hours. Even though based algorithm consistently reducing energy costs as well as power
on some days the total energy consumed by the DRL controller is higher throughout the day.
than the baseline, the energy consumed during high price periods is
significantly lower (50%). This behavior leads to a large reduction 5.3. Shift test
in overall cost (29%). The demand and cost values are summarized
in Table 7. It is important to note that the baseline strategy is also In the Shift experiment, the DRL controller was expected to increase
shedding load using a rule-based algorithm, therefore the difference the energy consumption and store thermal and electrical energy during
between DRL and the baseline can be read as extra-shed/differential the low (preferred) and medium price periods via pre-cooling and
shed. battery charging. Table 5 shows that, in the Shift experiment, the
Fig. 8 shows the detailed experimental results for the Shed exper- DRL controller reduced both cost and energy consumption compared
iment. The vertical gray bars in each panel represent the period with to the baseline strategy, as anticipated. Table 8 summarizes the mean
peak prices. Both controllers discharged the batteries during the peak daily cost and average demand by the two controllers for different
price period, though the DRL controller discharged the battery more price periods. However, contrary to the expected behavior, the DRL
rapidly and by a larger amount. On average, at the end of the high price controller consumed more energy during the peak period (i.e., it shifted
period, the DRL battery was left with less than 20% of its maximum less energy than the baseline strategy) but still reduced overall energy
charge, while the baseline battery was still at 50% (Fig. 8a). Further cost, because it significantly reduced the energy consumption during

11
S. Touzani et al. Applied Energy 304 (2021) 117733

Fig. 6. EE test: (a) PV generation (gold), SOC of the battery in the baseline cell (light green) and SOC of the battery in the DRL cell (dark green); (b) chilled water power in the
baseline cell (light blue) and the DRL cell (dark blue), dry bulb outdoor air temperature (yellow, right axis); (c) hot water power in the baseline cell (light red) and the DRL cell
(dark red); (d) indoor dry bulb temperature in the baseline cell (light purple) and the DRL cell (dark purple).

Fig. 7. Daily energy consumption for different prices for the baseline and DRL controller for the SHED control test.

the lower price periods. This result is interesting because the DRL average, the DRL controller started cooling earlier, compared to the
algorithm achieved its objective of reducing cost, but it failed the baseline, and continued cooling throughout the high peak period, caus-
implicit objective of shifting energy out of the peak period. These ing an increase of demand during this period, but saving total energy
findings are further discussed in Section 6.3. and cost (Table 8). This is the second effect that limited the amount
Fig. 9 shows details about the experiment. The vertical gray bars of energy shifted by the DRL algorithm. As it pertains to heating, the
in each panel represent the mid peak (half gray bar) and high peak baseline controller created sharp spikes every early morning (Fig. 9c).
(full gray bar) periods. Both controllers discharged the batteries during In comparison, the DRL algorithm never used heating during the testing
the mid peak and peak periods (Fig. 9a), but the baseline controller period. The DRL controller also provided better thermal comfort com-
discharged the battery by a larger amount. This is the first reason that pared to the baseline (Fig. 9d), with 3.2 ◦ C-h of temperature violations
explains the reduced shift in energy by the DRL controller, described per day compared to 4.3 ◦ C-h for the baseline. In addition, most of
above. The baseline controller used more chilled water and produced the comfort violations happened during the high price periods for the
higher chilled water peaks than the DRL controller (Fig. 9b). On baseline cell, while they happened outside of that period for the DRL

12
S. Touzani et al. Applied Energy 304 (2021) 117733

Fig. 8. Shed test: (a) PV generation (gold), SOC of the battery in the baseline cell (light green), and SOC of the battery in the DRL cell (dark green); (b) chilled water power in
the baseline cell (light blue) and the DRL cell (dark blue), dry bulb outdoor air temperature (yellow, right axis); (c) hot water power in the baseline cell (light red) and the DRL
cell (dark red); (d) indoor dry bulb temperature in the baseline cell (light purple) and the DRL cell (dark purple).

controller (Fig. 9d). Overlaying the comfort and chilled water profiles deployed in FLEXLAB (the best performer identified in the tuning
it is evident that DRL engaged in cooling during high price periods to process). Finally, the black dots represent the simulated energy con-
maintain thermal comfort, in contrast to the baseline controller. This sumption values using the policies that were generated using different
behavior suggests that DRL places a higher value on thermal comfort random seeds during the training process. The actual energy consump-
over cost savings during this period. tion (in green) is significantly different from the energy consumption
of the simulated model used for training the DRL controller (in red).
6. Discussion Additionally, the random seeds seem to have a much higher impact
in the EE experiment than for the Shift and Shed experiments. Our
6.1. Instability and variance of trained policies hypothesis for this effect is that it is likely due to the fixed price of
electricity in EE test that make the algorithm converge more easily to
Variation in trained policies across different random seeds is a some local optimal solution. In the EE test the difference between the
common problem in DRL. Policies trained with different random seeds best model (i.e., policy selected for the testbed) and the worst model is
may generate a significant gap in performance and produce unstable re- significant (≈ 400 kWh). However, the small variability of the results for
sults [66]. To investigate this variability, the performances of eight DRL Shift and Shed may be misleading, as it is important to remember that
agents, each trained with the same combination of hyper-parameters only half of the experiments had converged. Fig. 11 shows the annual
but with eight different seeds, were evaluated for each experiment. net energy purchased from the grid using the same controllers identi-
As described in Sections 4.2.3 and 4.2.4, the hyper-parameters were fied above, performing EE, Shed, and Shift experiments respectively.
identified using the random search grid approach. For the Shift and While the differences in values are not as prominent as in Fig. 10, it is
Shed experiments, even out of these eight, only four converged to a clear that the seed has a significant impact on the actions taken by a
realistic solution. While this in itself provides evidence of the significant DRL controller.
impact of the random seeds, all the other controllers that had converged
during training were evaluated further. The controllers were deployed 6.2. Mismatch between modeled and actual control behavior
in a simulated model of the building for (1) the duration of the actual
FLEXLAB tests and (2) a whole year. When the trained DRL controller was deployed in the building, the
Fig. 10 shows the net energy consumption from the grid for the actions generated by the controller differed from those taken during the
duration of the three experiments (EE, Shift, and Shed). While the green simulation. Some of these occurred due to: using the wrong sensors,
dots represent the actual energy consumed during the field experiment issues with mismatched units of measurement, or other factors, and
in FLEXLAB, the red dots are the energy consumption values estimated they were able to be diagnosed and fixed immediately. However,
in the simulation environment using the same controller that was there were more variations in the actions that produced significant

13
S. Touzani et al. Applied Energy 304 (2021) 117733

Fig. 9. Shift test: (a) PV generation (gold), SOC of the battery in the baseline cell (light green), and SOC of the battery in the DRL cell (dark green); (b) chilled water power in
the baseline cell (light blue) and the DRL cell (dark blue), dry bulb outdoor air temperature (yellow, right axis); (c) hot water power in the baseline cell (light red) and the DRL
cell (dark red); (d) indoor dry bulb temperature in the baseline cell (light purple) and the DRL cell (dark purple).

Fig. 10. Net energy consumption from the grid for simulated and test models.

differences in the relevant metrics (e.g., energy consumption, costs energy model was updated in the simulation (this, of course, meant
incurred), and upon further inspection it was determined that these retraining the DRL controller). However characterizing the stochastic
happened due to inconsistencies between the simulated building en- battery charge/discharge behavior with respect to a given setpoint
ergy model and the actual building. Such variations in actions were turned out to be quite challenging, and the DRL controller was used
not present when the trained DRL controller was being trained and with this particular flaw built in. A more detailed evaluation of the
tested in the simulation environment and hence it was noticed only impacts of these embedded inconsistencies has been presented in [67].
when it was deployed in the actual building. For example, the power Similar inconsistencies that were overlooked during the model cali-
consumption pattern of the supply air fan during cooling and the actual bration also affected the results of our experiments. For example, in the
charge/discharge rate of the battery given a certain setpoint proved EE test, we noticed that the DRL controller was unnecessarily heating
to be wrongly represented in the simulated building model. Once this the cell during early morning hours. Similarly during the Shed test,
issue was identified, the supply air fan was better characterized and the while the DRL controller performed well during the test, the actions

14
S. Touzani et al. Applied Energy 304 (2021) 117733

Fig. 11. Energy purchased from the grid for different simulated models for the entire year.

it generated produced diminished savings as the AHU started heat- Table 9


Comparison of metrics between shed test and shed test with higher penalty on thermal
ing the zone during the high price period, reducing thermal comfort
comfort.
and increasing energy costs (Section 5.2). Upon further investigation
Test 𝜇𝜈 𝜎𝜈 % of overshoot 𝜁𝜈 per day Cost of daily
(conducted after the testing), it was hypothesized that the modeling energy
inconsistencies could be the most likely reason for this abnormal be-
DRL (regular shed) 1.3 0.81 15 3.68 $ 2.46
havior. To gather evidence supporting this train of thought, the trained DRL (more emphasis on 1.25 0.58 6 1.88 $ 2.65
DRL agent was used to generate actions for the FLEXLAB simulated thermal comfort)
model used for training, subjected to the same external conditions
(i.e., weather, occupancy, internal loads) of the actual Shed test. During
the high price period, the agent generated similar supply air tempera-
that are unknown or unpredictable to the grid. In our example the DRL
ture and supply air flow rate setpoints as it did in the actual FLEXLAB
algorithm was successful in reducing cost for the building owner, but
test. However, the heating power required in the simulation, close to
did not shift energy compared to the rule-based system.
zero, was significantly lower than what was actually required in the
To understand the impact of the weight on thermal comfort, we
real test. In other words there was an error in mapping the setpoints to
conducted a test with a higher penalty being placed on thermal comfort
the actual heating supplied by the equipment, probably due to simpli-
violation, using the Shed test pricing. Results of this test are presented
fications in modeling the actual control sequences implemented in the
in Table 9. On this day, the DRL-controlled cell saw lower ZAT over-
underlying FLEXLAB infrastructure. The DRL agent expected the HVAC
shoot (1.88 ◦ C-h) as compared to 3.68 ◦ C-h overshoot during regular
system to idle during the Shed period with the setpoints it generated,
shed test days. However, this caused an increase in cost of energy
but that action, instead, caused significant heating to happen. This
purchased ($2.65) compared to cost of energy during regular shed
inconsistency in the heating energy model that the DRL agent was
days ($2.46). The experiment suggests that different trade-offs between
trained on negatively impacted the DRL controller’s performance. The
comfort and cost can be achieved by tweaking the hyperparameters of
mismatch between HVAC and control operation in simulation models
the model.
and real systems is common, therefore it is advisable to carefully map
and calibrate the simulation with the actual behavior of the control
6.4. Challenges and future work for deploying DRL-based controllers in
system. In our case, we realized that the set of conditions we calibrated buildings
the building on did not adequately cover what was subsequently expe-
rienced during the tests. However, such calibration issues have major The recent advances in DRL algorithms offer a unique opportunity
implications for practical scalability of DRL-based controllers. to optimize the operation of complex building systems. However, there
are still several challenges in developing, deploying, evaluating, and
6.3. Relative weight of cost and comfort and magnitude of price signal scaling DRL control algorithms in building systems.
First, training DRL controllers is a difficult task. Training DRL
During the Shift test, the DRL controller reduced the HVAC load sig- algorithm by directly interacting with real building systems may cause
nificantly during the medium price period, which resulted in additional comfort issues and equipment damage and may take an unreasonable
cooling (albeit for short periods) in the high price period, to reduce amount of time [68]. At the same time, developing a simulation model
the thermal comfort penalty. This counterintuitive behavior was the that adequately approximates the system is time consuming to develop,
result of the specific combination of the weights penalizing cost and difficult to calibrate, and not scalable. And simpler models, such as
comfort violations defined in the reward function (Section 4.2.2). The gray-box models used in MPC [69], may be unable to fully capture
behavior was also indirectly affected by the prices selected for each the complex dynamic of real building systems. Thus, it is essential to
period, since they determined cost. The DRL controller was successful develop DRL methods that require less training data and that use real
in minimizing the total energy costs ($1.03 vs. $1.62 of the baseline historical data to ensure the scalability of this approach. Additional
controller) and obtained better comfort; however, it did not shift energy studies should investigate the minimum amount of data that needs to be
from the high price period to the other two periods, compared to the collected to train such algorithms before deployment. Future research
baseline Shed strategy. Since its behavior was driven by the magnitude should also look into using transfer learning, imitation learning and
of the prices, it is clear that the absolute value of the high price also online learning techniques to address this challenge.
signal was not high enough to trigger the expected response, resulting Second, as discussed in Section 6.1, the high variability of trained
in extra cooling load during the peak hours. As some utilities are policies with different random seeds for the same combination set of
moving toward dynamic pricing, this example highlights the challenges hyper-parameters is a common problem for DRL algorithms. This issue
in designing price schemes that produce the desired grid response in is common while training neural networks and it is particularly relevant
a target building, particularly when the building employs algorithms when dealing with incomplete observations of the states of a complex

15
S. Touzani et al. Applied Energy 304 (2021) 117733

system or reward functions that are hard to optimize. Future work Writing – original draft, Writing – review & editing, Visualization. Zhe
should develop more stable DRL algorithms that improve the robustness Wang: Methodology, Software, Validation, Formal analysis, Investi-
to hyper-parameter tuning, as well as investigate the effectiveness of gation, Writing – original draft, Writing – review & editing. Shreya
the random search grid as an approach for auto-tuning DRL parameters. Agarwal: Formal analysis, Writing – original draft, Writing – review
In addition, including the random seed in the tuning process can & editing, Visualization. Marco Pritoni: Methodology, Formal analy-
enhance the selection of optimal policies. sis, Writing – original draft, Writing – review & editing, Supervision.
Third, a proper definition of the reward functions is key to suc- Mariam Kiran: Methodology, Writing – review & editing. Richard
cessfully finding a good control policy. Inadequate reward functions Brown: Writing – review & editing. Jessica Granderson: Conceptual-
can result in unstable training and/or inappropriate policy. Therefore, ization, Writing – review & editing, Supervision, Project administration,
more research is needed to analyze the impact of different definitions Funding acquisition.
of reward functions (that optimize the same objective). This will help
provide a more standardized definition of reward function that is well Declaration of competing interest
adapted for DRL frameworks.
Fourth, to promote adoption, the policies and behavior of the algo- The authors declare that they have no known competing finan-
rithm need to be easy to explain, particularly in the cases where the cial interests or personal relationships that could have appeared to
policy provide a nonstandard and unexpected solution to control the influence the work reported in this paper.
system. In addition, understanding the reasons behind control errors is
important, to be able to create control algorithms that are robust to
Acknowledgments
failures and thus reassure stakeholders, which is an essential step to
increase the adoption of this technology.
This work was supported by the Assistant Secretary for Energy
Fifth, standardized building benchmarks are needed to properly
Efficiency and Renewable Energy, Building Technologies Office, of the
evaluate the developed DRL methods. It is crucial to have standard
U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
environments where the DRL algorithms can be trained, to have well-
The authors would like to thank Heather Goetsch, Cedar Blazek, the
defined metrics to quantify the performance of trained policies, and
FLEXLAB team, Callie Clark and Christoph Gehbauer for their support.
to have baseline DRL models against which the new algorithms can
This research used the Lawrencium computational cluster resource
be evaluated. These benchmarks will help to speed incremental im-
provided by the IT Division at the Lawrence Berkeley National Lab-
provements and to evaluate the generalization of the proposed solutions
oratory (Supported by the Director, Office of Science, Office of Basic
to different buildings systems. By releasing and open-sourcing the
Energy Sciences, of the U.S. Department of Energy under Contract No.
environment and the DRL algorithm that was developed for this work,
our aim is to provide the research community a benchmark for building DE-AC02-05CH11231).
systems that involves small commercial buildings with PV and battery
storage. In the future, more of these benchmark will be needed to cover References
more applications (i.e., buildings systems and control objectives).
[1] Bayram IS, Ustun TS. A survey on behind the meter energy management systems
Lastly, future research should overcome the limitations of this study
in smart grid. Renew Sustain Energy Rev 2017;72:1208–32.
and extend its results by comparing different DRL algorithms and [2] Foruzan E, Soh L-K, Asgarpoor S. Reinforcement learning approach for opti-
testing different control variables (e.g., controlling ZAT setpoints or mal distributed energy management in a microgrid. IEEE Trans Power Syst
chilled water valve position) in addition to the setpoints controlled in 2018;33(5):5749–58.
this experiment. Better comfort models, should be explored, to take [3] Neukomm M, Nubbe V, Fares R. Grid-interactive efficient buildings. 2019, http:
//dx.doi.org/10.2172/1508212, URL https://www.osti.gov/biblio/1508212.
advantage of more realistic assumptions about comfort adaptation.
[4] Granderson J, Kramer H, Rui T, Brown R, Curtin C. Market brief: Customer-
Additional field testing should be performed in a variety of building sited distributed energy resources. 2020, URL https://escholarship.org/uc/item/
systems to make sure the results are generalizable. 1xv3z9jb.
[5] Krishnan Prakash A, Zhang K, Gupta P, Blum D, Marshall M, Fierro G,
7. Conclusion Alstone P, Zoellick J, Brown R, Pritoni M. Solar+ optimizer: A model predictive
control optimization platform for grid responsive building microgrids. Energies
2020;13(12). http://dx.doi.org/10.3390/en13123093, URL https://www.mdpi.
This paper explores the use of a DRL approach to control a behind com/1996-1073/13/12/3093.
the meter DER system that consists of a proxy of a small office building [6] Bruno S, Giannoccaro G, La Scala M. A demand response implementation
with local PV solar generation and a battery unit. A simulation-based in tertiary buildings through model predictive control. IEEE Trans Ind Appl
training framework was developed and used to train a DRL algorithm. 2019;55(6):7052–61. http://dx.doi.org/10.1109/TIA.2019.2932963.
[7] Kim D, Braun J, Cai J, Fugate D. Development and experimental demon-
This trained controller was then deployed in a testbed to evaluate its
stration of a plug-and-play multiple RTU coordination control algorithm for
performance under three load flexibility modes (EE, Shift, and Shed). small/medium commercial buildings. Energy Build 2015;107:279–93. http://
The results of this work, and prior published efforts show that DRL can dx.doi.org/10.1016/j.enbuild.2015.08.025, URL https://www.sciencedirect.com/
be a promising solution for DER systems. As the research community science/article/pii/S0378778815302097.
transitions from simulation-based DRL research to field demonstra- [8] Bonthu RK, Pham H, Aguilera RP, Ha QP. Minimization of building energy cost
by optimally managing PV and battery energy storage systems. In: 2017 20th
tions, the results and the challenges described in this paper can be
international conference on electrical machines and systems (ICEMS). 2017, p.
used to accelerate the efforts of future researchers and practitioners. 1–6. http://dx.doi.org/10.1109/ICEMS.2017.8056442.
The next steps in this research include improving the DRL training [9] Wang Z, Yang R, Wang L. Intelligent multi-agent control for integrated building
methods, evaluating different reward functions and testing alternative and micro-grid systems. In: ISGT 2011. 2011, p. 1–7. http://dx.doi.org/10.1109/
algorithms. These can help develop portable and scalable solutions that ISGT.2011.5759134.
[10] Reynolds J, Rezgui Y, Kwan A, Piriou S. A zone-level, building energy
enable easier deployment of DRL in different buildings.
optimisation combining an artificial neural network, a genetic algorithm,
and model predictive control. Energy 2018;151:729–39. http://dx.doi.org/10.
CRediT authorship contribution statement 1016/j.energy.2018.03.113, URL https://www.sciencedirect.com/science/article/
pii/S036054421830522X.
Samir Touzani: Conceptualization, Methodology, Software, Valida- [11] Macarulla M, Casals M, Forcada N, Gangolells M. Implementation of pre-
dictive control in a commercial building energy management system using
tion, Investigation, Writing – original draft, Writing – review & edit- neural networks. Energy Build 2017;151:511–9. http://dx.doi.org/10.1016/
ing, Supervision, Project administration. Anand Krishnan Prakash: j.enbuild.2017.06.027, URL https://www.sciencedirect.com/science/article/pii/
Software, Validation, Formal analysis, Investigation, Data curation, S0378778817300907.

16
S. Touzani et al. Applied Energy 304 (2021) 117733

[12] Drgoňa J, Arroyo J, Cupeiro Figueroa I, Blum D, Arendt K, Kim D, Ollé EP, [33] Azuatalam D, Lee W-L, de Nijs F, Liebman A. Reinforcement learning for whole-
Oravec J, Wetter M, Vrabie DL, Helsen L. All you need to know about model building HVAC control and demand response. Energy AI 2020;2:100020. http:
predictive control for buildings. Annu Rev Control 2020;50:190–232. http://dx. //dx.doi.org/10.1016/j.egyai.2020.100020, URL http://www.sciencedirect.com/
doi.org/10.1016/j.arcontrol.2020.09.001, URL https://www.sciencedirect.com/ science/article/pii/S2666546820300203.
science/article/pii/S1367578820300584. [34] Nagy A, Kazmi H, Cheaib F, Driesen J. Deep reinforcement learning for optimal
[13] Vázquez-Canteli JR, Nagy Z. Reinforcement learning for demand re- control of space heating. In: 4th building simulation and optimization conference.
sponse: A review of algorithms and modeling techniques. Appl En- 2018, URL http://www.ibpsa.org/proceedings/BSO2018/1C-4.pdf.
ergy 2019;235:1072–89. http://dx.doi.org/10.1016/j.apenergy.2018.11.002, URL [35] Lee S, Choi D-H. Energy management of smart home with home appliances,
http://www.sciencedirect.com/science/article/pii/S0306261918317082. energy storage system and electric vehicle: A hierarchical deep reinforcement
[14] Sutton RS, Barto AG. Reinforcement learning: An introduction. USA: A Bradford learning approach. Sensors 2020;20(7). http://dx.doi.org/10.3390/s20072157,
Book; 2018. URL https://www.mdpi.com/1424-8220/20/7/2157.
[15] Wang Z, Hong T. Reinforcement learning for building controls: The opportuni- [36] Chen B, Cai Z, Bergés M. Gnu-rl: A precocial reinforcement learning solution
ties and challenges. Appl Energy 2020;269:115036. http://dx.doi.org/10.1016/ for building hvac control using a differentiable mpc policy. In: Proceedings of
j.apenergy.2020.115036, URL http://www.sciencedirect.com/science/article/pii/ the 6th ACM international conference on systems for energy-efficient buildings,
S0306261920305481. cities, and transportation. 2019; p. 316–25.
[37] Zhang Z, Lam KP. Practical implementation and evaluation of deep reinforcement
[16] White CC, White DJ. Markov decision processes. European J Oper Res
learning control for a radiant heating system. In: Proceedings of the 5th
1989;39:1–16, URL https://doi.org/10.1016/0377-2217(89)90348-2.
conference on systems for built environments. BuildSys ’18, New York, NY, USA:
[17] Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T,
Association for Computing Machinery; 2018, p. 148–57. http://dx.doi.org/10.
Baker L, Lai M, Bolton A, et al. Mastering the game of go without human
1145/3276774.3276775.
knowledge. Nature 2017;550(7676):354–9.
[38] Moriyama T, De Magistris G, Tatsubori M, Pham T-H, Munawar A, Tachibana R.
[18] Mocanu E, Mocanu DC, Nguyen PH, Liotta A, Webber ME, Gibescu M,
Reinforcement learning testbed for power-consumption optimization. In: Li L,
Slootweg JG. On-line building energy optimization using deep reinforcement
Hasegawa K, Tanaka S, editors. Methods and applications for modeling and
learning. IEEE Trans Smart Grid 2019;10(4):3698–708.
simulation of complex systems. Singapore: Springer Singapore; 2018, p. 45–59.
[19] Wei T, Wang Y, Zhu Q. Deep reinforcement learning for building HVAC control. [39] Zhang Z, Chong A, Pan Y, Zhang C, Lu S, Lam KP. A deep reinforcement
In: Proceedings of the 54th annual design automation conference 2017. 2017; learning approach to using whole building energy model for HVAC optimal
p. 1–6. control. In: ASHRAE and IBPSA-USA simbuild building performance modeling
[20] Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, conference. 2018, URL https://www.researchgate.net/publication/326711617_A_
Riedmiller M, Fidjeland AK, Ostrovski G, Petersen S, Beattie C, Sadik A, Deep_Reinforcement_Learning_Approach_to_Using_Whole_Building_Energy_Model_
Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D. Human-level For_HVAC_Optimal_Control.
control through deep reinforcement learning. Nature 2015;518(7540):529–33. [40] Zhang Z, Zhang C, Lam K. A deep reinforcement learning method for model-
http://dx.doi.org/10.1038/nature14236. based optimal control of HVAC systems. 2018, p. 397–402. http://dx.doi.org/
[21] Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D. 10.14305/ibpc.2018.ec-1.01.
Continuous control with deep reinforcement learning. 2015, arXiv preprint arXiv: [41] Zhang Z, Chong A, Pan Y, Zhang C, Lam KP. Whole building energy
1509.02971. model for HVAC optimal control: A practical framework based on deep
[22] Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M. Deterministic pol- reinforcement learning. Energy Build 2019;199:472–90. http://dx.doi.org/10.
icy gradient algorithms. In: Proceedings of the 31st international conference on 1016/j.enbuild.2019.07.029, URL http://www.sciencedirect.com/science/article/
international conference on machine learning - Volume 32. ICML’14, JMLR.org; pii/S0378778818330858.
2014, p. I–387–I–395. [42] Li Y, Wen Y, Tao D, Guan K. Transforming cooling optimization for green data
[23] Henderson P, Islam R, Bachman P, Pineau J, Precup D, Meger D. Deep center via deep reinforcement learning. IEEE Trans Cybern 2020;50(5):2002–13.
reinforcement learning that matters. In: Proceedings of the AAAI conference [43] Kazmi H, Mehmood F, Lodeweyckx S, Driesen J. Gigawatt-hour scale sav-
on artificial intelligence, Vol. 32. 2018, (1). URL https://ojs.aaai.org/index.php/ ings on a budget of zero: Deep reinforcement learning based optimal
AAAI/article/view/11694. control of hot water systems. Energy 2018;144:159–68. http://dx.doi.org/10.
[24] Liang E, Liaw R, Nishihara R, Moritz P, Fox R, Goldberg K, Gonzalez J, Jordan M, 1016/j.energy.2017.12.019, URL http://www.sciencedirect.com/science/article/
Stoica I. RLlib: Abstractions for distributed reinforcement learning. In: Dy J, pii/S0360544217320388.
Krause A, editors. Proceedings of the 35th international conference on machine [44] Wu P, Partridge J, Bucknall R. Cost-effective reinforcement learning energy
learning. Proceedings of machine learning research, vol. 80, Stockholmsmässan, management for plug-in hybrid fuel cell and battery ships. Appl Energy
Stockholm Sweden: PMLR; 2018, p. 3053–62, URL http://proceedings.mlr.press/ 2020;275:115258. http://dx.doi.org/10.1016/j.apenergy.2020.115258, URL http:
v80/liang18b.html. //www.sciencedirect.com/science/article/pii/S0306261920307704.
[25] Mason K, Grijalva S. A review of reinforcement learning for autonomous building [45] Qiu S, Li Z, Li Z, Li J, Long S, Li X. Model-free control method based
energy management. Comput Electr Eng 2019;78:300–12. http://dx.doi.org/10. on reinforcement learning for building cooling water systems: Validation
1016/j.compeleceng.2019.07.019, URL http://www.sciencedirect.com/science/ by measured data-based simulation. Energy Build 2020;218:110055. http://
article/pii/S0045790618333421. dx.doi.org/10.1016/j.enbuild.2020.110055, URL http://www.sciencedirect.com/
science/article/pii/S0378778819339945.
[26] Yang T, Zhao L, Li W, Zomaya AY. Reinforcement learning in sustainable energy
[46] Soares A, Geysen D, Spiessens F, Ectors D, De Somer O, Vanthournout K. Using
and electric systems: a survey. Annu Rev Control 2020;49:145–63. http://dx.
reinforcement learning for maximizing residential self-consumption – Results
doi.org/10.1016/j.arcontrol.2020.03.001, URL https://www.sciencedirect.com/
from a field test. Energy Build 2020;207:109608. http://dx.doi.org/10.1016/
science/article/pii/S1367578820300079.
j.enbuild.2019.109608, URL http://www.sciencedirect.com/science/article/pii/
[27] Wang Y, Velswamy K, Huang B. A long-short term memory recurrent neural
S037877881930934X.
network based reinforcement learning controller for office heating ventilation
[47] Yu L, Xie W, Xie D, Zou Y, Zhang D, Sun Z, Zhang L, Zhang Y, Jiang T. Deep
and air conditioning systems. Processes 2017;5(3). http://dx.doi.org/10.3390/
reinforcement learning for smart home energy management. IEEE Internet Things
pr5030046, URL https://www.mdpi.com/2227-9717/5/3/46.
J 2020;7(4):2751–62. http://dx.doi.org/10.1109/JIOT.2019.2957289.
[28] Jia R, Jin M, Sun K, Hong T, Spanos C. Advanced building control via deep
[48] Alfaverh F, Denaï M, Sun Y. Demand response strategy based on reinforce-
reinforcement learning. Energy Procedia 2019;158:6158–63. http://dx.doi.org/
ment learning and fuzzy reasoning for home energy management. IEEE Access
10.1016/j.egypro.2019.01.494, Innovative Solutions for Energy Transitions. URL
2020;8:39310–21. http://dx.doi.org/10.1109/ACCESS.2020.2974286.
http://www.sciencedirect.com/science/article/pii/S187661021930517X.
[49] Schreiber T, Eschweiler S, Baranski M, Müller D. Application of two
[29] Wei T, Wang Y, Zhu Q. Deep reinforcement learning for building HVAC control. promising Reinforcement Learning algorithms for load shifting in a cool-
In: 2017 54th ACM/EDAC/IEEE design automation conference (DAC). 2017, p. ing supply system. Energy Build 2020;229:110490. http://dx.doi.org/10.1016/
1–6. j.enbuild.2020.110490, URL http://www.sciencedirect.com/science/article/pii/
[30] Brandi S, Piscitelli MS, Martellacci M, Capozzoli A. Deep reinforcement S0378778820320922.
learning to optimise indoor temperature control and heating energy consump- [50] Hua H, Qin Y, Hao C, Cao J. Optimal energy management strategies
tion in buildings. Energy Build 2020;224:110225. http://dx.doi.org/10.1016/ for energy Internet via deep reinforcement learning approach. Appl Energy
j.enbuild.2020.110225, URL http://www.sciencedirect.com/science/article/pii/ 2019;239:598–609.
S0378778820308963. [51] BL Energy Technologies Area. FLEXLAB: Advanced integrated building & grid
[31] Yoon YR, Moon HJ. Performance based thermal comfort control (PTCC) technologies testbed. 2020, URL https://flexlab.lbl.gov/.
using deep reinforcement learning for space cooling. Energy Build [52] Yu L, Qin S, Zhang M, Shen C, Jiang T, Guan X. Deep reinforcement learning
2019;203:109420. http://dx.doi.org/10.1016/j.enbuild.2019.109420, URL for smart building energy management: A survey. 2020, arXiv:2008.05074.
http://www.sciencedirect.com/science/article/pii/S0378778819310692. [53] EI Administration. 2012 CBECS survey data - microdata. 2020, https://www.eia.
[32] Gao G, Li J, Wen Y. Energy-efficient thermal comfort control in smart buildings gov/consumption/commercial/data/2012/index.php?view=microdata. [Online;
via deep reinforcement learning. 2019, URL arXiv:1901.04693. accessed 17-December-2020].

17
S. Touzani et al. Applied Energy 304 (2021) 117733

[54] P Gas, E Company. Electric schedule B-19. 2020, https://www.pge.com/tariffs/ [62] Zhang Z, Chong A, Pan Y, Zhang C, Lam KP. Whole building energy model
assets/pdf/tariffbook/ELEC_SCHEDS_B-19.pdf. [Online; accessed 16-December- for HVAC optimal control: A practical framework based on deep reinforcement
2020]. learning. Energy Build 2019;199:472–90.
[55] Vasudevan J, Swarup KS. Price based demand response strategy considering load [63] Morris MD. Factorial sampling plans for preliminary computational experiments.
priorities. In: 2016 IEEE 6th international conference on power systems (ICPS). Technometrics 1991;33(2):161–74.
2016, p. 1–6. http://dx.doi.org/10.1109/ICPES.2016.7584019. [64] Bergstra J, Bengio Y. Random search for hyper-parameter optimization. J Mach
[56] Asadinejad A, Tomsovic K. Optimal use of incentive and price based de- Learn Res 2012;13(2).
mand response to reduce costs and price volatility. Electr Power Syst Res [65] Kingma DP, Ba J. Adam: A method for stochastic optimization. 2014, arXiv
2017;144:215–23. http://dx.doi.org/10.1016/j.epsr.2016.12.012, URL https:// preprint arXiv:1412.6980.
www.sciencedirect.com/science/article/pii/S0378779616305259. [66] Islam R, Henderson P, Gomrokchi M, Precup D. Reproducibility of benchmarked
[57] Liu J, Yin R, Pritoni M, Piette MA, Neukomm M. Developing and evaluating deep reinforcement learning tasks for continuous control. 2017, arXiv preprint
metrics for demand flexibility in buildings: Comparing simulations and field data. arXiv:1708.04133.
2020, URL https://escholarship.org/uc/item/2d93p636. [67] Prakash AK, Touzani S, Kiran M, Agarwal S, Pritoni M, Granderson J. Deep
[58] ASHRAE. New guideline on standardized advanced sequences of operation for reinforcement learning in buildings: Implicit assumptions and their impact.
common HVAC systems. 2018, URL https://www.ashrae.org/news/esociety/new- In: Proceedings of the 1st international workshop on reinforcement learning
guideline-on-standardized-advanced-/sequences-of-operation-for-common-hvac- for energy management in buildings & cities. RLEM’20, New York, NY, USA:
systems. Association for Computing Machinery; 2020, p. 48–51. http://dx.doi.org/10.
[59] Nouidui T, Wetter M, Zuo W. Functional mock-up unit for co-simulation import 1145/3427773.3427868.
in EnergyPlus. J Build Perform Simul 2014;7(3):192–202. [68] Tan Y, Yang J, Chen X, Song Q, Chen Y, Ye Z, Su Z. Sim-to-real optimization
[60] Gehbauer C, Mueller J, Swenson T, Vrettos E. Photovoltaic and behind-the-meter of complex real world mobile network with imperfect information via deep
battery storage: Advanced smart inverter controls and field demonstration. 2020. reinforcement learning from self-play. 2018, arXiv:1802.06416.
[61] Andersson C, Åkesson J, Führer C. Pyfmi: A Python package for simulation [69] Sohlberg B, Jacobsen E. Grey box modelling - branches and experiences.
of coupled dynamic models with the functional mock-up interface. Centre for IFAC Proc Vol 2008;41(2):11415–20. http://dx.doi.org/10.3182/20080706-5-
Mathematical Sciences, Lund University Lund; 2016. KR-1001.01934, 17th IFAC World Congress. URL http://www.sciencedirect.com/
science/article/pii/S1474667016408025.

18

You might also like