Robust Adaptive Control for Linear Systems
Robust Adaptive Control for Linear Systems
Edouard Leurent
Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 – CRIStAL, Renault
F-59000 Lille, France
arXiv:2002.10816v2 [[Link]] 21 Oct 2020
[Link]@[Link]
Denis Efimov
Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 – CRIStAL
F-59000 Lille, France
[Link]@[Link]
Odalric-Ambrym Maillard
Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 – CRIStAL
F-59000 Lille, France
[Link]@[Link]
Abstract
We consider the problem of robust and adaptive model predictive control (MPC) of
a linear system, with unknown parameters that are learned along the way (adaptive),
in a critical setting where failures must be prevented (robust). This problem has
been studied from different perspectives by different communities. However, the
existing theory deals only with the case of quadratic costs (the LQ problem), which
limits applications to stabilisation and tracking tasks only. In order to handle
more general (non-convex) costs that naturally arise in many practical problems,
we carefully select and bring together several tools from different communities,
namely non-asymptotic linear regression, recent results in interval prediction, and
tree-based planning. Combining and adapting the theoretical guarantees at each
layer is non trivial, and we provide the first end-to-end suboptimality analysis for
this setting. Interestingly, our analysis naturally adapts to handle many models and
combines with a data-driven robust model selection strategy, which enables to relax
the modelling assumptions. Last, we strive to preserve tractability at any stage of
the method, that we illustrate on two challenging simulated environments.1
1 Introduction
Despite the recent successes of Reinforcement Learning [e.g. 33, 40], it has hardly been applied in
real industrial issues. This could be attributed to two undesirable properties which limit its practical
applications. First, it depends on a tremendous amount of interaction data that cannot always be
simulated. This issue can be alleviated by model-based methods – which we consider in this work –
that often benefit from better sample efficiencies than their model-free counterparts. Second, it relies
on trial-and-error and random exploration. In order to overcome these shortages, and motivated by
the path planning problem for a self-driving car, in this paper we consider the problem of controlling
an unknown linear system x(t) so as to maximise an arbitrary bounded reward function R, in a
1
Code and videos available at [Link]
34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.
critical setting where mistakes are costly and must be avoided at all times. This choice of rich reward
space is crucial to have sufficient flexibility to model non-convex and non-smooth functions that
naturally arise in many practical problems involving combinatorial optimisation, branching decisions,
etc., while quadratic costs are mostly suited for tracking a fixed reference trajectory [e.g. 23]. Since
experiencing failures is out of question, the only way to prevent them from the outset is to rely on
some sort of prior knowledge. In this work, we assume that the system dynamics are partially known,
in the form of a linear differential equation with unknown parameters and inputs. More precisely, we
consider a linear system with state x ∈ Rp , acted on by controls u ∈ Rq and disturbances ω ∈ Rr ,
and following dynamics in the form:
ẋ(t) = A(θ)x(t) + Bu(t) + Dω(t), t ≥ 0, (1)
where the parameter vector θ in the state matrix A(θ) ∈ Rp×p belongs to a compact set Θ ⊂ Rd .
The control matrix B ∈ Rp×q and disturbance matrix D ∈ Rp×r are known. We also assume having
access to the observation of x(t) and to a noisy measurement of ẋ(t) in the form y(t) = ẋ(t) + Cν(t),
where ν(t) ∈ Rs is a measurement noise and C ∈ Rp×s is known. Assumptions over the disturbance
ω and noise ν will be detailed further, and we denote η(t) = Cν(t) + Dω(t). We argue that this
structure assumption is realistic given that most industrial applications to date have been relying on
physical models to describe their processes and well-engineered controllers to operate them, rather
than machine learning. Our framework relaxes this modelling effort by allowing some structured
uncertainty around the nominal model. We adopt a data-driven scheme to estimate the parameters
more accurately as we interact with the true system. Many model-based reinforcement learning
algorithms rely on the estimated dynamics to derive the corresponding optimal controls [e.g. 24, 29],
but suffer from model bias: they ignore the error between the learned and true dynamics, which can
dramatically degrade control performances [39].
To address this issue, we turn to the framework of robust decision-making: instead of merely
considering a point estimate of the dynamics, for any N ∈ N, we build an entire confidence region
CN,δ ⊂ Θ, illustrated in Figure 1, that contains the true dynamics parameter with high probability:
P (θ ∈ CN,δ ) ≥ 1 − δ, (2)
where δ ∈ (0, 1). In Section 2, having observed a history DN = {(xn , yn , un )}n∈[N ] of transitions,
our first contribution extends the work of Abbasi-Yadkori et al. [2] who provide a confidence ellipsoid
for the least-square estimator to our setting of feature matrices, rather than feature vectors.
The robust control objective V r [8, 9, 18] aims to maximise the worst-case outcome with respect to
this confidence region CN,δ :
" ∞ #
r r def X
n
sup V (u), where V (u) = inf γ R(xn (u, ω)) , (3)
u∈(Rq )N θ∈CN,δ
n=N +1
ω∈[ω,ω]R
γ ∈ (0, 1) is a discount factor, and xn (u, ω) is the state reached at step n under controls u and
disturbances ω(t) within the given admissible bounds [ω(t), ω(t)]. Maximin problems such as (3)
are notoriously hard if the reward R has a simple form. However, without a restriction on the shape
of functions R, we cannot hope to derive an explicit solution. In our second contribution, we propose
a robust MPC algorithm for solving (3) numerically. In Section 3, we leverage recent results from
the uncertain system simulation literature to derive an interval predictor [x(t), x(t)] for the system
(1), illustrated in Figure 2. For any N ∈ N, this predictor takes the information on the current state
xN , the confidence region CN,δ , planned control sequence u and admissible disturbance bounds
[ω(t), ω(t)]; and must verify the inclusion property:
x(t) ≤ x(t) ≤ x(t), ∀t ≥ tN . (4)
Since R is generic, potentially non-smooth and non-convex, solving the optimal – not to mention
the robust – control objective is intractable. In Section 4, facing a sequential decision problem with
continuous states, we turn to the literature of tree-based planning algorithms. Although there exist
works addressing continuous actions [10, 43], we resort to a first approximation and discretise the
continuous decision (Rq )N space by adopting a hierarchical control architecture: at each time, the
agent can select a high-level action a from a finite space A. Each action a ∈ A corresponds to the
def
selection of a low-level controller πa , that we take affine: u(t) = πa (x(t)) = −Ka x(t) + ua . For
2
θ
C[N ],δ
Figure 1: The model estimation procedure, run- Figure 2: The state prediction procedure running
ning on the obstacle avoidance problem of Sec- on the obstacle avoidance problem of Section 6.
tion 6. The confidence region CN,δ shrinks with At each time step (red to green), we bound the set
the number of samples N . of reachable states under model uncertainty (2)
instance, a tracking a subgoal xg can be achieved with πg = K(xg − x). This discretisation induces
a suboptimality, but it can be mitigated by diversifying the controller basis. The robust objective (3)
becomes supa∈AN V r (a), where xn (a, ω) stems from (1) with un = πan (xn ). However, tree-based
planning algorithms are designed for a single known generative model rather than a confidence
region for the system dynamics. Our third contribution adapts them to the robust objective (3) by
approximating it with a tractable surrogate V̂ r that exploits the interval predictions (4) to define
a pessimistic reward. In our main result, we show that the best surrogate performance achieved
during planning is guaranteed to be attained on the true system, and provide an upper bound for the
approximation gap and suboptimality of our framework in Theorem 3. This is the first result of this
kind for maximin control with generic costs to the best of our knowledge. Algorithm 1 shows the full
integration of the three procedures of estimation, prediction and control.
In Section 5, our forth contribution extends the proposed framework to consider multiple modelling
assumptions, while narrowing uncertainty through data-driven model rejection, and still ensuring
safety via robust model-selection during planning.
Finally, in Section 6 we demonstrate the applicability of Algorithm 1 in two numerical experiments:
a simple illustrative example and a more challenging simulation for safe autonomous driving.
Notation The system dynamics are described in continuous time, but sensing and control happen
in discrete time with time-step dt > 0. For any variable z, we use subscript to refer to these discrete
times: zn = z(tn ) with tn = ndt and n ∈ N. We use bold symbols to denote temporal sequences
z = (zn )n∈N . We denote z + = max(z, 0), z − = z + − z, |z| = z + + z − and [n] = {1, . . . , n}.
The control of uncertain systems is a long-standing problem, to which a vast body of literature
is dedicated. Existing work is mostly concerned with the problem of stabilisation around a fixed
reference state or trajectory, including approaches such as H∞ control [7], sliding-mode control
[32] or system-level synthesis [11, 12]. This paper fits in the popular MPC framework, for which
3
adaptive data-driven schemes have been developed to deal with model uncertainty [38, 41, 5], but
lack guarantees. The family of tube-MPC algorithms seeks to derive theoretical guarantees of robust
constraint satisfaction: the state x is constrained in a safe region X around the origin, often chosen
convex [17, 4, 6, 42, 30, 22, 31, 28]. Yet, many tasks cannot be framed as stabilisation problems (e.g.
obstacle avoidance) and are better addressed with the minimax control objective, which allows more
flexible goal formulations. Minimax control has mostly been studied in two particular instances.
Finite states Minimax control of finite Markov Decision Processes with uncertain parameters was
studied in [21, 34, 44], who showed that the main results of Dynamic Programming can be extended
to their robust counterparts only when the dynamics ambiguity set verifies a certain rectangularity
property. Since we consider continuous states, these methods do not apply.
Linear dynamics and quadratic costs Several approaches have been proposed for cumulative
regret minimisation in the LQ problem. In the Optimism in the Face of Uncertainty paradigm, the best
possible dynamics within a high-confidence region is selected under a controllability constraint, to
compute the corresponding optimal control in closed-form by solving a Riccati equation. The results
of [1, 20, 16] show that this procedure achieves a O e N 1/2 regret. Posterior sampling algorithms
[35, 3] select candidate dynamics randomly instead, and obtain the same result. Other works use noise
injection for exploration such as [11, 12]. However, neither optimism nor random exploration fit a
critical setting, where ensuring safety requires instead to consider pessimistic outcomes. The work of
Dean et al. [11] is close to our setting: after an offline estimation phase, they estimate a suboptimality
between a minimax controller and the optimal performance. Our work differs in that it addresses a
generic shape cost. Another work of interest is [37] where worst-case generic costs are considered.
However, they assume the knowledge of the dynamics, and their rollout-based solution only produces
inner-approximations and does not yield any guarantee. In this paper, interval prediction is used to
produce oversets, while a near-optimal control is found using a tree-based planning procedure.
2 Model Estimation
To derive a confidence region (2) for θ, the functional relationship A(θ) must be specified.
Assumption 1 (Structure). There exists a known feature tensor φ ∈ Rd×p×p such that for all θ ∈ Θ,
d
X
A(θ) = A + θi φi , (5)
i=1
We slightly abuse notations and include additional known terms in the measurement signal y(t) =
ẋ(t) + Cν(t) − Ax(t) − Bu(t), to obtain a linear regression system yn = Φn θ + ηn .
N
X
min kyn − Φn θk2Σ−1 + λkθk2 . (6)
θ∈Rd p
n=1
PN
Substituting yn into (7) yields the regression error: θN,λ − θ = G−1
N,λ n=1 ΦTn Σ−1 −1
p ηn − λGN,λ θ.
To bound this error, we need the noise ηn to concentrate.
Assumption 2 (Noise Model). We assume that
4
1. at each time t ≥ 0, the combined noise η(t) is an independent sub-Gaussian noise with
covariance proxy Σp ∈ Rp×p :
1 T
∀u ∈ Rp , E [exp (uT η(t))] ≤ exp u Σp u ;
2
Theorem 1 (Confidence ellipsoid, a matricial version of Abbasi-Yadkori et al. 2). Under Assump-
tion 2, it holds with probability at least 1 − δ that
s
det(GN,λ )1/2
def
kθN,λ − θkGN,λ ≤ βN (δ), with βN (δ) = 2 ln + (λd)1/2 S. (8)
δ det(λId )1/2
We convert this confidence ellipsoid CN,δ from (8) into a polytope for A(θ). For simplicity, we
present here a simple but coarse strategy: bound the ellipsoid by its enclosing axis-aligned hypercube:
X2d X2d
A(θ) ∈ AN + αi ∆AN,i : α ≥ 0, αi = 1 (9)
i=1 i=1
q
βN (δ)
where AN = A(θN,λ ), hi ∈ {−1, 1}d , ∆AN,i = hi λmax (GN,λ ) . A tighter polytope derivation is
presented in the Supplementary Material.
3 State Prediction
A simple solution to (4) is proposed in [14], where, given bounds A ≤ A(θ) ≤ A from CN,δ they use
matrix interval arithmetic to derive the predictor:
Proposition 2 (Simple predictor of Efimov et al. 14). Assuming that (2) is satisfied for the system
(1), then the interval predictor following x(tN ) = x(tN ) = x(tN ) and:
+ −
ẋ(t) = A+ x+ (t) − A x− (t) − A− x+ (t) + A x− (t) + Bu(t) + D+ ω(t) − D− ω(t), (10)
+ + − +
ẋ(t) = A x (t) − A+ x− (t) − A x (t) + A− x− (t) + Bu(t) + D+ ω(t) − D− ω(t),
ensures the inclusion property (4) with confidence level δ.
However, Leurent et al. [27] showed that this predictor can have unstable dynamics, even for stable
systems, which causes a fast build-up of uncertainty. They proposed an enhanced predictor which
exploits the polytopic structure (9) to produce more stable predictions, at the price of a requirement:
Assumption 3. There exists an orthogonal matrix Z ∈ Rp×p such that Z T AN Z is Metzler2 .
In practice, this assumption is often verified: it is for instance the case whenever AN is diagonalisable.
The similarity transformation of [15] provides a method to compute such Z when the system is
observable. To simplify the notation, we will further assume that Z = Ip . Denote ∆A+ =
P2d + P2d −
i=1 ∆AN,i , ∆A− = i=1 ∆AN,i .
Proposition 3 (Enhanced predictor of Leurent et al. 27). Assuming that (9) and Assumption 3 are
satisfied for the system (1), then the interval predictor following x(tN ) = x(tN ) = x(tN ) and:
ẋ(t) = AN x(t) − ∆A+ x− (t) − ∆A− x+ (t) + Bu(t) + D+ ω(t) − D− ω(t), (11)
+ − + −
ẋ(t) = AN x(t) + ∆A+ x (t) + ∆A− x (t) + Bu(t) + D ω(t) − D ω(t),
ensures the inclusion property (4) with confidence level δ.
Figure 3 compares the performance of the predictors (10) and (11) in a simple example. It suggests
to always prefer (11) whenever Assumption 3 is verified, and only fallback to (10) as a last resort.
2
We say that a matrix is Metzler when all its non-diagonal coefficients are non-negative.
5
1.2
1.0 x(t)
0.8 x(t), x(t) from (10)
0.6 x(t), x(t) from (11)
0.4
0.2
0.0
−0.2
0 1 2 3 4 5
t
Figure 3: Comparison of (10) and (11) for a simple system ẋ(t) = −θx(t) + ω(t), with θ ∈ [1, 2]
and ω(t) ∈ [−0.05, 0.05].
4 Robust Control
To evaluate the robust objective V r (3), we approximate it thanks to the interval prediction [x(t), x(t)].
Definition 1 (Surrogate objective). Let xn (u), xn (u) following the dynamics defined in (11) and
∞
def X def
V̂ r (u) = γ n Rn (u) where Rn (u) = min R(x). (12)
x∈[xn (u),xn (u)]
n=N +1
Such a substitution makes this pessimistic reward Rn not Markovian, since the worst case is assessed
over the whole past trajectory.
Theorem 2 (Lower bound). The surrogate objective (12) is a lower bound of the objective (3).
V̂ r (u) ≤ V r (u) ≤ V (u)
Consequently, since all our approximations are conservative, if we manage to find a control sequence
such that no “bad event” (e.g. a collision) happens according to the surrogate objective V̂ r , they are
guaranteed not to happen either when the controls are executed on the true system.
To maximise V̂ r , we cannot use DP algorithms since the state space is continuous and the pessimistic
rewards are non-Markovian. Rather, we turn to tree-based planning algorithms, which optimise a
sequence of actions based on the corresponding sequence of rewards, without requiring Markovity
nor state enumeration. In particular, we consider the Optimistic Planning of Deterministic Systems
(OPD) algorithm [19] tailored for the case when the relationship between actions and rewards is
deterministic. Indeed, the stochasticity of the disturbances and measurements is encased in V̂ r :
given the observations up to time N both the predictor dynamics (11) and the pessimistic rewards in
(12) are deterministic. At each planning iteration k ∈ [K], OPD progressively builds a tree Tk+1 by
forming upper-bounds Ba (k) over the value of sequences of actions a, and expanding3 the leaf ak
with highest upper-bound:
h(a)−1
X γ h(a)
ak = arg max Ba (k), Ba (k) = Rn (a) + (13)
a∈Lk n=0
1−γ
where Lk is the set of leaves of Tk , h(a) is the length of the sequence a, and Rn (a) the pessimistic
reward (12) obtained at time n by following the controls un = πan (xn ).
Lemma 1 (Planning performance of Hren & Munos 19). The suboptimality of the OPD algorithm
(13) applied to the surrogate objective (12) after K planning iterations is:
log 1/γ
V̂ r (a? ) − V̂ r (aK ) = O K − log κ ;
def
n o 1/h
γ h+1
where κ = lim suph→∞ a ∈ Ah : V̂ r (a) ≥ V̂ r (a? ) − 1−γ is a problem-dependent mea-
sure of the proportion of near-optimal paths.
3
The expansion of a leaf node a refers to the simulation of its children transitions aA = {ab, b ∈ A}
6
Hence, by using enough computational budget K for planning we can get as close as we want to the
optimal surrogate value V̂ r (a? ), at a polynomial rate. Unfortunately, there exists a gap between V̂ r
and the true robust objective V r , which stems from three approximations: (i) the true reachable set
was approximated by an enclosing interval in (4); (ii) the time-invariance of the dynamics uncertainty
A(θ) ∈ CN,δ was handled by the interval predictor P (11) as if it were
P a time-varying uncertainty
A(θ(t)) ∈ CN,δ , ∀t ; and (iii) the lower-bound min ≤ min used to define the surrogate
objective (12) is not tight. However, this gap can be bounded under additional assumptions.
Theorem 3 (Suboptimality bound). Under two conditions:
1. a Lipschitz regularity assumption for the reward function R:
2. a stability condition: there exist P > 0, Q0 ∈ Rp×p , ρ > 0, and N0 ∈ N such that
T
AN P + P AN + Q0 P |D|
∀N > N0 , < 0;
|D|T P −ρIr
with probability at least 1 − δ, where V (a) is the optimal expected return when executing an
action a ∈ A, a? is an optimal action, and ∆ω is a constant which corresponds to an irreducible
suboptimality suffered from being robust to instantaneous disturbances ω(t).
It is difficult to check the validity of the stability condition 2. since it applies to matrices AN produced
by the algorithm rather than to the system parameters. A stronger but easier to check condition is that
the polytope (9) at some iteration becomes included in a set where this property is uniformly satisfied.
For instance, if the features are sufficiently excited, the estimation converges to a neighbourhood of
the true dynamics A(θ). This also allows to further bound the input-dependent estimation error term.
Corollary 1 (Asymptotic near-optimality). Under an additional persistent excitation (PE) assumption
2
∃φ, φ > 0 : ∀n ≥ n0 , φ2 ≤ λmin (ΦTn Σ−1 p Φn ) ≤ φ , (14)
the stability condition 2. of Theorem 3 can be relaxed to apply to the true system: there exist P, Q0 , ρ
such that
A(θ)T P + P A(θ) + Q0 P |D|
< 0;
|D|T P −ρIr
and the suboptimality bound takes the more explicit form
!
log N d/2 /δ log 1/γ
r
V (a? ) − V̂ (aK ) ≤ ∆ω + O + O K − log κ
N
which ensures asymptotic near-optimality when N → ∞ and K → ∞.
5 Multi-model Selection
The procedure we developed in Sections 2 to 4 relies on strong modelling assumptions, such as the
linear dynamics (1) and Assumption 1. But what if they are wrong?
Model adequacy One of the major benefits of using the family of linear models, compared to
richer model classes, is that they provide strict conditions allowing to quantify the adequacy of the
modelling assumptions to the observations. Given N − 1 observations, Section 2 provides a polytopic
confidence region (9) that contains A(θ) with probability at least 1 − δ. Since the dynamics are linear,
we can propagate this confidence region to the next observation: yN must belong to the Minkowski
sum of a polytope representing model uncertainty P(A0 xN + BuN , ∆A1 xN , . . . , ∆A2d xN ) and a
polytope P(0p , η, η) bounding the disturbance and measurement noises. Delos & Teissandier [13]
provide a way to test this membership in polynomial time using linear programming. Whenever
it is not verified, we can confidently reject the (A, φ)-modelling assumption 1. This enables us to
consider a rich set of potential features (A1 , φ1 ), . . . , (AM , φM ) rather than relying on a single
assumption, and only retain those that are consistent with the observations so far. Then, every
remaining hypothesis must be considered during planning.
7
Robust selection We temporarily ignore the parametric uncertainty on θ to simply consider several
candidate dynamics models, which all correspond to different modelling assumptions. We also restrict
to deterministic dynamics, which is the case of (11).
Assumption 4 (Multi-model ambiguity). The dynamics f lie in a finite set of candidates (f m )m∈[M ] .
We adapt our planning algorithm to balance these concurrent hypotheses in a robust fashion, i.e.
maximise a robust objective with discrete ambiguity:
∞
X
r
V = sup min γ n Rnm , (15)
a∈AN m∈[M ]
n=N +1
where Rnm is the reward obtained by following the action sequence a up to step n under the dynamics
f m . This objective could be optimised in the same way as in Section 4, but this would result in a
coarse and lossy approximation. Instead, we exploit the finite uncertainty structure of Assumption 4
to asymptotically recover the true V r by modifying the OPD algorithm in the following way:
Definition 2 (Robust UCB). We replace the upper-bound (13) on sequence values in OPD by:
h−1
def X γh
Bar (k) = min γ n Rnm + . (16)
m∈[M ]
n=0
1−γ
Note that it is not equivalent to solving each control problem independently and following the action
with highest worst-case value, as we show in the Supplementary Material. We analyse the sample
complexity of the corresponding robust planning algorithm.
Proposition 4 (Robust planning performance). The robust version of OPD (16) enjoys the same regret
bound as OPD in Lemma 1, with respect to the multi-model objective (15).
This result is of independent interest: the solution of a robust objective (15) with discrete ambiguity
f ∈ {f m }m∈[M ] can be recovered exactly, asymptotically when the planning budget K goes to
infinity, which Robust DP algorithms do not allow. This also contrasts with the results obtained
in Section 4 for the robust objective (3) with continuous ambiguity A(θ) ∈ CN,δ , for which OPD
only recovers the surrogate approximation V̂ r , as discussed in Theorem 3. Note that here the regret
depends on the number K of node expansions, but each expansion now requires M times more
simulations than in the single-model setting. Finally, the two approaches of Sections 4 and 5 can be
merged by using the pessimistic reward (12) in (16).
6 Experiments
Videos and code are available at [Link]
Obstacle avoidance with unknown friction We first consider a simple illustrative example, shown
in Figure 2: the control of a 2D system moving by means of a force (ux , uy ) in an medium with
anisotropic linear friction with unknown coefficients (θx , θy ). The reward encodes the task of
navigating to reach a goal state xg while avoiding collisions with obstacles: R(x) = δ(x)/(1 + kx −
xg k2 ) where δ(x) is 0 whenever x collides with an obstacle, 1 otherwise. The actions A are constant
controls in the up, down, left and right directions. For the reasons mentioned above, no robust baseline
applies to our setting. We compare Algorithm 1 to the non-robust adaptive control approach that plans
with the estimated dynamics θN,λ , and thus enjoys the same prior knowledge of dynamics structure
and reward. This highlights the benefits of the robust formulation solely rather than stemming from
algorithm design. We show in Table 1(a) the results of 100 simulations of a single episode: the robust
agent performs worse than the nominal agent on average, but manages to ensure safety while the
nominal agent collides with obstacles in 4% of simulations. We also compare to a standard model-free
approach, DQN, which does not benefit from the prior knowledge on the system dynamics, and is
instead trained over multiple episodes. The reported performance is that of the final policy obtained
after training for 3000 episodes, during which 897P± 64 collisions occurred (29.9 ± 2.1%). We
study the evolution of the suboptimality V (xN ) − n>N γ n−N R(xn ) with respect to the number
of samples N , by comparing the empirical returns from a state xN to the value V (xN ) that the agent
would get by acting optimally from xN with knowledge of the dynamics. Although the assumptions
8
101
100
suboptimality
10−1
10−2
10−3 agent
Nominal
Robust
10−4
0 20 40 60 80
samples N
Figure 4: The mean (solid), 95% CI for Figure 5: The intersection crossing task. Trajectory
the mean (shaded) and maximum (dashed) intervals show behavioural uncertainty for each vehicle,
suboptimality with respect to N . with a multi-model assumption over their route.
Table 1: Frequency of collision, minimum and average return achieved on a single episode, repeated
with 100 random seeds. In both tasks, the robust agent performs worse than the nominal agent on
average, but manages to ensure safety and attains a better worst-case performance.
(a) Performances on the obstacle task (b) Performances on the driving task
Performance failures min avg ± std Performance failures min avg ± std
Oracle 0% 11.6 14.2 ± 1.3 Oracle 0% 6.9 7.4 ± 0.5
Nominal 4% 2.8 13.8 ± 2.0 Nominal 1 4% 5.2 7.3 ± 1.5
Algorithm 1 0% 10.4 13.0 ± 1.5 Nominal 2 33% 3.5 6.4 ± 0.3
Algorithm 1 0% 6.8 7.1 ± 0.3
DQN (trained) 6% 1.7 12.3 ± 2.5
DQN (trained) 3% 5.4 6.3 ± 0.6
of Theorem 3 are not satisfied (e.g. non-smooth reward), the mean suboptimality of the robust agent,
shown in Figure 4, still decreases polynomially with N : Algorithm 1 gets more efficient as it is
more confident while ensuring safety at all times. In comparison, the nominal agent enjoys a smaller
suboptimality on average, but higher in the worst-case.
Motion planning for an autonomous vehicle We consider the highway-env environment [25]
for simulated driving decision problems. An autonomous vehicle with state χ0 ∈ R4 is approach-
ing an intersection among V other vehicles with states χi ∈ R4 , resulting in a joint traffic state
x = [χ0 , . . . , χV ]> ∈ R4V +4 . These vehicles follow parametrized behaviours χ̇i = fi (x, θi ) with
unknown parameters θi ∈ R5 . We appreciate a first advantage of the structure imposed in Assump-
tion 1: the uncertainty space of θ is R5V . In comparison, the traditional LQ setting where the whole
2
state matrix A is estimated would have resulted in a much larger parameter space θ ∈ R16V . The
system dynamics f , which describes the interactions between vehicles, can only be expressed in the
form of Assumption 1 given the knowledge of the desired route for each vehicle, with features φ
expressing deviations to the centerline of the followed lane. Since these intentions are unknown to
the agent, we adopt the multi-model perspective of Section 5 and consider one model per possible
route for every observed vehicle before an intersection. In Table 1(b), we compare Algorithm 1 to
a nominal agent planning with two different modelling assumptions: Nominal 1 has access to the
true followed route for each vehicle, while Nominal 2 does not and picks the model with minimal
prediction error. Again we also compare to a DQN baseline trained over 3000 episodes, causing
1058 ± 113 collisions while training (35 ± 4%). As before, the robust agent has a higher worst-case
performance and avoids collisions at all times, at the price of a decreased average performance..
Conclusion
We present a framework for the robust estimation, prediction and control of a partially known linear
system with generic costs. Leveraging tools from linear regression, interval prediction, and tree-based
planning, we guarantee the predicted performance and provide a suboptimality bound. The method
applicability is further improved by a multi-model extension and demonstrated on two simulations.
9
Broader Impact
The motivation behind this work is to enable the development of Reinforcement Learning solutions
for industrial applications, when it has been mainly limited to simulated games so far. In particular,
many industries already rely on non-adaptive control systems and could benefit from an increased
efficiency, including Oil and Gas, robotics for industrial automation, Data Center cooling, etc. But
more often than not, safety-critical constraints proscribe the use of exploration, and industrials are
reluctant to turn to learning-based methods that lack accountability. This work addresses these
concerns by focusing on risk-averse decisions and by providing worst-case guarantees. Note however
that these guarantees are only as good as the validity of the underlying hypotheses, and Assumption 1
in particular should be submitted to a comprehensive validation procedure; otherwise, decisions
formed on a wrong basis could easily lead to dramatic consequences in such critical settings. Beyond
industrial perspectives, this work could be of general interest for risk-averse decision-making. For
instance, parametrized epidemiological models have been used to represent the propagation of
Covid-19 and study the impact of lockdown policies. These model parameters are estimated from
observational data and corresponding confidence intervals are often available, but rarely used in
the decision-making loop. In contrast, our approach would enable evaluating and optimising the
worst-case outcome of such public policies.
References
[1] Abbasi-Yadkori, Y. and Szepesvári, C. Regret bounds for the adaptive control of linear quadratic
systems. In Kakade, S. M. and von Luxburg, U. (eds.), Proceedings of the 24th Annual
Conference on Learning Theory, volume 19 of Proceedings of Machine Learning Research, pp.
1–26, Budapest, Hungary, 09–11 Jun 2011. PMLR.
[2] Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. Improved algorithms for linear stochastic bandits.
In Shawe-Taylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F., and Weinberger, K. Q. (eds.),
Advances in Neural Information Processing Systems 24, pp. 2312–2320. Curran Associates,
Inc., 2011.
[3] Abeille, M. and Lazaric, A. Improved regret bounds for thompson sampling in linear quadratic
control problems. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International
Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research,
pp. 1–9, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
[4] Adetola, V., DeHaan, D., and Guay, M. Adaptive model predictive control for constrained
nonlinear systems. Systems and Control Letters, 2009. ISSN 01676911. doi: 10.1016/[Link].
2008.12.002.
[5] Amos, B., Rodriguez, I. D. J., Sacks, J., Boots, B., and Zico Kolter, J. Differentiable MPC for
end-to-end planning and control. In Advances in Neural Information Processing Systems, 2018.
[6] Aswani, A., Gonzalez, H., Sastry, S. S., and Tomlin, C. Provably safe and robust learning-based
model predictive control. Automatica, 2013. ISSN 00051098. doi: 10.1016/[Link].2013.
02.003.
[7] Basar, T. and Bernhard, P. H infinity - Optimal Control and Related Minimax Design Problems:
A Dynamic Game Approach, volume 41. 1996.
[8] Ben-Tal, A., El Ghaoui, L., and Nemirovski, A. Robust optimization, volume 28. Princeton
University Press, 2009.
[9] Bertsimas, D., Brown, D. B., and Caramanis, C. Theory and applications of robust optimization.
SIAM review, 53(3):464–501, 2011.
10
[10] Busoniu, L., Pall, E., and Munos, R. Continuous-action planning for discounted infinite-horizon
nonlinear optimal control with lipschitz values. Automatica, 92:100–108, 06 2018.
[11] Dean, S., Mania, H., Matni, N., Recht, B., and Tu, S. On the sample complexity of the linear
quadratic regulator. ArXiv, abs/1710.01688, 2017.
[12] Dean, S., Mania, H., Matni, N., Recht, B., and Tu, S. Regret bounds for robust adaptive control
of the linear quadratic regulator. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K.,
Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems
31, pp. 4188–4197. Curran Associates, Inc., 2018.
[13] Delos, V. and Teissandier, D. Minkowski Sum of Polytopes Defined by Their Vertices. Journal
of Applied Mathematics and Physics (JAMP), 3(1):62–67, January 2015.
[14] Efimov, D., Fridman, L., Raïssi, T., Zolghadri, A., and Seydou, R. Interval estimation for LPV
systems applying high order sliding mode techniques. Automatica, 48:2365–2371, 2012.
[15] Efimov, D., Raïssi, T., Chebotarev, S., and Zolghadri, A. Interval state observer for nonlinear
time varying systems. Automatica, 49(1):200–205, 2013.
[16] Faradonbeh, M. K. S., Tewari, A., and Michailidis, G. Finite time analysis of optimal adaptive
policies for linear-quadratic systems. CoRR, abs/1711.07230, 2017.
[17] Fukushima, H., Kim, T. H., and Sugie, T. Adaptive model predictive control for a class of
constrained linear systems based on the comparison model. Automatica, 2007. ISSN 00051098.
doi: 10.1016/[Link].2006.08.026.
[18] Gorissen, B. L., İhsan Yanıkoğlu, and den Hertog, D. A practical guide to robust optimization.
Omega, 53:124 – 137, 2015.
[19] Hren, J.-F. and Munos, R. Optimistic planning of deterministic systems. In European Workshop
on Reinforcement Learning, pp. 151–164, France, 2008.
[20] Ibrahimi, M., Javanmard, A., and Roy, B. Efficient reinforcement learning for high dimensional
linear quadratic systems. Advances in Neural Information Processing Systems, 4, 03 2013.
[21] Iyengar, G. N. Robust Dynamic Programming. Mathematics of Operations Research, 30:
257–280, 2005.
[22] Köhler, J., Andina, E., Soloperto, R., Müller, M. A., and Allgöwer, F. Linear robust adaptive
model predictive control: Computational complexity and conservatism. In 2019 IEEE 58th
Conference on Decision and Control (CDC), pp. 1383–1388, 2019.
[23] Kumar, E. V. and Jerome, J. Robust lqr controller design for stabilizing and trajectory tracking
of inverted pendulum. Procedia Engineering, 64:169 – 178, 2013. International Conference on
Design and Manufacturing.
[24] Lenz, I., Knepper, R. A., and Saxena, A. Deepmpc: Learning deep latent features for model
predictive control. In Robotics: Science and Systems, 2015.
[25] Leurent, E. An environment for autonomous driving decision-making. [Link]
eleurent/highway-env, 2018.
[26] Leurent, E. and Mercat, J. Social attention for autonomous decision-making in dense traffic. In
Machine Learning for Autonomous Driving Workshop at NeurIPS 2019, 2019.
[27] Leurent, E., Efimov, D., Raïssi, T., and Perruquetti, W. Interval prediction for continuous-time
systems with parametric uncertainties. In Proc. IEEE Conference on Decision and Control
(CDC), Nice, 2019.
[28] Leurent, E., Efimov, D., and Maillard, O.-A. Robust-Adaptive Interval Predictive Control for
Linear Uncertain Systems. In 2020 IEEE 59th Conference on Decision and Control (CDC),
Jeju Island, Republic of Korea, 8–11 Dec 2020.
11
[29] Levine, S., Finn, C., Darrell, T., and Abbeel, P. End-to-end training of deep visuomotor policies.
CoRR, abs/1504.00702, 2015.
[30] Lorenzen, M., Allgöwer, F., and Cannon, M. Adaptive model predictive control with robust
constraint satisfaction. IFAC-PapersOnLine, 50(1):3313–3318, Jul 2017. ISSN 2405-8963.
[31] Lu, X. and Cannon, M. Robust adaptive tube model predictive control. In Proceedings of the
American Control Conference, 2019. ISBN 9781538679265.
[32] Lu, X.-Y. and Spurgeon, S. K. Robust sliding mode control of uncertain nonlinear systems.
Systems & Control Letters, 32(2):75–90, Nov 1997. ISSN 0167-6911.
[33] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A.,
Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou,
I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Human-level control through
deep reinforcement learning. Nature, 518(7540):529–533, February 2015.
[34] Nilim, A. and El Ghaoui, L. Robust Control of Markov Decision Processes with Uncertain
Transition Matrices. Operations Research, 53:780–798, 2005.
[35] Ouyang, Y., Gagrani, M., and Jain, R. Learning-based control of unknown linear systems with
thompson sampling. CoRR, abs/1709.04047, 2017.
[36] Peña, V. H., Lai, T. L., and Shao, Q.-M. Self-normalized processes: Limit theory and Statistical
Applications. Springer Science & Business Media, 2008.
[37] Rosolia, U. and Borrelli, F. Sample-Based Learning Model Predictive Control for Linear
Uncertain Systems. In Proceedings of the IEEE Conference on Decision and Control, 2019.
ISBN 9781728113982. doi: 10.1109/CDC40024.2019.9030270.
[38] Sastry, S., Bodson, M., and Bartram, J. F. Adaptive Control: Stability, Convergence, and
Robustness. The Journal of the Acoustical Society of America, 1990. ISSN 0001-4966. doi:
10.1121/1.399905.
[39] Schneider, J. Exploiting model uncertainty estimates for safe dynamic control learning. Ad-
vances in neural information processing systems, pp. 1047—-1053, 1997.
[40] Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L.,
Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., and Hassabis, D. A general reinforcement
learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):
1140–1144, 2018.
[41] Tanaskovic, M., Fagiano, L., Smith, R., and Morari, M. Adaptive receding horizon control for
constrained MIMO systems. Automatica, 2014. ISSN 00051098. doi: 10.1016/[Link].
2014.10.036.
[42] Turchetta, M., Berkenkamp, F., and Krause, A. Safe exploration in finite Markov decision
processes with Gaussian processes. In Advances in Neural Information Processing Systems,
2016.
[43] Weinstein, A. and Littman, M. Bandit-based planning and learning in continuous-action markov
decision processes. Proceedings of the 22nd International Conference on Automated Planning
and Scheduling, pp. 306–314, 01 2012.
[44] Wiesemann, W., Kuhn, D., and Rustem, B. Robust Markov Decision Processes. Mathematics
of Operations Research, pp. 1–52, 2013.
12
Supplementary Material
Outline In Appendix A, we provide a proof for every novel result introduced in this paper. Ap-
pendix B provides additional details on our experiments. Appendix C gives a better method of
conversion from ellipsoid to polytope than that of (9). Finally, Appendix D highlights the fact
that robustness cannot be recovered by aggregating independent solutions to many optimal control
problem.
A Proofs
A.1 Proof of Proposition 1
PN
Proof. We differentiate J(θ) = n=1 kyn − Φn θk2Σ−1 + λkθk2 as in (6) with respect to θ:
p
N
X
∇θ J(θ) = ∇θ (yn − Φn θ)T Σ−1
p (yn − Φn θ) + ∇θ λkθk
2
n=1
N
X N
X
= −2 ynT Σ−1
p Φn + 2 θT (ΦTn Σ−1 Φn ) + 2λθT
n=1 n=1
Hence,
N
! N
X X
∇θ J(θ) = 0 ⇐⇒ Φn Σ−1
T
p Φn + Id θ= ynT Σ−1
p Φn
n=1 n=1
Proof. Let
t
X
Gt = ΦTs Σ−1
p Φs ∈ R
d×d
s=1
d
And for any z ∈ R ,
1
Mtz = exp hz, St i − kzkGt
2
1
Dtz = exp hΦt z, ηt iΣ−1 − kΦ t zk −1
Σp
p
2
13
Then,
t
!
X 1
Mtz = exp z T
Φs Σ−1
T
p ηs
T −1
− (Φs z) Σp (Φs z)
s=1
2
t
Y
= Dsz
s=1
t−1
!
Y
E [Mtz | Ft−1 ] = Dsz E [Dtz | Ft−1 ] ≤ Mt−1
z
s=1
Showing that (Mtz )∞ is indeed a supermartingale and in fact E[Mtz ] ≤ 1. It then follows by Doob’s
t=1
z
upcrossing lemma for supermartingale that M∞ = limt→∞ Mtz is almost surely well-defined, and
z
so is Mτ for any random stopping time τ .
z
Next, we consider the stopped martingale Mmin(τ,t) . Since (Mtz )∞
t=1 is a non-negative supermartin-
gale and τ is a random stopping time, we deduce by Doob’s decomposition that
t−1
X
z
E[Mmin(τ,t) ] = E[M0z ] + E[ (Ms+1
z
− Msz )I{τ > s}]
s=0
t−1
X
z
≤ 1 + E[ E[Ms+1 − Msz |Fs ]I{τ > s}]
s=0
≤1
Finally, an application of Fatou’s lemma show that E[Mτz ] = E[lim inf t→∞ Mmin(τ,t)
z
] ≤
z
lim inf t→∞ E[Mmin(τ,t) ] ≤ 1.
This results allows to apply a result from [36]:
Lemma 2 (Theorem 14.7 of [36]). If Z is a random vector and B is a symmetric positive definite
matrix such that
d 1 T
∀γ ∈ R , ln E exp γ Z − γ Bγ ≤ 0,
T
2
then for any positive definite non-random matrix C, it holds
"s #
det(C) 1 2
E exp kZk(B+C)−1 ≤ 1.
det(B + C) 2
14
Pt
Here, by using Z = s=1 Φs Σ−1
p ηs , B = Gt , C = G,
s !
det(Gt + G)1/2
P kSt k(Gt +G)−1 ≥ 2 ln ≤δ
δ det(G)1/2
n=1
N
X
= hx, ΦTn Σ−1
p ηn iG−1 − λhx, θiG−1
N,λ N,λ
n=1
In particular, for x = GN,λ (θN,λ − θ), we get after simplifying with kθN,λ − θkGN,λ :
N
X
kθN,λ − θkGN,λ ≤ ΦTn Σ−1
p ηn + λkθkG−1
N,λ
n=1 G−1
N,λ
And since kθk2G−1 ≤ 1/λmin (GN,λ )kθk22 ≤ 1/λkθk22 and kθk22 ≤ dkθk2∞ ≤ dS 2 ,
N,λ
s
det(GN,λ )1/2
kθN,λ − θkGN,λ ≤ 2 ln + (λd)1/2 S
δ det(λId )1/2
Proof. The predictor designed in Section 3 verifies the inclusion property (4). Thus, for sequence of
controls u, any dynamics A(θ) ∈ CN,δ , and disturbances ω ≤ ω ≤ ω, the corresponding state at time
tn is bounded by xn ≤ xn ≤ xn , which implies that R(xn ) ≥ minx∈[xn (u),xn (u)] R(x) = Rn (u).
Thus, by taking the min over CN,δ and [ω, ω], we also have for any sequence of controls u:
X∞
V r (u) = min γ n R(xn )
A(θ)∈CN,δ
ω≤ω≤ω n=N +1
∞
X
≥ γ n Rn (u)
n=N +1
= V̂ r (u)
P∞
And V r (u) ≤ V (u) = Eω n=N +1 γ n R(xn ) by definition.
15
A.4 Proof of Theorem 3
Proof. We have
kθ − θN,λ k2GN,λ ≥ λmin (GN,λ )kθ − θN,λ k22
And (8) gives
kθ − θN,λ k2GN,λ = O(βN (δ)2 )
Moreover, A(θ) belongs to a linear image of this L2 -ball. By writing a the j th column of a matrix M
as Mj , and its coefficient i, j as Mi,j ,
((A(θ) − A(θN,λ ))T (A(θ) − A(θN,λ )))i,j = (θ − θN,λ )T φTi φj (θ − θN,λ )
≤ λmax (φTi φj )kθ − θN,λ k22
βN (δ)2
Thus, kA(θ) − A(θN,λ )k2F = Tr [(A(θ) − A(θN,λ ))T (A(θ) − A(θN,λ ))] = O λmin (GN,λ )
16
Moreover, it holds that −x+ − x− ≤ e ≤ x+ + x− , which implies |x+ + x− | ≤ 2|e|. Hence,
AN P + P AN + Q + 4αIp P |D| P |∆A|
" T #
V̇ ≤ X T
|D| P
T
−ρIr 0 X
|∆A|T P 0 −αIp
| {z }
Υ
− eT Qe + ρkω − ωk22
We now examine the condition Υ ≤ 0. We resort to its Schur complement: given α > 0, Υ ≤ 0 if
and only if R ≥ S, where S = α−1 [|∆A|T P 0] [|∆A|T P 0] and R is the top-left block of −Υ:
T
−ATN P − P AN − Q − 4αIp −P |D|
R=
−|D|T P ρIr
Choose Q = 12 Q0 − 4αIp . Assume that P is fixed and satisfies the conditions of the lemma. We have
Finally, we propagate the state prediction error bound to the pessimistic rewards and surrogate
objective to get our final result.
Proof. For any sequence of controls u, dynamical parameters θ ∈ CN,δ and disturbances ω ≤ ω ≤
ω, we clearly have X
V (u)r ≤ V (u) = E γ n R(xn )
ω
n
17
Moreover, by the inclusion property (4), we have that xn ≤ xn ≤ xn , which implies that R(xn ) ≤
maxx∈[xn (u),xn (u)] R(x). Assuming R is L-lipschitz,
∞
X
V (u) − V̂ r (u) ≤ γ n (max − min) R(x)
n=N +1 x∈[xn (u),xn (u)]
X∞
≤ γ n L kxn (u) − xn (u)k2
n=N +1
βN (δ)2
X
≤ L C0 + O γ n Cω (tn )
λmin (GN,λ )
n>N
βN (δ)2
= ∆ω + O
λmin (GN,λ )
γ n Cω (tn ), which is finite by Assumption 2.
P
with ∆ω = LC0 n>N
Finally, we use the result of Lemma 1 to account for planning with a finite budget, and relate V̂ r (a? )
to V̂ r (aK ).
and by (8),
s
det(GN,λ )1/2
βN (δ) = 2 log + (λd)1/2 S
δ det(λId )1/2
r
d
≤ log N d/2 φ /(δλd/2 ) + O(1)
Thus,
βN (δ)2 log(N d/2 /δ)
=O
λmin (GN,λ ) N
Stability condition 2. By Lemma 3 and the above, the sequence (AN )N converges to A(θ) in
Frobenius norm. Thus,
Mn =
def AT N P + P AN + Q0 P |D| also converges to M def=
A(θ)T P + P A(θ) + Q0 P |D|
,
|D|T P −ρIr |D|T P −ρIr
which is assumed to be negative definite.
Moreover, the two functions that map a matrix to its characteristic polynomial and a polynomial to its
roots, are both continuous. Thus, by continuity, the largest eigenvalue of Mn converges to that of
M , which is strictly negative. Hence, there exists some N0 ∈ N such that for all N > N0 , MN is
negative definite, as required in the condition 2. of Theorem 3.
18
we also define the robust value of a sequence of actions a
∞
def X
Var = max∞ min γ n Rnm (19)
u∈aA m∈[M ]
n=h(a)+1
Then, the robust values, U-values and B-values exhibit similar properties as the optimal values,
U-values and B-values, that is: for all 0 < k < K and a ∈ TT ,
Uar (k) ≤ Uar (K) ≤ Var ≤ Bar (K) ≤ Bar (k) (21)
Proof. By definition, when starting with sequence a, the value Uam (k) represents the minimum
admissible reward, while Bam (k) corresponds to the best admissible reward achievable with respect
to the the possible continuations of a. Thus, for all a ∈ A∗ , Uam (k) and Uar (k) are non-decreasing
functions of k and Bam (k) and Bar (k) are a non-increasing functions of k, while Vam and Var do not
depend on k.
Moreover, since the reward function R is assumed be bounded in [0, 1], the sum of discounted rewards
γd
from a node of depth d is at most γ d + γ d+1 + · · · = 1−γ . As a consequence, for all k ≥ 0 , a ∈ Lk
of depth d, and any sequence of rewards (Rn )n∈N obtained from following a path in aA∞ with any
dynamics m ∈ [M ]:
d−1 ∞ d−1
X X X γd
Uam (k) = γ n Rnm ≤ γ n Rnm ≤ γ n Rnm + = Bam (k)
n=0 n=0 n=0
1−γ
Hence,
∞
X
min Uam (k) ≤ min γ n Rn ≤ min Bam (k) (22)
m∈[M ] m∈[M ] m∈[M ]
n=0
And as the left-hand and right-hand sides of (22) are independent of the particular path that was
followed in aA∞ , it also holds for the robust path:
∞
X
min Uim (k) ≤ 0max∞ min γ n Rnm ≤ min Bim (k)
m∈[M ] a ∈aA m∈[M ] m∈[M ]
t=0
that is,
Uar (k) ≤ Var ≤ Bar (k) (23)
Finally, (23) is extended to the rest of Tk by recursive application of (19), (20) and (18).
Proof. Hren & Munos [19] first show in Theorem 2 that the simple regret rK of their optimistic
dK
planner is bounded by γ1−γ where dK is the depth of TK . This properties relies on the fact that the
returned action belongs to the deepest explored branch, which we can show likewise by contradiction
using Lemma 5. This yields directly that the returned action a = i0 where i is some node of
maximal depth dK expanded at round k ≤ K, which by selection rule verifies Bar (k) = Bir (k) =
maxx∈A Bxr (k) and:
V r − Var = Var? − Var ≤ Bar? (k) − Var ≤ Bar (k) − Uar (k)
= Bir (k) − Uir (k)
γ dK
= .
1−γ
19
Secondly, they bound the depth dK of TK with respect to K. To that end, they show that the expanded
γd
nodes always belong to the sub-tree T∞ of all the nodes of depth d that are 1−γ -optimal. Indeed, if a
node i of depth d is expanded at round k, then Bir (k) ≥ Bjr (k) for all j ∈ Lk by selection rule, thus
the max-backups of (16) up to the root yield Bir (k) = B∅r (k). Moreover, by Lemma 5 we have that
γd γd
B∅r (k) ≥ V∅r = V r and so Vir ≥ Uir (k) = Bir (k) − 1−γ ≥Vr − 1−γ , thus i ∈ T∞ .
Then from the definition of κ applied to nodes in T∞ , there exists d0 and c such that the number nd
of nodes of depth d ≥ d0 in T∞ is bounded by cκd . As a consequence,
PdK PdK PdK
K = d=0 nd = n0 + d=d 0 +1
nd ≤ n0 + c d=d 0 +1
κd .
B Experimental details
In both experiments, we used γ = 0.9, δ = 0.9 and a planning budget K = 100. The disturbances
were sampled uniformly in [−0.1, 0.1]r while the measurements are Gaussian with covariance
Σs = 0.12 Is .
States The system is described by its position (px , py ) and velocity (vx , vy ):
T
x = [px py vx vy ]
Actions It is acted upon by means horizontal and vertical forces u = (ux , uy ) ∈ [−1, 1]2 . We
discretise the action space into four constant controls, for each direction:
A = {(−1, −1), (−1, 1), (1, −1), (1, 1)}
Reward The reward encodes the task of navigating to reach a goal state xg while avoiding collisions
with obstacles:
R(x) = δ(x)/(1 + kx − xg k2 ),
where δ(x) is 0 whenever x collides with an obstacle, 1 otherwise.
Dynamics The system dynamics consist in a double integrator, with friction parameters (θx , θy ):
p˙x 0 0 1 0 px 0
p˙y 0 0 0 1 py 0
v˙ = 0 0 −θ +
0 vx ux
.
x x
v˙y 0 0 0 −θy vy uy
DQN baseline In addition to the state, knowledge of the obstacles is encoded in the observation as
an angular grid of laser-like distance measurements, as well as the goal location relative to the system
position. As a model for the Q-function, we used a Multi-Layer Perceptron with two hidden layers of
size 100. An ε-greedy strategy was used for exploration.
In the following, we describe the structure of the dynamical system f representing the couplings and
interactions between several vehicles.
20
States In addition to the ego-vehicle, the scene contains V other vehicles. Any vehicle i ∈ [0, V ] is
represented by its position (xi , yi ), its forward velocity vi its heading ψi . The resulting joint state is
the traffic description: x = (xi , yi , vi , ψi )i∈[0,V ] ∈ R4V +4 .
Actions The ego-vehicle is following a fixed path, and the tasks consists in adapting its velocity by
means of three actions A = {faster, constant velocity, slower}. They are achieved by a longitudinal
linear controller that tracks the desired velocity v0 , as described below in the system dynamics.
Dynamics The kinematics of any vehicle i ∈ [V ] are represented by the Kinematic Bicycle Model:
ẋi = vi cos(ψi ),
ẏi = vi sin(ψi ),
v̇i = ai ,
vi
ψ̇i = sin(βi ),
l
where (xi , yi ) is the vehicle position, vi is its forward velocity and ψi is its heading, l is the vehicle
half-length, ai is the acceleration command and βi is the slip angle at the centre of gravity, used as a
steering command.
Lateral dynamics The lane Li with the lateral position yLi and heading ψLi is tracked by a cascade
controller of lateral position and heading βi , which is selected in a way the closed-loop dynamics
take the form:
−1 vei,y
ψ̇i = θi,5 ψLi + sin − ψi , (24)
vi
vei,y = θi,4 (yLi − yi ).
We assume that the drivers choose their steering command βi such that (24) is always achieved:
βi = sin−1 ( vli ψ̇i ).
LPV formulation The system presented so far is non-linear and must be cast into the LPV form.
We approximate the non-linearities induced by the trigonometric operators through equilibrium
linearisation around yi = yLi and ψi = ψLi .
This yields the following longitudinal dynamics:
ẋi = vi ,
v̇i = θi,1 (v0 − vi ) + θi,2 (vfi − vi ) + θi,3 (xfi − xi − d0 − vi T ),
21
where θi,2 and θi,3 are set to 0 whenever the corresponding features are not active.
It can be rewritten in the form
Ẋ = A(θ)(X − Xc ) + ω.
For example, in the case of two vehicles only:
xi −d0 − v0 T v0
xfi 0 v0
X = , Xc = , ω = 0
vi v0
vfi v0 0
i fi i fi
i 0 0 1 0
f 0 0 0 1
A(θ) = i
i −θi,3 θi,3 −θi,1 − θi,2 − θi,3 θi,2
fi 0 0 0 −θfi ,1
Change of coordinates In both cases, the obtained polytope centre AN is non-Metzler. We use the
similarity transformation of coordinates of Efimov et al. [15]. Precisely, we choose Θ such that for
any θ ∈ Θ, A(θ) is always diagonalisable with real eigenvalues, and perform an eigendecomposition
to compute its change of basis matrix Z. The transformed system X 0 = Z −1 (X − Xc ) verifies (2)
with AN Metlzer as required to apply the interval predictor of Proposition 3. Finally, the obtained
predictor is transformed back to the original coordinates Z by using the following lemma:
Lemma 6 (Interval arithmetic of Efimov et al. 14). Let x ∈ Rn be a vector variable, x ≤ x ≤ x for
some x, x ∈ Rn .
1. If A ∈ Rm×n is a constant matrix, then
A+ x − A− x ≤ Ax ≤ A+ x − A− x. (25)
DQN baseline In order to avoid discontinuities in the vehicles headings, the state is encoded as
x = (xi , yi , vix , viy , cos ψi , sin ψi )i∈[0,V ] ∈ R6V +6 , with the ego-vehicle always in the first position.
As a model for the Q-function, we used the Social Attention architecture from [26], that allows to
support an arbitrary number of vehicles as input and enforce an invariance to their order.
22
Proof. The ellipsoid in (8) is described by:
θ ∈ Cδ =⇒ (θ − θN,λ )T GN,λ (θ − θN,λ ) ≤ βN (δ)2
=⇒ (θ0 − θN,λ
0
)T D(θ0 − θN,λ
0
) ≤ βN (δ)2
d
X
=⇒ Di,i (θi0 − θN,λ,i
0
)2 ≤ βN (δ)2
i=1
−1/2
=⇒ ∀i, |θi0 − θN,λ,i
0
| ≤ Di,i βN (δ)
This would not recover the robust policy, as we show in Figure 6 with a simple counter-example.
1 1 1
1/2
Figure 6: From left to right: two simple models and corresponding u-values with optimal sequences
in blue; the naive version of the robust values returns sub-optimal paths in red; our robust U-value
properly recovers the robust policy in green.
23