0% found this document useful (0 votes)

44 views8 pages

Convergence of Stochastic DP Algorithms

Q: How does the paper extend stochastic approximation theory to improve the convergence of DP-based learning algorithms?

The paper extends Dvoretzky's formulation of the classical stochastic approximation theory to address issues with the maximum norm which is crucial in learning algorithms based on dynamic programming (DP). By doing so, the authors establish a class of converging processes involving this norm. This extension allows them to prove the convergence of algorithms like Q-learning and TD(bb), which were challenging to analyze with standard stochastic approximation techniques. Their new convergence theorem simplifies convergence proofs without relying on constructions specific to particular algorithms .

Uploaded by

hanyihuid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views8 pages

Convergence of Stochastic DP Algorithms

Uploaded by

hanyihuid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Convergence of Stochastic Iterative

Dynamic Programming Algorithms

Tommi Jaakkola'" Michael I. Jordan

Satinder P. Singh
Department of Brain and Cognitive Sciences
Massachusetts Institute of Technology
Cambridge, MA 02139

Abstract
Increasing attention has recently been paid to algorithms based on
dynamic programming (DP) due to the suitability of DP for learn-
ing problems involving control. In stochastic environments where
the system being controlled is only incompletely known, however,
a unifying theoretical account of these methods has been missing.
In this paper we relate DP-based learning algorithms to the pow-
erful techniques of stochastic approximation via a new convergence
theorem, enabling us to establish a class of convergent algorithms
to which both TD("\) and Q-Iearning belong.

1 INTRODUCTION
Learning to predict the future and to find an optimal way of controlling it are the
basic goals of learning systems that interact with their environment. A variety of
algorithms are currently being studied for the purposes of prediction and control
in incompletely specified, stochastic environments. Here we consider learning algo-
rithms defined in Markov environments. There are actions or controls (u) available
for the learner that affect both the state transition probabilities, and the proba-
bility distribution for the immediate, state dependent costs (Ci( u)) incurred by the
learner. Let Pij (u) denote the probability of a transition to state j when control
u is executed in state i. The learning problem is to predict the expected cost of a

... E-mail: tommi@[Link]

703
704 Jaakkola, Jordan, and Singh

fixed policy p (a function from states to actions), or to obtain the optimal policy
(p*) that minimizes the expected cost of interacting with the environment.
If the learner were allowed to know the transition probabilities as well as the imme-
diate costs the control problem could be solved directly by Dynamic Programming
(see e.g., Bertsekas, 1987). However, when the underlying system is only incom-
pletely known, algorithms such as Q-Iearning (Watkins, 1989) for prediction and
control, and TD(>.) (Sutton, 1988) for prediction, are needed.
One of the central problems in developing a theoretical understanding of these
algorithms is to characterize their convergence; that is, to establish under what
conditions they are ultimately able to obtain correct predictions or optimal control
policies. The stochastic nature of these algorithms immediately suggests the use
of stochastic approximation theory to obtain the convergence results. However,
there exists no directly available stochastic approximation techniques for problems
involving the maximum norm that plays a crucial role in learning algorithms based
on DP.
In this paper, we extend Dvoretzky's (1956) formulation of the classical Robbins-
Munro (1951) stochastic approximation theory to obtain a class of converging pro-
cesses involving the maximum norm. In addition, we show that Q-Iearning and
both the on-line and batch versions of TD(>.) are realizations of this new class.
This approach keeps the convergence proofs simple and does not rely on construc-
tions specific to particular algorithms. Several other authors have recently presented
results that are similar to those presented here: Dayan and Sejnowski (1993) for
TD(A), Peng and Williams (1993) for TD(A), and Tsitsiklis (1993) for Q-Iearning.
Our results appear to be closest to those of Tsitsiklis (1993).

2 Q-LEARNING
The Q-Iearning algorithm produces values-"Q-values"-by which an optimal ac-
tion can be determined at any state. The algorithm is based on DP by rewriting
Bellman 's equation such that there is a value assigned to every state-action pair
instead of only to a state. Thus the Q-values satisfy
Q(s,u) = cs(u) +, L....J
~pssl(u)maxQ(sl,ul)
1.).1
(1)
8'

where c denotes the mean of c. The solution to this equation can be obtained
by updating the Q-values iteratively; an approach known as the vaz'ue iteration
method. In the learning problem the values for the mean of c and for the transition
probabilities are unknown. However, the observable quantity
CSt (Ut) +, maxQ(St+l, u)
1.).
(2)
where St and Ut are the state of the system and the action taken at time t, respec-
tively, is an unbiased estimate of the update used in value iteration. The Q-Iearning
algorithm is a relaxation method that uses this estimate iteratively to update the
current Q-values (see below).
The Q-Iearning algorithm converges mainly due to the contraction property of the
value iteration operator.
Convergence of Stochastic Iterative Dynamic Programming Algorithms 705

2.1 CONVERGENCE OF Q-LEARNING

Our proof is based on the observation that the Q-Iearning algorithm can be viewed as
a stochastic process to which techniques of stochastic approximation are generally
applicable. Due to the lack of a formulation of stochastic approximation for the
maximum norm, however, we need to slightly extend the standard results. This is
accomplished by the following theorem the proof of which can be found in Jaakkola
et al. (1993).

Theorem 1 A random iterative process ~n+I(X) = (l-ll:n(X))~n(x)+lin(x)Fn(x)

converges to zero w.p.l under the following assumptions:

1) The state space is finite.

2) Ln ll:n(x) = 00, Ln ll:~(x) < 00, Ln lin(x) = 00, Ln Ii~(x) < 00, and
E{lin(x)IPn } ~ E{ll:n(x)IPn } uniformly w.p.1.

3) II II ~n IlwI where'Y E (0,1).

E{Fn(x)IPn} Ilw~ 'Y
4) Var{Fn(x)IPn } ~ C(1+ II ~n Ilw)2, where C is some constant.

Here Pn = {~n, ~n-I, .. ·' Fn - I , ... , ll:n-I,· .. , lin-I, ... } stands for the past at step
n. Fn(x), ll:n(x) and lin(x) are allowed to depend on the past insofar as the above
conditions remain valid. The notation II . Ilw refers to some weighted maximum
norm.

In applying the theorem, the ~n process will generally represent the difference
between a stochastic process of interest and some optimal value (e.g., the optimal
value function). The formulation of the theorem therefore requires knowledge to be
available about the optimal solution to the learning problem before it can be applied
to any algorithm whose convergence is to be verified. In the case of Q-Iearning the
required knowledge is available through the theory of DP and Bellman's equation
in particular.
The convergence of the Q-Iearning algorithm now follows easily by relating the
algorithm to the converging stochastic process defined by Theorem 1.1

Theorem 2 The Q-learning algorithm given by

Qt+I(St, Ut) = (1 - ll:t(St, Ut))Qt(St, ut) + ll:t(St, ut}[CSt(ut) + 'Yvt(St+dJ

converges to the optimal Q*(s, u) values if

1) The state and action spaces are finite.

2) Lt ll:t(s, u) = 00 and Lt ll:;(s, u) < 00 uniformly w.p.1.

3) Var{cs(u)} is bounded.
1 We note that the theorem is more powerful than is needed to prove the convergence
of Q-learning. Its generality, however, allows it to be applied to other algorithms as well
(see the following section on TD(>.)).
706 Jaakkola, Jordan, and Singh

3) If, = 1, all policies lead to a cost free terminal state w.p.1.

Proof. By subtracting Q*(s, u) from both sides of the learning rule and by defining
Llt(s, u) = Qt(s, u) - Q*(s, u) together with
(3)
the Q-learning algorithm can be seen to have the form of the process in Theorem 1
with !3t(s, u) = at(s, u).
To verify that Ft(s, u) has the required properties we begin by showing that it is a
contraction mapping with respect to some maximum norm. This is done by relating
F t to the DP value iteration operator for the same Markov chain. More specifically,
maxIE{Ft(i, u)}1
u
j

< ,max ~Pij(u)maxIQt(j,v) - Q*(j,v)1

u 6 v
j

,muax LPij(U)Va(j) = T(Va)(i)

where we have used the notation Va(j) = maXv IQt(j, v)-Q*(j, v)1 and T is the DP
value iteration operator for the case where the costs associated with each state are
zero. If, < 1 the contraction property of E{ F t (i, u)} can be obtained by bounding
I:j Pij(U)Va(j) by maxj Va(j) and then including the, factor. When the future
costs are not discounted (, = 1) but the chain is absorbing and all policies lead to
the terminal state w.p.1 there still exists a weighted maximum norm with respect
to which T is a contraction mapping (see e.g. Bertsekas & Tsitsiklis, 1989) thereby
forcing the contraction of E{Ft(i, u)}. The variance of Ft(s, u) given the past is
within the bounds of Theorem 1 as it depends on Qt(s, u) at most linearly and the
variance of cs(u) is bounded.
Note that the proof covers both the on-line and batch versions. o

3 THE TD(-\) ALGORITHM

The TD(A) (Sutton, 1988) is also a DP-based learning algorithm that is naturally
defined in a Markov environment. Unlike Q-learning, however, TD does not involve
decision-making tasks but rather predictions about the future costs of an evolving
system. TD(A) converges to the same predictions as a version ofQ-learning in which
there is only one action available at each state, but the algorithms are derived from
slightly different grounds and their behavioral differences are not well understood.
The algorithm is based on the estimates
00

V/\(i) = (1 - A) L An-l~(n)(i) (4)

n=l
where ~(n)(i) are n step look-ahead predictions. The expected values of the ~>"(i)
are strictly better estimates of the correct predictions than the lit (i)s are (see
Convergence of Stochastic Iterative Dynamic Programming Algorithms 707

Jaakkola et al., 1993) and the update equation of the algorithm

Vt+l(it) = vt(it) + adV/(it) - Vt(it)J (5)

can be written in a practical recursive form as is seen below. The convergence of
the algorithm is mainly due to the statistical properties of the V? (i) estimates.

3.1 CONVERGENCE OF TDP)

As we are interested in strong forms of convergence we need to impose some new

constraints, but due to the generality of the approach we can dispense with some
others. Specifically, the learning rate parameters an are replaced by a n ( i) which
satisfy Ln an(i) =
00 and Ln a~(i) < 00 uniformly w.p.1. These parameters
allow asynchronous updating and they can, in general, be random variables. The
convergence of the algorithm is guaranteed by the following theorem which is an
application of Theorem 1.

Theorem 3 For any finite absorbing Markov chain, for any distribution of starting
states with no inaccessible states, and for any distributions of the costs with finite
variances the TD(A) algorithm given by

1)
m t
Vn+1(i) = Vn(i) + an(i) L)Ci + ,Vn(it+d -
t Vn(it)] LbA)t-kXi(k)
t=l k=l
Ln an(i) = 00 and Ln a~(i) < 00 uniformly w.p.i.
2)
t

Vt+l(i) = Vt(i) + at(i)[ci + ,Vt(it+d -

t Vt(id] LbA)t-kXi(k)
k=l
Lt at(i) = 00 and Ln a;(i) < 00 uniformly w.p.i and within sequences
at(i)/maXtESat(i) ----;. 1 uniformly w.p.i.

converges to the optimal predictions w.p.i provided" A E [0,1] with ,A < 1.

Proof for (1): We use here a slightly different form for the learning rule (cf. the
previous section).

Vn(i) + an (i)[Gn (i) - E~~~)} Vn(i)]

1 m(i)

E{m(i)} {; Vn"(i; k)

where Vn"( i; k) is an estimate calculated at the ph occurrence of state i in a

sequence and for mathematical convenience we have made the transformation
an(i) ----;. E{m(i)}an(i), where m(i) is the number of times state i was visited
during the sequence.
708 Jaakkola, Jordan, and Singh

To apply Theorem 1 we subtract V* (i), the optimal predictions, from both sides of
the learning equation. By identifying an(i) := an(i)m(i)/E{m(i)}, f3n(i) := an(i),
and Fn(i) := Gn(i) - V*(i)m(i)/E{m(i)} we need to show that these satisfy the
conditions of Theorem 1. For an(i) and f3n(i) this is obvious. We begin here by
showing that Fn(i) indeed is a contraction mapping. To this end,

m?xIE{Fn(i) 1Vn}1
I
=
miaxIE{~(i)} E{(VnA(i; 1) - V*(i» + (VnA(i;2) - V*(i» +···1 Vn}1

which can be bounded above by using the relation

IE{VnA(i; k) - V*(i) 1Vn}1
< E { IE{VnA(i; k) - V*(i) 1m(i) ~ k, Vn}IO(m(i) - k) 1Vn }
< P{m(i) ~ k}IE{VnA(i) - V*(i) 1 Vn}1
< I P {m( i) > k} m~x 1Vn (i) - V* (i) 1
I

where O(x) = 0 if x < 0 and 1 otherwise. Here we have also used the fact that VnA(i)
is a contraction mapping independent of possible discounting. As Lk P {m( i) ~
=
k} E{ m( i)} we finally get
m~x IE{ Fn( i) 1 Vn} 1 ::; I m?x IVn(i) - V*(i)1
I I

The variance of Fn (i) can be seen to be bounded by

E{ m4} m~xIVn(i)12
I

For any absorbing Markov chain the convergence to the terminal state is geometric
and thus for every finite k, E{mk}::; C(k), implying that the variance of Fn(i) is
within the bounds of Theorem 1. As Theorem 1 is now applicable we can conclude
that the batch version of TD(>.) converges to the optimal predictions w.p.l. 0

Proof for (2) The proof for the on-line version is achieved by showing that the
effect of the on-line updating vanishes in the limit thereby forcing the two versions
to be equal asymptotically. We view the on-line version as a batch algorithm in
which the updates are made after each complete sequence but are made in such a
manner so as to be equal to those made on-line.
Define G~ (i) = G n (i) + G~ (i) to be a new batch estimate taking into account the
on-line updating within sequences. Here G n (i) is the batch estimate with the desired
properties (see the proof for (1» and G~ (i) is the difference between the two. We
take the new batch learning parameters to be the maxima over a sequence, that
is an(i) = maxtES at(i). As all the at(i) satisfy the required conditions uniformly
w.p.1 these new learning parameters satisfy them as well.
To analyze the new batch algorithm we divide it into three parallel processes: the
batch TD( >.) with an (i) as learning rate parameters, the difference between this and
the new batch estimate, and the change in the value function due to the updates
made on-line. Under the conditions of the TD(>.) convergence theorem rigorous
Convergence of Stochastic Iterative Dynamic Programming Algorithms 709

upper bounds can be derived for the latter two processes (see Jaakkola, et al.,
1993). These results enable us to write

II E{G~ - V} II < II E{Gn - V} II + II G~ II

< (-y' + C~) II Vn - V* II +C~
where C~ and C~ go to zero with w.p.I. This implies that for any c > 0 and
II Vn - V* II~ c there exists I < 1 such that
I

II E{G n - V} II::; I II Vn - V II

for n large enough. This is the required contraction property of Theorem 1. In
addition, it can readily be checked that the variance of the new estimate falls under
the conditions of Theorem 1.
Theorem 1 now guarantees that for any c the value function in the on-line algorithm
converges w.p.1 into some t-bounded region of V* and therefore the algorithm itself
converges to V* w.p.I. 0

4 CONCLUSIONS
In this paper we have extended results from stochastic approximation theory to
cover asynchronous relaxation processes which have a contraction property with
respect to some maximum norm (Theorem 1). This new class of converging iterative
processes is shown to include both the Q-Iearning and TD(A) algorithms in either
their on-line or batch versions. We note that the convergence of the on-line version
of TD(A) has not been shown previously. We also wish to emphasize the simplicity
of our results. The convergence proofs for Q-Iearning and TD(A) utilize only high-
level statistical properties of the estimates used in these algorithms and do not rely
on constructions specific to the algorithms. Our approach also sheds additional
light on the similarities between Q-Iearning and TD(A).
Although Theorem 1 is readily applicable to DP-based learning schemes, the theory
of Dynamic Programming is important only for its characterization of the optimal
solution and for a contraction property needed in applying the theorem. The theo-
rem can be applied to iterative algorithms of different types as well.
Finally we note that Theorem 1 can be extended to cover processes that do not show
the usual contraction property thereby increasing its applicability to algorithms of
possibly more practical importance.

References

Bertsekas, D. P . (1987). Dynamic Programming: Deterministic and Stochastic Mod-

els. Englewood Cliffs, NJ: Prentice-Hall.
Bertsekas, D. P ., & Tsitsiklis, J. N. (1989). Parallel and Distributed Computation:
Numerical Methods. Englewood Cliffs, NJ: Prentice-Hall.
Dayan, P. (1992). The convergence of TD(A) for general A. Machine Learning, 8,
341-362.
710 Jaakkola, Jordan, and Singh

Dayan, P., & Sejnowski, T. J. (1993). TD(>.) converges with probability 1. CNL,
The Salk Institute, San Diego, CA.
Dvoretzky, A. (1956). On stochastic approximation. Proceedings of the Third Berke-
ley Symposium on Mathematical Statistics and Probability. University of California
Press.
Jaakkola, T., Jordan, M. I., & Singh, S. P. (1993). On the convergence of stochastic
iterative dynamic programming algorithms. Submitted to Neural Computation.
Peng J., & Williams R. J. (1993). TD(>.) converges with probability 1. Department
of Computer Science preprint, Northeastern University.
Robbins, H., & Monro, S. (1951). A stochastic approximation model. Annals of
Mathematical Statistics, 22, 400-407.
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences.
Machine Learning, 3, 9-44.
Tsitsiklis J. N. (1993). Asynchronous stochastic approximation and Q-learning.
Submitted to: Machine Learning.
Watkins, C.J .C.H. (1989). Learning from delayed rewards. PhD Thesis, University
of Cambridge, England.
Watkins, C.J .C.H, & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279-292.

Common questions

The contraction property is crucial for establishing the convergence of both Q-learning and TD(λ) as it ensures that the iterative processes move progressively closer to the optimal solutions with each update. For Q-learning, the value iteration operator is a contraction mapping that aids in reaching the optimal Q-values . For TD(λ), although convergence is attributed to the statistical properties of the V^λ(i) estimates, the contraction property ensures that the updates iteratively approximate the optimal solution under a maximum norm . This property is central to applying theorem-based convergence proofs effectively .

The convergence results have significant implications for the practical application of Q-learning and TD(λ) as they provide a solid theoretical foundation that guarantees these algorithms will yield optimal solutions under specified conditions. This assurance is crucial for applying these algorithms to real-world control and prediction tasks where the systems are stochastic and not fully known. By confirming convergence through stochastic approximation techniques, practitioners can be confident in deploying these algorithms in dynamic environments, leading to reliable and effective control strategies .

The convergence of TD(λ) sheds light on the similarities between Q-learning and TD(λ) as both are shown to belong to a new class of convergent, stochastic approximation-based processes with similar convergence conditions. The fact that both algorithms can be proven to converge using the same extended theoretical framework highlights a fundamental attribute they share: both are affected by the dynamics of their Markov environments and ultimately seek solutions that align with optimal DP predictions. This perspective emphasizes the roles of their structural components, like contraction properties, which unify their behavior under this theoretical context .

Q-learning differs from TD(λ) in that it assigns values to state-action pairs, allowing optimal actions to be determined directly from these Q-values. The convergence of Q-learning is fundamentally tied to Bellman's equation, where the contraction property of the value iteration operator ensures convergence . In contrast, TD(λ) focuses on predictions rather than decision-making, using n-step look-ahead predictions which ultimately converge due to their statistical properties. While both algorithms are shown to fall under the class of convergent processes described by the new theorem, their behavioral differences are grounded in their approaches to handling predictions and actions .

The application of stochastic approximation theory in the paper provides new insights by demonstrating that both Q-learning and TD(λ) are realizations of a broader class of convergent processes governed by an extended convergence theorem. This insight reveals that these algorithms, despite their differing origins and purposes, share fundamental convergence properties due to their underlying DP-based frameworks. The paper underscores the importance of asynchronous relaxation processes and the contraction property with respect to a weighted maximum norm, highlighting deeper connections between these methods and fostering a unified theoretical understanding .

The concept of a weighted maximum norm supports the convergence proofs of Q-learning and TD(λ) by providing a metric under which the algebraic processes involved in these algorithms can be shown to contract. This characteristic is crucial for applying the extended stochastic approximation theory, as the contractions relative to this norm ensure iterative solutions move closer to an optimal value. The weighted maximum norm is pivotal in proving that updates decrement differences between the current estimates and the optimal values with each iteration, thereby securing convergence .

The batch version of TD(λ) differs from its on-line counterpart in terms of convergence as it organizes updates into sequences that are applied after observing multiple interactions, whereas the on-line version updates after each individual interaction with the environment. Despite these procedural differences, both versions are shown to be encompassed by the converging processes identified in the stochastic approximation extension. The document indicates that rigorous upper bounds can be derived for both versions, ensuring their convergence to the optimal predictions under the described conditions .

The paper justifies the use of asynchronous relaxation processes by showing that these processes, when they have a contraction property with respect to a maximum norm, fall within the class of converging iterative processes described by their extended stochastic approximation theory. This approach not only applies to Q-learning and TD(λ) but also broadens the applicability of their convergence theorem to cover other iterative algorithms beyond DP-based learning schemes . The generality of the results allows for the inclusion of asynchronous processes without additional constraints, illustrating their robust applicability to practical algorithms .

The Q-learning algorithm converges to the optimal Q-values under several conditions: the state and action spaces must be finite; the learning rate parameters must satisfy Lt a_t(s,u) = ∞ and Lt a^2_t(s,u) < ∞ uniformly with probability one; and the variance of the immediate cost c_s(u) must be bounded. Additionally, if the environment is absorbing and discount factor γ = 1, all policies should lead to a cost-free terminal state with probability one .

The paper extends Dvoretzky's formulation of the classical stochastic approximation theory to address issues with the maximum norm which is crucial in learning algorithms based on dynamic programming (DP). By doing so, the authors establish a class of converging processes involving this norm. This extension allows them to prove the convergence of algorithms like Q-learning and TD(bb), which were challenging to analyze with standard stochastic approximation techniques. Their new convergence theorem simplifies convergence proofs without relying on constructions specific to particular algorithms .

Q-Learning Convergence Proof Explained
No ratings yet
Q-Learning Convergence Proof Explained
4 pages
Robust Q-Learning for Markov Processes
No ratings yet
Robust Q-Learning for Markov Processes
13 pages
Convergence of Quantile TD Learning
No ratings yet
Convergence of Quantile TD Learning
47 pages
Reinforcement Learning Fundamentals
No ratings yet
Reinforcement Learning Fundamentals
31 pages
Introduction to Temporal-Difference Learning
No ratings yet
Introduction to Temporal-Difference Learning
56 pages
32 Learning Rate Free Learning by
No ratings yet
32 Learning Rate Free Learning by
31 pages
Reinforcement Learning: Value & Q-Iteration
No ratings yet
Reinforcement Learning: Value & Q-Iteration
4 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
Learning Algorithms For Markov Decision Processes With Average Cost
No ratings yet
Learning Algorithms For Markov Decision Processes With Average Cost
18 pages
Double Q-learning Algorithm Explained
No ratings yet
Double Q-learning Algorithm Explained
9 pages
Reinforcement Learning Fundamentals
No ratings yet
Reinforcement Learning Fundamentals
42 pages
Q-learning Convergence Rates Explained
No ratings yet
Q-learning Convergence Rates Explained
25 pages
Understanding Temporal-Difference Learning
No ratings yet
Understanding Temporal-Difference Learning
25 pages
Deterministic Dynamic Programming Formulas
No ratings yet
Deterministic Dynamic Programming Formulas
10 pages
Reinforcement Learning Overview
No ratings yet
Reinforcement Learning Overview
6 pages
Learning Rate Free Learning by D Adaptation
No ratings yet
Learning Rate Free Learning by D Adaptation
35 pages
Passive vs Active Reinforcement Learning
No ratings yet
Passive vs Active Reinforcement Learning
30 pages
Reinforcement Learning Algorithms Overview
No ratings yet
Reinforcement Learning Algorithms Overview
16 pages
Machine Learning Solution Manual
No ratings yet
Machine Learning Solution Manual
67 pages
Reinforcement Learning Insights from CS188
No ratings yet
Reinforcement Learning Insights from CS188
46 pages
Reinforcement Learning Algorithms Overview
No ratings yet
Reinforcement Learning Algorithms Overview
42 pages
Reinforcement Learning Without Models
No ratings yet
Reinforcement Learning Without Models
24 pages
UMass Amherst Reinforcement Learning Slides
No ratings yet
UMass Amherst Reinforcement Learning Slides
20 pages
Reinforcement Learning Tutorial by Baveja
No ratings yet
Reinforcement Learning Tutorial by Baveja
80 pages
Algorithms for Reinforcement Learning
No ratings yet
Algorithms for Reinforcement Learning
98 pages
Reinforcement Learning Algorithms Guide
No ratings yet
Reinforcement Learning Algorithms Guide
98 pages
Bootstrapping in Reinforcement Learning
No ratings yet
Bootstrapping in Reinforcement Learning
25 pages
Q-Learning in Unsupervised Learning
No ratings yet
Q-Learning in Unsupervised Learning
14 pages
Reinforcement Learning Overview
No ratings yet
Reinforcement Learning Overview
47 pages
Reinforcement Learning: Policy Search Methods
No ratings yet
Reinforcement Learning: Policy Search Methods
22 pages
Reinforcement Learning Algorithms Guide
No ratings yet
Reinforcement Learning Algorithms Guide
98 pages
Reinforcement Learning Overview Guide
No ratings yet
Reinforcement Learning Overview Guide
119 pages
Temporal-Difference Methods in RL
No ratings yet
Temporal-Difference Methods in RL
26 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
51 pages
Temporal Difference Learning Overview
No ratings yet
Temporal Difference Learning Overview
17 pages
CSE 445: Reinforcement Learning Overview
No ratings yet
CSE 445: Reinforcement Learning Overview
45 pages
Reinforcement Learning Basics in AI
No ratings yet
Reinforcement Learning Basics in AI
25 pages
Optimizing Neural Network Learning
No ratings yet
Optimizing Neural Network Learning
7 pages
Monte Carlo and Bootstrapping in RL
No ratings yet
Monte Carlo and Bootstrapping in RL
6 pages
Importance Weighting Learning Bounds
No ratings yet
Importance Weighting Learning Bounds
30 pages
Reinforcement Learning Techniques Overview
No ratings yet
Reinforcement Learning Techniques Overview
28 pages
Understanding Markov Decision Processes
No ratings yet
Understanding Markov Decision Processes
101 pages
Rule-Based RL with External Knowledge
No ratings yet
Rule-Based RL with External Knowledge
7 pages
Function Approximation in RL Challenges
No ratings yet
Function Approximation in RL Challenges
9 pages
Reinforcement Learning Algorithms Overview
No ratings yet
Reinforcement Learning Algorithms Overview
41 pages
Reinforcement Learning Insights by Khare
No ratings yet
Reinforcement Learning Insights by Khare
52 pages
Fitted Q vs Deep Q-Learning Explained
No ratings yet
Fitted Q vs Deep Q-Learning Explained
17 pages
Reinforcement Learning Overview by Schoukens
No ratings yet
Reinforcement Learning Overview by Schoukens
61 pages
LCS Approaches to Reinforcement Learning
No ratings yet
LCS Approaches to Reinforcement Learning
29 pages
Convergent TD Learning with Nonlinear Approximators
No ratings yet
Convergent TD Learning with Nonlinear Approximators
9 pages
Dynamic Programming: Approximation Techniques
No ratings yet
Dynamic Programming: Approximation Techniques
125 pages
Reinforcement Learning Fundamentals
No ratings yet
Reinforcement Learning Fundamentals
55 pages
Mavrin 19 A
No ratings yet
Mavrin 19 A
11 pages
Reinforcement Learning for Control
100% (1)
Reinforcement Learning for Control
111 pages
Vol I Dimitri PDF
No ratings yet
Vol I Dimitri PDF
30 pages
Dynamic Programming and Optimal Control, Volumes I Solution Selected
No ratings yet
Dynamic Programming and Optimal Control, Volumes I Solution Selected
30 pages
Polynomial-Time Near-Optimal RL Algorithms
No ratings yet
Polynomial-Time Near-Optimal RL Algorithms
24 pages
Advanced Q-Learning Techniques Explained
No ratings yet
Advanced Q-Learning Techniques Explained
28 pages
Online Evaluation in Reinforcement Learning
No ratings yet
Online Evaluation in Reinforcement Learning
41 pages
Interpretability Techniques for Medical ML
No ratings yet
Interpretability Techniques for Medical ML
18 pages
How To Reduce Overfitting With Dropout Regularization in Keras
No ratings yet
How To Reduce Overfitting With Dropout Regularization in Keras
12 pages
Quick Sort Algorithm Overview
No ratings yet
Quick Sort Algorithm Overview
9 pages
AI Stock Prediction with Regression & Q-Learning
No ratings yet
AI Stock Prediction with Regression & Q-Learning
12 pages
Shortest Path Algorithms Overview
No ratings yet
Shortest Path Algorithms Overview
3 pages
MCQs on Binary Search Trees and Trees
No ratings yet
MCQs on Binary Search Trees and Trees
12 pages
MRI Modality Transformation for Demon Registration
No ratings yet
MRI Modality Transformation for Demon Registration
4 pages
Discrete and Circular Convolution Explained
No ratings yet
Discrete and Circular Convolution Explained
5 pages
Intro to Digital Audio Processing
No ratings yet
Intro to Digital Audio Processing
8 pages
Neural Networks for Differential Equations
No ratings yet
Neural Networks for Differential Equations
14 pages
QPSK Modulation and Signal Analysis
No ratings yet
QPSK Modulation and Signal Analysis
24 pages
EE301 Signals and Systems Exam 1
No ratings yet
EE301 Signals and Systems Exam 1
14 pages
Problem Solving via State Space Search
No ratings yet
Problem Solving via State Space Search
54 pages
Numerical Analysis Course Overview
No ratings yet
Numerical Analysis Course Overview
2 pages
Image Analysis Techniques Overview
No ratings yet
Image Analysis Techniques Overview
75 pages
Ch10 - Part 1
No ratings yet
Ch10 - Part 1
21 pages
Taylor Series Questions and Solutions
No ratings yet
Taylor Series Questions and Solutions
6 pages
CT109H: Intro to Algorithms at Stanford
No ratings yet
CT109H: Intro to Algorithms at Stanford
57 pages
FDM Basics in Computational Fluid Dynamics
No ratings yet
FDM Basics in Computational Fluid Dynamics
25 pages
Numerical Solutions for Boundary Value Problems
No ratings yet
Numerical Solutions for Boundary Value Problems
10 pages
Spatial Filtering Techniques in DIP
No ratings yet
Spatial Filtering Techniques in DIP
19 pages
Numerical Stability in IVP Solvers
No ratings yet
Numerical Stability in IVP Solvers
6 pages
Sorting and Hashing Techniques Overview
No ratings yet
Sorting and Hashing Techniques Overview
47 pages
Deep Learning
No ratings yet
Deep Learning
12 pages
Problem Solving in Software Engineering
No ratings yet
Problem Solving in Software Engineering
25 pages
Linear Programming for Business Optimization
No ratings yet
Linear Programming for Business Optimization
26 pages
Dijkstra's Shortest Path Algorithm Explained
No ratings yet
Dijkstra's Shortest Path Algorithm Explained
27 pages
BCS304 Data Structures Exam Paper 2024
No ratings yet
BCS304 Data Structures Exam Paper 2024
2 pages
Perceptron vs ADALINE Comparison Analysis
No ratings yet
Perceptron vs ADALINE Comparison Analysis
17 pages
Introduction to Computational Geometry
No ratings yet
Introduction to Computational Geometry
6 pages

Convergence of Stochastic DP Algorithms

Uploaded by

Convergence of Stochastic DP Algorithms

Uploaded by

Convergence of Stochastic Iterative

Dynamic Programming Algorithms

Tommi Jaakkola'" Michael I. Jordan

... E-mail: tommi@[Link]

2.1 CONVERGENCE OF Q-LEARNING

Theorem 1 A random iterative process ~n+I(X) = (l-ll:n(X))~n(x)+lin(x)Fn(x)

1) The state space is finite.

3) II II ~n IlwI where'Y E (0,1).

Theorem 2 The Q-learning algorithm given by

Qt+I(St, Ut) = (1 - ll:t(St, Ut))Qt(St, ut) + ll:t(St, ut}[CSt(ut) + 'Yvt(St+dJ

1) The state and action spaces are finite.

2) Lt ll:t(s, u) = 00 and Lt ll:;(s, u) < 00 uniformly w.p.1.

3) If, = 1, all policies lead to a cost free terminal state w.p.1.

< ,max ~Pij(u)maxIQt(j,v) - Q*(j,v)1

,muax LPij(U)Va(j) = T(Va)(i)

3 THE TD(-\) ALGORITHM

V/\(i) = (1 - A) L An-l~(n)(i) (4)

Jaakkola et al., 1993) and the update equation of the algorithm

Vt+l(it) = vt(it) + adV/(it) - Vt(it)J (5)

3.1 CONVERGENCE OF TDP)

As we are interested in strong forms of convergence we need to impose some new

Vt+l(i) = Vt(i) + at(i)[ci + ,Vt(it+d -

converges to the optimal predictions w.p.i provided" A E [0,1] with ,A < 1.

Vn(i) + an (i)[Gn (i) - E~~~)} Vn(i)]

where Vn"( i; k) is an estimate calculated at the ph occurrence of state i in a

which can be bounded above by using the relation

The variance of Fn (i) can be seen to be bounded by

II E{G~ - V*} II < II E{Gn - V*} II + II G~ II

II E{G n - V*} II::; I II Vn - V* II

Bertsekas, D. P . (1987). Dynamic Programming: Deterministic and Stochastic Mod-

Common questions

What role does the contraction property play in establishing the convergence of Q-learning and TD(λ)?

What role does the contraction property play in establishing the convergence of Q-learning and TD(λ)?

What implications do the convergence results have for the practical application of Q-learning and TD(λ) in real-world control and prediction problems?

What implications do the convergence results have for the practical application of Q-learning and TD(λ) in real-world control and prediction problems?

Why is the convergence of TD(λ) said to shed light on the similarities between Q-learning and TD(λ), according to the document?

Why is the convergence of TD(λ) said to shed light on the similarities between Q-learning and TD(λ), according to the document?

What distinguishes the convergence of Q-learning from that of TD(λ) in terms of algorithmic behavior and underlying principles?

What distinguishes the convergence of Q-learning from that of TD(λ) in terms of algorithmic behavior and underlying principles?

What new insights about Q-learning and TD(λ) convergence are provided by the application of stochastic approximation theory in the paper?

What new insights about Q-learning and TD(λ) convergence are provided by the application of stochastic approximation theory in the paper?

How does the concept of a weighted maximum norm support the convergence proofs of Q-learning and TD(λ)?

How does the concept of a weighted maximum norm support the convergence proofs of Q-learning and TD(λ)?

In the context of the document, what distinguishes the batch version of TD(λ) from its on-line counterpart regarding convergence?

In the context of the document, what distinguishes the batch version of TD(λ) from its on-line counterpart regarding convergence?

How does the paper justify the use of asynchronous relaxation processes in its convergence theorems?

How does the paper justify the use of asynchronous relaxation processes in its convergence theorems?

What are the conditions under which the Q-learning algorithm converges to the optimal Q-values according to the document?

What are the conditions under which the Q-learning algorithm converges to the optimal Q-values according to the document?

How does the paper extend stochastic approximation theory to improve the convergence of DP-based learning algorithms?

How does the paper extend stochastic approximation theory to improve the convergence of DP-based learning algorithms?

You might also like

II E{G~ - V} II < II E{Gn - V} II + II G~ II

II E{G n - V} II::; I II Vn - V II