RL
en
espacios
con,nuos
(con
diaposi0vas
de
Ali
Nouri)
Assump0ons
for
Con0nuous
Spaces
Transi0on
can
be
wri>en
in
the
following
form:
St+1
=
f(st,at)
+
Where
is
drawn
from
a
known
distribu0on.
[Note:
a
recent
work
shows
learning
the
noise
is
not
very
detrimental
to
algorithms
[BLLLR08]
Transi0on
and
reward
func0ons
are
Lipschitz
con0nuous
[CT91]:
||
f(s1,a)
f(s2,a)||
<
CT
||s1-s2||
||
R(s1)
R(s2)
||
<
CR
||s1-s2||
2
MDP
Proper0es
Cntd.
Op0mal
value
func0on
sa0ses
the
following
Bellman
equa0on
[P94]:
Finite
State
Space:
Con.nuous
State
Space:
3
Solving
MDPs
Finite
state
space
[P94]:
Value
itera0on
Policy
itera0on
Linear
programming
Con0nuous
state
space:
Convert
to
nite
MDP
by
Rela0vely
very
expensive
discre0za0on,
solve
accordingly
Fi>ed
value
itera0on
[G95]
Forward
search:
sparse
sampling,
UCT
[KMN02,
KS06]
4
RL
en
espacios
con,nuos
Todos
los
algoritmos
que
mencionamos
hasta
ahora
usan
tablas
para
almacenar
los
parmetros
del
problema.
Imposoble
en
espacios
con,nuos
con
estados
y/o
acciones
innitas.
Hay
que
usar
algn
mtodo
de
generalizacin.
Podemos
reemplazar
las
tablas
por
un
aproximador
de
funciones,
aunque
puede
traer
problemas.
5
Value-func0on
Approxima0on
Several
researchers
have
tried
to
apply
func0on
approxima0on
to
values
of
states
a
long
0me
ago
[T95,
BM95]
Boyan
and
Moore
reported
that
some
func0on
approximators
are
not
stable
and
result
in
divergence
[BM95].
Gordon
showed
a
very
restricted
set
of
func0on
approximators
are
stable
[G95].
6
Model
Approxima0on
Metric
E3
[KKL03]
provides
an
algorithm
schema
for
how
generaliza0on
can
be
done
in
a
model-based
sepng
without
losing
convergence
guarantees.
A
few
realiza0on
of
metric
E3
exist
for
a
subset
of
environments
[SL07,JS06].
An
Experimental
Domain
Factored
State
Spaces
S
=
S1
x
S2
x
x
Sm
S1
=
(s11,s12,,
s1m1)
S2
=
(s21,s22,,
s2m2)
dog
x
dog
y
State:
dog
angle
ball
x
ball
y
Model
Parameter
Approxima0on
(MPA)
A
factored
learner
for
con0nuous
spaces
10
Factored
Learner
in
Con0nuous
Domains
Es0mate
environment
in
delta
model
[JS06]:
Instead
of
learning
a
mapping
from
S
to
S,
learn
a
mapping
from
S
to
Rm
such
that
st+1=st
+
f(st,at)+
Input
the
dependency
graph
of
variables
in
the
form
of
a
DBN.
Construct
a
func0on
approximator
for
each
state
variable.
Train
each
func0on
approximator
with
samples
in
delta
transi0on
model.
Allow
func0on
approximators
to
return
I
dont
know
value,
if
theres
not
enough
support.
11
dog
x
dog
y
For
each
ac0on
dog
angle
ball
x
ball
y
dog
x
dog
y
dog
angle
ball
x
ball
y
dog
x
dog
y
For
each
ac0on
dog
x
dog
angle
ball
x
ball
y
dog
x
For
each
ac0on
dog
y
dog
x
dog
angle
12
MPA
Input
dependency
graphs
di,j
for
each
ac0on
i
and
target
state
variable
j.
Create
transi0on
func0on
approximator
i,j
for
each
ac0on
i
and
target
state
variable
j.
Create
reward
func0on
approximator
.
While
learning
con0nues:
observe
an
interac0on
in
the
form
of
(st,at,st+1,rt+1)
Train
at,i
with
(
dat,i(st),
st+1(i)
st(i)
)
Train
with
(st+1,
rt+1)
If
t=
intervalTime
then
Let
=
plan(,
,Rmax)
Return
(st+1)
13
MPA
Planning
Input
,
and
Rmax
Par00on
state
space
using
non-overlapping
cells
with
resolu0on
.
Create
c00ous
state
sf
with
reward
Rmax
and
self-looping
ac0ons.
For
each
cell
in
,
generate
k
uniformly
distributed
samples
(O1
,
,
Ok).
Let
R(Oi)
=
(Oi)
For
each
ac0on
ai,
Let
Pij=
(Oj,ai)
and
cell(s)
be
the
cell
containing
s.
Using
Pijs,
construct
maximum
likelihood
transi0on
func0on
T(,
ai)
If
Pij
=
IDK
then
Pij
=
sf
Construct
nite
MDP
M=<,
A,
T,
R,
>
Solve
M
using
conven0onal
value
itera0on
algorithm.
Return
the
best
policy
14
Results
for
Robot
Naviga0on
Results
when
BumbleBall
is
NOT
in
the
environment
LWPR
used
as
func0on
approxima0on
[VSS00]
Results
averaged
over
10
runs
15
Mul0-resolu0on
Explora0on
(MRE)
16
A
discre0za0on-based
explora0on
Discre0ze
state
space
and
use
samples
in
each
cell
to
tag
that
region
as
known
or
unknown.
We
can
use
it
for
implemen0ng
metric
E3
Its
computa0onally
faster
than
maintaining
hyper
spheres.
More
accurate
less
generaliza0on
More
generaliza0on
Less
accurate
17
Mul0-resolu0on
Discre0za0on
Allow
dierent
levels
of
generaliza0on
depending
on
how
many
samples
exist
in
the
neighborhood.
Get
more
accurate
es0ma0ons
in
parts
of
state
space
where
we
care
more.
Allow
more
generaliza0on
for
places
where
we
dont
need
much
accuracy
18
Kd-tree
Structure
It
par00ons
the
state
space
into
variable-sized
regions.
The
root
covers
the
whole
space,
each
node
selects
one
of
the
dimensions
and
splits
the
space
into
two
half-spaces,
producing
two
children.
We
split
at
the
median
of
the
points
along
that
direc0on.
We
stop
once
the
number
of
points
inside
a
regions
is
less
than
a
threshold.
19
Knownness
for
MRE
Make
knownness
a
func0on
of
the
size
and
the
number
of
points
in
the
cell:
Dene
a
target
resolu0on,
,
and
number
of
desired
points
in
cells,
,
based
on
,
,
and
smoothness
parameters.
Dene
smallness
of
a
cell
to
be
a
func0on
that
goes
from
0
to
1
as
the
size
of
the
cell
decreases
from
||S||
to
.
Dene
knownness
to
be
knownness()
=
(|O|/
).(smallness())
20
Model-based
MRE
Let
Osa
be
the
set
of
samples
for
s/a
pair.
Let
sf
be
a
c00ous
state
with
value=Vmax.
Build
the
transi0on
func0on
as
follows:
T(s|s,a)
=
k(s,a)
T^(s|s,a)
T(sf|s,a)
=
1
k(s,a)
21
Propiedades
de
Model-based
MRE
Es
independiente
del
aproximador
u0lizado.
Es
independiente
del
algoritmo
de
planning.
Resultado:
es
PAC.
22
Results
for
Mountain
Car
MountainCar
is
a
2D
environment
Results
averaged
over
20
runs
23