Asada 94 B
Asada 94 B
A Reinforcement Learning
1 Introduction
Due to its globally perceptive capability, vision State, s Action,a
seems indispensable for autonomous agent(s) to ac- Environment
quire reactive and purposive behaviors that can be
obtained through interactions with its environment.
The existing deliberative and incremental approach- Reward,r
es in computer vision, however, do not seem to have
made the great advances in this context because these
methods often need huge amount of computation time
which is fatal in real time execution of robot tasks,
and they offer general descriptions of the scene which
might need more time to be transformed into the spec- Agent
ified descriptions needed to accomplish tasks at hand.
Rather, these general descriptions are hard to be prop-
erly evaluated unless the task or purpose of an agen- Figure 1: The basic model of robot-environment
t is specified. From this viewpoint, purposive, task- interaction
oriented, or so-called behavior-based approach seems Reinforcement learning has recently been receiving
promising to evaluate the role of vision (when, where, increased attention as a method for robot learning
and what kind of information is necessary and how with little or no a priori knowledge and higher ca-
accurate they should be) and finally to realize au- pability of reactive and adaptive behaviors [6]. Fig.1
shows the basic model of robot-environment interac- kinematics and dynamics of the robot itself. All the
tion, where a robot and environment are modeled by information the robot can capture is the image posi-
two synchronized finite state automatons interacting tions of the ball and the goal from which we can infer
in a discrete time cyclical processes. The robot senses changes in the world caused by the robot’s actions.
the current state of the environment and selects an ac- The remainder of this article is structured as fol-
tion. Based on the state and action, the environment lows: In the next section, we give a brief overview of
makes a transition to a new state and generates a re- Q-learning. We then explain the task and the learning
ward that is passed back to the robot. Through these scheme in our method. Next, we show the experiments
interactions, the robot learns a purposive behavior to with computer simulations and a real robot system.
achieve a given goal. Finally, we give concluding discussions.
Although the role of reinforcement learning is very
important to realize autonomous systems, the promi- 2 Q-learning
nence of that role is largely determined by the extent Before getting into the details of our system, we
to which it can be scaled to larger and complex robot briefly review the basics of Q-learning. For more
learning tasks. Many theoretical works have argued on through treatment, see [12]. We follow the explana-
the convergence time of the learning, and how to speed tion of Q-learning by Kaelbling [13].
up it by using some heuristics and to extend these We assume that the robot can discriminate the set
techniques from a single goal task to multiple ones [7]. S of distinct world states, and can take the set A
However, almost of them have only shown computer of actions on the world. The world is modeled as a
simulations, and only a few real robot applications are Markov process, making stochastic transitions based
reported which are simple and less dynamic [8, 9]. Es- on its current state and the action taken by the robot.
pecially, the use of vision in the reinforcement learning Let T (s, a, s0 ) be the probability that the world will
is very rare. transit to the next state s0 from the current state-
action pair (s, a). For each state-action pair (s, a),
To the best of our knowledge, only Whitehead and
the reward r(s, a) is defined.
Ballard [10] argued this problem. Their task is a sim-
The general reinforcement learning problem is typ-
ple manipulation of blocks on the conveyer belt. Al-
ically stated as finding a policy that maximizes dis-
though each block is colored to be easily discriminated,
counted sum of the reward received over time. A pol-
they still have a large size of state space. To cope with
icy f is mapping from S to A. This sum is called the
this problem, they assumed that observer could con-
return and is defined as:
trol its gaze to attended object so as to reduce the size
X∞
of the state space. However, this causes so-called “per-
γ n rt+n , (1)
ceptual aliasing” problem. That is, both the observer
n=0
motion and actual changes happened in the environ-
ment cause the changes inside the image captured by where rt is the reward received at step t given that
the observer. Therefore, it seems difficult to discrim- the agent started in state s and executed policy f . γ
inate the both from only the image. Then, they pro- is the discounting factor, it controls to what degree
posed a method to cope with this problem by adopting rewards in the distant future affect the total value of
the internal states and separating action command- a policy and is just slightly less than 1.
s into “Action frame” and “Attention frame” com- Given definitions of the transition probabilities and
mands. Thus, they encoded encoded a priori world the reward distribution, we can solve the optimal poli-
knowledge into state and action spaces. cy, using methods from dynamic programming [14]. A
more interesting case occurs when we wish to simulta-
In order to make the role of the reinforcement learn-
neously learn the dynamics of the world and construct
ing evident in realizing autonomous agents, we need
the policy. Watkin’s Q-learning algorithm gives us an
more applications in more dynamic and complex en-
elegant method for doing this.
vironment. In this paper, we propose a method which
Let Q∗ (s, a) be the expected return or action-value
acquires a purposive behavior for a mobile robot to
function for taking action a in a situation s and con-
shoot a ball into the goal by using a vision-based rein-
tinuing thereafter with the optimal policy. It can be
forcement learning. We apply Q-learning method [11],
recursively defined as:
one of the widely used reinforcement learning schemes X
to our problem. The robot is expected to learn a Q∗ (s, a) = r(s, a) + γ T (s, a, s0 ) max Q∗ (s0 , a0 ).
0
a ∈A
shooting behavior without world knowledge such as s0 ∈S
3-D locations and sizes of the goal and the ball or the (2)
Because we do not know T and r initially, we construct
incremental estimates of the Q values on line. Starting
with Q(s, a) at any value, usually 0, every time an
action is taken update the Q value as follows:
12000
gamma = 0.600
gamma = 0.800 Sun WS Sun WS
10000 gamma = 0.999 Ether Net
number of goals
MC68040
8000
MaxVideo 200
DigiColor
6000 UHF Receiver
4000 MC68040
Parallel I/O
2000 A/D
D/A
VME BOX
0 Transmitter
0 500 1000 1500 2000 2500 3000 3500
time step *1000
Receiver
+ + + + ++
Radio Controller
Figure 4: Number of goals in terms of γ
Fig.4 show the accumulated number of shooting
goals in terms of the temporal discount factor γ, where
Figure 6: A configuration of the real system.
the number of goals with larger γ (0.999) is lower than
that with smaller ones (0.6 and 0.8). The reason is as Fig.6 shows a configuration of the real mobile robot
follows. When the temporal discount factor γ is very system. The image taken by a TV camera mounted
close to 1 (almost no discount), the reward received on the robot is transmitted to a UHF receiver and
after the goal is almost the same whichever path is processed by Datacube MaxVideo 200, a real-time
selected. While, if γ is small, the robot try to take pipeline video image processor. In order to simplify
a shorter path which means more rewards. Howev- and speed up the image processing time, we painted
er, for a too small γ, the robot loses the way to the the ball in red and the goal in blue. We constructed
goal. Fig.5 shows some kinds of behaviors during the the radio control system of the vehicle, following the
learning process. (a) and (b) show the difference be- remote-brain project by Profs. Inaba and Inoue at U-
tween the shooting behaviors with different γs. In (a), niversity of Tokyo [16]. The image processing and the
Table 1: State-Action data
time state state action errer
step step ball goal L R
1 1 (C,F) (C,F,Fo) F F
2 2 (R*,F) (C,F,Fo) F F 1
3 3 (D*,D*) (C,F,Ro*) B B 3
4 4 (C,F) (C,F,Lo*) B S 1
5 5 (C,F) (C,F,Fo) F F
6 (C,F) (C,F,Fo) F F
7 (C,F) (C,F,Fo) F F
8 (C,F) (C,F,Fo) F F
9 6 (C,F) (C,F,Ro*) B S 1
(a) input image (b) detected image 10 7 (C,F) (C,F,Fo) F F
11 8 (C,F) (R,M,Fo) F F
12 9 (R,F) (R,M,Fo) F F
13 10 (R,M*) (R,F*,Lo*) F B 3
14 11 (L*,F) (R,M,Ro*) F S 2
Figure 7: Detection of the ball and the goal. 15 12 (L*,F) (R,M,Fo) F S 1
16 13 (R,M) (R,M,Fo) S B
17 14 (C,M) (C,M,Fo) F F
18 15 (L,M) (L,M,Fo) S F
vehicle control system are operated by VxWorks OS 19 16 (L,N) (L,M,Fo) B S
on MC68040 CPU which are connected with host Sun 20 (L,N) (L,M,Fo) B S
21 17 (L,M*) (L,M,Fo) S F 1
workstations via Ether net. We have shown a picture 22 18 (L,N) (L,M,Fo) B S
of the real robot with a TV camera (Sony handy-cam 23 (L,N) (L,M,Fo) B S
24 19 (C,N) (C,M,Fo) F B
TR-3) and video transmitter in Fig.3. 25 20 (C,M) (C,M,Fo) F F
26 (C,M) (C,M,Fo) F F
27 21 (C,M) (C,N,Fo) F S
28 22 (C,M) (C,M*,Lo*) F S 2
29 23 (C,M) (C,M*,Ro*) S B 2
30 24 (C,F) (D,D,D) F S