0% found this document useful (0 votes)
4 views7 pages

Asada 94 B

The document presents a method for a mobile robot to learn purposive behavior for shooting a ball into a goal using vision-based reinforcement learning. The robot utilizes Q-learning to acquire this behavior without prior knowledge of the environment's 3-D parameters, relying solely on visual information from a camera. The study includes both computer simulations and real robot experiments to demonstrate the effectiveness of the proposed approach.

Uploaded by

Luiz Celiberto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views7 pages

Asada 94 B

The document presents a method for a mobile robot to learn purposive behavior for shooting a ball into a goal using vision-based reinforcement learning. The robot utilizes Q-learning to acquire this behavior without prior knowledge of the environment's 3-D parameters, relying solely on visual information from a camera. The study includes both computer simulations and real robot experiments to demonstrate the effectiveness of the proposed approach.

Uploaded by

Luiz Celiberto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Vision-Based Behavior Acquisition For A Shooting Robot By Using

A Reinforcement Learning

Minoru Asada, Shoichi Noda, Sukoya Tawaratsumida, and Koh Hosoda


Dept. of Mech. Eng. for Computer-Controlled Machinery
Osaka University, 2-1, Yamadaoka, Suita, Osaka 565, Japan
[email protected]

Abstract tonomous agent.


Since Brooks [1, 2] proposed the behavior-based ap-
We propose a method which acquires a purposive proach, his group invented several kinds of behavior-
behavior for a mobile robot to shoot a ball into the based robots, such as [3] and [4]. Although these
goal by using a vision-based reinforcement learning. A robots can take reflexive actions against the environ-
mobile robot (an agent) does not need to know any ment, we still lack a capability of generating purposive
parameters of the 3-D environment or its kinematic- behaviors each of which consists of a sequence of ac-
s/dynamics. Information about the changes of the en- tions to achieve the goal. We are studying the feasibil-
vironment is only the image captured from a single TV ity of providing this capability with our robot by using
camera mounted on the robot. An action-value func- vision. Arkin [5] proposed a hybrid approach with re-
tion in terms of state is to be learned. Image positions active and deliberative methods for navigation task.
of a ball and a goal are used as a state variable which He encoded a priori world knowledge into the motor
shows the effect of an action previously taken. After schemas in order to generate the purposive behaviors.
the learning process, the robot tries to carry a ball n- Here, we intend to solve this problem with less world
ear the goal and to shoot it. Both computer simulation knowledge and expect our robot to learn a behavior
and real robot experiments are shown, and discussion though interactions with dynamic environment.
on the role of vision in the context of the vision-based
reinforcement learning is given.

1 Introduction
Due to its globally perceptive capability, vision State, s Action,a
seems indispensable for autonomous agent(s) to ac- Environment
quire reactive and purposive behaviors that can be
obtained through interactions with its environment.
The existing deliberative and incremental approach- Reward,r
es in computer vision, however, do not seem to have
made the great advances in this context because these
methods often need huge amount of computation time
which is fatal in real time execution of robot tasks,
and they offer general descriptions of the scene which
might need more time to be transformed into the spec- Agent
ified descriptions needed to accomplish tasks at hand.
Rather, these general descriptions are hard to be prop-
erly evaluated unless the task or purpose of an agen- Figure 1: The basic model of robot-environment
t is specified. From this viewpoint, purposive, task- interaction
oriented, or so-called behavior-based approach seems Reinforcement learning has recently been receiving
promising to evaluate the role of vision (when, where, increased attention as a method for robot learning
and what kind of information is necessary and how with little or no a priori knowledge and higher ca-
accurate they should be) and finally to realize au- pability of reactive and adaptive behaviors [6]. Fig.1
shows the basic model of robot-environment interac- kinematics and dynamics of the robot itself. All the
tion, where a robot and environment are modeled by information the robot can capture is the image posi-
two synchronized finite state automatons interacting tions of the ball and the goal from which we can infer
in a discrete time cyclical processes. The robot senses changes in the world caused by the robot’s actions.
the current state of the environment and selects an ac- The remainder of this article is structured as fol-
tion. Based on the state and action, the environment lows: In the next section, we give a brief overview of
makes a transition to a new state and generates a re- Q-learning. We then explain the task and the learning
ward that is passed back to the robot. Through these scheme in our method. Next, we show the experiments
interactions, the robot learns a purposive behavior to with computer simulations and a real robot system.
achieve a given goal. Finally, we give concluding discussions.
Although the role of reinforcement learning is very
important to realize autonomous systems, the promi- 2 Q-learning
nence of that role is largely determined by the extent Before getting into the details of our system, we
to which it can be scaled to larger and complex robot briefly review the basics of Q-learning. For more
learning tasks. Many theoretical works have argued on through treatment, see [12]. We follow the explana-
the convergence time of the learning, and how to speed tion of Q-learning by Kaelbling [13].
up it by using some heuristics and to extend these We assume that the robot can discriminate the set
techniques from a single goal task to multiple ones [7]. S of distinct world states, and can take the set A
However, almost of them have only shown computer of actions on the world. The world is modeled as a
simulations, and only a few real robot applications are Markov process, making stochastic transitions based
reported which are simple and less dynamic [8, 9]. Es- on its current state and the action taken by the robot.
pecially, the use of vision in the reinforcement learning Let T (s, a, s0 ) be the probability that the world will
is very rare. transit to the next state s0 from the current state-
action pair (s, a). For each state-action pair (s, a),
To the best of our knowledge, only Whitehead and
the reward r(s, a) is defined.
Ballard [10] argued this problem. Their task is a sim-
The general reinforcement learning problem is typ-
ple manipulation of blocks on the conveyer belt. Al-
ically stated as finding a policy that maximizes dis-
though each block is colored to be easily discriminated,
counted sum of the reward received over time. A pol-
they still have a large size of state space. To cope with
icy f is mapping from S to A. This sum is called the
this problem, they assumed that observer could con-
return and is defined as:
trol its gaze to attended object so as to reduce the size
X∞
of the state space. However, this causes so-called “per-
γ n rt+n , (1)
ceptual aliasing” problem. That is, both the observer
n=0
motion and actual changes happened in the environ-
ment cause the changes inside the image captured by where rt is the reward received at step t given that
the observer. Therefore, it seems difficult to discrim- the agent started in state s and executed policy f . γ
inate the both from only the image. Then, they pro- is the discounting factor, it controls to what degree
posed a method to cope with this problem by adopting rewards in the distant future affect the total value of
the internal states and separating action command- a policy and is just slightly less than 1.
s into “Action frame” and “Attention frame” com- Given definitions of the transition probabilities and
mands. Thus, they encoded encoded a priori world the reward distribution, we can solve the optimal poli-
knowledge into state and action spaces. cy, using methods from dynamic programming [14]. A
more interesting case occurs when we wish to simulta-
In order to make the role of the reinforcement learn-
neously learn the dynamics of the world and construct
ing evident in realizing autonomous agents, we need
the policy. Watkin’s Q-learning algorithm gives us an
more applications in more dynamic and complex en-
elegant method for doing this.
vironment. In this paper, we propose a method which
Let Q∗ (s, a) be the expected return or action-value
acquires a purposive behavior for a mobile robot to
function for taking action a in a situation s and con-
shoot a ball into the goal by using a vision-based rein-
tinuing thereafter with the optimal policy. It can be
forcement learning. We apply Q-learning method [11],
recursively defined as:
one of the widely used reinforcement learning schemes X
to our problem. The robot is expected to learn a Q∗ (s, a) = r(s, a) + γ T (s, a, s0 ) max Q∗ (s0 , a0 ).
0
a ∈A
shooting behavior without world knowledge such as s0 ∈S
3-D locations and sizes of the goal and the ball or the (2)
Because we do not know T and r initially, we construct
incremental estimates of the Q values on line. Starting
with Q(s, a) at any value, usually 0, every time an
action is taken update the Q value as follows:

Q(s, a) ⇐ (1 − α)Q(s, a) + α(r(s, a) + γ max


0
Q(s0 , a0 )).
a ∈A
(3)
where r is the actual reward value received for taking
action a in a situation s, s0 is the next state, and α is
a leaning rate (between 0 and 1). The following is a
simple version of the 1-step Q-learning algorithm we
used here.
Initialization: Q ← a set of initial values for the
action-value function (e.g., all zeros). Figure 3: A picture of the radio-controlled vehi-
Repeat forever: cle.
1. s ← the current state
2. Select an action a that is usually consistent with the robot, we might be able to develop several meth-
the policy f but occasionally an alternate. ods to control it to shoot a ball into a goal. This is
3. Execute action a, and let s0 and r be the next not our intention. We intend to start with the visual
state and the reward received, respectively. information only, that is, the image positions of the
4. Update Q(s, a): ball and the goal. That is all the robot captures from
the environment. In order for the robot to take an
Q(s, a) ← (1 − α)Q(s, a) + α(r + γ max
0
Q(s0 , a0 )). action against the environment, it has several motion
a ∈A
5. Update the policy f : (4) commands such as forward and turn left (See Fig.2).
Note that the robot does not even know any physical
f (s) ← a such that Q(s, a) = max Q(s, b) meanings for these motion commands. The effects of
b∈A
an action against the environment can be informed to
(5)
the robot only through the visual information. To en-
3 Task and Assumptions able to do that, the robot has to track the ball and/or
the goal inside image continuously. Fig.3 shows a
mobile robot, a ball, and a goal we used in the real
experiments.

4 Construction of State and Action


Sets
In order to apply Q-learning scheme to the task, we
define a number of sets and parameters. Many exist-
ing applications of the reinforcement learning schemes
have constructed the state and action spaces in such
Figure 2: The task is to shoot a ball into the
a way that each action causes a state transition (e.g.
goal.
one action is forward, backward, left, or right, and
The task for a mobile robot is to shoot a ball in- states are encoded by the locations of the agent) in
to the goal as shown in Fig.2. The problem we are order to make the quantization problem (the struc-
attacking here is to develop a method which automat- tural credit assignment problem) easy. This makes a
ically acquires strategies how to do this. As a first gap between the computer simulations and real robot
step, we simplify the environment in such a way that systems. Each space should reflect the correspond-
the environment consists of only a ball the robot can ing physical space in which a state (an action) can be
kick and a goal fixed on the ground. perceived (taken). However, such construction of state
If we know the exact three-dimensional parameters and action spaces sometimes causes a “state-action de-
of the environment and kinematics and dynamics of viation” problem. In the followings, we describe how
to construct the state and action spaces, and then how state transitions is vary large, and therefore the learn-
to cope with the state-action deviation problem. ing does not converged correctly. Then, we reconstruct
the action space as follows. Each action defined in the
4.1 Construction of Each Space above is called an action primitive. The robot con-
(a) a state set S tinues to take one action primitive until the current
state changes. This sequence of the action primitive
Only the information the robot can obtain about the
is called an action. The number of action primitives
environment is the image supposed to be capturing
needed for state changes has no meanings. Once the
the ball and/or the goal. The ball image is quan-
state has changed, we apply the update equation (3)
tized into 9 sub-states, combinations of three posi-
of the action value function.
tions (left, center, and right) and three sizes (large
(near), medium, and small (far)). The goal image has (c) a reward and a discounting factor γ
27 sub-states, combinations of three parameters each We assign a reward value 1 when the ball was entered
of which is quantized to three levels. Each sub-state into the goal or 0 otherwise. This makes the learning
corresponds to one posture of the robot toward the very time-consuming. Although adopting a reward
goal, that is, position and orientation of the robot in function in terms of distance to the goal state makes
the field. In addition to these 243 (27 × 9) states, we the learning time much shorter, it seems difficult to
add other states such as these cases in which only the avoid the local maxima of the action-value function
ball or the goal is captured in the image. Q.
After some simulations, we realized that as long A discounting factor γ is used to control to what
as the robot captures the ball and the goal positions degree rewards in the distant future affect the total
in the image, it succeeds in shooting a ball. Howev- value of a policy. In our case, we set the value at
er, once it lost the ball, it randomly moves because slightly less than 1 (γ = 0.8).
it does not know to which direction it should move
to find the ball. This happens because the ball-lost s- 5 Experiments
tate is just one, therefore it cannot determine in which
The experiment consists of two parts: first, learning
direction the ball is lost. Then, we separate the ball-
the optimal policy f through the computer simulation,
lost state into two states; the ball-lost-into-right and
then apply the learned policy to a real situation. The
the ball-lost-into-left states. Also, we set up goal-lost-
merit of the computer simulation is not only to check
into-right and goal-lost-into-left states. This made the
the validity of the algorithm but also to save the run-
robot behavior much better. As a result, we totally
ning cost of the real robot during the learning process.
have 319 states in the set S.
Still real experiments are necessary because the com-
puter simulation cannot completely simulate the real
(b) an action set A
world [15].
The robot can select an action to be taken against the
environment. In real system, the robot moves around 5.1 Simulation
the field by a PWS (Power Wheeled Steering) sys- We performed the computer simulation with the
tem with two independent motors. Since we can send following specifications (the unit is an arbitrary-scaled
the motor control command to each of two motors in- length). The field is a square of which side length is
dependently, we quantized the action set in terms of 200. The goal post is located at the center of the
two motor commands ωl and ωr , each of which has 3 side line of the square (see Fig.2) and its height and
sub-actions (forward, stop, and back motions, respec- width are 10 and 50, respectively. The robot is 16
tively). Totally, we have 9 actions in the action set wide and 20 long and kicks a ball of diameter 6. The
A. camera is horizontally mounted on the robot (no tilt),
Due to the peculiarity of the visual information, and its visual angle is 36 degrees. The mass of the
that is, a small change near the observer might cause ball is negligible compared to that of the robot. Other
a large change in image and vice versa, causes a state- parameters such as a bounding factor between the ball
action deviation problem because each action pro- and the robot, viscous friction between the ball and
duces almost the same amount of motion in the en- the field and so on are properly chosen to simulate the
vironment. In our case, the resolution of robot action real world.
is much higher than that of state space. Therefore, First, we place the ball and the robot at arbitrary
the robot might frequently transits to the same state. positions. In almost all cases, the robot crossed over
This is highly undesirable because the variance of the the field line without shooting a ball into the goal.
This means that the learning has not converged af-
ter many trials (three days running on SGI Elan with
R4000). This situation resembles a case in which the
small child tries to shoot a ball into the goal, but he
(or she) cannot imagine which direction and how far
the goal is because a reward is received only after the
ball has entered into the goal. Further, he (or she)
does not know how to choose an action from several
action commands. This is the famous delayed rein- (a) shooting (b) shooting
forcement problem due to no explicit teacher signal (c) finding
(γ = 0.999) (γ = 0.6)
that indicates the correct output at each time step.
Then, we construct the learning schedule such that
the robot can learn in easy situations at early stages
Figure 5: Some kinds of behaviors during learn-
and learn in more difficult situations at later stages.
ing process.
We began the learning of the shooting behavior by
setting the ball and the robot near the goal. Once
the robot succeeds in a shooting, the robot begins to γ = 0.999, and the robot shifted its body to the better
learn (the sum of Q increases), but after that the robot position for getting a shoot. While, in (b), γ = 0.6,
wonders again in the field. After many iterations of and the robot tried to get a shoot immediately. (a)
these successes and failures, the robot learned to shoot shows a series of behaviors: first the robot lost the
a ball into the goal when the ball is near the goal. ball, then tried to find it by rotating itself, and finally
After that, we place the ball and the robot slightly it dribbled and got a shoot.
further from the goal, and repeat the robot learning
again. 5.2 Real System

12000
gamma = 0.600
gamma = 0.800 Sun WS Sun WS
10000 gamma = 0.999 Ether Net
number of goals

MC68040
8000
MaxVideo 200
DigiColor
6000 UHF Receiver

4000 MC68040
Parallel I/O
2000 A/D
D/A
VME BOX
0 Transmitter
0 500 1000 1500 2000 2500 3000 3500
time step *1000

Receiver
+ + + + ++
Radio Controller
Figure 4: Number of goals in terms of γ
Fig.4 show the accumulated number of shooting
goals in terms of the temporal discount factor γ, where
Figure 6: A configuration of the real system.
the number of goals with larger γ (0.999) is lower than
that with smaller ones (0.6 and 0.8). The reason is as Fig.6 shows a configuration of the real mobile robot
follows. When the temporal discount factor γ is very system. The image taken by a TV camera mounted
close to 1 (almost no discount), the reward received on the robot is transmitted to a UHF receiver and
after the goal is almost the same whichever path is processed by Datacube MaxVideo 200, a real-time
selected. While, if γ is small, the robot try to take pipeline video image processor. In order to simplify
a shorter path which means more rewards. Howev- and speed up the image processing time, we painted
er, for a too small γ, the robot loses the way to the the ball in red and the goal in blue. We constructed
goal. Fig.5 shows some kinds of behaviors during the the radio control system of the vehicle, following the
learning process. (a) and (b) show the difference be- remote-brain project by Profs. Inaba and Inoue at U-
tween the shooting behaviors with different γs. In (a), niversity of Tokyo [16]. The image processing and the
Table 1: State-Action data
time state state action errer
step step ball goal L R
1 1 (C,F) (C,F,Fo) F F
2 2 (R*,F) (C,F,Fo) F F 1
3 3 (D*,D*) (C,F,Ro*) B B 3
4 4 (C,F) (C,F,Lo*) B S 1
5 5 (C,F) (C,F,Fo) F F
6 (C,F) (C,F,Fo) F F
7 (C,F) (C,F,Fo) F F
8 (C,F) (C,F,Fo) F F
9 6 (C,F) (C,F,Ro*) B S 1
(a) input image (b) detected image 10 7 (C,F) (C,F,Fo) F F
11 8 (C,F) (R,M,Fo) F F
12 9 (R,F) (R,M,Fo) F F
13 10 (R,M*) (R,F*,Lo*) F B 3
14 11 (L*,F) (R,M,Ro*) F S 2
Figure 7: Detection of the ball and the goal. 15 12 (L*,F) (R,M,Fo) F S 1
16 13 (R,M) (R,M,Fo) S B
17 14 (C,M) (C,M,Fo) F F
18 15 (L,M) (L,M,Fo) S F
vehicle control system are operated by VxWorks OS 19 16 (L,N) (L,M,Fo) B S
on MC68040 CPU which are connected with host Sun 20 (L,N) (L,M,Fo) B S
21 17 (L,M*) (L,M,Fo) S F 1
workstations via Ether net. We have shown a picture 22 18 (L,N) (L,M,Fo) B S
of the real robot with a TV camera (Sony handy-cam 23 (L,N) (L,M,Fo) B S
24 19 (C,N) (C,M,Fo) F B
TR-3) and video transmitter in Fig.3. 25 20 (C,M) (C,M,Fo) F F
26 (C,M) (C,M,Fo) F F
27 21 (C,M) (C,N,Fo) F S
28 22 (C,M) (C,M*,Lo*) F S 2
29 23 (C,M) (C,M*,Ro*) S B 2
30 24 (C,F) (D,D,D) F S

due to noise of image processing, the robot succeeded


in shooting a ball into the goal as long as the errors
do not occur continuously because the robot has the
optimal action value function against the all states.

6 Discussion and Future Works


Figure 8: The robot succeeded in shooting a ball As a first step towards an autonomous agent capa-
into the goal. ble to generate a purposive behavior, we have studied
the feasibility of realizing a shooting behavior with
Fig.7 shows a result of the image processing where vision. Although we need very longer learning time,
the ball and the goal are detected and their positions the robot has learned to generate a shooting behav-
are calculated in real time (1/30 of a second). Fig.8 ior consisting of a series of actions including finding,
shows a sequence of the shooting images in which the dribbling, and shooting a ball.
robot succeeded in shooting a ball into the goal. In the followings, we argue the role of vision in the
Table 1 shows the result of state discrimination context of the vision-based reinforcement learning.
for the scene shown in Fig.8, where the time step
• During the learning process, the visual informa-
(1/30 of a second), state step the robot discriminat-
tion has an important role of state discrimination
ed, ball state (Left, Right, Center, Disappeared, and
and eventually action evaluation. Almost of the
Near, Medium, or Far), goal state (in addition to the
previous works in the reinforcement learning ap-
same states as a ball, Front Oriented, Left Oriented,
plications assume the perfect sensors that are of-
or Right Oriented), control commands to right and
ten too idealized to apply the real situations. The
left motors (Forward, Stop, or Backward), and the
vision is the most suitable sensor because of its
number of failures of state discrimination. The mis-
global scope to the environment. Only the prob-
understood states are with *s. Although error ratio
lem is how to extract the necessary information
of the state discrimination was very high (about 15%)
to accomplish the task.
• The action value function obtained after the re- ward to speed up the learning rate and the scaling
inforcement learning includes the necessary infor- problem to apply the learned policy to similar tasks
mation for the robot to take the optimal action but different environments, we realized that the vision-
against each individual state of the environment, based reinforcement learning method seems promising
and therefore it could be called the environment in realizing autonomous agent in real world. We are
map. planning to extend the method to multiple players co-
• Two environments which are different in their ap- ordination and competition.
pearances could be found similar to each other if
we can find the similarity in the action value func-
tions for these environments. A desk or a chair References
in the office scene and a large rock in the out- [1] R. A. Brooks. “A robust layered control system for
door scene need not be discriminated for a obsta- a mobile robot”. IEEE J. Robotics and Automation,
cle finder and avoider. RA-2:14–23, 1986.
• How to construct a state space is one of the is- [2] R. A. Brooks. “Elephants don’t play chess”. In
sues in the reinforcement learning scheme. This P. Maes, editor, Designing Autonomous Agents, pages
is called structural credit assignment problem. 3–15. MIT/Elsevier, 1991.
Many existing works in the reinforcement learning [3] M. J. Mataric. “Integration of representation into
construct the state space in such a way that their goal-driven behavior based robots”. IEEE J. Robotics
and Automation, RA-8:–, 1992.
simulations work well. This sometimes causes un-
[4] P. Maes. “The dynamics of action selection”. In Proc.
natural segmentation of the sensory information of IJCAI-89, pages 991–997, 1989.
and mapping to the state space. At least, this [5] R. C. Arkin. “Integrating behavioral, perceptual, and
can be avoided if the robot uses the visual in- world knowledge in reactive navigation”. In P. Maes,
formation because the spatial resolution and the editor, Designing Autonomous Agents, pages 105–122.
MIT/Elsevier, 1991.
dynamic range of the observed intensities are lim-
ited to some extents and therefore the robot can- [6] J. H. Connel and S. Mahadevan, editors. Robot Learn-
ing. Kluwer Academic Publishers, 1993.
not discriminate the state beyond these physical
constraints. [7] R. S. Sutton. “Special issue on reinforcement learn-
• In the vision based reinforcement learning ing”. In R. S. Sutton(Guest), editor, Machine Learn-
ing, volume 8, pages –. Kluwer Academic Publishers,
scheme, there are two issues related to each oth- 1992.
er: coarse segmentation of the state space and [8] P. Maes and R. A. Brooks. “Learning to coordinate
real-time processing (state mapping and control). behaviors”. In Proc. of AAAI-90, pages 796–802,
In our work, the red ball and the blue goal are 1990.
easily extracted and their sizes and positions are [9] J. H. Connel and S. Mahadevan. “Rapid task learning
for real robot”. In J. H. Connel and S. Mahadevan,
very coarsely mapped to the state space, that editors, Robot Learning, chapter 5. Kluwer Academic
is, “right,” “center,” or “left” and so on in or- Publishers, 1993.
der to reduce the size of the state space. This [10] S. D. Whitehead and D. H. Ballard. “Active percep-
contributes to absorb small amount of errors in tion and reinforcement learning”. In Proc. of Work-
shop on Machine Learning-1990, pages 179–188, 1990.
measuring the positions and the sizes of the ball
[11] C. J. C. H. Watkins and P. Dayan. “Technical note:
or the gall in the image captured by the robot. Q-learning”. Machine Learning, 8:279–292, 1992.
In order to cover this coarseness of the state s- [12] C. J. C. H. Watkins. Learning from delayed rewards”.
pace, control of the robot action must be done PhD thesis, King’s College, University of Cambridge,
in real-time (1/30 of a second). Since it has the May 1989.
action value function obtained after the learning, [13] L. P. Kaelbling. “Learning to achieve goals”. In Proc.
the robot can take the optimal action against any of IJCAI-93, pages 1094–1098, 1993.
situation of the environment. Therefore, even if [14] R. Bellman. Dynamic Programming. Princeton Uni-
mis-mapping of the state due to the image noise versity Press, Princeton, NJ, 1957.
or the failure of action execution due to the slip [15] R. A. Brooks and M. J. Mataric. “Real robot, re-
al learning problems”. In J. H. Connel and S. Ma-
between the floor and crawler of the robot hap- hadevan, editors, Robot Learning, chapter 8. Kluwer
pens, the robot succeed in shooting a ball into the Academic Publishers, 1993.
goal as long as the frequencies of these mistakes [16] M. Inaba. “Remote-brained robotics: Interfacing ai
is low (See Table 1). with real world behaviors”. In Preprints of ISRR’93,
Pitsuburg, 1993.
Although we have other problems such as the tem-
poral credit assignment problem when to give a re-

You might also like