0 ratings0% found this document useful (0 votes) 34 views241 pagesWTK - Learning Rewards
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
Learning from Delayed Rewards
Christopher John Cornish Hellaby Watkins
King’s College
Thesis Submitted for Ph.D.
May, 1989Preface
As is required in the university regulations, this thesis is all my own work,
and no part of it is the result of collaborative work with anybody else.
I would like to thank all the people who have given me encouragement
and advice, and the many people I have had conversations with about this
work, Other people’s interest and encouragement has helped me enormously.
In the early stages of my research, Doug Frye introduced me to the work
of Piaget, and I had a number of conversations with David Spiegelhalter on
inductive inference.
I would like to thank Richard Young, my supervisor, for patiently giving
me good advice over a number of years, and for spending a lot of time talking
* to me about ideas that led up to this work.
Alex Kacelnik has encouraged me greatly, as well as providing many fas-
cinating examples from animal behaviour and foraging theory.
Rich Sutton gave me his technical reports, and pointed out a difficulty
with Q-leaming.
T have had a number of helpful conversations with Andy Barto, who has
raised a number of issues and possibilities that I have not been able to discuss
in the thesis. I have also had interesting conversations with, and comments
from, Graeme Mitchison, and John McNamara. Andy Barto, Alex Kacelnik,
Alastair Houston, and Robert Zimmer have read and made comments on early
drafts of some parts of the thesis, and the final version is considerably clearer
and more correct as a result.I am most grateful for the financial support I have received from King’s
College. I would also like to thank Tony Weaver, my group leader at Philips
Research Laboratories, for his unfailing support which has enabled me to com-
plete this work.
Finally, I would like to dedicate this thesis to my wife Lesley, because she
is the happiest to see it finished!
Chris Watkins, May Ist, 1989.Summary
In behavioural ecology, stochastic dynamic programming may be used as a
general method for calculating animals’ optimal behavioural policies. But how
might the animals themselves learn optimal policies from their experience? The
aim of the thesis is to give a systematic analysis of possible computational
methods of learning efficient behaviour.
First, it is argued that it does follow from the optimality assumption that
animals should lear optimal policies, even though they may’ not always follow
them, Next, it is argued that Markov decision processes are a general formal
model of an animal's behavioural choices in its environment. The conventional
methods of determining optimal policies by dynamic programming are then
described. It is not plausible that animals carry out calculations of this type.
However, there is a range of alternative methods of organising the
dynamic programming calculation, in ways that are plausible computational
models of animal learning. In particular, there is an incremental Monte-Carlo
method that enables the optimal values ( or ‘canonical costs’) of actions to be
leamed directly, without any requirement for the animal to model its environ
ment, or to remember situations and actions for more than a short period of
time. A proof is given that this leaming method works. Leaming methods of
this type are also possible for hierarchical policies. Previously suggested leam-
ing methods are reviewed, and some even simpler learning methods are
presented without proof. Demonstration implementations of some of the leam-
ing methods are described.Corrigenda
Page 90, lines 13-21 should be replaced by:
Note that
COOOL 3 4 a= Uh a) + AE CA" ee +E OW tnd ~ Vist
2
‘A leaming method can be implemented by, at each time step, adding appropriate fractions of the
current prediction difference to previously visited states.
Page 91, insert after line 16:
‘The total change in U(x) that results from a visit to x at time ¢ is
afes+ Nem + (Pena t ==] = off —app] oP ECA Clee HAD erat
2
Note that [Ca] < <4 forall x and 1. 1F U is exacly comect then the average value of the
first term on the RHS above will be zero; however, if there is any error in U, then the second
term on the RHS above will become negligible for sufficiently small o.
Page 98, in line 21:
‘Michie (1967)' should be deleted; ‘Widrow et al (1972)' should be replaced by
“Widrow et al (1973)’.
Page 227, lines 7 to 13 should be replaced by:
Sufficient conditions are that:
© For each observation, 2, a, and a may be chosen with knowledge of previous observations,
but r and y are sampled, independently of other observations, from a joint distribution that
depends only on x and a.
© Forall x and a, the rewards should have finite mean and finite variance,
‘© For each state-action pair x, a, the subsequence of leaming factors for observations of the
form [x @ Yq fq] is monotonically decreasing, tends to zero, and sums to infinity,
Page 228, replace lines 14 to 23 by:
between each value of n in the sequence.
To show this, consider the subsequence of observations in which action a is performed in
state x. Let the index of the ith observation in this subsequence have index m, in the main
sequence of observations. The replay probabilities when performing a in state in the ARP
may be writen explicitly as follows. Let Bj, be the probability that the m,th observation will be
replayed when action a is performed in , Then
= 0 for s>i
taking og = 1If for some j>0
A
La=D
ah
then it may be shown that
:
TI G-a) < e?
cy
so
Z Bis > 1-e?
an
That is, if d(,a) ~ d(,a) = D then if action a is performed in then the probabil-
ity that the observation replayed has index less than a is less than 1-e"?,
Hence, the depth construction given previously on page 228 shows that, for any chosen
imber of replays & and any chosen value € of the leaming factor and for any small probability 8,
is possible to choose a level n in the ARP so large that, starting at any state and follow-
ing any sequence of actions, the number of replays will be greater than k and the maximal value
of a encountered during any of the first k replayed moves will be less than e, with probability
greater than 1-8, k and 8 may be chosen so that the expected -step truncated returns of the ARP
are as close as desired to the expected retums of the ARP. .
It remains to show that the transition probabilities and expected rewards in the ARP con-
verge to those of the RP, There is a delicate problem here: the ARP is constructed from a
sequence of observations, and some observation sequences will be unrepresentative samples.
Furthermore, x, , dy, and ot, may all be chosen with knowledge of the previous observations 1
to n-l. So to give a convergence result, it is necessary to regard the observed rewards and transi-
tions as random variables, and to argue that, for any RP, the transition probabilities and expected
rewards in the ARP will converge to those in the RP with probability 1, the probability being
taken over the set of possible sequences of observations for each state action pair.
Consider once more the subsequence of observations of action a in state x, and let the ith
observation in this subsequence be observation m, in the main sequence, and let R; be the random
variable denoting the reward observed on this mth observation. Let the states be numbered from
10 S, and let T} ,..., Tf be random variables such that if the observed destination state at the
mith observation is the kth state, then Tf = 1 and T= 0 for j # &, Note that E[ T?] = P,,(a) and
E(R,] = p(x) for all i
‘The expected reward and the transition probabilities in the ARP (which are now random vari-
ables, since they depend on the observation sequence) are:
oP (a) z Bis R
and
‘
Paap. 2(@) = x Bis TE
Note that Ef p™(o,,2(2)]=Po(a), and observe that
max {B,,} > 0 as i-se, Since, by hypothesis, the means and variances of all rewards are
finite, and the T} are bounded, the strong law of large numbers implies that as io,
(.a) > pla) and PAs.0,,>(a)—> Pa(a), both with probability 1. Since there is
only a finite number of state-action pairs, all transition probabilities and expected rewards in the
ARP converge uniformly with probability 1 to the corresponding values in the RP as the level in
the ARP tends to infinity, This completes the proof.Chapter
Chapter
Chapter
Chapter
Chapter
Chapter
Chapter
Chapter
Chapter
Chapter
Chapter
Appendix 1: Convergence of One-Step Q-Leaming
References
Iy Introduction
2: Learning Problems
3: Markov Decision Processes
4: Policy Optimisation by Dynamic Programming
5:
6: Model-Based Leaning Methods
7: Primitive Learning
8: Possible Forms of Innate Knowledge
9:
Chapter 10: Function Representation .
Table of Contents
Modes of Control of Behaviour
Learning in a Hierarchy of Control
‘wo Demonstrations
‘onclusion
229Chapter 1
Introduction
Learning:to act in ways that are rewarded is a sign of intelligence. It is,
for example, natural to train a dog by rewarding it when it responds appropri-
ately to commands. That animals can learn to obtain rewards and to avoid pun-
ishments is generally accepted, and this aspect of animal intelligence has been
studied extensively in experimental psychology. But it is strange that this type
of learning has been largely neglected in cognitive science, and I do not know
of a single paper on animal learning published in the main stream of literature
on ‘artificial intelligence’.
This thesis will present a general computational approach to leaming from
rewards and punishments, which may be applied to a wide range of situations
in which animal learning has been studied, as well as to many other types of
learning problem. The aim of the thesis is not to present specific computational
models to explain the results of specific psychological experiments. Instead, I
will present systematically a family of algorithms that could in principle be
used by animals to optimise their behaviour, and which have potential applica-
tions in artificial intelligence and in adaptive control systems.
In this introduction I will discuss how animal learning has been studied,
and what the requirements for a computational theory of learning are.Chapter 1 — Introduction
1. Classical and Instrumental Conditioning
I will not give any comprehensive review of the enormous literature on
the experimental study of animal leaming. Instead I will describe the essential
aims of the experimental research, the nature of the phenomena studied, and
some of the main conclusions.
There is a long history of research into conditioning and associative leam-
ing, as described by Mackintosh (1983). Animals’ ability to learn has been stu-
died by keeping them in controlled antficial environments, in which events and
contingencies are under the control of the experimenter. The prototypical
antficial environment is the Skinner box, in which an animal may be con-
fronted with stimuli, such as the sound of a buzzer or the sight of an
illuminated lamp, and the animal may perform responses such as pressing a
lever in the case of a rat, or pecking at a light in the case of a pigeon, The
animal may be automatically provided with reinforcers. In behavioural terms, a
positive reinforcer is something that may increase the probability of a preceding
response; a positive reinforcer might be a morsel of food for a hungry animal,
for instance, or a sip of water for a thirsty animal. Conversely, a negative rein-
forcer, such as an electric shock, is something that may reduce the probabilty of
a preceding response. In a typical experiment, the animal's environment may be
controlled automatically for a long period of time, and the delivery of rein-
forcers may be made contingent upon the stimuli presented and on the
responses of the animal, The events and contingencies that are specified for the
artificial environment are known as the reinforcement schedule.
Two principal types of experimental procedure have been used: instrumen-
tal and classical conditioning schedules.
In instrumental schedules, the reinforcement that the animal receives
depends on what it does. Instrumental learning is learning to perform actions toChapter 1 — Introduction
_ obtain rewards and to avoid punishments: the animal learns to behave in a cer-
tain way because behaving in that way leads to positive reinforcement. The
adaptive function of instrumental conditioning in animals is clear: blue tits will
obtain more food in winter if they can lear to visit well stocked bird-tables.
In classical (or ‘Pavlovian’) experiments, the animal is exposed to
sequences of events and reinforcers. The reinforcers are contingent on the
events, not on the animal’s own behaviour: a rat may be exposed to a light, and
then given an electric shock regardless of what it does, for example. Experi-
ments of this type are often preferred (Dickinson 1980) because the correlations
between events and reinforcers may be controlled by the experimenter, whereas
the animal’s actions may not.
Classical conditioning experiments depend on the fact that an animal may
naturally respond to certain stimuli without any previous learning: a man will
withdraw his hand from a pin-prick; a dog will salivate at the sight of food.
‘The stimulus that elicits the response is termed the unconditioned stimulus or
US. If the animal is placed in an environment in which another stimulus—the
conditioned stimulus or CS—tends to occur before the US, so that the
occurrence of the CS is correlated with the occurrence of the US, then an
animal may produce the response after the CS only, It is as if the animal leans
to expect the US as a result of the CS, and responds in anticipation.
Whether there are in fact two types of learning in instrumental and classi-
cal conditioning is much disputed, and complex and difficult issues are
involved in attempting to settle this question by experiment. However, I will be
concerned not with animal experiments but with learning algorithms. I will
therefore give only a brief discussion of one interpretation of the experimental
evidence from animal experiments, as part of the argument in favour of the
type of learning algorithm I will develop later.Chapter 1 — Introduction
Mackintosh (1983) discusses the relationship between classical and instru-
mental conditioning at length.
First, it might seem tempting to regard classical conditioning as a form of
instrumental, conditioning: might a dog not learn to salivate in anticipation of
food because the dog found that if it salivated the food was more palatable?
Mackintosh argues that classical conditioning cannot be explained as instrumen-
tal conditioning in this way. One of the neatest experiments is that of (Browne
1976) in which animals were initially prevented from giving any response while
they observed a correlation between a stimulus and a subsequent reward.
When the constraint that prevented the animals giving the response was
removed, they immediately gave the response; there was no possibility of any
instrumental learning because the animals had been prevented from responding.
Mackintosh also notes that, pethaps more suprisingly, much apparently
instrumental conditioning may be explainable as classical conditioning. In an
instrumental experiment in which animals leam to perform some action in
response to a conditioned stimulus, the animal must inevitably observe a corre-
lation between the conditioned stimulus and the reward that occurs as a result
of its action. This correlation, produced by the animal itself, may give rise to a
classically conditioned response: if this classically conditioned response is the
same as the instrumental response, then each response will strengthen the corre-
lation between the CS and the reward, thus strengthening the conditioning of
the CS. Learning would thus be a positive feedback process: the more reliably
the animal responds, the greater the correlation it observes, and the greater the
correlation it observes, the more reliably the animal will respond.
But, as Mackintosh argues, not all instrumental leaming may be explained
in this way. A direct and conclusive argument that not all instrumental learning
is explainable as classical conditioning is the common observation that animalsChapter 1 — Introduction
can lear to perform different responses in response to the same stimuli to
obtain the same rewards.
Mackintosh suggests, however, that no instrumental conditioning experi-
ment is totally free of classically conditioned effects, and vice versa. For
instrumental learning to occur, an animal must produce at least one response, or
sequence of responses, that results in a reward—why does the animal produce
the first such response? Instrumental learning consists of attempting to repeat
previous successes; the achievement of the first success requires a different
explanation. One possibility is that an animal performs a totally random
exploration of its environment: but this cannot be accepted as a complete expla-
nation, A reasonable hypothesis is that classical conditioning is the expression
“of inflate knowledge of what actions are usually appropriate when certain types
of events are observed to be correlated in the environment The roughly
appropriate innate behaviour released by classical conditioning may then be
fine-tuned by instrumental learning. The question of the relationship between
classical and instrumental conditioning is, therefore, one aspect of a more fun-
damental question: what types of innate knowledge do animals have, and in
what ways does this innate knowledge contribute to learning?
Conditioning theory seeks to explain animals’ behaviour in detail: to
explain, for example, just how the time interval between a response and a rein-
forcer affects the rate at which the response is leaned. As a consequence of
this level of detail, conditioning theory cannot readily be used to explain or to
predict animal leaming under more natural conditions: the relationships between
stimuli, responses, and reinforcers become too complex for models of instru-
mental conditioning to make predictions. It is clear in a general way that instru-
mental conditioning could enable an animal to learn to obtain what it needs, but
conditioning theory cannot in practice be used to predict the results of leamingChapter 1 — Introduction
quantitatively under most natural conditions. The difficulty is that conditioning
theory has tended to be developed to explain the results of certain types of
experiment, rather than to predict the effect of learning on behaviour overall.
The optimality argument as used in behavioural ecology can provide, I
believe, a clear and well motivated description of what instrumental learning
should ideally be. I will later consider systematically the different ways in
which this type of instrumental leaming may be achieved.
2. The Optimality Argument
Behavioural ecologists seek to explain animal behaviour by starting from a
different direction. They argue that animals need to behave efficiently if they
are to survive and breed: selective pressure should, therefore, lead to animals
adopting behavioural strategies that ensure maximal reproductive success. Just
as evolution has provided animals with bodies exquisitely adapted to survival in
their ecological niches, should not evolution also lead to similarly exquisite
adaptations of behaviour? On this view, it should be possible to explain natural
animal behaviour in terms of its contribution to reproductive success. This
approach to the analysis of animal behaviour is known as the optimality argu-
ment,
The optimality argument is controversial: Gould and Lewontin (1979)
attack its uncritical use, and Stephens and Krebs (1986) give an extended dis-
cussion of when the use of the optimality argument can be justified. It is clear
that there are both practical and theoretical difficulties with the optimality argu-
ment.
One difficulty that Gould and Lewontin point out is that the optimality
argument must be applied to an animal and its behaviour as a whole, and not to
each aspect of the animal separately. Further, optimality can only be assessedChapter 1 — Introduction
relative to the animal’s ‘choice’ of overall design (or ‘bauplan’) and mode of
life. One cannot recommend a whale to become a butterfly.
A potential weakness of the optimality argument is that evolution is not a
perfect optimiser. It is likely, therefore, that there are some aspects of
behaviour which have no adaptive value, just as there are some parts of the
body that might benefit from being re-designed. Nevertheless, there are many
examples where optimality arguments can be .applied convincingly to explain
aspects of animals’ natural behaviour.
And, of course, optimality arguments may often be difficult to provide in
practice because it may be difficult to establish what the optimal behavioural
strategy actually is for an animal in the wild. The difficulty may be either that
it is difficult to determine what an animal's intermediate goals should be if it is
to leave as many surviving descendants as possible, or else the difficulty may
be that although the goals of optimal behaviour are reasonably clear, it is
difficult for a behavioural ecologist to know how animals could best go about
achieving them, What is the best way for a squirrel to look for nuts?
To apply the optimality argument to any particular example of animal
behaviour is fraught with subtle difficulties, and a substantial amount of investi-
gation of the animal’s behaviour and habitat is necessary. But I am not going to
do this—all I need to assume is a rather limited form of the optimality argu-
ment, which is set out below.
3. Optimality and Efficiency
‘The optimality argument as applied to behaviour suggests that the function
of instrumental leaming is to lear to behave optimally. Some aspects of
behaviour are innate, other aspects are leamed, but all behaviour should be
optimal. But this is much too simple.Chapter | — Introduction
There is a basic difficulty: animals cannot learn how to leave as many
descendants as possible. It is not possible for an animal to live its life many
times and to determine from its experience the optimal strategy for perpetuating
its genes. All that an animal can lear to do is to achieve certain intermediate
objectives. To ensure maximal reproductive success, animals may need to
achieve a variety of intermediate goals: finding food and water, finding shelter,
defending territory, attracting a mate, raising offspring, avoiding predators, rest-
ing, grooming, and so on. Animals may leam to achieve these goals, but they
cannot learn optimal fitness directly.
It is often possible to identify certain skills that animals need to have—
one such skill that many animals need is the ability to forage so as to gain
energy at the maximum possible rate. To describe an intermediate objective
quantitatively, it is necessary to specify a performance criterion, which can be
used to ‘score’ different possible behavioural strategies. The maximally efficient
behavioural strategy is the one that leads to the best score according to the per-
formance criterion. If animals can represent suitable performance criteria inter-
nally, so that they can score their current behaviour, then it becomes possible
for them to leam efficient behaviour. This is the type of learning I will exam-
ine.
But an animal will not always need to achieve all its intermediate objec-
tives with maximal efficiency. A plausible view is that, for each species of
animal, there are certain critical objective
in that the levels of performance an
animal achieves in these areas strongly affect its reproductive fies. In other
areas of behaviour, provided that the animal's performance is above some
minimum level of efficiency, further improvements do not greatly affect fitness.
For example, a bird may have ample leisure during the autumn when food is
plentiful and it has finished rearing its young, but its survival in the winter mayChapter 1 — Introduction
depend critically on its ability to forage efficiently. There is, therefore, an
important distinction between optimal behaviour in the sense of behaviour that
ensures maximal reproductive success, and behaviour that is maximally efficient
in achieving some intermediate objective.
It is possible that some animals need to lear to optimise their behaviour
overall by learning to choose to devote appropriate amounts of time to different
activities; but it is likely that it is more usual that animals need to leam certain
specific skills, such as how to hunt efficiently. It is unlikely that an animal will
have to learn to seek food when it is hungry: it is more likely to need to leam
how to find food efficiently, so that it can exercise this skill when it is hungry.
Animals may, therefore, need to leam skills that they do not always need
to use, Learning of this type is to some extent incidental: an animal may learn
how to forage efficiently while actually foraging somewhat inefficiently, so that
the animal's true level of skill may only become evident when the animal needs
to use it. Furthermore, it is usually necessary to make mistakes in order to
learn: animals must necessarily behave inefficiently sometimes in order to learn
how to behave efficiently when they need to. This view of leaming is rather
different from traditional views of reinforcement learning, as presented by, for
example, Bush and Mosteller (1955).
4, Learning and the Optimality Argument
The optimality argument does predict that animals should have the ability
to leam—to adapt their behaviour to the environment they find. The reason for
this is that
* The same genotype may encounter circumstances in which different
behavioural strategies are optimal.Chapter 1 — Introduction
That is, either the same individual may need to adapt its behaviour, or different
individuals may ‘encounter different circumstances. This is not an entirely trivial
point: one reason why we do not leam to beat our hearts is that the design of
the heart, the circulatory system, and the metabolism is encoded in the genes,
so that the optimal strategy for beating the heart may be genetically coded as
well, It is possible, however, that there could be a mectanism for fine-tuning
the control system for the heart-beat in response to experience, because physi-
cal development is not entirely determined by the genotype.
Naturally, optimality theory predicts optimality in learning, but there are
two notions of optimality in learning: optimal learning, and learning of efficient
serategies. ‘Optimal learning’ is a process of collecting and using information
during learning in an optimal manner, so that the leamer makes the best possi-
ble decisions at all stages of learning: learning itself is regarded as a multi-
stage decision process, and leaming is optimal if the learner adopts a strategy
that will yield the highest possible return from actions over the whole course of
learning. ‘Leaming of an efficient strategy’ or ‘asymptotically optimal’ leaming
(Houston et al 1987) is a much weaker notion—all that is meant is that after
sufficient experience, the leamer will eventually acquire the ability to follow
the maximally efficient strategy.
The difference between these two.notions may be made clear by consider-
ing the ‘two-armed bandit’ problem. In this problem, a player is faced with two
levers. On each tum, the player may pull either lever A or lever B, but not
both. After pulling a lever, the player receives a reward. Let us suppose that,
for each lever, the rewards are generated according to a different probability
distribution.’ Successive rewards are independent of each other, given the
choice of lever. The average rewards given by the two levers are different.
The reward the player obtains, therefore, depends only on the lever he pulls.
10Chapter 1 — Introduction
Now, suppose that the player is allowed only 10 tums; at each turn, the player
may decide which lever to pull based on the rewards he has received so far in
the session.
If the player knows that lever A gives a higher reward than lever B, then
clearly his maximally efficient strategy is to pull lever A 10 times. But if the
player is uncertain about the relative mean rewards offered by the two levers,
and his aim is to maximise his total reward over n tums, then the problem
becomes interesting. The point is that the player should try pulling both levers
alternately at first, to determine which lever appears to give higher rewards;
once the player has sampled enough from both levers, he may choose to pull
one of the levers for the rest of the session. Other sampling strategies are possi-
ble.
The difference between optimal learning and leaming an efficient strategy
is clear for this problem. Learning an efficient strategy is leaming which lever
gives the higher rewards on average; a learning method leams the efficient stra-
tegy if it always eventually finds out which. lever gives the higher rewards.
“However, a learning method is optimal for a session of length nif it results in
the player obtaining the highest possible expected reward over the tums,
‘highest possible’ taking into account the player's initial uncertainty about the
reward distributions of the levers.
Optimal teaming is the optimal use of information to inform behaviour. It
is leaming that is optimal when considered over the whole course of leaning,
taking into account both early mistakes and later successes. Optimality in this
sense refers to the learning method itself, not to the final behaviour attained. In
the two-armed bandit problem, for example, if only a few tums are allowed, it
may be optimal for the player to perform very little initial sampling before
choosing one lever to pull for the rest of the session. If the player does not
iChapter 1 — Introduction
perform enough sampling, then he may easily choose the wrong lever: if many
tums are allowed, therefore, the optimal strategy may be to do more sampling.
Note that optimal learning does not necessarily lead to the acquisition of the
maximally efficient strategy: if learning the maximally efficient skill is costly, it
may not be worthwhile for the animal to lear it.
The two-armed bandit problem is perhaps the simplest learning problem
which involves a trade-off between exploration of the possibilities of the
environment, and exploitation of the strategies that have been discovered so far.
This is a dilemma that arises in almost any instrumental learning problem. If an
animal performs too much exploration, it may not spend enough time in
exploiting to advantage what it has learned: conversely, if an animal is incuri-
ous and does too little exploration, it may miss discovering some alternative
behaviours that would bring much higher returns, and it may spend all its time
exploiting an initial mediocre strategy. This is known as the exploration-
exploitation trade-off. During its life time, an animal must continually choose
whether to explore or whether to exploit what it knows already. One prediction
of optimality theory, therefore, is that an animal should make an optimal choice
in the exploration-exploitation trade-off. It may happen that, in following the
optimal strategy, the animal will not necessarily perform enough exploration to
achieve maximally efficient performance: it may be better to be incurious and
so avoid making too many mistakes during exploration.
Houston and McNamara (1988) and Mangel and Clark (1988), propose
explaining natural animal leaming as optimal leaning in this sense. This
approach is surely correct for leaming in the sense of collecting and using
information, but it is in practice impossible to apply for learning in the sense of
the gradual improvement of skill.
12Chapter 1 — Introduction
Rather confusingly, ‘leaming’ is used in the operational research and
dynamic programming literature (for example, de Groot (1970) or Dreyfus and
Law (1977)) to refer to the short-term collection of information for immediate
use. This is the sense of ‘learning’ as in ‘In the darkness of the room there
came a slow rustling sound and then the sound of the sofa being pushed across
the floor, so I'leamed I was dealing with a snake of remarkable size.’ as
opposed to the sense of leaming in ‘It took him three months of continuous
practice to leam to ride that unicycle,’ The short-term collection and use of
information (learning in the first sense) is a skill that can itself be gradually
improved by practice (learning in the second sense).
Krebs, Kacelnik, and Taylor (1978) performed one of the-first experiments
to determine whether animals could learn to collect and use information in an
optimal way—indeed, one of the first experiments to determine whether
animals could lear an optimal strategy, where the optimality of the strategy
was determined by reference to a dynamic model. They kept birds (great tits)
in an artificial environment in which they were fed in a series of short sessions.
For the duration of each feeding session, the bird was presented with two
feeders, one of which would yield food more readily than the other. The birds
could only tell which feeder was better by trial and error, so that each session
was in effect a two armed bandit problem, and successive sessions were
independent problems of the same type. A reasonable strategy for a bird—and
one that is very nearly optimal—is to sample from both feeders for a short time
at the start of each session, and then to feed for the rest of the session
exclusively from the feeder that gave out food most readily during the sampling
period. Over many sessions, the birds did indeed acquire near optimal strategies
of this type.
13Chapter 1 — Introduction
But the type of learning that I will be interested in is the improvement in
performance over many feeding sessions. In this example, the birds gradually
leamed a maximally efficient strategy for the problem. Was this gradual leam-
ing also optimal? That is a very difficult question to answer, for two reasons.
The first reason is that it is exceedingly difficult to devise demonstrably
optimal learning strategies for any but the simplest of formal problems. Even
for the two-armed bandit problem, finding the optimal strategy is a formidable
computation, There is a straightforward general method, as explained in
de Groot (1970), for constructing optimal learning strategies, but the strategies
and the computation become impractically complex for any but small problems,
or problems for which simplifying assumptions are possible. The problem of
learning the optimal strategy in repeated two-armed bandit problems is far too
complex for it to be possible to determine the optimal learning strategy.
But a second and more fundamental difficulty is that an optimal learning
strategy is optimal only with respect to some prior assumptions conceming the
probabilities of encountering various possible environments. In an experiment
similar to that of Krebs et al (1978), the optimal strategy within a feeding ses-
sion depends on the distribution of yields from the feeders in different sessions.
‘After the great tits have experienced many feeding sessions, they have enough
information to ‘know’ the ‘statistical distribution of the yields from each feeder,
and it makes sense to ask whether they can acquire the optimal strategy for the
distribution of yields that they have experienced. But to ask whether the birds’
learning is optimal over the whole experiment is a different matter: the optimal
strategy from the birds’ point of view is depends on the birds’ prior expecta-
tions, and we have no means of knowing what these expectations are or what
they should optimally be.
14Chapter 1 — Introduction
In other words, to show that some particular learning method is optimal, it
is necessary to specify a probability distribution over the environments that the
animal may encounter, as noted by McNamara and Houston (1985), In prac-
tice, this is likely to be an insuperable difficulty in providing convincing quanti-
tative optimality explanations for any type of skill acquisition.
But although quantitative arguments on optimal leaming may be difficult
to provide, some qualitative explanations involving optimal learning are com-
mon sense. Animals that are physically specialised so that they are adapted to a
particular way of life, for example, should in general be less curious and
exploratory than animals that are physically adapted to eat many different
foods. The reason for this is that a highly specialised animal is unlikely to dis-
cover’ viable alternative sources of food, while an omnivore lives by adapting
its behaviour to exploit whatever is most available.
Iam not going to consider computational models of optimal learning, both
«because of the technical difficulty of constructing optimal learning methods,
and because of the need to introduce explicit assumptions about a probability
distribution over possible environments. In any case, optimal learning will sel-
dom be a practical quantitative method of explaining animal learning.
Let us retum to the second type of leaming—learning of efficient stra-
tegies. By the learning of efficient strategies, I mean the acquisition of the abil-
ity to follow a strategy that is maximally efficient according to an intermediate
criterion. Note that this is learning of the abiliry to follow a maximally efficient
strategy: an animal with this ability need not always actually follow the
efficient strategy, but it can do so if it chooses to.
An example where optimality theory would predict that an animal should
leam an efficient strategy is that a prey animal should learn how to retum to its
burrow as fast as possible from any point in its territory. Of course, the animal
15Chapter 1 — Introduction
need not always return to its burrow as fast as possible—but it is vitally neces-
sary that it should be able to do so if danger threatens. Similarly, it is not
important that an animal should follow an efficient strategy in searching for
food if it has just eaten, but it is advantageous for an animal to be able to fol-
low an efficient strategy in searching for food if it needs to.
Optimal learning will require the learning of an efficient strategy if the
following three conditions hold:
© The capacity for maximally efficient performance is valuable.
© Exploration is cheap.
© The time taken to learn the behaviour is short compared to the period of
time during which the behaviour will be used.
‘The third condition implies that the final level of performance reached is more
important than the time taken to learn it—hence optimal leaming will consist of
learning the efficient strategy.
Animals need to be able to survive adverse conditions that are more
extreme than those they usually encounter. It is likely, therefore, that under nor-
mal circumstances most animals have some leisure for exploration; in other
words, the opportunity cost of exploration is usually small. Animals may, there-
fore, normally perform with slightly less than maximum efficiency in order to
be able to lea: maximally efficient performance is only occasionally neces-
sary. Efficient performance may be valuable for the animal to acquire either
because it is occasionally vital (as in avoiding predators), or else because it
continuously ensures a small competitive advantage (as in searching for food).
Even if these assumptions are not entirely satisfied, it is still plausible that
animals should learn efficient strategies. The point is that optimal learning will
entail learning an efficient strategy unless learning is expensive. Learning may
16Chapter 1 — Introduction
be expensive if mistakes are costly: prey animals would be unwise to attempt to
leam which animals were their predators by experience, for example, If animals
have innate behaviours that prevent them from making disastrous mistakes,
there is no reason why these behaviours should not be fine-tuned by instrumen-
tal leaming. If an innately feared predator never behaves in a threatening way,
for example, the prey animal may lose some of its fear, and so cease spending
time and energy in avoiding the predators, Animals may have innate knowledge
or behaviours that prevent them from making costly initial mistakes, and these
innate behaviours may be progressively modified to become maximally efficient
behaviours through instrumental leaming. In other words, innate knowledge
may take the pain out of learning.
In conclusion, the optimality assumption leads to the hypothesis that
animals will be able to learn efficient behavioural strategies. That is, after
sufficient experience in an environment, an animal should acquire the ability to
exploit that environment with maximal efficiency. Most of the thesis is devoted
to investigating what algorithms animals might use to learn in this way.
4.1. Learning Efficient Strategies and Conditioning
Instrumental conditioning and the learning of efficient strategies are related
concepts, but they are not at all the same, The motivation for studying instru-
mental conditioning is that it is possible mechanism for a type of learning that
could be useful to an animal in the wild. However, operant conditioning theory
does not explicitly consider efficiency of strategy, and many aspects of instru-
mental conditioning experiments’ are not directly interpretable from the
viewpoint of optimality theory. Conversely, many experiments that test whether
animals can leam an efficient behavioural strategy are not easy to interpret as
instrumental conditioning experiments.
17Chapter 1 — Introduction
Although they are superficially similar to instrumental conditioning experi-
ments, experiments to test whether animals can leam maximally efficient
behavioural strategies are designed quite differently. A well designed ‘leaming
of efficiency’ experiment should give animals both incentive and opportunity to
Jeam the maximally efficient strategy.
* Animals should be placed in an artificial environment for which the exper-
imenter can determine the maximally efficient behavioural strategy.
+ The animals should be left in the artificial environment for long enough
for them to have ample opportunity of optimising their strategies. The
environment should not be changed during this time.
© The animals should have an adequate motivation to acquire the optimal
strategy, but the incentive should not be so severe that exploration of alter-
native behaviours is too costly.
* Control groups should be placed in artificial environments that differ in
chosen respects from the environment of the experimental group. Control
groups should be given the same opportunities of optimising their
behaviour as the experimental group.
Experiments designed in this way have two considerable advantages. First, it is
possible to devise experiments that simulate directly certain aspects of natural
conditions. Second, optimality theory ‘can be used to make quantitative predic-
tions about what strategy the animals will eventually learn.
Some conditioning experiments satisfy these design requirements; others
do not. For example, the phenomenon known as ‘extinction’ in. conditioning
theory, in which a leamed response gradually extinguishes when the stimulus is
repeatedly presented without the reinforcer, is not directly interpretable in terms
of optimality. This is because in a typical instrumental conditioning experiment,
the purpose of testing animals under extinction is to determine the persistence
18Chapter 1 — Introduction
or ‘strength’ of the animal’s expectation of a reward following the stimulus.
This concept of ‘strength’ is difficult to interpret in terms of optimality. There
is often no ‘correct’ behaviour during extinction: whether an animal should
continue to respond for a long time or not depends entirely upon what types of
regularity it should expect to find in its environment. Since the extinction con-
dition occurs only once during the experiment, the animal is not given enough
data for it to work out what it ought to do.
If, on the other hand, extinction were to occur repeatedly in the course of
an experiment, the animal has the chance to learn how to react in an optimal
way, Kacelnik and Cuthill (1988) report an experiment in which starlings
repeatedly obtain food from a feeder. Each time it is used, the feeder will sup-
ply only a limited amount of food, so that as the birds continue to peck at the
feeder they obtain food less and less often, until eventually the feeder gives out
no more food at all, To obtain more food, they must then leave the feeder and
hop from perch to perch in their cage until a light goes on that indicates that
the feeder has been reset. In terms of conditioning theory, this experiment is
(roughly) 2 sequence of repeated extinctions of reinforcement that is contingent
upon pecking at the feeder: however, because the birds have the opportunity to
accumulate sufficient experience over many days, they have the necessary
information to find an optimal strategy for choosing when to stop pecking the
feeder. ,
I do not want at all to suggest that conditioning experiments are uninter-
pretable: they ask different questions and, perhaps, provide some more detailed
answers than optimality experiments do. However, the optimality approach is
both quantitative and strongly motivated, and I will argue in the rest of this
thesis that it is possible to classify and to implement a range of algorithms for
learning optimal behaviour.
19Chapter 1 — Introduction
5. Special-Purpose Learning Methods
McNamara and Houston (1980) describe how decision theory may be used
to analyse the choices that animals face in some simple tasks and to calculate
the efficient strategy.” They point out that it is in principle possible that animals
might learn by ‘statistical estimation of probabilities, and then use decision
theory to calculate their strategies, but they suggest that it is more likely that
animals lean by special purpose, ad hoc methods. McNamara and Houston
give two main arguments in favour of this conclusion.
First, they point out that the calculations using decision theory are quite
complex even for simple problems, and that, in order to perform them, the
animals would need to collect a considerable amount of information that they
would not otherwise need. Their second argument is that animals do not face
the problem of determining optimal strategies in general: each species of animal
has evolved to face a limited range of leaming problems in a limited range of
environments. ‘Animals, therefore, should only need simple, special-purpose
heuristic learning methods for tuning their behaviour to the optimum, These
special-purpose strategies may break down in artificial environments that are
different from those in which the animals evolved.
The classic example of a heuristic, special-purpose, fallible learning
method of this type is the mechanism. of imprinting as described by Lorenz. In
captivity, the ducklings may become imprinted on their keeper rather than on
their mother. There can be no doubt that many other special-purpose leaming
methods exist, of exactly the type that McNamara and Houston describe.
But I do not think that McNamara and Houston’s arguments are convinc-
ing in general. Although some ‘innate special-purpose’ adaptive mechanisms
demonstrably exist, it is implausible that all animal leaming can be described in
this way. Many species such as rats or starlings are opportunists, and can lear
20Chapter 1 — Introduction
to invade many different habitats and to exploit novel sources of food. Animals
can be trained to perform many different tasks in conditioning experiments, and
different species appear to learn in broadly similar ways. Is it not more plausi-
ble that there are generally applicable leaming mechanisms, common to many
species, that can enable animals to learn patterns of behaviour that their ances-
tors never needéd?
The next section presents a speculative argument that special-purpose
learning methods may sometimes evolve from general learning methods applied
to particular tasks. But the most convincing argument against the hypothesis of
special-purpose leaming methods will be to show that simple general leaming
methods are possible, which I will attempt to do later on.
6, Learning Optimal Strategies and Evolution
Evolution may speed up learning. If the learning of a critical skill is slow
and expensive, then there will be selective pressure to increase the efficiency of
learning. The efficiency of learning may be improved by providing what might
be called ‘innate knowledge’. By this, I do not necessarily mean knowledge in
the ordinary sense of knowing how to perform a task, or of knowing facts or
information. Instead, I mean by ‘innate knowledge’ any innate behavioural ten-
dency, desire, aversion, or area of curiosity, or anything else that influences the
course of learning. An animal that has evolved to have innate knowledge
appropriate for learning some skill does not necessarily know anything in the
normal sense of the word, but in normal circumstances it is able to /earn that
skill more quickly than another animal without this innate knowledge.
A plausible hypothesis, therefore, is that useful behaviours and skills are
initially learnt by some individuals at some high cost: if that behaviour or skill
is sufficiently useful, the effect of selective pressure will be to make the
2Chapter 1 — Introduction
learning of it quicker, less costly, and more reliable. One origin of special-
purpose learning methods, therefore, may be as innate characteristics that have
evolved to speed up learning by a general purpose method.
7. How Can a Learning Method be General?
A ‘general leaming mechanism’ is an intuitively appealing idea, but it
difficult to pin down the sense in which a learning mechanism can be general,
because all leaming must start from some innate structure. It has become a
commonplace in philosophy that leaning from a tabula rasa is necessarily
impossible. Any form of leaning or empirical induction consists of combining
a finite amount of data from experience with some prior structure. No learning
method, therefore, can be completely general, in the sense that it depends on no
prior assumptions at all.
However, there is another, more restricted sense in which a learning
method can be ‘general’, An animal has sensory abilities that enable it to dis-
tinguish certain aspects of its surroundings, it can remember a certain amount
about the recent past, and it has a certain range of desires, needs, and internal
sensations that it can experience, It can perform a variety of physical actions.
A behavioural strategy is a method of deciding what action to take on the basis
of the surroundings, of the recent past, and of the animal’s internal sensations
and needs. A strategy might be viewed as a set of situation-action rules, or as a
set of stimulus-response associations, where the situations or ‘stimuli? consist of
the appearance of the surroundings, memories of the recent past, and internal
sensations and desires, and the ‘responses’ aré the actions the animal can take. I
do not wish to imply that a strategy is actually represented as a set of
stimulus-response associations, although some strategies can be: the point is
merely that a strategy is a method of choosing an action in any situation.
22Chapter 1 — Introduction
Learning is a process of finding better strategies, Now, a given animal will be
able to distinguish a certain set of situations, and to perform a certain set of
actions, and it will have the potential ability to construct a certain range of
situation-action strategies. A general learning method is a method of using
experience to improve the current strategy, and, ideally, to find the strategy that
is the best one for the current environment, given the situations the animal can
recognise and the actions the animal can perform. As will be shown, there are
general methods of improving and optimising behavioural strategies in this
sense.
8. Conclusion
I have argued that the optimality argument of behavioural ecology does
indicate an analysis of associative instrumental learning, but the connection
between the optimality argument and associative instrumental leaming is
indirect. Animals cannot directly learn to optimise their fitness, because they
cannot live their lives many times and learn to perpetuate their genes as much
as possible, Instead, animals may leam critical skills that improve their fimess.
By a ‘critical skill’, I mean a skill for which improvements in performance
result directly in increases in fitness. For example, the rate at which a bird can
bring food to its nest directly affects the number of chicks it can raise. In
many cases, an animal need not always perform its critical skills with maximal
efficiency
tis the capacity to perform with maximal efficiency when necessary
that is valuable, A bird may need to obtain food with maximal efficiency all the
time during the breeding season, but at other times it may have leisure.
One role of instrumental learning, therefore, is in acquiring the ability to
perform critical skills as efficiently as possible. This learning may be to some
extent incidental, in that performance does not always have to be maximally
23Chapter 1 — Introduction
efficient during learning: indeed, sub-optimal performance may be a necessary
part of learning.
According to the optimality argument, if an animal has many opportunities
to practise a critical skill, and if it is able to try out some alternative strategies
without disaster, then the animal should ultimately acquire a capacity for maxi-
mally efficient performance.
In the next chapter, I will describe how a wide range of learning problems
may be posed as problems of leaming how to obtain delayed rewards. I will
argue that it is plausible that animals may represent tasks subjectively in this
way. After that, I will describe the established method for calculating an
optimal strategy, assuming that complete knowledge of the environment is
available. Then I will consider systematically what leaming methods are possi-
ble. The leaming methods will be presented as alternative algorithms for per-
forming dynamic programming, After that, I will describe computer implemen-
tations of some of these leaming methods.
From now on, I will speak more often about hypothetical ‘agents’ or
‘learners’ rather than about ‘animals’, because the discussion will not be related
to specific examples of animal leaming, The learning algorithms are strong
candidates as computational models of some types of animal learning, but they
may also have practical applications in the construction of learning machines.
24Chapter 2
Learning Problems
In this chapter I will describe several problems to which the learning
methods are applicable, and I will indicate how the problems are related.
1. The Pole-Batancing Problem
‘A well-known example of a procedural leaming problem, studied by
Michie (1967), and Barto, Sutton, and Anderson (1983), is the ‘pole-balancing’
problem, illustrated overleaf.
The cart is free to roll back and forth on the track between the two end-
blocks. The pole is joined to the cart by a hinge, and is free to move in the
vertical plane aligned with the track. There are two possible control actions,
which are to apply a constant force to the cart, pushing it either to the right or
to the left, The procedural skill to be acquired is that of pushing the cart to left
and right so as to keep the pole balanced more or less vertically above the cart,
and'to keep the cart from bumping against the ends of the track, This skill
might be represented as a rule for deciding whether to push the cart to the right
or to the left, the decision being made on the basis of the state of the cart-pole
system.
There are several ways of posing this as a learning problem. If an ‘expert’
is available, who knows how to push the cart to balance the pole, then one
approach would be to train an automatic system to imitate the expert's
behaviour. If the learner is told which action would be correct at each time, the
leaming problem becomes one of constructing a mapping from states of the cart