0% found this document useful (0 votes)
10 views

Computational Models of Developmental Psychology

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Computational Models of Developmental Psychology

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

P1: IBE

CUFX212-16 CUFX212-Sun 978 0 521 85741 3 April 2, 2008 17:14

CHAPTER 16

Computational Models of Developmental


Psychology

1. Introduction 2. Developmental Issues

This chapter provides a comparative sur- To understand how computational mod-


vey of computational models of psycho- eling can contribute to the study of psy-
logical development. Because it is impos- chological development, it is important to
sible to cover everything in this active appreciate the enduring issues in devel-
field in such limited space, the chap- opmental psychology. These include issues
ter focuses on compelling simulations in of how knowledge is represented and pro-
domains that have attracted a range of cessed at various ages and stages, how chil-
competing models. This has the advan- dren make transitions from one stage to an-
tage of allowing comparisons between dif- other, and explanations of the ordering of
ferent types of models. After outlining psychological stages. Although many ideas
some important developmental issues and about these issues have emerged from stan-
identifying the main computational tech- dard psychological research, these ideas of-
niques applied to developmental phenom- ten lack sufficient clarity and precision.
ena, modeling in the areas of the balance Computational modeling forces precision
scale, past tense, object permanence, arti- because models that are not clearly specified
ficial syntax, similarity-to-correlation shifts will either not run or will produce inappro-
in category learning, discrimination-shift priate results.
learning, concept and word learning, and
abnormal development is discussed. This is 3. Computational Techniques
followed by preliminary conclusions about
the relative success of various types of The most common computational techni-
models. ques applied to psychological development

451
Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 15:14:05, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511816772.020
P1: IBE
CUFX212-16 CUFX212-Sun 978 0 521 85741 3 April 2, 2008 17:14

452 shultz and sirois

are production systems, connectionist net- process information by passing activation


works, dynamic systems, robotics, and among units. Although some networks, in-
Bayesian inference. Production systems rep- cluding connection weight values, are de-
resent long-term knowledge in the form of signed by hand, it is more common in
condition-action rules that specify actions to developmental applications for programer-
be taken or conclusions to be drawn under designed networks to learn their connec-
particular conditions (see Chapter 6 in this tion weights (roughly equivalent to neu-
volume). Conditions and actions are com- ronal synapses) from examples. Some other
posed of symbolic expressions containing neural networks also construct their own
constants as well as variables that can be topology, typically by recruiting new hid-
bound to particular values. The rules are den units. The neural learning algorithms
processed by matching problem conditions most commonly applied to development in-
(contained in a working-memory buffer) clude back-propagation (BP) and its vari-
against the condition side of rules. Ordi- ants, cascade-correlation (CC) and its vari-
narily, one rule with satisfied conditions is ants, simple recurrent networks (SRNs),
selected and then fired, meaning that its encoder networks, auto-association (AA),
actions are taken or its conclusions drawn. feature-mapping, and contrastive Hebbian
Throughout matching and firing, variable learning.
bindings must be consistently maintained so A dynamic system is a set of quantita-
that the identities of particular objects re- tive variables that change continually, con-
ferred to in conditions and actions are not currently, and interdependently over time in
confused. accordance with differential equations (see
Although first-generation production sys- Chapter 4 in this volume). Such systems can
tem models involved programmers writ- be understood geometrically as changes of
ing rules by hand (Klahr & Wallace, 1976; position over time in a space of possible sys-
Siegler, 1976), it is more interesting for tem states. Dynamic systems overlap con-
understanding developmental transitions if nectionism in that neural networks are of-
rules can be acquired by the model in ten dynamic systems. In recurrent networks,
realistic circumstances. Several such rule- activation updates depend in part on cur-
learning systems have been developed, in- rent activation values; and in learning net-
cluding Soar (Newell, 1990), which learns works, weight updates depend in part on
rules by saving the results of look-ahead current weight values. However, it is also
search through a problem space; ACT-R common for dynamic-system models to be
(Anderson, 1993), which learns rules by implemented without networks, in differen-
analogy to existing rules or by compiling less tial equations where a change in a dependent
efficient rules into more efficient ones; and variable depends in part on its current value.
C4.5 (Quinlan, 1993), which learns rules by Another relatively new approach is de-
extracting information from examples of ob- velopmental robotics, a seemingly unlikely
jects or events. Rule learning is a challenging marriage of robotics and developmental psy-
computational problem because an indefi- chology (Berthouze & Ziemke, 2003). A
nitely large number of rules can be consis- principal attraction for roboticists is to cre-
tent with a given data set, and because it is ate generic robots that begin with infant
often unclear which rules should be modi- skills and learn their tasks through interact-
fied and how they should be modified (e.g., ing with adults and possibly other robots.
by changing existing conditions, adding new The primary hook for developmentalists is
conditions, or altering the certainty of con- the challenge of placing their computational
clusions). models inside of robots operating in real en-
Connectionism represents knowledge in vironments in real time.
a subsymbolic fashion via activation pat- Bayesian inference, which is rapidly gain-
terns on neuron-like units (see Chapter 2 ing ground in modeling a variety of cogni-
in this volume). Connectionist networks tive phenomena, is starting to be applied to

Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 15:14:05, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511816772.020
P1: IBE
CUFX212-16 CUFX212-Sun 978 0 521 85741 3 April 2, 2008 17:14

computational models of developmental psychology 453

developmental problems (see Chapter 3 in C between candidate-hidden-unit activation


this volume). At its heart is the use of Bayes’ and network error:
rule to infer posterior probabilities (of a hy-
pothesis given some data) from products of  $$   $
$
o$ p h p − h e op − e o  $
prior and likelihood probabilities divided by C=    2
the sum of such products across all known o p e op − e o 
hypotheses. The CC and C4.5 algorithms (16.2)
are discussed here in some detail because
they are not treated elsewhere in this vol- where h p is activation of the candidate hid-
ume, but have been used in a variety of de- den unit for pattern p, <h> is mean acti-
velopmental simulations. vation of the candidate hidden unit for all
patterns, e op is residual error at output o for
pattern p, and <e o > is mean residual error at
4. Cascade-Correlation output o for all patterns. C represents the ab-
solute covariance between hidden-unit ac-
CC networks begin with just input and out- tivation and network error summed across
put units, ordinarily fully connected. They patterns and output units and standardized
are feed-forward networks trained in a su- by the sum of squared error deviations. The
pervised fashion with patterns representing same Quickprop algorithm used for output
particular input and target output values. training is used here, but with the goal of
Any internal, hidden units required to deal maximizing these correlations rather than
with nonlinearities in the training patterns reducing network error. When the corre-
are recruited one at time, as needed. The CC lations stop increasing, the candidate with
algorithm alternates between output and in- the highest absolute covariance is installed
put phases to reduce error and recruit help- into the network, with its just-trained input
ful hidden units, respectively (Fahlman & weights frozen, and a random set of output
Lebiere, 1990). The function to minimize weights with the negative of the sign of the
during output phase is error at the output covariance C. The other candidates are dis-
units: carded.
The basic idea of input phase is to select a
 2 candidate whose activation variations track
E= Aop − Top (16.1)
o p
current network error. Once a new recruit
is installed, CC returns to output phase to
where A is actual output activation and T is resume training of weights entering output
target output activation for unit o and pat- units to decide how to best use the new re-
tern p. Error minimization is typically ac- cruit to reduce network error. Standard CC
complished with the Quickprop algorithm networks have a deep topology with each
(Fahlman, 1988), a fast variant of the gener- hidden unit occupying its own layer.
alized delta rule that uses curvature as well A variant called sibling-descendant CC
as slope of the error surface to compute (SDCC) dynamically decides whether to in-
weight changes. When error can no longer stall each new recruit on the current highest
be reduced by adjusting weights entering the layer of hidden units (as a sibling) or on its
output units, CC switches to input phase to own new layer (as a descendant; Baluja &
recruit a hidden unit to supply more com- Fahlman, 1994). SDCC creates a wider va-
putational power. riety of network topologies, normally with
In input phase, a pool of usually eight can- less depth, but otherwise performs much the
didate hidden units with typically sigmoid same as standard CC on simulations (Shultz,
activation functions have random trainable 2006).
weights from the input units and any exist- CC and SDCC are constructivist al-
ing hidden units. These weights are trained gorithms that are theoretically compati-
by attempting to maximize a covariance ble with verbally formulated constructivist

Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 15:14:05, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511816772.020
P1: IBE
CUFX212-16 CUFX212-Sun 978 0 521 85741 3 April 2, 2008 17:14

454 shultz and sirois

theories of development (Piaget, 1954). make an interesting and parallel contrast


Qualitative changes in cognition and behav- between constructive neural and symbolic
ior potentially can be attributed to qualita- systems.
tive changes in underlying computational re- A hybrid learning system that does Con-
sources, namely recruited hidden units and nectionist Learning with Adaptive Rule In-
their connectivity. duction Online (CLARION; Sun, Slusarz,
& Terry, 2005) may also be worth trying in
a developmental context. CLARION works
5. C4.5 on two levels: BP networks learn from ex-
amples, and explicit rules can be extracted
In some ways, C4.5 is the symbolic analog from these networks.
of CC and accordingly has been applied to
several developmental domains. To learn
to classify examples, C4.5 builds a decision 6. Balance Scale
tree that can be transformed into production
rules (Quinlan, 1993). C4.5 processes a set One of the first developmental tasks to at-
of examples in attribute-value format and tract a wide range of computational mod-
learns how to classify them into discrete els was the balance scale. This task presents
categories using information on the correct a child with a rigid beam balanced on
category of each example. A decision tree a fulcrum (Siegler, 1976). This beam has
contains leaves, each indicating a class, and pegs spaced at regular intervals to the left
branching nodes, each specifying a test of and right of the fulcrum. An experimenter
a single attribute with a subtree for each places a number of identical weights on a
attribute value. C4.5 learning proceeds as peg on the left side and a number of weights
follows: on a peg on the right side. While supporting
blocks prevent the beam from tipping, the
1. If every example has the same predicted child predicts which side of the beam will
attribute value, return it as a leaf node. descend, or whether the scale will balance,
2. If there are no attributes, return the if the supporting blocks are removed.
most common attribute value. Children are typically tested with six
3. Otherwise, pick the best attribute, par- types of problems on this task. Two of
tition the examples by values, and recur- these problems are simple because one cue
sively learn to grow subtrees below this (weight or distance from the fulcrum) per-
node after removing the best attribute fectly predicts the outcome, whereas the
from further consideration. other cue is constant on both sides. A third
is even simpler, with identical cues on each
C4.5 creates smaller and more general trees side of the fulcrum. The other three prob-
by picking the attribute that maximizes in- lem types are more complex because the
formation gain. Symbolic rules can be de- two cues conflict, weight predicting one out-
rived from a decision tree by following the come and distance predicting a different
branches (rule conditions) out to the leaves outcome. The pattern of predictions across
(rule actions). C4.5 is a reasonable choice these problem types helps to diagnose how
for modeling development because it learns a child solves these problems.
rules from examples, just as connection- Despite ongoing debate about details, it
ist models do, and without the background is generally agreed that there are two ma-
knowledge that Soar and ACT-R often re- jor psychological regularities in the balance-
quire. Like CC, C4.5 grows as it learns, scale literature: stage progressions and the
building its later knowledge on top of ex- torque-difference effect. In stage 1, chil-
isting knowledge as a decision tree is be- dren use weight information to predict that
ing constructed. The two algorithms thus the side with more weights will descend or

Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 15:14:05, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511816772.020
P1: IBE
CUFX212-16 CUFX212-Sun 978 0 521 85741 3 April 2, 2008 17:14

computational models of developmental psychology 455

Table 16.1: Characteristics of balance-scale models

All Torque-difference
Author Model 4 stages effect

Langley (1987) Sage noa no


Newell (1990) Soar nob no
Schmidt & Ling (1996) C4.5 yesc yesd
van Rijn et al. (2003) ACT-R yese yes f
McClelland (1989) BP nog yes
Shultz et al. (1994) CC yes yes

a
Sage only learned stage 3, not stages 1, 2, and 4.
b
Soar learned stages 1–3, but not stage 4.
c
C4.5 learned all four stages, but to get the correct ordering of the first two stages,
it was necessary to list the weight attributes before the distance attributes because
C4.5 breaks ties in information gain by picking the first-listed attribute with the
highest information gain.
d
To capture the torque-difference effect, C4.5 required a redundant coding of
weight and distance differences between one side of the scale and the other. In
addition to doing a lot of the work that a learning algorithm should be able to do
on its own, this produced strange rules that children never show.
e
The ordering of stages 1 and 2 and the late appearance of addition and torque
rules in ACT-R were engineered by programmer settings; they were not a natural
result of learning or development. The relatively modern ACT-R model is the only
balance-scale simulation to clearly distinguish between an addition rule (comparing
the weight + distance sums on each side) and the torque rule.
f
ACT-R showed a torque-difference effect only with respect to differences in dis-
tance but not weight and only in the vicinity of stage transitions, not throughout
development as children apparently do.
g
BP oscillated between stages 3 and 4, never settling in stage 4.

that the scale will balance when the two The ability of several different computa-
sides have equal weights (Siegler, 1976). In tional models to capture these phenomena
stage 2, children start to use distance in- is summarized in Table 16.1. The first four
formation when the weights are equal on rows in Table 16.1 describe symbolic, rule-
each side, predicting that in such cases the based models, and the last two rows describe
side with greater distance will descend. In connectionist models.
stage 3, weight and distance information are In one of the first developmental con-
emphasized equally, and the child guesses nectionist simulations, McClelland (1989)
when weight and distance information con- found that a static BP network with two
flict on complex problems. In stage 4, chil- groups of hidden units segregated for ei-
dren respond correctly on all problem types. ther weight or distance information devel-
The torque-difference effect is that prob- oped through the first three of these stages
lems with large torque-differences are eas- and into the fourth stage. However, these
ier for children to solve than problems with networks did not settle in stage 4, instead
small torque differences (Ferretti & But- continuing to cycle between stages 3 and 4.
terfield, 1986). Torque is the product of The first CC model of cognitive develop-
weight × distance on a given side; torque dif- ment naturally captured all four balance-
ference is the absolute difference between scale stages, without requiring segregation
the torque on one side and the torque on of hidden units (Shultz, Mareschal et al.,
the other side. 1994).

Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 15:14:05, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511816772.020
P1: IBE
CUFX212-16 CUFX212-Sun 978 0 521 85741 3 April 2, 2008 17:14

456 shultz and sirois

The ability of these network models to past-tense forms. Seven psychological regu-
capture stages 1 and 2 is due to a bias to- larities have been identified:
ward equal-distance problems in the train-
ing set, that is, problems with weights placed 1. Children begin to overregularize irregu-
equally distant from the fulcrum on each lar verbs after having correctly produced
side. This bias, justified by noting that chil- them.
dren rarely place objects at differing dis- 2. Frequent irregulars are more likely to be
tances from a fulcrum but have considerable correct than infrequent ones.
experience lifting differing numbers of ob- 3. Irregular and regular verbs that are sim-
jects, forces a network to emphasize weight ilar to frequent verbs are more likely to
information first because weight informa- be correct. (Two regularities are com-
tion is more relevant to reducing network bined here into one sentence.)
error. When weight-induced error has been 4. Past-tense formation is quicker with
reduced, then the network can turn its atten- consistent regulars (e.g., like) than with
tion to distance information. A learning al- inconsistent regulars (e.g., bake), which
gorithm needs to find a region of connection- are in turn quicker than irregulars (e.g.,
weight space that allows it to emphasize the make).
numbers of weights on the scale before mov- 5. Migrations occurred over the centuries
ing to another region of weight space that al- from Old English such that some irreg-
lows correct performance on most balance- ulars became regular and some regulars
scale problems. A static network such as BP, became irregular.
once committed to using weight informa- 6. Double-dissociations exist between reg-
tion in stage 1, cannot easily find its way ulars and irregulars in neurological dis-
to a stage-4 region by continuing to reduce orders, such as specific language impair-
error. In contrast, a constructive algorithm ment and Williams syndrome.
such as CC has an easier time with this
move because each newly recruited hidden The classical rule-rote hypothesis holds that
unit changes the shape of connection-weight the irregulars are memorized, and the add –
space by adding a new dimension. ed rule applied when no irregular memory
Both of the connectionist models read- is retrieved (Pinker, 1999), but this has not
ily captured the torque-difference effect. resulted in a successful published computa-
Such perceptual effects are natural for tional model. The ability of several different
neural models that compute a weighted computational models to capture past-tense
sum of inputs when updating downstream phenomena is summarized in Table 16.2.
units. This ensures that larger differences All of the models were trained to take a
on the inputs create clearer activation pat- present-tense verb stem as input and pro-
terns downstream at the hidden and output vide the correct past-tense form.
units. In contrast, crisp symbolic rules care One symbolic model used ID3, a pre-
more about direction of input differences decessor of the C4.5 algorithm that was
than about input amounts, so the torque- discussed in Section 5, to learn past-tense
difference effect is more awkward to capture forms from labeled examples (Ling & Mari-
in rule-based systems. nov, 1993). Like C4.5, ID3 constructs a de-
cision tree in which the branch nodes are
attributes, such as a particular phoneme in a
7. Past Tense particular position, and the leaves are suf-
fixes, such as the phoneme −t. Each ex-
The morphology of the English past tense ample describes a verb stem, for example,
has generated considerable psychological talk, in terms of its phonemes and their po-
and modeling attention. Most English verbs sitions, and is labeled with a particular past-
form the past tense by adding the suffix -ed tense ending, for example, talk-t. Actually,
to the stem, but about 180 have irregular because there are several such endings, the

Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 15:14:05, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511816772.020
P1: IBE
CUFX212-16 CUFX212-Sun 978 0 521 85741 3 April 2, 2008 17:14

computational models of developmental psychology 457

Table 16.2: Coverage of English past-tense acquisition

Ling & Taatgen & Plunkett & Daugherty & Hare &
Marinov Anderson Marchman Seidenberg Elman Westermann
Authors (1993) (2002) (1996) (1992) (1995) (1998)

Model ID3 ACT-R BP BP BP CNNa

Over-regularization yes yes yes – – yes


Frequency yes yes – yes – yes
Similarity-irregulars no no – yes – yes
Similarity-regulars no no – yes – –
Reaction time no no – yes – –
Migration no no – – yes –
Double-dissociation no no – – – yes

a
CNN is a Constructivist Neural Network model with Gaussian hidden units.

model used a small grove of trees. Instead of tense patterns are represented in common
a single rule for the past tense as in the rule- hidden units. In these neural models, less
rote hypothesis, this grove implemented frequent and highly idiosyncratic verbs can-
several rules for regular verbs and many not easily resist the pull of regularization
rules for irregular verbs. Coverage of over- and other sources of error. Being similar
regularization in the ID3 model was due to other verbs can substitute for high fre-
to inconsistent and arbitrary use of the m quency. These effects occur because weight
parameter that was originally designed to change is proportional to network error, and
control the depth of decision trees (Quin- frequency and similarity effects create more
lan, 1993). Although m was claimed by initial error. In symbolic models, memory
the authors to implement some unspeci- for irregular forms is searched before the reg-
fied mental capacity, it was decreased here ular rule is applied, thus slowing responses
to capture development, but increased to to regular verbs and creating reaction times
capture development in other simulations, opposite to those found in people.
such as the balance scale (Schmidt & Ling,
1996).
The ACT-R models started with three 8. Object Permanence
handwritten rules: a zero rule, which does
not change the verb stem; an analogy rule, A cornerstone acquisition in the first two
which looks for analogies to labeled exam- years is belief in the continued existence of
ples and thus discovers the –ed rule; and a hidden objects. Piaget (1954) found that ob-
retrieval rule, which retrieves the past tense ject permanence was acquired through six
form from memory (Taatgen & Anderson, stages, that the ability to find hidden ob-
2002). Curiously, this model rarely applied jects emerged in the fourth stage between
the –ed rule because it mostly used retrieval; the ages of eight and twelve months, and
the –ed rule was reserved for rare words, that a full-blown concept of permanent ob-
novel words, and nonsense words. jects independent of perception did not oc-
As shown in Table 16.2, none of the com- cur until about two years. Although data
putational models cover many past-tense collected using Piaget’s object-search meth-
phenomena, but collectively, a series of ods were robust, recent work using differ-
neural-network models do fairly well. Sev- ent methodologies suggested that he might
eral of these phenomena naturally emerge have underestimated the cognitive abilities
from neural models, where different past- of very young infants.

Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 15:14:05, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511816772.020
P1: IBE
CUFX212-16 CUFX212-Sun 978 0 521 85741 3 April 2, 2008 17:14

458 shultz and sirois

Response integration network Trajectory prediction network

Reaching Tracking
outputs

Shared
hiddens

Visual
Hiddens memory
hiddens

Inputs
Object recognition network

Figure 16.1. Modular network topology for looking and reaching.


The network in the lower left learns to recognize objects based
on their static features. The network on the right learns to follow the
trajectory of moving objects. The network in the upper left learns
to reach for objects by integrating visual tracking and object recog-
nition. (Adapted, with permission, from Mareschal et al., 1999).

An influential series of experiments sug- ing on which system was used to assess that
gested that infants as young as 3.5 months competence.
understand the continued existence of hid- A different model of the lag between
den objects if tested by where they look looking and reaching (Mareschal, Plunkett,
rather than where they reach (Baillargeon, & Harris, 1999) used a modular neural
1987). Infants were familiarized to a simple network system implementing the dual-
perceptual sequence and then shown two route hypothesis of visual processing (i.e.,
different test events: one that was percep- that visual information is segregated into
tually more novel but consistent with the a what ventral stream and a where dorsal
continued existence of objects and one that stream). Like the previous model, this one
was perceptually more familiar but that vi- had a shared bank of hidden units receiv-
olated the notion that hidden objects con- ing input from a recurrent bank of visual-
tinue to exist. Infants looked longer at the memory inputs. As shown in Figure 16.1,
impossible event, which was interpreted as these hidden units fed two output mod-
evidence that they understand the contin- ules: a trajectory-prediction module and a
ued existence of occluded objects. response-integration module. The former
Computational modeling has clarified was trained to predict the position of an ob-
how infants could reveal an object concept ject on a subsequent time step. The latter
with looking but not by reaching. In one was trained to combine hidden-unit activa-
model, perceptual input about occluded ob- tions in the trajectory module with object-
jects fed a hidden layer with recurrent con- recognition inputs. Here, the time lag be-
nections, which in turn fed two distinct out- tween looking and reaching was explained
put systems: a looking system and a reaching by the what and where streams needing to
system (Munakata et al., 1997). Both sys- be integrated in reaching tasks, but not in
tems learned to predict the location of an looking tasks, which can rely solely on the
input object, but the reaching system lagged where stream. The model uniquely predicted
developmentally behind the looking system developmental lags for any task requiring in-
because of differential learning rates. The tegration of the two visual routes.
same underlying competence (understand- Although both models simulated a lag be-
ing where an object should be) thus led to tween looking and reaching, some embodied
different patterns of performance, depend- simulations integrated the what and where

Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 15:14:05, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511816772.020
P1: IBE
CUFX212-16 CUFX212-Sun 978 0 521 85741 3 April 2, 2008 17:14

computational models of developmental psychology 459

functions into a how function (Schlesinger, 6. Reaching to A. The more reaches to A,


2004; Schlesinger, Parisi, & Langer, 2000). the more likely the error.
Here, the problem of reaching was con- 7. Interestingness. Less error the more in-
strained by mechanical and kinematic prop- teresting the toy; hiding a cookie re-
erties of the task that facilitate learning. duces error.
A novel experiment used a primitive 8. Objectless. Cuing a cover is sufficient
robot to produce looking-time data in an to elicit the error without any hidden
object-permanence experiment measuring object.
looking (Baillargeon, 1986). This robot 9. Object helpful. Less error in covers-only
knew nothing about objects and their per- condition when a toy is hidden at B.
manence, but was designed to habituate to 10. Adult error. Even adults can make this
visual stimuli that it captured through video error under certain conditions.
cameras (Lovett & Scassellati, 2004). Its be-
havior as a surrogate participant in the in- Two quite different computational mod-
fant experiment suggested that looking-time els addressed these regularities, one using
differences between possible and impossible a feed-forward neural network with self-
events could be due to mere habituation to recurrent excitatory connections within hid-
stimuli, having nothing to do with knowl- den and output layers to maintain represen-
edge of objects or their permanence. tations over time and inhibitory connections
A major subtopic in development of the within these layers to implement compe-
object concept concerns the so-called A-not- tition (Munakata, 1998). The network re-
B error. Between seven and twelve months ceived sequential input about three hiding
of age, infants search for a hidden object locations, two types of cover, and two toys.
in one location (conventionally called loca- These inputs fed a bank of self-recursive hid-
tion A), but when it is moved to another den units representing the three locations.
hiding place (location B), they persevere in These hidden units fed two separate banks
searching at A (Piaget, 1954). Ten major of outputs: gaze/expectation units, repre-
regularities have been noted in the exten- senting where infants look, and reach units
sive psychological literature on the A-not-B representing where infants reach. Because
error. reaching was permitted only near the end
of a trial, these units were updated less
1. Age. Before seven to eight months, in- frequently than the gaze/expectation units,
fants do not search for a hidden object. which produced earlier looking than reach-
Between seven and twelve months, they ing. The network was trained with a few
perseverate in searching A. After twelve standard A-not-B observations. Learning of
months, they search in B. feed-forward weights was done with a zero-
2. Delay. No error with no delay between sum Hebbian rule that increased a weight
hiding and search; error increases with when its sending unit’s activation was higher
amount of delay. than the mean of its layer and decreased a
3. Décalage. Well before twelve months, weight when its sending unit activation was
infants look longer at an event in which lower than the mean of its layer. Recurrent
the hidden object is retrieved from a dif- and inhibitory weights were fixed. Age ef-
ferent place than where they last saw it. fects were covered by increasing the recur-
4. Distinctiveness. Making the hiding places rent weights from .3 to .5, that is, stronger
more distinctive reduces error, for ex- recurrence with increasing age.
ample, by using distinctive covers, land- The other model of the A-not-B error
marks, or familiar environments. Con- was a dynamic-system model (Thelen et al.,
versely, using identical covers increases 2001). A decision to reach in a particular di-
error. rection (A or B) was modeled as activation
5. Multiple locations. Decrease error. in a dynamic field, expressed in a differential

Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 15:14:05, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511816772.020
P1: IBE
CUFX212-16 CUFX212-Sun 978 0 521 85741 3 April 2, 2008 17:14

460 shultz and sirois

equation, the details of which can be found Table 16.3: Model coverage of the
elsewhere (Thelen et al., 2001). Somewhat A-not-B error in object permanence
informally,
Model
activation = −decay + cooperativity + h Munakata Thelen et al.
+ noise + task + cue Regularity (1998) (2001)
+ reach.memory (16.3)
Age yes yes
Delay yes yes
where decay was a linear decrease in activa- Décalage yes no
tion, cooperativity included both local exci- Distinctiveness yes yes
tation and distant inhibition integrated over Multiple locations yes no
field positions, h was the resting activation Reaching to A yes no
level of the field, Gaussian noise ensured that Interestingness no no
activations were probabilistic rather than Objectless yes yes
deterministic, task reflected persisting fea- Object helpful yes no
Adult error no no
tures of the task environment, cue reflected
the location of the object or attention to a
specific cover, and reach memory reflected
the frequency and recency of reaching in a neighboring excitation sustains local activa-
particular direction. Development relied on tion peaks whereas global inhibition pre-
the resting activation level of the field. When vents diffusion of peaks and stabilizes against
h was low, strong inputs predominated and competing inputs.
activation was driven largely by inputs and In simulation of younger infants, imple-
less by local interactions, a condition known mented by noncooperation (h = −12), a
as noncooperation. When h was large, many cue to location B initially elicited activa-
field sites interacted, and local excitation tion, which then decayed rapidly, allowing
was amplified by neighboring excitation and memory of previous reaching to A to pre-
distant inhibition, allowing cooperation and dominate. But in simulations of young in-
self-sustained excitation, even without con- fants allowed to reach without delay, the
tinual input. Parameter h was set to −6 for initial B activation tended to override mem-
cooperation and −12 for noncooperation. ory of previous A reaches. In simulation of
All parameters but h were differential func- older infants, implemented by cooperation
tions of field position and time. Variation (h = −6), the ability to sustain initial B ac-
of the cue parameter implemented different tivation across delays produced correct B
experimental conditions. Other parameters reaches despite memory of reaching to A.
were held constant, but estimated to fit psy- This model suggested that the A-not-B error
chological data. has more to do with the dynamics of reach-
The differential equation simulated up to ing for objects than with the emergence of a
10 sec of delay in steps of 50 msec. An concept of permanence.
above-threshold activation peak indicated a Comparative coverage of the psycholog-
perseverative reach when centered on the ical regularities by these two models is in-
A location or a correct reach when cen- dicated in Table 16.3. The neural-network
tered on the B location. This idea was sup- model covered almost all of these regular-
ported by findings that activity in popu- ities, and it is possible that the dynamic-
lations of neurons in monkey motor and system model could also achieve this by
premotor cortex became active in the 150 manipulation of its existing parameters. It
msec between cue and reach, and pre- would be interesting to see if the dynamic-
dicted the direction of ordinary reaching system model could be implemented in a
(Amirikian & Georgopoulos, 2003). See Fig- neurally plausible way. Both models were
ure 4.3 in this volume, which shows that highly designed by fixing weights in the

Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 15:14:05, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511816772.020
P1: IBE
CUFX212-16 CUFX212-Sun 978 0 521 85741 3 April 2, 2008 17:14

computational models of developmental psychology 461

neural model and by writing equations and then differentially attend to novel stim-
and fitting parameters in the dynamic- uli that deviate from their representations
system model. Modeler-designed parameter (Cohen & Arthur, 1983). Because neural
changes were used to implement age-related learning is directed at reducing the largest
development in both models. sources of error, network error can be con-
sidered as an index of future attention and
learning.
9. Artificial Syntax The CC simulations captured the essen-
tials of the infant data: more interest in sen-
An issue that attracted considerable simu- tences inconsistent with the familiar pattern
lation activity concerns whether cognition than in sentences consistent with that pat-
should be interpreted in terms of symbolic tern and occasional familiarity preferences
rules or subsymbolic neural networks. For (Shultz & Bale, 2001). In addition, CC net-
example, it was argued that infants’ ability works showed the usual exponential de-
to distinguish one syntactic pattern from an- creases in attention to a repeated stimulus
other could only be explained by a symbolic pattern that are customary in habituation
rule-based account (Marcus et al., 1999). experiments and generalized both inside and
After being familiarized to sentences in an outside of the range of training patterns.
artificial language with a particular syntactic Follow-up simulations clarified that CC net-
pattern (such as ABA), infants preferred to works were sensitive to both phonemic con-
listen to sentences with an inconsistent syn- tent and syntactic structure, as infants prob-
tactic form (such as ABB). The claim about ably are (Shultz & Bale, 2006).
the necessity of rule-based processing was A simple AA network model contained a
contradicted by a number of neural-network single layer of interconnected units, allowing
models showing more interest in novel internal circulation of unit activations over
than familiar syntactic patterns (Altmann & multiple time cycles (Sirois et al., 2000). Af-
Dienes, 1999; Elman, 1999; Negishi, 1999; ter learning the habituation sentences with
Shultz, 1999; Shultz & Bale, 2001; Sirois, a delta rule, these networks needed more
Buckingham, & Shultz, 2000). This princi- processing cycles to learn inconsistent than
pal effect from one simple experiment is consistent test sentences. The mapping of
rather easy for a variety of connectionist processing cycles to recovery from habitua-
learning algorithms to cover, probably due tion seems particularly natural in this model.
to their basic ability to learn and generalize. A series of C4.5 models failed to cap-
In addition to this novelty preference, there ture any essential features of the infant data
were a few infants who exhibited a slight fa- (Shultz, 2003). C4.5 could not simulate fa-
miliarity preference, as evidenced by slightly miliarization because it trivially learned to
more recovery to consistent novel sentences expect the only category to which it was ex-
than to familiar sentences. posed. When trained instead to discriminate
One of the connectionist simulations the syntactic patterns, it did not learn the
(Shultz & Bale, 2001) was replicated (Vilcu desired rules except when these rules were
& Hadley, 2005) using batches of CC en- virtually encoded in the inputs.
coder networks, but it was claimed that this Three different SRN models covered the
model did not generalize well and merely principal finding of a novelty preference,
learned sound contours rather than syntax. but two of these models showed such a
Like other encoder networks, these net- strong novelty preference that they would
works learned to reproduce their inputs on not likely show any familiarity preference.
their output units. Discrepancy between in- Two of these SRN models also were not
puts and outputs is considered as error, replicated by other researchers (Vilcu &
which networks learn to reduce. Infants are Hadley, 2001, 2005). Failure to replicate
thought to construct an internal model of seems surprising with computational mod-
stimuli to which they are being exposed els and probably deserves further study.

Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 15:14:05, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511816772.020
P1: IBE
CUFX212-16 CUFX212-Sun 978 0 521 85741 3 April 2, 2008 17:14

462 shultz and sirois

Table 16.4: Coverage of artificial syntax phenomena

Regularity
Novelty Familiarity Simulation
Author Model preference preference replicated

Altmann & Diennes (1999) SRN yes – no


Elman (1999) SRN yes no no
Negishi (1999) SRN yes no –
Shultz & Bale (2001) CC yes yes yes
Sirois et al. (2000) AA yes yes –
Shultz (2003) C4.5 no no –

Comparative performance of each of in the familiarization set. The ten-month-


these models is summarized in Table 16.4. old finding is termed a correlation effect.
Dashes in the table indicate uncertainty. It Both groups learned about individual stim-
is possible that the SRN models and the AA ulus features, but older infants also learned
model might be able to show some slight fa- how these features correlate.
miliarity preference if not so deeply trained. These effects, including the shift from
similarity to correlation, were simulated
with three different neural-network al-
10. Similarity-to-Correlation Shift in gorithms: BP networks with shrinking
Category Learning receptive fields (Shrink-BP; Westermann
& Mareschal, 2004), Sustain networks
Research on category learning with a fa- (Gureckis & Love, 2004), and CC en-
miliarization paradigm showed that four- coder networks (Shultz & Cohen, 2004).
month-olds process information about inde- Westermann and Mareschal (2004) used
pendent features of visual stimuli, whereas more Gaussian hidden units than inputs and
ten-month-olds additionally abstract rela- shrank the width of Gaussian receptive fields
tions among those features (Younger & for older infants to mimic developing vi-
Cohen, 1986). These results addressed a sual acuity. Increased acuity arising from de-
classic controversy about whether percep- creased field size presumably enhances se-
tual development involves integration or lective response to unique conjunctions of
differentiation of stimulus information, inte- feature values. The Sustain algorithm tries to
gration being favored by developing the abil- assimilate new stimuli to existing prototypes
ity to understand relations among already and recruits a new prototype when stim-
discovered features. Following repetitions of uli are sufficiently novel. A parameter con-
visual stimuli with correlated features, four- trolled the number of prototypes that could
month-olds recovered attention to stimuli be recruited and was set higher for older in-
with novel features more than to stimuli fants. Alternatively, Sustain could capture
with either correlated or uncorrelated famil- the infant data if the inputs were random-
iar features. However, ten-month-olds re- ized to mimic poor visual acuity in younger
covered attention both to stimuli with novel infants. In CC networks, age was imple-
features and to stimuli with familiar uncor- mented by setting the score-threshold pa-
related features more than to stimuli with fa- rameter higher for four-month-olds than for
miliar correlated features. Uncorrelated test ten-month-olds, an approach that has been
items violated the correlations in the train- used successfully to model other age-related
ing items. The four-month-old finding is changes in learning tasks (Shultz, 2003).
termed a similarity effect because the uncor- Training continues until all output activa-
related test item was most similar to those tions are within a score threshold of their

Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 15:14:05, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511816772.020
P1: IBE
CUFX212-16 CUFX212-Sun 978 0 521 85741 3 April 2, 2008 17:14

computational models of developmental psychology 463

Table 16.5: Coverage of the shift from features to correlations in category learning

Effect

Authors Model Similarity Correlation Shift Habituation

Westermann & Mareschal (2004) Shrink-BP yes yes yes no


Gureckis & Love (2004) Sustain yes yes yes no
Shultz & Cohen (2004) CC yes yes yes yes
Shultz & Cohen (2004) BP no no no no

target values for all training patterns. Thus, timally fit in Sustain, but not in CC, and
a lower score-threshold parameter produces (c) like the well-known ALCOVE, concept-
deeper learning. learning algorithm, Sustain employs an at-
In contrast to these successful models, a tention mechanism that CC does not re-
wide range of topologies of ordinary BP net- quire. One could say that Sustain attends
works failed to capture any of these effects to learn, whereas CC learns to attend. Like-
(Shultz & Cohen, 2004). Comparative pat- wise, CC seems preferable over Shrink-BP
terns of data coverage are summarized in because CC learns much faster (tens vs.
Table 16.5. thousands of epochs), thus making a better
The three successful models shared sev- match to the few minutes of familiarization
eral commonalities. They all employed un- in the infant experiments. CC is so far the
supervised (or self-supervised in the case only model to capture the habituation effect
of encoder networks) connectionist learn- across an experimental session at a single age
ing, and they explained apparent qualitative level. Shrink-BP and Sustain would not cap-
shifts in learning by quantitative variation ture this effect because their mechanisms
in learning parameters. Also, the Shrink-BP operate over ages, not over trials; CC mech-
and Sustain (in the randomized-input ver- anisms operate over both trials and ages.
sion) models both emphasized increased vi-
sual acuity as an underlying cause of learn-
ing change. Finally, both the Sustain and CC 11. Discrimination-Shift Learning
models grew in computational power.
When CC networks with a low score Discrimination-shift learning tasks stretch
threshold were repeatedly tested over the fa- back to early behaviorism (Spence, 1952)
miliarization phase, they predicted an early and have a substantial human literature with
similarity effect followed by a correlation robust, age-related effects well suited to
effect. Tests of this habituation prediction learning models. In a typical discrimination-
found that ten-month-olds who habituated shift task, a learner is shown pairs of stim-
to training stimuli looked longer at uncorre- uli with mutually exclusive attributes along
lated than correlated test stimuli, but those two perceptual dimensions (e.g., a black
who did not habituate did the opposite, square and a white circle or a white square
looking longer at correlated than uncorre- and a black circle, creating four stimulus
lated test stimuli (Cohen & Arthur, 2003). pairs when left-right position is counter-
CC might be preferred over Sustain balanced). The task involves learning to
because: (a) the effects in Sustain are pick the consistently-rewarded stimulus in
smaller than in infants, necessitating 10,000 each pair, where reward is linked to an at-
networks to reach statistical significance, tribute (e.g., black). When the learner con-
whereas the number of CC networks sistently picks the target stimulus (usually
matched the nine infants run in each con- eight times or more in ten consecutive tri-
dition, (b) parameter values had to be op- als), various shifts in reward contingencies

Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 15:14:05, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511816772.020
P1: IBE
CUFX212-16 CUFX212-Sun 978 0 521 85741 3 April 2, 2008 17:14

464 shultz and sirois

are introduced. A within-dimension shift in- These CC networks provided successful


volves an attribute from the initially relevant coverage of a wide range of discrimination-
dimension as the new learning target (e.g., shift phenomena, capturing the perfor-
shifting from black to white). Conversely, mance of both preschoolers and adults. Two
a between-dimensions shift involves a new empirical predictions were made. One was
learning target from the previously irrele- that adults would perform like preschool-
vant dimension (e.g., from black to circle). ers if iterative processing were blocked.
Children above ten years and adults typ- Overtraining research had already shown
ically exhibit dimensional transfer on these that preschoolers would perform like adults
tasks, whereby within-dimension shifts are through additional learning trials. New re-
easier (i.e., require fewer learning trials) search using a cognitive load during discrim-
than between-dimension shifts, and gener- inative learning confirmed that adults do
alization of shift learning is observed on perform like preschoolers (Sirois & Shultz,
untrained stimuli. In contrast, preschool- 2006), revealing a continuity of representa-
ers do not show dimensional transfer, and tions between preschoolers and adults.
their learning is often explained as stimulus- The second prediction concerned associa-
response association. A popular interpre- tive learning in preschoolers and was coun-
tation of this age-related change in per- terintuitive. Researchers had suggested that
formance was that development involves preschoolers do not respond on the basis of
a change from associative to mediated perceptual attributes but rather treat stim-
information processing during childhood uli as perceptual compounds. The idea was
(Kendler, 1979). that preschoolers are under object control,
A simulation of these tasks with BP net- whereas adults are under dimensional con-
works (Raijmakers, van Koten, & Molenaar, trol. But networks were sometimes correct
1996) found that performance was compa- on stimulus pairs and incorrect for each
rable to that of preschoolers because net- stimulus in a pair (Sirois & Shultz, 1998).
works failed to show dimensional transfer The prediction here was that preschoolers
and generalization. Even though these net- would be under pair control rather than
works included hidden units, which might object control, such that they would have
mediate between inputs and outputs, they difficulty categorizing individual stimuli fol-
failed to exhibit mediated processing typ- lowing successful pairwise learning. This has
ical of older children and adults. The au- since been confirmed in experiments with
thors concluded that feed-forward neu- children (Sirois, 2002).
ral networks were unable to capture the
rule-like behavior of adults who abstract
the relevant dimensions of variation in a 12. Concept and Word Learning
problem.
However, research with CC networks The classic developmental problem of how
showed that these shift-learning tasks are children learn concepts and words is attract-
linearly separable problems and that mul- ing a flurry of interest from both Bayesian
tilayered BP nets make the problem more and connectionist modelers. Four-year-olds
complicated than necessary (Sirois & Shultz, and adults were reported to be consistently
1998). These authors argued, based on psy- Bayesian in the way they generalized a novel
chological evidence, that preschoolers and word beyond a few provided examples (Xu
adults differ in depth of processing rather & Tenenbaum, 2007). When three exam-
than on qualitatively different representa- ples of a novel word were generated by a
tional structures. They further suggested teacher, learners were justified in assum-
that adults acquire more focused representa- ing that these examples represented a ran-
tions through extensive iterative processing, dom sample from the word’s extension. As
which can be simulated in neural networks a result, they restricted generalization to a
by lowering the score-threshold parameter. specific, subordinate meaning. In contrast,

Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 15:14:05, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511816772.020
P1: IBE
CUFX212-16 CUFX212-Sun 978 0 521 85741 3 April 2, 2008 17:14

computational models of developmental psychology 465

when three examples were provided but 4. Syntax. Name generalization in these
only one of them was a randomly selected tasks is influenced by syntactic cues mark-
instance of the word meaning (the other ing the noun as a count noun or mass noun
two examples having been chosen by the (Dickinson, 1988; Imai & Gentner, 1997;
learner), subjects generalized more broadly, Soja, 1992). If an English noun is preceded
to the basic level, just as in previous studies by the article a or the, it yields a shape bias,
that had provided only one example of the but if preceded by some or much it shows a
word’s extension. The authors argued that material bias.
other theories and models that do not ex-
plicitly address example sampling could not 5. Ontology bias. Names for things tend to
naturally account for these results. not refer to categories that span the bound-
Other, related phenomena that are at- ary between solids and nonsolids, for ex-
tracting modeling concern shape and mate- ample, water versus ice (Colunga & Smith,
rial biases in generalizing new word mean- 2005). This underscores greater complexity
ings. Six regularities from the psychological than a mere shape bias for solids and ma-
literature deserve model coverage: terial bias for nonsolids. Solid things do not
typically receive the same name as nonsolid
1. Shape and material bias. When shown a stuff does.
single novel solid object and told its novel
name, 2.5-year-olds generalized the name 6. Material-nonsolid bias. In young children,
to objects with the same shape. In contrast, there is an initial material bias for nonsolids
when shown a single novel nonsolid sub- (Colunga & Smith, 2005).
stance and told its novel name, children of
the same age typically generalized the name All six of these phenomena were covered
to instances of the same material (Colunga by a constraint-satisfaction neural network
& Smith, 2003; Imai & Gentner, 1997; Soja, trained with contrastive Hebbian learning
Carey, & Spelke, 1992). These biases are (see Figure 2.2 in this volume) that ad-
termed overhypotheses because they help justs weights on the basis of correlations be-
to structure a hypothesis space at a more tween unit activations (Colunga & Smith,
specific level (Goodman, 1955/1983). 2005). Regularities 5 and 6 were actually
predicted by the network simulations be-
2. Development of shape bias and material fore being documented in children. Each
bias. The foregoing biases emerge only after word and the solidity and syntax of each ex-
children have learned some names for solid ample were represented locally by turning
and nonsolid things (Samuelson & Smith, on a particular unit. Distributed activation
1999). One-year-old infants applied a novel patterns represented the shape and mate-
name to objects identical to the trained ob- rial of each individual object or substance.
ject but not to merely similar objects. Fur- Hidden units learned to represent the cor-
thermore, the training of infants on object relations between shape, material, solidity,
naming typically requires familiar categories syntax, and words. After networks learned a
and multiple examples. This is in contrast vocabulary via examples that paired names
to 2.5-year-olds’ attentional shifts being with perceptual instances, they were tested
evoked by naming a single novel example. on how they would categorize novel things.
Statistical distributions of the training pat-
3. Shape bias before material bias. At terns matched adult judgments (Samuel-
two years, children exhibit shape bias on son & Smith, 1999). The recurrent connec-
these tasks, but not material bias (Imai & tion scheme is illustrated by the arrows in
Gentner, 1997; Kobayashi, 1997; Landau, Figure 16.2.
Smith, & Jones, 1988; Samuelson & Smith, A hierarchical Bayesian model covered
1999; Soja, Carey, & Spelke, 1991; Subrah- the mature shape bias and material bias de-
manyam, Landau, & Gelman, 1999). scribed in regularity 1 and probably could

Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 15:14:05, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511816772.020
P1: IBE
CUFX212-16 CUFX212-Sun 978 0 521 85741 3 April 2, 2008 17:14

466 shultz and sirois

Word
Kemp et al. (2007) note that a common
objection is that the success of Bayesian
models depends on the modeler’s skill in
choosing prior probabilities. Interestingly,
Hidden units Syntax
hierarchical Bayesian models can solve this
problem because abstract knowledge can be
learned rather than specified in advance.
Shape Material Solidity
A final point is that the correlations be-
tween syntax, solidity, shape, and material
Figure 16.2. Topology of the network used by that underlie learning and generalization in
Colunga and Smith (2005). (Adapted with this domain are far from perfect (Samuel-
permission.) son & Smith, 1999). For example, accord-
ing to adult raters, bubble names a non-
solid but shape-based category; soap names
be extended to cover regularities 4 and a solid but material-based category; crayon,
5 (Kemp, Perfors, & Tenenbaum, 2007). key, and nail name categories that are based
Such models include representations at sev- on both shape and material. The many ex-
eral levels of abstraction and show how ceptions to these statistical regularities sug-
knowledge can be acquired at levels remote gest that symbolic rule-based models would
from experiential data, thus providing for be nonstarters in this domain.
both top-down and bottom-up learning. Hy- Particularly challenging to learn are so-
potheses at some intermediate level are con- called deictic words, such as personal pro-
ditional on both data at a lower level and nouns, whose meaning shifts with point of
overhypotheses at a higher level (see Chap- view. Although most children acquire per-
ter 3 in this volume). sonal pronouns such as me and you with-
This model also generated several pre- out notable errors (Charney, 1980; Chiat,
dictions that may prove to be somewhat 1981; Clark, 1978), a small minority of chil-
unique. For example, the optimal number dren show persistent pronoun errors before
of examples per category is two, assuming a getting them right (Clark, 1978; Oshima-
fixed number of total examples. Also, learn- Takane, 1992; Schiff-Meyers, 1983). The
ing is sometimes faster at higher than lower correct semantics are such that me refers
levels of abstraction, thus explaining why to the person using the pronoun and you
abstract knowledge might appear to be in- refers to the person who is addressed by the
nate even when it is learnable. This is likely pronoun (Barwise & Perry, 1983). Unlike
to happen in situations when a child en- most words, the referent of these pronouns
counters sparse or noisy observations such is not fixed, but instead shifts with conver-
that any individual observation is difficult to sational role. Although a mother calls her-
interpret, although the observations taken self me and calls her child you, these pro-
together might support some hypothesis. nouns must be reversed when the child ut-
As is typical, this Bayesian model is ters them. Because the referent of a personal
pitched at a computational level of analy- pronoun shifts with conversational role, an
sis, whereas connectionist models operate at imitative model for correct usage can be dif-
more of an implementation level. As such, ficult to find. If children simply imitated
a computational-level Bayesian model may what they heard in speech that was di-
apply to a variety of implementations. The rectly addressed to them, they would in-
other side of this coin is that Bayes’ rule correctly refer to themselves as you and to
does not generate representations – it instead the mother as me. These are indeed the
computes statistics over structures designed typical errors made by a few children be-
by the modelers. In contrast, connectionist fore sorting out the shifting references. A
approaches sometimes are able to show how challenge for computational modelers is to
structures emerge. explain both this rare sequence and the

Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 15:14:05, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511816772.020
P1: IBE
CUFX212-16 CUFX212-Sun 978 0 521 85741 3 April 2, 2008 17:14

computational models of developmental psychology 467

virtually errorless acquisition observed in addressed speech and overheard speech


most children. (Oshima-Takane, Takane, & Shultz, 1999;
The most coherent explanation and evi- Shultz, Buckingham, & Oshima-Takane,
dence focused on the extent to which chil- 1994). As in the psychology experiment
dren were exposed to speech directly ad- (Oshima-Takane, 1988), the networks had
dressed to them versus speech that they input information on speaker, addressee,
overheard (Oshima-Takane, 1988). In over- and referent, and learned to predict the
heard speech, children can observe that you correct pronoun. As with children, error-
refers to a person other than themselves free pronoun acquisition by networks was
and that me and you reciprocate each other achieved with a high proportion of over-
with shifts of speaker, addressee, and refer- heard speech patterns, whereas persistent
ent. But in directly addressed speech, chil- reversal errors resulted from a high propor-
dren would observe that you always refers to tion of directly addressed speech patterns.
themselves and that me refers to the speaker. Thus, both errorless acquisition and a pro-
Thus, the correct semantics are better un- gression from reversal errors to correct usage
derstood as children listen to others address- can be achieved, depending on the relative
ing each other. proportions of directly addressed and over-
This idea was supported by a training ex- heard speech. In an attempt to find effective
periment using the so-called me-you game therapeutic techniques for persistent rever-
for several weeks with nineteen-month-olds sal errors, simulations pointed to the bene-
who were just starting to acquire personal fits of massive amounts of overheard speech.
pronouns (Oshima-Takane, 1988). Correct Attempts to correct pronoun reversal er-
pronoun production benefited more from rors using directly addressed speech are no-
listening to overheard speech (for example, toriously difficult because the child misun-
the mother looks at the father, points to derstands the correction (Oshima-Takane,
herself, and says me) than from listening to 1992), and this difficulty was verified with
directly addressed speech (e.g., the father simulations.
looks at the child, points to the child, and Simulation of first- and second-person
says you). Only those children assigned to pronoun acquisition was also implemented
the overheard speech condition could pro- within a developmental robotics approach
duce pronouns without errors. Also sup- (Gold & Scassellati, 2006). Instead of
portive was a naturalistic study in which learning a pronoun-production function
second-borns acquired these pronouns ear- of speaker, addressee, and referent, p =
lier than did first-borns, even though these f (s, a, r ), as in Oshima-Takane’s psychol-
two groups of children did not differ on ogy experiment and the CC network simula-
other measures of language development tions, a humanoid robot learned a pronoun-
(Oshima-Takane, Goodz, & Derevensky, comprehension function, referent as a func-
1996). The explanation is that second-born tion of speaker, addressee, and pronoun,
children have relatively more opportunities r = f (s, a, p). This comprehension func-
to hear pronouns used in speech not ad- tion was learned in a game of catch with
dressed to them in conversations between a ball between two humans and the robot.
a parent and an older sibling. The robot’s video camera captured both vi-
There are some theoretically interesting, sual and auditory information, the latter be-
albeit extreme, conditions of pronoun expe- ing processed by a speech-recognition sys-
rience that cannot be found with children, tem. The humans tossed the ball back and
such as exclusive exposure to either directly forth and occasionally to the robot, while
addressed speech or overheard speech. One commenting on the action with utterances
advantage of simulation work is that such such as, “I got the ball” or “You got the
variations can be systematically explored. ball.” Once reference was established, word
Several simulations with CC networks ma- counts for each pronoun were updated by
nipulated the relative amounts of directly the robot’s computer in 2 × 2 tables for

Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 15:14:05, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511816772.020
P1: IBE
CUFX212-16 CUFX212-Sun 978 0 521 85741 3 April 2, 2008 17:14

468 shultz and sirois

each pronoun-property pair. The highest robot. However, humans are well known
significant chi-square value indicated the to both use and interpret gestures to com-
meaning of the utterance. Results revealed plement verbal communication (Goldin-
that you as addressee was acquired first and Meadow, 1999), and deictic (or pointing)
more strongly than I as speaker. Although gestures (McNeill, 1992) are among the first
the robot’s distinction between I and you to appear in young children, as early as ten
captures the correct semantics, it is not gen- months of age (Bates, 1979). Hence, future
erally true that children acquire second- be- humanoid robotic modelers might want to
fore first-person pronouns. If anything, there incorporate gesture production and inter-
is a tendency for children to show the re- pretation in an effort to more closely follow
verse order: first-person pronouns before human strategies.
second-person pronouns (Oshima-Takane, It would seem interesting to explore
1992; Oshima-Takane et al., 1996). computational systems that could learn all
In a simulation of blind children, the the functions relating speaker, addressee,
robot in another condition could not see referent, and pronoun to each other as
which of the humans had the ball, but could well as extended functions that included
sense whether it (the robot) had the ball. third-person pronouns. Could learning some
This blind robot fell into the reversal er- of these functions afford inferences about
ror of interpreting you as the robot, as do other functional relations? Or would every
young blind children (Andersen, Dunlea, & function have to be separately and explicitly
Kekelis, 1984). learned? Bayesian methods and neural net-
The game of catch seems like an inter- works with recurrent connections might be
esting and natural method to facilitate per- good candidates for systems that could make
sonal pronoun acquisition. Although tabu- inferences in various directions.
lating word-property counts in 2 × 2 tables
and then analyzing these tables with chi-
square tests is a common technique in the 13. Abnormal Development
study of computational semantics, it is un-
clear whether this could be implemented One of the promising ideas supported by
with neural realism. connectionist modeling of development is
A major difference between the psychol- that developmental disorders might emerge
ogy experiment and neural-network model from early differences that lead to abnor-
on the one hand and the robotic model on mal developmental trajectories (Thomas &
the other hand concerns the use of gestures. Karmiloff-Smith, 2002). One such simula-
Both the psychology experiment and the tion was inspired by evidence that larger
neural-network model liberally used point- brains favor local connectivity and smaller
ing gestures to convey information about brains favor long-distance connections
the referent, as well as eye-gaze information, (Zhang & Sejnowski, 2000) and that chil-
to convey information about the addressee. dren destined to become autistic show ab-
In contrast, the robotic model eschewed normally rapid brain growth in the months
gestures on the grounds that pointing is preceding the appearance of autistic symp-
rude, unnecessary, and difficult for robots toms (Courchesne, Carper, & Akshoomoff,
to understand. Paradoxically then, even 2003). Neural networks modeled the com-
though developmental robotics holds the putational effects of such changes in brain
promise of understanding embodied cogni- size (Lewis & Elman, 2008). The net-
tion, this robotic model ignored both ges- works were feed-forward pattern associa-
tural and eye-gaze information. The game tors, trained with backpropagation of error.
of catch, accompanied by appropriate ver- As pictured in Figure 16.3, each of two hemi-
bal commentary, nicely compensated for spheres of ten units was fed by a bank of five
the absence of such information in the input units. Units within a hemisphere were

Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 15:14:05, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511816772.020
P1: IBE
CUFX212-16 CUFX212-Sun 978 0 521 85741 3 April 2, 2008 17:14

computational models of developmental psychology 469

5 Outputs 5 Outputs

10 Hiddens 10 Hiddens
Left hemisphere Right hemisphere

5 Inputs 5 Inputs

Figure 16.3. Topology of the autism network. Each of two


hemispheres of ten units was fed by a bank of five input units.
Units within a hemisphere were recurrently connected, and two units
in each hemisphere were fully connected across hemispheres. Each
hemisphere, in turn, fed a bank of five output units.

recurrently connected, and two units in each McClelland, 1994), and Williams syndrome
hemisphere were fully connected across (Thomas & Karmiloff-Smith, 2002).
hemispheres. Each hemisphere, in turn, fed
a bank of five output units. Both inter-
and intra-hemispheric connections exhib- 14. Conclusions
ited conduction delays, implemented by pe-
14.1. Computational Diversity
riodically adding or subtracting copy units
forming a transmission chain – the more Computational modeling of development
links in the chain, the longer the conduc- is now blessed with several different tech-
tion delay. The networks simulated inter- niques. It started in the 1970s with pro-
hemispheric interaction by growing in spa- duction systems, which were joined in the
tial extent, with consequent transmission late 1980s by neural networks. But by the
delays, at the rate of either typically devel- early twenty-first century, there were also
oping children or those in the process of be- dynamic system, robotic, and Bayesian ap-
coming autistic. proaches to development. This diversity is
Those networks that simulated autistic welcome because each approach has already
growth (marked by rapid increases in the made valuable contributions to the study of
space taken up by the network) were less af- development, just as they have in other ar-
fected by removal of inter-hemispheric con- eas of psychology. All of these approaches
nections than those networks that grew at have contributed to the notion that an un-
a normal rate, indicating a reduced reliance derstanding of development can be facili-
on long-distance connections in the autistic tated by making theoretical ideas precise and
networks. As these differences accelerated, systematic, covering various phenomena of
they were reflected in declining connec- interest, linking several different findings to-
tivity and deteriorating performance. The gether, explaining them, and predicting new
simulation offers a computational demon- phenomena. Such activities significantly ac-
stration of how brain overgrowth could pro- celerate the scientific process.
duce neural reorganization and behavioral Production systems are to be admired
deficits. for their precision and clarity in specifying
In a similar vein, researchers have exam- both knowledge representations and pro-
ined the role of initial conditions in develop- cesses that operate on these representations
mental dyslexia (Harm & Seidenberg, 1999), to produce new knowledge. Connection-
specific language impairments (Hoeffner & ist systems have the advantage of graded

Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 15:14:05, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511816772.020
P1: IBE
CUFX212-16 CUFX212-Sun 978 0 521 85741 3 April 2, 2008 17:14

470 shultz and sirois

knowledge representations and relative dynamic systems (van Gelder, 1998), have
closeness to biological neural systems in sometimes been implemented as neural net-
terms of activation functions, connectivity, works. One can also be concerned with how
and learning processes. Dynamic systems il- a system is implemented in neural tissue. Of
lustrate how the many different features of a the approaches considered in this chapter,
complex computational system may interact neural networks come closest to simulating
to yield emergent behavior and ability. De- how this might be done because these net-
velopmental robotics forces modelers to deal works were largely inspired by principles of
with the complexities of real environments how the brain and its neurons work. Growth
and the constraints of operating in real time. of brain structure within a network and in-
Bayesian methods contribute tools for mak- tegration of brain structures across networks
ing inferences despite incomplete and un- have both been stressed in this and other re-
certain knowledge. views (Westermann et al., 2006). As noted,
dynamic systems can also be inspired by neu-
roscience discoveries. There is, of course, a
14.2. Complementary Computation
continuum of neural realism in such imple-
These different modeling approaches tend mentations.
to complement each other, partly by being If the different modeling approaches do
pitched at various levels. Marr’s (1982) lev- exist at different levels, wouldn’t it make
els of computational analysis, imperfect as sense to use the lowest level to obtain the
they are, can be used to explore this. Marr finest grain of detail, perhaps to a biologi-
argued that explanations of a complex sys- cally realistic model of actual neural circuits?
tem can be found in at least three levels: Remembering the reductionist cruncher ar-
analysis of the system’s competence, design gument (Block, 1995), the answer would be
of an algorithm and representational for- negative because different levels may be bet-
mat, and implementation. Analyzing a sys- ter for different purposes. It is preferable
tem’s competence has been addressed by to work at the most appropriate level for
task analysis in symbolic approaches, dif- one’s goals and questions, rather than al-
ferential equations in a dynamic system ap- ways trying to reduce to some lower level.
proach, and Bayesian optimization. Nonetheless, one of the convincing ratio-
Every computational model must cope nales for cognitive science was that differ-
with the algorithmic level. Symbolic rule- ent levels of analysis can constrain each
based models do this with the mechanics other, as when psychologists try to build
of production systems: the matching and computational models that are biologically
firing of rules, and the consequent updat- realistic.
ing of working memory. In neural networks,
patterns of unit activations represent ac-
14.3. Computational Bakeoffs
tive memory, whereas weight updates rep-
resent the learning of long-term memory. Even if computational algorithms exist at
Activation fields geometrically represent the somewhat different levels, so-called bake-
changing positions of a dynamic system. off competitions are still possible and in-
Bayesian approaches borrow structures from teresting, both between and within vari-
symbolic approaches and compute statistics ous approaches. This is because different
over these structures to identify the most approaches and models sometimes make
probable hypothesis or structure given cur- different, and even conflicting, predictions
rent evidence. about psychological phenomena. Focusing
The implementation level can be taken on phenomena that have attracted a lot of
to refer to the details of how a particu- modeling attention, as in this chapter, pro-
lar model is instantiated. In this context, vides some ready-made bakeoff scenarios.
higher-level approaches, such as production Symbolic and connectionist models were
systems (Lebiere & Anderson, 1993) and sharply contrasted here in the cases of the

Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 15:14:05, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511816772.020
P1: IBE
CUFX212-16 CUFX212-Sun 978 0 521 85741 3 April 2, 2008 17:14

computational models of developmental psychology 471

balance scale, past tense, artificial syntax, that these simple learning problems were
and pronouns. In balance-scale simulations, actually linearly separable, suggesting that
rule-based models, but not connectionist hidden units were making learning more
models, had difficulty with the torque- difficult than it needed to be. Other con-
difference effect. This is a graded, percep- structive versus static network competitions
tual effect that is awkward for crisp symbolic have also favored constructive networks
rules but natural for neural systems with on developmental problems (Shultz, 2006).
graded representations and update processes To simulate stages and transitions between
that propagate these gradations. Past-tense stages, there is an advantage in starting small
formation was likewise natural for neural and increasing in computational power as
approaches, which can implement regu- needed.
larities and exceptions in a homogeneous The notion of underlying qualitative
system and thus capture various phenomena changes causing qualitative changes in psy-
by letting them play off against each other, chological functioning differs from the idea
but awkward for symbolic systems that of underlying small quantitative changes
isolate rule processing from other processes. causing qualitative shifts in behavior, as in
Several connectionist models captured the mere weight adjustment in static neural net-
novelty preference in learning an artificial works or quantitative changes in dynamic-
syntax, but the one rule-based approach that system parameters. There are analogous
was tried could not do so. Although no rule- qualitative structural changes at the neu-
based models have yet been applied to pro- rological level in terms of synaptogenesis
noun acquisition, the graded effects of varia- and neurogenesis, both of which have been
tion in amount of experience with overheard demonstrated to be under the control of
versus directly addressed speech would pose pressures to learn in mature as well as devel-
a challenge to rule-based models. oping animals (Shultz, Mysore, & Quartz,
Attractive modeling targets, such as the 2007). The CC algorithm is neutral with
balance scale, artificial syntax, similarity-to- respect to whether hidden-unit recruitment
correlation shift, and discrimination shift implements synaptogenesis or neurogenesis,
also afforded some bakeoff competitions depending on whether the recruit already
within the neural approach in terms of static exists in the system or is freshly created. But
(BP) versus constructive (CC) network it is clear that brains do grow in structure and
models. On the balance scale, CC networks there seem to be computational advantages
uniquely captured final, stage-4 perfor- in such growth, particularly for simulating
mance and did so without having to segre- qualitative changes in behavioral develop-
gate inputs by weight and distance. CC also ment (Shultz, 2006).
captured more phenomena than did static This is not to imply that static connec-
BP models in simulations of the similarity- tionist models do not occupy a prominent
to-correlation shift. This was probably be- place in developmental modeling. On the
cause CC naturally focused first on identify- contrary, this review highlights several cases
ing stimulus features while underpowered in which static networks offered compelling
and only later with additional computa- and informative models of developmental
tional power abstracted correlations among phenomena. Static networks may be partic-
these features. In discrimination-shift learn- ularly appropriate in cases for which evo-
ing, the advantage of CC over static BP was lution has prepared organisms with either
a bit different. Here, BP modelers were led network topologies or a combination of con-
to incorrect conclusions about the inability nection weights and topologies (Shultz &
of neural networks to learn a mediated ap- Mareschal, 1997). When relevant biological
proach to this problem by virtue of trying constraints are known, as in a model of ob-
BP networks with designed hidden units. ject permanence (Mareschal et al., 1999),
Because CC networks only recruit hidden they can guide design of static network
units as needed, they were able to verify topologies. In some studies, the process

Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 15:14:05, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511816772.020
P1: IBE
CUFX212-16 CUFX212-Sun 978 0 521 85741 3 April 2, 2008 17:14

472 shultz and sirois

of network evolution itself has been mod- problem was not so easy for C4.5, though,
eled (Schlesinger et al., 2000). Ultimately, which could not capture any phenomena
models showing how networks evolve, from the infant experiment. Moving to re-
develop, and learn would be a worthy alistically complex syntactic patterns will
target. likely prove challenging for all sorts of
The more recently applied modeling models.
techniques (dynamic systems, developmen- Simulation of abnormal development has
tal robotics, Bayesian) do not yet have a number of promising connectionist mod-
enough bakeoff experience to draw firm els, but it is too early to tell which particular
conclusions about their relative effective- approaches will best capture which particu-
ness in modeling development. For example, lar developmental disorders.
in the A-not-B error, the dynamic system ap-
proach seemed promising but did not cover
14.4. Development via Parameter
as many phenomena as BP networks did.
Settings?
However, as noted, this dynamic system has
several parameters whose variation could be Some of the models reviewed in this chapter
explored to cover additional phenomena. simulated development with programmer-
Likewise, although Bayesian approaches designed parameter changes. Variations in
are only just starting to be applied to de- such parameter settings were used to im-
velopment, they have already made some plement age-related changes in both con-
apparently unique predictions in the do- nectionist and dynamic-systems models of
main of word learning: inferences allowed the A-not-B error, the CC model of dis-
by random sampling of examples and esti- crimination-shift learning, all three models
mates of the optimal number of examples of the similarity-to-correlation shift, and the
for Bayesian inference. Also in the word- autism model. Granted that this technique
learning domain, the Bayesian approach captured developmental effects and ar-
covered only a portion of the shape-and- guably could be justified on various grounds,
material-bias phenomena covered by the but does it really constitute a good explana-
neural-network model. Nonetheless, the hi- tion of developmental change? Or is this a
erarchical Bayesian model employed there case of divine intervention, manually imple-
seems to have the potential to integrate phe- menting changes that should occur naturally
nomena across different explanatory levels. and spontaneously? ACT-R simulations of
Before leaving these biases, it is perhaps development also have this character as pro-
worth remembering, in a bakeoff sense, that grammers change activation settings to allow
rule-based methods would likely be both- different rules to come to the fore. Perhaps
ered by the many exceptions that exist in such parameter settings could be viewed as a
this domain. preliminary step in identifying those changes
In the domain of pronoun acquisition, the a system needs to advance. One hopes that
robotics model did not address the same psy- this could be followed by model improve-
chology experiment as did the CC model, ments that would allow for more natural and
so the robot could not realistically cover spontaneous development.
the findings of that experiment. However,
a blind catch-playing robot did simulate the
reversal errors made by blind children. How- Acknowledgments
ever, a sighted robot developed you before
I , something that is not true of children. This work was supported by a grant from
The domain of syntax learning proved to the Natural Sciences and Engineering Re-
be too easy for a variety of connectionist search Council of Canada to the first au-
models, so it was difficult to discriminate thor. Frédéric Dandurand, J-P Thivierge,
among them – they all captured the main and Yuriko Oshima-Takane provided help-
infant finding of a novelty preference. This ful comments.

Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 15:14:05, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511816772.020
P1: IBE
CUFX212-16 CUFX212-Sun 978 0 521 85741 3 April 2, 2008 17:14

computational models of developmental psychology 473

References (Ed.), New trends in conceptual representation


(pp. 197–220). Hillsdale, NJ: Lawrence Erl-
Altmann, G. T. M., & Dienes, Z. (1999). Rule baum.
learning by seven-month-old infants and neu- Cohen, L. B., & Arthur, A. E. (2003). The role
ral networks. Science, 284, 875. of habituation in 10-month-olds’ categorization.
Amirikian, B., & Georgopoulos, A. P. (2003). Unpublished manuscript.
Modular organization of directionally tuned Colunga, E., & Smith, L. B. (2003). The emer-
cells in the motor cortex: Is there short-range gence of abstract ideas: Evidence from net-
order? Proceedings of the National Academy of works and babies. Philosophical Transactions
Sciences U.S.A., 100, 12474–12479. by the Royal Society, 358, 1205–1214.
Andersen, E. S., Dunlea, A., & Kekelis, L. S. Colunga, E., & Smith, L. B. (2005). From the
(1984). Blind children’s language: Resolving lexicon to expectations about kinds: A role
some differences. Journal of Child Language, for associative learning. Psychological Review
11, 645–664. 112, 347–382.
Anderson, J. R. (1993). Rules of the mind. Hills- Courchesne, E., Carper, R., & Akshoomoff, N.
dale, NJ: Lawrence Erlbaum. (2003). Evidence of brain overgrowth in the
Baillargeon, R. (1986). Representing the exis- first year of life in autism. Journal of the Amer-
tence and the location of hidden objects: ican Medical Association, 290, 337–344.
Object permanence in 6- and 8-month-old in- Daugherty, K., & Seidenberg, M. S. (1992). Rules
fants. Cognition, 23, 21–41. or connections? The past tense revisited. In
Baillargeon, R. (1987). Object permanence in 3 Proceedings of the Fourteenth Annual Confer-
1/2- and 4 1/2-month-old infants. Develop- ence of the Cognitive Science Society (pp. 259–
mental Psychology, 23, 655–664. 264). Hillsdale, NJ: Lawrence Erlbaum.
Baluja, S., & Fahlman, S. E. (1994). Reducing net- Dickinson, D. K. (1988). Learning names for ma-
work depth in the cascade-correlation learning terials: Factors constraining and limiting hy-
architecture (Technical Report No. CMU-CS- potheses about word meaning. Cognitive De-
94-209). Pittsburgh PA: School of Computer velopment, 3, 15–35.
Science, Carnegie Mellon University. Elman, J. L. (1999). Generalization, rules, and
Barwise, J., & Perry, J. (1983). Situations and at- neural networks: A simulation of Marcus et al.
titudes. Cambridge, MA: MIT Press. Retrieved April 27, 1999, from http//www.
Bates, E. (1979). The emergence of symbols: Cogni- crl.ucsd.edu/∼elman/Papers/MVRVsim.html
tion and communication in infancy. New York: Fahlman, S. E. (1988). Faster-learning variations
Academic Press. on back-propagation: An empirical study. In
Berthouze, L., & Ziemke, T. (2003). Epigenetic D. S. Touretzky, G. E. Hinton, & T. J. Se-
robotics – modelling cognitive development in jnowski (Eds.), Proceedings of the 1988 Con-
robotic systems. Connection Science, 15, 147– nectionist Models Summer School (pp. 38–51).
150. Los Altos, CA: Morgan Kaufmann.
Block, N. (1995). The mind as the software of Fahlman, S. E., & Lebiere, C. (1990). The
the brain. In E. E. Smith & D. N. Osherson cascade-correlation learning architecture. In
(Eds.), Thinking: An invitation to cognitive sci- D. S. Touretzky (Ed.), Advances in neural in-
ence (Vol. 3, 2nd ed.). Cambridge, MA: MIT formation processing systems 2 (pp. 524–532).
Press. Los Altos, CA: Morgan Kaufmann.
Charney, R. (1980). Speech roles and the devel- Ferretti, R. P., & Butterfield, E. C. (1986). Are
opment of personal pronouns. Journal of Child children’s rule-assessment classifications in-
Language, 7, 509–528. variant across instances of problem types?
Chiat, S. (1981). Context-specificity and gener- Child Development, 57, 1419–1428.
alization in the acquisition of pronominal dis- Gold, K., & Scassellati, B. (2006). Grounded
tinctions. Journal of Child Language, 8, 75–91. pronoun learning and pronoun reversal. In
Clark, E. V. (1978). From gesture to word: On Proceedings of the Fifth International Confer-
the natural history of deixis in language ac- ence on Development and Learning ICDL 2006.
quisition. In J. S. Bruner & A. Garton (Eds.), Bloomington: Department of Psychological
Human growth and development (pp. 85–120). and Brain Sciences, Indiana University.
Oxford, UK: Oxford University Press. Goldin-Meadow, S. (1999). The role of gesture in
Cohen, L. B., & Arthur, A. E. (1983). Perceptual communication and thinking. Trends in Cog-
categorization in the infant. In E. Scholnick nitive Sciences, 3, 419–429.

Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 15:14:05, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511816772.020
P1: IBE
CUFX212-16 CUFX212-Sun 978 0 521 85741 3 April 2, 2008 17:14

474 shultz and sirois

Goodman, N. (1955/1983). Fact, fiction, and fore- autism phenotype: a test of the hypothesis
cast. New York: Bobbs-Merrill. that altered brain growth leads to altered con-
Gureckis, T. M., & Love, B. C. (2004). Com- nectivity. Developmental Science, 11, 135–155.
mon mechanisms in infant and adult category Ling, C. X., & Marinov, M. (1993). Answer-
learning. Infancy, 5, 173–198. ing the connectionist challenge: A symbolic
Hare, M., & Elman, J. L. (1995). Learning and model of learning the past tenses of English
morphological change. Cognition, 56, 61–98. verbs. Cognition, 49, 235–290.
Harm, M. W., & Seidenberg, M. S. (1999). Lovett, A., & Scassellati, B. (August 2004). Us-
Phonology, reading acquisition, and dyslexia: ing a robot to reexamine looking time experi-
Insights from connectionist models. Psycholog- ments. Paper presented at the Fourth Interna-
ical Review, 106, 491–528. tional Conference on Development and Learn-
Hoeffner, J. H., & McClelland, J. L. (1994). ing, San Diego, CA.
Can a perceptual processing deficit explain the Marcus, G. F., Vijayan, S., Bandi Rao, S., &
impairment of inflectional morphology in Vishton, P. M. (1999). Rule learning by seven-
development dysphasia – a computational month-old infants. Science, 283, 77–80.
investigation? In Proceedings of the Twenty- Mareschal, D., Plunkett, K., & Harris, P.
Fifth Annual Child Language Research Forum (1999). A computational and neuropsycho-
(pp. 38–49). Center for the Study of Lan- logical account of object-oriented behaviours
guage and Information, Stanford University, in infancy. Developmental Science, 2, 306–
Stanford, CA. 317.
Imai, M., & Gentner, D. (1997). A cross- Marr, D. (1982). Vision: A computational investi-
linguistic study of early word meaning: Uni- gation into the human representation and pro-
versal ontology and linguistic influence. Cog- cessing of visual information. San Francisco:
nition, 62, 169–200. W. H. Freeman.
Kemp, C., Perfors, A., & Tenenbaum, J. B. McClelland, J. L. (1989). Parallel distributed pro-
(2007). Learning overhypotheses with hier- cessing: Implications for cognition and devel-
archical Bayesian models. Developmental Sci- opment. In R. G. M. Morris (Ed.), Parallel
ence, 10, 307–321. distributed processing: Implications for psychol-
Kendler, T. S. (1979). The development of dis- ogy and neurobiology (pp. 8–45). Oxford, UK:
crimination learning: A levels-of-functioning Oxford University Press.
explanation. In H. Reese (Ed.), Advances in McNeill, D. (1992). Hand and mind. Chicago:
child development and behavior (Vol. 13, pp. University of Chicago Press.
83–117). New York: Academic Press. Munakata, Y. (1998). Infant perseveration and
Klahr, D., & Wallace, J. G. (1976). Cognitive implications for object permanence theories:
development: An information processing view. A PDP model of the AB task. Developmental
Hillsdale, NJ: Lawrence Erlbaum. Science, 1, 161–184.
Kobayashi, H. (1997). The role of actions in mak- Munakata, Y., McClelland, J. L., Johnson, M.
ing inferences about the shape and material of H., & Siegler, R. S. (1997). Rethinking in-
solid objects among 2-year-old children. Cog- fant knowledge: Toward an adaptive process
nition, 63, 251–269. account of successes and failures in object
Landau, B., Smith, L. B., & Jones, S. S. (1988). permanence tasks. Psychological Review, 104,
The importance of shape in early lexical learn- 686–713.
ing. Cognitive Development, 3, 299–321. Negishi, M. (1999). Do infants learn grammar
Langley, P. (1987). A general theory of discrim- with algebra or statistics? Science, 284, 433.
ination learning. In D. Klahr, P. Langley, & Newell, A. (1990). Unified theories of cognition.
R. Neches (Eds.), Production systems models of Cambridge, MA: Harvard University Press.
learning and development (pp. 99–161). Cam- Oshima-Takane, Y. (1988). Children learn from
bridge, MA: MIT Press. speech not addressed to them: The case of
Lebiere, C., & Anderson, J. R. (1993). A con- personal pronouns. Journal of Child Language,
nectionist implementation of the ACT-R pro- 15, 95–108.
duction system. In Proceedings of the Fifteenth Oshima-Takane, Y. (1992). Analysis of pronom-
Annual Conference of the Cognitive Science So- inal errors: A case study. Journal of Child Lan-
ciety (pp. 635–640). Hillsdale, NJ: Lawrence guage, 19, 111–131.
Erlbaum. Oshima-Takane, Y., Goodz, E., & Derevensky,
Lewis, J. D., & Elman, J. L. (2008). Growth- J. L. (1996). Birth order effects on early lan-
related neural neural reorganization and the guage development: Do secondborn children

Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 15:14:05, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511816772.020
P1: IBE
CUFX212-16 CUFX212-Sun 978 0 521 85741 3 April 2, 2008 17:14

computational models of developmental psychology 475

learn from overheard speech? Child Develop- Shultz, T. R., & Bale, A. C. (2001). Neural net-
ment, 67, 621–634. work simulation of infant familiarization to ar-
Oshima-Takane, Y., Takane, Y., & Shultz, T. tificial sentences: Rule-like behavior without
R. (1999). The learning of first and second explicit rules and variables. Infancy, 2, 501–
pronouns in English: Network models and 536.
analysis. Journal of Child Language, 26, 545– Shultz, T. R., & Bale, A. C. (2006). Neural net-
575. works discover a near-identity relation to dis-
Piaget, J. (1954). The construction of reality in the tinguish simple syntactic forms. Minds and
child. New York: Basic Books. Machines, 16, 107–139.
Pinker, S. (1999). Words and rules: The ingredi- Shultz, T. R., Buckingham, D., & Oshima-
ents of language. New York: Basic Books. Takane, Y. (1994). A connectionist model of
Plunkett, K., & Marchman, V. (1996). Learning the learning of personal pronouns in English.
from a connectionist model of the acquisition In S. J. Hanson, T. Petsche, M. Kearns, & R.
of the English past tense. Cognition, 61, 299– L. Rivest (Eds.), Computational learning theory
308. and natural learning systems, Vol. 2: Intersec-
Quinlan, J. R. (1993). C4.5: Programs for machine tion between theory and experiment (pp. 347–
learning. San Mateo, CA: Morgan Kaufmann. 362). Cambridge, MA: MIT Press.
Raijmakers, M. E. J., van Koten, S., & Molenaar, Shultz, T. R., & Cohen, L. B. (2004). Model-
P. C. M. (1996). On the validity of simulat- ing age differences in infant category learning.
ing stagewise development by means of PDP Infancy, 5, 153–171.
networks: Application of catastrophe analy- Shultz, T. R., & Mareschal, D. (1997). Re-
sis and an experimental test of rule-like net- thinking innateness, learning, and construc-
work performance. Cognitive Science, 20, 101– tivism: Connectionist perspectives on de-
136. velopment. Cognitive Development, 12, 563–
Samuelson, L., & Smith, L. B. (1999). Early noun 586.
vocabularies: Do ontology, category structure Shultz, T. R., Mareschal, D., & Schmidt, W. C.
and syntax correspond? Cognition, 73, 1–33. (1994). Modeling cognitive development on
Schiff-Meyers, N. (1983). From pronoun rever- balance scale phenomena. Machine Learning,
sals to correct pronoun usage: A case study of 16, 57–86.
a normally developing child. Journal of Speech Shultz, T. R., Mysore, S. P., & Quartz, S. R.
and Hearing Disorders, 48, 385–394. (2007). Why let networks grow? In D.
Schlesinger, M. (2004). Evolving agents as a Mareschal, S. Sirois, G. Westermann, & M.
metaphor for the developing child. Develop- H. Johnson (Eds.), Neuroconstructinism: Per-
mental Science, 7, 158–164. spectives and prospects (Vol. 2). Oxford, UK:
Schlesinger, M., Parisi, D., & Langer, J. (2000). Oxford University Press.
Learning to reach by constraining the move- Siegler, R. S. (1976). Three aspects of cognitive
ment search space. Developmental Science, 3, development. Cognitive Psychology, 8, 481–
67–80. 520.
Schmidt, W. C., & Ling, C. X. (1996). A Sirois, S. (September 2002). Rethinking ob-
decision-tree model of balance scale develop- ject compounds in preschoolers: The case
ment. Machine Learning, 24, 203–229. of pairwise learning. Paper presented at the
Shultz, T. R. (1999). Rule learning by habitua- British Psychological Society Developmen-
tion can be simulated in neural networks. In tal Section Conference, University of Sussex,
M. Hahn & S.C. Stoness (Eds.), Proceedings of UK.
the Twenty-first Annual Conference of the Cog- Sirois, S., Buckingham, D., & Shultz, T. R.
nitive Science Society (pp. 665–670). Mahwah, (2000). Artificial grammar learning by infants:
NJ: Lawrence Erlbaum. An auto-associator perspective. Developmen-
Shultz, T. R. (2003). Computational developmen- tal Science, 4, 442–456.
tal psychology. Cambridge, MA: MIT Press. Sirois, S., & Shultz, T. R. (1998). Neural network
Shultz, T. R. (2006). Constructive learning in modeling of developmental effects in discrim-
the modeling of psychological development. ination shifts. Journal of Experimental Child
In Y. Munakata & M. H. Johnson (Eds.), Psychology, 71, 235–274.
Processes of change in brain and cognitive de- Sirois, S., & Shultz, T. R. (2006). Preschoolers
velopment: Attention and performance XXI. out of adults: Discriminative learning with a
(pp. 61–86). Oxford, UK: Oxford University cognitive load. Quarterly Journal of Experimen-
Press. tal Psychology, 59, 1357–1377.

Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 15:14:05, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511816772.020
P1: IBE
CUFX212-16 CUFX212-Sun 978 0 521 85741 3 April 2, 2008 17:14

476 shultz and sirois

Soja, N. N. (1992). Inferences about the mean- van Rijn, H., van Someren, M., & van der Maas,
ing of nouns: The relationship between per- H. (2003). Modeling developmental transi-
ception and syntax. Cognitive Development, 7, tions on the balance scale task. Cognitive Sci-
29–45. ence, 27, 227–257.
Soja, N. N., Carey, S., & Spelke, E. S. (1991). Vilcu, M., & Hadley, R. F. (2001). Generalization
Ontological categories guide young children’s in simple recurrent networks. In J. B. Moore &
inductions of word meaning: Object terms K. Stenning (Eds.), Proceedings of the Twenty-
and substance terms. Cognition, 38, 179– third Annual Conference of the Cognitive Sci-
211. ence Society (pp. 1072–1077). Mahwah, NJ:
Soja, N. N., Carey, S., & Spelke, E. S. (1992). Lawrence Erlbaum.
Perception, ontology, and word meaning. Cog- Vilcu, M., & Hadley, R. F. (2005). Two apparent
nition, 45, 101–107. “Counterexamples” to Marcus: A closer look.
Spence, K. W. (1952). The nature of the re- Minds and Machines, 15, 359–382.
sponse in discrimination learning. Psycholog- Westermann, G. (1998). Emergent modularity
ical Review, 59, 89–93. and U-shaped learning in a constructivist neu-
Subrahmanyam, K., Landau, B., & Gelman, R. ral network learning the English past tense. In
(1999). Shape, material, and syntax: Interact- M. A. Gernsbacher & S. J. Derry (Eds.), Pro-
ing forces in children’s learning in novel words ceedings of the Twentieth Annual Conference of
for objects and substances. Language & Cogni- the Cognitive Science Society (pp. 1130–1135).
tive Processes, 14, 249–281. Mahwah, NJ: Lawrence Erlbaum.
Sun, R., Slusarz, P., & Terry, C. (2005). The in- Westermann, G., & Mareschal, D. (2004). From
teraction of the explicit and the implicit in parts to wholes: Mechanisms of development
skill learning: A dual-process approach. Psy- in infant visual object processing. Infancy, 5,
chological Review, 112, 159–192. 131–151.
Taatgen, N. A., & Anderson, J. R. (2002). Why Westermann, G., Sirois, S., Shultz, T. R., &
do children learn to say “broke”? A model of Mareschal, D. (2006). Modeling developmen-
learning the past tense without feedback. Cog- tal cognitive neuroscience. Trends in Cognitive
nition, 86, 123–155. Sciences, 10, 227–232.
Thelen, E., Schoener, G., Scheier, C., & Smith, Xu, F., & Tenenbaum, J. B. (2007). Sensitivity
L. (2001). The dynamics of embodiment: A to sampling in Bayesian word learning. Devel-
field theory of infant perseverative reaching. opmental Science, 10, 288–297.
Brain and Behavioral Sciences, 24, 1–33. Younger, B. A., & Cohen, L. B. (1986). Develop-
Thomas, M. S. C., & Karmiloff-Smith, A. (2002). mental change in infants’ perception of corre-
Are developmental disorders like cases of lations among attributes. Child Development,
adult brain damage? Implications from con- 57, 803–815.
nectionist modelling. Behavioral and Brain Zhang, K., & Sejnowski, T. J. (2000). A universal
Sciences, 25, 727–787. scaling law between gray matter and white
van Gelder, T. J. (1998). The dynamical hypoth- matter of cerebral cortex. Proceedings of the
esis in cognitive science. Behavioral and Brain National Academy of Sciences USA, 97, 5621–
Sciences, 21, 1–14. 5626.

Downloaded from https:/www.cambridge.org/core. New York University, on 26 Jun 2017 at 15:14:05, subject to the Cambridge Core terms of use, available at
https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511816772.020

You might also like