0% found this document useful (0 votes)
4 views18 pages

Francis Kaganovich Huber 2008

This study investigates the differential weighting of acoustic cues, specifically voice onset time (VOT) and onset fundamental frequency (F0), in English consonant voicing categorization. It finds that listeners tend to prioritize VOT over onset F0, even when perceptual distances are equated, and that training to focus on one cue can affect performance and attention allocation. The results support an attentional model of phonetic learning, suggesting that listeners can strategically adjust their focus on different acoustic cues based on experience and training.

Uploaded by

fhr1406
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views18 pages

Francis Kaganovich Huber 2008

This study investigates the differential weighting of acoustic cues, specifically voice onset time (VOT) and onset fundamental frequency (F0), in English consonant voicing categorization. It finds that listeners tend to prioritize VOT over onset F0, even when perceptual distances are equated, and that training to focus on one cue can affect performance and attention allocation. The results support an attentional model of phonetic learning, suggesting that listeners can strategically adjust their focus on different acoustic cues based on experience and training.

Uploaded by

fhr1406
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Cue-specific effects of categorization training on the relative

weighting of acoustic cues to consonant voicing in English


Alexander L. Francisa兲
Department of Speech, Language and Hearing Sciences and Program in Linguistics, Purdue University,
Heavilon Hall, 500 Oval Drive, West Lafayette, Indiana 47906

Natalya Kaganovichb兲
Program in Linguistics, Purdue University, Heavilon Hall, 500 Oval Drive, West Lafayette, Indiana 47906

Courtney Driscoll-Huber
Department of Speech, Language and Hearing Sciences, Purdue University, Heavilon Hall, 500 Oval Drive,
West Lafayette, Indiana 47906

共Received 16 May 2007; revised 17 March 2008; accepted 21 May 2008兲


In English, voiced and voiceless syllable-initial stop consonants differ in both fundamental
frequency at the onset of voicing 共onset F0兲 and voice onset time 共VOT兲. Although both correlates,
alone, can cue the voicing contrast, listeners weight VOT more heavily when both are available.
Such differential weighting may arise from differences in the perceptual distance between voicing
categories along the VOT versus onset F0 dimensions, or it may arise from a bias to pay more
attention to VOT than to onset F0. The present experiment examines listeners’ use of these two cues
when classifying stimuli in which perceptual distance was artificially equated along the two
dimensions. Listeners were also trained to categorize stimuli based on one cue at the expense of
another. Equating perceptual distance eliminated the expected bias toward VOT before training, but
successfully learning to base decisions more on VOT and less on onset F0 was easier than vice
versa. Perceptual distance along both dimensions increased for both groups after training, but only
VOT-trained listeners showed a decrease in Garner interference. Results lend qualified support to an
attentional model of phonetic learning in which learning involves strategic redeployment of
selective attention across integral acoustic cues. © 2008 Acoustical Society of America.
关DOI: 10.1121/1.2945161兴
PACS number共s兲: 43.71.An, 43.71.Es, 43.71.Rt 关PEI兴 Pages: 1234–1251

I. INTRODUCTION cient to cue the perception of this contrast in syllable-initial


position, even in the absence of other cues 共Lisker, 1978兲,
The acoustic patterns of speech sounds are highly mul- but we will focus on four that have been more intensively
tidimensional, in the sense that multiple acoustic properties studied: Voice onset time 共VOT; Abramson and Lisker,
typically correlate with the production of a particular pho- 1970兲, the fundamental frequency at the onset of voicing
netic category. Most, if not all, of these correlates have the 共onset F0; Haggard et al., 1970; Haggard et al. 1981兲, the
potential to function as perceptual cues to categorization un- degree of delay in the onset of the first formant 共F1 cutback
der appropriate circumstances, but not all cues are weighted
or voiced transition duration; Stevens and Klatt, 1974兲 and
equally in a given contrast. There are at least two major
the relative amplitude of any aspiration noise in the period
reasons that listeners might prefer to make a particular pho-
between the burst release and the onset of voicing 共Repp,
netic judgment on the basis of one cue over another. On the
one hand, the perceived difference between two phonetic cat- 1979兲. Despite the multiplicity of sufficient cues to the En-
egories might be greater along one contrastive dimension glish stop-consonant voicing contrast, when more than one
than the other. Alternatively, some cues may be privileged of these cues are presented to listeners, a pattern of domi-
共for particular phonetic decisions兲 because of learned or in- nance appears that suggests that some correlates are better
nate biases in the way they are processed. able to serve as cues 共often called primary cues兲 than others
The multiplicity of cues to phonetic contrasts is well 共secondary cues兲, at least in specific phonetic contexts. For
documented. For example, Lisker 共1986兲 describes a wide the purposes of this study, the most relevant observation is
variety of acoustic correlates that differ systematically be- that VOT appears to dominate other cues to voicing of
tween productions of intervocalic /p/ and /b/ in English. syllable-initial stop consonants in English 共Raphael, 2005兲.
Most or all of these correlates have been shown to be suffi- In particular, a variety of studies have shown that, in this
context, VOT is preferred over onset F0 共Abramson and
Lisker, 1985; Gordon et al., 1993; Lisker, 1978; Whalen et
a兲
Author to whom correspondence should be addressed. Tel.: 共765兲 494- al. 1993; see Francis and Nusbaum, 2002 for discussion兲.
3815. Electronic mail: [email protected]
b兲
Also at: Department of Speech, Language and Hearing Sciences, Purdue However, although such patterns of relative dominance are
University, Heavilon Hall, 500 Oval Drive, West Lafayette, Indiana 47906. generally agreed upon, there is little consensus regarding the

1234 J. Acoust. Soc. Am. 124 共2兲, August 2008 0001-4966/2008/124共2兲/1234/18/$23.00 © 2008 Acoustical Society of America

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 35.1.233.149 On: Fri, 21 Oct 2016 19:34:53
psychological basis for such apparent prioritization of one frequently made on the basis of one feature rather than an-
acoustic cue over another. other 共a pattern whose origins might itself ultimately have a
One factor of note in this regard is that the results of socio-historical as well as or instead of a psychophysiologi-
group studies on this topic 共including the present one兲 may cal basis兲 共see Holt et al. 2001 for discussion兲. That these
obscure the presence of real individual differences in the kinds of explanations need not be mutually exclusive is sup-
relative weighting of these two cues. For example, Haggard ported by recent evidence suggesting that listeners’ native
et al. 共1970兲 found that onset F0 “can be of some impor- language experience affects the efficiency of neural encoding
tance, but the wide differences in performance between sub- of pitch properties at the brainstem level 共Xu et al., 2006兲.
jects show that it is unimportant for some listeners” 共p. 616兲. One of the most recent and thorough discussions of the
Similarly, Massaro and Cohen 共1976, 1977兲 found a range of idea that listeners may be predisposed to use certain acoustic
individual differences in reliance on onset F0 as compared to properties rather than others in a categorization task was pre-
VOT and fricative duration in a series of studies on the per- sented by Holt and Lotto 共2006兲. They trained adult listeners
ception of voicing in syllable-initial fricatives. Such differ- to categorize unfamiliar nonspeech sounds that differed ac-
ences in individual listeners’ weighting of normally covary- cording to two orthogonal dimensions, the center frequency
ing acoustic cues are consistent with other studies showing 共CF兲 of the carrier sine wave and the frequency of a modu-
similar differences even in the perception of nonspeech cues lating sine wave. They found that listeners showed a consis-
共e.g., Lutfi and Liu, 2007兲, and clearly invite further study. tent preference for the CF cue, even when the perceptual
However, the observation of individual differences in distances between the two categories were equal along the
weighting still does not address the question of what might two dimensions. This suggests that there may be intrinsic
motivate the prioritization of one cue over another and to biases favoring the ability to learn 共and therefore use兲 certain
what degree such weighting might be changed by experi- acoustic dimensions rather than others 共see also Lutfi and
ence. Liu, 2007兲, but it is not known whether this is the case for
dimensions that are relevant to perceiving speech sounds.
If English speakers’ preference for using VOT over on-
A. Perceptual weighting set F0 in determining a syllable-initial stop-consonant voic-
One possible reason for the relative dominance of one ing contrast results from a privileged status for VOT, then we
cue over another is that the perceptual distance between two would expect VOT to be given more weight than onset F0
categories may be different along two different dimensions when perceiving a voicing contrast even when the perceptual
of contrast. For example, the perceptual distance between distance between tokens is equalized along the onset F0 and
two prototypical exemplars of English /b/ and /p/ is quite VOT dimensions. Thus, the first goal of the present study is
large according to VOT and may be somewhat smaller ac- to determine whether VOT and onset F0 exhibit different
cording to onset F0.1 In this case, listeners would be ex- weighting in a voicing decision when perceptual distance is
pected to give more weight to VOT than to onset F0, if only not a factor. These two commonly studied acoustic correlates
because the VOT differences are more easily distinguished. of the phonetic voicing contrast were chosen because of the
On the other hand, it is also possible that one dimension extensive literature on the perception of these two features
might be intrinsically better at attracting listeners’ attention and because previous research strongly suggests that VOT is
to it than another, such that, when given a choice between the more heavily weighted than onset F0 for perceiving the En-
two dimensions, listeners prefer to make decisions on the glish voicing contrast in syllable-initial stops, yet it is not
basis of one rather than another, even when the two contrasts known whether this pattern still obtains after equating the
are equated in terms of perceptual distance in isolation. That two distances perceptually.
is, some acoustic properties may be privileged, at least with
respect to their use in distinguishing a given phonetic con-
B. Dimensional integrality
trast.
There seem to be at least two or three possible explana- Another consequence of the multidimensionality of
tions of how such an intrinsic bias might arise. On the one speech sounds is that many acoustically independent corre-
hand, biases might arise as a function of 共possibly innate兲 lates covary consistently with one another in the speech sig-
biological mechanisms, for example, as a consequence of nal. The covariance of onset F0 and VOT has been argued to
differences in the efficiency of neural systems for processing arise from a variety of sources. Abramson 共1977兲 and Lisker
different kinds of features, e.g., differences in neural systems 共1978兲 suggest that the two features share a common origin
specialized for processing temporally versus spectrally de- in the unfolding of the same laryngeal timing gesture, while
fined properties, see Zatorre and Belin 共2001兲. Alternatively, Hombert 共1978兲 links the two via aerodynamic demands
such biases might derive from auditory/acoustic interactions 共higher airflow following the release of voiceless stops lead-
between features that result in one feature enhancing the per- ing to a greater onset F0 and longer VOT兲,2 In contrast,
ception of another 共Diehl and Kluender, 1989; Kingston and others ascribe the covariance to perceptual factors. For ex-
Diehl, 1994兲 or the two features together contributing to a ample, Kingston and Diehl 共1994兲 and Kingston et al. 共2008兲
higher-order, combinatoric perceptual feature 共Kingston et argue that the two cues contribute to the perception of an
al., 2008兲. Finally, such biases might be explicitly learned, overarching property of low frequency energy continuing
developing through years of experience listening to a lan- into the stop closure 共near short VOT/low onset F0 conso-
guage in which linguistically salient differences are more nants兲 or its absence 共in long VOT/high onset F0 conso-

J. Acoust. Soc. Am., Vol. 124, No. 2, August 2008 Francis et al.: Phonetic learning as cue reweighting 1235

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 35.1.233.149 On: Fri, 21 Oct 2016 19:34:53
nants兲, while Holt et al. 共2001兲 claim that the covariance is versa, would support the hypothesis that VOT is an intrinsi-
learned simply because the two cues are reliably associated cally more attention-demanding dimension of phonetic con-
in the ambient language 共without specifying a basis for this trast.
association兲.
In all cases, however, we might expect covarying cues to C. Perceptual learning
be highly integral in the sense of Garner 共1974兲. Listeners
who are accustomed to hearing that two cues covary in a If, in fact, VOT is a privileged dimension for voicing 共as
consistent manner might be expected to have difficulty ignor- compared to onset F0兲, then listeners might be expected to be
ing irrelevant variability in one of the cues when making a better at learning new categories distinguished in terms of
decision based on the properties of the other, especially if the VOT than ones distinguished according to onset F0. A vari-
ety of studies 共e.g., Holt et al., 2004; Pisoni et al. 1982兲 have
two cues are integrated into a distinct “intermediate percep-
shown that listeners are able to learn new VOT-based catego-
tual property” 共Kingston et al., 2008兲. When perceptual dis-
ries with relatively little training, while Francis and Nus-
tances along the two covarying dimensions are not equal,
baum 共2002兲 showed that a few hours of laboratory training
variability along the more distinctive dimension tends to in-
with Korean speech stimuli were sufficient to induce English
terfere more with classification along the less distinctive one
listeners to make use of onset F0. However, due to method-
in a pattern of performance known as asymmetric integrality
ological differences it is difficult to compare results across
共see Garner, 1974, 1983; Melara and Mounts, 1994兲. Thus, in
studies. Thus, the third goal of the present study was to de-
the case of the covarying cues of onset F0 and VOT, if the
termine whether training to identify categories differing only
perceptual distance between long- and short-lag VOT catego-
along one of these two dimensions 共VOT or onset F0兲 would
ries is naturally greater than that between falling and rising
have comparable effects, or whether there would be differ-
onset F0 categories, then this would be sufficient to explain
ences in the effects of training based on the dimension being
the primacy of VOT as a cue to voicing, but artificially
learned.
equating the perceptual distances along both dimensions
should result in a symmetrical pattern of interference.
On the other hand, if VOT is intrinsically more attention D. Enhancement and inhibition
demanding than onset F0, then variability in VOT should A final question concerned the mechanism or mecha-
interfere more with classification according to onset F0 than nisms by which training affected perception of the two di-
vice versa. Moreover, this dominance should be maintained mensions. A few theories of general perceptual learning
even when the perceptual distances between stimuli are 共Gibson, 1969; Goldstone, 1994; Nosofsky, 1986兲 have been
equated 共that is, even when stimuli are selected such that applied to perceptual learning of speech, primarily to explain
their perceptual distance is equivalent along each of two di- the results of first- and second-language learning 共Francis
mensions tested in isolation兲, because trial-to-trial changes and Nusbaum, 2002; Iverson et al., 2003兲. According to such
along a more attention-demanding dimension should attract theories, category learning requires increasing the similarity
attention more than those along a less demanding one 共see of tokens within the same category 共acquired similarity兲,
Tong et al., 2008, for a review of some such cases兲. while increasing the perceived differences between tokens in
In support of the possibility that VOT may simply be a different categories 共acquired distinctiveness兲 共see Liberman,
more attention-demanding dimension of contrast, Gordon et 1957, for what is probably the first application of these terms
al. 共1993兲 argue that VOT is a “stronger” phonetic feature in speech research, and Jusczyk, 1993, for a comprehensive
than onset F0, in the sense that VOT is more closely linked model of first language acquisition that explicitly incorpo-
to the phenomenal quality of voicing than is onset F0. They rates these concepts兲. Such changes are argued to result from
suggest that under ideal listening conditions onset F0 is more changing the relative weighting of different dimensions: Di-
likely to be ignored as a cue to voicing if VOT is unambigu- mensions that are good at distinguishing categories are given
ous than vice versa 共cf. Abramson and Lisker, 1985兲. More- more weight 共enhanced兲, while those that do not differentiate
over, Gordon et al. 共1993兲 showed that the primacy of VOT categories well are given less weight 共inhibited兲. Existing
over onset F0 as a cue to stop-consonant voicing was miti- research provides tentative support for the hypothesis that
gated by attentional demands. Under conditions of high cog- both enhancement and inhibition of specific dimensions of
nitive load, listeners showed a decreased reliance on VOT contrast may operate in perceptual learning of speech. For
and a corresponding increase in the relative weight given to example, Francis et al. 共2000兲 trained two groups of listeners
onset F0, suggesting that, all else being equal, the use of to use one of two competing cues to syllable-initial stop-
VOT as a cue to voicing attracts or demands greater atten- consonant place of articulation: The slope of the formant
tional commitment than using onset F0. However, in the transitions or the spectrum of the burst release. While listen-
study of Gordon et al. 共1993兲 no attempt was made to equate ers in the formant-trained condition learned to give increased
the perceptual distance along the two dimensions. Thus, the weight to the formant cue, results from those in the burst-
second goal of this study was to investigate the symmetry of trained group were more suggestive of their having learned
dimensional interference between onset F0 and VOT when to give less weight to the formant cue rather than more
making a voicing decision after equating perceptual dis- weight to the burst cue. However, because the perceptual
tances along both dimensions. In this case, any observation distance between tokens was not equated across the two
of asymmetric integrality, such that variability in VOT inter- cues, we cannot tell whether training caused listeners to ad-
feres more with classification according to onset F0 than vice just the weight given to formant transitions because the

1236 J. Acoust. Soc. Am., Vol. 124, No. 2, August 2008 Francis et al.: Phonetic learning as cue reweighting

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 35.1.233.149 On: Fri, 21 Oct 2016 19:34:53
stimuli differed more along this dimension of contrast 共for- when classifying according to the trained dimension.
mant transitions兲 or because formant transitions are a privi-
leged cue compared to the spectrum of the burst release. II. METHOD
Thus, the final goal of the present study was to provide ad- A. Subjects
ditional data relevant to determining whether training-related
changes in the relative weight given to a specific dimension A total of 42 young adults between the ages of 18 and 36
result from inhibition of the uninformative dimension or en- were initially enrolled in this experiment. All of them were
hancement of the more informative one. undergraduate or graduate students or staff of Purdue Uni-
versity, or residents of the surrounding community. All par-
ticipants underwent a standard hearing screening 关pure tone
E. Summary audiometry at octave intervals between 500 and 4000 Hz at
20 共500 Hz兲 or 25 dB HL兴 and a linguistic background ques-
In the present investigation listeners were trained to hear
tionnaire designed to identify individuals with strongly
a familiar consonantal contrast 共voiceless aspirated versus
monolingual perceptual experience. No applicant was en-
voiceless unaspirated stops, e.g., 关p兴 and 关b兴兲 according to
rolled if they failed the hearing screening, had lived for more
either onset F0 or VOT while ignoring variability in the other
than two weeks in a non-English speaking environment,
cue. We used acoustic differences that were within a single
grew up speaking any language other than English, or had
category 共voiceless aspirated兲 with the goal of ensuring that
lived in a household where the predominant language was
our stimuli were located within a region of perceptual space
anything other than English.
that did not contain any already-known discontinuities in au-
Participants were initially randomly assigned to one of
ditory sensitivity such as the well-known discontinuity
two training conditions, VOT training or onset F0 training.
around 20– 30 ms along the VOT dimension 共cf. Holt et al.
However, as the experiment progressed and it became appar-
2004兲 or the probable discontinuity between falling and ris-
ent that the VOT training condition was easier than the F0
ing frequency transitions 共Schouten, 1985兲.
condition, more participants were assigned to the onset F0
We used a variety of training stimuli, incorporating as-
training group to increase the probability of ending up with
pects of “high variability” training which has been argued by
relatively balanced numbers of successful learners in both
some researchers to be more effective than other common
conditions. Of the 42 initial participants, 34 completed all
types of laboratory training 共see discussion by Iverson et al.,
phases of the experiment 共producing analyzable data兲, and 24
2005兲, in an attempt to improve learning over what is often
of these showed evidence of some learning 共improvement of
observed in short-term laboratory training studies. We in-
at least five percentage points兲. In all, 16 of these learners 共11
cluded stimuli produced at a variety of places of articulation
women, 5 men兲 showed evidence of progressing toward ex-
of the initial consonant, with a variety of vowels, and pro-
pert perception of the contrast on which they were trained,
duced by two different talkers. However, because the pretest
defined as improvement of at least five percentage points
and post-test results we report here derive from stimuli that
above pretest level as well as a final proportion correct of at
were identical to 共some of兲 those used in training, we cannot
least 0.70. There were nine such expert learners in the VOT-
make any strong assumptions about what listeners were ac-
trained condition 共six women, three men兲 and seven in the
tually learning because there is no possibility to measure
F0-trained condition 共five women, two men兲 共see Sec. III B,
generalization, e.g., to a novel talker, place of articulation, or
below兲.
vowel context.
We measured the perceptual distance between tokens
B. Design
differing according to these two dimensions both before and
after training and compared it to the distribution of selective The goal of this study was to investigate the relationship
attention between the two dimensions at the same times. All between changes in perceptual distance and the distribution
measurements were made from listeners who exhibited a of selective attention before and after successful training to
high degree of success in learning. Our focus is on the per- make phonetic decisions based on one acoustic cue as op-
formance of these successful learners because we were inter- posed to another. Thus, in addition to the usual pretest-
ested in the effect of successful learning on the distribution training-post-test structure commonly used in phonetic train-
of weight to acoustic cues. By focusing on learners who ing studies 共e.g., Francis et al., 2000; Francis and Nusbaum,
showed clear improvement in performance, we also increase 2002; Guenther et al., 1999; Guion and Pederson, 2007兲,
the validity of any comparison between the effects of learn- three kinds of measures were needed, one to assess degree of
ing observed here and those observed in more natural learn- learning 共in order to identify successful learners兲, one to de-
ing tasks 共Francis and Nusbaum, 2002兲 and in actual cases of termine the distribution of selective attention, and one to
native language acquisition 共e.g., Iverson et al., 2003兲. We evaluate perceptual distance. It was also important that this
expected that training would increase perceptual distances last measure be obtainable even on the pretest, when listeners
along the trained dimension while possibly also decreasing were expected to be close to chance when using cues on
distance along the 共task-irrelevant兲 untrained dimension. which they had not been trained. To assess learning, the mea-
Corresponding to these changes, following the results of sure of proportion correct responses was used, calculated
Melara and Mounts 共1994兲, we expected to see an increase in over the first and last sessions of training. For measuring the
Garner interference when classifying according to the un- distribution of selective attention, a set of related tasks often
trained dimension, and a similar decrease in interference referred to as a Garner paradigm 共Garner, 1974兲 was used.

J. Acoust. Soc. Am., Vol. 124, No. 2, August 2008 Francis et al.: Phonetic learning as cue reweighting 1237

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 35.1.233.149 On: Fri, 21 Oct 2016 19:34:53
Finally, to measure perceptual distance, two quantities were 1. Recording
obtained: Sensitivity 共d⬘兲 computed from a speeded target Initially, multiple productions of each of the nine syl-
monitoring 共STM兲 task and response time 共RT兲 computed lables 关phi兴, 关pha兴, 关phu兴, 关thi兴, 关tha兴, 关thu兴, 关khi兴, 关kha兴, and
from the baseline component of the Garner paradigm 共see 关khu兴 were recorded by one adult male and one adult female
below兲. Sensitivity in a STM task was used in addition to the native speaker of a Midwestern dialect of American English.
Garner base line task 共which was collected in the course of Recordings were made to digital audio tape using a hyper-
evaluating selective attention, see below兲 for two reasons. cardioid microphone 共Audio-Technica D1000HE兲 and digital
First, the validity of response-time measures may be less audio tape-recorder 共Sony TCD-D8兲 in a sound-isolated
reliable when participants are close to chance, as there will booth 共IAC, model No. 403A兲, and redigitized to disk for
be fewer correct responses on which to base average scores, analysis and resynthesis at 22.05 kHz sampling rate and
but the stimuli used in this experiment necessarily sounded 16 bit quantization using PRAAT 4.2 via a SoundBlaster Live!
quite similar to listeners 共prior to training兲 to increase the Sound card on a Dell Optiplex running Windows XP. Speak-
likelihood of observing training-related improvement, mean- ers recorded multiple instances of three repetitions of each
ing that performance on the initial Garner task would likely syllable. For example, two or three utterances of
be close to chance. Second, since the primary goal of this 关pha pha pha兴 were recorded by each speaker. Only the sec-
study was to compare changes in perceptual distance with ond token of each group was digitized to maintain similar
changes in selective attention, it was thought desirable to intonational properties across tokens. The resulting set of 54
obtain a measure of perceptual distance through methods in- tokens 共three repetitions of each of nine syllables by two
dependent of, though similar in task structure to, the methods speakers兲 was carefully analyzed to identify the acoustically
used to measure selective attention. cleanest recording of each syllable. Tokens with a compara-
A final aspect of the experimental design that may play a tively high degree of line noise or breathiness, irregularities
role in interpreting the results is the choice of response cat- in voicing during vowel production, or other acoustic arti-
egories in the Garner paradigm. In a typical Garner para- facts that could be compounded by the resynthesis process
digm, stimuli differ along dimensions that are consciously were discarded. In the end, six tokens were selected for each
identifiable to listeners, e.g., pitch and loudness, or hue and speaker, creating two mostly overlapping sets 共with the lack
brightness. In such cases, participants can be instructed to of complete overlap due to acoustic artifacts in specific re-
identify stimuli according to a value along either dimension cordings兲. For the female speaker, 关phi兴, 关phu兴, 关thi兴, 关thu兴,
共e.g., is the sound “loud or soft” or “high or low pitched”?兲. 关khi兴, and 关kha兴 were selected, and for the male talker 关phi兴,
However, in the present case the dimensions are expressly 关pha兴, 关thi兴, 关th兴, 关khi兴, and 关kha兴. Stimuli derived from the
not accessible to conscious processing 共Allen et al., 2000兲. In male 关pha兴 tokens were used for testing, and stimuli derived
such cases, researchers frequently first train listeners on from all tokens 共including the male 关pha兴兲 were used in train-
novel, arbitrarily labeled categories 共e.g., “type 1” versus ing.
“type 2”兲, but this was not an option in the present experi-
ment because one of our research questions involved the ef-
2. Resynthesis
fects of training and therefore we did not want to train lis-
teners on the stimuli before we could establish a baseline Starting with each of the 12 base syllables, a set of 100
measure of their performance. Instead, listeners were asked tokens were resynthesized using the PSOLA methods imple-
to identify stimuli as belonging to one of two categories mented in PRAAT 4.2, creating a grid varying in ten steps
共e.g., “B” or “P”兲 when the decision was made along the along each of two phonetically relevant acoustic dimensions,
dimension they were 共to be兲 trained on, or according to one onset F0 and VOT, for a total of 1200 tokens 共100 tokens for
of two alternative categories when the decision was made each of 12 starting syllables兲. Along the VOT dimension,
along the untrained 共to be ignored兲 dimension. The identity stimuli ranged from 35 to 65 ms VOT in approximately
of the alternative categories, stressed and unstressed, was 3 ms steps.3 Variation in onset F0 ranged from a starting
chosen based on the correspondence between both VOT and frequency of 1.21 times the starting frequency of the un-
onset F0 with stress in English: Stressed syllables typically modified 共base兲 syllable to 0.91 times 共125 Hz for the male
exhibit both a higher overall F0 and longer VOT than un- 关pa兴兲, in steps of about 4 Hz 共i.e., for the male 关pa兴 stimulus,
stressed syllables, and a sharply falling F0 contour is associ- the starting frequency ranged from 165 to 125 Hz兲. All onset
ated with emphatic stress 共as in the final syllable of the re- F0 contours were linear interpolations starting at the defined
sponse “You don’t believe that story, do you?” “Yes, I do”兲. initial value and decreasing to the original F0 contour over
However, listeners were not necessarily expected to be as the first 100 ms of the token 共ending at 118 Hz兲. Thus, all
facile with this classification as with the voicing classifica- onset F0 contours ranged from sharply falling to nearly flat.
tion so it was used only for the untrained dimension. There were no rising contours in any stimuli. Slopes ranged
from −0.07 Hz/ ms 共in the shortest VOT, lowest slope stimu-
lus兲 to −0.47 Hz/ ms for the most sharply falling contour.
C. Stimuli
Six sets of 100 stimuli varying in two dimensions 共onset 3. Nomenclature
F0 and VOT兲 were generated from naturally recorded tokens The goal was to identify four stimuli that differed or-
using PSOLA resynthesis 共PRAAT 4.2, Boersma and Weenink, thogonally according to two dimensions to perceptually
2006兲. equivalent degrees 关forming a square in perceptual space, as

1238 J. Acoust. Soc. Am., Vol. 124, No. 2, August 2008 Francis et al.: Phonetic learning as cue reweighting

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 35.1.233.149 On: Fri, 21 Oct 2016 19:34:53
be highly variable due to a high incidence of guessing. Test-
A B ing always proceeded in the same order. Starting with the B
F0
C D
token 共step 7 along both the VOT and onset-F0 dimensions,
indicating a token close to but not quite prototypical for
VOT
关ph兴兲, a corresponding A token was selected having the same
onset-F0 value 共step 7兲, but a more 关b兴-like 共shorter兲 VOT
A B 共generally step 3 or 4兲. Participants then completed a series
A B of eight repetitions of a speeded target monitoring task
F0
C D 共STM, see below兲 using these two stimuli, and sensitivity
C D 共d⬘兲 was calculated as the difference between the z-score
transformed proportion of hits and false alarms 关z共H兲
VOT
− z共FA兲 共Macmillan and Creelman, 2004兲, where hits were
FIG. 1. Hypothetical illustration of changes in perceptual space from counted as correct responses to presented targets, while false
equally balanced performance on pretest 共1a兲 to increased attention to VOT/ alarms were incorrect responses to distractors 共nontargets兲. If
decreased attention to F0 共1b兲 or decreased attention to VOT/increased at- the listener’s sensitivity to the initial A-B pair was less than
tention to F0 共1c兲. Axes are measured in arbitrary units of perceptual dis-
tance. 1, a more distant candidate for the A token was selected 共e.g.,
step 2 or 1兲 and the STM task was repeated. Conversely, if
the listener’s sensitivity to the initial A-B pair was greater
shown in Fig. 1共a兲兴. For the purposes of testing and training,
than 1, a closer candidate for the A token 共e.g., step 4 or 5兲
stimuli were identified differently to each group, based on
the dimension on which each group was trained. For partici- was selected and the STM task was repeated. This process
pants in the VOT-trained group, tokens A and C were both was repeated until either 共1兲 a VOT step value was identified
treated as exemplars of B while B and D were categorized as that was approximately 1 d⬘ distant from the B token along
P. Conversely, A and B were both considered stressed while the VOT dimension or 共2兲 the perceptual distance between
C and D were unstressed. In contrast, for participants in the the B token and the most distant possible A token 共VOT step
F0-trained group, A and C were both considered unstressed 0兲 was determined. At this point the A token was fixed and
and B and D were stressed, while A and B were labeled as P the selection of a D token began. If the most distant A token
and C and D were labeled as B. was selected 共i.e., if the maximum distance between the B
and A tokens was still less than 1 d⬘兲, then the d⬘ value
D. Procedure calculated between this A and the B token was used as the
critical value 共instead of 1兲 for the next leg of the square. A
Participants completed a total of 11 to 12 sessions, each similar quasi-iterative process was used to select a D token
about an hour in duration, over the course of three to four located approximately the same distance away from the B
weeks 共one session per day, usually with no more than three token along the onset-F0 dimension 共typically close to 1 d⬘,
days between any two sessions兲. but sometimes less if the step-0 A token was used兲. This
The first three sessions and last three sessions consti- process took between one and five repetitions for the AB
tuted the pretest and post-test, respectively, with six sessions distance 共mean= 2.2, SD= 0.81兲 and between one and four
of training between them. In the first pretest session partici- repetitions for the BD distance 共mean= 2.2, SD= 0.76兲. After
pants completed the hearing test, language background ques- A and D tokens had been identified through these iterative
tionnaire, and initial assessment of perceptual distance to procedures, a C token was automatically selected having the
identify subject-specific, perceptually equal distances along onset-F0 step value of the D token and the VOT step value of
the two dimensions. In the second and third pretest sessions, the A token. Once all tokens were selected, the perceptual
participants completed the tasks associated with the Garner distances between the remaining adjacent pairs 共DC and AC兲
selective attention paradigm using both male and female as well as the diagonals 共AD and BC兲 were computed using
stimuli 共one talker in each session兲. The post-test was accom- the same STM task 共see Sec. III兲. In this way, a set of four
plished in the reverse order of the pretest, but consisted of tokens were selected that were approximately equidistant in
the same tests 共Garner paradigm followed by perceptual dis- perceptual space for each individual listener. Step values
tance measurement兲. When time permitted, the last two ses- identified in this session were then used for all stimuli, both
sions of the post-test were conducted on the same day. Train- in testing and training. Note that, since the order of presen-
ing was carried out in the intervening sessions. tation of each pair was the same for all listeners, some effect
of order of presentation may have occurred.
1. Perceptual distance measurement „STM… The task used to determine d⬘ for a given pair of stimuli
The goal of this stage of the pretest was to identify four was STM. For every pair of tokens, listeners completed one
tokens whose pairwise perceptual distances were approxi- set of eight trials with each trial consisting of a total of 20
mately equal in each of the two dimensions, roughly forming stimulus presentations. In each trial, participants were shown
a square in the VOT-by-onset-F0 space, as shown in Fig. a type of sound to monitor for 共e.g., B or P for tokens dif-
1共a兲. Sensitivity, d⬘, was used as a measure of perceptual fering only along the trained dimension or stressed or un-
distance because, with listeners expected to be close to stressed for tokens differing only along the untrained dimen-
chance on the pretest, such a measure would be more infor- sion兲. The stimulus corresponding to this identifier was
mative than response time for correct responses, which might considered a target for this trial, while the other stimulus was

J. Acoust. Soc. Am., Vol. 124, No. 2, August 2008 Francis et al.: Phonetic learning as cue reweighting 1239

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 35.1.233.149 On: Fri, 21 Oct 2016 19:34:53
considered the distractor. For example, if a member of the TABLE I. Structure of Garner paradigm experiment showing stimuli and
tasks for both groups in all conditions.
VOT-training group was being tested on the distance be-
tween the C and D tokens, in a trial specified as monitoring VOT-trained group
for B, the C token 共more 关b兴 like兲 would be the target while
the D token 共more 关p兴 like兲 would be the distractor. If the Trained dimension Untrained dimension
“Is it B or P?” “Is it stressed or unstressed?”
trial involved monitoring for P then the D token would be the
target and the C token would be the distractor. The category Task Stimuli Task Stimuli
identifier 共e.g., B兲 remained on the screen for the duration of
Base line 1 A, B Base line 1 A, C
the trial. Beginning 1 s after the target identifier appeared on
Base line 2 C, D Base line 2 B, D
the screen, listeners heard a series of 20 tokens, presented Filtering A, B, C, D Filtering A, B, C, D
with 1250 ms stimulus onset asynchrony. There were an Correlation 1 A, D Correlation 1 A, D
equal number of target and distractor tokens, and these could Correlation 2 B, C Correlation 2 B, C
appear in any order within the trial with the constraint that a
target token could not appear first or last in the trial. Partici- F0-trained group
pants were instructed to press a response key every time they Trained dimension Untrained dimension
heard a syllable starting with the sound shown on the screen “Is it B or P?” “Is it stressed or unstressed?”
and not to respond if the syllable began with a sound differ-
Task Stimuli Task Stimuli
ent from the symbol shown. They were asked to response as
quickly as possible, but also to be as accurate as possible. Base line 1 A, C Base line 1 A, B
Responses were scored as hits 共responses to targets兲 or false Base line 2 B, D Base line 2 C, D
alarms 共responses to distractors兲 and combined over all eight Filtering A, B, C, D Filtering A, B, C, D
trials 共total of 80 target presentations and 80 Distractors兲 and Correlation 1 A, D Correlation 1 A, D
used to calculate d⬘. Correlation 2 B, C Correlation 2 B, C
Before each trial, listeners were familiarized with the
two tokens to be used, and their respective labels for the
particular contrast being tested 共e.g., for a participant in the by the trained dimension were grouped together, as were all
VOT-trained group, the A versus B stimulus contrast would involving classification according to the untrained dimen-
be presented as exemplars of B 共paired with the A token兲 and sion. Furthermore, the order of labels on the screen 共e.g., B
P 共paired with the B token兲. Familiarization consisted of pre- and P兲 and their associated response keys was counterbal-
sentation of a stimulus label 共e.g., B兲 with instructions to anced within blocks for each listener, such that the first half
click on the mouse button in order to hear an example 共the A of each block of trials used one order 共e.g., B on the left, P
token兲. After one presentation, listeners were instructed to on the right兲 while the second half used the other order.
click the mouse again to hear the sound again. Then the task Other than this, tasks were randomized.
proceeded to the next stimulus/label pair. Thus, each stimu- In each of the baseline and correlation tasks, listeners
lus was presented a total of 16 times with its associated label heard repetitions of only two different stimuli, e.g., the A and
in a given block 共twice per each of eight trials兲. B tokens or the A and the C tokens, and classified them
according to the appropriate categories by pressing a button
2. Garner paradigm on a button box corresponding to the category label shown
A complete Garner selective attention paradigm consists on that side of the screen. For example, A and B would be
of three kinds of tasks, each using stimuli drawn from a set classified as B and P, respectively, by participants in the
of four stimuli, arranged in a square in perceptual space. The VOT-trained group classifying stimuli along the trained di-
tasks are typically referred to as baseline, correlation, and mension, but as unstressed and stressed by participants in the
orthogonal or filtering 共Garner, 1974; Pomerantz et al. F0-trained group classifying stimuli along the untrained di-
1989兲. Each task involves classifying two or four stimuli as mension. In the correlation condition stimuli were classified
exemplars of two categories, e.g., B or P. In this experiment according to both dimensions. For example, the contrast be-
participants completed two base line tasks, two correlation tween A and D would be classified as “B and stressed” ver-
tasks, and one filtering task for each dimension of classifica- sus “P and unstressed” by listeners in the VOT-trained con-
tion. Because our focus is on Garner interference, only re- dition, and as “P and unstressed” versus “B and stressed” by
sults from the baseline and filtering tasks will be discussed in listeners in the F0-trained condition. In the filtering condition
detail, although responses to some of the stimuli in the cor- listeners still made a binary decision, e.g., B or P, but all four
related condition 共specifically, the A and D tokens兲 are infor- stimuli were presented in random order 共see Table I for a
mative with respect to the question of the relative weighting complete description of the distribution of stimuli in each
of the two cues in a directly conflicting condition analogous task兲.
to that used by Francis et al. 共2000兲. Moreover, although In the base line and correlated conditions there were a
both male and female voices were used, only results for the total of 64 trials with each pair of sounds 共32 trials per stimu-
male stimuli will be discussed because performance was no- lus, in random order within blocks兲. Response choice loca-
ticeably better for this talker, especially among the F0- tion and corresponding button was counterbalanced within
trained listeners. Tasks were blocked by talker 共in different each block 共e.g., half of the trials showed the order “B” “P”
sessions兲 and by dimension: All tasks involving classification and the other half showed “P” “B” from left to right兲, for a

1240 J. Acoust. Soc. Am., Vol. 124, No. 2, August 2008 Francis et al.: Phonetic learning as cue reweighting

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 35.1.233.149 On: Fri, 21 Oct 2016 19:34:53
total of 128 stimulus presentations for both dimensions of comparisons of means 共significance reported for tests at p
contrast 共trained and untrained兲. In the filtering condition ⬍ 0.05 or better兲 on the pretest showed a significant differ-
there were also 128 total trials 共32 per stimulus兲 and response ence between participants assigned to VOT training and
choice location was similarly counterbalanced. Before the those assigned to F0 training, and this difference remained
Garner paradigm began, listeners completed a minisession significant on the post-test. However, both groups improved
consisting of two trials of each of the two baseline tasks 共in significantly from day 1 of training to day 6. A t-test of
random order兲. Before every block 共practice, each baseline difference scores showed no significant difference between
condition, each correlated condition, and the filtering task兲 the improvement from day 1 to day 6 for the VOT group
listeners were also familiarized with the stimuli and their 共13%兲 and that shown by the F0-trained group 共8%兲. How-
respective labels to be used in the current block, in the same ever, this may be a result of the large amount of variance in
manner as for the STM task. However, unlike the STM task, changes in performance, since 13 out of the 14 participants
familiarization was carried out before each block of the Gar- 共93%兲 in the VOT-trained group showed an improvement
ner task, not before each trial. from pretest to post-test, as compared to only 15 out of 20
Response times for each correct response were averaged 共75%兲 in the F0-trained group, despite the equalization of
according to Dimension of classification 共either trained or perceptual distance along each dimension on a participant-
untrained兲 and task 共base line, filtering兲 for each subject, and specific basis. This suggests that listeners who were able to
Garner interference was calculated as 共filtering RT—baseline learn the F0 contrast were comparatively few, but showed
RT兲 for each dimension. Response times were measured relatively large improvements, while those who learned the
from the beginning of the stimulus and no response times VOT contrast were more common, but did not generally
less than 350 ms 共the maximum duration of the longest male show such extreme improvements.
stimulus兲 were recorded. Because we were interested in understanding the effects
of learning 共successful training兲, we restricted subsequent
3. Training analyses to results only from those participants who both
The six sessions between the pre-test and post-test con- achieved at least 70% correct on the final day of training and
sisted of training. In each session, listeners heard six blocks showed at least 5% improvement in token identification from
of trials, three with the male voice and three with the female the first to the last day of training. Repeating the same analy-
one. Each block of trials consisted of stimuli with a different sis on only these 16 participants 共7 in the F0 group, 9 in the
place of articulation 共bilabial, alveolar, and velar兲. Possible VOT group兲 showed the expected significant effect of test,
responses were always appropriate to the place of articula- F共1 , 14兲 = 69.75, p ⬍ 0.001, but no effect of group, F共1 , 14兲
tion 共e.g., P or B for the bilabial blocks, “T” or “D” for the = 3.23, p = 0.09, and no interaction, F共1 , 14兲 = 0.10, p = 0.76
alveolar blocks, and “K” or “G” for the velar blocks兲. In each 共Fig. 2兲. Planned comparisons of means showed again that
block, listeners heard eight different stimuli, presented in both groups improved significantly 共VOT, from 72% to 88%
random order, ten times each. As in the Garner tasks, the correct; F0 from 64% to 82% correct兲, but there was no
trials in the first and second halves of each block used a significant difference between the groups on either the pre-
different response order left to right. The stimuli consisted of test or post-test. This suggests that successful learners from
the tokens corresponding to those identified in the initial per- both groups showed comparable improvements in perfor-
ceptual distance measurement, but with the appropriate con- mance along the dimension on which they were trained.
sonant place of articulation and vowel quality for the given
block. For example, once a given participant demonstrated
B. Perceptual distance „STM…
roughly equal perceptual distances between four /pa/ stimuli,
then in the velar blocks of trials that participant would have Responses to targets in the go/no-go STM task were
heard /ka/ and /ki/ syllables with onset F0 and VOT values scored as hits while responses to distractors were scored as
corresponding to the same steps along their respective con- false alarms. Perceptual distances between each pair of to-
tinua. kens are shown in Table II. Results of a mixed factorial
ANOVA of the pretest distances with the between-groups
III. RESULTS factor of group 共VOT-trained, F0-trained兲 and within-groups
factor of pair 共AB, CD, AC, BD, and the diagonals AD and
A. Training
BC兲 showed a significant effect of pair, F共5 , 70兲 = 8.05, p
Overall, training was successful. Looking at perfor- ⬍ 0.001, but no effect of group, F共1 , 14兲 = 3.79, p = 0.07, and
mance on the first and last 共sixth兲 days of training, across all no interaction, F共5 , 70兲 = 0.62, p = 0.68. Post hoc 共Tukey
training stimuli 共male and female, at all places of articulation HSD, p = 0.05兲 tests showed a significant difference only be-
and in all vowel contexts included in the experiment兲, listen- tween the pairs that make up the sides of the square 共AB,
ers in the VOT group improved from 68% to 81% correct, CD, AC, BD兲 and those making up the diagonals 共AD and
while those in the F0 group improved from 60% to 67% BC兲, as Euclidean geometry would predict for a square.
correct. Results of a repeated measures ANOVA with the two There were no significant differences between any two sides
factors of group 共VOT trained and F0 trained兲 and training of the square, and none between the two diagonals, suggest-
session 共days 1 and 6兲 showed a significant effect of session, ing that the stimuli selected were perceptually “square” 共all
F共1 , 32兲 = 4.40, p = 0.001, and of group, F共1 , 32兲 = 9.31, p sides equal, and both diagonals equal兲. A similar analysis of
= 0.005, but no interaction, F共1 , 32兲 = 3.18, p = 0.08. Planned the post-test data showed comparable results: A significant

J. Acoust. Soc. Am., Vol. 124, No. 2, August 2008 Francis et al.: Phonetic learning as cue reweighting 1241

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 35.1.233.149 On: Fri, 21 Oct 2016 19:34:53
1.00

0.90

Proportion Correct
0.80

0.70

Trained Dimension
0.60 F0 VOT

0.50
Day 1 Day 6
Test

FIG. 2. Proportion of correct consonant identification responses on the first and last days of training for both training groups 共successful learners only, see
text兲. Error bars indicate standard error of the mean.

effect of pair, F共5 , 70兲 = 9.88, p ⬍ 0.001, but no effect of and test, F共1 , 14兲 = 1.30, p = 0.27. Planned comparisons of
group, F共1 , 14兲 = 0.18, p = 0.68, and no interaction, F共5 , 70兲 means 共all values reported as significant at p ⬍ 0.05 or better兲
= 1.78, p = 0.13. Again, post hoc analyses showed no signifi- showed that, for the VOT group, there was a significant in-
cant differences between any two sides of the square, and no crease in sensitivity to the VOT dimension 共from a d⬘ of
difference between the two diagonals, although the diagonals 1.93–3.30兲 and the F0 dimension 共from a d⬘ of 1.51–2.72兲.
were again significantly longer than the sides. Similarly, for the F0-trained group, d⬘ for the VOT dimen-
In order to compare performance from pretest to post- sion increased significantly from 1.28 to 2.51, while for the
test, parallel legs of each square were averaged 共e.g., AB and F0 dimension it increased significantly from 1.29 to 3.12.
CD were averaged, as were AC and BD兲 to derive a measure This suggests that the effect of training on perceptual dis-
of sensitivity to each dimension for each subject. Results of a tance was robust and not constrained to the dimension on
mixed factorial ANOVA with between-groups factor of which listeners were trained. Overall, these results suggest
group 共VOT-trained, F0-trained兲 and repeated measures of that the perceptual distances between tokens along each di-
test 共pretest, post-test兲 and dimension 共VOT, onset F0兲 mension were successfully equated on the pretest, and re-
showed a significant effect of test, F共1 , 14兲 = 36.39, p mained equal after training. Thus, with respect to measures
⬍ 0.001, but no main effects of group, F共1 , 14兲 = 1.02, p of perceptual distance based on accuracy of speeded target
= 0.33, or of dimension, F共1 , 14兲 = 0.30, p = 0.59. There was a monitoring, training primarily served to increase perceptual
significant interaction between dimension and group, distances, and did so to an equivalent degree along both the
F共1 , 14兲 = 5.35, p = 0.04, but no significant interactions be- trained and untrained dimensions.
tween dimension and test, F共1 , 14兲 = 0.45, p = 0.51, group and
test, F共1 , 14兲 = 0.29, p = 0.60, or between group, dimension

TABLE II. Perceptual distance, in d⬘ units, fro all pairs of stimuli for both
groups on pretest and post-test. C. Perceptual distance „Garner baseline RT…

VOT-trained F0-trained Although perceptual sensitivity can be measured in


terms of response sensitivity 共hit rate and false alarm rate兲,
Pretest Post-test Pretest Post-test measures based on response time may be better at differen-
Pair Mean SD Mean SD Mean SD Mean SD tiating subtle training-related differences between groups.
Thus, response times for correct responses in the base line
AB 1.59 0.45 2.83 1.65 1.06 0.55 2.20 0.75 Garner task were averaged for each learner and dimension of
CD 2.28 0.84 3.77 1.59 1.49 0.79 2.87 1.81
classification to provide another measure of perceptual dis-
AC 1.54 0.90 2.88 1.20 1.32 0.70 3.68 1.91
tance between tokens before and after training. Responses
BD 1.48 0.42 2.55 1.15 1.25 0.43 2.56 0.75
AD 2.86 1.59 4.32 1.65 1.74 0.89 4.37 0.82
made when classifying according to the trained dimension
BC 3.32 1.73 4.52 1.00 2.49 1.26 3.98 1.29 reflect correct responses to the question “is this B or P” while
those made when classifying according to the untrained di-

1242 J. Acoust. Soc. Am., Vol. 124, No. 2, August 2008 Francis et al.: Phonetic learning as cue reweighting

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 35.1.233.149 On: Fri, 21 Oct 2016 19:34:53
1000

950 Group, Dimension of Classification


F0, Trained VOT, Trained
900
F0, Untrained VOT, Untrained

850

800
RT (ms)

750

700

650

600

550

500
Pre Post
Test

FIG. 3. Pretest and post-test response times on the Garner base line task, classifying stimuli as either 关b兴 or 关p兴 共trained dimension兲 or “stressed” or
“unstressed” 共untrained dimension兲 for both training groups, separated by dimension of classification. Error bars indicate standard error of the mean.

mension reflect response times for classifying according to exhibited conflicting values of VOT and onset F0. The A
the other dimension, in response to the question “Is this token had a short VOT 共similar to a 关b兴兲 but a falling F0
sound stressed or unstressed?” contour 共like a 关p兴兲, while the feature values were reversed
A repeated measures ANOVA with one factor between for the D token 共long VOT like 关p兴 but level F0 onset, more
groups 共training group, either VOT or onset F0兲 and two like 关b兴兲. Thus, a response of B to the A token or P to the D
factors within group 共test and dimension兲 showed no signifi- token would indicate a decision made according to VOT,
cant effects of group, F共1 , 14兲 = 0.23, p = 0.64, test, F共1 , 14兲
while a P response to A or a B response to D would indicate
= 0.50, p = 0.49, or dimension, F共1 , 14兲 = 0.53, p = 0.48, and
a decision made according to onset F0. Overall, learners
no significant interactions between test and group, F共1 , 14兲
showed no preference for either cue: 49% of responses to the
= 0.33, p = 0.57, or between dimension and group, F共1 , 14兲
= 0.53, p = 0.48. However, the interaction between group, A token and 51% of those to the D token were consistent
test, and dimension was significant, F共1 , 14兲 = 13.35, p with the F0 cue, and this pattern remained even on the post-
= 0.003, as shown in Fig. 3. Post hoc 共Tukey HSD兲 analysis test 共51% and 48%, respectively兲. This lack of a preference
with a significance threshold of p = 0.05 showed that the only for one cue over another suggests that the bias toward using
significant pairwise comparison in the three-way interaction VOT under normal circumstances 共when other cues do not
was the 116 ms decrease in baseline response time from pre- conflict兲 is not due to something about the VOT dimension
test 共810 ms兲 to post-test 共694 ms兲 for the VOT-trained per se, but rather has to do with the relative size of the
group classifying tokens according to the trained 共VOT兲 di- interstimulus differences in VOT as compared to those in
mension. The observation that none of the pairwise compari- onset F0.
sons for pretest response times showed a significant differ- There was also a very large difference in response pat-
ence corroborates the findings from the STM task, terns between the two training conditions, even on the pre-
supporting the claim that stimuli were indeed a perceptual test. The F0-trained group made 88% of pretest and 96% of
square prior to training. However, the pattern of change in
post-test responses to both the A and D tokens based on onset
RT, unlike the pattern of change in sensitivity, suggests that
F0 共responding P and B, respectively兲, while the VOT group
only the VOT-trained group showed any appreciable change
made only 10% and 7% of their responses to the A and D
in perceptual distance between tokens as a result of training,
specifically an increase in the distance between tokens along tokens based on F0, respectively 共again, responding P to A
the VOT dimension. and B to D兲. This suggests that the small amount of famil-
iarization that listeners received prior to beginning the pre-
test was already sufficient to induce them to make phonetic
D. Cue weighting decisions on the basis of the trained rather than the untrained
On the pretest, in the correlated task, learners showed no cue. These results suggest, in turn, that listeners’ use of a
strong evidence in favor of one dimension over another. In particular cue may be strongly influenced by even short-term
the correlated condition involving the A and D tokens stimuli experience with a talker or context.

J. Acoust. Soc. Am., Vol. 124, No. 2, August 2008 Francis et al.: Phonetic learning as cue reweighting 1243

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 35.1.233.149 On: Fri, 21 Oct 2016 19:34:53
150
Dimension of Classification

130 VOT F0

110

90
Interference (ms)

70

50

30

10

-10

-30
Pre Post
Test

FIG. 4. Garner interference 共difference between RT on the Garner filtering task and RT on the Garner base line task, see text for description of tasks兲 showing
significant interaction between test and dimension of classification. Error bars indicate standard error of the mean.

E. Garner interference Although the overall three-way interaction 共group by


test by dimension兲 was not significant 共Fig. 5兲, the theoretical
Comparison of learners’ pretest base line RT with their
basis for the study, namely, the question of whether different
corresponding filtering RT using a three-way mixed factorial
kinds of training induce different changes in the processing
ANOVA with repeated measures of task 共baseline, filtering兲
of the two different dimensions, justified closer examination
and dimension of classification 共VOT and onset F0兲, and
of some of the contrasts within this interaction. Therefore, a
between-groups factor of training group 共VOT and onset F0兲
series of planned comparisons were carried out to compare,
showed a significant effect of task, F共1 , 14兲 = 9.51, p = 0.008,
for each group, the amount of interference for each of the
but no effects of group, F共1 , 14兲 = 0.54, p = 0.48, or dimen-
two dimensions on the pretest and on the post-test, as well as
sion, F共1 , 14兲 = 2.47, p = 0.14, and no interactions. Filtering
the amount of interference for each dimension on the pretest
performance was overall slower 共817 ms兲 than baseline
versus the post-test. Significance was set at p ⬍ 0.05. Results
共743 ms兲 by 74 ms, suggesting that the two dimensions are
showed that, for the F0-trained listeners, there was no sig-
indeed integral.
nificant difference between the degree to which F0 interfered
Garner interference was computed as the difference in
with VOT classification and vice versa on either the pretest
response time between classification according to a given
or the post-test, and there was no significant difference from
dimension in the filtering task and the average response time
pretest to post-test in either the interference of F0 on VOT or
for classifying stimuli according to the same dimension in
vice versa. For the VOT-trained listeners there was no sig-
the two baseline tasks using that dimension. These values
nificant difference between VOT or F0 interference on the
were submitted to a repeated measures ANOVA with one
pretest, but a significant increase from pretest to post-test in
factor between groups 共training group兲 and two factors
interference of VOT on classification according to onset F0
within group 共test and dimension兲. Results showed no sig-
resulted in there being a significant difference on the post-
nificant effect of group, F共1 , 14兲 = 0.08, p = 0.78, test,
test between the interference of VOT on F0 as compared to
F共1 , 14兲 = 0.62, p = 0.45, or dimension, F共1 , 14兲 = 1.69, p
vice versa. There were no significant differences in F0 inter-
= 0.21, and no interactions between group and test, F共1 , 14兲
ference from pretest to post-test for this group either.
= 0.08, p = 0.78, or group and dimension, F共1 , 14兲 = 1.86, p
= 0.19, and the three-way interaction between test, group,
and dimension was not significant, F共1 , 14兲 = 1.27, p = 0.28. IV. DISCUSSION
However, there was a significant interaction between test and
A. Training
dimension, F共1 , 14兲 = 8.26, p = 0.01, suggesting that training
had a different effect on the degree of interference of each Although training can be considered successful for both
dimension 共Fig. 4兲. After training, irrelevant variation in F0 groups, the degree of learning was unexpectedly low as mea-
no longer interfered with classification according to VOT, sured in terms of change in proportion of correct responses
but irrelevant variation in VOT continued to interfere with from first to last day of training and in terms of the number
classification according to onset F0. of trained listeners who exhibited the requisite improvement

1244 J. Acoust. Soc. Am., Vol. 124, No. 2, August 2008 Francis et al.: Phonetic learning as cue reweighting

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 35.1.233.149 On: Fri, 21 Oct 2016 19:34:53
300

Group, Dimension of Classification


250
F0, Trained VOT, Trained

200 F0, Untrained VOT, Untrained

Interference (ms) 150

100

50

-50

-100
Pre Post
Test

FIG. 5. Differential effects of training on Garner interference 共difference between RT on the Garner filtering task and RT on the Garner base line task兲 for
VOT- and F0-trained groups, separated by dimension of classification. Error bars indicate standard error of the mean.

in performance. Previous studies training listeners to develop to encourage listeners to develop more abstract categories
new categories based on non-native VOT differences 共e.g., less closely associated with a specific response key, it almost
Holt et al., 2004; Pisoni et al. 1982兲 gave less training and certainly made the task considerably more difficult. Shiffrin
yet showed noticeably better improvement than was found in and Schneider 共1977兲 have shown that it is much harder to
the present experiment, even for the VOT-trained listeners. learn an inconsistent mapping between stimulus and re-
Although the training results of Pisoni et al. 共1982兲 may have sponse in which the assignment of stimulus to response
been better than those observed here because of their use of changes than a consistent mapping in which stimuli have the
a different location in the VOT continuum 共they trained lis- same response across trials. Although in the present case the
teners to distinguish between a prevoiced category with mapping was, at one level, consistent 共i.e., shorter VOT val-
negative VOT and a short-lag category兲, the intended cat- ues always mapped onto the response B for listeners in the
egory boundary of experiment 1 of Holt et al. 共2004兲, “in- VOT-trained condition兲, the mapping between the category
consistent” group, was quite similar to the VOT difference in label B and the response key 共left or right兲 was inconsistent,
the present experiment, yet listeners of Holt et al. 共2004兲 and this presumably contributed to poorer performance on
achieved an identification rate of 90% correct or better this task.4
within about eight blocks of training 共about 380 stimulus
presentations兲.
B. Perceptual weighting
One possible explanation for the poor rate of learning in
the present experiment is that, by using very similar VOT In the present study, perceptual distance was success-
and onset F0 values for all of the training stimuli, regardless fully equated along the two dimensions of VOT and onset
of place of articulation 共POA兲, we provided less variability F0, as indicated by the results of the pretest STM 共d’兲 and
than would be found in natural speech. More significantly, Garner base line 共RT兲 tasks. This suggests that the typically
this lack of variability is contrary to the typical correlation observed pattern of using VOT in preference to onset F0 as a
between VOT and POA, in which VOT increases as POA cue to voicing in syllable-initial stops 共e.g., Abramson and
moves back in the oral cavity 共from bilabial to alveolar to Lisker, 1985; Francis and Nusbaum, 2002; Gordon et al.,
velar兲 共Lisker and Abramson, 1964兲. The lack of an expected 1993; Lisker, 1978兲 can apparently be eliminated at least at
correspondence of this sort between POA and VOT may the level measurable by discrimination and classification
have made the additional 共non-关pa兴兲 tokens less effective for 共and at least for tokens that lie within the onset F0 and VOT
training, and might conceivably have interfered with learning range of voiceless aspirated stops兲. In addition, overall per-
in some way. formance on the conflicting-cue tokens in the correlated task
Another major factor that probably contributed signifi- suggested that listeners showed no a priori preference for
cantly to the comparatively low learning rate for listeners in using VOT over F0, and just a few instances of familiariza-
the present experiment is the inconsistent mapping between tion were sufficient to induce listeners from both groups to
response category and response button in both testing and rely heavily on one cue instead of the other. This further
training. Although this was done intentionally in an attempt supports the hypothesis that preference for VOT is based

J. Acoust. Soc. Am., Vol. 124, No. 2, August 2008 Francis et al.: Phonetic learning as cue reweighting 1245

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 35.1.233.149 On: Fri, 21 Oct 2016 19:34:53
strongly on unequal perceptual distance, and does not derive line task lend tentative support to the hypothesis that there
from any special intrinsic property of VOT as a dimension of may be something special about VOT, as a phonetic cue, that
perceptual contrast. makes it easier to learn than onset F0 共though not easier to
use as a cue when perceptual distances are equated兲: Both
C. Dimensional integrality groups of listeners were given the same number of trials with
the same stimuli, but the VOT-trained group showed, overall,
With respect to the question of integrality, results from more evidence of stronger learning, including 共1兲 a greater
the Garner interference task on the pretest suggest that the improvement as a result of training 共for the entire training
two dimensions of VOT and onset F0 are integral in the group兲, 共2兲 a greater proportion of listeners showing evi-
sense of Garner 共1974兲. This is consistent with other research dence of learning 共greater than five percentage-point in-
on the integrality of speech dimensions 共Kingston and Mac-
crease, with a final score above 70% correct兲, and 共3兲 the
millan, 1995; Kingston et al., 1997; Macmillan et al., 1999兲.
significant changes in Garner interference discussed in the
Interference was symmetrical on the pretest, such that there
previous section.
was no significant difference in magnitude between the in-
While the present results suggest that it may be easier to
terference of irrelevant variability in onset F0 on classifica-
direct 共even兲 more attention to VOT than to either divert
tion according to VOT and vice versa, for either of the two
attention from VOT or distribute more attention to onset F0,
groups of learners. This pattern of results is consistent with
it is only possible to speculate in a broad manner about pos-
the hypothesis that any preference for using VOT over onset
sible reasons for such asymmetry in learnability. The most
F0 in classifying voicing in syllable-initial stop consonants
obvious explanation is that American English listeners are
derives from unequal perceptual distances along the two di-
simply more used to directing attention to VOT than to onset
mensions, and not from any preferred quality of VOT. When
F0 共cf. Francis and Nusbaum, 2002; Gordon et al., 1993兲,
the perceptual distances were equated along both dimensions
and thus increasing attention to an already dominant dimen-
in the present experiment, integrality was symmetrical. How-
sion of contrast comes relatively easily. In contrast, inhibit-
ever, after training, asymmetry increased, at least for the
ing such a cue may be considerably more difficult, especially
learners in the VOT group, such that there was significantly
since listeners in these studies spend relatively little time in
less interference from irrelevant variability in the untrained
training compared to the amount of time they spend speaking
dimension 共onset F0兲 on classification according to the
their native language outside the laboratory 共where giving
trained dimension 共VOT兲 than vice versa. These results 共for
greater weight to VOT is clearly a beneficial strategy兲.
the VOT-trained listeners兲, in turn, are consistent with the
This possibility may be further compounded by the fact
hypothesis that training served primarily to increase percep-
that, in testing, listeners were not directed to make judgments
tual distance along the trained dimension 共VOT兲. As demon-
about the specific dimensions in question, as would occur in
strated by Melara and Mounts 共1994兲, unequal perceptual
distances between tokens along two different dimensions re- a typical Garner paradigm 共e.g., “classify the sounds accord-
sult in increased interference from the larger dimension. Re- ing to the pitch dimension, as either high or low”兲. Rather,
sults of the present experiment suggest that, after success- because the dimensions of VOT and onset F0 are not usually
fully learning to rely more heavily on VOT and to better thought of as being consciously accessible to untrained lis-
ignore onset F0, the perceptual distance between tokens teners, linguistically plausible contrasts were chosen 共关b兴/关p兴
along the VOT dimension was increased with respect to that for voiced/voiceless, and stressed/unstressed兲 with the intent
along the onset F0 dimension for successful VOT-trained that each of these two dimensions should map sufficiently
listeners, resulting in the observed pattern of increased inter- well onto either of the two acoustic cue contrasts 共VOT or
ference. As discussed below in Sec. IV E other results to- onset F0兲. That is, the goal was to use two dimensions such
gether suggest that this change resulted primarily from in- that the mapping between a short VOT stimulus and the re-
creased distance along the VOT dimension, and not sponse B would be equally acceptable to naïve listeners as
decreased distance along onset F0. that between a short VOT stimulus and the response “un-
stressed” 共and similarly for mappings between shallow onset
F0 declines and B and unstressed responses, as well as for
D. Perceptual learning
long VOT/sharp onset F0 declines and P or “stressed” re-
In this experiment, perceptual distance was calculated in sponses兲. However, although all expected mappings are plau-
two ways, using d⬘ 共sensitivity兲 in a STM task, and using sible a priori 共stressed syllables do have longer VOT and
response time on a Garner speeded classification task. Re- higher F0 than unstressed ones, and voiced sounds do have
sults were somewhat contradictory, in that the STM task in- shorter VOT and a less negative slope of onset F0 than do
dicated that both groups of learners showed significantly in- voiceless ones兲, these linguistic dimensions do not, in fact,
creased perceptual distance along both their untrained and map equally well onto each respective response for native
trained dimensions as a result of training, while the classifi- speakers of English. Not only are English speakers more
cation task indicated that only the VOT-trained group accustomed to making voicing distinctions based on VOT,
showed an increase in perceptual distance as a result of train- not onset F0 共as discussed in the previous paragraph兲, but
ing, and that occurred only along the trained dimension 共see they are also more accustomed to making stress distinctions
below for a discussion of possible reasons for these differ- on the basis of F0 than on the basis of VOT. Thus, testing
ences between monitoring sensitivity and classification re- conditions, in terms of the mappings between response items
sponse time兲. At the least, the results from the Garner base- and acoustic dimensions, were much more natural for the

1246 J. Acoust. Soc. Am., Vol. 124, No. 2, August 2008 Francis et al.: Phonetic learning as cue reweighting

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 35.1.233.149 On: Fri, 21 Oct 2016 19:34:53
VOT-trained listeners, who were tested with the P/B contrast There are at least two ways to characterize the difference
mapping onto the VOT difference and stressed/unstressed between acquired similarity and acquired distinctiveness.
mapping onto onset F0 difference, than for the onset F0- Iverson and co-workers 共Iverson and Kuhl, 2000; Iverson et
trained listeners, who were tested with the P/B contrast map- al., 2003兲 have argued that acquired similarity arises from
ping onto the onset F0 difference and stressed/unstressed properties of the statistical distribution of input stimuli in
mapping onto VOT. In other words, our indices of perceptual perceptual space in a manner independent of attention, while
distance and the distribution of selective attention may be acquired distinctiveness results from the operation of an at-
confounded, for the onset F0 group, with experiment design- tentionally demanding process. Although there is now evi-
specific factors, and this might explain why the onset F0 dence that even passive statistical learning depends on the
group showed a comparable degree of improvement to the availability of attentional resources 共Toro et al. 2005兲, there
VOT-trained group on the training task 共measured in terms of is also evidence that the development of acquired similarity
proportion correct identification兲, but failed to show any evi- can be facilitated by certain distributional properties of the
dence of a differential change in the processing of onset F0 training stimuli. Thus, the Iverson argument may still be
as opposed to VOT that might explain this improvement. valid, despite the almost certain involvement of attention in
On the other hand, it is also possible that there is some- the process of phonetic cue learning. In support of a role for
thing intrinsically more learnable about the acoustic proper- distributional factors, Guenther et al. 共1999兲 found that in
ties that comprise VOT as opposed to onset F0 共i.e., an ad- order to induce increased similarity, it was necessary to pro-
vantage for learning temporal as opposed to spectral vide not only categorization training 共as in the present ex-
contrasts兲, but to test this hypothesis would require eliminat- periment兲 but also multiple exemplars of each category. They
ing the bias induced by native language experience, for ex- argued that experience with multiple exemplars encouraged
ample, by identifying and training listeners whose native lan- listeners to ignore small 共noncategorical兲 differences be-
guage weighted onset F0 equally with VOT 共one such tween stimuli within a single category, an effect impossible
possible example might be Korean, cf. Francis and Nus- to achieve when training with only a single exemplar 关see
baum, 2002兲. Finally, it may also be noted that training of also Iverson et al. 共2005兲 for similar arguments related to a
this sort served primarily to improve the speed with which test of the efficacy of high variability training兴.
listeners were able to make a decision, and such an improve- On the other hand, Goldstone 共1994兲 and Francis and
ment was disproportionately advantageous for decisions Nusbaum 共2002兲 argued that the processes of acquired dis-
based on VOT which is fundamentally temporal in nature tinctiveness and acquired similarity may be employed at dif-
and occurs earlier in the syllable, as opposed to onset F0 ferent stages in the learning process, and/or under different
which involves both spectral and temporal properties and conditions of stimulus properties. In cases such as the
occurs later in the syllable. present experiment and those of Goldstone 共1994兲 in which
stimuli are perceptually highly similar 共located within a
single native category in the present case, or within one 共just
E. Enhancement versus inhibition
noticeable difference兲 JND of one another in the Goldstone
Although the two methods used to measure perceptual case兲, acquired distinctiveness is the most effective strategy
distance 共sensitivity in speeded target monitoring versus re- for significantly improving categorization quickly. In con-
sponse time in speeded classification兲 provided somewhat trast, under conditions in which stimuli are already relatively
discrepant results 共see below兲, it is important to note that easy to categorize 共e.g., certain contrasts in the Korean
both methods provided strong evidence that training served stimuli used by Francis and Nusbaum, 2002兲, acquired simi-
only to increase the perceptual distance between tokens 共ac- larity, especially along an irrelevant dimension of contrast,
quired distinctiveness兲, not to decrease it 共acquired similar- leads to a more significant improvement in categorization
ity兲. Only the VOT group showed a change in interference, that would simply further increase the already salient differ-
and this was only in terms of the decrease in interference of ence between the two categories along an already contrastive
the untrained on the trained. The 共expected兲 corresponding dimension.
increase in interference of the trained on the untrained was Of course, the two accounts are not necessarily mutually
not significant, although the trend was definitely in the ex- exclusive, in the sense that the presence of multiple exem-
pected direction. Given that the dimensions of VOT and on- plars within each category increases the probability that vari-
set F0 are highly integral, these results are entirely consistent ance within the category is relatively high, which in turn
with results from previous research. In particular, Goldstone increases the likely benefit of applying a process of acquired
共1994兲 also found evidence for increased perceptual distance similarity to reduce within-category variability. In the present
along a variety of trained dimensions in a visual category case, however, listeners were trained with multiple exem-
learning experiment, but only found evidence of decreased plars, but these exemplars were acoustically extremely simi-
perceptual distance along a to-be-ignored dimension when lar to the test stimuli along the critical dimensions of onset
the two dimensions were perceptually separable in the sense F0 and VOT, and yet the two categories represented by these
of Garner 共1974兲. Indeed, cases of true acquired similarity exemplars 共and by the test stimuli兲 were extremely close to
seem to be relatively rare in the perceptual learning literature one another in perceptual space. Thus, in this case, although
关cf. Guenther et al. 共1999兲 for discussion, and Francis and listeners received multiple training exemplars, one might ar-
Nusbaum, 共2002兲, for an example of acquired similarity with gue that they were not distributed in a manner that would be
more natural stimuli兴. expected to promote acquired similarity on the basis of either

J. Acoust. Soc. Am., Vol. 124, No. 2, August 2008 Francis et al.: Phonetic learning as cue reweighting 1247

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 35.1.233.149 On: Fri, 21 Oct 2016 19:34:53
of these two hypotheses. The distribution of training exem- formance on the Garner base line task better reflects listen-
plars was not sufficiently broad to engage a Guenther/ ers’ ability to compare test stimuli with long共er兲-term
Iverson type of mechanism, and the overall similarity of the category representations 关see Xu et al. 共2006兲 for a model of
two categories was sufficiently great to engage a mechanism memory for phonetic categorization兴.
of acquired distinctiveness over one of acquired similarity in Macmillan 共1987兲 distinguishes between sensory or
a Francis/Goldstone type of model. Further research is trace and context modes of processing. In the trace mode,
clearly necessary to explore the basis for these two kinds of processing is dominated by comparison of 共temporary兲 sen-
processes. sory traces of stimuli, while in context coding processing
involves comparison between sensory traces of stimuli and
F. Differences between monitoring sensitivity and 共longer-term兲 perceptual anchors, including category repre-
classification response time sentations. In this sense, the different familiarization proto-
One curious finding in the present results is the apparent cols for the two types of tasks may have encouraged a
disagreement between the two measures of perceptual dis- greater degree of reliance on sensory coding in the STM task
tance employed, sensitivity in speeded target monitoring and and on context coding in the classification 共Garner base line兲
response time in speeded classification. While the sensitivity task. That is, performance measured in terms of accuracy on
results indicated that listeners in both groups showed equiva- the STM task may serve mainly to indicate listeners’ ability
lently increased perceptual distances along both their trained to retain and make use of short-term memory traces of the
and untrained dimensions of contrast, the response-time data familiarization stimuli. As listeners learned which properties
suggested that only the VOT-trained listeners showed a of the signal 共VOT and onset F0兲 varied across the training
change in perceptual distance, and this increase occurred stimuli, they may have become better able to encode and
only along VOT, the dimension on which they were trained. retrieve these properties as short-term memory traces 共i.e.,
This finding is particularly curious given the commonly when exposed to the tokens during familiarization兲. Since
accepted assumption that response time and accuracy tasks both properties varied equally across the training set, listen-
are assumed to measure more or less the same thing 共percep- ers showed an equal degree of improvement in encoding and
tual distance between tokens兲. Ashby and Maddox 共1994兲 retrieving memory traces of these properties.
discuss the widespread nature of this assumption as they de- On the other hand, RT performance on the Garner base
velop an explicit model relating RT performance to percep- line task may better reflect listeners’ ability to access stored
tual distance between tokens and decision 共category兲 bound- long-term representations of phonetic categories 共context
aries, based on general recognition theory 共GRT兲 共Ashby and coding兲. It has been argued that perceptual learning based on
Townsend, 1986兲. Specifically, they propose that RT should categorization training 共as used here兲 primarily affects cat-
decrease monotonically as a function of the perceptual dis- egorization at the level of context coding 共Guenther et al.,
tance between the stimulus and the decision bound. Further- 1999兲. According to this hypothesis, training was successful
more, the GRT as well as other theories of similar phenom- in changing the long-term representations of the categories
ena 共e.g., Luce, 1986兲 clearly demonstrate that difficult that listeners were learning 共e.g., B versus P兲, but this only
discriminations are associated with longer response times. became obvious in the Garner baseline 共RT兲 task because
Thus, we have every reason to expect a correspondence be- there was sufficient time between the presentation of the fa-
tween RT and accuracy measures: As stimuli become more miliarization stimuli and the actual test trials that listeners
distant from one another in perceptual space, they should were not able to rely solely on trace memories of the famil-
become both easier to identify 共in the STM task兲 and correct iarization stimuli and instead had to depend on their long-
identifications should be faster 共in the Garner base line task兲. term memories of the different 共learned兲 category represen-
However, it is possible that, in the present case, specific de- tations. Thus, it may be argued that the results of the Garner
tails of the experiment design unintentionally predisposed base line task are more indicative of the overall phonetic
listeners to treat the two tasks differently with respect to the consequences of this kind of training than are those of the
type of memory or attentional mechanisms they employed, STM task, because they better reflect changes in listeners’
resulting in a divergence between the results of the two tasks. attention to features encoded in long-term memory represen-
One potentially important difference between the two tations of the learned categories, while the results of the
tasks in the present experiment is that, in the STM task, STM task reflect instead an increase in overall sensitivity to
listeners received much more frequent familiarization with those acoustic properties that varied during training as a re-
exemplars of the two categories they using than they did on sult of increased attention to the speech signal under condi-
the classification task. In the STM task, listeners heard two tions of higher uncertainty 共Nusbaum and Magnuson, 1997;
presentations of each of the two stimuli in a given trial 共e.g., Nusbaum and Schwab, 1986; Wong et al., 2004兲.
the A and B tokens兲, accompanied by visual presentation of
their associated category label, before every trial. On the
G. The role of attention in phonetic learning
other hand, in the classification task, listeners were familiar-
ized with the stimulus-symbol pairing only three times, once Gordon et al. 共1993兲 showed that, under conditions of
before each block of trials 共baseline, correlated, and filter- 共comparatively兲 unlimited attentional load, American En-
ing兲. Thus, performance in the STM task may better reflect glish listeners gave more weight to VOT than to onset F0 in
listeners’ ability to compare each test stimulus with short- a voicing decision. In contrast, under conditions of more
term memory traces of the familiarization stimuli, while per- limited attentional availability, listeners showed a greater re-

1248 J. Acoust. Soc. Am., Vol. 124, No. 2, August 2008 Francis et al.: Phonetic learning as cue reweighting

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 35.1.233.149 On: Fri, 21 Oct 2016 19:34:53
duction in the weight given to VOT than in that given to tems that process VOT as compared to onset F0, or from
onset F0. They argued that weak acoustic cues 共e.g., onset experience-dependent development of such systems is a
F0兲 require comparatively little attention to make their full question beyond the scope of the present paper. However, by
contribution to a phonetic decision 共thus benefiting little considering such neural commitment in terms of the distri-
from an increased availability of attention兲, while stronger bution of attentional resources we are able to link the role of
cues 共e.g., VOT兲 benefit more from increased availability of attention in perceptual learning 共Guion and Pederson, 2007;
attentional resources. We elaborate on this hypothesis by pro- Strange, 2006兲 to processes of online speech perception
posing that using any cue requires some commitment of at- 共Gordon et al., 1993兲, making a connection that is obviously
tention, but that attention is allocated dynamically depending necessary, but thus far only occasionally discussed 共Nus-
on the current diagnosticity of specific cues. Under normal baum and Goodman, 1994; Stevens et al., 2006; Toro et al.,
circumstances those cues that have proven to be most diag- 2005兲.
nostic 共e.g., over the course of prior experience兲 receive the
lion’s share. Under conditions of limited attentional avail- ACKNOWLEDGMENTS
ability, the proportion of capacity devoted to each cue is
reduced proportionally, with strong cues continuing to re- This work was supported by a grant from the National
ceive proportionally more of the smaller pool of available Institute on Deafness and other Communication Disorders
resources. In new contexts or under conditions of uncertainty 共NIH-NIDCD R03DC006811兲 to A.L.F. We would like to
共i.e., multiple talkers, high noise, etc.兲, the distribution of thank Bob Melara, Howard Nusbaum, John Kingston, and an
attention to individual cues may vary as the speech percep- anonymous reviewer for suggestions on earlier drafts of this
tion mechanism begins to seek out cues that are potentially article. Some of these results were presented at the fourth
more diagnostic under those conditions 共Nusbaum and Mag- Joint Meeting of the Acoustical Society of American and the
nuson, 1997; Nusbaum and Schwab, 1986; Wong et al., Acoustical Society of Japan, Honolulu, HI, November 28–
2004兲. Such reallocation may result in a more even distribu- December 2, 2006.
tion of resources across cues as attention is withdrawn from
1
cues that are typically stronger but fail to be sufficiently di- Although the dimension of VOT has been explored in considerable depth,
the dimension of onset F0 is less well investigated, and to our knowledge
agnostic in the present context, and reallocated toward cues
there are no studies that provide quantitative data on listeners’ sensitivity
that, though typically weaker, might potentially be more di- to onset F0 differences comparable to the wealth of information available
agnostic in the present case. regarding VOT 共see Holt et al., 2004 for discussion兲.
2
In this dynamic redistribution of attention we see a rec- Note that subsequent research 共e.g., Löfqvist et al., 1989兲 supports a
physiological origin of the onset F0 property of stop consonants in the
onciliation between the effects of training and the effects of degree of tension of the cricothyroid muscle, suggesting that there is no
experimental task observed in the present experiment. On the direct physiological link between onset F0 and VOT cues. This physi-
one hand, perceptual training may alter the base line distri- ological dissociation is further supported by the patterning of these two
bution of attention to specific cues, increasing the weight cues in three-way stop consonant systems such as that of Korean and Thai,
in which stop categories are distinguished by independent onset F0 and
given to cues that are sufficient for identifying the newly VOT properties 共Thai: Gandour, 1974; Korean: Francis and Nusbaum,
learned categories, and reducing that given to less diagnostic 2002; see Francis et al., 2006 for discussion兲.
3
cues. That it does so preferentially for VOT and less so for While step size was maintained as closely as possible across tokens and
talkers, when specific values are given here they refer to the test stimuli
onset F0 suggests that there is something special about VOT,
based on 关pha兴. Other stimuli varied slightly from these specific values to
at least as a cue to the perception of syllable-initial stop- preserve some degree of interstimulus variability, but never by more than
consonant voicing by native speakers of English. On the 5 ms or two percentage points 共for frequency modifications兲 from the
other hand, frequent presentations of representative stimuli values given here.
4
In an ongoing study using nonspeech sounds in a similar testing/training
differing along two dimensions 共as in the STM task兲 may paradigm, we have found that simply eliminating this inconsistent map-
encourage listeners to maintain a high level of attention to ping between response label and response key improves learning consid-
both cues to facilitate the use of trace coding. Thus, the abil- erably both in terms of the number of listeners who are able to reach
ity of training to accomplish the redistribution of attention criterion, and in terms of the magnitude of the overall change in propor-
tion correct identification from the first to the last day of training.
among acoustic cues may only become obvious under con-
ditions in which listeners are not constantly reminded of the Abramson, A. S. 共1977兲. “Laryngeal timing in consonant distinctions,” Pho-
multiple dimensions 共diagnostic and nondiagnostic兲 along netica 34, 295–303.
which stimuli differ, and instead are forced to focus on Abramson, A. S., and Lisker, L. 共1970兲. “Discrimination along the voicing
stimulus differences that have been encoded in the long-term continuum: Cross-language tests,” Proceedings of the 6th International
Congress on Phonetic Science, Prague, 1967, Academia, Prague, pp. 569–
mental representations of the learned categories. 573.
Ultimately, this perspective is compatible with Kuhl’s Abramson, A. S., and Lisker, L. 共1985兲. “Relative power of cues: F0 shift
neural commitment theory 共Kuhl et al., 2006兲, in the sense versus voice timing,” in Linguistic Phonetics, edited by V. Fromkin 共Aca-
that English listeners appear to have committed to VOT to a demic New York兲, pp. 25–33.
Allen, J., Kraus, N., and Bradlow, A. 共2000兲.. “Neural representation of
greater degree than to onset F0 共at least as a cue to the consciously imperceptible speech sound differences,” Percept.
phonetic property of voicing in syllable-initial stops兲, and Psychophys. 62, 1383–1393.
reducing that commitment, or increasing their commitment Ashby, F. G., and Maddox, W. T. 共1994兲. “A response time theory of sepa-
rability and integrality in speeded classification,” J. Math. Psychol. 38,
to onset F0, seems to require more training, or different kinds
423–466.
of training, than we have employed here. Whether this com- Ashby, F. G., and Townsend, J. T. 共1986兲. “Varieties of perceptual indepen-
mitment derives from innate differences in the neural sys- dence,” Psychol. Rev. 93, 154–179.

J. Acoust. Soc. Am., Vol. 124, No. 2, August 2008 Francis et al.: Phonetic learning as cue reweighting 1249

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 35.1.233.149 On: Fri, 21 Oct 2016 19:34:53
Boersma, P., and Weenink, D. 共2006兲. Praat: doing phonetics by computer 共1997兲. “Integrality in the perception of tongue root position and voice
共Version 4.2兲 共computer program兲. http://www.praat.org/ 共last accessed quality in vowels,” J. Acoust. Soc. Am. 101, 1696–1709.
March 17, 2008兲. Kuhl, P. K., Stevens, E., Hayashi, A., Deguchi, T., Kiritani, S., and Iverson,
Diehl, R. L., and Kluender, K. R. 共1989兲. “On the objects of speech percep- P. 共2006兲. “Infants show a facilitation effect for native language phonetic
tion,” Ecological Psychol. 1, 121–144. perception between 6 and 12 months,” Dev. Sci. 9, F13–F21.
Francis, A. L., and Nusbaum, H. C. 共2002兲. “Selective attention and the Liberman, A. M. 共1957兲. “Some results of research on speech perception,” J.
acquisition of new phonetic categories,” J. Exp. Psychol. Hum. Percept. Acoust. Soc. Am. 29, 117–123.
Perform. 28, 349–366. Lisker, L. 共1978兲. “In qualified defense of VOT,” Lang Speech 21, 375–383.
Francis, A. L., Baldwin, K., and Nusbaum, H. C. 共2000兲. “Effects of training Lisker, L. 共1986兲. ““Voicing” in English: A catalogue of acoustic features
on attention to acoustic cues,” Percept. Psychophys. 62, 1668–1680. signaling /b/ versus /p/ in trochees,” Lang Speech 29, 3–11.
Francis, A. L., Ciocca, V., Wong, V. K. M., and Chan, J. K. L. 共2006兲. “Is Lisker, L., and Abramson, A. S. 共1964兲. “A cross-language study of voicing
fundamental frequency a cue to aspiration in initial stops?,” J. Acoust. in initial stops: Acoustical measurements,” Word 20, 384–422.
Soc. Am. 120, 2884–2895. Löfqvist, A., Baer, T., McGarr, N. S., and Seider Story, R. 共1989兲. “The
Gandour, J. 共1974兲. “Consonant types and tone in Siamese,” J. Phonetics 2, cricothyroid muscle in voicing control,” J. Acoust. Soc. Am. 85, 1314–
337–350. 1321.
Garner, W. R. 共1974兲. The Processing of Information and Structure 共Er- Luce, R. D. 共1986兲. Response Times: Their Role in Inferring Elementary
lbaum, Hillsdale, NJ兲. Mental Organization 共Oxford University Press, Oxford兲.
Garner, W. R. 共1983兲. “Asymmetric interactions of stimulus dimensions in Lutfi, R. A., and Liu, C.-J. 共2007兲. “Individual differences in source identi-
perceptual information processing,” in Perception, Cognition, and Devel- fication from synthesized impact sounds,” J. Acoust. Soc. Am. 122, 1017–
opment: Interactional Analyses, edited by T. J. Tighe and B. E. Shepp 1028.
共Erlbaum, Hillsdale, NJ兲, pp. 1–37. Macmillan, N. A. 共1987兲. “Beyond the categorical/continuous distinction: A
Gibson, E. J. 共1969兲. Principles of Perceptual Learning and Development psychophysical approach to processing modes,” in Categorical Percep-
共Appleton-Century-Crofts, New York兲. tion, edited by S. Harnad 共Cambridge University Press, New York兲, pp.
Goldstone, R. 共1994兲. “Influences of categorization on perceptual discrimi- 53–85.
nation,” J. Exp. Psychol. Gen. 123, 178–200. Macmillan, N. A., and Creelman, C. D. 共2004兲. Detection Theory: A User’s
Gordon, P. C., Eberhardt, J. L., and Rueckl, J. G. 共1993兲. “Attentional modu- Guide, 2nd ed. 共Lawrence Erlbaum Associates, Hillsdale, NJ兲.
lation of the phonetic significance of acoustic cues,” Cogn. Psychol. 25, Macmillan, N. A., Kingston, J., Thorburn, R., Dickey, L. W., and Bartels, C.
1–42. 共1999兲. “Integrality of nasalization and F1. II. Basic sensitivity and pho-
Guenther, F. H., Husain, F. T., Cohen, M. A., and Shinn-Cunningham, B. G. netic labeling measure distinct sensory and decision-rule interactions,” J.
共1999兲. “Effects of categorization and discrimination training on auditory Acoust. Soc. Am. 106, 2913–2932.
perceptual space,” J. Acoust. Soc. Am. 106, 2900–2912. Massaro, D. W., and Cohen, M. M. 共1976兲. “The contribution of fundamen-
Guion, S. G., and Pederson, E. 共2007兲. “Investigating the role of attention in tal frequency and voice onset times to the /zi/-/si/ distinction,” J. Acoust.
phonetic learning,” in Language Experience in Second Language Speech Soc. Am. 60, 704–717.
Learning: In Honor of James Emil Flege, edited by O.-S. Bohn and M. Massaro, D. W., and Cohen, M. M. 共1977兲. “Voice onset time and funda-
Munro 共Benjamins, Amsterdam兲, pp. 57–77. mental frequency as cues to the /zi/-/si/ distinction,” Percept. Psychophys.
Haggard, M., Ambler, S., and Callow, M. 共1970兲. “Pitch as a voicing cue,” 22, 373–382.
J. Acoust. Soc. Am. 47, 613–617. Melara, R. D., and Mounts, J. R. W. 共1994兲. “Contextual influences on
Haggard, M. P., Summerfield, Q., and Roberts, M. 共1981兲. “Psychoacousti- interactive processing: Effects of discriminability, quantity, and uncer-
cal and cultural determinants of phoneme boundaries: Evidence from trad- tainty,” Percept. Psychophys. 56, 73–90.
ing F0 cues in the voiced-voiceless distinction,” J. Phonetics 9, 49–62. Nosofsky, R. M. 共1986兲. “Attention, similarity, and the identification-
Holt, L. L., and Lotto, A. J. 共2006兲. “Cue weighting in auditory categoriza- categorization relationship.” J. Exp. Psychol. Gen. 115, 39–57.
tion: Implications for first and second language acquisition,” J. Acoust. Nusbaum, H. C., and Goodman, J. C. 共1994兲. “Learning to hear speech as
Soc. Am. 119, 3059–3071. spoken language,” in The Development of Speech Perception, edited by J.
Holt, L. L., Lotto, A. J., and Diehl, R. L. 共2004兲. “Auditory discontinuities C. Goodman and H. C. Nusbaum, 共MIT Press, Cambridge, MA兲, pp. 299–
interact with categorization: Implications for speech perception,” J. 338.
Acoust. Soc. Am. 116, 1763–1773. Nusbaum, H. C., and Magnuson, J. 共1997兲. “Talker normalization: Phonetic
Holt, L. L., Lotto, A. J., and Kluender, K. R. 共2001兲. “Influence of funda- constancy as a cognitive process,” in Talker Variability in Speech Process-
mental frequency on stop-consonant voicing perception: A case of learned ing, edited by K. Johnson and J. W. Mullennix, 共Academic, San Diego,
covariation or auditory enhancement?,” J. Acoust. Soc. Am. 109, 764–774. CA兲, pp. 109–132.
Hombert, J. M. 共1978兲. “Consonant types, vowel quality, and tone,” in Tone: Nusbaum, H. C., and Schwab, E. C. 共1986兲. “The role of attention and active
A Linguistic Survey, edited by V. A. Fromkin 共Academic, New York兲, pp. processing in speech perception,” in Pattern Recognition by Humans and
77–111. machines, edited by E. C. Schwab and H. C. Nusbaum 共Academic, San
Iverson, P., and Kuhl, P. K. 共2000兲. “Perceptual magnet and phoneme Diego兲, Vol. 1, pp. 113–157.
boundary effects in speech perception: Do they arise from a common Pisoni, D. B., Aslin, R. N., Perey, A. J., and Hennessy, B. L. 共1982兲. “Some
mechanism?,” Percept. Psychophys. 62, 874–886. effects of laboratory training on identification and discrimination of voic-
Iverson, P., Hazan, V., and Bannister, K. 共2005兲. “Phonetic training with ing contrasts in stop consonants,” J. Exp. Psychol. Hum. Percept. Perform.
acoustic cue manipulations: A comparison of methods for teaching English 8, 297–314.
/r/-/l/ to Japanese adults,” J. Acoust. Soc. Am. 118, 3267–3278. Pomerantz, J. R., Pristach, E. A., and Carson, C. E. 共1989兲. “Attention and
Iverson, P., Kuhl, P. K., Akahane-Yamada, R., Diesch, E., Tohkura, Y., Ket- object perception,” in Object Perception: Structure and Process, edited by
termann, A., and Siebert, C. 共2003兲. “A perceptual interference account of B. Shepp and S. Ballesteros 共Lawrence Erlbaum Associates, Hillsdale,
acquisition difficulties for non-native phonemes,” Cognition 87, B47–B57. NJ兲, pp. 53–89.
Jusczyk, P. W. 共1993兲. “From general to language-specific capacities: The Raphael, L. J. 共2005兲. “Acoustic cues to the perception of segmental pho-
WRAPSA model of how speech perception develops,” J. Phonetics 21, nemes,” in The Handbook of Speech Perception, edited by D. B. Pisoni
3–28. and R. E. Remez 共Blackwell, Malden, MA兲, pp. 182–206.
Kingston, J., and Diehl, R. L. 共1994兲. “Phonetic knowledge,” Language 70, Repp, B. H. 共1979兲. “Relative amplitude of aspiration noise as a voicing cue
419–494. for syllable-initial stop consonants,” Lang Speech 22, 173–189.
Kingston, J., Diehl, R. L., Kirk, C. J., and Castleman, W. A. 共2008兲. “On the Schouten, M. E. 共1985兲. “Identification and discrimination of sweep tones,”
internal perceptual structure of distinctive features: The 关voice兴 contrast,” Percept. Psychophys. 37, 369–376.
J. Phonetics 36, 28–54. Shiffrin, R. M., and Schneider, W. 共1977兲. “Controlled and automatic human
Kingston, J., and Macmillan, N. A. 共1995兲. “Integrality of nasalization and information processing: II. Perceptual learning, automatic attending, and a
F1 in vowels in isolation and before oral and nasal consonants: A general theory,” Psychol. Rev. 84, 127–190.
detection-theoretic application of the Garner paradigm,” J. Acoust. Soc. Stevens, K. N., and Klatt, D. H. 共1974兲. “Role of formant transitions in the
Am. 97, 1261–1285. voiced-voiceless distinction for stops,” J. Acoust. Soc. Am. 55共3兲, 653–
Kingston, J., Macmillan, N. A., Dickey, L. W., Thorburn, R., and Bartels, C. 659.

1250 J. Acoust. Soc. Am., Vol. 124, No. 2, August 2008 Francis et al.: Phonetic learning as cue reweighting

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 35.1.233.149 On: Fri, 21 Oct 2016 19:34:53
Stevens, C., Sanders, L., and Neville, H. 共2006兲. “Neurophysiological evi- Whalen, D. H., Abramson, A. S., Lisker, L., and Mody, M. 共1993兲. “F0
dence for selective auditory attention deficits in children with specific gives voicing information even with unambiguous voice onset times,” J.
language impairment,” Brain Res. 1111, 143–152. Acoust. Soc. Am. 93, 2152–2159.
Strange, W. 共2006兲. “Second-language speech perception: The modification Wong, P. C. M., Nusbaum, H. C., and Small, S. L. 共2004兲. “Neural bases of
of automatic perceptual routines. Paper presented at the Fourth Joint Meet- talker normalization,” J. Cogn Neurosci. 16, 1173–1184.
ing of the Acoustical Society of America and the Acoustical Society of Xu, Y., Gandour, J. T., and Francis, A. L. 共2006兲. “Effects of language
Japan, November-December, 2006, Honolulu, HI. 关Abstract兴,” J. Acoust. experience and stimulus complexity on the categorical perception of pitch
Soc. Am. 120, 3137. direction,” J. Acoust. Soc. Am. 120, 1063–1074.
Tong, Y., Francis, A. L., and Gandour, J. T. 共2008兲, “Processing dependen- Xu, Y., Krishnan, A., and Gandour, J. 共2006兲. “Specificity of experience-
cies between segmental and suprasegmental features in Mandarin Chi- dependent pitch representation in the brainstem,” NeuroReport 17, 1601–
nese,” Lang. Cognit. Processes 23, 689–708. 1605.
Toro, J. M., Sinnett, S., and Soto-Faraco, S. 共2005兲. “Speech segmentation Zatorre, R., and Belin, P. 共2001兲. “Spectral and temporal processing in hu-
by statistical learning depends on attention,” Cognition 97, B25–B34. man auditory cortex,” Cereb. Cortex 11, 946–953.

J. Acoust. Soc. Am., Vol. 124, No. 2, August 2008 Francis et al.: Phonetic learning as cue reweighting 1251

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 35.1.233.149 On: Fri, 21 Oct 2016 19:34:53

You might also like