0% found this document useful (0 votes)
16 views15 pages

Fpsyg 06 00316

The research investigates how German learners of English perceive and produce speech rhythm during second language acquisition, particularly focusing on the developmental changes in timing patterns. It finds that as proficiency increases, their English speech becomes more stress-timed and faster, but native speakers tend to focus on speech rate rather than rhythmic differences when classifying the learners' speech. The study highlights the importance of rhythmic patterns in language acquisition and processing, suggesting that sensitivity to these patterns persists into adulthood.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views15 pages

Fpsyg 06 00316

The research investigates how German learners of English perceive and produce speech rhythm during second language acquisition, particularly focusing on the developmental changes in timing patterns. It finds that as proficiency increases, their English speech becomes more stress-timed and faster, but native speakers tend to focus on speech rate rather than rhythmic differences when classifying the learners' speech. The study highlights the importance of rhythmic patterns in language acquisition and processing, suggesting that sensitivity to these patterns persists into adulthood.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

ORIGINAL RESEARCH

published: 25 March 2015


doi: 10.3389/fpsyg.2015.00316

Perception of speech rhythm in


second language: the case of
rhythmically similar L1 and L2
Mikhail Ordin * and Leona Polyanskaya *

Fakultät für Linguistik und Literaturwissenschaft, Universität Bielefeld, Bielefeld, Germany

We investigated the perception of developmental changes in timing patterns that happen


in the course of second language (L2) acquisition, provided that the native and the target
languages of the learner are rhythmically similar (German and English). It was found that
speech rhythm in L2 English produced by German learners becomes increasingly stress-
timed as acquisition progresses. This development is captured by the tempo-normalized
rhythm measures of durational variability. Advanced learners also deliver speech at
a faster rate. However, when native speakers have to classify the timing patterns
characteristic of L2 English of German learners at different proficiency levels, they attend
to speech rate cues and ignore the differences in speech rhythm.
Keywords: speech rhythm, rhythm metrics, durational variability, rhythm acquisition, rhythm perception, timing
Edited by: patterns, rhythm development, second language
Judit Gervain,
Université Paris Descartes, France

Reviewed by: Introduction


Juan M. Toro,
Universitat Pompeu Fabra, Spain The differences between languages and linguistic varieties are manifested in the acoustic compo-
Pilar Prieto,
nents of the signal that are perceived by the auditory system and cognitively processed to extract
Universitat Pompeu Fabra, Spain
linguistic structures. Some minute acoustic differences are perceived by the native speakers, while
*Correspondence:
some gross acoustic changes in the speech stream may be ignored—either not perceived not
Mikhail Ordin and Leona Polyanskaya,
attended to. In this study we concentrated on the perceptual relevance of the changes in speech
Fakultät für Linguistik und
Literaturwissenschaft, Universität rhythm in second language (L2) that happen in the course of L2 acquisition, provided that the
Bielefeld, Bielefeld 33615, Germany native and the target languages of the learner are rhythmically similar.
[email protected]; We start by introducing the notion of rhythm. Further we move on to discussing why perception
[email protected] of rhythmic patterns might be linguistically relevant. Then we report how speech rhythm develops
in L2 English spoken by German learners, and why it is worth studying whether people are sensi-
Specialty section: tive to the changes in L2 speech rhythm, when the target and native languages of the learner are
This article was submitted to rhythmically similar. A brief overview of empirical studies in German and English speech rhythm
Language Sciences, a section of the
is provided to highlight rhythmic similarities between languages. Later, we report the results of the
journal Frontiers in Psychology
perception experiment aimed to answer the main question of the research: are the rhythmic changes
Received: 08 November 2014 that happen in the course of acquisition perceptually relevant, if the L1 and L2 of the learner are
Accepted: 05 March 2015
rhythmically similar? In the end, we show the theoretical implications of our findings.
Published: 25 March 2015
Citation: Speech Rhythm and Rhythm Measures
Ordin M and Polyanskaya L (2015)
The word rhythm implies the idea of periodicity. Based on the auditory impression that certain
Perception of speech rhythm in
second language: the case of
events or certain speech constituents reoccur periodically in the speech stream, the languages
rhythmically similar L1 and L2. were classified into stressed-timed (in which stressed syllables were perceived to be distributed at
Front. Psychol. 6:316. roughly equal intervals, e.g., German, English, Dutch, Russian) and syllable-timed (in which all
doi: 10.3389/fpsyg.2015.00316 syllables were perceived to be of roughly equal duration, e.g., French, Italian, Spanish). Later, a new

Frontiers in Psychology | www.frontiersin.org 1 March 2015 | Volume 6 | Article 316


Ordin and Polyanskaya Perception of speech rhythm in L2

"m − 1 #
rhythmic class of mora-timed languages (in which moras are sup- X
posedly perceived as roughly equal in duration, e.g., Japanese, r PVI = dk − dk + 1 /(m − 1) (2)
West Greenlandic) was added. Experimental studies, however, k=1

failed to find empirical evidence to support this impression


(Roach, 1982; Dauer, 1983; Pamies Bertran, 1999). However, where
adults (Ramus et al., 1999) and even infants (Nazzi et al., 1998; m—number of interval in an utterance for which PVI is
Ramus and Mehler, 1999; Nazzi and Ramus, 2003) are able to calculated,
differentiate between rhythmic patterns of languages that are d—duration of kth interval.
traditionally classified as stress- and syllable-timed. Therefore, Higher values of %V and lower values of the other metrics cor-
researchers continued looking for the acoustic correlates of audi- respond to the languages that are traditionally defined as syllable-
torily perceived differences in speech rhythm. timed (Ramus et al., 1999; Low et al., 2000; Grabe and Low, 2002;
A new concept of speech rhythm has been introduced in an Dellwo and Wagner, 2003; White and Mattys, 2007), and pos-
attempt to find the perceptually relevant acoustic correlates of sibly provide the necessary cues to differentiate the durational
rhythmic patterns (Ramus et al., 1999). It rests on the assumption patterns of the rhythmically contrastive languages (Ramus and
that consonantal and vocalic intervals in the speech signal can Mehler, 1999; Ramus et al., 1999).
exhibit language-specific patterns of durational variability. Lan- Dauer (1983, 1987) and Schiering (2007) analyzed the phono-
guages that are traditionally classified as “stress-timed” exhibit logical structure of languages which give the impression of stress-
higher degree of durational variability compared to “syllable- timing or syllable-timing. Those languages that produce the
timed” languages. That is, stress-timing is characterized by more effect of stress-timing display vowel reduction, more complex C
substantial differences in duration of vowels and consonantal clusters, have more different syllable types, opposition between
clusters within the same utterance produced by the same speaker. phonologically long and short vowels, between geminate and
To capture the variability in duration of speech intervals, a num- non-geminate consonants, are less likely to exhibit vowel har-
ber of the so-called rhythm metrics have been proposed. Among mony and fixed stress. Rhythm metrics reflect these language-
the most commonly used interval-based rhythm metrics are the specific phonological properties. To name a few examples, 1C
pairwise variability index (PVI) (Grabe and Low, 2002), the stan- is thought to be indicative of the syllabic structure, syllable com-
dard deviation in duration of speech intervals (1) and the per- plexity and consonantal phonotactic constraints. 1V is supposed
centage of vocalic material in an utterance (%V) (Ramus et al., to be indicative of the degree of vowel reduction. VarcoV and
1999), the coefficient of variation in duration of speech inter- VarcoC reflect the same properties 1C and 1V do, but Varco
vals (Varco) (Dellwo and Wagner, 2003). Conventionally these measures are supposed to neutralize the effect of the tempo dif-
metrics are applied to vocalic (V) and consonantal (C) inter- ferences, and thus to reduce the effect of idiosyncrasies in speech
vals, i.e., sequences of consecutive vowels or consonantal clus- production. %V indicates the syllabic structure and inventory.
ters that can straddle the syllabic and word boundaries within an Languages with more restricted syllabic inventory operate less
utterance). Yet the metrics have also been applied to capture the complex syllables, usually of the CV structure. The more types
durational variability of other speech intervals in order to inves- of syllables there are in the language inventory, the more conso-
tigate the multiple rhythms on multiple timescale, e.g., on the nants are added to the onset or coda of the syllables. This reduces
timescale of feet (Nolan and Asu, 2009) or syllables (Ordin et al., the proportion of vocalic intervals to the overall duration of the
2011). utterance, and %V decreases. Prieto et al. (2012) demonstrated
Some of these metrics are influenced by the speech rate to a that prosodic edges and heads are marked by manipulating dura-
higher degree than the others (Dellwo and Wagner, 2003; Dellwo, tional ratios in a language-specific way, and this may also account
2006; Wiget et al., 2010). For example, 1V depends on the mean for small differences in rhythm measures between languages. The
duration of vowels in an utterance. That is, if speech is deliv- analysis of the surface durational variability of V and C intervals
ered at a faster rate, mean durations become smaller and 1V indeed allows spreading the languages on a continuous scale with
tends to decrease. Dellwo (2006) suggested Varco measure to respect to their rhythmic properties and to say that one language
normalize for the tempo differences and to capture the differ- is more or less stress-timed than another, or that the two lan-
ences in durational variability irrespective of the differences in guages are rhythmically similar. However, it did not yet provide
speech rate between the languages. Grabe and Low (2002) also an unambiguous support for the Rhythm Class Hypothesis that
suggested a normalized version of PVI in an attempt to neutralize suggests that languages are split into distinct rhythm categories.
the influence of the speech rate on the measures of the local dura-
tional variability. Formulas 1 and 2 show how the raw (rPVI) and Importance of Rhythmic Patterns for Speech
the normalized (nPVI) versions are calculated. White and Mat- Processing
tys (2007) and Wiget et al. (2010) reported Varco measures, %V People are sensitive to the timing patterns which are cap-
and nPVI are more robust to the fluctuations of the speech rate tured by rhythm metrics. Mehler et al. (1996) hypothesized
compared to non-normalized metrics. that pre-linguistic infants perceive incoming continuous
speech as a succession of vocalic and consonantal segments,
"m − 1 # vocalic segments are processed as informative harmonic
X dk − dk + 1 signals of variable duration and intensity, which are alter-
n PVI = 100 × /(m − 1) (1)
(dk + dk + 1 )/2 nating with unanalyzed noise (consonantal intervals). This
k=1

Frontiers in Psychology | www.frontiersin.org 2 March 2015 | Volume 6 | Article 316


Ordin and Polyanskaya Perception of speech rhythm in L2

Time-Intensity-Grid-Representation of incoming continuous German share phonological parameters that are known to affect
speech is based on innate perception mechanisms that help them the rhythm metrics. Both of these languages are classified as
to construct the first representation of their language. Ramus stress-timed in terms phonetic timing patterns captured by met-
et al. (1999) observed that languages with similar rhythmic ric scores (Grabe and Low, 2002) and exhibit the phonological
properties tend to share more typological characteristics of characteristics typical of stress-timed languages (Dauer, 1987;
grammatical and phonological structure. This led to the hypoth- Schiering, 2007). Therefore German learners of English do not
esis that alongside with constructing the first representation of have to acquire phonological characteristics like production of
their native language, babies use rhythmic patterns to bootstrap complex syllables and complex consonantal clusters, opposition
on the syntactic properties of the language and on lexicon of long and short vowels, etc. Table 1 provides the metric scores
(Christophe and Dupoux, 1996; Mazuka, 1996; Mehler et al., in monolingual adult speech delivered by adult native speak-
1996, 2004; Nespor et al., 1996). Rhythmic patterns are also used ers of either German or English, as reported in various studies.
to develop strategies for segmentation of continuous speech No unambiguous tendency is evident as for in which of these
and consequent word extraction and learning (Christophe et al., languages the durational variability is higher. %V seems a bit
2003; Thiessen and Saffran, 2007). In light of these considera- lower in German, which can be explained by a slightly higher
tions, we could suggest that the ability to recognize the durational syllabic complexity and a higher number of C clusters in Ger-
cues pertaining to the speech rhythm is of the utmost importance man than in English (Delattre, 1965 cited in Gut, 2009). Com-
for language acquisition and speech processing (e.g., for devel- parison of the metric scores for German and English with those
opment and implementation of language-specific segmentation reported for traditional syllable-timed languages (Ramus et al.,
strategies). Therefore, sensitivity to timing differences, which is 1999; Grabe and Low, 2002; White and Mattys, 2007) shows
already observed in infancy, also persists in adulthood (Ramus that both German and English exhibit higher duration variability
and Mehler, 1999; White et al., 2012). Adults also use rhythmic and lower %V.
cues to recognize the foreign accent in L2 speech and to detect
the linguistic origin of the speaker (Kolly and Dellwo, 2014), to Research Question
evaluate the degree of accentedness in L2 speech (Polyanskaya Previous studies have showed that rhythmic patterns change as
et al., 2013), to extract discrete linguistic units from continuous language acquisition progresses even when the native and the
speech (Christophe et al., 2003). target languages of the learners are rhythmically similar (Ordin
et al., 2011). In our study, we were interested whether these
Rhythm Changes in Second Language developmental changes in speech rhythm are perceptually rel-
Acquisition evant. It is already known that the listeners are sensitive to
Papers focussed on acquisition of speech rhythm in L2 are rare. the rhythmic differences between rhythmically contrastive lan-
Most of these studies concentrate on comparing rhythm in L2 guages (Ramus and Mehler, 1999) as well as between German
speech with the target represented by an adult native speaker. and English, i.e., between rhythmically similar languages (Vicenik
Examined L2 speech is usually produced by rather advanced and Sundara, 2013). Listeners are also able to distinguish rhyth-
learners. The results showed that the rhythm scores in L2 speech mic patterns of the utterances from the same language (White
are intermediate between those in the native and the target lan- et al., 2012; Arvaniti and Rodriquez, 2013). Therefore, we think
guage of the learners (White and Mattys, 2007). This is usually that the fine distinctions between rhythmic patterns typical of L2
interpreted as the influence of the native language of the learner English of adult learners at different proficiency levels might be
on his speech production in the L2. Low et al. (2000) showed that detected.
nPVI-V in L2 Singaporean English is influenced by the L1 Chi- These important findings regarding sensitivity of listeners to
nese language. Rhythm in L2 English was shown to be affected by rhythmic differences have been done using discrimination tests.
L1 Chinese, French, Spanish, Romanian and Italian (White and We know that people can discriminate between utterances even
Mattys, 2007; Gut, 2009; Mok, 2013, etc.). with small differences in durational variability. However, certain
The studies with the emphasis on development of rhythmic functions attributed to speech rhythm are not based on discrim-
patterns in the course of L2 acquisition are even rarer. One of ination, but rather on classification (segmentation, evaluation of
the few exceptions is the study by Ordin and Polyanskaya (2014) accentedness, detection of linguistic origin of the speaker, etc.).
who compared how speech rhythm develops in L1 and in L2 Classification is different from discrimination. The listener may
acquisition. They found that speech rhythm develops from more be able to perceive some acoustic differences when attending to
syllable-timed toward more stress-timed patterns both in child them, but nevertheless ignore these differences when attributing
L1 and in adult L2 speech. The authors showed that both vocalic an acoustic signal to a certain group, or when making a decision
and consonantal variability in duration in L2 English increases as whether an acoustic signal is a representative of a certain class.
a function of the length of residence in the UK in adult speech In this particular study we focused not merely on whether the
when the target (English) and the native (Italian or Punjabi) differences in L2 rhythm between utterances delivered by learn-
languages of the learners are rhythmically contrastive. ers at different proficiency levels are detected. We were rather
Ordin et al. (2011) showed that durational variability in interested in whether listeners are able to reliably classify the
speech of L2 learners also increases with proficiency growth utterances of L2 learners into distinct classes based on timing
when the target (English) and the native (German) languages differences between utterances, and if so, which timing patterns
of the learners exhibit similar rhythmic properties. English and listeners use to form the classes.

Frontiers in Psychology | www.frontiersin.org 3 March 2015 | Volume 6 | Article 316


Ordin and Polyanskaya Perception of speech rhythm in L2

TABLE 1 | Metric scores for German and English as reported in various studies.

Language Rhythm metrics References

%V VarcoV n-PVI-V rPVI-C 1C VarcoC

German 41.7 52.5 68.7 65.0 Russo and Barry, 2008


42.8 71.7 Dellwo and Wagner, 2003
46.4 59.7 55.3 52.6 Grabe and Low, 2002
39.8 51.5 53.6 67.0 62.0 54.0 Arvaniti (2012)—overall score
36 44 55 73 62 51 Arvaniti (2012)—scores obtained on read sentences that were deliberately designed to
enhance durational variability
41 52 56 60 54 50 Arvaniti (2012)—scores obtained on read sentences that were deliberately designed to
inhibit durational variability
41 52 53 56 55 50 Arvaniti (2012)—scores obtained on sentences uncontrolled for phonotactics
42 55 52 72 55 50 Arvaniti (2012)—scores obtained on spontaneous speech

English 38.0 64.0 73.0 70.0 59.0 White and Mattys, 2007
42.0 55.7 Dellwo and Wagner, 2003
41.1 57.2 64.1 56.7 Grabe and Low, 2002
40.1 53.5 Ramus et al., 1999
45.7 54.8 59.9 68.9 60.0 Arvaniti (2012)—overall score
41 48 55 83 68 57 Arvaniti (2012)—scores obtained on read sentences that were deliberately designed to
enhance durational variability
50 46 51 57 49 53 Arvaniti (2012)—scores obtained on read sentences that were deliberately designed to
inhibit durational variability
44 50 56 61 55 55 Arvaniti (2012)—scores obtained on sentences uncontrolled for phonotactics
48 66 66 77 68 59 Arvaniti (2012)—scores obtained on spontaneous speech

Speech Material Germany at the time of the recordings. However, they reported
to have little to no command of German, lived in close English-
Participants speaking community at the UK military bases in Nord-Rhein
Piske et al. (2001) analyzed a range of factors that influence pro- Westphalia, worked in only English-speaking environment, had
nunciation of L2 learners. These factors, among others, included English as their home and neighborhood language, came from
the age and the length of exposure to the L2, amount of L2 monolingual English-speaking families and were raised in mono-
use, language learning aptitude and motivation, learning mode. lingual environment.
In our study we controlled these factors by collecting the rel-
evant information in a detailed language-background question- Elicitation Procedure
naire (see Appendix 1 in online Supplementary Materials). Based The selected learners of English first underwent a pronuncia-
on the questionnaire, we selected only those speakers who formed tion test so that we could assess the learners’ mastery of pro-
a homogeneous group and varied only in the degree of L2 mas- nunciation. The test was devised by the authors and consisted
tery. The relevant information gleaned from the questionnaire of two parts: Perception and production. The perception part
was further verified in an informal interview during the recording was compiled from Vaughan-Rees (2002) and included phoneme
sessions. recognition, emotion recognition, intention recognition tasks.
We have recorded 51 German learners of L2 English (17–35 The production part included sentence reading. The sentences
years old, M = 21; 27 females). We selected for participation for production were composed to evaluate segmental realizations
only those people who grew up in or near the city of Bielefeld and prosodic control of the participants in the second language.
in North-Rhein Westphalia. The variety of German spoken in The test ran for approximately 20 min. The test and the details on
that region closely resembles what is understood as a Northern the controlled pronunciation features and assessment criteria can
standard variety of German (Hochdeutsch). The selected partic- be found in Appendix 2 in online Supplementary Materials.
ipants did not exhibit features of regional varieties of German. Further on, a 5-min phonetic aptitude test (PAT) was admin-
All the participants were monolingual native speakers of German istered. The authors devised this test based on the oral mimicry
without speech or hearing disorders. tests described by Pike (1959), Suter (1976), and Thompson
We have also recorded 10 native speakers of English (south- (1991). The test is aimed to predict the general phonetic ability by
ern British variety, 25–40 years, M = 30, 6 females) to compare asking the participant to imitate novel sounds that do not exist in
the metric scores of the L2 learners of English with those of their native or target language and to mimic novel prosodic phe-
the L1 English speakers. The English speakers were residents in nomena (e.g., lexical tones, tonal contours with accents which are

Frontiers in Psychology | www.frontiersin.org 4 March 2015 | Volume 6 | Article 316


Ordin and Polyanskaya Perception of speech rhythm in L2

not aligned according to the convention of the learner’s target or ratings between the teachers, we used Cronbach alpha, which
native language, etc.). The test and the details on the assessment is 0.90 for vocabulary, 0.89 for fluency and 0.92 for grammati-
criteria can be found in Appendix 3 in online Supplementary cal accuracy. This shows high agreement between the raters and
Materials. The sounds to imitate were presented by the holder of confirms the reliability of their assessments. We averaged three
the IPA certificate confirming his proficiency in producing and ratings across the parameters for each rater and each interview,
perception of sounds existing in world languages. The perfor- and thus got three mean ratings per learner.
mance of participants in PAT did not correlate with their L2 pro- The teachers’ assessments and the results of the pronunciation
ficiency (we had both high and low proficiency learners with both tests were used to place the learner into one of the following profi-
high and low phonetic aptitude). Neither did the performance of ciency groups: beginners (12 speakers with ratings between 4 and
the L2 learners in the PAT correlate with their performance in the 6), intermediate (9 speakers with mean ratings between 6 and 8),
English pronunciation test with any of the metrics calculated on and advanced learners (22 speakers with ratings above 8)1 . We
their speech. This shows that the ability to imitate rhythmic pat- used the results of the pronunciation test to assess the pronun-
terns of the target language is not related to the general phonetic ciation skills of the learners. Eight speakers were not attributed
aptitude and we can eliminate a potential alternative explanation to any group, either because the teachers did not agree with
that the differences in rhythmic patterns between learners at dif- each other in their assessments (2 speakers were excluded for
ferent proficiency levels are pertaining to the phonetic aptitude this reason) or because of the discrepancy between the results of
rather than to the overall proficiency. the pronunciation tests and the teachers’ assessment of accuracy,
At the next stage, an informal interview was conducted by fluency and vocabulary resources. Pronunciation skills do not
the first author. General questions about preferences in read- always agree with the general assessment of the learner’s reading,
ing and music, lifestyle, career choice, biography, and childhood writing, listening and speaking skills, vocabulary size, grammar
were asked (Appendix 4 in online Supplementary Materials). The accuracy, etc. That is why we deemed it necessary to combine the
interviews were recorded and lasted approximately 12 min long tutors’ assessment of fluency, accuracy and vocabulary on the one
with each participant. hand and the mastery of pronunciation on the other hand. In case
Following the interview, we ran a sentence elicitation task, when pronunciation lags far behind the general L2 mastery or
similar to one used by Bunta and Ingram (2007). Thirty three exceeds the expected level, the learner was not attributed to any
sentences were elicited from each speaker. We used 33 pic- of the proficiency groups.
ture prompts for the elicitation procedure. The participants
viewed picture slides in PowerPoint presentation. Each slide was Segmentation
accompanied with a descriptive sentence. The participants were Thirty three elicited sentences per speaker were annotated in
instructed to remember the sentences. The participants could Praat (Boersma and Weenink, 2010). Annotation was performed
move to the next image or to go back to the previous slide at by the second author. Each sentence was divided into V and C
their own pace. When they had viewed all the slides, they were intervals. The segmentation was carried out manually by the sec-
asked to look at the images again, without the accompanying text, ond author based on the criteria outlined in Peterson and Lehiste
and to recall and say the sentences that they had been asked to (1960) and Stevens (2002) for V and C intervals.
remember. In a very rare case (<5%) when the speaker could not The burst of energy corresponding to the release of the closure
remember the sentence or retrieved a modified sentence from was taken as the starting point of a consonantal interval with the
memory, verbal prompts were used to help the speaker to pro- initial voiceless plosive sound after a pause and at the beginning
duce the correct sentence. For example, the participant said “The of a sentence. Either the stop release, or apparent beginning of a
dog is running after the cat,” and the expected sentence was “The voice bar, or other cues indicating apparent vibration of the vocal
dog is chasing the cat.” The researcher responded to the partici- folds (whatever came first) were considered as the beginning of
pant: “Yes, it is. You could also say chasing, which means running a consonantal interval with the initial voiced plosive. The mark-
after. Can you say what you see at this picture once again?” One ers of the turbulent noise were taken as the beginning of fricative
verbal prompt was sufficient to elicit the expected sentence when consonants. The beginning of the first formant was taken as the
there was a mismatch in the first trial. The recording ran con- beginning of a sonorant consonant. Consonantal intervals in the
tinuously throughout the sentence elicitation procedure. The list
of elicited sentences and the examples of picture prompts can be 1 According to the evaluators’ opinion, we did not have true beginners, and our
found in Appendix 5 in online Supplementary Materials. participants better correspond to the lower-intermediate (B1.1 according to the
The tests and recordings were made individually with every Common European Framework for Languages), upper-intermediate (B2.1) and
participant in a sound-treated booth of the audio-visual studio at advanced (C1.1 and C1.2) levels. CEFL specifies skills the learner should achieve
at each of the six levels: A1, A2, B1, B2, C1, C2. However, as each level is usu-
the Bielefeld University in Germany. The recordings were made ally covered in language schools during two intensive courses, teachers split each
in WAV PCM at 44 kHz, 16 bit, mono. level into two sublevels, e.g., C1.1 and C1.2. However, official CEFL guidelines do
not split six levels into sublevels, and division into C1.1 and C1.2 is done—rather
Assessment of Learners’ Proficiency arbitrarily—by the teachers. We did not want to resort to commonly used place-
Three experienced teachers of English as a foreign language lis- ment tests to evaluate the learners’ proficiency level because standard placement
tests are designed to make an initial assessment to place the student into the course
tened to the recorded interviews and evaluated learners’ fluency, that fits his level. Placement tests are not designed to estimate the proficiency level
grammatical accuracy, and vocabulary resources. They used a 10- for certification of the achieved proficiency level. Therefore, using several human
point scale for each parameter. To estimate the consistency of evaluators (to avoid human bias) is the best methodological option, in our opinion.

Frontiers in Psychology | www.frontiersin.org 5 March 2015 | Volume 6 | Article 316


Ordin and Polyanskaya Perception of speech rhythm in L2

middle of a sentence were considered to start after the vowel fin- sentence to account for possible developmental changes in speech
ishes, and to stretch until the onset of the following vowel. The tempo in the course of L2 acquisition, and for the interaction of
end of the consonantal interval in the final position was marked speech rhythm and speech tempo.
at the end of the acoustic energy. The consonantal intervals in Although some rhythm metrics were claimed to be better
the final positions were considered to start immediately after the than others at quantifying rhythm, there is no consensus on
vowel and finish at the end of the fricative noise (for obstruents) which metrics have more discriminative power. White and Mat-
or at the end of the first formant (for sonorants). Conventional tys (2007), for example, advocated for pairwise metrics, while
procedure based on the analysis of the waveform and the spec- Ramus et al. (1999) favored 1C and %V. Loukina et al. (2011)
tral characteristics of the speech signal was based to identify the performed the analysis of 15 rhythm metrics and in experiments
boundaries of the vocalic intervals. The end of the vowel was separating pairs of languages by rhythmic properties showed that
identified by the abrupt change in the vowel formant structure a rhythm measure that is successful at separating one pair often
or by termination of the formants, and by the significant drop in performs poorly at separating another pair. Considering the lack
the waveform amplitude. The onset of the vowel was marked at of consensus on the optimal set of metrics, we decided not to limit
the beginning of the voicing identified as the start of the regular our investigation to the metrics which were found more useful
vertical stripes on the spectrogram in the region of the second in certain studies. Instead we tested all the metrics in order to
and higher formats. The marker indicating the vowel onset or see which ones better capture the differences in rhythm between
offset was placed at the point closest to the zero crossing on the sentences produced by L2 learners at different proficiency levels.
waveform. A series of by-sentence ANOVA tests (Table 3) with the values
In difficult cases where it was necessary to place the boundary of the metrics as the dependent variables and proficiency level as
between the consonantal interval represented by a sonorant con- the factor shows that non-normalized rhythm metrics (1V, 1C,
sonant with a clear formant structure and a vowel, the decision rPVI-v, rPVI-c) and %V do not differ between the proficiency
was based on the amplitude of the first format. Such difficult cases levels. As the raw metrics do not differ between the proficiency
were associated with the boundaries or categorizing allophones levels, we are not including them into further statistical tests.
of /l/ (e.g., in the words girl, ball, table). We based our segmen-
tation on purely phonetic criteria, therefore /l/ was sometimes
marked as a vowel (in case of a vocalized [l]), and sometimes
TABLE 2 | Metrics used in this study.
as a consonant. The decision was based on (1) auditory analy-
sis by an experienced phonetician, and (2) amplitude of the first Metric Description
formant. If the amplitude did not drop after the preceding vowel
and the segment was perceived by a phonetician as a vocalized [l], %V Percentage of vocalic intervals

then the segment was segmented as a vocalic interval. We did not 1V Standard deviation of vocalic intervals duration
want to pre-define certain types of segments either as consonan- 1C Standard deviation of consonantal intervals duration
tal or vocalic. We adopted a phontic approach to speech rhythm. nPVI-V Averaged of the mean differences between successive vocalic
intervals
Within the adopted framework, speech rhythm is represented by
the surface timing patterns, which are purely phonetic, and pho- nPVI-C Averaged of the mean differences between successive consonantal
intervals
netic properties are not discrete and cannot be pre-assigned to a
rPVI-V Averaged difference in duration of successive vocalic intervals
certain phonological category a-priory.
rPVI-C Averaged difference in duration of successive consonantal intervals
Pauses and hesitations were not included into V or C intervals
VarcoV Coefficient of variation of vocalic intervals, i.e., standard deviation
and were discarded. If the same type of the interval was annotated
divided by the mean
prior and following the pause, we treated them as two separate
VarcoC Coefficient of variation of consonantal intervals, i.e., standard
intervals because they are likely to be perceived as such. Final deviation divided by the mean
syllables were included into analysis. MeanV Mean duration of vocalic intervals
MeanC Mean duration of consonantal intervals

Calculating the Rhythm Metrics


The sentences elicited using the picture prompts were used to
calculate the rhythm metrics. The sentence elicitation procedure TABLE 3 | Non-significant ANOVA tests for the rhythm metrics between
proficiency groups.
helped us to avoid the reading mode and made speech material
more similar to natural spontaneous speech. Besides, we obtained Metric Significance of Significance of welch test
lexically identical sentences from every participant, which is nec- levene’s test (if Levene’s test is significant) or
essary to analyze the development of speech rhythm per se, not F statistics of the analysis of variance
affected by the differences between the sentences in phonotactics,
number of syllables in polysyllabic words, syntactic structures rPVI-v 0.513 0.4
and phrasing. rPVI-c 0.007 0.267
Traditional rhythm metrics were calculated on each sentence. 1V 0.748 0.692
The overview of the selected metrics was given in the Table 2. We 1C < 0.0005 0.154
also calculated the mean duration of V and C intervals for each %V 0.691 0.068

Frontiers in Psychology | www.frontiersin.org 6 March 2015 | Volume 6 | Article 316


Ordin and Polyanskaya Perception of speech rhythm in L2

ANOVAs on the rate-normalized rhythm measures revealed the pairwise durational variability and speech rate discriminate
significant difference between proficiency levels at p < 0.0005 between the proficiency levels much better than utterance-wise
for each metric. These metrics were included into multivari- variability.
ate model. The MANOVA test with nPVI metrics, Varco met- We also wanted to see how close the advanced German learn-
rics and mean durations of V and C intervals as the dependent ers of English are to their target in regard to acquisition of rhyth-
variables and proficiency level as the factor revealed a significant mic patterns. For this, we compared the metric scores calculated
effect of proficiency level on the rhythm measures, 3 = 0.856, on the sentences produced by the advanced learners of English
F(12, 2822) = 19.06, p < 0.0005, µ2 = 0.075. Figures 1–3 show with those calculated on the sentences spoken by native English
that the metric scores increase as L2 acquisition progresses, which speakers. T-tests (Table 7) reveal that the metric scores do not
indicates that German learners of English deliver L2 speech at
a higher rate and with higher degree of stress-timing as their
L2 mastery grows. The differences between the proficiency lev-
els pairwise for each metric are mostly significant (significance
values are given in Table 4).
The MANOVA was followed up with the discriminant anal-
ysis. We used only those metrics that were found to differ sig-
nificantly between proficiency levels in our previous tests. The
analysis revealed two discriminant functions. The first function
explained 96.9% of variance, canonical R2 = 0.14, and the sec-
ond explained only 3.1% of variance, R2 = 0.005. In combination
these functions significantly differentiated the proficiency levels,
3 = 0.856, χ2(12) = 220.318, p < 0.005. The second function
alone did not significantly differentiate between the proficiency
levels, 3 = 0.995, χ2(5) = 7.232, p = 0.204. This can also be
seen on the discriminant function plot (Figure 4). Classification
results (Table 5) show that the model classifies correctly 57% of
cases (chance is 33%).
The correlations between the outcomes and discriminant
functions revealed that the measures of local—pairwise—
variability and of speech rate loaded on the first function, and FIGURE 2 | nPVI-v and nPVI-r in the sentences produced by native
global measures of variability loaded more highly on the second English speakers and by German learners of English at beginning,
function (see Table 6). As the first function explains substantially intermediate and advanced proficiency levels. Error bar shows 95%
more variance that the second function, we can conclude that confidence interval.

FIGURE 1 | VarcoV and VarcoC in the sentences produced by native FIGURE 3 | meanV and meanC in the sentences produced by native
English speakers and by German learners of English at beginning, English speakers and by German learners of English at beginning,
intermediate and advanced proficiency levels. Error bar shows 95% intermediate and advanced proficiency levels. Error bar shows 95%
confidence interval. confidence interval.

Frontiers in Psychology | www.frontiersin.org 7 March 2015 | Volume 6 | Article 316


Ordin and Polyanskaya Perception of speech rhythm in L2

TABLE 4 | Significance for comparisons of rhythm metrics between proficiency levels pairwise (with Hochberg’s correction).

Comparison VarcoV VarcoC nPVI-V nPVI-C meanV meanC

Beginner—Intermediate <0.0005 0.855 <0.0005 0.17 0.004 0.315


Intermediate—Advanced 0.328 <0.0005 0.096 <0.0005 0.004 0.011

TABLE 6 | Structure matrix of the discriminant function coefficients.

Function

1 2

meanV −0.5∗∗ 0.07


nPVI-C −0.472∗∗ −0.404
nPVI-V −0.469∗∗ −0.444
meanC −0.347∗∗ −0.224
VarcoC 0.436 0.79∗∗
VarcoV 0.409 −0.552∗∗

The stars indicate larger correlation between each variable and one of the discriminant
functions.

purposes of marking edges and heads of prosodic constituents


(see Prieto et al., 2012).
FIGURE 4 | Discriminant function plot.
The analysis shows that speech rate and the degree of
stress-timing increase as a function of proficiency growth. This
tendency, however, can only be captured by normalized rhythm
TABLE 5 | Classification Results based on the Discriminant Analysis.
metrics. Raw metrics do not differ between the proficiency lev-
Predicted Group Membership (in %) els. The values of the raw metrics are influenced by the speech
tempo, i.e., by the mean durations of speech intervals: The faster
Beginners Intermediate Advanced
one talks, the shorter the V and S intervals become; the shorter
Original Group Beginners 38.4 0 55.1 speech intervals result in smaller durational differences in pairs
Membership (in%) Intermediate 26.6 0.3 73.1 of consecutive intervals and in smaller standard deviation in
Advanced 9.8 0 90.2 duration of speech intervals. As the mean durations of speech
intervals significantly differ between the proficiency levels, we
should also expect significant differences in the values of the
raw metrics. However, this was not confirmed. We believe that
differ between sentences spoken by advanced German learners the values of the raw metrics are influenced by two conflicting
and native speakers of English, with the exception of meanC forces: The tendency to deliver speech at a faster rate and with
(overall shorter C intervals in the utterances of L2 speakers) higher durational variability at high proficiency levels. This con-
and rPVI-C (raw pairwise variability of consonantal intervals is flict prevents the emergence of significant differences in the val-
higher in speech of learners of English). The difference in 1C ues of raw metrics between proficiency levels. Normalization—
is on the verge of significance (p = 0.069), and the scores are removing the influence of speech tempo—allows us to notice the
again higher in sentences produced by L2 learners. Significant trend to enhance durational variability in L2 speech with profi-
and marginally significant difference in consonantal variability ciency. The lack of significant differences in %V between the pro-
is easily accounted for the differences in articulation rate of C ficiency levels presents an interesting case. In earlier studies %V
intervals: longer C intervals in L2 speech result in larger standard has been reported to be robust to fluctuations in speech tempo
deviations and pairwise durational differences. What is impor- (Wiget et al., 2010) and to discriminate between L1 and L2 speech
tant is that pairwise durational variability of consonantal inter- (White and Mattys, 2007). However, in our experiment %V was
vals per se, i.e., when the differences in speech rate are normalized, not informative. %V is the proportion of the vocalic material in
is also significantly higher in speech of advanced L2 learners. a sentence, and that is determined by phonotatic differences. The
Advanced learners overshoot with increasing durational variabil- proportion of vocalic material will be lower in the languages that
ity of C intervals, although they successfully acquire variability of allow complex consonantal clusters and reduction of vowels in
V intervals. This can be explained by less assimilation of conso- unstressed positions (e.g., German, Russian, English). These lan-
nants in clusters within syllables in L2 speech (i.e., tendency to guages are traditionally classified as stress-timed (Dauer, 1983).
clearly produce all the consonants in the clusters) and by incom- The languages on the opposite end of the spectrum impose strong
plete mastery of fine modifications in prosodic timing for the phonotactic constraints, prefer simple CV syllables and feature

Frontiers in Psychology | www.frontiersin.org 8 March 2015 | Volume 6 | Article 316


Ordin and Polyanskaya Perception of speech rhythm in L2

TABLE 7 | t-tests comparing metric scores in English speech of native English speakers and advanced L2 learner of English.

meanV meanC %V 1V 1C VarcoV VarcoC rPVI-V rPVI-C nPVI-V nPVI-C

t(1054) 1.216 3.371 −1.314 −0.064 1.818 −1.433 0.131 1.014 4.506 1.412 4.506
p = 0.224 = 0.001 = 0.189 = 0.949 = 0.069 = 0.152 = 0.896 = 0.311 < 0.0005 = 0.158 < 0.0005

less vowel shortening (e.g., French, Japanese). These factors become more stable and consistent as a result of the acquisition
increase the proportion of vocalic material in speech. Therefore, progress.
we assume that %V is a powerful predictor to discriminate To conclude, the analysis confirms significant differences in
between rhythmically contrastive languages. %V can also reflect rhythmic patterns between proficiency levels in L2. Rhythm mea-
the differences in lexical material, i.e., whether the utterances per sures are more consistently stress-timed at higher proficiency
se differ in phonotactic characteristics. In our study, we used the levels. Raw metrics are influenced by conflicting tendencies to
same set of sentences elicited from different speakers, thus the deliver speech at a faster rate and with higher durational vari-
lexical differences that could potentially influence %V were elim- ability at higher proficiency levels, and thus do not increase with
inated. The target and the native languages of the L2 learners proficiency. The developmental tendency to increase the degree
were similar in terms of phonotactic and phonological proper- of stress-timing in L2 speech has been observed even when both
ties, and the learners did not have problems with producing the the native and the target languages of the learner are rhythmi-
clusters of consonants in English sentences. %V captures phono- cally similar. The main research question of our study was to
tactic and phonological differences, but the sentences spoken by investigate the perceptual relevance of the rhythmic differences
learners at different proficiency levels in our study manifested between proficiency levels. Based on the literature review, we
only phonetic differences in timing patterns, phonotactics and assumed that listeners are sufficiently sensitive to the durational
phonological characteristics were the same. Therefore, it is not variability of C and V intervals to discriminate timing patterns of
surprising that %V was not found to differ between sentences L2 utterances delivered by learners at different proficiency levels.
produced by L2 learners at different proficiency levels. We wanted to find out whether the detected differences in timing
The discriminant analysis also reveals that the advanced learn- patterns between proficiency levels are used to classify utterances
ers are more consistent in realization of timing patterns com- into discrete categories. To address this question, we set up the
pared to lower-proficient learners. Inspection of the discriminant perception experiment.
function plot (Figure 7) reveals that the variate scores for the
advanced learners are more compact, while the variate scores for Experiment
the beginners are spread more evenly along the first discrimi-
nant function. The discriminant function plot also showed that Methods
the variate scores for different groups of acquirers overlap (see Participants
overlapping circles on Figure 7). This means that beginners pro- We have recruited 25 native English speakers to act as listeners
duced sentences sometimes with high degree of durational vari- in the perception study (age range—21–24 years, M = 22; 13
ability, and sometimes with lower degree of durational variability. females). Care was taken to form a socially homogeneous group
Advanced learners constantly produced the sentences with high of listeners with the same language background. All participants
degree of durational variability. In other words, the productions were students of Ulster University, monolingual English speak-
of beginners varied greatly between stress-timed and syllable- ers (see our criteria for monolinguality in the description of the
timed rhythm patterns, but productions of advanced learners participants for Experiment 1). All listeners grew up in or around
were more consistently stress-timed. Belfast and were speaking the same regional variety of English
We can draw the same conclusion if we look at Table 5. (verified by a native speaker of English, phonetician and Belfast
Rhythm and tempo measures correctly predict the speaker’s pro- resident). We ensured that the participants did not differ in age,
ficiency level for 57% of sentences. The overall accuracy is sig- educational level, social status, language background, experience
nificantly above chance (33%), but the accuracy for the sentences with foreign languages, and all had equal exposure to educated
produced by speakers on different proficiency levels varies sub- standard British English.
stantially. Sentences produced by advanced speakers were clas-
sified correctly in 90.2% of cases, while sentences produced by Stimuli
beginners were classified correctly only in 38.4% of cases. This We selected sentences elicited from seven speakers per profi-
means that the half of the sentences spoken by beginners exhibit ciency group in the first experiment to prepare the stimuli. The
higher degree of variability that is typical of stress-timed rhythm selected speakers from the advanced group had the highest mean
in 90.2% sentences spoken by advanced learners. On the other ratings given by the evaluators (see description of the first exper-
hand, only 9.8% of sentences spoken by advanced learners exhibit iment, Section Procedure). The selected speakers from the begin-
lower durational variability overlapping with 38.4% of sentences ners had the lowest mean ratings from the evaluators. We also
from beginners. The analysis of the discriminant function plot randomly selected seven speakers from the group of intermediate
and the classification accuracy indicates that the timing patterns learners.

Frontiers in Psychology | www.frontiersin.org 9 March 2015 | Volume 6 | Article 316


Ordin and Polyanskaya Perception of speech rhythm in L2

Eighteen out of thirty tree elicited sentences per speaker were was played. When all 108 stimuli were presented, the partici-
selected for stimuli preparation. Six sentences had three stressed pant had a 2-min break before the stimuli were played again.
syllables (e.g., the ‘dog is ‘ eating the ‘bone), six sentences included The training procedure was repeated three times. Supposedly,
two stressed syllables (e.g., the ‘book is on the ‘table) and six sen- during the training session the participants formed new percep-
tences had only one stressed syllable (e.g., it’s ‘raining outside). tion categories for further discrimination between the stimuli
The selected sentences produced by the selected speakers were from different groups. Then the testing session began.
listened to in order to make sure that the sentences were indeed For the testing session, we prepared 270 stimuli (different
pronounced with the expected number of stressed syllables. The from those used in the training session, 5 speakers per proficiency
selected sentences are marked with asterisk in Appendix 5 in group, 18 sentences per speaker). The procedure was the same as
online Supplementary Materials. We selected 378 sentences in in the training session, but the listeners received no feedback, and
total for the perception experiment (21 speakers ∗ 18 sentences). all the stimuli were played only once.
We used the speech resynthesis technique (Ramus and The duration of the experiment varied between participants
Mehler, 1999) to prepare the stimuli. We replaced all consonantal and usually exceeded 90 min. The participants could take a short
intervals in the selected sentences with “s” and all vocalic intervals break and have a rest pause during the training session and
with “a” and resynthesizing sentences with constant fundamen- between the training and the testing session, but not during
tal frequency in MBROLA. The durations of “s” and “a” intervals the testing session. During the experiment the participants were
were equal to the duration of C and V intervals in the origi- offered hot and cold drinks and sweet snacks to help them cope
nal sentences. This technique degraded segmental and most of with possible fatigue. The participants could have their drinks
the prosodic information from the sentences. The only preserved and snacks during the rest pauses as well as during the train-
differences between the identical sentences spoken by learners ing session. The order of stimuli presentation was randomized
at different proficiency levels were the differences in durational using the internal Praat algorithm in attempt to counterbalance
ratios of C and V intervals. Regardless of the recent criticism of for possible fatigue effect.
this technique (Arvaniti and Rodriquez, 2013), its usefulness has
been demonstrated in a number of studies (Ramus et al., 1999; Results
Ramus and Mehler, 1999; Vicenik and Sundara, 2013; Kolly and
Dellwo, 2014, etc.), and we found this delexicalization method to We calculated rhythm metrics on the stimuli that were classified
be optimal for the purposes of our study. by the majority of listeners as Burabah, Losto, and Mahutu. The
metrics were calculated on V and C intervals. We performed the
Procedure discriminant analysis to test whether rhythm metrics statistically
The experiment was carried out with each participant individ- discriminate between the stimuli classified into three groups. The
ually in the phonetic laboratory of Ulster University. The stim- analysis revealed two discriminant functions. The first function
uli were presented to the listeners in two sessions: Training and explained 94.8% of variance, R2 = 0.52, and the second function
testing. The listeners were not informed that the stimuli were explains 5.2% of variance, R2 = 0.05. These functions in combi-
derived from L2 English speech because we did not want the nation significantly differentiate between the groups, λ = 0.457,
listeners use linguistic expectations regarding what the stim- χ2(20) = 149.9, p < 0.0005. The second function alone is not sig-
uli in L2 English might sound like. This might have created nificant, λ = 0.945, χ2(9) = 11.101, p = 0.282. The overall accu-
a bias that would be difficult to control. Instead, the listeners
racy of the model is 69% (chance is 33.3%), accuracy of Burabah
were told that the stimuli were derived from three rare exotic
is 91%, Losto—52%, Mahutu—58.5% (chance level is 33.3% for
African languages. We coined these languages Burabah (sen-
each category). See Table 8 for the details on the classification
tences of the advanced L2 learners converted into “sasasa” stim-
accuracy.
uli), Losto (stimuli based on durations in sentences of intermedi-
The structure matrix (Table 9) reveals that the first function
ate learners of English), and Mahutu (resynthesized sentences of
is loaded with the raw metrics and mean durations of V and C
beginners).
intervals, while the second function is loaded with the normalized
We chose 108 stimuli for the training session (18 stimuli per
metrics. This means that the normalized metrics cannot discrim-
speaker, 2 speakers per proficiency group). Before the session,
inate between the groups, but mean durations and raw metrics
each listener was exposed to nine stimuli, randomly selected from
discriminate between the stimuli identified as Burabah, Losto,
those used later in the training session, 3 stimuli per proficiency
and Mahutu with probability significantly above chance.
group, i.e., per “exotic language.” The listener had 1 min to listen
to these stimuli by clicking with a mouse on nine buttons on the
computer screen. Each button had a caption with the “language”
TABLE 8 | Classification Results (prior probabilities: all groups equal).
name. After 1-min familiarization, the stimuli were presented to
listener one by one. The listener had to identify from which lan- Predicted Group membership
guage (Mahutu, Losto, or Burabah) it originates. The listener was
Mahutu(%) Losto(%) Burabah(%)
expected to click one of the three buttons on the computer screen
Original

with a mouse pointer. Each button had a caption with the “lan- Mahutu 58.5 26.4 15.1
guage” name. On response, the listener was provided with the Losto 28.4 52.2 19.4
feedback which “language” it really was, and the next stimulus Burabah 1.3 7.6 91.1

Frontiers in Psychology | www.frontiersin.org 10 March 2015 | Volume 6 | Article 316


Ordin and Polyanskaya Perception of speech rhythm in L2

TABLE 9 | Structure matrix of the discriminant function coefficients.

Metrics Function

I II

meanV 0.719∗ 0.404


meanC 0.607∗ −0.181
rPVI-v 0.458∗ 0.351
rPVI-c 0.371∗ −0.097
1C 0.352∗ −0.064
1V 0.421 0.559
VarcoV −0.002 0.341∗
nPVI-c 0.102 0.208∗
nPVI-v 0.002 0.203∗
VarcoC 0.035 0.127∗

The stars indicate larger correlation between each variable and one of the discriminant
functions.
FIGURE 6 | 1V and 1C for the stimuli identified as Mahutu, Losto, or
Burabah. Error bar shows 95% confidence interval.

FIGURE 5 | rPVI-V and rPVI-C for the stimuli identified as Mahutu,


Losto, or Burabah. Error bar shows 95% confidence interval.

FIGURE 7 | meanV and meanC for the stimuli identified as Mahutu,


Figures 5–7 show the differences in the rhythm metrics that Losto, or Burabah. Error bar shows 95% confidence interval.
significantly differ between stimuli classified into three groups.
Only mean durations and non-normalized metrics (rPVI and the
standard deviation) differ significantly between the stimuli iden- durational variability for the listeners performing the classifi-
tifyed as Burabah, Mahutu, and Losto and statistically discrimi- cation task. However, we still do not know the relative con-
nate between the groups. Rhythm metrics normalized for tempo tribution of the rhythm compared to tempo in classification.
and %V do not differ between the stimuli classified into three dif- To address this issue, we calculated the frequency for Burabah,
ferent groups, and do not discriminate between stimuli attributed Losto, and Mahutu response for each stimulus, i.e., how many
to different classes. listeners out of 25 identified each stimulus as Burabah (Fre-
Figures 5–7 show that the stimuli identified as Burabah quency_Burabah), Losto (Frequency_Losto), or Mahutu (Fre-
exhibit shorter mean durations and smaller standard deviations quency_Burabah). After that we performed stepwise multiple
in duration of speech intervals and smaller durational differ- regression to assess the ability of meanV, meanC, rPVI-C, rPVI-
ences in pairs of consecutive intervals. As the raw metrics are V, 1V, and 1C to predict Frequency_Burabah. Stepwise regres-
influenced by the speech rate, we cannot say what the partici- sion was chosen because we wanted to evaluate whether both
pants were listening for to make their judgments—speech tempo, the mean durations and the variability measures were necessary
durational variability, or both. As meanV and meanC display to predict Frequency_Burabah. The constructed model included
the highest correlations in the structure matrix (Table 9), we only two steps. At the first step, meanV was entered into equation
could conclude that tempo is probably more important than as the most powerful predictor. At the second step, meanC was

Frontiers in Psychology | www.frontiersin.org 11 March 2015 | Volume 6 | Article 316


Ordin and Polyanskaya Perception of speech rhythm in L2

TABLE 10 | Coefficients and parameters of the regression model with Frequency_Burabah as the dependent variable.

Step Metrics β t B p R2 R2 change Significance of R2 change

1 meanV −0.638 −13.574 −102.93 <0.0005 0.407 0.407 <0.0005


2 meanV −0.489 −10.506 −78.832 <0.0005 0.519 0.111 <0.0005
meanC −0.365 −7.853 −64.49 <0.0005

TABLE 11 | Coefficients and parameters of the regression model with Frequency_Mahutu as the dependent variable.

Step Metrics β T B p R2 R2 change Significance of R2 change

1 meanV 0.525 10.1 61.108 <0.0005 0.276 0.276 <0.0005


2 meanV 0.377 7.163 43.814 <0.0005 0.386 0.110 <0.0005
meanC 0.363 6.914 46.28 <0.0005

added, and the model was significantly improved. Adding raw tempo equals 5.62 syl/s. for the stimuli identified as Burabah,
rhythm metrics as predictors did not improve the model fur- 4.41 syl/s. for the stimuli identified as Losto, and 4.4 syl/s. for the
ther. Table 10 summarized the main details of the regression stimuli identified as Mahutu. ANOVA analysis showed that the
model. difference in tempo between the groups is significant, F(2, 196) =
The results show that the most important predictors are mean 64.077, p < 0.0005. Pairwise comparisons (with the Bonfer-
durations of V and C intervals, which are negatively correlated roni correction) reveal that the difference lies between “Losto”
with the frequency of “Burabah” response. This means that the and “Burabah” stimuli, while the difference between “Losto”
shorter the speech intervals (i.e., the faster the tempo), the more and “Mahutu” groups is not significant. Speech tempo in the
likely the listener will classify the stimulus as Burabah. stimuli identified as Burabah is 25.7% higher than in the stim-
We also performed stepwise multiple regressions with Fre- uli identified as Losto. This increase is above the threshold for
quency_Mahutu and with Frequency_Losto as dependent variable just noticeable tempo difference (Quene, 2007; Thomas, 2007).
(details of the regression models are in Tables 11, 12 respec- Speech tempo in the stimuli classified as Mahutu is 6.6% slower
tively). The analyses show that the most influential predictors for than in the stimuli identified as Losto, and this difference is below
both Frequency_Mahutu and Frequency_Losto are meanV and the just noticeable threshold.
meanC. The predictors are positively correlated with the fre- Listeners’ sensitivity to speech tempo can be explained by a
quency of “Losto” and “Mahutu” responses, which means that the number of studies in physiology of hearing. Schreiner and Urbas
stimuli with longer C and V intervals (i.e., slower speech rate) are (1986, 1988) showed that auditory neurons fire in response to a
more likely to be identified as Losto or Mahutu. sharp increase in intensity that usually coincides with the vowel
onset. Consequently, the rate at which “s” and “a” alternate in the
Discussion stimuli determines the rate at which the neurons fire. Moreover,
some studies suggest a direct relation between a syllable-length
The results show that listeners classify the stimuli based on speech unit (“sa” unit in our stimuli) and the neural response in the
tempo and ignore the differences in the durational variability auditory cortex (Viemeister, 1988; Greenberg, 1997; Wong and
between the “sasasa” sequences. The Figures 5–7 also show that Schreiner, 2003; Greenberg and Ainsworth, 2004). Besides, the
there is no difference between the stimuli identified as Losto auditory system imposes certain limitations on the speech tempo.
and Mahutu for 1V, 1C, rPVI-V, rPVI-C, meanC, and meanV If the assumptions to the speech rate and to the length of the
measures. Faster stimuli with both low and high variation in syllable-like units are violated, speech processing and decoding
duration of V and C intervals were classified as Burabah, and of speech at the cortical level is compromised (Ghitza and Green-
slower stimuli were almost randomly attributed to either Losto berg, 2009; Ghitza, 2011). Therefore, there is a physiological basis
or Mahutu. We conclude that the listeners formed only two cate- for discriminating fast and slow stimuli, or stimuli with longer
gories: one for faster stimuli that were classified as Burabah, and and shorter syllable-like units.
the other for slower stimuli that were randomly identified either We are not aware of any evidence of direct physiologi-
as Mahutu or Losto. cal correlates for the ability to differentiate fine distinctions in
This result agrees with psychoacoustic data in tempo percep- durational variability. Thus, we assume that differentiation of
tion. Quene (2007) and Thomas (2007) studied just-noticeable fine distinctions in rhythmic patterns involves cognitive pro-
differences in tempo and found that 5–8% change in tempo cessing. Peculiarities of predominant rhythmic patterns in a
(expressed as beats per minute for non-speech stimuli and certain language correlate with grammatical, morphological and
syllables-per-minute for speech stimuli) is easily detected by the other structural characteristics. Rhythmic patterns guide the
subjects. We analyzed the tempo differences between the stimuli way the language is acquired. They influence the strategies of
which were classified as Losto, Mahutu, and Burabah. Average segmentation of continuous speech. Rhythmic cues are exploited

Frontiers in Psychology | www.frontiersin.org 12 March 2015 | Volume 6 | Article 316


Ordin and Polyanskaya Perception of speech rhythm in L2

TABLE 12 | Coefficients and parameters of the regression model with Frequency_Losto as the dependent variable.

Step Metrics β t B p R2 R2 change Significance of R2 change

1 meanV 0.427 7.741 41.821 <0.0005 0.183 0.183 <0.0005


2 meanV 0.358 5.993 35.018 <0.0005 0.207 0.024 =0.005
meanC 0.170 2.847 18.207 =0.005

FIGURE 8 | Splitting the “sasasa” stimuli into two categories based on durational variability of vocalic and consonantal intervals and speech rate.

differently by listeners with different native languages for pur- Conclusion


poses of speech processing (Christophe et al., 2003; Murty et al.,
2007; Thiessen and Saffran, 2007; Kim et al., 2008). However, We have shown significant differences in speech timing between
their importance in processing non-linguistic stimuli when cog- the sentences produced by the German learners of English at
nitive mechanisms are less intensely employed might be low. different proficiency levels. As L2 acquisition progresses, L2
When the presented stimuli are not processed as speech-like, English is delivered at a faster rate and with a higher degree of
rhythmic differences between stimuli are not used for discrim- stress-timing. Further analysis revealed that realization of tim-
ination or classification (Ramus et al., 2000). Thus, in our ing is more stable in L2 speech produced by advanced L2 learn-
experiment listeners rely more on those patterns in acoustic ers. Advanced learners tend to speak consistently with a higher
signal that have direct physiological correlates rather than on degree of stress-timing. Lower proficiency speakers randomly
the patterns that are processed though the relay of cognitive vary the degree of durational variability in their speech, some-
filters. times delivering L2 speech with high durational variability, and
We would like to emphasize that our results do not indicate sometimes with a more syllable-timed rhythm. We suggest that
the inability of the participants to hear the rhythmic differences. timing control in L2 speech production improves as acquisition
To test the ability to detect the differences in L2 speech rhythm progresses, and rhythm becomes more stable.
between proficiency levels, discrimination test is to be carried Although rhythmic changes in L2 acquisition can be easily
out, and a number of studies showed that such small differ- profiled with normalized rhythm metrics, raw metrics do not
ences are detected. Using classification task, we can determine exhibit a clear uni-directional development. Faster speech rate
which timing patterns are used to classify the utterances into at higher proficiency levels lowers the values of raw metrics,
groups. It is possible, that larger differences in durational vari- while the need to enhance durational variability pushes the met-
ability (e.g., between rhythmically contrastive languages that are rics up. These conflicting forces did not allow raw metrics to
traditionally defined as stress-timed and syllable-timed) might reveal a clear developmental change as a function of acquisition
become more linguistically relevant and used in classification. progress.
Smaller differences as those revealed between L2 varieties are Perception experiment was set up to investigate whether
not sufficiently different to be processed as linguistically rele- monolingual English use the differences in L2 speech timing
vant, and timing patterns are classified based on direct phys- between proficiency levels to group the utterances with differ-
iological correlates. Further research is necessary to address ent timing patterns into the same class. Although L2 speech
which rhythmic differences could be processed as linguistically indeed becomes increasingly more stress-timed with proficiency,
relevant. native speakers of English, when asked to classify different timing

Frontiers in Psychology | www.frontiersin.org 13 March 2015 | Volume 6 | Article 316


Ordin and Polyanskaya Perception of speech rhythm in L2

patterns into separate groups, paid attention to the differences between the stimuli are not sufficiently large to be linguistically
in speech rate and ignored the differences in speech rhythm relevant.
between the utterances produced by the L2 learners at differ-
ent proficiency levels. Faster utterances were grouped separately Acknowledgments
from slower utterances. Both groups included utterances with
high and low durational variability of speech intervals. This trend We acknowledge the financial support of the German Research
is schematically illustrated on Figure 8. The sensitivity of the Foundation (DFG) and the Open Access Publication Fund of
listeners to speech tempo is physiologically determined. The fact Bielefeld University for the article processing charge. This work
that listeners ignore rhythmic differences in classification can was supported by the Alexander von Humboldt foundation. We
be explained by non-linguistic nature of the stimuli. Process- are also thankful to Ferenc Bunta and David Ingram for sharing
ing of “sasasa” stimuli in our experiment, assumingly, does not the picture prompts that we used for the sentence elicitation task.
involve cognitive mechanisms that are employed in processing of
linguistic material, and listeners pay attention to those features Supplementary Material
of the acoustic signal that have direct physiological correlates.
Further research is necessary to understand whether the cogni- The Supplementary Material for this article can be found
tive filter is not applied to processing these stimuli because they online at: http://www.frontiersin.org/journal/10.3389/fpsyg.
are not perceived as speech, or because the differences in rhythm 2015.00316/abstract

References Gut, U. (2009). Non-native Speech. A Corpus-Based Analysis of the Phonetic and
Phonological Properties of L2 English and L2 German. Frankfurt: Peter Lang.
Arvaniti, A. (2012). The usefulness of metrics in the quantification of speech Kim, J., Davis, C., and Cutler, A. (2008). Perceptual tests of rhythmic similarity: II.
rhythm. J. Phonet. 40, 351–373. doi: 10.1016/j.wocn.2012.02.003 Syllable rhythm. Langu. Speech 51, 343–359. doi: 10.1177/0023830908099069
Arvaniti, A., and Rodriquez, T. (2013). The role of rhythm class, speaking rate and Kolly, M.-J., and Dellwo, V. (2014). Cues to linguistic origin: the contribution
F0 in language discrimination. Lab. Phonol. 4, 7–38. doi: 10.1515/lp-2013-0002 of speech temporal information in foreign accent recognition. J. Phonet. 42,
Boersma, P., and Weenink, D. (2010). Praat: Doing Phonetics by Com- 12–23. doi: 10.1016/j.wocn.2013.11.004
puter (Version 5.1.22). Retrieved: December 15, 2010. Available online at: Loukina, A., Kochanski, G., Rosner, B., Shih, C., and Keane, E. (2011). Rhythm
http://www.praat.org/ measures and dimensions of durational variation in speech. J. Acoust. Soc. Am.
Bunta, F., and Ingram, D. (2007). The acquisition of speech rhythm by bilingual 129, 3258–3270. doi: 10.1121/1.3559709
Spanish- and English-speaking four-and five-year-old children. J. Speech Lang. Low, L., Grabe, E., and Nolan, F. (2000). Quantitative characterizations of speech
Hear. Res. 50, 999–1014. doi: 10.1044/1092-4388(2007/070) rhythm: syllable-timing in Singapore English. Langu. Speech 43, 377–401. doi:
Christophe, A., and Dupoux, E. (1996). Bootstrapping lexical acquisition: the role 10.1177/00238309000430040301
of prosodic structure. Linguist. Rev. 13, 383–412. doi: 10.1515/tlir.1996.13.3- Mazuka, R. (1996). “How can a grammatical parameter be set before the first
4.383 word?,” in Signal to Syntax: Bootstrapping from Speech to Grammar in Early
Christophe, A., Gout, A., Peperkamp, S., and Morgan, J. L. (2003). Discovering Acquisition, eds J. L. Morgan and K. Demuth (Mahwah, NJ: Lawrence Erlbaum
words in the continuous speech stream: the role of prosody. J. Phonet. 31, Associates Inc), 313–330.
585–598. doi: 10.1016/S0095-4470(03)00040-8 Mehler, J., Dupoux, E., Nazzi, T., and Dehaene-Lambertz, G. (1996). ĎCoping with
Dauer, R. (1983). Stress-timing and syllable-timing reanalyzed. J. Phonet. 11, linguistic diversity: the infant’s viewpoint,” in From Signal to Syntax: Bootstrap-
51–62. ping from Speech to Grammar in Early Acquisition, eds J. Morgan and K. D.
Dauer, R. (1987). “Phonetic and phonological components of language rhythm,” Demuth (Hillsdale, NJ: Erlbaum), 101–116.
in Proceedings of the 11th International Congress of Phonetic Sciences, (Tallinn, Mehler, J., Sebastian-Galles, N., and Nespor, M. (2004). “Biological foundations
Estonia), 447–450. of language: language acquisition, cues for parameter setting and the bilingual
Dellwo, V. (2006). “Rhythm and speech rate: a variation coefficient for deltaC,” in infant,” in The New Cognitive Neuroscience, ed M. Gazzaniga (Cambridge, MA:
Language and Language-Processing, eds P. Karnowski and I. Szigeti (Frankfurt MIT Press), 825–836.
am Main: Peter Lang), 231–241. Mok, P. (2013). Speech rhythm of monolingual and bilingual children at 2;06: Can-
Dellwo, V., and Wagner, P. (2003). “Relations between language rhythm and tonese and English. Bilingualism: Language and Cognition 16, 693–703. doi:
speech rate,” in Proceedings of the 15th International Congress of Phonetics 10.1017/S1366728910000453
Sciences (Barcelona), 471–474. Murty, L., Otake, T., and Cutler, A. (2007). Perceptual tests of rhythmic similarity:
Ghitza, O. (2011). Linking speech perception and neurophysiology: speech decod- I. Mora Rhythm. Langu. Speech 50, 77–99. doi: 10.1177/00238309070500010401
ing guided by cascaded oscillators locked to the input rhythm. Front. Psychol. Nazzi, T., Bertoncini, J., and Mehler, J. (1998). Language discrimination by new-
2:130. doi: 10.3389/fpsyg.2011.00130 borns: toward an understanding of the role of rhythm. J. Exp. Psychol. Hum.
Ghitza, O., and Greenberg, S. (2009). On the possible role of brain rhythms in Percept. Perform. 24, 756–766. doi: 10.1037/0096-1523.24.3.756
speech perception: intelligibility of time-compressed speech with periodic and Nazzi, T., and Ramus, F. (2003). Perception and acquisition of linguistic rhythm
aperiodic insertions of silence. Phonetica 66, 113–126 doi: 10.1159/000208934 by infants. Speech Commun. 41, 233–243. doi: 10.1016/S0167-6393(02)00106-1
Grabe, E., and Low, L. (2002). “Durational variability in speech and the rhythm Nespor, M., Guasti, M., and Christophe, A. (1996). “Selecting word order: the
class hypothesis,” in Papers in Laboratory Phonology, Vol. 7, eds C. Gussenhoven rhythmic activation principle,” in Interfaces in Phonology, ed U. Kleinhenz
and N. Warner (New York, NY: Mouton de Gruyter), 515–546. (Berlin: Akademie Verlag), 1–26.
Greenberg, S. (1997). “Auditory function,” in Encyclopedia of Acoustics, ed M. Nolan, F., and Asu, E. (2009). The pairwise variability index and coexisting
Crocker (New York, NY: Wiley), 1301–1323. rhythms in language. Phonetica 66, 64–77. doi: 10.1159/000208931
Greenberg, S., and Ainsworth, W. (2004). “Speech processing in the auditory sys- Ordin, M., and Polyanskaya, L. (2014). Development of timing patterns in first and
tem: An Overview,” in Speech Processing in the Auditory System, eds S. Green- second languages. System 42, 244–257. doi: 10.1016/j.system.2013.12.004
berg, W. Ainsworth, A. Popper, and R. Fay (New York, NY: Springer-Verlag), Ordin, M., Polyanskaya, L., and Ulbrich, C. (2011). “Acquisition of timing patterns
1–62. in second language,” in Proceedings of Interspeech 2011 (Florence), 1129–1132.

Frontiers in Psychology | www.frontiersin.org 14 March 2015 | Volume 6 | Article 316


Ordin and Polyanskaya Perception of speech rhythm in L2

Pamies Bertran, A. (1999). Prosodic typology: on the dichotomy between stress- Suter, R. W. (1976). Predictors of pronunciation accuracy in second lan-
timed and syllable-timed languages. Lang. Design 2, 103–130. guage learning. Lang. Learn. 26, 233–253. doi: 10.1111/j.1467-1770.1976.tb
Peterson, G., and Lehiste, I. (1960). Duration of syllable nuclei in English. J. Acoust. 00275.x
Soc. Am. 32, 693–703. doi: 10.1121/1.1908183 Thiessen, E., and Saffran, J. (2007). Learning to learn: infants’ acquisition of stress-
Pike, E. V. (1959). A test for predicting phonetic ability. Lang. Learn. 9, 35–41. doi: based strategies for word segmentation. Lang. Learn. Dev. 3, 73–100. doi:
10.1111/j.1467-1770.1959.tb01127.x 10.1080/15475440709337001
Piske, T., McKay, I., and Flege, J. (2001). Factors affecting degree of foreign accent Thomas, K. (2007). Just noticeable differences and tempo change. J. Sci. Psychol.
in an L2: a review. J. Phonet. 29, 191–215. doi: 10.1006/jpho.2001.0134 14–20.
Polyanskaya, L., Ordin, M., and Ulbrich, C. (2013). “Contribution of timing pat- Thompson, I. (1991). Foreign accents revisited: the English pronunciation
terns into perceived foreign accent,” in Elektronische Sprachsignalverarbeitung of Russian immigrants. Lang. Learn. 41, 177–204. doi: 10.1111/j.1467-
2013, ed P. Wagner (Dresden: TUDpress), 71–79. 1770.1991.tb00683.x
Prieto, P., del Mar Vanrell, M., Astruc, L., Payne, E., and Post, B. (2012). Phonotac- Vaughan-Rees, M. (2002). Test Your Pronunciation: Book With Audio CD. London:
tic and phrasal properties of speech rhythm. Evidence from Catalan, English, Longman.
and Spanish. Speech Commun. 54, 681–702. doi: 10.1016/j.specom.2011.12.001 Vicenik, C., and Sundara, M. (2013). The role of intonation in lan-
Quene, H. (2007). On the just noticeable difference for tempo in speech. J. Phonet. guage and dialect discrimination by adults. J. Phonet. 41, 297–306. doi:
35, 353–362. doi: 10.1016/j.wocn.2006.09.001 10.1016/j.wocn.2013.03.003
Ramus, F., Hauser, M., Miller, C., Morris, D., and Mehler, J. (2000). Language dis- Viemeister, N. (1988). “Psychophysical aspects of auditory intensity coding,” in
crimination by human newborns and by cotton-top tamarin monkeys. Science Auditory Function, eds G. Edelman, W. Gall, and W. Cowan (New York, NY:
288, 349–351 doi: 10.1126/science.288.5464.349 Wiley), 213–241.
Ramus, F., and Mehler, J. (1999). Language identification with suprasegmental White, L., and Mattys, S. (2007). Calibrating rhythm: first language and sec-
cues: a study based on speech resynthesis. J. Acoust. Soc. Am. 105, 512–521. ond language studies. J. Phonet. 35, 501–522. doi: 10.1016/j.wocn.2x007.
doi: 10.1121/1.424522 02.003
Ramus, F., Nespor, M., and Mehler, J. (1999). Correlates of linguistic rhythm in the White, L., Mattys, S., and Wigit, L. (2012). Language categorization by adults is
speech signal. Cognition 73, 265–292. doi: 10.1016/S0010-0277(99)00058-X based on sensitivity to durational cues, not rhythmic class. J. Mem. Lang. 66,
Roach, P. (1982). “On the distinction between ‘stress-timed’ and ‘syllable-timed’ 665–679. doi: 10.1016/j.jml.2011.12.010
languages,” in Linguistic Controversies, ed D. Crystal (London: Edward Arnold), Wiget, L., White, L., Schuppler, B., Grenon, I., Rauch, O., and Mattys, S. (2010).
73–79. How stable are acoustic metrics of contrastive speech rhythm? J. Acoust. Soc.
Russo, R., and Barry, W. J. (2008). “Isochrony reconsidered. Objectifying relations Am. 127, 1559–1569. doi: 10.1121/1.3293004
between rhythm mesures and speech tempo,” in Proceedings of Speech Prosody Wong, S., and Schreiner, C. (2003). Representation of stop-consonants in cat pri-
2008 (Campinas, Brazil), 419–422. mary auditory cortex: intensity dependence. Speech Commun. 41, 93–106. doi:
Schiering, R. (2007). The phonological basis of linguistic rhythm. Cross-linguistic 10.1016/S0167-6393(02)00096-1
data and diachronic interpretation. Sprachtypologie Universalienforschung 60,
337–359. doi: 10.1524/stuf.2007.60.4.337 Conflict of Interest Statement: The authors declare that the research was con-
Schreiner, C., and Urbas, J. (1986). Representation of amplitude modulation in the ducted in the absence of any commercial or financial relationships that could be
auditory cortex of the cat. I. The anterior auditory field (AAF). Hear. Res. 21, construed as a potential conflict of interest.
227–241. doi: 10.1016/0378-5955(86)90221-2
Schreiner, C., and Urbas, J. (1988). Representation of amplitude modulation in the Copyright © 2015 Ordin and Polyanskaya. This is an open-access article distributed
auditory cortex of the cat. II. Comparison between cortical fields. Hear. Res. 32, under the terms of the Creative Commons Attribution License (CC BY). The use,
49–64. doi: 10.1016/0378-5955(88)90146-3 distribution or reproduction in other forums is permitted, provided the original
Stevens, K. (2002). Toward a model for lexical access based on acoustic land- author(s) or licensor are credited and that the original publication in this jour-
marks and distinctive features. J. Acoust. Soc. Am. 111, 1872–1891. doi: nal is cited, in accordance with accepted academic practice. No use, distribution or
10.1121/1.1458026 reproduction is permitted which does not comply with these terms.

Frontiers in Psychology | www.frontiersin.org 15 March 2015 | Volume 6 | Article 316

You might also like