Recurrent Word Combinations in Academic Writing by Native and Non-Native
Recurrent Word Combinations in Academic Writing by Native and Non-Native
net/publication/251600597
CITATIONS READS
248 2,278
2 authors:
Some of the authors of this publication are also working on these related projects:
Metadiscourse Across Genres 2019 (MAG2019), 27-29 June 2019, Italy View project
La metáfora en el lenguaje académico oral: la producción metafórica en los seminarios de los programas AICLE en la Educación Superior (METCLIL: segunda fase de
EuroCoAT) View project
All content following this page was uploaded by Britt Erman on 15 March 2019.
a r t i c l e i n f o a b s t r a c t
Article history: In order for discourse to be considered idiomatic, it needs to exhibit features like fluency
Available online 1 October 2011 and pragmatically appropriate language use. Advances in corpus linguistics make it possi-
ble to examine idiomaticity from the perspective of recurrent word combinations. One
Keywords: approach to capture such word combinations is by the automatic retrieval of lexical bun-
Corpus-based research dles. We investigated the use of English-language lexical bundles in advanced learner writ-
Academic writing ing by L1 speakers of Swedish and in comparable native-speaker writing, all produced by
Native and non-native speakers of English
undergraduate university students in the discipline of linguistics. The material was culled
Formulaic language
Lexical bundles
from a new corpus of university student writing, the Stockholm University Student English
Corpus (SUSEC), amounting to over one million words. The investigation involved a quan-
titative analysis of the use of four-word lexical bundles and a qualitative analysis of the
functions they serve. The results show that the native speakers have a larger number of
types of lexical bundles, which are also more varied, such as unattended ‘this’ bundles,
existential ‘there’ bundles, and hedging bundles. Other lexical bundles which were found
to be more common and more varied in the native-speaker data involved negations. The
findings are shown to be largely similar to those of the phraseological research tradition
in SLA.
Ó 2011 Elsevier Ltd. All rights reserved.
1. Formulaic language
It is notoriously difficult to achieve idiomaticity, that is, the knowledge of conventionalized combinations of words, in
academic discourse, whether from the perspective of the beginning university student coming to grips with formal academic
style or from the perspective of the advanced doctoral student being socialised into a specific discipline. One way in which
idiomaticity is realised is through the successful use of recurrent word combinations that are typical of a specific academic
register and discipline. Recurrent word combinations do not only contribute to idiomaticity, but also contribute to demon-
strating membership in a specific discourse community: ‘‘when we speak, we select particular turns of phrase that we per-
ceive to be associated with certain values, styles and groups’’ (Wray, 2006, p. 593). In the case of a student who is an
academic novice, this means learning to use specific word combinations as a ‘badge of identity’ (Wray, 2006). For any student
who is also a language learner in this context, yet another dimension of difficulty is naturally added.
Combinations of words that fulfill specific functions and that are called up more or less automatically by native speakers
have come to be known by the term ‘formulaic language’ (Schmitt & Carter, 2004). Research in second language acquisition
(SLA) shows that native speakers rely more on formulaic language, especially collocations, than non-native users. Further-
more, the degree of proficiency correlates significantly with the proportion and/or types of formulaic language used
0889-4906/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved.
doi:10.1016/j.esp.2011.08.004
82 A. Ädel, B. Erman / English for Specific Purposes 31 (2012) 81–92
(Forsberg, 2008; Lewis, 2009; Wiktorsson, 2003). For example, formulaic markers of vagueness (e.g., sort of, kind of) are
underrepresented in non-native compared to native speech (de Cock, 2004). For a subgroup within the formulaic family, col-
locations, there are mixed results, which can partly be explained by different studies applying different methodologies.
Although the majority of studies of collocations within SLA have found that non-native speakers underuse collocations com-
pared to native speakers in writing (Bolly, 2009; Erman, 2009; Granger, 1998; Howarth, 1998), some have shown that non-
native speakers use the same quantity of collocations as native speakers but overuse high-frequency collocations, which
makes type/token measures differ significantly between native and non-native writers (Durrant & Schmitt, 2009). Further-
more, it has been shown that non-natives have poorer intuitions about (a)typical collocations, and take 30% longer to make
judgements regarding collocational frequencies (Siyanova & Schmitt, 2008). Collocations are an elusive group and constitute
an area within formulaic language that ‘‘many teachers can identify as a problem though cannot describe’’, and that learners
‘‘are only dimly aware of’’ (Howarth, 1998, pp. 161–162). Although non-native academic writers can handle demanding
grammatical structures, they may fail to use the appropriate verb with a specific noun, which may hamper their communi-
cation (Howarth, 1998, p. 161).
Phenomena that fall under the rubric of formulaic language are not only difficult to acquire by language learners or novice
academics, but they are also somewhat difficult to identify and measure in naturally-occurring discourse. However, the last
couple of decades have seen an increasing use of corpora to explore large quantities of spoken and written language in search
of patterns (e.g., Hunston & Francis, 2000). Sinclair (1987) was among the first to demonstrate through corpus data that
words co-occur in specific patterns carrying specific meanings. This insight led him to formulate his ‘idiom principle’ to
the effect that words are co-selected rather than selected on an item-by-item basis.
The basic assumption of the present study is that the notion of idiomaticity must be expanded to encompass any native-
like selection of expression (cf. Pawley & Syder, 1983), be it a ‘phraseological unit’ or a ‘lexical bundle’. As the terms suggest,
these are approached and analyzed using different methodologies. Phraseological units coincide with traditional grammat-
ical units (e.g., verb + noun collocations, such as make a contribution), typically extracted from part-of-speech-tagged cor-
pora. A phraseological unit is composed of at least two words and is identified, usually manually, through at least one of
its members being used in a specialized, restricted sense, precluding the substitutability of a synonymous word. Collocations
are thus characterized by the restricted combinability of members. SLA-oriented phraseological research has used relatively
small corpora of native and non-native material (Granger, 1998; Howarth, 1998; Nesselhauf, 2003). Lexical bundles are ex-
tracted automatically from raw data—typically using sizable corpora—disregarding any pre-defined linguistic categories. De-
spite the frequency with which they occur, lexical bundles are ‘‘not idiomatic in meaning and not perceptually salient’’ (Biber
& Barbieri, 2007, p. 269). They typically do not coincide with traditional grammatical units, but instead represent clause or
phrase fragments, such as it is possible to or at the beginning of. In fact, Biber, Johansson, Leech, Conrad, and Finegan (1999)
report that less than 5% of the lexical bundles in academic prose represent complete structural units.
The analytical framework adopted for this study is captured in the notion of lexical bundles: multi-word sequences that
recur frequently and are distributed widely across different texts (Biber, 2010, p. 170). Specific, albeit varying, cut-off points
are used for the two criteria of frequency and dispersion. The frequency cut-off point used to identify lexical bundles is
‘‘somewhat arbitrary’’, but it tends to be called ‘‘conservative’’ or ‘‘relatively high’’ if set at 40 times per million words for
four-word bundles (Biber & Barbieri, 2007, p. 267). Especially in the case of spoken data, it may be useful to set the bar high,
since spoken data generally have been shown to rely more extensively on lexical bundles (Biber, Conrad, & Cortes, 2004;
Biber et al., 1999). Studies of written corpus data have used cut-offs of 25 times per million words (e.g., Chen & Baker,
2010) and 20 times per million words (e.g., Cortes, 2004). The dispersion criterion is also arbitrary, leading to varying prac-
tices in the literature. A criterion of three to five texts is often used for four-word bundles (e.g., Biber & Barbieri, 2007; Biber
et al., 2004; Chen & Baker, 2010; Cortes, 2004), but percentages are also sometimes used (Hyland, 2008b). In order to avoid
overly context-dependent expressions, ‘content bundles’ are also sometimes excluded, for example, if they are present in an
essay question or if they incorporate proper nouns (e.g., Chen & Baker, 2010).
The framework is appealing in that it is essentially data-driven and that the retrieval can be fully automatized, if we dis-
regard some manual manipulation involving content-based bundles and overlaps. Note that lexical bundles refer to contig-
uous word relations only; the method does not capture syntagmatic relations that are variable in terms of the position of the
individual words included, such as cases in which modification occurs (to a /very/ large extent) or various types of word order
variation (a quantitative study of the two genitive constructions; this part of the study was mainly quantitative). This has been
pointed out as one of the main drawbacks of the framework (Durrant, 2009).
Despite the fact that lexical bundles do not represent complete structural units, they are still seen as ‘‘important building
blocks in discourse’’ (Biber & Barbieri, 2007, p. 270) which serve important functions—not least in academic discourse, where
lexical bundles have been found to be pervasive (Biber & Barbieri, 2007). Starting with Biber et al. (2004), different attempts
have been made to pin down the functions of lexical bundles, typically resulting in three primary ones: expressions of stance
(e.g., I don’t know what the voltage is here), discourse organisers (e.g., What I want to do is quickly run through the exercise. . .)
and referential expressions (e.g., . . .students must define and constantly refine the nature of the problem. . .) (Biber & Barbieri,
2007, p. 270ff). The distribution of these functions has been shown to vary on the basis of register: for example, similar
A. Ädel, B. Erman / English for Specific Purposes 31 (2012) 81–92 83
to face-to-face conversation, classroom teaching draws heavily on stance bundles, but, like academic writing, it also draws
heavily on referential bundles (Biber et al., 2004).
Much of the original interest in lexical bundles stemmed from a concern with register, but recent work also considers
learner/expert production. Since its introduction in the Longman Grammar (Biber et al., 1999), the framework has been used
in a range of studies, including comparison of the characteristics of different registers, such as textbooks versus classroom
discourse (Biber et al., 2004; see also Biber & Barbieri, 2007), or the behaviour of different populations, such as native versus
non-native speakers (e.g., Chen & Baker, 2010; see also de Cock, 2000 and Römer, 2009 for similar approaches) and student
versus expert writers (e.g., Cortes, 2004; Hyland, 2008a). In addition, there is a small number of studies of lexical bundles
from the perspective of academic disciplines (e.g., Cortes, 2004; Hyland, 2008b), but it is as yet unclear to what extent
disciplinary variation plays a role in lexical bundles (cf. Hyland, 2008b).
If we compare the lexical bundle approach (represented by Chen & Baker, 2010) to the phraseological approach, we will
see that the fundamental findings concerning the use of formulaic language in native and non-native speaker populations
converge. Recurrent word combinations are more frequent overall in native than non-native production. Not only do native
speakers have a broader repertoire of types, but they also tend to display greater variety in form. Furthermore, certain groups
of recurrent word combinations are typically found to be underused by non-native speakers (e.g., conventionalized
adverb + adjective combinations, such as acutely aware/painfully clear, reported in Granger, 1998, p. 150 and bundles, such
as in the context of, reported in Chen & Baker, 2010, p. 30), while others are found to be overused (I/We framed constructions,
e.g., I claim that/we could say that, reported in Granger, 1998, pp. 154–155 and bundles, such as all over the world, reported in
Chen & Baker, 2010, p. 30).
The aim of the present study is to investigate the use of English-language lexical bundles in advanced learner writing by
L1 speakers of Swedish and in comparable native-speaker writing, all produced by undergraduate students of linguistics.
Comparing the two groups, we carry out a quantitative analysis of the use of lexical bundles and a qualitative analysis of
the functions served by lexical bundles. Furthermore, we relate our findings to similar data by comparing our results to those
of Chen and Baker (2010), which are based on writing from many different disciplines by native-speaker students and non-
native students with L1 Chinese. On the basis of previous research, it is hypothesized that the non-native students will pro-
duce fewer bundles overall (cf. Erman, 2009; Howarth, 1998), and less varied ones (e.g., Granger, 1998; Lewis, 2009). We
believe this to be the first systematic study of lexical bundles used by undergraduate EFL students in a European setting.
Previous studies exist for ESL contexts (Chen & Baker, 2010) and for EFL settings in Asia at master’s and doctoral levels
(Hyland, 2008a,b). Römer (2009) includes graduate EFL students who are L1 speakers of German, but considers only the
20 most frequent bundles (regardless of dispersion) and does not filter out content-related bundles (such as Quirk et al.,
1985 and the University of Michigan).
The material used for the study is from a new corpus of essays by university students of linguistics, both native and non-
native speakers of English. The method follows that of Chen and Baker (2010), which is based on the pioneering work of Biber
and colleagues.
The material is from the Stockholm University Student English Corpus (SUSEC) and includes 325 essays, amounting to
over one million words. The writing represented in the corpus is by students of linguistics at different levels who are
non-native speakers (L1 Swedish) and native speakers of (British) English. The non-native material consists of writing by stu-
dents of English linguistics from the first to the fourth term of study in the Department of English at Stockholm University,
while the native-speaker material for comparison consists of writing by students of linguistics from the second and third
year of study at the Department of Linguistics at King’s College in London. Table 1 gives an overview of the data used for
the learner subcorpus, while Table 2 details the native-speaker subcorpus.
Table 1
Material for the non-native-speaker subcorpus.
Table 2
Material for the native-speaker subcorpus.
The two subcorpora are largely comparable, although there are important differences between them, one of which con-
cerns the year of study of the two student groups. The learner material spans 2 years of study, from the first term through to
the fourth (and final term) of undergraduate study. The native-speaker material does not include any first-year data, but only
data from the second and third years of study. However, in both groups, around 70% of the material is produced by students
in the final stages of the program. The two subcorpora are also different with respect to revision and the number of students
represented. The third- and fourth-term essays of the non-native-speaker data have been revised by the student writers
themselves and, to some extent, by their supervisors, while the native-speaker essays have not undergone any revision.
The number of students represented is considerably larger in the non-native material, with almost all of the 243 essays hav-
ing different authors. Many of the native speakers, by contrast, have two or more of their texts represented (although written
for different courses), so the total number of individual students is just over 30, which is a drawback from the perspective of
dispersion. Although the native-speaker subcorpus is considerably smaller, it still offers a relatively large collection,
especially considering that it represents writing from one single discipline (topics covering phonetics, psycholinguistics,
semantics, sociolinguistics and discourse analysis). Further differences between the subcorpora have to do with text
length—varying from approximately 1000 to 11000 words—and also, to some extent, differences in task—with the native-
speaker material involving discussions of previous research to a far greater extent than reports of students’ own small-scale
research projects. See Section 3 for further discussion on the possible effects these differences could have on the use of lexical
bundles.
Only four-word lexical bundles were considered for the study, in order to make the analysis more manageable and com-
parable to that of Chen and Baker. The four-word scope is ‘‘the most researched length for writing studies, probably because
the number of 4-word bundles is often within a manageable size (around 100) for manual categorization and concordance
checks’’ (Chen & Baker, 2010, p. 32). However, we can only concede that bringing in lexical bundles of other scopes would
have made for a richer study. A word of caution is called for in this context also, from the perspective of a recent study by
Simpson-Vlach and Ellis (2010, p. 509), which found that many important recurrent word combinations are actually three-
word bundles. That said, we can make the point (with Cortes, 2004, p. 401) that three-word bundles are often subsumed in
four-word bundles (e.g., as a result of contains as a result).
In order to retrieve the four-word lexical bundles, WordSmith’s (Scott, 2007) function WordList cluster was used. The cut-
off frequency was set at 25 times per million words1; since the two subcorpora differ in size, the cut-off point was equivalent
to a raw frequency of 22 in the non-native material and 6 in the native material. Following Chen and Baker (2010), the disper-
sion criterion was that a word combination had to occur in at least three texts in the native-speaker data in order to be included.
However, since the non-native material is considerably larger, the dispersion criterion was set at nine texts in an attempt to
make the results as comparable as possible. The purpose of the dispersion criterion is to guard against idiosyncrasies introduced
by individual writers (cf. Biber et al., 2004).
The retrieved bundles were checked manually for context-dependent content bundles and overlapping bundles, following
Chen and Baker (2010). We excluded content bundles involving proper nouns (the New York Times, at King’s College London),
terms that were directly related to the topic (e.g., between men and women used in essays on the topic of language and gen-
der, and of the English language used in essays on specific linguistic phenomena in English) and, thus, terms and expressions
specific to the discipline (e.g., as a second language, in the present tense), but we included terms and expressions used in re-
search in general (e.g., the majority of the informants, the age of #). The omission of proper nouns resulted in three exclusions
from each list. The omission of topic- and discipline-specific terms resulted in nine exclusions from the non-native list and
28 from the native list. The reason for excluding topic-specific bundles is to guard against idiosyncrasies introduced by those
topics that happen to be represented in the material. Furthermore, the presence of content bundles specific to linguistics
would have affected the comparison of our results to the multidisciplinary material of Chen and Baker (2010). Nevertheless,
there were in fact not many discipline-specific terms to exclude, presumably because a range of subfields of linguistics were
represented and possibly also because specialist terms are often shorter than four words, or occur as acronyms. The proce-
dure for dealing with overlapping bundles involved merging examples such as due to the fact and to the fact that into due to
the fact + that, the justification being to guard against inflated numbers (cf. Chen & Baker, 2010, p. 33). These were checked
manually by concordancing. Table S1 represents extensions of a four-word bundle by a plus sign, which is put in parenthesis
if the four-word bundle does not co-occur with it in all cases.
1
Note that, because of rounding used in the normalisation, many native bundles have a frequency of 24.
A. Ädel, B. Erman / English for Specific Purposes 31 (2012) 81–92 85
3. Results
The overall quantitative comparison between our two groups is presented first, followed by a discussion of those bundles
that are shared and not shared between the native and non-native speakers. The comparison to Chen and Baker’s results is
presented thereafter, followed by the functional classification of the lexical bundles found.
Table S1 (see Supplementary material) provides a list of the lexical bundles, divided into those that are found in either
group and those that are shared by both groups.2 Note that the fact that a given lexical bundle made it onto the non-native
list does not necessarily mean that it was not used at all by the native group (or vice versa); it simply means that the frequency
and dispersion criteria were not met in the other group’s material. In the second column of the table, the first number refers to
the non-native group and the second number to the native group. The hash symbol (#) represents any number represented by
digits in the subcorpora (thus representing a minor solution to the variability problem); the actual frequencies of some of these
bundles would have been higher if numerals written out in full had also been included.
The native-speaker writing shows a considerably wider range of lexical bundles than the learner writing, with a total of
130, as compared to 60. This general pattern verifies our hypothesis.
The frequency differences across subcorpora were tested for statistical significance, using the log-likelihood statistic.3
Applying statistical tests goes against the tradition in the literature on lexical bundles, which is characterized primarily by sim-
ple descriptive statistics. Some, however, such as Simpson-Vlach and Ellis (2010, p. 492), have argued that statistics such as log-
likelihood are ‘‘useful for comparing the relative frequency of words or phrases’’ across corpora. The bundles in Table S1 are
marked with if they occur in the list for only one subcorpus and if the difference in frequency between the two subcorpora
(not shown in the table) does not reach statistical significance. This symbol is also used below when a bundle is discussed for
which statistical significance was not found. The dispersion cut-offs have been taken into account in that only when a given
bundle meets the dispersion criterion has it been tested for statistical significance. The tests show that as many as 70% of
the lexical bundles occurring in only one list (43 types in the non-native data and 89 in the native data) do so with a significance
level of p < 0.01. While 70% is a large proportion, it is still the case that 30% of the bundles types do not reach statistical signif-
icance—despite the initial frequency and dispersion cut-offs.4 While our study does not depart from the established procedure
for selecting bundles based on simple descriptive statistics (we merely mark those that are not significant), these results suggest
that future research should consider augmenting the procedures used for bundle selection with more sophisticated inferential
statistics.
2
Here we follow Erman and Lewis’ study of multiword structures in native and non-native speech (2011 – see my notes in the reference list).
3
The frequencies of all of the bundle types in either list were checked against the frequencies in the other subcorpus, using Paul Rayson’s online calculator
(http://ucrel.lancs.ac.uk/llwizard.html).
4
We also tested the frequency differences of the shared bundles for statistical significance (suggested by Stefan Gries, p.c.). This analysis showed that most of
the shared bundles (87%) were not used differently by the two groups, but that 13% of the shared bundles were significantly overused by either group at the
level of p < 0.01.
86 A. Ädel, B. Erman / English for Specific Purposes 31 (2012) 81–92
of introducing the topic, considering that the native speakers’ essays mostly involve expository discussion. This suggests that
the non-native speakers’ overuse of this type of metadiscourse (such as the aim of this) is less strong, but also that they use
different wordings from the native speakers. Finally, it is unclear why the bundle can be used to is used more extensively by
the native speakers. Testing the statistical significance of these shared bundles, we find that, with the exception of the results
from the and can be used to, they are used more often by both group with a statistical significance of p < 0.01.
Previous research has shown that unattended ‘this’ is not a rare phenomenon in published academic writing, but rather is
even more common than attended ‘this’ in some disciplines (Swales, 2005, p. 10). We can note that attended ‘this’ bundles
are rare in our native-speaker list (in this essay I), but common in our non-native-speaker list, predominantly with the meta-
discursive head nouns ‘study’ and ‘essay’. There are no bundles including unattended ‘this’ in the non-native list.
Like unattended ‘this’, existential ‘there’ constructions display a striking pattern in that a large group of such bundles oc-
cur in the native list, while none appear in the non-native list. As many as seven types are included:
Some of these serve as ‘‘a springboard in developing the text’’ (Biber et al., 1999, p. 952), by, for example, introducing a series
of elements:
(3) However, there are a number of limitations to the data that may have affected my analysis of them and so my
account of Amaljeet’s language use. Firstly, it does not seem necessary to... [KC_2_024]
It should be noted that the shared list also includes four ‘there’ bundles (there is a difference; there seems to be; that there is a;
that there is no), so the overall picture is not that the non-native writers do not use these at all, but that the native writers
draw on such structures to a much greater extent, and with more variation.
A. Ädel, B. Erman / English for Specific Purposes 31 (2012) 81–92 87
Another way in which the two groups differ is in the use of hedges. While there are some shared types of hedging,
expressed by ‘seem’ (seems to be a; there seems to be) or ‘can’ (e.g., as can be seen), the native writers appear to use more
hedges than the non-native writers. Indeed, the native list includes over 20 hedges, while the non-native list has only four
(two involving ‘can’ and two involving ‘seem’).5 In addition to ‘can’ and ‘seem’, the native writers frequently use expressions
such as to a certain extent as well as hedges involving ‘may’, ‘appear’ and ‘could’. Thus, the native speakers not only use a
larger number of recurrent hedges, but also draw on a wider variety of lexical resources for expressing uncertainty and doubt.
This is in line with previous research; for example, Chen and Baker (2010, p. 43) found that their non-native group did not
demonstrate control of hedging ‘‘as diversely and robustly as native writers do’’. This appears to be the case also in spoken
discourse, where vagueness tags (e.g., or something, sort of, kind of) are underused by learners (de Cock, Granger, Leech, &
McEnery, 1998).
Five passive bundles appear in the shared list, showing that the students have adopted, to some extent, this highly char-
acteristic feature of academic writing (cf. Biber et al., 1999). Despite the existence of shared passive strings, however, the
distribution of passive constructions is another area of discrepancy between the groups, where, again, the native writers
make greater use of these complex structures than the non-native writers (25 versus 8 types).6 The verbs found in the native
list include see (nine bundles), refer to (five bundles), find (two bundles), say, note, attribute to, relate to, define, support, suggest,
assume and use. The list of verbs found in the non-native list is shorter, including only four of the verbs above, plus show (shown
in table #).
‘It’-clauses, especially those followed by an extraposed ‘to’-clause and, to a lesser extent, those followed by an extraposed
‘that’-clause, have been found to be unusually frequent in academic writing as opposed to other registers (Biber et al., 1999).
The fact that anticipatory ‘it’ structures are seen as useful devices by our student writers is evident from the list of shared
bundles, which contains no fewer than six of these structures:
(4) it is [difficult/important/interesting/necessary/possible] to
it would be interesting + to
It has been suggested that anticipatory ‘it’ patterns are sometimes exploited by learners and student writers to ‘‘state
propositions more forcefully than is appropriate’’ (Hewings & Hewings, 2002, p. 381), although there is no clear evidence
for this in the current data. However, the lists indicate that the learners make use of these structures to a greater extent,
in that it is easy to, it is hard to and it is clear that are unique to the non-native list. We have already noted the potentially
inappropriate use of ‘easy’ and ‘hard’, which are often avoided in formal registers. Nevertheless, we suggest that it is a useful
strategy for L1 Swedish students to use this structure, simply for the reason that it helps the writer to project a detached
writing persona; impersonalized evaluative statements have been found to be underused by learner writers (e.g., Milton,
1998).
There are a few ways in which our material is different from that of Chen and Baker (2010): our corpus is discipline-spe-
cific rather than covering many disciplines; it is larger in size by approximately 70% than the BAWE subcorpora used by Chen
and Baker; the L1 of our non-native group is Swedish, not Chinese and our corpus material has been marked up for quoted
material, such that any bundles captured definitely represent the students’ own writing (cf. Ädel, 2010). Table S2 (see Sup-
plementary material) provides an alphabetically sorted list of the lexical bundles found by Chen and Baker (2010), which we
have divided into those that are found in either group and those that are shared by both groups.
Comparing our results (Table S1) to Chen and Baker’s, we find that the overall pattern is the same, but that there is greater
discrepancy within our data set. Chen and Baker (2010, p. 44) come to the conclusion that ‘‘the use of lexical bundles in non-
native and native student essays is surprisingly similar’’. We find 54 bundles unique to the non-native student list (compared
to 60 in our data), 78 bundles unique to the native student list (compared to 130), and 26 bundles that are shared between
the two groups (compared to 55). Thus, the range in their data is from 54 to 78, but from 60 to 130 in our data. Expressed in
proportions, the difference in the number of types is found in the non-native-speaker groups (34% in Chen & Baker versus
24% in SUSEC) and in the shared bundles (16% in Chen & Baker versus 22% in SUSEC), but not in the native-speaker groups,
which produce half of the types in both corpora (49% in Chen & Baker versus 53% in SUSEC).
This raises the question of why the difference is greater between our student groups. There are several possibilities.
One is that our non-native material represents a context in which English is a foreign language, while Chen and Baker’s
non-native material represents a context in which English is a second language, with the student writers being interna-
tional students in an English-speaking country. On the basis of EFL–ESL, we would expect greater differences within our
groups. Another possible source of discrepancy is in the method itself, as ‘‘larger corpora will generate fewer recurrent
word combinations with the same cut-off normalized frequency, when compared with smaller corpora, because large
5
The majority of these occur with statistical significance in the native data, while three of four in the non-native list do not.
6
The majority of these occur with statistical significance in the native data, while three of eight in the non-native list do not.
88 A. Ädel, B. Erman / English for Specific Purposes 31 (2012) 81–92
corpora will elicit higher converted raw frequencies’’ (Chen & Baker, 2010, p. 43; see also Biber & Barbieri, 2007, p.
269fn). This could have resulted in lower frequencies in the non-native subcorpus, since it is larger than the native-
speaker subcorpus by two thirds, thus increasing the difference between our two groups. Yet another possible factor
is that only 30 different writers are represented in the native-speaker material. According to the principle that ‘the fewer
speakers represented, the greater the likelihood of uniform behaviour’, this could have boosted the number of bundles.
There are also factors likely to work toward closing the gap between the subcorpora. One is the fact that the L1 of our non-
native students (Swedish) is closely related to English, while the L1 of the Chen and Baker group (Chinese) is not. Another
factor is discipline-specificity; since our material is restricted to one single discipline, unlike Chen and Baker’s multidisciplin-
ary material, we would expect stronger convergence within our groups. That said, it seems that different disciplines may
vary considerably in the overall number of lexical bundles yielded; the results for research articles in history and biology
given in Cortes (2004, pp. 404, 407) show that history only has half as many (54) as biology (109). We eagerly await further
research into the distribution of bundles to account for the various factors that may have an effect on studies of this type.
Finally, we can note that the differences would have been greater had we not excluded content bundles (which was also
done by Chen and Baker).
Our results clearly converge with Chen and Baker’s, demonstrating that non-native speakers produce fewer and less
varied lexical bundles than native speakers. Even in the case of advanced learners, this has been a robust finding in pre-
vious research in SLA. There are two of studies of lexical bundles, however, in which it could be said that the opposite
pattern is found: the non-native groups in Römer (2009) and Hyland (2008a) produce a larger number of bundles than
the native groups. The results in the first case could have to do with study design. For example, the cut-off points are
highly restricted (to a top 20 list), which is likely to favour the learners who have a more restricted repertoire but tend
to use their favourite bundles unusually often (the use of ‘lexical teddy bears’ by learners has been noted in the liter-
ature, referring to the reliance on that which is familiar, by ‘‘choosing words and phrases closely resembling their first
language or those learnt early or widely used’’ (Hasselgren, 1994, p. 237)). The results in the second case showed dif-
ferences between three different registers produced by different populations: research articles by native-speaking profes-
sionals and Master’s and PhD theses by L1 Cantonese speakers that had been awarded high passes. The fact that the
research articles had the smallest number of lexical bundles, however, is explained by reference to register—nativeness
is not considered relevant.
A comparison of these two lists of shared bundles in student writing to Chen and Baker’s list of bundles in published writ-
ing from the FLOB corpus reveals that as many as 14 of the 15 shared bundles are also found in professional academic writ-
ing, the one exception being as well as the. These widely shared core bundles must be highly useful, hence their abundance.
Corpus-based investigations of native and non-native speaker production often reveal patterns of over- and underuse.
Chen and Baker (2010) found that the non-native student writers overused (over-)generalizing expressions such as all over
the world, which were rarely used by native-speaker academics. No similar expressions were found in our data. We do find
some patterns of underuse by our non-native students, one of which involves negations with not and no, which are smaller in
number in the non-native groups, as illustrated in Table 3.
This can be explained by the fact that such negated structures are rather complex, and thus likely to be learned later.
Another complex structure which is more common in the native groups is ‘fact’-headed bundles. The Swedish group shares
six of these with the native speakers (due to the fact + that, is the fact that, [to/of] the fact that, the fact that [the/they]), who also
have an additional three (by the fact that ; despite the fact that ; in the fact that). The Chinese group only has one (to the fact
that), while the native-speaker group has four ((due) + to the fact that, is the fact that, the fact that [the/they]). Both the
negation and the ‘fact’ patterns are also confirmed by Chen and Baker’s data from published writing.
A. Ädel, B. Erman / English for Specific Purposes 31 (2012) 81–92 89
Table 3
Bundles involving negation from the non-native and native lists.
In an attempt to examine the functions served by the lexical bundles, Chen and Baker’s classification, based on Biber et al.
(2004), was applied. The purpose was to investigate the extent to which the natives and the non-natives use lexical bundles
for different functions.
Biber et al. (2004, p. 384) describe the three main categories in the functional classification of lexical bundles as follows:
Referential bundles make direct reference to physical or abstract entities, or to the textual context itself, either to identify the
entity or to single out some particular attribute of the entity. Stance bundles express attitudes or assessments of certainty that
frame some other proposition. Discourse organisers reflect relationships between prior and coming discourse. A category of
interactional bundles is sometimes also used if spoken data is considered (e.g., Biber et al., 1999). Within these main catego-
ries, which reflect the Hallidayan metafunctions of language, finer distinctions are made, involving three to four subcatego-
ries for each category. The subcategories used by Chen and Baker (2010, pp. 37–38) are shown in Fig. 1.
We have several reservations about this classification, despite the fact that it is presented as unproblematic in Chen and
Baker (2010). Here, the classification is introduced as fully established, even though it was called ‘‘preliminary’’ when first
presented in Biber et al. (2004, p. 383). The main problem is that no clear criteria are given for how to decide which (sub)cat-
egory a given bundle should belong to. While some of the subcategories are somewhat well-defined by previous research or
are intuitively clear (e.g., topic introduction, quantifying), others are vague (e.g., identification/focusing, framing). In fact, this
vagueness has led to several inconsistencies in previous research. For example, Focusing is labelled Discourse organising in
Chen and Baker, but Referential in Biber et al. (2004) and Simpson-Vlach and Ellis (2010). Framing is labelled Referential in
Chen and Baker and Biber et al. (2004), but Discourse organising in Cortes (2004). An additional problem is the multifunc-
tionality of many lexical bundles. When classifying a given type, as emphasized in Biber et al. (2004), it is therefore necessary
to consider the extended context to determine what the predominant function is.
Despite considerable difficulty, we were able to improve the initial 66% agreement rate through extensive discussion,
checks of bundle classifications in the literature, and the contextual analysis of concordance lines and finalise the classifica-
tion with almost 100% agreement, albeit with some 10% of the labels marked with question marks. It is not clear, however, to
what extent our understanding of some of the categories matches that of other researchers. The results are shown in Fig. 2.
What we find are rather similar proportions for referential expressions in the two groups, but a greater proportion of
stance bundles and a smaller proportion of discourse organisers among the native speakers. This confirms a pattern already
spotted: the native speakers’ greater reliance on, and greater variation in, stance bundles. One possible explanation for the
differences between referential expressions and discourse organisers is found in the highly frequent PP-bundles, especially
the in-bundles. These are similar in both native-speaker student groups, notably with abstract nouns as complements (e.g.,
purpose, variety, attempt). In contrast, non-native in-bundles in our material have concrete nouns as complements (essay,
study, table), adding to the high numbers for discourse organisers in this group.
Given our reservations about the classification, we have not tested the differences for statistical significance. This is con-
trary to Chen and Baker (2010), where the statistical comparison is strongly emphasized. Our perspective, however, is that
clearer definitions and a more generally agreed-upon classification are needed before inferential statistics could really make
a contribution.
4. Conclusion
The results presented in this study confirm a general pattern found by research in both the phraseological tradition (e.g.,
Erman, 2009; Howarth, 1998) and the lexical bundles tradition (Chen & Baker, 2010), which is that non-native speakers ex-
hibit a more restricted repertoire of recurrent word combinations than native speakers. This was found to be the case despite
the fact that our specific learner group is highly advanced, studying English linguistics at university level in a country where
the general proficiency level of English is relatively high. The range of the number of types in our data was found to be con-
siderable, with 60 bundles unique to the non-native speakers, 55 bundles shared between both groups, and as many as 130
bundles unique to the native speakers. The hypothesis that the non-native student writers would produce not only fewer
types overall, but also less varied ones, was also verified. For example, more varied means of expression among the native
speakers were found in unattended ‘this’ constructions, existential ‘there’ constructions, hedges, and passive constructions.
Other complex structures which were found to be more common and more varied in the native-speaker data were negated
patterns (e.g., not be able to, there is no evidence) and ‘fact’-headed bundles (e.g., the fact that they).
Several issues in design and methodology emerged which ought to be considered in future work. One issue is to do with
the relationship between descriptive and inferential statistics in the consideration of lexical bundles. In testing the frequency
difference between bundles in the native and non-native lists for statistical significance, which is rarely (if ever) done in the
literature, we found that 70% of the bundles in each list occurred with statistical significance. While currently not applied in
the definition and/or selection of lexical bundles, some measurement of statistical significance should perhaps be considered
in future work. Another issue concerns corpus comparability. Even though the corpus material used was largely comparable,
certain aspects of the context of writing appeared to have had an effect on the use of lexical bundles. For example, a larger
number of bundles referring to results was evident in the non-native speaker data, which we took to be due to a difference in
genre or task, made manifest in the non-native students typically reporting on an empirical study and the native speakers
typically discussing different approaches to linguistic issues. This can be seen as a confirmation of previous research showing
that there is not just ‘‘one single pool of lexical bundles’’ that speakers or writers draw on, but that ‘‘each register employs a
A. Ädel, B. Erman / English for Specific Purposes 31 (2012) 81–92 91
distinct set of lexical bundles, associated with the typical communicative purposes of that register’’ (Biber & Barbieri, 2007,
p. 265).
It needs to be stressed that only four-word bundles have been considered in this study; had three- and two-word bundles
also been covered, a vastly larger number of recurrent patterns would have been retrieved—and a fuller picture of the use of
formulaic language among our populations could have been given. By way of evaluating the types of structures that were and
were not captured by the four-word scope, we can point out that a large number of PPs and NPs (involving many abstract
nouns) were retrieved. This is in line with previous research showing that lexical bundles in academic writing are predom-
inantly phrasal rather than clausal: Biber (2010, p. 172) calculates that ‘‘70% of the common bundles in academic prose con-
sist of a noun phrase with an embedded prepositional phrase fragment (e.g., the nature of the) or a sequence that bridges
across two prepositional phrases (e.g., as a result of).’’ What the four-word scope did not capture, however, was adjective
phrases (e.g., statistically significant, most important), adverbial phrases (e.g., not necessarily, more specifically) and verb
phrases other than passives (e.g., draw attention to).
We compared our results to Chen and Baker’s (2010) specifically to be able to relate our results to similar data and better
evaluate our numbers. It would be a great step forward if future research were to develop clear criteria for comparing fre-
quencies across different studies, taking potential confounding factors into account. Ensuring the feasibility of comparison is
also important with respect to the functional classification of bundles. In fact, we found the functional classification to be the
most challenging part of the study, primarily because the various (sub)categories need to be better defined and more gen-
erally agreed upon in the literature.
If we were to single out only one future direction in which to continue this line of research, we would choose to focus on
qualitative analyses of context. By way of illustration, consider the expression in terms of. While in our material it is drawn on
equally frequently by the two groups, contextual analysis reveals rather different usage patterns. For example, the native
speakers’ uses co-occur with items such as define, vary/variable and similar, while the non-native speakers’ uses co-occur
with items such as differ/difference/different, describe and view. The non-natives appear to have identified the expression
as a marker of formal writing (and possibly as a useful information-structuring device), but they are not using it in a fully
target-like manner. A qualitative perspective would be likely to prove especially fruitful when comparing native and non-
native populations. It could prove quite revealing to examine those lexical bundles that are more or less equally frequent
in both groups, because even though a given lexical bundle is drawn on to the same extent, it does not necessarily follow
that it is used in the same way by native and non-native speakers.
Acknowledgements
We are grateful to two anonymous reviewers for their valuable comments on the manuscript. We also gratefully
acknowledge support from The Bank of Sweden Tercentenary Foundation for the compilation of SUSEC.
Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.esp.2011.08.004.
References
Hasselgren, A. (1994). Lexical teddy bears and advanced learners: A study into the ways Norwegian students cope with English vocabulary. International
Journal of Applied Linguistics, 4, 237–260.
Hewings, M., & Hewings, A. (2002). ‘It is interesting to note that’: A comparative study of anticipatory ‘it’ in student and published writing. English for Specific
Purposes, 21(4), 367–383.
Howarth, P. (1998). Phraseology and second language proficiency. Applied Linguistics, 19(1), 24–44.
Hunston, S., & Francis, G. (2000). Pattern grammar: A corpus-driven approach to the lexical grammar of English. Amsterdam: John Benjamins.
Hyland, K. (2008a). Academic clusters: Text patterning in published and postgraduate writing. International Journal of Applied Linguistics, 18(1), 41–62.
Hyland, K. (2008b). As can be seen: Lexical bundles and disciplinary variation. English for Specific Purposes, 27, 4–21.
Lewis, M. (2009). The idiom principle in L2 English: Assessing elusive formulaic sequences as indicators of idiomaticity, fluency, and proficiency. Saarbrücken,
Germany: VDM Verlag.
Milton, J. (1998). Exploiting L1 and interlanguage corpora in the design of an electronic language learning and production environment. In S. Granger (Ed.),
Learner English on computer (pp. 186–198). London: Longman.
Nesselhauf, N. (2003). The use of collocations by advanced learners of English and some implications for teaching. Applied Linguistics, 24(2), 223–242.
Pawley, A., & Syder, F. (1983). Two puzzles for linguistic theory: Nativelike selection and nativelike fluency. In J. Richards & R. Schmidt (Eds.), Language and
communication (pp. 191–226). London: Longman.
Römer, U. (2009). English in academia: Does nativeness matter? Anglistik: International Journal of English Studies, 20(2), 89–100.
Schmitt, N., & Carter, R. (2004). Formulaic sequences in action: An introduction. In N. Schmitt (Ed.), Formulaic sequences: Acquisition, processing and use.
Amsterdam: John Benjamins.
Scott, M. (2007). WordSmith tools (Version 4.0) [Computer software]. Oxford: Oxford University Press.
Simpson-Vlach, R., & Ellis, N. C. (2010). An academic formulas list (AFL). Applied Linguistics, 31, 487–512.
Sinclair, J. McH. (1987). The nature of the evidence. In J. McH. Sinclair (Ed.), Looking up: An account of the COBUILD project in lexical computing (pp. 150–159).
London: Collins.
Siyanova, A., & Schmitt, N. (2008). L2 learner production and processing of collocation: A multi-study perspective. The Canadian Modern Language Review,
64(3), 429–458.
Swales, J. (2005). Attended and unattended ‘‘this’’ in academic writing: A long and unfinished story. ESP Malaysia, 11, 1–15.
Wiktorsson, M. (2003). Learning idiomaticity: A corpus-based study of idiomatic expressions in learners’ written production. Lund Studies in English (Vol. 105).
Lund, Sweden: Lund University.
Wray, A. (2006). Formulaic language. In K. Brown (Ed.), Encyclopedia of language and linguistics (pp. 590–597). Oxford: Elsevier.
Annelie Ädel’s main research areas are discourse analysis, corpus linguistics and English for Academic Purposes. She has been affiliated with the University
of Michigan’s English Language Institute as a post-doctoral fellow and as Director of Applied Corpus Linguistics. She is currently a research fellow at
Stockholm University, Sweden.
Britt Erman’s main research fields are Conversation Analysis; Discourse Analysis; Grammaticalization; the Mental Lexicon. Currently involved in a large-
scale project at the Centre of Bilingualism, Stockholm University, with a focus on ‘multiword expressions’ in L1 and L2 English, aimed at establishing
ultimate attainment in highly advanced and immersed L2 users.