0% found this document useful (0 votes)
72 views

Forensic Phonetics.2

The document discusses the work of forensic phoneticians in analyzing speech evidence from crime scenes. It provides two examples: (1) British authorities attempting to identify a suspected British jihadist from a beheading video based on his voice and accent. (2) A 911 call in the George Zimmerman murder case where a forensic phonetician was asked to analyze a scream in the background and determine if it was Zimmerman or Trayvon Martin. The document then discusses challenges forensic phoneticians face in accurately transcribing disputed utterances from recordings due to issues like unclear audio, active listener perception, and the influence of contextual information on what is heard.

Uploaded by

karla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views

Forensic Phonetics.2

The document discusses the work of forensic phoneticians in analyzing speech evidence from crime scenes. It provides two examples: (1) British authorities attempting to identify a suspected British jihadist from a beheading video based on his voice and accent. (2) A 911 call in the George Zimmerman murder case where a forensic phonetician was asked to analyze a scream in the background and determine if it was Zimmerman or Trayvon Martin. The document then discusses challenges forensic phoneticians face in accurately transcribing disputed utterances from recordings due to issues like unclear audio, active listener perception, and the influence of contextual information on what is heard.

Uploaded by

karla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

1 7

Forensic phonetics

2 Police and security services are trying to identify a suspected British jihad-
3 ist who appeared in footage of the killing of a US journalist. […] Uncon-
4 firmed reports suggest the man in the video […] is from London or south-east
5 England and may have guarded Islamic State captives.
6 (BBC News, 21 August 2014)

7 The screams are clearly coming from a distraught male, whose repeated cries
8 for help end abruptly with a gunshot. What is not clear from a recording of
9 a 911 call […] is the identity of the screamer: George Zimmerman, the vol-
10 unteer community watchman, or Trayvon Martin, the unarmed 17-year-old
11 he killed…
12 (New York Times, 22 June 2013)

13 The work of the forensic phonetician


14 The forensic phonetician is concerned with all aspects of speech as evidence. This
15 can involve deriving information about a speaker’s social and regional background
16 on the basis of their voice. This is what was asked of linguists by the media when
17 videos emerged online of a member of Islamic State beheading a man who had been
18 held hostage, as reported by the BBC in the first epigraph. Forensic phoneticians are
19 also asked by police or legal teams to offer an opinion on whether the speaker in
20 two or more separate recordings is the same, such as in the Zimmerman case, in the
21 second epigraph, in which an unidentified scream was heard in the background of
22 a telephone call and a phonetician was asked whether it was possible to classify that
23 scream as belonging to either George Zimmerman or Trayvon Martin. Both of these
24 cases are discussed in more detail in this chapter. In addition, forensic speech scien-
25 tists can help police forces with the transcription and interpretation of disputed
26 recordings, and offer advice in the design of ‘voice line-ups’, also known as ‘voice
27 parades’. These are similar to identity parades, but involve victims and witnesses
28 who have heard, but not seen, the perpetrator of a crime. When the police arrest
29 someone as a suspect in such a crime, recordings of the suspect’s voice, along with
30 a set of similar voices, are played to the witness, and the witness is asked whether
31 they can identify the voice that they heard at the scene of the crime.

BK-DEP-COULTHARD-160143-Chp07.indd 129 5/25/2016 6:44:32 PM


130  Forensic phonetics
1 Transcription and disputed utterances
2 Many court cases involve the provision and presentation of transcriptions of tape-
3 or video-recorded evidence. The recording(s) concerned may be of people talking
4 about future or past criminal activity, or of them actually committing a crime, as
5 in the case of bomb threats, ransom demands, hoax emergency calls or negotiat-
6 ing the buying or selling of drugs. Very few of the transcriptions presented in
7 court have been made by someone with a qualification in phonetics, although
8 occasionally a forensic phonetician is called in, typically when there is a dispute
9 over a small number of specific items, which could be single words or even a
10 single phoneme. Such recordings can come from a variety of sources, including
11 recorded face-to-face interactions, recorded telephone and emergency service
12 calls, like in the Zimmerman case, or ‘covert’ undercover recordings made with-
13 out the knowledge of the speaker(s), all of which can include voices of native and
14 non-native speakers. The expert is tasked by either prosecution or defence to
15 provide an accurate and reliable account of what was said in the recordings.
16 However, any researcher or student who has transcribed recordings of any kind
17 will know that this is not a straightforward task, but one that demands consider-
18 able time and effort, and one that presents many challenges, even to the trained
19 ear of the professional linguist. Unlike with writing, the sounds produced in
20 speech are continuous and non-discrete, often difficult to distinguish when
21 uttered at speed and with particular stress and rhythm, even in the clearest of
22 recordings. Furthermore, Fraser (2003: 204–5) highlights the challenges posed by
23 human perception when we hear sounds in recordings. She argues that:

24 Although we generally don’t notice our own contribution to perception,


25 speech perception is an active, rather than a passive, process, with the hearer
26 actively constructing, rather than passively picking up, the speaker’s
27 message.

28
In other words, the listener does not necessarily hear what was said, but rather
29
hears their construction of what they think was said; they subconsciously
30
combine the speech signal (the sounds) with prior knowledge of speech, language
31
and context in their own heads (Fraser 2003: 206). This is exemplified by Fraser
32
(2014: 13–14) as she describes an earlier study by Bruce (1958), in which partici-
33
pants listened to a number of sentences, partially ‘masked’ with a hissing noise,
34
after being given a key word as to the content of the recording, such as ‘sport’ or
35
‘weather’. Listeners’ heard the same masked sentences, but with different key
36
words. What was found was that their perceptions of what they heard changed in
37
response to the key words, despite the fact that they had listened to the same
38
sentences. Such difficulties are compounded when the recordings are unclear or
39
of a low quality, most commonly a result of poor recording equipment or exten-
40
sive background noise (Fraser 2014: 9).
41
Fraser et al. (2011) exemplify the challenges posed by listener perceptions of
42
unclear recordings with experimental results using a recording from a real

BK-DEP-COULTHARD-160143-Chp07.indd 130 5/25/2016 6:44:32 PM


Forensic phonetics    131
1 forensic case in New Zealand. The case itself is that of David Bain, who was
2 convicted of five counts of murder after the deaths of his parents and siblings at
3 the family home in 1994. In one of the appeals against the verdict, a detective
4 listening to the crisis call Bain made to the police claimed to hear Bain utter the
5 words ‘I shot the prick’ under his breath, and it was alleged by the prosecution
6 that this barely audible utterance constituted a previously unheard confession.
7 The defence, meanwhile, argued that the speech was uninterpretable, and perhaps
8 not even an utterance at all, but actually Bain gasping for breath. After lengthy
9 legal disputes, a re-trial was granted, but it was decided that that the emergency
10 call was to be played to the jury with the disputed utterance removed. Innes
11 (2011) provides a detailed account of the case itself. Although it is impossible to
12 know how the jury in the trial would have interpreted the utterance, Fraser and
13 her colleagues investigated what their participants ‘heard’ in the recording after
14 they had received additional information about the case, which they were given
15 at various ‘evidence points’ throughout the experiment (Fraser et al. 2011: 266).
16 At the first evidence point (of six), after the listeners had already heard the call
17 and been asked about their immediate impression of the caller, they were told that
18 the speaker (Bain) had returned home to find his family shot dead, and were asked
19 whether this ‘evidence’ changed their initial impression of the call and caller. At
20 this first stage, the most frequent responses from the 200 participants were that
21 the disputed utterance was either ‘I can’t breathe’ or was not speech at all (Fraser
22 et al. 2011: 274), and nobody perceived the utterance to be ‘I shot the prick’
23 (though ‘shot’ and ‘prick’ were actually heard by some participants as isolated
24 words). At the third evidence point, the participants were randomly assigned to
25 one of two groups and were given systematically different information in order to
26 observe whether the information affected what they heard. Group A was given a
27 story in which suspicion fell on the caller, while Group B were told that the police
28 suspected the caller’s father had killed his family and then shot himself. Notably,
29 the groups were also given different possible interpretations of the disputed
30 section of the call. Group A was told it was alleged to contain the words ‘I shot
31 the prick’, while Group B was told it was ‘he shot them all’. At this point, ‘I shot
32 the prick’ became by far the most common interpretation response in Group A
33 participants, while Group B’s interpretations were relatively unaffected by the
34 possibility of the utterance being ‘he shot them all’. While there was an increase
35 in responses that included the words ‘shot’ and ‘killed’, some of which also heard
36 the pronoun ‘he’, nobody heard the alleged phrase (Fraser et al. 2011: 276).
37 Findings such as these provide strong support for the argument that listeners
38 construct speakers’ messages through a combination of the sound signals them-
39 selves and their own contextual knowledge of the talk. That is, if the sounds alone
40 are not enough to extract meaning, as in the David Bain case and many others,
41 listeners use other background contextual information to interpret utterances. In
42 turn, we are ‘primed’ to interpret unclear utterances in a particular way; different
43 contextual knowledge provides different primings for listeners’ perceptions.
44 Such findings pose obvious challenges for those expected to produce accurate
45 and reliable transcriptions for the court, particularly of unclear and questionable

BK-DEP-COULTHARD-160143-Chp07.indd 131 5/25/2016 6:44:33 PM


132  Forensic phonetics
1 utterances. Therefore, transcribers must be aware of the ways in which these
2 influences can produce errors in their perceptions of what is said, and their subse-
3 quent transcriptions. Fraser (2003: 221) argues that such skills and knowledge are
4 ‘only gained through considerable study of linguistics, phonetics and psycholin-
5 guistics’. In cases where forensic evidence is central to a judicial decision, and
6 where such decisions are made by judges or juries on the basis of spoken linguis-
7 tic evidence, accuracy and reliability is paramount, as mis-transcriptions can have
8 serious legal consequences. For example, in one case in which Coulthard was
9 involved, an indistinct word, in a clandestine recording of a man later accused of
10 manufacturing the designer drug Ecstasy, was mis-heard by a police transcriber
11 as ‘hallucinogenic’:

12 … but if it’s as you say it’s hallucinogenic, it’s in the Sigma catalogue

13 whereas, what he actually said was ‘German’:

14 … but if it’s as you say it’s German, it’s in the Sigma catalogue.

15 In another case, a murder suspect with a very strong West Indian accent, was
16 transcribed as saying, in a police interview, that he ‘got on a train’ and then ‘shot
17 a man to kill’; in fact what he said was the completely innocuous and contextually
18 much more plausible ‘show[ed] a man ticket’ (Peter French, personal
19 communication).
20 French (in Baldwin and French 1990) reports a much more difficult case,
21 which appeared to turn on the presence or absence of a single phoneme, the one
22 that distinguishes can from can’t. Most readers, if they record themselves reading
23 these two words aloud, will notice not one but two phonemic differences between
24 their pronunciations of the words – the absence/presence of a /t/ and a different
25 vowel phoneme. Using a received pronunciation (RP) or near-RP British accent,
26 at least when the words are produced as citation forms, the vowel in can’t is also
27 longer. This contrasts with many North American accents of English in which the
28 vowels in can and can’t are more similar. However, in an ordinary speech
29 context, as in the phrase ‘I can’t refuse’, the /t/ often disappears and the vowel is
30 shortened, so that the phonetic difference between the two words is very signifi-
31 cantly reduced. In French’s case a doctor, who spoke English with a strong Greek
32 accent, had been surreptitiously tape-recorded apparently saying, whilst prescrib-
33 ing tablets to someone he thought was a drug addict, ‘you can inject those things’.
34 He was prosecuted for irresponsibly suggesting that the patient could grind up the
35 pills and then inject them. His defence was that he had actually said just the
36 opposite, ‘you can’t inject those things’. An auditory examination of the tape-
37 recording showed that there was certainly no hint of a /t/ at the end of the ‘can’
38 word and thus confirmed the phonetic accuracy of the police transcription.
39 However, the question remained, was the transcription morphologically incor-
40 rect; that is, was the doctor intending to say and actually producing his version of
41 can’t? Auditory analysis of a taped sample of the doctor’s speech showed that

BK-DEP-COULTHARD-160143-Chp07.indd 132 5/25/2016 6:44:33 PM


Forensic phonetics    133
1 there was usually an absence of final /t/ in his production of can’t. Also, even a
2 trained phonetician found it virtually impossible to distinguish the doctor’s /a/
3 vowels, when they were produced in words which, it was possible to deduce from
4 the context, were unambiguously intended as either can or can’t. So, whichever
5 the doctor’s intended meaning on any particular occasion, it had to be determined
6 by the untrained listener from the context and not auditorily. There would, there-
7 fore, be occasions when there was genuine ambiguity.
8 Forensic phoneticians transcribing recordings for court are not simply asked
9 what was said? As well as questions of content, they can also be asked who said
10 that? A difficulty facing anyone attempting to transcribe spoken language in
11 which there are overlapping voices is to ensure that they correctly attribute utter-
12 ances to speakers. As Bartle and Dellwo (2015: 230) note, when transcribing
13 recordings for use in police investigations, utterances may be attributed to named
14 individuals or ‘Speaker 1, Speaker 2, etc.’, and this may in turn constitute incrim-
15 inating evidence. In such cases, the transcriber must be able to differentiate
16 between the voices on recordings in order to reliably attribute any speech to a
17 speaker. If this cannot be done reliably, then ‘the validity of any attribution is
18 clearly compromised’ (Bartle and Dellwo 2015: 230). Bartle and Dellwo (2015:
19 230) give details of a case in which the UK Court of Appeal overturned a convic-
20 tion after police officers’ attributions of utterances to speakers in covert audio
21 recordings were ruled as inadmissible after phoneticians argued that it was
22 impossible to reliably distinguish between the different voices in the recording.

23 Analysing the human voice


24 Acoustically, speech is a very complex and constantly changing combination of
25 multiple and simultaneously produced noises and resonances or frequencies rang-
26 ing across much of the audible spectrum. These sounds are produced by restrict-
27 ing and sometimes momentarily stopping the stream of exhaled air as it passes
28 from the lungs, through the vocal tract to exit through the mouth or nose.
29 At this point a brief consideration of the physiology of speech might help. As
30 we breathe normally, air passes freely to and from the lungs through the glottis,
31 which is a gap between two small muscular folds in the larynx, which are popu-
32 larly called the vocal cords but which phoneticians call the vocal folds. When we
33 start to speak, the position of the vocal folds is altered to narrow the gap between
34 them and the pressure of the escaping air now causes them to vibrate and in so
35 doing to creates sound.
36 Any vibrating object emits a sound, or note, whose perceived pitch is directly
37 related to the frequency of the vibrations – thus anything, be it vocal folds, piano
38 or guitar strings, vibrating 262 times or cycles a second will produce the sound
39 we have learned, at least in the English speaking world, to call middle C. Cycles
40 per second, or ‘cps’, is now universally referred to as Hertz (Hz). The frequency
41 at which an object vibrates, and therefore the perceived pitch of the sound it
42 emits, is a function of both its physical composition and its length, and thus an
43 alteration in either or both of these will affect the vibration rate and therefore the

BK-DEP-COULTHARD-160143-Chp07.indd 133 5/25/2016 6:44:33 PM


134  Forensic phonetics
1 perceived pitch. If one were to take a piano wire and cut it in half it would
2 vibrate exactly twice as fast and produce a note exactly an octave higher; cut it
3 in half again and it would vibrate four times as fast and produce a note two
4 octaves higher. However, whereas each note on the piano has its own wire,
5 speakers have only one set of vocal folds, and so variations in the pitch of the
6 voice have to be achieved by tightening and slackening the muscles and thereby
7 altering both the length and the thickness of the folds and therefore the frequency
8 at which they vibrate.
9 What we call vowels are literally multi-note chords, that is, combinations of
10 several separate pitches, which are produced simultaneously by modifications of
11 the vocal tract, which thereby allow separate sections of the vocal tract to
12 amplify multiples, or harmonics, of the underlying base frequency vibration of
13 the vocal folds. These notes or pitches are called formants. Formants can be
14 detected by acoustic analysis using electronic equipment. These are only one in
15 a range of features that forensic phoneticians can focus on when analysing
16 speech recordings. These can be classified as being either ‘segmental’ features,
17 which means they manifest in individual phonemes, or ‘suprasegmental’ features
18 which extend over more than one phoneme. These features are summarised in
19 Table 7.1 below.
20 These phonetic features are often supplemented by relevant lexico-­grammatical
21 choices which may provide useful information about a speaker. Some of the
22 features in Table 7.1 are related to an individual’s anatomical and physiological
23 characteristics. Pitch and voice quality, and the production of vowels and conso-
24 nants, are determined by the length, thickness and movement of the vocal chords,
25 and the composition of and interaction between articulators such as lips, teeth and
26 the hard and soft palates. At the same time, regional and social factors influence
27 a person’s speech, resulting in the acquisition of particular accent features.
28 Such features can be used by forensic phoneticians to distinguish between
29 groups of speakers. An obvious example is that people living in the same dialect
30 area share similar vowel and consonant pronunciations. Similarly, pitch can be
31 used to distinguish between sexes. Whereas boys and girls have similarly pitched
32 voices, the male vocal folds thicken and lengthen at puberty and thus adult male
33 voices have, on average, a significantly lower pitch than female voices. However,
34 even within groups there is still considerable individual variation. For example,
35 some female voices are naturally lower in pitch than some male voices. Indeed,
36 the power of individual variation in human voices has been revealed in studies of
37 twins. As Watt (2010: 79) notes, it has been found that identical twins who have
38 very similar vocal tracts, who have lived in the same region and have received the
39 same education and parental input, still exhibit differences in speech production.
40 For example, speakers have been found who have consistently more fronted
41 vowels than their twins, (Loakes 2008) and have different pronunciations of sibi-
42 lant and stop consonants (Weirich 2011).
43 Experts exploit group and individual variation within the parameters of speech
44 in Table 7.1 when they are addressing two main types of forensic phonetic prob-
45 lem: speaker profiling and speaker comparison.

BK-DEP-COULTHARD-160143-Chp07.indd 134 5/25/2016 6:44:34 PM


Forensic phonetics    135
Table 7.1  Features used by forensic phoneticians in the analysis of speech

Parameter Description

Pitch Caused by the vibration of the vocal folds. The faster the
vibration, the higher the pitch (e.g. 124 Hz).
Voice quality A general set of characteristics which are a product of the
configuration of the speaker’s vocal chords and vocal tract
(e.g. breathy, nasal, creaky, shimmer or vocal fry) (Jessen
2010: 391).
Articulatory setting The medium- to long-term setting of all the articulators
(e.g. tongue, hard/soft palate, teeth, lips) in relation to
one another, which results in different pronunciations of
sounds across speakers (O’Grady 2013: 13).
Intonation and prosody Patterns or melody of pitch changes or stress across stretches
of connected speech (e.g. rising rather than falling
intonation at the end of declarative sentences).
Rhythm Relates to the timing and length of stress and syllables in
speech.
Speaking rate/tempo The speed with which someone talks, measured by words per
minute or syllables per second.
Vowels Distinctive realisations of sounds in which there is unimpeded
airflow (e.g. the pronunciation of [aɪ] in <time>).
Consonants Distinctive realisations of sounds which involve restriction or
closure of the vocal tract (e.g. the pronunciation of [ɫ] in
<milk>).
Connected speech Presence or absence of phonological processes across word
processes boundaries (e.g. assimilation, linking, elision) (Knight
2012: 197).
Pathological features Medical conditions which have long-term effects on speech
production, such as stuttering and sigmatism (Jessen 2010:
382)

1 Speaker profiling
2 There are times when the police have a recording of a criminal’s voice, either
3 committing or confessing to a crime, but have no suspect, and are thus anxious to
4 glean any information at all that might enable them to narrow down the group of
5 potential suspects. Examples might include an obscene phone call, a ransom
6 demand, a bomb threat, extortion, or an audio recording of an attack or murder.
7 In such cases, the forensic phonetician may be asked to undertake ‘speaker profil-
8 ing’. We have already mentioned in the Introduction one of the earliest high-
9 profile cases, dating from 1979, that of the Yorkshire Ripper where the forensic
10 phonetician was amazingly successful in placing the speaker regionally, but such
11 cases are not uncommon.
12 The first quotation in the epigraph of this chapter is a BBC News report of a
13 video that emerged online and through mainstream news networks of the behead-
14 ing of journalist James Foley at the hands of an Islamic State terrorist. After this
15 video, a series of others were released apparently showing the same man murder-
16 ing British and American journalists and aid workers. Upon the emergence of

BK-DEP-COULTHARD-160143-Chp07.indd 135 5/25/2016 6:44:34 PM


136  Forensic phonetics
1 these videos, media coverage focused on the killer’s British accent, and their
2 reporting included quotations from renowned linguists. On the basis of his British
3 accent the killer became widely known as ‘Jihadi John’. In August 2014, phoneti-
4 cian Professor Paul Kerswill from the University of York, was quoted in The
5 Guardian as identifying the man’s accent as Multicultural London English, prob-
6 ably with a foreign language background (Chulov and Halliday 2014). A few
7 months later in February 2015, The Washington Post identified the man as
8 Mohammed Emwazi, who was born in Kuwait, but grew up in West London
9 (Mekhennet and Goldman 2015), a profile which corresponds to the linguistic
10 background offered by Kerswill.
11 In forensic cases, phoneticians work hard to derive as much information as
12 possible about the speaker from the sample(s) of speech made available to them,
13 using both their expertise in phonetics and specialist software. The characteristics
14 of a speaker that forensic phoneticians may be able to make a judgement on range
15 from biological features such as age and sex, to socio-cultural factors such as
16 ethnicity, geographical region and first language. The identification of some of
17 these is more straightforward and reliable than that of others.
18 One of the most addressable questions relates to where a person is from, as was
19 the case with the Yorkshire Ripper hoax. In these cases, the expert closely exam-
20 ines the recording for identifiable features of accent, namely vowel and consonant
21 realisations, but also for lexico-grammatical features of dialect. Once they have
22 identified a pool of variables that may give an indication that the person is speak-
23 ing with a particular accent or in a particular dialect, the expert can then compare,
24 verify and confirm their findings with those in authoritative descriptions of
25 language varieties, and the research literature in language variation and sociolin-
26 guistics. Ultimately, the forensic phonetician is then able to offer a geographical
27 profile of the speaker, localising them to where they are from and where they may
28 have lived. This is not always straightforward, however, and Schilling and
29 Marsters (2015) describe a fascinating case in which forensic phoneticians were
30 asked to create a speaker profile for a woman who claimed to be a girl who had
31 been missing for 20 years. In this case, the ‘unusual’ combination (and lack) of
32 accent and dialect features in the woman was such that locating her geographi-
33 cally was very difficult. Schilling and Marsters’ discussion highlights the impli-
34 cations that dialect acquisition, contact and mixing have for regional profiling of
35 individuals through voice. That said, research by Köster et al. (2012) found that
36 German experts could accurately identify a speaker’s region on the basis of their
37 voice with a success rate of 85 per cent, with the majority of errors occurring as
38 experts selected accents which were neighbours of the one in question. Therefore,
39 forensic speaker profiling can provide useful evidence for tracking down crimi-
40 nals or suspects in police investigations.
41 As well as region, other social characteristics forensic phoneticians can include
42 in their profiles include the age and gender of the speaker. Both of these can be
43 estimated through an analysis of the pitch of the voice. As Jessen (2010: 383)
44 points out, under normal circumstances the decision as to whether a speaker is a
45 male or female is straightforward for experts and laypeople alike. This is because

BK-DEP-COULTHARD-160143-Chp07.indd 136 5/25/2016 6:44:34 PM


Forensic phonetics    137
1 women generally have a much higher average pitch level than men due to physi-
2 ological differences in vocal fold length, and can be identified either by ear or by
3 quantitative fundamental frequency (f0) analysis, measuring the average rate of
4 vibration in the vocal folds.
5 Age, however, is less straightforward. Jessen (2010: 383) points out that most
6 of the changes in speech patterns occur in childhood, puberty and old age, and so
7 these age groups have been the focus of research. However, he states that crimi-
8 nal offenders tend to be between 20 and 40, for whom there is less research.
9 French and Stevens (2013: 186) comment that speaker age ‘lies only marginally
10 within the range of addressable profiling questions’ because features such as pitch
11 can only be used in determining age if they are compared with the pitch in earlier
12 speech from the same speaker. This is rarely the case in forensic contexts. Kelly
13 and Harte (2015), however, show that in experimental conditions for which there
14 are recordings of the same speakers over time, lay listeners can correctly detect
15 ‘vocal ageing’ in a speaker. Participants in their experiments were asked to
16 decide whether the same speaker was older or younger in two sets of recordings.
17 Listeners were able to answer correctly 64 per cent of the time when the differ-
18 ence in age was ten years, and 86 per cent when the difference was 30 years.
19 Furthermore, ageing was found to be more easily detectable in female speakers
20 than male speakers (Kelly and Harte 2015: 175). Speaker profiling can also be
21 undertaken in collaboration with clinical linguists to help identify whether speak-
22 ers have any medical conditions and speech disorders that have long-term effects
23 on speech, such as stammering, cleft palate and dysphonia, which are impair-
24 ments in the ability to produce sounds (Jessen 2010; French and Stevens 2013;
25 Schilling and Marsters 2015: 199). Such information can be helpful in police
26 investigations if the speaker has been subject to clinical attention and their condi-
27 tion traceable through medical records (French and Stevens 2013: 186). Some
28 argue that a speaker’s body size (height and weight) can also be discernible
29 through measurement of their vowel formants (Jessen 2010: 382), while others
30 argue that such physical characteristics ‘cannot be realistically addressed’ by
31 forensic speaker profiling (French and Stevens 2013: 186).
32 Finally, speaker profiling is also central in cases where linguists, sometimes in
33 combination with native speakers, perform language analysis for the determina-
34 tion of origin (LADO) to draw reasonable conclusions about the nationality – or
35 more specifically the language of socialisation – of asylum seekers. LADO is
36 becoming its own specialist sub-field of speaker profiling, and has a substantial
37 and growing literature surrounding the practice, methods and cases involved (e.g.
38 Eades and Arends 2004; Patrick 2010, 2012; Cambier-Langeveld 2010, 2014;
39 Fraser 2009, 2011) see Chapter 6 for more details.
40 Ultimately, forensic speaker profiling can help narrow down the number of
41 potential perpetrators responsible for committing a crime when the police do not
42 have any suspects. Such cases are rare, however, in comparison with those in
43 which the police have an incriminating speech sample and do have a suspect in
44 mind. In such cases, the role of the forensic phonetician is one of speaker
45 comparison.

BK-DEP-COULTHARD-160143-Chp07.indd 137 5/25/2016 6:44:35 PM


138  Forensic phonetics
1 Speaker comparison
2 The vast majority of the cases undertaken by forensic phoneticians involve
3 speaker comparison. These are cases where there is a voice recording of a person
4 committing a crime, and the police have identified one or more suspects, and the
5 phonetician is asked to express an opinion as to whether any of the suspect voices
6 is consistent with, or shares close similarities with, that of the criminal. A basic
7 problem to overcome is that there will always be differences between any two
8 speech samples, even when they come from the same speaker and are recorded
9 on the same machine and on the same occasion. So, the task for the forensic
10 phonetician involves being able to tell whether the inevitable differences between
11 samples are more likely to be within-speaker differences or between-speaker
12 differences (Rose 2002: 10). They measure the similarity and difference between
13 the samples, and estimate the relative likelihood of obtaining these measures in
14 the context of two competing hypotheses: that the samples were produced by the
15 same speaker, versus two different speakers. Or, in other words, the expert
16 considers the extent to which the evidence supports the prosecution (same-
17 speaker) versus defence (different-speaker) hypotheses (see also Chapter 10).
18 The criminal, ‘disputed’ or ‘questioned’ speech samples may be recordings of
19 telephone calls, or covert recordings made by police or witnesses. These criminal
20 recordings are then compared against known recordings of the suspect’s speech,
21 which most often consists of recorded police interviews. In comparing speech
22 samples and attempting to identify speakers, forensic phoneticians can draw upon
23 any and all of the speech parameters listed in Table 7.1 above, from voice quality
24 and pitch to vowel and consonant production. As Gold and French (2011: 302–3)
25 emphasise, rather than one individual feature of speech being sufficient to distin-
26 guish between voices, it is the overall combination of features that are crucial in
27 discriminating speakers.

28 Auditory and acoustic approaches


29 There are two major traditions for analysing and comparing speech samples, the
30 auditory and the acoustic. Auditory techniques consist of the forensic phonetician
31 listening to the speech samples and producing a narrow phonetic transcription
32 using the International Phonetic Alphabet (IPA). In doing so, they identify
33 features of speech that appear to be consistent in the voice of the offender in ques-
34 tion. Such analysis focuses predominantly on segmental features such as vowel
35 and consonant realisations and connected speech processes, but notes on intona-
36 tion, prosody, rhythm and voice quality may also be made from auditory analysis.
37 Foulkes and French (2012) detail a case in which a criminal recording comprised
38 a mere four seconds of speech transmitted across a building’s intercom system,
39 which was to be compared against known non-criminal recordings of a suspect in
40 the case. The transcription of the disputed sample is:

41 Text: I’ve come to see the lady at number two [operator’s turn removed] (I’m
42 fro)m the Home Care I’ve come to collect her sheet(s).

BK-DEP-COULTHARD-160143-Chp07.indd 138 5/25/2016 6:44:35 PM


Forensic phonetics    139
1 I PA: av ˈkhʊm tsiːʔ ˈɫɛɪdjəʔ nʊmbə ˈ↑\thəʉuːː […] (…)mʔ ˈʌʊm khɛːɹ av
2 ˈkhʊm thə ˈkhɫɛkth ə ˈʃɪiːːʔ
3 (Foulkes and French 2012: 564)

4 Despite being only four seconds in length, an IPA transcription of the disputed
5 recording identified as many as 12 pronunciation features characteristic of the
6 perpetrator’s voice. These included vowel realisations such as the reduction of
7 the diphthong /aɪ/ to a monophthong /a/ in the word I’ve, and the northern English
8 vowel /ʊ/ in the words number and come among others. Observable consonant
9 pronunciations included glottalised /t/ in at and sheet and /h/-dropping in home
10 and her. As well as vowels and consonants, the experts also used the IPA tran-
11 scription to identify patterns in the speaker’s intonation (as in two), and found that
12 she used word-final linking /r/ between care and I’ve. Even though it was short,
13 the criminal recording proved to be a rich source of data. Furthermore, every one
14 of the potentially useful features identified in it was also found in the police
15 recordings of the suspect’s voice, providing evidence in support of the assertion
16 that the known and disputed samples were spoken by the same person.
17 In contrast to auditory analysis, where the expert’s focus is primarily on their
18 aural perception of segmental vowel and consonant production, acoustic analysis
19 involves the use of specialised computer software to quantify and measure
20 elements of speech. One such parameter which is commonly analysed using
21 acoustic methods is voice pitch. Such analyses can be presented visually, in spec-
22 trograms such as those in Figure 7.1, (see overleaf) which is a visualisation of
23 two separate utterings of ‘what time’s the train?’ The spectrogram has the words
24 written in ordinary orthography, but the actual pronunciation can be better repre-
25 sented using the International Phonetic Alphabet and removing word spaces
26 which are obviously not articulated, so the first one may sound like /
27 wɒttɑɪmzðətreɪn / while the second seem to be a less formal pronunciation with-
28 out the first ‘t’ pronounced, at the end of ‘what’, /wɒʔ/, an assimilation of the first
29 consonant of the to the end of times resulting in an initial /z/ /tɑɪmz zə/ giving a
30 compete version of /wɒʔtɑɪmzzətreɪn/. Readers will note a thin line below the
31 pitch printout, which indicates the carrying pressure of the air expelled from the
32 lungs to create the individual sounds.
33 In the image, one can see all the different component pitches on the vertical
34 axis and how they change over time along the horizontal axis. Intensity, or
35 perceived loudness, is represented by a darkness scale – the darker the print, the
36 louder the sound. The average pitch of someone’s voice, their fundamental
37 frequency (abbreviated f0), can also be expressed as a numerical value in cycles
38 per second, as for instance in the observation that, say, the average f0 of a given
39 voice over time is 124 Hz. This number can then be compared with population
40 data to identify whether the person in question has a higher or lower pitched voice
41 than average. Vowel formants are also frequently measured in acoustic analysis,
42 correlating with how the vowels are articulated, and similar analysis can be
43 conducted to measure the duration and articulation of consonants. The results
44 produced by such computational methods are then subjected to human

BK-DEP-COULTHARD-160143-Chp07.indd 139 5/25/2016 6:44:36 PM


BK-DEP-COULTHARD-160143-Chp07.indd 140
Figure 7.1  Comparison of spectrograms of two utterings of ‘What time’s the train?’

5/25/2016 6:44:36 PM
Forensic phonetics    141
1 examination and evaluation by the phonetician when comparing those of the
2 known and disputed speech samples.
3 Some practitioners (e.g. Hollien et al. 2014) argue that auditory analyses,
4 which they refer to as ‘aural-perceptual’, performed by humans are the most
5 accurate of all available methods. At the same time, even in the early stages of
6 technological developments, others observed that ‘in principle … the ear may be
7 inherently ill-equipped to pick up some differences between speakers, which
8 show up clearly in an acoustic analysis’ (Nolan 1994: 341). Indeed there is a
9 consensus, supported by the majority of the members of the International
10 Association for Forensic Phonetics and Acoustics (IAFPA), that forensic phoneti-
11 cians should use a mixed method, with the detailed type of auditory analysis and
12 a rigorous instrumental acoustic analysis reinforcing each other. The procedure
13 and process of such mixed methods are explained in Watt (2010), Foulkes and
14 French (2012), Eriksson (2012), French and Stevens (2013). In their survey of
15 international forensic speaker comparison practices, in which they consulted 36
16 experts from 13 countries, Gold and French (2011) found that the mixed
17 ‘Auditory Phonetic cum Acoustic Phonetic analysis’ is the one most routinely
18 employed by practitioners, and is used in Australia, Austria, Brazil, China,
19 Germany, Netherlands, Spain, Turkey, UK and USA, as well as in universities,
20 research institutes, and government/agency laboratories. Similar results ware
21 found in a survey of speaker identification practices used by global law enforce-
22 ment agencies (Morrison et al. 2016).

23 ‘Voiceprints’
24 At this point, a discussion of voiceprints is necessary. ‘Voiceprinting’ essentially
25 involves the visual matching of pairs of spectrograms, such as those in Figure 7.1,
26 showing the known and suspect speakers uttering the same word(s). In the USA
27 in the 1960s the dominant tradition was the ‘voiceprint’, a label deliberately
28 formed to echo and thereby borrow prestige from ‘fingerprint’:

29 closely analogous to fingerprint identification, which uses the unique features


30 found in people’s fingerprints, voiceprint identification uses the unique fea-
31 tures found in their utterances.
32 (Kersta 1962: 1253, as quoted in Rose 2002)

33 However, this method, never achieved the same level of reliability as finger print-
34 ing. The attraction of the spectrogram for this kind of ‘voiceprint analysis’ is that
35 it gives a ‘picture’ of the sounds spoken, but the fatal flaw of the voiceprinting
36 method was that it involved checking the degree of similarity between two spec-
37 trograms by eye. A major problem with this approach is that, as observed above,
38 there is always significant within-speaker variation. For example, if a speaker
39 uttered ‘the train’ one-hundred times in quick succession, no two utterings would
40 be identical. You might like to spend a few moments trying to decide visually
41 whether these two prints of utterings of ‘What time’s the train?’ are from the

BK-DEP-COULTHARD-160143-Chp07.indd 141 5/25/2016 6:44:37 PM


142  Forensic phonetics
1 same or from different speakers. You will of course quickly realise that you don’t
2 know which bits to focus on nor what weight to give to dissimilarities. Both were
3 in fact produced by the same speaker, but using different accents. However,
4 neither of them is a disguise in the accepted sense, because both fall within the
5 speaker’s ‘active natural repertoire, [that is] he may shift quite unconsciously
6 between the two [accents] in response to perceived differences in the communica-
7 tive situation’ (French 1994: 172). Nevertheless, the two prints do look very
8 different.
9 Critics of the voiceprint approach note that its practitioners failed to publish an
10 explanation of the methodology (even when they later added an auditory compar-
11 ison as an integral component of the analysis), and asserted that this was because
12 there was no firm scientific basis to either of the components. They further
13 observed that for the auditory part of the comparison there was no evidence that
14 the analysts were performing any better than an experienced layperson – they
15 certainly didn’t have any professional training in descriptive or acoustic
16 phonetics.
17 Despite this, Koenig (1986), after reviewing 2,000 FBI cases stretching over a
18 15-year period, where voiceprints had been analysed, found that there was an
19 error rate of less than 1 per cent. Hollien, by contrast, claimed error rates of
20 between 20 per cent and 78 per cent in voiceprint analyses and reports that he has
21 testified in court that voiceprint evidence is ‘a fraud being perpetrated on the
22 American public and the Courts’ (1990: 210). In 1985, a Californian court
23 enquiry into voiceprint analysis concluded that ‘there exists no foundation for its
24 admissibility into evidence …’ (Rose 2002: 121). Despite this, voiceprint
25 evidence is still admissible in some American States and indeed it was presented,
26 although its admissibility was contested, during the 2013 Trayvon Martin case,
27 when George Zimmerman, a member of the local Community Watch, was
28 charged with shooting and murdering the unarmed black teenager, Trayvon
29 Martin (this case is discussed in some detail on p. 143). Interestingly, the FBI, has
30 a dedicated voiceprinting unit and uses voiceprints for investigative purposes, but
31 does not permit the use of voiceprint evidence in court.
32 A distinction should be drawn between this ‘voiceprint’ analysis and the acous-
33 tic analysis used in combined auditory cum acoustic methods regularly used
34 today. ‘Mixed’ does not mean simply adding voiceprints to narrow phonetic
35 transcriptions, but rather using tools such as spectrograms in a very different way,
36 to focus not on the overall pattern, but on the acoustic make-up of (parts of) indi-
37 vidual sounds and the transitions between them. The auditory analysis identifies
38 possibly evidential linguistic features and these can then be probed quantitatively
39 by acoustic analysis. For example, particular realisations of vowels and conso-
40 nants may be revealed in an auditory analysis, such as that described with the
41 four-second intercom clip above, and then an acoustic analysis can be used to
42 systematically examine the specific way in which these sounds are articulated, for
43 example, by measuring and comparing vowel formants across samples. French
44 (1994: 177) exemplifies the use of spectrograms in such a way in a case involving
45 a stammerer. He notes that there are two kinds of stammer, one called

BK-DEP-COULTHARD-160143-Chp07.indd 142 5/25/2016 6:44:37 PM


Forensic phonetics    143
1 prolongation, typically co-occurring with fricatives, when the consonant is
2 lengthened and the other called block, typically associated with plosives, when
3 the consonant is arrested. Spectrograms allow the length of individual sounds to
4 be measured easily and so are ideal for such purposes. In French’s case the
5 suspect and the known sample not only shared the same two stammer phenom-
6 ena, associated with two particular fricatives, /s/ and /f/ and two particular
7 plosives, /t/ and /d/, but also shared the same average stammer durations.

8 Potential difficulties in forensic speech comparison


9 A number of factors can complicate forensic speaker comparison. The quality of
10 criminal recordings available might be very poor due to low quality recording
11 equipment or considerable background noise. In such cases, there are procedures
12 in place to clean or ‘enhance’ the recording before subjecting it to analysis
13 (Hollien 2002: 8). If the speaker was originally disguising their voice in some
14 way, either simply by using a different accent or by temporarily changing the
15 pitch of their voice, this can have ‘considerable detrimental effect on speaker
16 identification’ (Eriksson 2010: 87). Other potentially problematic factors include
17 when a suspect’s speech was affected by alcohol in the criminal recording (Schiel
18 and Heinrich 2015; Hollien et al. 2014), when the voice was originally transmit-
19 ted over the telephone (Yarmey 2003; Byrne and Foulkes 2004; Nolan et al.
20 2013) and when the suspect was shouting (Blatchford and Foulkes 2006) or
21 whispering (Bartle and Dellwo 2015). All these factors influence a speaker’s
22 voice, or at least the recording of their voice, in such a way as to complicate the
23 comparison of disputed with known interview recordings.
24 The trial of George Zimmerman, who was charged with, but acquitted of,
25 second-degree murder for shooting 17-year-old Trayvon Martin in Florida, USA
26 in February 2012, involved a particularly difficult and controversial case of foren-
27 sic voice identification. The evidence in the trial included an emergency 911 call
28 made by a local resident who observed the fatal encounter between Zimmerman
29 and Martin. In the background to the call a scream is clearly heard, and forensic
30 phoneticians were asked to determine whether the scream was that of Zimmerman,
31 who argued he was acting in self-defence, or Martin. In a pre-trial hearing, pros-
32 ecution evidence was submitted from forensic audio consultants who compared
33 the scream with recordings of ordinary speech of both Zimmerman and Martin
34 and judged that the screaming voice was Martin’s, not Zimmerman’s. This
35 speech evidence, therefore, would have been damaging to Zimmerman’s case of
36 self-defence. Zimmerman’s defence lawyers called four expert witnesses to
37 comment on the evidence forwarded by the prosecution. The experts contested
38 the methods used by the prosecution experts, arguing they were not reliable and
39 that the evidence should not be admitted. One of the defence experts, Peter
40 French, emphasised the difficulty (or impossibility) of comparing screaming with
41 recordings of normal speech, given that screams do not include any of the speech
42 parameters required for voice comparison. Therefore, he argued that the recorded
43 evidence in this case was not ‘remotely suitable for speaker comparison

BK-DEP-COULTHARD-160143-Chp07.indd 143 5/25/2016 6:44:37 PM


144  Forensic phonetics
1 purposes’. At the conclusion of the pre-trial hearing the judge excluded the testi-
2 monies of the prosecution expert witnesses.

3 Automatic Speaker Recognition (ASR)


4 The desire to automate forensic speaker comparison is growing ever stronger, and
5 Automatic Speaker Recognition (ASR) systems are becoming increasingly popu-
6 lar with practitioners in Europe and the US (Gold and French 2011: 296). As the
7 name suggests, these methods rely on automated computational approaches in
8 comparing voice samples. ASR systems work by:

9 taking a known (suspect) recording, performing complex mathematical


10 transformations on it and reducing it to a statistical model […].The recording
11 of the questioned voice (the criminal recording) is similarly processed and a
12 set of features is extracted. The system then compares the extracted features
13 with the statistical model of the suspect’s voice and produces a measure of
14 similarity/difference (distance) between the two.
15 (French and Stevens 2013: 188)

16 The main way in which these automated methods differ from the combined audi-
17 tory and acoustic techniques is the amount of human input in the comparison of
18 voices. Although there are still decisions to be made regarding the excerpts of a
19 recording that are to be analysed automatically, the human interpretation of
20 features and judgements about degrees of similarity and difference between
21 samples is removed. This objectivity and replicability is attractive to courts, and
22 as French and Stevens (2013: 188) note, ASR can perform in seconds a compari-
23 son that would take many hours using the combined auditory-acoustic approach.
24 In addition, it is well established that automated techniques are very accurate in
25 comparing and recognising voices when operating under ideal conditions (Rose
26 2002: 95). However, forensic speech evidence is rarely ideal. The influence of
27 such issues as poor quality recordings and speaker disguise on the ability to iden-
28 tify voices is exacerbated when relying solely on automatic systems (e.g.
29 Eriksson 2010). Similarly, as pointed out by Foulkes and French (2012: 565), the
30 four-second intercom recording analysed above would be far too short for any
31 automated analysis, despite being a rich source of features of the suspect’s
32 speech. The features extracted and considered in ASR systems are related to the
33 acoustic signal produced by ‘vocal tract resonances arising from the geometry of
34 individual cavities’ (French and Stevens 2013: 189) and are not easily translata-
35 ble into the segmental and suprasegmental features that phoneticians are well
36 trained in analysing and interpreting. As a result automated methods are unable
37 to draw on the distinctive speech features of individuals which are central to
38 auditory-acoustic methods. It is for this reason that it is generally agreed that ASR
39 approaches used alone cannot replace the valuable evidence obtained through
40 close phonetic analysis performed by humans (e.g. Eriksson 2012: 46). However,
41 while the use of ASR approaches alone is generally rejected by forensic speech

BK-DEP-COULTHARD-160143-Chp07.indd 144 5/25/2016 6:44:38 PM


Forensic phonetics    145
1 scientists, it is now considered a valuable part of the expert’s analytical toolkit
2 when approaching a forensic speaker comparison problem, and can be, (and most
3 commonly is), used in combination with auditory and/or acoustic approaches
4 (e.g. Cambier-Langeveld 2007: 240; Gold and French 2011: 296; Eriksson 2012:
5 49; French and Stevens 2013: 191; Morrison et al. 2016).

6 Naïve speaker recognition, earwitnesses and voice parades


7 In 1933, the baby son of the American aviator Charles Lindbergh, famous as the
8 first man to fly solo across the Atlantic, was kidnapped and later found murdered,
9 but not before a ransom had been demanded and paid. Eventually the police
10 arrested and charged a suspect. Lindbergh had talked to the kidnapper twice, once
11 on the telephone, which in those days would not have provided a very good repro-
12 duction, and once in person, briefly and at night, while handing over the ransom
13 money. Almost three years later, when the case came to trial, Lindbergh testified
14 that he recognised the voice of the accused as being that of the man he had talked
15 to. The defence set out to challenge his testimony and employed a psychologist
16 to discover what was and what was not possible in terms of memory for voices.
17 This Lindbergh case is an example of what we now call ‘Naïve Speaker
18 Recognition’ which involves laypeople, untrained in forensic speech science
19 techniques, making judgements about voices in legal cases. There is now a vast
20 literature on how to evaluate such judgements. In the process of collecting and
21 preparing evidence, the largest role played by naïve speakers is in ‘voice line-
22 ups’ or ‘voice parades’, where people act as ‘earwitnesses’ to a crime. Hollien
23 et al. (2014: 91) define earwitness line-ups or voice parades as:

24 a process where a person who has heard, but not seen, a perpetrator attempts
25 to pick his or her voice from a group of voices. […] the witness listens to the
26 suspect’s exemplar embedded in a group of four to six similar samples
27 produced by other people.

28 Nolan (2003) reports an earwitness case in which a voice parade and naïve
29 speaker recognition contributed significantly. In November 2001, a woman died
30 in a house fire in London, which police suspected to have been an arson attack by
31 a man who had previously had a relationship with the woman. After the fire, a
32 lodger in the man’s house told police that he had overheard his landlord commis-
33 sioning a young man to carry out the arson attack on the evening it happened
34 (Nolan 2003: 277). The lodger claimed to recognise the voice of the unidentified
35 young man from previous visits. Shortly afterwards, the lodger’s landlord and the
36 young man became defendants in a murder investigation. A voice parade was
37 carried out using voice samples of the suspect taken from police interviews and
38 samples from police interviews with other young men from the same London
39 Asian community as ‘foils’ (non-suspects used for comparison purposes). The
40 witness not only identified the suspect correctly from the voice parade, but also
41 from an identity parade. Both men were eventually convicted of murder.

BK-DEP-COULTHARD-160143-Chp07.indd 145 5/25/2016 6:44:38 PM


146  Forensic phonetics
1 Voice line-ups should be fair to both sides of a case and should only be used if
2 they are likely to produce reliable results. However, Nolan (2003: 187) warns that
3 ‘it is very difficult to achieve a voice parade whose fairness cannot be called into
4 question for one reason or another’. Early work on voice parades produced prom-
5 ising results. For example, Künzel (1994: 55) noted that ‘[Speaker Identification]
6 by non-experts may attain a high degree of reliability under favourable circum-
7 stances’. Two years later, Nolan and Grabe (1996) drawing on earlier work by
8 Broeders and Rietveld (1995) concluded that ‘a carefully carried out voice parade
9 should … be capable of contributing usefully to the balance of evidence’. Recent
10 commentary has focused on what constitutes a ‘carefully carried out voice
11 parade’, and has examined the accuracy with which laypeople can recognise
12 voices.

13 Set-up of voice parades


14 In the arson case discussed above, Nolan was initially contacted by Detective
15 Sergeant John McFarlane of the Metropolitan Police. As a result of the voice line-up
16 method developed and applied in that case, the procedure and guidelines McFarlane
17 implemented became an example of good practice, and since 2003 has been
18 presented by the Home Office to all police forces in England and Wales as ‘Advice
19 on the use of voice identification parades’. The advice can be summarised as:

20 1 The officer in charge should obtain a detailed statement from the witness,
21 containing as much detail and description of the voice as is possible. All
22 descriptions of the voice given by the witness must be included in the mate-
23 rial supplied to the relevant forensic phonetics/linguistics expert, the suspect
24 and solicitors.
25 2 Under no circumstances should an attempt be made to conduct a live voice
26 identification procedure, using live suspect and foils.
27 3 The identification officer should obtain a representative sample of the
28 suspect’s voice. Such samples might include police recorded interview
29 tapes, during which the suspect is speaking naturally and responding to
30 questions. Under no circumstances should the suspect be invited to read
31 any set text.
32 4 The identification officer should obtain no less than 20 samples of speech,
33 from persons of similar age and ethnic, regional and social background
34 as the suspect. A suitable source of such material may be other police
35 recorded interview tapes from unconnected cases.
36 5 The officer should ensure that all the work can be undertaken and completed
37 within 4-6 weeks of the incident in question, as memory degradation or
38 ‘fade’ on the part of the witness has been identified as a critical factor by
39 experts in the field.
40 6 The identification officer should request the services of a force-approved
41 expert witness in phonetics/linguistics, for example, a Member of the
42 International Association for Forensic Phonetics and Acoustics, to ensure

BK-DEP-COULTHARD-160143-Chp07.indd 146 5/25/2016 6:44:39 PM


Forensic phonetics    147
1 that the final selection and compilation of sample voices and the match with
2 the suspect’s is as accurate and balanced as possible.

3 (These guidelines can be accessed from: https://webarchive.nationalarchives.gov.


4 uk/20130125102358/http://www.homeoffice.gov.uk/about-us/corporate-publica-
5 tions-strategy/home-office-circulars/circulars-2003/057-2003, or in Nolan 2003)
6 These procedures are offered as recommendations, rather than being manda-
7 tory. Nonetheless, they offer very clear and helpful guidelines for the successful
8 implementation of voice parades, and they were clearly developed in close
9 communication with experts in the field. As Nolan (2003: 288) notes, however,
10 practical difficulties remain in the setting up of voice parades. In light of contem-
11 poraneous and more recent research, there are particular points in these guide-
12 lines (bold in the list) which require further discussion.
13 First, people’s descriptions of voices in the first instance may be unhelpful or
14 unreliable. In contrast with identity parades, lay witnesses find it much harder to
15 describe voices than they do faces, and ‘exhibit a wide variety of subjective
16 categories’ (Künzel 1994: 48). Künzel notes that, while in some cases witnesses’
17 descriptions are so precise that the expert only needs to convert them into scien-
18 tific terminology, other subjects are unable to indicate any categories for their
19 judgements. Second, it may not be an easy enterprise to source representative
20 samples of the suspect’s voice. The guidelines suggest that police interviews
21 should be used, but it may be difficult to extract samples which do not include
22 any incriminating utterances. Furthermore, a question remains over what a ‘repre-
23 sentative’ speech sample is. It might be that the context in which the witness
24 initially heard the suspect’s voice is dramatically different from the controlled
25 situation of the voice parade. For example, there is research demonstrating the
26 effect of voice disguise and imitation on naïve speaker recognition (Schlichting
27 and Sullivan 1997; Eriksson et al. 2010), suggesting that if the suspect disguised
28 their voice during a crime, witnesses would be less likely to identify them in a
29 voice parade. Also, while the guidelines suggest that ‘under no circumstances
30 should the suspect be invited to read any set text’, recent experimental research
31 has found that naïve speaker recognition accuracy is actually better when speak-
32 ers read from a text book than if they are involved in a dialogue (Sarwar et al.
33 2014).
34 The guidelines are clear that the foils used in voice parades should come from
35 people who are of similar age and ethnic, regional and social background to the
36 suspect. In sociolinguistics it is commonly held that social factors such as age,
37 ethnicity and gender are not determinants of how people will talk, but are
38 resources that speakers use when creating voices (Johnstone 1996: 11). Therefore,
39 although people may share a whole range of social characteristics; that does not
40 mean their voices will sound the same. This problem will be partially overcome
41 by the consultation of a forensic phonetician in deciding the appropriateness of
42 foil samples (point 6 in the guidelines) and Nolan and Grabe (1996) outline a
43 pre-parade experiment they ran with a group of listeners to judge the similarity
44 of foils and suspects and to control for bias. Nolan and Grabe (1996: 92) state that

BK-DEP-COULTHARD-160143-Chp07.indd 147 5/25/2016 6:44:39 PM


148  Forensic phonetics
1 the standard assumption has been that foils should be similar to the actual suspect.
2 This process involves some decision regarding how similar the foils should be to
3 the suspect, and could result in an ‘ideal’ line-up consisting of a set of voices so
4 close that identification would be virtually impossible (e.g. Hollien 2002: 100).
5 However, they point out that an alternative, derived from suggestions for eyewit-
6 ness line-ups by Wells (1993), is to match foils to the witness’s description of the
7 voice of the offender. The issue here is the accuracy and usefulness of lay-
8 people’s description of voices.
9 Finally, the guidelines recommend that the voice line-ups should take place
10 within four to six weeks of the incident in question, due to the fact that witness’
11 memory of the voice fades, or ‘lapses’. Early research, however, suggests that
12 memory lapsing occurs very quickly; McGehee (1937) reported 87 per cent
13 correct identification after two days, falling to 13 per cent after five months. While
14 sooner is always better for voice parades (Hollien 2002: 102), very soon is not
15 always possible in the forensic context. Furthermore, time pressure could exacer-
16 bate the other challenges posed by the selection of speaker samples and foils.
17 Nevertheless, these guidelines developed by Nolan (2003) and the police offer a
18 very useful benchmark against which practice can be measured

19 Ability of laypeople to recognise voices


20 Rose (2002: 97) states, ‘common experience suggests that untrained human
21 listeners can and do make successful judgements about voices’. He continues:
22 ‘the natural human ability to recognise and identify voices has been accepted in
23 courts for several centuries’. Others, however, are less optimistic. Solan and
24 Tiersma (2005: 119) argue that ‘the unreliability of earwitness identification has
25 gone virtually unnoticed’, at least by the US legal system. Hollien et al. (2014:
26 174) accept that if competent personnel are used, ‘results can be both robust and
27 reasonably accurate’. Elsewhere, however, he points out that ‘some listeners are
28 simply better at identification than others’ (Hollien 2002: 102), and research has
29 found that untrained listeners perform significantly worse than trained listeners in
30 speaker identification (Schiller and Köster 1998).
31 The reason for this variability in success is that, as well as the factors affecting
32 the organisation of voice parades discussed above, there are a wide range of
33 factors which affect individuals’ abilities to recognise voices. Firstly, it is well
34 established that there are significant differences in recognition success depending
35 on whether it is a familiar or an unfamiliar voice. Rose (2002: 98–99), for exam-
36 ple, report experiments which show listeners being twice as successful in
37 correctly recognising familiar voices. Secondly, even with familiar voices, listen-
38 ers make mistakes roughly one-third of the time. Thirdly, one cannot extrapolate
39 from these scores for average success to the likely success of a given witness
40 being able to recognise a known voice, because there is massive individual vari-
41 ation; listener success in one experiment which was testing the ability to recog-
42 nise 25 famous voices ranged ‘all the way from totally correct (100%) to chance
43 (46%)’ (Rose 2002: 100). Witness emotion and stress can also influence their

BK-DEP-COULTHARD-160143-Chp07.indd 148 5/25/2016 6:44:40 PM


Forensic phonetics    149
1 memory and recollection of events; Hollien (2002: 101) reports that subjects who
2 have been stressed or aroused do better at recognising speakers than those who
3 have not. Exposure to the suspect voice has also been found to affect people’s
4 abilities to recognise that voice. Listeners are better able to recognise voices they
5 have heard for longer periods (e.g. Yarmey 1991), and those which they have
6 been exposed to more frequently (Deffenbacher et al. 1989). Biological charac-
7 teristics of the witness have also been found to influence their success in naïve
8 speaker recognition. Generally speaking, women outperform men when identify-
9 ing the voices of other women, people aged between 21 and 40 years are superior
10 to older listeners, and witnesses are better at identifying the voices of people of
11 the same ethnicity or race (Yarmey 2012: 549–50). Finally, some of the most
12 recent research in naïve speaker recognition and earwitness accuracy has found
13 that some types of voices are easier to identify than others. For example, Sørensen
14 (2012) found that in a group of young men between 20 and 35 years of age, ‘less
15 common’ voices, so called because their average voice pitch (fundamental
16 frequency) is markedly higher or lower than the average of the group, are easier
17 to identify in voice line-ups than ‘common’ voices.
18 Given the vast array of factors that can influence the accuracy and reliability
19 of naïve speaker recognition and voice parades, some of which are beyond the
20 control of the police officers and phoneticians, it is generally agreed that evidence
21 obtained through earwitness testimony should be treated with caution by the
22 courts (e.g. Nolan 2003: 228; Yarmey 2012: 556). Voice line-ups are used,
23 however, and can produce critical evidence in a case in which they are carefully
24 designed and controlled, and reasonable speech samples and foils are available,
25 as demonstrated by Nolan (2003).

26 Conclusion
27 You may be wondering how you could contribute to the field of forensic phonetics.
28 Transcription and voice parade research can be undertaken by students with a
29 working knowledge of phonetics, and careful engagement with the frameworks and
30 cases outlined in this chapter. To a certain extent, this is also true of speaker profil-
31 ing, as work in areas such as perceptual dialectology has found that people with no
32 background in studying linguistics at all can make accurate and nuanced geographi-
33 cal observations about the voices that they hear. The field of forensic speaker
34 comparison may seem harder to penetrate, however, especially when expertise in
35 phonetics needs to be combined with skills in computing and programming to oper-
36 ate acoustic analysis software. Nonetheless, for the student of linguistics, at least,
37 forensic speaker identification offers a real-world, high-stakes application of your
38 knowledge and understanding of phonetics and the human voice.

39 Further reading
40 Rose (2002) (chapters 2, 6, 5, 7 and 10 in that order); Hollien (2002); Jessen (2010); Watt
41 (2010), Foulkes and French (2012); and French and Stevens (2013).

BK-DEP-COULTHARD-160143-Chp07.indd 149 5/25/2016 6:44:40 PM


150  Forensic phonetics
1 Research tasks
2 1 Record a friend or family member speaking with you in natural conversation. Gather
3 some participants (who don’t know the speaker) for an experiment, and have them listen
4 to the recording a number of times. Ask them to deduce as much as they can about the
5 speaker’s ‘profile’ from that recording, including sex, gender, ethnicity, region, body
6 size. How accurate are they? How confident are they? On what basis are they making
7 their decisions? You might like to record a number of sections of different lengths, and
8 observe whether your participants’ predictions are more accurate with longer record-
9 ings. Be careful that the content of the recordings doesn’t give anything away!
10 2 Find or create a recording of a professional mimic or impersonator producing the voice
11 of a famous person. Then, collect authentic samples of the famous person along with
12 other, non-professional impersonators imitating the voice. Construct a voice parade, in
13 which you organise for participants to listen to short extracts of recordings you have for
14 (i) the actual famous person, (ii) the professional impersonator and (iii) non-­professional
15 impersonators. Ask them to identify which recording is of the actual famous person. Are
16 they able to tell? How confident are they, and on what basis are they making their iden-
17 tification? Try it with different famous people. Are some, more familiar people, easier
18 to identify? Why?
19 3 Record a natural conversation between yourself and a friend or family member. Now,
20 with that same family member, perform and record a structured interview that you have
21 designed. The interview can be about anything, but it is important that they are answer-
22 ing questions that you ask them. Now, find ten participants for an experiment. Choose
23 short sections of both the natural conversation and the interview, ideally parts that don’t
24 include your own voice. Give these sections to your participants, and ask them to
25 compare the person’s voice across the two different types of recording. Are they able to
26 identify the voices they hear as being from the same speaker? How confident are they?
27 What are they basing their comparison on? You might like to also run a similar test, but
28 in which the speaker in the natural conversation and the interview have similar voices
29 but are actually different speakers. Are your participants able to distinguish between the
30 voices?

BK-DEP-COULTHARD-160143-Chp07.indd 150 5/25/2016 6:44:41 PM

You might also like