Forensic Phonetics.2
Forensic Phonetics.2
Forensic phonetics
2 Police and security services are trying to identify a suspected British jihad-
3 ist who appeared in footage of the killing of a US journalist. […] Uncon-
4 firmed reports suggest the man in the video […] is from London or south-east
5 England and may have guarded Islamic State captives.
6 (BBC News, 21 August 2014)
7 The screams are clearly coming from a distraught male, whose repeated cries
8 for help end abruptly with a gunshot. What is not clear from a recording of
9 a 911 call […] is the identity of the screamer: George Zimmerman, the vol-
10 unteer community watchman, or Trayvon Martin, the unarmed 17-year-old
11 he killed…
12 (New York Times, 22 June 2013)
28
In other words, the listener does not necessarily hear what was said, but rather
29
hears their construction of what they think was said; they subconsciously
30
combine the speech signal (the sounds) with prior knowledge of speech, language
31
and context in their own heads (Fraser 2003: 206). This is exemplified by Fraser
32
(2014: 13–14) as she describes an earlier study by Bruce (1958), in which partici-
33
pants listened to a number of sentences, partially ‘masked’ with a hissing noise,
34
after being given a key word as to the content of the recording, such as ‘sport’ or
35
‘weather’. Listeners’ heard the same masked sentences, but with different key
36
words. What was found was that their perceptions of what they heard changed in
37
response to the key words, despite the fact that they had listened to the same
38
sentences. Such difficulties are compounded when the recordings are unclear or
39
of a low quality, most commonly a result of poor recording equipment or exten-
40
sive background noise (Fraser 2014: 9).
41
Fraser et al. (2011) exemplify the challenges posed by listener perceptions of
42
unclear recordings with experimental results using a recording from a real
12 … but if it’s as you say it’s hallucinogenic, it’s in the Sigma catalogue
14 … but if it’s as you say it’s German, it’s in the Sigma catalogue.
15 In another case, a murder suspect with a very strong West Indian accent, was
16 transcribed as saying, in a police interview, that he ‘got on a train’ and then ‘shot
17 a man to kill’; in fact what he said was the completely innocuous and contextually
18 much more plausible ‘show[ed] a man ticket’ (Peter French, personal
19 communication).
20 French (in Baldwin and French 1990) reports a much more difficult case,
21 which appeared to turn on the presence or absence of a single phoneme, the one
22 that distinguishes can from can’t. Most readers, if they record themselves reading
23 these two words aloud, will notice not one but two phonemic differences between
24 their pronunciations of the words – the absence/presence of a /t/ and a different
25 vowel phoneme. Using a received pronunciation (RP) or near-RP British accent,
26 at least when the words are produced as citation forms, the vowel in can’t is also
27 longer. This contrasts with many North American accents of English in which the
28 vowels in can and can’t are more similar. However, in an ordinary speech
29 context, as in the phrase ‘I can’t refuse’, the /t/ often disappears and the vowel is
30 shortened, so that the phonetic difference between the two words is very signifi-
31 cantly reduced. In French’s case a doctor, who spoke English with a strong Greek
32 accent, had been surreptitiously tape-recorded apparently saying, whilst prescrib-
33 ing tablets to someone he thought was a drug addict, ‘you can inject those things’.
34 He was prosecuted for irresponsibly suggesting that the patient could grind up the
35 pills and then inject them. His defence was that he had actually said just the
36 opposite, ‘you can’t inject those things’. An auditory examination of the tape-
37 recording showed that there was certainly no hint of a /t/ at the end of the ‘can’
38 word and thus confirmed the phonetic accuracy of the police transcription.
39 However, the question remained, was the transcription morphologically incor-
40 rect; that is, was the doctor intending to say and actually producing his version of
41 can’t? Auditory analysis of a taped sample of the doctor’s speech showed that
Parameter Description
Pitch Caused by the vibration of the vocal folds. The faster the
vibration, the higher the pitch (e.g. 124 Hz).
Voice quality A general set of characteristics which are a product of the
configuration of the speaker’s vocal chords and vocal tract
(e.g. breathy, nasal, creaky, shimmer or vocal fry) (Jessen
2010: 391).
Articulatory setting The medium- to long-term setting of all the articulators
(e.g. tongue, hard/soft palate, teeth, lips) in relation to
one another, which results in different pronunciations of
sounds across speakers (O’Grady 2013: 13).
Intonation and prosody Patterns or melody of pitch changes or stress across stretches
of connected speech (e.g. rising rather than falling
intonation at the end of declarative sentences).
Rhythm Relates to the timing and length of stress and syllables in
speech.
Speaking rate/tempo The speed with which someone talks, measured by words per
minute or syllables per second.
Vowels Distinctive realisations of sounds in which there is unimpeded
airflow (e.g. the pronunciation of [aɪ] in <time>).
Consonants Distinctive realisations of sounds which involve restriction or
closure of the vocal tract (e.g. the pronunciation of [ɫ] in
<milk>).
Connected speech Presence or absence of phonological processes across word
processes boundaries (e.g. assimilation, linking, elision) (Knight
2012: 197).
Pathological features Medical conditions which have long-term effects on speech
production, such as stuttering and sigmatism (Jessen 2010:
382)
1 Speaker profiling
2 There are times when the police have a recording of a criminal’s voice, either
3 committing or confessing to a crime, but have no suspect, and are thus anxious to
4 glean any information at all that might enable them to narrow down the group of
5 potential suspects. Examples might include an obscene phone call, a ransom
6 demand, a bomb threat, extortion, or an audio recording of an attack or murder.
7 In such cases, the forensic phonetician may be asked to undertake ‘speaker profil-
8 ing’. We have already mentioned in the Introduction one of the earliest high-
9 profile cases, dating from 1979, that of the Yorkshire Ripper where the forensic
10 phonetician was amazingly successful in placing the speaker regionally, but such
11 cases are not uncommon.
12 The first quotation in the epigraph of this chapter is a BBC News report of a
13 video that emerged online and through mainstream news networks of the behead-
14 ing of journalist James Foley at the hands of an Islamic State terrorist. After this
15 video, a series of others were released apparently showing the same man murder-
16 ing British and American journalists and aid workers. Upon the emergence of
41 Text: I’ve come to see the lady at number two [operator’s turn removed] (I’m
42 fro)m the Home Care I’ve come to collect her sheet(s).
4 Despite being only four seconds in length, an IPA transcription of the disputed
5 recording identified as many as 12 pronunciation features characteristic of the
6 perpetrator’s voice. These included vowel realisations such as the reduction of
7 the diphthong /aɪ/ to a monophthong /a/ in the word I’ve, and the northern English
8 vowel /ʊ/ in the words number and come among others. Observable consonant
9 pronunciations included glottalised /t/ in at and sheet and /h/-dropping in home
10 and her. As well as vowels and consonants, the experts also used the IPA tran-
11 scription to identify patterns in the speaker’s intonation (as in two), and found that
12 she used word-final linking /r/ between care and I’ve. Even though it was short,
13 the criminal recording proved to be a rich source of data. Furthermore, every one
14 of the potentially useful features identified in it was also found in the police
15 recordings of the suspect’s voice, providing evidence in support of the assertion
16 that the known and disputed samples were spoken by the same person.
17 In contrast to auditory analysis, where the expert’s focus is primarily on their
18 aural perception of segmental vowel and consonant production, acoustic analysis
19 involves the use of specialised computer software to quantify and measure
20 elements of speech. One such parameter which is commonly analysed using
21 acoustic methods is voice pitch. Such analyses can be presented visually, in spec-
22 trograms such as those in Figure 7.1, (see overleaf) which is a visualisation of
23 two separate utterings of ‘what time’s the train?’ The spectrogram has the words
24 written in ordinary orthography, but the actual pronunciation can be better repre-
25 sented using the International Phonetic Alphabet and removing word spaces
26 which are obviously not articulated, so the first one may sound like /
27 wɒttɑɪmzðətreɪn / while the second seem to be a less formal pronunciation with-
28 out the first ‘t’ pronounced, at the end of ‘what’, /wɒʔ/, an assimilation of the first
29 consonant of the to the end of times resulting in an initial /z/ /tɑɪmz zə/ giving a
30 compete version of /wɒʔtɑɪmzzətreɪn/. Readers will note a thin line below the
31 pitch printout, which indicates the carrying pressure of the air expelled from the
32 lungs to create the individual sounds.
33 In the image, one can see all the different component pitches on the vertical
34 axis and how they change over time along the horizontal axis. Intensity, or
35 perceived loudness, is represented by a darkness scale – the darker the print, the
36 louder the sound. The average pitch of someone’s voice, their fundamental
37 frequency (abbreviated f0), can also be expressed as a numerical value in cycles
38 per second, as for instance in the observation that, say, the average f0 of a given
39 voice over time is 124 Hz. This number can then be compared with population
40 data to identify whether the person in question has a higher or lower pitched voice
41 than average. Vowel formants are also frequently measured in acoustic analysis,
42 correlating with how the vowels are articulated, and similar analysis can be
43 conducted to measure the duration and articulation of consonants. The results
44 produced by such computational methods are then subjected to human
5/25/2016 6:44:36 PM
Forensic phonetics 141
1 examination and evaluation by the phonetician when comparing those of the
2 known and disputed speech samples.
3 Some practitioners (e.g. Hollien et al. 2014) argue that auditory analyses,
4 which they refer to as ‘aural-perceptual’, performed by humans are the most
5 accurate of all available methods. At the same time, even in the early stages of
6 technological developments, others observed that ‘in principle … the ear may be
7 inherently ill-equipped to pick up some differences between speakers, which
8 show up clearly in an acoustic analysis’ (Nolan 1994: 341). Indeed there is a
9 consensus, supported by the majority of the members of the International
10 Association for Forensic Phonetics and Acoustics (IAFPA), that forensic phoneti-
11 cians should use a mixed method, with the detailed type of auditory analysis and
12 a rigorous instrumental acoustic analysis reinforcing each other. The procedure
13 and process of such mixed methods are explained in Watt (2010), Foulkes and
14 French (2012), Eriksson (2012), French and Stevens (2013). In their survey of
15 international forensic speaker comparison practices, in which they consulted 36
16 experts from 13 countries, Gold and French (2011) found that the mixed
17 ‘Auditory Phonetic cum Acoustic Phonetic analysis’ is the one most routinely
18 employed by practitioners, and is used in Australia, Austria, Brazil, China,
19 Germany, Netherlands, Spain, Turkey, UK and USA, as well as in universities,
20 research institutes, and government/agency laboratories. Similar results ware
21 found in a survey of speaker identification practices used by global law enforce-
22 ment agencies (Morrison et al. 2016).
23 ‘Voiceprints’
24 At this point, a discussion of voiceprints is necessary. ‘Voiceprinting’ essentially
25 involves the visual matching of pairs of spectrograms, such as those in Figure 7.1,
26 showing the known and suspect speakers uttering the same word(s). In the USA
27 in the 1960s the dominant tradition was the ‘voiceprint’, a label deliberately
28 formed to echo and thereby borrow prestige from ‘fingerprint’:
33 However, this method, never achieved the same level of reliability as finger print-
34 ing. The attraction of the spectrogram for this kind of ‘voiceprint analysis’ is that
35 it gives a ‘picture’ of the sounds spoken, but the fatal flaw of the voiceprinting
36 method was that it involved checking the degree of similarity between two spec-
37 trograms by eye. A major problem with this approach is that, as observed above,
38 there is always significant within-speaker variation. For example, if a speaker
39 uttered ‘the train’ one-hundred times in quick succession, no two utterings would
40 be identical. You might like to spend a few moments trying to decide visually
41 whether these two prints of utterings of ‘What time’s the train?’ are from the
16 The main way in which these automated methods differ from the combined audi-
17 tory and acoustic techniques is the amount of human input in the comparison of
18 voices. Although there are still decisions to be made regarding the excerpts of a
19 recording that are to be analysed automatically, the human interpretation of
20 features and judgements about degrees of similarity and difference between
21 samples is removed. This objectivity and replicability is attractive to courts, and
22 as French and Stevens (2013: 188) note, ASR can perform in seconds a compari-
23 son that would take many hours using the combined auditory-acoustic approach.
24 In addition, it is well established that automated techniques are very accurate in
25 comparing and recognising voices when operating under ideal conditions (Rose
26 2002: 95). However, forensic speech evidence is rarely ideal. The influence of
27 such issues as poor quality recordings and speaker disguise on the ability to iden-
28 tify voices is exacerbated when relying solely on automatic systems (e.g.
29 Eriksson 2010). Similarly, as pointed out by Foulkes and French (2012: 565), the
30 four-second intercom recording analysed above would be far too short for any
31 automated analysis, despite being a rich source of features of the suspect’s
32 speech. The features extracted and considered in ASR systems are related to the
33 acoustic signal produced by ‘vocal tract resonances arising from the geometry of
34 individual cavities’ (French and Stevens 2013: 189) and are not easily translata-
35 ble into the segmental and suprasegmental features that phoneticians are well
36 trained in analysing and interpreting. As a result automated methods are unable
37 to draw on the distinctive speech features of individuals which are central to
38 auditory-acoustic methods. It is for this reason that it is generally agreed that ASR
39 approaches used alone cannot replace the valuable evidence obtained through
40 close phonetic analysis performed by humans (e.g. Eriksson 2012: 46). However,
41 while the use of ASR approaches alone is generally rejected by forensic speech
24 a process where a person who has heard, but not seen, a perpetrator attempts
25 to pick his or her voice from a group of voices. […] the witness listens to the
26 suspect’s exemplar embedded in a group of four to six similar samples
27 produced by other people.
28 Nolan (2003) reports an earwitness case in which a voice parade and naïve
29 speaker recognition contributed significantly. In November 2001, a woman died
30 in a house fire in London, which police suspected to have been an arson attack by
31 a man who had previously had a relationship with the woman. After the fire, a
32 lodger in the man’s house told police that he had overheard his landlord commis-
33 sioning a young man to carry out the arson attack on the evening it happened
34 (Nolan 2003: 277). The lodger claimed to recognise the voice of the unidentified
35 young man from previous visits. Shortly afterwards, the lodger’s landlord and the
36 young man became defendants in a murder investigation. A voice parade was
37 carried out using voice samples of the suspect taken from police interviews and
38 samples from police interviews with other young men from the same London
39 Asian community as ‘foils’ (non-suspects used for comparison purposes). The
40 witness not only identified the suspect correctly from the voice parade, but also
41 from an identity parade. Both men were eventually convicted of murder.
20 1 The officer in charge should obtain a detailed statement from the witness,
21 containing as much detail and description of the voice as is possible. All
22 descriptions of the voice given by the witness must be included in the mate-
23 rial supplied to the relevant forensic phonetics/linguistics expert, the suspect
24 and solicitors.
25 2 Under no circumstances should an attempt be made to conduct a live voice
26 identification procedure, using live suspect and foils.
27 3 The identification officer should obtain a representative sample of the
28 suspect’s voice. Such samples might include police recorded interview
29 tapes, during which the suspect is speaking naturally and responding to
30 questions. Under no circumstances should the suspect be invited to read
31 any set text.
32 4 The identification officer should obtain no less than 20 samples of speech,
33 from persons of similar age and ethnic, regional and social background
34 as the suspect. A suitable source of such material may be other police
35 recorded interview tapes from unconnected cases.
36 5 The officer should ensure that all the work can be undertaken and completed
37 within 4-6 weeks of the incident in question, as memory degradation or
38 ‘fade’ on the part of the witness has been identified as a critical factor by
39 experts in the field.
40 6 The identification officer should request the services of a force-approved
41 expert witness in phonetics/linguistics, for example, a Member of the
42 International Association for Forensic Phonetics and Acoustics, to ensure
26 Conclusion
27 You may be wondering how you could contribute to the field of forensic phonetics.
28 Transcription and voice parade research can be undertaken by students with a
29 working knowledge of phonetics, and careful engagement with the frameworks and
30 cases outlined in this chapter. To a certain extent, this is also true of speaker profil-
31 ing, as work in areas such as perceptual dialectology has found that people with no
32 background in studying linguistics at all can make accurate and nuanced geographi-
33 cal observations about the voices that they hear. The field of forensic speaker
34 comparison may seem harder to penetrate, however, especially when expertise in
35 phonetics needs to be combined with skills in computing and programming to oper-
36 ate acoustic analysis software. Nonetheless, for the student of linguistics, at least,
37 forensic speaker identification offers a real-world, high-stakes application of your
38 knowledge and understanding of phonetics and the human voice.
39 Further reading
40 Rose (2002) (chapters 2, 6, 5, 7 and 10 in that order); Hollien (2002); Jessen (2010); Watt
41 (2010), Foulkes and French (2012); and French and Stevens (2013).