100% found this document useful (14 votes)
48 views73 pages

9965digging Numbers Elementary Statistics For Archaeologists 2nd Edition Mike Fletcher Instant Download

The document is a promotional description for the book 'Digging Numbers: Elementary Statistics for Archaeologists, 2nd Edition' by Mike Fletcher and Gary Lock, which provides an introduction to statistical techniques specifically for archaeologists. It outlines the book's structure, including sections on descriptive statistics, inferential statistics, and statistical computing, and highlights updates made in the second edition. Additionally, it mentions other related statistical texts available for download.

Uploaded by

smcnoaoxf059
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (14 votes)
48 views73 pages

9965digging Numbers Elementary Statistics For Archaeologists 2nd Edition Mike Fletcher Instant Download

The document is a promotional description for the book 'Digging Numbers: Elementary Statistics for Archaeologists, 2nd Edition' by Mike Fletcher and Gary Lock, which provides an introduction to statistical techniques specifically for archaeologists. It outlines the book's structure, including sections on descriptive statistics, inferential statistics, and statistical computing, and highlights updates made in the second edition. Additionally, it mentions other related statistical texts available for download.

Uploaded by

smcnoaoxf059
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Digging Numbers Elementary Statistics for

Archaeologists 2nd Edition Mike Fletcher pdf


download

https://ebookgate.com/product/digging-numbers-elementary-
statistics-for-archaeologists-2nd-edition-mike-fletcher/

Get Instant Ebook Downloads – Browse at https://ebookgate.com


Get Your Digital Files Instantly: PDF, ePub, MOBI and More
Quick Digital Downloads: PDF, ePub, MOBI and Other Formats

Elementary Statistics for Geographers 3rd Edition James


E. Burt

https://ebookgate.com/product/elementary-statistics-for-
geographers-3rd-edition-james-e-burt/

Elementary Statistics 10th Edition Mario F. Triola

https://ebookgate.com/product/elementary-statistics-10th-edition-
mario-f-triola/

THE PRESENT PAST An Introduction to Anthropology for


Archaeologists 2nd Revised edition Edition Ian Hodder

https://ebookgate.com/product/the-present-past-an-introduction-
to-anthropology-for-archaeologists-2nd-revised-edition-edition-
ian-hodder/

3D Math Primer for Graphics and Game Development 2nd


Edition Fletcher Dunn

https://ebookgate.com/product/3d-math-primer-for-graphics-and-
game-development-2nd-edition-fletcher-dunn/
Mixing Secrets for the Small Studio 2nd Edition Mike
Senior

https://ebookgate.com/product/mixing-secrets-for-the-small-
studio-2nd-edition-mike-senior/

Digital Photography Lighting For Dummies 1st Edition


Dirk Fletcher

https://ebookgate.com/product/digital-photography-lighting-for-
dummies-1st-edition-dirk-fletcher/

Nonparametric Statistics for Health Care Research


Statistics for Small Samples and Unusual Distributions
NULL 2nd Edition Michael Knight

https://ebookgate.com/product/nonparametric-statistics-for-
health-care-research-statistics-for-small-samples-and-unusual-
distributions-null-2nd-edition-michael-knight/

New Opportunities Elementary Students Book Global


Elementary Students Book 2nd Edition Michael Harris

https://ebookgate.com/product/new-opportunities-elementary-
students-book-global-elementary-students-book-2nd-edition-
michael-harris/

Printed Test Bank Mark Schultz to Accompany Elementary


Statistics Eighth Edition Mario F. Triola

https://ebookgate.com/product/printed-test-bank-mark-schultz-to-
accompany-elementary-statistics-eighth-edition-mario-f-triola/
Oxford Uni-
33

DIGGIN i .s ‫נ‬
ELEMENTARY STATISTICS FOR
ARCHAEOLOGISTS
(Second Edition)

Mike Fletcher and Gary Lock

Oxford University Committee for Archaeology


2005
Published by
Oxford University School of Archaeology CONTENTS
Institute of Archaeology
Beaumont Street Preface
Oxford
Section 1: Techniques for describing and presenting archaeological data

L An introduction to data. 1
© Mike Fletcher and Gary Lock 2005 1.1 The example data set 1
1.2 Levels of measurement l
First published 1991 1.3 Coding 5
reprinted 1994. 200 I. 2004 1.4 Transforming variables 6
Second edition 2005
2. A statistical approach - signposting the way 9

ISBN O 947816 69 0
3. Tabular and pictorial display 14
3.1 Basic aims and rules 14
3.2 Tabulating measurements 14
A C!P record for this book is available from the British Library 3.3 Tabulating frequencies 15
3.3. l One variable 15
3.3.2 Two variables 17
3.4 Pictorial displays for nominal and ordinal data 19
This book is availahle directfiwn 3.4.1 The bar chart 19
Oxbow Books, Park End Place, Oxford OX I I HN 3.4.2 The pie chart 21
(Phone: 01865-241249; Fax: 01865-794449) 3.5 Pictorial displays for continuous data 22
3 .5. I The histogram 22
lll!d 3.5.2 The stem-and-leaf plot
an alternative histogram? 25
The David Brown Book Company 3.5.3 The ogive
PO Box 51 I. Oakville, CT 06779, USA the total so far 27
(Phone: 860-945-9329: Fax: 860-945-9468)
3.5.4 The scatterplot
displaying two variables 29
and

via our website


4. Measures of position - the average 32
www.oxbowbooks.com 4.1 Introduction 32
4.2 The mode 32
4.3 The median 34
4.4 The mean 36
4.5 Comparing the mode, median and mean 37

5. Measures of variability - the spread 41


Printed in Great Britain !JI' 5.1 Introduction 41
Antony Rowe Ltd, Chippenham. Wiltshire 5.2 The range 41
42 8.3.2 Difference in means for paired data
5.3 The quartiles
5.4 The mean deviation 45 assuming a normal distribution 92
5.5 The standard deviation 47 8.3.3 Difference in means for paired data
5.6 The coefficient of variation 48 no assumption of normality 94
49 8.3.4 Difference in means for two independent samples
5.7 Standardisation
5.8 Boxplots
50 assuming a normal distribution 95
8.3.5 Difference in means for two independent samples
Section 2: Techniques for drawing inferences from archaeological data - no assumption of nonnality 97
53 8.3.6 Difference of two proportions 98
6. An introduction to probability and inference drawing conclusions
53
6. I Introduction 9. Tests of distribution 101
6.2 Probability 9.1 Introduction l OI
measuring chance and risk 53
53
9.2 Tests for randomness 101
6.2. l The concept of probability 9.3 Tests for nonnality 105
6.2.2 The concept of independence 9.4 Tests between two distributions 111
are two events related? 55
6.3 Probability distributions 10. Measures of association for continuous or ordinal data
- predicting results 59
- are two variables related? 115
6.4 The logic of hypothesis testing I 0.1 Introduction 115
is it significant? 63
I 0.2 Product-moment correlation coefficient 118
66
10.2.1 Testing the significance of
7. Sampling theory and sampling design the product-moment correlation coefficient 121
7. I Introduction 66
I 0.3 Speannan's rank correlation coefficient 123
7.2 Sampling strategies - which measurements to take 67
70
l 0.3. l Testing the significance of
7.3 A statistical background to sampling Speannan's rank correlation coefficient 126
7.3.l The central-limit theorem I0.4 Predicting using regression 127
- the law of averages 70
7.3.2 Confidence limits 11. Measures of association for categorical data
the reliability of results 74
- are two characteristics related? 128
7.4 Conclusions 78
11. l Introduction 128
80
11.2 The Chi-squared test 129
8. Tests of difference 11.3 Guttman's lambda 134
80
8.1 Introduction 11 .4 Kendall's tau 135
8.2 One sample tests .
comparing an observed measurement with an 12. An introduction to multivariate analysis
81
139
expected measurement 12.1 Reduction and grouping 140
81 12.1.1 Cluster Analysis 140
8.2. l Test for sample mean 12.1.2 Correspondence Analysis 145
8.2.2 Test for median 87
8.2.3 Test for proportions 89
12.2 Prediction 149
8.3 Two sample tests 12.2. l Multiple Regression 149
- comparing two observed measurements 90
90
12.2.2 Discriminant Analysis 151
8.3. l Test for variation
Preface (First Edition)
Section 3: Books and software

13. A few recommended books


154 Note on the contents list: we have intentionally tried to produce user friendly headings
to try and overcome the problems inherent in statistical beginners being faced with a
14. SPSS for Windows
158 list of technical names. This has resulted in a considerable amount of simplification
which may offend some statistical purists. We beg understanding in advance.
195
Appendix. Statistical tables . . .
Table A. Random digits from a uniform d1stnbution 195 Digging Numbers comprises four sections;
Table B. Percentage points of the t-distribution 196
Section l. Simple techniques for describing and presenting archaeological data,
Table C. 5% points of the F distribution 197
Section 2. Techniques for drawing inferences from archaeological data,
Table D. Kolmogorov-Smirnov single sample test . . . . Section 3. An introduction to statistical computing,
- (uniform and other completely specified d1stnbut10ns) 198
Section 4. A catalogue of selected statistical packages.
Table E. Kolmogorov-Smirnov single sample test
(normal distribution) 198
The first two sections are sequential in the sense that Section 2 assumes familiarity
Table F. Kolmogorov-Smimov two sample test 199
with the concepts and techniques covered in Section l .
Table G. Critical values for correlation coefficient (pmcc) 200
Table H. Critical values for Spearman's rank correlation coefficient 201
Section l starts with a discussion of the structure and organisation of archaeological
Table I. Percentage points of the X
2
distribution 202 data-sets which are suitable for statistical analysis. It introduces a hypothetical data-set
which describes measurements and other aspects of forty bronze and iron spearheads.
203 This data-set is used throughout Sections l, 2 and 3 to demonstrate the different
Index
statistical techniques and concepts. Chapter 2 outlines a statistical approach to
analyzing such a data-set. It assumes familiarity with Chapter 1 and is meant to act as
a guide through the rest of Sections l and 2.

The rest of Section l is concerned with what are usually called Descriptive Statistics.
These include several methods of displaying the distribution of a single variable in
tabular and pictorial form as well as simple ways of displaying the relationship
between two variables. Measures of position (usually thought of as 'the average') and
measures of dispersion or variation (the 'spread' around the average) are also
described. All of these are applied to the spearhead data-set.

Section 2 outlines the main types of Inferential Statistics. These involve the concepts
of Sampling, Probability, Hypothesis Testing and Statistical Significance. Some of the
more commonly used Tests of Difference, Tests of Distribution and Tests of
Association are described and illustrated with examples from the spearhead data-set.

In Sections l and 2 statistical fonnulae are stated and used without derivation or
proofs. This is due to limited space and mainly to the fact that this book is aimed at
people with little or no statistical knowledge. It is felt that most users will be prepared
to accept a formula as stated. If statistical derivation is required appropriate books are
recommended in Chapter l 2. References throughout the text are to the few recom-
mended books described in Chapter 12, there is no formal bibliography.
T~chni~al note: camera-ready copy for this book has been produced by the authors
The emphasis here is on using the appropriate technique and understanding the results usmg Tuneworks Des!(top Publisher. Figures have been produced using Gem Graph
in both statistical and archaeological terms. Where applicable, this includes working and Gem. Draw and mtegrated electronically. We would be happy to discuss this
through examples by hand (with a calculator). This may seem a little old-fashioned in process with any mterested parties.
todays world of computers but we feel that the benefits in understanding are well
worth the effort. GRL and MF, February 1991.

Even so, many people will have access to a computer and this is where Sections 3 and
4 comes in. Most of the techniques described in Sections 1 and 2 have one or two
corresponding computer programs listed in Section 3. These are written in SPSS and Preface (Second Edition)
Minitab.
O:'er_ the last thirteen ye~rs or so we have been pleased by the continuing popularity of
Section 4 is a catalogue of commercially available statistical packages. This gives D1g_g1~g Numbers as an 111troduct01y text for archaeologists wanting to get started with
details of hardware requirements, the software's contents and availability and includes statistics. _Several p~ople and organizations have requested that we update it and so,
general packages as well as specific archaeological software. after a senes of repnnts of the First Edition, here is the Second Edition.

This combination should allow the user of Digging Numbers to approach statistical The underlying philosophy and much of the text remains the same. This is still an
analysis either by calculation by hand, or by using a commercial software package. mtroductory book that is meant to get people doing statistics for themselves within a
Although the two packages we have chosen to demonstrate are relatively expensive basic understandi?g of the strengths and limitations of various techniques. There are,
ones (SPSS and Minitab ), many of those listed in Section 4 are inexpensive with some however, several important changes within the Second Edition:
being available as Shareware.
I.A new chapter ( 12) has been added which provides an introduction to
It is probably already apparent that this book provides only an introduction to the ~ultiva~iate tec~niques. The emphasis of the book is still on descriptive and
complex world of statistics. There are whole areas of statistical reasoning and analysis m~erentJal techmques but this new chapter gives a taste of what can be done
which are not even mentioned. Many different methods of multivariate analysis, for usmg more than one or two variables.
example, have proved to be of importance to archaeologists. Even so, statistics seems 2. Section 4 of the First Edition, the catalogue of statistical software has been
to be one of those subjects that can cause instant mental paralysis in many otherwise omitted as it is no longer relevant. '
competent archaeologists. If this book can give someone enough confidence to 3. Chapter 13, recommended books, has been re-written and updated.
approach a more advanced text then our aim will have been achieved. 4. Ch~~ter 14, computer programs, has been completely re-written. This Second
Ed1t1on uses SPSS for PC and many of the figures within the text are SPSS
Throughout the book three icons are used to quickly highlight either a reference in output so that text and figures are more closely linked.
Chapter 12, a link with another chapter or a program number from Section 3.
We would like to thank the '.nany people who have contacted us about Digging
Acknowledgements. Number~ _over the years, especrnlly those who have pointed out errors and typos, not
We would like to thank Clive Orton of University College London for his meticulous least ~h1hp Balco~nbe. We have attempted to rectify them all but please do get in
reading of an earlier draft of this book. His detailed comments were of great help to touch tf any remam. Thank you also to Barry Cunliffe for encouragement and Val
us. Thank you also to Hazel Dodge for being a guinea-pig, to all the suppliers of Lamb of Oxbow Books for guidance.
software for Section 4 and to H.R. Neave for permission to use some of his statistical
tables. Any mistakes or misunderstandings that remain in the text are the
responsibility of the authors. GRL and MF, January 2005.

Simon Pressey drew the cover illustration.


CHAPTER l

AN INTRODUCTION TO DAT A

1.1 The Example Data Set


It is impo1tant that any data set to be used for statistical analysis be well organised and
properly defined. This often results in a rectangular block of numbers which is called a
data matrix. Table 1.1 is a data matrix with 40 rows and 14 columns, or a 40 by 14
matrix (downloadable from http:i/www.soc.staffs.ac.uk/mf4/spears.zip ).

Table 1.1 describes forty spearheads. Each horizontal row represents one spearhead
and is, therefore, one item in archaeological terms. In statistical ten11S this is referred
to as one case ( often one record in database terminology). Each vertical column
represents one observation on the item and is, archaeologically speaking, one
attribute. In statistical terms this is a variable (equivalent to a field in many database
applications).

There are fourteen variables in Table 1.1. The first one is a label in the fonn of a
unique number for each case; this is essential for any form of cross-referencing with
other infonnation about the spearheads. Each variable has a variable name which is
displayed at the top of the column.

From now on variable names that refer to the spearhead data-set are enclosed in<>.

1.2 Levels of measurement


Variables can be measured at one of four levels. This classification was first
introduced in 1946 and has become universally accepted by statisticians. As will
become apparent during this and the next section, it is important to know at what level
variables are measured. Many statistical techniques can only be applied to variables at
a certain level of measurement or higher. The four levels are, in ascending order;
nominal, ordinal, interval and ratio.

NOMINAL (in name only). Nominal variables consist of categories which have no
inherent ordering or numeric value. Each category is assigned an arbitrary name. In
Table 1. l the following variables are nominal;

<MAT>, <CON>, <LOO> and <PEG>


M. Fletcher and G. R. Lock An introduction to data

~
NL~! MAT CON LOO PEG COM) DATE ~L'\XU: SOCLE MAXWI lJPSOC LOSOC rvtr\ \\'IT \11:IGHT ·•,
.,
1 2 .J 1 2 3 300 12.4 3.1 3.6 1.0 1.7 6.2 167.0 1 Column Variable Name Description Values
2 2 3 1 2 4 450 22.6 7.8 4.3 l.3 1.6 11.3 342.1 2 <MAT> Material 1 Bronze
3 2 3 I 2 4 400 17.9 5.2 4.1 1.7 2.0 7.5 322.9
2 Iron
4 2 3
.,
1 0 4 350 * * * 1.4 2.0 * 154.8 ~ 3 <CON> Context l = Stray find
5 2 .J l 1 3 350 16.8 6.6 5.7 1.1 1.7 7.0 358. l l
., (inc. hoards)
6 2 .J 1 2 3 400 13.3 3.1 4.1 1.6 1.9 5.6 227.9
7 2 3 1 .
')
? 450 14.l 5.8 5.8 1.2 1.8 6.8 323.8
2 Settlement
3 Burial
8 2 ? I 2 4 600 * 6.1 5.9 l.3 I. 7 7.1 285.2
9 2 .
')
1 2 4 150 22.5 9.2 6.2 l.3 2.0 13.l 613.8 J
4 <LOO> Loop 1 =No
10 2 1 I .
')
3 300 16.9 4.5 3.6 1.4 l.9 :," __
? 254.3 I
5 <PEG> Peghole
2=Yes
l No
11 2 1 1 2 2 50 19.1 4.6 4.1 1.5 l.8 10.6 310.1 ~
12 ? 1 1 2 3 100 25.8 8.6 4.7 1.4 1.6 12.7 426.8 f 2 Yes
13 2 I 1 2 ')
-'- 600 22.5 8.4 3.9 1.7 2.7 18.0 521.2 ii
14 2 1 1 2 3 300 ?7.6 8.7 6.0 1.5 2.! 14.4 765.1 For each of these variables there is no significance in the values 'l ', '2' and '3' that
15 2 1 1 2 2 350 38.0 9.6 5.6 2.0 ?.6 13.6 1217.2 have been assigned to the categories (i.e. '2' is not twice the value of 'l '), any other
16 2 1 I 2 2 350 7?.4 14.4 6.4 2.0 2.4 17.6 2446.5 numbers or names would do. Note that it is good practice to avoid the use of 1 for
17 2 1 I 2 2 350 37.5 10.2 3.9 1.8 2.1 14.1 675.7 'yes' and O for 'no' as this can confuse the distinction which often needs to be made
18 2 2 I 2 3 450 10.2 3.0 2.7 1.4 1.5 5.8 90.9 t between 'no' and 'no infonnation' (or 'missing data').
,.
19 2 2 l 2 2 200 l l.6 4.6 2.0 0.9 1.7 5.6 86.8
20 2 2 l 1 3 400 10.8 3.1 2.7 1.9 1.7 5.4 109.1 ORDINAL (forming a sequence). Ordinal variables also consist of categories but this
21 I 1 2 I '.) 900 11.4 4.2 1.8 0.8 1.5 6.1 67.7 time they have an inherent ordering or ranking. There is, however, no fixed distance
2? I I 1 2 2 900 16.6 7.2 2.8 1.6 2.0 9.5 204.5 between the categories. The only ordinal variable in Table l. l is column 6 <COND>
'.)"'
_.) 1 l '.) I 1 1000 10.2 3.4 3.3 1.9 2.3 5.4 170.3 which has the following values;
24 1 I 2 1 l 1200 18.6 6.6 2.7 1.4 1.6 8.5 176.8 "
25 1 1 2 I 2 1200 24.4 7.5 4.4 1.7 2.3 11.3 543.2 I = Excellent 2 = Good 3 Fair 4 Poor.
26 I I 2 I I 1000 23.5 8.0 4.5 2.0 2.7 8.7 628.2
27 1 I 2 I 2 1200 24.8 8.1 3.5 2.0 2.1 11.1 40 l.0 A
Here we can state the relationship of '2' as being between '1' and '3' but it is wrong to
28 1 1 1 2 1 800 14.1 3.4 3.9 1.7 2.5 6.1 302.4 h
assume equal distance between categories as is implied by the numeric values.
29 I l I 2 2 800 24.6 6.0 4.8 2.1 2.4 8.6 623.5
30 I 1 2 1 .
')
800 30.9 5.1 6.0 1.5 2.4 8.0 978.9 K
It is possible that a nominal variable could become ordinal if an ordering is imposed
31 1 1 I 2 I 700 20.2 5.9 5.7 1.7 2.4 9.4 607.9
32 I 1 1 2 2 700 12.8 3.5 2.8 1.5 2.1 5.9 165.6 by a typology although this will be based on some external criteria and is not inherent
within the data.
33 I l I 2 I 800 16.9 5.5 3.6 1.6 2.3 8.2 307.9
34 I I l 2 1 800 14.2 4.3 2.8 1.3 2.2 6.0 192.4'
35 1 '.)
1 2 2 700 18.0 4.5 5.3 1.6 2.5 9.9 524.7 Some statistical tests will accept a dichotomous nominal variable ( one with only two
36 I 1 2 l '.) 1000 11.7 3.6 2.4 2.2 1.8 6.6 111.2 .. categories) as being ordinal. Many dichotomous variables are presence/absence
37 I 1 I 2 I 800 14.1 5.4 2.4 1.5 2.4 8.4 118.1 J variables; they record whether the attribute is there or not Care must be taken when
38 1 1 2 1 '.) P00 17.7 4.8 3.9 1.2 1.8 9.6 273.4 . dealing with missing values which can be frequent in archaeological data. Spearhead
.,
39 1 I 2 2 .J 1200 36.6 13.5 6.0 1.6 2.7 18. J l 304.4i Number 4 has missing values (indicated by*) for variables 8, 9, 10 and 13 because it
')
40 I l 2 1 -'- 800 12.3 2.4 5.4 1.1 1.6 7.2 233.8 is badly damaged and those measurements can not be taken. It can be confusing to
represent missing values with a numeric value such as 0.0 or 99.9, choose something
Table 1.1. The spearhead data-set. obvious such as *. Missing values can cause complications in presence/absence data.
The value 'absent' is different to 'not known' (if the relevant piece of information can

2 3
M. Fletcher and G. R. Lock An introduction to data

not be measured) and a third category may have to be introduced; Present Absent and 8. Maximum length ( cm)
Missing. <MAXLE>
9. Length of socket (cm)
INTERVAL (a sequence with fixed distances). An interval variable has the properties <SOCLE>
of an ordinal variable with the added property that the distances between the values I 0. Maximum width ( cm)
can be interpreted. A popular way of explaining this concept is to look at method~ of <MAXWI>
measuring temperature. The values 'hot', 'warm', 'cool'_, and 'cold' are ~rdma~ 11. Width of upper socket
because the difference between 'hot' and 'warm' and the difference between warm (cm)
and 'cool' are not defined. A temperature of 30°C is not only higher than one_of ~0°C <UPSOC>
but it is J0°C higher. The interval is meaningful, therefore temperature Celsms is an 12. Width of lower socket
ll
interval scale. (cm)
-++ <LOSOC>
The only interval variable in Table 1.1 is column 7 <DATE>. If we take spea!·head 13. Distance between
numbers 9. JO and 18 they have the dates 150BC, 300BC and 450BC respectively. maximum width and lower
The difference in years between Number 18 and Number I O is t!1e same as between 13 socket (cm)
Number 10 and Number 9. It is obviously incorrect, however, to mterpret Nt:mber 10 <MA WIT>
as beina twice as old as Number 9 even though this is implied by the numenc values 14. Weight (g)
of '300';' and '150'. With interval variables there is no meaningful datum or zero. <WEIGHT>

RATIO (fixed distances with a datum point). This is the hi_ghest level ofmeasur~ment
with the properties of interval data plus a fixed zero pomt. If the dates, mentlo~ed
above were converted to a new variable <AGE> so that a value of' 1,000 was_ twice
as '500' this would then be a ratio variable. Returning to the measunng of
~ Old . . d
temperature, 20°c is not twice as hot as l 0°C be~ause 0°C is not a datum pomt, 1t oes Figure 1. 1 The seven quantitative variables.
not imply no heat. Temperature in degrees Kelvm, ~n t?e o~her hand, are n;!easured on
a ratio scale because 0°K does mean no heat and 20 K 1s twice as hot as 10 K. All observations involve a level of accuracy, especially on continuous variables. The
level of accuracy decided on must be adequate as a basis for sensible decisions and
Jn Table l.1 columns 8 to 14 inclusive are all ratio variables. They are metric interpretations during analysis. Variables 8 to 13 in Table Li are all recorded to the
measurements as shown in Figure 1.1. nearest millimetre, to be any more accurate is unnecessary although not physically
impossible. Once data have been collected no amount of statistical manipulation will
Jt is also quite common to refer to nominal and ordi1'.al variable_s as categorica! (or improve their accuracy.
discrete) variables and to interval and ratio as contmuous vanables. The va:iable
values of categorical variables are usually chosen by the analyst and _bec_ause tl'.1s can 1.3 Coding
be a fairly arbitrary process these are sometimes referred to as quabtat1ve vanables. With categorical variables it is necessary to represent the values of the categories in a
The valu~s of continuous variables tend to be more objectively arrived at and these are standardised way by using a coding system. It is common in statistical analysis to use
sometimes called quantitative variables. a numeric coding system, in fact, using letters rather than numbers can cause problems
with some statistical software. All of the categorical variables in Table 1.1 have values
Just because nominal variables are classified as the lowest level of measurement their represented by a unique integer number. This is easy to process but is obscure because
importance within archaeology must not be under~stimated. Some fundamental the meanings of the code have to be remembered or looked up, if the data set is large
archaeological concepts involve the use of nommal data, the processes of and/or complex this can be very time consuming and become a major drawback with
classification and typology are important examples. numeric coding. Another problem with this method is the potential for a higher error
rate in the data and the associated problem of error coding.

4 5
M. Fletcher and G. R. Lock An introduction to data

Obscurity (and thus many errors!) can be reduced by using an abbreviated keyword As an example, the ratio between the two variables <MAXLE> and <MAXWI> will
coding system. In such a system the values of the variable <CON>, for example, could express something of the overall shape of the spearhead. Short, wide spearheads will
be represented by the code 'str', 'set' and 'bur'. For complete clarity a full keyword have a different value to long narrow ones. The ratio can be calculated by dividing
code would use the values 'stray find', 'settlement' and 'burial'. Both keyword <MAXLE> by <MAXWI> as follows;
systems can create more work during data recording although the extra time spent
typing can be offset by not having to look up codes. Codes containing letters Spearhead <MAXLE> <MAXWI> Ratio
(alphanumeric) can cause problems with some software; make sure to check first! number <LE/WIRAT>
l 12.4 3.6 3.4
Whatever coding system is used it must be exhaustive and exclusive. Exhaustive in 39 36.6 6.0 6.1
that every possible data value is catered for and exclusive because every value will
only fit into one category. Each observation must fit into one and only one category of The difference in overall 'shape' 1s expressed in the two values of the ratio for
the coding system (even if it is a category called 'miscellaneous' for those values that spearheads 1 and 39.
don't fit elsewhere).
It is now possible to use the two new variables <PERIOD> and <LE/WIRA T> to
1.4 Transforming variables investigate temporal trends in the shape of spearheads. It is often the case that as
Table 1.1 shows the observations as recorded, these are the raw data. It is sometimes exploration of a data-set progresses so new ways of expanding the original variables
useful to transform one or more of the original variables to create new variables for by creating new ones are thought of. It can be informative to 'play' with the data, to
analysis. Transformations can involve a single variable or be a relationship between explore relationships and see if the results are interesting.
two variables.
Another measure of some aspect of shape could be a proportion stated as a
GROUPING. Values of a continuous variable can be grouped to create a new percentage. A good example is to take the length of socket as a proportion of the
categorical variable. The variable <DATE> could be chopped up into the three values maximum length by dividing <SOCLE> by <MAXLE> and multiplying by 100 as
'1200 to 650', '649 to l 00' and 'after 99' to create the new variable <PERIOD>. The follows:
values of <PERIOD> would be 'Later Bronze Age', 'Earlier Iron Age' and 'Late Iron
Age' and could be used for the basis of establishing changes in the spearheads through Spearhead <SOCLE> <MAXLE> Proportion
time. Performing statistical analyses on each of the three groups of <PERIOD> could number
identify temporal trends. l 12.4 3.1 0.25 (25%)
22 16.6 7.2 0.43 (43%)
The grouping of continuous variables is flexible in that new groups can be created to
suit a particular analysis. This is a useful technique for exploring a data-set. 1f data for Percentages are often used to measure frequencies or counts but can be deceptive
many more spearheads became available it may prove interesting to divide unless the raw counts are also given.
<PERIOD> into more than three categories for finer temporal investigations.
Points to remember:
Although grouping of continuous variables can be very useful it must be remembered However a variable is measured, mm, g, %, years etc. it is essential to state clearly the
that it involves a loss of infonnation. It is always better to record data as a continuous units used for this measurement and, as far as possible, to use a consistent set of units.
variable and then group, rather than to record initially as a categorical variable. Do not mix mm. with inches!

RATIOS. Sometimes the relationship between the values of two variables can express Always keep a copy of the original data. As an analysis progresses the data being used
a new attribute of interest. By performing a calculation on the two values the new can change in form. lf a computer is being used it is very easy to overwrite old
attribute can be stored as an extra variable. This usually applies to continuous versions of data with new versions.
variables.

6 7
M. Fletcher and G. R. Lock

Keep a record of any changes made to data. It is very easy to lose track of how an
analysis has developed. If the results are to be published it is important for other CHAPTER2
workers to have access to the original data and to be aware of how the data have been
altered. A STATISTICAL APPROACH SIGNPOSTING THE WAY

We are now in a position to be able to record data in a suitable format for statistical
analysis. This chapter outlines a general statistical approach which can be applied to
any data-set while, at the same time, it attempts to guide the reader through the
following chapters. It is useful to preserve the two stages implied by the structuring of
this book: the initial descriptive and exploratory stage and then the inferential stage
when hypotheses can be formally tested. Going beyond these relatively simple
techniques it may then be suitable to apply multivariate techniques to try and
understand more complex patterns within the data.

The descriptive and exploratory stage (Chapters 1, 3, 4 and 5).


The suggested approach is meant to emphasise the exploratory nature of statistical
analysis. The aim is not to perform 'an analysis' to produce 'the answer' but rather to
execute successive passes through the data gradually identifying trends and patterns that
look interesting and can be followed up by further investigation. A series of sequential
steps can be recommended, of which the first two have already been described.

Step 1 (Chapter 1).


Establish the structure of the data.
- Assign variable names, identify the level of measurement for each variable.
- Assign a case identifier if there is not one.
- Decide on the coding of nominal and ordinal variables.
- Decide how to code missing values.

Step 2 (Chapter I).


- Produce a rectangular data matrix aligning the columns.
- Visually scan the matrix for any obvious errors.

Step 3 (Chapter 3).


- Investigate the gross values of each variable individually (i.e. univariate analysis).
This is still primarily screening for errors. It is important to be sure that the data
are absolutely error free.
- The minimum and maximum values of each variable can be initially important in
identifying possible errors. For categorical variables using a numeric or
alphanumeric code this can show cases of gross misclassification. For
continuous variables this can show errors of measurement (although it could be
a genuine outlier).
- Correct any errors and repeat this step.

8 9
M. Fletcher and G.R. Lock A statistical approach

Step 4 (Chapters 3, 4 and 5) Comparison using two categorical variables (including grouped continuous
Investigate the distribution and parameters of each variable (still univariate) using the variables). This is a contingency table approach (Chapter 3), an example being
full range of descriptive statistics. material of spearhead by find context.
Comparison using one categorical variable (including grouped continuous
- For categorical variables the most useful will be frequency tables, bar charts and variables) and one continuous variable (Chapters 3, 4 and 5). This approach
the modal value(s). produces statistics for the continuous variable, using the techniques as in Step 4,
- For continuous variables the mean, median, range and standard deviation together for each category of the categorical variable and compares them. A simple
with histograms, stem-and-leaf plots, boxplots and ogives will probably be the example would be a histogram, mean and standard deviation for the maximum
most productive. length of spearheads from each category of find context how do they
- Use pictures and graphical techniques wherever possible, these can be much more compare?
infonnative than numbers alone.
- Investigate the same variable several times over don't just produce one result It is quite common in both of the above comparisons for one of the categorical
and claim it is 'the answer'. For example, if a continuous variable is being variables to be either time or position related. This results in the investiaation of
analyzed by a histogram or stem-and-leaf plot, use several values for class temporal and spatial trends respectively - the two most important lines of e~qui1-y in
intervals and midpoints and compare the results. archaeology.
- Create new variables by transformations (Chapter 1) and repeat step 4.
- Anomalies and errors in the data can still be identified at this stage. Correct any - Comparison using two continuous variables. A scatterplot of the maximum
and return to step 3. length by the maximum width of the spearheads is an example.

This is the end of the basic analysis and en-or checking procedures. It is important to realise that:

Step 5 (Chapters 3, 4 and 5) - All three methods of comparison could include data from another data-set
Certain simple, albeit often important, archaeological questions will have been comparing data from two different sites or areas for example. How does th~
answered during step 4, these will have been univariate in nature i.e. concerning the maximum length of our spearheads compare to the maximum length of those
distribution and other characteristics of a single variable. The minimum, maximum from a different area?
and average weight of spearheads, the numbers of spearheads from different context - AU three methods of comparison can be developed to include techniques of
types are such questions. The next stage of archaeological questioning will involve formal inference and hypothesis testing. ls there a statistically significant
some kind of comparison of two variables: bivariate analysis. association between the material of the spearheads and their find context or
could it have happened by chance? ls the relationship between the maximum
length and width of the spearheads significant?
It is here that the intuitive nature of statistical analysis becomes more important
because control is in the hands of the analyst: the analysis should be archaeologically
The answering of such questions involves the concepts of probability theory and
driven. On the one hand statistics are just a tool capable of providing answers to
statistical significance and move us into the second stage.
archaeological questions but the real power of statistics is that they can be more than
that - statistics can trigger new approaches to a data-set, generate new questions, and
it is this that makes the intuitive, iterative nature of a statistical analysis important. The inferential stage (Chapters 6 to 11)
~hapters 6 and 7 provide the underlying theory for the techniques involved in drawing
Bivariate questions involve comparison of some kind and fonn the basis of much mferences from the data. Both should be read before attempting anything described in
archaeological analysis. Comparisons will probably be one or more of the following: Chapters 8 to 11.

It is important to 1:ealise that moving into the inferential stage is not an essential step,
the methods descnbed above form the basis of many an excavation report or research
paper. The difficult part of a statistical analysis is often the initial posing of the

JO 11
M. Fletcher and G.R. Lock A statistical approach

archaeological question in statistical terms. This has been compared with translating hmvever, that human beings and their resulting material and social worlds are multi-
between two different languages: archaeology has its own theoretical language and dimensional and complex. That complexity can not always be reduced to single
statistics has an operational language. Once the translation has been done, and it is variables or the relationship between two variables and this has resulted in a long
clear just what relationship between which variables represents the archaeological history of applying multivariate statistical techniques in archaeology.
question to be answered together with which statistical technique is needed. it could be
that one of the descriptive methods will provide enough information. We would still suggest that, as for the simpler techniques, multivariates are used in an
exploratory way. Many multivariate techniques produce some kind of graphical output
There is a general move in statistics away from rigid confinnatory approaches (i.e. one (together with statistics) which is descriptive in the sense that it simplifies and
analysis produces 'the answer') towards a much more flexible exploratory approach, presents patterns within the data. Chapter 12 offers a simple introduction to the two
this applies to inferential techniques as well as descriptive methods. main areas of multivariate techniques that have been used in archaeology. The first is
the general theme of clustering or grouping, the techniques of P1incipal Components
It has already been stated that some of the descriptive techniques mentioned above and Factor Analysis, Correspondence Analysis and Cluster Analysis. Given several
fotm the basis for inferential statistics. Distribution characteristics such as the variance measurements on each of a set of objects can the objects be placed in groups so that
and the mean can be tested (Chapters 8 and 9), as can relationships displayed by within each group the objects are similar but between the groups there are
scatterplots (Chapter l 0) and contingency tables (Chapter 11 ). In every situation, interpretable differences. Secondly, given several measurements on a set of objects is
however, it is important to remember just what it is that is being tested, i.e. it is the it possible to predict a variable of interest from the others, and if so which variables
statistical significance. This is a very different thing to archaeological significance are important in this prediction. These are the techniques of Multiple Regression and
and ihe two should not become conflated. We may identify patterns within the data Discriminant Analysis.
that are statistically significant at the 95%) level but unless this can be translated back
into the theoretical language of archaeology, and be given meaning in archaeological Our argument for multivariate techniques being used in an exploratory way is a simple
terms, it will not be archaeologically significant. Another problem, which again can one. Because the statistics underlying these techniques are more complex than for
only be answered in archaeological terms, is that of which level of statistical descriptive and inferential techniques they are in more danger of being seen as a
significance is really meaningful. If something is statistically significant at the 90% 'black box'. It is essential to use a computer and 'answers' are always provided
level but not at the 95% level what does this mean in archaeological terms? ls it whether or not you understand the manipulations being performed on the input data
important? Statistical analysis of archaeological data should not be reduced to a search you have provided. In one sense the process is 'objective' in that the same result will
for statistical significance (see Chapter 6 for more on this). always be attained from the same data whomever perfonns the analysis. In reality,
however, it is a deeply 'subjective' process because firstly, we decide on which
Statistical significance, then, is formally defined and involves testing that is repeatable characteristics to measure and input as variables and, secondly, all of these techniques
(i.e. any two people could apply the same test to the same data and get the same involve making decisions during the process. For example, there are several different
result). Archaeological significance is much more difficult to pin down. Almost any methods of cluster analysis involving different ways of measuring the 'similarity'
identifiable pattern within a data-set can be subjectively analyzed and declared to be between objects and then displaying them. So, just as when using a histogram it can be
significant either because it is the same as some existing pattern or because it 1s enlightening to change the interval width and centre points, when using cluster
different to some existing pattern, this involves testing that is often not repeatable. analysis it can be interesting to experiment with different methods and settings.

While an increasing use and understanding of statistical techniques by archaeologists


will not close this rift between the two different methodologies, it should provide
alternative ways of approaching data.

Multivariate analysis (Chapter 12)


Some archaeological questions can be answered (and many more thought about) by
using the relatively simple univariate and bivariate techniques described above. For
many people and for many analyses these will be adequate. It has to be acknowledged,

12 13
Tabular and pictorial display

CHAPTER 3 It is important that each row has a unique identifier. If this is not included within the
list of variables (some kind of catalogue number, for example) then a new variable
TABULAR AND PICTORIAL DISPLAY <ROW NUMBER> should be created. Each column should also be labelled with a
<VARIABLE NAME>. Try to use meaningful variable names, even if abbreviated,
3.1 Basic aims and .rules. rather than something like V 1, V2, V3 etc.
Descriptive statistics involve the display and summary of data. Tables, diagrams and
individual summary statistics enable a rapid understanding of the main characteristics The standardisation of units within a table can avoid confusion. This applies to all
of a raw data set. The parameters of individual variables, different relationships measurements for a single variable and to all variables within a table. All of the
between two variables and trends and peaks within the data can all be recognised and observations on the variable <MAXLE>, for example, are in centimetres. It would be
quantified with these simple techniques. unacceptable to have some recorded in centimetres and others in inches, or even in
millimetres. All six variables in Table 1.1 that record a distance measurement are in
This chapter describes tabular and pictorial descriptive statistics. The next two centimetres. Again, it would be confusing if different units were used for different
concentrate on the individual summary statistics usually classified as measures of variables.
central tendency and measures of dispersion. The techniques in all three chapters are
exploratory in nature. They can be used together, several times over in different All six are also recorded to one decimal place with the decimal points aligned
combinations, to draw out salient points from a data set. vertically. It is advisable to standardise the number of decimal places, certainly within
the values of one variable and, if possible, within the whole table. This not only makes
For a table or picture to convey the maximum information, 111 a clear and the table easier to understand visually but can also simplify future analysis.
unambiguous way, several simple rules should be followed:
3.3 Tabulating frequencies.
I. Include a title. 3.3.1 One variable.
2. All units of measurement must be clearly stated. If percentages are used try and It is usually the case in archaeology that a data set consists of a large number of items
include actual counts as well, or at least a total so that counts can be calculated. (rows). The tabulation of measurements, therefore, is of little use in trying to analyze
3. State the source of the data if it is not obvious. the whole data set at any level. The usual way around this is to work with frequencies
4. Use footnotes to define ambiguous or non-standard terms and to help clarification instead of measurements. A frequency (usually abbreviated to f) is the number of
generally. times a particular value (measurement) occurs, these are displayed in a frequency
5. Use a key for all symbols and shadings. table.
6. Keep it simple so that the important information is not swamped by unnecessary
detail. If the variable is categorical, a convenient way of building up a frequency table is to
7. Diagrams must be sufficiently large for any detail to be clear. use a system of tally marks. It can be seen from Table 3 .1 that a tally mark is made
for each measurement alongside the category into which it falls.
3.2 Tabulating measurements.
The starting point for any statistical analysis is a data set. This will consist of a table of Condition Tally Frequency
observations which will be measurements at different levels as defined in Chapter 1. A 1 1--1-1-1- Ill 8
table is made up of horizontal rows and vertical columns with a cell at each Ill 18
2 1--1-1-1- 1--1-1-1- 1--1-1-1-
intersection which contains a value. Each row represents an item, sometimes called a
3 1--1-1-1- Ill/ 9
case, and each column is an attribute of the item, usually called a variable. It is
4 1--1-1-1- 5
obvious that in Table I. I each row is one spearhead and each column is a variable as
Total 40 spearheads
described in Chapter 1.

Table 1.1 shows typical archaeological data in the form of tabulated measurements. It
Table 3.1. A univariatefi·equen()' table shmving the condition ofspearheads,
consists of a mixture of categorical and continuous variables.
<COND>.

14 15
M. Fletcher and G.R. Lock Tabular and pictorial display

The tally marks are bundled into fives using the 'five-bar gate' method. At the end of Interval (cm) frequency (t)
each row is a row total and at the bottom is a table total. [n this example the variable is 1.25 - 1.75 0
<COND>; it is ordinal with four categories. Because the table only describes one l.75 2.25 2
4-
variable it is a univariate frequency table. 2.25 - 2.75 5
2.75 3.25 3
It is often of interest in archaeology to comment on proportions as well as counts 3.25 - 3.75 5
hence it is common to convert the figures to percentages. Table 3.2 shows the same 3.75 4.25 7
data as the last table but in a slightly different form. 4.25 4.75 4
4.75 5.25 1
Condition Frequency Percent Cumulative Cumulative 5.25 - 5.75 5
Frequency Percent 5.75 - 6.25 6
Excellent 8 20.0 8 20.0 6.25 6.75 l
Good 18 45.0 26 65.0 Total 39
Fair 9 22.5 35 87.5
Poor 5 12.5 40 100.0 Table 3.3. A grouped univariatefi·equenq table.for maximum width, <MAXWI>.
Total 40 100.0
Table 3.3 shows the continuous variable <MAXWI> grouped into classes of 0.50 cm.
Table 3.2. A 1111ivariatefi·equenc:v table showing the condition ofspearheads, It is important that class intervals do not overlap and have no gaps that could contain a
<COND>. value. With these data, recorded to one decimal place, an interval 1.3 to 1. 7 would
contain all true measurements from 1.250000 ..... to l. 749999 ..... This is replaced by
Column one has category labels (sometimes called 'value labels') rather than the the interval 1.25 to 1.75 as in Table 3.3. Although the value 1.75 occurs twice (in two
meaningless values 1, 2, 3 and 4. The second column shows category frequencies and different intervals) this will not cause a problem since the actual data value of 1.75
the third shows category percentages. The fourth and fifth columns are cumulative will never be recorded because of the accuracy of the data ( it will be either L 7 or 1.8).
frequencies and percentages established by adding consecutive category values. The The precision of the class intervals will depend on the level of accuracy of the
cumulative figures can be of value if the table has many rows or if categories need to variable. If the coding is such that 'boundary values' do occur (values that could be
be combined. From the table above, for example, we could deduce that 65% (26) of assigned to two intervals) it must be decided whether to always put them into the
the spearhead sample were in at least good condition and 35% ( 14) were less than lower or higher interval.
good.
3.3.2 Two variables.
If the variable of interest is a continuous rather than a categorical variable a grouped It is also possible to produce a frequency table for two variables at once. This is a
frequency table should be used. This involves dividing the range of the values for the bivariate frequency table, more usually called a two-way contingency table.
variable into classes and then proceeding as above (treating the variable as Contingency tables are the basis of a group of statistical tests of significance which are
categorical). described in Chapter 11. They are, however, also important in their own right as a
means of rapidly assessing the relationship between two variables as shown in Table
There are no firm rules about how many classes to use although it is nonnal to have 3.4. Here, each spearhead has been assigned to one of the six cells according to its
between five and fifteen of equal size. Less than five would lose too much information values on the two categorical variables <MAT> and <CON>. A tallying procedure
and more than fifteen would make the table too complicated. Classes of equal size similar to that described above is used to produce the six cell frequencies.
give a better idea of the distribution of the variable, as shown in Table 3.3.

16 17
M. Fletcher and G.R. Lock Tabular and pictorial display

Each cell contains its frequency as a percentage of the row, the column and the whole
Context table as indicated by the 'within cell order'. This shows the power of contingency
Stray find Settlement Burial Group Total tables in being able to present a lot of inf01111ation quickly and simply. For example,
Material Bronze 19 1 20 Table 3.5 shows us amongst many other things that 67.5% of all our spearheads are
Iron 8 5 7 20 stray finds, 70.4% of all stray finds are bronze and that 95% of all bronze spearheads
Group Total 27 6 7 40 are stray finds!

Table 3.4. A bivariatefi·equenc)' table (two-way contingency table), context, <CON>, 3.4 Pictorial displays for nominal and ordinal data.
by material, <MAT>. 3.4.1 The bar chart.
The bar chart is the most popular method of representing categorical data, it is
If the variables are nominal or ordinal then the categories to be used will be their sometimes called a 'bar diagram' or a 'block diagram'. The categories of the variable
values ( as in this case). If one or both of the variables are on a continuous scale then are positioned along the horizontal axis and a measure of popularity is the scale of the
decisions about grouping the values will have to be taken. Sometimes the grouping of vertical axis.
a variable will produce a contingency table with many empty cells (a sparse table).
Such a sparse table can cause problems if statistical tests are to be performed so a A bar chart is a graphic version of a frequency table. The ve1iical scale can be in
common solution is to redefine the grouping to produce fewer groups with higher frequencies or percentages (in which case it is a percentage bar chart). If percentages
frequencies. This is discussed in more detail in Chapter 11. are used the frequency for each bar should also be shown.

Table 3.4 shows a 2 by 3 contingency table since it has 2 rows and 3 columns. The
50
row and column sub-totals (20, 20, 27, 6 and 7) are called the marginal frequencies
and the table total (40) is also shown. As with univariate frequency tables,
proportions can be shown by converting the frequencies to percentages. Three 18
40
different percentages can be calculated as shown in Table 3.5.

Context Group 30 '


Stray find , Settlement Burial Total
Material Bronze Count 19 l ! 20
Row% 70.4% 16.7% • 50.0% 20 9
8
Col% 95.0% 5.0% 100.0%
Table% 47.5% 2.5% 50.0%
Iron Count 8 5 7 20 10
5
Row% 29.6% 83.3% 100.0% 50.0%
Col% 40.0% 25.0% 35.0% 100.0%
Table% 20.0% 12.5% 17.5% 50.0(%
0
- -- . -
Excellent Good Fair Poor
Group Count 27 6• 7 40
Total Row% 100.0% 100.0% 100.0% 100.0% Condition
Col% 67.5% 15.0% 17.5% 100.0% Figure 3.1. A vertical percentage bar chartfor condition, <COND>.
Table% 67.5% I 15.0% 17.5% 100.0%
The bars should be of the same width with each one separated by a gap to show that
Table 3.5. A contingency table showing row, column and table percentages. the variable is categorical and not continuous. It is quite acceptable to reverse the two

18 19
M. Fletcher and G.R. Lock Tabular and pictorial display

axes and produce a horizontal bar chart with the bars horizontal. Figure 3.1 shows a
vertical percentage bar chart with frequencies stated. If there are many categories (more than three or four) for the second variable,
compound bar charts can become difficult to interpret and the multiple version may
There are two variations of bar charts that allow the representation of two variables in well be superior. Both Figures 3.2 and 3.3 are relatively simple and show the better
one diagram. These are graphical equivalents to bivariate frequency tables. If we reservation of bronze spearheads compared to iron.
wanted to see how the condition of spearheads varied according to material we could
use either a multiple bar chart or a compound bar chart.
100

Figure 3.2 shows a percentage multiple bar chart. Notice that the bars for the two
categories of <MAT> are drawn together for each category of <COND>. 35
80 .
60,0%
11

....., 60
50.0% C
(1J
u 55
'-
(1)
0...
8 40 ' 40
40.0% 40

....C:
(I)
u 25 Material
20 '
a) 30.0%
ll.
5 D1ron

5
200%
0
- . . . Dsronze
Excellent Good Fair A:Jor

Condition
10.0%
Material Figure 3.3. A vertical percentage compound bar chartfor condition, <COND> and
Iii Bronze
material, <A1AT>.
D Iron

Excellent Good Fair Poor


3.4.2 The pie chart.
Condition A pie chart is a circular diagram divided into sectors where each sector represents a
value of a categorical variable. Each sector is proportional in size corresponding to the
Figure 3.2. A vertical percentage multiple bar chartfor condition, <COND> and frequency (or percentage) value of that category. Figure 3.4a shows a percentage pie
material, <MAT>. chart for the condition of the spearheads ( equivalent to Figure 3.1 ).

Figure 3.3 shows a compound bar chart. The bar for each category of <COND> is a Calculation:
total proportion (the same as Figure 3.1) divided according to the values of <MAT>. When drawing a pie chart it is the angle at the centre which is proportional for each
Both 1miltiple and compound bar charts require some form of shading to represent the category. The proportion of 360 degrees can be calculated in the following way using
categories of the second variable together with an appropriate key. the figures for Figure 3.4;

20 21
M. Fletcher and G.R. Lock Tabular and pictorial display

There is a fundamental difference between a bar chart and a histogram. In a bar chart
Excellent 20% (20/100) X (360) 72 degrees. each block represents a category and block widths are equal so that frequency is
Good =45% (45/100) X (360) = 162 degrees. measured by the height of each block. In a histogram the width of each block is
Fair = 22.5% = (22.5/ 100) X (360) 81 degrees. proportional to the class interval (which need not be constant) and it is the area of each
Poor 12.5% (] 2.5/100) X (360) = 45 degrees. block that measures the frequency. It is usual to have equal class intervals but the
choice of width can affect the appearance.
A protractor can then be used to draw the pie chart. If some of the sectors are so small
10---------------------------~
in size that the labelling will not fit within them a system of shading and a key can be
used.

Good
18.00 i 45.0%, 2

0 ,,__,..._->--.-.~..........-'--•..--L....-;•.--J'---;•.--J'---;.r-'-....,..........--..---'-..---..---.-
....................--'----.--'
2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0

Figure 3.4. A pie chart showing proportion categories of condition. <COND>)


Length of socket ( cm)
exploded to emphasise 'excellent'.
Figure 3.5. A histogram ofsocket length, <SOCLE>, ·with a 1cm interval.
If the purpose of the pie chart is to emphasise one particular sector this can be
achieved by pulling out or 'exploding'. Figure 3.4 shows an exploded pie chart
Figure 3.5 shows a histogram of <SOCLE> with a 1.0 cm class width and class
which focuses attention on the proportion of spearheads in excellent condition.
midpoints marked. Notice that adjacent blocks touch to indicate a continuous variable.

Notice also the relationship between the accuracy of measurement and class width.
3.5 Pictorial displays for continuous data.
The interval 4.6 to 5.5, for example, is strictly from 4.55 to 5.55 since any socket
3.5.1 The histogram.
length whose true value is 5.53 would have been recorded as 5.5 and a true value of
A histogram is the pictorial equivalent of the grouped frequency table; it displays a
4.56 as 4.6. Because of this the class width is 1.0 exactly.
continuous variable that has been divided into classes. As with bar charts, histograms
can be horizontal or vertical although the latter is much more usual (as in Figures 3.5
to 3.7).

22 23
M. Fletcher and G.R. Lock Tabular and pictorial display

Figure 3.6 shows the same data as Figure 3.5 but with the class widths changed so that
they arc each of width 4 cm. Some of the details have been hidden but the overall 30
shape is still clear.

30----------------------------,

20 .

20•

10 '

10
I

0
. . . . I .
1.5 4.5 7.5 10.5

Length of socket (cm)


0 .l----.-._ _ _.1,...__ __,•. - - - - - ' - - - - - - - . . - - - - ' - - - - - - - , , - . - - - - '
2.0 6.0 10.0 14.0 Figure 3. 7. Figure 3.5 redrawn again with different class intervals and different
midpoints.
Length of socket (cm)

Figure 3.6. Figure 3.5 redrawn with different class intervals. 3.5.2 The stem-and-leaf plot - an alternative histogram?
The stem-and-leaf plot (or stem-and-leaf display or stemplot) is a relatively new type
Because the values of the class width and the class midpoint for a histogram are under of diagram which forms part of the approach known as Exploratory Data Analysis
the control of the analyst, the use of histograms must be approached with caution. (EDA). It is similar in many ways to the histogram but has one important advantage.
They are an exploratory tool which can produce many different results from the same The stem-and-leaf plot displays the actual data values whereas a histogram displays
data set simply by varying the class midpoint and/or the class width. only the frequencies of each class. Stem-and-leaf plots are designed for interval and
ratio data.
Figure 3.7 shows a histogram of <SOCLE> as in Figure 3.5 but with a class width of
3.0 cm instead of 1.0 cm and different midpoints as marked. Using the same data as Figure 3.5 (<SOCLE> which is column 9 in Table 1.1) we can
see in Figure 3.8 how a stem-and-leaf plot is built up.
There are differences between Figures 3.5, 3.6 and 3. 7 reinforcing the exploratory
nature of histograms. It is probably a little naive and can certainly be misleading to Each data value is split into two parts: a stem and a leaf, in this case the digits before
produce just one histogram and accept it as the only interpretation of the data. the decimal point are the stem and those after the point are the leaf. The stem values
are listed once only to the left of the vertical line and the leaves are added to their

24 25
M. Fletcher and G.R. Lock Tabular and pictorial display

appropriate stems. Figure 3.8 shows the first ten values in column 9 of Table 1.1 The advantage over histograms is immediately apparent because the original data
plotted as stems and leaves. values are recoverable from the stem-and-leaf plot. It can be seen that the distribution
is biased towards the lower end of the scale, the distribution has gaps (stems without
stem leaves leaves) and that the two values of 13.5 and 14.4 are high outliers.
2
.., One decision to be made when constructing a stem-and-leaf plot is the size of the leaf
.) 11
4 56 unit. In Figures 3.8 and 3.9 the leaf unit is 0.1 and this indicates the units of the data
5 28 values. Decimal points are not used in stem-and-leaf plots which means that the
6 61 numbers 3500, 350, 35, 3.5 and 0.35 would all be split into a stem of 3 and a leaf of 5.
The differences are indicated in the Leaf Unit statement as follows;
7 8
8
3,500 Leaf Unit= 100
9 2
350 Leaf Unit JO
JO
35 Leaf Unit 1
11
3.5 Leaf Unit 0.1
12
0.35 Leaf Unit 0.01
13
14 The choice of leaf unit will depend on the particular application and on the range of
values to be displayed. Complications can arise if a leaf contains more than two digits.
Figure 3.8. The beginnings ola stem-and-!eaf'plotfor socket length, <SOCLE>. The number 583, for example, may end up as a stem= 5 and a leaf 8 with the three
being dropped (leaf unit= 10).
Figure 3.9 shows the diagram completed with all 39 values plotted. Notice that the leaf
values have also been ordered within each stem to convey extra infonnation. Stem-and-leaf plots can become unwieldy with large data sets although it is possible
to increase the number of horizontal lines per stem. For example, one stem value could
stem leaves have the leaf values Oto 4 and 5 to 9 on a line each.
2 4
3 01114456 3.5.3 The ogive - the total so far.
4 2355668 The ogive ( or cumulative frequency graph) is a graphical technique for showing
5 124589 cumulative frequencies as described in section 3 .3 .1. An ogive takes the form of a
6 0166 graph with the values on the horizontal axis representing the stated value and all
7 258 values below. The vertical axis can be scaled in frequencies or percentages ( or both).
8 01467 Ogives can be used for grouped data and for actual values of continuous data.
9 26
JO 2 Figure 3.10 shows the same data as used in Figure 3.5, socket length, <SOCLE>, with
11 frequencies as follows:
12
13 5 Interval (cm) Frequency Cumulative
14 4 Frequency
1.55 2.55 l l
Figure 3. 9. The completed stem-and-leaf'plotfor socket length, <SOCLE>. 2.55 3.55 6 7
3.55 4.55 4 1l
4.55 5.55 8 19

26 27
M. Fletcher and G.R. Lock Tabular and pictorial display

5.55 - 6.55 5 24 The ogive allows the rapid assessment of some useful characteristics of a distribution.
,.,
6.55 - 7.55 _) 27 ln Figure 3.10, for example, we can see that 50% of all the spearheads have a socket
7.55 8.55 5 32 length between 2.0 and 5.0 cm and the other 50% between 5.0 and 14.0 cm. More
8.55 9.55 3 35 formally, the Median, Quartiles and Percentiles of a distribution can be calculated
9.55 - 10.55 2 37 from the ogive, these are described in Chapters 4 and 5.
10.55 11.55 0 37
l 1.55 12.55 0 37 3.5.4 The scatterplot - displaying two variables.
12.55 13.55 0 37 Most of the techniques described so far in this chapter refer to the display of a single
13.55 14.55 2 39 variable, the exceptions are the bivariate frequency table and the multiple and
compound bar charts. The scatterplot ( or scattergram, scattergraph or scatter diagram)
For any value along the horizontal axis the corresponding point on the vertical axis allows the plotting of the values of one variable against another variable.
shows how many are less than or equal to that value. In Figure 3.10, for example, 7
spearheads have a socket length ofless than 3.55 cm and 37 less than 11.55 cm. Scatterplots provide a quick and easy visual estimate of the relationship (or
correlation) between the two variables. It is essential that the two variables are paired;
Cumulative frequency they must be two attributes of the same item or case. We can go further than this and
state that they must be paired in an archaeologically meaningful way. For spearheads,
40 the <MAXLE> and <AGE OF FINDER> (if available!) are paired variables which
may be correlated although the relationship would be difficult to explain in
archaeological terms, whilst <MAXLE> and <MAXWI> could well have a
meaningful relationship.
30
Scatterplots can be drawn for variables measured at the ordinal, interval and ratio
levels. A scatterplot takes the form of a graph where a horizontal (x) axis and a
20 vertical (y) axis define an area of two-dimensional space. The axes are scaled
according to the range of values for the variable each represents. It is standard practice
for the points of lowest measurement to meet in the bottom left-hand corner. Each axis
10 should be labelled with the variable name and unit of measurement.
/ As the two variables are paired the items will be positioned in the two dimensional
space according to their values on the two axes. Points are marked with an appropriate
symbol and not joined. Figure 3.11 shows a scatterplot of <MAXLE> and <MAXWI>
0 2 4 6 8 10 12 14 16 from Table I. I. This scatterplot shows a positive association or correlation between
the width and length of spearheads suggesting that longer ones tend to be wider.
Socket length (cm)

Figure 3. I 0. An ogive olsocket length, <SOCLE>.

The shape of the curve betrays the nature of the distribution of the variable. The form
of an ogive is that the curve is always increasing upwards. Large differences in
consecutive class frequencies will produce a steep section of curve whereas small
accumulations will result in a flat curve. It is convention to join the points of the curve
on an ogive with straight lines.

28 29
M. Fletcher and G.R. Lock
Tabular and pictorial display

7---------------------~ the correlation coefficients that are available. These are explained in detail in Chapter
10.
X
X
6' X ♦ ♦ Outliers such as the one large bronze spearhead in Figure 3.11 are immediately visible
X X ♦ in a scatterplot.
X

The distributions could break down into different size groups which will often show as
clusters of points in a scatterplot suggesting a classificatory line of enquiry. If we
decided that the maximum width and the ratio of socket length to maximum length
X XX
♦ ♦ X X
were significant enough variables to base a simple typology of spearheads on, a
X • clustered result from a scatterplot would indicate classes or 'types'.

.....

♦♦
Material
Points to remember:
Methods of tabular and pictorial display are some of the most important ways of
presenting archaeological data and results, but only if they are capable of
2, X interpretation by the reader! Keep them simple, clear and uncluttered. Include
• x Iron information on the raw data where possible.


. . . . . . Bronze
All of these methods are EXPLORATORY in nature. Use them in different ways on
0 10 20 30 40 50 60 70 80 different variables to extract information from the data which could be of interest It is
often dangerous to just do one analysis and present the result as 'THE ANSWER'.
Maximum length ( cm)

Figure 3.11. Scatte1plot of maximum ·width. <MAXWI> and maximum length.


<MAXLE> by material, <l'vJAT>.

As in Figure 3.11, it is possible to introduce a third variable into a scatterplot. This


must be categorical so that each category can be represented by a different symbol, in
Figure 3. l 1 the two categories of <MAT> are shown. If too many different symbols
are displayed on the same plot it can become confusing and difficult to interpret, three
or four is the maximum. If two points fall on exactly the same position they are shown
by a 2 on the diagram (or a 3 for three points etc.).

It is also possible to label each item on the plot for identification. In Figure 3.11 the
unique value in column 1 of Table 1.1 could be displayed next to each point. Again,
though, care must be taken not to overcrowd the diagram.

From a scatterplot it is possible to get a quick visual estimate of the correlation


between the two variables displayed. This could be a positive or negative linear
correlation, a non-linear correlation or a zero correlation. This visual estimate often
fonns the first stage of a more formal test of correlation and significance using one of

30 31
Measures of position

CHAPTER4
50.0%

MEASURES OF POSITION - THE AVERAGE

4.1 Introduction.
One of the less contentious uses of statistics is to condense and describe large bodies 40.0%

of data in a precise manner. Looking at the raw data in Table 1.1 it is impossible to get
an immediate understanding of the spearheads because there is too much detail. What
is an average spearhead? How many are larger or smaller than average? The tabular
and pictorial displays of the last chapter go some way towards summarising and - 30.0%
c::
making sense of the data-set but it is possible to do more, and to be yet more precise. <JJ
,_
(.)
(l)
a.
Although the term 'average' is often used it is, in fact, very imprecise. When most
20.0%
people talk of the average (add up all the values and divide by the number of values)
they are actually referring to the mean. There are two other common measures of
position or average which are useful in archaeology: the mode and the median. It is
important to use the correct term for the particular type of 'average' being used. All
three measures have different advantages and disadvantages, the most suitable can
depend on the level of measurement of the variable being used ( see Chapter l ).

4.2 The Mode.


The mode is the only measure of position that can be used for nominal data. It can be Good Fair Poor
Excellent
used for variables measured at any level although interval and ratio variables are Condition
usually grouped.
Figure 4.1. The condition, <COND>, of the spearheads showing
The mode of a distribution is that value that occurs the most, i.e. it is the most popular, the modal class.
the most fashionable, it has the highest frequency.
of 8. It would only take one more in the 800-900 group to make the mode very
Figure 4. l shows a barchart of the ordinal variable <COND>, there are 8 spearheads in different. Also, the mode is not sensitive to frequencies in any of the other class
excellent condition, I 8 are good, 9 fair and 5 poor. Value 2 (good) is the modal class, intervals. They could all have values of 1 or they could all have values of 7, the mode
it is simply the most popular. would not alter.

Figure 4.2 shows a histogram of the ratio variable <LOSOC>. The values have been It must be remembered when using a grouped interval or ratio variable that class
grouped with a class interval of 0. I cm. intervals and midpoints can drastically influence the mode.

Note that there are two classes with the highest frequency of 5, l .65 to l. 75 and 2.35 Despite these problems with the mode it is still often useful to know the 'typical' or
to 2.45. This distribution is, therefore, bimodal and the two modes can be estimated to 'most popular' value in a distribution. If the variable is nominal then there is no
be I. 7 and 2.4. If there had been three modes it would be trimodal, etc. alternative, to speak of the 'average' is to use the mode.

Because the mode is a relatively simple statistic there are problems with it. It is an The mode is also a useful measure if the distribution is asymmetrical (skewed) rather
unstable measure and can swing wildly by the alteration of only a few values. Figure than symmetrical (see section 4.5).
4.3 shows a histogram of <DATE>: 300 to 400 BC is the modal class with a frequency

32 33
M. Fletcher and G.R. Lock Measures of position

5,- - ~

8
Modal class

4 - ~ ~

- - - >,
0
C:
Q)
:::;
C-4
f!
I.I.

-- '-- -

- '-- 1--

0 I l l I I I
I 0 200 400 600 800 1000 1200
1.6 ,~ 20 22 24 2.6 2.8
Date BC
Width of lower socket (cm)
Figure 4.3. A histogram ofdate, <DATE>, shmving the modal class.
Figure 4.2. The bimodal distribution ofsocket width, <LOSOC>.
If the number of values is even (40 in this case) the median (abbreviated to Md) is
4.3 The median. halfway between the middle two,
The median (from the Latin for 'middle') of a distribution is that value which cuts the
distribution in half. One half of the values will be larger than the median and the other
20th value + 21 st value 600 + 700 1300
half smaller. Md = - - - - - - - - - 650
2 2 2
The median can be calculated for variables that are ordinal or higher but not for
nominal variables. It is most suitable for ordinal variables. If the number of values is odd the median will be an actual value. Suppose the first
spear in the list was not dated leaving only 39 values, the median is now the 20th
Cakulation: value 600. There are 19 values above and 19 below the 20th value.
List the values in order, for example the variable <DATE> as shown in Figure 4.3:
Thirty-eight of the spearheads have a measurement for the variable <MAXLE>, the
50, I 00, 150, 200, 300, 300, 300, 350, 350, 350, 350, 350, 400, 400, 400, 450, 450, median is 17.8. It will be seen from Table 1.1 that although 17.8 is not an actual value
450,600,600,700,700,700,800,800,800,800,800,800,800,900,900, 1000, 1000, there are 19 above and 19 below it.
1000, 1200, 1200, 1200, 1200, 1200.

34 35
M. Fletcher and G.R. Lock Measures of position

The median can also be calculated from the ogive (see Chapter 3.5.3) by reading off Note that calculating a mean usually produces an answer to several decimal places,
the value of the variable (horizontal axis) that corresponds to half of the total especially when using a calculator. Always round the answer down to a sensible level
frequency (vertical axis). of accuracy when quoting it. In this instance the level of accuracy is meaningful but it
may not always be so. Values that are recorded solely as integers could produce means
As with the mode, changes in just one or two values can have an effect on the median. to two or three decimal places check that it is archaeologically sensible.
In the <DATE> example above changing just the 20th value to 700 would cause the
median to also change to 700. The mean is truly representative of a distribution if the values are grouped closely
around a central value. 1t is sensitive to all values in the distribution, however, and can
The median, however, has the advantage of not being sensitive to occasional extreme be very misleading. If the distribution is widely spread, unevenly distributed, has
values (outliers) which can seriously influence the mean (see below). groups towards the extremes or even just occasional outliers, the mean on its own may
not be a good measure of position or average.
4.4 The mean.
Strictly speaking the mean described here is the arithmetic mean. There are other 4.5 Comparing the mode, median and mean.
means such as the 'harmonic mean' and the 'geometric mean' but they are It is important to understand that the mode,.median and mean are three quite different
infrequently used in the social sciences and will not be detailed here. measures of position which can give three different values when applied to the same
data-set. The logic behind their calculation is different as they are measuring different
The mean is the most common form of average and can be used on interval or ratio qualities of the same distribution.
data but not nominal or ordinal.

The mean is the most important measure of position because a lot of further statistical
analyses are based on it. Much standard statistical theory is devoted to testing means
and the variation about the mean.

Calculation:
x
The usual notation for the mean of a variable x is (x bar). Mean
Sum the values and divide by the number of values. Mode
Median
Symmetrical
Formula: x
n
c)
b)
where:
L, (sigma) the sum of I
I
x the individual value I
n the total number of values.

Using the variable x <MAXLE>: Median Median


Positive skew Negative skew
X 785.46/38 20.67

Figure 4.4. Symmetrical and asymmetrical distributions.

36 37
M. Fletcher and G.R. Lock Measures of position

All three measurements are sensitive to the symmetry (or skewness) of the Figure 4.5 shows the frequency table, histogram and ogive for the variable <LOSOC>
distribution. Figure 4.4 shows three hypothetical distributions, a) is symmetrical with a class interval of 0.2 cm. ( 1.5-1. 7 is really 1.5-1.699 etc.).
whereas both b) and c) are asymmetrical; b) is positively skewed and c) is negatively
skewed. The distribution is fairly symmetrical with the following measures;

All three measurements are read from the horizontal axis. modal class 1.5 to l. 7 (really 1.5-1.699)
mode 1.64
In the symmetrical distribution the mode, median and the mean all have the same median 2.05
value. Note that in both of the skewed distributions the three values are different with mean 2.05
the mode at the 'highest' point, the mean towards the tail of the distribution and the
median in between. The modal class is 1.5 to 1.7 and the mode could be taken to be the middle of this
interval. A more accurate estimate of the mode is obtained by using the simple 'cross'
The mean can be affected by a few low scores in a negative skew or by a few high construction shown in Figures 4.5 and 4.6.
scores in a positive skew. In both cases it is not a good measure of position and if used
alone would not accurately describe the distribution. For skewed distributions it is
advisable to use all three measures as shown in Figures 4.5 and 4.6. Mode

12 ! Mode

Maximum length 1o

Class interval(cm) f cf
8•
10-20 23 23
Width of lower socket
0
20-30 10 33
Class interval(cm) f cf
30-40 4 37
1.5-1.7 11 11 20 40 60 80
Maximum length (cm)
40-50 0 37
--
2 ·· 40
1.7-1.9 6 17
50-60 0 37
1.9-2.1 4 21 14 i 6 l 8 20 22 24 26 28
Width of lower socket (cm)
60-70 0 37 30

//'
2.1-2.3 8 29 40
70-80 38
2.3-2.5 5 34 20

2.5-2.7 3 37 30
I
i
2.7-2.9 3 40
20
10
/M
I
di8'1

0 I
20 4G 60 80
10 Maximum length (cm)
1,1 :an

0 L
Figure 4.6. Frequency table, histogram and ogivejimnaximum length, <MAXLE>.
'4 16 18 20 22 24 26 28
Width of lowar socket (cm)

Notice that when <LOSOC> was discussed earlier (Figure 4.2) an interval of 0.1 cm
Figure 4.5. Frequency table, histogram and ogive.for socket width, was used and the distribution was bimodal. There are no absolutely definitively correct
<LOSOC>.

38 39
M. Fletcher and G.R. Lock

methods of describing data, different approaches may produce different results, which CHAPTER 5
is why it is important to always state or define the techniques being used.
MEASURES OF VARIABILITY - THE SPREAD
In contrast to Figure 4.5, Figure 4.6 shows a histogram of the <MAXLE> of the
spearheads which is strongly positively skewed. 5.1 Introduction.
Using methods described in Chapters 3 and 4 we can now display the distribution of a
There is one very high value and several quite high values which are stretching the variable and give a measure of its central tendency in the fonn of a single value, its
distribution in one direction to produce the following measures; ·average' value. These alone are not enough to adequately describe a distribution as is
shown in Figure 5.1:
modal class I 0-20 (really I 0-19. 999)
mode 16.0
median 17.8
mean 20.7

In these circumstances it could be misleading to quote just one measure as the


'average' length of the spearheads, all three together give a more accurate description.

Points to remember:
Be precise about which 'average' you are using. Depending on the level of
measurement of the variable under investigation, try as many of the three methods as
possible. Compare the results.
5 7 8 5 6 7 8 9 10
It is always useful to 'visualise' the data by using the simple graphical methods. as Mean Mean
demonstrated here, rather than just looking at numbers.
Figure 5.1. The spread ofa distribution.

These two hypothetical distributions both have the same mean (and median and mode)
but it is immediately obvious that they are very different. One has a wide spread of
values while the other has values which are much more clustered around the mean.
This chapter describes the main ways of quantifying the spread, or variation of a
distribution, called measures of dispersion or measures of variability.

Measures of dispersion only apply to interval or ratio data.

5.2 The range.


The range measures the total spread of the distribution. It is a simple measure and is of
limited use.

Calculation:
The range is calculated by subtracting the minimum value from the maximum value.

40 41
M. Fletcher and G.R. Lock Measures of variability

Example: Q, 350
The variable <MAXLE> has a maximum value of 72.4, a minimum of 10.2 and a Q2 = 650 (the median)
range of 62.2. Q3 850

Because the range is such a simple measure there are problems with it. Like the mean, Interquartile range 850 - 350 = 500
it is seriously affected by outliers (single extreme values). The <MAXLE> of the
spearheads has an outlier with a value of 72.4 (look back to Figure 4.6). If this one
value was removed the range now becomes 38.0 10.2 = 27.8. The removal of this Quartile deviation 250
one value has altered the range from 62.2 to 27.8, a drop of 34.4 points. 2

The range can clearly only be used as a sensible measure of dispersion when all the 2. Draw an ogive (method as in Chapter 3.5.3). Draw horizontal lines from the 25%,
values are clustered together. It gives the impression of an evenly spread distribution 50%i and 75% points on the vertical axis, when they hit the ogive line drop vertically
despite the presence of outliers. and read off the values on the horizontal axis. Figure 5.2 shows the same ogive as in
Figure 3.10 with quartiles calculated.
5.3 The quartiles.
Another form of range, and one that eliminates the problems associated with outliers, Cumulative frequency
is the inter-quartile range and its associated statistic the quartile deviation.
40
The quartiles are the three values in a distribution that partition it into four parts with
an equal number of values in each part. They are usually referred to as Q 1, Q 2 and Q 3
so that 25% of the values are less than Q,, 50% are less than Q 2 and 75% are less than
30
Q 3, (Q2 is also the median).
01=4.3
Using this same concept, the points that divide a distribution up into one hundred 02=5.6
equal parts are called the percentiles. If a value falls on the 73rd percentile, for
example, we know that 73% of the distribution is less than that value. During the
discussion of probability and hypothesis testing in Chapters 6 and 7 the importance of
I O3=8.0

+-----
I

the 5th and 95th percentiles will be shown. Occasionally deciles are used, with an 10
obvious interpretation. The median, Q 2 , 50th percentile and 5th decile are all different
ways of describing the same value.
' '

I
I
I
The inter-quartile range is the difference between Q, and Q 3 and the quartile deviation
__j_--=---~~~---+--------~--------
is halfof this, ie. the deviation around the median. 0 2 4 6 8 10 12 14 16
Socket length (cm)
Calculation:
The quartiles can be calculated in two ways: Figure 5.2. Calculating quartilesfi·om an ogive.for socket length, <SOCLE>.

l. Referring back to the method for calculating the median (Chapter 4.3), the data The inter-quartile range is then found by subtracting Q 1 from Q 3 . The quartile
values are listed in increasing order and the list is then quartered. For the variable deviation is found by subtracting Q 1 from Q 3 and dividing by 2.
<DATE> this will produce the following quartiles:

42 43
M. Fletcher and G.R. Lock Measures of variability

Formulae:
Iron:
Inter-quartile range= Q 3 Q, Min Max Range lnt-Quart Quart
Range Dev
Q,-Q1 10.2 72.4 62.2 13.07 26.25 13.18 6.59
Quartile deviation = - - -
2

The relationship between the quartiles and the range is summarised in Figure 5.3. The effects of the one large iron spearhead are obvious when the two values for the
range are compared. The similarity between the two inter-quartile ranges, however,
shows that there is not much difference between the variation in the maximum length
+------------ Ranked Values
of bronze and iron spearheads when all of the values are considered rather than just the
tv,o extremes. Figure 5.4 shows histograms of the two distributions with class
intervals of 5 cm and class midpoints as marked.
1 25% 25% 25%
! of the values _J of the values J . of the values of
8

t t t i:i' 6
C:
Q2 Cl)

5- 4 Ir o n
I e
LL 2
Median
Minimum Maximum
N w "' m -,
p
0
0
a
0
b
"'a"
0 9
o
0
O
0
0
Maximum length (cm

i:i'
C:
6
Bronze
Cl)

- - - - - - - - - - - - Range 5- 4
...
Cl)

LL 2

Figure 5.3. The quartiles and the range. 10 20 30 40 50 60 70 80


Maximum length
Example: (Cm )
If we calculate the above for <MAXLE> of the spearheads categorised by the two
values of <MAT> we get the following:
Figure 5.4. Distributions of the maximum length, <MAXLE>, by the two categories al
Bronze: material, <MAT>.
Min Max Range Q1 Q3 lnt-Quart Quart
Range Dev 5.4 The mean deviation.
10.2 36.6 26.4 13.12 24.17 I 1.05 5.53 Another measure of dispersion is the mean deviation. This is also more reliable than
the range because it is calculated using every value rather than just the two extremes.
It also differs from the inter-quartile range and the quartile deviation because it uses
every value rather than just the values at certain rank positions.

44 45
M. Fletcher and G.R. Lock Measures of variability

The mean deviation is a measure of how much each value deviates from the mean; it Comparing these results with the range and the inter-quartile range above it is the most
would, in fact, be more accurate to call it the mean of the deviations from the mean. sensitive of the three if all the values are to be taken into account

Calculation: 5.5 The standard deviation.


Calculate the difference between each value and the mean (ignoring the sign of the You may have realised that in calculating the Mean Deviation each deviation from the
difference). Total the differences and divide by the number of values. This gives the mean is treated as being positive even though half of them are negative! This is rather
mean of the differences which is the Mean Deviation. inelegant and is overcome in the calculation of the Standard Deviation by squaring
each deviation from the mean. This also has the effect of weighting in favour of the
Formula: larger deviations thus giving a more realistic measure of the dispersion.

Mean deviation
Ilx-xl The Standard Deviation is the most used measure of dispersion. It is important, as is
11 the mean, because it forms the basis of further statistical tests. This also applies to the
variance (the Standard Deviation squared) although this measure tends not to be used
where: as a measure of dispersion because it can be a very large number.
x the individual value.
= I- 31 = 3
Calculation:
11 absolute value ie. ignore minus signs so that 131
The Standard Deviation can be abbreviated to: S.D., s, or 0 (sigma, small s in Greek).
x = the mean of all the values. Calculate the difference between each value and the mean. Square each difference.
Li the sum of. Total the squared differences to obtain the sum of squares. Divide the sum of squares
by the number of values to obtain the variance. Square root the variance to find the
n the number of values in the list. Standard Deviation.

Example: Formula:
For the two material categories of <MAXLE> of the spearheads the mean deviations
are:

For Bronze: For Iron:


S.D. (s) J( }) x-
11
x)2 J
11.4 18.68 7.28 12.4 22.89 = l 0.49
16.6-18.68 2.08 22.6 22.89 0.29 The variance is usually called s2 and has the following formula:
10.2 18.68 8.48 17.9 22.89 4.99
18.6- 18.68 0.08 16.8 22.89 6.09
24.4 18.68 = 5.72 13.3 22.89 = 9.59 n
Etc. for all 20 values Etc. for all 18 values

Total of differences Total of differences where:


= 108.52 173.72 x = the individual value.

Mean deviation Mean deviation


Li the sum of.
x the mean of the values.
108.52 I 20 173.72/18
n the number of values.
= 5.43 = 9.65

46 47
M. Fletcher and G.R. Lock Measures of variability

Example: towards O a very narrow spread (small S.D.), and


For Bronze <MAXLE>: towards l very wide spread (large S.D.)

(x x) (x x) 2 Calculation:
18.68 - 11.4 = 7.28 The coefficient of variation is usually denoted as V. It 1s found by dividing the
52.998 (7.28 squared)
Standard Deviation by the mean.
18.68 16.6 2.08 4.326 (etc.)
18.68 l 0.2 8.48 71.910
Formula:
18.68 - 18.6 0.08 = 0.006
18.68 24.4 5.72 32.718 V = S.D.
Etc. X
Etc. for all 20 values
Example:
Total of the squared differences 968.832 The following table shows the means, standard deviations and coefficients of variation
for the two variables <MAXLE> and <MA WIT>.by material category:
Variance 968.832/20 48.442
Maximum length Socket end to maximum width
<MAXLE> <MA WIT>
S.D. ✓ 48.442 6.96
X S.D. V x S.D. V
[f this is applied to both of the material categories of <MAX LE> we get the following: Bronze 18.68 6.96 0.37 8.63 2.82 0.33
Iron 22.89 14.85 0.65 9.87 4.34 0.44
S.D. (Bronze) 6.96
S.D. (Iron) 14.85 One standard deviation of <MAXLE> for bronze spearheads is approximately one
third of the mean (reflected in V= 0.37) whereas for iron spearheads the relative
Compared to the Mean Deviations calculated in the last section the Standard spread is greater at over one half (V= 0.65). The coefficient of variation, therefore, is a
Deviations are quite different. The main difference is in the much greater value of the convenient way of expressing this comparison. ·
Standard Deviation for iron spearheads. This is because of the weighting that the
Standard Deviation gives to values with larger deviations from the mean. The three For iron spearheads the difference between the variability in the distance from the end
large iron spearheads indicated in the histogram of Figure 5.4 are responsible for most of the socket to the maximum width and the variability in the maximum length is
of the variation in the maximum length of all iron spearheads. noticeable compared to the figures for bronze, (V= 0.44 and V= 0.65, V= 0.33 and V=
0.3 7). This suggests that iron spearheads have similar sized socket lengths regardless
It is worth remembering that the Standard Deviation never approaches anywhere near of their overall length, the half of the blade towards the tip is responsible for most of
the range. As a rough rule of thumb when n= I 0 the Standard Deviation will be about the variation in maximum length.
one third of the range and when n= I 00 it will drop to about one fifth.
5. 7 Standardisation.
5.6 The coefficient of variation. Standardising values in a distribution is a similar concept to that just introduced with
It is sometimes difficult to compare the spread of two or more distributions by just the coefficient of variation. lt allows the comparison of values for different variables
looking at the means and standard deviations. The actual values of these statistics on a fixed scale. This is done by converting any value to a z-value (or z-score ).
could be of very different orders of magnitude. The coefficient of variation provides a
comparative measure on a fixed scale from Oto l (values remain positive) where: Consider a comparison between <MAXLE> and <MA WIT> for a bronze spearhead
with a maximum length 20.0 cm and socket end to maximum width 13.0 cm.
Comparing these values with the results given in the table above it can be seen that the

48 49
M. Fletcher and G.R. Lock Measures of variability

20.0 cm is a fairly typical length while 13 .0 cm is an unusually high measurement. An Minimum Q1 Median Q3 Maximum
objective way of measuring the 'typicality' of the two measurements is to convert
them to z-scores. It is convenient to think of z-scores as units of Standard Deviation in
relation to the position of the mean. A z-score of 1.0, therefore, is one S.D. away from
the mean. For most distributions an interval of 3 SDs each side of the mean contains
nearly all of the values, so z-scores usually fall within the range -3.0 to +3.0.

Formula:
X X
z-value

where x is the raw or unstandardised value with mean x and standard deviation u.
Thus the standardised value for a maximum length of 20.0 is

20.00-18.68
z 0.19
6.96
Scale
while for a socket end to maximum width of 13.0
Figure 5.5. A boxplot.
z 1.55
2.82 At a distance of 1.5 x h-spread above and below the edges of the box are the inner
fences. At a distance of 3 x h-spread either side of the edges of the box are the outer
These results show that the value of 13.0 is 'relatively' larger or more extreme than fences.
the value of 20.0. The latter is very close to the mean of the distribution whereas the
former, 13.0, is over one and a half S.D.s away from the mean of 8.63. Any values that fall between the inner and outer fences are considered to be possible
outliers, and any values that fall beyond the outer fences are probable outliers. The
5.8 Boxplots. minimum and maximum points are found by ignoring any outliers of either type.
Boxplots (sometimes called Box-and-Whisker plots) are a graphical representation of
a distribution using some of the concepts described above. A boxplot divides a Example:
distribution according to the value of the inter-quartile range. Figure 5.5 illustrates a Figure 5.6 shows boxplots for the variable <MAXLE> categorised by material.
hypothetical boxplot with the appropriate terminology.
The boxplots show the larger spread of the iron spearheads (longer whiskers) despite
Calculation: the similarity in the size and position of the central parts of the two distributions (the
On an appropriate scale for the distribution the median is plotted as are Q1 and Q3. median and the boxes which contain 50% of the values). The distribution of bronze
These are called the lower hinge and upper hinge respectively and the difference spearheads is fairly symmetrical as shown by the central position of the median within
between them is the h-spread, they also form the box. 50% of the values are within the box. The assymmetry of the iron distribution is shown by the off-centre position of
the box. The whiskers run from each side of the box to the minimum and maximum the median within the box. Remember that the S.D.s are 6.96 and 14.85 for bronze and
values as defined below. iron respectively and the difference between these is reflected in the boxplots. Note
that the iron distribution has one probable outlier at the higher end.

50 51
M. Fletcher and G.R. Lock

CHAPTER6

AN INTRODUCTION TO PROBABILITY AND INFERENCE ORA WING


CONCLUSIONS

6.1 Introduction.
iron
*
16
Section I was concerned with methods of descriptive statistics; describing, presenting
and condensing data. These alone will often be enough to isolate trends and patterns
within the data enabling certain archaeological questions to be answered and
generating new questions to be asked.

This section goes one step further and introduces methods of formally testing patterns
within data. These statistical tests are referred to as 'inferential statistics' because they
are performed within a framework of hypothesis (or theory) testing and something is
inferred from the result. Jn a nutshell, a certain pattern within the data is tested and
Bronze found to be significant or not. Of course there are many different tests which can be
applied and different levels of significance.

Important note: This chapter, together with the next, covers the three important areas
of probability, inference and sampling. Both inference and sampling depend upon the
concepts of probability and although sampling comes before inference in practice,
here we discuss probability and inference together because of their close logical
10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 relationship.
Maximum length (cm)

Figure 5.6. Boxplotsfor maximum length, <MAXLE> by material, <MAT>. At the heart of all inferential statistics are the concepts of randomness and probability.
Many natural and artificial phenomena (including many archaeological data) are
Without a proper investigation of the dispersion or variability of a distribution no random in the sense that they are not predictable in advance although they do exhibit
meaningful comparisons or inferences can be made. Of all the different measures of long term patterning. It is the study of these patterns (statistical distributions) which
variation the standard deviation is certainly the most used and together with its close will involve probabilistic (also called stochastic or random) models.

relative the variance (SO ✓ variance) forms the basis for a great deal of statistical The mathematical theory of probability was started by the two French mathematicians
inference (see Section 2). We must repeat the emphasis on exploring the data use Blaise Pascal (1623-1662) and Pie1Te Fermat (1601-1665). Probability theory also
different techniques and compare the results, use graphical representations whenever owes a great deal to the work in 1933 of the modern day Russian A.N. Kolmogorov.
possible.
6.2 Probability measuring chance and risk
6.2.1 The concept of probability

Three important definitions:

Definition l. Often called 'a Priori' or Classical


For a complete set of n equally likely outcomes of which r are favourable, the
probability of a favourable outcome is:

52 53
M. Fletcher and G.R. Lock An introduction to probability

P(peg hole) r/n


P(Favourable outcome) r/n =27/39
0.692 (this is the same as 69.2%)
Note the notation, P( ) simply means the probability of whatever is in the brackets, r
and n are as defined above. Note that probabilities are often stated as percentages. If a new source of spears is
found and the assumption is that they are the same as the existing group, 69.2% of the
Definition 2. Often called 'a posteriori' or Frequentist. new spears will have a peg hole. If this turns out not to be true the difference could
For n past outcomes of which r were favourable, the probability of a favourable yield interesting archaeological conclusions (testing such differences is discussed in
outcome is Chapter 8).

P(Favourable outcome) r/n Example 2


Of the 40 spears, only one has a socket length, <SOCLE>, less than 3cm, and so:
Both Definitions 1 and 2 will clearly produce a probability which is a measure within
the range 0.0 to l .O inclusive. This is the standard way of stating a probability, so that: P(socl<3cm) = l / 40
0.025
a probability of 0.0 implies the event is impossible (eg. the probability of an iron
spearhead being made in the Neolithic) Any probability less than 0.05 (5%) is usually considered to be so low that, in this
instance, it is improbable that a spear chosen at random would have a socket length of
a probability of 1.0 implies the event is certain (eg. the probability of finding less than 3cm. Equally, any probability greater than 0.95 (95%) is usually considered
something important protruding from the baulk of an excavation on the last afternoon. to be very high (this introduces the concept of significance and is discussed below in
JOKE!) section 6.4). It is highly probable, therefore, that a spear chosen at random will have a
socket length of 3cm or more.
a probability of 0.3 implies the event has a reasonable chance of occurring but is not as
likely as an event with a probability of 0.8. 6.2.2 The concept of independence - are two events related?
For a spear chosen at random the probability that it is made of bronze and iron is
Definition 3. Often called Subjective. clearly 0.0 since all of these spears are made of either bronze or iron but not both. The
For a particular event, give a personal estimate of its probability using a scale of 0.0 to two events 'choose a bronze spear' and 'choose an iron spear' are said to be Mutually
1.0. Exclusive (M.E.) - they cannot occur together. Other examples of such a dichotomy
are Male/Female or Pig/Sheep etc.
All three definitions have their place in archaeological analysis although Definition 2
is the most often used, assigning probabilities to past events. If a particular A very important concept in probability theory, both generally and in archaeology, is
archaeological theory is to be tested (compared to the observed data) it is often that of independence. To illustrate this consider the classification of the 40 spearheads
necessary to use statistical theory based on Definition 1. In some cases, when according to their material and whether or not they have a peg hole. We have already
investigating a new phenomenon or characteristic, Definition 3 is used to provide a seen that
first estimate of the probabilities.
P(PH) 27 I 39 P(PH) 12/39
Example I
In the spear data-set of 40 cases (observations, trials or experiments using statistical where:
language) there are 27 with a peg hole (see Table 1.1, variable number 5). Using PH denotes does have a peg hole, and
Definition 2 with n 39 (spearhead number 4 is not counted as it has missing values)
and r = 27 we can conclude that the probability of a randomly chosen spear from
P(PH) denotes does not have a peg hole.
among the 39 having a peg hole is:

54 55
M. Fletcher and G.R. Lock An introduction to probability

(Notice that P(PH) + P(PH) = 1.0 as expected). These ideas are often referred to as conditional probability and the example above
would be written as:
Since 20 out of the 40 are made of Iron and 20 out of the 40 are made of Bronze, we
also have P(B/PH) = 0.37
P(B) 0.50
P(I) 20 I 40 P(B) = 20 I 40
where:
If a spear has a peg hole, what is the probability that it is made of bronze? Does having B/PH bronze given it has a peg hole
a peg hole make it more likely or less likely to be made of bronze? If having a peg B bronze.
hole makes no difference to the probability of the material then the two variables (Peg
Hole v Material) are independent, otherwise they are dependent. The following Two important rules.
contingency table (see Chapter 3.3.2 for an introduction to contingency tables) There are two simple rules which are fundamental to the understanding of
illustrates these ideas: manipulating probabilities:

Rule 1: If the two events A and B are mutually exclusive, then


Material Pe2: Hole
Yes No P(A or B) = P(A) + P(B)
Iron 17 2 19
Bronze 10 10 20 ie. when using 'or' the probabilities are added.
27 12 39
Example. Since 8 spearheads are classified as Condition 1 and 18 as Condition 2
(Table 1.1, variable number 6),
Using this table the following probabilities can be found:
P(CI) 8/40 0.20
P(Bronze and Peg hole) 10/39 0.256
P(C2) = 18/40 0.45
P(lron or Peg hole or both) 27 + 2/39 0.744
P(lron and no Peg hole) = 2/39 0.051 and so:
P(Cl or C2) 8/40 + 18/40 26/40 0.65
Returning to the earlier question, if a spear has a peg hole what is the probability that
Rule 2: lf the two events A and Bare independent, then
it is made of Bronze? The above shows that the proportion of all spearheads that have
a peg hole and are made of bronze is 0.256 but the following calculates the proportion
P(A and B) P(A) x P(B)
of those spearheads which have a peg hole which are made of bronze. Of the 27 spears
that have a peg hole, 10 are bronze, so:
ie. when using 'and' the probabilities are multiplied.
P(Bronze given it has a peg hole) 10/27 0.37
Example. If we assume that half of all spears are Bronze, then the probability of
whilst:
choosing two spears and both of them being bronze is,
P(Bronze) 20140 0.50
P(lst Band 2nd B) P(B) X P(B)
Since these two probabilities are not equal we can conclude that Material and Peg
0.5 X 0.5
Hole are dependent. In fact, if a spearhead has a peg hole it is less likely to be made of
0.25 (or 25%)
bronze and so more likely to be of iron.

56 57
M. Fletcher and G.R. Lock An introduction to probability

Note: To be strictly accurate this result should be (20/40) x (19,139) 0244 since whilst:
having chosen one bronze spearhead from our forty there now remains only 39 of
which 19 are bronze. These ideas are better illustrated with a second contingency P(I) x P(C2) = 20/40 x 18/40
table, this time showing the relationship between Condition and Material. 0.5 X 0.45
0.225

Material Condition Since these two results are not the same, Rule 2 does not hold, Material and Condition
1 2 3 4 are not independent. This means that the material of the spearhead does have an effect
1nm 8 11 l 0 20 on its condition.
Bronze 0 7 8 5 20
8 18 9 5 40 6.3 Probability distributions - predicting results
The data on spearhead condition, <COND>, can be transformed from a simple
frequency count into a probability distribution by replacing each frequency with a
Since Condition 3 and Condition 4 are M.E., it is true that: probability, as shown below:

P(C3 or C4) P(C3) + P(C4) Condition Frequency Probability


9/40 + 5.140 (or relative frequency)
14/40 l 8 0.200
0.35 2 18 0.450
3 9 0.005
However, Iron and Condition 2 are not M.E. since it is possible for an iron spearhead 4 5 0.125
to be in good condition. From the table we have: Total 40 1.00

P(l or C2 or Both) (8 + 11 + 1 + 0 + 7)/40 The variable <COND> is discrete (a condition of 2.3 is meaningless) and ordinal
27/40 (condition 2 is better than condition 4). There are many different models (or
= 0.675 theoretical distributions) which can be suggested as fits for the distributions of discrete
random variables. The commonest two are the Binomial and Poisson distributions
Compare this result with the incorrect (although easily done) way of doing it which provide good models for answering such questions as:

P(I or C2 or Both) P(I) + P(C2) L Of all graves in a cemetery, 23% contain beads. What is the probability that in a
20/40 + 18/40 random sample of 12 graves more than five will contain beads? (Use the Binomial
= 38/40 distribution).
0.95 (WRONG!)
2. The average number of sherds per sq.m. is three. What is the probability that in an
This result is wrong because the score of 11 in the lron/C2 cell of the table has been area of 4 sq.m. there are less than five sherds? (Use the Poisson distribution).
counted twice.
Further discussion of these and other distributions are beyond the scope of this book,
Notice that the table also yields: but see Chapter 13 for further reading.

P(I and C2) 11/40 For continuous random variables which are measured on a ratio, interval or ordinal
0.275 scale (see Chapter 1.2 for levels of measurement) the most important model is the
Normal distribution. It has many applications in archaeology and also plays a very

58 59
M. Fletcher and G.R. Lock An introduction to probability

important role in sampling theory. The nonnal distribution (recognised by the bell-
shaped curve) is the most useful of all distributions because many naturally occurring
distributions are very similar to it. Its mathematical derivation was first presented by
De Moivre in 1733 but it is often referred to as the Gaussian distribution after Carl
Gauss ( 1777-1855) who also derived its equation from a study of errors in repeated
measures of the same quantity.
-3
Consider Figure 6.1 which shows the distribution of the upper socket width,
<UPSOC> for the 40 spearheads. The mean and standard deviation for this variable
are 1.535cm and 0.331cm respectively (see Chapters 4.4 and 5.5 for calculating the
mean and standard deviation). It can be seen that the widths vary more or less
Figure 6.2. A normal distribution with mean 0 and standard deviation I.
symmetrically about the mean with the more extreme values being less probable. This
is typical of a sample from the normal distribution. Using statistical theory and tables
lt can be shown that:
(which are not relevant here) it is possible to produce a curve showing what an exact
normal distribution with this mean and standard deviation would look like. This is
(i) 50% of the values are less than 0
shown as the smooth line in Figure 6.1. Figure 6.2 shows a nonna! distribution with a
(ii) 50% of the values are more than 0
mean of O and a standard deviation of 1.
(iii) approximately 68% are between -1.0 and+ 1.0
(iv) approximately 95% are between -2.0 and +2.0
( v) exactly 95% are between I. 96 and + l. 96
(vi) exactly 90% are between -1.645 and+ l .645
(vii) exactly 99% are between -2.576 and +2.576

These results allow the following statements to be made for any variable with a
normal or near normal distribution:

ii' 9 (a) approximately 95% of all values should be within two standard deviations of the
C
!l)
::i
mean
er (b) practically all values should be within 3 SD of the mean
e:!
I.!.

Example:
For the variable upper socket width, <UPSOC>, (Table l .1, variable number 11)

Mean= 1.535, SO 0.331

Therefore, we would expect about 95% of the values to be within 1.535 ± 2(0.331)
ie. within the limits (0.873, 2.197)
OD 05 10 15 20 25 30
Width of upper socket (cm) In fact there are two (5%) outside these limits with values of 0.8 and 2.2.

Figure 6.1. The distribution of the upper socket width, <UPSOC>.

60 61
M. Fletcher and G.R. Lock
An introduction to probability

The variable maximum length, <MAXLE>, on the other hand, has a distribution
6.4 The logic of hypothesis testing - is it significant?
which is clearly not symmetrical (Figure 6.3) and here the normal distribution is a
Most of the rest of this section, and indeed most applications of inferential statistics in
poor model.
archaeology, are concerned with hypothesis testing (sometimes called tests of
significance).

It is important to understand just what hypothesis testing in this formal context means.
Theories of one kind or another abound in archaeology although many of them cannot
be tested in any way let alone in the formal way to be described below. A hypothesis,
therefore, must represent a quantifiable relationship and it is this rel!ltionship which is
tested formally. We could say that all hypotheses are theories whereas not all theories
are hypotheses.

In order to illustrate the logic of a hypothesis test consider testing the hypothesis that
at least 40% of all bronze spearheads come from burials (this is the quantifiable
association between the number of bronze spearheads and the variable <CON>).

The first step:


is to formulate two hypotheses, one is called the mdl hypothesis (denoted by Ho) and
the other is the alternative hypothesis (H 1). This must be done so that one and only
one must be true. In this case we would have:

20 40 60 80
H0 : Proportion of bronze spearheads from burials is 40% or more
Maximum length (cm) H 1: Proportion of bronze spearheads from burials is less than 40%.

Figure 6.3. The asymmetric distribution olthe maximum length, <MAXLE>. The second step:
is to take suitable measurements or observations from which a test statistic and its
The smooth curve shown in Figure 6.3 indicates the theoretical or expected shape of a associated probability (described in step 3) can be calculated. Here we have a sample
normal distribution which has the same mean and standard deviation as those for of twenty bronze spearheads seven of which have been found in burials (this is the
<MAXLE>. If the underlying distribution of <MAXLE> really was normal, we would observed result).
still expect small discrepancies between the expected and the actual (or observed)
results from a sample, with such differences getting smaller for larger samples. In
So far so good!
Figure 6.3 the differences between the observed frequencies and those expected from a
normal distribution (or a model which assumes a normal distribution) are large, and so The third step:
the normal distribution can be considered a poor model. There are formal ways of is more difficult. Calculate a test statistic which can then be tested for significance in
testing whether a normal distribution is a good model which are discussed in Chapter step 4. The test statistic allows for the calculation of the probability of the observed
9.
result which is often called the p-value. If Ho is true and at least 40°/o of all bronze
spearheads do come from burials what is the probability of a sample of 20 containing
It is important here to mention the Central Limit Theorem (this is explained in detail seven from burials? Using the ideas from Section 2.2 of this chapter we have:
in Chapter 7) because many statistical applications rely upon it. The Central Limit
Theorem provides a rationale for the use of the Nonnal distribution which is why it is P(Burial) 0.40 and so P(Not burial) 0.60
the most important of all distributional models. P(Not burial for I st and 2nd) (0.60)(0.60) (0.60)"
Hence P(Not burial for 13) = (0.60) 13 0.0013

62
63
M. Fletcher and G.R. Lock An introduction to probability

The p-value (probability of the observed result) is 0.00 I 3 or 0. I 3% (very small!) archaeological terms. ls the decision to accept or reject the null hypothesis important
enough to waITant a 1% level or will 5% do, or why not 7%? What are the
Step 4 (and last!): archaeological costs involved in accepting or rejecting the null hypothesis? If lives
Remember that the null hypothesis is being tested. The significance of the test statistic depended on the outcome, as in testing drugs or parts of aircraft, we could easily
will determine whether the Null Hypothesis is accepted or rejected. There are set justify using the 0.1 % level but the situation is not so clear cut in archaeology.
conventions for significance testing which allow this decision to be made.
Returning to the p-value of 0.0013 we can now see that Ho can be rejected at the I%
The p-value calculated above is very small which means either Ho is true and we have level concluding that we are at least 99% certain (this is not absolutely true but will
been very unlucky in drawing an unrepresentative sample, or else Ho is false. The do in simple terms) that the percentage of spearheads from burials is less than 40% (to
convention for significance testing is as follows: be precise we are 99.87% certain but it is convention to stick to the 10%, 5%, l % and
0.1 c!lu levels).
If p<0.05 (5%) reject H0 at the 5% level and conclude that there is significant evidence
to show that the percentage of bronze spearheads from burials is less than 40% (in In the four steps described above it is the calculation of the test statistic and its
other words if H0 is rejected H 1 must be accepted). It is also valid to conclude that we associated probability that can be difficult. In various situations and under various
are 95% certain that the percentage of bronze spearheads from burials is less than assumptions there are a number of accepted methods of doing this, Chapters 8 to 11
40%. describe some of them

If the p-value had turned out to be greater than 0.05, the conclusion would have been
that there is insufficient evidence to reject H0 at the 5% level and so Ho is accepted. A
somewhat philosophical point here it is impossible to prove a hypothesis in this
formal way, it is either accepted or rejected at certain levels of significance. In a way
this is in line with the falsificationist views of Karl Popper and has interesting
implications for archaeology by suggesting that the advancement of knowledge is not
based on proving things but on disproving things. If, for example, we reconstruct an
Iron Age roundhouse and it falls down the first time the wind blows, it is reasonable to
argue that this disproves that building hypothesis. If, however, it stands for many years
it doesn't prove that that is how roundhouses were built in the Iron Age. Which
advances knowledge more?

Although the 5% significance level has been used above other levels can also be
applied, those often used in the social sciences are:

p<0.10 reject at the 10% level


p<0.05 reject at the 5% level
p<0.0 l reject at the l % level
p<0.001 reject at the 0.1 % level.

Important note:
having said this about significance levels it is important to emphasise the arbitrary
nature of the whole concept of significance (we thank Clive Orton for forcing this
issue). We have explained the statistical reasoning behind significance levels but it is
up to the archaeologist to justify the choice of a certain significance level in

64 65
Sampling theory and sample design

CHAPTER 7 Sample: the subset or part of the population that is selected.

SAMPLING THEORY AND SAMPLE DESIGN Sample size (n): the number in the sample. A sample size of 5 is considered small,
while, formally, a sample size of 50 is large. The sample size may be stated as a
7.1 Introduction. percentage of the sampling frame, e.g. a 10% sample.

Archaeologists are permanently working with samples. An area of excavation is a Clearly, the larger the sample size, the more reliable will be any inferences made from
sample of the complete site which in itself is a sample of all sites of that type. The the sample (and, usually, the more costly in time and resources it will be to collect). A
same goes for artefact assemblages which represent samples of a larger, often smaller sample will be less expensive although the resulting infom1ation will be less
unknown, group. For many years such samples have been selected by a variety of ad reliable. Faced with a cemetery of 127 graves, excavating a sample of 100 should
hoe methods and have served archaeology well, indeed, virtually all of our current allow reliable inferences about the whole group, whereas inferences from a sample of
archaeological knowledge rests upon the results from thousands of judgement only l O would be very unreliable. As with judgement sampling, the size of the sample
samples, so called because they are chosen in a mathematically non-rigorous manner. is often a product of many constraining factors.

This chapter is about a different type of sampling, usually called :random sampling in The rest of this chapter is concerned with two important aspects of sampling:
the UK or probability sampling in North America because it uses ideas from
probability theory. Strictly speaking, judgement samples do not allow any statements (i) How to make the sample as representative of the population as possible so that it
to be made about the material that was not included in the sample (although yields the maximum information.
archaeologists do this all the time) whereas random sampling provides a means of
making such statements within a stated confidence limit. If a 20% sample of the (ii) Having taken a sample, how to measure the precision or reliability of the results it
interior of a hil!fort was excavated, and the sample was a random one, it would be produces.
possible to estimate the number of sherds, pits or any other characteristic for the whole
interior within stated confidence limits. If the sample was a judgement sample, as
most excavations are, any statements made concerning the whole would be informed 7.2 Sampling strategies -which measurements to take.
guesswork. The following are the more common and useful sampling strategies for drawing a
random sample. It should be emphasised, however, that it is often difficult to apply
The essence of all sampling is to gain the maximum amount of information by these rigidly in many archaeological situations. They are not claimed to be a substitute
measuring or testing just a part of the available material. Because of the ability of for archaeological intuition and experience, but a useful tool to be used when and
random sampling to enable extrapolation from the sample taken it can often provide where appropriate.
more information than a judgement sample although, of course, the procedure of
random sampling is more difficult to set up and perfonn. Before going any further, Each method will be applied to designing a sample of 10 spearheads from our
some formal definitions need to be established: population of 40 (a 25% sample) in order to estimate the mean weight. We know that
the mean weight of all 40 is 442.4 g and the S.D. 436.0 g.
Population: the whole group or set of objects (or other items) about which inference
is to be made. This could be all Bronze Age spearheads or all spearheads in some sub- A Simple random sample.
set (e.g. a particular county). Each unit in the population has the same chance of being selected as any other unit. To
take a simple random sample of l O spearheads we could:
Sampling frame: a list of the items, units or objects that could be sampled. Often, and
ideally, the sampling frame will contain all of the population, but it need not. (a) Stick a pin into the list ten times without looking (not professional and open to
abuse, fiddling and criticism!)
Variable: a characteristic which is to be measured for the units, such as weight of
spearheads.

66 67
M. Fletcher and G.R. Lock Sampling theory and sample design

(b) Put the 40 spearheads into a large box, shake and withdraw I O (not a good idea, using material as a stratum has no advantage, (to test this claim see the measures of
think of at least three faults!) association for categorical data described in Chapter 11 ).

(c) Number the spearheads I to 40 and using random number tables identify 10 by If the context the spearhead was found in is suspected to have a relationship to weight,
choosing numbers from the tables. This is the usual method which is sometimes then we should stratify our sample using context as a stratum. The 40 spearheads are
improved upon by using a computer to select the random numbers. Table A in the spread between the three contexts as follows:
Appendix is a typical table of random digits between O and 9. They are grouped into
blocks of 5 just for ease of reading. If we were to start reading at the top left and read Context 2 3
across (we could start reading anywhere but should then read from the table in a Number of spearheads 27 6 7
steady sequence either down, up or across), the first number would be 72, the second Percentage of total 67.5% 15.0% 17.5%
3 L then 02, 85. 27 etc. This will give a sample of I O spears to be numbers:
This means that to design a sample of 10 that is stratified proportionally according to
31,2,27,33,8,26,23,29,22,21. context, we would need to take 67.5% of the l O from context l etc. Using nearest
whole numbers (in order to save having to saw spearheads into bits!) the 10% sample
Notice that 00 and any number larger than 40 has been ignored, and if we had the would consist of the following:
same number twice this would also have been ignored. Using this ! 0% sample the
mean weight is 349.51 g which is rather low compared to the true mean of 442.4 g. context l context 2 context 3
67.5% 15% 17.5%
A Systematic sample. 7 spearheads l spearhead 2 spearheads
To take a systematic sample of I O from our population of 40 spearheads take every
fourth one. Although in this example we start with number one, strictly speaking the Again usinub the random number table and only taking numbers between l and 40, the
~

start number (between one and four) should be chosen at random. This method has the following proportional stratified random sample is taken (starting top left in the table
advantage of being easy to design although if the units have inherent patterning in and going down in pairs):
their ordering systematic sampling could be inappropriate. Our sample starts at
number one and ends with number 37 (numbers 1, 5. 9, 13, 17, 21, 25, 29, 33 and 37) Context I: 38, 14, 26, 31, 40, 13, 34
producing a sample mean weight of 405.68 g (true mean is 442.4 g). Context 2: 8
Context 3: 7, 4
A Stratified sample.
In order to get a representative sample the structure of that sample should reflect the The mean weight of this sample is 398.6 g, again a little low. By designing such a
structure of the population in terms of characteristics that are thought to be influential. stratified sample there is no guarantee that the results will be more accurate or better
For example, the spearhead population consists of 20 iron and 20 bronze items which but we are less likely to produce unrepresentative results based on some bias in the
could influence the weight (if one group was a lot heavier than the other), in order to sample.
reflect this our sample should contain 5 iron and 5 bronze spearheads. Using the
random number tables again (Table A in the Appendix), we select numbers in the A Cluster sample.
range I to 40 in order to give the first 5 iron and 5 bronze: Rather than select individual items, select clusters or groups of items that are close
together. This would be better illustrated by a spatial application where, for example, a
Iron: 2, 8, 13, 17, 4 survey of a large area is being conducted. To save on travelling time, groups of sites
Bronze: 31, 27, 33, 26, 23, are selected at random, if each group is then sampled individually this could be called
a two stage sampling design.
These give a sample mean of 409.43; still a little low. Perhaps there is no real
relationship or association between material and weight, in which case stratifying

68 69
M. Fletcher and G.R. Lock Sampling theory and sample design

A Convenience Sample.
This sampling design does not use random methods to select the sample, and 15
consequently can produce poor results. If, for example, the first I O spearheads were
taken as the sample (simple because this is convenient) the mean would be 304.99 g
which is very low. If the next 10 are taken their mean is 664. 94 g which is very high.
12
There are other sampling strategies, although the first three described above are the
main ones and certainly adequate for most archaeological applications.

7.3 A statistical background to sampling. i:;' 9


C:
Once a sample has been taken various statistics can be calculated from it (using a,
:::s
methods described in Section 1 of the book). The commonest of these sample statistics C'"

are:
e
LL

x, the sample mean


s, the sample standard deviation, and
p, the sample proportion, i.e. the proportion of the sample having a particular
characteristic. 3

The true or population values for these statistics are usually unknown, and formally
denoted by Greek letters so that:
0.0 500.0 1000.0 1500.0 2000.0 2500.0
x, the sample mean, is an estimate of p, the population mean Weight (g)
s, the sample standard deviation, is an estimate of a, the population standard deviation
p, the sample proportion, is an estimate of 7r, the population proportion. Figure 7.1. The distribution of'weight, <WEIGHT>.

7.3.1 The central-limit theorem - the law of averages. 9 spearheads. (We have taken a sample size of 9 for reasons that will become clear
later, but these ideas hold whatever the sample size).
In order to comment on how good an estimate the sample statistics are, the nature of
their distribution needs to be known. To illustrate this, consider trying to estimate the
Comparing the two distributions shown in Figures 7.1 and 7.2, original weights and
mean weight of the spearheads from a sample of only 9 randomly chosen ones. We
sample mean weights, the differences in shape can be seen. This demonstrates two
know that for the population of 40, p 442.4 and a 436.0. Figure 7.1 shows the
important concepts of sampling which together form the Central Limit Theorem, (here
distribution of weight for the whole population which is not symmetrical (it is skewed)
this theorem is only demonstrated then stated but it can be proved formally),
with a large amount of variation (a 436.0).
(i) The distribution of the sample means is reasonably symmetrical and can be well
If a simple random sample of 9 spearheads is taken, and the sample mean calculated, modelled by the normal distribution (see Chapter 6 for the importance of this). As the
we would expect it to be a reasonable estimate of the truth (the population mean). If
sample size increases so the closeness of the approximation to the n?rmal distributio!1
many such random samples are taken each will produce an estimate of the population increases. For most distributions a sample size as low as four will produce a fair
mean and we would expect most of them to be near to the truth. Figure 7.2 shows the
approximation, while a sample of 20 or 30 will give a very good approximation, (this
distribution of 40 sample means each one calculated from a separate random sample of is why a sample of 50 is considered to be large). The sequence in Figure 7.3 (a to e)
shows how the distribution of sample means becomes more symmetrical as the sample
size increases.

70 71
M. Fletcher and G.R. Lock Sampling theory and sample design

15 -
-
3J'

12 - 25
>,
20000 25Ci0 g 2':i
Weight (g)
"'5- 15
e
- U. 10

i:;' 9 -
C:
Cl) 20CO 2 58:
::, 500 1.00G i 5G0 2 000 2 500 Sample mean weighl{n=9)
C" Sample mean weigh!(n=4)
e
Li.

6-

-
-
3-
-- fil
500
I
1 OOG
I
1.500
I
2 000 2 501 500 1 OOQ 1.500 2 ODO 2 500
Sample mean weighl(n=16) Sample mean weight(n=25)

0 I I I
0 500 1,000 1,500 2,000 2,500
Figure 7.3. The change in the distribution of sample means as sample size increases.
Sample mean weight(n=9)
It can be proved that if the standard deviation of the individuals in a population is a,
Figure 7.2. The distribution of'mean vl'eight, < WEIGHT> for samples of size nine. then the standard deviation of the means of samples of size n, often called the
standard error, of the mean (or simply the SE of the mean) is

(ii) These distributions also show that the vanation is reducing as the sample size {5
increases. Variation is usually measured by the standard deviation and the following SE (mean)= ✓n
table shows how the SD changes as the sample size increases from l to 25 individuals:

Sample size (n) Standard Deviation The results in the table above confirm this since by taking samples of size 4, the
1 436.0 original SO has been approximately halved and for n = 9 the SO is reduced to
4 182.6 approximately one third.
9 129.7
16 75.9 Before discussing the implications of this any further, the Central Limit Theorem can
25 54. l be stated (without proof);

If a random sample of size n is taken from a population with mean p and SO a, then:

72 73
M. Fletcher and G.R. Lock Sampling theory and sample design

(I) the sample means are approximately normally distributed with this approximation a
becoming closer as n increases, 99% x±2.576 ✓n

(2) the mean of the sampling distribution is p


Jf we had taken a sample of 16 spearheads and calculated x 402. l, then using (J
a
(3) the SO of the sampling distribution is ✓n 436.0 would give 95% confidence limits of:

436.0
The important consequence of this theorem is that whenever a population mean is to 402.1 ± 1.969 ,,-;;
be estimated we are able to use the known characteristics of the normal distribution -v16
(see Chapter 6) to evaluate how precise or reliable our estimate is. For example, if we
had been trying to estimate the population mean for weight from a sample of just 9 = 402. l ± 1.960(109.0)
spearheads, by looking at Figure 7.3(c) we can see that the truth is between = 402.1 ± 213.64
approximately 300 g and 700 g. If we had a larger sample of 25 we could estimate the
true mean to be between 400 and 500 g (Figure 7.3 (e)). i.e. from402.l 213.64to402.l +213.64
from 188.46 to 615.74
7.3.2 Confidence limits - the reliability of results.
Although the estimates for the true mean get more accurate as the sample size We could conclude from this that it is 95% certain that the true mean is between
increases they are still rather vague. It is statistical convention to calculate a 188.46 g and 615.74 g. If these limits are too wide, as they probably are, then clearly a
confidence interval (or confidence limits) for an estimate based on a particular larger sample needs to be taken to reduce them (the importance of ✓ n in the
probability value, usually 90%, 95% or 99%. This is a formal way of stating equation). In fact these very broad limits reflect the high variability (standard
confidence in the estimate. Given a sample of n observations which lead to a sample deviation) in the weight of the spearheads. If we consider our 40 spearheads as a
mean of x, from a population with unknown mean but standard deviation a, sample of some larger population (all spearheads perhaps), then we can calculate the
confidence limits are calculated as below. 95% Confidence Limit for the mean weight to be:

In practice, when trying to estimate the mean, the standard deviation is also unknown
436.0
but for large samples the calculated standard deviation is often taken to be the actual 442.4 ± 1.96 r-;-;;
true standard deviation. When the sample size is small so that the sample standard -v40
deviation is a poor estimate for the true standard deviation, the following theory
should be modified by using the Students-t distribution to calculate the confidence 442.4 ± 135.1
limits. Confidence intervals are calculated using:
i.e. we are 95% sure that the mean weight of all spearheads is between 307.3 g and
Confidence level Confidence intervals 577.5 g. This is still rather a poor estimate because of the large standard deviation
a which we can see is mainly due to one spearhead having an unusually high weight
90% x ± l.645 , (number 16 in Table I.I, at 2446.5 g).We could, of course, omit this one spearhead
-v n from our calculations considering it to be an unrepresentative outlier or freak. This
would produce a better (narrower) confidence interval but such editing of the data is
er archaeologically dangerous without sound justification.
95% x ± 1.960 ✓n
If, instead of estimating the population mean, we were interested in estimating the
proportion of the population with a particular characteristic, then similar ideas are used

74 75
M. Fletcher and G.R. Lock Sampling theory and sample design

to establish confidence limits for this proportion. If a sample of size n gives a estimate the true proportion of bronze spearheads with loops to within ± 0.05 (± 5%)
proportion of size p, then the various confidence limits are given by: i.e. with 95% certainty. Given that p 0.45, what sample size is needed?

Confidence level Confidence limits The 95% confidence limits are:

90% p ± 1.645✓ p(l: p)


.J(0.45)(1-0.45)
0.45 ± 1.96 ✓n

95% p± l.960✓p(l: p)
.J(0.45)(0.55)
i.e. 0.45 ± 1.96 1
99% p± 2.576✓ p(l: p) -vn
We need to substitute the 'error' part of the above with the required level of ± 0.05,
so:
Consider that our 40 spearheads are a sample from a much larger unknown population,
and we need to estimate the true proportion of bronze spearheads that have loops. Of 1.96.J(0.45)(0.55)
the 20 bronze spearheads in our sample, 9 have loops giving a sample proportion of 9 0.05 = 1
in 20, 9/20 0.45 or 45%. vn

If we wanted to compare this proportion to another population (perhaps another site, . l.96.J(0.45)(0.55)
1 _
area or assemblage) we need to know just how reliable this figure of 45% is. Given 1.e. -vn - 0.05
that n 20 and p 0.45, the 95% confidence limits for the true proportion are:

.Jn= 19.5
0.45±1
2
n 19.5

0.45 ± l.96✓0.012375 n = 380.3

Hence we need a sample of 380 bronze spearheads to give an estimate of the


0.45±0.218
proportion with loops with a confidence level of 95%. To save doing the previous
analysis there are simpler formulae to find the required sample size, n, to estimate
i.e. from 0.232 to 0.668
either the mean or proportion to within ± D (here with 95% confidence):
This means that with a sample of 20 we can be 95% certain that the true proportion of
2
bronze spearheads is somewhere between 0.232 and 0.668 (23% and 67%). Again this l1.9 6al
is rather vague, but this in itself serves to emphasise the need to quote confidence Mean: n L0 j
limits to show the uncertainty.

The previous formulae for confidence limits can also be used to estimate the required
sample size to obtain a given level of reliability. Suppose that we really wanted to

76 77
Sampling theory and sample design
M. Fletcher and G.R. Lock

certain aspects of the site can be established using random strategies and this does not

[-1.96a]
2
, preclude the use of the much favoured judgement sampling of areas which look
Proportion: n - p(l p)
D particularly interesting for some reason.

So, for the previous example where D 0.05 and p 0.45

n 0.45)

380.3

To estimate the true mean weight to within 10 g with a probability of 95%, would
require a sample size of n, where:

= 7303. what a big sample!

7.4 Conclusions
The techniques discussed in this chapter are about making generalising statements
describing an unknown population based on information gleaned from a sample of that
population. Providing that the sample is collected using one of several mathematically
random strategies the formulae given above can be used. The reliability of estimates of
characteristics of the population will vary according to the size of the sample and the
variability within the sample. A small sample containing a lot of variability will never
yield precise estimates of the population.

These concepts are of potential interest to archaeologists. If for example an excavation


produced a very large assemblage of flints and available resources did not allow for a
full analysis, a randomly drawn sample would allow general statements to be made
about characteristics such as means of measurements and proportions of types etc. It is
obvious that the emphasis of this approach is on establishing general parameters
describing the population and not on identifying and describing unusual and 'special'
elements within the population. Of course there is no reason why judgement samples
containing obviously special cases can not be taken as well as the random sample so
long as it is realised that the aims of the two.are different.

It is also not difficult to interpret sampling strategies spatially. An area to be excavated


could be gridded into sampling units and the appropriate number selected randomly.
Areas of different interest could be built into a stratified design. General parameters of

79
78
Discovering Diverse Content Through
Random Scribd Documents
Mr. Alger became famous with the
publication of that undying book,
“Ragged Dick, or Street Life in New
York.” It was his first book for young
people, and its success was so great that
he immediately devoted himself to that
kind of writing. It was a new and fertile
field for a writer then, and Mr. Alger’s
treatment of it at once caught the fancy
of the boys. “Ragged Dick” first appeared
in 1868, and ever since then it has been
selling steadily, until now it is estimated
that about 200,000 copies of the series
have been sold.
—Pleasant Hours for Boys and Girls.
THE JOHN C. WINSTON CO.’S POPULAR JUVENILES.

LUCK AND PLUCK SERIES—Second Series.


4 vols. $4.00
Risen from the
Try and Trust.
Ranks.
Herbert Carter’s
Bound to Rise.
Legacy.

BRAVE AND BOLD SERIES.


By Horatio
4 vols. $4.00
Alger, Jr.
Brave and Bold. Shifting for Himself.
Jack’s Ward. Wait and Hope.

NEW WORLD SERIES.


By Horatio
3 vols. $3.00
Alger, Jr.
Facing the
Digging for Gold. In a New World.
World.

VICTORY SERIES.
By Horatio
3 vols. $3.00
Alger, Jr.
Only an Irish Boy. Adrift in the City.

Victor Vane, or the Young Secretary.

FRANK AND FEARLESS SERIES.


By Horatio
3 vols. $3.00
Alger, Jr.
Frank Hunter’s Peril. Frank and Fearless.

The Young Salesman.

GOOD FORTUNE LIBRARY.


By Horatio
3 vols. $3.00
Alger, Jr.
Walter Sherwood’s
A Boy’s Fortune.
Probation.

The Young Bank Messenger.

RUPERT’S AMBITION.
By Horatio
1 vol. $1.00
Alger, Jr.

JED, THE POOR-HOUSE BOY.


By Horatio
1 vol. $1.00
Alger, Jr.
A writer for boys should have an
abundant sympathy with them. He
should be able to enter into their plans,
hopes, and aspirations. He should learn
to look upon life as they do. Boys object
to be written down to. A boy’s heart
opens to the man or writer who
understands him.
—From Writing Stories for Boys, by
Horatio Alger, Jr.

RAGGED DICK SERIES.


By Horatio Alger,
6 vols. $6.00
Jr.
Ragged Dick. Rough and Ready.
Fame and
Ben the Luggage Boy.
Fortune.
Mark the Match
Rufus and Rose.
Boy.

TATTERED TOM SERIES—First Series.


By Horatio Alger,
4 vols. $4.00
Jr.
Tattered Tom. Phil the Fiddler.
Paul the Peddler. Slow and Sure.

TATTERED TOM SERIES—Second Series.


4 vols. $4.00
Julius. Sam’s Chance.
The Young
The Telegraph Boy.
Outlaw.

CAMPAIGN SERIES.
By Horatio Alger,
3 vols. $3.00
Jr.
Charlie Codman’s
Frank’s Campaign.
Cruise.

Paul Prescott’s Charge.

LUCK AND PLUCK SERIES—First Series.


By Horatio Alger,
4 vols. $4.00
Jr.
Luck and Pluck. Strong and Steady.
Sink or Swim. Strive and Succeed.
EDWARD S. ELLIS.

Edward S. Ellis, the popular writer of


boys’ books, is a native of Ohio, where
he was born somewhat more than a half-
century ago. His father was a famous
hunter and rifle shot, and it was
doubtless his exploits and those of his
associates, with their tales of adventure
which gave the son his taste for the
breezy backwoods and for depicting the
stirring life of the early settlers on the
frontier.
Mr. Ellis began writing at an early age
and his work was acceptable from the
first. His parents removed to New Jersey
while he was a boy and he was
graduated from the State Normal School
and became a member of the faculty
while still in his teens. He was afterward
principal of the Trenton High School, a
trustee and then superintendent of
schools. By that time his services as a
writer had become so pronounced that
he gave his entire attention to literature.
He was an exceptionally successful
teacher and wrote a number of text-
books for schools, all of which met with
high favor. For these and his historical
productions, Princeton College conferred
upon him the degree of Master of Arts.
The high moral character, the clean,
manly tendencies and the admirable
literary style of Mr. Ellis’ stories have
made him as popular on the other side of
the Atlantic as in this country. A leading
paper remarked some time since, that no
mother need hesitate to place in the
hands of her boy any book written by Mr.
Ellis. They are found in the leading
Sunday-school libraries, where, as may
well be believed, they are in wide
demand and do much good by their
sound, wholesome lessons which render
them as acceptable to parents as to their
children. All of his books published by
Henry T. Coates & Co. are re-issued in
London, and many have been translated
into other languages. Mr. Ellis is a writer
of varied accomplishments, and, in
addition to his stories, is the author of
historical works, of a number of pieces of
popular music and has made several
valuable inventions. Mr. Ellis is in the
prime of his mental and physical powers,
and great as have been the merits of his
past achievements, there is reason to
look for more brilliant productions from
his pen in the near future.

DEERFOOT SERIES.
By Edward S.
3 vols. $3.00
Ellis.
Hunters of the
The Last War Trail.
Ozark.
Camp in the Mountains.

LOG CABIN SERIES.


By Edward S.
3 vols. $3.00
Ellis.
Footprints in the
Lost Trail.
Forest.

Camp-Fire and Wigwam.

BOY PIONEER SERIES.


By Edward S.
3 vols. $3.00
Ellis.
Ned in the Block-
Ned on the River.
House.

Ned in the Woods.

THE NORTHWEST SERIES.


By Edward S.
3 vols. $3.00
Ellis.
Two Boys in
Cowmen and Rustlers.
Wyoming.

A Strange Craft and its Wonderful Voyage.

BOONE AND KENTON SERIES.


By Edward S.
3 vols. $3.00
Ellis.
In the Days of the
Shod with Silence.
Pioneers.

Phantom of the River.

IRON HEART, WAR CHIEF OF THE IROQUOIS.


By Edward S.
1 vol. $1.00
Ellis.
THE NEW DEERFOOT SERIES.
By Edward S.
3 vols. $3.00
Ellis.
Deerfoot in the Deerfoot on the
Forest. Prairie.

Deerfoot in the Mountains.

JACK HAZARD SERIES.


By J. T.
6 vols. $7.25
Trowbridge.
Jack Hazard and
Doing His Best.
His Fortunes.
The Young
A Chance for Himself.
Surveyor.
Lawrence’s
Fast Friends.
Adventures.

International Bibles
Are known the world over for their clear
print, scholarly Helps and absolutely
flexible bindings. They comprise every
variety of readable type in every style of
binding and include Text Bibles,
Reference Bibles, Teachers’ Bibles,
Testaments, Psalms, Illustrated Bibles;
also the “International” Red Letter
Testaments and Red Letter Bibles with
the prophetic types and prophecies
relating to Christ in the Old Testament
printed in red, and the words of Christ in
the New Testament printed in red; also
Christian Workers’ Testament and
Christian Workers’ Bible in which all
subjects or the Theme of Salvation are
indexed and marked in red.
For sale by all booksellers. Catalog of
Books and Bibles mailed on application
to the publishers.

THE JOHN C. WINSTON CO.


Winston Building
PHILADELPHIA, PA.
Transcriber’s Notes

The List of Illustrations at the beginning of the book was created by the
transcriber.
Inconsistencies in spelling and hyphenation such as “bulkhead”/“bulk-head” and
“doorway”/“door-way” have been maintained.
Minor punctuation and spelling errors have been silently corrected and, except
for those changes noted below, all misspellings in the text, and inconsistent or
archaic usage, have been retained.

1. Page 213: “expect to bear a good account” changed to “expect to hear a good
account”.
2. Page 262: “Miller visited Tom in his domitory” changed to “Miller visited Tom in his
dormitory”.
3. Page 300: “waiving his hat to them” changed to “waving his hat to them”.
4. Page 341: “affairs looked rather dubius again” changed to “affairs looked rather
dubious again”.
*** END OF THE PROJECT GUTENBERG EBOOK TOM NEWCOMBE;
OR, THE BOY OF BAD HABITS ***

Updated editions will replace the previous one—the old editions will
be renamed.

Creating the works from print editions not protected by U.S.


copyright law means that no one owns a United States copyright in
these works, so the Foundation (and you!) can copy and distribute it
in the United States without permission and without paying
copyright royalties. Special rules, set forth in the General Terms of
Use part of this license, apply to copying and distributing Project
Gutenberg™ electronic works to protect the PROJECT GUTENBERG™
concept and trademark. Project Gutenberg is a registered trademark,
and may not be used if you charge for an eBook, except by following
the terms of the trademark license, including paying royalties for use
of the Project Gutenberg trademark. If you do not charge anything
for copies of this eBook, complying with the trademark license is
very easy. You may use this eBook for nearly any purpose such as
creation of derivative works, reports, performances and research.
Project Gutenberg eBooks may be modified and printed and given
away—you may do practically ANYTHING in the United States with
eBooks not protected by U.S. copyright law. Redistribution is subject
to the trademark license, especially commercial redistribution.

START: FULL LICENSE


T H E F U L L P R OJ E C T G U T E N B E R G L I C E N S E
PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK

To protect the Project Gutenberg™ mission of promoting the free


distribution of electronic works, by using or distributing this work (or
any other work associated in any way with the phrase “Project
Gutenberg”), you agree to comply with all the terms of the Full
Project Gutenberg™ License available with this file or online at
www.gutenberg.org/license.

Section 1. General Terms of Use and


Redistributing Project Gutenberg™
electronic works
1.A. By reading or using any part of this Project Gutenberg™
electronic work, you indicate that you have read, understand, agree
to and accept all the terms of this license and intellectual property
(trademark/copyright) agreement. If you do not agree to abide by all
the terms of this agreement, you must cease using and return or
destroy all copies of Project Gutenberg™ electronic works in your
possession. If you paid a fee for obtaining a copy of or access to a
Project Gutenberg™ electronic work and you do not agree to be
bound by the terms of this agreement, you may obtain a refund
from the person or entity to whom you paid the fee as set forth in
paragraph 1.E.8.

1.B. “Project Gutenberg” is a registered trademark. It may only be


used on or associated in any way with an electronic work by people
who agree to be bound by the terms of this agreement. There are a
few things that you can do with most Project Gutenberg™ electronic
works even without complying with the full terms of this agreement.
See paragraph 1.C below. There are a lot of things you can do with
Project Gutenberg™ electronic works if you follow the terms of this
agreement and help preserve free future access to Project
Gutenberg™ electronic works. See paragraph 1.E below.
1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright law
in the United States and you are located in the United States, we do
not claim a right to prevent you from copying, distributing,
performing, displaying or creating derivative works based on the
work as long as all references to Project Gutenberg are removed. Of
course, we hope that you will support the Project Gutenberg™
mission of promoting free access to electronic works by freely
sharing Project Gutenberg™ works in compliance with the terms of
this agreement for keeping the Project Gutenberg™ name associated
with the work. You can easily comply with the terms of this
agreement by keeping this work in the same format with its attached
full Project Gutenberg™ License when you share it without charge
with others.

1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside the
United States, check the laws of your country in addition to the
terms of this agreement before downloading, copying, displaying,
performing, distributing or creating derivative works based on this
work or any other Project Gutenberg™ work. The Foundation makes
no representations concerning the copyright status of any work in
any country other than the United States.

1.E. Unless you have removed all references to Project Gutenberg:

1.E.1. The following sentence, with active links to, or other


immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project Gutenberg™
work (any work on which the phrase “Project Gutenberg” appears,
or with which the phrase “Project Gutenberg” is associated) is
accessed, displayed, performed, viewed, copied or distributed:
This eBook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License
included with this eBook or online at www.gutenberg.org. If you
are not located in the United States, you will have to check the
laws of the country where you are located before using this
eBook.

1.E.2. If an individual Project Gutenberg™ electronic work is derived


from texts not protected by U.S. copyright law (does not contain a
notice indicating that it is posted with permission of the copyright
holder), the work can be copied and distributed to anyone in the
United States without paying any fees or charges. If you are
redistributing or providing access to a work with the phrase “Project
Gutenberg” associated with or appearing on the work, you must
comply either with the requirements of paragraphs 1.E.1 through
1.E.7 or obtain permission for the use of the work and the Project
Gutenberg™ trademark as set forth in paragraphs 1.E.8 or 1.E.9.

1.E.3. If an individual Project Gutenberg™ electronic work is posted


with the permission of the copyright holder, your use and distribution
must comply with both paragraphs 1.E.1 through 1.E.7 and any
additional terms imposed by the copyright holder. Additional terms
will be linked to the Project Gutenberg™ License for all works posted
with the permission of the copyright holder found at the beginning
of this work.

1.E.4. Do not unlink or detach or remove the full Project


Gutenberg™ License terms from this work, or any files containing a
part of this work or any other work associated with Project
Gutenberg™.

1.E.5. Do not copy, display, perform, distribute or redistribute this


electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1
with active links or immediate access to the full terms of the Project
Gutenberg™ License.

1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if you
provide access to or distribute copies of a Project Gutenberg™ work
in a format other than “Plain Vanilla ASCII” or other format used in
the official version posted on the official Project Gutenberg™ website
(www.gutenberg.org), you must, at no additional cost, fee or
expense to the user, provide a copy, a means of exporting a copy, or
a means of obtaining a copy upon request, of the work in its original
“Plain Vanilla ASCII” or other form. Any alternate format must
include the full Project Gutenberg™ License as specified in
paragraph 1.E.1.

1.E.7. Do not charge a fee for access to, viewing, displaying,


performing, copying or distributing any Project Gutenberg™ works
unless you comply with paragraph 1.E.8 or 1.E.9.

1.E.8. You may charge a reasonable fee for copies of or providing


access to or distributing Project Gutenberg™ electronic works
provided that:

• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”

• You provide a full refund of any money paid by a user who


notifies you in writing (or by e-mail) within 30 days of receipt
that s/he does not agree to the terms of the full Project
Gutenberg™ License. You must require such a user to return or
destroy all copies of the works possessed in a physical medium
and discontinue all use of and all access to other copies of
Project Gutenberg™ works.

• You provide, in accordance with paragraph 1.F.3, a full refund of


any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.

• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.

1.E.9. If you wish to charge a fee or distribute a Project Gutenberg™


electronic work or group of works on different terms than are set
forth in this agreement, you must obtain permission in writing from
the Project Gutenberg Literary Archive Foundation, the manager of
the Project Gutenberg™ trademark. Contact the Foundation as set
forth in Section 3 below.

1.F.

1.F.1. Project Gutenberg volunteers and employees expend


considerable effort to identify, do copyright research on, transcribe
and proofread works not protected by U.S. copyright law in creating
the Project Gutenberg™ collection. Despite these efforts, Project
Gutenberg™ electronic works, and the medium on which they may
be stored, may contain “Defects,” such as, but not limited to,
incomplete, inaccurate or corrupt data, transcription errors, a
copyright or other intellectual property infringement, a defective or
damaged disk or other medium, a computer virus, or computer
codes that damage or cannot be read by your equipment.

1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except for


the “Right of Replacement or Refund” described in paragraph 1.F.3,
the Project Gutenberg Literary Archive Foundation, the owner of the
Project Gutenberg™ trademark, and any other party distributing a
Project Gutenberg™ electronic work under this agreement, disclaim
all liability to you for damages, costs and expenses, including legal
fees. YOU AGREE THAT YOU HAVE NO REMEDIES FOR
NEGLIGENCE, STRICT LIABILITY, BREACH OF WARRANTY OR
BREACH OF CONTRACT EXCEPT THOSE PROVIDED IN PARAGRAPH
1.F.3. YOU AGREE THAT THE FOUNDATION, THE TRADEMARK
OWNER, AND ANY DISTRIBUTOR UNDER THIS AGREEMENT WILL
NOT BE LIABLE TO YOU FOR ACTUAL, DIRECT, INDIRECT,
CONSEQUENTIAL, PUNITIVE OR INCIDENTAL DAMAGES EVEN IF
YOU GIVE NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.

1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you


discover a defect in this electronic work within 90 days of receiving
it, you can receive a refund of the money (if any) you paid for it by
sending a written explanation to the person you received the work
from. If you received the work on a physical medium, you must
return the medium with your written explanation. The person or
entity that provided you with the defective work may elect to provide
a replacement copy in lieu of a refund. If you received the work
electronically, the person or entity providing it to you may choose to
give you a second opportunity to receive the work electronically in
lieu of a refund. If the second copy is also defective, you may
demand a refund in writing without further opportunities to fix the
problem.

1.F.4. Except for the limited right of replacement or refund set forth
in paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied


warranties or the exclusion or limitation of certain types of damages.
If any disclaimer or limitation set forth in this agreement violates the
law of the state applicable to this agreement, the agreement shall be
interpreted to make the maximum disclaimer or limitation permitted
by the applicable state law. The invalidity or unenforceability of any
provision of this agreement shall not void the remaining provisions.

1.F.6. INDEMNITY - You agree to indemnify and hold the Foundation,


the trademark owner, any agent or employee of the Foundation,
anyone providing copies of Project Gutenberg™ electronic works in
accordance with this agreement, and any volunteers associated with
the production, promotion and distribution of Project Gutenberg™
electronic works, harmless from all liability, costs and expenses,
including legal fees, that arise directly or indirectly from any of the
following which you do or cause to occur: (a) distribution of this or
any Project Gutenberg™ work, (b) alteration, modification, or
additions or deletions to any Project Gutenberg™ work, and (c) any
Defect you cause.

Section 2. Information about the Mission


of Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new computers.
It exists because of the efforts of hundreds of volunteers and
donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the


assistance they need are critical to reaching Project Gutenberg™’s
goals and ensuring that the Project Gutenberg™ collection will
remain freely available for generations to come. In 2001, the Project
Gutenberg Literary Archive Foundation was created to provide a
secure and permanent future for Project Gutenberg™ and future
generations. To learn more about the Project Gutenberg Literary
Archive Foundation and how your efforts and donations can help,
see Sections 3 and 4 and the Foundation information page at
www.gutenberg.org.

Section 3. Information about the Project


Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-profit
501(c)(3) educational corporation organized under the laws of the
state of Mississippi and granted tax exempt status by the Internal
Revenue Service. The Foundation’s EIN or federal tax identification
number is 64-6221541. Contributions to the Project Gutenberg
Literary Archive Foundation are tax deductible to the full extent
permitted by U.S. federal laws and your state’s laws.

The Foundation’s business office is located at 809 North 1500 West,


Salt Lake City, UT 84116, (801) 596-1887. Email contact links and up
to date contact information can be found at the Foundation’s website
and official page at www.gutenberg.org/contact

Section 4. Information about Donations to


the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission of
increasing the number of public domain and licensed works that can
be freely distributed in machine-readable form accessible by the
widest array of equipment including outdated equipment. Many
small donations ($1 to $5,000) are particularly important to
maintaining tax exempt status with the IRS.

The Foundation is committed to complying with the laws regulating


charities and charitable donations in all 50 states of the United
States. Compliance requirements are not uniform and it takes a
considerable effort, much paperwork and many fees to meet and
keep up with these requirements. We do not solicit donations in
locations where we have not received written confirmation of
compliance. To SEND DONATIONS or determine the status of
compliance for any particular state visit www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states where


we have not met the solicitation requirements, we know of no
prohibition against accepting unsolicited donations from donors in
such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot make


any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.

Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.

Section 5. General Information About


Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could be
freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose network of
volunteer support.
Project Gutenberg™ eBooks are often created from several printed
editions, all of which are confirmed as not protected by copyright in
the U.S. unless a copyright notice is included. Thus, we do not
necessarily keep eBooks in compliance with any particular paper
edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,


including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how
to subscribe to our email newsletter to hear about new eBooks.
Welcome to Our Bookstore - The Ultimate Destination for Book Lovers
Are you passionate about books and eager to explore new worlds of
knowledge? At our website, we offer a vast collection of books that
cater to every interest and age group. From classic literature to
specialized publications, self-help books, and children’s stories, we
have it all! Each book is a gateway to new adventures, helping you
expand your knowledge and nourish your soul
Experience Convenient and Enjoyable Book Shopping Our website is more
than just an online bookstore—it’s a bridge connecting readers to the
timeless values of culture and wisdom. With a sleek and user-friendly
interface and a smart search system, you can find your favorite books
quickly and easily. Enjoy special promotions, fast home delivery, and
a seamless shopping experience that saves you time and enhances your
love for reading.
Let us accompany you on the journey of exploring knowledge and
personal growth!

ebookgate.com

You might also like