Statistical Analysis
Statistical Analysis
D G Rossiter Department of Earth Systems Analysis International Institute for Geo-information Science & Earth Observation (ITC) <http://www.itc.nl/personal/rossiter> January 9, 2006
AN
Topic: Motivation
Why is statistics important? It is part of the quantitative approach to knowledge: In physical science the rst essential step in the direction of learning any subject is to nd principles of numerical reckoning and practicable methods for measuring some quality connected with it. I often say that when you can measure what you are speaking about, and express it in numbers, you know something about it; . . . but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind; . . . it may be the beginning of knowledge, but you have scarcely in your thoughts advanced to the state of Science, whatever the matter may be. Lord Kelvin (William Thomson), Popular Lectures and Addresses 1:73
D G R OSSITER
AN
A simple denition
Statistics: The determination of the probable from the possible Davis, Statistics and data analysis in geology, p. 6 . . . which implies the rigorous denition and then quantication of probable. Probable causes of past events or observations Probable occurrence of future events or observations This is a denition of inferential statistics: Observations = Inferences
D G R OSSITER
AN
What is statistics?
Two common use of the word: 1. Descriptive statistics: numerical summaries of samples; (what was observed) 2. Inferential statistics: from samples to populations. (what could have been or will be observed) Example: Descriptive The adjustments of 14 GPS control points for this orthorectication ranged from 3.63 to 8.36 m with an arithmetic mean of 5.145 Inferential The mean adjustment for any set of GPS points used for orthorectication is no less than 4.3 and no more than 6.1 m; this statement has a 5% probability of being wrong.
D G R OSSITER
AN
D G R OSSITER
AN
Topic: Introduction
1. Outline of statistical analysis 2. Types of variables 3. Statistical inference 4. Data analysis strategy 5. Univariate analysis 6. Bivariate analysis; correlation; linear regression 7. Analysis of variance 8. Non-parametric methods
D G R OSSITER
AN
D G R OSSITER
AN
Texts
There are hundreds of texts at every level and for every application. Here are a few I have found useful. Elementary: Bulmer, M.G., 1979. Principles of statistics. Dover Publications, New York. Dalgaard, P., 2002. Introductory Statistics with R. Springer-Verlag. Advanced: Venables, W.N. and Ripley, B.D., 2002. Modern applied statistics with S. Springer-Verlag. Fox, J., 1997. Applied regression, linear models, and related methods. Sage, Newbury Park.
D G R OSSITER
AN
Applications: Davis, J.C., 2002. Statistics and data analysis in geology. John Wiley & Sons, New York. * Website: http://www.kgs.ku.edu/Mathgeo/Books/Stat/index.html Webster, R. and Oliver, M.A., 1990. Statistical methods in soil and land resource survey. Oxford University Press, Oxford.
D G R OSSITER
AN
D G R OSSITER
AN
10
D G R OSSITER
AN
11
D G R OSSITER
AN
12
D G R OSSITER
AN
13
Step 1: Outliers
Three uses of this word: 1. An observation that is some dened distance away from the sample mean (an empirical outlier; 2. An extreme member of a population; 3. An observation in the sample that is not part of the population of interest. Example: In a set of soil samples, one has an order of magnitude greater level of heavy metals (Cd, Pb, Cu etc.) than all the others. 1. The sample is an empirical outlier because it is more than 1.5 times the inter-quartile range from the 3rd quartile; 2. This is an extreme value but is included in our analysis of soil contamination; 3. This sample comes from an industrial site and is not important for our target population of agricultural soils.
D G R OSSITER
AN
14
D G R OSSITER
AN
15
Step 2: Understand
If there is an underlying process of which the sampled data are a representative sample . . . . . . then the data allow us to infer the nature of the process Example: the distribution of heavy metals in soil is the result of: * * * * * Parent material Pollutants transported by wind, water, or humans Transformations in the soil since deposition Movement of materials within and through the soil ...
D G R OSSITER
AN
16
D G R OSSITER
AN
17
Step 3: Prove
A further step is to prove, in some sense, a statement about nature. E.g. Soil pollution in this area is caused by river ooding; pollutants originate upstream in industrial areas. The model must be plausible evidence of causation With what condence can we state that our understanding (model) is correct? Nothing can be proved absolutely; statistics allows us to accumulate evidence We can determine sampling strategies to achieve a given condence level Underlying assumptions may not be proveable, only plausible
D G R OSSITER
AN
18
Step 4: Predict
The model can be applied to unsampled entities in the underlying population * Interpolation: within the range of the original sample * Extrapolation: outside this range The model can be applied to future events; this assumes that future conditions (the context in which the events will take place) is the same as past conditions (c.f. uniformitarianism of Hutton and Playfair) A geo-statistical model can be applied to unsampled locations; this assumes that the process at these locations is the same as at the sample locations. Key point: we must assume that the sample on which the model is based is representative of the population in which the predictions are made. We argue for this with meta-statistical analysis (outside of statistics itself).
D G R OSSITER
AN
19
D G R OSSITER
AN
20
Nominal variables
Values are from a set of classes with no natural ordering Example: Land uses (agriculture, forestry, residential . . . ) Can determine equality, but not rank Meaningful sample statistics: mode (class with most observations); frequency distribution (how many observations in each class) Numbers may be used to label the classes but these are arbitrary and have no numeric meaning (the rst class could just as well be the third); ordering is by convenience (e.g. alphabetic) R: unordered factors
D G R OSSITER
AN
21
Ordinal variables
Values are from a set of naturally ordered classes with no meaningful units of measurement Example: Soil structural grade (0 = structureless, 1 = very weak, 2 = weak, 3 = medium, 4 = strong, 5= very strong ) N.b. This ordering is an intrinsic part of the class denition Can determine rank (greater, less than) Meaningful sample statistics: mode; frequency distribution Numbers may be used to label the classes; their order is meaningful, but not the intervals between adjacent classes are not dened (e.g. the interval from 1 to 2 vs. that from 2 to 3) R: ordered factors
D G R OSSITER
AN
22
Interval variables
Values are measured on a continuous scale with well-dened units of measurement but no natural origin of the scale, i.e. the zero is arbitrary, so that differences are meaningful but not ratios Example: Temperature in C. It is twice as warm yesterday as today is meaningless, even though Today it is 20C and yesterday it was 10C may be true. * (To see this, try the same statement with Farenheit temperatures) Meaningful statistics: quantiles, mean, variance
D G R OSSITER
AN
23
Ratio variables
Values are measured on a continuous scale with well-dened units of measurement and a natural origin of the scale, i.e. the zero is meaningful Examples: Temperature in K; concentration of a chemical in solution There is twice a much heat in this system as that is meaningful, if one system is at 300K and the other at 150K Meaningful statistics: quantiles, mean, variance; also the coefcient of variation. (Recall: CV = SD / Mean; this is a ratio).
D G R OSSITER
AN
24
D G R OSSITER
AN
25
D G R OSSITER
AN
26
Statistical inference
Using the sample to infer facts about the underlying population of which (we hope) it is representative Example: true value of a population mean, estimated from sample mean and its standard error * condence intervals: having a known probability of containing the true value * For a sample from a normally-distributed variate, 95% probability ( = 0.05): x 1.96 sX x + 1.96 sX * The standard error is estimated from the sample variance: sX = s2 /n X
D G R OSSITER
AN
27
D G R OSSITER
AN
28
D G R OSSITER
AN
29
D G R OSSITER
AN
30
D G R OSSITER
AN
31
D G R OSSITER
AN
32
D G R OSSITER
AN
33
Research questions
What research questions are supposed to be answered with the help of these data?
D G R OSSITER
AN
34
D G R OSSITER
AN
35
Non-spatial modelling
Univariate descriptions: normality tests, summary statistics Transformations as necessary and justied Bivariate relations between variables (correlation) Multivariate relations between variables Analysis of Variance (ANOVA) on predictive factors (conrms subpopulations)
D G R OSSITER
AN
36
D G R OSSITER
AN
37
Spatial modelling
If the data were collected at known points in geographic space, it may be possible to model this. Model the spatial structure * Local models (spatial dependence) * Global models (geographic trends, feature space predictors) * Mixed models
D G R OSSITER
AN
38
Prediction
Values at points or blocks Summary values (e.g. regional averages) Uncertainty of predictions
D G R OSSITER
AN
39
D G R OSSITER
AN
40
D G R OSSITER
AN
41
Source
Rikken, M.G.J. & Van Rijn, R.P.G., 1993. Soil pollution with heavy metals An inquiry into spatial variation, cost of mapping and the risk evaluation of copper, cadmium, lead and zinc in the oodplains of the Meuse west of Stein, the Netherlands. Doctoraalveldwerkverslag, Dept. of Physical Geography, Utrecht University This data set is also used as an example in gstat and in the GIS text of Burrough & McDonnell.
D G R OSSITER
AN
42
Variables
155 samples taken on a support of 10x10 m from the top 0-20 cm of alluvial soils in a 5x2 km part the oodplain of the Maas (Meuse) near Stein (NL). id x, y cadmium copper lead zinc elev om ffreq soil lime landuse dist.m point number coordinates E and N in Dutch national grid coordinates, in meters concentration in the soil, in mg kg-1 concentration in the soil, in mg kg-1 concentration in the soil, in mg kg-1 concentration in the soil, in mg kg-1 elevation above local reference level, in meters organic matter loss on ignition, in percent ood frequency class, 1: annual, 2: 2-5 years, 3: every 5 years soil class, coded has the land here been limed? 0 or 1 = F or T land use, coded distance from main River Maas channel, in meters
D G R OSSITER
AN
43
D G R OSSITER
AN
44
Topic: Probability
1. probability 2. discrete and continuous probability distributions 3. normality, transformations
D G R OSSITER
AN
45
Probability
A very controversial topic, deep relation to philosophy; Two major concepts: Bayesian and Frequentist; The second can model the rst, but not vice-versa; Most elementary statistics courses and computer programs take the frequentist point of view. The probability of an event is: Bayesian degree of rational belief that the event will occur, from 0 (impossible) to 1 (certain) Frequentist the proportion of time the event would occur, should the experiment that gives rise to the event be repeated a large number of times
D G R OSSITER
AN
46
D G R OSSITER
AN
47
Probability distributions
A complete account of the probability of each possible outcome . . . assuming some underlying process n.b. the sum of the probabilities of all events is by denition 1 (its certain that something will happen!) Examples: Number of radioactive decays in a given time period: Poisson * assuming exponential decay with constant half-life, independent events. Number of successes in a given number of binary (Bernoulli) trials (e.g. nding water within a xed depth): Binomial * assuming constant probability of success, independent trials
D G R OSSITER
AN
48
AN
49
f (x) = where
n n! x x!(n x)!
is the binomial coefcient, i.e. the number of different ways of selecting x distinct items out of n total items. Mean and variance: = np; 2 = np(1 p)
D G R OSSITER
AN
50
Example computation in R
> # number of distinct ways of selecting 2 from 16 > (f2 <- factorial(16)/(factorial(2)*factorial(16-2))) [1] 120 > # direct computation of a single binomial density > # for prob(success) = 0.2 > p <- 0.2; n <- 16; x <- 2 > f2 * p^x * (1-p)^(n-x) [1] 0.21111 > # probability of 0..16 productive wells if prob(success) = 0.2 > round(dbinom(0:16, 16, 0.2),3) [1] 0.028 0.113 0.211 0.246 0.200 0.120 0.055 0.020 [9] 0.006 0.001 0.000 0.000 0.000 0.000 0.000 0.000 [17] 0.000 > # simulate 20 drilling campaigns of 16 wells, prob(success) = 0.2 > trials <- rbinom(20, 16, .2) > summary(trials) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 2.75 3.00 3.45 4.00 8.00 > # compare with theoretical mean and variance > (mu <- n * p) [1] 3.2 > (var <- n * p * (1-p)); var(trials) [1] 2.56 [1] 2.2605 > sort(trials) [1] 1 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 8
D G R OSSITER
AN
51
q q
0.15
Density
0.10
0.05
q q qq qqqqqqqqqqqqqqqqqq
0.00
10
15
20
25
30
> hist(rbinom(1000, 32, .2), breaks=(0:32), right=F, freq=F, + main="Binomial distribution, p=0.2, 32 trials") > points(cbind((0:32)+0.5,dbinom(0:32, 32, 0.2)), col="blue", + pch=20, cex=2)
D G R OSSITER
AN
52
D G R OSSITER
AN
53
f (x) = 1
x=
F(z) =
x=
f (x)
> # 8 normal variates with mean 1.6, var .2 > rnorm(8, 1.6, .2) [1] 1.771682 1.910130 1.518092 1.712963 1.365242 1.837332 1.777395 1.749878 > # z-values for some common probabilities > qnorm(seq(0.80,0.95, by=.05),1.6,.2) [1] 1.768324 1.807287 1.856310 1.928971
D G R OSSITER
AN
54
0.15
probability density
0.10
mu = 16, sigma = 5
0.00 0
0.05
10
15 x
20
25
30
> range <- seq(0,32, by=.1) > plot(range, dnorm(range, 16, 2), type="l") # etc.
D G R OSSITER
AN
55
Standardization
All normally-distributed variates can be directly compared by standardization: subtract , divide by Standardized normal: all variables have the same scale and deviation: = 0, = 1 1 x2/2 f (x) = e 2
> sdze<-function(x) { (x-mean(x))/sd(x) }
D G R OSSITER
AN
56
Evaluating Normality
Graphical * Histograms * Quantile-Quantile plots (normal probability plots) Numerical * Various tests including Kolmogorov-Smirnov, Anderson-Darling, Shapiro-Wilk * These all work by compare the observed distribution with the theoretical normal distribution having parameters estimated from the observed, and computing the probability that the observed is a realisation of the theoretical
> qqnorm(cadmium); qqline(cadmium) > shapiro.test(cadmium) Shapiro-Wilk normality test W = 0.7856, p-value = 8.601e-14
D G R OSSITER
AN
57
15
q q
Sample Quantiles
q q q q q
10
qq qq q q q q qq q q q q q q q q q q q q q q
q q q q q q q q q q q q qq q q qq qq qq qq q q q q q qq qq qq qq qq qq q qq qq qq q qq qq qq qqq qqq qq qq qqq qqq qqq qqq qq q q qqq qq qq qqq q q q qqq q qqqq qqqq q q qqqqqqqq qqqqq
Theoretical Quantiles
D G R OSSITER
AN
58
D G R OSSITER
AN
59
mean: 182.75 sdev: 19.55 6 Frequency Frequency 6 0 120 140 160 180 200 220 240 2 4
140
160
180
200
220
240
Sample 1
Sample 2
mean: 182.77 sdev: 17.44 6 Frequency Frequency 6 0 120 140 160 180 200 220 240 2 4
140
160
180
200
220
240
Sample 3
Sample 4
D G R OSSITER
AN
60
D G R OSSITER
AN
61
x = sin1 x: arcsine: for proportions x [0 . . . 1] spreads the distribution near the tails x = ln[x/(1 x)]: logit (logistic) for proportions x [0 . . . 1] note: must add a small adjustment to zeroes
D G R OSSITER
AN
62
D G R OSSITER
AN
63
Histogram of log(cadmium)
30 20 Frequency 2 1 0 1 2 3 0 5 10 15
Histogram of log(cadmium)
Frequency
5 10
20
log(cadmium)
log(cadmium)
Sample Quantiles 1 2 3
Theoretical Quantiles
D G R OSSITER
AN
64
D G R OSSITER
AN
65
D G R OSSITER
AN
66
One population or several? Outliers? Centered or skewed (mean vs. median)? Heavy or light tails (kurtosis)?
stem(cadmium) boxplot(cadmium) boxplot(cadmium, horizontal = T) points(mean(cadmium),1, pch=20, cex=2, col="blue") hist(cadmium) #automatic bin selection hist(cadmium, n=16) #specify the number of bins hist(cadmium, breaks=seq(0,20, by=1)) #specify breakpoints plot(ecdf(cadmium))
D G R OSSITER
AN
67
D G R OSSITER
AN
68
q q
15
q q q q q q q q
10
qq q q q q qq q q q
q q
10
15
Histogram of cadmium
80 40
Histogram of cadmium
60
Frequency
Frequency 0 5 10 cadmium 15 20
40
20
0 0
10
20
30
10 cadmium
15
20
D G R OSSITER
AN
69
0.8
Proportion
qq qq q q q q q q q q q q q q q q q q q q q q q q q q q q q
q qq q q q qq q q q q q q qq q q
q q
qq qq q
0.2
0.4
0.6
qq q
0.0
10 Cd mg kg1
15
20
D G R OSSITER
AN
70
x = xi
i=1 > summary(cadmium) Min. 1st Qu. Median 0.200 0.800 2.100 > var(cadmium) [1] 12.41678
D G R OSSITER
AN
71
> sd(cadmium) [1] 3.523746 > sqrt(var(cadmium)) [1] 3.523746 > round((sqrt(var(cadmium))/mean(cadmium))*100,0) [1] 109
D G R OSSITER
AN
72
Cautions
The quantiles, including the median, are always meaningful The mean and variance are mathemtically meaningful, but not so useful unless the sample is approximately normal This imples one population (unimodal)
> quantile(cadmium, probs=seq(0, 1, .1)) 0% 10% 20% 30% 40% 50% 60% 0.20 0.20 0.64 1.20 1.56 2.10 2.64 70% 3.10 80% 5.64 90% 100% 8.26 18.10
D G R OSSITER
AN
73
D G R OSSITER
AN
74
D G R OSSITER
AN
75
Test whether less than a target value; user must set (condence level):
> t.test(cadmium, alt="less", mu=3, conf.level = .99) t = 0.8685, df = 154, p-value = 0.8068 alternative hypothesis: true mean is less than 3 99 percent confidence interval: -Inf 3.91116 sample estimates: mean of x 3.24581
Note that in this case the condence interval is one sided: from 3 . . . 3.91116; we dont care what the mean is if its less than 3.
D G R OSSITER
AN
76
D G R OSSITER
AN
77
D G R OSSITER
AN
78
Bivariate scatterplot
Shows the relation of two variates in feature space (a plane made up of the two variables ranges) Display two ways: * Non-standardized: with original values on the axes (and same zero); shows relative magnitudes * Standardized to zero sample means and unit variances: shows relative spreads * Note: some displays automatically scale the axes, so that non-standardized looks like standardized
D G R OSSITER
AN
79
Scatterplots of two heavy metals; automatic vs. same scales; also log-transformed; standardized and not.
> > > > > > > > > > > > plot(lead,zinc) abline(v=mean(lead)); abline(h=mean(zinc)) lim<-c(min(min(lead,zinc)), max(max(lead,zinc))) plot(lead, zinc, xlim=lim, ylim=lim) abline(v=mean(lead)); abline(h=mean(zinc)) plot(log(lead), log(zinc)) abline(v=mean(log(lead))); abline(h=mean(log(zinc))) plot(log(lead), log(zinc), xlim=log(lim), ylim=log(lim)) abline(v=mean(log(lead))); abline(h=mean(log(zinc))) sdze<-function(x) { (x-mean(x))/sd(x) } plot(sdze(lead), sdze(zinc)); abline(h=0);abline(v=0) plot(sdze(log(lead)), sdze(log(zinc))); abline(h=0); abline(v=0)
D G R OSSITER
AN
80
q qq q q q q q q q q q q qq q q qq qq q q qq q q q q qq q q q q q qq qq q q q q qq q q qq qq qq q qq q qq qq q qq qqq q qq q qq q q qq q qq q qqq qq qq qq q qq qq q qq q
500 1000
100
300 lead
500
2.2
2.6
log10(zinc)
zinc
3.0
q q
1.6
2.0
2.4
2.8
log10(lead)
1500
q qq q q q qq q qq q q q q q q q q qq q qqq q qq qq qq qq q qq q q q qq q q q q q q qq q q q q q q qq q q qq q qq q q q q q q q q q q q
q q qq q q
zinc
500
500
2.0
2.5
3.0
2.0
2.5
3.0
(log10(lead))
D G R OSSITER
AN
81
D G R OSSITER
AN
82
D G R OSSITER
AN
83
D G R OSSITER
AN
84
D G R OSSITER
AN
85
D G R OSSITER
AN
86
Topic: Regression
A general term for modelling the distribution of one variable (the response or dependent) from (on) another (the predictor or independent) This is only logical if we have a priori (non-statistical) reasons to believe in a causal relation Correlation: makes no assumptions about causation; both variables have the same logical status Regression: assumes one variable is the predictor and the other the response
D G R OSSITER
AN
87
D G R OSSITER
AN
88
D G R OSSITER
AN
89
D G R OSSITER
AN
90
D G R OSSITER
AN
91
Anscombe dataset 1
Anscombe dataset 2
12
10
q q
10
12
y1
y2
q q q
q q q q
q q
q q
q q
10 x1
15
20
6 5
10 x2
15
20
Anscombe dataset 3
q
Anscombe dataset 4
q
12
10
y3
y4
10
12
q q q q q q q q q q
10 x3
15
20
10 x4
15
20
D G R OSSITER
AN
92
D G R OSSITER
AN
93
Sums of squares
The regression partitions the variability in the sample into two parts: 1. explained by the model 2. not explained, left over, i.e. residual Note we always know the mean, so the total variability refers to the variability around the mean Question: how much more of the variability is explained by the model? Total SS = Regression SS + Residual SS
n i=1 n 2 2 i=1 n i=1
(yi y)
The least squares estimate maximizes the Regression SS and minimizes the Residual SS
D G R OSSITER
AN
94
D G R OSSITER
AN
95
D G R OSSITER
AN
96
D G R OSSITER
AN
97
l1 log(cadmium) o om 3 2 0m -1 -og(cadmium) 15 15 10 M 50 m
q q q q qq q q qq q q q q q q q q q q q q qq q qq qq q q q q q q q
log(cadmium)
log(cadmium)
q q
q q q q
q q q q q q q
q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q
-1
q qq qq q
qqq q q qqq qq q q q q q q
5
5 om 10 15
10 om
15
Note the additional information we get from visualising the ood frequency class.
D G R OSSITER
AN
98
Max 2.0503 t value Pr(>|t|) -5.437 2.13e-07 *** 9.202 2.70e-16 *** on 151 degrees of freedom Adjusted R-squared: 0.3551 DF, p-value: 2.703e-16
Highly-signicant model, but organic matter content explains only about 35% of the variability of log(Cd).
D G R OSSITER
AN
99
D G R OSSITER
AN
100
Regression diagnostics
Objective: to see if the regression truly represents the presumed relation Objective: to see if the computational methods are adequate Main tool: plot of standardized residuals vs. tted values Numerical measures: leverage, large residuals
D G R OSSITER
AN
101
D G R OSSITER
AN
102
We can see problems at the low metal concentrations. This is probably an artifact of the measurement precision at these levels (near or below the detection limit). These are almost all in ood frequency class 3 (rarely ooded).
D G R OSSITER
AN
103
log(cadmium)
fitted(m) 0 -1
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
5 om
10
15
-1
log(cadmium[!is.na(om)])
s2 studres(m) f 0.5 fitted(m) -1 -2 2.5 2.5 2.0 1.5 1.0 1.0 0.5 0tudres(m) 0.0 M -0.5 -itted(m) m 1
S Sample T Theoretical N M Normal 2 1 0 heoretical Quantiles -1 M -2 -2ormal Q-Q Plot mample Quantiles 1
Sample Quantiles
studres(m)
-1
-2
-0.5
0.0
0.5
1.0 fitted(m)
1.5
2.0
2.5
-2
-1
-2
-1
0 Theoretical Quantiles
D G R OSSITER
AN
104
Much higher R2 and better diagnostics. Still, there is a lot of spread at any value of the predictor (organic matter).
D G R OSSITER
AN
105
3.0
2.5
2.0
fitted(m)
log(cdx)
1.5
1.0
0.5
0.0
5 om
10
15
0.0
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5 log(cdx[!is.na(om)])
2.0
2.5
3.0
s1 studres(m) f2 fitted(m) 3 -1 -2 -itted(m) 2.5 2.5 2.0 1.5 1.0 1.0 0.5 M 0.0 0tudres(m) m
S Sample T Theoretical N M Normal 3 2 1 0 heoretical Quantiles -1 M -2 -2ormal Q-Q Plot mample Quantiles 1
studres(m)
-1
-2
0.0
0.5
1.0 fitted(m)
1.5
2.0
2.5
-2
-1
-2
-1
0 Theoretical Quantiles
D G R OSSITER
AN
106
Still higher R2 and excellent diagnostics. There is still a lot of spread at any value of the predictor (organic matter), so OM is not an efcient predictor of Cd.
D G R OSSITER
AN
107
3.0
2.5
log(cadmium)
1.5
2.0
fitted(m)
1.0
0.0
0.5
5 om
10
15
0.5
1.0
1.5
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
log(cadmium[!is.na(om)])
s1 studres(m) f2 fitted(m) -1 -2 -itted(m) 2.5 2.0 2.0 1.5 1.5 1.0 M 0.5 0tudres(m) m
S Sample T Theoretical N M Normal 2 1 0 heoretical Quantiles -1 M -2 -2ormal Q-Q Plot mample Quantiles 1
Sample Quantiles
studres(m)
-1
-2
0.5
1.0
1.5 fitted(m)
2.0
2.5
-2
-1
-2
-1
0 Theoretical Quantiles
D G R OSSITER
AN
108
Categorical ANOVA
Model the response by a categorical variable (nominal); ordinal variables are treated as nominal Model: y = 0 + j x + ; where each observation x is multiplied by the beta j corresponding to the class to which it belongs (of n classes) The j represent the deviations of each class mean from the grand mean
D G R OSSITER
AN
109
D G R OSSITER
AN
110
Categorical EDA
> boxplot(cadmium ~ ffreq,xlab="Flood frequency class",ylab="Cadmium (ppm)")
q q
Cadmium (ppm)
10
15
D G R OSSITER
AN
111
Example ANOVA
> m<-lm(log(cadmium) ~ ffreq) > summary(m) Residuals: Min 1Q Median 3Q Max -1.8512 -0.7968 -0.1960 0.7331 1.9354 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.32743 0.09351 14.196 < 2e-16 *** ffreq2 -1.95451 0.15506 -12.605 < 2e-16 *** ffreq3 -1.08566 0.20168 -5.383 2.72e-07 *** Residual standard error: 0.857 on 152 degrees of freedom Multiple R-Squared: 0.5169, Adjusted R-squared: 0.5105 F-statistic: 81.31 on 2 and 152 DF, p-value: < 2.2e-16
D G R OSSITER
AN
112
All per-pair class differences are signicant (condence interval does not include zero).
D G R OSSITER
AN
113
Non-parametric statistics
A non-parametric statistic is one that does not assume any underlying data distribution. For example: a mean is an estimate of a parameter of location of some assumed distribution (e.g.mid-point of normal, expected proportion of success in a binomial, . . . ) a median is simply the value at which half the samples are smaller and half larger, without knowing anything about the distribution underlying the process which produced the sample. So non-parametric inferential methods are those that make no assumptions about the distribution of the data values, only their order (rank).
D G R OSSITER
AN
114
Then the sample Pearsons correlation coefcient is computed as: rXY = Cov(X,Y )/sX sY
D G R OSSITER
AN
115
D G R OSSITER
AN
116
25
25
q q q
q q q q q q q q q q q
25
q q q q q
q q
20
20
20
q q q q q q q q q q q q q q
q qq q
15
15
r = 0.099
r = 0.157
15
r = 0.114
15
20 x
25
15
20 x
25
15
20 x
25
q q
q q
q q
120
120
100
100
80
80
r = 0.978
y
r = 0.978
y
80
100
120
r = 0.98
y 60
60
40
40
q qqq qq q q qq q qqq q qq q
qq q q qq qqq qq qq q q qq q
40
60
qq q qq q qqq qq q q qqq q q q
20
20
20
40
60 x
80
100
120
20
40
60 x
80
100
120
20
20
40
60 x
80
100
120
D G R OSSITER
AN
117
Non-parametric correlation
The solution here is to use a method such as Spearmans correlation, which correlates the ranks, not the values; therefore the distribution (gaps between values) has no inuence. From numbers to ranks:
> n<-10 > (x<-rnorm(n, 20, 4)) [1] 15.1179 23.7801 21.2801 21.5191 23.0096 18.5065 19.1448 24.9254 29.3211 [10] 14.1453 > (ix<-(sort(x, index=T)$ix)) [1] 10 1 6 7 3 4 5 2 8 9
If we change the largest of these to any large value, the rank does not change:
> x[ix[n]]<-120; x [1] 15.1179 23.7801 21.2801 [9] 120.0000 14.1453 > (ix<-(sort(x, index=T)$ix)) [1] 10 1 6 7 3 4 5 2 8 21.5191 23.0096 18.5065 19.1448 24.9254
D G R OSSITER
AN
118
The Pearsons (parametric) coefcient is completely changed by the one high-valued pair, whereas the Spearmans is unaffected.
D G R OSSITER
AN
119
D G R OSSITER