LHC Physicists' Statistics Guide
LHC Physicists' Statistics Guide
Harrison B. Prosper
Florida State University
Lecture 1
Descriptive Statistics
Probability & Likelihood
Lecture 2
Frequentist Inference
Lecture 3
Bayesian Inference
2
Descriptive Statistics
Descriptive Statistics: Samples
Definition: A statistic is any function of the data,
x = x1, x2, … xn. Here are some simple examples:
1 n
the sample average x xi
n i 1
1 n r
the sample moments mr xi
n i 1
n
1
and the sample variance S 2 (xi x )2
n i1
4
Descriptive Statistics: Samples
It is often useful to order the data so that
x(1) < x(2) < … < x(n)
5
Descriptive Statistics: Populations
Now consider an infinitely large sample, called a population.
Mean
Error x
Mean Square Error MSE E[ ] 2
Bias b E[x]
2
Variance V[x] E[(x E[x]) ]
7
Descriptive Statistics – 3
RMS MSE
8
Descriptive Statistics – 4
Consider the expected value of the sample variance
n
1
E[S ] E[ (xi x ) ]
2 2
n i1
n n n
1 2 1
E[ xi xi x x ]
2 2
B
AB
A
13
Probability – 3
A and B are mutually exclusive if
P(AB) = 0
B
AB
A and B are exhaustive if A
P(A) + P(B) = 1
Theorem
P( A B) P( A) P( B) P( AB)
B
AB
A
15
Bayes Theorem: Are You Doomed?
Diagnostic Example (Michael Goldstein)
You are Diseased (event D)
You are Healthy (event H)
A test result is either positive (event +) or negative (event –)
Let P(+ | D) = 0.99 and P(+ | H) = 0.01.
Your test result is positive. Are you doomed? It all depends…
E[ f ] f (xi )P(xi )
i
17
Probability: Some Definitions
Suppose we have potential observations (random variables) x
and y, then their covariance is the functional
Cov[ f , g] f (x)g( y) p(x, y) dx dy
where
f (x) = x – E[x] and g(y) = y – E[y] and
p(x, y) is the joint probability density of x and y.
18
Probability: What Exactly Is It?
There are at least two interpretations of probability:
19
Binomial & Poisson Distributions
Binomial & Poisson Distributions – 1
A Bernoulli trial has two outcomes:
S = success or F = failure.
21
Binomial & Poisson Distributions – 2
Let p be the probability of a success, which is assumed to be
the same at each trial. Since S and F are exhaustive, the
probability of a failure is 1 – p.
For a given order O of n trails, the probability P(k, O, n) of
exactly k successes and n – k failures is
22
Binomial & Poisson Distributions – 3
If the order O of successes and failures is assumed to be
irrelevant, we can eliminate the order from the problem by
summing over all possible orders
P(k,n) Binomial(k,n, p) p (1 p)
n
k
k n k
23
Binomial & Poisson Distributions – 4
We can prove that the mean number of successes a is
a = p n. Exercise 4: Prove it
Uniform(x, a) 1/ a
Gaussian(x, , ) exp[(x )2 / (2 2 )] / ( 2 )
LogNormal(x, , ) exp[(ln x )2 / (2 2 )] / (x 2 )
Chisq(x, n) x n /2 1 exp(x / 2) / [2 n /2 (n / 2)]
Gamma(x, a,b) x b 1a b exp(ax) / (b)
Exp(x, a) a exp(ax)
Binomial(k, n, p) p (1 p)
n
k
k nk
Poisson(k, a) a k exp(a) / k!
n! K K K
Multinomial(k, n, p)
k1 !L kK ! i 1
p ki
i , p i 1, k i n
i 1 i 1
25
Likelihood
Likelihood – 1
The likelihood function is simply the probability, or
probability density function (pdf), evaluated at the observed
data.
27
Likelihood – 2
Example 2:
(CMS, Phys. Rev. D 87, 052017 (2013))
Observed counts Di
p(D | p) Multinomial(D, N, p)
D D1 ,L , DK , p p1 ,L , pK
K K
D i N, p i 1
i 1 i 1
28
Likelihood – 3
Example 3: (Union2.1 Compilation, SCP)
Red shift and distance modulus measurements of
N = 580 Type Ia supernovae
p(D | M , ,Q)
N
Gaussian(x , (z ,
i 1
i i M , ,Q), i )
D zi , xi i
This is an example of
an un-binned likelihood for
heteroscedastic data.
29
Likelihood – 4
Example 4: Higgs to γγ (CMS & ATLAS, 2012 – 15)
The analyses of the di-photon final states use an un-binned
likelihood of the form,
N
p(x | s, m,w,b) exp[(s b)] s fs (xi , m,w) b fb (xi )
i 1
31
Example 1: W±W±jj Production
(ATLAS)
Evidence for electroweak production of W±W±jj (2014)
PRL 113, 141803 (2014)
knowns:
D = 12 observed events (μ±μ± mode)
B = 3.0 ± 0.6 background events
unknowns:
b expected background count
s expected signal count
d=b+s expected event count
where
B =Q/q Q (B / B)2 (3.0 / 0.6)2 25.0
δB = √Q / q q B / B2 3.0 / 0.6 2 8.33
33
Example 4: Higgs to γγ (CMS)
Eur. Phys. J. C74 (2014) 3076
background model
fb (x,a), x m
signal model
fs (x | m,w)
N
p(x | s, m,w,b,a) exp[(s b)] s fs (xi , m,w) b fb (xi ,a)
i 1
34
Example 4: Higgs to γγ (CMS)
Tomorrow, we shall study a
toy version of the likelihood:
background model
fb (x,a)
A exp[(a1 x a2 x 2 )]
signal model
fs (x, m,w)
Gaussian(x, m, w)
N
p(x | s, m,w,b,a) exp[(s b)] s fs (xi , m,w) b fb (xi ,a)
i 1
35
Summary
Statistic
A statistic is any calculable function of potential observations
Probability
Probability is an abstraction that must be interpreted
Likelihood
The likelihood is the probability (or probability density)
of potential observations evaluated at the observed data
36
Practical Statistics for LHC Physicists
Frequentist Inference
Harrison B. Prosper
Florida State University
Confidence Intervals
Hypothesis Tests
2
The Frequentist Principle
The Frequentist Principle
The Frequentist Principle (FP) (Neyman, 1937)
4
The Frequentist Principle
Points to Note:
1. The frequentist principle applies to real ensembles, not
just the ones we simulate on a computer. Moreover, the
statements need not all be about the same quantity.
6
The Frequentist Principle
Example continued
Suppose each mean count θ is randomly sampled from
Uniform(θ, 5), and suppose we know these numbers.
Exercise 7:
Show, that the coverage f is 0.62
7
Confidence Intervals
8
Confidence Intervals – 1
Consider an experiment that observes N events with expected
count s.
9
Confidence Intervals – 2
Suppose we know s. We could then find a region in the
sample space with probability f ≥ p = confidence level (C.L.)
L f P(D | s) R
s D
parameter space
p
L f R 1
sample space
10
Confidence Intervals – 3
But, in reality we do not know s! So, we must repeat this
procedure for every s that is possible, a priori.
L f p R
s
parameter space
L f R 1
sample space
11
Confidence Intervals – 4
But, in reality we do not know s! So, we must repeat this
procedure for every s that is possible, a priori.
L f p R
s
parameter space
L f R 1
sample space
12
Confidence Intervals – 5
Through this procedure we build two curves l(D) and u(D)
that define lower and upper limits, respectively.
L f p R
s
parameter space
L f R 1
sample space
13
Confidence Intervals – 6
Suppose, the s shown is the true value for one experiment. The
probability to get an interval [l(D), u(D)] that includes s is ≥ p.
L f p R
s
parameter space
L f R 1
sample space
14
Confidence Intervals – 7
There are many ways to create a region, in the sample space,
with probability f. Here is the classic way (Neyman, 1937):
For every D solve,
f p
s L P(x D | u)
parameter space
R P(x D | l)
u
l
L f R 1
sample space
15
Confidence Intervals – 8
Here are a few ways to construct sample space intervals
1. Central Intervals (Neyman, 1937)
Solve αR = P(x ≤ D| u) and αR = P(x ≥ D| l)
with αR = αL = (1 – CL)/2
Central
Mode Centered
[D – √D, D + √D]
17
Confidence Intervals – 10
Central
Mode Centered
[D – √D, D + √D]
18
Confidence Intervals – 11
Central
Mode Centered
[D – √D, D + √D]
19
The Profile Likelihood
Nuisance Parameters are a Nuisance!
All models are “wrong”! But,…
…to the degree that the probability models are accurate
models of the data generation mechanisms, the Neyman
construction, by construction, satisfies the FP exactly.
21
Nuisance Parameters are a Nuisance!
One way or another, we have to rid our probability models of
all nuisance parameters if we wish to make interferences
about the parameters of interest, such as the expected
signal.
Example 1:
Evidence for electroweak production of W±W±jj
(ATLAS, 2014)
PRL 113, 141803 (2014)
22
Example 1: W±W±jj Production
(ATLAS)
First, let’s be clear about knowns and (known) unknowns:
knowns:
D = 12 observed events (μ±μ± mode)
B = 3.0 ± 0.6 background events
unknowns:
b expected background count
s expected signal count
23
Example 1: W±W±jj Production
(ATLAS)
Probability:
P(D | s, b) Poisson(D, s b) Poisson(Q, bq)
(s b)D e(sb) (bk)Q ebq
Likelihood: D! (Q 1)
L(s, b) P(12 | s, b)
24
Example 1: W±W±jj Production
(ATLAS)
Now that we have a likelihood, we can estimate its
parameters, for example, by maximizing the likelihood:
ln L(s,b) ln L(s,b)
0 ŝ, b̂
s b
ŝ D B, b̂ B
with D = 12 observed events (μ±μ± mode)
B = 3.0 ± 0.6 background events
25
Maximum Likelihood – An Aside
The Good
Maximum likelihood estimates are consistent: the
RMS goes to zero as more and more data are acquired.
If an unbiased estimate for a parameter exists, the
maximum likelihood procedure will find it.
Given the MLE for s, the MLE for y = g(s) is just ŷ g(ŝ)
Exercise 8: Show this
The Bad (according to some!) Hint: perform a Taylor
In general, MLEs are biased expansion about the MLE
and consider its ensemble
The Ugly (according to some!) average.
Correcting for bias, however, can waste data and
sometimes yield absurdities. (See Seriously Ugly)
26
The Profile Likelihood – 1
In order to make an inference about the W±W±jj signal, s,
the
2-parameter problem, D (sb) Q bq
(s b) e (bq) e
p(D | s, b)
D! (Q 1)
29
The Profile Likelihood – 4
If ŝ does not occur on the boundary of the parameter space, and
if the data sample is large enough so that the density of ŝ is
approximately,
Gaussian(ŝ, s, )
then
t(s) (s ŝ)2 / 2
has a χ2 density of one degree of freedom, where 2 / t (ŝ)
2
30
The Profile Likelihood – 5
The CMLE of b is
g g 2 4(1 q)Qs
b̂(s)
2(1 q)
g D Q (1 q)s
with
s=D–B
b=B
the mode (peak) of the
likelihood
31
The Profile Likelihood – 6
By solving
@ ~ 68% C.L.
32
The Hypothesis Tests
Hypothesis Tests
The basic idea is simple:
1. Decide which hypothesis you may end up rejecting. This
is called the null hypothesis. At the LHC, this is typically
the background-only hypothesis.
p-value P(x x0 | H 0 )
x0 is the observed value of the test statistic.
35
Example 1: W±W±jj Production
(ATLAS)
Background, B = 3.0 events (ignoring uncertainty)
p(D | H 0 ) Poisson(D | B)
D 12 observed count
p-value Poisson(D | 3.0) 7.1 10 5
D12
37
The Neyman-Pearson Test
In Neyman’s approach,
hypothesis tests are
p(x | H 0 ) p(x | H1 ) a contest between
significance and
power, i.e., the probability
to accept a true alternative.
x x
p(x | H 0 )dx p p(x | H1 )dx
x x
p(x | H 0 )dx p p(x | H1 )dx
x x
39
Hypothesis Tests
This is all well and good, but what do we do when we are
bedeviled with nuisance parameters?
…well, we’ll talk about that tomorrow and also talk about
Bayesian inference.
40
Summary
Frequentist Inference
1) Uses the likelihood.
41
The Seriously Ugly
The moment generating function of a probability
distribution P(k) is the average:
e.g.,
M2 = (np)2 + np – np2
42
The Seriously Ugly
Given that k events out of n pass a set of cuts, the MLE of the
event selection efficiency is
p=k/n
and the obvious estimate of p2 is
k2 / n2
But
2 2 2
k /n p V / n Exercise 8b: Show this
Harrison B. Prosper
Florida State University
L P(x D | u) f p R P(x D | l)
s
parameter space
l
L f R 1
sample space
2
Outline
Bayesian Inference
3
Hypothesis Tests
In order to perform a realistic hypothesis test we need first to
rid ourselves of nuisance parameters.
4
Example 1: W㼼W㼼jj Production
(ATLAS)
Recall, that for B = 3.0 events (ignoring the uncertainty
uncertainty)
p(D | H 0 ) Poisson(D | B)
D 12 observed count
we found
p-value Poisson(D | 3.0) 7.1 10 5
D 12
5
Example 1: W㼼W㼼jj Production
(ATLAS)
Method 1:: We eliminate b from the problem as follows*:
Q r 0
6
Example 1: W㼼W㼼jj Production
(ATLAS)
Background, B = 3.0 ± 0.6 events
p(D | H 0 ) p(D | s 0)
1
(1 x)2 Beta(x, D 1, Q)
Q
D is observed count
D 12
p-value p(D | H 0 ) 21 105
D12
This is equivalent to 3.5 σ
which may be compared with
Exercise 11: Verify this calculation the 3.8 σ obtained earlier.
7
An Aside on s / √b
The quantity s / √b is often used as a rough measure of
significance on the “n-σ” scale. But, it should be used with
caution.
Beware of s / √b!
8
The Profile Likelihood Revisited
Recall that the profile likelihood is just the likelihood with all
nuisance parameters replaced by their conditional
maximum likelihood estimates (CMLE).
In our example,
9
The Profile Likelihood Revisited
t(s) can also be used to test hypotheses, in particular, s = 0.
11
Example 1: W㼼W㼼jj Production
(ATLAS)
Background, B = 3.0 ± 0.6 events. For this example,
tobs(0) = 12.65
therefore, Z = 3.6
D 12
15
Bayesian Inference
Bayesian Inference – 1
Definition:
A method is Bayesian if
1. it is based on the degree of belief interpretation of
probability and if
2. it uses Bayes’ theorem
p(D | , ) ( , )
p( , | D)
for all inferences. p(D)
D observed data
θ parameter of interest
ω nuisance parameters
π prior density
17
Bayesian Inference – 2
Nuisance parameters are removed by marginalization:
p( | D) p( , | D) d
p( D | , ) ( , ) d / p( D)
18
Bayesian Inference – 3
Bayes’ theorem can be used to compute the probability of a
model. First compute the posterior density:
p(D | H , , H ) ( H , , H )
p( H , , H | D)
p( D)
D observed data
H model or hypothesis
θH parameters of model H
ω nuisance parameters
π prior density
19
Bayesian Inference – 4
1. Factorize the priors: ( H , ω, H) = (θH, ω | H) (H)
p(D | H ) p(D | H
, , H ) ( H , | H ) d H d
p( D | H ) ( H )
p( H | D)
p( D | H ) ( H )
H
20
Bayesian Inference – 5
In order to compute p(H |D), however, two things are needed:
1. Proper priors over the parameter spaces
( H
, | H ) d H d 1
2. The priors (H).
21
Example 1: W㼼W㼼jj Production
(ATLAS)
Example 1: W㼼W㼼jj Production
(ATLAS)
Step 1:: Construct a probability model for the observations
e(sb) (s b)D ekb (kb)Q
P(D | s, b)
D! (Q 1)
and insert the data
D = 12 events
B = 3.0 㼼 0.6 background events
Q (B / B)2 25 B=Q/k
δB = √Q / k
k B / B 8.33
2
23
Example 1: W㼼W㼼jj Production
(ATLAS)
Step 2: Write down Bayes’ theorem:
24
Example 1: W㼼W㼼jj Production
(ATLAS)
The Prior: What do
(b | s)
and (s)
represent?
25
Example 1: W㼼W㼼jj Production
(ATLAS)
For simplicity, we shall take π(b | s) = 1*.
Q r 0
26
Example 1: W㼼W㼼jj Production
(ATLAS)
L(s) = P(12 | s, H1) is
marginal likelihood for
the expected signal s.
27
Example 1: W㼼W㼼jj Production
(ATLAS)
Given the likelihood
P( D | s, H1 )
where
p(D | H1 ) P(D | s, H1 ) (s | H1 ) ds
0
28
Example 1: W㼼W㼼jj Production
(ATLAS)
Assuming a flat prior for the signal π (s | H1) = 1, the
posterior density is given by
D
Beta(x,r 1,Q)
r 0
30
Example 1: W㼼W㼼jj Production
(ATLAS)
As noted, the number
p(D | H1 ) p(D | s, H1 ) (s | H1 ) ds
0
31
Example 1: W㼼W㼼jj Production
(ATLAS)
From
p(D | H1 ) = 1.13 x 10-1 and
p(D | H0 ) = 2.23 x 10-4
33
Summary – 1
Probability
Two main interpretations:
1. Degree of belief
2. Relative frequency
Likelihood Function
Main ingredient in any full scale statistical analysis
Frequentist Principle
Construct statements such that a fraction f ≥ C.L. of them
will be true over a specified ensemble of statements.
34
Summary – 2
Frequentist Approach
1. Use likelihood function only.
2. Eliminate nuisance parameters by profiling.
3. Decide on a fixed threshold α for rejection and reject null
if p-value < α, but do so only if rejecting the null makes
scientific sense, e.g.: the probability of the alternative is
judged to be high enough.
Bayesian Approach
1. Model all uncertainty using probabilities and use Bayes’
theorem to make all inferences.
2. Eliminate nuisance parameters through marginalization.
35
The End
Immanuel Kant
36
Likelihoods
1) Introduction .
2) Do’s & Dont’s
What it is
How it works: Resonance
Uncertainty estimates
Detailed example: Lifetime
Several Parameters
Extended maximum L
Start with pdf = Prob density fn for data, given param values:
y = N (1 + cos2)
yi = N (1 + cos2i)
= probability density of observing i, given
L() = yi
= probability density of observing the data set yi, given
Best estimate of is that which maximises L
Values of for which L is very small are ruled out
Precision of estimate for comes from width of L distribution
= -1 large L
3
cos cos
How it works: Resonance
First write down pdf:
y~ Γ/2
(m-M0)2 + (Γ/2)2
m m
Vary M Vary Γ
0 N.B. Can make use
of individual events 4
Conventional to consider
l = lnL = Σ lnyi
Better numerically, and
has some nice properties
L
5
Maximum likelihood uncertainty
Range of likely values of param μ from width of L or l dists.
If L(μ) is Gaussian, following definitions of σ are equivalent:
1) RMS of L(µ)
2) 1/√(-d2lnL / dµ2) (Mnemonic)
3) ln(L(μ0±σ) = ln(L(μ0)) -1/2
If L(μ) is non-Gaussian, these are no longer the same
7
8
Several Parameters
PROFILE L
Lprof =L(β,νbest(β)), where
β = param of interest
ν = nuisance param(s)
Uncertainty on β from
decrease in ln(Lprof) by 0.5
9
Profile L
υ
-2lnL
s
11
Extended Maximum Likelihood
12
ML and EML
13
Relation between Poisson and Binomial
N people in lecture, m males and f females (N = m + f )
Assume these are representative of basic rates: ν people νp males ν(1-p) females
Probability of observing N people = PPoisson = e–ν ν N /N!
N! m
Prob of given male/female division = PBinom = m! f !
p (1-p)f
15
DO’S AND DONT’S WITH L
16
ΔlnL = -1/2 rule
If L(μ) is Gaussian, following definitions of σ are
equivalent:
1) RMS of L(µ)
2) 1/√(-d2lnL/dµ2)
3) ln(L(μ0±σ) = ln(L(μ0)) -1/2
If L(μ) is non-Gaussian, these are no longer the same
“Procedure 3) above still gives interval that contains the
true value of parameter μ with 68% probability”
100%
Hope for: Nominal
value
C ( )
21
COVERAGE
If true for all : “correct coverage”
P< for some “undercoverage”
(this is serious !)
23
Neyman central intervals, NEVER undercover
(Conservative at both ends)
24
Feldman-Cousins Unified intervals
25
Unbinned Lmax and Goodness of Fit?
Lmax 26
Not necessarily: pdf
L(data, params)
fixed vary L
Contrast pdf(data, params) param
vary fixed
t λ 27
Example 1
Fit exponential to times t1, t2 ,t3 ……. [ Joel Heinrich, CDF 5639 ]
L= π λ exp(-λt )i
pdf
Same average t same Lmax
t 28
Example 2
dN 1 cos2
d cos 1 / 3
1 cos 2 i
L= i 1 / 3
cos θ
29
Example of general principle
Lmax and Goodness of Fit?
Conclusion:
31
Binned data and Goodness of Fit using L-ratio
ni L=
p (µ )
i
P nnii (ii )
μi Lbest
p (µ
i
P n i (i ,best))
ni i,best
x
p (n )
i
Pni ni(n i )i
ln[L-ratio] = ln[L/Lbest]
32
Baker and Cousins, NIM A221 (1984) 437
Conclusions
40
Getting L wrong: Punzi effect
Giovanni Punzi @ PHYSTAT2003
“Comments on L fits with variable resolution”
Separate two close signals, when resolution σ varies event
by event, and is different for 2 signals
e.g. 1) Signal 1 1+cos2θ
Signal 2 Isotropic
and different parts of detector give different σ
2) M (or τ)
Different numbers of tracks different σM (or στ)
41
Events characterised by xi and σi
A events centred on x = 0
B events centred on x = 1
L(f)wrong = Π [f * G(xi,0,σi) + (1-f) * G(xi,1,σi)]
L(f)right = Π [f*p(xi,σi;A) + (1-f) * p(xi,σi;B)]
43
Explanation of Punzi bias
σA = 1 σB = 2
A events with σ = 1
B events with σ = 2
x x
ACTUAL DISTRIBUTION FITTING FUNCTION
[NA/NB variable, but same for A and B events]
Fit gives upward bias for NA/NB because (i) that is much better for A events; and 44
(ii) it does not hurt too much for B events
Another scenario for Punzi problem: PID
A B π K
M TOF
Originally:
Positions of peaks = constant K-peak π-peak at large momentum
Where else??
MORAL: Beware of event-by-event variables whose pdf’s do not
45
appear in L
Avoiding Punzi Bias
BASIC RULE:
Write pdf for ALL observables, in terms of parameters
1
Least squares best fit
What is σ?
Resume of straight line
Correlated errors
Goodness of fit with χ2
Number of Degrees of Freedom
Other G of F methods
Errors of first and second kind
Combinations
THE paradox 2
Least Squares Straight Line Fitting
S ~ χ2 ? More plausible
(For â ~ai, and both much larger than σi, 2 methods are very similar) 5
Straight Line Fit
(Fixed σi )
<y>
∂ 2S ∂ 2S Σ1/σi2 Σxi/σi2
∂a2 ∂a∂b
=
∂ 2S ∂ 2 S Σxi/σi2 Σxi2/σi2
∂a∂b ∂b2
b
y
x
8
Measurements with correlated errors e.g. systematics?
9
Comments on Least Squares method
1) Need to bin
Beware of too few events/bin
2) Extends to n dimensions
but needs lots of events for n larger than 2 or 3
3) No problem with correlated uncertainties
4) Can calculate Smin “on line” i.e. single pass through data
Σ (yi – a –bxi)2 /σ2 = [yi2] – b [xiyi] –a [yi]
5) For theory linear in params, analytic solution
y
6) Goodness of Fit
x
12
Goodness of Fit: χ2 test
1) Construct S and minimise wrt free parameters
2) Determine ν = no. of degrees of freedom
ν=n–p
n = no. of data points
p = no. of FREE parameters
3) Look up probability that, for ν degrees of freedom, χ2 ≥ Smin
13
Properties of mathematical χ2 distribution:
χ2 = ν
σ2(χ2) = 2ν
14
Cf: Area in tails of Gaussian 15
χ2 with ν degrees of freedom?
ν = data – free parameters ?
. χ2 Very general
Needs binning
Not sensitive to sign of deviation
Run Test
Kolmogorov-Smirnov
etc 17
Goodness of Fit:
Kolmogorov-Smirnov
Compares data and model cumulative plots
(or 2 sets of data)
Uses largest discrepancy between dists.
Model can be analytic or MC sample
18
Goodness of fit: ‘Energy’ test
Assign +ve charge to data ; -ve charge to M.C.
Calculate ‘electrostatic energy E’ of charges
If distributions agree, E ~ 0
If distributions don’t overlap, E is positive v2
Assess significance of magnitude of E by MC
N.B. v1
1) Works in many dimensions
2) Needs metric for each variable (make variances similar?)
3) E ~ Σ qiqj f(Δr = |ri – rj|) , f = 1/(Δr + ε) or –ln(Δr + ε)
Performance insensitive to choice of small ε
See Aslan and Zech’s paper at:
http://www.ippp.dur.ac.uk/Workshops/02/statistics/program.shtml
19
Binned data and Goodness of Fit using L-ratio
For histogram, uses Poisson prob P(n;µ) for n
ni observed events when expect µ.
20
Wrong Decisions
Error of First Kind
Reject H0 when true
Should happen x% of tests
1) Result of experiment
e.g Is spin of resonance = 2?
Get answer WRONG
Where to set cut?
Small cut Reject when correct
Large cut Never reject anything
Depends on nature of H0 e.g.
Does answer agree with previous expt?
Is expt consistent with special relativity?
3) Track finding
22
Combining: Uncorrelated exptl results
Simple Example of Minimising S
N.B. Better to
combine data rather
than results
23
Difference between weighted and simple averaging
Total = 180 ± 30 K
Wtd average = 99 ± 5 K CONTRAST
Total = 198 ± 10 K
Extrapolation is sensible:
V
Vtrue v1 v2
Beware extrapolations because
BLUE = χ2
S(vbest) = Σ (vi – vbest) E-1ij (vj – vbest) , and minimise S wrt vbest
Smin distributed like χ2, so measures Goodness of Fit
But BLUE gives weights for each vi
Can be used to see contributions to σbest from each source of
uncertainties e.g. statistical and systematics
different systematics
b
y
x
29
Uncertainty on Ωdark energy
When combining pairs of
variables, the uncertainties on the
combined parameters can be
much smaller than any of the
individual uncertainties
e.g. Ωdark energy
30
THE PARADOX
Histogram with 100 bins
Fit with 1 parameter
Smin: χ2 with NDF = 99 (Expected χ2 = 99 ± 14)
33
KINEMATIC FITTING
Tests whether observed event is consistent
with specified reaction
34
Kinematic Fitting: Why do it?
35
Kinematic Fitting: Why do it?
1) Check whether event consistent with hypothesis [Goodness of Fit]
Use Smin and ndf
36
How we perform Kinematic Fitting ?
Observed event: 4 outgoing charged tracks
Assumed reaction: ppppπ+π-
Measured variables: 4-momenta of each track, vimeas
(i.e. 3-momenta & assumed mass)
Then test hypothesis:
Observed event = example of assumed reaction
37
‘KINEMATIC’ FITTING
Angles of triangle: θ1 + θ2 + θ3 = 180
θ1 θ2 θ3
Measured 50 60 73±1 Sum = 183
Fitted 49 59 72 180
χ2 = (50-49)2/12 + 1 + 1 =3
Prob {χ21 > 3} = 8.3%
ALTERNATIVELY:
Sum =183 ± 1.7, while expect 180
Prob{Gaussian 2-tail area beyond 1.73σ} = 8.3%
38
Toy example of Kinematic Fit
39
i.e. KINEMATIC FIT
REDUCED UNCERTAINTIES
40
BAYES and FREQUENTISM:
The Return of an Old Controversy
Louis Lyons and Lorenzo Moneta
Imperial College & Oxford University
CERN
3
It is possible to spend a lifetime
analysing data without realising that
there are two very different
fundamental approaches to statistics:
Bayesianism and Frequentism.
6
How can textbooks not even mention
Bayes / Frequentism?
8
WHAT IS PROBABILITY?
MATHEMATICAL
Formal
Based on Axioms
FREQUENTIST
Ratio of frequencies as n infinity
Repeated “identical” trials
Not applicable to single event or physical constant
LEGAL PROBABILITY 9
Bayesian versus Classical
Bayesian
P(A and B) = P(A;B) x P(B) = P(B;A) x P(A)
e.g. A = event contains t quark
B = event contains W boson
or A = I am in Spanish Pyrenees
B = I am giving a lecture
P(A;B) = P(B;A) x P(A) /P(B)
Completely uncontroversial, provided…. 10
Bayesian P( B; A) x P( A)
P( A; B) Bayes’
P( B) Theorem
14
Even more important for UPPER LIMITS
Mass-squared of neutrino
17
UL includes 0; LL excludes 0; Central usually excludes 0; Shortest is metric dependent
P (Data;Theory) P (Theory;Data)
HIGGS SEARCH at CERN
Is data consistent with Standard Model?
or with Standard Model + Higgs?
End of Sept 2000: Data not very consistent with S.M.
Prob (Data ; S.M.) < 1% valid frequentist statement
Turned by the press into: Prob (S.M. ; Data) < 1%
and therefore Prob (Higgs ; Data) > 99%
i.e. “It is almost certain that the Higgs has been seen”
19
P (Data;Theory) P (Theory;Data)
P (pregnant ; female) ~ 3%
21
P (Data;Theory) P (Theory;Data)
P (pregnant ; female) ~ 3%
but
P (female ; pregnant) >>>3%
22
Peasant and Dog
River x =0 River x =1 km
25
Given that: a) Dog d has 50% probability of
being 100 m. of Peasant,
is it true that: b) Peasant p has 50% probability of
being within 100m of Dog d ?
Additional information
• Rivers at zero & 1 km. Peasant cannot cross them.
0 h 1 km
• Dog can swim across river - Statement a) still true
28
Classical (Neyman) Confidence Intervals
Theoretical
Parameter
µ
Observation x
Frequentist l and
u known, but random
unknown, but fixed
Probability statement about l and
u
Bayesian
and
l u known, and fixed
μtrue μ
* What it is:
For given statistical method applied to many sets of data to extract
confidence intervals for param µ, coverage C is fraction of ranges that
contain true value of param. Can vary with µ
35
Frequentist central intervals, NEVER undercovers
(Conservative at both ends)
36
Feldman-Cousins Unified intervals
37
FELDMAN - COUSINS
Wants to avoid empty classical intervals
No ‘Flip-Flop’ problem
39
.
Feldman-Cousins
90% Confidence
Interval for
Gaussian
46
Taking Systematics into account
48
Reminder of PROFILE L
υ
s
50
Bayesian versus Frequentism
Bayesian Frequentist
Basis of Bayes Theorem Uses pdf for data,
method Posterior probability for fixed parameters
distribution
Meaning of Degree of belief Frequentist definition
probability
Prob of Yes Anathema
parameters?
Needs prior? Yes No
Choice of Yes Yes (except F+C)
interval?
Data Only data you have ….+ other possible
considered data
Likelihood Yes No 53
principle?
Bayesian versus Frequentism
Bayesian Frequentist
Ensemble of No Yes (but often not
experiment explicit)
55
Approach used at LHC
56
Goodness of Fit:
Kolmogorov-Smirnov
Compares data and model cumulative plots
(or 2 sets of data)
Uses largest discrepancy between dists.
Model can be analytic or MC sample
57
Is there evidence for a peak in this
data?
1
Is there evidence for a peak in this
data?
2
Is there evidence for a peak in this
data?
4
CERN Academic Training Course Dec 2016
Theme: Using data to make judgements about H1 (New Physics) versus
H0 (S.M. with nothing new)
Why?
Experiments are expensive and time-consuming
so
Worth investing effort in statistical analysis
better information from data
Topics:
p-values
What they mean
Combining p-values
Significance
Blind Analysis
LEE = Look Elsewhere Effect
Why 5σ for discovery?
Wilks’ Theorem
Background Systematics
p0 v p1 plots
Higgs search: Discovery, mass and spin
Conclusions
5
Examples of Hypotheses
1) Event selector (Event = particle interaction)
Events produced at CERN LHC at enormous rate
Online ‘trigger’ to select events for recording (~1 kiloHertz)
e.g. events with many particles
Offline selection based on required features
e.g. H0: Event contains top H1: No top
Possible outcomes: Events assigned as H0 or H1
2) Result of experiment
e.g. H0 = nothing new
H1 = new particle produced as well
(Higgs, SUSY, 4th neutrino,…..)
Possible outcomes H0 H1
X Exclude H1
X Discovery
No decision
X X ?
WRONG DECISIONS
E1: Reject H0 when H0 true (Loss of effic in 1))
E2: Fail to reject H0 when H1 true (Contamination)
7
H0 or H0 versus H1 ?
H0 = null hypothesis
e.g. Standard Model, with nothing new
H1 = specific New Physics e.g. Higgs with MH = 125 GeV
H0: “Goodness of Fit” e.g. χ2, p-values
H0 v H1: “Hypothesis Testing” e.g. L-ratio
Measures how much data favours one hypothesis wrt other
8
Choosing between 2 hypotheses
Possible methods:
Δχ2
p-value of statistic
lnL–ratio
Bayesian:
Posterior odds
Bayes factor
Bayes information criterion (BIC)
Akaike …….. (AIC)
Minimise “cost”
tobs t tobs t
p1 p0
tobs
t 10
p-values
Concept of pdf y
Example: Gaussian
μ x0 x
y = probability density for measurement x
y = 1/(√(2π)σ) exp{-0.5*(x-μ)2/σ2}
p-value: probablity that x ≥ x0
Gives probability of “extreme” values of data ( in interesting direction)
(x0-μ)/σ 1 2 3 4 5
p 16% 2.3% 0.13% 0. 003% 0.3*10-6
Assumes:
Specific pdf for x (e.g. Gaussian, no long tails)
Data is unbiassed
σ is correct
If so, and x is from that pdf uniform p-distribution
Interesting region
0 p 1
12
p-values for non-Gaussian distributions
b=2.9
P
0 n 10
For n=7, p = Prob( at least 7 events) = P(7) + P(8) + P(9) +…….. = 0.03
13
Significance
CONCLUSION:
Calculate p properly (and allow for LEE if necessary)
14
p-values and σ
15
What p-values are (and are not)
H0 pdf
p0 = α
1) Misunderstood
So ban relativity, matrices…..?
2) Incorrect statements
t
β tcrit α
Procedure: Obtain expected distributions for data statistic (e.g. L-ratio) for H0 and H1
Choose α (e.g. 95%, 3σ, 5σ ?) and CL for p1 (e.g. 95%)
Given b, α determines tcrit
b+s defines β. For s > smin, separation of curves discovery or excln
1-β = Power of test
Now data: If tobs ≥ tcrit (i.e. p0 ≤ α), discovery at level α
If tobs < tcrit, no discovery. If p1 < 1– CL, exclude H1 (or CLs = p1/(1-p0))
19
For event selector, 1-α = efficiency for signal events; β = mis-ID prob from other events
BLIND ANALYSES
Why blind analysis? Data statistic, selections, corrections, method
Methods of blinding
Add random number to result *
Study procedure with simulation only
Look at only first fraction of data
Keep the signal box closed
Keep MC parameters hidden
Keep unknown fraction visible for each bin
Disadvantages
Takes longer time
Usually not available for searches for unknown
INFORMAL CONSENSUS:
Quote local p, and global p according to a) above.
Explain which global p 21
Example of LEE: Stonehenge
22
23
Are alignments significant?
• Atkinson replied with his article "Moonshine on Stonehenge"
in Antiquity in 1966, pointing out that some of the pits which ….. had used
for his sight lines were more likely to have been natural depressions, and
that he had allowed a margin of error of up to 2 degrees in his alignments.
Atkinson found that the probability of so many alignments being visible
from 165 points to be close to 0.5 rather that the "one in a million"
possibility which ….. had claimed.
• ….. had been examining stone circles since the 1950s in search of
astronomical alignments and the megalithic yard. It was not until 1973
that he turned his attention to Stonehenge. He chose to ignore alignments
between features within the monument, considering them to be too close
together to be reliable. He looked for landscape features that could have
marked lunar and solar events. However, one of …..'s key sites, Peter's
Mound, turned out to be a twentieth-century rubbish dump.
24
Why 5σ for Discovery?
Statisticians ridicule our belief in extreme tails (esp. for systematics)
Our reasons:
1) Past history (Many 3σ and 4σ effects have gone away)
2) LEE
3) Worries about underestimated systematics
4) Subconscious Bayes calculation
p(H1|x) = p(x|H1) * π(H1)
p(H0|x) p(x|H0) π(H0)
Posterior Likelihood Priors
prob ratio
“Extraordinary claims require extraordinary evidence”
1) H0 = polynomial of degree 3
H1 = polynomial of degree 5
YES: ΔS distributed as 2 with ndf = (d-4) – (d-6) = 2
2) H0 = background only
H1 = bgd + peak with free M0 and cross-section
NO: H0 and H1 nested, but M0 undefined when H1 H0. ΔS≠2
(but not too serious for fixed M)
N.B. 1: Even when W. Th. does not apply, it does not mean that ΔS
is irrelevant, but you cannot use W. Th. for its expected distribution.
N.B. 2: For large ndf, better to use ΔS, rather than S1 and S0 separately
Is difference in S distributed as χ2 ?
Demortier:
H0 = quadratic bgd
H1 = ……………… +
Gaussian of fixed width,
What is peak at zero?
variable location & ampl
Why not half the entries?
29
Is difference in S distributed as χ2 ?, contd.
N.B.
31
Background systematics, contd
Signif from comparing χ2’s for H0 (bgd only) and for H1 (bgd + signal)
Typically, bgd = functional form fa with free params
e.g. 4th order polynomial
Uncertainties in params included in signif calculation
But what if functional form is different ? e.g. fb
Typical approach:
If fb best fit is bad, not relevant for systematics
If fb best fit is ~comparable to fa fit, include contribution to systematics
But what is ‘~comparable’?
Other approaches:
Profile likelihood over different bgd parametric forms
http://arxiv.org/pdf/1408.6865v1.pdf?
Background subtraction
sPlots
Non-parametric background
Bayes
etc
33
Reminder of Profile L
υ
s
35
Red curve: Best value of nuisance param υ
Blue curves: Other values of υ
Horizontal line: Intersection with red curve
statistical uncertainty
36
Point of controversy!
Two types of ‘other functions’:
a) Different function types e.g.
Σai xi versus Σai/xi
b) Given fn form but different number of terms
DDKW deal with b) by -2lnL -2lnL + kn
n = number of extra free params wrt best
k = 1, as in AIC (= Akaike Information Criterion)
41
CLs = p1/(1-p0) diagonal line
Provides protection against excluding H1 when little or no sensitivity
Δµ
H0 H1
β = value of p1 when p0 = α
= prob of not rejecting H0 when H1 true = E2
= mis-ID of ‘no top’ events
Power = prob of rejecting H0 when H1 true = 1-β
Contamination in signal sample depends on β, and relative frequencies for
H0 and H1 events.
HEP experiments:
If UL on expected rate for new particle expected, exclude particle
Do as function of MX excluded mass range below Me
Predicted
σ UL
Compare with expected
Me expt’s sensitivity
Me MX
CERN CLW (Jan 2000)
FNAL CLW (March 2000)
Heinrich, PHYSTAT-LHC, “Review of Banff Challenge”
46
Methods (no systematics)
Bayes (needs priors e.g. const, 1/μ, 1/√μ, μ, …..)
Frequentist (needs ordering rule,
possible empty intervals, F-C)
CLs
Likelihood (DON’T integrate your L)
χ2 (σ2 = μ)
χ2 (σ2 = n)
48
Search for Higgs:
H : low S/B, high statistics
55
HZ Z 4 l: high S/B, low statistics
56
p-value for ‘No Higgs’ versus mH
57
Mass of Higgs:
Likelihood versus mass
58
Comparing 0+ versus 0- for Higgs
(like Neutrino Mass Hierarchy)
http://cms.web.cern.ch/news/highlights-cms-results-presented-hcp
59
Conclusions
Resources:
Software exists: e.g. RooStats
Books exist: Barlow, Cowan, James, Lista, Lyons, Roe,…..
New: `Data Analysis in HEP: A Practical Guide to
Statistical Methods’ , Behnke et al.
PDG sections on Prob, Statistics, Monte Carlo
CMS and ATLAS have Statistics Committees (and BaBar and CDF
earlier) – see their websites
“Good luck” 60