0% found this document useful (0 votes)
143 views283 pages

LHC Physicists' Statistics Guide

The document provides an overview of descriptive statistics, probability, and common probability distributions relevant for physics analyses at the LHC. It defines key descriptive statistics like mean, variance, and quantiles. It also covers probability, conditional probability, Bayes' theorem, and common discrete and continuous probability distributions like the binomial, Poisson, uniform, Gaussian, log-normal, and chi-squared distributions. Examples are given throughout to illustrate key concepts for analyzing experimental data in particle physics.

Uploaded by

kevinchu021195
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
143 views283 pages

LHC Physicists' Statistics Guide

The document provides an overview of descriptive statistics, probability, and common probability distributions relevant for physics analyses at the LHC. It defines key descriptive statistics like mean, variance, and quantiles. It also covers probability, conditional probability, Bayes' theorem, and common discrete and continuous probability distributions like the binomial, Poisson, uniform, Gaussian, log-normal, and chi-squared distributions. Examples are given throughout to illustrate key concepts for analyzing experimental data in particle physics.

Uploaded by

kevinchu021195
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 283

Practical Statistics for LHC Physicists

Descriptive Statistics, Probability and Likelihood

Harrison B. Prosper
Florida State University

CERN Academic Training Lectures


7 April, 2015
Outline

 Lecture 1
Descriptive Statistics
Probability & Likelihood

 Lecture 2
Frequentist Inference

 Lecture 3
Bayesian Inference

2
Descriptive Statistics
Descriptive Statistics: Samples
Definition: A statistic is any function of the data,
x = x1, x2, … xn. Here are some simple examples:

1 n
the sample average x   xi
n i 1

1 n r
the sample moments mr   xi
n i 1
n
1
and the sample variance S 2   (xi  x )2
n i1
4
Descriptive Statistics: Samples
It is often useful to order the data so that
x(1) < x(2) < … < x(n)

x(k) is called the kth order statistic

x(k) is also the α-quantile, where α = k / n

If α = 0.5, then x(k) is called the median.

All of these quantities, and many more, can be computed


because the sample is known.

5
Descriptive Statistics: Populations
Now consider an infinitely large sample, called a population.

This is clearly an abstraction, which exists only in the sense


that the set of real numbers exist.

Like many abstractions, however, we can study this one


mathematically.

But, since all we have is a sample, we need a way to connect


it to its associated population. One goal of a theory of
statistical inference is to use a sample to say something
about its associated population.
6
Descriptive Statistics: Populations
Expected Value E[x]

Mean 

Error  x
Mean Square Error MSE  E[ ] 2

Bias b  E[x]  
2
Variance V[x]  E[(x  E[x]) ]
7
Descriptive Statistics – 3

MSE  E[ 2 ] Exercise 1:


2
Show this
V b

The MSE is the most widely used measure of how close an


ensemble of statistics {x} is to the mean (or true value) μ.

The root mean square (RMS) is

RMS  MSE

8
Descriptive Statistics – 4
Consider the expected value of the sample variance
n
1
E[S ]  E[  (xi  x ) ]
2 2

n i1
n n n
1 2 1
 E[  xi   xi x   x ]
2 2

n i1 n i1 n i1


1 n
  E[xi2 ]  E[x 2 ]
n i1
 E[x 2 ]  E[x 2 ]
9
Descriptive Statistics – 5
The expected value of the sample variance (as we have
defined it) is biased

E[S 2 ]  E[x 2 ]  E[x 2 ]


2 1 2  n  1 2
 E[x ]  E[x ]    E[x]
n  n 
 n  1
V [x] 
 n 
The bias is –V / n Exercise 2:
Show this
10
Probability
Probability – 1
Objects
1. Sample space: the set S of outcomes of an experiment
2. Event: a subset E of S*
3. Function: P associates a real number to E

Rules (Kolmogorov Axioms)


1. P(E) ≥ 0
2. P(S) = 1
3. P(E1 + E2 +…) = P(E1) + P(E2) + … where EiEj = Ø

and the rules of Boolean algebra.

* With a technical restriction on the collection of subsets of S


12
Probability – 2
By definition, the conditional probability of A given B is
P(AB) P(B) is the probability of B without
P(A | B) 
P(B) the restriction imposed by A.

P(A|B) is the probability of A when


we restrict to the conditions under
which B is true.

B
AB
A

13
Probability – 3
A and B are mutually exclusive if

P(AB) = 0
B
AB
A and B are exhaustive if A

P(A) + P(B) = 1
Theorem

P( A  B)  P( A)  P( B)  P( AB)

Exercise 3: Prove theorem


14
Probability – 4
By definition: P( AB)  P( A | B)P( B)
P( BA)  P( B | A)P( A)

But, since AND commutes, i.e., AB = BA, we immediately


deduce Bayes Theorem:
P( A | B) P( B)
P( B | A) 
P( A)

B
AB
A

15
Bayes Theorem: Are You Doomed?
Diagnostic Example (Michael Goldstein)
You are Diseased (event D)
You are Healthy (event H)
A test result is either positive (event +) or negative (event –)
Let P(+ | D) = 0.99 and P(+ | H) = 0.01.
Your test result is positive. Are you doomed? It all depends…

Suppose the incidence of disease is 1 in a 1000, i.e.,


P(D) = 0.001, then Bayes theorem yields
P(D |+) = P(+ | D) P(D) / P(+)
= P(+ | D) P(D) / [P(+ | D) P(D)+P(+| H) P(H)]
= 0.09
16
Probability: Some Definitions
Suppose we have some function f (x), for example,
f (x) = x,
f (x) = (x – μ)2
then its expected value is the functional

E[ f ]   f (xi )P(xi )
i

If x is continuous, this becomes

E[ f ]   f (x) dP(x)   f (x) p(x) dx


p(x) = dP / dx is called a probability density function (pdf).

17
Probability: Some Definitions
Suppose we have potential observations (random variables) x
and y, then their covariance is the functional
Cov[ f , g]   f (x)g( y) p(x, y) dx dy
where
f (x) = x – E[x] and g(y) = y – E[y] and
p(x, y) is the joint probability density of x and y.

If we can write p(x, y) = p(x) p(y) then x and y are said to be


statistically independent, in which case Cov[f, g] = 0. But
note that, in general, Cov[f, g] = 0 does not imply statistical
independence.

18
Probability: What Exactly Is It?
There are at least two interpretations of probability:

1. Degree of belief in, or assigned to, a proposition, e.g.,


“A tsunami will flood Geneva tomorrow”

2. Relative frequency of outcomes in an infinite


sequence of trials, e.g.,
proton-proton collisions at the LHC with
outcome the creation of Higgs bosons.

19
Binomial & Poisson Distributions
Binomial & Poisson Distributions – 1
A Bernoulli trial has two outcomes:
S = success or F = failure.

Example: Each collision between protons at the LHC is a


Bernoulli trial in which either something interesting
happens (S) or does not (F).

21
Binomial & Poisson Distributions – 2
Let p be the probability of a success, which is assumed to be
the same at each trial. Since S and F are exhaustive, the
probability of a failure is 1 – p.
For a given order O of n trails, the probability P(k, O, n) of
exactly k successes and n – k failures is

P(k,O,n)  p k (1 p)n k

22
Binomial & Poisson Distributions – 3
If the order O of successes and failures is assumed to be
irrelevant, we can eliminate the order from the problem by
summing over all possible orders

P(k,n)   P(k,O,n)   p k (1 p)n k


O O

This yields the binomial distribution

P(k,n)  Binomial(k,n, p)   p (1  p)
n
k
k n k

23
Binomial & Poisson Distributions – 4
We can prove that the mean number of successes a is
a = p n. Exercise 4: Prove it

Suppose that the probability, p, of a success is very small,

then, in the limit p → 0 and n → ∞, such that a is constant,


Binomial(k, n, p) → Poisson(k, a).

The Poisson distribution is generally regarded as a good


model for a counting experiment
Exercise 5: Show that Binomial(k, n, p) → Poisson(k, a)
24
Common Densities and Distributions

Uniform(x, a) 1/ a
Gaussian(x, ,  ) exp[(x   )2 / (2 2 )] / ( 2 )
LogNormal(x, ,  ) exp[(ln x   )2 / (2 2 )] / (x 2 )
Chisq(x, n) x n /2 1 exp(x / 2) / [2 n /2 (n / 2)]
Gamma(x, a,b) x b 1a b exp(ax) / (b)
Exp(x, a) a exp(ax)

Binomial(k, n, p)  p (1  p)
n
k
k nk

Poisson(k, a) a k exp(a) / k!
n! K K K
Multinomial(k, n, p) 
k1 !L kK ! i 1
p ki
i , p i  1, k i n
i 1 i 1

25
Likelihood
Likelihood – 1
The likelihood function is simply the probability, or
probability density function (pdf), evaluated at the observed
data.

Example 1: Evidence for electroweak production of W±W±jj


(ATLAS, PRL 113, 141803 (2014))

p(D| d) = Poisson(D |d) probability to observe a count D

p(12|d) = Poisson(12|d) likelihood of observation D = 12

where d = E[D] is the expected count.

27
Likelihood – 2
Example 2:
(CMS, Phys. Rev. D 87, 052017 (2013))

Observed counts Di
p(D | p)  Multinomial(D, N, p)
D  D1 ,L , DK , p  p1 ,L , pK
K K

D i  N, p i 1
i 1 i 1

This is an example of a binned


likelihood

28
Likelihood – 3
Example 3: (Union2.1 Compilation, SCP)
Red shift and distance modulus measurements of
N = 580 Type Ia supernovae
p(D |  M ,  ,Q) 
N

 Gaussian(x , (z ,
i 1
i i M ,  ,Q),  i )

D  zi , xi   i

This is an example of
an un-binned likelihood for
heteroscedastic data.

29
Likelihood – 4
Example 4: Higgs to γγ (CMS & ATLAS, 2012 – 15)
The analyses of the di-photon final states use an un-binned
likelihood of the form,
N
p(x | s, m,w,b)  exp[(s  b)] s fs (xi , m,w)  b fb (xi )
i 1

where x = measured di-photon masses


m = mass of particle Exercise 6: Show that a
w = expected width binned multi-Poisson
s = expected signal likelihood yields an
b = expected background un-binned likelihood of
fs = signal model this form as the bin widths
fb = background model go to zero
30
Likelihood – 5
Given the likelihood function, we can answer several
questions including:

1. How do I estimate a parameter?


2. How do I quantify its accuracy?
3. How do I test an hypothesis?
4. How do I quantify the significance of a result?

Writing down the likelihood function requires:


1. Identifying all that is known, e.g., the observations
2. Identifying all that is unknown, e.g., the parameters
3. Constructing a probability model for both

31
Example 1: W±W±jj Production
(ATLAS)
Evidence for electroweak production of W±W±jj (2014)
PRL 113, 141803 (2014)
knowns:
D = 12 observed events (μ±μ± mode)
B = 3.0 ± 0.6 background events

unknowns:
b expected background count
s expected signal count
d=b+s expected event count

Note: we are uncertain about unknowns, so 12 ± 3.5 is a


statement about d, not about the observed count 12!
32
Example 1: W±W±jj Production
(ATLAS)
Probability:
p(D | s, b)  Poisson(D, s  b) Poisson(Q, bq)
(s  b)D e(sb) (bq)Q ebq

Likelihood: D! (Q  1)
p(12 | s, b)

where
B =Q/q Q  (B /  B)2  (3.0 / 0.6)2  25.0
δB = √Q / q q  B /  B2  3.0 / 0.6 2  8.33

33
Example 4: Higgs to γγ (CMS)
Eur. Phys. J. C74 (2014) 3076

background model

fb (x,a), x  m

signal model

fs (x | m,w)

N
p(x | s, m,w,b,a)  exp[(s  b)] s fs (xi , m,w)  b fb (xi ,a)
i 1
34
Example 4: Higgs to γγ (CMS)
Tomorrow, we shall study a
toy version of the likelihood:
background model
fb (x,a)
 A exp[(a1 x  a2 x 2 )]

signal model
fs (x, m,w)
 Gaussian(x, m, w)
N
p(x | s, m,w,b,a)  exp[(s  b)] s fs (xi , m,w)  b fb (xi ,a)
i 1
35
Summary
Statistic
A statistic is any calculable function of potential observations

Probability
Probability is an abstraction that must be interpreted

Likelihood
The likelihood is the probability (or probability density)
of potential observations evaluated at the observed data

36
Practical Statistics for LHC Physicists
Frequentist Inference

Harrison B. Prosper
Florida State University

CERN Academic Training Lectures


8 April, 2015
Outline

 The Frequentist Principle

 Confidence Intervals

 The Profile Likelihood

 Hypothesis Tests

2
The Frequentist Principle
The Frequentist Principle
The Frequentist Principle (FP) (Neyman, 1937)

Construct statements such that a fraction f ≥ p of them are


true over an ensemble of statements.

The fraction f is called the coverage probability (or coverage


for short) and p is called the confidence level (C.L.).

An ensemble of statements that obey the FP is said to cover.

4
The Frequentist Principle
Points to Note:
1. The frequentist principle applies to real ensembles, not
just the ones we simulate on a computer. Moreover, the
statements need not all be about the same quantity.

Example: all published measurements x, since 1897, of


the form l(x) ≤ θ ≤ u(x), where θ is the true value.

2. Coverage f is an objective characteristic of ensembles of


statements. However, in order to verify whether an
ensemble of statements covers, we need to know which
statements are true and which ones are false. Alas, in the
real world, we are typically not privy to this knowledge.
5
The Frequentist Principle
Example
Consider an ensemble of different experiments, each with a
different mean count θ, and each yielding a count N. Each
experiment makes a single statement of the form
N + √N > θ,
which is either true or false.

Obviously, some fraction of these statements are true.

But if we don’t know which ones, we have no operational


way to compute the coverage.

6
The Frequentist Principle
Example continued
Suppose each mean count θ is randomly sampled from
Uniform(θ, 5), and suppose we know these numbers.

Now, of course, we can compute the coverage probability f ,


i.e., the fraction of true statements.

Exercise 7:
Show, that the coverage f is 0.62

7
Confidence Intervals

8
Confidence Intervals – 1
Consider an experiment that observes N events with expected
count s.

Neyman (1937) devised a way to make statements of the form


s [l(N), u(N)]

such that a fraction f ≥ p of them are true. Note, again, that


the expected count s may not be, and extremely unlikely to be,
exactly the same for every experiment.

Neyman’s brilliant invention is called a Neyman construction.

9
Confidence Intervals – 2
Suppose we know s. We could then find a region in the
sample space with probability f ≥ p = confidence level (C.L.)

L f   P(D | s) R
s D
parameter space

p

L  f  R  1
sample space

10
Confidence Intervals – 3
But, in reality we do not know s! So, we must repeat this
procedure for every s that is possible, a priori.

L f p R
s
parameter space

L  f  R  1
sample space

11
Confidence Intervals – 4
But, in reality we do not know s! So, we must repeat this
procedure for every s that is possible, a priori.

L f p R
s
parameter space

L  f  R  1
sample space

12
Confidence Intervals – 5
Through this procedure we build two curves l(D) and u(D)
that define lower and upper limits, respectively.

L f p R
s
parameter space

L  f  R  1
sample space

13
Confidence Intervals – 6
Suppose, the s shown is the true value for one experiment. The
probability to get an interval [l(D), u(D)] that includes s is ≥ p.

L f p R
s
parameter space

L  f  R  1
sample space

14
Confidence Intervals – 7
There are many ways to create a region, in the sample space,
with probability f. Here is the classic way (Neyman, 1937):
For every D solve,
f p
s  L  P(x  D | u)
parameter space

 R  P(x  D | l)
u

l
L  f  R  1
sample space

15
Confidence Intervals – 8
Here are a few ways to construct sample space intervals
1. Central Intervals (Neyman, 1937)
Solve αR = P(x ≤ D| u) and αR = P(x ≥ D| l)
with αR = αL = (1 – CL)/2

2. Feldman & Cousins (1997)


Find intervals with the largest values of the ratio
λ(s) = P(D| s) / P(D| s*), where s* is an estimate of s.

1. Mode Centered (HBP, some time in the late 20th century)


Find intervals with the largest value of P(D| s).
By construction, all these yield intervals satisfy the FP.
16
Confidence Intervals – 9

Central

Feldman & Cousins

Mode Centered

[D – √D, D + √D]

17
Confidence Intervals – 10

Central

Feldman & Cousins

Mode Centered

[D – √D, D + √D]

18
Confidence Intervals – 11

Central

Feldman & Cousins

Mode Centered

[D – √D, D + √D]

19
The Profile Likelihood
Nuisance Parameters are a Nuisance!
All models are “wrong”! But,…
…to the degree that the probability models are accurate
models of the data generation mechanisms, the Neyman
construction, by construction, satisfies the FP exactly.

However, to achieve this happy state, we must construct


confidence regions for all the parameters simultaneously.

But, what if we are not interested in the expected background,


b, but only in the expected signal s? The expected
background count is an example of a nuisance
parameter…for once, here’s jargon that says it all!

21
Nuisance Parameters are a Nuisance!
One way or another, we have to rid our probability models of
all nuisance parameters if we wish to make interferences
about the parameters of interest, such as the expected
signal.

We’ll show how this works in practice, using

Example 1:
Evidence for electroweak production of W±W±jj
(ATLAS, 2014)
PRL 113, 141803 (2014)

22
Example 1: W±W±jj Production
(ATLAS)
First, let’s be clear about knowns and (known) unknowns:

knowns:
D = 12 observed events (μ±μ± mode)
B = 3.0 ± 0.6 background events

unknowns:
b expected background count
s expected signal count

Next, we construct a probability model.

23
Example 1: W±W±jj Production
(ATLAS)
Probability:
P(D | s, b)  Poisson(D, s  b) Poisson(Q, bq)
(s  b)D e(sb) (bk)Q ebq

Likelihood: D! (Q  1)
L(s, b)  P(12 | s, b)

We model the background estimate as an effective count:


B =Q/q
δB = √Q / q Q  (B /  B)2  (3.0 / 0.6)2  25.0
q  B /  B2  3.0 / 0.6 2  8.33

24
Example 1: W±W±jj Production
(ATLAS)
Now that we have a likelihood, we can estimate its
parameters, for example, by maximizing the likelihood:

 ln L(s,b)  ln L(s,b)
  0  ŝ, b̂
s b
ŝ  D  B, b̂  B
with D = 12 observed events (μ±μ± mode)
B = 3.0 ± 0.6 background events

Estimates found this way (first done by Karl Frederick Gauss)


are called maximum likelihood estimates (MLE).

25
Maximum Likelihood – An Aside
The Good
 Maximum likelihood estimates are consistent: the
RMS goes to zero as more and more data are acquired.
 If an unbiased estimate for a parameter exists, the
maximum likelihood procedure will find it.
 Given the MLE for s, the MLE for y = g(s) is just ŷ  g(ŝ)
Exercise 8: Show this
The Bad (according to some!) Hint: perform a Taylor
 In general, MLEs are biased expansion about the MLE
and consider its ensemble
The Ugly (according to some!) average.
 Correcting for bias, however, can waste data and
sometimes yield absurdities. (See Seriously Ugly)

26
The Profile Likelihood – 1
In order to make an inference about the W±W±jj signal, s,
the
2-parameter problem, D (sb) Q bq
(s  b) e (bq) e
p(D | s, b) 
D! (Q  1)

must be reduced to one involving s only by getting rid of the


nuisance parameter b.

In principle, this must be done while respecting the


frequentist principle: coverage prob. ≥ confidence level.

In general, this is difficult to do exactly. 27


The Profile Likelihood – 2
In practice, what we do is replace all nuisance parameters by
their conditional maximum likelihood estimates (CMLE),
which yields a function called the profile likelihood, LP(s).

For the W±W±jj evidence example, we find an estimate of b


as a function of s
b̂  f (s)
Then, in the likelihood L(s, b), b is replaced with its estimate.

Since this is an approximation, the frequentist principle is not


guaranteed to be satisfied exactly. But this procedure has a
sound justification, as we shall now see…
28
The Profile Likelihood – 3
Consider the profile likelihood ratio
(s)  LP (s) / L p (ŝ)
where ŝ is the MLE of s. Taylor expand the associated quantity

t(s)  2ln (s)


about ŝ:

t(s)  t(ŝ)  t (ŝ)(s  ŝ)  t ( ŝ)(s  ŝ)2 / 2  ...


2
 (s  ŝ) / [2 / t (ŝ)]  d (1 / N )
The result is called the Wald approximation (1943).

29
The Profile Likelihood – 4
If ŝ does not occur on the boundary of the parameter space, and
if the data sample is large enough so that the density of ŝ is
approximately,
Gaussian(ŝ, s, )
then
t(s)  (s  ŝ)2 /  2
has a χ2 density of one degree of freedom, where   2 / t (ŝ)
2

This result, Wilks’ Theorem (1938) and its generalization, is the


basis of formulae popular in ATLAS and CMS.
(Glen Cowan, Kyle Cranmer, Eilam Gross, Ofer Vitells “Asymptotic formulae for
likelihood-based tests of new physics.” Eur.Phys.J.C71:1554, 2011)

30
The Profile Likelihood – 5
The CMLE of b is

g  g 2  4(1  q)Qs
b̂(s) 
2(1  q)
g  D  Q  (1  q)s

with
s=D–B
b=B
the mode (peak) of the
likelihood

31
The Profile Likelihood – 6
By solving

t(s)  2ln (s)  1

for s, we can make


the statement
s [5.8, 12.9]

@ ~ 68% C.L.

Exercise 9: Show this

32
The Hypothesis Tests
Hypothesis Tests
The basic idea is simple:
1. Decide which hypothesis you may end up rejecting. This
is called the null hypothesis. At the LHC, this is typically
the background-only hypothesis.

2. Construct a number, called a test statistic that depends on


data, such that large values of the test statistic would cast
doubt on the veracity of the null hypothesis.

3. Decide on a threshold above which you are prepared to


reject the null hypothesis. Do the experiment, compute the
statistic, compare it to the agreed upon rejection threshold
and reject the null if the threshold is breached.
34
Hypothesis Tests
Fisher’s Approach: Null hypothesis (H0), say, background-only
The null hypothesis is
rejected if the p-value
is judged to be small
p(x | H 0 )
enough.

p-value  P(x  x0 | H 0 )
x0 is the observed value of the test statistic.

35
Example 1: W±W±jj Production
(ATLAS)
Background, B = 3.0 events (ignoring uncertainty)

p(D | H 0 )  Poisson(D | B)

D  12 observed count

p-value   Poisson(D | 3.0)  7.1  10 5

D12

This is equivalent to a 3.8σ excess above


background if the density were a Gaussian.
36
Hypothesis Tests – 2
Neyman’s Approach: Null hypothesis (H0) + alternative (H1)
Neyman argued Alternative hypothesis
that it is p(x | H 0 )
necessary to p(x | H1 )
consider
alternative
hypotheses
H1
x x
  p-value(x ) Choose a fixed value of α
before data are analyzed.
α is called the significance (or size) of the test.

37
The Neyman-Pearson Test
In Neyman’s approach,
hypothesis tests are
p(x | H 0 ) p(x | H1 ) a contest between
significance and
power, i.e., the probability
to accept a true alternative.

x x



   p(x | H 0 )dx p p(x | H1 )dx
x x

significance of test power


38
The Neyman-Pearson Test
Power curve
power vs. significance.
p Blue is the more
powerful below
Note: in general, no
analysis is uniformly the
the cross-over point
most powerful.
and green is the
more powerful after.





   p(x | H 0 )dx p p(x | H1 )dx
x x

significance of test power

39
Hypothesis Tests
This is all well and good, but what do we do when we are
bedeviled with nuisance parameters?

…well, we’ll talk about that tomorrow and also talk about
Bayesian inference.

40
Summary
Frequentist Inference
1) Uses the likelihood.

2) Ideally, respects the frequentist principle.

3) In practice, nuisance parameters are eliminated through


the approximate procedure of profiling.

4) A hypothesis test reduces to a comparison between an


observed p-value and a p-value threshold called the
significance α. Should the p-value < α, the null hypothesis
is rejected.

41
The Seriously Ugly
The moment generating function of a probability
distribution P(k) is the average:

For the binomial, this is

Exercise 8a: Show this

which is useful for calculating moments

e.g.,
M2 = (np)2 + np – np2
42
The Seriously Ugly
Given that k events out of n pass a set of cuts, the MLE of the
event selection efficiency is
p=k/n
and the obvious estimate of p2 is
k2 / n2
But
2 2 2
k /n  p V / n Exercise 8b: Show this

is a biased estimate of p2. The best unbiased estimate of p2 is


k (k – 1) / [n (n – 1) ] Exercise 8c: Show this

Note: for a single success in n trials, p = 1/n, but p2 = 0!


43
Practical Statistics for LHC Physicists
Bayesian Inference

Harrison B. Prosper
Florida State University

CERN Academic Training Lectures


9 April, 2015
Confidence Intervals – Recap
Any questions about this figure?

 L  P(x  D | u) f p  R  P(x  D | l)
s
parameter space

l
L  f  R  1
sample space

2
Outline

 Frequentist Hypothesis Tests Continued…

 Bayesian Inference

3
Hypothesis Tests
In order to perform a realistic hypothesis test we need first to
rid ourselves of nuisance parameters.

Here are the two primary ways:

1. Use a likelihood averaged over the nuisance parameters.

2. Use the profile likelihood.

4
Example 1: W㼼W㼼jj Production
(ATLAS)
Recall, that for B = 3.0 events (ignoring the uncertainty
uncertainty)

p(D | H 0 )  Poisson(D | B)

D  12 observed count
we found

p-value   Poisson(D | 3.0)  7.1  10 5
D 12

Z ~ 3.8 (sometimes called the Z-value)

5
Example 1: W㼼W㼼jj Production
(ATLAS)
Method 1:: We eliminate b from the problem as follows*:

P(D | s)   P(D | s,b) d(kb)


0
D
1
 (1  x)  Beta(x, D  r  1, Q) Poisson(r, s)
2

Q r 0

Exercise 10: Show this


where,
1 (n  m) n1
x , Beta(x, n, m)  x (1 x)m1
1 k (n)(m)
L(s) = P(12|s) is the average likelihood.
(*R.D. Cousins and V.L. Highland. Nucl. Instrum. Meth., A320:331–335, 1992)

6
Example 1: W㼼W㼼jj Production
(ATLAS)
Background, B = 3.0 ± 0.6 events
p(D | H 0 )  p(D | s  0)
1
 (1  x)2 Beta(x, D  1, Q)
Q
D is observed count

D  12

p-value   p(D | H 0 )  21 105
D12
This is equivalent to 3.5 σ
which may be compared with
Exercise 11: Verify this calculation the 3.8 σ obtained earlier.
7
An Aside on s / √b
The quantity s / √b is often used as a rough measure of
significance on the “n-σ” scale. But, it should be used with
caution.

In our example, s ~ 12 – 3.0 = 9.0 events.

So according to this measure, the ATLAS W㼼W㼼jj result is a


9.0/√3 ~ 5.2σ effect, which is to be compared with 3.8σ!

Beware of s / √b!

8
The Profile Likelihood Revisited
Recall that the profile likelihood is just the likelihood with all
nuisance parameters replaced by their conditional
maximum likelihood estimates (CMLE).

In our example,

LP (s)  L(s, b(ŝ))


We also saw that the quantity
t(s)  2ln[Lp (s) / Lp (ŝ)]
can be used to compute approximate confidence intervals.

9
The Profile Likelihood Revisited
t(s) can also be used to test hypotheses, in particular, s = 0.

Wilks’ theorem, applied to our example, states that for large


samples the density of the signal estimate will be
approximately,
Gaussian( ŝ, s,  )
if s is the true expected signal count. Then,

t(s) 2ln[Lp (s) / Lp (ŝ)] (s  ŝ)2 /  2


will be distributed approximately as a χ2 density of one degree
of freedom, that is, as a density that is independent of all the
parameters of the problem!
10
The Profile Likelihood Revisited
Since we now know the analytical form of the probability density
of t, we can calculate the observed
p-value = P[t(0) ≥ tobs(0)]

given the observed value t(0), tobs(0), for the s = 0 hypothesis.

Then if we find that the


p-value < α,

the significance of our test, we reject the s = 0 hypothesis.


Furthermore, Z = √tobs(0), so we can skip the p-value calculation!

11
Example 1: W㼼W㼼jj Production
(ATLAS)
Background, B = 3.0 ± 0.6 events. For this example,
tobs(0) = 12.65

therefore, Z = 3.6

D  12

Exercise 12: Verify this calculation


12
Example 4: Higgs to γγ (CMS)
Example 4: Higgs to γγ (CMS)
This example mimics part of the CMS Run 1 Higgs γγ data.
We simulate 20,000 background di-photon masses and 200
signal masses. The signal is chosen to be a Gaussian bump
with standard deviation 1.5 GeV and mean of 125 GeV.
background model
fb (x,a)
 A exp[(a1 x  a2 x 2 )]
signal model
fs (x, m,w)
 Gaussian(x, m, w)
N
p(x | s, m,w,b,a)  exp[(s  b)]s fs (xi , m,w)  b fb (xi ,a)
i 1
14
Example 4: Higgs to γγ (CMS)
Fitting using
Minuit (via RooFit)
yields:

15
Bayesian Inference
Bayesian Inference – 1
Definition:
A method is Bayesian if
1. it is based on the degree of belief interpretation of
probability and if
2. it uses Bayes’ theorem
p(D |  , ) ( , )
p( , | D) 
for all inferences. p(D)
D observed data
θ parameter of interest
ω nuisance parameters
π prior density

17
Bayesian Inference – 2
Nuisance parameters are removed by marginalization:

p( | D)   p( , | D) d
  p( D |  , ) ( , ) d  / p( D)

in contrast to profiling, which can be viewed as marginalization


with the data-dependent prior  ( , )  [  ö( , D)]

p( | D)   p(D |  , )  ( , ) d / p(D)


  p(D |  , )  (  ö) d / p(D)
; p(D |  , ö( )) / p(D)

18
Bayesian Inference – 3
Bayes’ theorem can be used to compute the probability of a
model. First compute the posterior density:
p(D |  H , , H )  ( H , , H )
p( H , , H | D) 
p( D)

D observed data
H model or hypothesis
θH parameters of model H
ω nuisance parameters
π prior density

19
Bayesian Inference – 4
1. Factorize the priors:  ( H , ω, H) =  (θH, ω | H)  (H)

2. Then, for each model, H, compute the function

p(D | H )   p(D |  H
, , H )  ( H , | H ) d H d

3. Then, compute the probability of each model, H

p( D | H )  ( H )
p( H | D) 
 p( D | H )  ( H )
H

20
Bayesian Inference – 5
In order to compute p(H |D), however, two things are needed:
1. Proper priors over the parameter spaces

  ( H
,  | H ) d H d   1
2. The priors  (H).

In practice, we compute the Bayes factor:

p(H1 | D)  p(D | H1 )    (H1 ) 


  
p(H0 | D)  p(D | H0 )    (H0 ) 
which is the ratio in the first bracket, B10.

21
Example 1: W㼼W㼼jj Production
(ATLAS)
Example 1: W㼼W㼼jj Production
(ATLAS)
Step 1:: Construct a probability model for the observations
e(sb) (s  b)D ekb (kb)Q
P(D | s, b) 
D! (Q 1)
and insert the data
D = 12 events
B = 3.0 㼼 0.6 background events
Q  (B /  B)2  25 B=Q/k
δB = √Q / k
k  B /  B  8.33
2

to arrive at the likelihood.

23
Example 1: W㼼W㼼jj Production
(ATLAS)
Step 2: Write down Bayes’ theorem:

p(D, s, b) p(D | s, b)  (s, b)


p(s, b | D)  
p(D) p(D)
and specify the prior:
 (s, b)   (b | s)  (s)

It is often convenient first to compute the marginal likelihood


by integrating over the nuisance parameters

p(D | s)   p(D | s,b)  (b | s)db

24
Example 1: W㼼W㼼jj Production
(ATLAS)
The Prior: What do
 (b | s)
and  (s)
represent?

They encode what we know, or assume, about the expected


background and signal in the absence of new observations.
We shall assume that s and b are non-negative and finite.

After a century of argument, the consensus today is that there


is no unique way to represent such vague information.

25
Example 1: W㼼W㼼jj Production
(ATLAS)
For simplicity, we shall take π(b | s) = 1*.

We may now eliminate b from the problem:


p(D | s, H1 )   p(D | s,b)  (b | s) d(kb)


0
D
1
 (1  x)  Beta(x, r  1, Q) Poisson( D  r | s)
2

Q r 0

which, of course, is exactly the same function we found


earlier! H1 represents the background + signal hypothesis.

*Luc Demortier, Supriya Jain, Harrison B. Prosper,


Reference priors for high energy physics, Phys.Rev.D82:034002,2010

26
Example 1: W㼼W㼼jj Production
(ATLAS)
L(s) = P(12 | s, H1) is
marginal likelihood for
the expected signal s.

Here we compare the


marginal and profile
likelihoods. For this
problem they are found
to be almost identical.

But, this happy thing does


not always happen!

27
Example 1: W㼼W㼼jj Production
(ATLAS)
Given the likelihood

P( D | s, H1 )

we can compute the posterior density


P(D | s, H1 )  (s | H1 )
p(s | D, H1 ) 
P(D | H1 )

where

p(D | H1 )   P(D | s, H1 )  (s | H1 ) ds
0

28
Example 1: W㼼W㼼jj Production
(ATLAS)
Assuming a flat prior for the signal π (s | H1) = 1, the
posterior density is given by
D

 Beta(x,r  1,Q)Poisson(D  r,s)


p(s | D, H1 )  r 0
D

 Beta(x,r  1,Q)
r 0

The posterior density of the parameter (or parameters) of


interest is the complete answer to the inference problem
and should be made
available. Better still, Exercise 13: Derive an expression
publish the likelihood for p(s | D, H1) assuming a gamma
prior Gamma(qs, U +1) for π(s | H1)
and the prior.
29
Example 1: W㼼W㼼jj Production
(ATLAS)
By solving u
l
p(s | D, H1 ) ds  0.68
we obtain

s [6.3, 13.5] @ 68% C.L.

Since this is a Bayesian calculation, this statement means:

the probability (that is, the degree of belief) that


s lies in [6.3, 13.5] is 0.68.

30
Example 1: W㼼W㼼jj Production
(ATLAS)
As noted, the number

p(D | H1 )   p(D | s, H1 )  (s | H1 ) ds
0

can be used to perform a hypothesis test. But, to do so, we


need to specify a proper prior for the signal, that is, a prior
π(s| H1) that integrates to one.

The simplest such prior is a δ-function, e.g.:


π (s | H1) = δ(s – 9), which yields

p(D | H1 ) = p(D |9, H1 ) =1.13 x 10-1

31
Example 1: W㼼W㼼jj Production
(ATLAS)
From
p(D | H1 ) = 1.13 x 10-1 and
p(D | H0 ) = 2.23 x 10-4

we conclude that the odds in favor of the hypothesis s = 9 has


increased by 506 relative to the whatever prior odds you
started with.

It is useful to convert this Bayes factor into a (signed) measure


akin to the “n-sigma” (Sezen Sekmen, HBP)
Z  sign(ln B10 ) 2 | ln B10 |  3.6, B10  p(D | H1 ) / p(D | H0 )
Exercise 14: Verify this number
32
Example 4: Higgs to γγ (CMS)
Here is a plot of Z vs. mH
as we scan through
different hypotheses about
the expected signal s.

For simplicity, the signal


width and background
parameters have been fixed
to their maximum likelihood
estimates.

33
Summary – 1
Probability
Two main interpretations:
1. Degree of belief
2. Relative frequency

Likelihood Function
Main ingredient in any full scale statistical analysis

Frequentist Principle
Construct statements such that a fraction f ≥ C.L. of them
will be true over a specified ensemble of statements.

34
Summary – 2
Frequentist Approach
1. Use likelihood function only.
2. Eliminate nuisance parameters by profiling.
3. Decide on a fixed threshold α for rejection and reject null
if p-value < α, but do so only if rejecting the null makes
scientific sense, e.g.: the probability of the alternative is
judged to be high enough.
Bayesian Approach
1. Model all uncertainty using probabilities and use Bayes’
theorem to make all inferences.
2. Eliminate nuisance parameters through marginalization.

35
The End

“Have the courage to use your own understanding!”

Immanuel Kant

36
Likelihoods
1) Introduction .
2) Do’s & Dont’s

Louis Lyons and Lorenzo Moneta


Imperial College & Oxford
CERN

CERN Academic Training Course


Nov 2016 1
Topics
(mainly Parameter Determination)

What it is
How it works: Resonance
Uncertainty estimates
Detailed example: Lifetime
Several Parameters
Extended maximum L

Do’s and Dont’s with L


2
Simple example: Angular distribution

Start with pdf = Prob density fn for data, given param values:
y = N (1 +  cos2)
yi = N (1 +  cos2i)
= probability density of observing i, given 
L() =  yi
= probability density of observing the data set yi, given 
Best estimate of  is that which maximises L
Values of  for which L is very small are ruled out
Precision of estimate for  comes from width of L distribution

CRUCIAL to normalise y N = 1/{2(1 + /3)}


(Information about parameter  comes from shape of exptl distribution of cos)

 = -1  large L

3
cos  cos  
How it works: Resonance
First write down pdf:
y~ Γ/2
(m-M0)2 + (Γ/2)2

m m
Vary M Vary Γ
0 N.B. Can make use
of individual events 4
Conventional to consider
l = lnL = Σ lnyi
Better numerically, and
has some nice properties
L

5
Maximum likelihood uncertainty
Range of likely values of param μ from width of L or l dists.
If L(μ) is Gaussian, following definitions of σ are equivalent:
1) RMS of L(µ)
2) 1/√(-d2lnL / dµ2) (Mnemonic)
3) ln(L(μ0±σ) = ln(L(μ0)) -1/2
If L(μ) is non-Gaussian, these are no longer the same

“Procedure 3) above still gives interval that contains the


true value of parameter μ with 68% probability”
Return to ‘Coverage’ later
Uncertainties from 3) usually asymmetric, and asym uncertainties are
messy. So choose param sensibly
e.g 1/p rather than p; τ or λ 6
Lifetime Determination
Realistic analyses are more
complicated than this

7
8
Several Parameters

PROFILE L
Lprof =L(β,νbest(β)), where
β = param of interest
ν = nuisance param(s)
Uncertainty on β from
decrease in ln(Lprof) by 0.5
9
Profile L
υ

A method for dealing with


systematics
Stat uncertainty on s from
width of L fixed at υbest

Total uncertainty on s from width


s of L(s,υprof(s)) = Lprof
υprof(s) is best value of υ at that s
Contours of lnL(s,υ) υprof(s) as fn of s lies on green line
s = physics param
υ = nuisance param Total uncert ≥ stat uncertainty 10
Blue curves = different
values of ν

-2lnL

s
11
Extended Maximum Likelihood

Maximum Likelihood uses shape  parameters


Extended Maximum Likelihood uses shape and normalisation
i.e. EML uses prob of observing:
a) sample of N events; and
b) given data distribution in x,……
 shape parameters and normalisation.

Example 1: Angular distribution


Observe N events total e.g 100
F forward 96
B backward 4
Rate estimates ML EML
Total ----- 10010
Forward 962 9610
Backward 42 4 2

12
ML and EML

ML uses fixed (data) normalisation


EML has normalisation as parameter

Example 2: Cosmic ray experiment


See 96 protons and 4 heavy nuclei
ML estimate 96 ± 2% protons 4 ±2% heavy nuclei
EML estimate 96 ± 10 protons 4 ± 2 heavy nuclei

Example 3: Decay of resonance


Use ML for Branching Ratios
Use EML for Partial Decay Rates

13
Relation between Poisson and Binomial
N people in lecture, m males and f females (N = m + f )
Assume these are representative of basic rates: ν people νp males ν(1-p) females
Probability of observing N people = PPoisson = e–ν ν N /N!
N! m
Prob of given male/female division = PBinom = m! f !
p (1-p)f

Prob of N people, m male and f female = PPoisson PBinom


= e–νp νm pm * e-ν(1-p) νf (1-p)f
m! f!
= Poisson prob for males * Poisson prob for females

People Male Female


Patients Cured Remain ill

Decaying Forwards Backwards


nuclei
14
Cosmic rays Protons Other particles
Moments Max Like Least squares

Easy? Yes, if… Normalisation, Minimisation


maximisation messy
Efficient? Not very Usually best Sometimes = Max Like

Input Separate events Separate events Histogram


Goodness of fit Messy No (unbinned) Easy
Constraints No Yes Yes
N dimensions Easy if …. Norm, max messier Easy
Weighted events Easy Errors difficult Easy
Bgd subtraction Easy Troublesome Easy
Inverse Covariance Observed spread, - ∂2 l ∂2S
Matrix or analytic ∂pi∂pj 2∂pi∂pj
Main feature Easy Best Goodness of Fit

15
DO’S AND DONT’S WITH L

• NORMALISATION FOR LIKELIHOOD

• JUST QUOTE UPPER LIMIT

• (ln L) = 0.5 RULE

• Lmax AND GOODNESS OF FIT


pU
•  L dp  0 . 90
pL
• BAYESIAN SMEARING OF L

• USE CORRECT L (PUNZI EFFECT)

16
ΔlnL = -1/2 rule
If L(μ) is Gaussian, following definitions of σ are
equivalent:
1) RMS of L(µ)
2) 1/√(-d2lnL/dµ2)
3) ln(L(μ0±σ) = ln(L(μ0)) -1/2
If L(μ) is non-Gaussian, these are no longer the same
“Procedure 3) above still gives interval that contains the
true value of parameter μ with 68% probability”

Heinrich: CDF note 6438 (see CDF Statistics


Committee Web-page)
Barlow: Phystat05
20
COVERAGE

How often does quoted range for


parameter include param’s true value? µtrue µ
N.B. Coverage is a property of METHOD, not of a particular exptl result

Coverage can vary with μ

Study coverage of different methods of Poisson parameter μ, from


observation of number of events n

100%
Hope for: Nominal
value
C ( )

 21
COVERAGE
If true for all  : “correct coverage”
P<  for some  “undercoverage”
(this is serious !)

P>  for some  “overcoverage”


Conservative
Loss of rejection
power
Some Bayesians regard
Coverage as irrelevant 22
Coverage : L approach (Not Neyman construction)
P(n,μ) = e-μμn/n! (Joel Heinrich CDF note 6438)
-2 lnλ< 1 λ = P(n,μ)/P(n,μbest) UNDERCOVERS

23
Neyman central intervals, NEVER undercover
(Conservative at both ends)

24
Feldman-Cousins Unified intervals

Neyman construction so NEVER undercovers

25
Unbinned Lmax and Goodness of Fit?

Find params by maximising L


So larger L better than smaller L
So Lmax gives Goodness of Fit??

Bad Good? Great?

Monte Carlo distribution


Frequency
of unbinned Lmax

Lmax 26
Not necessarily: pdf
L(data, params)

fixed vary L
Contrast pdf(data, params) param

vary fixed

e.g. p(λ) = λ exp(-λt) data

Max at t = 0 Max at λ=1/t


p L

t λ 27
Example 1

Fit exponential to times t1, t2 ,t3 ……. [ Joel Heinrich, CDF 5639 ]

L= π λ exp(-λt )i

lnLmax = -N(1 + ln tav)


i.e. Depends only on AVERAGE t, but is
INDEPENDENT OF DISTRIBUTION OF t (except for……..)
(Average t is a sufficient statistic)
Variation of Lmax in Monte Carlo is due to variations in samples’ average t , but
NOT TO BETTER OR WORSE FIT

pdf
Same average t same Lmax
t 28
Example 2

dN 1  cos2 

d cos  1  / 3

1   cos 2 i
L= i 1  / 3

cos θ

pdf (and likelihood) depends only on cos2θi


Insensitive to sign of cosθi
So data can be in very bad agreement with expected distribution
e.g. all data with cosθ < 0
and Lmax does not know about it.

29
Example of general principle
Lmax and Goodness of Fit?

Conclusion:

L has sensible properties with respect to parameters


NOT with respect to data

Lmax within Monte Carlo peak is NECESSARY


not SUFFICIENT

(‘Necessary’ doesn’t mean that you have to do it!)

31
Binned data and Goodness of Fit using L-ratio

ni L=
 p (µ )
i
P nnii (ii )

μi Lbest 
 p (µ
i
P n i (i ,best))
ni i,best

x

 p (n )
i
Pni ni(n i )i

ln[L-ratio] = ln[L/Lbest]

large μi -0.5c2 i.e. Goodness of Fit


Lbest is independent of parameters of fit,
and so same parameter values from L or L-ratio

32
Baker and Cousins, NIM A221 (1984) 437
Conclusions

How it works, and how to estimate uncertainties


Likelihood or Extended Likelihood
Several Parameters
Likelihood does not guarantee coverage
Unbinned Lmax and Goodness of Fit

40
Getting L wrong: Punzi effect
Giovanni Punzi @ PHYSTAT2003
“Comments on L fits with variable resolution”
Separate two close signals, when resolution σ varies event
by event, and is different for 2 signals
e.g. 1) Signal 1 1+cos2θ
Signal 2 Isotropic
and different parts of detector give different σ

2) M (or τ)
Different numbers of tracks  different σM (or στ)

41
Events characterised by xi and σi
A events centred on x = 0
B events centred on x = 1
L(f)wrong = Π [f * G(xi,0,σi) + (1-f) * G(xi,1,σi)]
L(f)right = Π [f*p(xi,σi;A) + (1-f) * p(xi,σi;B)]

p(S,T) = p(S|T) * p(T)


p(xi,σi|A) = p(xi|σi,A) * p(σi|A)
= G(xi,0,σi) * p(σi|A)
So
L(f)right = Π[f * G(xi,0,σi) * p(σi|A) + (1-f) * G(xi,1,σi) * p(σi|B)]

If p(σ|A) = p(σ|B), Lright = Lwrong


but NOT otherwise
42
Punzi’s Monte Carlo for A : G(x,0,A)
B : G(x,1,B)
fA = 1/3
Lwrong Lright
A B fA f fA f

1.0 1 .0 0.336(3) 0.08 Same


1.0 1.1 0.374(4) 0.08 0. 333(0) 0
1.0 2.0 0.645(6) 0.12 0.333(0) 0
12 1.5 3 0.514(7) 0.14 0.335(2) 0.03
1.0 12 0.482(9) 0.09 0.333(0) 0
1) Lwrong OK for p(A)  p(B) , but otherwise BIASSED
2) Lright unbiassed, but Lwrong biassed (enormously)!
3) Lright gives smaller σf than Lwrong

43
Explanation of Punzi bias
σA = 1 σB = 2

A events with σ = 1

B events with σ = 2

x  x
ACTUAL DISTRIBUTION FITTING FUNCTION
[NA/NB variable, but same for A and B events]
Fit gives upward bias for NA/NB because (i) that is much better for A events; and 44
(ii) it does not hurt too much for B events
Another scenario for Punzi problem: PID
A B π K

M TOF
Originally:
Positions of peaks = constant K-peak  π-peak at large momentum

σi variable, (σi)A = (σi)B σi ~ constant, pK = pπ

COMMON FEATURE: Separation/Error = Constant

Where else??
MORAL: Beware of event-by-event variables whose pdf’s do not
45
appear in L
Avoiding Punzi Bias
BASIC RULE:
Write pdf for ALL observables, in terms of parameters

• Include p(σ|A) and p(σ|B) in fit


(But then, for example, particle identification may be determined more
by momentum distribution than by PID)
OR
• Fit each range of σi separately, and add (NA)i 
(NA)total, and similarly for B

Incorrect method using Lwrong uses weighted average


of (fA)j, assumed to be independent of j

Talk by Catastini at PHYSTAT05


46
χ 2 and Goodness of Fit

Louis Lyons and Lorenzo Moneta


Imperial College & Oxford
CERN

CERN Academic Training Course


Nov 2016

1
Least squares best fit
What is σ?
Resume of straight line
Correlated errors
Goodness of fit with χ2
Number of Degrees of Freedom
Other G of F methods
Errors of first and second kind
Combinations
THE paradox 2
Least Squares Straight Line Fitting

Data = {xi, yi ±δyi}

1) Does it fit straight line? (Goodness of Fit)


(Goodness of Fit)

2) What are gradient and intercept?


(Parameter Determination)
Do 2) first

N.B.1 Can be used for non “a+bx”


e.g. a + b/x + c/x2
N.B.2 Least squares is not the only method
3
If theory and data OK:
yth ~ yobs  S small
Minimise S  best line
Value of Smin  how good 4
fit is
Which σ should we use?
Which σ? Exptl σ Theory σ

Name Neyman Pearson

Ease of algebra Easier, so this version


is used more

If Th = 0.01, Exp = 1 Contributes 1 to S Contributes 98 to S


More plausible

S ~ χ2 ? More plausible

S = ( â -a1)2/σ2 + Biassed down because Biassed up because


(â -a2)2/σ2 smaller ai smaller σ larger â  larger σ

(For â ~ai, and both much larger than σi, 2 methods are very similar) 5
Straight Line Fit
(Fixed σi )

<y>

N.B. L.S.B.F. passes through (<x>, <y>) 6


Correlated intercept and gradient?
2 * Inverse covariance matrix =

∂ 2S ∂ 2S Σ1/σi2 Σxi/σi2
∂a2 ∂a∂b
=
∂ 2S ∂ 2 S Σxi/σi2 Σxi2/σi2
∂a∂b ∂b2

Invert Covariance matrix


Covariance ~ -Σxi/σi2 = [x]
If measure intercept at weighted c. of g. of x for
data points, cov = 0
i.e. gradient and intercept there are uncorrelated

So track params are usually specified at centre


7
of track.
Covariance(a,b) ~ -<x>

<x> positive <x> negative

b
y

x
8
Measurements with correlated errors e.g. systematics?

9
Comments on Least Squares method
1) Need to bin
Beware of too few events/bin
2) Extends to n dimensions 
but needs lots of events for n larger than 2 or 3
3) No problem with correlated uncertainties
4) Can calculate Smin “on line” i.e. single pass through data
Σ (yi – a –bxi)2 /σ2 = [yi2] – b [xiyi] –a [yi]
5) For theory linear in params, analytic solution
y
6) Goodness of Fit
x

Individual events yi±σi v xi


(e.g. in cos θ ) (e.g. stars)

1) Need to bin? Yes No need

4) χ2 on line First histogram Yes 10


11
Moments Max Like Least squares

Easy? Yes, if… Normalisation, Minimisation


maximisation messy
Efficient? Not very Usually best Sometimes = Max Like

Input Separate events Separate events Histogram


Goodness of fit Messy No (unbinned) Easy
Constraints No Yes Yes
N dimensions Easy if …. Norm, max messier Easy
Weighted events Easy Errors difficult Easy
Bgd subtraction Easy Troublesome Easy
Inverse covariance Observed spread, - ∂2lnL ∂2S
matrix or analytic ∂pi∂pj 2∂pi∂pj
Main feature Easy Best Goodness of Fit

12
Goodness of Fit: χ2 test
1) Construct S and minimise wrt free parameters
2) Determine ν = no. of degrees of freedom
ν=n–p
n = no. of data points
p = no. of FREE parameters
3) Look up probability that, for ν degrees of freedom, χ2 ≥ Smin

Uses i) Poisson ~ Gaussian if expected number not too small


ii) For N yi distributed as Gaussian N(0,1), Σyi2 ~ χ2 with ndf = N

So works ASYMPTOTICALLY. Otherwise use MC for dist of S (or binned L)

13
Properties of mathematical χ2 distribution:
χ2 = ν
σ2(χ2) = 2ν

So Smin > ν + 3√2ν is LARGE

e.g. Smin = 2200 for ν = 2000?

14
Cf: Area in tails of Gaussian 15
χ2 with ν degrees of freedom?
ν = data – free parameters ?

Why asymptotic (apart from Poisson  Gaussian) ?


a) Fit flatish histogram with
y = N {1 + 10-6 cos(x-x0)} x0 = free param

b) Neutrino oscillations: almost degenerate parameters


y ~ 1 – A sin2(1.27 Δm2 L/E) 2 parameters
1 – A (1.27 Δm2 L/E)2 1 parameter
Small Δm2 16
Goodness of Fit

. χ2 Very general
Needs binning
Not sensitive to sign of deviation

Run Test

Kolmogorov-Smirnov

Aslan and Zech `Energy Test’


Durham IPPP Stats Conf (2002)

Binned Likelihood ( = Baker-Cousins}

etc 17
Goodness of Fit:
Kolmogorov-Smirnov
Compares data and model cumulative plots
(or 2 sets of data)
Uses largest discrepancy between dists.
Model can be analytic or MC sample

Uses individual data points


Not so sensitive to deviations in tails
(so variants of K-S exist)
Not readily extendible to more dimensions
Distribution-free conversion to p; depends on n
(but not when free parameters involved – needs MC)

18
Goodness of fit: ‘Energy’ test
Assign +ve charge to data ; -ve charge to M.C.
Calculate ‘electrostatic energy E’ of charges
If distributions agree, E ~ 0
If distributions don’t overlap, E is positive v2
Assess significance of magnitude of E by MC

N.B. v1
1) Works in many dimensions
2) Needs metric for each variable (make variances similar?)
3) E ~ Σ qiqj f(Δr = |ri – rj|) , f = 1/(Δr + ε) or –ln(Δr + ε)
Performance insensitive to choice of small ε
See Aslan and Zech’s paper at:
http://www.ippp.dur.ac.uk/Workshops/02/statistics/program.shtml
19
Binned data and Goodness of Fit using L-ratio
For histogram, uses Poisson prob P(n;µ) for n
ni observed events when expect µ.

Construct L-ratio = Product{P(ni;µi)/P(ni;µ=ni)}


P(ni;µ=ni) is best possible µ for that ni
µi Need denoms because P(100;100.0)
very different from P(1;1.0)
x

-2*L ratio ~ χ2 when µi large and ni ~ µi


Better than Neyman or Pearson χ2 when µi small

Baker and Cousins, NIM 221 (1984) 437

20
Wrong Decisions
Error of First Kind
Reject H0 when true
Should happen x% of tests

Errors of Second Kind


Accept H0 when something else is true
Frequency depends on ………
i) How similar other hypotheses are
e.g. H0 = μ
Alternatives are: e π K p
ii) Relative frequencies: 10-4 10-4 1 0.1 0.1

Aim for maximum efficiency Low error of 1st kind


maximum purity Low error of 2nd kind
As χ2 cut tightens, efficiency and purity
Choose compromise
21
How serious are errors of 1st and 2nd kind?

1) Result of experiment
e.g Is spin of resonance = 2?
Get answer WRONG
Where to set cut?
Small cut Reject when correct
Large cut Never reject anything
Depends on nature of H0 e.g.
Does answer agree with previous expt?
Is expt consistent with special relativity?

2) Class selector e.g. b-quark / galaxy type / γ-induced cosmic shower


Error of 1st kind: Loss of efficiency
Error of 2nd kind: More background
Usually easier to allow for 1st than for 2nd

3) Track finding
22
Combining: Uncorrelated exptl results
Simple Example of Minimising S

N.B. Better to
combine data rather
than results

So â = Σwiai/Σwi , where wi=1/σi2

23
Difference between weighted and simple averaging

Isolated island with conservative inhabitants


How many married people ?

Number of married men = 100 ± 5 K


Number of married women = 80 ± 30 K

Total = 180 ± 30 K
Wtd average = 99 ± 5 K CONTRAST
Total = 198 ± 10 K

GENERAL POINT: Adding (uncontroversial) theoretical input can


improve precision of answer
Compare “kinematic fitting”
24
Best Linear Unbiassed Estimate
Combine several possibly correlated estimates of same quantity
e.g. v1, v2, v3
Covariance matrix σ12 cov12 cov13
cov12 σ22 cov23
cov13 cov23 σ32

Uncorrelated Positive correlation Negative correlation

covij = ρij σi σj with -1 ≤ ρ ≤ 1


Lyons, Gibault + Clifford
NIM A270 (1988) 42
vbest = w1v1 + w2v2 + w3v3 Linear
with w1 + w2 + w3 =1 Unbiassed
to give σbest = min (wrt w1, w2, w3) Best
For uncorrelated case, wi ~ 1/σi2
For correlated pair of measurements with σ1 < σ2
vbest = α v1 + β v2 β=1-α
β = 0 for ρ = σ1/σ2
β < 0 for ρ > σ1/σ2 i.e. extrapolation! e.g. vbest = 2v1 – v2

Extrapolation is sensible:
V

Vtrue v1 v2
Beware extrapolations because

[b] σbest tends to zero, for ρ = +1 or -1

[a] vbest sensitive to ρ and σ1/σ2

N.B. For different analyses of ~ same data,


ρ ~ 1, so choose ‘better’ analysis, rather than
combining
27
N.B. σbest depends on σ1, σ2 and ρ, but not on v1 – v2
e.g. Combining 0±3 and x±3 gives x/2 ± 2

BLUE = χ2
S(vbest) = Σ (vi – vbest) E-1ij (vj – vbest) , and minimise S wrt vbest
Smin distributed like χ2, so measures Goodness of Fit
But BLUE gives weights for each vi
Can be used to see contributions to σbest from each source of
uncertainties e.g. statistical and systematics
different systematics

For combining two or more possibly correlated measured quantities


{e.g. intercepts and gradients of a straight line), use χ2 approach.
Alternatively. Valassi has extended BLUE approach
Covariance(a,b) ~ -<x>

<x> positive <x> negative

b
y

x
29
Uncertainty on Ωdark energy
When combining pairs of
variables, the uncertainties on the
combined parameters can be
much smaller than any of the
individual uncertainties
e.g. Ωdark energy

30
THE PARADOX
Histogram with 100 bins
Fit with 1 parameter
Smin: χ2 with NDF = 99 (Expected χ2 = 99 ± 14)

For our data, Smin(p0) = 90


Is p2 acceptable if S(p2) = 115?

1) YES. Very acceptable χ2 probability

2) NO. σp from S(p0 +σp) = Smin +1 = 91


But S(p2) – S(p0) = 25
So p2 is 5σ away from best value
31
32
Next time:
Bayes and Frequentism:
the return of an old controversy

The ideologies, with examples


Upper limits
Feldman and Cousins
Summary

33
KINEMATIC FITTING
Tests whether observed event is consistent
with specified reaction

34
Kinematic Fitting: Why do it?

1) Check whether event consistent with hypothesis [Goodness of Fit]

2) Can calculate missing quantities [Param detn.]

3) Good to have tracks conserving E-P [Param detn.]

4) Reduces uncertainties [Param detn.]

35
Kinematic Fitting: Why do it?
1) Check whether event consistent with hypothesis [Goodness of Fit]
Use Smin and ndf

2) Can calculate missing quantities [Param detn.]


e.g. Can obtain |P| for short/straight track, neutral beam; px,py,pz of outgoing ν, n, K0

3) Good to have tracks conserving E-P [Param detn.]


e.g. identical values for resonance mass from prodn or decay

4) Reduces uncertainties [Param detn.]


Example of “Including theoretical input reduces uncertainties”

36
How we perform Kinematic Fitting ?
Observed event: 4 outgoing charged tracks
Assumed reaction: ppppπ+π-
Measured variables: 4-momenta of each track, vimeas
(i.e. 3-momenta & assumed mass)
Then test hypothesis:
Observed event = example of assumed reaction

i.e. Can tracks be wiggled “a bit” to do so?


Tested by:
Smin = Σ(vifitted - vimeas)2 / σ2
where vifitted conserve 4-momenta
(Σ over 4 components of each track)
N.B. Really need to take correlations into account

i.e. Minimisation subject to constraints (involves Lagrange Multipliers)

37
‘KINEMATIC’ FITTING
Angles of triangle: θ1 + θ2 + θ3 = 180
θ1 θ2 θ3
Measured 50 60 73±1 Sum = 183
Fitted 49 59 72 180
χ2 = (50-49)2/12 + 1 + 1 =3
Prob {χ21 > 3} = 8.3%
ALTERNATIVELY:
Sum =183 ± 1.7, while expect 180
Prob{Gaussian 2-tail area beyond 1.73σ} = 8.3%

38
Toy example of Kinematic Fit

39
i.e. KINEMATIC FIT 
REDUCED UNCERTAINTIES
40
BAYES and FREQUENTISM:
The Return of an Old Controversy
Louis Lyons and Lorenzo Moneta
Imperial College & Oxford University
CERN

CERN Academic Training Course Nov 2016 1


2
Topics
• Who cares?
• What is probability?
• Bayesian approach
• Examples
• Frequentist approach
• Summary

. Will discuss mainly in context of PARAMETER


ESTIMATION. Also important for GOODNESS of
FIT and HYPOTHESIS TESTING

3
It is possible to spend a lifetime
analysing data without realising that
there are two very different
fundamental approaches to statistics:
Bayesianism and Frequentism.

6
How can textbooks not even mention
Bayes / Frequentism?

For simplest case (m   )  Gaussian


with no constraint on µtrue , then
m  k  m(µtrue
true
)  m  k
at some probability, for both Bayes and Frequentist
(but different interpretations)
7
See Bob Cousins “Why isn’t every physicist a Bayesian?” Amer Jrnl Phys 63(1995)398
We need to make a statement about
Parameters, Given Data

The basic difference between the two:

Bayesian : Prob(parameter, given data)


(an anathema to a Frequentist!)

Frequentist : Prob(data, given parameter)


(a likelihood function)

8
WHAT IS PROBABILITY?
MATHEMATICAL
Formal
Based on Axioms

FREQUENTIST
Ratio of frequencies as n infinity
Repeated “identical” trials
Not applicable to single event or physical constant

BAYESIAN Degree of belief


Can be applied to single event or physical constant
(even though these have unique truth)
Varies from person to person ***
Quantified by “fair bet”

LEGAL PROBABILITY 9
Bayesian versus Classical
Bayesian
P(A and B) = P(A;B) x P(B) = P(B;A) x P(A)
e.g. A = event contains t quark
B = event contains W boson
or A = I am in Spanish Pyrenees
B = I am giving a lecture
P(A;B) = P(B;A) x P(A) /P(B)
Completely uncontroversial, provided…. 10
Bayesian P( B; A) x P( A)
P( A; B)  Bayes’
P( B) Theorem

p(param | data) α p(data | param) * p(param)


  
posterior likelihood prior

Problems: p(param) Has particular value


“Degree of belief”
Credible Intervals
Prior What functional form?
Coverage 11
Prior: What functional form?
Uninformative prior: Flat?
Cannot be normalised
Ranges 0-1 and 1089-1090 equally probable
In which variable? e.g. m, m2, ln m,….?
dp/dm = dp/d(ln m) x d(ln m)/dm = (1/m) x dp/d(ln m)

Even more problematic with more params

Unimportant if “data overshadows prior”


Important for limits
Subjective or Objective prior?

Priors might be OK for parametrising prior knowledge,


but not so good for prior ignorance.
12
Mass of Z boson (from LEP)

Data overshadows prior


13
Prior

14
Even more important for UPPER LIMITS
Mass-squared of neutrino

Prior = zero in unphysical region

Fred James: “Is it a reindeer?”


15
Bayes: Specific example
Particle decays exponentially: dn/dt = (1/τ) exp(-t/τ)
Observe 1 decay at time t1: L(τ) = (1/τ) exp(-t1/τ)
Choose prior π(τ) for τ
e.g. constant up to some large τ L
Then posterior p(τ) =L(τ) * π(τ)
has almost same shape as L(τ)
Use p(τ) to choose interval for τ
τ in usual way
Sensitivity study: Compare with using different prior
e.g. Prior constant in decay rate λ= 1/τ  different range

Contrast frequentist method for same situation later.


16
Bayesian posterior  intervals

ppost Upper limit Lower limit

Central interval Shortest

17
UL includes 0; LL excludes 0; Central usually excludes 0; Shortest is metric dependent
P (Data;Theory)  P (Theory;Data)
HIGGS SEARCH at CERN
Is data consistent with Standard Model?
or with Standard Model + Higgs?
End of Sept 2000: Data not very consistent with S.M.
Prob (Data ; S.M.) < 1% valid frequentist statement
Turned by the press into: Prob (S.M. ; Data) < 1%
and therefore Prob (Higgs ; Data) > 99%
i.e. “It is almost certain that the Higgs has been seen”

19
P (Data;Theory)  P (Theory;Data)

Theory = Murderer or not


Data = Eats bread for breakfast or not

P (eats bread ; murderer) ~ 99%


but
P(murderer; eats bread) ~ 10-6
20
P (Data;Theory)  P (Theory;Data)

Theory = male or female


Data = pregnant or not pregnant

P (pregnant ; female) ~ 3%

21
P (Data;Theory)  P (Theory;Data)

Theory = male or female


Data = pregnant or not pregnant

P (pregnant ; female) ~ 3%
but
P (female ; pregnant) >>>3%
22
Peasant and Dog

1) Dog d has 50%


probability of being
d p
100 m. of Peasant p
2) Peasant p has 50% x
probability of being
within 100m of Dog d ?

River x =0 River x =1 km

25
Given that: a) Dog d has 50% probability of
being 100 m. of Peasant,
is it true that: b) Peasant p has 50% probability of
being within 100m of Dog d ?

Additional information
• Rivers at zero & 1 km. Peasant cannot cross them.
0  h  1 km
• Dog can swim across river - Statement a) still true

If dog at –101 m, Peasant cannot be within 100m of


dog
Statement b) untrue
26
Classical Approach
Neyman “confidence interval” avoids pdf for 
Uses only P( x;  )
Confidence interval  1   2 :
P(  1   2 contains t ) =  True for any t

Varying intervals fixed


from ensemble of
experiments
Gives range of  for which observed value x0 was “likely” ( )
Contrast Bayes : Degree of belief =  that  is in  1   2
t

28
Classical (Neyman) Confidence Intervals

Uses only P(data|theory)

Theoretical
Parameter
µ

Observation x 

μ≥0 No prior for μ 29


90% Classical interval for Gaussian
σ=1 μ≥0
e.g. m2(νe), length of small object

xobs=3 Two-sided range


Other methods have
xobs=1 Upper limit different behaviour at
xobs=-1 No region for µ negative x
30
   
l u at 90% confidence

Frequentist  l and 
u known, but random
 unknown, but fixed
Probability statement about  l and 
u

Bayesian
 and 
l u known, and fixed

 unknown, and random


Probability/credible statement about 
31
Frequentism: Specific example

Particle decays exponentially: dn/dt = (1/τ) exp(-t/τ)


Observe 1 decay at time t1: L(τ) = (1/τ) exp(-t1/τ)
Construct 68% central interval
t = .17τ
dn/dt
τ
t
t = 1.8τ

68% conf. int. for τ from t1 t


t1 /1.8  t1 /0.17
32
Coverage

μtrue μ
* What it is:
For given statistical method applied to many sets of data to extract
confidence intervals for param µ, coverage C is fraction of ranges that
contain true value of param. Can vary with µ

* Does not apply to your data:


It is a property of the statistical method used
It is NOT a probability statement about whether µtrue lies in your
confidence range for µ
C(µ)
68%
* Coverage plot for Poisson counting expt
Ideal coverage
Observe n counts plot
Estimate µbest from maximum of likelihood µ

L(µ) = e-µ µn/n! and range of µ from ln{L(µbest)/L(µ)}  0.5


For each µtrue calculate coverage C(µtrue), and compare with nominal 68%
33
Coverage : L approach
(Not Neyman construction)
P(n,μ) = e-μμn/n! (Joel Heinrich CDF note 6438)
-2 lnλ< 1 λ = P(n,μ)/P(n,μbest) UNDERCOVERS

35
Frequentist central intervals, NEVER undercovers
(Conservative at both ends)

36
Feldman-Cousins Unified intervals

Neyman construction, so NEVER undercovers

37
FELDMAN - COUSINS
Wants to avoid empty classical intervals 

Uses “L-ratio ordering principle” to resolve


ambiguity about “which 90% region?” 
[Neyman + Pearson say L-ratio is best for
hypothesis testing]

No ‘Flip-Flop’ problem

39
.

Feldman-Cousins
90% Confidence
Interval for
Gaussian

Xobs = -2 now gives upper limit 40


Flip-flop

Black lines Classical 90% central interval


Red dashed: Classical 90% upper limit 41
FLIP-FLOP
If xobs < 3, Upper Limit
If xobs > 3, 2-sided interval

Not good to let xobs determine how result will be presented.


F-C: Move smoothly from 1-sided to 2-sided interval
42
Features of Feldman-Cousins
Almost no empty intervals
Unified 2-sided and 1-sided intervals y
Eliminates flip-flop
No arbitrariness of interval
Less over-coverage than ‘x% at both ends’
‘Readily’ extends to several dimensions x
‘x% at each end’ or ‘Max prob density’ problematic

Neyman construction time-consuming (esp in n-dimensions)


Minor pathologies: Occasional disjoint intervals
Wrong behaviour wrt background
Tight limits when b>nobs e.g. nobs bgd 90% UL
0 3.0 1.08
0 0.0 2.44
Exclusion of s=0 at lower x

46
Taking Systematics into account

48
Reminder of PROFILE L
υ

Stat uncertainty on s from


width of L fixed at υbest

Total uncertainty on s from width


of L(s,υprof(s)) = Lprof
υprof(s) is best value of υ at that s
s υprof(s) as fn of s lies on green line

Contours of lnL(s,υ) Total uncert ≥ stat uncertainty


s = physics param
υ = nuisance param Contrast with MARGINALISE
Integrate over ν 49
L(s,ν) for
different fixed ν
-2lnL

s
50
Bayesian versus Frequentism
Bayesian Frequentist
Basis of Bayes Theorem  Uses pdf for data,
method Posterior probability for fixed parameters
distribution
Meaning of Degree of belief Frequentist definition
probability
Prob of Yes Anathema
parameters?
Needs prior? Yes No
Choice of Yes Yes (except F+C)
interval?
Data Only data you have ….+ other possible
considered data
Likelihood Yes No 53
principle?
Bayesian versus Frequentism
Bayesian Frequentist
Ensemble of No Yes (but often not
experiment explicit)

Final Posterior probability Parameter values 


statement distribution Data is likely
Unphysical/ Excluded by prior Can occur
empty ranges

Systematics Integrate over prior Extend dimensionality


of frequentist
construction
Coverage Unimportant Built-in
Decision Yes (uses cost function) Not useful 54
making
Bayesianism versus Frequentism

“Bayesians address the question everyone is


interested in, by using assumptions no-one
believes”

“Frequentists use impeccable logic to deal


with an issue of no interest to anyone”

55
Approach used at LHC

Recommended to use both Frequentist and Bayesian


approaches for parameter determination

If agree, that’s good

If disagree, see whether it is just because of different


approaches

56
Goodness of Fit:
Kolmogorov-Smirnov
Compares data and model cumulative plots
(or 2 sets of data)
Uses largest discrepancy between dists.
Model can be analytic or MC sample

Uses individual data points


Not so sensitive to deviations in tails
(so variants of K-S exist)
Not readily extendible to more dimensions
Distribution-free conversion to p; depends on n
(but not when free parameters involved – needs MC)

57
Is there evidence for a peak in this
data?

1
Is there evidence for a peak in this
data?

“Observation of an Exotic S=+1

Baryon in Exclusive Photoproduction from the Deuteron”


S. Stepanyan et al, CLAS Collab, Phys.Rev.Lett. 91 (2003) 252001

“The statistical significance of the peak is 5.2 ± 0.6 σ”

2
Is there evidence for a peak in this
data?

“Observation of an Exotic S=+1


Baryon in Exclusive Photoproduction from the Deuteron”
S. Stepanyan et al, CLAS Collab, Phys.Rev.Lett. 91 (2003) 252001
“The statistical significance of the peak is 5.2 ± 0.6 σ”

“A Bayesian analysis of pentaquark signals from CLAS data”


D. G. Ireland et al, CLAS Collab, Phys. Rev. Lett. 100, 052001 (2008)
“The ln(RE) value for g2a (-0.408) indicates weak evidence in favour
of the data model without a peak in the spectrum.”

Comment on “Bayesian Analysis of Pentaquark Signals from CLAS


3
Data” Bob Cousins, http://arxiv.org/abs/0807.1330
Statistical Issues in Searches for
New Physics

Louis Lyons and Lorenzo Moneta


Imperial College, London & Oxford
CERN

4
CERN Academic Training Course Dec 2016
Theme: Using data to make judgements about H1 (New Physics) versus
H0 (S.M. with nothing new)

Why?
Experiments are expensive and time-consuming
so
Worth investing effort in statistical analysis
 better information from data

Topics:
p-values
What they mean
Combining p-values
Significance
Blind Analysis
LEE = Look Elsewhere Effect
Why 5σ for discovery?
Wilks’ Theorem
Background Systematics
p0 v p1 plots
Higgs search: Discovery, mass and spin

Conclusions
5
Examples of Hypotheses
1) Event selector (Event = particle interaction)
Events produced at CERN LHC at enormous rate
Online ‘trigger’ to select events for recording (~1 kiloHertz)
e.g. events with many particles
Offline selection based on required features
e.g. H0: Event contains top H1: No top
Possible outcomes: Events assigned as H0 or H1

2) Result of experiment
e.g. H0 = nothing new
H1 = new particle produced as well
(Higgs, SUSY, 4th neutrino,…..)
Possible outcomes H0 H1
 X Exclude H1
X  Discovery
  No decision
X X ?
WRONG DECISIONS
E1: Reject H0 when H0 true (Loss of effic in 1))
E2: Fail to reject H0 when H1 true (Contamination)
7
H0 or H0 versus H1 ?
H0 = null hypothesis
e.g. Standard Model, with nothing new
H1 = specific New Physics e.g. Higgs with MH = 125 GeV
H0: “Goodness of Fit” e.g. χ2, p-values
H0 v H1: “Hypothesis Testing” e.g. L-ratio
Measures how much data favours one hypothesis wrt other

H0 v H1 likely to be more sensitive for H1

8
Choosing between 2 hypotheses
Possible methods:
Δχ2
p-value of statistic 
lnL–ratio
Bayesian:
Posterior odds
Bayes factor
Bayes information criterion (BIC)
Akaike …….. (AIC)
Minimise “cost”

See ‘Comparing two hypotheses’


http://www-cdf.fnal.gov/physics/statistics/notes/H0H1.pdf 9
(a) p-values (b)

First choose data


statistic H0 H1

tobs t tobs t
p1 p0

With 2 hypotheses, (c)


each with own pdf, H0 H1
p-values are
defined as tail
areas, pointing in
towards each other

tobs
t 10
p-values

Concept of pdf y

Example: Gaussian
μ x0 x
y = probability density for measurement x
y = 1/(√(2π)σ) exp{-0.5*(x-μ)2/σ2}
p-value: probablity that x ≥ x0
Gives probability of “extreme” values of data ( in interesting direction)

(x0-μ)/σ 1 2 3 4 5
p 16% 2.3% 0.13% 0. 003% 0.3*10-6

i.e. Small p = unexpected 11


p-values, contd

Assumes:
Specific pdf for x (e.g. Gaussian, no long tails)
Data is unbiassed
σ is correct
If so, and x is from that pdf uniform p-distribution

(Events at large x give small p)

Interesting region

0 p 1
12
p-values for non-Gaussian distributions

e.g. Poisson counting experiment, bgd = b


P(n) = e-b * bn/n!
{P = probability, not prob density}

b=2.9
P

0 n 10
For n=7, p = Prob( at least 7 events) = P(7) + P(8) + P(9) +…….. = 0.03
13
Significance

Significance = S/B or similar ?


Potential Problems:
•Uncertainty in B
•Non-Gaussian behaviour of Poisson, especially in tail
•Number of bins in histogram, no. of other histograms [LEE]
•Choice of cuts, bins (Blind analyses)

For future experiments:


• Optimising: Could give S =0.1, B = 10-4, S/B =10

CONCLUSION:
Calculate p properly (and allow for LEE if necessary)
14
p-values and σ

p-values often converted into equivalent Gaussian σ


e.g. 3*10-7 is “5σ” (one-sided Gaussian tail)
Does NOT imply that pdf = Gaussian
(Simply easier to remember number of , than p-value.)

15
What p-values are (and are not)
H0 pdf
p0 = α

Reject H0 if t > tcrit (p < α )


p-value = prob that t ≥ tobs tcrit t
Small p  data and theory have poor compatibility
Small p-value does NOT automatically imply that theory is unlikely
Bayes prob(Theory;data) related to prob(data;Theory) = Likelihood
by Bayes Th, including Bayesian prior
P(A;B) ≠ P(B;A)
p-values are misunderstood. e.g. Anti-HEP jibe:
“Particle Physicists don’t know what they are doing, because half their
p ˂ 0.05 exclusions turn out to be wrong”
Demonstrates lack of understanding of p-values
[All results rejecting energy conservation with p ˂α =.05 cut will turn out to
16
be ‘wrong’]
Criticisms of p-values
(p-values banned by journal Basic and Applied Social Psychology )

1) Misunderstood
So ban relativity, matrices…..?

2) Incorrect statements

3) p-values smaller than L-ratios


Measure different quantities
p is only for one hypothesis
L-ratio compares two hypotheses
(Is length or mass ‘better’ for comparing mouse and elephant?)
17
Combining different p-values
Several results quote independent p-values for same effect:
p1, p2, p3….. e.g. 0.9, 0.001, 0.3 ……..
What is combined significance? Not just p1*p2*p3…..
If 10 expts each have p ~ 0.5, product ~ 0.001 and is clearly NOT correct
combined p
n 1

S = z *  (-ln z)j j! , z = p1p2p3…….


/
j 0
(e.g. For 2 measurements, S = z * (1 - lnz) ≥ z )
Problems:
1) Recipe is not unique (Uniform dist in n-D hypercube  uniform in 1-D)
2) Formula is not associative
Combining {{p1 and p2}, and then p3} gives different answer
from {{p3 and p2}, and then p1} , or all together
Due to different options for “more extreme than x1, x2, x3”.
3) Small p’s due to different discrepancies

******* Better to combine data ************


18
Procedure for choosing between 2 hypotheses
1) No sensitivity 2) Maybe 3) Easy separation
H0 H1

t
β tcrit α
Procedure: Obtain expected distributions for data statistic (e.g. L-ratio) for H0 and H1
Choose α (e.g. 95%, 3σ, 5σ ?) and CL for p1 (e.g. 95%)
Given b, α determines tcrit
b+s defines β. For s > smin, separation of curves  discovery or excln
1-β = Power of test
Now data: If tobs ≥ tcrit (i.e. p0 ≤ α), discovery at level α
If tobs < tcrit, no discovery. If p1 < 1– CL, exclude H1 (or CLs = p1/(1-p0))
19
For event selector, 1-α = efficiency for signal events; β = mis-ID prob from other events
BLIND ANALYSES
Why blind analysis? Data statistic, selections, corrections, method

Methods of blinding
Add random number to result *
Study procedure with simulation only
Look at only first fraction of data
Keep the signal box closed
Keep MC parameters hidden
Keep unknown fraction visible for each bin
Disadvantages
Takes longer time
Usually not available for searches for unknown

After analysis is unblinded, don’t change anything unless ……..

* Luis Alvarez suggestion re “discovery” of free quarks


20
Look Elsewhere Effect (LEE)

Prob of bgd fluctuation at that place = local p-value


Prob of bgd fluctuation ‘anywhere’ = global p-value
Global p > Local p
Where is `anywhere’?
a) Any location in this histogram in sensible range
b) Any location in this histogram
c) Also in histogram produced with different cuts, binning, etc.
d) Also in other plausible histograms for this analysis
e) Also in other searches in this PHYSICS group (e.g. SUSY at CMS)
f) In any search in this experiment (e.g. CMS)
g) In all CERN expts (e.g. LHC expts + NA62 + OPERA + ASACUSA + ….)
h) In all HEP expts
etc.
d) relevant for graduate student doing analysis
f) relevant for experiment’s Spokesperson

INFORMAL CONSENSUS:
Quote local p, and global p according to a) above.
Explain which global p 21
Example of LEE: Stonehenge

22
23
Are alignments significant?
• Atkinson replied with his article "Moonshine on Stonehenge"
in Antiquity in 1966, pointing out that some of the pits which ….. had used
for his sight lines were more likely to have been natural depressions, and
that he had allowed a margin of error of up to 2 degrees in his alignments.
Atkinson found that the probability of so many alignments being visible
from 165 points to be close to 0.5 rather that the "one in a million"
possibility which ….. had claimed.

• ….. had been examining stone circles since the 1950s in search of
astronomical alignments and the megalithic yard. It was not until 1973
that he turned his attention to Stonehenge. He chose to ignore alignments
between features within the monument, considering them to be too close
together to be reliable. He looked for landscape features that could have
marked lunar and solar events. However, one of …..'s key sites, Peter's
Mound, turned out to be a twentieth-century rubbish dump.

24
Why 5σ for Discovery?
Statisticians ridicule our belief in extreme tails (esp. for systematics)
Our reasons:
1) Past history (Many 3σ and 4σ effects have gone away)
2) LEE
3) Worries about underestimated systematics
4) Subconscious Bayes calculation
p(H1|x) = p(x|H1) * π(H1)
p(H0|x) p(x|H0) π(H0)
Posterior Likelihood Priors
prob ratio
“Extraordinary claims require extraordinary evidence”

N.B. Points 2), 3) and 4) are experiment-dependent


Alternative suggestion:
L.L. “Discovering the significance of 5” http://arxiv.org/abs/1310.1284 25
How many ’s for discovery?
SEARCH SURPRISE IMPACT LEE SYSTEMATICS No. σ

Higgs search Medium Very high M Medium 5


Single top No Low No No 3
SUSY Yes Very high Very large Yes 7
Bs oscillations Medium/Low Medium Δm No 4

Neutrino osc Medium High sin22ϑ, Δm2 No 4


Bs μ μ No Low/Medium No Medium 3

Pentaquark Yes High/V. high M, decay Medium 7


mode
(g-2)μ anom Yes High No Yes 4
H spin ≠ 0 Yes High No Medium 5
4th gen q, l, ν Yes High M, mode No 6
Dark energy Yes Very high Strength Yes 5
Grav Waves No High Enormous Yes 8

Suggestions to provoke discussion, rather than `carved in stone on Mt. Sinai’


26
Bob Cousins: “2 independent expts each with 3.5σ better than one expt with 5σ”
Wilks’ Theorem
Data = some distribution e.g. mass histogram
For H0 and H1, calculate best fit weighted sum of squares S0 and S1
Examples: 1) H0 = polynomial of degree 3
H1 = polynomial of degree 5
2) H0 = background only
H1 = bgd+peak with free M0 and cross-section
3) H0 = normal neutrino hierarchy
H1 = inverted hierarchy

If H0 true, S0 distributed as χ2 with ndf = ν0


If H1 true, S1 distributed as χ2 with ndf = ν1
If H0 true, what is distribution of ΔS = S0 – S1? Expect not large. Is it χ2?

Wilks’ Theorem: ΔS distributed as χ2 with ndf = ν0 – ν1 provided:


a) H0 is true
b) H0 and H1 are nested
c) Params for H1 H0 are well defined, and not on boundary
27
d) Data is asymptotic
Wilks’ Theorem, contd
Examples: Does Wilks’ Th apply?

1) H0 = polynomial of degree 3
H1 = polynomial of degree 5
YES: ΔS distributed as 2 with ndf = (d-4) – (d-6) = 2

2) H0 = background only
H1 = bgd + peak with free M0 and cross-section
NO: H0 and H1 nested, but M0 undefined when H1 H0. ΔS≠2
(but not too serious for fixed M)

3) H0 = normal neutrino hierarchy


H1 = inverted hierarchy
NO: Not nested. ΔS≠2 (e.g. can have ΔS negative)

N.B. 1: Even when W. Th. does not apply, it does not mean that ΔS
is irrelevant, but you cannot use W. Th. for its expected distribution.

N.B. 2: For large ndf, better to use ΔS, rather than S1 and S0 separately
Is difference in S distributed as χ2 ?
Demortier:
H0 = quadratic bgd
H1 = ……………… +
Gaussian of fixed width,
What is peak at zero?
variable location & ampl
Why not half the entries?

Protassov, van Dyk, Connors, ….


H0 = continuum
(a) H1 = narrow emission line
(b) H1 = wider emission line
(c) H1 = absorption line

Nominal significance level = 5%

29
Is difference in S distributed as χ2 ?, contd.

So need to determine the ΔS distribution by Monte Carlo

N.B.

1) For mass spectrum, determining ΔS for hypothesis H1


when data is generated according to H0 is not trivial,
because there will be lots of local minima

2) If we are interested in 5σ significance level, needs lots of


MC simulations (or intelligent MC generation)

3) Asymptotic formulae may be useful (see K. Cranmer, G. Cowan,


E. Gross and O. Vitells, 'Asymptotic formulae for likelihood-based tests of new
physics', http://link.springer.com/article/10.1140%2Fepjc%2Fs10052-011-
1554-0 ) 30
Background systematics

31
Background systematics, contd
Signif from comparing χ2’s for H0 (bgd only) and for H1 (bgd + signal)
Typically, bgd = functional form fa with free params
e.g. 4th order polynomial
Uncertainties in params included in signif calculation
But what if functional form is different ? e.g. fb
Typical approach:
If fb best fit is bad, not relevant for systematics
If fb best fit is ~comparable to fa fit, include contribution to systematics
But what is ‘~comparable’?
Other approaches:
Profile likelihood over different bgd parametric forms
http://arxiv.org/pdf/1408.6865v1.pdf?
Background subtraction
sPlots
Non-parametric background
Bayes
etc

No common consensus yet among experiments on best approach


{Spectra with multiple peaks are more difficult}
32
“Handling uncertainties in background
shapes: the discrete profiling method”
Dauncey, Kenzie, Wardle and Davies (Imperial College, CMS)
arXiv:1408.6865v1 [physics.data-an]
Has been used in CMS analysis of Hγγ

Problem with ‘Typical approach’: Alternative functional


forms do or don’t contribute to systematics by hard cut, so
systematics can change discontinuously wrt ∆χ2

Method is like profile L for continuous nuisance params


Here ‘profile’ over discrete functional forms

33
Reminder of Profile L
υ

Stat uncertainty on s from width


of L fixed at υbest

Total uncertainty on s from width


of L(s,υprof(s)) = Lprof
υprof(s) is best value of υ at that s
s
υprof(s) as fn of s lies on green line

Contours of lnL(s,υ) Total uncert ≥ stat uncertainty


s = physics param
υ = nuisance param
34
-2lnL

s
35
Red curve: Best value of nuisance param υ
Blue curves: Other values of υ
Horizontal line: Intersection with red curve
statistical uncertainty

‘Typical approach’: Decide which blue curves have small enough ∆


Systematic is largest change in minima wrt red curves’.

Profile L: Envelope of lots of blue curves


Wider than red curve, because of systematics (υ)
For L = multi-D Gaussian, agrees with ‘Typical approach’

Dauncey et al use envelope of finite number of functional forms

36
Point of controversy!
Two types of ‘other functions’:
a) Different function types e.g.
Σai xi versus Σai/xi
b) Given fn form but different number of terms
DDKW deal with b) by -2lnL  -2lnL + kn
n = number of extra free params wrt best
k = 1, as in AIC (= Akaike Information Criterion)

Opposition claim choice k=1 is arbitrary.


DDKW agree but have studied different values, and say k =1
is optimal for them.
Also, any parametric method needs to make such a choice
37
p0 v p1 plots
Preprint by Luc Demortier and LL,
“Testing Hypotheses in Particle Physics:
Plots of p0 versus p1”
http://arxiv.org/abs/1408.6123

For hypotheses H0 and H1, p0 and p1


are the tail probabilities for data
statistic t

Provide insights on:


CLs for exclusion
Punzi definition of sensitivity
Relation of p-values and Likelihoods
Probability of misleading evidence
Jeffreys-Lindley paradox

41
CLs = p1/(1-p0)  diagonal line
Provides protection against excluding H1 when little or no sensitivity

Punzi definition of sensitivity:


Enough separation of pdf’s for no chance of ambiguity

Δµ

H0 H1

Can read off power of test


e.g. If H0 is true, what is
prob of rejecting H1?

N.B. p0 = tail towards H1


p1 = tail towards H0 42
. α, β, Errors of 1st and 2nd Kind, etc.
e.g. H0 = event with top H1 = no top

α = prob of rejecting H0 when H0 true = E1


p0 < α, reject as top event
p0 > α, accept as top event
Effic for H0 = 1-α

β = value of p1 when p0 = α
= prob of not rejecting H0 when H1 true = E2
= mis-ID of ‘no top’ events
Power = prob of rejecting H0 when H1 true = 1-β
Contamination in signal sample depends on β, and relative frequencies for
H0 and H1 events.

ROC curves plot ‘1- Bgd Mis-ID’ versus ‘Signal Efficiency’


= ‘1- p1’ versus ‘1-p0’ (Cf p1 v p0 plots)
43
Why p ≠ Likelihood ratio

Measure different things:


p0 refers just to H0; L01 compares H0 and H1

Depends on amount of data:


e.g. Poisson counting expt little data:
For H0, μ0 = 1.0. For H1, μ1 =10.0
Observe n = 10 p0 ~ 10-7 L01 ~10-5
Now with 100 times as much data, μ0 = 100.0 μ1 =1000.0
Observe n = 160 p0 ~ 10-7 L01 ~10+14

N.B. In HEP, data statistic is typically L01


Can think of method as:
p-value, where data statistic just happens to be L01; or
L01 method where p-values are just used for calibration.
44
Jeffreys-Lindley Paradox
H0 = simple, H1 has μ free
p0 can favour H1, while B01 can favour H0
B01 = L0 / L1(s) (s) ds

Likelihood ratio depends on signal :


e.g. Poisson counting expt small signal s:
For H0, μ0 = 1.0. For H1, μ1 =10.0
Observe n = 10 p0 ~ 10-7 L01 ~10-5 and favours H1
Now with 100 times as much signal s, μ0 = 100.0 μ1 =1000.0
Observe n = 160 p0 ~ 10-7 L01 ~10+14 and favours H0

B01 involves intergration over s in denominator, so a wide enough range


will result in favouring H0
However, for B01 to favour H0 when p0 is equivalent to 5, integration
range for s has to be O(106) times Gaussian widths
45
WHY LIMITS?
Michelson-Morley experiment  death of aether

HEP experiments:
If UL on expected rate for new particle  expected, exclude particle
Do as function of MX  excluded mass range below Me
Predicted
σ UL
Compare with expected
Me expt’s sensitivity

Me MX
CERN CLW (Jan 2000)
FNAL CLW (March 2000)
Heinrich, PHYSTAT-LHC, “Review of Banff Challenge”
46
Methods (no systematics)
Bayes (needs priors e.g. const, 1/μ, 1/√μ, μ, …..)
Frequentist (needs ordering rule,
possible empty intervals, F-C)
CLs
Likelihood (DON’T integrate your L)
χ2 (σ2 = μ)
χ2 (σ2 = n)

Recommendation 7 from CERN CLW: “Show your L”


1) Not always practical
2) Not sufficient for frequentist methods
47
Ilya Narsky, FNAL CLW 2000 Poisson counting expt
b = 3.0

48
Search for Higgs:
H  : low S/B, high statistics

55
HZ Z  4 l: high S/B, low statistics

56
p-value for ‘No Higgs’ versus mH

57
Mass of Higgs:
Likelihood versus mass

58
Comparing 0+ versus 0- for Higgs
(like Neutrino Mass Hierarchy)

http://cms.web.cern.ch/news/highlights-cms-results-presented-hcp
59
Conclusions
Resources:
Software exists: e.g. RooStats
Books exist: Barlow, Cowan, James, Lista, Lyons, Roe,…..
New: `Data Analysis in HEP: A Practical Guide to
Statistical Methods’ , Behnke et al.
PDG sections on Prob, Statistics, Monte Carlo
CMS and ATLAS have Statistics Committees (and BaBar and CDF
earlier) – see their websites

Before re-inventing the wheel, try to see if Statisticians have already


found a solution to your statistics analysis problem.
Don’t use a square wheel if a circular one already exists.

“Good luck” 60

You might also like