0% found this document useful (0 votes)
17 views104 pages

Pattern Recognition

The document outlines a course on Pattern Recognition taught by Dr. Jayakrishnan Anandakrishnan, detailing its objectives, outcomes, and units of study, including parametric and nonparametric methods, clustering, and decision-making processes. It emphasizes the importance of identifying patterns in data and covers techniques such as Bayesian decision-making, linear discriminant functions, and minimum squared error methods. The course aims to equip students with the knowledge and skills to apply various classification and clustering methods effectively.

Uploaded by

14567emoji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views104 pages

Pattern Recognition

The document outlines a course on Pattern Recognition taught by Dr. Jayakrishnan Anandakrishnan, detailing its objectives, outcomes, and units of study, including parametric and nonparametric methods, clustering, and decision-making processes. It emphasizes the importance of identifying patterns in data and covers techniques such as Bayesian decision-making, linear discriminant functions, and minimum squared error methods. The course aims to equip students with the knowledge and skills to apply various classification and clustering methods effectively.

Uploaded by

14567emoji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

24ASD641 Pattern Recognition

Dr. Jayakrishnan Anandakrishnan


Assistant Professor
Amrita School of Computing
Amrita Vishwa Vidyapeetham, Coimbatore

Google Scholar ORCID Web of Science

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 1 / 104


Course Objectives and Outcomes

Objectives
Identify patterns, regularities, or structures in data to make informed decisions.
Focus on computational properties of patterns and algorithms used to process
them.

Outcomes
CO01: To get an idea about pattern recognition with suitable examples
CO02: To gain knowledge about parametric classification methods using
Bayesian decision making approach
CO03: To apply nonparametric techniques such as nearest neighbor, adaptive
discriminant functions, and decision regions based on minimum squared error
CO04: To gain knowledge about nonmetric methods, classification trees, and
some resampling methods
CO05: To study and apply various clustering methods

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 2 / 104


Course Units and References
Unit I: Introduction – pattern recognition systems, design cycle, learning/adapta-
tion, applications, statistical decision theory, examples in pattern recognition and
image processing.
Unit II: Parametric methods – Bayes theorem, Bayesian decision making, Gaussian
case, discriminant functions, decision boundaries/regions, dimensionality problems,
ROC curves, ML classification.
Unit III: Nonparametric methods – histograms, density estimation, mixture densi-
ties, kernel/window estimators, nearest neighbor techniques, adaptive discriminant
functions, minimum squared error methods.
Unit IV: Nonmetric methods – decision trees, CART, algorithm-independent ML,
bias-variance, jackknife and bootstrap resampling.
Unit V: Clustering – unsupervised learning, criterion functions, hierarchical (single,
complete, average, Ward’s), partitional (Forgy’s, k-means).
References:
1 Richard O. Duda, Peter E. Hart, David G. Stork, Pattern Classification, 2nd
Ed., Wiley, 2003.
2 Earl Gose, Richard Johnsonbaugh, Steve Jost, Pattern Recognition and Image
Analysis, PHI, 2002.
Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 3 / 104
Outline

1 Linear Discriminant Functions

2 Minimum Squared Error Discriminant Functions

3 Non-metric Methods

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 4 / 104


Linear Discriminant Functions

In parametric estimation, we assumed that the forms for the underly-


ing probability densities were known and used the training samples to
estimate the values of their parameters.
Instead, assume that the proper forms for the discriminant functions
are known, and use the samples to estimate the values of parameters
of the classifier.
None of the various procedures for determining discriminant functions
requires knowledge of the forms of underlying probability distributions,
so-called nonparametric approach.
Linear discriminant functions are relatively easy to compute and esti-
mate the form using training samples.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 5 / 104


Linear Discriminant Functions

Parametric vs. Nonparametric


Parametric: Assume the data follows a probability distribution
(Gaussian/Normal), and just estimate parameters (mean, variance).
Nonparametric: No assumption of any distribution. Instead, directly
learn from the training data itself.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 6 / 104


Parametric Case

Training Data:

Fruit Weight (grams)


Apple 160
Apple 170
Banana 120
Banana 130

Estimated Parameters:
Apple: Mean µ = 165, Variance σ 2 = 25
Banana: Mean µ = 125, Variance σ 2 = 25

New fruit weight: x = 150 grams

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 7 / 104


Classification Using Gaussian Likelihood

(x − µ)2
 
1
P(x|class) = √ exp −
2πσ 2 2σ 2

For Apple:

(150 − 165)2
 
1
P(150|Apple) = p exp − =≈ 0.0026
2π(25) 2(25)

For Banana:
(150 − 125)2
 
1
P(150|Banana) = p exp − =≈ 1.5 × 10−6
2π(25) 2(25)

Conclusion: The fruit is more likely to be an Apple.


Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 8 / 104
Non-parametric Case

No assumption of Gaussian or any probability distribution


Instead, we can use a function g (x), ie, a linear discriminant function

Simple Rule:
g (x) = ax + b
(
If g (x) > 0, then classify as Apple
If g (x) < 0, then classify as Banana

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 9 / 104


Discriminant Function for Classification

A discriminant function helps in decision-making.


For example: Given a point X , assign it to class ω1 or class ω2
depending on the value of a function.
(
If function > 0, assign to class ω1
If function < 0, assign to class ω2

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 10 / 104


Linear Classifier

w is the weight vector


w0 is the bias or threshold
weight
g (x) = 0 defines the decision
surface
Decision surface separates
classes
Two-category case
Multi-category case

g (x) = wT x + w0

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 11 / 104


Linear Classifier

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 12 / 104


Two-Category Classification
A two-category case can be defined as a decision between class ω1 and
class ω2 . (
g (x) > 0 ⇒ Class ω1
g (x) < 0 ⇒ Class ω2
Given:
g (x) = wT x + w0 > 0 ⇒ Class ω1
This implies:
wT x > −w0
So the decision rule becomes:
(
wT x > −w0 ⇒ Class ω1
wT x < −w0 ⇒ Class ω2

If g (x) = 0, the point lies on the decision boundary.


Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 13 / 104
Linear Classifier

y = f (g (x))
Where:
(
+1 if g (x) > 0
y=
−1 if g (x) < 0

The equation g (x) = 0 defines


the decision surface that separates
points assigned to the categories
ω1 and ω2 .

If x1 and x2 are both on the deci-


sion surface, then...

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 14 / 104


Geometry of the Decision Surface

If g (x1 ) = g (x2 ) = 0, then both x1 and x2 lie on the decision surface.

wT x1 + w0 = w T x2 + w0 = 0
Subtracting the two equations:

w T (x1 − x2 ) = 0
This indicates that w is orthogonal (normal) to any vector lying in the
hyperplane defined by the decision surface.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 15 / 104


How g (x) is Related to Distance Measure
xp is the projection of x on hyperplane H
r is positive if x is on the positive side, negative if on the negative side
w
x = xp + r
∥w∥

 
T w T
g (x) = w x+w0 = w xp + r +w0
∥w∥

∥w∥2
 
w
= wT xp +r wT +w0 = wT xp +r +w0
∥w∥ ∥w∥

= wT xp + r ∥w∥ + w0
If xp lies on the decision surface, then
wT xp + w0 = 0, so:

Dr. Jayakrishnan Ananadakrishnan g (x)


24ASD641 Pattern Recognition 16 / 104
How g (x) is Related to Distance Measure

The distance from the origin to the


hyperplane H is:
w0
∥w∥

If w0 > 0, the origin lies on the


positive side of H
If w0 < 0, the origin lies on the
negative side of H
If w0 = 0, then g (x) = wT x is
homogeneous and the hyperplane
passes through the origin

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 17 / 104


Multi Class Case

c(c − 1)/2 linear


c two-class problem (1 vs
discriminant one for every
the rest)
pair of classes

Both lead to an ambiguous region, where if a sample falls the its


difficulty in deciding
How can it be solved?

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 18 / 104


Multi Class Case: Solution

Can be solved by considering a linear discriminant function for each


individual class
Let us consider, a linear discriminant function for the i th class is:

gi (x) = wiT x + wi0 , i = 1, . . . , C

We assign x to wi if gi (x) > gj (x) for j ̸= i.


In this case, the resulting classifier is known as a Linear Machine.
The linear machine divides the feature space into C decision regions:

R1 , R2 , . . . , RC

with gi (x) being the largest discriminant if x lies in the region Ri .

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 19 / 104


Multi Class Case: Point on Decision Boundary

For two contiguous regions Ri and Rj , and if point lies on the decision
boundary then:
gi (x) = gj (x)

⇒ wiT x + wi0 = wjT x + wj0

⇒ (wi − wj )T x + (wi0 − wj0 ) = 0


It shows that (wi − wj ) is normal to Hij (separating plane between class i
& j).
So the algebraic distance from x to Hij is

gi (x) − gj (x)
r=
∥wi − wj ∥

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 20 / 104


Multi Class Case: Regions

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 21 / 104


Minimum Squared Error Discriminant Functions

Possibilities to extract the weight vector all at once without any


iterative process
Neural network has an iterative process to learn the weight a
Can you learn a in a single step?
    
x11 x12 · · · x1d a1 b1
x12 x22 · · · x2d   a2  b2 
..   ..  =  .. 
    
 .. .. . .
 . . . .  .   . 
xn1 xn2 · · · xnd ad bn

Let X be an n × d matrix
Now apply augmentation. Why augmentation?

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 22 / 104


Minimum Squared Error Discriminant Functions
Because of the bias (intercept) term need to be added in linear
discriminant functions
The linear discriminant function for a single data point xi

g (xi ) = a0 + a1 xi1 + a2 xi2 + · · · + ad xid


What is a0 , its the augmentation to wright vector a
Augmented vector Y , whose each data point yi has an extra 1 at the
beginning
 
1 x11 x12 · · · x1d
1 x21 x22 · · · x2d 
Y = .
 
. .. .. . . .. 
. . . . . 
1 xn1 xn2 · · · xnd

Now its n × (d + 1)
Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 23 / 104
Minimum Squared Error Discriminant Functions
    
1 x11 x12 · · · x1d a0 b1
1 x21 x22 · · · x2d   a1  b2 
..   ..  =  .. 
    
 .. .. .. . .
. . . . .  .   . 
1 xn1 xn2 · · · xnd ad bn

Now you can write as a linear equation Ya = b, and a = Y −1 b


a = Y −1 b is only possible if Y is a square, invertible matrix
is it true for the real world?
In the real world, usually you have more data points n than features d
How to get the first linear equation?
Multiply the first row of Y by column wright a equal to the first
element in b
Now, we don’t have the exact weight vector a, then the error will
happen
Error e = Ya − b
Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 24 / 104
Minimum Squared Error Discriminant Functions

If error e is there, then we can write the error as the sum of squared
errors, and the function is given as
X
Js (a) = ∥Ya − b∥2 = (aT yi − bi )2
Note: a is the weights, including bias, y is the input sample with a
bias term, b is the target
Now we have to minimize this error, take the gradient, and equate it
to zero provides a simple solution
Differentiate w.r.t a
n
X
∇Js (a) = 2(aT yi − bi )yi ⇒ 2Y T (Ya − b)
i=1

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 25 / 104


Minimum Squared Error Discriminant Functions

When you equate the gradient of the error function to zero, you are
finding the point where the function’s slope is flat. For a convex func-
tion like the squared error, this flat point is the global minimum.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 26 / 104


Minimum Squared Error Discriminant Functions
Equate it to zero to get a simple solution

2Y T (Ya − b) = 0

Y T Ya = Y T b
a = (Y T Y )−1 Y T b
Now you can write a as

a = Y † b, where Y † = (Y T Y )−1 Y T
Y † is called the pseudoinverse. is Y T Y is always a square matrix?
Indicating if we know b, we can compute our solution weight a from
the pseudoinverse
Regularized pseudoinverse is given as, where ϵ is a small positive
constant.
Y † ≡ lim (Y T Y + ϵI )−1 Y T
ϵ→0

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 27 / 104


Minimum Squared Error Discriminant Functions
Suppose we have the following two-dimensional points for two categories:
ω1 : (1, 2)t , (2, 0)t and ω2 : (3, 1)t , (2, 2)t . Construct a classifier with a
pseudoinverse.
       
1 2 3 2
ω1 : , ω2 : ,
2 0 1 3
Augmented vectors are:
y1 = [1, 1, 2], y2 = [1, 2, 0], y3 = [−1, −3, −1], y4 = [−1, −2, −3]
Matrix Y is:  
1 1 2
1 2 0
Y= −1 −3 −1

−1 −2 −3
The pseudoinverse of Y is given by:

Y+ = (YT Y)−1 YT
Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 28 / 104
Minimum Squared Error Discriminant Functions

5 13 3 7
 
4 12 4 12
Y+ = 
 1
− 16 − 21 − 16 

− 2 
0 − 13 0 −3 1

We arbitrarily let all the margins/labels be equal, i.e.,


 
1
1
b = (1, 1, 1, 1)t = 


1
1

Our solution is:


11
 
3
a = Y+ b = − 4 
3
− 23

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 29 / 104


Introduction to Non-Metric Methods

Previous pattern recognition methods involved real-valued feature vec-


tors with clear metrics.
Are there instances of data without clear metrics?
For Nominal data: Data that are discrete and without any natural order
notion of similarity or even ordering.
For example, eye Colors. It can be Brown, Blue, or Green. You can
count how many people have brown eyes, but you can’t perform a
mathematical operation like finding the average eye color.
Solution: decision trees and string grammars.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 30 / 104


Introduction to Non-Metric Methods: Decision Trees

I am thinking of a fruit. Ask me up to 20 yes/no questions to determine


what is the fruit am thinking of.
How did you ask the questions?
What underlying measure led you to the questions, if any?
Most importantly, iterative yes/no questions of this sort require no
metric and are well-suited for nominal data.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 31 / 104


Decision Tree

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 32 / 104


Decision Tree

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 33 / 104


Decision Tree

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 34 / 104


Decision Tree

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 35 / 104


Decision Tree

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 36 / 104


Decision Tree

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 37 / 104


Decision Tree

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 38 / 104


Decision Tree

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 39 / 104


Decision Tree

A decision tree starts with a root node at the very top. This is where
the classification of a particular pattern begins by checking a specific
property that was chosen during the tree’s learning phase.
The tree then branches out from the root node. Each branch corre-
sponds to a different possible value of the property being checked.
You follow the branch that matches the value of your pattern, which
leads you to a new node where the process is repeated.
This step-by-step process continues, with the tree checking one prop-
erty after another, until you reach a leaf node.
A leaf node signifies that a final decision has been reached and the
pattern has been classified.
This logical, easy-to-follow structure is what makes decision trees highly
interpretable.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 40 / 104


Decision Tree-CART

Consider a labeled dataset D with a set of features chosen for classifi-


cation.
The goal is to arrange these features into a decision tree that achieves
high classification accuracy.
A decision tree works by recursively splitting the data into smaller sub-
sets.
When all samples in a subset belong to the same class, the node is said
to be pure, and further splitting is unnecessary.
In practice, complete purity is rare, so we must decide whether to stop
with an imperfect split or continue growing the tree with additional
features.
CART adopt a greedy(i.e., non backtracking) approach in which deci-
sion trees are constructed in a top-down recursive divide-and-conquer
manner

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 41 / 104


CART Strategy

The basic CART strategy for recursively defining a tree is:


Given the data at a node, either declare it a leaf or choose a property
to split the data into subsets.
In this process, six key questions arise:
1 How many branches should be created from a node?
2 Which property should be tested at a node?
3 When should a node be declared a leaf?
4 How can we prune a tree that has grown too large?
5 If a leaf node remains impure, how should its category be assigned?
6 How should missing data be handled?

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 42 / 104


Number of Splits

The number of splits at a node, called the branching factor B, is


usually determined by the designer based on the test selection
method.
The branching factor can vary across different parts of the tree.
Any split with more than two branches can be represented as a
sequence of binary splits.
For this reason, Decision Hierarchy System primarily focuses on binary
tree learning.
However, in some cases, using 3- or 4-way splits may be preferred, as
binary tests or inferences can be computationally expensive.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 43 / 104


Principle of Tree Creation

The core principle in building decision trees is simplicity: prefer small,


compact trees with fewer nodes.
At each node N, we select a property test T that makes the
descendant nodes as pure as possible.
Let i(N) represent the impurity of node N:
i(N) = 0 if all samples in the node belong to one category (pure).
i(N) is large when categories are equally represented (impure).
A widely used impurity measure is Entropy:
X
i(N) = − P(ωj ) log P(ωj )
j

Entropy reaches its minimum when the node contains only one class.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 44 / 104


Principle of Tree Creation
For the two-class case, Variance Impurity is defined as:

i(N) = P(ω1 )P(ω2 )

For multi-class problems, the Gini Impurity is commonly used:


X X
i(N) = P(ωi )P(ωj ) = 1 − P 2 (ωj )
i̸=j j

This represents the expected error rate if the class label is chosen
randomly according to the class distribution at node N.
Another measure is the Misclassification Impurity: Valid for two
category only
i(N) = 1 − max P(ωj )
j

This gives the minimum probability that a training sample at node N


will be misclassified.
Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 45 / 104
Principle of Tree Creation

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 46 / 104


Feature Selection at a Node

Key Question: Given a partial tree down to node N, which feature


should be selected for the property test T ?
Heuristic: choose the feature that produces the largest decrease in
impurity.
The impurity gradient is defined as:

∆i(N) = i(N) − PL i(NL ) − (1 − PL )i(NR )

where:
NL , NR : left and right child nodes.
PL : fraction of samples directed to the left subtree by test T .
Strategy: select the feature that maximizes ∆i(N).
If entropy impurity is used, this is equivalent to selecting the feature
with the highest Information Gain.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 47 / 104


Binary and Multi-Class Splits

In the binary case, feature selection reduces to a one-dimensional


optimization problem (which may have multiple optima).
For higher branching factors, the problem becomes
higher-dimensional.
In multi-class binary tree construction, the twoing criterion is often
used:
Goal: find a split that best separates the c categories into two groups.
Define a candidate “supercategory” C1 (subset of categories) and C2
(the remainder).
Search must consider both features and category groupings.
This approach follows a local, greedy optimization strategy:
No guarantee of achieving the global optimum in accuracy.
No guarantee of obtaining the smallest possible tree.
In practice, the specific choice of impurity function has little effect on
the final classifier’s accuracy.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 48 / 104


When to Stop Splitting?

If the tree grows until each leaf has only one sample (minimum
impurity), it will be overfitted and fail to generalize.
If the tree is stopped too early, training error remains high and
performance suffers.
Common strategies to decide when to stop splitting:
1 Use cross-validation to determine optimal stopping.
2 Set a threshold on the impurity gradient.
3 Add a tree-complexity penalty term and minimize.
4 Apply a statistical test on the impurity gradient.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 49 / 104


Stopping Criterion Using Threshold β

Splitting is stopped if the best candidate split at a node reduces


impurity by less than a preset threshold β:

max ∆i(s) ≤ β
s

Benefits:
Unlike cross-validation, the tree is trained on the full dataset.
Leaf nodes can occur at different depths, which adapts to varying data
complexity.
Drawback:
Choosing an appropriate value for β is non-trivial.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 50 / 104


Stopping with a Complexity Term

Define a global criterion function that balances complexity and


accuracy: X
α · size + i(N)
leaf nodes

where:
size: number of nodes or links in the tree.
α: positive constant controlling the trade-off.
Splitting continues until this global criterion is minimized.
With entropy impurity, this measure is related to the Minimum
Description Length (MDL) principle.
The sum of leaf node impurities represents the uncertainty of the
training data given the tree model.
Drawback: The challenge is how to appropriately set the constant α.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 51 / 104


Stopping with a Complexity Term

Imagine trying to divide apples (class 1) and oranges (class 2).


A good split: All apples on one side, oranges on the other (very
informative).
Candidate split x1 < 0.25
A random split: Apples and oranges scattered left and right by chance
(not useful).

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 52 / 104


Statistical Testing for Stopping Splits

During tree construction, estimate the distribution of impurity


gradients ∆i across the current nodes.
For any candidate split, test if ∆i is significantly different from zero.
Possible approaches:
Use a Chi-squared test.
More generally, apply a hypothesis testing framework to check
whether a split is better than a random split.
Example: Suppose there are n samples at node N.
A candidate split s sends Pn samples to the left and (1 − P)n samples
to the right.
Under a random split:
Pn1 of ω1 samples go left.
Pn2 of ω2 samples go left.
(1 − P)n1 of ω1 samples go right.
(1 − P)n2 of ω2 samples go right.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 53 / 104


Chi-Squared Test for Splitting

The Chi-squared statistic measures how much a candidate split s


deviates from a random split:
2
2
X (niL − nie )2
χ =
nie
i=1

where:
niL = number of ωi patterns sent to the left under split s.
nie = Pni = expected number sent left under a random split.
Interpretation:
Larger χ2 
greater deviation from random splitting.
If χ2 exceeds a critical value (based on significance level), reject the
null hypothesis of randomness.
In this case, accept split s as meaningful.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 54 / 104


Pruning in Decision Trees

Stopping-split criteria bias trees toward large impurity reductions near


the root.
These methods do not account for possible future splits deeper in the
tree.
Pruning is an alternative strategy:
First, grow the tree fully (exhaustive construction).
Then, consider all pairs of neighboring leaf nodes for elimination.
If eliminating the pair only slightly increases impurity, replace them
with their common ancestor node as a leaf.
Characteristics of pruning:
Often produces unbalanced trees.
Avoids the local nature of early stopping.
Uses the full training dataset.
Involves higher computational cost.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 55 / 104


Pruning in Decision Trees

Is Warm-blooded?

Yes No

Is it Pet? Snake (Leaf)

Dog (Leaf) Wolf (Leaf)

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 56 / 104


Pruning in Decision Trees

Is Warm-blooded?

Warm-blooded
Snake (Leaf)
Animal (Leaf)
Will we be able to distinguish Dog and Wolf then?

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 57 / 104


Decision Tree Effect of Noise

Consider 16 data points. Suppose the x2 value of the last point in the red
class is affected by some noise. As a result, two possible decision trees are
generated, with x2 values of 0.36 and 0.32.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 58 / 104


Decision Tree Effect of Noise

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 59 / 104


Decision Tree Effect of Noise

Note how the DT drastically changed due to the change in a single point.
Hence DT’s are highly susceptible to noise.
Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 60 / 104
Decision Tree Variable Selection

As we know from Ugly Duckling and various empirical evidence, the


selection of features will ultimately play a major role in accuracy, gen-
eralization, and complexity
Furthermore, the use of multiple variables in selecting a decision rule
may greatly improve the accuracy and generalization.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 61 / 104


Decision Tree Variable Selection

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 62 / 104


Decision Tree Variable Selection

Note how the DT drastically changed and improved the separability when
used multi-variate
Dr. Jayakrishnan decision criteria.
Ananadakrishnan 24ASD641 Pattern Recognition 63 / 104
Decision Tree Complexity

Training Time Complexity:


Root node: O(d · n log n) for sorting and evaluating splits
Balanced binary tree: O(d · n(log n)2 )
Prediction Time Complexity: O(log n) for balanced trees
Space Complexity:
Number of nodes ≈ 2n − 1 ⇒ O(n)
Includes memory for features, thresholds, and tree structure
Factors Affecting Complexity:


Number of features d more candidate splits


Number of samples n longer training, deeper tree


Tree depth more nodes, higher memory, risk of overfitting
Split type binary splits simpler; multi-way splits increase branching
Key Points:
Decision trees are fast for prediction but can be expensive to train
Complexity depends on samples, features, and tree depth

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 64 / 104


Decision Tree Points to Note

Problem Link + Additional Learning:


YouTube Video Link

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 65 / 104


Algorithm-Independent Machine Learning

Many pattern recognition algorithms exist, so people often ask: “Which


one is best?”
Some select algorithm with lower computational cost (faster)
Some select algorithm based on data (discrete vs continuous)
What if there are datasets for which these factors does not matter?
William of Ockham (or Occam), a 14th-century English philosopher
and theologian
Occam’s Razor is that the simpler models often generalize better.
Two models perform equally well on training data but Occam’s razor
suggests the simpler model will likely perform better on unseen data.
In physics, some laws (e.g., conservation of energy) hold regardless of
conditions. Now what about Pattern Recognition? Do we have general
principles holds regardless of algorithm?
We look for algorithm-independent rules that guide learning and clas-
sification.
Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 66 / 104
Algorithm-Independent Machine Learning

Bayes error is the theoretical minimum error a classifier can achieve.


Even the best classifier will have to face this (unavoidable classification
error).
Apples weigh 150–200g and Oranges weigh 180–230g. Overlap is there
in range 180–200g. Even the best classifier will suffer.
If your classifier’s error is close to the Bayes error, it’s nearly optimal.
Some techniques or principles in machine learning work regardless of
which algorithm you use.
Instead of focusing on one algorithm, you can evaluate, validate, or
improve models in a way that applies to all classifiers.
Example is k-fold cross validation, bagging, boosting, which can be
applied with any classifier.
Algorithm-independent methods let you measure and improve models
reliably, no matter which learning algorithm you use.
Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 67 / 104
Algorithm-Independent Machine Learning

Some techniques or principles in machine learning work regardless of


which algorithm you use.
Instead of focusing on one algorithm, you can evaluate, validate, or
improve models in a way that applies to all classifiers.
Example is k-fold cross validation, bagging, boosting, which can be
applied with any classifier.
Algorithm-independent methods let you measure and improve models
reliably, no matter which learning algorithm you use.
No classifier is inherently best. performance depends on type of prob-
lem, data, distribution and prior knowledge.
Lack of inherent superiority of any classifier

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 68 / 104


No Free Lunch Theorem

No learning algorithm is inherently superior to any other when


averaged over all possible problems.
Apparent superiority arises from:
The nature of the specific problem
Data distribution
Amount of training data
Cost/reward functions
No Free Lunch (NFL) says not to trust universal claims of one
algorithm is better than other.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 69 / 104


Implications of No Free Lunch

NFL emphasizes that no algorithm is better for generalization in every


case.
Error on unseen data is a better measure than training error (General-
ization).
Many algorithms can fit training data perfectly, but generalization dif-
fers.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 70 / 104


No Free Lunch Theorem – Formal Equations
Consider learning algorithms P(h|D) trained on dataset D. Let F be the
true target function and h(x) the hypothesis produced.
Many target function, many hypothesis
Expected generalization error:
XX
E [E |D] = P(x)[1 − δ(F (x), h(x))]P(h|D)P(F |D)
h,F x ∈D
/

δ(F (x), h(x)) = 1 if F (x) = h(x), else 0.


Measures alignment between algorithm hypothesis and true target.

Off-training-set error for a candidate algorithm Pk (h|D) :


X
Ek (E |F , n) = P(x)[1 − δ(F (x), h(x))]Pk (h(x)|D)
x ∈D
/

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 71 / 104


No Free Lunch Theorem
For any two learning algorithms P1 (h|D) and P2 (h|D), the following are
true, independent of the sampling distribution P(x) and the number of
training points n:
1 Uniformly averaged over all target functions F :

E1 (E | F , n) − E2 (E | F , n) = 0
2 For any fixed training set D, uniformly averaged over F :
E1 (E | F , D) − E2 (E | F , D) = 0
3 Uniformly averaged over all priors P(F ):
E1 (E | n) − E2 (E | n) = 0
4 For any fixed training set D, uniformly averaged over P(F ):
E1 (E | D) − E2 (E | D) = 0
P1 (h|D), P2 (h|D): Probability of hypotheses produced by the algorithms after training on D.
D: Training dataset (inputs and outputs).
h(x): Hypothesis (model) produced by algorithm.
F : True target function mapping inputs to outputs.
E1 (E |F , n), E2 (E |F , n): Expected generalization error of the algorithms on F with n training points.
Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 72 / 104
Understanding

Averaged over all target functions:


XX
P(D|F ) [E1 (E |F , n) − E2 (E |F , n)] = 0
F D

For fixed training set D:


X
[E1 (E |F , D) − E2 (E |F , D)] = 0
F

Algorithm performance depends on the problem and data distribution.


Focus on evaluation metrics and unseen data performance rather than
training accuracy.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 73 / 104


No Free Lunch Theorem – Illustration

D is training set remaining is test set.


In this case test errors of h1 and h2 are
0.4 and 0.6 respectively.
Clearly h1 is better.
Now think of 25 target functions

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 74 / 104


Ugly Duckling Theorem
Importance of domain knowledge
Without prior assumptions, all patterns are equally similar.
No feature set or representation is inherently “better.”
Feature importance or similarity depends on assumptions about the
problem or prior knowledge.
Helps avoid bias when designing classifiers.
Example:
Attributes: f1 = blind in right eye, f2 = blind in left eye
Person A: {1, 0}, Person B: {0, 1}, Person C: {1, 1}
Without assumptions, mathematically all pairs share the same
number of predicates
Hence, no principled reason to say A&B are more similar than A&C
Theorem Statement:
For a finite set of predicates, the number of shared predicates
between any two distinct patterns is constant.
Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 75 / 104
Ugly Duckling Theorem Example

Three people A, B, C. Features are f1 = blind in right eye, and f2 =


blind in left eye.
Feature vectors:
A: {1, 0}, B: {0, 1}, C: {1, 1}
Intuition (considering only f1 and f2 ):
A&C and B&C seem closer than A&B
Lets add more predicates (logical statements):
p3 : Blind in at least one eye
p4 : Blind in exactly one eye
p5 : Bot blind in both eyes
Shared predicates count:
A&B: 3, A&C: 2, B&C: 2
Without assuming which features matter, no pair is inherently
“closer” all pairs are equally similar when considering all predicates.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 76 / 104


Bias-Variance Trade-off: Overview

There is no universally best classifier; performance depends on the


problem distribution.
Two measures of match between algorithm and problem:
Bias: Accuracy of the model. High bias = poor match.
Variance: Precision of the model. High variance = weak match.
Bias and variance are interdependent (bias-variance trade-off).

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 77 / 104


Bias and Variance in Regression

Setup: Let F (x) be the true function with noise. Estimate it from a
dataset D using a model g (x; D).
The expected mean-square error over all training sets D is:
2
ED (g (x; D) − F (x))2 = (ED [g (x; D)] − F (x)) + ED (g (x; D) − ED [g (x; D)])2
   
| {z } | {z }
Bias2 Variance

Bias 2 + Variance + Noise Explanation of terms:


ED [·]: Expectation over all training sets D.
Bias2 : Square of difference between average prediction and true function.
Variance: How much predictions vary across different training sets.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 78 / 104


Bias-Variance Dilemma in Regression
9.3. BIAS AND VARIANCE 17
a) b) c) d)
g(x) = fixed g(x) = fixed g(x) = a0 + a1x + a0x2 +a3x3 g(x) = a0 + a1x
learned learned
y y y y

g(x) g(x)
F(x) F(x) F(x) g(x) F(x)
D1 g(x)

x x x x

y y y y

g(x) g(x)
F(x) F(x) F(x) F(x)
D2 g(x)
g(x)

x x x x

y y y y

g(x) g(x) F(x)


F(x) F(x) F(x)
D3
g(x)

g(x)
x x x x

p p p p

bias bias bias bias

E E E E
variance

variance

variance

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition variance 79 / 104


Bias-Variance Dilemma in Regression

Low bias: Model fits data well on average.


Low variance: Model predictions do not change much across datasets.
Trade-off: Flexible models 
lower bias but higher variance; rigid
models 
higher bias, lower variance.
Example models:
a) Fixed linear model  high bias, zero variance
b) Better fixed model  lower bias, zero variance
c) Cubic model, trainable  low bias, moderate variance
d) Linear model, trainable  intermediate bias and variance

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 80 / 104


Bias and Variance in Classification

Two-class problem: y ∈ {0, 1}, with true discriminant function:

F (x) = Pr[y = 1 | x] = 1 − Pr[y = 0 | x]


Estimate g (x; D) minimizing mean-square error:

ED (g (x; D) − y )2
 

Boundary error: Probability that prediction differs from Bayes classifier:


R
 1/2 p(g (x; D)) dg F (x) ≥ 1/2
−∞
Pr[g (x; D) = yB ] = R
 ∞ p(g (x; D)) dg F (x) < 1/2
1/2

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 81 / 104


Boundary Bias and Variance in Classification

Assume p(g (x; D)) is Gaussian:


!
ED [g (x; D)] − 1/2
Pr[g (x; D) = yB ] = Φ sgn[F (x) − 1/2] p
Var[g (x; D)]
Explanation of terms:
Φ[t]: Standard normal cumulative distribution function
sgn[F (x) − 1/2]: Sign indicating which side of the decision boundary
ED [g (x; D)] − 1/2: Boundary bias
Var[g (x; D)]: Variance
Key insight: In classification, variance usually dominates bias; low
variance is crucial for accurate classification.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 82 / 104


Bias-Variance Dilemma in Classification
9.4. *RESAMPLING FOR ESTIMATING STATISTICS 21

truth

a) b) c)
2
σi1 σi12 σi1
( ) ( 0
)
2

Σi = σi21 σi2
2 Σi = 0 σi22
Σi = ( 1
0
0
1 )
low Bias high
x2 x2 x2

D1

x1 x1 x1
x2 x2 x2

D2

x2 x1 x x1 x x1
2 2

D3

x1 x1 x1

x2 x2 x2
distributions
boundary

x1 x1 x1
p p p
histograms
error

EB E EB E EB E

high Variance low

Figure 9.5: The (boundary) bias-variance tradeoff in classification can be illustrated


with a two-dimensional Gaussian problem. The figure at the top shows the (true)
Figure 2
decision boundary of the Bayes classifier. The nine figures in the middle show nine
different learned decision boundaries. Each row corresponds to a different training set
of n = 8 points selected randomly from the true distributions and labeled according
Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 83 / 104
Practical Implications

Flexible models (many parameters)  lower bias, higher variance


Rigid models (few parameters)  higher bias, lower variance
Large training sets reduce variance
Accurate prior knowledge reduces bias
Matching model complexity to the unknown true distribution is
critical for minimizing generalization error

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 84 / 104


Jackknife Resampling: Motivation
Estimating error for statistics beyond the mean (e.g., median, mode,
percentiles).
For the mean of a dataset D = {x1 , x2 , . . . , xn }:
n
1X
µ̂ = xi
n
i=1

Variance of the mean:


n
2 1 X
σ̂ = (xi − µ̂)2
n(n − 1)
i=1

For statistics like the median or mode, error estimation is not


straightforward.
Jackknife and Bootstrap resampling help generalize variance
estimation to any statistic.
Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 85 / 104
Leave-One-Out Concept

Remove one data point at a time and compute the statistic.


Leave-one-out mean:
1 X nx̄ − xi
µ(i) = xj =
n−1 n−1
j̸=i

This is the mean of the dataset if the i-th point is left out.
Repeat for all i = 1, . . . , n to get n leave-one-out estimates.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 86 / 104


Jackknife Estimate of the Mean

The jackknife estimate of the mean is the average of all leave-one-out


means:
n
1X
µ(·) = µ(i)
n
i=1

Interestingly, µ̂ = µ(·) , so the mean estimate remains the same.


The power of jackknife is in estimating variance for any statistic.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 87 / 104


Jackknife Estimate of Variance

Variance of the estimate:


n
n−1X
Var[µ̂] = (µ(i) − µ(·) )2
n
i=1

Equivalent to the traditional variance for the mean.


Advantage: can be generalized to other statistics.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 88 / 104


Generalization to Any Statistic

For any statistic θ̂ (median, mode, percentiles):

θ̂(i) = θ̂(x1 , . . . , xi−1 , xi+1 , . . . , xn )


n
1X
θ̂(·) = θ̂(i)
n
i=1

Compute statistic leaving out each data point in turn.


Jackknife variance:
n
n−1X
Var[θ̂] = (θ̂(i) − θ̂(·) )2
n
i=1

This method provides an error estimate for any statistic.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 89 / 104


Jackknife Bias Estimate: Introduction

Bias of an estimator:
bias = θ − E[θ̂]

Measures the difference between the true value θ and the expected
value of the estimator θ̂.
Jackknife can estimate bias for any statistic, not just the mean.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 90 / 104


Jackknife Bias: Computation

Sequentially delete each point xi from the dataset D and compute the
leave-one-out estimate θ̂(·) .
Jackknife estimate of bias:

biasjack = (n − 1) θ̂(·) − θ̂

Bias-corrected estimate of θ:

θ̃ = θ̂ − biasjack = nθ̂ − (n − 1)θ̂(·)

Benefit: Provides an approximately unbiased estimate of the true statistic.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 91 / 104


Jackknife Variance Estimate: Introduction

Traditional variance:

Var[θ̂] = E (θ̂ − E[θ̂])2


 

Measures how much the estimator θ̂ varies across different samples.


Jackknife provides an analogous way to estimate variance for any
statistic.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 92 / 104


Jackknife Variance: Computation

Jackknife variance estimate:


n
n−1X
Varjack [θ̂] = (θ̂(i) − θ̂(·) )2
n
i=1

θ̂(i) = statistic computed leaving out i-th data point.


θ̂(·) = n1 ni=1 θ̂(i) = average of leave-one-out estimates.
P

Provides a reliable estimate of the variance for arbitrary statistics.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 93 / 104


Jackknife Example: Mode Estimation

Dataset: D = {0, 10, 10, 10, 20, 20} (n = 6)


Goal: Estimate the mode of the dataset using jackknife.
Standard mode: θ̂ = 10 (most frequent value).
Jackknife considers leave-one-out resampling to provide bias and
variance estimates.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 94 / 104


Leave-One-Out Mode Estimates

Leave-one-out estimates: θ̂(i)


Compute mode after removing each data point:

θ̂(1) = 10, θ̂(2,3,4) = 15, θ̂(5,6) = 10

Note: When two peaks are equal, the mode is taken as the midpoint.
Jackknife estimate of the mode:
6
1X 10 + 15 + 15 + 15 + 10 + 10
θ̂(·) = θ̂(i) = = 12.5
6 6
i=1

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 95 / 104


Interpretation of Jackknife Mode

Standard mode θ̂ = 10 ignores skew in the distribution.


Jackknife estimate θ̂(·) = 12.5 accounts for the full skew.
The difference indicates the bias in the naive mode estimate.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 96 / 104


Jackknife Bias of the Mode

Bias estimate:

biasjack = (n − 1)(θ̂(·) − θ̂) = 5(12.5 − 10) = 12.5

This shows how much the naive mode underestimates the ”true
center” considering all points.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 97 / 104


Jackknife Variance of the Mode

Variance estimate:
n
n−1X
Varjack [θ̂] = (θ̂(i) − θ̂(·) )2
n
i=1

5h i
= (10 − 12.5)2 + 3(15 − 12.5)2 + 2(10 − 12.5)2 = 31.25
6

Standard deviation: 31.25 ≈ 5.6
Twice this width can be used as a tolerance range for the mode.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 98 / 104


Summary: Jackknife Mode Example

Dataset: D = {0, 10, 10, 10, 20, 20}


Mode: θ̂ = 10
Jackknife estimate: θ̂(·) = 12.5
Bias: 12.5, Variance: 31.25, Std. deviation: 5.6
p
Visualization: Histogram with red bar indicating ±2 Varjack shows
traditional mode lies within tolerance of jackknife estimate.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 99 / 104


Bootstrap Resampling: Introduction

Bootstrap: A resampling method to estimate statistics and their error.


Create a bootstrap dataset by randomly selecting n points from the
original dataset D with replacement.
Some points may be duplicated; some may be omitted.
Repeat this process independently B times to get B bootstrap
datasets.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 100 / 104


Bootstrap Estimate of a Statistic

Notation: θ̂∗ (b) = estimate of statistic θ on bootstrap sample b.


Bootstrap estimate of θ:
B
∗ 1 X ∗
θ̂(·) = θ̂ (b)
B
b=1

Essentially the mean of all B bootstrap estimates.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 101 / 104


Bootstrap Bias Estimate

Bias of the statistic:



biasboot = θ̂(·) − θ̂

Measures the difference between the bootstrap mean and the original
estimate.
Can be applied to complex statistics like the trimmed mean.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 102 / 104


Bootstrap Variance Estimate

Variance of the statistic:


B
1 X ∗ ∗
2
Varboot [θ] = θ̂ (b) − θ̂(·)
B
b=1

For the mean, as B → ∞, Varboot converges to the traditional


variance of the mean.
Larger B 
more accurate estimate; smaller B  faster computation
but noisier estimate.
Advantage over jackknife: B can be adjusted based on computational
resources.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 103 / 104


Bootstrap vs Jackknife

Jackknife: Leave-one-out, requires exactly n repetitions,


deterministic.
Bootstrap: Random sampling with replacement, B repetitions,
adjustable.
Bootstrap works well for statistics that are difficult to analyze
analytically (e.g., median, trimmed mean, percentiles).
Jackknife is simpler but may underestimate variance for highly skewed
statistics.

Dr. Jayakrishnan Ananadakrishnan 24ASD641 Pattern Recognition 104 / 104

You might also like