0% found this document useful (0 votes)
11 views55 pages

Lecture 1 2022

The document outlines an introductory lecture on Machine Learning by Siddharth Garg, covering key concepts such as supervised and unsupervised learning, model fitting, and challenges like bias and variance. It discusses various examples including digit recognition, spam classification, and regression techniques, emphasizing the importance of model selection and evaluation methods like cross-validation. Additionally, it introduces logistic regression for binary classification and highlights the significance of understanding performance measures in machine learning applications.

Uploaded by

beezosjeffery
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views55 pages

Lecture 1 2022

The document outlines an introductory lecture on Machine Learning by Siddharth Garg, covering key concepts such as supervised and unsupervised learning, model fitting, and challenges like bias and variance. It discusses various examples including digit recognition, spam classification, and regression techniques, emphasizing the importance of model selection and evaluation methods like cross-validation. Additionally, it introduces logistic regression for binary classification and highlights the significance of understanding performance measures in machine learning applications.

Uploaded by

beezosjeffery
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Lecture 1: Machine Learning

Basics
Siddharth Garg
[email protected]
This Course…
Social network Spam filtering
deanonymization
Growing use of ML
techniques in cyber-
Biometrics
security application
Browser
fingerprinting Malware
detection
Automated
Network intrusion
Evasion
detection
This Course…
Bias and fairness Spam filtering

Vulnerabilities in
ML/AI deployments Interpretability
Accountability and
transparency
Model privacy
Adversarial
Training data
perturbations
poisoning attacks
What is Machine Learning?
• Ability for machines to learn without being explicitly programmed

"A computer program is said to learn from experience E with


respect to some class of tasks T and performance
measure P if its performance at tasks in T, as measured
by P, improves with experience E.” --- Mitchell, T. (1997).
Machine Learning. McGraw Hill. p. 2.

• Why not use user knowledge, experience or expertise?


• Are humans always able to explain their expertise?
• Can machines outperform humans?

• What kinds of experiences (E), tasks (T) and performance measures (P)?
Example: MNIST Digit Recognition
Task (T):
• Given gray-scale images and y find a function
𝑓 :𝑥→ 𝑦
Experience (E):
• A ”training dataset” a set of
8 correctly labeled images

Performance (P):
• Accuracy on a “test dataset”

https://www.npmjs.com/package/mnist
“Supervised Learning (Classification)”c
Example: Spam Classification
Task (T):
• Emails and y find
𝑓 :𝑥→ 𝑦
Experience (E):
• A ”training dataset” a emails
marked as “spam” or “non_spam”

Performance (P):
SPAM • Spam detection accuracy

“Supervised Learning (Classification)”


Some Challenges
Representing Data (or Feature Extraction)
• How to represent mathematically
• One example is “bag of words” representation: # times each word in
the dictionary occurs
• What do you lose?
• What do you gain?
• How can we compress this representation further?

What kind of classifier?


• What does the function f look like?
• And how do we learn it’s parameters?
Example: Clustering
Task (T): “Cluster” a set of documents into k groups such that
“similar” documents appear in the same group

Experience (E):
• A ”training dataset” of documents
without “labels”

Performance (P):
• Average distance to cluster center

“Unsupervised Learning”
Example: Anomaly Detection
Task (T):
• Which of these is like the others?

Experience (E):
• Unlabeled samples

Performance (P):
• Anomaly detection accuracy

“Unsupervised Learning”
Regression
Task (T):
• Given and y find a linear function
𝑓 :𝑥→ 𝑦
Experience (E):
• Training data: Points
Performance (P):
• Least squares fit: minimize mean
square error between prediction
[S. Rangan, EL-GY-9123 Lec 2]
and ground-truth
“Supervised Learning (Regression)”
Linear Least Squares Regression
𝑦 = 𝑓 ( 𝑥 ) = 𝛽 1 𝑥+ 𝛽 0

• How do we find the values ?

𝑁
min ∑ ¿ ¿ ¿ ¿ ¿
𝛽 1 , 𝛽 0 𝑖 =1

^
𝑦 𝑖= 𝛽1 𝑥 𝑖 + 𝛽0 ∀ 𝑖 ∈[1 , 𝑁 ]
Linear Least Squares Regression
𝑦 = 𝑓 ( 𝑥 ) = 𝛽 1 𝑥+ 𝛽 0
𝑔 ( 𝛽 1 , 𝛽0 )
• How do we find the values ?

𝑁
𝑁
min ∑ ¿ ¿ ¿ ¿ ¿ min ∑ (𝑦 ¿ ¿𝑖−𝛽 ¿ ¿1 𝑥𝑖 − 𝛽0 ) ¿¿ 2

𝛽 1 , 𝛽 0 𝑖=1
𝛽 1 , 𝛽 0 𝑖 =1

^
𝑦 𝑖= 𝛽1 𝑥 𝑖 + 𝛽0 ∀ 𝑖 ∈[1 , 𝑁 ]
𝜕𝑔 𝜕𝑔
=0 =0
𝜕 𝛽1 𝜕 𝛽0
Linear Least Squares Regression
𝑦 = 𝑓 ( 𝑥 ) = 𝛽 1 𝑥+ 𝛽 0

• How do we find the values ?


Residual Sum Squares (RSS) 𝑔 ( 𝛽 1 , 𝛽0 )
𝑁
𝑁
min ∑ (𝑦 ¿ ¿𝑖−𝛽 ¿ ¿1 𝑥𝑖 − 𝛽0 ) ¿¿ 2 𝜕𝑔
=∑ −2(𝑦 ¿ ¿𝑖− 𝛽 ¿ ¿1𝑥 𝑖 − 𝛽0 )=0 ¿¿

𝛽 1 , 𝛽 0 𝑖=1 𝜕 𝛽 0 𝑖=1
Sample mean
𝑁

∑ ( 𝑦 𝑖 − 𝛽 1 𝑥 𝑖)
𝜕𝑔 𝜕𝑔 𝛽 0 = 𝑖 =1 = 𝑦 − 𝛽1 𝑥
=0 =0 𝑁
𝜕 𝛽1 𝜕 𝛽0
Are you surprised?
Linear Least Squares Regression
• How do we find the values ?
𝑔 ( 𝛽 1 , 𝛽0 ) 𝑁
𝜕𝑔
𝑁 =∑ −2 𝑥𝑖 (𝑦 ¿ ¿𝑖− 𝛽 ¿ ¿1𝑥 𝑖 −𝛽 0 )=0 ¿¿

min ∑ (𝑦 ¿ ¿𝑖−𝛽 ¿ ¿1 𝑥𝑖 − 𝛽0 ) ¿¿
2 𝜕 𝛽1 𝑖=1
𝛽 1 , 𝛽 0 𝑖=1 𝑁

∑ 𝑥𝑖 ¿ ¿ ¿ ¿ ¿
𝑖 =1 Sample covariance
𝜕𝑔 𝜕𝑔 𝑥𝑦 − 𝑥 𝑦
=0 =0 𝛽 1= 2
𝜕 𝛽1 𝜕 𝛽0 𝑥 −𝑥
2

Sample variance
Auto Example
• Python code

Regression line:

15
Linear Least Squares (Multivariate)

• Now consider input: x   and output y   the goal is to learn


M

y  f ( x)  M xM  ...  1 x1   0

NM N
• Given training dataset X  and Y 

1 x01 x02 .. x0 M 0 y0
Yˆ  X
Training sample 1 x11 x12 .. x1M 1
= y1
Note: for simplicity we will
1 x N 1 x N 2 .. x NM M yN assume that X includes a
column of 1s
Linear Least Squares (Multivariate)
2
ˆ T ˆ
RSS  ( y  yˆ ) (Y  Y ) (Y  Y ) (Y  X ) (Y  X )
T

Objective: min (Y  X ) (Y  X )


T

1
Solution:  ( X X ) X Y
* T T
Following slides are from Prof. Sundeep Rangan’s Intro to ML Class.

Polynomial Fitting
• Last lecture: polynomial regression
• Given data
• Learn a polynomial relationship:

• = degree of polynomial. Called model order


• = coefficient vector
• Given , can find via least squares
• How do we select from data?
• This problem is called model order
selection.

18
Example Question
• You are given some data.
• Want to fit a model:
• Decide to use a polynomial:

• What model order should


we use?
• Thoughts?

19
Synthetic Data
• Previous example is synthetic
data
• : 40 samples uniform in [-1,1]

• = “true relation”
• ,

• Synthetic data useful for


analysis
• Know “ground truth”
• Can measure performance of
various estimators

20
Fitting with True Model Order
• Suppose true polynomial order,
d=3, is known
• Use linear regression
• numpy.polynomial package
• Get very good fit

21
But, True Model Order not Known
• Suppose we guess the wrong model order?

d=1 “Underfitting” d=10 “Overfitting”

22
How Can You Tell from Data?

• Is there a way to tell what is the correct model order to use?


• Must use the data. Do not have access to the true ?
• What happens if we guess:
• too big?
• too small?

23
Using RSS on Training Data?
• Simple (but bad) idea:
• For each model order, , find estimate
• Compute predicted values on training data

• Compute RSS

• Find with lowest


• This doesn’t work
• is always decreasing (Question: Why?)
• Minimizing will pick as large as possible
• Leads to overfitting
• What went wrong?
• How do we do better?

24
Model Class and True Function
• Analysis set-up:
• Learning algorithm assumes a model class:
• But, data has true relation:

• Will quantify three key effects:


• Irreducible error
• Under-modeling
• Over-fitting

25
Output Mean Squared Error
• To evaluate prediction error suppose we are given:
• A parameter estimate (computed from the learning algorithm)
• A test point
• Test point is generally different from training samples.
• Predicted value:
• Actual value:
• Output mean squared error:

• Expectation is over noise on the test sample.

26
Irreducible Error
• Rewrite output MSE:

• Since noise on test sample is independent of and :

• Define irreducible error:


• Lower bound on
• Fundamental limit on ability to predict
• Occurs since is influenced by other factors than
27
Analysis with Noise (Advanced)
• Now assume noise:
• Get training data:
• Fit a parameter:

• will be random.
• Depends on particular noise realization.
• Take a new test point (not random)
• Compute mean and variance of estimated function
• Define:
• Bias: Difference of true function from mean estimate
• Variance: Variance of estimate around its mean
33
Bias and Variance Illustrated
• Polynomial ex
• Mean and std
dev of
estimated
functions
• 100 trials

Low variance, High variance,


High bias Zero bias

34
Bias-Variance Tradeoff

Simpler models Richer models


Less parameters More parameters
Under-fitting Over-fitting 35
Cross Validation
• Concept: Need to test fit on data independent of training data
• Divide data into two sets:
• training samples, validation samples
• For each model order, , learn parameters from training samples
• Measure RSS on validation samples.

• Select model order that minimizes

36
Finding the Model Order
• Estimated optimal model order = 3

RSS test minimized at


RSS training always decreases

38
Problems with Simple Train/Test Split
• Test error could vary significantly depending on samples selected
• Only use limited number of samples for training
• Problems particularly bad for data with limited number of samples

39
From http://blog.goldenhelix.com/goldenadmin/cross-validation-for-genomic-
prediction-in-svs/

K-Fold Cross Validation


• -fold cross validation
• Divide data into parts
• Use parts for training. Use remaining for test.
• Average over the test choices
• More accurate, but requires fits of parameters

• Leave one out cross validation (LOOCV)


• Take so one sample is left out.
• Most accurate, but requires N model fittings

40
Polynomial Example
• Use sklearn Kfold object
• Loop
• Outer loop: Over K folds
• Inner loop: Over model order
• Measure test error in each fold and order
• Can be time-consuming

41
Polynomial Example CV Results
• For each model order d
• Compute mean test RSS
• Compute std error (SE) of test RSS
• SE = std dev /
• Mean and SE computed over the folds

• Simple model selection


• Select d with lowest mean test RSS
• For this example
• Estimate model order = 3

42
Binary Classification
“Categorical
variable” Can you fit a linear
Binary Classification Task (T):
model to this data?
• Simplest example where x   and y v{0,1}

• Dataset of ICLR’18 review scores vs. accept/reject decisions

v
Logistic Regression
Pr{Decision=Accept|Score}
Binary Classification Task (T):
• Instead, let’s compute and plot p Pr{ yv 1 | x}

• Idea: Linear regression to fit p as a


function of x

p p 1 x   0
• Is this a good idea?
• Probability p is always bounded
between [0,1]
x
Logistic Regression
“Logits” Function
Binary Classification Task (T):
p
• Consider the following function: g log(v )
1 p
Ground-truth • What is the range of g?
Linear fit*
g  [ , ]
g
• Logistic Regression: fit logits
function using a linear model!
p
g log( ) 1 x   0
1 p
x
Note: the linear fit is illustrative only. How to determine the best linear fit will be discussed next!
Logistic Regression Pr{Decision=Accept|Score}

p 1
g log( ) 1 x   0 p
1 p 1  e  ( 1x   0 )

• What is Pr{Decision=Reject|Score}

e  ( 1x   0 )
1 p 
1  e  ( 1x   0 )

How do we find the model parameters b1 and b0?


Model Estimation
• We will use an approach referred to as Maximum Likelihood Estimation (MLE)
• Let’s assume that the model(i.e., b1 and b0) is magically known. Consider the
training dataset below. What is the likelihood that the dataset came from
our model?
# X Y e  ( 1x1   0 ) 1 1
Likelihood   ( 1 x1   0 )
*  ( 1 x2   0 )
* ...
1
x1 3 y1 0 1 e 1 e 1  e  ( 1x N   0 )
2
x2 8 y2 1
..

..

N
x N 6 y N 1
Model Estimation
• We will use an approach referred to as Maximum Likelihood Estimation (MLE)
• Let’s assume that the model(i.e., b1 and b0) is magically known. Consider the
training dataset below. What is the likelihood that the dataset came from
our model?
# X Y e  (3 1   0 ) 1 1
Likelihood   ( 3 1   0 )
*  ( 8 1   0 )
* ...
1
x1 3 y1 0 1 e 1 e 1  e  ( 6 1   0 )
2
x2 8 y2 1
..

..

N
x N 6 y N 1
Model Estimation
• We will use an approach referred to as Maximum Likelihood Estimation (MLE)
• Let’s assume that the model(i.e., b1 and b0) is magically known. Consider the
training dataset below. What is the likelihood that the dataset came from
our model? e  (3  ) 1 1 0
1
v
Log  Likelihood log( )  log( )  ... log( )
# X Y 1  e  (3 1   0 ) 1  e  (8 1   0 ) 1  e  ( 6 1   0 )

1
x1 3 y1 0 g ( 1 ,  0 ) Function of model parameters only
2
x2 8 y2 1
.. Find b1 and b0 that maximize g
.. (or minimize the “loss” –g)
N
x N 6 y N 1 Loss ( 1 ,  0 )  g ( 1 ,  0 )
We Won’t Worry About How (Phew!)
Ground-truth
LR

From regression to classification: if probability of Accept > 0.5, then output Accept.
Logistic Regression: Multi-Variate
Case
UCI Spam Dataset:
https://archive.ics.uci.edu/ml/datasets/Spambase

• 57 Real or integer valued features


• Binary output class

1
pspam  M
(   i xi   0 )
1 e i 1
LR on Spam Database: Results
90% of samples used for training, remaining 10% used for test

Prediction probabilities for all


“SPAM” emails in the test set

Prediction probabilities
for all “SPAM” emails in
the test set

Which emails are mis-predicted?


Accuracy on test set: ~92%
Which Features Matter?
Our Model:
What does bi=0 imply about feature i?
1
pspam  M
(   i xi   0 ) char_freq_$:
1 e i 1

Reasonable hypothesis: features with


larger absolute values of b matter more.
“cs”

“George”
Feature Selection
Retrain and predict using only the top-k features

Can we explicitly train the parameters so


as to prioritize a “sparser” model?
Why?
Low model complexity prevents overfitting!!
80% accuracy using only3 features
Recall that during training we were
seeking to minimize:

ˆ min Loss (  )

How should this objective function change?


Regularization
Lp Norm of a vector x x p
( | xi | p )1/ p

p Lp Norm Interpretation

x 0 ( | xi | )
0 0 1/ 0 Number of Non-zero
Entries

x 1 ( | xi | )
1 Sum of absolute values

2
x 2 ( | xi | ) 2 0.5 Root mean square

 x 
( | xi | ) 0 Max. value

c controls the relative


“Regularized” loss ˆ min{Loss (  )  c  0 } importance of the

regularization penalty
Regularization In Practice
Hard “combinatorial”
L0 Regularization ˆ min{Loss (  )  c  0 }
 optimization problem!
Instead, the following regularization functions are commonly used:

L1 Regularization ˆ min{Loss (  )  c  1} We are penalizing



(LASSO) “large” coefficients.

But why?
L2 Regularization ˆ min{Loss (  )  c  2 }
(Ridge) 
LASSO and Ridge Regularization

 [ 1 ,  2 ] 2 Contour of loss function


2

Loss (  )  Loss (  )
 1
2

1 1
Contour of LASSO function

LASSO prefers
sparse solutions!
Regularization for Spam
Classification

Which regularization
function to use?

How should we select c?


Impact of C
Best result

Ridge (L2)
Lasso (L1)

Increasing model complexity


Errors in Binary Classification
• Two types of errors:
• Type I error (False positive / false alarm): Decide when
• Type II error (False negative / missed detection): Decide when
• Implication of these errors may be different
• Think of breast cancer diagnosis
• Accuracy of classifier can be measured by:

[Remaining Slides from Prof. Rangan’s Intro to ML Class]


60
ROC Curve
• Varying threshold obtains a set of classifier
• Trades off FPR and TPR
• Can visualize with ROC curve
• Receiver operating curve
• Term from digital communications

62

You might also like