Lecture 1: Machine Learning
Basics
Siddharth Garg
[email protected]
This Course…
Social network Spam filtering
deanonymization
Growing use of ML
techniques in cyber-
Biometrics
security application
Browser
fingerprinting Malware
detection
Automated
Network intrusion
Evasion
detection
This Course…
Bias and fairness Spam filtering
Vulnerabilities in
ML/AI deployments Interpretability
Accountability and
transparency
Model privacy
Adversarial
Training data
perturbations
poisoning attacks
What is Machine Learning?
• Ability for machines to learn without being explicitly programmed
"A computer program is said to learn from experience E with
respect to some class of tasks T and performance
measure P if its performance at tasks in T, as measured
by P, improves with experience E.” --- Mitchell, T. (1997).
Machine Learning. McGraw Hill. p. 2.
• Why not use user knowledge, experience or expertise?
• Are humans always able to explain their expertise?
• Can machines outperform humans?
• What kinds of experiences (E), tasks (T) and performance measures (P)?
Example: MNIST Digit Recognition
Task (T):
• Given gray-scale images and y find a function
𝑓 :𝑥→ 𝑦
Experience (E):
• A ”training dataset” a set of
8 correctly labeled images
Performance (P):
• Accuracy on a “test dataset”
https://www.npmjs.com/package/mnist
“Supervised Learning (Classification)”c
Example: Spam Classification
Task (T):
• Emails and y find
𝑓 :𝑥→ 𝑦
Experience (E):
• A ”training dataset” a emails
marked as “spam” or “non_spam”
Performance (P):
SPAM • Spam detection accuracy
“Supervised Learning (Classification)”
Some Challenges
Representing Data (or Feature Extraction)
• How to represent mathematically
• One example is “bag of words” representation: # times each word in
the dictionary occurs
• What do you lose?
• What do you gain?
• How can we compress this representation further?
What kind of classifier?
• What does the function f look like?
• And how do we learn it’s parameters?
Example: Clustering
Task (T): “Cluster” a set of documents into k groups such that
“similar” documents appear in the same group
Experience (E):
• A ”training dataset” of documents
without “labels”
Performance (P):
• Average distance to cluster center
“Unsupervised Learning”
Example: Anomaly Detection
Task (T):
• Which of these is like the others?
Experience (E):
• Unlabeled samples
Performance (P):
• Anomaly detection accuracy
“Unsupervised Learning”
Regression
Task (T):
• Given and y find a linear function
𝑓 :𝑥→ 𝑦
Experience (E):
• Training data: Points
Performance (P):
• Least squares fit: minimize mean
square error between prediction
[S. Rangan, EL-GY-9123 Lec 2]
and ground-truth
“Supervised Learning (Regression)”
Linear Least Squares Regression
𝑦 = 𝑓 ( 𝑥 ) = 𝛽 1 𝑥+ 𝛽 0
• How do we find the values ?
𝑁
min ∑ ¿ ¿ ¿ ¿ ¿
𝛽 1 , 𝛽 0 𝑖 =1
^
𝑦 𝑖= 𝛽1 𝑥 𝑖 + 𝛽0 ∀ 𝑖 ∈[1 , 𝑁 ]
Linear Least Squares Regression
𝑦 = 𝑓 ( 𝑥 ) = 𝛽 1 𝑥+ 𝛽 0
𝑔 ( 𝛽 1 , 𝛽0 )
• How do we find the values ?
𝑁
𝑁
min ∑ ¿ ¿ ¿ ¿ ¿ min ∑ (𝑦 ¿ ¿𝑖−𝛽 ¿ ¿1 𝑥𝑖 − 𝛽0 ) ¿¿ 2
𝛽 1 , 𝛽 0 𝑖=1
𝛽 1 , 𝛽 0 𝑖 =1
^
𝑦 𝑖= 𝛽1 𝑥 𝑖 + 𝛽0 ∀ 𝑖 ∈[1 , 𝑁 ]
𝜕𝑔 𝜕𝑔
=0 =0
𝜕 𝛽1 𝜕 𝛽0
Linear Least Squares Regression
𝑦 = 𝑓 ( 𝑥 ) = 𝛽 1 𝑥+ 𝛽 0
• How do we find the values ?
Residual Sum Squares (RSS) 𝑔 ( 𝛽 1 , 𝛽0 )
𝑁
𝑁
min ∑ (𝑦 ¿ ¿𝑖−𝛽 ¿ ¿1 𝑥𝑖 − 𝛽0 ) ¿¿ 2 𝜕𝑔
=∑ −2(𝑦 ¿ ¿𝑖− 𝛽 ¿ ¿1𝑥 𝑖 − 𝛽0 )=0 ¿¿
❑
𝛽 1 , 𝛽 0 𝑖=1 𝜕 𝛽 0 𝑖=1
Sample mean
𝑁
∑ ( 𝑦 𝑖 − 𝛽 1 𝑥 𝑖)
𝜕𝑔 𝜕𝑔 𝛽 0 = 𝑖 =1 = 𝑦 − 𝛽1 𝑥
=0 =0 𝑁
𝜕 𝛽1 𝜕 𝛽0
Are you surprised?
Linear Least Squares Regression
• How do we find the values ?
𝑔 ( 𝛽 1 , 𝛽0 ) 𝑁
𝜕𝑔
𝑁 =∑ −2 𝑥𝑖 (𝑦 ¿ ¿𝑖− 𝛽 ¿ ¿1𝑥 𝑖 −𝛽 0 )=0 ¿¿
❑
min ∑ (𝑦 ¿ ¿𝑖−𝛽 ¿ ¿1 𝑥𝑖 − 𝛽0 ) ¿¿
2 𝜕 𝛽1 𝑖=1
𝛽 1 , 𝛽 0 𝑖=1 𝑁
∑ 𝑥𝑖 ¿ ¿ ¿ ¿ ¿
𝑖 =1 Sample covariance
𝜕𝑔 𝜕𝑔 𝑥𝑦 − 𝑥 𝑦
=0 =0 𝛽 1= 2
𝜕 𝛽1 𝜕 𝛽0 𝑥 −𝑥
2
Sample variance
Auto Example
• Python code
Regression line:
15
Linear Least Squares (Multivariate)
• Now consider input: x and output y the goal is to learn
M
y f ( x) M xM ... 1 x1 0
NM N
• Given training dataset X and Y
1 x01 x02 .. x0 M 0 y0
Yˆ X
Training sample 1 x11 x12 .. x1M 1
= y1
Note: for simplicity we will
1 x N 1 x N 2 .. x NM M yN assume that X includes a
column of 1s
Linear Least Squares (Multivariate)
2
ˆ T ˆ
RSS ( y yˆ ) (Y Y ) (Y Y ) (Y X ) (Y X )
T
Objective: min (Y X ) (Y X )
T
1
Solution: ( X X ) X Y
* T T
Following slides are from Prof. Sundeep Rangan’s Intro to ML Class.
Polynomial Fitting
• Last lecture: polynomial regression
• Given data
• Learn a polynomial relationship:
• = degree of polynomial. Called model order
• = coefficient vector
• Given , can find via least squares
• How do we select from data?
• This problem is called model order
selection.
18
Example Question
• You are given some data.
• Want to fit a model:
• Decide to use a polynomial:
• What model order should
we use?
• Thoughts?
19
Synthetic Data
• Previous example is synthetic
data
• : 40 samples uniform in [-1,1]
• = “true relation”
• ,
• Synthetic data useful for
analysis
• Know “ground truth”
• Can measure performance of
various estimators
20
Fitting with True Model Order
• Suppose true polynomial order,
d=3, is known
• Use linear regression
• numpy.polynomial package
• Get very good fit
21
But, True Model Order not Known
• Suppose we guess the wrong model order?
d=1 “Underfitting” d=10 “Overfitting”
22
How Can You Tell from Data?
• Is there a way to tell what is the correct model order to use?
• Must use the data. Do not have access to the true ?
• What happens if we guess:
• too big?
• too small?
23
Using RSS on Training Data?
• Simple (but bad) idea:
• For each model order, , find estimate
• Compute predicted values on training data
• Compute RSS
• Find with lowest
• This doesn’t work
• is always decreasing (Question: Why?)
• Minimizing will pick as large as possible
• Leads to overfitting
• What went wrong?
• How do we do better?
24
Model Class and True Function
• Analysis set-up:
• Learning algorithm assumes a model class:
• But, data has true relation:
• Will quantify three key effects:
• Irreducible error
• Under-modeling
• Over-fitting
25
Output Mean Squared Error
• To evaluate prediction error suppose we are given:
• A parameter estimate (computed from the learning algorithm)
• A test point
• Test point is generally different from training samples.
• Predicted value:
• Actual value:
• Output mean squared error:
• Expectation is over noise on the test sample.
26
Irreducible Error
• Rewrite output MSE:
• Since noise on test sample is independent of and :
• Define irreducible error:
• Lower bound on
• Fundamental limit on ability to predict
• Occurs since is influenced by other factors than
27
Analysis with Noise (Advanced)
• Now assume noise:
• Get training data:
• Fit a parameter:
• will be random.
• Depends on particular noise realization.
• Take a new test point (not random)
• Compute mean and variance of estimated function
• Define:
• Bias: Difference of true function from mean estimate
• Variance: Variance of estimate around its mean
33
Bias and Variance Illustrated
• Polynomial ex
• Mean and std
dev of
estimated
functions
• 100 trials
Low variance, High variance,
High bias Zero bias
34
Bias-Variance Tradeoff
Simpler models Richer models
Less parameters More parameters
Under-fitting Over-fitting 35
Cross Validation
• Concept: Need to test fit on data independent of training data
• Divide data into two sets:
• training samples, validation samples
• For each model order, , learn parameters from training samples
• Measure RSS on validation samples.
• Select model order that minimizes
36
Finding the Model Order
• Estimated optimal model order = 3
RSS test minimized at
RSS training always decreases
38
Problems with Simple Train/Test Split
• Test error could vary significantly depending on samples selected
• Only use limited number of samples for training
• Problems particularly bad for data with limited number of samples
39
From http://blog.goldenhelix.com/goldenadmin/cross-validation-for-genomic-
prediction-in-svs/
K-Fold Cross Validation
• -fold cross validation
• Divide data into parts
• Use parts for training. Use remaining for test.
• Average over the test choices
• More accurate, but requires fits of parameters
• Leave one out cross validation (LOOCV)
• Take so one sample is left out.
• Most accurate, but requires N model fittings
40
Polynomial Example
• Use sklearn Kfold object
• Loop
• Outer loop: Over K folds
• Inner loop: Over model order
• Measure test error in each fold and order
• Can be time-consuming
41
Polynomial Example CV Results
• For each model order d
• Compute mean test RSS
• Compute std error (SE) of test RSS
• SE = std dev /
• Mean and SE computed over the folds
• Simple model selection
• Select d with lowest mean test RSS
• For this example
• Estimate model order = 3
42
Binary Classification
“Categorical
variable” Can you fit a linear
Binary Classification Task (T):
model to this data?
• Simplest example where x and y v{0,1}
• Dataset of ICLR’18 review scores vs. accept/reject decisions
v
Logistic Regression
Pr{Decision=Accept|Score}
Binary Classification Task (T):
• Instead, let’s compute and plot p Pr{ yv 1 | x}
• Idea: Linear regression to fit p as a
function of x
p p 1 x 0
• Is this a good idea?
• Probability p is always bounded
between [0,1]
x
Logistic Regression
“Logits” Function
Binary Classification Task (T):
p
• Consider the following function: g log(v )
1 p
Ground-truth • What is the range of g?
Linear fit*
g [ , ]
g
• Logistic Regression: fit logits
function using a linear model!
p
g log( ) 1 x 0
1 p
x
Note: the linear fit is illustrative only. How to determine the best linear fit will be discussed next!
Logistic Regression Pr{Decision=Accept|Score}
p 1
g log( ) 1 x 0 p
1 p 1 e ( 1x 0 )
• What is Pr{Decision=Reject|Score}
e ( 1x 0 )
1 p
1 e ( 1x 0 )
How do we find the model parameters b1 and b0?
Model Estimation
• We will use an approach referred to as Maximum Likelihood Estimation (MLE)
• Let’s assume that the model(i.e., b1 and b0) is magically known. Consider the
training dataset below. What is the likelihood that the dataset came from
our model?
# X Y e ( 1x1 0 ) 1 1
Likelihood ( 1 x1 0 )
* ( 1 x2 0 )
* ...
1
x1 3 y1 0 1 e 1 e 1 e ( 1x N 0 )
2
x2 8 y2 1
..
..
N
x N 6 y N 1
Model Estimation
• We will use an approach referred to as Maximum Likelihood Estimation (MLE)
• Let’s assume that the model(i.e., b1 and b0) is magically known. Consider the
training dataset below. What is the likelihood that the dataset came from
our model?
# X Y e (3 1 0 ) 1 1
Likelihood ( 3 1 0 )
* ( 8 1 0 )
* ...
1
x1 3 y1 0 1 e 1 e 1 e ( 6 1 0 )
2
x2 8 y2 1
..
..
N
x N 6 y N 1
Model Estimation
• We will use an approach referred to as Maximum Likelihood Estimation (MLE)
• Let’s assume that the model(i.e., b1 and b0) is magically known. Consider the
training dataset below. What is the likelihood that the dataset came from
our model? e (3 ) 1 1 0
1
v
Log Likelihood log( ) log( ) ... log( )
# X Y 1 e (3 1 0 ) 1 e (8 1 0 ) 1 e ( 6 1 0 )
1
x1 3 y1 0 g ( 1 , 0 ) Function of model parameters only
2
x2 8 y2 1
.. Find b1 and b0 that maximize g
.. (or minimize the “loss” –g)
N
x N 6 y N 1 Loss ( 1 , 0 ) g ( 1 , 0 )
We Won’t Worry About How (Phew!)
Ground-truth
LR
From regression to classification: if probability of Accept > 0.5, then output Accept.
Logistic Regression: Multi-Variate
Case
UCI Spam Dataset:
https://archive.ics.uci.edu/ml/datasets/Spambase
• 57 Real or integer valued features
• Binary output class
1
pspam M
( i xi 0 )
1 e i 1
LR on Spam Database: Results
90% of samples used for training, remaining 10% used for test
Prediction probabilities for all
“SPAM” emails in the test set
Prediction probabilities
for all “SPAM” emails in
the test set
Which emails are mis-predicted?
Accuracy on test set: ~92%
Which Features Matter?
Our Model:
What does bi=0 imply about feature i?
1
pspam M
( i xi 0 ) char_freq_$:
1 e i 1
Reasonable hypothesis: features with
larger absolute values of b matter more.
“cs”
“George”
Feature Selection
Retrain and predict using only the top-k features
Can we explicitly train the parameters so
as to prioritize a “sparser” model?
Why?
Low model complexity prevents overfitting!!
80% accuracy using only3 features
Recall that during training we were
seeking to minimize:
ˆ min Loss ( )
How should this objective function change?
Regularization
Lp Norm of a vector x x p
( | xi | p )1/ p
p Lp Norm Interpretation
x 0 ( | xi | )
0 0 1/ 0 Number of Non-zero
Entries
x 1 ( | xi | )
1 Sum of absolute values
2
x 2 ( | xi | ) 2 0.5 Root mean square
x
( | xi | ) 0 Max. value
c controls the relative
“Regularized” loss ˆ min{Loss ( ) c 0 } importance of the
regularization penalty
Regularization In Practice
Hard “combinatorial”
L0 Regularization ˆ min{Loss ( ) c 0 }
optimization problem!
Instead, the following regularization functions are commonly used:
L1 Regularization ˆ min{Loss ( ) c 1} We are penalizing
(LASSO) “large” coefficients.
But why?
L2 Regularization ˆ min{Loss ( ) c 2 }
(Ridge)
LASSO and Ridge Regularization
[ 1 , 2 ] 2 Contour of loss function
2
Loss ( ) Loss ( )
1
2
1 1
Contour of LASSO function
LASSO prefers
sparse solutions!
Regularization for Spam
Classification
Which regularization
function to use?
How should we select c?
Impact of C
Best result
Ridge (L2)
Lasso (L1)
Increasing model complexity
Errors in Binary Classification
• Two types of errors:
• Type I error (False positive / false alarm): Decide when
• Type II error (False negative / missed detection): Decide when
• Implication of these errors may be different
• Think of breast cancer diagnosis
• Accuracy of classifier can be measured by:
[Remaining Slides from Prof. Rangan’s Intro to ML Class]
60
ROC Curve
• Varying threshold obtains a set of classifier
• Trades off FPR and TPR
• Can visualize with ROC curve
• Receiver operating curve
• Term from digital communications
62