0% found this document useful (0 votes)
12 views

19 ML Intro

Uploaded by

shahzad.dar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

19 ML Intro

Uploaded by

shahzad.dar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

CS 5/7320

Artificial Intelligence

Learning
from Examples
AIMA Chapter 19
Slides by Michael Hahsler
Based on slides by Dan Klein, Pieter Abbeel, Sergey
Levine and A. Farhadi (http://ai.berkeley.edu)
with figures from the AIMA textbook.

This work is licensed under a Creative Commons


Attribution-ShareAlike 4.0 International License.
Topics

Supervised Training & Types of


Data Use in AI
Learning Testing ML Models
Learning from Examples: Machine Learning
Up until now in this course:
• Hand-craft algorithms to make rational/optimal or at least good decisions.
Examples: Search strategies, heuristics.

Issues
• Designer cannot anticipate all possible future situations.
• Designer may have examples but does not know how to program a solution.

Machine Learning
• Learning: Improve performance after making observations about the world. That is, learn what
works and what doesn’t to get closer to optimal decisions.
• How to learn a model to make better decisions from data/experience?
• Supervised Learning: Learn a function (model) to map input to output from a training set. We focus on
Examples:
▪ Use a naïve Bayesian classifier to distinguish between spam/no spam supervised learning
▪ Learn a playout policy to simulate games (current board -> good move)
• Unsupervised Learning: Organize data (e.g., clustering, embedding)
• Reinforcement Learning: Learn from rewards/punishment (e.g., winning a game) obtained via interaction
with the environment over time.
1+1=2
Supervised
Learning
Supervised Learning
• Examples
• We assume there exists a target function 𝑦 = 𝑓(𝑥) that produces iid (independent
and identically distributed) examples possibly with noise and errors.
• Examples are observed input-output pairs E = 𝑥1 , 𝑦1 , … , 𝑥𝑖 , 𝑦𝑖 , … , 𝑥𝑁 , 𝑦𝑁 ,
where 𝑥 is a vectors called the feature vector.
𝑓
• Learning problem
• Given a hypothesis space H of representable models.
• Find a hypothesis ℎ ∈ 𝐻 such that 𝑦ො𝑖 = ℎ 𝑥𝑖 ≈ 𝑦𝑖 ∀𝑖
• That is, we want to approximate 𝑓 by ℎ using E.
Set of all
functions
• Supervised learning includes
• Classification (outputs = class labels). E.g., 𝑥 is an email and 𝑓(𝑥) is spam / ham.
• Regression (outputs = real numbers). E.g., x is a house and 𝑓(𝑥) is its selling price.
Consistency vs. Simplicity
Example: Univariate curve fitting (regression, function approximation)
y Examples y Learned Models x…𝑓 𝑥
lines … ℎ(𝑥)

Very simple,
but not very
consistent
with the
data!

• Consistency: ℎ 𝑥𝑖 ≈ 𝑦𝑖
• Simplicity: small number of model parameters
Measuring Consistency using Loss
Goal of learning: Find a hypothesis that makes predictions that are consistent with
the examples E = 𝑥1 , 𝑦1 , … , 𝑥𝑖 , 𝑦𝑖 , … , 𝑥𝑁 , 𝑦𝑁 .
That is, 𝑦ො = ℎ 𝑥 ≈ 𝑦.

• Measure mistakes: Loss function 𝐿(𝑦, 𝑦)



• Absolute-value loss 𝐿1 𝑦, 𝑦ො = |𝑦 − 𝑦|
ො For Regression
• Squared-error loss 𝐿2 𝑦, 𝑦ො = 𝑦 − 𝑦ො 2
• 0/1 loss 𝐿0/1 𝑦, 𝑦ො = 0 if 𝑦 = 𝑦,
ො else 1 For Classification
• Log loss, cross-entropy loss and many others… Loss
𝑓
• Empirical loss: average loss over the N examples in the dataset
1 ℎ∗
𝐸𝑚𝑝𝐿𝑜𝑠𝑠𝐿,𝐸 (ℎ) = ෍ 𝐿(𝑦, ℎ 𝑥 )
|𝐸|
𝑥,𝑦 ∈𝐸
Learning Consistent ℎ by Minimizing the Loss
• Empirical loss 1
𝐸𝑚𝑝𝐿𝑜𝑠𝑠𝐿,𝐸 (ℎ) = ෍ 𝐿(𝑦, ℎ 𝑥 )
|𝐸|
𝑥,𝑦 ∈𝐸

• Find the best hypothesis that minimizes the loss


ℎ∗ = argmin 𝐸𝑚𝑝𝐿𝑜𝑠𝑠𝐿,𝐸 (ℎ)
ℎ∈ 𝐻 Loss
𝑓
• Reasons for ℎ∗ ≠ 𝑓
a) Realizability: 𝑓 ∉ 𝐻
b) 𝑓 is nondeterministic or examples are noisy.
ℎ∗
c) It is computationally intractable to search all 𝐻,
so we use a non-optimal heuristic.
The Bayes Classifier
For 0/1 loss, the empirical loss is minimized by the model that predicts for each 𝑥 the most likely class 𝑦 using
MAP (Maximum a posteriori) estimates. This is called the Bayes classifier.

𝑃 𝑥 𝑦) 𝑃(𝑦)
h∗ x = argmax 𝑃 𝑌 = 𝑦 𝑋 = 𝑥) = argmax = argmax 𝑃 𝑥 𝑦) 𝑃(𝑦)
𝑦 𝑦 𝑃(𝑥) 𝑦

Optimality: The Bayes classifier is optimal for 0/1 loss. It is the most consistent classifier possible with the lowest
possible error called the Bayes error rate. No better classifier is possible!

Issue: The classifier requires to learn 𝑃 𝑥 𝑦) 𝑃 𝑦 = 𝑃(𝑥, 𝑦) from the examples.


• It needs the complete joint probability which requires in the general case a probability table with one entry for
each possible value for the feature vector 𝑥.
• This is impractical (unless a simple Bayes network exists) and most classifiers try to approximate the Bayes
classifier using a simpler model with fewer parameters.
Simplicity
Ease of use
• Simpler hypotheses have fewer model parameters to estimate and store.

Generalization: How well does the hypothesis perform on new data?


• We do not want the model to be too specific to the training examples (an issue called
overfitting).
• Simpler models typically generalize better to new examples.

How to achieve simplicity?


a) Model bias: Restrict 𝐻 to simpler models (e.g., assumptions like independence,
only consider linear models).
b) Feature selection: use fewer variables from the feature vector 𝑥
c) Regularization: penalize model for its complexity (e.g., number of parameters)
ℎ∗ = argmin 𝐸𝑚𝑝𝐿𝑜𝑠𝑠𝐿,𝐸 (ℎ) + 𝜆 𝐶𝑜𝑚𝑝𝑙𝑒𝑥𝑖𝑡𝑦(ℎ)
ℎ∈ 𝐻
Penalty term
Overfitting

Model Selection: Bias vs. Variance


Simpler More consistent

Points: Two
samples from the
same function 𝑓
to show variance.

Lines: the learned


function ℎ.

High Bias: restrictions by the model class Low


This is a tradeoff
Low Variance: difference in the model due to slightly different data. high
Data
Feature vector 𝑥 Class
The Dataset (Features, Variables, Attributes) Label 𝑦

Examples
(Instances,
Observation)

Find a hypothesis (called “model”) to predict the class given the features.
Feature Engineering
• Add information sources as new variables to the model.
• Add derived features that help the classifier (e.g., 𝑥1 𝑥2 , 𝑥12 ).
• Embedding: E.g., convert words to vectors where vector
similarity between vectors reflects semantic similarity.

• Example for Spam detection: In addition to words


• Have you emailed the sender before?
• Have 1000+ other people just gotten the same email?
• Is the header information consistent?
• Is the email in ALL CAPS?
• Do inline URLs point where they say they point?
• Does the email address you by (your) name?

• Feature Selection: Which features should be used in the


model is a model selection problem (choose between
models with different features).
Training
and
Testing
Model Evaluation (Testing)
The model was trained on the training examples 𝐸. We want to test how well the model
will perform on new examples 𝑇 (i.e., how well it generalizes to new data).

• Testing loss: Calculate the empirical loss for predictions on a testing data set 𝑇 that is
different from the data used for training.
1
𝐸𝑚𝑝𝐿𝑜𝑠𝑠𝐿,𝑇 (ℎ) = ෍ 𝐿(𝑦, ℎ 𝑥 )
|𝑇|
𝑥,𝑦 ∈𝑇

• For classification we often use the accuracy measure, the proportion of correctly
classified test examples.
1
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 ℎ, 𝑇 = ෍ [ℎ 𝑥 = 𝑦] = 1 − 𝐸𝑚𝑝𝐿𝑜𝑠𝑠𝐿0/1,𝑇 (ℎ)
𝑇
(𝑥,𝑦)∈𝑇

𝑐 is an indicator function returning 1 if 𝑐 = 𝑇𝑟𝑢𝑒 and otherwise 0


Training a Model
• Models are “trained” (learned) on the training data. This
involved estimating:

1. Model parameters (the model): E.g., probabilities, weights, Training


factors. Data
2. Hyperparameters: Many learning algorithms have choices for
learning rate, regularization 𝜆, maximal decision tree depth,
selected features,... The algorithm tries to optimizes the model
parameters given user-specified hyperparameters.

• We need to tune the hyperparameters! Test


Data
Hyperparameter Tuning/Model Selection
1. Hold a validation data set back from the training data.
2. Learn models using the training set with different
hyperparameters. Often a grid of possible hyperparameter
combinations or some greedy search is used.
Training
3. Evaluate the models using the validation data and choose Data
the model with the best accuracy. Selecting the right type of Training
model, hyperparameters and features is called model Data
selection.
4. Learn the final model with the chosen hyperparameters Validation
using all training (including validation data). Data

• Notes:
• The validation set was not used for training, so we get generalization Test
accuracy for the different hyperparameter settings. Data
• If no model selection is necessary, then no validation set is used.
Testing a Model

Training
Data

• After the model is selected, the final model is evaluated against the
test set to estimate the final model accuracy.
Test
• Very important: never “peek” at the test set during training! Data
How to Split the Dataset
• Random splits: Split the data randomly in, e.g.,
60% training, 20% validation, and 20% testing.

• Stratified splits: Like random splits, but balance classes and other Training
properties of the examples. Data
Training
Data
• k-fold cross validation: Use training & validation data better
• Split the training & validation data randomly into k folds.
• For k rounds hold one fold back for testing and use the remaining 𝑘 − 1 folds
for training. Validation
• Use the average error/accuracy as a better estimate. Data
• Some algorithms/tools do this internally.

• LOOCV (leave-one-out cross validation): 𝑘 = 𝑛 used if very little Test


data is available. Data
Learning Curve:
The Effect the Training Data Size
Accuracy of a classifier
when the amount of
available training data
increases.
Accuracy

More data is better!

At some point the


learning curve flattens
out and more data does
not contribute much!
Comparing to a Baselines
• First step: get a baseline
• Baselines are very simple straw man model.
• Helps to determine how hard the task is.
• Helps to find out what a good accuracy is.

• Weak baseline: The most frequent label classifier


• Gives all test instances whatever label was most common in the training set.
• Example: For spam filtering, give every message the label “ham.”
• Accuracy might be very high if the problem is skewed (called class imbalance).
• Example: If calling everything “ham” gets already 66% right, so a classifier that gets 70% isn’t very good…

• Strong baseline: For research, we typically compare to previous published state-


of-the-art as a baseline.
Types of
Models
Regression: Predict a number
Classification: Predict a label
Regression: Linear Regression
Model: ℎ𝒘 𝒙𝑗 = 𝑤𝑜 + 𝑤1 𝑥𝑗,1 + ⋯ + 𝑤𝑛 𝑥𝑗,𝑛 = σ𝑖 𝑤𝑖 𝑥𝑗,𝑖 = 𝒘𝑇 𝒙𝑗
Squared error loss over the whole data matrix 𝑿
Empirical Loss: 𝐿 𝒘 = 𝑿𝒘 − 𝒚 𝟐
The gradient is a vector of partial derivatives
𝑇
Gradient: ∇𝐿 𝒘 = 2𝑿𝑇 𝑿𝒘 − 𝒚 𝜕𝐿 𝜕𝐿 𝜕𝐿
∇𝐿 𝒘 = (𝒘), (𝒘), … , (𝒘)
𝜕𝑤1 𝜕𝑤2 𝜕𝑤𝑛
Find: ∇𝐿 𝒘 = 0

Gradient descend: ∇𝐿 𝒘
𝒘
𝒘 = 𝒘 − 𝛼∇𝐿 𝒘

Analytical solution:
𝒘∗ = 𝑿𝑇 𝑿 −1 𝑿𝑇 𝒚

Pseudo inverse
Naïve Bayes Classifier
• Approximates a Bayes classifier with the naïve independence assumption that all 𝑛
features are conditional independent given the class.
𝑛

ℎ 𝑥 = argmax 𝑃 𝑦 ෑ 𝑃 𝑥𝑖 𝑦)
𝑦
𝑖=1
The 𝑃 𝑦 s and the 𝑃 𝑥𝑖 𝑦)s are estimated from the data by counting.

• Gaussian Naïve Bayes Classifiers extend the approach to continuous features by


assuming:

𝑃 𝑥𝑖 𝑦) ~ 𝑁 𝜇𝑦 , 𝜎𝑦

The parameters for the normal distribution 𝑁 𝜇𝑦 , 𝜎𝑦 are estimated from data.
Decision Trees

• A sequence of decisions represented as a tree.


• Many implementations that differ by
• How to select features to split?
• When to stop splitting?
• Is the tree pruned?

• Approximates a Bayesian classifier by


ℎ(𝑥) = argmax 𝑃 𝑌 = 𝑦 leafNodeMatching(𝑥))
𝑦
K-Nearest Neighbors Classifier

• Class is predicted by looking at the majority in the set of the k nearest neighbors. 𝑘 is a
hyperparameter. Larger 𝑘 smooth the decision boundary.
• Neighbors are found using a distance measure (e.g., Euclidean distance between points).
• Approximates a Bayesian classifier by
ℎ(𝑥) = argmax 𝑃 𝑌 = 𝑦 neighborhood(𝑥))
𝑦
Support Vector Machine (SVM)

Margin

Decision
boundary

• Linear classifier that finds the maximum margin separator using only the points
that are “support vectors” and quadratic optimization.
• The kernel trick can be used to learn non-linear decision boundaries.
Artificial Neural Networks/Deep Learning
Computational graph
Hidden Layer For classification
typically a softmax • Represent 𝑦ො = ℎ 𝑥 as a network
activation function of weighted sums with non-linear
returning 𝑷(𝑦|𝑥) activation functions g (e.g.,
logistic, ReLU).
• Learn weights 𝐰 from examples
using backpropagation of
prediction errors L(𝑦,
ො 𝑦) (gradient
descend).
• ANNs are universal
approximators. Large networks
can approximate any function (no
bias). Regularization is typically
used to avoid overfitting.
• Deep learning adds more hidden
layers and layer types (e.g.,
convolution layers) for better
Perceptron learning.
Bias term Non-linear activation function
Many other models exist

• Generalized linear model (GLM): This important

Other model family includes linear regression and the


classification method logistic regression.

Popular Often used methods

• Regularization: enforce simplicity by using a penalty

Models and for complexity.


• Kernel trick: Let a linear classifier learn non-linear
decision boundaries ( = a linear boundary in a high

Methods dimensional space).


• Ensemble Learning: Use many models and combine
the results (e.g., random forest, boosting).
• Embedding and Dimensionality Reduction: Learn
how to represent data in a simpler way.
Some Use Cases of ML for Intelligent Agents
Learn Actions Learn Heuristics Perception

• Directly learn the best action • Learn evaluation functions for • Natural language processing:
from examples. states. Use deep learning / word
embeddings / language
𝑎𝑐𝑡𝑖𝑜𝑛 = ℎ(𝑠𝑡𝑎𝑡𝑒) 𝑒𝑣𝑎𝑙 = ℎ(𝑠𝑡𝑎𝑡𝑒) models to understand
concepts, translate between
languages, or generate text.
• This model can also be used • Can learn a heuristic for
as a playout policy for Monte • Speech recognition: Identify
minimax search from
Carlo tree search with data the most likely sequence of
examples.
from self-play. words.
• Vision: Object recognition in
images/videos. Generate
images/video.

Bottom line: Learning a function is often more effective than hard-coding it,
but we do not always know how it performs in very rare cases!

You might also like