19 ML Intro
19 ML Intro
Artificial Intelligence
Learning
from Examples
AIMA Chapter 19
Slides by Michael Hahsler
Based on slides by Dan Klein, Pieter Abbeel, Sergey
Levine and A. Farhadi (http://ai.berkeley.edu)
with figures from the AIMA textbook.
Issues
• Designer cannot anticipate all possible future situations.
• Designer may have examples but does not know how to program a solution.
Machine Learning
• Learning: Improve performance after making observations about the world. That is, learn what
works and what doesn’t to get closer to optimal decisions.
• How to learn a model to make better decisions from data/experience?
• Supervised Learning: Learn a function (model) to map input to output from a training set. We focus on
Examples:
▪ Use a naïve Bayesian classifier to distinguish between spam/no spam supervised learning
▪ Learn a playout policy to simulate games (current board -> good move)
• Unsupervised Learning: Organize data (e.g., clustering, embedding)
• Reinforcement Learning: Learn from rewards/punishment (e.g., winning a game) obtained via interaction
with the environment over time.
1+1=2
Supervised
Learning
Supervised Learning
• Examples
• We assume there exists a target function 𝑦 = 𝑓(𝑥) that produces iid (independent
and identically distributed) examples possibly with noise and errors.
• Examples are observed input-output pairs E = 𝑥1 , 𝑦1 , … , 𝑥𝑖 , 𝑦𝑖 , … , 𝑥𝑁 , 𝑦𝑁 ,
where 𝑥 is a vectors called the feature vector.
𝑓
• Learning problem
• Given a hypothesis space H of representable models.
• Find a hypothesis ℎ ∈ 𝐻 such that 𝑦ො𝑖 = ℎ 𝑥𝑖 ≈ 𝑦𝑖 ∀𝑖
• That is, we want to approximate 𝑓 by ℎ using E.
Set of all
functions
• Supervised learning includes
• Classification (outputs = class labels). E.g., 𝑥 is an email and 𝑓(𝑥) is spam / ham.
• Regression (outputs = real numbers). E.g., x is a house and 𝑓(𝑥) is its selling price.
Consistency vs. Simplicity
Example: Univariate curve fitting (regression, function approximation)
y Examples y Learned Models x…𝑓 𝑥
lines … ℎ(𝑥)
Very simple,
but not very
consistent
with the
data!
• Consistency: ℎ 𝑥𝑖 ≈ 𝑦𝑖
• Simplicity: small number of model parameters
Measuring Consistency using Loss
Goal of learning: Find a hypothesis that makes predictions that are consistent with
the examples E = 𝑥1 , 𝑦1 , … , 𝑥𝑖 , 𝑦𝑖 , … , 𝑥𝑁 , 𝑦𝑁 .
That is, 𝑦ො = ℎ 𝑥 ≈ 𝑦.
𝑃 𝑥 𝑦) 𝑃(𝑦)
h∗ x = argmax 𝑃 𝑌 = 𝑦 𝑋 = 𝑥) = argmax = argmax 𝑃 𝑥 𝑦) 𝑃(𝑦)
𝑦 𝑦 𝑃(𝑥) 𝑦
Optimality: The Bayes classifier is optimal for 0/1 loss. It is the most consistent classifier possible with the lowest
possible error called the Bayes error rate. No better classifier is possible!
Points: Two
samples from the
same function 𝑓
to show variance.
Examples
(Instances,
Observation)
Find a hypothesis (called “model”) to predict the class given the features.
Feature Engineering
• Add information sources as new variables to the model.
• Add derived features that help the classifier (e.g., 𝑥1 𝑥2 , 𝑥12 ).
• Embedding: E.g., convert words to vectors where vector
similarity between vectors reflects semantic similarity.
• Testing loss: Calculate the empirical loss for predictions on a testing data set 𝑇 that is
different from the data used for training.
1
𝐸𝑚𝑝𝐿𝑜𝑠𝑠𝐿,𝑇 (ℎ) = 𝐿(𝑦, ℎ 𝑥 )
|𝑇|
𝑥,𝑦 ∈𝑇
• For classification we often use the accuracy measure, the proportion of correctly
classified test examples.
1
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 ℎ, 𝑇 = [ℎ 𝑥 = 𝑦] = 1 − 𝐸𝑚𝑝𝐿𝑜𝑠𝑠𝐿0/1,𝑇 (ℎ)
𝑇
(𝑥,𝑦)∈𝑇
• Notes:
• The validation set was not used for training, so we get generalization Test
accuracy for the different hyperparameter settings. Data
• If no model selection is necessary, then no validation set is used.
Testing a Model
Training
Data
• After the model is selected, the final model is evaluated against the
test set to estimate the final model accuracy.
Test
• Very important: never “peek” at the test set during training! Data
How to Split the Dataset
• Random splits: Split the data randomly in, e.g.,
60% training, 20% validation, and 20% testing.
• Stratified splits: Like random splits, but balance classes and other Training
properties of the examples. Data
Training
Data
• k-fold cross validation: Use training & validation data better
• Split the training & validation data randomly into k folds.
• For k rounds hold one fold back for testing and use the remaining 𝑘 − 1 folds
for training. Validation
• Use the average error/accuracy as a better estimate. Data
• Some algorithms/tools do this internally.
Gradient descend: ∇𝐿 𝒘
𝒘
𝒘 = 𝒘 − 𝛼∇𝐿 𝒘
Analytical solution:
𝒘∗ = 𝑿𝑇 𝑿 −1 𝑿𝑇 𝒚
Pseudo inverse
Naïve Bayes Classifier
• Approximates a Bayes classifier with the naïve independence assumption that all 𝑛
features are conditional independent given the class.
𝑛
ℎ 𝑥 = argmax 𝑃 𝑦 ෑ 𝑃 𝑥𝑖 𝑦)
𝑦
𝑖=1
The 𝑃 𝑦 s and the 𝑃 𝑥𝑖 𝑦)s are estimated from the data by counting.
𝑃 𝑥𝑖 𝑦) ~ 𝑁 𝜇𝑦 , 𝜎𝑦
The parameters for the normal distribution 𝑁 𝜇𝑦 , 𝜎𝑦 are estimated from data.
Decision Trees
• Class is predicted by looking at the majority in the set of the k nearest neighbors. 𝑘 is a
hyperparameter. Larger 𝑘 smooth the decision boundary.
• Neighbors are found using a distance measure (e.g., Euclidean distance between points).
• Approximates a Bayesian classifier by
ℎ(𝑥) = argmax 𝑃 𝑌 = 𝑦 neighborhood(𝑥))
𝑦
Support Vector Machine (SVM)
Margin
Decision
boundary
• Linear classifier that finds the maximum margin separator using only the points
that are “support vectors” and quadratic optimization.
• The kernel trick can be used to learn non-linear decision boundaries.
Artificial Neural Networks/Deep Learning
Computational graph
Hidden Layer For classification
typically a softmax • Represent 𝑦ො = ℎ 𝑥 as a network
activation function of weighted sums with non-linear
returning 𝑷(𝑦|𝑥) activation functions g (e.g.,
logistic, ReLU).
• Learn weights 𝐰 from examples
using backpropagation of
prediction errors L(𝑦,
ො 𝑦) (gradient
descend).
• ANNs are universal
approximators. Large networks
can approximate any function (no
bias). Regularization is typically
used to avoid overfitting.
• Deep learning adds more hidden
layers and layer types (e.g.,
convolution layers) for better
Perceptron learning.
Bias term Non-linear activation function
Many other models exist
• Directly learn the best action • Learn evaluation functions for • Natural language processing:
from examples. states. Use deep learning / word
embeddings / language
𝑎𝑐𝑡𝑖𝑜𝑛 = ℎ(𝑠𝑡𝑎𝑡𝑒) 𝑒𝑣𝑎𝑙 = ℎ(𝑠𝑡𝑎𝑡𝑒) models to understand
concepts, translate between
languages, or generate text.
• This model can also be used • Can learn a heuristic for
as a playout policy for Monte • Speech recognition: Identify
minimax search from
Carlo tree search with data the most likely sequence of
examples.
from self-play. words.
• Vision: Object recognition in
images/videos. Generate
images/video.
Bottom line: Learning a function is often more effective than hard-coding it,
but we do not always know how it performs in very rare cases!