0% found this document useful (0 votes)
2 views

Supervised Learning

Supervised learning is a machine learning technique that uses labeled training data to learn input-output relationships, primarily focusing on classification and regression tasks. Classification predicts discrete class labels while regression predicts continuous values, with various techniques such as decision trees, linear regression, and logistic regression employed for these tasks. The document also discusses model evaluation, overfitting, and underfitting, highlighting the importance of accuracy and generalization in machine learning models.

Uploaded by

tadeleadisu16
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Supervised Learning

Supervised learning is a machine learning technique that uses labeled training data to learn input-output relationships, primarily focusing on classification and regression tasks. Classification predicts discrete class labels while regression predicts continuous values, with various techniques such as decision trees, linear regression, and logistic regression employed for these tasks. The document also discusses model evaluation, overfitting, and underfitting, highlighting the importance of accuracy and generalization in machine learning models.

Uploaded by

tadeleadisu16
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 187

Supervising Learning

Supervised Learning

• Each type of task is characterized by the kinds of data they


require and the kinds of output they generate.
Supervised learning (classification)

• Supervised Learning is a machine learning technique that


uses a collection of paired input-output training samples to
learn about a system's input-output connection.
• Classification and regression are two types of problems that
can be solved with supervised learning.
• Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations.
• New data is classified based on the training set.
Supervised Learning
• Classification
– Output type: discrete (binary/multi-classes)
– Trying to find: a boundary
– Evaluation: accuracy
• Regression
– Output type: continuous
– Trying to find: best fit line
– Evaluation: sum of squared errors

Slide share
Classification
• predicts categorical class labels (discrete or nominal)
• classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute and
uses it in classifying new data

Regression
• Regression is a type of Supervised Learning task in which
the output has a continuous value.
• The term regression is used when you try to find the
relationship between variables.
• It used to understand the relationship between dependent
and independent variables.
Classification Vs Regression
Classification: A Two-Step
 Process
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
 The set of tuples used for model construction is training set

 The model is represented as classification rules, decision trees, or

mathematical formulae
Model usage: for classifying future or unknown objects
 Estimate accuracy of the model


The known label of test sample is compared with the classified
result from the model

Accuracy rate is the percentage of test set samples that are
correctly classified by the model

Test set is independent of training set (otherwise overfitting)
 If the accuracy is acceptable, use the model to classify new data

Note: If the test set is used to select models, it is called validation (test) set
Process (1): Model
Construction
Process (2):Prediction by the
Model
Classification Tasks

Data: A set of data records (also


called examples, instances or
cases) described by
k attributes: A1, A2, …
Ak. a class: Each
example is labelled with
a pre-
defined class.
Goal: To learn a classification
model from the data that can be
used to predict the classes of new
(future, or test) cases/instances.

Learning (training): Learn a model using the training data


Testing: Test the model using unseen test data to assess the model accuracy
Classification Tasks

• Given:
– A set of classes
– Instances (examples) of
each class
• Described
as a set of features
or attributes and
their values
• Generate: A method (aka
model) that when given a
new instance it will
determine its class
Classification Techniques

What classification techniques do you know?

• Base Classifiers
• Decision Tree based Methods
• Rule-based Methods
• Nearest-neighbor
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
• Neural Networks, Deep Neural Nets
• Ensemble Classifiers
• Boosting, Bagging, Random Forests
Linear Regression
• Linear regression uses the relationship between the data-
points to draw a straight line through all them.
• When the outcome and all the attributes are numeric, linear
regression is a natural technique to consider.
Y= a + bX
• Where Y is the dependent variable (that’s the variable that goes on
the Y axis), X is the independent variable (i.e. it is plotted on the X
axis), b is the slope of the line and a is the y-intercept.
• The idea is to express the class as a linear combination of the
attributes, with predetermined weights:

Where
• x---- class
• a1 to ak– attribute values
• Wo-wk--- weights– calculated from the
training data
Linear Regression
• Linear Regression finds the relationship between the input and output
data by plotting a line that fits the input data and maps it onto the
output.
• This line represents the mathematical relationship between the
independent input variables and is called The Line of Best Fit.

• Linear regression is an excellent, simple method for numeric


prediction.
Linear Regression

• Consider the data that is displayed below, which tells you the sales
corresponding to the amount spent on advertising.

you can plot the graph of Sales vs.


Advertising, and find the line of best fit
between them
Linear Regression

• Best-fitting straight line will be found, where “best”


is interpreted as the least mean-squared difference.

• Linear regression measures the goodness of fit using the


squared error.
• Linear models serve well as building blocks for more complex
learning methods.
Linear Regression
Logistic Regression

• One type of classification algorithm


• Logistic regression builds a linear model based on a
transformed target variable.

Towards Data Science


Logistic Regression

Source
Logistic Regression

• It is a special case of linear regression where the target


variable is categorical in nature.
• predict the category of a dependent variable based on
the values of the independent variable.
• The outcome or target variable is dichotomous (two
possible
classes) in nature.
• Logistic Regression predicts the probability of
occurrence of a
binary event utilizing a logit function.
Logistic Regression

source
Logistic Regression
• From linear to logistic regression--- using sigmoid function.
• In logistic regression weighted sum of input is passed through the
sigmoid activation function and the curve which is obtained is
called the sigmoid curve.

• Sigmoid function: also called logistic function gives an ‘S’ shaped


curve that can take any real-valued number and map it into a value
between 0 and 1.

If the curve goes to positive infinity, y predicted


will become 1, and if the curve goes to negative
infinity, y predicted will become 0.
Logistic Regression

• Decision Boundary

– Based upon this threshold, the obtained estimated


probability is classified into classes.
– Say, if predicted value ≥ 0.5, then classify email as spam else
as not spam.
Example: If the output is 0.75, we can say in terms of
probability as: There is a 75 percent chance that email will be
spam.
Logistic Regression
Sigmoid function
Logistic Regression
Types of Logistic Regression
1. Binary Logistic Regression
– The categorical response has only two 2 possible
outcomes. Example: Spam or Not
– one of the most simple and commonly used Machine
Learning algorithms for two-class classification.
2. Multinomial Logistic Regression
– Three or more categories without ordering. Example:
Predicting which food is preferred more (Veg, Non-Veg,
Vegan)
3. Ordinal Logistic Regression
– Three or more categories with ordering. Example: Movie
rating from 1 to 5
Logistic Regression -- Advantage
• Advantages:
- Makes no assumptions about distributions of classes in
feature space
- Easily extended to multiple classes (multinomial regression)
- Natural probabilistic view of class predictions
- Quick to train
- Very fast at classifying unknown records
- Good accuracy for many simple data sets
- Resistant to overfitting
- Can interpret model coefficients as indicators of feature
importance
• What are the dis advantages of LR?
Question

• Discuss the difference between


linear regression and logistic regression
Decision Tree

• The most frequent machine learning tree-structured classification


model, which is simple to understand even for non-expert users.
• Decision tree have the applications in both supervised
(regression) and non-supervised (classification) learning.
• It can handle categorical as well as continuous data.
• The internal node (variable/feature), branches (output/result),
and leaf node are the three elements of a decision tree (labels of
class).
• Decision Tree uses information gain, gain ratio, and Gini Index to
find the best splitting attribute.
• Various Decision tree algorithms are ID3, C4.5, C5.0 and CART
Decision Tree

• ID3- It is a multiway tree, finding for each node of the largest


information for categorical targets gained by yielding the categorical
features.
• C4.5- It is the successor to ID3. C4.5 turns trained trees into a set of
if-then rules, with each rule's accuracy being checked to ensure that
the rules are applied in the correct order.
• C5.0- This algorithm uses less memory and make rule sets which is
smaller than C4.5 while being more precise.
• CART- It exactly look like C4.5 but differ in numerical target variable
supporting and compute rule set is not supported.
Decision Tree

• A decision tree is a predictor, h : X Y, that predicts the


label associated with an instance X by traveling from a root
node of a tree to a leaf.
• A method for approximating discrete classification functions
by means of a tree-based representation

• Tree leaf ↔ contains a specific


label
•Tree branch ↔ possible
attribute value for the instance in
question
DT Learning Algorithm
• Tree is constructed in a top-down recursive manner– greedy
search - through the space of possible solutions
• A general Decision Tree learning algorithm:
1. Perform a statistical test of each attribute- mostly categorical
(possible to handle continuous attribute values) to determine how
well it classifies the training examples when considered alone;

1. Select the attribute that performs best and use it as the root of
the
tree;

2. To decide the descendant node down each branch of the root


(parent node), sort the training examples according to the value
related to the current branch and repeat the process described in
steps 1 and 2--- a recursive process
Example of a Decision Tree
Apply Model to Test Data
Apply Model to Test Data
Apply Model to Test Data
Apply Model to Test Data
Apply Model to Test Data
Apply Model to Test Data
Another Example of Decision Tree
What is Good Attribute

• A good attribute prefers attributes that split the data so that


each successor node is as pure as possible.

• In other words:
– We want a measure that prefers attributes that have a high
degree of ”order”
• Maximum order: all examples are of the same class
• Minimum order: all classes are equally likely
• Needs a measure of impurity
Measures of Node Impurity

• Information Gain
– Determine how informative an attribute is
– Attributes are assumed to be categorical

• Gini Index
– Attributes are assumed to be continuous
– Assume there exist several possible split values
for each
attribute
Information Gain

• The encoding information that would be gained by branching


on A.
Gain(A) = Info(D) -
InfoA(D)
• Gain(A) tells us how much would be gained by branching on
A.

• The attribute A with the highest information gain is chosen as


the splitting attribute at node N.
Information Gain

Entropy in information theory specifies the minimum number of


bits needed to encode the classification accuracy of an instance.
Information Gain

• Suppose we were to partition the tuples in D on


some attribute A having v distinct values, {a1, a2, …, av}
• Attribute A can be used to split D into v partitions or subsets,
{D1,D2,…, Dv}, where Dj contains those tuples in D that have
outcome aj of A.
• These partitions would correspond to the branches
grown from node N.
• The expected information required to classify a tuple from D
based on the partitioning by A:
Information Gain

• Assume an attribute A split the set S into subsets {S1, S2,…,


Sv}

• To compute the Information Gain, we have to compute:


– The average entropy
– The sum of the entropy of the original set S

• The encoding information that would be gained by branching


on A.

E(S)= 0 if S contains only positive or only negative examples


E(S)= 1 if S contains equal amount of positive and negative examples
Play Golf Example
Note: No root-to-leaf path should contain the
same discrete attribute twice
Stopping Criteria for Tree Induction

• Stop expanding a node when all the records belong to the


same class OR when all the records have similar attribute
values.
Tree pruning

• Pruning reduces the size of decision trees by removing parts


of the tree that do not provide power to classify instances

• Prepruning: stopping tree construction early on before


it is full

• Postpruning: to get simpler tree (to find and prune


unnecessary sub trees)

• Comparing prepruning and postpruning:


– The prepruning is faster but postpruning leads to more
accurate trees.
Overfitting
• The goal of a good machine learning model is to generalize
well from the problem domain.

• Overfitting and underfitting are the two biggest causes


for poor performance of machine learning algorithms.

• If a decision tree is fully grown, it may lose


some generalization capability.
– This is a phenomenon known as overfitting.
• Overfitting refers to a model that models the training
data too well.
• Happens the model learns the detail and noise in the training
data---- accuracy compromised.
Issues in decision trees

• Overfitting negatively impact the performance.


• A decision tree is said to overfit the training data if
– It results in poor accuracy to classify test samples

– It has too many branches, that reflect anomalies


Causes of Overfitting
• Presence of Noise
– Mislabeled instances may contradict the class
labels of
other similar records.

• Lack of Representative Instances


– Lack of representative instances in the training data can
prevent refinement of the learning algorithm.

• The Multiple Comparison Procedure


– Failure to compensate for algorithms that explore a large
number of alternatives can result in spurious fitting.
Avoiding Over fitting

• Ways to avoid overfitting:


1. Stop the training process before the learner reaches the point
where it perfectly classifies the training data.

2. Apply backtracking in the search for the optimal hypothesis. In


the case of Decision Tree Learning, backtracking process is
referred to as ‘post-pruning of the overfitted tree’.
Underfitting

• Underfitting refers to a model that can neither model the


training data nor generalize to new data.

• Underfitting: when model is too simple, both training and test


errors are large
• Easy to detect given a good performance metric

• Remedy: try out an alternative machine learning algorithm


Evaluation Matrices
Confusion Matrix
Cont..

• TP: True Positive: Predicted values correctly predicted as


actual positive
• FP: Predicted values incorrectly predicted an actual positive.
i.e., Negative values predicted as positive
• FN: False Negative: Positive values predicted as negative
• TN: True Negative: Predicted values correctly predicted as
an actual negative.
Evaluation Matrices
Example confusion matrix for a binary classifier
Nearest Neighbours
• This technique assumes that data points that are similar can be found
near together.
• Based on learning by analogy– by comparing a given test tuple with
training tuples that are similar to it.
• It attempts to determine the distance between data points, which is
commonly done using Euclidean distance, and then assigns a category
based on the most frequent category or average.
• Training tuple represents a point in an n-dimensional space–
described by n attributes.
• When unknown tuple is given – searches the pattern space for the k
training tuples (k nearest nieghbors) that are closest to the unknown
tuple.
K-Nearest Neighbours
• Euclidean distance: the most commonly used measure
• In mathematics, the Euclidean distance between two
points in Euclidean space is the length of a line
segment between the two points.

• Where :
• p,q= two points in Euclidean n-space
• pi,qi=Euclidean vectors, starting from the origin of the space
(initial point)
• n= n-space
• Determine the class from nearest neighbor list take the majority vote of class
labels among the k-nearest neighbors
• Weigh the vote according to distance.
K-Nearest Neighbours

• Example: 1
Name Acidity Strength Class
Durability
Type-1 7 7 Bad
Type-2 7 4 Bad
Type-3 3 4 Good
Type-4 1 4 Good

• Test data: AD=3, S=7, Class=? Check for K=1/2/3


Example 2

• Check for K=3


K-Nearest Neighbours

• Choosing k for K-NN is just one of the many model selection


problems we face in machine learning

– If k is too small, sensitive to noise points


– If k is too large, neighborhood may include points from other
classes

• Normalize the values of each attribute before


computing
closeness– min-max
K-Nearest Neighbours

• Advantages
– Conceptually simple, easy to understand and explain
– Very flexible decision boundaries
– Not much learning at all
• Disadvantages
– It can be hard to find a good distance measure
– Irrelevant features and noise can be very detrimental
– Typically can not handle more than a few dozen attributes
– Computational cost: requires a lot computation and memory
Bayes Learning
• Use a probability framework for fitting a predictive model to a
training dataset.
• Has two roles
– Provides learning algorithms
• Naïve Bayes learning
• Bayes Belief Network learning
– Provides conceptual framework
• Provides “gold standard” to evaluate other
learning
Algorithms.
Probability Theory

• Conditional (posterior) probabilities:


– Formalize the process of accumulating evidence and
updating probabilities based on new evidence
– Specify the belief in one proposition (event, conclusion,
diagnosis, ets) conditioned on another proposition (evidence,
feature, symptom, etc)
• P(A|B) is the conditional probability of A given evidence B:
Probability Theory

• Conditional probabilities behave like standard probabilities:


– With in a range of [0,1]
– Sum to 1

• Can have P( conjunction of events |B)


– P(A ᴧ B ᴧ C |E) is the sentence “A ᴧ B is
ᴧ C” true
conditioned on the evidence E being true.
Rules of Probability Theory

• Negation: probability event A being false:


P(¬A|B) = 1- P(A|B)
• Sum rule: probability of a disjunction of two events A and B:
P(A v B) = P(A) + P(B) – P(AᴧB)
• Product Rule: probability of a conjunction of two events A and B
P(A ᴧ B) = P(A|B) x P(B)
= P(B|A) x P(A)
• Chain rule: generalization of the product rule for any number of
events.
P(A ᴧ B ᴧ C) = P(A|B ᴧ C) x P(B|C) x P(C)
Rules of Probability Theory

• Conditional chain rule: variant of the chain rule for conditional


probabilities
P(A ᴧ B|C) = P(A|B ᴧ C) x P(B|C)
• Total Probability: summing out over mutually exclusive
events B1,…..,Bn
Bayes Classification Method

• The goal is to learn a model and use this model to predict.


– Learn a probability distribution

– Use the distribution to make a decision

• A learner tries to find the most probably hypothesis h from a


set of hypothesis H, given the observed data.
• Bayesian classifiers (statistical classifiers) -- predict
class
membership probabilities -- based on Bayes’ theorem
Bayes Theorem

• Let X be a data tuple.

• Let H be some hypothesis such as that the data tuple X belongs


to a specified class C.
• We want to determine P(H|X)– posterior probability
of H
conditioned on X.
• We are looking for the probability that tuple X belongs to class C,
given that we know the attribute description of X.
Example of Bayes Theorem

• Given:
– A doctor knows that meningitis causes stiff neck 50% of the
time
– Prior probability of any patient having meningitis is1/50,000
– Prior probability of any patient having stiff neck is1/20

• If a patient has stiff neck, what is the probability he/she has


meningitis?
Naïve Bayes Classification

• Simple Bayesian classifier known as the Naïve


Bayesian
– Comparable in performance with decision tree and
network neural
• How it works
1. Each tuple is represented by n-dimensional attribute vector X
={𝑥1, 𝑥2, … , 𝑥𝑛}
2. Suppose there are m classes --- {𝑐1, 𝑐2, …𝑐𝑛}
The classifier predict x that bellongs to classes having the
highest posterior probability conditioned on X:
P(𝐶𝑖 X) = 𝑃(𝑋|𝐶𝑖𝑃(𝑋)
)(𝑃(𝐶𝑖)
Naïve Bayes Classification

3. For data with many attributes


P(X|𝐶𝑖) = 𝑛

𝑃(𝑋𝑘 |𝐶𝑖)
What if the attribute value is continuous? Gaussian distribution with a
mean µ and standard deviation σ.

4. Select the higher(est) conditional probability

𝑃 𝑋 𝐶𝑖 𝑃(𝐶𝑖) > 𝑃𝑋 𝐶𝑗 𝑃(𝐶𝑗)


Estimating probability from Data
Example Naïve Bayes
Example Naïve Bayes
Naïve Bayes Summary

• Robust to isolate noise points

• Handle missing values by the instance


ignoring during
probability estimate calculations
• Independence assumption may not hold for some attributes.

– Use other techniques such as Bayesian


Belief Networks (BBN)
naïve Bayes vs Bayes Nets

• Note that the naïve Bayes classifier is simply a special instance


of a Bayes Net.

• Naïve Bayes simplified comutation by making the


assumption of class conditional independence

• Bayesian belief networks allow the representation


of dependencies among subsets of attributes.
Bayesian Belief Network

• Belief Measure:

– In general, a person belief in a statement a will depend on


some body of knowledge K. We write this as P(a|K).
– P(a|K) represents a belief measure.
Training Bayesian Belief Networks

• Several algorithms exist for learning the network


topology from the training data given observable variables.

• If the network topology is known and the variables


are observable, then training the network is straightforward.

• When the network topology is given and some of


the variables are hidden, there are various techniques.
Training Bayesian Belief Networks
– Gradient descent
• Let D be a training set of data tuples, X1,X2, . . . , X|D|.

• Training the belief network means that we must learn the


values of the CPT entries.

• The CPT entries considered as the weight of the network–


analogous to the hidden weight of NN.

• A gradient descent strategy performs greedy hill-climbing--


at each iteration, the weights are updated and will
eventually converge to a local optimum solution
Application

Source
Introduction SVM
• The support vector machine is a supervised machine-learning
model for classification and regression that is based on
kernels.
• It creates a hyperplane where the distance between two
classes of data points is at its maximum.
• The decision boundary is a hyperplane that separates the
classifications of data points.
• It plots each data item in the dataset in an N-dimensional
space.
• Where N is the number of features or attributes in the data.
• Next, find the optimal hyperplane to separate the data.
Introduction SVM

• Basic idea of support vector machines:


– Optimal hyperplane for linearly separable patterns
• A hyperplane is a linear decision surface that splits the space
into two parts

– For nonlinearly separable data-- transformations of original


data to map into new space – the Kernel function
Introduction SVM

• Important because of:


– Robust to very large number of variables and small samples

– Can learn both simple and highly complex classification models

– Employ sophisticated mathematical principles to


avoid overfitting
– Can be used for both classification and regression tasks
Main ideas of SVMs

• Find a linear decision surface (“hyperplane”) that can separate


patient classes and has the largest distance (i.e., largest “gap”
or “margin”) between border-line patients (i.e., “support
vectors”);
Main ideas of SVMs

• If linear decision surface does not exist, the data is mapped


into a much higher dimensional space (feature space)
• The feature space is constructed via a fancy
mathematical
projection called the kernel trick.
Support Vectors
• Support vectors are the data points that lie closest to
the decision surface
– Most difficult to classify

– Critical elements of the training set

• Change the position of the dividing hyperplane if removed.


Support vectors

Support Vectors: touch the boundary of the margin


Support Vector Machine

• SVMs maximize the margin around


the separating hyperplane.
• The decision function is fully
specified by a subset of training
samples, the support vectors.
• 2-Ds, it’s a line.
• 3-Ds, it’s a plane.
• In more dimensions, call it a
hyperplane.
Input and Outputs in SVM

• Input: set of (input, output) training pair samples; call the


input sample features x1, x2…xn, and the output result y.

• Output: set of weights w (or wi), one for each feature, whose
linear combination predicts the value of y. (just like neural
nets)
SVM-Mathematical Concepts

• Samples geometrically.
Purpose of vector representation
• Representing each sample/patient as a vector allows to
geometrically represent the decision surface that separates
two groups of samples/patients.

• In order to define the decision surface, we need to introduce


some basic math elements.
Hyperplane Example

• Which one is better? B1 or B2?


• How do you define better?
Hyperplane Example

• Optimal classification occurs when a hyperplane provides


maximal distance to the nearest training data points.
Intuitively, this makes sense, as if the points are well
separated, the classification between two groups is much
clearer.
SVM- the widest street approach

- Linearly Separable Case-


A separating hyperplane
can be

- where 𝑊 = {𝑤
Where:

𝑤2, . . . , 𝑤𝑑} is
1,

- 𝑏is called as bias


weight vector;
Linear and non-linear separable data

Which one is easy to separate?


Kernel trick
• How to efficiently compute the hyperplane that separates two
classes with the largest street width?

• How to separate linear inseparable?-- optimization problem

• Kernel trick: for linearly inseparable cases in SVM, kernel trick is


a commonly used technique.
Kernel trick

– Is a nonlinear transformation of samples from the original


space to a feature space with higher or even infinite
dimension so as to make the problems linearly separable.

𝑋
– Nonlinear mapping function map data
in original (or primal) space into a

𝐹
higher (ever infinite) dimension space
Strong points of SVM-based learning
methods

• Empirically achieve excellent results in high-dimensional data


with very few samples
• Internal capacity control to avoid overfitting
• Can learn both simple linear and very complex
nonlinear functions by using “kernel trick”
• Robust to outliers and noise
• Donot require direct access to data, work only with
dot- products of data-points.
Weak points of SVM-based learning
methods

• Interpretation is less straightforward than classical statistics


• Lack of parametric statistical significance tests
• Has several key parameters like C, kernel function,
and
Gamma that all need to be set correctly
Ensemble Method

• A set of classifiers whose individual decisions are combined in


some way (typically by weighted or unweighted voting) to
classify new examples.

• The most active areas of research in supervised learning has


been to study methods for constructing good ensembles of
classifiers.

• Ensembles are much more accurate than the individual


classifiers that make up them.
Ensemble Method
• Ensemble methods can take the form of:
– Using different algorithms

– Using the same algorithm with different settings, or

– Assigning different parts of the dataset to different


classifiers.
Esembel Method
• Examples: Imagine we have an ensemble of three classifiers:
{h1,h2,h3} and consider a new case x.

– If the three classifiers are identical ( not diverse), then when


h1(x) is wrong, h2(x) and h3(x) will also be wrong.

– If the errors made by classifiers are uncorrelated, then when


h1(x) is wrong, h2(x) and h3(x) may be correct– the majority vote
classify x correctly.

• Decisions can be combined by many methods,


including averaging, voting, and probabilistic methods.
Ensemble Method

• Simplest approach:
1. Generate multiple classification models
2. Each votes on test instance
3. Take majority as classification
Ensemble Method

• Differ in training strategy and method combination


– Bagging: parallel training with different training sets

– Boosting: sequential training, iteratively re-


weighting training
examples so current classifier focuses on hard
examples

– Mixture of experts: parallel training with objective


encouraging
How Bagging Works
Bagging

• Each member of the ensemble is constructed from a different


training dataset.
– Predictions combined either by uniform averaging or voting over
class labels
– Dataset is generated by sampling from the total N data
examples, choosing N items uniformly at random with
replacement.
– Each sample is known as a boot-strap
• More advanced methods of bagging, such as random
forests
Bagging
• Random Forests grows many classification trees.
• To classify a new object from an input vector, put the input
vector down each of the trees in the forest.

• Each tree gives a classification, and we say the tree "votes" for
that class.

• The forest chooses the classification having the most votes


(over all the trees in the forest)
• A random forest produces good predictions that can be
understood easily and it can perform both regression, and
classification tasks.
Random Forest
• The Random Forest algorithm combines the output of multiple
(randomly created) Decision Trees to generate the final output.
Bagging
• Features of Random Forests
– It runs efficiently on large data bases
– It can handle thousands of input variables without variable
deletion
– It gives estimates of what variables are important in the
classification
– It has an effective method for estimating missing data and
maintains accuracy when a large proportion of the data
are missing
– It has methods for balancing error in class population
unbalanced data sets
Random Forests by Leo Breiman and Adele Cutler
Bagging
Boosting

• Boosting algorithms also works by manipulating training set, it


combines weak learners into strong learners by creating sequential
models such that the final model has the highest accuracy.
• Boosting makes new classifiers focus on data that was previously
misclassified by previous classifiers.
• Constructing each ensemble member with some measurement
ensuring that it is substantially different from the other members.
– Alters the distribution of training examples to make more accurate
predictions where previous predictors have made errors.
– Adaboost is the most well known of the Boosting family of algorithms
Boosting

• Probably one of the most influential ideas in machine learning in


the last decade.

• Is a way of converting a “weak” learning model (behaves slightly


better than chance) into a “strong” learning mode (behaves
arbitrarily close to perfect).

• Boosting is different from bagging because the output is calculated


from a weighted sum of all classifiers.

• The weights aren’t equal as in bagging but are based on


how successful the classifier was in the previous iteration.
Boosting

• Strong theoretical result, but also lead to a very powerful and


practical algorithm which is used all the time in real world
machine learning.

Rich Zemel, ML & DM- Ensemble method


Weighting

• How to weight each training case for classifier m


Boosting
How Boosting Works
Boosting
• Gradient boosted decision trees proven to be effective on a
wide range of datasets for classification and regression
• Combines multiple decision trees to create a more
powerful
model
• Build trees in a serial manner, where each tree tries to
correct
the mistakes of the previous one.
• Often use very shallow trees, of depth one to five, which
makes the model smaller in terms of memory and
makes predictions faster.
• A bit more sensitive to parameter settings than random
forests
Adaboost
• Short for adaptive boosting

• Trains models sequentially with a new model trained at each


round.

• At each round, misclassified examples are identified and then


used as a fed back into the start of the next round.

• The idea is that subsequent models should be able to


compensate for errors made by earlier models.

• Key difference with bagging is that at each round, Bagging has a


uniform distribution, while Adaboost adapts a non uniform
distribution.
Adaboost

• AdaBoost assigns α values to each of the classifiers based


on the error of each classifier.
• The error ε is given by

• α is given
by
Adaboost
• Pros: Low generalization error, easy to code, works with most
classifiers, no parameters to adjust

• Cons: Sensitive to outliers

• Works with: Numeric values, nominal values


• Reading assignment --- read the maths of
Bagging, Boosting and mixtures of
experts
Artificial Neural Network
ANN

• Neural networks are nonlinear models inspired by the


structure of neural networks in the brain.
• We use ANN:
– They are extremely powerful computational devices

– Massive parallelism makes them very efficient

– They are particularly fault tolerant


How ANN Works

– Receives inputs

– Combines them in someway

– Performs a generally nonlinear operation on the result

– Outputs the final result


Artificial Neural Networks: the dimensions
How ANN Works

• The three basic components of the (artificial) neuron are:


– synapses or connecting links

– adder that sums

– Activation function.

• Each neuron is characterized by its weight, bias and activation


function.
How ANN works
Activation functions
Preceptor

Perceptron = a neuron that its input is the dot product of W


and X and uses a step function as a transfer function
Perceptron: Example 1
Perceptron: Example 2
NNs: Architecture
Classification Back Propagation

• Back Propagation learns by iteratively processing a set


of
training data (samples).

• For each sample, weights are modified to minimize the error


between network’s classification and actual classification.
Steps in Back Propagation Algorithm

• STEP ONE: initialize the weights and biases.

• The weights in the network are initialized to random numbers


from the interval.
• Each unit has a BIAS associated with it
• The biases are similarly initialized to random numbers from
the interval .

• STEP TWO: feed the training sample.


Steps in Back Propagation Algorithm

• STEP THREE: propagate the inputs forward; we compute the


net input and output of each unit in the hidden and output
layers.

• STEP FOUR: back propagate the error.

• STEP update weights and biases to reflect the


propagated errors.
FIVE:

• STEP SIX: terminating conditions.


ANN Application

• Real world applications:


– Financial modelling:- predicting the stock market
– Time series prediction:- climate, weather, seizures
– Computer games:- intelligent agents, chess, backgammon
– Robotics:- autonomous adaptable robots
– Pattern recognition:- speech recognition, seismic activity,
sonar
– signals Data analysis ? data compression, data mining
– Bioinformatics:- DNA sequencing, alignment
Weakness of ANN

• The complex internal structure shows black box behavior: it is


• Very hard to get an idea of the meaning of the internal
computations.
• Another feature of neural networks is their random
behavior.
– The training process contains random elements. When this is
repeated, the same input set may yield very different
networks.
– Sometimes they differ in performance, one showing good
behavior, while others behave badly.
• Neural Network needs long time for training.
Deep Learning
Deep Learning

• Deep learning deals with a “neural network with more than


two layers.”
• Deep learning is about learning feature representation in a
compositional manner--- composition of non-linear
transformation of the data.
• Goal: Learn useful representations, aka features, directly from
data.
• There has been a dramatic increase in the performance of
recognition systems due to the introduction of deep
architectures.
Deep NNs

•Deep learning has dramatically improved state-of-the-art in:


•Speech and character recognition
•Visual object detection and recognition
•Convolutional neural nets most commonly applied to analyze
visual imagery such as for processing of images, video, speech
and signals (time series) in general.
•Recurrent neural nets for processing of sequential data (speech,
text).
Deep Learning

Deep learning vs machine learning:


- Do not necessarily need
structured/labeled data to classify
- Do not require human intervention,
which eventually learn through
their own errors
- Deep learning requires much more
data than a traditional machine
learning algorithm
- It is the quality of data which
ultimately determines the quality of
the result.
Cont...

• Neural network mimics the way the biological


neurons in the human brain work, while a deep
learning network comprises several layers of neural
networks.
• A Neural network comprises an input layer, a hidden layer, and an
output layer.

• Deep learning, on the other hand, is made up of several hidden


layers of neural networks that perform complex operations on
massive amounts of structured and unstructured data.
Properties of Deep Learning

• There are three key factors in deep learning


– Architectures: deeper architectures are able to better capture
invariant properties of the data compared to their shallow
counterparts.
– Generalization and Regularization techniques: ability to
generalize from a small number of training examples
– Optimization algorithms
• minimize loss using backpropagation
Common Architectural Principles of Deep
Networks
• The core components
– Parameters: relate directly to the weights on the connections
in the network
• Methods of optimization such as gradient descent to find
good values for the parameter vector to minimize loss across
our training dataset.
• The biggest change in deep networks with respect to
parameters is how the layers are connected in the different
architectures
– Layers: fundamental architectural unit in deep networks
Common Architectural Principles of Deep
Networks
• Activation function: Commonly used functions
– Sigmoid
– Tanh
– Hard tanh
– Rectified linear unit (ReLU) (and its variants)
• A more continuous distribution of input data is generally best
modeled with a ReLU activation function
• Using the tanh activation function (if the network isn’t very
deep) in the event that the ReLU did not achieve good results.
Common Architectural Principles of Deep
Networks
• Loss functions: quantify the agreement between the predicted
output (or label) and the ground truth output;
• Loss function is a method of evaluating how well your algorithm
is modeling your dataset.
– Use loss functions to determine the penalty for an incorrect
classification of an input vector.
• Squared loss
• Logistic loss
• Hinge loss
• Negative log likelihood
Common Architectural Principles of Deep
Networks
• Optimization methods: training a model to find the best set of
values for the parameter vector of the model.
• Minimize the loss function with respect to the parameters of
our prediction function
– First Order -- the Jacobian matrix
– Second Order -- approximate the Hessian.
Common Architectural Principles of Deep
Networks
• Optimization methods: training a model to find the best set of
values for the parameter vector of the model.
• Minimize the loss function with respect to the parameters of
our prediction function
– First Order -- the Jacobian matrix
– Second Order -- approximate the Hessian.
• Hyperparameters: any configuration setting that is free to be
chosen by the user
Common Architectural Principles of
Deep Networks
• Hyperparameters fall into several categories:
– Layer size-- number of neurons in a given layer
– Magnitude (momentum, learning rate): how fast we change the
parameter vector as we move through search space
– Regularization: is a measure taken against overfitting
– Activations (and activation function families)
– Weight initialization strategy
– Loss functions
– Settings for epochs during training (mini-batch size)
– Normalization scheme for input data (vectorization)
Major Architectures of Deep Networks
• Deep learning as neural networks with a large number of
parameters and layers in one of four fundamental network
architectures:

– Unsupervised pretrained networks

– Convolutional neural networks

– Recurrent neural networks

– Recursive neural networks


Convolutional neural networks
• The goal of a CNN is to learn higher-order features in the data
via convolutions.
• CNNs transform the input data from the input layer through
all connected layers into a set of class scores given by the
output layer.
• A parametric models that perform sequential operations on
their input data. Each such operation consists of a linear
transformation, say, a convolution of its input, followed by a
point wise nonlinear “activation function”, e.g., a RELU OR
sigmoid.
Convolutional neural networks

• Input layer
• Feature-extraction (learning) layers: have a general repeating
pattern of the sequence – convolution and pooling layers
• Classification layers
CNN Common Layers
• Convolutional Layer: first layer to extract features from
an input image.
• Filter to an input that results in an activation.
• Results in a map of activations called a feature map

The layer will compute a dot product between the region of the neurons in the
input layer and the weights to which they are locally connected in the output
layer.
CNN Common Layers

• Components of convolutional layers:


– Filters: a function that has a width and height smaller than
the width and height of the input volume.

• Compute the output of the filter by producing the dot product of


the filter and the input region

• Convolutional kernel works by dividing the image into small slices


commonly known as receptive fields
CNN Common Layers

• Components of convolutional layers:


– Activation maps: slide each filter across the spatial
dimensions (width, height) of the input volume produces a
two-dimensional output called an activation map for that
specific filter.

– Parameter sharing: control the total parameter count to


use
fewer resources to learn the training dataset.
CNN Common Layers

• Components of convolutional layers:


– Layer-specific dictate the spatial
hyperparameters: arrangement output volume from a
and size of the
convolutional layer
• Filter (or kernel) size (field size)
• Output depth
• Stride
• Zero-padding
CNN Common Layers

• Pooling Layer: pooling layers were developed to reduce the


number of parameters needed to describe layers deeper in
the network.
• Reduces the dimensionality of each map but retaining the
important information
• Useful for extracting dominant features
• Inserted between successive convolutional layers
• Pooling layers use filters to perform the down
sampling
process on the input volume

These two layers find a number of features in the images and progressively construct
higher-order features. This corresponds directly to the ongoing theme in deep learning by
which features are automatically learned as opposed to traditionally hand engineered.
CNN Common Layers

• Fully Connected Layer: takes an input volume (whatever the


output is of the conv or ReLU or pool layer preceding it)
– Compute class scores that we’ll use as output of the
network

• Softmax Layer: assigns decimal probabilities to each class in a


multi-class problem.
CNN Architecture
• There are various architectures of CNNs
CNN Architecture
• LeNet
– One of the earliest successful architectures of CNNs
– Developed by Yann Lecun
– Originally used to read digits in images
• AlexNet
– Helped popularize CNNs in computer vision
– Developed by Alex Krizhevsky, Ilya Sutskever, and Geoff
Hinton
– Won the ILSVRC 2012
CNN Architecture
• ZF Net
– Won the ILSVRC 2013
– Developed by Matthew Zeiler and Rob Fergus
– Introduced the visualization concept of the Deconvolutional
Network
• GoogLeNet
– Won the ILSVRC 2014
– Developed by Christian Szegedy and his team at Google
– Codenamed “Inception,” one variation has 22 layers
CNN Architecture
• VGGNet
– Runner-Up in the ILSVRC 2014
– Developed by Karen Simonyan and Andrew Zisserman
– Showed that depth of network was a critical factor in good
performance
• ResNet
– Rained on very deep networks (up to 1,200 layers)
– Won first in the ILSVRC 2015 classification task
Application of Deep Learning

• Text-to-speech synthesis
• Language identification
• Large vocabulary speech recognition
• Medium vocabulary speech recognition
• English-to-French translation
• Audio onset detection
• Social signal classification
Genetic Algorithm
• Inspired by Charles Darwin’s theory of natural evolution

• Genetic algorithms mimic an evolutionary natural selection process.

• Generations of solutions are evaluated according to a fitness value .

• Only those candidates with high fitness values are used to create
further solutions via crossover and mutation procedures.
• Provide efficient, effective techniques for optimization and machine
learning applications.
Genetic Algorithm
• Not fast in some sense; but sometimes more robust;
scale relatively well
• Have extensions including:
– Genetic Programming (GP) (LISP-like function trees),
– Learning classifier systems (evolving rules),
– Linear GP (evolving “ordinary” programs), many others
Genetic Algorithm

• An individual is characterized by a set of parameters


(variables) known as Genes. Genes are joined into a string to
form a Chromosome (solution).
• Genes of an individual is
represented using a string
(binary values), in terms of an
alphabet

• Encode the genes in a


chromosome

• The term chromosome refers


to a numerical value or values
that represent a candidate
solution to the problem that
the genetic algorithm is trying
to solve
Genetic Algorithm
• A genetic algorithm begins with a randomly chosen
assortment of chromosomes, which serves as the
first
generation (initial population)
• Then each chromosome in the population is evaluated by the
fitness function to test how well it solves the problem
• If f is a non-negative fitness function, then the probability that
chromosome C53 is chosen to reproduce might be

Selection operator chooses some of the


chromosomes for reproduction based on
a probability distribution defined by the
user.
Genetic Algorithm
• The basic components common to almost all
algorithms are:
genetic
– A fitness function for optimization
• Tests and quantifies how `fit' each potential solution is
• One of the most pivotal parts of the algorithm
• the fitness function returns a single numerical fitness
value which
is supposed to be proportional to the individual
which that
chromosome represents.
– A population of chromosomes
• Refers to a numerical value or values that represent a candidate
solution to the problem that the genetic algorithm is trying to solve
• Each candidate solution is encoded as an array of parameter values
Genetic Algorithm

• Selection Methods: can be broadly classified into two classes


as follows.
– Fitness Proportionate Selection: includes methods such as roulette-
wheel selection and stochastic universal selection
– Ordinal Selection: includes methods such as tournament selection and
truncation selection
Genetic Algorithm

– Crossover to produce next generation of chromosomes

Offspring are created by exchanging the genes of parents among themselves


until the crossover point is reached
Basic Genetic algorithm operators

- For each couple we first


decide (using some pre-
defined probability, for
instance 0.6) whether to
actually perform the
crossover or not

- If we decide to actually
perform crossover, we
randomly extract the
crossover points
Genetic algorithm

- Random mutation of chromosomes in new generation

• The mutation operator randomly


flips individual bits in the new
chromosomes (turning a 0 into a 1
and vice versa).

• Mutation occurs to maintain


diversity within the population and
prevent premature convergence.
Genetic algorithm

• Genetic algorithms are iterated until the fitness value


of the “best-so-far” chromosome stabilizes and
does not change for many generations.

• Whole process of iterations is called a run


Example: Maximizing a Function of One
Variable
• Consider the problem of maximizing the function

x is allowed to vary between 0 and 31

- Encode the possible values of x as


chromosomes-- x as a binary integer
of length 5
- Chromosomes for our genetic
algorithm will be sequences of 0's
An Introduction to Genetic
and 1's with a length of 5 bits
Algorithms– Jenna Carr
- 0 (00000) to 31 (11111)
Example: Maximizing a Function of One
Variable
• Select an initial population of 10 chromosomes at random
Example: Maximizing a Function of One
Variable
• Select the chromosomes that will reproduce based on their
fitness values, using the following probability
Example: Maximizing a Function of One
Variable
• Each bit of the new chromosomes mutates with a low
probability.
• 50 total transferred bit positions, we expect 500:001
= 0:05
bits to mutate.
• After selection, crossover, and mutation are complete,
the
new population is tested with the tness function.
• Both the maximum fitness and average fitness of the
population have increased after only one generation.
Genetic algorithm
• GAs are stochastic search methods that could in
principle run for ever.
• A termination criterion is needed:
– Set a limit on the umber of fitness evaluations or the
computer clock time

– To track the population’s diversity and stop when this


falls below a preset threshold
Application of Deep Learning

• Arabic handwriting recognition


• TIMIT phoneme recognition
• Optical character recognition
• Image caption generation
• Video-to-textual description
• Syntactic parsing for natural language processing
• Photo-real talking heads
Thank You

You might also like