Supervised Learning
Supervised Learning
Supervised Learning
Slide share
Classification
• predicts categorical class labels (discrete or nominal)
• classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute and
uses it in classifying new data
Regression
• Regression is a type of Supervised Learning task in which
the output has a continuous value.
• The term regression is used when you try to find the
relationship between variables.
• It used to understand the relationship between dependent
and independent variables.
Classification Vs Regression
Classification: A Two-Step
Process
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
The set of tuples used for model construction is training set
mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the classified
result from the model
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
Test set is independent of training set (otherwise overfitting)
If the accuracy is acceptable, use the model to classify new data
Note: If the test set is used to select models, it is called validation (test) set
Process (1): Model
Construction
Process (2):Prediction by the
Model
Classification Tasks
• Given:
– A set of classes
– Instances (examples) of
each class
• Described
as a set of features
or attributes and
their values
• Generate: A method (aka
model) that when given a
new instance it will
determine its class
Classification Techniques
• Base Classifiers
• Decision Tree based Methods
• Rule-based Methods
• Nearest-neighbor
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
• Neural Networks, Deep Neural Nets
• Ensemble Classifiers
• Boosting, Bagging, Random Forests
Linear Regression
• Linear regression uses the relationship between the data-
points to draw a straight line through all them.
• When the outcome and all the attributes are numeric, linear
regression is a natural technique to consider.
Y= a + bX
• Where Y is the dependent variable (that’s the variable that goes on
the Y axis), X is the independent variable (i.e. it is plotted on the X
axis), b is the slope of the line and a is the y-intercept.
• The idea is to express the class as a linear combination of the
attributes, with predetermined weights:
Where
• x---- class
• a1 to ak– attribute values
• Wo-wk--- weights– calculated from the
training data
Linear Regression
• Linear Regression finds the relationship between the input and output
data by plotting a line that fits the input data and maps it onto the
output.
• This line represents the mathematical relationship between the
independent input variables and is called The Line of Best Fit.
• Consider the data that is displayed below, which tells you the sales
corresponding to the amount spent on advertising.
Source
Logistic Regression
source
Logistic Regression
• From linear to logistic regression--- using sigmoid function.
• In logistic regression weighted sum of input is passed through the
sigmoid activation function and the curve which is obtained is
called the sigmoid curve.
• Decision Boundary
1. Select the attribute that performs best and use it as the root of
the
tree;
• In other words:
– We want a measure that prefers attributes that have a high
degree of ”order”
• Maximum order: all examples are of the same class
• Minimum order: all classes are equally likely
• Needs a measure of impurity
Measures of Node Impurity
• Information Gain
– Determine how informative an attribute is
– Attributes are assumed to be categorical
• Gini Index
– Attributes are assumed to be continuous
– Assume there exist several possible split values
for each
attribute
Information Gain
• Where :
• p,q= two points in Euclidean n-space
• pi,qi=Euclidean vectors, starting from the origin of the space
(initial point)
• n= n-space
• Determine the class from nearest neighbor list take the majority vote of class
labels among the k-nearest neighbors
• Weigh the vote according to distance.
K-Nearest Neighbours
• Example: 1
Name Acidity Strength Class
Durability
Type-1 7 7 Bad
Type-2 7 4 Bad
Type-3 3 4 Good
Type-4 1 4 Good
• Advantages
– Conceptually simple, easy to understand and explain
– Very flexible decision boundaries
– Not much learning at all
• Disadvantages
– It can be hard to find a good distance measure
– Irrelevant features and noise can be very detrimental
– Typically can not handle more than a few dozen attributes
– Computational cost: requires a lot computation and memory
Bayes Learning
• Use a probability framework for fitting a predictive model to a
training dataset.
• Has two roles
– Provides learning algorithms
• Naïve Bayes learning
• Bayes Belief Network learning
– Provides conceptual framework
• Provides “gold standard” to evaluate other
learning
Algorithms.
Probability Theory
• Given:
– A doctor knows that meningitis causes stiff neck 50% of the
time
– Prior probability of any patient having meningitis is1/50,000
– Prior probability of any patient having stiff neck is1/20
𝑃(𝑋𝑘 |𝐶𝑖)
What if the attribute value is continuous? Gaussian distribution with a
mean µ and standard deviation σ.
• Belief Measure:
Source
Introduction SVM
• The support vector machine is a supervised machine-learning
model for classification and regression that is based on
kernels.
• It creates a hyperplane where the distance between two
classes of data points is at its maximum.
• The decision boundary is a hyperplane that separates the
classifications of data points.
• It plots each data item in the dataset in an N-dimensional
space.
• Where N is the number of features or attributes in the data.
• Next, find the optimal hyperplane to separate the data.
Introduction SVM
• Output: set of weights w (or wi), one for each feature, whose
linear combination predicts the value of y. (just like neural
nets)
SVM-Mathematical Concepts
• Samples geometrically.
Purpose of vector representation
• Representing each sample/patient as a vector allows to
geometrically represent the decision surface that separates
two groups of samples/patients.
- where 𝑊 = {𝑤
Where:
𝑤2, . . . , 𝑤𝑑} is
1,
𝑋
– Nonlinear mapping function map data
in original (or primal) space into a
𝐹
higher (ever infinite) dimension space
Strong points of SVM-based learning
methods
• Simplest approach:
1. Generate multiple classification models
2. Each votes on test instance
3. Take majority as classification
Ensemble Method
• Each tree gives a classification, and we say the tree "votes" for
that class.
• α is given
by
Adaboost
• Pros: Low generalization error, easy to code, works with most
classifiers, no parameters to adjust
– Receives inputs
– Activation function.
• Input layer
• Feature-extraction (learning) layers: have a general repeating
pattern of the sequence – convolution and pooling layers
• Classification layers
CNN Common Layers
• Convolutional Layer: first layer to extract features from
an input image.
• Filter to an input that results in an activation.
• Results in a map of activations called a feature map
The layer will compute a dot product between the region of the neurons in the
input layer and the weights to which they are locally connected in the output
layer.
CNN Common Layers
These two layers find a number of features in the images and progressively construct
higher-order features. This corresponds directly to the ongoing theme in deep learning by
which features are automatically learned as opposed to traditionally hand engineered.
CNN Common Layers
• Text-to-speech synthesis
• Language identification
• Large vocabulary speech recognition
• Medium vocabulary speech recognition
• English-to-French translation
• Audio onset detection
• Social signal classification
Genetic Algorithm
• Inspired by Charles Darwin’s theory of natural evolution
• Only those candidates with high fitness values are used to create
further solutions via crossover and mutation procedures.
• Provide efficient, effective techniques for optimization and machine
learning applications.
Genetic Algorithm
• Not fast in some sense; but sometimes more robust;
scale relatively well
• Have extensions including:
– Genetic Programming (GP) (LISP-like function trees),
– Learning classifier systems (evolving rules),
– Linear GP (evolving “ordinary” programs), many others
Genetic Algorithm
- If we decide to actually
perform crossover, we
randomly extract the
crossover points
Genetic algorithm