22ECSC306
1
Learning! Knowledge ! Intelligence!
• Learning: is the ability to
• see pattern
• recognize pattern
• add constraints
• Knowledge:
• is the variability in constraints [patterns with variability].
• Intelligence:
• is invoke the knowledge on given test case.
10-07-2024 School of Computer Science and Engineering 2
Machines & Humans
• Von Neumann architecture bottlenecks:
• Serial
• Unambiguous
• Architecture of Brain – Simulation by ANN
• Process & Represent / Represent & Process
• Process & Represent
• Humans
• Represent & Process
• Machines
10-07-2024 School of Computer Science and Engineering 3
Learning!!
• Human Intelligence • Artificial Intelligence
• Biological Neural Network • Von Neumann architecture
• Artificial Neural Network
10-07-2024 School of Computer Science and Engineering 4
“A computer program that learns from experience E with respect to
some task T and some performance measure P, if its performance on T,
as measured by P, improves with experience E.”
-- Tom Mitchell, Carnegie Mellon University
7/10/2024 5
Machine Learning
7/10/2024 6
Machine Learning
10-07-2024 School of Computer Science and Engineering 7
Machine Learning
10-07-2024 School of Computer Science and Engineering 8
Machine Learning
10-07-2024 School of Computer Science and Engineering 9
AI, ML and DL
10-07-2024 School of Computer Science and Engineering 10
Machine Learning
10-07-2024 School of Computer Science and Engineering 11
Types of Machine Learning
10-07-2024 School of Computer Science and Engineering 12
7/10/2024 13
Classical Machine Learning techniques
10-07-2024 School of Computer Science and Engineering 14
Machine Learning
10-07-2024 School of Computer Science and Engineering 15
Knowledge Discovery (KDD) Process
– Data mining—core of Pattern Evaluation
knowledge discovery
process
Data Mining
Task-relevant Data
Data Warehouse Selection
Data Cleaning
Data Integration
Databases
July 10, 2024 Data Mining: Concepts and Techniques 16
Machine Learning
Traditional Programming:
Data
Computer Output
Program
Machine Learning:
Data
Computer Program
Output
10-07-2024 17
School of Computer Science and Engineering 17
Machine Learning
past future
Training model/ Testing model/
Data predictor Data predictor
10-07-2024 School of Computer Science and Engineering 18
Machine Learning
Is more like gardening
• Seeds = Algorithms
• Nutrients = Data
• Gardener = You
• Plants = Programs
10-07-2024 School of Computer Science and Engineering 19
ML in a Nutshell
• Tens of thousands of machine learning algorithms
• Hundreds new every year
• Every machine learning algorithm has three components:
• Representation
• Evaluation
• Optimization
10-07-2024 School of Computer Science and Engineering 20
Machine Learning - Types
• Supervised (inductive) learning
• Training data includes desired outputs
• Unsupervised learning
• Training data does not include desired outputs
• Semi-supervised learning
• Training data includes a few desired outputs
• Reinforcement learning
• Rewards from sequence of actions
Machine Learning - Supervised
In supervised learning, the training data you feed to the algorithm includes the desired
solutions, called labels.
Machine Learning - Supervised
• Regression: to predict a target numeric value
Example: To predict the price of a car, given a set of features
(mileage, age, brand, etc.) called predictors.
Figure 2: Regression
Machine Learning - Supervised
Machine Learning - Supervised
Machine Learning - Supervised
Machine Learning – Un-Supervised
• In unsupervised learning, the training data is unlabeled(Figure 3). The
system tries to learn without a teacher.
Figure 3: An unlabeled training set
Figure 4: Clustering
for unsupervised learning
Machine Learning – Semi-Supervised
• Algorithms that work with partially labeled training data, usually a lot of
unlabeled data and a little bit of labeled data. This is called semisupervised
learning.
Machine Learning – Reinforcement
• Reinforcement Learning : The
learning system, called an agent, in
this context, can observe the
environment, select and perform
actions, and get rewards in return
• It must then learn by itself what is
the best strategy, called a policy, to
get the most reward over time. A
policy defines what action the agent
should choose when it is in a given
situation.
Machine Learning Workflow
10-07-2024 School of Computer Science and Engineering 30
Machine Learning Workflow
10-07-2024 School of Computer Science and Engineering 31
Machine Learning Workflow
10-07-2024 School of Computer Science and Engineering 32
Machine Learning Workflow
10-07-2024 School of Computer Science and Engineering 33
Machine Learning Workflow
10-07-2024 School of Computer Science and Engineering 34
Machine Learning Workflow
10-07-2024 School of Computer Science and Engineering 35
Machine Learning Workflow
10-07-2024 School of Computer Science and Engineering 36
10-07-2024 School of Computer Science and Engineering 37
10-07-2024 School of Computer Science and Engineering 38
Machine Learning
10-07-2024 School of Computer Science and Engineering 39
Supervised Learning
Linear Regression
Supervised Learning
• In regression, we seek to identify (or estimate) a continuous variable y
associated with a given input vector x.
• In classification, we seek to identify the categorical class Ck associate with
a given input vector x.
• Regression method is used for forecasting and finding out cause and effect
relationship between variables.
• Regression techniques differ based on the number of independent
variables and the type of relationship between the independent and
dependent variables.
10-07-2024 School of Computer Science and Engineering 41
Why Regression
• Regression analysis is a form of predictive modelling technique which
investigates the relationship between dependent (target)
and independent variable (s) (predictor).
• Regression analysis is an important tool for modelling and analyzing
data. Here, we fit a curve / line to the data points, in such a manner that
the differences between the distances of data points from the curve or
line is minimized.
.
Why do we use regression analysis?
• There are multiple benefits of using regression analysis. They are
as follows:
• It indicates the significant relationships between dependent
variable and independent variable.
• It indicates the strength of impact of multiple independent
variables on a dependent variable.
.
Regression Means….
.
Linear Regression
• Linear regression is usually among the first few topics which people pick while learning
predictive modeling.
• In this technique, the dependent variable is continuous, independent variable(s) can
be continuous or discrete, and nature of regression line is linear.
• The difference between simple linear regression and multiple linear regression is that,
multiple linear regression has (>1) independent variables, whereas simple linear
regression has only 1 independent variable. Now, the question is “How do we
obtain best fit line?”.
.
Linear regression: Introduction
• A scatter plot of the data on 2-Dimensional plane.
Pizza Price
Pizza Size
10-07-2024 School of Computer Science and Engineering 46
Linear regression: Introduction
• Choose a linear (straight line) model, and tweak it to match the data points by
changing its slope.
• Choose a linear (straight line) model, and tweak it to match the data points by
changing its intercept/bias.
10-07-2024 School of Computer Science and Engineering 47
Linear regression: Introduction
• Fitting linear (straight) line by optimal value for slope and bias.
10-07-2024 School of Computer Science and Engineering 48
Linear Regression
• Different applications of Linear regression:
i. Impact of product price on number of sales
ii. Impact of rainfall amount on number fruits yielded
iii. To predict if the funds that they have invested in marketing a particular
brand has given them substantial return on investment.
• In order to predict accurate value of Y for a given value of X.
1. Data (samples, combination of X and Y)
2. Model (function to represent relationship X & Y)
3. Cost function (how well our model approximates training samples)
4. Optimization (find parameters of model to minimize value of cost function)
10-07-2024 School of Computer Science and Engineering 49
Linear Regression- data
• In simple linear regression the number of independent variables is one
and there is a linear relationship between the independent(x) and
dependent(y) variable.
Population density Number of COVID-19
per sq km patients
20 2
40 6
60 8
80 12
100 14
10-07-2024 School of Computer Science and Engineering 50
Linear Regression- hypothesis
• How to represent data as model/hypothesis?
• Since data is linear, we can use linear operators
• Hypothesis hθ(x) = θ0x0+ θ1x1
• θ0 & θ1 – parameters
• x0 and x1 – input
• y – actual output
• hθ(x) – predicted output
10-07-2024 School of Computer Science and Engineering 51
Linear Regression- data
Population density Average Number of COVID-19
per sq km age patients
20 16 3
40 16 4
60 14 6
80 14 7
300 21 21
10-07-2024 School of Computer Science and Engineering 52
Linear Regression- cost function
X Y
1 1
2 2
3 3
10-07-2024 School of Computer Science and Engineering 53
Linear Regression- optimization
10-07-2024 School of Computer Science and Engineering 54
Linear Regression- optimization
Gradient descent algorithm Linear Regression Model
10-07-2024 School of Computer Science and Engineering 55
Linear Regression- optimization
update
and
simultaneously
10-07-2024 School of Computer Science and Engineering 56
Linear Regression- optimization
Correct: Simultaneous update Incorrect:
10-07-2024 School of Computer Science and Engineering 57
Linear Regression- optimization
10-07-2024 School of Computer Science and Engineering 58
Linear Regression- optimization
10-07-2024 School of Computer Science and Engineering 59
Linear Regression- optimization
If α is too small, gradient descent can be
slow.
If α is too large, gradient descent can
overshoot the minimum. It may fail to
converge, or even diverge.
10-07-2024 School of Computer Science and Engineering 60
Linear Regression- optimization
• Determine the direction with the steepest downward gradient at current
position and take a step of size X in that direction. Repeat & rinse; this is known
as training.
10-07-2024 School of Computer Science and Engineering 61
Linear Regression- optimization
• Determine the direction with the steepest downward gradient at current
position and take a step of size X in that direction. Repeat & rinse; this is known
as training.
10-07-2024 School of Computer Science and Engineering 62
Linear Regression- optimization
• Gradient descent algorithm guides to move in direction of steep descent.
10-07-2024 School of Computer Science and Engineering 63
Linear Regression- optimization
• Determine the direction with the steepest downward gradient at current
position and take a step of size X in that direction.
10-07-2024 School of Computer Science and Engineering 64
Linear Regression- optimization
• Determine the direction with the steepest downward gradient at current
position and take a step of size X in that direction.
10-07-2024 School of Computer Science and Engineering 65
Linear Regression- optimization
J()
10-07-2024 School of Computer Science and Engineering 66
Linear Regression- optimization
J()
10-07-2024 School of Computer Science and Engineering 67
Linear Regression- optimization
J()
10-07-2024 School of Computer Science and Engineering 68
Linear Regression- optimization
J()
10-07-2024 School of Computer Science and Engineering 69
Linear Regression- optimization
10-07-2024 School of Computer Science and Engineering 70
Linear Regression- Variations
• Gradient descent can vary in terms of the number of training patterns
used to calculate error; that is in turn used to update the model.
• The number of patterns used to calculate the error includes how stable
the gradient is that is used to update the model.
• There is a tension in gradient descent configurations of computational
efficiency and the fidelity of the error gradient.
• The three main flavours of gradient descent are stochastic, batch and
mini-batch.
10-07-2024 School of Computer Science and Engineering 71
Stochastic Gradient Descent
• It is a variation of the gradient descent algorithm that calculates the error
and updates the model for each example in the training dataset.
• Parameters are updated after computing the gradient of error with
respect to a single training example.
• The update of the model for each training example means that stochastic
gradient descent is often called an online machine learning algorithm.
10-07-2024 School of Computer Science and Engineering 72
Stochastic Gradient Descent-Upsides
• The frequent updates immediately give an insight into the performance of
the model and the rate of improvement.
• This variant of gradient descent may be the simplest to understand and
implement, especially for beginners.
• The increased model update frequency can result in faster learning on
some problems.
• The noisy update process can allow the model to avoid local minima (e.g.
premature convergence).
10-07-2024 School of Computer Science and Engineering 73
Stochastic Gradient Descent-Downside
• Updating the model so frequently is more computationally expensive than
other configurations of gradient descent, taking significantly longer to
train models on large datasets.
• The frequent updates can result in a noisy gradient signal, which may
cause the model parameters and in turn the model error to jump around
(have a higher variance over training epochs).
• The noisy learning process down the error gradient can also make it hard
for the algorithm to settle on an error minimum for the model.
10-07-2024 School of Computer Science and Engineering 74
Batch Gradient Descent-Upside
• Fewer updates to the model means this variant of gradient descent is
more computationally efficient than stochastic gradient descent.
• The decreased update frequency results in a more stable error gradient
and may result in a more stable convergence on some problems.
• The separation of the calculation of prediction errors and the model
update lends the algorithm to parallel processing based implementations.
10-07-2024 School of Computer Science and Engineering 75
Batch Gradient Descent-Downside
• The more stable error gradient may result in premature convergence of
the model to a less optimal set of parameters.
• The updates at the end of the training epoch require the additional
complexity of accumulating prediction errors across all training examples.
• Commonly, batch gradient descent is implemented in such a way that it
requires the entire training dataset in memory and available to the
algorithm.
• Model updates, and in turn training speed, may become very slow for
large datasets.
10-07-2024 School of Computer Science and Engineering 76
Mini-Batch Gradient Descent
• It splits the training dataset into small batches that are used to calculate
model error and update model coefficients.
• Implementations may choose to sum the gradient over the mini-batch
which further reduces the variance of the gradient.
• Mini-batch gradient descent seeks to find a balance between the
robustness of stochastic gradient descent and the efficiency of batch
gradient descent.
• It is the most common implementation of gradient descent used in the
field of deep learning.
• Parameters are updated after computing the gradient of error with
respect to a subset of the training set.
10-07-2024 School of Computer Science and Engineering 77
Mini batch Gradient Descent-Upside
• The model update frequency is higher than batch gradient descent which
allows for a more robust convergence, avoiding local minima.
• The batched updates provide a computationally more efficient process
than stochastic gradient descent.
• The batching allows both the efficiency of not having all training data in
memory and algorithm implementations.
10-07-2024 School of Computer Science and Engineering 78
Mini batch Gradient Descent-Downside
• Mini-batch requires the configuration of an additional “mini-batch size”
hyper parameter for the learning algorithm.
• Error information must be accumulated across mini-batches of training
examples like batch gradient descent.
10-07-2024 School of Computer Science and Engineering 79
Gradient Descent-summary
Stochastic Gradient Descent Batch Gradient Descent Mini-Batch Gradient Descent
Since only a single training example is Entire training data is considered Subset of training examples is
considered before taking a step in the before taking a step in the direction of considered, hence it can make quick
direction of gradient. gradient. updates in the model parameters.
we are forced to loop over the training Therefore it takes a lot of time for It can also exploit the speed
set and thus cannot exploit the speed making a single update. associated with vectorizing the code.
associated with vectorizing the code.
It makes very noisy updates in the It makes smooth updates in the model Depending upon the batch size, the
parameters. parameters. updates can be made less noisy –
greater the batch size less noisy is the
update.
10-07-2024 School of Computer Science and Engineering 80
Multivariate Linear Regression
Size (feet2) Number of Number of Age of home Price ($1000)
bedrooms floors (years)
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… … … … …
Notation:
= number of features
= input (features) of training example.
= value of feature in training example.
10-07-2024 School of Computer Science and Engineering 81
Multivariate Linear Regression
• Univariate Linear regression:
• Multivariate Linear Regression:
Hypothesis:
Parameters:
Cost function:
Gradient descent:
Repeat
(simultaneously update for every )
10-07-2024 School of Computer Science and Engineering 82
Multivariate Linear Regression
New algorithm :
Gradient Descent
Repeat
Previously (n=1):
Repeat
(simultaneously update for
)
(simultaneously update )
10-07-2024 School of Computer Science and Engineering 83
Bias and Variance
• If we take a Quadratic function, it will give a good fit. • There are several ways to deal with over-fitting:
➢ Collecting more data
• If we take a very high order polynomial, we get a curve that ➢Select a subset of features
over fits the data. ➢Penalize the weights (parameter values)
Regularization
Problem of Overfitting & Under-fitting
• If we take a Quadratic function, it will give a good fit.
w1x + w2x2 +b
• If we take a very high order polynomial, we get a curve that over fits the data.
w1x + w2x2 + w3x3 + w4x4 + b
• There are several ways to deal with over-fitting:
➢ Collecting more data
➢Select a subset of features
➢Penalize the weights (parameter values)
Regularization
• There are several problems like Overfitting, Under-
fitting.
• And Regularization helps to solve the problem of
Overfitting.
• Its Aim is to :
Reduce Model Complexity
Reduce Cost Function
• It adds penalty to reduce Magnitude of weights.
• There are 3 such types: Ridge, Lasso & Elastic-Net
• Regularized Cost Function for Linear Regression:
Limitation of Linear Regression
• Linear regression is not robust to outliers.
• Linear regression also susceptible to over fitting.
Ridge• Regression
Ridge regression works with an enhanced cost function when
compared to the least squares cost function.
• Instead of the simple sum of squares, Ridge regression
introduces an additional ‘regularization’ parameter that
penalizes the size of the weights.
• It can be used when there are too many predictors, or
predictors have a high degree of Multicollinearity between
each other.
• Cost function and gradient descent is given by Equation 1 and
2
2
Limitation of Ridge Regression
• It is not good for feature selection.
• Ridge regression decreases the complexity of a model but does not reduce
the number of variables since it never leads to a coefficient been zero rather
only minimizes it.
Lasso Regression
• Lasso regression stands for Least Absolute Shrinkage and Selection
Operator.
• It adds penalty term to the cost function.
• This term is the absolute sum of the coefficients. As the value of
coefficients increases from 0 this term penalizes, cause model, to
decrease the value of coefficients in order to reduce loss.
• The difference between ridge and lasso regression is that it tends to
make coefficients to absolute zero as compared to Ridge which never
sets the value of coefficient to absolute zero.
Limitations of Lasso Regression
• Lasso sometimes struggles with some types of data.
• If the number of predictors (p) is greater than the number of observations (n),
Lasso will pick at most n predictors as non-zero, even if all predictors are
relevant (or may be used in the test set).
• If there are two or more highly collinear variables then LASSO regression
select one of them randomly which is not good for the interpretation of data
The Elastic Net:
• Elastic Net is proved to better it combines the regularization of both
lasso and Ridge.
• The advantage of that it does not easily eliminate the high collinearity
coefficient.
• The elastic net has two parameters, it combines
both L1 and L2 regularization. So we need a lambda1 for the L1 and a
lambda2 for the L2.
Thank you