0% found this document useful (0 votes)
10 views61 pages

1

The document provides an overview of machine learning, focusing on supervised learning, its types, and algorithms such as perceptrons and support vector machines (SVM). It explains the training process, evaluation methods, and the role of neural networks in pattern recognition and decision-making. Additionally, it covers logistic regression and its types, assumptions, and evaluation metrics, highlighting the importance of loss functions in model training.

Uploaded by

nandushettar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views61 pages

1

The document provides an overview of machine learning, focusing on supervised learning, its types, and algorithms such as perceptrons and support vector machines (SVM). It explains the training process, evaluation methods, and the role of neural networks in pattern recognition and decision-making. Additionally, it covers logistic regression and its types, assumptions, and evaluation metrics, highlighting the importance of loss functions in model training.

Uploaded by

nandushettar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Machine learning (ML)

Well posed learning problem: "A computer program is said to learn from experience E with
respect to some class of tasks T and performance measure P, if its performance at tasks in T,
as measured by P, improves with experience E.“ (Tom Michel)

• Allows computers to learn and make decisions without being explicitly programmed.

• It involves feeding data into algorithms to identify patterns and make predictions on new
data.

• Computers may now acquire knowledge and make hypotheses or judgments without
explicit programming with the help of machine learning.

• It involves developing algorithms and models that identify data trends and connections so
that computers can draw inferences, make accurate predictions, and automate processes.
Supervised learning is a type of machine learning where a model
is trained on labeled data—meaning each input is paired with the
How Supervised Machine Learning Works?
Where supervised learning algorithm consists of input features and
corresponding output labels.

The process works through:


Training Data: The model is provided with a training dataset that
includes input data (features) and corresponding output data (labels
or target variables).
Learning Process: The algorithm processes the training data, learning
the relationships between the input features and the output labels.
This is achieved by adjusting the model's parameters to minimize the
difference between its predictions and the actual labels.

• After training, the model is evaluated using a test dataset to measure its accuracy
and performance.
• Then the model's performance is optimized by adjusting parameters and using
techniques like cross-validation to balance bias and variance. This ensures the
model generalizes well to new, unseen data.
How a supervised machine learning model is trained on a dataset to
learn a mapping function between input and output, and then with
learned function is used to make predictions on new data:
Types of Supervised Learning in Machine Learning
Now, Supervised learning can be applied to two main types of
problems:

Classification: Where the output is a categorical variable (e.g., spam


vs. non-spam emails, yes vs. no).

Regression: Where the output is a continuous variable (e.g.,


predicting house prices, stock prices).
Figure A: It is a dataset of a shopping store that is useful in predicting whether a
customer will purchase a particular product under consideration or not based on his/
her gender, age, and salary.
Input: Gender, Age, Salary
Output: Purchased i.e. 0 or 1; 1 means yes the customer will purchase and 0 means
that the customer won't purchase it.

Figure B: It is a Meteorological dataset that serves the purpose of predicting wind


speed based on different parameters.
Input: Dew Point, Temperature, Pressure, Relative Humidity, Wind Direction
Output: Wind Speed
• In statistics, the term linear model refers to any model which
assumes linearity in the system.

• The most common occurrence is in connection with regression


models and the term is often taken as synonymous with linear
regression model.

• However, the term is also used in time series analysis with a


different meaning.
In machine learning, a perceptron is a fundamental algorithm and a
basic building block of artificial neural networks. It's a type of linear
classifier that takes inputs, applies weights and a bias, sums them,
and then applies an activation function to produce an output,
essentially making a decision.

the perceptron model mimics the functioning of a biological neuron,


enabling us to solve binary classification problems.

It is a type of linear classifier that predicts whether an input belongs


to one of two classes, typically labeled as 0 and 1.

At its core, the perceptron algorithm performs a weighted sum of the


input features, applies a threshold function, and outputs a predicted
class label.
Key Components of the Perceptron:

Input Features: The perceptron algorithm takes a set of input features as its
initial input. These features can be numeric, categorical, or binary,
representing different aspects of the problem being solved.

Weights and Bias: Each input feature is associated with a weight, which
determines its importance in the classification process. Additionally, there is
a bias term that allows for adjusting the decision boundary.

Activation Function: The activation function is applied to the weighted sum


of the input features plus the bias. It determines the output of the
perceptron, indicating which class the input is predicted to belong to.
Common activation functions used in perceptrons include the step function
and the sigmoid function.
The support vector machine (SVM) method is a popular and effective
machine learning method that finds its application in a wide range of
different areas.

Furthermore, various modifications of this method are still being


developed.

We can use this method for both the problems of classification and
regression, but it’s more common to use it for classification.
In short, the main idea behind this classification algorithm is to separate classes as
correctly as possible. For example, if we take the classification of red and blue dots
in the left image below, we can see that all three lines (generally hyperplanes)
correctly separate the class of red and blue dots.

However, the question arises as to which is the best solution in general for some
other points of these classes as well. The SVM solves this problem by dividing these
two classes in such a way that the hyperplane remains as far away as possible from
the two nearest points of both classes. We can see this in the right image below.

The maximum possible margin was constructed between these two classes as a
space between two parallel gray boundary hyperplanes, while in their midst, there
is a hyperplane of separation. These two parallel gray hyperplanes intersect one
or more points that we call support vectors:
Key Differences
Although these two algorithms are very similar, and they’re solving the same
problem, there are major differences between them.

1. Inspiration
As we mentioned above, the perceptron is a neural network type of model. The
inspiration for creating perceptron came from simulating biological networks. In
contrast, SVM is a different type of machine learning model, which was inspired by
statistical learning theory.

2. Training and Optimization


Generally, for training and optimizing weights of perceptron and neural networks,
we use the backpropagation technique, which includes the gradient descent
approach. Conversely, to maximize the margin of SVM, we need to solve quadratic
equations using quadratic programming (QP).

For example, in the popular machine learning library Scikit-Learn, QP is solved by an


algorithm called sequential minimal optimization (SMO).
3. Kernel Trick
The SVM algorithm uses one smart technique that we call the kernel
trick. The main idea is that when we can’t separate the classes in the
current dimension, we add another dimension where the classes may
be separable.

In order to do that, we don’t just arbitrarily add another dimension,


we use special transformations called kernels.

In contrast to SVM, perceptron doesn’t use the kernel trick and


doesn’t transform the data into a higher dimension. Consequently, if
the data isn’t easily separable with the current configuration of
perceptron, we can try to increase the number of neurons or layers in
the model.
4. Multiclass Classification
SVM doesn’t support multiclass classification natively. Therefore, if
we want to separate multiple classes using SVM algorithms, there are
two indirect approaches:
• One-vs-One approach
• One-vs-Rest approach

One-vs-One (OvO) approach means that we break the multiclass


problem into multiple binary classification problems. For example, if
we have three classes with names X, Y, and Z, the OvO approach
would divide it into three binary classification problems:
1. X vs Y
2. X vs Z
3. Y vs Z
Similarly, One-vs-Rest (OvR) approach breaks the multiclass problem
into multiple binary classifications where it tries to separate the
current class with all the other classes together. For example, if we
take the same classes as above, the OvR approach looks like this:
1. X vs [Y, Z]
2. Y vs [X, Z]
3. Z vs [X, Z]
Although classic perceptron with one neuron requires the same logic
for solving multiclass classification problems, most of today’s
implementations of perceptron algorithms can directly predict the
probability for each of the classes. This is simply done using the
softmax activation function in the output layer.
5. Probability of Prediction
Finally, the SVM model doesn’t output probability natively. Therefore,
if we want to have a probability of prediction, we can get it indirectly
with the probability calibration method. One standard way is using
Platt scaling.

In contrast to SVM, perceptron with probabilistic activation function in


the output layer directly predicts the probability for each class. The
most common probabilistic functions are sigmoid and softmax.
Logistic Regression is a supervised machine learning algorithm used
for classification problems.

Unlike linear regression which predicts continuous values it predicts


the probability that an input belongs to a specific class.

It is used for binary classification where the output can be one of two
possible categories such as Yes/No, True/False or 0/1.

It uses sigmoid function to convert inputs into a probability value


between 0 and 1.
Types of Logistic Regression
Logistic regression can be classified into three main types based on the nature of
the dependent variable:

Binomial Logistic Regression: This type is used when the dependent variable has
only two possible categories. Examples include Yes/No, Pass/Fail or 0/1. It is the
most common form of logistic regression and is used for binary classification
problems.

Multinomial Logistic Regression: This is used when the dependent variable has
three or more possible categories that are not ordered. For example, classifying
animals into categories like "cat," "dog" or "sheep." It extends the binary logistic
regression to handle multiple classes.

Ordinal Logistic Regression: This type applies when the dependent variable has
three or more categories with a natural order or ranking. Examples include ratings
like "low," "medium" and "high." It takes the order of the categories into account
when modeling.
Assumptions of Logistic Regression
Independent observations: Each data point is assumed to be
independent of the others means there should be no correlation or
dependence between the input samples.
Binary dependent variables: It takes the assumption that the
dependent variable must be binary, means it can take only two values.
For more than two categories SoftMax functions are used.
Linearity relationship between independent variables and log odds:
The model assumes a linear relationship between the independent
variables and the log odds of the dependent variable which means the
predictors affect the log odds in a linear way.
No outliers: The dataset should not contain extreme outliers(data
points that deviate significantly from the rest of the observations in a
dataset) as they can distort the estimation of the logistic regression
coefficients.
Large sample size: It requires a sufficiently large sample size to
produce reliable and stable results.
How to Evaluate Logistic Regression Model
Neural networks are machine learning models that mimic
the complex functions of the human brain.

These models consist of interconnected nodes or neurons


that process data, learn patterns, and enable tasks such as
pattern recognition and decision-making.
Neural networks are pivotal in identifying complex patterns, solving
intricate challenges, and adapting to dynamic environments. Their
ability to learn from vast amounts of data is transformative,
impacting technologies like natural language processing, self-driving
vehicles, and automated decision-making.
Layers in Neural Network Architecture
Input Layer: This is where the network receives its input data. Each
input neuron in the layer corresponds to a feature in the input data.
Hidden Layers: These layers perform most of the computational
heavy lifting. A neural network can have one or multiple hidden
layers. Each layer consists of units (neurons) that transform the inputs
into something that the output layer can use.
Output Layer: The final layer produces the output of the model. The
format of these outputs varies depending on the specific task (e.g.,
classification, regression).
Shallow Neural Network?
A shallow neural network refers to a neural network that consists of
only one hidden layer between the input and output layers.

The term “shallow” refers to the minimal depth of the network due
to just one hidden layer between input and output.

During the training process, input data is fed into the network where
it is processed through weights and biases associated with neurons in
the hidden layer.

The processed information then moves to the output layer, which


provides a prediction or classification based on the learned features.

The accuracy of these predictions is refined through a process called


backpropagation and optimization algorithms like gradient descent,
which adjust weights and biases to minimize errors.
Components of a Shallow Neural Network
Input Layer: This is where the network receives its input data. Each
neuron in this layer represents a feature of the input dataset.

Hidden Layer: The single hidden layer in a shallow network


transforms the inputs into something that the output layer can use.
The neurons in this layer apply a set of weights to the inputs and pass
them through an activation function to introduce non-linearity to the
process.

Output Layer: The final layer produces the output of the network. For
regression tasks, this might be a single neuron; for classification, it
could be multiple neurons corresponding to the classes.
Shallow Neural Networks Work?
The functionality of shallow neural networks hinges on the
transformation of inputs through the hidden layer to produce outputs.
Here's a step-by-step breakdown:

Weighted Sum: Each neuron in the hidden layer calculates a weighted


sum of the inputs.

Activation Function: The weighted sums are passed through an


activation function (such as Sigmoid, Tanh,) to introduce non-linearity,
enabling the network to learn complex patterns.
Sigmoid: S-shaped function that maps input values to a range between 0 and 1.
Tanh (Hyperbolic Tangent): S-shaped function like sigmoid, but maps input values between -1 and 1.)

Output Generation: The output layer integrates the signals from the
hidden layer, often through another set of weights, to produce the
final output.
Training Shallow Neural Networks
Training a shallow neural network typically involves:

Forward Propagation: Calculating the output for a given input by


passing it through the layers of the network.

Loss Calculation: Determining how far the network's output is from


the actual desired output using a loss function.

Backpropagation: Calculating the gradient of the loss function with


respect to each weight in the network, which informs how the
weights should be adjusted to minimize the loss.

Weight Update: Adjusting the weights using an optimization


algorithm like gradient descent.
Forward Propagation
• Forward propagation is the fundamental process in a neural network where input
data passes through multiple layers to generate an output.

• It is the process by which input data passes through each layer of neural network
to generate output.

• In Forward propagation input data moves through each layer of neural network
where each neuron applies weighted sum, adds bias, passes the result through
an activation function and making predictions.
• It determines the output of neural network with a given set of inputs and current
state of model parameters (weights and biases).

• Understanding this process helps in optimizing neural networks for various tasks
like classification, regression and more.

x (input) -> [Layer 1] -> [Layer 2] -> ... -> [Layer n] -> ŷ (output)
1. Input Layer
• The input data is fed into the network through the input layer.
• Each feature in the input dataset represents a neuron in this layer.
• The input is usually normalized or standardized to improve model
performance.

2. Hidden Layers
• The input moves through one or more hidden layers where
transformations occur.
• Each neuron in hidden layer computes a weighted sum of inputs
and applies activation function to introduce non-linearity.
Each neuron receives inputs, computes: Z=WX+b
where: W: W is the weight matrix
X: X is the input vector
B: b is the bias term
The activation function such as ReLU or sigmoid is applied.
3. Output Layer
• The last layer in the network generates the final prediction.
• The activation function of this layer depends on the type of
problem:
• Softmax (for multi-class classification)
• Sigmoid (for binary classification)
• Linear (for regression tasks)
4. Prediction
• The network produces an output based on current weights and
biases.
• The loss function evaluates the error by comparing predicted
output with actual values.
2. loss function
• is a mathematical way to measure how good or bad a model’s predictions are
compared to the actual results.
• It gives a single number that tells us how far off the predictions are. The smaller
the number, the better the model is doing.
• Loss functions are used to train models.

L(ŷ, y) = measure of difference between prediction ŷ and true value y

Loss functions are important because they:


• Guide Model Training: During training, algorithms such as Gradient Descent use
the loss function to adjust the model's parameters and try to reduce the error
and improve the model’s predictions.
• Measure Performance: By finding the difference between predicted and actual
values and it can be used for evaluating the model's performance.
• Affect learning behavior: Different loss functions can make the model learn in
different ways depending on what kind of mistakes they make.
Types of loss functions each suited for different tasks.
1. Regression Loss Functions
These are used when your model needs to predict a continuous
number such as predicting the price of a product or age of a person.
Popular regression loss functions are:

A. Mean Squared Error (MSE) Loss


• Mean Squared Error (MSE) Loss is one of the most widely used loss
functions for regression tasks.
• It calculates the average of the squared differences between the
predicted values and the actual values.
• It is simple to understand and sensitive to outliers because the
errors are squared which can affect the loss.

MSE=1/n​​∑i=1 ​(y​i​)2
n
B. Mean Absolute Error (MAE) Loss
• Mean Absolute Error (MAE) Loss is another commonly used loss
function for regression.
• It calculates the average of the absolute differences between the
predicted values and the actual values.
• It is less sensitive to outliers compared to MSE. But it is not
differentiable at zero which can cause issues for some optimization
algorithms.
MSE=1/n​​∑i=1 ​(y​i​)
n

C. Huber Loss
• Huber Loss combines the advantages of MSE and MAE.
• It is less sensitive to outliers than MSE and differentiable
everywhere unlike MAE. It requires tuning of the parameter δ.
2. Classification Loss Functions
Classification loss functions are used to evaluate how well a
classification model's predictions match the actual class labels.

A. Binary Cross-Entropy Loss (Log Loss)


• Binary Cross-Entropy Loss is also known as Log Loss and is used
for binary classification problems.
• It measures the performance of a classification model whose
output is a probability value between 0 and 1.

B. Categorical Cross-Entropy Loss


• Categorical Cross-Entropy Loss is used for multiclass classification
problems.
• It measures the performance of a classification model whose
output is a probability distribution over multiple classes.
C. Sparse Categorical Cross-Entropy Loss
• Sparse Categorical Cross-Entropy Loss is similar to Categorical
Cross-Entropy Loss but is used when the target labels are integers
instead of one-hot encoded vectors.
• It is efficient for large datasets with many classes.
3. Ranking Loss Functions
Ranking loss functions are used to evaluate models that predict the
relative order of items. These are commonly used in tasks such as
recommendation systems and information retrieval.
A. Contrastive Loss
Contrastive Loss is used to learn embeddings such that similar items
are closer in the embedding space while dissimilar items are farther
apart. It is often used in Siamese networks.
2. Triplet Loss
Triplet Loss is used to learn embeddings by comparing the relative
distances between triplets: anchor, positive example and negative
example.
4. Image and Reconstruction Loss Functions
These loss functions are used to evaluate models that generate or
reconstruct images ensuring that the output is as close as possible to
the target images.

A. Pixel-wise Cross-Entropy Loss


Pixel-wise Cross-Entropy Loss is used for image segmentation tasks
where each pixel is classified independently.

B. Dice Loss
Dice Loss is used for image segmentation tasks and is particularly
effective for imbalanced datasets. It measures the overlap between
the predicted segmentation and the ground truth.

C. Jaccard Loss (Intersection over Union, IoU)


Jaccard Loss is also known as IoU Loss that measures the intersection
over union of the predicted segmentation and the ground truth.
3. Backpropagation
• is a supervised learning algorithm used for training artificial neural
networks.

• It computes the gradient of the loss function with respect to each


weight by the chain rule.

• This process allows the network to update its weights and biases,
minimizing the error in predictions.

• Without backpropagation, training deep neural networks would be


inefficient and impractical.

• Its goal is to reduce the difference between the model’s predicted


output and the actual output by adjusting the weights and biases in
the network.
Backpropagation is all about efficiently computing how each
parameter in the network contributes to the overall error.

It does this by cleverly applying the chain rule of calculus, propagating


the error gradient backwards through the network.

It works iteratively to adjust weights and bias to minimize the cost


function. In each epoch the model adapts these parameters by
reducing loss by following the error gradient.

It often uses optimization algorithms like gradient descent or


stochastic gradient descent.

The algorithm computes the gradient using the chain rule from
calculus allowing it to effectively navigate complex layers in the neural
network to minimize the cost function.
Chain Rule
• The chain rule is a fundamental concept in calculus that is crucial
for backpropagation.
• It allows the computation of the derivative of a composite
function by breaking it down into simpler parts.
• In the context of neural networks, the chain rule helps in
computing the gradient of the loss function with respect to each
weight.

∂L/∂θ = (∂L/∂ŷ) * (∂ŷ/∂z) * (∂z/∂θ)


Where:
• L is the loss
• ŷ is the network output
• z is the input to the activation function
• θ is any weight or bias in the network
Back Propagation plays a critical role in how neural
networks improve over time.

• Efficient Weight Update: It computes the gradient of


the loss function with respect to each weight using the
chain rule making it possible to update weights
efficiently.

• Scalability: The Back Propagation algorithm scales well


to networks with multiple layers and complex
architectures making deep learning feasible.

• Automated Learning: With Back Propagation the


learning process becomes automated and the model
can adjust itself to optimize its performance.
The Back Propagation algorithm involves two main steps: the Forward
Pass and the Backward Pass.

1. Initial Calculation

aj​=∑(wi​,j ∗ xi​)
Where,
aj​is the weighted sum of all the inputs and weights at each node
wi,j​represents the weights between the ith input and the jth neuron
xi​represents the value of the ith input

(output): After applying the activation function to a, we get the output of the
neuron:
oj​= activation function(aj​)
2. Sigmoid Function
The sigmoid function returns a value between 0 and 1, introducing
non-linearity into the model.
yj=1/1+e−aj

3. Computing Outputs
4. Error Calculation
Our actual output is 0.5 but we obtained 0.67.

To calculate the error we can use the below formula:

Errorj = ytarget − y5 ​

=> 0.5−0.67=−0.17

Back Propagation
1. Calculating Gradients
The change in each weight is calculated as:
Δwij=η × δj × Oj

Where:
•δj​is the error term for each unit,
•η is the learning rate.
Stochastic gradient descent
• Stochastic Gradient Descent (SGD) is an optimization
algorithm in machine learning, particularly when dealing
with large datasets.
• It is a variant of the traditional gradient descent
algorithm but offers several advantages in terms of
efficiency and scalability, making it the go-to method for
many deep-learning tasks.
• Gradient descent is an iterative optimization algorithm
used to minimize a loss function, which represents how
far the model’s predictions are from the actual values.
The main goal is to adjust the parameters of a model
(weights, biases, etc.) so that the error is minimized.
Need for Stochastic Gradient Descent

• For large datasets, computing the gradient using all data points can
be slow and memory-intensive.

• This is where SGD comes into play. Instead of using the full dataset
to compute the gradient at each step, SGD uses only one random
data point (or a small batch of data points) at each iteration.

• This makes the computation much faster.


• The insight of SGD is that the gradient is an expectation.

• The expectation maybe approximately estimated using a small set


of samples.

• Specifically, on each step of the algorithm, we can sample a


minibatch of examples B={x(1), . . . , x(m)} drawn uniformly from
the training set.

• The mini batch size m is typically chosen to be a relatively small


number of examples, ranging from one to a few hundred.

• Crucially, m is usually held fixed as the training set size m grows.

• We may fit a training set with billions of examples using updates


computed on only a hundred examples.
How It Works
Initialization: Start with random initial values for the model
parameters (weights and biases).

Iterative Updates:
• Randomly shuffle the dataset.
• For each data point (or mini-batch), compute the gradient of the
loss function with respect to the model parameters.
• Update the parameters using the formula
• Θ = Θ −η∇ Θ ​J(Θ)
• where:
• Θ : Model parameters (weights, biases, etc.)
• η : Learning rate (step size)
• ​J(Θ) : Gradient of the loss function

Repeat: Continue until the loss converges or a stopping criterion (e.g.,


number of epochs) is met.
SGD Classifier:
• is a linear classification algorithm that aims to find the optimal
decision boundary (a hyperplane) to separate data points belonging
to different classes in a feature space.

• It operates by iteratively adjusting the model's parameters to


minimize a cost function, often the cross-entropy loss, using the
stochastic gradient descent optimization technique.

SGD Regressor:
• solve regression issues with a machine learning approach

• Predicting a continuous output variable, also known as the dependent


variable, from one or more input data, also known as independent
variables, is the aim of SGD regression, a sort of supervised learning.

• The SGD Regressor reduces the discrepancy between target values and
anticipated values by optimizing the model's parameters.
Advantages of Stochastic Gradient Descent
Efficiency: Because it uses only one or a few data points to calculate
the gradient, SGD can be much faster, especially for large datasets.
Each step requires fewer computations, leading to quicker
convergence.

Memory Efficiency: Since it does not require storing the entire dataset
in memory for each iteration, SGD can handle much larger datasets
than traditional gradient descent.

Escaping Local Minima: The noisy updates in SGD, caused by the


stochastic nature of the algorithm, can help the model escape local
minima or saddle points, potentially leading to better solutions in non-
convex optimization problems (common in deep learning).

Online Learning: SGD is well-suited for online learning, where the


model is trained incrementally as new data comes in, rather than on a
static dataset.
Applications of Stochastic Gradient Descent
SGD and its variants are widely used across various domains of machine learning:

Deep Learning: In training deep neural networks, SGD is the default optimizer due
to its efficiency with large datasets and its ability to work with large models. Deep
learning frameworks like TensorFlow and PyTorch typically use variants like Adam or
RMSprop, which are based on SGD.

Natural Language Processing (NLP): Models like Word2Vec and transformers are
trained using SGD variants to optimize large models on vast text corpora.

Computer Vision: For tasks such as image classification, object detection and
segmentation, SGD has been fundamental in training convolutional neural networks
(CNNs).

Reinforcement Learning: SGD is also used to optimize the parameters of models


used in reinforcement learning, such as deep Q-networks (DQNs) and policy
gradient methods.

You might also like