Activation Functions:
• Artificial neurons are elementary units in an artificial neural
network. The artificial neuron receives one or more inputs
and sums them to produce an output. Each input is separately
weighted, and the sum is passed through a function known as
an activation function or transfer function
Threshold activation function
• The threshold activation function is defined by
Unit step functions
• Sometimes, the threshold activation function is also defined as a unit step function in which case it is called a
unit-step activation function
Sigmoid activation function (logistic function):
 One of the most commonly used activation functions is the sigmoid activation function.
 It is a function which is plotted as ‘S’ shaped graph
GRADIENT DESCENT IN MACHINE LEARNING
• Gradient Descent is known as one of the most commonly used optimization
algorithms to train machine learning models by means of minimizing errors
between actual and expected results.
• In mathematical terminology, Optimization algorithm refers to the task of
minimizing/maximizing an objective function f(x) parameterized by x. Similarly,
in machine learning, optimization is the task of minimizing the cost function
parameterized by the model's parameters.
• “Gradient Descent is defined as one of the most commonly used iterative
optimization algorithms of machine learning to train the machine learning and
deep learning models. It helps in finding the local minimum of a function.”
• If we move towards a negative gradient or away from the gradient of
the function at the current point, it will give the local minimum of
that function.
• Whenever we move towards a positive gradient or towards the
gradient of the function at the current point, we will get the local
maximum of that function.
Cost-function
The cost function is defined as the measurement of
difference or error between actual values and expected
values at the current position
Direction & Learning Rate
Types of Gradient Descent
• Batch gradient descent,
• stochastic gradient descent, and
• mini-batch gradient descent
1. Batch Gradient Descent:
• Batch gradient descent (BGD) is used to find the error for each point
in the training set and update the model after evaluating all training
examples.
• This procedure is known as the training epoch
2. Stochastic gradient descent
• Stochastic gradient descent (SGD) is a type of gradient descent that
runs one training example per iteration.
3.MiniBatch Gradient Descent:
• Mini Batch gradient descent is the combination of both batch gradient
descent and stochastic gradient descent. It divides the training datasets
into small batch sizes then performs the updates on those batches
separately.
Challenges with the Gradient Descent
• Local Minima and Saddle Point:
Whenever the slope of the cost function is at zero or just close to
zero, this model stops learning further.
2.Vanishing and Exploding Gradient
Vanishing Gradients:
• Vanishing Gradient occurs when the gradient is smaller than expected.
Exploding Gradient:
• Exploding gradient is just opposite to the vanishing gradient as it
occurs when the Gradient is too large and creates a stable model.
UNIT SATURATION
• Unit saturation occurs when the output of an activation
function reaches its maximum or minimum value and stops
responding to changes in input. This can hinder the learning
process of neural networks.
The problem
•As more layers using certain activation functions are added to neural
networks, the gradients of the loss function approaches zero, making the network
hard to train.
Solution
• The simplest solution is to use other activation functions,
such as ReLU, which doesn't cause a small derivative.
Residual networks are another solution, as they provide
residual connections straight to earlier layers
ReLU
•The rectified linear activation unit, or ReLU, is one of the few landmarks in the
deep learning revolution. It's simple, yet it's far superior to previous activation
functions like sigmoid or tanh.
• ReLU formula is: f(x) = max(0,x)
• If the function receives any negative input, it returns 0; however, if the
function receives any positive value x, it returns that value.
def relu(x):
return max(0.0, x)
To test the function, let’s run it on a few inputs.
x = 1.0
print('Applying Relu on (%.1f) gives %.1f' % (x, relu(x)))
x = -10.0
print('Applying Relu on (%.1f) gives %.1f' % (x, relu(x)))
x = 0.0
print('Applying Relu on (%.1f) gives %.1f' % (x, relu(x)))
x = 15.0
print('Applying Relu on (%.1f) gives %.1f' % (x, relu(x)))
x = -20.0
Advantages of ReLU:
• ReLU is used in the hidden layers instead of Sigmoid or tanh
as using sigmoid or tanh in the hidden layers leads to the
infamous problem of "Vanishing Gradient".
• Simpler Computation
• Mitigates Vanishing Gradient Problem
• Less Computational Cost
Disadvantages of ReLU:
•Exploding Gradient
•Dying ReLU
•Sensitivity to Outliers
HYPERPARAMETER TUNING
•These parameters express important properties of the
model such as its complexity or how fast it should
learn.
• They are usually fixed before the actual training process
begins. These parameters express important properties of the
model
EXAMPLE:
• Grid Search CV
It searches for the best set of hyperparameters from a grid of hyperparameters values.
DROUPOUT
• "Dropout" in machine learning refers to the process of randomly
ignoring certain nodes in a layer during training.
• Dropout: A Simple Way to Prevent Neural Networks from Overfitting

MACHINE LEARNING NEURAL NETWORK PPT UNIT 4

  • 18.
    Activation Functions: • Artificialneurons are elementary units in an artificial neural network. The artificial neuron receives one or more inputs and sums them to produce an output. Each input is separately weighted, and the sum is passed through a function known as an activation function or transfer function
  • 20.
    Threshold activation function •The threshold activation function is defined by
  • 21.
    Unit step functions •Sometimes, the threshold activation function is also defined as a unit step function in which case it is called a unit-step activation function
  • 22.
    Sigmoid activation function(logistic function):  One of the most commonly used activation functions is the sigmoid activation function.  It is a function which is plotted as ‘S’ shaped graph
  • 23.
    GRADIENT DESCENT INMACHINE LEARNING • Gradient Descent is known as one of the most commonly used optimization algorithms to train machine learning models by means of minimizing errors between actual and expected results. • In mathematical terminology, Optimization algorithm refers to the task of minimizing/maximizing an objective function f(x) parameterized by x. Similarly, in machine learning, optimization is the task of minimizing the cost function parameterized by the model's parameters. • “Gradient Descent is defined as one of the most commonly used iterative optimization algorithms of machine learning to train the machine learning and deep learning models. It helps in finding the local minimum of a function.”
  • 24.
    • If wemove towards a negative gradient or away from the gradient of the function at the current point, it will give the local minimum of that function. • Whenever we move towards a positive gradient or towards the gradient of the function at the current point, we will get the local maximum of that function.
  • 25.
    Cost-function The cost functionis defined as the measurement of difference or error between actual values and expected values at the current position
  • 27.
  • 28.
    Types of GradientDescent • Batch gradient descent, • stochastic gradient descent, and • mini-batch gradient descent
  • 29.
    1. Batch GradientDescent: • Batch gradient descent (BGD) is used to find the error for each point in the training set and update the model after evaluating all training examples. • This procedure is known as the training epoch 2. Stochastic gradient descent • Stochastic gradient descent (SGD) is a type of gradient descent that runs one training example per iteration. 3.MiniBatch Gradient Descent: • Mini Batch gradient descent is the combination of both batch gradient descent and stochastic gradient descent. It divides the training datasets into small batch sizes then performs the updates on those batches separately.
  • 30.
    Challenges with theGradient Descent • Local Minima and Saddle Point: Whenever the slope of the cost function is at zero or just close to zero, this model stops learning further.
  • 31.
    2.Vanishing and ExplodingGradient Vanishing Gradients: • Vanishing Gradient occurs when the gradient is smaller than expected. Exploding Gradient: • Exploding gradient is just opposite to the vanishing gradient as it occurs when the Gradient is too large and creates a stable model.
  • 32.
    UNIT SATURATION • Unitsaturation occurs when the output of an activation function reaches its maximum or minimum value and stops responding to changes in input. This can hinder the learning process of neural networks.
  • 33.
    The problem •As morelayers using certain activation functions are added to neural networks, the gradients of the loss function approaches zero, making the network hard to train.
  • 34.
    Solution • The simplestsolution is to use other activation functions, such as ReLU, which doesn't cause a small derivative. Residual networks are another solution, as they provide residual connections straight to earlier layers
  • 35.
    ReLU •The rectified linearactivation unit, or ReLU, is one of the few landmarks in the deep learning revolution. It's simple, yet it's far superior to previous activation functions like sigmoid or tanh. • ReLU formula is: f(x) = max(0,x) • If the function receives any negative input, it returns 0; however, if the function receives any positive value x, it returns that value.
  • 36.
    def relu(x): return max(0.0,x) To test the function, let’s run it on a few inputs. x = 1.0 print('Applying Relu on (%.1f) gives %.1f' % (x, relu(x))) x = -10.0 print('Applying Relu on (%.1f) gives %.1f' % (x, relu(x))) x = 0.0 print('Applying Relu on (%.1f) gives %.1f' % (x, relu(x))) x = 15.0 print('Applying Relu on (%.1f) gives %.1f' % (x, relu(x))) x = -20.0
  • 38.
    Advantages of ReLU: •ReLU is used in the hidden layers instead of Sigmoid or tanh as using sigmoid or tanh in the hidden layers leads to the infamous problem of "Vanishing Gradient". • Simpler Computation • Mitigates Vanishing Gradient Problem • Less Computational Cost
  • 39.
    Disadvantages of ReLU: •ExplodingGradient •Dying ReLU •Sensitivity to Outliers
  • 40.
    HYPERPARAMETER TUNING •These parametersexpress important properties of the model such as its complexity or how fast it should learn. • They are usually fixed before the actual training process begins. These parameters express important properties of the model
  • 41.
    EXAMPLE: • Grid SearchCV It searches for the best set of hyperparameters from a grid of hyperparameters values.
  • 42.
    DROUPOUT • "Dropout" inmachine learning refers to the process of randomly ignoring certain nodes in a layer during training.
  • 43.
    • Dropout: ASimple Way to Prevent Neural Networks from Overfitting