0% found this document useful (0 votes)
34 views

DL Activation Functions Question Bank

Uploaded by

vennu.sanjana
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

DL Activation Functions Question Bank

Uploaded by

vennu.sanjana
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

16. Explain Artificial Neural Network (ANN).

This is another name for Deep Neural network or Deep Learning.


What does a Neural Network mean?
 What neural network essentially means is we take logistic regression
and repeat it multiple times.
 In a normal logistic regression, we have an input layer and an output layer.
 But in the case of a Neural Network, there is at least one hidden layer of
regression between these input and output layers.
How many layers are needed to call it a “Deep” neural network?

 Well of course there is no specific amount of layers to classify a neural


network as deep.
 The term ―Deep‖ is quite frankly relative to every problem.
 The correct question we can ask is ―How much deep?‖.
 For example, the answer to ―How deep is your swimming pool?‖ can be
answered in multiple ways.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering


Pune (COEP)

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering


Pune (COEP)
 It could be 2 meters deep or 10 meters deep, but it has ―depth‖.
Same with our neural network, it can have 2 hidden layers or
―thousands‖ hidden layers(yes you heard that correctly).
 So I’d like to just stick with the question of ―How much deep?‖ for the time
being.
Why are the hidden layers?
 They are called hidden because they do not see the original inputs( the
training set ).
 For example, let’s say you have a NN with an input layer, one hidden
layer, and an output layer.
 When asked how many layers your NN has, your answer should be ―It
has 2 layers‖, because while computation the initial, or the input layer, is
ignored.
 Let me help visualize how a 2 layer Neural network looks like.
Step by step we shall understand this image.
1) As you can see here we have a 2 Layered Artificial Neural Network. A Neural
network was created to mimic the biological neuron of the human brain. In
our ANN we have a ―k‖ number of nodes. The number of nodes is a
hyperparameter, which essentially means that the amount is configured by the
practitioner making the model.
2) The inputs and outputs layers do not change. We have ―n‖ input features and
3 possible outcomes.
3) Unlike Logistic regression, neural networks use the tanh function as their
activation function instead of the sigmoid function which you are quite familiar
with. The reason is that the mean of its output is closer to 0 which makes the
more centered for input to the next layer. tanh function can cause an increase in
non-linearity which makes our model learn better.

4) In normal logistic regression: Input => Output.


Whereas in a Neural network: Input => Hidden Layer => Output. The hidden
layer can be imagined as the output of part 1 and input of part 2 of our ANN.
Now let us have a more practical approach to a 2 Layered Neural Network.
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering
Pune (COEP)
(Important Note: We shall continue where we left off in the previous article. I’m
not going to waste your time and mine by loading the dataset again and
preparing it. The link to the Part 1 of this series is given above.)
17. Explain elements of Deep Learning?
Researchers tried to mimic the working of the human brain and replicated it into
the machine making machines capable of thinking and solving complex
problems. Deep Learning (DL) is a subset of Machine Learning (ML) that allows
us to train a model using a set of inputs and then predict output based. Like the
human brain, the model consists of a set of neurons that can be grouped into 3
layers:
a)Input Layer: It receives input and passes it to hidden layers.
b) Hidden Layers: There can be 1 or more hidden layers in Deep Neural
Network (DNN).
―Deep‖ in DL refers to having more than 1 layer. All computations are done by
hidden layers.
c)Output Layer: This layer receives input from the last hidden layer and gives
the output.

18. Explain working of deep Learning with an example.


We will see how DNN works with the help of the train price prediction problem.
For simplicity, we have taken 3 inputs, namely, Departure Station, Arrival
Station, Departure Date. In this case, the input layer will have 3 neurons, one
for each input. The first hidden layer will receive input from the input layer and
will start performing mathematical computations followed by other hidden
layers. The number of hidden layers and number of neurons in each hidden layer
is hyperparameters that are challenging task to decide. The output layer will
give the predicted price value. There can be more than 1 neuron in the output
layer. In our case, we have only 1 neuron as output is a single value.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering


Pune (COEP)
Now, how the price prediction is made by hidden layers? How computation is
done inside hidden layers? This will be explained with help of activation function,
loss function, and optimizers.
19. What is activation Functions?
Each neuron has an activation function that performs computation.
Different layers can have different activation functions but neurons belonging to
one layer have the same activation function. In DNN, a weighted sum of input is
calculated based on weights and inputs provided. Then, the activation function
comes into the picture that works on weighted sum and converts it into output.

20. Why activation functions are required?


Activation functions help model learn complex relationship that exists within the
dataset. If we do not use the activation function in neurons and give weighted
sum as output, in that case, computations will be difficult as there is no specific
range for weighted sum. So, the activation function helps to keep output in a
particular range. Secondly, the non-linear activation function is always preferred
as it adds non-linearity to the dataset which otherwise would form a simple linear
regression model incapable of taking the benefit of hidden layers. The relu
function or its varients is mostly used for hidden layers and sigmoid/ softmax
function is mostly used for final layer for binary/ multi-class classification
problems.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering


Pune (COEP)
22. What is Loss/ Cost Function?

To train the model, we give input (departure location, arrival location and,
departure date in
case of train price prediction) to the network and let it predict the output making
use of activation function. Then, we compare predicted output with the actual
output and compute the error between two values. This error between two
values is computed using loss/ cost function. The same process is repeated for
entire training dataset and we get the average loss/error. Now, the objective is to
minimize this loss to make the model accurate. There exist weights between
each connection of 2 neurons. Initially, weights are randomly initialized and the
motive is to update these weights with every iteration to get the minimum value
of the loss/ cost function. We can change the weights randomly but that is not
efficient method. Here comes the role of optimizers which updates weights
automatically.
Loss function is chosen based on the problem.
a.Regression Problem

23. What are different loss functions and their use case?

Mean squared error (MSE) is used where real value quantity is to be


predicted. MSE in case of train price prediction as price predicted is
real value quantity.
b. Binary/ Multi-class Classification
Problem Cross-entropy is used.
c.Maximum- Margin
Classification Hinge loss is
used.

24. Explain optimizers. Why optimizers are required?

Once loss for one iteration is computed, optimizer is used to update weights.
Instead of
changing weights manually, optimizers can update weights automatically in
small increments and helps to find the minimum value of the loss/ cost function.
Magic of DL!! Finding minimum value of cost function requires iterating through
dataset many times and thus requires large computational power. Common
technique used to update these weights is gradient descent.
It is used to find minimum value of loss function by updating weights. There are 3
variants:

25. What is Gradient Descent (GD) and its variants?

a)Batch/ Vanila Gradient


 In this, gradient for entire dataset is computed to perform one weight
update.
 It gives good results but can be slow and requires large memory.
b)Stochastic Gradient Descent (SGD)
 Weights are updated for each training data point.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering


Pune (COEP)
 Therefore, frequent updates are performed and thus can cause objective
function to fluctuate.
c)Mini-batch Gradient Descent
 It takes best of batch gradient and SGD.
 It is algorithm of choice.
 Reduces frequency of updates and thus can lead to more stable
convergence.
 Choosing proper learning rate is difficult.
26. What are GD optimization methods and which optimizer to use?
To overcome challenges in GD, some optimization methods are used by AI
community. Further, less efforts are required in hyperparameter tuning.
a)Adagrad
 Algorithm of choice in case of sparse data.
 Eliminate need of manually tuning learning rate unlike GD.
 Default value of 0.01 is preferred.
b)Adadelta
 Reduces adagrad’s monotonically decreasing learning rate.
 Do not require default learning rate.
c)RMSprop
 RMSprop and adadelta were developed for same purpose at same time.
 Learning rate = 0.001 is preferred.
d)Adam
 It works well with most of problems and is algorithm of choice.
 Seen as combination of RMSprop and momentum.
 AdaMax and Nadam are variants of Adam.
To sum up, DNN takes the input which is worked upon by activation function to
make computation and learn complex relationship within dataset. Then, loss is
computed for entire dataset based on actual and predicted values. Finally, to
minimize the loss and making the predicted values close to actual, weights are
updated using optimizers. This process continues till model converges with the
motive of getting minimum loss value.
27. What is Convolutional Neural Network (CNN)?
“Convolution neural networks” indicates that these are simply neural
networks with some mathematical operation (generally matrix multiplication) in
between their layers called convolution. It was proposed by Yann LeCun in
1998. It's one of the most popular uses in Image Classification. Convolution
neural network can broadly be classified into these steps:
1. Input layer

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering


Pune (COEP)
2. Convolutional layer
3. Output layers

28. Explain the architecture of Convolutional Neural Networks (CNN)?


Input layers are connected with convolutional layers that perform many tasks
such as padding, striding, the functioning of kernels, and so many performances
of this layer, this layer is considered as a building block of convolutional neural
networks. We will be discussing it’s functioning in detail and how the fully
connected networks work.
Convolutional Layer: The convolutional layer’s main objective is to extract
features from images and learn all the features of the image which would help in
object detection techniques. As we know, the input layer will contain some pixel
values with some weight and height, our kernels or filters will convolve around
the input layer and give results which will retrieve all the features with fewer
dimensions. Let’s see how kernels work;
Formation and arrangement of Convolutional Kernels

With the help of this very informative visualization about kernels, we can see
how the kernels work and how padding is done.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering


Pune (COEP)
Matrix visualization in CNN
Need for Padding: We can see padding in our input volume, we need to do
padding in order to make our kernels fit the input matrices. Sometimes we do
zero paddings, i.e. adding one row or column to each side of zero matrices or we
can cut out the part, which is not fitting in the input image, also known as valid
padding. Let’s see how we reduce parameters with negligible loss, we use
techniques like Max-pooling and average pooling.
Max pooling or Average pooling:

Matrix formation using Max-pooling and average pooling

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering


Pune (COEP)
Max pooling or average pooling reduces the parameters to increase the
computation of our convolutional architecture. Here, 2*2 filters and 2 strides are
taken (which we usually use). By name, we can easily assume that max-pooling
extracts the maximum value from the filter and average pooling takes out the
average from the filter. We perform pooling to reduce dimensionality. We have to
add padding only if necessary. The more convolutional layer can be added to our
model until conditions are satisfied.
29. Explain activation functions in CNN?
An activation function is added to our network anywhere in between two
convolutional layers or at the end of the network. So you must be wondering
what exactly an activation function does, let me clear it in simple words for you.
It helps in making the decision about which information should fire forward and
which not by making decisions at the end of any network. In broadly, there are
both linear as well as non-linear activation functions, both performing linear and
non-linear transformations but non-linear activation functions are a lot helpful
and therefore widely used in neural networks as well as deep learning networks.
The four most famous activation functions to add non-linearity to the network are
described below.
1. Sigmoid Activation Function
The equation for the sigmoid function is
f(x) = 1/(1+e-X )

Sigmoid Activation function


The sigmoid activation function is used mostly as it does its task with great
efficiency, it basically is a probabilistic approach towards decision making and
ranges in between 0 to 1, so when we have to make a decision or to predict an
output we use this activation function because of the range is the minimum,
therefore, the prediction would be more accurate.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering


Pune (COEP)
2. Hyperbolic Tangent Activation Function(Tanh)

Tanh Activation function


This activation function is slightly better than the sigmoid function, like the
sigmoid function it is also used to predict or to differentiate between two classes
but it maps the negative input into negative quantity only and ranges in between
-1 to 1.
3. ReLU (Rectified Linear unit) Activation function
Rectified linear unit or ReLU is the most widely used activation function right now
which ranges from 0 to infinity, all the negative values are converted into zero,
and this conversion rate is so fast that neither it can map nor fit into data
properly which creates a problem, but where there is a problem there is a
solution.

Rectified Linear Unit activation function


We use Leaky ReLU function instead of ReLU to avoid this unfitting, in Leaky
ReLU range is expanded which enhances the performance.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering


Pune (COEP)
4. Softmax Activation Function
Softmax is used mainly at the last layer i.e output layer for decision making the
same as sigmoid activation works, the softmax basically gives value to the input
variable according to their weight, and the sum of these weights is eventually
one.

Softmax activation function


For Binary classification, both sigmoid, as well as softmax, are equally
approachable but in the case of multi-class classification problems we generally
use softmax and cross-entropy along with it.
30. What’s need of activation functions?

―The world is one big data problem.‖


As it turns out—
This saying holds true both for our brains as well as machine learning.
Every single moment our brain is trying to segregate the incoming information
into the
―useful‖ and ―not-so-useful‖ categories.

A similar process occurs in artificial neural network architectures in deep


learning. The segregation plays a key role in helping a neural network properly
function, ensuring that it

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering


Pune (COEP)
learns from the useful information rather than get stuck analyzing the not-useful
part. And this is also where activation functions come into the picture. Activation
Function helps the neural network to use important information while
suppressing irrelevant data points.
31. What is a Neural Network Activation Function?
An Activation Function decides whether a neuron should be activated or not.
This means that it will decide whether the neuron’s input to the network is
important or not in the process of prediction using simpler mathematical
operations. The role of the Activation Function is to derive output from a set of
input values fed to a node (or a layer). But—Let’s take a step back and clarify:
What exactly is a node? Well, if we compare the neural network to our brain, a
node is a replica of a neuron that receives a set of input signals—external
stimuli.
Depending on the nature and intensity of these input signals, the brain processes
them and decides whether the neuron should be activated (―fired‖) or not. In
deep learning, this is also the role of the Activation Function—that’s why it’s

often referred to as a Transfer Function in Artificial Neural Network. The primary


role of the Activation Function is to transform the summed weighted input from
the node into an output value to be fed to the next hidden layer or as output.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering


Pune (COEP)
Now, let's have a look at the Neural Networks Architecture.
Elements of a Neural Networks Architecture
Here’s the thing— If you don’t understand the concept of neural networks and
how they work, diving deeper into the topic of activation functions might be
challenging.
That’s why it’s a good idea to refresh your knowledge and take a quick look at
the structure of the Neural Networks Architecture and its components. Here it is.
In the image above, you can see a neural network made of interconnected

neurons. Each of them is characterized by its weight, bias, and activation


function.
Input Layer
The input layer takes raw input from the domain. No computation is performed at
this layer. Nodes here just pass on the information (features) to the hidden layer.
Hidden Layer
As the name suggests, the nodes of this layer are not exposed. They provide an
abstraction to the neural network.
The hidden layer performs all kinds of computation on the features entered
through the input layer and transfers the result to the output layer.
Output Layer
It’s the final layer of the network that brings the information learned through
the hidden layer and delivers the final value as a result.
Note: All hidden layers usually use the same activation function. However, the
output layer will typically use a different activation function from the hidden
layers. The choice depends on the goal or type of prediction made by the model.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering


Pune (COEP)
Feedforward vs. Backpropagation
When learning about neural networks, you will come across two essential terms
describing the movement of information—feedforward and backpropagation.
Let’s explore them. Feedforward Propagation - the flow of information occurs in
the forward direction. The input is used to calculate some intermediate function
in the hidden layer, which is then used to calculate an output. In the feedforward
propagation, the Activation Function is a mathematical ―gate‖ in between the
input feeding the current neuron and its output going to the next layer.
Backpropagation - the weights of the network connections are repeatedly
adjusted to minimize the difference between the actual output vector of the net
and the desired output vector. To put it simply—backpropagation aims to
minimize the cost function by adjusting the network’s weights and biases. The

cost function gradients determine the level of adjustment with respect to


parameters like activation function, weights, bias, etc.
Why do Neural Networks Need an Activation Function?
So we know what Activation Function is and what it does, but— Why do Neural
Networks need it? Well, the purpose of an activation function is to add non-
linearity to the neural network.
Activation functions introduce an additional step at each layer during the forward

propagation, but its computation is worth it. Here is why— Let’s suppose we
have a neural network working without the activation functions. In that case,
every neuron will only be performing a linear transformation on the inputs using
the weights and biases. It’s because it

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering


Pune (COEP)
doesn’t matter how many hidden layers we attach in the neural network; all
layers will behave in the same way because the composition of two linear
functions is a linear function itself. Although the neural network becomes
simpler, learning any complex task is impossible, and our model would be just a
linear regression model.
3 Types of Neural Networks Activation Functions
Now, as we’ve covered the essential concepts, let’s go over the most popular
neural networks activation functions.
Binary Step Function: Binary step function depends on a threshold value that
decides whether a neuron should be activated or not. The input fed to the
activation function is compared to a certain threshold; if the input is greater than
it, then the neuron is activated, else it is deactivated, meaning that its output is
not passed on to the next hidden layer.
Binary Step

Function Mathematically it can be represented


as:

Here are some of the limitations of binary step function:


 It cannot provide multi-value outputs—for example, it cannot be used for
multi-class classification problems.
 The gradient of the step function is zero, which causes a hindrance in the
backpropagation process.
Linear Activation Function: The linear activation function, also known as "no
activation," or "identity function" (multiplied x1.0), is where the activation is
proportional to the input. The function doesn't do anything to the weighted sum
of the input, it simply spits out the value it was given.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering


Pune (COEP)
Linear Activation
Function Mathematically it can be represented as:

However, a linear activation function has two major problems :


 It’s not possible to use backpropagation as the derivative of the function
is a constant and has no relation to the input x.
 All layers of the neural network will collapse into one if a linear activation
function is used. No matter the number of layers in the neural network,
the last layer will still be a linear function of the first layer. So, essentially,
a linear activation function turns the neural network into just one layer.
Non-Linear Activation Functions
The linear activation function shown above is simply a linear regression model.
Because of its limited power, this does not allow the model to create complex
mappings between the network’s inputs and outputs.
Non-linear activation functions solve the following limitations of linear activation
functions:
 They allow backpropagation because now the derivative function would be
related to the input, and it’s possible to go back and understand which
weights in the input neurons can provide a better prediction.
 They allow the stacking of multiple layers of neurons as the output would
now be a non-linear combination of input passed through multiple layers.
Any output can be represented as a functional computation in a neural
network.
Now, let’s have a look at ten different non-linear neural networks activation
functions and their characteristics.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering


Pune (COEP)
32. Explain non-linear Neural Networks activation functions.

Sigmoid / Logistic Activation Function


This function takes any real value as input and outputs values in the range of 0 to
1.
The larger the input (more positive), the closer the output value will be to 1.0,
whereas the smaller the input (more negative), the closer the output will be to
0.0, as shown below.

Sigmoid/Logistic Activation
Function Mathematically it can be
represented as:

Here’s why sigmoid/logistic activation function is one of the most widely used
functions:
 It is commonly used for models where we have to predict the probability
as an output. Since probability of anything exists only between the range
of 0 and 1, sigmoid is the right choice because of its range.
 The function is differentiable and provides a smooth gradient, i.e.,
preventing jumps in output values. This is represented by an S-shape of
the sigmoid activation function.
The limitations of sigmoid function are discussed below:
 The derivative of the function is f'(x) = sigmoid(x)*(1-sigmoid(x)).

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering


Pune (COEP)
The derivative of the Sigmoid Activation Function
As we can see from the above Figure, the gradient values are only significant for
range -3 to 3, and the graph gets much flatter in other regions. It implies that
for values greater than 3 or less than -3, the function will have very small
gradients. As the gradient value approaches zero, the network ceases to learn
and suffers from the Vanishing gradient problem. The output of the logistic
function is not symmetric around zero. So the output of all the neurons will be of
the same sign. This makes the training of the neural network more difficult and
unstable.
Tanh Function (Hyperbolic Tangent)
Tanh function is very similar to the sigmoid/logistic activation function, and even
has the same S-shape with the difference in output range of -1 to 1. In Tanh, the
larger the input (more positive), the closer the output value will be to 1.0,
whereas the smaller the input (more negative), the closer the output will be to -
1.0.
Tanh Function (Hyperbolic

Tangent) Mathematically it can be represented as:

Advantages of using this activation function are:


 The output of the tanh activation function is Zero centered; hence we can
easily map the output values as strongly negative, neutral, or strongly
positive.
 Usually used in hidden layers of a neural network as its values lie between
-1 to; therefore, the mean for the hidden layer comes out to be 0 or very
close to it. It helps in centering the data and makes learning for the next
layer much easier.
Have a look at the gradient of the tanh activation function to understand its
limitations.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering


Pune (COEP)
Gradient of the Tanh Activation Function
As you can see— it also faces the problem of vanishing gradients similar to the
sigmoid activation function. Plus the gradient of the tanh function is much
steeper as compared to the sigmoid function. Note: Although both sigmoid and
tanh face vanishing gradient issue, tanh is zero centered, and the gradients are
not restricted to move in a certain direction. Therefore, in practice, tanh
nonlinearity is always preferred to sigmoid nonlinearity.
ReLU Function
ReLU stands for Rectified Linear Unit. Although it gives an impression of a linear
function, ReLU has a derivative function and allows for backpropagation while
simultaneously making it computationally efficient. The main catch here is that
the ReLU function does not activate all the neurons at the same time. The
neurons will only be deactivated if the output of the linear transformation is less
than 0.
ReLU Activation
Function Mathematically it can be represented as:

The advantages of using ReLU as an activation function are as follows:


 Since only a certain number of neurons are activated, the ReLU function
is far more computationally efficient when compared to the sigmoid and
tanh functions.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering


Pune (COEP)
 ReLU accelerates the convergence of gradient descent towards the
global minimum of the loss function due to its linear, non-saturating
property.
Softmax Function
Before exploring the ins and outs of the Softmax activation function, we should
focus on its building block—the sigmoid/logistic activation function that works on
calculating probability values.

Probability
The output of the sigmoid function was in the range of 0 to 1, which can be
thought of as probability. But — This function faces certain problems. Let’s
suppose we have five output values of 0.8, 0.9, 0.7, 0.8, and 0.6, respectively.
How can we move forward with it? The answer is: We can’t. The above values
don’t make sense as the sum of all the classes/output probabilities should be
equal to 1. You see, the Softmax function is described as a combination of
multiple sigmoids. It calculates the relative probabilities. Similar to the
sigmoid/logistic activation function, the SoftMax function returns the probability
of each class. It is most commonly used as an activation function for the last
layer of the neural network in the case of multi-class classification.
Mathematically it can be represented as:
Softmax
Function Let’s go over a simple example
together.

Assume that you have three classes, meaning that there would be three neurons
in the output layer. Now, suppose that your output from the neurons is [1.8, 0.9,
0.68]. Applying the softmax function over these values to give a probabilistic
view will result in the following outcome: [0.58, 0.23, 0.19]. The function returns
1 for the largest probability index while it returns 0 for the other two array

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering


Pune (COEP)
indexes. Here, giving full weight to index 0 and no weight to index 1 and index
2. So the output would be the class corresponding to the 1st
neuron(index 0) out of three. You can see now how softmax activation function
make things easy for multi-class classification problems.
Scaled Exponential Linear Unit (SELU)
SELU was defined in self-normalizing networks and takes care of internal
normalization which means each layer preserves the mean and variance from
the previous layers. SELU enables this normalization by adjusting the mean and
variance. SELU has both positive and negative values to shift the mean, which
was impossible for ReLU activation function as it cannot output negative values.
Gradients can be used to adjust the variance. The activation function needs a
region with a gradient larger than one to increase it.
SELU Activation
Function Mathematically it can be represented as:

SELU has values of alpha α and lambda λ predefined.


Here’s the main advantage of SELU over ReLU: Internal normalization is faster
than external normalization, which means the network converges faster. SELU is
a relatively newer activation function and needs more papers on architectures
such as CNNs and RNNs, where it is comparatively explored.
33. Why are deep neural networks hard to train?
There are two challenges you might encounter when training your deep
neural networks. Let's discuss them in more detail.
Vanishing Gradients: Like the sigmoid function, certain activation functions
squish an ample input space into a small output space between 0 and 1.
Therefore, a large change in the input of the sigmoid function will cause a small
change in the output. Hence, the derivative becomes small. For shallow
networks with only a few layers that use these

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering


Pune (COEP)
activations, this isn’t a big problem. However, when more layers are used, it can
cause the gradient to be too small for training to work effectively.
Exploding Gradients: Exploding gradients are problems where significant
error gradients accumulate and result in very large updates to neural network
model weights during training. An unstable network can result when there are
exploding gradients, and the learning cannot be completed. The values of the
weights can also become so large as to overflow and result in something called
NaN values.
34. How to choose the right Activation Function?
You need to match your activation function for your output layer based on the
type of prediction problem that you are solving—specifically, the type of predicted
variable.
Here’s what you should keep in mind.
As a rule of thumb, you can begin with using the ReLU activation function and
then move over to other activation functions if ReLU doesn’t provide optimum
results.
And here are a few other guidelines to help you out.
1. ReLU activation function should only be used in the hidden layers.
2. Sigmoid/Logistic and Tanh functions should not be used in hidden layers
as they make the model more susceptible to problems during training (due
to vanishing gradients).
3. Swish function is used in neural networks having a depth greater than 40
layers. Finally, a few rules for choosing the activation function for your output
layer based on the type of prediction problem that you are solving:
1. Regression - Linear Activation Function
2. Binary Classification—Sigmoid/Logistic Activation Function
3. Multiclass Classification—Softmax
4. Multilabel Classification—Sigmoid
The activation function used in hidden layers is typically chosen based on the
type of neural network architecture.
5. Convolutional Neural Network (CNN): ReLU activation function.
6. Recurrent Neural Network: Tanh and/or Sigmoid
activation function. Summary:
 Activation Functions are used to introduce non-linearity in the network.
 A neural network will almost always have the same activation function in
all hidden layers. This activation function should be differentiable so that
the parameters of the network are learned in backpropagation.
 ReLU is the most commonly used activation function for hidden layers.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering


Pune (COEP)
 While selecting an activation function, you must consider the problems it
might face: vanishing and exploding gradients.
 Regarding the output layer, we must always consider the expected value
range of the predictions. If it can be any numeric value (as in case of the
regression problem) you can use the linear activation function or ReLU.
 Use Softmax or Sigmoid function for the classification problems.

35. How does deep learning attain such impressive results?


In a word, accuracy. Deep learning achieves recognition accuracy at higher
levels than ever before. This helps consumer electronics meet user expectations,
and it is crucial for safety- critical applications like driverless cars. Recent
advances in deep learning have improved to the point where deep learning
outperforms humans in some tasks like classifying objects in images. While deep
learning was first theorized in the 1980s, there are two main reasons it has only
recently become useful:

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering


Pune (COEP)
1. Deep learning requires large amounts of labeled data. For example,
driverless car development requires millions of images and thousands of
hours of video.
2. Deep learning requires substantial computing power. High-performance
GPUs have a parallel architecture that is efficient for deep learning. When
combined with clusters or cloud computing, this enables development
teams to reduce training time for a deep learning network from weeks to
hours or less.
36. State examples of Deep Learning.
Deep learning applications are used in industries from automated driving to
medical devices. Automated Driving: Automotive researchers are using deep
learning to automatically detect objects such as stop signs and traffic lights.
In addition, deep learning is used to detect pedestrians, which helps decrease
accidents.
Aerospace and Defense: Deep learning is used to identify objects from
satellites that locate areas of interest, and identify safe or unsafe zones for
troops.
Medical Research: Cancer researchers are using deep learning to
automatically detect cancer cells. Teams at UCLA built an advanced microscope
that yields a high-dimensional data set used to train a deep learning application
to accurately identify cancer cells.
Industrial Automation: Deep learning is helping to improve worker safety around
heavy machinery by automatically detecting when people or objects are within
an unsafe distance of machines.
Electronics: Deep learning is being used in automated hearing and speech
translation. For example, home assistance devices that respond to your voice
and know your preferences are powered by deep learning applications.
37. What's the Difference Between Machine Learning and Deep Learning?
Deep learning is a specialized form of machine learning. A machine learning
workflow starts with relevant features being manually extracted from images.
The features are then used to create a model that categorizes the objects in the
image. With a deep learning workflow, relevant features are automatically
extracted from images. In addition, deep learning performs ―end-to-end learning‖
– where a network is given raw data and a task to perform, such as
classification, and it learns how to do this automatically. Another key difference
is deep learning algorithms scale with data, whereas shallow learning converges.
Shallow learning refers to machine learning methods that plateau at a certain
level of performance when you add more examples and training data to the
network. A key advantage of deep learning networks is that they often continue
to improve as the size of your data increases. In machine learning, you manually
choose features and a classifier to sort images. With deep learning, feature
extraction and modeling steps are automatic.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering


Pune (COEP)
Figure: Comparing a machine learning approach to categorizing vehicles
(left) with deep learning (right).
38. Choosing Between Machine Learning and Deep Learning
 Machine learning offers a variety of techniques and models you can
choose based on your application, the size of data you're processing,
and the type of problem you want to solve.
 A successful deep learning application requires a very large amount of
data (thousands of images) to train the model, as well as GPUs, or
graphics processing units, to rapidly process your data.
 When choosing between machine learning and deep learning, consider
whether you have a high-performance GPU and lots of labeled data.
 If you don’t have either of those things, it may make more sense to use
machine learning instead of deep learning.
 Deep learning is generally more complex, so you’ll need at least a few
thousand images to get reliable results.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering


Pune (COEP)
 Having a high-performance GPU means the model will take less time to
analyze all those images.
39. Application oriented questions
Explain role of reinforcement learning in following example. Identify
environment, agent, different actions, reward, punishment etc. Draw its block
diagram.

The reinforcement learning provides the means for robots to learn complex
behavior from interaction on the basis of generalizable behavioural primitives.
From the human negative feedback, the robot learns from its own misconduct.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering


Pune (COEP)
Agent Reward Punishment

Different Environment
actions

Feel free to contact me on +91-8329347107 calling / +91-


9922369797 whatsapp, email ID: [email protected] and
[email protected]

*********************

View publication stats

You might also like