Activation Functions in Neural Networks

Last Updated : 12 May, 2026

An activation function is applied to the weighted sum of inputs before producing the final output of a neuron. It introduces non-linearity, allowing the network to learn complex patterns.

Activation-functions-in-Neural-Networks
Activation Functions in neural Networks
  • Applied after the weighted sum of inputs
  • Introduces non-linearity into the model
  • Enables learning of complex data patterns
  • Without it, the network behaves like a linear model

Importance of Non-Linearity

  • Real-world data is rarely linearly separable.
  • Non-linear functions allow neural networks to form curved decision boundaries, making them capable of handling complex patterns (e.g., classifying apples vs. bananas under varying colors and shapes).
  • They ensure networks can model advanced problems like image recognition, NLP and speech processing.

Mathematical Example

Consider a neural network with:

  • Inputs: i1, i2​
  • Hidden layer: neurons h1​ and h2​
  • Output layer: one neuron (output)
  • Weights: w1, w2, w3, w4, w5, w6
  • Biases: b1​ for hidden layer, b2​ for output layer
tree
Neural Network

Each circle represents a neuron (node) and a group of neurons forms a layer.

The hidden layer outputs are:

h_1 = i_1 \cdot w_1 + i_2 \cdot w_3 + b_1

{h_2} = i_1.w_2 + i_2.w_4 + b_2

The output before activation is:

\text{output} = h_1.w_5 + h_2.w_6 + \text{bias}

Without activation, these are linear equations.

To introduce non-linearity, we apply a sigmoid activation:

\sigma(x) = \frac{1}{1+e^{-x}}

\text{final output} = \sigma(h_1.w_5 + h_2.w_6 + \text{bias})

This gives the final output of the network after applying the sigmoid activation function in output layers, introducing the desired non-linearity.

Types of Activation Functions in Deep Learning

1. Linear Activation Function

Linear Activation Function resembles straight line define by y=x. No matter how many layers the neural network contains if they all use linear activation functions the output is a linear combination of the input.

  • The range of the output spans from (-\infty \text{ to } + \infty).
  • Output is a linear combination of inputs
  • Using it in all layers makes the network behave like a linear model
  • Limits the ability to learn complex patterns
  • Commonly used in the output layer for regression tasks
  • Often combined with non-linear functions in hidden layers for better learning


Linear-Activation-Function
Linear Activation Function or Identity Function returns the input as the output

2. Non-Linear Activation Functions

1. Sigmoid Function

Sigmoid Activation Function is characterized by 'S' shape. It is mathematically defined as A = \frac{1}{1 + e^{-x}}​. This formula ensures a smooth and continuous output that is essential for gradient-based optimization methods.

  • It allows neural networks to handle and model complex patterns that linear equations cannot.
  • The output ranges between 0 and 1, hence useful for binary classification.
  • The function exhibits a steep gradient when x values are between -2 and 2. This sensitivity means that small changes in input x can cause significant changes in output y which is critical during the training process.
Sigmoid-Activation-Function
Sigmoid or Logistic Activation Function Graph

2. Tanh Activation Function 

Tanh function (hyperbolic tangent function) is a shifted version of the sigmoid, allowing it to stretch across the y-axis. It is defined as:

f(x) = \tanh(x) = \frac{2}{1 + e^{-2x}} - 1.

  • Outputs values from -1 to +1.
  • Enables modeling of complex data patterns.
  • Commonly used in hidden layers due to its zero-centered output, facilitating easier learning for subsequent layers.
Tanh-Activation-Function
Tanh Activation Function

3. ReLU(Rectified Linear Unit)Function 

ReLU activation is defined by A(x) = \max(0,x), this means that if the input x is positive, ReLU returns x, if the input is negative, it returns 0.

  • Value Range is[0, \infty), meaning the function only outputs non-negative values.
  • Introduces non-linearity, enabling learning of complex patterns
  • Computationally efficient due to simple operations
  • Activates only positive neurons, making the network sparse and efficient
  • Commonly used in hidden layers for faster training and better performance
relu-activation-function
ReLU Activation Function

4. Leaky ReLU

f(x) = \begin{cases} x, & x > 0 \\ \alpha x, & x \leq 0 \end{cases}

  • Leaky ReLU is similar to ReLU but allows a small negative slope (\alpha, e.g., 0.01) instead of zero.
  • Solves the “dying ReLU” problem, where neurons get stuck with zero outputs.
  • Range: (-\infty, \infty).
  • Preferred in some cases for better gradient flow.
Leaky_relu
Leaky ReLU Activation Function

5. SoftPlus Function

Softplus function is defined mathematically as: A(x) = \log(1 + e^x). It is similar to ReLU but avoids sharp transitions by being fully differentiable.

  • The Softplus function is non-linear.
  • The function outputs values in the range (0, \infty), similar to ReLU, but without the hard zero threshold that ReLU has.
  • Softplus is a smooth, continuous function, meaning it avoids the sharp discontinuities of ReLU which can sometimes lead to problems during optimization.
softplus
Softplus Activation Function

3. Exponential Linear Units

1. ELU (Exponential Linear Unit) Function

ELU (Exponential Linear Unit) is a non-linear activation function that improves learning speed and helps reduce the vanishing gradient problem. It behaves like ReLU for positive inputs but allows smooth negative values.

f(x)=\begin{cases}x, & x>0 \\\alpha (e^{x}-1), & x \le 0\end{cases}

  • Output range is (−α,∞)(-\alpha, \infty)(−α,∞)
  • Introduces non-linearity for learning complex patterns
  • Allows negative outputs, helping maintain zero-centered activations
  • Smooth and differentiable, supporting stable training
Elu_Activation_Function
ELU (Exponential Linear Unit) Functio

2. SELU (Scaled Exponential Linear Unit) Function

SELU is a scaled version of ELU designed for self-normalizing neural networks, helping maintain stable activations during training.

f(x)=\lambda \begin{cases} x, & x>0 \\\alpha (e^{x}-1), & x \le 0\end{cases}

where λ ≈ 1.05 (scaling factor) and α ≈ 1.67

  • Output range is(-\lambda\alpha,\infty)
  • Maintains near zero mean and unit variance (self-normalizing)
  • Helps prevent vanishing and exploding gradients
  • Works well in deep fully connected networks
  • Can reduce the need for batch normalization in some cases
selu
SELU (Scaled Exponential Linear Unit) Function

4. Output Layer Activation Functions

1. Sigmoid Activation Function

Sigmoid function produces an S-shaped curve and maps input values into a probability-like range between 0 and 1 and is used to find the final output of the neural network for binary classification problems. It is defined as:

\sigma(x)=\frac{1}{1+e^{-x}}

  • Output range is (0,1)
  • Produces probability-like outputs
  • Commonly used in the output layer for binary classification
  • Smooth and differentiable, useful for gradient-based learning
Sigmoid-Activation-Function
Sigmoid Activation Function

2. Softmax Function

Softmax function is used for multi-class classification and converts raw output scores into probabilities for each class.

  • Transforms outputs into values between 0 and 1
  • Ensures all probabilities sum to 1
  • Highlights the most likely class among multiple options
  • Commonly used in the output layer for multi-class classification
  • Helps interpret model outputs as probabilities
softmax
Softmax Activation Function

Impact of Activation Functions on Model Performance

Activation functions play a key role in how efficiently a neural network learns and performs across different tasks.

  1. ReLU helps in faster training by avoiding the vanishing gradient problem, while Sigmoid and Tanh can slow down convergence in deep networks
  2. ReLU maintains better gradient flow, allowing deeper layers to learn effectively, whereas Sigmoid may produce very small gradients
  3. Softmax enables handling of multi-class classification problems, while functions like ReLU or Leaky ReLU are commonly used in hidden layers for efficient learning
Comment