0% found this document useful (0 votes)
38 views17 pages

Understanding Deep Learning Basics

Deep learning is a subset of artificial intelligence that utilizes neural networks to process information similarly to the human brain, enabling the identification of complex patterns in data. It involves multiple layers of artificial neurons, with processes like forward and backward propagation to train models, and employs various techniques such as activation functions, loss functions, and regularization to optimize performance. Key concepts include the architecture of neural networks, gradient descent methods, and evaluation metrics for assessing model accuracy and generalization.

Uploaded by

steven royal son
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views17 pages

Understanding Deep Learning Basics

Deep learning is a subset of artificial intelligence that utilizes neural networks to process information similarly to the human brain, enabling the identification of complex patterns in data. It involves multiple layers of artificial neurons, with processes like forward and backward propagation to train models, and employs various techniques such as activation functions, loss functions, and regularization to optimize performance. Key concepts include the architecture of neural networks, gradient descent methods, and evaluation metrics for assessing model accuracy and generalization.

Uploaded by

steven royal son
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DEEP LEARNING

Deep learning is a powerful branch of artificial intelligence that mimics the way the human brain
processes information. At its core, deep learning uses structures called neural networks, which are
inspired by the biological neurons in our brain. Just as our brain is made up of billions of
interconnected neurons that fire signals to recognize patterns—like a face or a voice—deep learning
models are made up of layers of artificial neurons that learn to identify patterns in data. These models
are capable of learning directly from raw inputs, gradually transforming them into meaningful
representations through a series of processing layers. Over time, as the model is trained on data, it
strengthens the connections between its artificial neurons, much like how the brain strengthens
synapses when we learn something new. This ability to learn hierarchical features makes deep learning
especially powerful in solving complex problems such as image recognition, language translation, and
speech understanding. The "deep" in deep learning refers to the multiple layers through which data
passes and gets refined, leading to highly accurate outcomes. This biologically inspired design has
allowed machines to learn and adapt in ways that were previously thought to be unique to humans.
1. Neural Network Architecture
 Input Layer:

o Purpose: This is the entry point for data into the network. It does not perform any
computation.

o Structure: It consists of one neuron for each feature in the input dataset. For example, a
dataset with 10 features would have an input layer with 10 neurons. Its role is to pass the
initial data to the first hidden layer.

 Hidden Layers:

o Purpose: These are the computational engines of the network, responsible for learning the
complex patterns in the data. They extract increasingly abstract features from the input as
data flows through them.

o Structure: A network can have one or more hidden layers. The number of layers (depth)
and the number of neurons per layer (width) define the model's capacity. Each neuron in a
hidden layer is connected to all neurons in the previous layer.

 Output Layer:

o Purpose: This layer produces the final prediction of the model.

o Structure: The number of neurons and the activation function in the output layer are
determined by the task. For regression, it's typically one neuron with a linear activation.
For binary classification, one neuron with a sigmoid activation. For multi-class
classification, it's one neuron per class with a softmax activation.

 Role of Weights and Biases:

o Weights: These are learnable parameters that represent the strength of the connection
between neurons. A higher weight means a stronger influence from the input neuron.
During training, the network adjusts these weights to minimize the prediction error.

o Biases: A bias is another learnable parameter associated with each neuron (except in the
input layer). It allows for shifting the activation function to the left or right, providing the
model with more flexibility to fit the data. It essentially provides a trainable constant to the
neuron's input.
2. Forward Propagation
 Conceptual Flow: Forward propagation is the process of passing input data through the
network, layer by layer, to generate an output. It is a sequence of linear transformations followed
by non-linear activations.

 Matrix Operations:

o At each layer, the process starts with a matrix multiplication. The input vector (or the
output from the previous layer) is multiplied by the weight matrix of the current layer. This
operation scales and combines the inputs.

o This dot product effectively computes a weighted sum of the inputs for each neuron in the
current layer.

 Pre-activation and Activation Values:

o Pre-activation (Z): This is the intermediate value calculated for each neuron. It is the
result of the weighted sum of inputs from the previous layer plus the neuron's bias term. It
represents the linear part of the neuron's computation.

o Activation (A): This is the final output of the neuron. It is obtained by applying a non-
linear activation function to the pre-activation value (Z). This non-linearity is crucial, as it
allows the network to learn complex, non-linear relationships in the data. The activation
value (A) is then passed as input to the next layer.
3. Activation Functions
 Purpose: To introduce non-linearity into the model. Without them, a neural network would
just be a series of linear transformations, equivalent to a single linear model.

 Common Functions & Use Cases:

o Sigmoid: Compresses any input into a range between 0 and 1. Historically used in hidden
layers but now primarily used in the output layer for binary classification tasks. Its
derivative is small for high or low inputs, leading to the vanishing gradient problem.

o Tanh (Hyperbolic Tangent): Compresses input into a range between -1 and 1. It is


zero-centered, which can help in learning. Like Sigmoid, it suffers from the vanishing
gradient problem in its saturated regions.

o ReLU (Rectified Linear Unit): Outputs the input directly if it is positive, and zero
otherwise. It is the most common activation for hidden layers due to its
computational efficiency and its ability to mitigate the vanishing gradient problem for
positive inputs. Its derivative is either 0 or 1.

o Leaky ReLU: A variant of ReLU that allows a small, non-zero gradient when the unit is
not active (i.e., for negative inputs). This helps to prevent "dying ReLU" neurons, where
neurons get stuck in a state where they always output zero.

o Softmax: Used exclusively in the output layer for multi-class classification. It


converts a vector of raw scores (logits) into a probability distribution, where each value is
between 0 and 1, and the sum of all values is 1.
 Gradient Issues:

o Vanishing Gradients: Occurs when gradients become extremely small as they are
propagated backward through the network. This is common with saturating functions like
Sigmoid and Tanh, effectively halting learning in earlier layers.

o Exploding Gradients: The opposite problem, where gradients become excessively large,
leading to unstable training. This is less common but can occur with certain weight
initializations or architectures. ReLU helps with vanishing gradients but can contribute to
exploding gradients if not managed.

4. Loss Functions
 Purpose: A loss function (or cost function) quantifies the difference between the model's
predicted output and the actual target values. The goal of training is to minimize this function.

 Regression Loss Functions:

o Mean Squared Error (MSE): Calculates the average of the squared differences between
predicted and actual values. It penalizes large errors more heavily due to the
squaring operation. It is the default choice for many regression problems.

o Mean Absolute Error (MAE): Calculates the average of the absolute differences between
predicted and actual values. It is less sensitive to outliers than MSE and provides a
more direct interpretation of the average error magnitude.

 Classification Loss Functions:

o Binary Cross-Entropy: Used for binary (two-class) classification problems. It


measures the dissimilarity between the predicted probability and the true binary label. It
works best when the output layer has a single sigmoid neuron.

o Categorical Cross-Entropy: Used for multi-class classification problems. It compares


the model's predicted probability distribution (from a softmax output layer) with the true
distribution (which is typically one-hot encoded).
5. Backward Propagation
 Core Concept: This is the algorithm used to train the network by calculating the gradients of
the loss function with respect to each weight and bias. It propagates the error signal backward
from the output layer to the input layer.

 Chain Rule and Gradient Flow:

o Backward propagation relies on the chain rule from calculus to compute gradients
efficiently.

o It first calculates the gradient of the loss with respect to the output of the final layer.

o Then, it iteratively moves backward, layer by layer, calculating the gradient of the loss with
respect to each layer's outputs, pre-activations, weights, and biases. The chain rule allows
reusing previously computed gradients, making the process highly efficient.

 Weight Update Process:

o Once the gradients for all weights and biases are computed, they are used to update the
parameters.

o The update rule involves subtracting a fraction of the gradient (determined by the learning
rate) from the current parameter value. This moves the parameter in the direction that
most steeply decreases the loss.

6. Gradient Descent & Its Variants

Gradient Descent: Gradient Descent is a method to help a Deep learning model learn by slowly
adjusting its settings (weights) to make fewer mistakes over time.

 Batch Gradient Descent:

o Process: Computes the gradient of the loss function using the entire training dataset for a
single weight update.

o Pros: Produces a stable and direct convergence path.

o Cons: Extremely slow and memory-intensive for large datasets. Not practical for most deep
learning applications.

 Stochastic Gradient Descent (SGD):

o Process: Updates the model's weights after processing each single training example.
o Pros: Much faster computation per update. The noisy updates can help the model escape
shallow local minima.

o Cons: The convergence path is very erratic and noisy. It may never fully converge to the
absolute minimum.

 Mini-Batch Gradient Descent:

o Process: A compromise between the two extremes. It updates the weights after processing
a small batch (e.g., 32, 64, or 128 examples) of training data.

o Pros: Offers the best of both worlds: it's computationally efficient and provides a more
stable convergence than SGD. It's the standard method used in deep learning.

⚙️ How It Works (Step-by-Step):

1. Start with random weights


2. Make a prediction
3. Compare it with the actual value (calculate loss)
4. Compute the gradient (i.e., how much error changes with respect to weights)
5. Update the weights in the direction that reduces the error
6. Repeat until the error is very small (or we reach max iterations)
7. Training, Validation, and Testing Sets
 Dataset Splits Explained:

o Training Set: The largest portion of the data, used to train the model by adjusting its
weights and biases.

o Validation Set: A separate subset used to evaluate the model's performance during
training. It helps in tuning hyperparameters (like learning rate or model architecture) and
provides a check for overfitting. The model does not learn from this data.

o Test Set: A completely unseen subset of data that is used only once, after all training and
hyperparameter tuning is complete. It provides an unbiased estimate of the final model's
performance on new, real-world data.

 Typical Ratios: Common splits include 70% for training, 15% for validation, and 15% for
testing (70/15/15), or 80/10/10. For very large datasets, the validation and test sets can be a
smaller percentage (e.g., 98/1/1).
Step-by-Step Workflow of Training a Neural Network
8. Model Evaluation Metrics
 For Regression:

o R-squared (R²): Indicates the proportion of the variance in the dependent variable that
is predictable from the independent variables. A value closer to 1 is better.

o Mean Squared Error (MSE): The average of the squared errors. Useful for penalizing
larger errors. (Preferred)

o Root Mean Squared Error (RMSE): The square root of MSE. It is in the same units as
the target variable, making it more interpretable.

o Mean Absolute Error (MAE): The average of the absolute errors. It is robust to outliers
and also in the same units as the target variable.

 For Classification:

o Accuracy: The ratio of correctly predicted instances to the total instances. Can be
misleading for imbalanced datasets. (Preferred)

o Precision: Measures the accuracy of positive predictions. Answers the question: "Of all
instances predicted as positive, how many were actually positive?"

o Recall (Sensitivity): Measures the model's ability to find all the actual positive
instances. Answers the question: "Of all actual positive instances, how many did the model
correctly identify?"

o F1-Score: The harmonic mean of Precision and Recall. It provides a single score that
balances both metrics, which is useful when there is an uneven class distribution.
9. Underfitting & Overfitting

 Identifying Underfitting (High Bias):

o The model is too simple to capture the underlying patterns in the data.

o Symptoms: Both the training loss and the validation loss are high and plateau at a high
value. The model performs poorly on both the training set and the validation set.

 Identifying Overfitting (High Variance):

o The model has learned the training data too well, including its noise, and fails to generalize
to new, unseen data.

o Symptoms: The training loss continues to decrease to a very low value, while the validation
loss starts to increase after a certain point. There is a large and growing gap between the
training and validation loss curves.

 Diagnostic Plots:

o Plotting the training loss and validation loss over epochs is the primary way to diagnose
these issues.

o Good Fit: Both curves converge to a low value, and the gap between them is minimal.

o Overfitting: The training curve goes down, while the validation curve goes down and then
starts to go up.
o Underfitting: Both curves flatten out at a high loss value.

10. Regularization Techniques

 Purpose: Techniques used to prevent overfitting by adding a penalty for model complexity to the
loss function.

 L1 Regularization (Lasso):

o Concept: Adds a penalty proportional to the absolute value of the weights.

o Effect: It can shrink some weights to exactly zero, effectively performing automatic feature
selection by removing irrelevant features from the model. This results in a "sparse" model.

 L2 Regularization (Ridge / Weight Decay):

o Concept: Adds a penalty proportional to the square of the value of the weights.

o Effect: It forces the weights to be small but rarely shrinks them to zero. It is the most
common form of regularization and is often referred to as "weight decay" because of how it
is implemented in optimizers.

 Dropout:

o Concept: During each training iteration, it randomly sets the activations of a fraction of
neurons in a layer to zero.

o Intuition: This forces the network to learn more robust features and prevents neurons
from co-adapting too much. It's like forcing the network to be redundant, so it doesn't rely
on any single neuron.

o Implementation: It is only active during training and is turned off during


evaluation/testing.
11. Training Optimization Techniques

Learning Rate Scheduling:

o Concept: A strategy to adjust the learning rate during training. A common approach is to
start with a higher learning rate for faster initial progress and then gradually decrease it to
allow the model to settle into a good minimum. Examples include step decay, exponential
decay, or adaptive methods.

Advanced Optimizers:

o Momentum: Helps accelerate SGD in the correct direction by adding a fraction of the
previous weight update to the current one. This helps to smooth out the noisy updates of
SGD.

o RMSProp: Maintains a moving average of the squared gradients for each weight and
divides the learning rate by this average. This effectively adapts the learning rate for each
parameter.

o Adam (Adaptive Moment Estimation): Combines the ideas of both Momentum and
RMSProp. It stores moving averages of both the past gradients and the past squared
gradients. It is the most widely used and recommended optimizer for deep learning.

Early Stopping:

o Concept: A form of regularization that stops training when the model's performance on the
validation set stops improving.

o Logic: Monitor the validation loss at the end of each epoch. If the validation loss does not
improve for a specified number of consecutive epochs (the "patience"), stop the training
process and save the model from the epoch with the best validation loss.
12. Model Training Lifecycle

 Key Terms:

o Epoch: One complete pass of the entire training dataset through the network.

o Batch: A small subset of the training dataset.

o Iteration: A single update of the model's weights. It corresponds to processing one batch of
data. The number of iterations in one epoch is the total number of training samples
divided by the batch size.

 Key Hyperparameters and Tuning:

o Hyperparameters: These are settings configured before training begins, such as the
learning rate, batch size, number of epochs, number of hidden layers, number of neurons
per layer, choice of activation function, and choice of optimizer.

o Tuning: The process of finding the optimal set of hyperparameters for a model. This is
typically an empirical process involving experimentation. Techniques like Grid Search,
Random Search, or more advanced methods like Bayesian Optimization are used, with
performance evaluated on the validation set.
TensorFlow and Keras: Frameworks for Deep Learning

📌 Introduction to TensorFlow

TensorFlow is an open-source deep learning framework developed by the Google Brain Team in 2015.
It is designed to facilitate the development, training, and deployment of machine learning and deep
learning models at scale.

At its core, TensorFlow uses a computational graph approach, where each operation is represented as
a node and data flows as tensors between them. This architecture enables it to run efficiently across
multiple CPUs, GPUs, and even TPUs (Tensor Processing Units), making it highly scalable for
production environments and research alike.

TensorFlow supports both low-level APIs (for maximum control and customization) and high-level
APIs (for rapid development). One of its most powerful features is automatic differentiation, which is
essential for backpropagation in neural networks.

📌 Introduction to Keras

Keras is a high-level neural network API that was initially developed by François Chollet. Since 2017, it
has been tightly integrated into TensorFlow as [Link]. It simplifies the process of building and
training deep learning models by abstracting much of the complexity of TensorFlow’s lower-level
operations.

Keras follows the principle of modularity and user-friendliness, allowing developers to construct
neural networks layer by layer using intuitive building blocks such as Dense, Conv2D, LSTM, etc.

TensorFlow vs Keras (Before & After TensorFlow 2.0)

Feature Keras (standalone, pre-TF2.0) [Link] (Keras inside TensorFlow)


Backend Theano, CNTK, or TensorFlow TensorFlow only
Performance Moderate Highly optimized with XLA, GPU/TPU
Integration External Native and seamless
Industry Usage Prototyping and research Research + Production
Why Use TensorFlow and Keras?

 Ease of Use: Keras provides simple syntax for building deep learning models.
 Powerful Back-End: TensorFlow manages low-level operations efficiently, even on large
datasets and clusters.
 Pretrained Models & Tools: [Link], [Link], and [Link] offer ready-
made models, input pipelines, and distributed training.
 Visualization: Built-in integration with TensorBoard for visualizing metrics, model graphs,
and profiling.
 Deployment Ready: TensorFlow supports serving models via TensorFlow Serving, TFLite
for mobile, and [Link] for browser environments.
 Auto-differentiation: Essential for backpropagation and gradient-based optimization.

from [Link] import Sequential

from [Link] import Dense

# Step 1: Define the model

model = Sequential([

Dense(64, activation='relu', input_shape=(10,)),

Dense(32, activation='relu'),

Dense(1, activation='sigmoid')

])

# Step 2: Compile the model

[Link](optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Step 3: Train the model

[Link](X_train, y_train, validation_data=(X_val, y_val), epochs=10, batch_size=32)

# Step 4: Evaluate or predict

[Link](X_test, y_test)

[Link](new_data)

You might also like