Deep Learning - A Beginners' Guide - Dulani Meedeniya - 1, 2023 - Chapman and Hall - CRC - 103247324X - Anna's Archive
Deep Learning - A Beginners' Guide - Dulani Meedeniya - 1, 2023 - Chapman and Hall - CRC - 103247324X - Anna's Archive
This book focuses on deep learning (DL), which is an important aspect of data science,
that includes predictive modeling. DL applications are widely used in domains such
as finance, transport, healthcare, automanufacturing, and advertising. The design of
the DL models based on artificial neural networks is influenced by the structure and
operation of the brain. This book presents a comprehensive resource for those who
seek a solid grasp of the techniques in DL.
Key features:
Accordingly, the book covers the entire process flow of deep learning by providing
awareness of each of the widely used models. This book can be used as a beginners’
guide where the user can understand the associated concepts and techniques. This
book will be a useful resource for undergraduate and postgraduate students, engin-
eers, and researchers, who are starting to learn the subject of deep learning.
Dulani Meedeniya
Designed cover image: Pdusit, Shutterstock Illustrator
First edition published 2024
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
and by CRC Press
4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
CRC Press is an imprint of Taylor & Francis Group, LLC
© 2024 Dulani Meedeniya
Reasonable efforts have been made to publish reliable data and information, but the author and publisher
cannot assume responsibility for the validity of all materials or the consequences of their use. The authors
and publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any future
reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter
invented, including photocopying, microfilming, and recording, or in any information storage or retrieval
system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or
contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-
8400. For works that are not available on CCC please contact [email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used
only for identification and explanation without intent to infringe.
ISBN: 9781032473246 (hbk)
ISBN: 9781032487960 (pbk)
ISBN: 9781003390824 (ebk)
DOI: 10.1201/9781003390824
Typeset in Times
by Newgen Publishing UK
Contents
Preface........................................................................................................................ix
Acknowledgements....................................................................................................xi
List of Abbreviations................................................................................................xiii
Chapter 1 Introduction........................................................................................... 1
1.1 Data-Driven Decision-Making and Society................................ 1
1.2 Overview of Deep Learning........................................................ 2
1.3 Bias and Variance........................................................................ 6
1.3.1 Skewness of Data.......................................................... 7
1.3.2 Bias................................................................................ 7
1.3.3 Variance......................................................................... 8
1.3.4 Trade-off Between Bias and Variance........................... 8
1.4 Supervised and Unsupervised Learning.................................... 11
1.5 Supportive Tools and Libraries................................................. 12
1.5.1 TensorFlow.................................................................. 13
1.5.2 Keras............................................................................ 13
1.5.3 PyTorch....................................................................... 14
1.5.4 Jupyter Notebook........................................................ 14
1.5.5 NumPy and Pandas...................................................... 14
1.5.6 Tensor Hub.................................................................. 14
Review Questions................................................................................ 15
v
vi Contents
References.............................................................................................................. 173
Index....................................................................................................................... 181
Preface
The rapid development of digital technologies has resulted in an explosive growth
of data. Data engineering plays an essential role in many fields, including finance,
medical informatics, and social sciences. This has led to increasing demand for career
opportunities with the knowledge and experience of data science, with competence in
computer programming. However, still, there is a global shortage of workforce whose
skills span these areas.
Deep learning (DL) is an important and evolving area in data science that includes
statistics and predictive modeling. It is concerned with algorithms inspired by the
brain’s structure and functions known as artificial neural networks. DL can automat-
ically learn features in data, by updating learned weights at each layer. This book
provides adequate theoretical coverage of DL techniques and applications. This book
will teach deep learning concepts from scratch. We aim to make DL approachable
by teaching the concepts and theories behind DL models. Thus, practitioners can
grab the critical thinking skills required to formulate problems, design and develop
models to make accurate predictions and support the decision-making process. Many
academic institutions have embarked on DL education and research at various levels.
At present, DL has become a forward-looking academic discipline with a wide range
of real-world applications.
DL is extremely beneficial to data scientists in collecting, analyzing, and
interpreting large amounts of data with efficient processing. There are many
advantages associated with deep learning. For instance, DL techniques may produce
new features from a small collection of features in the training dataset without any
further human interaction. The ability to process large numbers of features makes DL
techniques very powerful when dealing with unstructured data namely texts, images,
and voices. More reliable and concise analysis results can be obtained as the predic-
tion process is based on historical data. In the long run, it also supports improving
prediction accuracy by learning from flaws. Although DL techniques can be expen-
sive to train, once trained, it is cost-effective. Moreover, these techniques are scalable,
as they can analyze large volumes of data and execute numerous calculations in a
cost-and time-effective way.
DL applications are widely used in several industries like finance, transport,
healthcare, automanufacturing, and advertising. For instance, DL is reshaping and
enhancing the living environments by delivering new possibilities to improve people’s
life. For instance, in the healthcare domain, it helps in the early detection of cancer
cells and tumors, improves the time-consuming process of synthesizing new drugs,
and invents sophisticated medical instruments. Deep learning is used in the entertain-
ment industry such as Netflix, Amazon, and Film Making. Netflix and Amazon use
recommender systems to provide a personalized experience to their viewers using
their show preferences, time of access, and history. Voice and audio recognition tech-
nology can be used to train a deep learning network to produce music compositions.
Google’s Wavenet and Baidu’s Deep Speech can train a computer to learn the patterns
ix
x Preface
and the statistics that are unique to the music. It can then generate a completely new
composition. Additionally, the ‘LipNet’, which is developed by Oxford and Google
scientists, could read people’s lips with 93% success. This can be used to add sounds
to silent movies. Further, in advertising, DL allows optimizing a user’s experience.
Deep learning helps publishers and advertisers to increase the relevance of the ads
and boost the advertising campaigns.
Acknowledgements
We are grateful to all who helped improve the content and offered valuable feedback.
Specifically, we thank K. T. S. De Silva and S. Dayarathna and T. Shyamalee for their
contributions to collecting materials and designing graphics.
xi
Abbreviations
Adaptive delta (AdaDelta)
Adaptive gradient (Adagrad)
Adaptive moment estimation (ADAM)
Area under the curve (AUC)
Area under the ROC curve (AUROC)
Artificial intelligence (AI)
Artificial neural network (ANN)
Bi-directional encoder
representations from transformers (BIRT)
Capsule network (CapsNet)
Carlini & Wagner attack (C&W)
Convolutional neural networks (CNN)
Deep learning (DL)
Deep neural network (DNN)
Differentiable architecture search (DART)
Directed acyclic graph (DAG)
Efficient neural architecture search (ENAS)
Exponential moving average (EMA)
Facebook-Berkeley-Nets (FBNet)
False negative (FN)
False positive (FP)
Fast and practical neural architecture search (FPNAS)
Fast gradient sign method (FGSM)
Federated learning (FL)
Generative adversarial networks (GAN)
Geometric Mean (G–mean)
Gradient descent (GD)
Gradient-weighted class activation mapping (Grad-CAM)
Internet-of-things (IoT)
Jacobian-based saliency map attack (JSMA)
Limited-memory Broyden-Fletcher-
Goldfarb-Shanno (L-BFGS)
Logarithmic loss (Log loss)
Long short-term memory networks (LSTM)
Machine learning (ML)
Matthew’s correlation coefficient (MCC)
Mean absolute error (MAE)
Mean squared error (MSE)
Multilayer perceptron (MLP)
Neural architecture optimization (NAO)
Neural architecture search (NAS)
Natural language processing (NLP)
xiii
newgenprepdf
Peer-to-peer (P2P)
Principal component analysis (PAC)
Receiver operating characteristics (ROC)
Rectified linear (ReLU)
Recurrent neural networks (RNN)
Region of interest (ROI)
Residual network (ResNet)
Root mean square error (RMSE)
Root mean square propagation (RMSprop)
Stochastic gradient descent (SGD)
Stochastic neural architecture search (SNAS)
True negative (TN)
True positive (TP)
Vision transformer (ViT)
Visual geometric group (VGG)
Youden’s index (YI)
1 Introduction
DOI: 10.1201/9781003390824-1 1
2 Deep Learning: A Beginners’ Guide
market and understand how the new product is going to perform in that particular
market. This enables increased efficiency in market research and can immensely
support many departments, including marketing, upper management, and even the
development level to get clarity in the decision-making process. Most importantly,
analyzing the patterns and shifts or new tendencies in the market based on demo-
graphic data applies to all sorts of businesses and different industries. Therefore, the
data-driven decision-making process immensely supports organizations to determine
opportunities or potential threats, where the organization can prepare with potential
ways to tackle them.
In the applications of such data-driven decision-making, artificial intelligence (AI)
takes a significant role. Since most of the activities are carried out by computers, the
decision-making process is also supported by computational implementations more
efficiently and effectively than human intervention. Developing artificial intelligence-
enabled applications has been a topic of interest for a couple of decades. From email
spam filters to autonomous cars, AI supports a wide range of applications, which
are used to guide the human thinking process. In the early stages of artificial intelli-
gence, knowledge was used to solve problems that were difficult to solve with human
intelligence.
The usage of artificial intelligence to assist in decision-making will depend on a
collection of factors including the current issues, vision, goals, nature of the appli-
cation, and the type and quality of data to which it is exposed. This has become an
essential tool to assist in making smarter and more impactful decisions. Therefore,
decision-making is hugely benefited from data which is powered by the usage of
advanced AI methodologies including machine learning and deep learning.
• Artificial intelligence (AI) has the ability to perform tasks using machines that
normally require human intelligence. It can be considered as a smart applica-
tion that simulates the behavioral patterns of humans and learns without human
intervention such as self-driving cars.
• Machine learning (ML) can be considered as a specialization of AI that consists
of a stack of tools to analyze and visualize data, and predictions. It can learn
using data without being explicitly programmed with a set of rules. This
approach is based on training a model from datasets.
• Deep learning (DL) is a type of ML that simulates the human brain. It uses mul-
tiple layers in a deep neural network to progressively extract high-level features
from the raw input. Their ability to analyze more complex relationships makes
them particularly useful for modeling a wide variety of real-world problems.
Introduction 3
also has an accurate and automated feature engineering process by eliminating many
boundaries in areas such as natural language processing, computer vision, and speech
recognition.
With the overview of the deep learning concept, it is important to identify the
data processing stages to produce insights and predictions to obtain the outcome in
practice. As shown in Figure 1.4, the life cycle of a deep learning process mainly
consists of data acquisition, preprocessing, training, testing, evaluation, deployment,
and monitoring. We learnt that the overall effectiveness of the model depends mainly
on the data. Generally, data can be private or public and collected using surveys or
experiments. With the availability of data, a data scientist needs to understand the
data by exploring the structure, relevance, type, and suitability of data. Although data
preparation is a time-consuming task, it plays an important role in the life cycle, to
derive new features from the existing data. This, exploratory data analysis is required
6 Deep Learning: A Beginners’ Guide
to identify the affecting factors using the data distribution between different feature
variables before the actual model design. Accordingly, in the data flow pipeline, the
data should be preprocessed to clean, remove outliers, manage missing data, nor-
malize, and augment before feeding into the training model. Considering the deep
learning pipeline, the feature engineering techniques are aligned with the data
preprocessing to extract and identify informative features of the data.
The training algorithm can be selected considering different factors such as the
type of data, nature of the application, and resource availability. The model training
and testing processes are engaged in tuning the hyperparameter and applying
optimizations to generate better results. These models should be designed to learn the
data and perform well on new data as well, by ensuring the balance between perform-
ance and generalizability. Once the model is evaluated by testing on unseen data. The
modeling process should be reiterated until the desired level of metrics is achieved.
A detailed description of these concepts and techniques is discussed in later chapters
of this book. Once the final model is deployed, the application is monitored for fur-
ther improvements. In practice, several frameworks and technologies are available
to ease the processes in the deep learning life cycle. Furthermore, different tools and
frameworks are utilized to accomplish these processes.
Deep learning is still evolving with novel ideas of big data processing with artifi-
cial intelligence. Therefore, you need to better understand deep learning techniques
and their key concepts for the development of innovative applications. This book
will provide you with the theoretical background on basic deep learning techniques,
neural networks, deep learning models, types of deep learning approaches, architec-
tural enhancements, and evaluation techniques.
In general, a function, where the predicted output is closer to the actual value of
ɵ, is considered as a good estimator. It is important to review the properties of these
estimators. The following sections describe the bias and variance. A correct balance
between bias and variance is important to generate accurate predictions.
1.3.1 Skewness of Data
Skewness determines how far a random variable’s probability distribution deviates
from the normal distribution (probability distribution without any skewness), as
shown in Figure 1.5. Also, skewness indicates the direction of outliers.
For example, let us consider the case of positive skew data, where a large number
of data instances consist of small values. This results in better training performance
at predicting data instances with lower values. Thus, there is a ‘bias’ towards lower
values, in this scenario. Considering the direction in this case, most of the outliers
appear on the right side of the distribution. Thus, there is a variance in data.
1.3.2 Bias
The term bias can be defined as the deviation between the predicted value by the deep
learning model and the actual output or the ground truth. When the bias is a higher
value, it indicates a large error in the output of the model. Also, it can indicate the
imbalance of the dataset. Generally, we expect a model to have a low bias to prevent
issues such as the underfitting of data to the model. This can be explained as a system-
atic error of the training model, which skews the result in favor or against the actual
output. Bias shows the matching of the dataset to the model as follows:
In a high-bias situation, the dataset does not match the model.
In a low-bias situation, the dataset fits with the training model.
The following indicators help to identify a high-bias model:
Failure to capture the data trends
Potential towards underfitting
More generalized/overly simplified
High error rate
The bias of an estimator can be defined as in (1.1), where ɵ̂ denotes the estimation
of a parameter where the actual value is ɵ, and ɵ̂m is the point estimator. The term
E(ɵ̂) is an expectation of the data. If Bias (ɵ̂) =0, the estimator of ɵ̂ is considered as
unbiased, as E(ɵ̂) is equal to ɵ.
1.3.3 Variance
Variance indicates the expected difference between the observed data instances from
the average value. Thus, variance indicates the data spreading within the sample set.
Both bias and variance of an estimator are calculated for a dataset. This variance
measures how the estimate would vary as the dataset is changed independently from
the underlying data generating process. In other words, it indicates the changes in the
model with different parts of the training data set.
Since the target function is estimated from the dataset, it is acceptable to have
some variance. However, it should not vary drastically from one dataset to another,
which indicates that the estimator is good at understanding the hidden mapping
between inputs and outputs. It can be considered as an indicator of the uncertainty in
the data. A high variance indicates that the estimator does not generalize on unseen
training datasets. In that case, the model shows high performances on the training set
but gives high error rates on the testing set. It is good to have a relatively low variance
for an estimator.
The following indicators help to identify a high-variance model:
Noise in the dataset
Potential towards overfitting
Complex models
Trying to include all data points closer.
TABLE 1.1
Comparison of Bias and Variance
Bias Variance
High model complexity Low bias High variance
Causes High bias results in High variance results in overfitting.
underfitting
Feature Low bias indicates fewer Low variance means that similar
target function target functions would result
assumptions are made. from training data changes.
TABLE 1.2
Trade-off Between Bias and Variance
Train error Very low Relatively high Relatively high Very low
Test error High Relatively high Very high Low
Bias-variance High variance High bias High bias and Low bias and low
high variance variance
typically result in a decrease in bias error. The variance error will rise as a result,
though, and the model may start to pick up on noise in the training set.
However, since bias and variance are inversely connected, it is hard to have a
model with a low bias and a low variance. Therefore, the trade-off between these
terms can be stated as follows.
High-bias models will have low variance.
High-variance models will have a low bias.
10 Deep Learning: A Beginners’ Guide
Generally, when the model is simple with few parameters then it may have high bias
and low variance. Here, the model may not have the risk of generating inaccurate
results, but it will not match the dataset. In contrast, if the model has many parameters,
then it will have high variance and low bias. In this situation, although the model fits
with the dataset, there are more chances to predict inaccurate results. These aspects
indicate the model’s flexibility to obtain an optimal model. For example, if the model
does not fit with the dataset, it will have a high bias. This leads to an inflexible model
with low variance.
This trade-off can be addressed, and the most suitable model can be selected by
comparing the mean squared error (MSE) of the estimators as in (1.2). The estimators
with less MSE can keep both their bias and variance in an acceptable range. It should
be noted that these biases and variances are linked with capacity, overfitting, and
underfitting concepts in deep learning.
Consider, Figure 1.7 with model capacity against the error. The capacity of the model
indicates its capability to suit a variety of functions. When the capacity increases,
the bias of the model tends to decrease, and variance gets increased. This produces
another U-shaped curve, which represents the generalization error. As capacity varies,
there is an optimal point in the graph that denotes a good balance between bias and
variance that minimizes the error. However, a learning algorithm can handle some
variance. A model that is optimally balanced between bias and variance is neither
overfitting nor underfitting.
A trained model with the lowest bias versus variance trade-off for a specific dataset
is the desired outcome. Techniques such as cross-validation, regularization, dimen-
sion reduction, stop training early and use mode data will help overcome bias and
variance errors. The following tasks can be applied to address the trade-off between
bias and variance.
• Increase the complexity of the model. This decreases the overall bias while
increasing the variance to an acceptable level. This aligns the model with the
training dataset without incurring significant variance errors.
• Increase the training dataset. This is the preferred method when dealing with
overfitting models. This allows users to increase the complexity without vari-
ance errors that negatively impact the model with a large dataset. A learning
algorithm can be generalized easily when there are many data points. However,
when the data is underfitting or the model shows low bias, the model is not
sensitive to the dataset, even in a large dataset. Therefore, for models with high
bias and high variance, using a large dataset is a feasible solution.
Some of the real-world examples that use supervised learning are listed as follows.
• Group a set of random photos into landscape photos, pictures of dogs and
cats, babies and mountain peaks, etc. This is known as clustering in machine
learning terminology.
• Find a group of small numbers of parameters that can be used to explain the
data. This extracts the most important features from the dataset, which explain
the dataset the best. This procedure is known as principal component analysis
in machine learning terminology.
• Identify the functional proteins that affect the most in cancer diagnosis.
• Identify the patterns in financial fraud activities.
• Identify the important minimum set of dimensions in magnetic resonance
imaging data.
1.5.1 TensorFlow
TensorFlow is an open-source platform for fine-tuning large-scale machine learning
applications using different libraries, tools, and resources. Initially, this framework
was developed by Google for their internal usage and later provided as an end-to-end
machine learning platform in the public domain. Among several functionalities, it
mainly supports model training and inference of deep neural networks. Since there
are a large amount of data to process using complex algorithms, it is required to store
data compactly and feed it to the neural network. Tensors provide a better way to
represent data as an n-dimensional matrix or a vector. Since these tensors hold data
in different known shapes, the shape of the data can be identified with the dimen-
sion of the matrix. After storing data in tensors, the relevant computations can be
performed in the form of a graph. These tensors can be derived either from input data
or as computation results of an operation that conducts inside the graph. The input
data goes into the graph at one end and then flows through various operations and
comes out at another end as an output. All these operations in the graph are known as
graph op nodes that are connected by tensors as edges. These graph frameworks can
run on multiple CPUs, GPUs, or mobile operating systems. Accordingly, some of the
benefits of using TensorFlow can be stated as open-source, platform independence,
train on CPU and GPU, high flexibility, autodifferentiation and manage threads and
asynchronous computation.
1.5.2 Keras
Keras is an open-source deep learning API written in python that executes on machine
learning platforms. This provides an interface to solve complex learning problems
aiming at deep learning techniques. This API acts as a high-level wrapper to create
deep learning models, define their layers, and compile models. However, this does
not support other low-level API such as generating computational graphs and making
tensors and sessions. Keras supports multiple backends for the computation such as
TensorFlow, Theano, CNTK, and PlaidML. TensorFlow uses Keras as its official
high-level API, supporting many in-built modules to compute neural networks.
Keras offers a simple API that reduces the complexity of making neural network
models and allows you to implement the codes with a simple set of functions. Since
Keras supports multiple cross-platforms, a given backend can be selected depending
on the requirements. When using TensorFlow with Keras API, we can easily create
customized workflows based on the requirements. Also, this is much easier to learn as
it provides a python frontend with a high level of abstraction. Keras can be deployed
on devices like iOS, Android, Raspberry Pi, Cloud Engines, or Web Browsers with
.js support. Also, Keras runs on both GPU and CPU, with the support of in-built data
parallelism to process large data volumes for model training. Therefore, this can be
used easily and efficiently as a high-end API to create deep learning networks.
Let us learn the main steps in creating a simple Keras model. The basic elements
of Keras are models and layers. Initially, we need to define a network by adding layers
to support data flow in the selected model type. There are two types of models namely
sequential and functional. Then we need to define the loss function, optimizer, and
14 Deep Learning: A Beginners’ Guide
the other matrices to calculate the model accuracy, and compile the model to convert
it into a machine-understandable format. Next, the model can train, evaluate, and pre-
dict the results. Additionally, the Keras functional API can be used to build arbitrary
graphs of layers or develop models in complex architectures.
1.5.3 PyTorch
PyTorch is a python-based open-source machine learning framework. It uses an
optimized tensor library for deep learning using GPUs and CPUs. One of the main
high-level features of PyTorch is its dynamic computational graph based on auto-
matic differentiation. In contrast to TensorFlow, where we need to first define the
entire computational graph before running the model, PyTorch allows us to define
the graph dynamically. The PyTorch library is designed for more efficient use by
tracking the model built in real-time. Since the developers can dynamically change
the behavior of the graph, it is easier to use than TensorFlow. Also, PyTorch enables
GPU- accelerated tensor computations and effective data parallelism. However,
compared to TensorFlow, PyTorch provides limited visualizations during the training
process.
1.5.4 Jupyter Notebook
Jupyter Notebook is an open-source web application. Developers use this environ-
ment to create and share documents with source codes, text, and visualizations. This
helps to perform end-to-end workflows in data science such as data preprocessing,
model building, model training, data visualization, and many other related works.
Jupyter notebooks can use to write codes in independent cells and execute them indi-
vidually. Therefore, this allows testing specific blocks of code without executing the
entire script of code as in many other IDEs. This is a flexible and interactive platform
that is widely used in data science.
1.5.6 Tensor Hub
TensorFlow hub provides a repository of pretrained models as off-the-shelf models
to be used in machine learning tasks. These models are used for fine-tuning to build
Introduction 15
real-world applications and learning purposes with few lines of code. The models in
this repository support a variety of applications, such as pattern recognition, object
detection, audio processing, and natural language processing.
REVIEW QUESTIONS
1. What are the advantages of using deep learning based applications in the
real world?
2. Explain the importance of balancing bias and variance.
3. What are the problems of having high-dimensional data and explain possible
approaches to address those issues?
4. What aspects need to be considered when selecting the correct support tool or
support library when solving a learning problem?
2 Concepts and
Terminology
16 DOI: 10.1201/9781003390824-2
Concepts and Terminology 17
The ImageNet challenge was the next key milestone in the development of artifi-
cial neural networks. Deep neural network (DNN) architectures were able to learn
new features and carry out classification tasks. In 2012, the model AlexNet won
the ImageNet challenge, paving the way for neural architectures. Since then, several
neural architectures have been developed to address different kinds of problems,
including image classification, video classification, feature identification, and image
reconstruction.
If we dive into the mathematical approach in ANNs, we learn about the important
theory of logistic regression, which is considered as the basic theoretical approach in
ANNs. In general, regression is a set of methods to model relationships between inde-
pendent variables and dependent variables. In real-world problems using machine
learning, the independent variables are known as inputs and the dependent variable is
known as the output. Therefore, it is required to find a mapping between inputs and
outputs known as predictions.
Consider an online shopping website that suggests items to buy based on the
customers’ details and their buying patterns. A model to recommend items can be
developed by using a dataset consisting of many variables such as customers’ histor-
ical buying items, personal data involving age, sex, location, and user data with views
and likes. In deep learning terminology, the dataset is divided into the training set and
test set. The data corresponding to one record is known as a data instance. The model
is trained to predict an item to buy, and it is called as the label of the target object. All
the independent variables considered to produce the prediction are known as features.
18 Deep Learning: A Beginners’ Guide
2.2 REGRESSION
Although this book does not focus on regression techniques in detail, this section
gives an overview of regression for better understandability. Regression analysis
is a method that identifies the relationship between the independent variables and
dependent variable in the dataset. Based on the type of relationship linear or non-
linear regression techniques will differ and the target variable is always a continuous
value in regression. This is mainly used to forecast the trend, and identify the strength
Concepts and Terminology 19
of the predictor and time series. Different regression techniques, such as linear regres-
sion and logistic regression, are used based on the nature of the data.
2.2.1 Linear Regression
Generally, linear regression learns the linear relationship between the features and the
target. It is a supervised based learning algorithm used for the predictive analysis of
continuous values, and therefore, cannot learn the complex non-linear relationship.
For example, consider a set of data points. The linear regression finds a line that the
data points fit well. Thus, the output of a new data point can be predicted in a way that
fits along the best-fit line. Since it solves linear problems by predicting continuous
dependent variables using one or many independent variables as shown in Figure 2.4.
Consider an input vector x ∈ Rn and an output value y ∈ Rn. The output is a linear
function of the input such that y = mX + c + e, where y, m, X, c, e denote target, gra-
dient, predictor, bias, and error. We get the best fit line with the lowest prediction error
by changing the m and c values. However, linear regression does not suit learning
complex non-linear relationships. Additionally, since this is sensitive to outliers, this
will work well for small-size datasets.
Let us learn the basic principles of linear regression using mathematical representations.
Let the model predicted value is ŷ ̂ and the actual output is y. We can define the output to
be, ŷ ̂ = wTx, where w ∈ Rn is a vector of parameters. These parameters are a set of values
that controls the behavior of the system. In this case, w is defined as a set of weights
(wi) that determines the effect of each feature (xi) on the final prediction. If the feature
xi receives a positive weight wi, that means, the increase of the feature xi increases the
value of the final prediction from the model ŷ . Similarly, if the features receive a negative
weight, the increase of features causes a decrease in the value of prediction. If the
weight is zero, it does not affect the final prediction. Linear regression is often used with
an intercept term b, which is known as bias. This defines that the terminology is biased
towards b in the absence of any input value. The basic principles of linear regression can
be used to develop complicated and advanced learning algorithms.
2.2.2 Logistic Regression
We discussed linear regression, which predicts continuous valued quantities as a
linear function of independent variables. However, it is not a good solution to predict
binary-valued labels. This is addressed by logistic regression, by predicting the prob-
ability of a given example belonging to a particular class using the sigmoid function to
represent probability distribution over binary value. Thus, a neural network without a
hidden layer and a sigmoid activation in the neurons associated with the output layers
is called logistic regression. This is a machine learning method that uses the concepts
of probability in classification. This regression is used when the target variable is
discrete that is 0 or 1, and the sigmoid function denotes the relationship between the
predictor and the target variable as shown in Figure 2.5.
When logistic regression is used for multi-labeled classes, we use the Softmax
activation function, which is known as Softmax regression or multinomial logistic
regression. Thus, it is often used as the output of classifiers to represent probability
distribution over multiple classes. Logistic regression is used when the dataset is
large. Also, there should not be any correlation between the independent features.
the overfitting in linear regression and reduce the least-squares error. These are used
when there is more correlation between predictors and the target variables. The L1
and L2 regularization methods are further discussed in Section 2.5.5. Generally, L1
regularization is used when many feature sets provide sparse solutions. It performs
both regularization and feature selection. When the independent variables are highly
collinear, L1 regression gets only one variable and makes other variables shrink to
zero, thus reducing overfitting.
In polynomial regression, the relationship between variables is given by the
n-th degree. Although it tries to obtain the best-fit curved line that goes through
all the data points, by reducing the mean squared error (MSE), this can cause
overfitting where the model performs well only for the training set, without gener-
alizing. Bayesian regression is another technique that uses the Bayesian theorem
to identify the coefficients. Here, it finds the posterior distribution of features
more stable than linear regression.
2.3 CLASSIFICATION
Now, let us see the difference between regression and classification. Classification
is a supervised learning algorithm that performs predictions with labeled datasets.
Although, it is used for the same tasks as in regression, the difference lies based
on the applications. For instance, regression techniques are used to predict the con-
tinuous values such as temperature, item price, salary, and age; whereas classification
classifies the discrete values such as dog or cat, health or unhealthy. Figure 2.6 and
Figure 2.7 show visual representations of classification and regression.
Classification divides the dataset into different classes. As shown in Figure 2.7,
A and B are two classes. There are similar features within the class and different
features between classes. Accordingly, classification algorithms divide the dataset
into different classes based on a set of parameters. Initially, a computer program is
trained on a training dataset. Then based on the trained model, the test set is classified
into different classes. Thus, it can be stated as a mapping function that maps the
input to a discrete output. There are different types of classification, such as binary
classification, multi-class classification, and multi-label classification. A variety of
2.4 HYPERPARAMETERS
2.4.1 Overview
Hyperparameters are the variables that define the structure of the model network.
These variables specify the training parameters of a model. The hyperparameters,
such as the learning rate, are set before optimizing the weights and bias during model
training. Some other hyperparameters are the number of hidden layers, batch size,
number of epochs, and number of nodes in each layer, which are used for the training
process. We can list the steps of hyperparameter tuning as follows. Additionally,
Table 2.1 states the impact of different hypotheses.
2.4.2 Weight Initialization
In neural networks, the weights are initialized to small random numbers, in such a
way that it will prevent the exploding or vanishing of the activation outputs during
the feed forwarding in the network. If either of the above situations occurs, the loss
gradients become very large or very small to flow backwards in backpropagation and
Concepts and Terminology 23
TABLE 2.1
States Different Types of Hyperparameters and Their Approximate Sensitivity
eventually the model may not converge. Here, in feedforward propagation, the input
is used to compute the intermediate function in the hidden layer and used to get the
output. During backpropagation, the model weights are altered repetitively to obtain
the prediction output closer to the actual output.
These weights should not be the same and should allow different learning styles. If
the weights are the same, then every neuron in each layer will learn the same features
in the same way. Thus, the weights should have a good variance to learn new features.
The selection of the weight initialization techniques is mainly based on the type of
the dataset and the activation functions. The neural network uses matrix multiplica-
tion at its core operation. Matrix multiplication is mainly used in the prominent layers
in the preliminary stages of the neural network. The matrix multiplication of layer
inputs and weights is calculated as a resulting matrix and after applying the activation
function, the resulting matrix is applied to the next layer.
We will discuss two weight initialization techniques: zero initialization and random
initialization. Zero initialization defines the bias variable to be zero in the first step
and assigns weights to zero. Here, the derivative concerning the loss function will be
the same for every weight in the weight matrix. This makes the underlying hidden
neurons become symmetric and ultimately performs worse than a linear model.
Therefore, zero initialization does not result in successful classification. Random
initialization is used in many computational problems, such as advanced searching
algorithms like gradient descent. It is better than zero initialization. However, we
cannot guarantee that the weights become too high or too low values, which can drive
away from the classification or learning process.
If we initialize the weights with very high values, then the derivative terms of (wx+
b) become higher. When an activation function, such as sigmoid, is used, the function
tends to map the values to nearly 1. This causes the gradient descent to make a slower
progression in finding the minimum value, which results in increased learning time.
On the other hand, if we initialize the weights as too low, it gets mapped to 0 in the
activation function and results in a vanishing gradient problem. However, to utilize
24 Deep Learning: A Beginners’ Guide
the randomness in the search process there is a need for stochastic optimization
algorithms.
Therefore, the following aspects should be considered when initializing the
weights of a neural network.
• The model efficiency: how long the training would take place.
• How to handle the vanishing or exploding gradient problem.
The widely used weight initialization mechanisms in neural networks can be described
as follows.
1. He Initialization
Also known as Kaiming Initialization, and used to initialize weights when using
non-linear activation functions, such as rectified linear (ReLU). These functions are
discussed later in this chapter. This method calculates the weights as a random number
with a Gaussian probability distribution (G) with a mean of 0 and a standard devi-
ation of sqrt(2/n), as shown in (2.2) to avoid vanishing or exploding the magnitudes
of input. Here, n is the number of inputs to the node.
2. Xavier Initialization
This approach is also known as Glorot initialization and is used with neural
networks that use sigmoid or tanh activation functions. Similar to He initialization,
the weights initialization is based on normal or uniform distribution with a minimum
0 and standard deviation. The Xavier initialization method initializes the weights in a
way that the activation variance is the same across each layer. The gradient exploding
or vanishing can be prevented by having a constant variance. This method calculates
the weight as a random number with a uniform probability distribution (U) between
the range –(1/sqrt(n)) and 1/sqrt(n), as shown in (2.3), where n is the number of inputs
to the node.
2.4.3 Activation Function
The activation function is used to learn complex data patterns using a neural network
by deciding the important features that pass to the next neuron while suppressing
irrelevant data points. This behaves the same as the function of an activation function
in the biological neural network. Similarly, the output from the preceding node is
transformed into a format that can be passed as the input to the next neuron. This
function can be defined in both linear and non-linear forms, where the non-linear
form is widely used. For example, when there are no activation functions in a neural
Concepts and Terminology 25
network, the neuron will perform only a linear conversion on the inputs using weights
and biases, and all the hidden layers have similar behavior. Thus, when there is no
activation function, it is hard to learn complex tasks and the model will behave as a
linear regression model.
The activation functions are useful to keep the output value to be bounded by
a threshold value, which acts as an upper limit. Here, the input to the activation
function comes as a product of the weight and input plus the bias values, which are
not normalized or have restrictions in their ranges. Therefore, to normalize the output
to avoid unnecessary computations we can use an activation function.
As we have discussed earlier, one key drawback of the earlier neural networks was
their inability to adjust non-linear inputs and outputs in the network. Hence, having
an activation function enables non-linearity in the neural network. To highlight the
importance of non-linearity in models, let us draw a real-world example of a classi-
fication problem. Imagine that you are given a problem identifying patterns of the
dataset consisting of weight, blood pressure, and age and you are asked to find out
the patterns of a smoker and non-smoker. This classification scenario is a non-linear
problem and requires using an activation function.
There are multiple things that we need to consider when designing an activation
function. The first one is to beware of the vanishing gradient problem. The neural
networks learn through backpropagation using the gradient descent method to adjust
weights to minimize the loss in each epoch. The activation function limits the output
of the layer to 0 or 1 at the end. The network tends to backpropagate those values to
map between the input and output and to assign the weights to minimize the loss.
Thus, having 0,1 at the end is desired to replicate the initial phases and the gradient of
these layers may not be learning well. Therefore, these gradients are likely to vanish
as the network depth and the activation function are shifting their values to zero.
Hence, we need to design the activation functions such that they would not move the
gradient towards zero. Further, to keep the gradients in the same direction, the activa-
tion function should be symmetrical at zero. Also, the activation function should not
be computationally intensive because it must be applied in each layer to calculate and
takes many times in the neural networks. Most importantly, the activation functions
should be differentiable because the artificial neural networks are learned through
gradient descent.
In summary, the activation of a neuron is determined by the activation function. It
decides the importance of input to predict the output and transforms the weighted sum
of the input of the nodes in a layer into an output. Thus, during the forward propaga-
tion, the activation function adds non-linearity to the model by creating an extra task
at each layer.
Generally, all hidden layers in a neural network use the same activation function.
This should be differentiable to learn the parameters during backpropagation. The
activation function used in hidden layers is selected considering the type of the model.
In most cases, the hidden layers use ReLu activation. Sigmoid and Softmax activa-
tion is used for the output layer in binary classification and multi-class classification,
respectively. The selection of an activation function should consider vanishing and
exploding gradient issues.
26 Deep Learning: A Beginners’ Guide
2. Leaky ReLU
to become more responsive towards the negative inputs to solve the limited range
and impulsive behavior. Further, it preserves the monatomic and differentiable
nature of the ReLu.
For any real value input, the sigmoid function outputs values between 0 and
1. When the input is more positive, the output becomes closer to 1. For negative
inputs, the output becomes closer to 0. This function is mostly used for the output
layer in binary classification. Generally, in a normal distribution, (or Gaussian dis-
tribution) the data is zero-centered with mean 0 and variance 1, which resulted in a
bell-shaped curve. However, as shown in Figure 2.10, in sigmoid, the data is not 0
centered; hence consumes more computational time and takes more time to reach the
convergence and to reach global minima.
Following are the advantages and disadvantages of the sigmoid function.
+Commonly used for models that need to predict the probability as an output.
Since the probability of anything exists only between the range of 0 and 1, sig-
moid is the right choice because of its range.
28 Deep Learning: A Beginners’ Guide
– The sigmoid function gets negligible gradients for values higher than 3 or lower
than 3. Thus, the model does not learn and faces the vanishing gradient problem,
when the gradient value is closer to 0.
– Difficult to train and the model becomes unstable since the output is not sym-
metric around zero and the output of all the neurons will be of the same sign.
– Takes more time to converge.
The Softmax function is used in the last layer of the neural network with multi-
class classification. This is described as a set of sigmoid functions, which returns
the probability of each class. This activation function is more of a generalized form
of the sigmoid function and produces values in the range of 0–1. It transforms the
(unnormalized) output of K units of a fully connected layer to a probability distribu-
tion (a normalized output).
+ Since the function output is zero-centered, the output values are mapped as
strongly negative, neutral, or strongly positive.
+ The values fit between –1 and 1, tanh is generally used in hidden layers. Since
the mean for the hidden layers becomes 0 or close to 0, it supports the center of
the data and makes the learning process of the next layer easier.
+ Although this faces vanishing gradient issues since the function is zero-centered,
the gradients can move in certain directions. Thus, widely used in practice.
– The gra dient of the tanh activation function faces the vanishing gradient problem.
This gradient is much sharper than the gradient of the sigmoid function.
2.4.4 Learning Rate
The learning rate is a hyperparameter that impacts the training of a model. It addresses
the model change, as a reaction to the estimated loss when the model weights are
updated. Thus, the learning rate supports adjusting the weight updates, such that it
reduces the loss. A traditional default value for the learning rate is 0.1 or 0.01. When
the learning rate is set to very low, then the training process progresses slowly with
small updates to the weights, as shown in Figure 2.12. Similarly, when the learning
rate is very high, the loss function behaves differently.
Generally, the optimal learning rate depends on the model architecture and dataset.
Finding the optimal learning rate supports improving performance or speeding up the
training process. We need to find the optimal learning rate, in a way that minimizes
the loss. For example, in each mini-batch, the learning rate can be gradually increased
linearly or exponentially and calculate the loss in each increment. When the learning
rate is very low, the loss value is also reduced at a slight rate. When the model enters
the region of optimal learning, the loss function will show a quick drop. Accordingly,
when the learning rate is increasing again, the loss value will bounce and increase
again while divergent from the minimal point. It is worth noting that the slope of the
curve needs to be analyzed as the optimal learning rate is associated with the sharpest
drop in the loss. As shown in Figure 2.13, it is required to set the range of learning rate
boundaries in a way that can observe the low, optimal, and high learning rate regions.
2.4.5 Loss Function
The loss or error function measures the deviation of the estimated value from its true
value. This indicates the status of the model’s ability to predict the expected outcome.
Once the loss function is defined, we can optimize the algorithm to minimize the
loss function. Usually, the loss is a non-negative number. Normally, better results are
obtained with small loss values and the perfect predictions experienced with a loss of
zero. The prediction error can be three types, namely bias error, variance error, and
irreducible error that occurs due to unknown variables. Following are some of the
types of loss functions associated with classification and regression problems.
For classifications problems:
During the training process, we need to find the weight vector and bias, which
minimizes the total loss across all data points. Figure 2.14 shows an example of a
neural network. Let y, ŷ be the actual and predicted output, respectively. Cross-
entropy is a widely used method to calculate the loss in classification applications, it
reduces the distance between the predicted and actual outputs. The loss function can
be calculated as in (2.4).
output
size
1
Loss = −
output ∑ y . log yˆ + (1 − y ). log (1 − ŷˆ )
i i i i
(2.4)
i =1
size
Concepts and Terminology 31
When we consider regression problems, the squared error is used widely to calculate
the loss. It calculates the square of the variation between the actual and predicted
value. Figure 2.15 shows the graph of a regression problem for a single dimen-
sion input.
Mean squared error (MSE) is a widely used method that results in only a global
minimum. That is, MSE does not get any local minima. For n number of data points,
we can define MSE as in (2.5). Since it calculates the square, it prevents getting large
errors. However, it does not perform robustly with outliers.
1 n
( )
2
MSE = ∑ Y − Yˆ
n i =1 i i
(2.5)
∑
n
i =1
Yi − Yˆi
MAE = (2.6)
n
32 Deep Learning: A Beginners’ Guide
Huber loss is another loss function that considers both MSE and MAE, which is a
combination of linear and quadratic equations. As in (2.7), if the absolute value of the
error a is very small, then the square of the error is divided by 2. Otherwise, it multi-
plies the error value with the delta. It supports better regression and works well with
outliers, as the delta value solves the outlier problem.
1 2
2 a for a ≤ δ,
Lδ ( a ) = (2.7)
δ a − 1 δ , otherwise.
2
2.4.6 Other Hyperparameters
The number of epochs: this defines the number of iterations of the learning algorithm
during the dataset training. Thus, an epoch indicates one cycle of the training dataset,
where each data sample updates the internal parameters. A training process consists
of many epochs.
Dropout rate: dropout is a regularization technique used in deep neural networks
to prevent overfitting. The dropout rate hyperparameter indicates the probability
of ignoring a neuron during a training iteration. The dropout rate typically ranges
between 0.1 and 0.5. A high dropout rate can result in underfitting, where the model
does not learn well from the data. A low dropout rate can result in overfitting, where
the model learns too much from the training data and performs poorly on unseen data.
Therefore, the appropriate dropout rate depends on the specific dataset and architec-
ture of the neural network.
Concepts and Terminology 33
2.5 MODEL TRAINING
2.5.1 Model Selection
Model selection identifies the best model from a set of models. Deep learning models
can have different interpretations based on the various criteria that we use to define
the best model. The first thing would be selecting the best hyperparameters for the
model. As we discussed, hyperparameters are parameters that feed into the model
learning function as input. Selecting the right hyperparameters for a model is a cru-
cial point that affects the model’s performance. Another aspect is to select the best
learning algorithm for the model. Here, we need to carefully select the algorithm
based on different criteria, such as the nature of the training dataset, interpretability
of output, number of features, and linearity.
The most important part of the model selection comes under model evaluation.
This mainly focuses on estimating the generalized error on the selected model to
predict how well this model can perform on unseen data. A proper model evaluation
ensures that the performance of the model will not reduce even with a completely new
set of data. To do that, we need to have a completely independent test set that we have
not used to train our model.
If we have plenty of available data, we can split our dataset into three main
parts: training set, testing set, and validation set according to a valid ratio. The
training set can be used to have different candidate models with various combinations
of model hyperparameters. The models will be evaluated on a validation dataset and
the best model out of all candidates will be selected. The model will be trained on a
training set and validation set by tuning model parameters. The generalization error
for the model is then evaluated on the test set. If this error is quite similar to the val-
idation error, there is a higher probability that the model will perform well on unseen
data as well. Even after training the model, we can further use model learning curves
as a measure of model predictive performance. This will help identify overfitted and
underfit models based on training and validation scores. Also, learning curves illus-
trate the concept of variance and bias. Bias refers to the erroneousness of the model,
which will cause it to underfit the data. On the other hand, the high variance of the
model will cause the model to be overfitted. Therefore, all these measurements of
model complexity can be used to have a proper model selection.
2.5.2 Model Convergence
A model converges when additional training will not improve the model. A model
converges when its loss moves towards a minimum (local or global) with a decreasing
trend. Figure 2.16 shows a non-convex function with the minimum and maximum
points. The term local minimum is the lowest value of a loss function in a local
region. The global minimum is defined when the loss function obtains the lowest
value globally across the entire domain.
In deep learning models, we need to avoid local minima. Here, the derivative
with respect to the saddle point is zero, as the weights will remain the same without
updating. A momentum value can be used to address the local minimum points. In
34 Deep Learning: A Beginners’ Guide
other words, by providing an impulse in a given direction, the loss function prevents
local minima points. We can use stochastic gradient descent optimization, to identify
local minimum and to reach global minima, which is discussed in Chapter 5. In add-
ition, changes in activation function, learning rate and use of batch normalization can
help to avoid local minima.
• Model Overfitting
Overfitting occurs when the learning algorithm attempts to fit into all the data points
or more than the required data points within the dataset. As a result, the model learns
noise (irrelevant and unnecessary data that cause reduced model performance) and
inaccurate features in the dataset that negatively affects the overall performance of
the model. That means, when the model is trained with data, it picks up noise or other
random fluctuations in the dataset and tries to learn from them. This will reduce the
Concepts and Terminology 35
model’s ability to generalize and perform well with unseen data because of noise and
too many details. Therefore, an overfitted model will perform well on training data
but does not perform well with unseen data as shown in Figure 2.17 deep learning
category. These models have low bias and high variance. Consider overfitting in a
regression model in Figure 2.17. The model itself tries to cover all the data points
in the graph. You may think this is very efficient, however, regression aims to find
the best-fitted line and not to cover all the data points. Therefore, this model will not
perform well on unseen data. Thus, an overfitting model is observed by having a low
training loss, which is lower than the test loss and has a high variance.
In the early days of neural networks, the received wisdom that one cannot have
more parameters than training examples. With the technology of deep learning, that
rule seems to have been thrown out of the window. If there are more parameters than
samples, then the model becomes overfit for training data. Then it will lose its ability
to generalize data. That is, it will remember the training data very well and work for
training data, but it will not give good results for a new set of related data. Since we
already know the reasons for a model to be overfitted, let us learn about mechanisms
that we can use to avoid overfitting in learning models.
The overfitting reduction methods are listed as follows.
• Perform regularization.
• Increase training dataset size.
• Lessen model complexity: keep the model simpler by using fewer variables
and parameters. This removes noise associated with the training set to reduce
variance.
• Perform early stopping in model training, when the loss starts to increase, as
shown in Figure 2.18.
36 Deep Learning: A Beginners’ Guide
• Use dropout while training: it will drop out some of the connections and nodes
randomly during training. Thus, becomes a simple and small network.
• Apply regularization, such as ridge regularization and lasso regularization,
which ignore some model parameters that cause overfitting.
• Evaluate with cross-validation techniques.
When explaining the possible methods to reduce overfitting, training with more
data will help to avoid model overfitting to some extent, as it provides underlying
patterns of data for the model to learn. Different augmentation methods, such as
cropping and rotating, can be used to increase the dataset size, which can address
overfitting. However, if we add more noisy data to the dataset, this technique will not
help to avoid model overfitting. Thus, it is necessary to ensure that the additional data
is clean and relevant. Moreover, the number of features in the training data can be
reduced to decrease the complexity of the network.
Cross-validation is an efficient method to avoid model overfitting by generating
multiple train–test splits. In k-fold cross-validation, we partition the dataset into k
subsets and iteratively train the model on k-1 folds while keeping the remaining fold
for testing. This method helps to tune the model hyperparameters on the training data
and keep the test data as an unseen set of data to select the best final model. You can
read more on cross-validation in Chapter 7.
Regularization is another technique that is used to avoid overfitting in models by
reducing the complexity of the model. The regularization method will depend on the
type of learning algorithms that we use. As an example, we can use dropout layers on
neural networks, pruning on decision trees, and penalty parameters in regression cost
function and sparsity. The dropout regularization randomly ignores instances with
unusual dependences from the hidden layers in model training. More details on regu-
larization techniques will be discussed later in this chapter and Chapter 6.
As shown in Figure 2.18, early stopping can also efficiently prevent model
overfitting. During model training, we measure model performance in each iter-
ation. Therefore, up to a certain iteration, the next iteration will improve the model
Concepts and Terminology 37
performance. After that point, it will weaken the model’s ability to generalize on
unseen data and start to overfit. This concept of early stopping refers to stopping the
model training process before our learning algorithms pass through that point.
• Model Underfitting
Underfitting happens when the model is not capable of capturing the underlying trend
of data and fails to learn the patterns in the dataset. In this case, the model does not
learn well from the training dataset, which causes it to reduce accuracy and make
unreliable predictions on test data. Usually, this occurs when there is not sufficient
data to train an accurate model and attempting to train a linear model using non-
linear data. In underfitting, the error or loss in both training and testing is high. The
underfitted model has low variance and high bias.
Following are some of the possible methods to decrease underfitting:
2.5.4 Regularization
Regularization allows slight changes to the learning model for better generalization
and fitting the function suitably on the training set by avoiding overfitting.
This reduces the variance without considerable growth in the bias. This result in an
increase in model performance on the unseen data. Although regularization improves
the reliability, speed, and accuracy of convergence, it is not a solution to every
problem. Figure 2.19 shows an example of underfitting, optimum and overfitting of a
classification task. Generally, regularization should be used when working with large
datasets in neural networks.
The widely used regularization methods are L1 regularization (lasso regression)
and L2 regularization (ridge regression). L1 and L2 tend to shrink coefficients to zero,
and evenly, respectively. Therefore, L1 regularization is used to select features, as the
variables associated with coefficients that tend to zero can be dropped. In contrast,
L2 is useful when the features are collinear or co-dependent. Additionally, L1 and L2
calculate the median and mean of the data, respectively. In a multi-layer neural net-
work with many layers, underfitting will not happen. Due to the associated different
weights and bias, overfitting can happen, as the weights are trained to fit the training
data perfectly. Therefore, dropout and regularization (L1 and L2) are used as a solu-
tion for overfitting in multi-layer neural networks.
During model training, the following aspects can be observed regarding the
dropout layer and regularization.
Large weights in a neural network are a sign of a more complex network that
has overfit the training data.
Probabilistically dropping out nodes in the network is a simple and effective
regularization method.
A large network with more training and the use of a weight constraint is
suggested when using dropout.
2.5.5 Network Gradients
In neural network training, the associated weights are updated with small and con-
trolled values by considering the gradients.
Generally, these models are trained using gradient- based methods and
backpropagation by finding partial derivatives by navigating from the last layer to
the first layer using the chain rule. Here, it is necessary to calculate the gradient
from each sample element to determine a new approximation of the weight vector.
Backpropagation fine-tunes the weights based on the loss value obtained in the pre-
vious epoch. It is used to calculate derivatives quickly. Thus, the weight tuning results
in minimal loss and high reliability by improving the generalization. The chain rule
supports finding the derivative of composite functions, or functions that are made
by combining one or more functions. As stated below, it is computed extensively by
the backpropagation algorithm to train feedforward neural networks. The chain rule
indicates that the derivative of y with respect to x is equal to the product of the deriva-
tive of y with respect to u and the derivative of u with respect to x, as in (2.8).
dy dy du
Chain rule: = . (2.8)
dx du dx
However, there are issues associated with the activation functions and weight initializa-
tion mechanisms. A prominent artificial neural network design problem is the vanishing
or exploding gradients and it has been there as a large barrier in model training.
Concepts and Terminology 39
Consider a model with n number of hidden layers and n derivatives that multiplies
together. If the derivatives are substantial because of higher weights or activation
functions, then the gradient will grow exponentially as the model propagates until
they explode. That is, there is a significant difference between the new weight and the
old weight; thus, the model will not converge and mark different points in the gradient
descent. We call this problem an exploding gradient problem.
In terms of the exploding gradient problem, the model becomes unstable and may
not learn the patterns efficiently. As shown in Figure 2.13, if the learning rate is too
high, it causes drastic updates without converging to a global minimum. In other
words, with large changes in extreme values, the weights become large. This causes
an overflow of multiplied values resulting in many weight values with missing data
(NaN), which cannot be updated.
The exploding gradient problem can be detected using the following observations.
• The model does not learn much from the training set; thus, it has a low
performance.
• In each weight update, the model shows substantial differences in the loss, due
to the model’s instability.
• In the training process, the weights increase exponentially and result in large
values.
• The derivative values become constant.
In the issue of vanishing gradient, the model has slight derivatives, which cause the
gradients to reduce exponentially, and the model propagates until it vanishes. Here, the
accumulation of small gradients results in learning insightful patterns because the weights
and biases in the initial letters are responsible for learning those core features effectively.
With the sigmoid activation function, when the number of layers increases, the
derivative value that is used to update the weights becomes very low. In this problem,
the derivative of the sigmoid function ranges between 0 to 0.25. Thus, the weight
updating happens very slowly (due to very small derivatives) in backpropagation.
Therefore, the convergence will not happen towards global minima, and sigmoid is
not used for hidden layers. In the extreme case, the gradient will be 0, where the
weights remain the same, and the model will stop learning.
The vanishing gradient problem can be identified using the following observations
during model training.
• During the training phase, the model improves slowly and there is a possi-
bility to stop training early. That means further training does not result in model
improvement.
• The weights nearer to the output layer gets more changes and the layers closer
to the input layer may not alter much.
40 Deep Learning: A Beginners’ Guide
Now, let us see how we can address the issues related to exploding gradient problems
and vanishing gradient problems.
1. Reduce the number of hidden layers: This solution is applicable for both
exploding and vanishing gradient problems. However, the model complexity
reduces by lessening the number of layers.
2. Gradient clipping: This solution can be applied for exploding gradients,
where the gradient size is limited to a specific range if the gradient exceeds an
expected range of values.
3. Weight initialization: Applying a careful weight initialization approach would
address these two issues in random initialization. Therefore, He initialization
or Xavier initialization can be followed or modified if needed by adjusting the
mechanism to be compliant with the data.
For further explanation, consider the sigmoid activation function and its derivative
shown in Figure 2.20. Here, some activation functions squeeze the input space into
an output region between 0 and 1. Thus, when the value of the sigmoid function is
very high or low, the resulting derivative output becomes low. This result in vanishing
gradients and low model performance. However, as the number of layers is increased,
the gradients become very small and perform effectively.
Usually, when the model has a smaller number of layers, sigmoid activation
function is used. Therefore, when there are many layers, a sigmoid is not used
for each layer. In such scenarios, activation functions such as ReLU, which does
not result in a small derivative, are used. As another solution, residual models can
be used, as they provide residual connections straight to earlier layers. The most
recommended approaches to overcome the vanishing gradient problem are layer-
wise pretraining.
REVIEW QUESTIONS
1. What is the use of loss function in neural networks?
2. How can hyperparameters be trained in neural networks?
3. Why do we use activation functions in deep learning models?
4. How do you initialize weight and biases in neural networks?
5. What are the limitations in zero initialization of weight?
6. Why is ReLU the most commonly used activation function?
7. Justify the cases where the linear regression algorithm is suitable for a given
dataset.
8. List some of the metrics used to evaluate a regression model.
9. How can gradient exploding and vanishing problems be resolved?
10. Show the relationship between the model complexity with the training and
test error in overfitting and underfitting scenarios.
11. How can if the model is overfitted or underfitted using learning curves be
identified?
3 State-of-the-Art Deep
Learning Models: Part I
• Artificial neural network (ANN): use to solve regression and classification tasks.
• Convolutional neural network (CNN): use for the classification of image
and video data. Mainly for object recognition, object detection, and object
42 DOI: 10.1201/9781003390824-3
State-of-the-Art Deep Learning Models: Part I 43
classification. CNN has additional layers for convolution and max pooling,
compared to ANN.
• Recurrent neural network (RNN): data flows in any direction and is used for
applications, such as object detection, and language modeling. Long short-term
memory is effective for this use, which involves embedding layers and one-hot
implementation.
activation function is applied to get the results as shown in Figure 3.3. The input layer
gets the inputs from the input space and passes to the hidden layer to process where
each layer learns certain weights, and the output layer delivers the result. This is a
feed-forward network, where the data moves through the input, hidden, and output
nodes without any loops in the network. These are mainly used for regression and
classification of tabular data, image data, and text data.
In an ANN, a node or perceptron performs a mathematical operation as in (3.1),
where x is the input, w denotes the associated weight, and b is the bias. As shown in
Figure 3.4, the input is multiplied with the corresponding weights, and the summation
and the bias is added to the result. Finally, an activation function g is applied resulting
in an output of g(w · x + b).
State-of-the-Art Deep Learning Models: Part I 45
n
x ∑wi xi + b = w0 x0 + w1 x1 + …+ wn xn + b = w. x + b (3.1)
i=0
There are many advantages associated with ANNs. ANN supports learning any non-
linear function and considers it as a universal function approximator. The non-linear
characteristics of the model are handled by the associated activation functions, which
learn the complex relationship. Here, each node outputs a weighted sum of inputs. If
a model does not associate with an activation function, then the model can learn only
the linear relationships. Thus, the activation function gives power to ANN.
Let us consider the challenges in ANNs. Consider an image classification problem.
Initially, the two-dimensional image is transformed into a one-dimensional vector
before the training starts. However, when the image size increases, the required
number of trainable parameters also increases. For example, consider an image of size
224×224. In this case, the first hidden layer with four nodes will contain 602112 train-
able parameters. Therefore, ANN drops the spatial features and the pixel arrangement
of the image. Further, ANN may not capture the sequential data in the input space.
This limitation is addressed using recurrent neural networks (RNNs).
Generally, the deep feed-forward models may need specific parameters in each
element to handle sequence data and are not capable of generalizing to variable-
length sequences. On the other hand, RNNs efficiently work with sequential data by
memorizing part of the input data and using them to make accurate predictions. The
advantage is that RNN attains the sequential information in the input space. Thus, as
depicted using different color codes in Figure 3.7, it considers the dependency among
the words in the prediction process. In this example, the outputs o1, o2, o3, and o4 at
each time step are based on both the previous and current words.
Generally, in sequential data processing, it is necessary to consider the order of
elements in the sequential data. As an example, when considering sequential data,
such as natural language processing, it has input data in a sequence form and maybe
the output data also in the sequence of elements. In this case, the order of sequence
elements is important to have good predictive results. If we are using MLP to process
this data, one solution would be to have input layers that contain a set of units equal to
the set of elements in the input sequence. However, this will not work with variable-
length sequences. Additionally, these input elements need to be in the same order.
Sometimes, the same sentence may be expressed in various ways with different word
reordering. Therefore, without a dataset that contains all these orderings for similar
meanings, it will not be effective to learn an accurate model.
Recurrent neural networks are mainly designed to overcome problems in dealing
with sequential data. In RNN, we have element-level models that are finally used
State-of-the-Art Deep Learning Models: Part I 47
together to form the final predictive model. The speciality of these models is that all
element-level models share the same parameters. Therefore, it does not matter where
the element is present in the sequence, as all of them are processed in the same manner
and remain in the order in the outputs. Thus, each of these element-level models takes
input from each element and the output of the element-level model of the previous
element in the input sequence. After progressing in this way and after processing the
final element in the sequence, the sequence data can be encoded to output the final
element-level model. These output data can also be decoded into sequential outputs,
such as speech recognition, and language translation.
Consider the property of parameter sharing across different time steps. The
recurrence relationship to obtain a full chain of input is known as unfolding
the equation. This shows the hidden state at any given time t as a function of
parameters and sequence. With this, we can use fewer parameters for training,
hence reducing the computational cost. As shown in Figure 3.8, U, W, and V
are the three weight matrices that are shared throughout the time steps. Thus,
the same weight propagates forward direction over time and the weights get an
update during backward propagation.
In order to train an RNN model, we need to infer the parameters. That means,
during the forward pass, the gradients of the models need to be derived by calculating
through all the hidden states. This is known as unrolling the computational graph to
compute hidden states and then using it to compute the gradients. These gradients are
calculated using backpropagation, by sequentially going backwards in time starting
from the gradient of hidden variable h(t) to h(1), which is known as backpropagation
through time.
When we talk about the challenges, vanishing and exploding gradient problems
occur in deep RNNs with many time steps. During the model training, the error is
calculated using the cost function at each time step and used backpropagation to
update the weights. Therefore, every single neuron is associated with updating its
weights to minimize the error. This vanishing gradient problem occurs when the error
is moving backwards through all the neurons to get their weights updated. In RNN,
the cost function that is used in a given time state will be used by other shallow layers
to update their weights. Thus, the gradient value that is calculated at each step will be
multiplied back through the weights earlier in the network. This causes the gradient
to vanish if the weight is too small as the gradient becomes less and less with each
multiplication. Based on the value of the weight W, this causes two main problems
as follows:
to other objects in the image. Accordingly, CNNs solve problems related to both
images and sequential data. Although CNN works for small size images, it does not
ensure high precision. Because, when the image data is flattened into an array, it loses
important dimensionality information or spatial information of the image. CNN can
identify image complexities by reducing the number of parameters and reusing the
weights. Thus, CNNs provide high-performance results. As shown in Figure 3.10,
a CNN consists of four main layers namely the convolutional layer, pooling layer,
ReLU correction layer and the fully connected layer. The details of each layer are
described later in this chapter.
3.4.2 Concepts of CNN
Let us discuss some of the concepts and terminology used in CNNs.
• Padding indicates the number of pixels added to an image during filter pro-
cessing. The model design becomes easier if we set the width and height of the
image matrix. Thus, we do not have to be concerned about tensor dimensions.
Padding allows us to design deeper networks and avoids image size decrease
during the convolutional operation. This is used to get the output image size the
same as the input image size, without losing any information after the convolu-
tion operation. Different types of padding are used to handle the border pixels
of the image matrix. In zero padding, we add zeros to the borders systematic-
ally. Other padding types are reflection padding or mirror padding, near value
padding, which is based on neighboring pixels.
• Stride is a kernel parameter that changes the amount of movement across the
image region to reduce the image size. For instance, when the stride is set to
1, the filter moves one unit at a time. Stride controls how the filter convolves
around the input volume. Figure 3.12 shows the 3×3 feature map obtained from
a 7×7 input volume with stride 1 and 3×3 filters. Since neighboring pixels in
the lowest layers are strongly correlated, the size of the output is reduced using
sub-sampling, also known as pooling, the filter response. Accordingly, a large
stride in the pooling layer results in more information loss.
Now let us dive into the components of a CNN as shown in Figure 3.13. They can
be listed as
1. Convolution layer/kernel.
2. Max-pooling layer.
3. Fully connected layer.
range of the size of the input and output layers. One possible technique used by
practitioners is the number of hidden nodes should be 2/3 the size of the total size
of input and output layers. Another possible approach is the number of hidden nodes
should be fewer than double the input layer size. However, based on the problem we
need to experiment and identify the optimal number of hidden layer nodes for the
minimum loss.
3.4.3 Convolutional Layer
Convolution is a mathematical operation that interconnects the image matrix and a
filter, using a function change and producing an output. It is the first layer that extracts
features from the original image. Convolution learns the image features and maintains
the relationship among the pixels. A convolutional layer is the main element of a CNN.
This consists of filters (kernels), whose discrete value parameters (kernel weights)
need to be learned. Usually, the convolutional layer applies a filter, which is smaller
than the actual image or feature map size, where each filter convolves with the image
State-of-the-Art Deep Learning Models: Part I 53
and produces an activation map. At the start of the CNN training process, random
numbers are allocated to the kernel weights. Numerous approaches are employed to
initialize the kernel weights, which are changed on a per-training-step basis.
First, let us understand the term ‘kernel’, which is an operation to extract features
to solve non-linear problems using a linear classifier. This function is applied to each
data unit to transform the non-linear features into a separable higher-dimensional
region. In detail, the matrix of the kernel moves across the input space performing
the dot product with a sub-region of the input and producing the output as the matrix
of dot products. More precisely, a kernel denotes a 2D array of weights and a filter
represents a 3D stack of multiple kernels. Here, a kernel is allocated to a given input
channel. Thus, a 2D filter is the same as a kernel. However, a 3D filter is a collection
of kernels.
Consider a case where the input of the CNN is a multi-channeled image. Each
CNN kernel slides across the two-dimensional input space, conducting element-wise
multiplication and a summation to produce a single output in reduced size. This pro-
cess is repeated until no sliding is possible and generates a two-dimensional feature
map (activation map). Figure 3.14 graphically illustrates the convolution operation of
a 2×2 kernel applied to a 3×3 two-dimensional input.
As we learned, a convolution presents the overlap size of a function as it blends
over another function. Generally, starting from the top left corner of the image, the
convolution moves the filter over all the positions, such that the filter fits with the
image boundaries. As shown in Figure 3.14, the first entry of the output activation
map generates by convolving the filter with the selected region of the image. This
process is repeated for each element of the image to obtain the activation map. Thus,
the convolutional layer’s output is produced by stacking each filter’s activation map
along the depth, where each element of the activation map is considered as an output
of a node with parameter sharing. Therefore, each node in the convolutional layer is
associated with a region in the input image, and the filter size is the same as the area
size. The associated local connectivity enables learning filters with higher response
to a region of the input image. Generally, the basic features, such as the edges, are
captured using the initial convolutional layers and the complex features, such as
shapes and objects, are identified using the final layers.
Accordingly, the convolution layer extracts features from raw data and ensures
spatial connectivity among the pixels by learning image features using different
regions of the image. This layer reduces the input image size while identifying the
major information in the image. Filters identify spatial patterns, such as edges, using
the intensity values, as shown in Figure 3.15. The kernel size is the multiplication of
the width and height of the filter mask. This is done by the convolution operation,
where a fixed kernel matrix is passed through the input image matrix and calculates
the resulting output of the matrix. Generally, the number of weights in a convolutional
layer is less than the number of weights in a fully connected or dense layer. Therefore,
it is followed by a non-linear activation function. Accordingly, the convolutional layer
identifies a local association of features from the prior layer and maps their presence
to a feature map.
3.4.4 Pooling Layer
The pooling layer is used to decrease the spatial size of the convolved feature by sub-
sampling. The reduction of the feature map dimensions results in lowering the learn-
able parameters and the computational power in the model. Additionally, it extracts
the main features from the feature map supporting effective training. Generally,
the pooling layer is used in between two convolutional layers, as the pooling oper-
ation reduces the spatial volume of the input image after the convolution. If we use
a fully connected layer after the convolutional layer without applying pooling or max
pooling, it results in expensive computations. Thus, we can use a technique, such as
max-pooling to lessen the image’s spatial volume. Several types of pooling techniques
are available, such as max pooling, min pooling, and mean pooling.
The max-pooling technique outputs the highest values from image regions covered
by the convolutional kernel. The mean pooling returns the average value of all the
values from the convoluted feature map. In an image, if a region impacts the presence
of a given feature, the max pooling identifies that feature. Comparatively, if there are
regions with contradictory presence, then mean pooling is used. From another point
State-of-the-Art Deep Learning Models: Part I 55
of view, max pooling discards the noise data associated with the input image by redu-
cing dimensionality. In contrast, average pooling performs dimension reduction as a
noise-suppressing function. Therefore, the max-pooling function performs well than
the average-pooling function.
Figure 3.16 shows two examples of max pooling and average pooling with 2×2
filters and stride 2. Max pooling performs well for images with a black background
and white objects. Min pooling works well when the image has white backgrounds
and black objects. Since average pooling smooths out the image, it is hard to identify
strong features with it.
In CNNs, convolutional and max-pooling functions are going in hand in hand.
The number of these two layers needed is decided considering the image com-
plexity. Thus, the increased number of such layers is capable of capturing low-
level highly complex details but we have to bear the increased computational
complexity and power.
TABLE 3.1
Summary of ANN, RNN, and CNN
Model Advantages Disadvantages
ANN • Store details about the entire • Depend on the hardware specification.
model. • Limited model explainability.
• Train with partial knowledge. • Need a well-defined model structure.
• Capable of error handling.
• Use a distributed memory.
RNN • Record each detail along the time. • Has vanishing gradient and exploding
• Use convolutional layers to extract gradient issues.
features. • Complex training process.
• Predicts time series data. • Hard to process long sequences with
• Suited to analyzing temporal, Tanh or ReLU activations.
sequential data, such as text or
videos
CNN • Perform well on image • Does not consider the location and
classification. direction of the object.
• Automated feature extraction. • Require more training data.
• Parameter sharing.
newgenrtpdf
State-of-the-Art Deep Learning Models: Part I
TABLE 3.2
Comparison of ANN, RNN, and CNN
57
object.
58 Deep Learning: A Beginners’ Guide
different features, eventually, these are converted into a 1D array in the feed-
forward network and get the classification results. On the other hand, ANN
requires fixed data points. For example, consider a model that distinguishes cars
and buses. Here, the details, such as the height of the vehicle and front shape
are given as explicit data points. However, a CNN extracts these spatial features
from the original image. Likewise, CNN can extract many features automatically,
without measuring each of the features.
Therefore, considering better feature extraction ability, CNN has different filters to
detect many features of the image and have good clarity of the image as output with
max pooling. It learns features from the image’s region of interest (ROI). However,
ANN creates the image in a multi-dimensional array and the features of the image are
not recognized. This is the reason CNN is useful and extracts the exact image leaving
noise. When we increase the image size, the number of trainable parameters increases
drastically in ANN, failing to capture the features. However, CNN extracts the spatial
features from an image. Hence, ANN does not scale well with input size and requires
large computation power while taking images as inputs and training.
Further, considering RNN models, they include less feature compatibility when
compared to CNN. RNN fed the same weights and bias to all layers to transform
the independent activations into dependent activations. This helps to lower the com-
plexity due to parameter increase and remembering the output from each of the prior
layers that fed into the next hidden layer.
REVIEW QUESTIONS
1. Compare and contrast the following architectures:
artificial neural networks, recurrent neural networks, and convolutional neural
networks
2. Describe the main layers in a CNN and their process.
3. What is the importance of padding in a CNN?
4 State-of-the-Art Deep
Learning Models: Part II
DOI: 10.1201/9781003390824-4 59
60 Deep Learning: A Beginners’ Guide
these multi-layered networks of neurons can handle non-linearly separable data. The
complex non-linearity relations between input and output data are handled by the
hidden layers.
The neurons calculate a weighted sum of inputs and then apply activation
functions for the normalization of the summation. Thus, each neuron has weights
that are associated with that neuron and these weights are learned during the process
of training. Figure 4.2 shows the calculations associated with a neuron. The neurons
used the activation function to learn the linear or non-linear decision boundaries. This
also has the effect of normalizing to prevent the output of each neuron from becoming
very large after several layers due to cascading effect. The widely used activation
functions are sigmoid, tanh, and ReLU.
Let us consider the model learning process. When the training data are passed
through the network, the actual and predicted outputs are compared. The identified
State-of-the-Art Deep Learning Models: Part II 61
value of error is used to change the weights of neurons in such a way that minimizes
the error gradually. This process is done using backpropagation. Then batches of data
will iteratively pass through the network by updating the weights until the error is
minimized. The hidden layers in the network use a variety of functions for data trans-
formation. Therefore, each hidden layer is focused on a given out based on a feature.
Finally, the output layer gives out the prediction.
4.2 MULTI-LAYER PERCEPTRONS
A multi-layer perceptron (MLP) is a fully connected feed-forward neural network.
Artificial neural networks consist of neurons, which are also known as perceptrons.
A perceptron is a single-layer neural network consisting of input values, weights, and
bras, a total sum, and an activation function. Perceptrons are useful in classifying lin-
early separable datasets and encounter problems when non-linear functions, such as
the XOR function, are performed. The MLPs have the same input and output layers
but may have multiple hidden layers in between the layers, as depicted in Figure 4.3.
The MPLs break these restrictions and classify data that is not linearly separable; thus,
supporting the binary classification with supervised learning of complex datasets.
The MLP algorithm passes inputs forwards via the network by calculating the dot
product of the inputs and weights. The dot product provides a value at the hidden
layer and utilizes activation functions in each layer. Then MLP passes the calculated
value to the next layer and the above steps are repeated until the expected output is
generated. The output will be used for either backpropagation to train the model again
or for the testing process to make decisions. MLPs provide the foundation for neural
network processing to enable computer vision systems to solve far more complex
problems than the XOR problem. Although MPLs are widely used for regression,
they are not well-suited to identify patterns in sequential and multi-dimensional data,
as the input’s spatial information is not considered in MLP architecture.
Figure 4.4 shows a layout comparison of MLP, CNN, and RNN. In comparison
to the CNN architecture with convolutional and pooling layers, one of the key
features of MPLs is the ability to regularize the data to prevent overfitting or
underfitting. The new function named dropout comes into play to resolve this
problem. This layer discards a portion of units based on a random rate. Consider
an example with 256 units in the first layer. If the dropout is 0.4, then only (1–0.4)
* 255 =153 units will be passed to the next layer. This technique performs well
for unseen input as the model is trained in such a way that it will make correct
predictions even with missing data.
MLP is now deemed not sufficient for complex computer vision applications, as
the parameters grow by multiplying the number of perceptrons in each layer and cre-
ating redundancy with high dimensions. Spatial information discard is another limi-
tation as it inputs flattened vectors. However, a lightweight MLP with two to three
layers provides good accuracy levels.
The generative models capture the joint probability of p(X, Y), which describes
the probability of having the given input vector and the label together. In contrast, the
discriminative models measure the conditional probability of p(Y|X), where the prob-
ability of having label Y when the input vector X is given. As we can dive more into
the dataset, we can see that a generative model includes the dataset as a whole and
provides how likely the given example presents. On the other hand, the discriminative
models ignore the whole-dataset aspect and show the likeness of the appropriate label
that is given to the data instance.
GANs are useful as an image augmentation method since they can generate a
proper dataset that matches the original datasets. We learnt that the generative model
produces obvious fake data instances during the training process and the discrimin-
ator quickly classifies them as fake. The generator learns to generate data instances
that are getting closer to the input data so that it can mislead the discriminator. When
the generator completes its learning, the discriminator fails to distinguish the actual
data and the data instance generated by the generative model, hence decreasing
the accuracy. The process view of the generative adversarial model is depicted in
Figure 4.5. Here, the output of the generator is fed to the discriminator. The output
of the discriminator, which is the classified signal is used for backpropagation to
update the weights of the generator. It considers the impact of the generator weight,
which depends on the discriminator weights it feeds into. Subsequently, the generator
produces fake data that are similar to the input data based on the feedback received
from the discriminator. The loss function of the GAN is a combination of the loss of
the generator and the loss of the discriminator. Here, the discriminative loss indicates
the misclassification of real data as fake data.
However, the training of the generator can be affected by vanishing gradients, if
the discriminator is too good. Also, an optimal discriminator may not present suffi-
cient details to improve the performance of the generator. As the solutions, different
loss functions such as Wasserstein loss, which addresses vanishing gradient problems
during the training of discriminator, or modified max loss to deal with the same
problem, can be used.
When the generator begins to generate the same output continually, the discrimin-
ator will learn to deny the output. However, there can be scenarios where the next
output of the discriminator is stuck in a local minimum without progressing towards
the optimal solution. In this case, the generator finds it easier to produce the most
acceptable output in the next iteration. Accordingly, the generator overoptimizes for a
given discriminator in each iteration, hence the discriminator does not learn to over-
come this. Consequently, the generator rotates over a small set of output types and
GAN starts to fail. This can be addressed by using Wasserstein loss and unrolled GANs
to utilize a generator loss function that incorporates both discriminator’s classification
and also future discriminator values, where the generator cannot overoptimize its
outputs for a single discriminator.
4.4 VARIATIONS OF CNNS
4.4.1 Residual Networks (ResNet)
Recently, most of the complex problems are solved by adding some additional
layers in neural networks to increase model performance. The key idea behind this
approach is to utilize these additional layers to learn more complex features pro-
gressively. Therefore, the deeper the architecture becomes, it will solve more com-
plex problems with improved performance. However, when adding more layers, the
accuracy of the model will be saturated and then degrade gradually. The reason for
this could be the optimization function, initialization of the network, or vanishing gra-
dient problems. The residual network (ResNet) model, which is made up of residual
blocks, addresses these training issues. Figure 4.6 shows an instance of a residual
block and Figure 4.7 shows the architecture of the ResNet50 model. Generally,
ResNet performs well on image recognition and tasks associated with localization.
The ResNet50 model consists of 50 layers and over 23 million trainable parameters.
This is a fast-performing model and can train many layers without increasing the
training error percentage.
As the core concept, the residual block uses a skip connection, which is a direct
link that avoids some middle layers. In a general neural network without a skip
connection, the input value (x) will get multiplied by the weights of the layers and add
a bias term as in (4.1), where f() is the activation function and the H(x) is the output.
When the skip connection is present, the output will be changed as in (4.2). However,
the input dimensionality might differ from the output, which mainly happens with
convolutional layers and pooling layers. This can be addressed by adding a 1 * 1 con-
volutional layer to the input in such a way that it fit the dimensions as in (4.3), where
w1 denotes the additional parameters.
Let us see, how the ResNet solves the gradient vanishing problem. ResNet allows the
deep neural network to use alternative shortcut paths for the gradient to flow through.
These skip connections in residual blocks help to learn the functions, ensuring that
both the higher and lower layers will perform well. Thus, the ResNet model handles
the vanishing gradient problem using identity mapping. It includes skip connections
that serve as gradient superhighways, allowing the gradient to flow freely. It allows
gradients to spread to deeper layers before becoming attenuated to tiny or zero levels.
Another issue in model training with optimization is the use of wide parameter space.
This can result in naively adding layers and increasing training loss.
4.4.2 Inception Model
We expect to have neural networks perform well with large-scale and multi-scale
convolutional layers. The model learning is based on the Hebbian principle, where
the neurons activate, connect with other neurons and create a neural network to learn
a new aspect. Accordingly, Inception models believe that for neural networks to be
highly performant, they should scale well from all aspects.
The Inception-v3 model was introduced by Google in 2015 with 42 layers. It is a
commonly applied CNN model mainly for image classification. This model provides
high-performance gain on CNNs with a lower error rate. It utilizes the resources
efficiently with minimal growth in computational load. Also, it supports feature
extraction at varying scales with various sizes of convolutional filters. It is capable of
handling the vanishing gradient problem.
An Inception model is a DNN with repeating components as shown in Figure 4.8.
Here, for the dimension reduction of the data, a 1×1 convolution is used. It enables the
expansion of the depth and width of the network. The 3×3 and 5×5 convolutions that
lead to different convolutional filter sizes, enable the model to learn spatial patterns
at various scales across all the dimensions of the depth and width of the input. The
Inception model architecture is shown in Figure 4.9, and the benefits of the Inception
model can be listed as follows.
4.4.3 GoogLeNet
GoogLeNet is a CNN that contains 22 layers of deep convolutional, pooling layers
supported as a modification of the Inception model. It is a network with parallel
concatenations and is widely used for object classifications. The importance of this
architecture is that it is performing on par with the state-of-the-art models with a
higher degree of computational complexity while maintaining the computational
budget at a constant level. In real practice, having a model to be used in image rec-
ognition or object detection tasks should be efficient and should be able to embed
in lower resources devices, such as mobile phones to get the actual usage of the
application. We have learnt that the increase in the number of layers in a neural net-
work often leads to high-performance gain, however, it is computationally intensive.
More importantly, large networks become overfitting and face either a vanishing gra-
dient problem or exploding gradient problem. The GoogLeNet model addresses the
challenges of large networks, by utilizing the Inception module to increase compu-
tational efficiency. As shown in Figure 4.10, this architecture has 22 layers, which
extends to 27 layers including pooling layers. A section of these layers is complied
with nine Inception modules.
68 Deep Learning: A Beginners’ Guide
4.4.4 Xception Model
The Xception model is an efficient architecture in terms of computational time, as
it involves depthwise separable convolutions. This is an extension of the Inception
architecture, where the Inception model is substituted with depthwise separable
convolutions, which is followed by a pointwise convolution. Figure 4.11 shows the
architecture of an Xception model, where the plus sign denotes the elementwise add-
ition. However, it is expensive to train.
4.4.5 DenseNet Model
The DenseNet architecture consists of dense connections between layers as shown in
Figure 4.12. Here, the dense blocks directly connect each of the layers with the same
size of feature maps. This feature addresses the vanishing gradient issue in models
with high depth and improves the declined accuracy. As shown in Figure 4.12, each
layer acquires additional inputs from all the prior layers and sends their feature maps
to all successive layers. Here, c denotes the channelwise concatenation.
One of the advantages of DenseNet is its powerful gradient flow. Here, the loss is
transmitted to the previous layers easily and directly, where the previous layers obtain
direct supervision from the final layer implicitly. In a DenseNet, each layer gets a
combined knowledge of feature maps from all the previous layers. This may cause
the network to be narrower and denser. This model provides efficient computations
and memory utilization as it can have few channels and parameters. Additionally, it
supports expanded features and complex patterns of data, as each layer gets all the
prior layers as inputs. Consequently, DenseNet provides smooth decision boundaries.
Hence, supports high-performance values even with a lesser amount of input data.
Although there are many advantages, DenseNet requires heavy GPU memory due to
concatenation operations.
4.4.6 MobileNet Model
MobileNet is a simple, end-to-end transparent pipeline architecture. It provides
a lightweight and fast neural network with few parameters and uses depthwise
separable convolutions as shown in Figure 4.13. Additionally, the number of
parameters can be further lessened by using dense blocks. Generally, a filter in
a convolutional layer is applied for all the channels of the input. This is done
by taking the weighted sum of the pixels in the input image with the filter. Then
it passes to the next input pixels over the image. A MobileNet model uses this
convolution only in the first layer. The next set of layers is the depthwise separ-
able convolutions consisting of both depthwise and pointwise convolutions. The
depthwise convolution applies convolution for each channel of the input sep-
arately to filter the input channels. The pointwise convolution uses a 1×1 filter
to merge the output channels from the depthwise convolution and produce new
features. Since this requires fewer computations, these models are best suited for
embedded and mobile devices.
State-of-the-Art Deep Learning Models: Part II 71
4.4.7 VGG Model
Visual geometric group (VGG) is an innovative object recognition model based on
CNN with multiple layers. VGG is a deep NN and has been trained on large datasets
with different images with complex classification tasks. Thereby it is one of the widely
used vision models. Based on the deepness of the model, there are VGG-16 and VGG-
19 with 16 and 19 convolutional layers, respectively. As shown in Figure 4.14, it uses
a set of convolution and max pool layers constantly. There are two fully connected
layers and a Softmax layer before the output. The VGG model focuses on how the
depth of the CNN affects the accuracy of image classification tasks and how to reduce
the number of parameters. Therefore, VGG uses a small receptive field such that, 3 *
3 convolutional kernels in all its layers. There are also 1 * 1 convolutional filters as
linear transformers of inputs, which are followed by ReLU activations.
TABLE 4.1
Comparison of ResNet, Inception, and VGG Networks
The selection of a neural network and hyperparameter tuning should be done with
care. When we add more layers to the model and increase the depth of the network, it
may reduce the accuracy due to the vanishing gradient problem. Here, the derivatives
become insignificant during the backpropagation. In addition, we use dropout layers
to address data overfitting. It drops connections at a specified rate during training to
avoid local minima. However, it doubles the needed iterations for convergence.
In general image classification, there can be varying sizes of salient features in an
image. In that case, we may not be able to use a fixed kernel (filter) size. We use large
filter sizes when there are more global features are distrusted over a large region of
the image. In contrast, small filter sizes perform well to identify area-specific features
over the image. Therefore, different filter sizes in the same layer are used to identify
variable-sized features as in the Inception model. This process considers the width of
the network, instead of the depth.
Although CNNs are used in a wide variety of image classification tasks, it contains
a few drawbacks that are crucial in computer vision problems. For instance, CNNs
can precisely identify image edges using the initial layers and complex features using
deeper layers. However, there is a limitation in identifying the spatial composition of
the identified features using a CNN. Therefore, in tasks where the spatial compos-
ition is important in classification, the CNN models may not give the best results. In
such situations, the pooling function that connects the layers become inefficient in
detecting relationships between the object parts. Here, the pooling loses connectivity
information, as the pools do not overlap.
Consider the following example of identifying a face. If we use a CNN to recreate
the same image, it may output a representation as in Figure 4.15, which has
not considered the positional or instantiation information of the image. In general, a
State-of-the-Art Deep Learning Models: Part II 73
CNN tries to identify the features in the image, but not their relative position. This is
due to the feature extraction using max-pooling without overlapping areas. Therefore,
it is important to use an approach, where the pools are overlapped to preserve the
positions of features.
4.5 CAPSULE NETWORK
A capsule network (CapsNet) is an extension of an ANN that supports hierarchical
connections and preserves spatial composition. CapsNet overcomes the loss of infor-
mation that is seen in pooling operations, hence fetching more important features.
A capsule gives a vector as an output that has a direction. The probability of an
existence of a given instance is given by the length of the vector. The instantiation
parameters, such as position, orientation, scale, and color, are given by the orientation
vector. Capsules are the functional elements that perform inverse rendering, which is
the extraction of instantiation parameters. Therefore, a capsule predicts the existence
and the instantiation parameters of an object in a given location.
The standard neural networks use neurons to extract features, while capsule
models capture only the crucial image details. Compared to general neurons that gen-
erate scalar quantities, the capsules generate vectors that can recognize the direction
of a given feature. When the direction of a given feature is changed, only the direc-
tion changed corresponds to the position change, but the value of the vector remains
the same. These models provide good results with small datasets as well, by easily
interpreting the robustness of the images.
The term ‘routing- by-agreement’ (dynamic routing) explains the routing of
capsules between two consecutive layers and addresses the inefficiency in max-
pooling due to information loss. Here, the capsules of the first layer predict the output
of the second layer. Consider an example of an image of a person. Here, only the
lower-level features, such as mouth and eyes with matching content, are sent to the
corresponding higher level. Here, the feature representing the eyes and mouth will be
74 Deep Learning: A Beginners’ Guide
sent to the layer representing the features of a face and the features, such as fingers,
nails, and palms, will be sent to the corresponding layer of a hand. Thus, the process
encodes spatial details into the features with dynamic routing.
Therefore, a capsule network learns the features by reproducing the image using
the extracted features. It learns the prediction of the output, by reproducing the
expected object and assessing it against a labeled instance in the training set. Since
the generated predictions will be selected based on the highest expected probability
of most of the capsules, it sends a clean input to the next layer and requires fewer
capsule layers to service perfect results. This in turn reduces the training time and
resources. Also, by navigating backwards of the network the possession of the elem-
ents in the object can be precisely derived, hence image reconstruction is feasible.
This process leads to predicting better instantiation parameters.
The dynamic routing function acts as a dual loss function to get better predictions.
It moves the feature around the image, calculates the probability of each of the
positions and locates the feature in the most appropriate location. This is known as
equivariance between capsules. In a capsule network, the process starts by calculating
the dot product of the weight matrix and the input vectors by the capsule. During the
routing process, the capsules in the lower-level pass the inputs to the capsules in the
higher level by encoding the spatial relationship among the layers. Then dynamic
routing is used to select a parent capsule for a given capsule, by ensuring that the
output of a given capsule goes to an appropriate parent in the above layer. Here, a
function is used to squash a vector to a value between 0 to 1, while keeping the same
direction.
The architecture of a capsule network consists of six layers comprising an encoder
and a decoder. The first three layers correspond to the encoder and they transform the
input into a 16-dimensional vector.
As shown in Figure 4.16 these layers can be explained as follows.
1. The first layer of the encoder is a Conv layer, and it extracts the basic features
of an image that can later be analyzed by the capsules. It has 256 kernels of
size 9×9×1.
2. The second layer is the PrimaryCaps Network, which consists of a different
number of capsules, takes essential features, and finds more detailed patterns
with spatial relationships. This is the lower-level capsule layer that contains
32 capsules. Each of these capsules utilizes eighth 9×9×256 convolutional
filters to the result of the prior convolutional layer and produces a 4D vector
output.
3. The third layer is the DigitCaps Network, which contains different capsules.
This is a high-level layer, where the primary capsules are transmitted in
dynamic routing. After these layers, the encoder has a 16-dimensional vector
with the required details to render the image that goes to the decoder.
4.6 AUTOENCODERS
An autoencoder is an extension of ANN that efficiently learns data representation
(coding) of unlabeled data through unsupervised learning by ignoring the noise
signals. Generally, an autoencoder is a dimensionality reduction feedforward neural
network where the input is the same as the output. It is mainly used for denoising,
compression, and data generation of images. As shown in Figure 4.18, it has three
modules namely encoder, code, and decoder. The encoder, which is a fully connected
ANN, squeezes the input to generate a latent- space representation, which is a
dimensionality reduction code. The decoder uses this compressed code to rebuild the
input by refining and validating the encoded content. Figure 4.19 shows the archi-
tectural visualization of an autoencoder. The code is a single layer and the size of
the code, that is the number of nodes in the code layer, is a hyperparameter that
needs to be defined before the training process. The bottleneck architecture restricts
the amount of data that traverses through the network and forces it to prioritize the
features of the input that should be copied. The output is closely similar to the input,
where both dimensionalities are the same.
The types of four hyperparameters to be defined before the training process:
• The number of layers: the depth of the model can vary based on the problem.
In Figure 4.19, both the encoder and decoder have two layers, except the input
layer and output layer.
• The number of nodes stacked in each layer of encode and decoder: in the
encoder, the number of nodes in a layer reduces when moving with each sub-
sequent layer. In the decoder, the number of nodes in the subsequent layers
increases correspondingly. Therefore, the layer structure in the encoder and
decoder is symmetric.
• Code size: indicates the number of nodes in the middle layer (code). More com-
pression is provided when the code size is small.
• Loss function: uses cross-entropy loss when the input values are between zero
and one. If not, the mean squared error is used.
• Data-specific: autoencoders can squeeze data, that are similar to the trained data.
• Self-supervised: it does not need explicit labels to train, and generates labels
from the training data.
• Lossy: the output is a reduced representation of the input.
• Complete autoencoders
Autoencoder trains to perform the task of copying original data taking only the useful
features. In complete autoencoders, the code layer’s dimension is lower than the
dimension of the input. The learning process of an autoencoder tries to minimize its
loss function. This can learn the most important features from input data and effi-
ciently reconstruct the original data by penalizing the model based on the reconstruc-
tion error.
• Regularized autoencoders
The encoders and decoders with more capacity may fail to learn useful information
in the data. A similar problem may occur when the hidden code allows dimensions
equal to the input. Also in the case of overcomplete, where the hidden code contains
dimensions larger than the dimensions of the input, the linear encoder and decoder will
learn to replicate the same input as the output, instead of learning effective features.
78 Deep Learning: A Beginners’ Guide
Therefore, it is important to select the dimension of the code and the capacity of the
encoder and decoder based on the data complexity. The regularized autoencoders
utilize a loss function to avoid copying input data to its output during learning, instead
of reducing the capacity of the model by maintaining a small and shallow encoder and
decoder. These autoencoders can be nonlinear and overcomplete, but they will always
learn important information from the data.
• Sparse autoencoders
• Denoising autoencoders
Denoising autoencoders learn useful features by modifying the loss value, rather
than applying a penalty as in sparse autoencoders and having a small code layer
as in complete autoencoders. Denoising autoencoders add random Gaussian noise
to their original images. These data with noise are fed as input to the autoencoder
and it regenerates output without noise. Since the input consists of some noise, the
autoencoder cannot directly reproduce the output from the corresponding input. Thus,
these denoising autoencoders subtract the noise and produce effective data as shown
in Figure 4.20. The architecture of a denoising autoencoder is similar to a regular
autoencoder, except for the fit function. Denoising autoencoders use the function
L (x, g (f ( x̂ ))), where x is the original input and x̂ is the input with noise. The
process tries to handle the noise. Now let us see the usefulness of autoencoders at
compressing the input. Although, they are not commonly used in real applications
because of their data-specific nature. Therefore, they are used as a preprocessing step
for dimensionality reduction.
4.7 TRANSFORMERS
Transformers are used to transform an input sequence into an output sequence.
This process is known as sequence transduction. Transformers often apply to
real-life applications such as speech recognition and other sequential data processing
applications like text- to-
speech transformation. This requires understanding
the dependencies and connections within the context referred to. Thus, transformers
utilize a self-attention mechanism that applies different weights for specific regions
of the input. The transformer model is designed in such that it trains parallelly both
data and model. Thus, these models are more efficient than RNNs such as long-short
term memory (LSTM). The effect and efficiency are balanced using an encoder-
decoder architecture.
First, let us refresh your understanding of RNN and LSTM. As we have learned,
RNNs have loops in them that act as memory as shown in Figure 4.21. RNN creates
and connects several copies of the same model to provide the prediction. Therefore,
the sequence data are handled by a chain-like nature. Consider an application that
translates a set of words, where a word in the text is a given input. The RNN sends the
details of the prior words to the subsequent model for further processing. Generally,
RNNs do not perform well when there is a large gap between the relevant knowledge
and the location where it is used. Also, since the details are passed in each model, the
chain becomes longer allowing information loss.
If we extract the importance of the information presented in the form of data,
RNNs try to completely convert the data into new forms of information by applying
a function. Therefore, the importance of the information may not be persisted. Also,
RNNs have gradient issues. To address this LSTM architecture has been introduced. It
is a variety of RNNs that learns long-term dependencies, which is required for the pre-
diction of data sequences. The feedback connections of the LSTM support processing
the complete data sequence, instead of a single data location as in images. LSTM
tries to modify the data by performing a few operations, such as multiplications and
additions, the information flow mechanism of LSTM is known as cell states, where
LSTM selectively learn or forgets information that is crucial or not. Figure 4.22
shows the architecture of an LSTM.
Figure 4.23 shows a comparison of RNN and LSTM. We explained the architec-
ture of RNNs, which consists of a sequence of repeating neural network modules..
In the original RNN architecture (left figure), these duplicating modules represent a
80 Deep Learning: A Beginners’ Guide
simple structure as indicated by the tanh layer. In contrast, in the LSTM architecture
(right figure) each duplicate module has a different structure with four neural network
layers.
The same challenge of handling longer sentences faced by RNNs applies to
LSTMs as well. When there are long sentences, the performance degrades with the
increase of the distance between the word in the context and the processing point of
that word. Another issue in both types of models is since it processes word-by-word
it is difficult to process sentences parallelly.
Additionally, the long-and short-range dependencies are not considered by these
models. This can be addressed by the concept of attention that considers the linear
distance among the positions and the dependencies in long and short ranges. In this
mechanism, special attention is given to the word of the current translation when the
translation is happening. In a neural network, this can be applied in several ways.
State-of-the-Art Deep Learning Models: Part II 81
For instance, in an RNN, there is a hidden state for each word, which is sent to the
decoder, without encoding the complete sentence in a hidden state.
In CNN, we address a couple of these problems with inherent features. For
instance, we can parallelize the translation process, the distance between the positions
becomes logarithmic and the local dependencies can be exploited. However, CNNs
do not handle the dependency issues in sentence translations. In order to address these
issues, a combination of CNN and the new concept of attention has been used in the
transformers. Figure 4.24 shows a transformer architecture consisting of an encoder
and a decoder. These components are compiled in such a way that they can be created
by stacking on top of each other multiple times to support the chain architecture
introduced in RNN, as shown in Figure 4.25.
The encoder has a feed-forward neural network and a self-attention layer or
multi-head attention layer. The attention technique draws connections between any
parts of the sequence and addresses the issues with long-range dependencies. With
transformers, long-range dependencies have the same likelihood of being considered
as any other short-range dependencies. The attention technique addresses this depend-
ency by refer-back to the hidden state of the encoder by the decoder by its current
state. Informally, this can be considered as a variation of the encoder-decoder model
with bi-directional LSTM. Therefore, the decoder can retrieve only the crucial details
of the input and learn complex dependencies between the input and the output.
Attention is calculated by dividing the inner product of the sender (query) and
receiver (key or target word of the attention), by the square root of its dimensions.
Each word creates its attention for all the other words by giving a query to comply
with a key. When a sentence consists of many words, since there are more inner
products, the square root is considered as a variance balance. By applying the
82 Deep Learning: A Beginners’ Guide
Softmax function through the attention matrix, the normalized matrix can be derived.
The feed-forward network is utilized for each attention vector or matrix that is created
to convert the attention vectors into a structure that can be fed into the next layer of
decoding or the next encoder layer. The decoder network also consists of a stack
of multi-head attention, feed-forward networks, and an ending system with linear
or Softmax classifiers like CNN architecture. Additionally, the decoder block has a
masked (hidden) multi-head attention block.
Vision transformer (ViT) is used for vision processing tasks. It has the nature of
transformers that are used for natural language processing, where the learning is
based on the relationship between input token pairs. Generally, this lacks the inductive
biases of CNNs, such as locally restricted receptive fields and translation invariance.
That is, it identifies an object in an image, with varying appearances or positions.
Recall, that convolution is a linear local operator that observes the neighboring values
that are specified by the kernel.
The transformer is permutation invariant and cannot process grid- structured
data. Thus, the need to transform a spatial non-sequential signal into a sequence.
Transformers outperform CNNs in accuracy and computational efficiency. The term
translation indicates that each pixel of the image moves in a given direction with a
fixed value. Overall, the visual transformer initially splits an image into fixed-size
patches and embeds them. Then, as shown in Figure 4.26, it incorporates positional
embedding as an input to the transformer encoder. It uses multi-head self-to-remove
image-specific inductive biases to understand the image features.
State-of-the-Art Deep Learning Models: Part II 83
REVIEW QUESTIONS
1. Compare and contrast the following architectures.
a. Multilayer perceptrons and convolutional neural network.
b. Feed-forward neural network and multilayer perceptrons.
c. Recurrent neural network and convolutional neural network.
2. Explain potential applications of capsule neural networks and their ability to
regenerate data.
3. Explain why autoencoders are the most popular neural network element and
their usage.
4. Explain why multilayer architectures are becoming popular and their poten-
tial drawbacks.
5. What is significant about the parallel concatenation of the networks and
describe its advantages?
5 Advanced Learning
Techniques
5.1 TRANSFER LEARNING
5.1.1 Overview of Transfer Learning
Humans can learn a task and transfer the obtained knowledge to resolve associated
tasks using their inherent nature. The concept of transfer learning utilizes the learned
knowledge about a task to solve related problems, without relying on solitary learning.
Transfer learning takes a model, which is trained on the same domain, with different
tasks or different domains with the same task. Subsequently, the learning process
adapts to the considered domain and the target task, without training the model from
scratch.
Transfer learning takes a model trained on a large dataset and transfers the
obtained intelligence to another dataset. For instance, consider the task of recog-
nizing an object in an image using a CNN. Following the transfer learning approach,
the initial set of convolutional layers of the considered pretrained model can be frozen
and train only the set of layers in the latter part of the model to predict the results.
Generally, the initial convolutional layers provide low-level feature extraction over
the image, such as edges, gradients, and patterns. The last set of convolutional layers
supports extracting complex features that will lead to recognizing objects correctly.
Accordingly, a pretrained model on an unrelated large dataset can be utilized to pre-
dict a task, because the general low-level features are common to many images, as
shown in Figure 5.1.
As shown in Figure 5.2, we can distinguish transfer learning from traditional
machine learning. Traditional learning algorithms start from scratch and depend only
on the considered dataset and the allocated application. They do not keep the learned
intelligence to be used by another model. Here, the learning is performed without
comparing the knowledge that is learned in the past with the new tasks. In contrast,
transfer learning learns new responsibilities that rely on the tasks that were learnt
before. It leverages the associated weights and features, weights in the pretrained
models to train new models by generalizing the knowledge. Importantly, this can
handle the issues of insufficient datasets for new tasks. Consequently, this learning
process can be fast and more accurate.
84 DOI: 10.1201/9781003390824-5
Advanced Learning Techniques 85
Accordingly, the pretrained model is loaded initially and the base model is initiated.
For instance, some popular modules in computer vision are ResNet, Inception,
Xception, and VGG and in natural language processing, we have Word2Vec and
Glove language, models. As an option, the pretrained weights can be downloaded, or
the selected model can be trained from scratch. We need to freeze a few layers from
the pretrained model to avoid changing during training. Thus, the weights associated
with these layers would not be reinitialized. If the weights are changed, then it will
not be based on the prelearned knowledge and can be considered as a model trained
from scratch.
Usually, the base model contains a different set of elements in the output layer based
on the number of classes of the considered application, as the outputs of the pretrained
model and the new model are different. For example, generally, the pretrained models
are trained on the ImageNet dataset that output 1000 classes. However, the new model
will have two or three classes. Therefore, the model should be trained with a new layer
for the output. Thus, the final output layer has to be removed from the base model
and needs to add a new output layer that is corresponding to the number of classes in
the considered application as shown in Figure 5.3. Then, a set of new trainable layers
are added to train the model with the existing features to make the predictions for the
new data.
After that, the model performance can be improved through fine-tuning as shown
in Figure 5.4. The features extracted by the pretrained model should be fine-tuned
to obtain the new features specific to the new base model. The fine-tuning process
unfreezes the set of layers of the base model and trains the model for the entire dataset.
This starts with a low learning rate as a large model is trained on a small dataset. Also,
this avoids higher value changes in the gradient that can lead to low performance.
Accordingly, it prevents data overfitting and increases performance.
Generally, a model training over the dataset continues to repeat for a given number
of epochs reached. However, the model training could result in overfitting. As we
learnt in previous chapters, early stopping helps to address this issue. It stops the
training when the validation loss does not reduce or reaches a plateau state or starts to
increase for a consecutive set of epochs, which only the training loss reduces. To find
the generalized model for the test dataset, we compare the parameters corresponding
to each epoch during the region that reduces the validation loss and identify the
parameters with the best validation performance. Accordingly, when the learning
curve does not show any improvement, it is necessary to hold the training.
with unlabeled data in the target domain. In transductive transfer learning, there will
be some similarities between the tasks that correspond to the source and target but has
different domains. There is a large set of labeled data in the source domain, while no
data is available in the target domain.
When transferring across these different categories, there may be a question about
what to transfer. This can be addressed by transferring parameters, instances, features,
and relational knowledge. The instance transferring reuses a set of source instances
with the target domain to enhance the result. Inductive learning uses modifications
to utilize training the source domain instances to improve target tasks. In feature
representation transfer, the error or domain divergence is minimized by identifying the
most relevant features that can be effectively applied from source to target domains.
As the name implies, the parameter transfer shares parameters among the models
with related tasks. Thus, the additional weight is applied to rectify the error values
in the target domain to increase the accuracy. Unlike all these transfer learnings, the
relational knowledge transfer handles are mainly used on non-independent and iden-
tically distributed data such as social networks.
specific features of that task. Consequently, while keeping some of the earlier layers
frozen, the transfer learning process retrains the latter layers. The final layer is a fully
connected layer that connects the outputs from the previous layers. Here, the layered
architecture utilizes the pretrained models by replacing the final layer compatible
with the outputs of the selected task.
Another strategy is using fine-tuned, pretrained models, where the final layer is
not replaced, but retrained some of the previous layers are. Using this approach, we
can retrain or fine-tune selected layers to have a better performance with less time
for training. As we leant, initial layers learn basic features that are generalizable to
most types of data. The higher layers extract features that are more specific to the
considered dataset for training. The fine-tuning process helps to utilize the specific
feature representations to comply with the new dataset. Therefore, these layers can be
frozen and reused with the basic knowledge derived from past training.
One-shot learning and zero-shot learning are variants of transfer learning. One-shot
learning gives only one labeled sample for the transfer task and no labeled samples
are given for the zero-shot learning task.
One-shot learning learns from one or a small number of instances to classify
many new instances to infer the required output. Generally, this method is used with
insufficient labeled data or with more new classes. For instance, face recognition
applications classify faces of people with varying lighting conditions, hairstyles,
accessories, and expressions where the model has a smaller number of images as
input. Thus, it is based on the knowledge obtained by training the base model with a
small amount of data per class.
Zero-shot learning based on unlabeled training data. It makes alterations during the
model training to generate extra details for the unseen data. This concept of transfer
learning is mainly applied to machine translation with NLP with unlabeled data in the
target language, computer vision, and speech recognition tasks.
• Computer vision
Most of the image classification tasks with complex feature representations are
performed using neural networks. Mainly the fully connected layer recognizes the
image objects together with the fine-tuning of the latter set of layers. Since image
analysis tasks can be enhanced by the learned knowledge and identified patterns in
90 Deep Learning: A Beginners’ Guide
similar images, the transfer learning approaches are widely used for object detection,
image classification and captioning tasks.
• Audio processing
Transfer learning algorithms are used to solve tasks such as acoustic recognition or
translation from speech to text. For example, when the model trained for English
speech classification is running at the backend, it can be used as a pretrained model
for the French speech classification model.
• Negative transfer: the knowledge transferring from the pretrained model to the
new base model results in reduced performance. This can happen when the
source and target are completely unrelated or when the transferring process has
a lack of influence on the relationship between the two tasks.
• Drift: the environment changes can affect the relationship between the source
and target tasks. This will reduce the performance of the model.
• Transfer bounds: the quality and practicability of the transferring process can
be improved by quantifying the transfer.
Generally, high performance cannot be obtained by the transfer learning process due to
several reasons. For instance, when the features learnt by the initial layers are failed to
distinguish the output classes, performance degradation happens. Consider an image
classification task that identifies whether a door is opened or closed. The pretrained
model will support detection of whether there is a door in an image. However, it
may not be able to classify whether the door is opened or closed. In such scenarios,
the initial set of layers needs to be retrained to extract the needed feature represen-
tation. Another aspect of low performance occurs due to the removal of layers from
the pretrained model as it lessens the trainable parameters and leads to overfitting.
Therefore, identifying the number of removable layers and avoiding overfitting is a
process that consumes more time and effort. Further, when the datasets of the source
and the target tasks are not related, the corresponding feature-transferring process
will not perform well. Considering the above-mentioned aspects, the initialization of
the pretrained weights can lead to better predictions compared to the use of random
weight.
Advanced Learning Techniques 91
5.2 REINFORCEMENT LEARNING
5.2.1 Overview of Reinforcement Learning
Reinforcement learning is a major approach in machine learning that uses the concept
of intelligent agents to obtain optimal results in the considered domain. Its decision-
making process observes the environment and selects events from the action space
to maximize the rewards over time. Here, a potentially complex environment with
uncertainty is tuned into real-world scenarios, where the machine learning model
would employ experiments to find possible solutions. The computational model
would then act as an agent to support decision-making, where the agent receives
rewards or penalties for the corresponding actions. The objective is to achieve the
highest number of total rewards as shown in Figure 5.6. A detailed explanation is
given later in this chapter.
Let us see how reinforcement learning distinguishes it from traditional machine
learning as shown in Figure 5.7.
Deep learning and reinforcement learning are not mutually exclusive. Thus, there
is no strict division among these approaches, where both use computationally created
rules for autonomous problem-solving. Deep reinforcement learning is a specializa-
tion of deep learning. Considering the difference, deep learning users the training
dataset to learn the model and applies the learnt knowledge to an unseen related
dataset to predict results. Reinforcement learning follows a dynamic process to learn
the model and adjust the model learning based on continuous feedback to obtain
better predictions. Thus, compared to supervised learning, which maps functions
from input to output, reinforcement learning is based on input and feedback.
other actuators. A software agent consists of programs with encoded bit strings.
Accordingly, deep learning applications with agents can be listed as face recognition
systems, self-driving cars, chatbots, and other intelligent systems. Thus, we can
define an agent as a model that uses sensors to observe the environment and uses
effectors to perform activities in that environment, where the agent is located and
interact. However, the agent does not control the environment with its actions.
As shown in Figure 5.8, the components of a basic reinforcement learning algo-
rithm are listed as follows.
• Agent: the program or the model that can be trained to perform a given task. It
can select an action to commit in the current state.
Advanced Learning Techniques 93
• Environment: the world where the agent performs his actions. It offers new
inputs to the agent as a reply to the corresponding action.
• Rewards: the evaluation of an action, which can be positive or negative. This is
the incentive or cumulative mechanism returned by the environment.
The policy can be improved based on the feedback obtained by the system on the
value function. This is known as policy iteration. Therefore, continuous evalu-
ation and refinement of the policy are achieved through this. A model has a set of
instructions to refine the policy.
The value function identifies the feedback incentives for a given state based on
a given policy. Similarly, it calculates the total rewards received by an agent from a
given state based on a given policy. The value function calculates the possible value
using the following methods.
• Policy evaluation: initialized using a random policy and assess the status of the
state. Repeating this process, the value of the current state decides the next state
that will result in the best reward. After that, the model decides the required
action.
• Dynamic programming: use the reward after performing an action to calculate
the value of the following state.
• Monte Carlo: execute the policy and perform the entire tasks to identify all the
feedback.
However, there can be situations, where the model cannot be specifically defined.
Therefore, a new function is defined as an action-value function, which calculates the
predicted rewards of an action. Since this process needs to track more data, it may not
perform optimal learning with deep learning. This can be addressed by using a deep Q-
Network based on a neural network. The idea is to use off-policy learning. It estimates a
future reward by performing an action in a given state and working towards a target policy.
Furthermore, Markov decision process is a main element in reinforcement learning.
It formalized sequential decision-making where actions from a state not just influence
the immediate reward but also the subsequent state. It is a very useful framework to
model problems that maximize longer-term returns by taking a sequence of actions.
The details of this process are not covered within the scope of this book.
• Fixed ratio: the feedback is rewarded after a given set of feedback instances.
This is steady and leads to provide a high response rate.
• Variable ratio: the feedback is rewarded after a random set of feedback instances.
Leads to a steady high response rate.
• Fixed interval: the feedback is rewarded after a predefined time interval. It
shows low replying immediately after the reinforcer has occurred and shows a
high replying rate closer to the end of the time interval.
• Variable interval: the feedback is rewarded after a random time interval. It
shows a steady and slow response rate.
example, preparing a simulation environment for a model trained for a board game is
easier, when it is computer-based. However, creating a simulation environment for a
self-driving car is complex, as it will be implemented in the real-world considering
many constraints, such as preventing collision with other vehicles afterwards.
Another challenge is based on scaling and modifying the neural network that
controls the agent. Since the communication is mainly based on feedback, it may
result in information loss, due to replacing the previous knowledge with new feedback.
The occurrence of local optimum points can be another challenge in reinforcement
learning. Here, the agent executes the tasks to obtain a high reward, without executing
the task in the correct way to achieve the target. For example, consider a game of a
race. The model can take actions to gain rewards without ending the race.
5.3 FEDERATED LEARNING
5.3.1 Overview of Federated Learning
Generally, machine learning-based applications process a large amount of data, and
it is important to avoid data breaches. It keeps data in one central location. Therefore,
privacy preservation should be incorporated into such applications. Federated
learning (FL) introduces training in machine learning by transmitting model replicas
to the locations that perform data training. In other words, the learning algorithms are
trained over many distributed edge devices by keeping a local data instance without
replacing them. Therefore, this will eliminate the requirement of moving big data
records to a central device for training. Therefore, federated learning, also known
as collaborative learning, supports distributed and heterogeneous networks with pre-
serving data privacy.
Figure 5.9 shows an overview of federated learning with other learning approaches,
where federated learning is a deviation from distributed learning. Centralized learning
passes the data to a central location, such as a cloud, to train the model. Clients use
APIs to access the model through services. In distributed on-site learning, a model
with a local dataset is created in each device. Initially, a model is distributed to each of
the devices from the central location. After that, the devices can perform standalone,
without communicating with the central cloud server. As an extension, federated
learning trains the model in each edge device and passes its parameters to the central
location to aggregate. Here, data is stored on edge devices and only knowledge
sharing occurs among the aggregated models.
Accordingly, data is kept in a central location and the model training is distributed
to edge devices in distributed learning. Whereas federated learning trains a part of
the model by keeping a portion of data in local devices and passing the parameters
between the aggregated devices allowing collaborative learning. It trains the models
across distributed datasets and preserves the sensitive information in local devices, as
data is not communicated via the network. This helps to reduce the cost associated
with data transferring as well.
In detail, the data remains at source devices, which are known as clients, and they
get a replica of the global model from a central location. Then this global model will
be trained at each device using the local data. Here, instead of keeping a single global
dataset, federated learning distributes many model versions among devices with local
data, which will train locally.
Therefore, federated learning trains a set of models in multiple client devices
and aggregates the knowledge generated from each model, which is sent to a single
final model at the central location. This information is sent using parameter sharing
through an encrypted communication channel. After that, the weights of each local
model at the client are updated in each epoch, to continue with the model training
in edge devices and pass the knowledge again to the server or cloud. The server
aggregates the model updates to improve the combined model preserving the privacy
of the data and this process repeats. The final model will behave as it was trained
using a single dataset. Thus, the key advantage is the central location does not keep
individual updates and the data will remain on the client’s devices.
The key steps in federated learning can be listed as follows. This can be
an iteration with initialization, client selection, configuration, reporting, and
termination.
Step 1: select the underline model framework that supports FL. (Initialization)
The selection rules to implement a model are based on the aspects such as the data
type, compliance of the proposed framework such as TensorFlow and PyTorch with
the infrastructure and the feasibility of applying a given technology.
This considers the format of the communication and the framework for sending
the rules among the local devices. Few options would be PySyft with PyTorch
for lower-level access to modeling operations, Flower, which supports multiple
modeling frameworks, and Tensorflow Federated. At this stage, the local devices
are initialized, activated and stay until they get the tasks to do from the central
location.
The client system should carry out local model training and share the
knowledge-based parameters with the central location through services. The par-
ameter sharing helps to update the models in local machines. The training process
may start with a selected set of local devices while the rest of the devices stay
until the next round of federated learning. The associated considerations are, the
type of package (installable or docker image), managing dependency versioning,
client authentication, and server communication, observing model training and
handling errors.
It is important to identify the private data used by each device for local model
training. Here, the central service manages the related meta-data, such as the avail-
ability of datasets by different clients and the datasets used by each client model
training. In this stage, the central device assigns a set of devices to start training those
results in the update of mini-batches.
We need to manage the access rights and the measures associated with a model.
It is used to select the users that can train the model. The key considerations are
the access rights of each client and the model storage location. At this stage, when
each device transfers the local knowledge to the server, it aggregates the model and
communicates the updates back to the local devices. It also manages the failures
due to disconnected devices and lost updates. It starts the next federated round by
selecting a device set again.
Since the model trains by parameter sharing of different edge devices, the final
model is accessible for each client locally. It is important to identify the acceptable
risks, such as the possibility of identifying the clients who performed a given part of
model training. Such information sharing can be prevented by incorporating privacy-
preserving mechanisms into the trained weights, before transmitting to the central
location. However, the optimization of the risk associated with a model trade-off with
the model performance. Subsequently, the central device aggregates the knowledge
received by different local devices and finalizes the model when termination criteria,
such as the completion of the iterations or achieving an accuracy beyond the threshold
are met.
100 Deep Learning: A Beginners’ Guide
• Centralized: different tasks of the learning algorithm and the edge devices are
controlled by the central device. All the clients selected by the server, commu-
nicate the knowledge to the single server, thus the server may get into a bottle-
neck situation and tend to a single point of failure.
• Decentralized: the edge devices are organized among themselves such that the
updates of the models are shared among the interconnected edge devices to
generate the final model without a central server. Thus, it addresses the problem
of single-point failure. However, based on the selected topology of the network
(e.g., Blockchain based) can impact model learning.
• Heterogeneous: use a diverse set of client machines are used with different
computation and communication capabilities such as mobile and IoT devices.
• Data partitioning.
Several data partitioning techniques are available in federated learning. In hori-
zontal data partitioning, similar features with a slight intersection of the sample
space are included in different local machines. This is a widely used method,
where all the clients use a shared mode leading to an easy aggregation pro-
cess at the central machine. Considering related examples of different partition
types, a patient dataset for a given type of disease in a hospital is a possible
application for horizontal data partitioning. In vertical data partition, different
feature spaces with the same sample space are used by the clients. We use
techniques, such as entity alignment, to identify the overlapping samples in the
client data that are used for training. For example, student GPA datasets from
universities in different countries can be considered, where the feature space
would be the grading scale and evaluation metric. Hybrid data partitioning is a
combination of the above two methods. A possible application would be meas-
uring student performance across branches of a set of universities.
• Machine learning model.
The selection of the machine learning model based on the dataset and the
associated task. In federated learning, homogeneous models use the same
model in all clients and the aggregation of gradients in the server. In heteroge-
neous models, each client has a different model, thus no aggregation method,
but contains ensemble methods such as max voting.
• Privacy mechanism.
To avoid information leakage amongst clients, the server interprets clients’ data
using learning gradients without encrypting data. Differential privacy methods
such as adding random noise to the model parameter or associated data, are
used to hide or mask the gradients. However, the noise can result in low model
accuracy. Cryptographic techniques such as secure multi-party computation
Advanced Learning Techniques 101
and homomorphic encryption sent the encrypted data from local devices to the
central device. The central server decrypts the encrypted output to get the final
result. However, they are computationally expensive.
• Communication architecture.
In federated learning architectures, the functioning is the same, but the client-
server communication is different. In the centralized architecture, the central
device updates the parameters shared by the local devices. In the decentralized
architecture, a given local machine is randomly selected as a server for each
epoch, to update the global model and communicate with other clients in the
network. The implementation is complex and consists of peer-to-peer (P2P),
graphs, and blockchains.
• Scale of federation.
The scale of the federation falls into two categories. The cross-silo category has
a small set of clients with large computational abilities. This can correspond to
an organization and has high reliability as it is always available to train. The
cross-device category has a large client count with a small computation power.
This can be associated with mobile phones and has low reliability as the low
network can hinder the availability of the device.
• Learning over smartphones: statistical models that learn through user behavior
to detect faces, recognize the voice, and predict the next word using mobile
phones.
• Google’s Android keyboard improves word recommendation without uploading
data to the cloud and training the model using user interactions with mobile
devices. G-Board can personalize the user experience by the individuals’ way
of using the phone by referring to device history and suggesting improvements.
• Learning among organizations: institutes such as hospitals operate with many
patient data that should preserve privacy. These data are stored locally, due to
the associated ethical, administrative, and legal constraints.
• Predict human body conditions such as strokes using wearable devices.
• Identify the behavior of pedestrians and other vehicles in self-driving vehicles.
• Robo automation consists of NLP models that use data from a different
locations.
102 Deep Learning: A Beginners’ Guide
Three-dimensional images are input to the model and the CNN groups the pixels and
processes using filters. Based on the complexity of the dataset, we can decide the
number of filters to be used. The pooling layer is used to decrease the parameter space
of the input through regression. This process is applied repeatedly on a given dataset
to produce a reliable result.
Generally, we use a small set of models such as three, five, or ten trained models,
to avoid unnecessary computational expense and decline in performance. The weight
initialization of the ensemble model is proportionate to the prediction accuracy of
the individual models. These weights are assigned in a way that reduces the MSE of
the total of weighted models, for each iteration. Equation (5.1) states the production
of the ensemble model, where wi and yi denote the calculated weight and result of
model i, respectively. Then the weights are tuned to obtain the minimum MSE for the
ensemble model (w1y1 + w2y2 + ... + wjyj), which denotes the addition of bias and the
variance of the models.
n
EM = ∑ wi . yi( ) (5.1)
i =1
The bias is defined as the deviation between the expected prediction and the actual
values as stated in (5.2). The variance and the total expected prediction error (MSE)
are defined in (5.3) and (5.4), respectively.
where
m′(𝑥): output of a base model m(𝑥)
E[m′(𝑥)]: expected output error of model m
m(𝑥): actual class values
Var(𝜖): obvious error based on the noise variance
We can use mechanisms to alter the training data in each model in ensemble learning.
As a simple technique, k-fold cross-validation can be used to measure the general-
ization error. Here, the training dataset is divided into k subsets and each subset is
trained using a unique model. Bootstrap aggregation (bagging) is another technique,
which trains the model with a resampled training dataset with a replacement, which
106 Deep Learning: A Beginners’ Guide
• Changing models
The configuration of each model used for the ensemble can be varied. For instance,
models differ based on the number of layers or nodes, learning rates or regularization
types. Thus, the ensemble model learns a heterogeneous mapping function and results
in a smaller correlation in the output.
• Changing combinations
The combination method of the outcomes from ensemble models can be varied. As
a simple technique, model blending can be used that calculates the average of the
predictions of each model. This weighted average ensemble can be improved by
taking the optimized weights techniques, such as hold-out validation. Additionally,
new models can be learned using techniques such as stacking. As an extended tech-
nique, boosting can be used. It incorporates one model at a time for the ensemble
learning process to address the errors that occurred during the previous model
training. Model weight averaging is another method that combines the weights of
several models that has a similar structure.
The bagging method selects individual data points more than once and performs
a random sample of training data, to reduce the prediction variance within a noisy
dataset. Here, a replacement technique is used to create random subsets of a
dataset, which are used as independent datasets that are used to train models in
parallel. Therefore, a given data point can belong to many subsets of data. The
testing phase considers the output of the models that are trained using different
subsets of a given dataset, as shown in Figure 5.13. The final result is obtained by
passing the several model outputs through an aggregation process. Bagging shows
higher bias, such that the predictors are less correlated and reduce the ensemble’s
variance.
Advanced Learning Techniques 107
• Boosting
Generally, each model would not perform well on the entire dataset; however, they
do have high performances in parts of the dataset. Boosting processes, the dataset
sequentially, such that the entire dataset is input to the initial model and analyze the
result. The data points that are incorrectly classified by the model are then passed to
the second model. Thus, the second sub-model focuses on the challenging regions
of feature space and learns an applicable decision boundary. As the name suggests,
each model will contribute to boosting the performance of the overall ensemble.
Subsequently, the same process is followed and the combination of all the previous
models is used to generate the final result on unseen data as shown in Figure 5.14.
Here, each model is dependent on the previous model as the successive models aim
to rectify the errors of the prior model training for a given subset of data. Thus, the
boosting algorithm ensembles a set of inadequate models to generate a robust learning
model and boosts the entire prediction using the majority voting weight. This method
reduces the bias in the ensemble prediction. Thus, the selected classifiers should be
simple models with less trainable parameters that result in low variance and high bias.
• Stacking
The stacking technique uses bootstrapped data subsets and trains multiple models
parallelly as in the bagging technique. The output of all these models is used to com-
bine the multiple models through a meta-classifier that produces the overall prediction.
Here, two layers of classifiers are used to ensure proper training. The meta-classifier
in the next layer may capture the missing features from the set of models in the first
layer. For instance, the output class assignment probabilities produced by the first
layer of models can be averaged with some weights to combine the model outputs.
Then the argmax over the average predicted class probabilities can be used for the
108 Deep Learning: A Beginners’ Guide
final prediction. A stacking with one level is shown in Figure 5.15. Additionally, there
are ensemble models with multi-level stacking, where extra classification layers are
included among them. However, compared to the relatively low improvement of the
performance results, such methods consume more computational costs.
• Mixture of experts
This technique uses several classifier models, and their outputs are ensembled using
a generalized linear rule. A gating network, which is a trainable neural network, is
used to decide the weight assignment for the combinations as shown in Figure 5.16.
Advanced Learning Techniques 109
FIGURE 5.16 Process of the mixture of expert techniques with a generalized linear rule.
The mixture of experts’ method is applied when there are various models trained on
different classes of the feature space, supporting the information fusion problem.
The following are basic techniques in ensemble learning that operate on similar data.
• Disease detection: lung disease detection using chest X-ray and CT scans.
• Social networking: use face detection and recognition to tag users.
• Legal, banking, insurance: provide optical character recognition.
• Remote sensing: different sensor devices produce a variety of data consisting
of different resolutions.
• Entertainment: filter functionalities in social media networks.
• Document digitization: enables flexible access to documents.
• Landslide detection.
• Scene classification.
• Land cover detection.
• Credit card fraud detection.
• Speech emotion recognition in multi-lingual environments.
REVIEW QUESTIONS
1. Compare and contrast transfer learning and fine-tuning.
2. When should deep transfer learning be used?
Advanced Learning Techniques 111
Deep learning models mainly depend on data. The validation accuracy of the
model can be improved by adding more data. For instance, in image datasets,
the diversity of the available dataset can be increased using data augmentation.
General methods include image flipping along the axis and introducing noise.
In addition, advanced methods such as generative adversarial networks (GANs)
support data augmentation.
The real-world data often contain some absent data and outliers. This leads to a
decrease in the performance of the model by introducing model biases. Thus, it
should be handled during the preprocessing of the dataset. For instance, when the
dataset contains continuous data, we can use statistical measures, such as mean,
median, and mode as a substitution for missing data. When the dataset contains
categorical data, the values are considered as a separate class. Additionally,
models can be built to predict the missing values. The outliers can be handled by
applying methods, such as deleting the data, performing alterations, binning, or
treating outliers separately.
• Feature engineering
Feature engineering extracts more information as new features from the available
data. The identified features are used to explain the data and to improve prediction
accuracy. In this process, feature creation derives new variables from available data to
identify the connections of data points, then it transforms the features to the next pro-
cess. Feature transformation is supported by different techniques, such as normaliza-
tion, which changes the scale of the data. For instance, data can transform to a range
between zero and one to get the variable in the same scale. The skewness of variables
in normally distributed data can be removed using methods, such as square root, log,
or inverse of the data. Also, data discretization is used to transform the numeric data
into discrete by dividing data into bins.
• Feature selection
Feature selection finds a subset of features that can describe the connection between
the source and target data in an effective way. This is supported by the know-
ledge of the considered domain and experience, visualization of the relationship
between variables, and statistical parameters (e.g., p-values). Additionally, there
are dimensionality reduction methods such as principal component analysis (PAC)
that represent training data in a reduced dimensional space, but still, characterize
the intrinsic connections in the data. Other methods can be considered as maximum
correlation, low variance, factor analysis, and backward/forward feature selection.
The addition of more layers to a neural network improves feature learning capabil-
ities. It allows us to learn the features of the dataset more deeply by recognizing subtle
differences in the data. However, it depends on the nature of the application domain.
For instance, if the data classes have a clear difference, then it is sufficient to have a
few layers. However, if the classes are slightly different with fine-grade features, then
more layers are needed to learn subtle features that differentiate the classes.
• Algorithm tuning
The hyperparameter tuning majorly influences the outcome of the learning process
by finding the best value for each parameter. Knowledge of the parameter type and its
impact is useful to make decisions to improve the model performance.
Feasible image size should be identified during the data preprocessing. When the
image is very small in size, it is hard to learn distinctive features that are needed to
recognize the image. In contrast, if the image is too large, the model requires more
computational resources to process, otherwise, the model may not be sufficient to
114 Deep Learning: A Beginners’ Guide
learn the data. Converting an image from a small resolution to a large size image
results in pixelation, which is the visibility of individual square pixels that make up
an image. This leads the image to have blurry sections or fuzziness, which causes
negative effects on the model performance.
• Increase epochs
The number of iterations the entire dataset goes through in the model is defined
as an epoch. When we increase the number of epochs, the model will train incre-
mentally. Generally, epochs are increased when there is a sufficiently large dataset.
Subsequently, when the model no longer increases the accuracy while increasing
the epochs, we consider the model learning rate at this point. This hyperparameter
decides whether the model reaches its global minimum or stays at a local minimum.
Generally, color channels are used to represent the image dimensionality. For instance,
color images in RGB contain three color channels. In contrast, images with grayscale
have one channel. When the color channel is dense, the dataset becomes complex
too and will take more time for model training. Based on the application domain, if
the color does not impact the prediction, then it can be converted to grayscale and
processed, which requires fewer resources.
• Transfer learning
Transfer learning uses pretrained models that were trained on large datasets for the
predictions of a new task. We have discussed this topic in Chapter 5.
• Ensemble methods
Ensemble learning is a widely used approach to combine the output of a set of poorly
performing models and leads to better predictions. Different techniques, such as
bagging (bootstrap aggregating), boosting, and stacking, support to increase model
performance. Generally, ensemble models are more complex than conventional
models, which is the basis for ensemble learning.
• Evolutionary algorithms
then feeds the output back to the input, and repeats the process until the expected
output is generated.
This is an individual large model consisting of all the possible operations. It produces
weights that can be used by other potential models. After the training process, it
samples sub-architectures and compares them with the validation data. It uses the
advantage of parameter sharing to its maximum. Generally, this model trains with
gradient descent by transforming the space into a continuous and differentiable form.
• Cross-validation
6.2 REGULARIZATION
Computational models are known for their hard interpretability, which often identi-
fies them as black boxes where the input data is fed and then an output is generated
from the model by executing its classification or regression algorithm. The model
training can be a complex learning process and it can be challenging to generalize the
model, depending on the input data. This phenomenon is known as overfitting, which
leads to low accuracy. Here, the model is adjusted to the peculiarities of the dataset
where it cannot generalize well into new data. This becomes more complex when the
model tries to detect noise in the input data, which reflects the data points that do not
indicate the real features of the data but arbitrary chances. Figure 6.1 shows different
model-fitting scenarios, which we discussed in earlier chapters. This can be addressed
by applying cross-validation that estimates the error over the test set and helps to gen-
eralize by deciding the parameters that would work best for the model.
Regularization is an important concept in learning algorithms to reduce loss and
avoid overfitting or underfitting. This approach adds extra information to data and
helps to fit the model on an unseen dataset with reduced errors. This is a type of
regression that decreases the coefficient closer to zero. Because of that, regularization
does not support learning complex models and reduces overfitting. This simplifies
the model by using a smaller number of parameters. Since there are no regression
coefficients with higher values, it prevents overfitting. This method minimizes the
loss and complexity together resulting in a more streamlined and parsimonious model
that performs better at predictions.
The model complexity can be represented as the following functions.
• L1 regularization
• L2 regularization
• Elastic nets
• Early stopping
Early stopping is used in model learning with iterative approaches like gradient des-
cent. Since these techniques can reduce the model generalizability, the early stopping
mechanism states the maximum number of iterations that the training should be
executed before the model gets overfit. This method initially sets many epochs to train
and when the performance of the learning curves starts to reduce, it stops training
on the validation data. The model tries to follow the loss function exhaustively on
the training data, by tuning the parameters. Accordingly, by monitoring the loss
function on the validation data, if there is no improvement in the validation set, then
the learning stops without processing all the epochs. Figure 6.3 shows a representa-
tion of early stopping. The early stopping regularization has been useful in providing
consistency in boosting algorithms, such as Adaboost.
118 Deep Learning: A Beginners’ Guide
FIGURE 6.4 Neural network architecture before and after dropout is performed.
• Dropout
The dropout technique randomly ignores or drop-out some number of layer outputs
at different points during training. Subsequently, this is considered as a layer with a
diverse set of nodes and connections to the previous layer, supporting model training
with different architectures simultaneously. Thus, each time when the gradient
is updated, a new sparse model is generated without the dropped nodes based on
hyperparameters. This change may help network layers to correct mistakes from prior
layers and produce a robust model. Dropout can cause the training process to become
noisy based on the impact of the nodes for the inputs. Figure 6.4 shows an instance
of a dropout network.
Enhancement of Deep Learning Architectures 119
The models with dropouts may have a lower classification error with test data,
because of their simplistic representation and generalizability. Since overfitting can
be addressed using the dropout technique, which reduces the dependability of the
nodes among different hidden layers, it results in low error values. However, this
technique can take more training time. This can be addressed by using a regularizer,
which is similar to a dropout layer. For example, dropout can be considered as an
extension of L2 regularization for linear regression models.
6.3 AUGMENTATION
Generally, it is challenging to obtain a perfect dataset with the correct amount of
data for machine learning model training. Although many public datasets support
object detection and image classification, it is difficult for acute sufficient data to train
models for a specific application domain. Training with insufficient data can cause
model overfitting and underfitting. Thereby giving less accuracy in the test dataset
because of the limited variety of the data, which enables the model to learn the hidden
peculiarities of the desired application.
Data augmentation is a widely applied technique that expands the variety of the
training data by utilizing realistic and random transformations. With minor alterations
to the existing dataset, the model identifies those machine-generated images as dis-
tinct images, which will increase the variation of the training. This supports the model
to identify the peculiarities of the dataset and performs well with the unseen data.
The following strategies support data augmentation.
6.4 NORMALIZATION
In deep learning, normalization techniques are applied to transform data in such a
way that all the features in a dataset lie on a similar scale. Generally, we use a range
between zero and one to normalize data. If the data is not normalized, it will cause
a dilution in the effectiveness of important attributes that are on a lower scale. The
normalization layers support efficient and steady model learning. Let us discuss some
of the normalization techniques, such as batch normalization, used to train with large
batches without recurrent connections, and techniques to train with small batches
such as weight normalization, weight standardization, group normalization, and layer
normalization.
• Batch normalization
Batch normalization standardizes the inputs to a given layer. Generally, the neural
network learning process tries to decrease the weight parameters in the direction
given by the gradient, which depends on the current inputs. As the network layers
are stacked together, a small weight update of the previous layer results in a large
change in the connections from the input layer to other layers. This causes sub-
optimal outputs to be generated from the current gradient. Since batch normaliza-
tion limits the inputs of a given layer, such as the activations from the prior layer,
the weights can be updated with better gradients and result in steady and efficient
training.
The batch normalization layer performs the transformation by subtracting
the input mean from the input of the current mini-batch and dividing by the
standard deviation. Normally, most of the layer inputs have nearly zero mean
and unit variance. Depending on the task the model may perform well with
different values of means and variance. Algorithm (6.1) states the batch normal-
ization process of transforming the input x into the output y, using two learnable
parameters γ and β.
Enhancement of Deep Learning Architectures 121
Algorithm (6.1)
Input =values of x over a mini-batch: ß ={ x1….. m };
Parameter to be learned: γ , β
( )
Output ={ yi = BN γ ,β xi }
1 m
µβ ← ∑x
m i =1 i
//mini-batch mean
1 m
(
∑ x − µβ )
2
σβ2 ← // mini-batch variance
m i =1 i
xi − µβ
x̂i ← // normalize
σβ2 +
( )
yi ← γ xˆi + β ≡ BN γ ,β xi //scale and shift
Generally, the normalization layers are placed between a non-linear activation layer
such as ReLU and other linear or convolutional or recurrent layers. In an activa-
tion layer since the activations are centered around zero. This supports training as it
prevents non-active neurons occurred due to incorrect random initialization.
→ R equire large batch size: since each training iteration of batch normalization
determines the batch statistics, and the mean and variance of the mini-batch,
a large batch size is needed to approximate the mean and variance from the
mini-batch.
→ Does not perform well with RNNs due to complexity. Since there are
recurrent connections to preceding timestamps, it uses distinct learnable
parameters for each timestamp in the batch normalization layer.
→ Different training and test calculations. This increases the complexity as the
batch normalization layer calculates a fixed mean and variance calculated
from the training dataset, without calculating the mean and variance of the
mini-batch from the test dataset.
• Weight normalization
Weight normalization decouples the length from the direction of the weight vector
and reparameterizes to increase the training efficiency. This can use two parameters,
the length and the direction of the weight vector, without taking the gradient descent.
Weight normalization works with RNNs; however, this technique is significantly less
stable and not commonly applied.
122 Deep Learning: A Beginners’ Guide
• Layer normalization
This technique considers the direction of the features to normalize the activations
to zero mean and unit variance. Since it does not consider the direction of the
mini-batch for the normalization, it addresses the limitations of batch normaliza-
tion by ignoring the reliance on the batches. Thus, it can be applied to recurrent
networks too.
• Group normalization
The features are split into groups and each group is normalized individually
along the direction of features. This works better than layer normalization, as the
hyperparameters are tuned based on the groups.
• Weight standardization
This technique transforms the weights that result in zero mean and unit variance.
This is applicable for layers, such as convolution, linear, and recurrent. During the
forward pass, it transforms the weights and calculates the corresponding activations.
Generally, group normalization with weight standardization performs well even with
small batch sizes. This is applicable for applications without large batch sizes as it
consumes memory for dense prediction tasks.
6.5 HYPERPARAMETER TUNING
Hyperparameters are a set of parameters used to configure the machine learning
model with the focus of lowering the cost function. This includes the number of
nodes, learning rates, epochs, activation function, batch size, and optimizer. These
can be defined as the variables that decide the structure and the training process.
Initially, we set the hyperparameters before the weight-updating process in training.
Hyperparameter tuning finds a set of the best possible values of these parameters
during training. An optimal set of hyperparameter values result in high performance
and low errors. Different datasets and machine learning algorithms require different
sets of hyperparameters for accurate predictions. First, the hyperparameters are tuned
in a model, and then the number of layers is tuned. The number of layers impacts the
model accuracy such that fewer layers may result in data underfitting and more layers
may lead to overfitting.
Following are the main hyperparameter types.
• Number of hidden layers: the increase in the number of layers can increase the
accuracy. However, there is a trade-off between the simplicity of the model that
leads to being fast and generalized and the accuracy of the prediction.
• Number of nodes in each layer: the increase in the number of neurons is useful
up to a certain location. However, too wide layers may highly depend on the
training data and may generate low accuracy predictions on unseen data causing
overfitting. This should be decided based on the complexity of the problem.
• Learning rate: this controls the adjustment of the parameter size in each iter-
ation. When the learning rate is high the model learns efficiently. However, it
may result in local minima instead of reaching global minima. A low learning
rate reduces the changes to parameter estimates and directs towards global
minima. However, it requires a large number of epochs, with more time and
computational resources.
• Momentum: by changing the values of the parameters frequently in their chan-
ging direction, this prevents local minima points and zig-zag movements in
each iteration. Usually, it starts with low momentum and adjusts upward.
• An activation function: this affects the processing of the inputs of each layer
to its corresponding output. The values that move between layers are changed
based on the activation function used.
• Optimizer: the optimizer helps to change the weights and learning rate to reduce
loss and achieve high accuracy.
• Batch size: this defines the number of subsets of the training set. When the
training dataset is large, a batch size can be assigned without using the entire
data at once. When the batch size is small, it learns fast. However, it results in a
high variance in the unseen data.
• Epoch: this is the number of iterations the entire dataset goes through in the
learning process. For instance, during a single epoch, the training set routes
forward and backwards through the model once. Underfitting can occur, when
we use a small number of epochs, due to insufficient learning. In contrast, many
epochs can cause overfitting.
124 Deep Learning: A Beginners’ Guide
• Layer tuning: generally complex task solving requires many layers including
regularization layers such as batch normalization and dropout to avoid
overfitting. Generally, after the first few hidden layers, batch normalization is
included that normalizes the input of each batch. The dropout rate defines the
percentage of neurons to drop.
A search space must be defined before the algorithm, set bounds for all the
hyperparameters, and add prior knowledge on them including setting a non-uniform
distribution for the search. Different hyperparameter tuning algorithms categories are
available as follows.
• Grid search
• Randomized search
Halving randomized search uses the same successive halving approach, and it is
further optimized compared to halving grid search. Unlike halving grid search, it
does not train on all combinations of hyperparameters instead it randomly picks
a set of combinations of hyperparameters. In practice, the Scikit-Learn library in
HalvingRandomizedSeachCV implementation can be used.
Enhancement of Deep Learning Architectures 125
• Hyperopt-Sklearn
This technique utilizes Bayesian optimization to find the best parameters across the
search space efficiently. It considers the structure of search space and samples the
new instances based on the previous evaluations focusing on better predictions.
6.6 MODEL OPTIMIZATION
6.6.1 Overview of Model Optimization
The optimization process trains the model iteratively by adjusting the hyperparameters
in every iteration until optimum results occurred. The optimization algorithms alter
the model parameters like learning rate and weights, which are updated in each epoch
or iteration, to reduce the loss losses (error function) or to maximize the efficiency
of production. Usually, there is a trade-off between the network size and the speed,
so when the number of layers increases it is difficult to apply optimizers. Different
techniques like gradient descent (GD) and stochastic gradient descent (SGD) are used
as optimization algorithms.
Let us revisit the following concepts.
• The maximum and minimum values of a function are denoted by maxima, and
minima, respectively. Figure 6.6 shows the global maxima, local maxima, local
minima, and global minima locations for a given curve. The global points are
obtained for the entire domain of the function and the local points are based on
a given range of the function. In a learning model, there are points for single
global minimum and global maxima. However, there can be many local minima
and local maxima locations.
• The learning rate denotes the step size at each iteration and is a tuneable
parameter. Figure 6.2 shows the behavior of the learning rates. Since the
trade-off between bias and variance is affected by the hyperparameters, a
small learning rate can overfit the model with a large variance. In contrast,
the model can underfit and result in a large bias, when the learning rate
is high. The optimal values with a minimum loss can be identified using
cross-validation.
• Gradients measure the slope of a function that considers the weight updates
based on the loss. In general, the slope shifts its sign from positive to nega-
tive at minima. As a minimization algorithm, when the slope of points moved
towards minima, the slope reduces as shown in Figure 6.3. When the gradient is
increasing the slope got steeper and the model learns efficiently.
126 Deep Learning: A Beginners’ Guide
The weights in a neural network need to update with a value. The consideration of the
current gradient only may not give high-performance results. This can be addressed
by aggregating the averages of all the current gradients and past gradients; however,
it leads to having equally weighted gradients. In order to address this issue, an expo-
nential moving average of gradients is considered. In this process, higher weight
values are assigned for the past and most recent gradients to represent their higher
importance.
Enhancement of Deep Learning Architectures 127
When the average value of the gradients for the most recent iterations is closer to
zero, the model becomes an approximately flat surface. In order to converge and lead
to a global minimum, a downward slope that has the effect of acceleration should be
found. Here, the learning rate component should be increased to learn fast, when the
gradient value is low. The inverse proportion connection is established by dividing
the fixed learning rate by the average gradient value. When there is a high adapted
learning rate, it results in a large weight update, as it is multiplied by the gradient.
In contrast, when there is a high average gradient value, the model training is on
steep slopes. In order to reach the global optimal point, need to take small steps by
performing the same division and moving with caution.
• Momentum
Momentum is used as accelerating learning, especially with small, high curving, and
consistent or noisy gradients. This accumulates the exponentially decaying moving
average of recent gradients and moves in their directions. As shown in Figure 6.9,
consider a scenario with some noisy data. Even though these dots seem close to
each other, they do not share x coordinates. In such situations, a moving average
that denoises the data should be used and brings data points closer to the original
function. For instance, when the hyperparameter has a smaller number, the sequence
will fluctuate, since smaller numbers of samples are averaged, and it will be closer to
noisy data. When the hyperparameter has a large value, the curve will be smoother
but shifted to the right, as it averages over a large number of samples. Thus, a value
between these two extremes should be selected to obtain a balanced value for the
hyperparameter.
Adagrad uses parameter-specific learning rates, where the learning rate adjusts based
on the rate of parameter updates while model training. The idea behind this technique
is to use different learning rates for each hidden layer, and each neuron for different
iterations or epochs. This process divides the learning rate by the square root of the
collective sum of present and earlier squared gradients. As in SGD, the gradient will
not change. This technique supports increasing the performance of sparse gradient
tasks as it retains a per-parameter learning rate. Thus, mainly computer vision and
NLP tasks use Adagrad. However, when the number of iterations increases, it will get
a very small learning rate and will decrease the weight updating. Thus, there will not
be a significant value to update the weights. Hence it delays the convergence. This
problem can be addressed using AdaDelta and RMSProp optimizers.
Adadelta is an extension of AdaGrad optimizer and an SGD method. The term delta
denotes the difference between two consecutive weight updates. This uses adaptive
learning rate per dimension as a solution for the limitations namely, the repeated
decline of learning rates during training and the requirement of manual selection of
the global learning rate. Instead of using the learning rate as a parameter, AdaDelta
utilizes the frequency of parameter change to adapt the learning rate. Additionally,
this technique retains the second moments of gradient and the change in parameters
as variables.
Enhancement of Deep Learning Architectures 131
Adam is a widely used optimizer and different to classical SGD, which uses a fixed
learning rate for a given model to update weights throughout the training. This uses
the features of both RMSProp and Momentum. For instance, for the gradient, the
EMA of gradients is used as in Momentum. To decide the learning rate of parameters,
it divides the learning rate by the square root of the EMA of squared gradients as in
RMSProp. The RMSProp optimizer adapts the learning rates of the parameters by
considering the mean, which is the average first moment. Whereas Adam uses the
uncentered variance, which is the average of the second moments of the gradients.
This allows for the management of sparse gradients on noisy tasks. Therefore, Adam
processes the EMA of the gradient and the squared gradient by enabling it to handle
the decay rates of the moving averages.
Generally, the performance of the Adam optimizer is good and used as the default
optimizer for most of the applications due to the following reasons.
• Straightforward to implement.
• Low memory consumption.
• Computationally efficient.
• Faster computation time
• Requires few parameter tuning.
• Applicable for large tasks.
• Can be used for tasks with noisy or sparse gradients.
selects the best match for the objectives of a given problem by maximizing a fitness
function. This can be considered a biologically inspired optimization-based algo-
rithm. NAS automates both the architecture selection and neural network training,
which along with the increase in computational resources available should eventu-
ally allow the development of tailor-made architectures for each task per each hard-
ware platform. NAS is used in various tasks, such as object detection and image
classification in image processing, hyperparameter optimization and meta-learning
in AutoML. NAS will bring flexibility to industries with deep learning applications
that can adapt to diverse requirements. Thus, the advantages can be listed as follows.
Generally, NAS is not widely applied in real- world applications due to a few
limitations. NAS methods explore many potential solutions with variable complex-
ities. Hence it is computationally expensive in terms of resources and time. The larger
their search spaces, the more there are architectures to test, train, and evaluate. Since
the architectures are evaluated with training data, it can be hard to predict the per-
formance on real data. Also, expert knowledge is required to speedup the search pro-
cess by fine-tuning different architectures and guiding the search to converge quickly
towards an optimal solution. Additionally, it may not produce the global optimal solu-
tion even with sufficient resources. Further, many NAS studies are irreproducible and
may not be able to compare with the baseline models.
6.7.2 NAS Process
We have learnt that recurrent neural networks (RNNs) support the processing of
sequential data to output the next possible element in the sequence. Since RNNs are
prone to vanishing gradient and exploding gradient problems, LSTMs are applied
by handling the importance of each of the prior data in the sequence by using
different gates. Generally, NAS uses RNNs to handle sequential data, by decoding
the sequential outputs to produce more suitable models iteratively. The controllers
are designed to navigate the search space more intelligently using convolutional
blocks and stacking some of them to design the learning model. A search space
with basic convolutions, depth-separable convolutions with different kernel sizes
and pooling layers is used to design cells. Here, the convolutional cells that output
a feature map of the same size and half-size of the input are defined as normal cells
and reduction cells, respectively.
The NAS can be explained in three terms as follows.
1. Search space: set of neural network architectures that solve a given problem.
2. Search strategy: the mechanism used to explore the defined search space for
candidate architectures.
134 Deep Learning: A Beginners’ Guide
In the NAS process, a search strategy selects a model from the search space and
evaluates the selected architecture’s performance and passes it to the search strategy.
NAS predicts hyperparameters, such as the height and width of filters and strides,
the number of filters, and skip connections. Each prediction is performed using a
Softmax activation and passed to the next layer. Additionally, NAS models have been
developed with reinforcement learning. Here, also an RNN is used as the controller
to sample the search space. After training and evaluating the performance of the
sampled architecture, the result is utilized to update the controller using gradient-
based methods and the process is iterated until convergence or timeout.
The process of NAS is shown in Figure 6.13 and can be listed as follows.
Neural networks discard some features from the model based on their import-
ance to make decisions. The NAS process discards features from the search pipeline
training with hidden layers to identify their importance. Further, it is important to
consider model compression when designing NAS. Approaches, such as quantiza-
tion, pruning, and knowledge distillation, have been used to compress the models.
to be solved. Thus, although this may lead to human bias, it can simplify the search
process and deliver timely results. NAS includes a reinforcement learning-based
approach, which is in the discrete space and a gradient-based approach in the con-
tinuous space.
The search space can be considered in different types as follows.
This covers graphs that represent an entire neural architecture. This search
for all possible combinations of operations results in an expensive search
space. It combines operations to form a chain or sequence of architectures.
The associated parameters are the number of layers, operation type, and the
associated hyperparameters. Skip connections are also added to allow multi-
branch architectures.
convolutions and dilated convolutions where the parameters determine the condi-
tional space. Multi-branch networks and cell-based search are widely used in image
classification applications.
The search space can be either coarsely granular, where a block consists of many
layers, or more finely granular, where one layer with a given kernel size. The archi-
tectural search is more adaptable with the finer-grained procedures, enabling the gen-
eration of more complex designs. However, the search results will take longer to
converge as a result, and preliminary results might not be ideal. As long as the search
strategy starts with a sound initial architecture, a coarser search space can leverage
‘neural blocks’ based on the domain’s current knowledge to get better predictions
with fewer iterations. However, in the long term, applying finer-granular search space
is more flexible than the neural blocks, as the finer-granular process can independ-
ently find such neural blocks and subsequently enhance them.
• Grid search: the simplest form of search strategy where the screening is taken
place systematically screening.
• Sequential model-based optimization: select a model by iteratively exploring
the suitable models.
• Random search: select architectures at random from the search space and test
them according to the performance estimation method.
• Bayesian optimization: useful in hyperparameter search. Uses probability dis-
tribution and observations from tested architectures.
• Evolutionary or genetic algorithms: existing models may recreate additional
architectures or eliminate them from the search space.
• Reinforcement learning.
• One-shot architecture search: trains a supermodel that includes all other
configurations in the search space using parameter sharing.
The search strategies can categorize into black-box optimization strategies or differ-
entiable architecture search strategies.
This is a method for efficient architecture search that makes a continuous search space,
enabling it to be optimized using gradient descent. This combines both the search and
evaluation stage into one. Compared with black-box optimization strategies, the main
138 Deep Learning: A Beginners’ Guide
• FBNet
candidate building blocks can be chosen by using Gumbel Softmax random sam-
pling and differentiable training. It also used latency to optimize search results
for each target device. FBNet utilizes a basic type of image model block inspired
by MobileNetv2 that utilizes depthwise convolutions and an inverted residual
structure. At each searchable layer, the diverse candidate blocks are side by side
planned, leading to sufficient pretraining of the supernet. The pretrained supernet
is further sampled for fine-tuning of the subnet, to achieve better performance.
FBNet uses cross-entropy loss to lead to better accuracy and latency loss to opti-
mize the search results on a target device.
• PC-DARTS
• Validation accuracy: although this is a simple method, it may require more time
and more computations in scenarios such as large datasets, large search space,
and models with more layers. This can be addressed by strategies such as low-
reliability measures achieved by model training with a small number of epochs
or a subset of data.
• Measurement of low reliability: use a small set of epochs or data to train and
reduce the model size.
• The exploitation of the learning curves: this can be used to measure the model
performance without completing the training cycle. The efficiency of the search
process can be increased by discarding the models that produce low predictions
during the first few epochs.
• Efficient NAS method: uses subgraphs to explore the search space yields better
results and reduces the computational cost.
• Weight sharing one-shot models: the sub-models use the weights from the
supermodel.
• Train a substitute model to predict the model performance based on
characteristics derived from other new architectures.
• Training with warm- start: use the supermodel weights for the weight
initialization.
6.8 ADVERSARIAL TRAINING
6.8.1 Overview of Adversarial Training
At present, there are many support tools, libraries, and online services to apply
deep learning capabilities to applications easily without a thorough knowledge of
machine learning. As for many systems, these models can have the threat of adver-
sarial attacks, which is an important concern of machine learning applications.
Adversarial machine learning aims to mislead the learning process by giving
misleading input. It addresses the creation of attacks on machine learning algorithms
and the detection of adversarial examples, providing defensive mechanisms against
such attacks. Adversarial attacks are mainly found in spam detection and image clas-
sification tasks.
Contrary to other security dangers that programmers are accustomed to dealing
with, adversarial attacks are unique. The traditional cybersecurity environment has
developed to handle a variety of software threats. There are many static and dynamic
analysis tools to identify and fix security bugs in software. Compilers detect and iden-
tify unsuitable and possibly harmful source code. Unit testing ensures the responses
of the functions to various inputs. The browser and the computer’s hard disk can both
be searched for and blocked by anti-malware software and other endpoint solutions.
Web application firewalls can check and deny malicious requests to web servers.
Code and app hosting platforms are also in-built with security applications.
Enhancement of Deep Learning Architectures 141
TABLE 6.1
Types of Adversarial Attacks
Table 6.1 states the impact of each adversarial attack on the dataset, training, and
testing phases. The symbol * denotes that there is no access to the internal informa-
tion of the model and is considered a black-box attack setting. Figure 6.16 shows the
process behind these attacks.
REVIEW QUESTIONS
1. Explain how we can apply data augmentation for non-image data. Consider
data types, such as tabular data, and multimodality imaging data such as MRI,
fMRI, etc.
2. Explain the importance of normalization and how we can improve the model
performance by normalizing the dataset.
146 Deep Learning: A Beginners’ Guide
• True positive (TP): both the ground truth and the predicted output are positive.
True positive is the correctly predicted positive value, which indicates the
number of classes with actual output is true and the predicted output is true,
as well. As an example, suppose we have classification problems to identify
whether a patient has a given disease or not. The true positive value will be the
number of patients who have the disease (positive) and predicted as patients
with the disease (positive).
• False positive (FP): the ground truth is negative, and the predicted class is
positive.
False positive is the class determines as true when it is actually false (type
I error). It is considered as a ‘false alarm’. For example, consider a scenario
where the number of patients who are healthy (negative) but predicted as
patients with the disease (positive).
• True negative (TN): both the ground truth and the predicted output are negative.
True negative is the correctly predicted negative value, which indicates the
number of classes with the actual output is false and the predicted output is
false, too. For example, the number of patients who are healthy (negative) and
the predicted class tells the same thing (negative).
• False negative (FN): the ground truth is positive, and the predicted class is
negative.
A false negative is predicting a class as false when it is actually true (type II
error). For example, the number of patients that were diagnosed with the dis-
ease (positive) but predicted as healthy (negative).
Deep learning performance assessments include some level of trade- off when
characteristics that enhance one aspect of performance decrease another type of per-
formance. We will discuss the trade-offs later in this section.
7.2.2 Accuracy
Accuracy is the most intuitive measure, which is denoted by the ratio between cor-
rectly predicted instances to the total number of instance as given in (7.1). This works
well for symmetric datasets with almost the same number of FP and FN values. For
datasets where the class distribution is not balanced, it is hard to differentiate the FP
and FN values. Accordingly, it calculates the correct results’ percentage that a classi-
fier has achieved and indicates how well the classification model predicts the class
labels specified in the problem statement. For binary classification, positive and nega-
tive concepts are utilized to measure accuracy.
The term error rate (ERR) is defined as the complement of accuracy, which indicates
no misclassified samples of the positive and negative classes, as in (7.2).
Let us consider the following example that shows the model results that classified
chest X-ray images as pneumonia (the positive class) or normal (the negative class).
1 + 90
Accuracy = = 0.91
1 + 90 + 1 + 8
As shown in Example 7.1, the model accuracy is 91% (91 correct predictions out
of 100 total samples). In other terms, out of 100 chest X-ray samples 91 are classified
as normal (90 TNs and 1 FP) and nine samples are classified as pneumonia (one TP
and eight FNs). Thus, the model could correctly identify 90 out of 91 normal subjects.
But out of nine pneumonia cases, the model only correctly identified one as pneu-
monia, which is not good as eight out of nine pneumonia cases have been undiag-
nosed. Even though this model has good accuracy at first glance, another model that
always predicts normal cases would have a similar accuracy on this set of samples.
That means that this model does not perform any better than a model with no ability
to forecast the difference between pneumonia and healthy participants.
EXAMPLE 7.1
Chest X-Ray Classification
Therefore, the metric accuracy alone does not reflect the entire performance, with
a class imbalanced dataset, which has a considerable difference between the amount
of positive and negative classes. Therefore, better metrics than accuracy such as pre-
cision and recall are needed to evaluate class-imbalance problems.
TP
Pr ecision = (7.3)
FP + TP
TP
Re call = (7.4)
TP + FN
While improving recall will reduce the amount of false negatives, maximizing preci-
sion will reduce the number of false positives. When minimizing false positives is the
main goal, precision is appropriate.
Let us calculate the precision for Example 7.1. Chest X-ray classification. Here,
1
we have TP =1, FP =1, FN =8, and TP =90. Therefore, Precision = =0.5 and
1+1
1
Recall = =0.11
1+ 8
Precision counts the correct percentage of everything that has been anticipated to
be positive.
Precision can interpret as follows.
• How many instances are found to be truly positive by the model, among all the
actual positive instances.
• Even if they might mistakenly classify some negative examples as positive, a
model with high recall does a good job of discovering all the positive instances
in the data.
• All or a significant portion of the positive instances in the data cannot be found
by a model with low recall.
Precision and recall measures performance better, when the data is unbalanced,
because they consider different types of errors (FP and FN) that the model generates.
For a complete evaluation of the model, both precision and recall should be evaluated.
However, improving precision typically causes reduced recall and vice versa. Because
of this tension between precision and recall, some metrics such as the F1 score have
been developed which rely on both metrics.
7.2.4 F-Measure
F-measure or F1-score offers a score that strikes a compromise between recall and
precision concerns, considering their weighted average. It is a statistical metric used
to evaluate performance. In other words, an F1-score is the average performance
depending on the variables recall and precision and can be defined as in (7.5). As
a result, the formula accounts for both FP and FN. The F1-score is in the [0, 1]
range. This shows the reliability, so that will not miss the instances, and accuracy of
the model.
precision . recall
F1-score = 2. (7.5)
precision + recall
Generally, when the class distribution is uneven, accuracy will not be a good meas-
urement. In such scenarios, it would be better to consider both precision and recall.
Thus, the F1-score works better than accuracy to figure out how good the model
performed. A very accurate result is produced by high accuracy and with low recall,
however many occurrences that are challenging to identify are thus missed. The per-
formance of the model improves with increasing F1-score. Thus, the class imbalance
problems can be addressed by the performance metric F1-score, that based on the
type and number of prediction errors.
Points to note:
A model will obtain a high F1-score if both precision and recall are high.
A model will obtain a low F1-score if both precision and recall are low.
A model will obtain a medium F1-score if one of precision and recall is low and
the other is high.
152 Deep Learning: A Beginners’ Guide
TP TP
Sensitivity= = = TPR (7.6)
TP + FN P
TN TN
Specificity= = = TNR (7.7)
FP + TN N
FP
FPR = = (1 − specificity ) (7.8)
FP + TN
• When the threshold is high or more like (0, 0), the specificity increases and
sensitivity decreases.
• When the threshold becomes low or more like (1, 1), the specificity reduces and
the sensitivity increases.
• The better performing test will have a higher region under AUC with the curve
nearer to the upper left corner.
Performance Evaluation Techniques 153
7.2.8 Cross-Validation
A deep learning model will be evaluated for effectiveness and accuracy in a real-
world situation using a distinct and different dataset. Cross-validation is used to test
whether a model behaves effective enough on the unseen test data compared to the
training dataset. This assesses the model performance within the unknown data set.
Thus, cross-validation can be used to indicate data underfitting or overfitting, and the
model generalizability for the unseen dataset.
154 Deep Learning: A Beginners’ Guide
• Training dataset: candidate algorithms are trained on this to fit for the model.
• Validation dataset: utilized to compare their performances and select the most
performing model. In the process of fine-tuning model hyperparameters, it
offers an unbiased assessment of a model’s fit to the training dataset.
• Test dataset: accuracy, sensitivity, specificity, and F-measure are performance
metrics that are obtained using the test dataset. It offers an objective assessment
of how well a final model fits the training dataset.
The model is put to the test in every conceivable way throughout the thorough cross-
validation. The initial dataset is split into training and validation sets to do this.
Examples include leave-one-out cross-validation and leave-p-out cross-validation.
Non-exhaustive cross-validation, such as the hold-out approach and k-fold cross-
validation, does not separate out every potential combination and permutation from
the original data set.
Cross-validation can be performed in several ways as follows.
• Hold-Out Cross-Validation
are not known and obtain various results for different sets. However, it might produce
false results as everything can be executed in a single run. Generally, this method
applies to large datasets and may not give good results for small datasets. With small
datasets, since there is not sufficient data, the validation process may suffer from
underfitting. Also, having a small training dataset will lose important features in the
dataset, which increases the error induced by bias.
In order to determine the model’s overall efficacy, the error estimation is averaged
across all k trials. As a result, each data point will appear exactly once in the valid-
ation set and k-1 times in the training set. Accordingly, as shown in Figure 7.4, the
training and validation datasets are different in each iteration. Since the dataset is
divided into a training and a validation dataset iteratively, the following advantages
can be achieved.
– Avoids overfitting.
– Rotation estimation, or out-of-sample testing.
– Assess the generalizability of the result to an independent data set.
Stratified K-fold uses almost the same percentage of instances from each target class
as the total set and addresses the imbalance data problem as shown in Figure 7.5.
Therefore, it is used when the distribution of the target variable is not consistent,
which would require binning the target variable.
• Leave-P-Out Cross-Validation
Leave-P-out cross-validation drops p number of instances from the training data from
a total of n number of data instances. This resulting in n-p instances being utilized for
training and the remaining p instances being used as the validation set. This is carried
out for each combination obtained by dividing the total number of instances. The
entire model efficacy is determined by taking the average error across all trials. This
method is extensive since every potential combination of the model must be trained
and validated, and for fairly high p, it may become computationally impractical. In
practice, p is assigned to 1 in most of the applications and it is defined as cross-
validation with leave-one-out. Since the number of viable combinations is equal to
the number of data instances in the original sample size, this method is typically
selected as it requires less computational work.
Performance Evaluation Techniques 157
7.2.9 Kappa Score
As a performance indicator for a deep learning model, Cohen’s kappa is a numerical
measure of agreement between two raters. It evaluates the level of agreement between
the classification model and the observer in the real-world, both perfectly and ran-
domly as given in (7.9). Kappa score lies between -1 and +1. A value of zero and one
implies a random and complete agreement among raters, respectively. When the score
is negative, there is less agreement among raters. This gives convincing performance
metrics of the model, when we are having an imbalanced dataset.
where
TP + TN
Total accuracy = and
TP + FP + FN + TN
Random accuracy =
( TN + FP ) ( TN + FN ) + ( FN + TP ) ( FP + TP )
( TP + TN + FP + FN )2
impacted to predict the output class. The final score depends primarily on the data in
the locations where this gradient is large. This does not need retraining or modifying
the existing architecture. Although, Grad-CAM is class discriminative and capable
of localizing important regions of an image it cannot emphasize fine-grained informa-
tion, such as pixel space gradient visualizations. This can be addressed by guided
backpropagation, which suppresses the negative gradients when backpropagation
through the ReLU layer. Thus, it will capture pixels identified by the nodes.
This is a metric that is used to compare the performance of majority and minority
classes. The poor performance of the classifier is indicated through the low geometric
mean, which is defined as in (7.11).
TP TN
G-mean = × = specificity × sensitivity (7.11)
TP + FN TN + FP
• Likelihood Ratio
Likelihood ratio is used for diagnostic testing and is dependent on both sensitivity and
specificity. This ratio is used to determine how a result affects likelihood. All positive
findings are not genuine positives, and all negative results are not true negatives in
diagnostic testing. Therefore, the likelihood of developing diseases is affected by both
positive and bad outcomes. The positive likelihood with a greater value and the nega-
tive likelihood with a lower value indicates the superior performance of the positive
Performance Evaluation Techniques 159
and negative classes, respectively, and calculate as in (7.12) and (7.13). With balanced
and imbalanced datasets, both positive and negative likelihoods are appropriate.
sensitivity
Likelihood positive (ρ + ) = (7.12)
1 − specificity
1 − sensitivity
Likelihood negative (ρ − ) = (7.13)
specificity
TP.TN − FP.FN
MCC = (7.14)
( TP + FP ) . ( TP + FN ) . ( TN + FP ) . ( TN + FN )
Mean absolute error (MAE) calculates the average of the disparity between the real
and expected outputs. It calculates the accuracy of the forecasts compared to the
actual result and averages the absolute errors as in (7.15). The lower metric value for
MAE is better.
1 n
MAE = ∑ y − yˆ
n i =1
(7.15)
where n is the total number of data points, y is the actual output, and ŷ is the predicted
output.
160 Deep Learning: A Beginners’ Guide
Mean squared error (MSE) calculates the average of the squared variance among the
initial and expected outputs as in (7.16). The MSE is a positive value, and the values
closer to zero (lowest) are better.
1 n
MSE = ∑ ( y − yˆ )2 (7.16)
n i =1
where n, y, and ŷ denote the total number of data points, actual and predicted output,
respectively.
Root mean square error (RMSE) calculates the square root of mean squared error as
given in (7.17). The disparities between the model predicted values and the actual
values are quantified by this term, which is also known as the root mean squared
deviation. While MAE assigns all errors the same weight, the RMSE penalizes uncer-
tainty by giving greater absolute value errors more weight than smaller absolute value
errors. The RMSE is never less than the MAE since all metrics are evaluated.
1 n
RMSE = ∑ ( y − yˆ )2 (7.17)
n i =1
Where n, y, and ŷ denote the total number of data points, actual and predicted output,
respectively.
Logarithmic loss indicates the closeness of the prediction probability to the actual/
corresponding true value. The maximization of the likelihood is equivalent to min-
imizing the MSE. It accounts for the uncertainty in model projections by penalizing
the incorrect predictions. This performs well for multi-class classification and lower
values give better performances as shown in Figure 7.7. Here, the model sets a prob-
ability for every class. This supports comparing the performance of the two models.
Log loss exists between [0, ∞) and has no upper bound. Higher accuracy is indicated
by a log loss that is closer to zero, whereas a log loss that is further from zero suggests
lower accuracy. In general, the classifier is more accurate when log loss is minimized.
This can be defined as in (7.18) or (7.19).
Log loss = −
1 n
( ( ) ) (( )
∑ log pi *y i + 1 − y i *log 1 − pi
N i =1
( )) (7.18)
Performance Evaluation Techniques 161
Log loss = −
1 N M
( ( ) )
∑∑ log pij *yij
N i =1 j =1
(7.19)
• R Squared Error
R squared or the coefficient of determination shows how closely the model’s forecasts
match the actual data. This is calculated as in (7.20). R squared has an understandable
value and is not affected by the actual output, and does not provide any insight into
the prediction error.
∑ ( y − yˆ )
n 2
i =1 i i
R squared error = 1 − (7.20)
∑ (y − y)
n 2
i =1 i
where y, y , and y denote the actual, predicted, and the mean of the actual values.
newgenrtpdf
162
TABLE 7.1
Summary of Performance Metrics
Technique Description
Accuracy It is mostly used when all the classes are equally important, and the problem is balanced
Precision Precision becomes crucial for extremely unbalanced datasets with significantly more negatives than positives
because they are easy to get high accuracy for (always predicting negative would result in high accuracy).
Recall When the cost of a false negative is high, should use recall.
F1-measure In order to find a balance between precision and recall in an unequal class distribution, F1-score can be used to
utilize a task with many TNs.
Mean absolute error (MAE) When performance is evaluated using continuous data, MAE is typically used. It produces a linear number that
evenly weights the individual differences and averages them. The model performs better the lower the value.
Mean squared error (MSE) MSE is used to assess regression models that are used to predict dependent variables that are numeric, while
modeling independent variables and dependent or target variables.
ROC Do not use it when data is heavily imbalanced.
AUC Use when caring equally about positive and negative classes.
PR-AUC When data is imbalanced and when care more about positive than negative class.
Hold-out method One train-test split is all that the hold out approach requires. As a result, the score of the hold-out approach depends
on how the data is divided into the train and test sets. When a dataset is really huge, this method is useful.
REVIEW QUESTIONS
1. What is the ROC curve and what does it represent?
2. When there are many false positives or false negatives, how does it impact
the model?
3. What is overfitting and what are the methods that can be used to prevent it?
4. When the values of k becomes larger in K-fold cross-validation, can it result
in overfitting or underfitting? Explain your answer.
5. Explain the confusion matrix concerning machine learning algorithms.
6. What is the best way to measure performance improvement on imbalanced
datasets?
7. What are the pain points of Cohen’s Kappa?
Appendix –Frequently Asked
Questions
1. What are hyperparameters in a deep learning model? What are their
importances?
Learning rate: generally, training should be done starting with a small learning
rate and increasing the learning rate exponentially for every batch and plotting
the loss against the learning rate. The point of the fastest decrease in the loss
can be determined by observing the graph.
The number of epochs: this indicates the number of times the weights are
changed in the learning model. Therefore, it is better to train with a large
number of epochs such as 50, 100, 150, and 200 to get better convergence.
In situations where the number of epochs cannot be determined based on a
generalized manner, it is better to use early stopping methods where each net-
work is trained with an initial number of epochs and observe whether training
MSE is stuck in a minimum, if not the number of echoes can be increased.
Batch size: every model responds differently to different batch sizes. For
instance, in GPU acceleration, training can physically become faster, with the
increase of the batch size until the saturated GPU load. Decreasing batch size
can also affect performance, either positively or negatively, if the network has
BatchNorm layers.
Optimizers: techniques such as Adam and SGD can bring better results if
combined with a good learning rate and annealing schedule, which aims to
manage its value during the training. Hence, it would be better to try different
optimizers such as Adam, RMSProp, and SDG, and observe the results.
Loss function: the widely used loss function is cross-entropy loss or log
loss. It decreases as the predicted probability converges to the actual label. It
measures the performance of a classification model whose predicted output is
a probability value between 0 and 1.
Activation function: generally, ReLU is a better choice for hidden layers.
When it results in the dying ReLU problem then its modifications like leaky
ReLU, ELU, and SELU can be used. For binary classification problems, sig-
moid is the right choice and for multiclass classification, Softmax is the right
choice. However, functions, such as sigmoid and tanh, tend to have vanishing
gradient problems.
Dropout rate: generally, we divide the number of nodes in the layer before
dropout by the proposed dropout rate and use that as the number of nodes in
the new network that uses dropout. For example, a network with 100 nodes
and a proposed dropout rate of 0.5 will require 200 nodes (100/0.5) when
using dropout.
165
166 Appendix
duration, it would be better to tune the learning rate as it is the most important
hyperparameter.
10. What is meant by overfitting?
A statistical modeling error called overfitting happens when a function is
tightly matched to a small number of data points. Thus, attempting to make
the model conform too closely to slightly inaccurate data can infect the model
with substantial errors and reduce its predictive power.
11. How to prevent overfitting or underfitting.
Use approaches such as cross-validation, training with a large amount of data,
augmenting data instances, reducing complexity, and early stopping.
12. What is batch normalization?
Batch normalization standardizes the inputs to a network. It directly applies
to the activations of a prior layer or the inputs. This is used to accelerate the
training process. It reduces the generalization error by decreasing the epochs
or providing regularizations.
13. How to select a DL model for a given problem.
The selection of the model is based on different factors, such as the modality
of the dataset (image, video), the dimensionality of the data (text, images
2D, for fMRI 4D, MRI 3D, etc.), feature types (aggregations, linear, time
series). The potential cluster of architectures can be trained and compared to
the results to select the best out of the selection. Some of the features of CNN-
based models can be listed as follows.
ResNet: incorporates skip connections between layers and utilizes
batch normalization to normalize the input of activation functions. These
architectures make it possible to efficiently train an exceptionally deep neural
network.
VGG & AlexNet: requires large memory usage and computational power
due to the associated large parameter space with many hidden layers.
Inception family models (GoogLeNet, Inception, Xception): provide effi-
cient models.
GoogLeNet: good accuracy with the low memory requirement.
Xception: the accuracy is similar to ResNet. Complicated to apply
modifications.
MobileNet: use to execute on mobile devices. This uses few parameters
and requires less memory and is computationally simpler. MobileNet net
gives a good trade-off between model size and accuracy.
14. What are the pretrained datasets?
ImageNet
MNIST
PaHaW dataset
AlexNet-ImageNet
CIFAR-10
168 Appendix
The training and testing datasets should be balanced dataset to represent all
potential classes with nearly equal distribution.
Generalize the accuracy by dividing the entire data instances into training,
testing, and validation in the ratio.
Use cross-validation techniques.
21. What is the reason for getting high testing errors than training errors?
This indicates the model is overfitting. The regularization technique such
as early stopping with dropout and cross-validation can be used to reduce
overfitting. The other possible techniques are data augmentation, unimportant
feature removal, regularization, and model ensembling.
22. Distinguish the training, testing, and validation datasets.
The training dataset is used to fit the model training.
The validation dataset is utilized for a fair assessment of a model that
performs well on the training dataset while tuning model hyperparameters.
Different samples are included in validation datasets to assess trained ML
models. The model can still be adjusted and managed at this point.
A test dataset is a distinct sample that offers a final, unbiased assessment
of a model’s fit. The inputs are comparable to the stages before it, however,
they are not the same data.
23. What are the best-split ratios:
Common split ratios for training, testing, and validation are,
70% training, 15% validation, 15% testing
80% training, 10% validation, 10% testing
60% training, 20% validation, 20% testing
24. What are the platforms that we can use to implement DL models?
This depends based on the model size, dataset size, and batch size of the
training procedure.
Jupyter notebook: difficult to work with models that require high compu-
tational power,
Tensor Hub
Google Colab: have memory limitations in the longer run. For example, it
may not work well with models such as mask-RCNN and BERT pretraining
tasks that need more memory and training time.
25. What are the frameworks that we can use to implement DL models? How do
we select a framework to use? What are the conditions and features we should
consider?
Keras is used when developers seek a plug-and-play framework that enables
them to design, train, and evaluate their models rapidly. It offers more deploy-
ment options and easier model export. However, PyTorch is faster than Keras
and has better debugging capabilities.
Pytorch is a better framework for developing DL models for research due
to its compatibility with parallel processing API. In contrast, TensorFlow
170 Appendix
29. How to decide the trade-off between model throughput and accuracy.
In machine learning, throughput is a metric used to assess how well various
models perform in a given application. The quantity of data units processed in
a certain amount of time is referred to as throughput.
The ratio of the total number of predictions produced to the number of
classifications a model properly predicts is known as model accuracy. It is a
method of rating a model’s effectiveness.
30. How to select an evaluation technique?
Every machine learning problem can be categorized into two categories of
classification and regression. Based on the category the evaluation metric can
be different to be applied.
Classification metrics: Confusion matrix representing, Accuracy, preci-
sion, and recall, F-score.
Regression metrics: Explained variance, Mean squared error, R2 coefficient.
Common metrics: Learning curves, Validation curves.
References
1. Abdel-Jaber, H., Devassy, D., Al Salam, A., Hidaytallah, L., EL-Amir, M., 2022. A
review of deep learning algorithms and their applications in healthcare. Algorithms 15,
71. doi: 10.3390/a15020071.
2. Abeysinghe, C., Perera, I., Meedeniya, D., 2021. Capsule networks for character recog-
nition in low resource languages, in: Malarvel, M., Nayak, S.R., Pattnaik, P.K., Panda,
S.N. (Eds.), Machine vision inspection systems, Volume 2: Machine Learning-Based
Approaches. John Wiley and Sons. chapter 2, pp. 23–46. doi: 10.1002/9781119786122.
ch2.
3. Agarwal, N., Sondhi, A., Chopra, K., Singh, G., 2021. Transfer learning: Survey and
classification. Smart innovations in communication and computational sciences, 145–
155. doi: 10.1007/978-981-15-5345-5 13.
4. Agarwal, V., Lohani, M., Bist, A.S., Harahap, E.P., Khoirunisa, A., 2022. Analysis of
deep learning techniques for chest x-ray classification in context of covid-19. ADI
Journal on Recent Innovation 3, 208–216. doi: 10.34306/ajri.v3i2.659.
5. Al Husaini, M.A.S., Habaebi, M.H., Gunawan, T.S., Islam, M.R., Elsheikh, E.A.,
Suliman, F., 2022. Thermal-based early breast cancer detection using inception v3,
inception v4 and modified inception mv4. Neural Computing and Applications 34, 333–
348. doi: 10.1007/s00521-021-06372-1.
6. Alzubaidi, L., Zhang, J., Humaidi, A.J., Al-Dujaili, A., Duan, Y., Al-Shamma, O.,
Santamaría, J., Fadhel, M.A., Al- Amidie, M., Farhan, L., 2021. Review of deep
learning: Concepts, CNN architectures, challenges, applications, future directions.
Journal of Big Data 8, 1–74. doi: 10.1186/s40537-021-00444-8.
7. Ariyarathne, G., De Silva, S., Dayarathna, S., Meedeniya, D., Jayarathne, S., 2020.
ADHD identification using convolutional neural network with seed-based approach for
fMRI data, in: Proceedings of 9th International Conference on Software and Computer
Applications (ICSCA), pp. 31–35. doi: 10.1145/3384544.3384552.
8. Bandara, M., Jayasundara, R., Ariyarathne, I., Meedeniya, D., Perera, C., 2023. Forest
sound classification dataset: FSC22, Sensors, 23, 4:2032, doi: 10.3390/s23042032.
9. Belousov, B., Abdulsamad, H., Klink, P., Parisi, S., Peters, J., 2021. Reinforcement
learning algorithms: Analysis and applications. Springer.
10. Bozinovski, S., Fulgosi, A., 1976. The influence of pattern similarity and transfer
learning upon training of a base perceptron b2, in: Proc. Symposium Informatica, pp.
121–126. doi: 10.31449/inf.v44i3.2828.
11. Brendan McMahan, H., Moore, E., Ramage, D., Hampson, S., Ag¨uera y Arcas, B.,
2016. Communication-efficient learning of deep networks from decentralized data.
arXive-prints, arXiv–1602 doi: 10.48550/arXiv.1602.05629.
12. Brownlee, J., 2018. Better deep learning: train faster, reduce overfitting, and make better
predictions. Machine Learning Mastery.
13. Chauhan, N.K., Singh, K., 2018. A review on conventional machine learning vs deep
learnin g, in: Proc. International conference on computing, power and communication
technologies (GUCON), IEEE. pp. 347–352. doi: 10.1109/gucon.2018.8675097.
14. Chitty-Venkata, K.T., Somani, A.K., 2022. Neural architecture search survey: A hard-
ware perspective. ACM Computing Surveys (CSUR), 55(4):78, PP. 1-36. doi: 10.1145/
3524500.
173
174 References
15. Chollet, F., 2017. Xception: Deep learning with depth-wise separable convolutions,
in: Proc. International Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 1800–1807. doi: 10.1109/CVPR.2017.195.
16. Dasanayaka, S., Shantha, V., Silva, S., Ambegoda, T., Meedeniya, D., 2022a.
Interpretable machine learning for brain tumor analysis using MRI, in: Proceedings
of the 2nd International Conference on Advanced Research in Computing (ICARC),
pp. 212–217. doi: 10.1109/ICARC54489.2022.9754131.
17. Dasanayaka, S., Shantha, V., Silva, S., Meedeniya, D., Ambegoda, T., 2022b.
Interpretable machine learning for brain tumour analysis using MRI and whole slide
images. Software Impacts 13, 100340. doi: 10.1016/j.simpa.2022.100340.
18. De Silva, S., Dayarathna, S., Ariyarathne, G., Meedeniya, D., Jayarathna, S., Michalek,
A.M., 2021. Computational decision support system for ADHD identification.
International Journal of Automation and Computing (IJAC) 18, 233–255. doi: 10.1007/
s11633-020-1252-1.
19. De Silva, S., Dayarathna, S., Meedeniya, D., 2022. Alzheimer’s disease diagnosis using
functional and structural neuroimaging modalities, in: Wadhera, T. and Kakkar, D. (Ed.),
Enabling technology for neurodevelopmental disorders from diagnosis to rehabilita-
tion. Taylor and Francis CRS Press, Routledge. c hapter 11, pp. 162–183. doi: 10.4324/
9781003165569-11.
20. De Silva, S., Dayarathna, S.U., Ariyarathne, G., Meedeniya, D., Jayarathna, S., 2021b.
fMRI feature extraction model for ADHD classification using convolutional neural net-
work. International Journal of E-Health and Medical Communications (IJEHMC) 12,
81–105. doi:10.4018/IJEHMC.2021010106.
21. Demotte, P., Wijegunarathna, K., Meedeniya, D., Perera, I., 2021. Enhanced senti-
ment extraction architecture for social media content analysis using capsule networks.
Multimedia Tools and Applications doi: 10.1007/s11042-021-11471-1.
22. Desai, M., Shah, M., 2021. An anatomization on breast cancer detection and diagnosis
employing multi-layer perceptron neural network (MLP) and convolutional neural net-
work (CNN). Clinical eHealth 4, 1–11. doi: 10.1016/j.ceh.2020.11.002.
23. Dong, S., Wang, P., Abbas, K., 2021. A survey on deep learning and its applications.
Computer Science Review 40, 100379. doi: 10.1016/j.cosrev.2021.100379.
24. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T.,
Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al., 2021. An image is worth
16x16 words: Transformers for image recognition at scale. in: Proceedings of the The
International Conference on Learning Representations (ICLR) , pp. 1–22.
25. Eelbode, T., Sinonquel, P., Maes, F., Bisschops, R., 2021. Pitfalls in training and valid-
ation of deep learning systems. Best Practice & Research Clinical Gastroenterology 52,
101712. doi: 10.1016/j.bpg.2020.101712.
26. Ekman, M., 2021. Learning deep learning: Theory and practice of neural networks,
computer vision, NLP, and transformers using TensorFlow. Addison- Wesley
Professional.
27. Fernando, C., Kolonne, S., Kumarasinghe, H., Meedeniya, D., 2022. Chest radiographs
classification using multi-model deep learning: A comparative study, in: Proceedings
of the 2nd International Conference on Advanced Research in Computing (ICARC),
pp. 165–170. doi: 10.1109/ICARC54489.2022.9753811.
28. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,
Courville, A., Bengio, Y., 2020. Generative adversarial nets, Communications of the
ACM 63, 139–144, doi: 10.1145/342262.
29. G´eron, A., 2018. Neural networks and deep learning. O’Reilly Media, Inc. He, K.,
Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, in: Proc.
References 175
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778.
doi: 10.1109/CVPR.2016.90.
30. He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recogni-
tion, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 770–778. doi:10.1109/CVPR.2016.90.
31. Herath, L., Meedeniya, D., Marasingha, J., Weerasinghe, V., 2021. Autism spectrum
disorder diagnosis support model using inceptionv3, in: Proceedings of International
Research Conference on Smart Computing and Systems Engineering (SCSE), pp. 1–7.
doi: 10.1109/SCSE53661.2021.9568314.
32. Herath, L., Meedeniya, D., Marasingha, J., Weerasinghe, V., 2022. Optimize transfer
learning for autism spectrum disorder classification with neuroimaging: A comparative
study, in: Proceedings of the 2nd International Conference on Advanced Research in
Computing (ICARC), pp. 171–176. doi: 10.1109/ICARC54489.2022.9753949.
33. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto,
M., Adam, H., 2017. Mobilenets: Efficient convolutional neural networks for mobile
vision applications. arXiv preprint. doi: 10.48550/arXiv.1704.04861.
34. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q., 2017. Densely connected
convolutional networks, in: Proc. International Conference on Computer Vision and
Pattern Recognition (CVPR), pp. 2261–2269. doi: 10.1109/CVPR.2017.243.
35. Hutter, F., Kotthoff, L., Vanschoren, J., 2019. Automated machine learning: methods,
systems, challenges. Springer Nature. doi: 10.1007/978-3-030-05318-5.
36. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K., 2016.
Squeezenet: Alexnet-level accuracy with 50x fewer parameters and! 0.5 mb model size.
arXiv preprint, 1–13. doi: 10.48550/arXiv.1602.07360.
37. Joseph, A.D., Nelson, B., Rubinstein, B.I.P., Tygar, J.D., 2019. Adversarial machine
learning. Cambridge University Press. doi: 10.1017/9781107338548.
38. Kang, M., Ko, E., Mersha, T.B., 2022. A roadmap for multi-omics data integration using
deep learning. Briefings in Bioinformatics 23, bbab454. doi: 10.1093/bib/bbab454.
39. Kapadnis, S., Tiwari, N., Chawla, M., 2022. Developments in capsule network archi-
tecture: A review. Intelligent Data Engineering and Analytics 266, 81–90. doi: 10.1007/
978-981-16-6624-7_9.
40. Ketkar, N., Santana, E., 2017. Deep learning with Python. volume 1. Springer.
doi: 10.1007/978-1-4842-2766-4.
41. Kumar, S., Kaur, P., Gosain, A., 2022. A comprehensive survey on ensemble methods,
in: Proc. International conference for Convergence in Technology (I2CT), pp. 1–7,
doi: 10.1109/I2CT54291.2022.9825269.
42. Kumarasinghe, H., Kolonne, S., Fernando, C., Meedeniya, D., 2022. U- net based
chest x-ray segmentation with ensemble classification for COVID-19 and pneumonia.
International Journal of Online and Biomedical Engineering (iJOE) 18, 161– 174.
doi: 10.3991/ijoe.v18i07.30807.
43. Ladosz, P., Weng, L., Kim, M., Oh, H., 2022. Exploration in deep reinforcement
learning: A survey. Information Fusion 85, 1–22. doi: 10.1016/j.inffus.2022.03.003.
44. Laxmisagar, H., Hanumantharaju, M., 2022. Detection of breast cancer with light-
weight deep neural networks for histology image classification. Critical Reviews™ in
Biomedical Engineering 50, 1–19. doi: 10.1615/CritRevBiomedEng.2022043417.
45. Liu, X., Faes, L., Kale, A.U., Wagner, S.K., Fu, D.J., Bruynseels, A., Mahendiran,
T., Moraes, G., Shamdas, M., Kern, C., et al., 2019. A comparison of deep learning
performance against health- care professionals in detecting diseases from medical
imaging: a systematic review and meta-analysis. The Lancet Digital Health 1, e271–
e297. doi: 10.1016/s2589-7500(19)30123-2.
176 References
46. Liu, Y., Sun, P., Wergeles, N., Shang, Y., 2021. A survey and performance evaluation
of deep learning methods for small object detection. Expert Systems with Applications
172, 114602. doi: 10.1016/j.eswa.2021.114602.
47. Ludwig, H., Baracaldo, N., 2022. Federated learning: A comprehensive overview of
methods and applications. Springer Cham. doi: 10.1007/978-3-030-96896-0.
48. Mahakalanda, I., Demotte, P., Perera, I., Meedeniya, D., Wijesuriya, W., Rodrigo, L.,
2022. Chapter 7–deep learning-based prediction for stand age and land utilization
of rubber plantation, in: Khan, M.A., Khan, R., Ansari, M.A. (Eds.), Application of
Machine Learning in Agriculture. Elsevier Academic Press, pp. 131–156. doi: 10.1016/
B978-0-323-90550-3.00008-4.
49. Mandal, M., Vipparthi, S.K., 2021. An empirical review of deep learning frameworks
for change detection: Model design, experimental frameworks, challenges and research
needs. IEEE Transactions on Intelligent Transportation Systems 23, 6101– 6122.
doi: 10.1109/tits.2021.3077883.
50. Mathew, A., Amudha, P., Sivakumari, S., 2020. Deep learning techniques: an over-
view, in: Proc. International conference on advanced machine learning technologies and
applications, pp. 599–608. doi: 10.1007/978-981-15-3383-954.
51. Meedeniya, D., Kumarasinghe, H., Kolonne, S., Fernando, C., De la Torre D ́ıez, I.,
Marques, G., 2022a. Chest x-ray analysis empowered with deep learning: A systematic
review. Applied Soft Computing, 109319. doi: 10.1016/j.asoc.2022.109319.
52. Meedeniya, D., Mahakalanda, I., Lenadora, D., Perera, I., Hewawalpita, S., Abeysinghe,
C., Nayak, S., 2022b. Chapter 13–Prediction of paddy cultivation using deep learning on
land cover variation for sustainable agriculture, in: Poonia, R.C., Singh, V., Nayak, S.R.
(Eds.), Deep learning for sustainable agriculture. Elsevier Academic Press. pp. 325–355.
doi: 10.1016/B978-0-323-85214-2.00009-4.
53. Meedeniya, D., Rubasinghe, I., 2020. A review of supportive computational
approaches for neurological disorder identification, in: Wadhera, T., Kakkar, D. (Eds.),
Interdisciplinary approaches to altering neurodevelopmental disorder. IGI Global.
chapter 16, pp. 271–302. doi: 10.4018/978-1-7998-3069-6.ch016.
54. Nagrath, P., Jain, R., Madan, A., Arora, R., Kataria, P., Hemanth, J., 2021. Ssdmnv2: A
real time DNN-based face mask detection system using single shot multibox detector
and mobilenetv2. Sustainable Cities and Society 66, 102692. doi: 10.1016/
j.scs.2020.102692.
55. Nguyen, D.C., Pham, Q.V., Pathirana, P.N., Ding, M., Seneviratne, A., Lin, Z., Dobre, O.,
Hwang, W.J., 2022. Federated learning for smart healthcare: A survey. ACM Computing
Surveys (CSUR) 55, 1–37. doi: 10.1145/3501296.
56. Nielsen, M.A., 2015. Neural networks and deep learning. Volume 25. Determination
press San Francisco, USA.
57. Opitz, D., Maclin, R., 1999. Popular ensemble methods: An empirical study. Journal of
Artificial Intelligence Research 11, 169–198. doi: 10.1613/jair.614.
58. Padmasiri, H., Madurawe, R., Abeysinghe, C., Meedeniya, D., 2020. Automated
vehicle parking occupancy detection in real-time, in: Proceedings of 2020 Moratuwa
Engineering Research Conference (MERCon), pp. 1– 6. doi: 10.1109/MERCon
50084.2020.9185199.
59. Padmasiri, H., Shashirangana, J., Meedeniya, D., Rana, O., Perera, C., 2022. Automated
license plate recognition for resource-constrained environments. Sensors 22. 1434.
doi: 10.3390/s22041434.
60. Pathirana, P., Senarath, S., Meedeniya, D., Jayarathna, S., 2022a. Eye gaze estimation: A
survey on deep learning-based approaches. Expert Systems with Applications 19, 1–16.
doi: 10.1016/j.eswa.2022.116894.
References 177
61. Pathirana, P., Senarath, S., Meedeniya, D., Jayarathna, S., 2022b. Single-user 2D gaze
estimation in retail environment using deep learning, in: Proc. of the 2nd International
Conference on Advanced Research in Computing (ICARC), pp. 206–211. doi: 10.1109/
ICARC54489.2022.9754167.
62. Qian, C., Zhu, J., Shen, Y., Jiang, Q., Zhang, Q., 2022. Deep transfer learning in mech-
anical intelligent fault diagnosis: Application and challenge. Neural Processing Letters
54, 2509–2531. doi: 10.1007/s11063-021-10719-z.
63. Ravichandiran, S., 2019a. Hands-on deep learning algorithms with Python: Master deep
learning algorithms with extensive math by implementing them using TensorFlow. Packt
Publishing Ltd.
64. Romo-Montiel, E., Menchaca-Mendez, R., Rivero-Angeles, M.E., Menchaca-Mendez,
R., 2022. Improving communication protocols in smart cities with transformers. ICT
Express 1, 50–55. doi: 10.1016/j.icte.2022.02.006.
65. Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for
biomedical image segmentation, in: Proc. International Conference on Medical
image computing and computer-assisted intervention, pp. 234–241. doi: 10.1007/
978-3-319-24574-428.
66. Rubasinghe, I., Meedeniya, D., 2019. Ultrasound nerve segmentation using deep
probabilistic programming. Journal of ICT Research and Applications 13, 241–256.
doi: 10.5614/itbj.ict.res.appl.2019.13.3.5.
67. Rubasinghe, I., Meedeniya, D., 2020. Automated neuroscience decision support frame-
work, in: Agarwal, B., Balas, V., Jain, L., Poonia, R., Manisha (Eds.), Deep learning
techniques for biomedical and health informatics. Elsevier. chapter 13, pp. 305–326.
doi: 10.1016/B978-0-12-819061-6.00013-6.
68. Sabour, S., Frosst, N., Hinton, G.E., 2017. Dynamic routing between capsules, in: Proc.
International Conference on Neural Information Processing Systems (NIPS), pp. 3859–
3869. doi: 10.48550/arXiv.1710.09829.
69. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L., 2018. Mobilenetv2: Inverted
residuals and linear bottlenecks, in: Proc. International Conference on Computer Vision
and Pattern Recognition (CVPR), pp. 4510–4520. doi: 10.1109/CVPR.2018.00474.
70. Senarath, S., Pathirana, P., Meedeniya, D., Jayarathna, S., 2022a. Customer gaze esti-
mation in retail using deep learning. IEEE Access 10, 64904–64919. doi: 10.1109/
ACCESS.2022.3183357.
71. Senarath, S., Pathirana, P., Meedeniya, D., Jayarathna, S., 2022b. Retail gaze: A dataset
for gaze estimation in retail environments, in: Proceedings of the 3rd International
Conference on Decision Aid Sciences and Applications (DASA), pp. 1040– 1044.
doi: 10.1109/DASA54658.2022.9765224.
72. Sewak, M., Karim, M.R., Pujari, P., 2018. Practical convolutional neural
networks: Implement advanced deep learning models using Python. Packt Publishing Ltd.
73. Shafiq, M., Gu, Z., 2022. Deep residual learning for image recognition: A survey.
Applied Sciences 12, 8972. doi: 10.3390/app12188972.
74. Shashirangana, J., Padmasiri, H., Meedeniya, D., Perera, C., 2021a. Automated license
plate recognition: A survey on methods and techniques. IEEE Access 9, 11203–11225.
doi: 10.1109/ACCESS.2020.3047929.
75. Shashirangana, J., Padmasiri, H., Meedeniya, D., Perera, C., Nayak, S.R., Nayak,
J., Vimal, S., Kadry, S., 2021b. License plate recognition using neural architecture
search for edge devices. International Journal of Intelligent Systems (IJIS) 36, 1–38.
doi: 10.1002/int.22471.
76. Shrestha, A., Mahmood, A., 2019. Review of deep learning algorithms and architectures.
IEEE Access 7, 53040–53065. doi: 10.1109/access.2019.2912200.
178 References
77. Shyamalee, T., Meedeniya, D., 2022a. Attention u- net for glaucoma identification
using fundus image segmentation, in: Proceedings of the 3rd International Conference
on Decision Aid Sciences and Applications (DASA), pp. 6– 10. doi: 10.1109/
DASA54658.2022.9765303.
78. Shyamalee, T., Meedeniya, D., 2022b. CNN based fundus images classification
for glaucoma identification, in: Proceedings of the 2nd International Conference
on Advanced Research in Computing (ICARC), pp. 200– 205. doi: 10.1109/
ICARC54489.2022.9754171.
79. Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks for largescale
image recognition. arXiv preprint arXiv:1409.1556, arXiv:1409.1556.
80. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke,
V., Rabinovich, A., 2015. Going deeper with convolutions, in: Proc. International
Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9. doi: 10.1109/
cvpr.2015.7298594.
81. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z., 2016. Rethinking the
inception architecture for computer vision, in: Proc. International Conference on
Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826. doi: 10.1109/
cvpr.2016.308.
82. Tan, M., Le, Q., 2019. Efficientnet: Rethinking model scaling for convolutional neural
networks, in: Proc. International Conference on Machine Learning, pp. 6105–6114.
doi: 10.48550/arXiv.1905.11946.
83. Thomas, J.J., Karagoz, P., Ahamed, B.B., Vasant, P., 2019. Deep learning techniques
and optimization strategies in big data analytics. IGI Global. doi: 10.4018/
978-1-7998-1192-3.
84. Ugail, H., 2022. Deep learning in visual computing: Explanations and examples.
CRC Press.
85. Vasudevan, S.K., Pulari, S.R., Vasudevan, S., 2022. Deep learning: A comprehensive
guide. Chapman and Hall/CRC.
86. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
L., Polosukhin, I., 2017. Attention is all you need. Advances in Neural Information
Processing Systems 30, 1–15. doi: 10.48550/arXiv.1706.03762.
87. Wang, H.n., Liu, N., Zhang, Y.y., Feng, D.w., Huang, F., Li, D.s., Zhang, Y.m., 2020.
Deep reinforcement learning: A survey. Frontiers of Information Technology &
Electronic Engineering 21, 1726–1744. doi: 10.1631/FITEE.1900533.
88. Wijethilake, N., Meedeniya, D., Chitraranjan, C., Perera, I., 2020. Survival prediction
and risk estimation of glioma patients using MRNA expressions, in: Proceedings of
20th International Conference on Bioinformatics and Bioengineering (BIBE), pp. 35–
42. doi: 10.1109/BIBE50027.2020.00014.
89. Wijethilake, N., Meedeniya, D., Chitraranjan, C., Perera, I., Islam, M., Ren, H., 2021.
Glioma survival analysis empowered with data engineering—a survey. IEEE Access 9,
43168–43191. doi: 10.1109/ACCESS.2021.3065965.
90. Yan, W., 2021. Computational methods for deep learning. Springer.
91. Yang, Q., Zhang, Y., Dai, W., Pan, S.J., 2020. Transfer learning. Cambridge University
Press. doi: 10.1017/9781139061773.
92. Yao, X., Wang, X., Karaca, Y., Xie, J., Wang, S., 2020. Glomerulus classifica-
tion via an improved googlenet. IEEE Access 8, 176916– 176923. doi: 10.1109/
access.2020.3026567.
93. You, A., Kim, J.K., Ryu, I.H., Yoo, T.K., 2022. Application of generative adversarial
networks (GAN) for ophthalmology image domains: A survey. Eye and Vision 9, 1–19.
doi: 10.1186/s40662-022-00277-3.
References 179
94. Zhang, C., Ma, Y., 2012. Ensemble machine learning: methods and applications.
Springer. doi: 10.1007/978-1-4419-9326-7.
95. Zhang, J., Li, C., Yin, Y., Zhang, J., Grzegorzek, M., 2022a. Applications of artificial
neural networks in microorganism image analysis: a comprehensive review from conven-
tional multilayer perceptron to popular convolutional neural network and potential visual
transformer. Artificial Intelligence Review 55, 1–58. doi: 10.1007/s10462-022-10192-7.
96. Zhang, T., Gao, L., He, C., Zhang, M., Krishnamachari, B., Avestimehr, A.S., 2022b.
Federated learning for the internet of things: Applications, challenges, and opportun-
ities. IEEE Internet of Things Magazine 5, 24–29. doi: 10.1109/iotm.004.2100182.
97. Zhao, W., Alwidian, S., Mahmoud, Q.H., 2022a. Adversarial training methods for deep
learning: A systematic review. Algorithms 15, 283. doi: 10.3390/a15080283.
98. Zhou, T., Ye, X., Lu, H., Zheng, X., Qiu, S., Liu, Y., 2022. Dense convolutional network
and its application in medical image analysis. BioMed Research International 2022,
2384830. doi: 10.1155/2022/2384830.
99. Zoph, B., Le, Q.V., 2016. Neural architecture search with reinforcement learning. arXiv
preprint arXiv:1611.01578, 1—16. doi: 10.48550/arXiv.1611.01578.
100. Zouch, W., Sagga, D., Echtioui, A., Khemakhem, R., Ghorbel, M., Mhiri, C., Hamida,
A.B., 2022. Detection of covid- 19 from CT and chest X- ray images using deep
learning models. Annals of Biomedical Engineering 50, 825– 835. doi: 10.1007/
s10439-022-02958-5.
181
Index
Note: Figures are indicated by italics. Tables are indicated by bold.
A complexity 8–11, 13, 33, 35–7, 40, 55, 58, 67, 78,
105, 110, 116, 121, 124
accuracy 4, 34, 37, 62–4, 69, 71–2, 82, 88, computational load 66
99–100, 105, 112–15, 119, 123, 132, computer vision 48, 57, 61–2, 72, 86, 89, 102, 130
136–7, 139–41, 147–51, 153, 157, confusion matrix 148, 162
159–60, 162 convergence 18, 23, 28, 39, 127–8, 133, 136
activation function 16, 18, 20, 23–9, 34, 39–40, convolution 43, 48–9, 51–4, 66–71, 82, 103, 122,
44–5, 54, 56, 59–61, 66, 71, 123 135
AdaDelta 130–1 cross entropy 30, 77, 139
adagrad 130–1 cross validation 36, 105, 115–16, 125, 153–4, 156
ADAM 131–2
adversarial training 140, 143–4 D
asynchronous computation 13
agent 91–6 DARTS 135, 137–9
aggregation 97, 100, 105–6 data partition 100
ANN 42, 44–5, 56–8, 73, 76 data poisoning 141–2
API 13, 145 decentralize 100–1
argmax 107, 138 decision boundary 42, 60, 70, 103, 110, 143
artificial intelligence 1–3 decision-making 1–2, 11, 16, 91, 94, 141
attacks 140–5 decoder 74–82, 135
attention 50, 79–82 dense layer 48, 50, 54–5, 70, 114
AUC 147, 152–3, 162 DenseNet 69–70
augmentation 36, 63, 112, 119–20, 137 depthwise convolution 69–70, 139
autoencoder 76–8 dimensionality reduction 69, 76, 78, 113
axon 16 discriminative model 62–3
distributed features 72
B drift 90
dropout 32, 36, 38, 50, 62, 69, 72, 85, 118–19, 124
backpropagation 23, 25–6, 38–9, 47–8, 50, 56, 61, dual loss 74
63, 72, 128, 158 dynamic computation 14, 91, 140
bagging 105–7, 114 dynamic programming 94
batch size 22–3, 121, 123 dynamic routing 73–5
batch-norm layer 50
bias 6–11, 16, 18–20, 22, 24–5, 30, 33–5, 37–8, E
44, 57, 66, 102, 105–7, 125, 155
binning 112, 156 early stopping 35–7, 87, 117–18
boosting 106–8, 114, 117 elastic net 117
bootstrap 105–6, 114 encoder 74–8, 81–3, 90
brain cells 16 ensemble 100, 103–10, 114
epoch 25, 32, 38, 87, 98, 101, 114, 123, 125
C evaluation 33, 94, 100, 137, 147, 151
evolution 3–4, 114, 137
capsule 73–6 evolutionary algorithm 114, 137
centralize 96, 100–1 evasion 142
chain rule 38 exploding 23–4, 26, 38–40, 47–8, 56–7, 67, 133
classification 11, 21–30, 37–8, 42–5, 48, 55–8,
61, 66, 71–2, 89–90, 108–10, 119, 140–1, F
147–61, 165
CNN 42, 48–58, 62, 66–7, 71–3, 81–2, 103 FBNet 135, 138–9
collaborative learning 96–7 feature clipping 122
181
182
182 Index
feature extraction 37, 42, 48, 56, 58, 66, 73, 84, 88 K
feature selection 21, 42, 113, 143
federated learning 96–102 Kappa score 157, 162
feedforward 23, 38, 58, 60, 76 Keras 13–14
feature engineering 5–6, 42, 113 kernel 23, 48, 51–4, 56, 71–2, 74, 82, 132–3, 136
filter 48, 51–3, 66, 69–72, 110 knowledge distillation 134
flatten 49–51, 55, 83
F-measure 151, 154 L
freeze 85–6 latent space 76, 119
fully connected 28, 49–51, 54, 56, 61, 71, 76, 85, layer tuning 124
89, 103, 132, 134–5 Leaky ReLU 26–7, 165
fusion 50, 103, 109–10 learning rate 22–3, 29, 30, 34, 39, 87, 114, 123,
125, 127–8, 130–1
G life cycle 5–6
GAN 62–4, 112, 119 likelihood 81, 158–60
Gaussian 24, 27, 78 local minima 31, 33–4, 72, 123, 125, 128
generalization error 10, 33, 103, 105–6 log scaling 122
generalize 8, 35, 37, 115–16, 132 logarithmic loss 160, 62
geometric mean 158 loss function 13, 23–4, 26, 29–34, 63–4, 74–5,
global features 72, 97–8, 101, 133 77–8, 116–17
GoogLeNet 67–9 low-level features 4, 48, 55, 72–4, 84
Grad-CAM 157–8 LSTM 48, 79–81, 133
gradient descent 24–6, 34, 39, 94, 115, 121, 125,
127, 30, 137–8 M
majority voting 107, 109
H mapping function 6, 11, 21, 106
hardware agnostic 139 marginal loss 75
Hebbian principle 66 matrix 4, 13–14, 23–4, 50–4, 74, 82
heterogeneous 96, 100, 102–3 Matthew’s correlation 159
hidden layer 3–4, 20, 22–3, 26, 28, 36, 39, 43, 45, Max rule 109
50, 52, 58–61, 88, 119, 123–4, 134, 165 maxima 125, 126
hidden patterns 12 mean absolute error 30–1, 159, 162
high computation 3, 72 mean squared error 10, 21, 30, 77, 160, 162
hold-out validation 106, 154, 155, 162 meta-classifier 107–8
Huber loss 30, 32 meta-learning 119, 133
human brain 2, 16, 43 minima 27, 29, 31, 33–4, 38–9, 66, 72, 123, 125,
hyperparameter tuning 23, 112–13, 119, 126, 128–9
123–4 mixture of experts 108–9
hyperparameters 22, 23, 32, 33, 36, 76, 86, 88, MobileNet 70, 139
118, 122–5, 134–5, 154 model error 10, 18–19, 29–30, 32–4, 37, 47, 56,
61, 66, 77, 88, 103, 105–6, 119, 125, 147–9,
I 154–6
model training 6, 8, 13, 22, 33, 35–40, 47, 66, 87,
image captioning 45 89, 97–9, 102, 106–7, 112, 114–15, 118–19,
image classification 17, 45, 48, 57, 71–2, 90, 119, 130, 140, 142, 145, 155
133, 136–7, 141, 157 momentum 33, 123, 127, 129–31
ImageNet 17, 75, 86 moving average 126–27, 129–32
inception 66–7, 69, 71, 72, 86 multi scale 66, 102, 113, 120
inductive 82, 87–8 multidimensional data 50
inference 13, 142, 145 multi-layer perceptron 46, 61–2
information loss 51, 73, 79
N
J
NAS 132–40
Jupyter 14 negative transfer 90
Index 183
neuron 4, 16, 18, 20, 23–8, 32, 42, 47–8, 56, reinforcement learning 91–6, 134–35, 137
59–61, 66, 71, 73, 121, 123–4, 130 ReLU 22, 24, 26–8, 40, 49, 56, 60, 64, 71, 85,
neuroscience 16, 59 158, 165
NLP 46, 57, 82, 86, 89–90, 101–2, 130 ResNet 64–6, 72, 86
normal distribution 7, 27 residual block 41, 64, 66
normalization 34, 49, 60, 71, 113, 120–4, 129, reward 91, 93–6
139 RMSprop 130–1
NumPy 14, 50 RNN 43, 45–8, 56–8, 62, 79–81, 134
robust 31, 73, 75, 103, 107, 110, 118, 132, 141,
O 144, 145
ROC 152–4, 158, 162
one-shot learning 89, 115, 137, 140
ROI 58
optimization 23–4, 34, 50, 64, 66, 99, 123, 125,
127–33, 135, 137–9, 143–4
S
optimal 4, 10, 29, 34–5, 37, 51, 59, 63–4, 91, 94,
96, 110, 112, 123, 125, 127–9, 132, 134–5 scheduler 94, 129
outlier 6–7, 19, 31–2, 112, 116, 122 search space 125, 128, 131–40
overfit 8–11, 21, 32–8, 50, 62, 67, 69, 72, 87, security 99, 140–1
90, 112, 115–17, 119, 123–5, 132, 137, self-supervised 56, 77
153, 155 sensitivity 23, 139, 148, 152, 154, 158, 159,
162
P sequence transduction 78
sigmoid 20, 24, 27–8, 39–40, 42, 60, 71, 165
padding 51
skew 7, 113
Pandas 14
skip connection 64, 66, 132, 134–6, 139
parameter sharing 47–8, 53, 56–7, 98–9, 115
SoftMax 20, 26, 28, 56, 69, 71, 82, 85, 134–5,
PC-DARTS 138–9
138–9, 165
perceptron 16, 44, 61–2
spatial patterns 45, 48–50, 54, 57, 58, 62, 69, 72,
pipeline 6, 70, 134, 141, 145
74–5, 82
pointwise convolution 69–70
specificity 139, 152, 154, 158–9, 162
policy 93–4
speech recognition 3, 5, 45, 47, 89
pooling 43, 49–51, 56, 58, 62, 66–7, 69, 72–3,
stacking 48, 53, 76, 81–2, 103, 106–8, 114, 120,
103, 105, 133, 135
133
precision 49, 103, 150–1, 162
stock forecasting 12, 45, 142
prediction 4, 17–19, 23, 40, 46, 61, 74, 79, 90,
stride 51–2, 55, 72, 134
103, 105–10, 113–14, 122–3, 134, 137, 142,
supervised 11–12, 21, 61, 77, 91, 93
149, 151, 153–4, 160–61
prediction variance 103, 106
T
pre-process 5–6, 14, 78, 112–13
pretrained model 14, 84–6, 88–90, 114, 139 tanh 24, 28–9, 56, 60, 71, 80, 165
privacy 96, 98–102, 141 temporal dependencies 48
probability averaging 109 tensor 13–14, 50–1, 138
problem-solving 11, 91, 93 Tensor Hub 14
pruning 36, 170 TensorFlow 13–14, 50, 98
PyTorch 14, 98, 145 test error 9, 34
test set 17, 21, 33, 116, 162
Q threats 2, 140–1, 143–5
tool stack 12
quantization 134
trade-off 8–11, 99, 123, 125, 141, 148
train error 9, 34
R
training loss 35, 66, 87
recall 82, 150–2, 162 transductive 78, 87–8
regression 11, 17–21, 25, 30–2, 35–6, 38, 42, 44, transfer bound 90
105, 115–16, 119, 128, 159, 162 transfer learning 84–5, 87–90, 93, 114
regularization 21, 23, 32, 35–8, 50, 69, 106, transformation 4, 61, 79, 113, 120, 135
115–17, 119, 124 transformer 79, 81–3
184
184 Index
U VGG 71–2, 86
vision transformer 82
underfit 6–10, 32–5, 37–8, 62, 116, 119, 123, 125,
153, 155 W
unlabeled data 12, 76, 88–9
unsupervised 11–12, 76, 87 weight averaging 106
weight Initialization 23–4, 38, 40, 140
V
X
validation error 33
validation set 33, 117, 134, 154–6 Xception 69, 86
vanishing 23–6, 28–9, 38–41, 47–8, 56–7, 63–4,
66–7, 69, 71–2, 133, 165 Y
variance 6–11, 23–5, 27, 30, 33–5, 37, 81, Youden’s index 158
103, 105–7, 113, 116, 120–3, 125, 132,
154–5 Z
vector 13, 18–19, 30, 38, 45, 50, 59, 62–3, 73–5,
82, 121, 128–9 zero-shot learning 89