CHAPTER 1
Introduction to Machine Learning
Learning Objectives
At the end of this chapter, you will be able to:
Give a brief overview of machine learning (ML)
Describe the learning paradigms used in ML
Explain the important steps in ML, including data acquisition, feature
engineering, model selection, model learning, model validation, model
explanation, representation, and search and explanation
1.1 Evolution of Machine Learning
Machine learning (ML) is the process of learning a model that can be used in
prediction based on data. Prediction involves assigning a data item to one of
the classes or associating the data item with a number. The former activity
is classification while the latter is regression. ML is an important and state-
of-the-art topic. It gained prominence because of the improved processing
speed and storage space of computers and the availability of large data sets
for experimentation. Deep learning (DL) is an offshoot of ML. In fact,
perceptron was the earliest popular ML tool and it forms the basic building
block of various DL architectures, including multi-layer perceptron
networks, convolutional neural networks (CNNs) and recurrent neural
networks (RNNs).
In the early days of artificial intelligence (AI), it was opined that
mathematical logic was the ideal vehicle for building AI systems. Some of
the initial contributions in this area like the General Problem Solver (GPS),
Automatic Theorem Proving (ATP), rule-based systems and programming
languages like Prolog and Lisp (lambda calculus-based) were all outcomes of
this view. Various problem solving and game playing solutions also had this
flavour. During the twentieth century, a majority of prominent AI
researchers were of the view that logic is AI and AI is logic. Most of the
reasoning systems were developed based on this view. Further, the role of
artificial neural networks in solving complex real-world AI problems was
under-appreciated.
However, this view was challenged in the early twenty-first century and the
current view is that AI is deep learning and deep learning is AI. The advent
of efficient graphics processing units (GPUs), platforms like TensorFlow and
PyTorch along with the demonstrated success stories of convolutional neural
networks, gated recurrent units and generative models based on neural
networks have impacted every aspect of science and engineering activities
across the globe.
So, ML along with DL has become a state-of-the-art subject. Artificial neural
networks backbone of DL.
A high-level view of AI is shown in Fig. 1.1. The tasks related to conventional
AI, separately. Here, ML may be viewed as dealing with more than just
pattern recognit tasks. Classification and clustering are the typical tasks of a
PR system. However, ML di regression problems also. Data mining is the
efficient organization of data in the form of a 8
Fig. 1.1 A high-level view of AI
The typical background topics of AI are shown in Fig. 1.2.
Fig. 1.2 Background topics of AI
Note that data structures and algorithms are basic to both conventional and
current systems. Logic and discrete structures played an important role in
the analysis and syntil of conventional AI systems. The importance of other
background topics may be summarize follows:
In ML, we deal with vectors and vector spaces and these topics are
better appreciated thin linear algebra. The data input to an ML
system may be
viewed as a matrix, popularly the data matrix. If there are n data
items, each represented as an l -dimensional vector, the corresponding
data matrix A is of size n ×l . Linear algebra is useful in analysing the
weights associated with the edges in a neural network. Matrix
multiplication and eigen analysis are important in initializing the
weights of the neural network and in weight updates. It can also help
in weight normalization. The whole activity of clustering may be
viewed as data matrix factorization.
The role of probability and statistics need not be explained as ML is,
in fact, statistical ML. These topics help in estimating the distributions
underlying the data. Further, they play a crucial role in analysis and
inference in ML.
Optimization (along with calculus) is essential in training neural
networks where gradients and their computations are important.
Gradient descentbased optimization is an essential
Information theoretic concepts like entropy, mutual information and
KullbackLeibler ingredient of any DL system. divergence are essential
to understand topics such as decision tree classifiers, feature
selection and deep neural networks.
We will provide details of all these background topics in their
respective chapters.
1.2 Paradigms for ML
There are different ways or paradigms for ML, such as learning by rote,
learning by deduction, learning by abduction, learning by induction and
reinforcement learning. We shall look at each of these in detail.
1.2.1 Learning by Rote
This involves memorization in an effective manner. It is a form of learning
that is popular in elementary schools where the alphabet and numbers are
memorized. Memorizing simple addition and multiplication tables are also
examples of rote learning. In the case of data caching, we store computed
values so that we do not have to recompute them later. Caching is
implemented by search engines and it may be viewed as another popular
scheme of rote learning. When computation is more expensive than recall,
this strategy can save a significant amount of time. Chess masters spend a
lot of time memorizing the great games of the past. It is this rote learning
that teaches them how to 'think' in chess. Various board positions and their
potential to reach the winning configuration are exploited in games like
chess and checkers.
1.2.2 Learning by Deduction
Deductive learning deals with the exploitation of deductions made earlier.
This type of learning is based on reasoning that is truth preserving. Given A ,
and if A then B( A → B), we can deduce B. We can use B along with if B then
C (B → C) to deduce C . Note that whenever A and A → B are True, then B is
True, ratifying the truth preserving nature of learning by deduction.
Consider the following statements:
1. It is raining.
2. If it rains, the roads get wet.
3. If a road is wet, it is slippery.
From (1) and (2), we can infor using deduction that (4) the ronds are wet.
This deduction can the be used with (3) to deduce or learn that (5) the roads
are slippery. Here, if statements (1), (2) any (3) are True, then statements
(4) and (5) are antomatically True.
A digital computer is primarily a deductive engine and is ideally antited for
this for of learning. Deductive learning is applied in well-defined domains
like game playing, including in chess.
1.2.3 Learning by Abduction
Here, we infer A from B and ( A → B ). Notice that this is not truth
preserving like in deduction as both B and ( A → B ) can be Thue and A can
be False. Consider the following inference:
1. An acroplane is a flying object (aeroplane → flying object).
2. A is a flying object.
From (1) and (2), we infer using abduction that A is an aeroplane. This kind
of reasoning mas lead to incorrect conclusions. For example, A could be a
bird or a kite.
1.2.4 Learning by Induction
This is the most popular and effective form of ML. Here, learning is achieved
with the help of examples or observations. It may be categorized as follows:
Learning from Examples: Here, it is assumed that a collection of
labelled examples an provided and the ML system uses these
examples to make a prediction on a new data pattern. In supervised
classification or learning from examples, we deal with two ML
problems classification and regression.
1. Classification: Consider the handwritten digits shown in Fig. 1.3.
Here, each row has 15 examples of each of the digits. The problem is
to learn an ML model using such data to classify a new data pattern.
This is also called supervised learning as the model is learn with the
help of such exemplar data. It may be provided by an expert in several
practica situations. For example, a medical doctor may provide
examples of normal patients and patients infected by COVID-19 based
on some test results. In the case of handwritten digits we have 10
class labels, one class label corresponding to each of the digits from 0
to 9 In classification, we would like to assign an appropriate class
label from these labels to 3 new pattern.
2. Regression: Contrary to classification, there are several prediction
applications where the labels come from a possibly infinite set. For
example, the share value of a stock could be a positive real number.
The stock may have different values at a particular time and each of
these values is a real number. This is a typical regression or curve
fitting problem. The practical need here is to predict the share value
of a stock at a future time instance base on past data in the form of
examples.
Learning from Observations: Observations are also instances like
examples but they ar different because observations need not be
labelled. In this case, we cluster or group the observations into a
smaller number of groups. Such grouping is performed with the help a
clustering algorithm that assigns similar patterns to the same
group/cluster. Each clusted 000000000000000
22222222222222033333333333 X 3 , 30 .
666666666666666
7771717771777)1
888888888888888
999999999999999
Fig. 1.3 Examples of handwritten digits labelled 0 to 9
could be represented by its centroid or mean. Let x 1 , x 2 ,… , x p be p elements
of a cluster. Then the centroid of the cluster is defined by
p
1
∑x
p i=1 i
Let us consider the handwritten digit data set of 3 classes: 0,1 and 3 . By
using the class labels and clustering patterns of each class separately, we
obtain 3 clusters that give us the 9 centroids shown in Fig. 1.4.
000111333
Fig. 1.4 Cluster centroids using the class labels in clustering However, when
we cluster the entire data of digits 0,1 and 3 into 9 clusters, we obtain the
centroids shown in Fig. 1.5. So, the clusters and their representatives could
differ based on how we exploit the class labels.
331311001
Fig. 1.5 Cluster centroids without using the class labels in clustering
1.2.5 Reinforcement Learning
In supervised learning, the ML model is learnt in such a way as to maximize
a performance mean like prediction accuracy. In the case of reinforcement
learning, an agent learns an optimal po to optimize some reward function.
The learnt policy helps the agent in taking an action based the current
configuration or state of the problem. Robot path planning is a typical
application reinforcement learning.
1.3 Types of Data
In this book, we primarily deal with inductive learning as it is the most
popular paradigm for 1 It is important to observe that in both supervised
learning and learning from observations, we d with data. In general, data
can be categorical or numerical.
Categorical: This type of data can be nominal or ordinal. In the case of
nominal data, the is no order among the elements of the domain. For
example, for colour of hair, the domain {brown, black, red}. This data
is of categorical type and the elements of the domain are n ordered.
On the contrary, in ordinal data, there is an order among the values of
the domai For example, the domain of variable employee number
could be {1 , 2 , … ,1011} if there are 10 employees in an organization.
Here, ordering among the elements of the domain is observe
indicating that senior employees have smaller employee numbers
compared to junior employee the most senior employee will have
employee number 1 .
Numerical: In the case of numerical data, the domain of values of the
data type could be set/subset of integers or a set/subset of real
numbers. For example, in Table 1.1, a subset 0 the features used by
the Wisconsin
Breast Cancer data is shown. The domain of Diagnosis, the class label, is a
binary set with values Malignant and Benign. The domain of ID Number is a
subset of integers in the range [ 8670,917897 ] and the domain of
Area_Mean is a collection 01 floating point numbers (interval) in the range
[143.5, 2501]. It is possible to have binary values in the domain for
categorical or numerical data. For example, the domain of Status could be
{Pass, Fail} and this variable is nominal; an example of a binary ordinal type
is {short, tall} for humans based on their height. A very popular binary
numerical type is {0 ,1 }; also in the
Table 1.1 Different types of data from the Wisconsin Breast Cancer database
Feature
Attribute Type of Data Domain
Number
{Malignant,
1 Diagnosis Nominal
Benign
2 ID Number Ordinal [8670 , 917897]
Perimeter_Mea
3 Numerical [43.79 , 188.5]
n
4 Area_Mean Numerical [143.5 ,2501]
Smoothness_Me
5 Numerical [0.05263 , 0.1634]
an
classification context, the class label data can have the domain {− 1 ,+ 1}
where -1 stands for the label of the negative class and +1 stands for the
label of the positive class.
Typically, each pattern or data item is represented as a vector of feature
values. For example, a data item corresponding to a patient with ID 92751 is
represented by a five-dimensional vector (Benign, 92751, 47.92, 181,
0.05263), where each component of the vector represents the corresponding
feature shown in Table 1.1. Benign is the value of feature 1, Diagnosis;
similarly, the third entry 47.92 corresponds to feature 3, that is
Perimeter_Mean and so on. Note that Diagnosis is a nominal feature and I D
Number is an ordinal attribute. The remaining three features are
Here, Diagnosis or the class label is a dependent feature and the remaining
four features numerical. are independent features. Given a collection of such
data items or patterns in the form of fivedimensional vectors, the ML system
learns an association or mapping between the independent feress and the
dependent cature.
1.4 Matching
Matching is an important activity in ML. It is used in both supervised
learning and in learning from observations. Matching is carried out by using
a proximity measure which can be a distance/dissimilarity measure or a
similarity measure. Two data items, u and v , represented as l -dimensional
vectors, match better when the distance between them is smaller or when
the
A popular distance measure is the Euclidean distance and a popular
similarity measure is similarity between them is larger. the cosine of the
angle between vectors. The Euclidean distance is given by
√
l
d (u , v)= ∑ ¿¿
i=1
The cosine similarity is given by
t
uv
cos (u , v )=
‖u ‖‖v ‖
where ut v is the dot product between vectors u and v and ‖u ‖ is the
Euclidean distance between u and the origin; it is also called the Euclidean
norm. Some of the important applications of matching in ML are in:
Finding the Nearest Neighbor of a P ern: Let x be an l -dimensional
pattern vector. Let X ={ x 1 , x 2 ,… , x n } be a collection of n data vectors.
The nearest neighbor of x from X , denoted by N N x ( X) , is x j if
d ( x , x j ) ≤ d ( x , xi ) , ∀ xi ∈ X
This is an approximate search where a pattern that best matches x is
obtained. If there is a tie, that is, when both x p ∈ X and x q ∈ X are the nearest
neighbors of x , we can break the tie arbitrarily or choose either of the two to
be the nearest neighbor of x . This step is useful in classification and will be
discussed in the next chapter.
Assigning to a Set with the Nearest Representative: Let C 1 , C 2 , … , C K
be K sets with x 1 , x 2 ,… , x K as their respective representatives. A
pattern x is assigned to C i if
d ( x , x i ) ≤ d ( x , x j ) , for j∈ {1 , 2 ,… , K }
This idea is useful in clustering or learning from observations, where C i is
the i th grotp cluster of patterns or observations and x i is the representative
of C 6. The centroid of the dea vectors in C i is a popularly used
representative, x 4 , of C f .
1.5 Stages in Machine Learning
Building a machine learning system involves a number of steps, as
illustrated in Fig. 1.6. Note us emphasis on data in the form of training,
validation and test
Fig. 1.6 Important steps in a practical machine learning system
Typically, the available data is split into training, validation and test data.
Training data: used in model learning or training and validation data is used
to tune the ML model. Test dall is used to examine how the learnt model is
performing. We now describe the components of 14 ML system.
1.5.1 Data Acquisition
This depends on the domain of the application. For example, to distinguish
between adults all children, measurements of their height or weight are
adequate; however, to distinguish betwel normal and COVID-19-infected
humans, their body temperature and chest congestion may * more important
than their height or weight. Typically, data collection is carried out before
featur engineering.
1.5.2 Feature Engineering
This step involves a combination of data preprocessing and data
representation.
Data Preprocessing
In several practical applications, the raw data available needs to be updated
before it can 1 used by an ML model. The common problems encountered
with raw data are missing raw fferent ranges for different variables and the
presence of outliers. We will now explain how to deal ese problems.
Missing Data: It is likely that in some domains, there could be missing data.
This occurs as a consequence of the inability to measure n feature value or
due to unavailability or erroneous data entry. Some ML algorithms can work
even when there are a reasonable number of misaing data values and, in
such cases,
there is no need for preprocessing. However, there are a large number of
other cases where the ML models cannot handle missing values. So, there is
a need to examine techniques for dealing with missing data. Different
achemes are used for dealing with the prediction of missing values:
Use the nearset neighbor: Let x be an l -dimensional data vector that
has its i th component x (i) missing. Let X ={ x 1 , x 2 ,… , x n } be the set of n
training pattern vectors. Let x j ∈ X be the nearest neighbor of x based
on the remaining l −1 (excluding the i th ) components. Predict the
value of x (i) to be x j (i), that is, if the i th component x (i) of x is missing,
use the i th component of x j=N N x (X ) instead.
Use a langer neighborhood: Use the k -nearest neighbors (KNNs) of x
to predict the missing value x 1. Let the KNNs of x , using the
remaining l −1 components, from X be x 1 , x 2 … , x K . Now the
predicted value of x (i) is the average of the i th components of these
KNNs. That is, the predicted value of x (i) is
K
1
∑ x j(i)
K j=1
Example 1: Consider the set data vectors
(1 , 1, 1),(1, 1 , 2),(1 , 1, 3),(1 ,− , 2),(1 , 1, −),(6 ,6 ,1)
There are 6 three-dimensional pattern vectors in the set. Missing values are
indicated by Let us see how to predict the missing value in ( 1 ,− , 2 ). Let us
use K=3 and find the 3 nearest neighbors (NNs) based on the remaining two
feature values. The three NNs are (1 , 1, 1),(1, 1 , 2) and (1 , 1, 3). Note that the
second feature value of all these three neighbors is 1 , which is the predicted
value for the missing value in ( 1 ,− , 2 ). So, the vector becomes
Cluster the data and locate the nearest cluster: This approach is
based on clustering the (1 , 1, 2). training data and locating the cluster
to which x belongs based on the remaining l −1 components. Let x
with its i th value missing belong to cluster C q. Let μq be the centroid of
q q th q
C . Then the predicted value of x (i) is μi , the i component of μ . We
will explain clustering in detail in a later chapter; it is sufficient for
now to note that a clustering algorithm can be used to group patterns
in the training data into K clusters where patterns in each cluster are
similar to each other and patterns belonging to different clusters are
Example 2: Consider the following data matrix. It has 5 data vectors
in a fourdissimilar. dimensional space.
[ ]
5.1 3.5 1.4 0.2
4.9 3.0 1.4 0.2
4.7 3.2 1.3 0.2
4.6 3.1 1.5 0.2
5.0 3.6 1.4 0.2
[ ]
4.8 3.5 1.4 0.2
4.9 3.0 1.4 0.2
4.7 3.2 1.3 0.2
4.6 3.325 1.5 0.2
5.0 3.6 1.4 0.2
We can compute the mean squared error (MSE) with respect to the
predicted value based on the deviations from the original values. The
computation of MSE may be explaine as follows: Given the n true (target)
values to be y 1 , y 2 , … , y n and the predicted values: be ý 1 , ý 2 , … , ý n, MSE is
defined as
n
∑ ( y i − ýi ) 2
i=1
n
In the above example, we have predicted two missing values based on the
mean of th remaining values of the respective feature. In the first case,
instead of 5.1, our estimate value is 4.8 . Similarly, in the second case, for
the value 3.1 , our estimate is 3.325 . So, 也 MSE here is
¿¿
Example 3: Consider three clusters of points and their centroids:
a. Cluster1: {(1 ,1 , 1),(1 , 2 ,3),(1 , 3 ,2)} Centroid 1: (1 , 2, 2)
b. Cluster2: {(3 , 4 ,3),(3 , 5 ,3),(3 , 3 ,3)} Centroid 2 :(3 , 4 , 3)
c. Cluster3: {(6 , 6 , 6) ,(6 , 8 , 6),( 6 ,7 ,6)} Centroid 3: (6 , 7 , 6)
Consider a pattern vector with a missing value given by (1 , −, 2) . Its nearest
centrib among the three centroids based on the remaining two features is
Centroid 1 ,1 , 2 So , the missing value in the second location is predicted to be
the second componeth " Centroid 1, that is, 2 . So, the pattern with the
missing value becomes ( 1 , 2, 2 ).
We will now illustrate how the KNN-based and mean-based schemes work
on a bigge d ? set. We consider 20,640 patterns of the California Housing
data set. It has 8 feature ® the target variable is the median house value for
California districts, expressed in hum 23 of thousands of dollars ( $ 100 , 000 ).
This is a regression problem. Some values in the set are removed to create
missing values. The missing values are imputed using the k scheme and the
mean-based scheme. Now the regressor (a function to predict the t 2 ,5 t 6 is
used on the whole data set without missing values, on the KNN-based
imputed dat3 and the mean-based imputed data set. The resulting mean
squared error of the
predictions of the regressor is shown in Fig. 1.7.
Fig. 1.7 MSE of the regressor on data imputed using KNN and mean
It is easy to observe that the regressor performs best when the whole data is
available. However, when prediction is made by removing some values and
guessing them, the performance of the regressor suffers; this is natural.
Note that between the KNN-based and the mean-based imputations, the
former made better predictions leading to smaller MSE. This is because the
KNN-based scheme is more local to the respective point x and so is more
2. Data from Different Domains: The scales of values of different features
could be very focussed. different. This would bias the matching process to
depend more upon features that assume larger values, toning down the
contributions of features with smaller values. So, in applications where
different components of the vectors have different domain ranges, it is
possible for some components to dominate in contributing to the distance
between any pair of patterns. Consider for example, classification of objects
into one of two classes: adult or child. Let the objects be represented by
height in metres and weight in grams. Consider an adult represented by the
vector ( 1.6 , 75000 ) and a child represented by the vector ( 0.6 , 5000 ), where
the heights of the adult and the child in metres are 1.6 and 0.6 ,
respectively, and the weights of the adult and the child in grams are 75000
and 5000 , respectively. Assume that the domain of height is [0.5, 2.5] and
the domain of weight is [2000, 200000] in this example. So, there is a large
difference in the ranges of values of these two features.
Now the Euclidean distance between the adult and child vectors given above
is
√ 1.6 −0.6 ¿ 2+¿ ¿
Similarly, the cosine of the angle between the adult and child vectors is
0.96 +375 ×106
≈ 1.0
√25000000.36 × √5625000002.56
Note that the proximity values computed between the two vectors, whether
it is the Euclidean distance or the cosine of the angle between the two
vectors, are dependent largely upon only
one of the two features, that is, weight, while the contributions of height are
negligible. This is because of the difference in the magnitudes of the ranges
of values of the two features. Thin example illustrates how the
magnitudes/ranges of values of different features contribute differ. ently to
the overall proximity. This can be handled by scaling different components
differenth and such a process of scaling is called normalization. There are
two popular normalizating schemes:
Scaling using the range: On any categorical feature, the values of two
patterns either mate or mismatch and the contribution to the distance
is either zero ( 0 ) (match) or 1 (mismatch If we want to be consistent,
then in the case of any numerical feature also we want the
contribution to be in the range [0 ,1] . This is achieved by scaling the
difference by the range of the values of the feature. So, if the pth
component is of numerical type, its contribution to the distance
between patterns x i and x j is
|x i ( p)− x j ( p)|
Range p
where Range p is the range of the pth feature values. Note that the value of
this term is in the range [0 ,1] ; the value of 1 is achieved when |x p − x p|=¿
i j
Range p and it is 0 (zero) ‖ patterns x i and x j have the same value for the pth
feature. Such a scaling will ensure that the contribution, to the distance, of
either a categorical feature or a numerical feature will be in the range [0 ,1] .
Standardization: Here, the data is normalized so that it will have 0
(zero) mean and unit variance. This is motivated by the property of
standard normal distribution, which is characterized by zero mean
and unit variance.
Example 4: Let there be 5 l-dimensional data vectors and let the the q th
components of the 5 vectors be 60 , 80 , 20 ,100 and 40 . The mean of this
collection is
60+80+20+100 +40
=60
5
We get zero-mean data by subtracting this mean from each of the 5 data
items to obtain 0 , 20 , −40 , 40 ,− 20 for their q th components. Note that this is
zero-mean data as these value add up to 0 . To make the standard deviation
of this data 1 , we divide each of the zero-mean data values by the standard
deviation of the data. Note that the variance of the zero-mean data is
2
0+20 + ¿ ¿
and the standard deviation is 28.284 . So, the scaled data is
0 , 0.707 , −1.414 ,1.414 ,− 0.707. Note that this data corresponding to the q th
feature value of the 5 vectors has zero me all and unit variance.
Outliers in the Data: An outlier is a data item that is either noisy or
erroneous. Notal measuring instruments or erroneous data recordings are
responsible for the presence of sud outliers. A common problem across
various applications is the presence of outliers. A data itell is usually called
an outlier if it
Assumes values that are far away from those of the average data
items
Deviates from the normally behaving data item
Is not connected/similar to any other object in terms of its
characteristics
Outliers can occur because of different reasons:
Noisy measurements: The measuring instruments may malfunction
and may lead to recording of noisy data. It is possible that the
recorded value lies outside the domain of the data type.
Erroneous data entry: Outlying data can occur at the data entry level
itself. For example, it is very common for spelling mistakes to occur
when names are entered. Also, it is possible to enter numbers such as
salary erroneously as 2000000 instead of 200000 by typing an extra
zero (0).
Evolving systems: It is possible to encounter data items in sparse
regions during the evolution of a system. For example, it is common to
encounter isolated entities in the early days of a social network. Such
isolated entities may or may not be outliers.
Very naturally: Instead of viewing an outlier as a noisy or unwanted
data item, it may be better to view it as something useful. For
example, a novel idea or breakthrough in a scientific discipline, a
highly paid sportsperson or an expensive car can all be natural and
influential outliers.
An outlying data item can be either out-of-range or within-range. For
example, consider an organization in which the salary could be from
{10000 , 150000 , 225000 , 300000} . In this case, an entry like 2250000 is
an out-of-range outlier that occurs possibly because of an erroneous
zero ( 0 ). Also, if there are only 500 people drawing 10000, 400
drawing 150000, 300 at 225000 and 175 drawing 300000 , then an
entry like 270000 could be a within-range outlier.
There are different schemes for detecting outliers. They are based on the
density around points in the data. If a data point is located in a sparse
region, it could be a possible outlier. It is possible to use clustering to locate
such outliers. It does not matter whether it is withinrange or out-of-range. If
the clustering output has a singleton cluster, that is, a one-element cluster,
then that element could be a possible outlier.
1.5.3 Data Representation
Representation is an important step in building ML models. This subsection
introduces how data items are represented. It also discusses the importance
of representation in ML. In the process, it deals with both feature selection
and feature extraction and introduces different categories of
It is often stated in DL literature that feature engineering is important in
ML, but not dimensionality reduction. in DL because DL systems have
automatic representation learning capability. This is a highly debatable
issue. However, it is possible that, in some application domains, DL systems
can avoid the representation step explicitly. However, preprocessing
including handling missing data and eliminating outliers is still an important
part of any DL system. Even though representation is not explicit, it is
implicitly handled in DL by choosing the appropriate number of layers and
number of neurons in each layer of the neural network.
Representation of Data Items
The most active and state-of-the-art paradigm for ML is statistical machine
learning. Here, each data item is represented as a vector. Typically, we
consider addition of vectors in computing the mean or centroid of a
collection of vectors, multiplication of a vector by a scalar in dealing with
operations on matrices, and the dot product between a pair of vectors for
computing similarity as important operations on the set of vectors. In most
of the practical applications, the dimensionality
of the data or correspondingly size of the vectors representing data items, L,
can be very large. example, there are around 468 billion Google Ngrams. In
this case, the dimensionality of the vec is the vocabulary size or the number
of Ngrams; so, the dimensionality could be very large S high-dimensional
data is common in bioinformatics, information retrieval, satellite imagery,
anc on. So, representation is an important component of any ML system. An
arbitrary representat may also be adequate to build an ML model. However,
the predictions made using such a mo may not be meaningful.
Current-day applications deal with high-dimensional data. Some of the
difficulties associated wi ML using such high-dimensional data vectors are:
Computation time increases with the dimensionality.
Storage space requirement also increases with the dimensionality.
Performance of the model: It is well-known that as the dimensionality
increases, we requin a larger training data set to build an ML model.
There is a result, popularly called the peaking phenomenon, that
shows that as the dimensionality keeps increasing, the accuran of a
classification model increases until some value, and beyond that
value, the accuracy stan decreasing.
This may be attributed to the well-known concept of overfitting. The
model will tend to remember the training data and fails to perform
well on validation data. With a larger training data set we can afford
to use a higher dimensional data set and still avoid overfitting. Even
though the dimensionality of the data set in an application is large, it
is possible that the number d available training vectors is small. In
such cases, a popular technique used in ML is to reduce the
dimensionality so that the learnt model does not overfit the available
data. Well-known dimensionality reduction approaches are:
Feature selection: Let F={ f 1 , f 2 , … , f L } be the set of L features. In the
feature selection approach, we would like to select a subset F l of F
having l(¿ L) features such that F l maximize the performance of the
ML model.
Feature extraction: Here, from the set F of L features, a set
H= { h1 ,h 2 , … , hl } of l ¿ features is extracted. It is possible to categorize
these schemes as follows:
1. Linear schemes: In this case,
L
h j=∑ α i j f i
i=1
That is, each element of H is a linear combination of the original features.
Note thes feature selection is a specialization of feature extraction. Some
prominent schemes unde this category are:
a. Principal components ( P C s ): Consider the data set of n vectors in an L-
dimensions space; this may be represented as a matrix A of size n × L. The
covariance matris Σ of size L × L associated with the data is computed and
the eigenvectors of Σ for the principal components. The eigenvector
corresponding to the largest eigenvalue : the first principal component (PC).
Similarly, the second largest eigenvalue provids its corresponding
eigenvector as the second PC. Finally, the eigenvector cort to the l th largest
eigenvalue is the l th PC . Both the original feature correspondit are
sufficiently powerful to represent any data vector. So, PCs mare vectors and
P(b) combinations of the given features. b. Non-negative matrix factorization
possible that PCs have negative entries. However, it is the data is non-
negative, i t .
using non-negative entries; NMF is such a factorization of An × L into a
product of Bn ×l and C l × L. Its use is motivated by the notion that NMF can be
used to characterize objects in an image represented by A . In NMF, the
columns of
B can be viewed as linear combinations of the columns of A because of
linear independence.
We will examine, in detail, the concepts of eigenvalue, eigenvector and
linear independence in later chapters.
2. Non-linear feature extraction: Here, we represent using H= { h1 , … ,h l }, such
that
hi =t ( f 1 , f 2 , … , f L )
where t is a non-linear function of the features. For example, if F={ f 1 , f 2 },
then h1=a f 1 +b f 2 +c f 1 f 2 is one such non-linear combination; it is non-linear
because we have a term of the form f 1 f 2 in h1 .
Autoencoder is a popular, state-of-the-art, non-linear feature extraction tool.
Here, a neural network which has an encoder and a decoder is used. The
middle layer has l neurons so that the l outputs from the middle layer give
an l(¿ L) dimensional representation of the L-dimensional pattern that is
input to the autoencoder. Note that the encoder encodes or represents the L
-dimensional pattern in the l -dimensional space while the decoder decodes
or converts the l -dimensional pattern into the L-dimensional space. Note
that it is called autoencoder because the same L-dimensional pattern is
associated with the input and output layers.
1.5.4 Model Selection
Selection of the model to be used to train an ML system depends upon the
nature of the data and knowledge of the application domain. For some
applications, only a subset of the ML models can be used. For example, if
some features are numerical and others are categorical, then classifiers
based on perceptrons and support vector machine (SVM) are not suitable as
they compute the dot product between vectors and dot products do not
make sense when some values in the corresponding vectors are non-
numerical. On the other hand, Bayesian models and decision tree-based
models are ideally suited to deal with such data as they depend upon the
frequency of occurrence of values.
1.5.5 Model Learning
This step depends on the size and type of the training data. In practice, a
subset of the labelled data is used as training data for learning the model
and another subset is used for model validation or model evaluation. Some
of the ML models are highly transparent while others are opaque or black
box models. For example, decision tree-based models are ideally suited to
provide transparency; this is because in a decision tree, at each internal or
decision node, branching is carried out based on the value assumed by a
feature. For example, if the height of an object is larger than 5 feet, it is
likely to be an adult and not a child; such easy-to-understand rules are
abstracted by decision trees. Neural networks are typically opaque as the
outputs of intermediate/hidden layer neurons may not offer transparency.
1.5.6 Model Evaluation
This step is also called model validation. This step requires specifically
earmarked data called validation data. It is possible that the ML model
works well on the training data; then we say
that the model is well trained. However, it may not work well on the
validation data. In such, case, we say that the ML model overfits the training
data. In order to overcome overfitting, wh typically use the validation data to
tune the ML model so that it works well on both the training and validation
data sets.
1.5.7 Model Prediction
This step deals with testing the model that is learnt and validated. It is used
for prediction becaus both classification and regression tasks are predictive
tasks. This step employs the test datas ser earmarked for the purpose. In the
real world, the model is used for prediction as new patterns keen coming in.
Imagine an ML model built for medical diagnosis. It is like a doctor who
predicts and makes a diagnosis when a new patient comes in.
1.5.8 Model Explanation
This step is important to explain to an expert or a manager why a decision
was taken by the ML model. This will help in explicit or implicit feedback
from the user to further improve the model Explanation had an important
role earlier in expert systems and other AI systems. However explanation
has become very important in the era of DL. This is because DL systems
typically employ neural networks that are relatively opaque. So, their
functioning cannot be easily explained at a level of detail that can be
appreciated by the domain expert/user. Such opaque behaviour has created
the need for explainable AI.
1.6 Search and Learning
Search is a very basic and fundamental operation in both ML and AI. Search
had a special role in conventional AI where it was successfully used in
problem solving, theorem proving, planning and knowledge-based systems.
Further, search plays an important role in several computer science
applications. Some of them are as follows:
Exact search is popular in databases for answering queries, in
operating systems for operation like grep, and in looking for entries in
symbol tables.
In ML, search is important in learning a classification model, a
proximity measure for clusterns and classification, and the
appropriate model for regression.
Inference is search in logic and probability. In linear algebra, matrix
factorization is search. In optimization, we use a regularizer to
simplify the
search in finding a solution. In information theory, we search for
purity (low entropy).
io, several activities including optimization, inference and matrix
factorization that are essential or ML are all based on search.
Learning itself is search. We will examine how search aids learnins f
each ML model in the respective chapters.
7 Explanation Offered by the Model
aventional AI systems were logic-based or rule-based systems. So, the
corresponding reasonials items naturally exhibited transparency and, as a
consequence, explainability. Both forward at
kward reasoning was possible. In fact, the same knowledge base, based on
experts' input, was i in both diagnosis and in teaching because of this
flexibility. Specifically, the knowledge base i by the MYCIN expert system
was used in tutoring medical students using another expert rem called
GUIDON. never, there were some problems associated with conventional AI
systems:
There was no general framework for building AI systems. Acquiring
knowledge, using additional beuristics and dealing with exceptions led to
adhocism; experience in building one AI system did not simplify the building
of another AI system.
Acquiring knowledge was a great challenge. Different experts typically
differed in their conclusions, leading to inconsistencies. Conventional logic-
based systems found it difficult to deal with such inconsistent knowledge.
There has been a gradual shift from using knowledge to using data in
building AI systems. irrent-day AI systems, which are mostly based on DL,
are by and large data dependent. They n learn representations
automatically. They employ variants of multi-layer neural networks and
ickpropagation algorithms in training models.
une difficulties associated with DL systems are:
They are data dependent. Their performance improves as the size of the
data set increases. So, they need larger data sets. Fortunately, it is not
difficult to provide large data sets in most of the current applications.
Learning in DL systems involves a simple change of weights in the neural
network to optimize the objective function. This is done with the help of
backpropagation, which is a gradientdescent algorithm and which can get
stuck with a locally optimal solution. Combining this with large data sets
may possibly lead to overfitting. This is typically avoided by using a variety
of simplifications in the form of regularizers and other heuristics.
A major difficulty is that DL systems are black box systems and lack
explanation capability. This problem is currently attracting the attention of
AI researchers. Ne will be discussing how each of the ML models is
equipped with explanation capability in the espective chapters.
1.8 Data Sets Used
In this book, we make use of two data sets to conduct experiments and
present results in various chapters. These are:
Data Sets for Classification
1. MNIST Handwritten Digits Data Set:
There are 10 classes (corresponding to digits 0 , 1 , … ,9 ) and each digit
is viewed as an image of size 28 ×28 (¿ 784) pixels; each pixel having
values in the range 0 to 255 .
There are around 6000 digits as training patterns and around 1000
test patterns in each class and the class label is also provided for each
of the digits.
For more details, visit http://yann.lecun.com/exdb/mnist/
2. Fashion MNIST Data Set:
It is a data set of Zalando's article images, consisting of a training set of
60,000 examples and a vest set of 10,000 examples.
Each example is a 28 ×28 greyscale imago, associated with a label
from 10
It is intended to serve as a possible replacement for the original
MNIST 1 benchmarking ML models.
It has the same image size and structure of training and testing splits
as t data.
For more details, visit https://www.kaggle.com/datasets/zalando-1
fashionmnist
3. Olivetti Face Data Set:
It consists of 10 different images each of 40 distinct subjects. For
some sul images were taken at different times, varying the lighting,
facial expression closed eyes, smiling / not smiling) and facial details
(glasses / no glasses).
All the images were taken against a dark homogeneous background
with the in an upright, frontal position (with tolerance for some side
movement).
Each image is of size 64 × 64=4096 .
It is available on the scikit-learn platform.
~
For more details, visit https://ai.stanford.edu/
marinka/nimfa/nimfa.ex orl \images.html
4. Wisconsin Breast Cancer Data Set:
It consists of 569 patterns and each is a 30-dimensional vector.
There are two classes Benign and Malignant. The number of patterns
from B 357 and the number of Malignant class patterns is 212.
It is available on the scikit-learn platform.
For more details, visit
https://scikit-learn.org/stable/modules/generated/sklearn.datasets
_breast_cancer.html
Data Sets for Regression
1. Boston Housing Data Set:
It has 506 patterns.
Each pattern is a 13 -dimensional vector.
It is available on the scikit-learn platform.
For more details, visit
https://scikit-learn.org/0.15/modules/generated/sklearn.datasets.10a _
boston.html
2. Airline Passengers Data Set:
This data set provides monthly totals of US airline passengers from
1949 to 1960.
This data set is taken from an inbuilt data set of Kaggle called
AirPassengers.
For more details, visit
https://www.kaggle.com/datasets/chirag19/air-passengers
3. Australian Weather Data Set:
It provides various weather record details for cities in Australia.
The features include location, min and max temperature, etc.
For more details, visit
https://www.kaggle.com/datasets/arunavakrchakrabal australia-
weather-data
Summary
Machine learning (ML) is an important topic and has affected research
practices in both sciel and engineering significantly. The important steps in
building an ML system are:
Data acquisition that is application domain dependent.
Feature engineering that involves both data preprocessing and
representation.
Selecting a model based on the type of data and the knowledge of the
domain.
Learning the model based on the training data.
Evaluating and tuning the learnt model based on validation data.
Providing explanation capability so that the model is transparent to
the user/expert.
Exercises
1. You are given that 9 ×17 is 153 and 4 ×17 is 68 . From this data, you
need to learn the valu of 13 ×17 . Which learning paradigm is useful
here? Specify any assumptions you need t make.
2. You are given the following statements:
a. The sum of two even numbers is even.
b. 12 is an even number.
c. 22 is an even number.
What can you deduce from the above statements?
3. Consider the following statements:
a. If x is an even number, then x +1 is odd and x +2 is even.
b. 34 is an even number.
Which learning paradigm is used to learn that 37 is odd and 38 is even?
4. Consider the following reasoning:
a. If x is odd, then x +1 is even.
You have learnt from the above that 21 is odd. Which learning paradigm is
used? Specify any
b. 22 is even.
5. Consider the following attributes. Find out whether they are nominal,
ordinal or numerical assumptions to be made. features. Give a reason for
your choice.
a. Telephone number
b. Feature that takes values from {ball, bat, wicket, umpire, batsman,
bowler}
c. Temperature
d. Weight
e. Feature that takes values from {short, medium height, tall}
6. Let x i and x j be two l -dimensional unit norm vectors; that is, ‖x i‖=1 and
‖x j‖=1. Derive a relation between the Euclidean distance d ( x i , x j ) and cosine
of the angle between x i and x j .
7. Consider the data set:
(1 , 1, 1),(1, 1 , 2),(1 , 1, 3),(1 ,2 , 2),(1 , 1 ,−),(6 , 6 , 10)