UNIT-V Notes
UNIT-V Notes
Digital Notes
[Department of Computer Application]
Subject Name : Artificial Intelligence
Subject Code : KCA301
Course : MCA
Branch : -
Semester : IIIrd
Prepared by : Mr. Narendra Kumar Sharma
Reference No./MCA/NARENDRA/KCA301/2/3
Unit – 5
PATTERN RECOGNITION
1. Pattern is everything around in this digital world. A pattern can either be seen physically or
it can be observed mathematically by applying algorithms.
Example: The colors on the clothes, speech pattern etc. In computer science, a pattern is
represented using vector feature values.
1.3 Features may be represented as continuous, discrete or discrete binary variables. A feature
is a function of one or more measurements, computed so that it quantifies some significant
characteristics of the object.
2
Page
Example: consider our face then eyes, ears, nose etc are features of the face.
A set of features that are taken together, forms the features vector.
Example: In the above example of face, if all the features (eyes, ears, nose etc) taken together
then the sequence is feature vector([eyes, ears, nose]). Feature vector is the sequence of a
features represented as a d-dimensional column vector. In case of speech, MFCC
(Melfrequency Cepstral Coefficient) is the spectral features of the speech. Sequence of first 13
features forms a feature vector.
Training set:
Training set is used to build a model. It consists of the set of images that are used to train
the system. Training rules and algorithms used give relevant information on how to
associate input data with output decision. The system is trained by applying these
algorithms on the dataset, all the relevant information is extracted from the data and results
are obtained. Generally, 80% of the data of the dataset is taken for training data.
3
Page
Testing set:
Testing data is used to test the system. It is the set of data which is used to verify whether
the system is producing the correct output after being trained or not. Generally, 20% of the
data of the dataset is used for testing. Testing data is used to measure the accuracy of the
system. Example: a system which identifies which category a particular flower belongs to,
is able to identify seven category of flowers correctly out of ten and rest others wrong, then
the accuracy is 70 %
An obvious representation of a pattern will be a vector. Each element of the vector can
represent one attribute of the pattern. The first element of the vector will contain the value of
the first attribute for the pattern being considered.
4
Page
Example: While representing spherical objects, (25, 1) may be represented as an spherical
object with 25 units of weight and 1 unit diameter. The class label can form a part of the
vector. If spherical objects belong to class 1, the vector would be (25, 1, 1), where the first
element represents the weight of the object, the second element, the diameter of the object and
the third element represents the class of the object.
1.7 Advantages:
Pattern recognition solves classification problems
Pattern recognition solves the problem of fake bio metric detection.
It is useful for cloth pattern recognition for visually impaired blind people.
It helps in speaker diarization.
We can recognise particular object from different angle.
1.8 Disadvantages:
Syntactic Pattern recognition approach is complex to implement and it is very slow process.
Sometime to get better accuracy, larger dataset is required.
It cannot explain why a particular object is recognized.
Example: my face vs my friend‟s face.
1.9 Applications:
Computer vision
Pattern recognition is used to extract meaningful features from given image/video samples
and is used in computer vision for various applications like biological and biomedical
imaging.
Seismic analysis
5
Pattern recognition approach is used for the discovery, imaging, and interpretation of
Page
temporal patterns in seismic array recordings. Statistical pattern recognition is implemented
and used in different types of seismic analysis models.
Speech recognition
The greatest success in speech recognition has been obtained using pattern recognition
paradigms. It is used in various algorithms of speech recognition which tries to avoid the
problems of using a phoneme level of description and treats larger units such as words as
pattern
different features values but the same class always has the same features values.
Example:
7
Page
2.4 Components in Pattern Recognition System:
A pattern recognition systems can be partitioned into components.There are five typical
components for various pattern recognition systems. These are as following:
8
Page
2.5 Design Principles of Pattern Recognition
In pattern recognition system, for recognizing the pattern or structure two basic approaches
are used which can be implemented in diferrent techniques. These are –
Statistical Approach and
Structural Approach
1. Statistical Approach:
Statistical methods are mathematical formulas, models, and techniques that are used in the
statistical analysis of raw research data. The application of statistical methods extracts
information from research data and provides different ways to assess the robustness of
research outputs.
Two main statistical methods are used :
i. Descriptive Statistics: It summarizes data from a sample using indexes such as
the mean or standard deviation.
ii. Inferential Statistics: It draw conclusions from data that are subject to random
variation.
2. Structural Approach:
The Structural Approach is a technique wherein the learner masters the pattern of sentence.
Structures are the different arrangements of words in one accepted style or the other.
Types of structures:
Sentence Patterns
Phrase Patterns
Formulas
Idioms
3. Dimension Reduction
In pattern recognition, Dimension Reduction is defined as-
It is a process of converting a data set having vast dimensions into a data set with lesser
dimensions.
It ensures that the converted data set conveys similar information concisely.
Example-
Consider the following example-
The following graph shows two dimensions x1 and x2.
x1 represents the measurement of several objects in cm.
x2 represents the measurement of several objects in inches.
10
Page
In machine learning,
Using both these dimensions convey similar information.
Also, they introduce a lot of noise in the system.
So, it is better to use just one dimension.
We convert the dimensions of data from 2 dimensions (x1 and x2) to 1 dimension (z1).
It makes the data relatively easier to explain.
3.2 Benefits-
Dimension reduction offers several benefits such as-
It compresses the data and thus reduces the storage space requirements.
It reduces the time required for computation since less dimensions require less computation.
It eliminates the redundant features.
It improves the model performance.
It transforms the variables into a new set of variables called as principal components.
These principal components are linear combination of original variables and are
orthogonal.
The first principal component accounts for most of the possible variation of original data.
The second principal component does its best to capture the variance in the data.
There can be only two principal components for a two-dimensional data set.
PCA Algorithm-
The steps involved in PCA Algorithm are as follows-
Step-05: Calculate the eigen vectors and eigen values of the covariance matrix.
Here, Linear Discriminant Analysis uses both the axes (X and Y) to create a new axis and
projects data onto a new axis in a way to maximize the separation of the two categories and
hence, reducing the 2D graph into a 1D graph.
Two criteria are used by LDA to create a new axis:
13
Page
In the above graph, it can be seen that a new axis (in red) is generated and plotted in the 2D
graph such that it maximizes the distance between the means of the two classes and minimizes
the variation within each class. In simple terms, this newly generated axis increases the
separation between the delta points of the two classes. After generating this new axis using the
above-mentioned criteria, all the data points of the classes are plotted on this new axis and are
shown in the figure given below.
But Linear Discriminant Analysis fails when the mean of the distributions are shared, as it
becomes impossible for LDA to find a new axis that makes both the classes linearly separable.
In such cases, we use non-linear discriminant analysis.
Extensions to LDA:
1. Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of variance (or
covariance when there are multiple input variables).
2. Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs is used
such as splines.
3. Regularized Discriminant Analysis (RDA): Introduces regularization into the estimate of
the variance (actually covariance), moderating the influence of different variables on LDA.
LDA Applications:
1. Face Recognition: In the field of Computer Vision, face recognition is a very popular
application in which each face is represented by a very large number of pixel values. Linear
discriminant analysis (LDA) is used here to reduce the number of features to a more
manageable number before the process of classification. Each of the new dimensions
generated is a linear combination of pixel values, which form a template. The linear
14
combinations obtained using Fisher‟s linear discriminant are called Fisher faces.
Page
2. Medical: In this field, Linear discriminant analysis (LDA) is used to classify the patient
disease state as mild, moderate or severe based upon the patient various parameters and the
medical treatment he is going through. This helps the doctors to intensify or reduce the pace
of their treatment.
3. Customer Identification: Suppose we want to identify the type of customers which are
most likely to buy a particular product in a shopping mall. By doing a simple question and
answers survey, we can gather all the features of the customers. Here, Linear discriminant
analysis will help us to identify and select the features which can describe the
characteristics of the group of customers that are most likely to buy that particular product
in the shopping mall.
4. K-Nearest Neighbours
K-Nearest Neighbors is one of the most basic yet essential classification algorithms in Machine
Learning. It belongs to the supervised learning domain and finds intense application in pattern
recognition, data mining and intrusion detection.
It is widely disposable in real-life scenarios since it is non-parametric, meaning, it does not make
any underlying assumptions about the distribution of data (as opposed to other algorithms such
as GMM, which assume a Gaussian distribution of the given data).
We are given some prior data (also called training data), which classifies coordinates into groups
identified by an attribute.
As an example, consider the following table of data points containing two features:
15
Page
Now, given another set of data points (also called testing data), allocate these points a group by
analyzing the training set. Note that the unclassified points are marked as „White‟.
4.1 Intuition
If we plot these points on a graph, we may be able to locate some clusters or groups. Now,
given an unclassified point, we can assign it to a group by observing what group its nearest
neighbors belong to. This means a point close to a cluster of points classified as „Red‟ has a
higher probability of getting classified as „Red‟.
Intuitively, we can see that the first point (2.5, 7) should be classified as „Green‟ and the
second point (5.5, 4.5) should be classified as „Red‟.
4.2 Algorithm
1. Store the training samples in an array of data points arr[]. This means each element of this array
represents a tuple (x, y).
2. for i=0 to m:
4. Make set S of K smallest distances obtained. Each of these distances corresponds to an already
classified data point.
16
Conditional probability:
Conditional Probability
Where,
P(A): The probability of hypothesis H being true. This is known as the prior probability.
P(B): The probability of the evidence.
P(A|B): The probability of the evidence given that hypothesis is true.
P(B|A): The probability of the hypothesis given that the evidence is true.
It is a kind of classifier that works on the Bayes theorem. Prediction of membership probabilities
is made for every class such as the probability of data points associated with a particular class.
The class having maximum probability is appraised as the most suitable class. This is also
referred to as Maximum A Posteriori (MAP).
17
Page
Naive Bayes classifiers conclude that all the variables or features are not related to each
other. The Existence or absence of a variable does not impact the existence or absence of any
other variable. For example,
Fruit may be observed to be an apple if it is red, round, and about 4″ in diameter.
In this case also even if all the features are interrelated to each other, an naive bayes classifier
will observe all of these independently contributing to the probability that the fruit is an
apple.
1. Gaussian Naïve Bayes: When characteristic values are continuous in nature then an
assumption is made that the values linked with each class are dispersed according to Gaussian
that is Normal Distribution.
2. Multinomial Naïve Bayes: Multinomial Naive Bayes is favored to use on data that is
multinomial distributed. It is widely used in text classification in NLP. Each event in text
classification constitutes the presence of a word in a document.
3. Bernoulli Naïve Bayes: When data is dispensed according to the multivariate Bernoulli
distributions then Bernoulli Naive Bayes is used. That means there exist multiple features but
each one is assumed to contain a binary value. So, it requires features to be binary-valued.
Real-time Prediction: Being a fast learning algorithm can be used to make predictions in real-
Page
time as well.
MultiClass Classification: It can be used for multi-class classification problems also.
Text Classification: As it has shown good results in predicting multi-class classification so it
has more success rates compared to all other algorithms. As a result, it is majorly used
in sentiment analysis & spam detection.
6. Introduction to SVM
Support vector machines (SVMs) are powerful yet flexible supervised machine learning
algorithms which are used both for classification and regression. But generally, they are used in
classification problems. In 1960s, SVMs were first introduced but later they got refined in 1990.
SVMs have their unique way of implementation as compared to other machine learning
algorithms. Lately, they are extremely popular because of their ability to handle multiple
continuous and categorical variables.
Support Vectors − Datapoints that are closest to the hyperplane is called support
19
vectors. Separating line will be defined with the help of these data points.
Page
Hyperplane − As we can see in the above diagram, it is a decision plane or space which
is divided between a set of objects having different classes.
Margin − It may be defined as the gap between two lines on the closet data points of
different classes. It can be calculated as the perpendicular distance from the line to the
support vectors. Large margin is considered as a good margin and small margin is
considered as a bad margin.
The main goal of SVM is to divide the datasets into classes to find a maximum marginal
hyperplane (MMH) and it can be done in the following two steps −
First, SVM will generate hyperplanes iteratively that segregates the classes in best way.
Then, it will choose the hyperplane that separates the classes correctly.
7. Clustering
Clustering is the task of dividing the population or data points into a number of groups such
that data points in the same groups are more similar to other data points in the same group and
dissimilar to the data points in other groups. It is basically a collection of objects on the basis
of similarity and dissimilarity between them.
For ex– The data points in the graph below clustered together can be classified into one single
group. We can distinguish the clusters, and we can identify that there are 3 clusters in the
below picture.
20
Page
It is not necessary for clusters to be a spherical. Such as :
good accuracy and ability to merge two clusters.Example DBSCAN (Density-Based Spatial
Page
Clustering of Applications with Noise) , OPTICS (Ordering Points to Identify Clustering
Structure) etc.
Hierarchical Based Methods : The clusters formed in this method forms a tree-type
structure based on the hierarchy. New clusters are formed using the previously formed one.
It is divided into two category
Agglomerative (bottom up approach)
Divisive (top down approach)
examples CURE (Clustering Using Representatives), BIRCH (Balanced Iterative Reducing
Clustering and using Hierarchies) etc.
Partitioning Methods : These methods partition the objects into k clusters and each
partition forms one cluster. This method is used to optimize an objective criterion similarity
function such as when the distance is a major parameter example K-means, CLARANS
(Clustering Large Applications based upon Randomized Search) etc.
Grid-based Methods : In this method the data space is formulated into a finite number of
cells that form a grid-like structure. All the clustering operation done on these grids are fast
and independent of the number of data objects example STING (Statistical Information
Grid), wave cluster, CLIQUE (CLustering In Quest) etc.
The algorithm will categorize the items into k groups of similarity. To calculate that similarity,
we will use the euclidean distance as measurement.
2. We categorize each item to its closest mean and we update the mean‟s coordinates, which
Page
23
Page
References
1. Stuart Russell, Peter Norvig, “Artificial Intelligence – A Modern Approach”, Pearson
Education
2. Elaine Rich and Kevin Knight, “Artificial Intelligence”, McGraw-Hill
3. E Charniak and D McDermott, “Introduction to Artificial Intelligence”, Pearson Education
4. Dan W. Patterson, “Artificial Intelligence and Expert Systems”, Prentice Hall of India
5. https://www.geeksforgeeks.org/pattern-recognition-introduction/
6. https://www.geeksforgeeks.org/pattern-recognition-basics-and-design-principles/
7. https://www.gatevidyalay.com/principal-component-analysis-dimension-reduction/
8. https://www.geeksforgeeks.org/k-means-clustering-introduction/
9. https://www.javatpoint.com/k-means-clustering-algorithm-in-machine-learning
24
Page