Machine Learning
Machine Learning
Machine learning is a branch of artificial intelligence that enables algorithms to uncover hidden patterns
within datasets, allowing them to make predictions on new, similar data without explicit programming for
each task. Traditional machine learning combines data with statistical tools to predict outputs, yielding
actionable insights. This technology finds applications in diverse fields such as image and speech recognition,
natural language processing, recommendation systems, fraud detection, portfolio optimization, and
automating tasks.
Netflix, for example, employs collaborative and content-based filtering to recommend movies and TV shows
based on user viewing history, ratings, and genre preferences. Machine learning’s impact extends to
autonomous vehicles, drones, and robots, enhancing their adaptability in dynamic environments.
1. Data Collection:
First, relevant data is collected or curated. This data could include examples, features, or attributes that are
important for the task at hand, such as images, text, numerical data, etc.
2. Data Preprocessing:
Before feeding the data into the algorithm, it often needs to be preprocessed. This step may involve cleaning
the data (handling missing values, outliers), transforming the data (normalization, scaling), and splitting it
into training and test sets.
3. Choosing a Model:
Depending on the task (e.g., classification, regression, clustering), a suitable machine learning model is
chosen. Examples include decision trees, neural networks, support vector machines, and more advanced
models like deep learning architectures.
4. Training the Model:
The selected model is trained using the training data. During training, the algorithm learns patterns and
relationships in the data. This involves adjusting model parameters iteratively to minimize the difference
between predicted outputs and actual outputs (labels or targets) in the training data.
6. Fine-tuning:
Models may be fine-tuned by adjusting hyperparameters (parameters that are not directly learned during
training, like learning rate or number of hidden layers in a neural network) to improve performance.
7. Prediction or Inference:
Finally, the trained model is used to make predictions or decisions on new data. This process involves
applying the learned patterns to new inputs to generate outputs, such as class labels in classification tasks or
numerical values in regression tasks.
As its name suggests, Supervised machine learning is based on supervision. It means in the supervised learning
technique, we train the machines using the "labelled" dataset, and based on the training, the machine predicts
the output. Here, the labelled data specifies that some of the inputs are already mapped to the output. More
preciously, we can say; first, we train the machine with the input and corresponding output, and then we ask the
machine to predict the output using the test dataset.
Let's understand supervised learning with an example. Suppose we have an input dataset of cats and dog
images. So, first, we will provide the training to the machine to understand the images, such as the shape & size
of the tail of cat and dog, Shape of eyes, colour, height (dogs are taller, cats are smaller), etc. After completion
of training, we input the picture of a cat and ask the machine to identify the object and predict the output. Now,
the machine is well trained, so it will check all the features of the object, such as height, shape, colour, eyes,
ears, tail, etc., and find that it's a cat. So, it will put it in the Cat category. This is the process of how the
machine identifies the objects in Supervised Learning.
The main goal of the supervised learning technique is to map the input variable(x) with the output
variable(y). Some real-world applications of supervised learning are Risk Assessment, Fraud Detection, Spam
filtering, etc.
Supervised machine learning can be classified into two types of problems, which are given below:
o Classification
o Regression
a) Classification
Classification algorithms are used to solve the classification problems in which the output variable is
categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The classification algorithms predict the
categories present in the dataset. Some real-world examples of classification algorithms are Spam Detection,
Email filtering, etc.
b) Regression
Regression algorithms are used to solve regression problems in which there is a linear relationship between
input and output variables. These are used to predict continuous output variables, such as market trends,
weather prediction, etc.
Advantages and Disadvantages of Supervised Learning
Advantages:
o Since supervised learning work with the labelled dataset so we can have an exact idea about the classes
of objects.
o These algorithms are helpful in predicting the output on the basis of prior experience.
Disadvantages:
o It may predict the wrong output if the test data is different from the training data.
o Image Segmentation: Supervised Learning algorithms are used in image segmentation. In this process,
image classification is performed on different image data with pre-defined labels.
o Medical Diagnosis: Supervised algorithms are also used in the medical field for diagnosis purposes. It
is done by using medical images and past labelled data with labels for disease conditions. With such a
process, the machine can identify a disease for the new patients.
o Fraud Detection: Supervised Learning classification algorithms are used for identifying fraud
transactions, fraud customers, etc. It is done by using historic data to identify the patterns that can lead
to possible fraud.
o Spam detection: In spam detection & filtering, classification algorithms are used. These algorithms
classify an email as spam or not spam. The spam emails are sent to the spam folder.
o Speech Recognition: Supervised learning algorithms are also used in speech recognition. The algorithm
is trained with voice data, and various identifications can be done using the same, such as voice-
activated passwords, voice commands, etc.
In Unsupervised learning, there is no need for supervision. It means, in unsupervised machine learning, the
machine is trained using the unlabeled dataset, and the machine predicts the output without any supervision.
The main aim of the unsupervised learning algorithm is to group or categories the unsorted dataset
according to the similarities, patterns, and differences. Machines are instructed to find the hidden patterns
from the input dataset.
Let's take an example to understand it more preciously; suppose there is a basket of fruit images, and we input it
into the machine learning model. The images are totally unknown to the model, and the task of the machine is to
find the patterns and categories of the objects.
So, now the machine will discover its patterns and differences, such as colour difference, shape difference, and
predict the output when it is tested with the test dataset.
o Clustering
o Association
1) Clustering
The clustering technique is used when we want to find the inherent groups from the data. It is a way to group
the objects into a cluster such that the objects with the most similarities remain in one group and have fewer or
no similarities with the objects of other groups. An example of the clustering algorithm is grouping the
customers by their purchasing behaviour.
2) Association
Association rule learning is an unsupervised learning technique, which finds interesting relations among
variables within a large dataset. The main aim of this learning algorithm is to find the dependency of one data
item on another data item and map those variables accordingly so that it can generate maximum profit. This
algorithm is mainly applied in Market Basket analysis, Web usage mining, continuous production, etc.
Advantages:
o These algorithms can be used for complicated tasks compared to the supervised ones because these
algorithms work on the unlabeled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the unlabeled dataset is easier as
compared to the labelled dataset.
Disadvantages:
o The output of an unsupervised algorithm can be less accurate as the dataset is not labelled, and
algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the unlabelled dataset that does
not map with the output.
3. Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm that lies between Supervised and
Unsupervised machine learning. It represents the intermediate ground between Supervised (With Labelled
training data) and Unsupervised learning (with no labelled training data) algorithms and uses the combination of
labelled and unlabeled datasets during the training period.
To overcome the drawbacks of supervised learning and unsupervised learning algorithms, the concept of
Semi-supervised learning is introduced. The main aim of semi-supervised learning is to effectively use all the
available data, rather than only labelled data like in supervised learning. Initially, similar data is clustered along
with an unsupervised learning algorithm, and further, it helps to label the unlabeled data into labelled data. It is
because labelled data is a comparatively more expensive acquisition than unlabeled data.
We can imagine these algorithms with an example. Supervised learning is where a student is under the
supervision of an instructor at home and college. Further, if that student is self-analysing the same concept
without any help from the instructor, it comes under unsupervised learning. Under semi-supervised learning, the
student has to revise himself after analyzing the same concept under the guidance of an instructor at college.
Advantages and disadvantages of Semi-supervised Learning
Advantages:
o It is highly efficient.
o Accuracy is low.
Applications of Semi-Supervised Learning:
Here are some common applications of semi-supervised learning:
o Image Classification and Object Recognition: Improve the accuracy of models by combining a
small set of labeled images with a larger set of unlabeled images.
o Natural Language Processing (NLP): Enhance the performance of language models and classifiers
by combining a small set of labeled text data with a vast amount of unlabeled text.
o Speech Recognition: Improve the accuracy of speech recognition by leveraging a limited amount of
transcribed speech data and a more extensive set of unlabeled audio.
o Recommendation Systems: Improve the accuracy of personalized recommendations by
supplementing a sparse set of user-item interactions (labeled data) with a wealth of unlabeled user
behavior data.
o Healthcare and Medical Imaging: Enhance medical image analysis by utilizing a small set of labeled
medical images alongside a larger set of unlabeled images.
4. Reinforcement Learning:
Reinforcement learning works on a feedback-based process, in which an AI agent (A software
component) automatically explore its surrounding by hitting & trail, taking action, learning from
experiences, and improving its performance. Agent gets rewarded for each good action and get punished for
each bad action; hence the goal of reinforcement learning agent is to maximize the rewards.
In reinforcement learning, there is no labelled data like supervised learning, and agents learn from their
experiences only. Due to its way of working, reinforcement learning is employed in different fields such
as Game theory, Operation Research, Information theory, multi-agent systems.
Categories of Reinforcement Learning:
Reinforcement learning is categorized mainly into two types of methods/algorithms:
o Positive Reinforcement Learning: Positive reinforcement learning specifies increasing the tendency
that the required behaviour would occur again by adding something. It enhances the strength of the
behaviour of the agent and positively impacts it.
o Negative Reinforcement Learning: Negative reinforcement learning works exactly opposite to the
positive RL. It increases the tendency that the specific behaviour would occur again by avoiding the
negative condition.
o Video Games: RL algorithms are much popular in gaming applications. It is used to gain super-human
performance. Some popular games that use RL algorithms are AlphaGO and AlphaGO Zero.
o Resource Management: The "Resource Management with Deep Reinforcement Learning" paper
showed that how to use RL in computer to automatically learn and schedule resources to wait for
different jobs in order to minimize average job slowdown.
o Robotics: RL is widely being used in Robotics applications. Robots are used in the industrial and
manufacturing area, and these robots are made more powerful with reinforcement learning. There are
different industries that have their vision of building intelligent robots using AI and Machine learning
technology.
o Text Mining Text-mining, one of the great applications of NLP, is now being implemented with the
help of Reinforcement Learning by Salesforce company.
Advantages and Disadvantages of Reinforcement Learning:
Advantages
o It helps in solving complex real-world problems which are difficult to be solved by general techniques.
o The learning model of RL is similar to the learning of human beings; hence most accurate results can be
found.
o Too much reinforcement learning can lead to an overload of states which can weaken the results.
What is Regression?
Regression is a statistical approach used to analyze the relationship between a dependent variable (target
variable) and one or more independent variables (predictor variables). The objective is to determine the most
suitable function that characterizes the connection between these variables.
It is a supervised machine learning technique, used to predict the value of the dependent variable for new,
unseen data. It models the relationship between the input features and the target variable, allowing for the
estimation or prediction of numerical values.
Regression analysis problem works with if output variable is a real or continuous value, such as “salary” or
“weight”. It is mainly used for prediction, forecasting, time series modeling, and determining the causal-
effect relationship between variables.
In Regression, we plot a graph between the variables which best fits the given datapoints, using this plot, the
machine learning model can make predictions about the data. In simple words, "Regression shows a line or
curve that passes through all the datapoints on target-predictor graph in such a way that the vertical
distance between the datapoints and the regression line is minimum." The distance between datapoints and
line tells whether a model has captured a strong relationship or not.
Some examples of regression can be as:
o Dependent Variable: The main factor in Regression analysis which we want to predict or understand is
called the dependent variable. It is also called target variable.
o Independent Variable: The factors which affect the dependent variables or which are used to predict
the values of the dependent variables are called independent variable, also called as a predictor.
o Outliers: Outlier is an observation which contains either very low value or very high value in
comparison to other observed values. An outlier may hamper the result, so it should be avoided.
o Multicollinearity: If the independent variables are highly correlated with each other than other
variables, then such condition is called Multicollinearity. It should not be present in the dataset, because
it creates problem while ranking the most affecting variable.
o Underfitting and Overfitting: If our algorithm works well with the training dataset but not well with
test dataset, then such problem is called Overfitting. And if our algorithm does not perform well even
with training dataset, then such problem is called underfitting.
Types of Regression:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression
Linear Regression:
o Linear regression is a statistical regression method which is used for predictive analysis.
o It is one of the very simple and easy algorithms which works on regression and shows the relationship
between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable (X-axis) and the
dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple linear regression.
And if there is more than one input variable, then such linear regression is called multiple linear
regression.
o The relationship between variables in the linear regression model can be explained using the below
image. Here we are predicting the salary of an employee on the basis of the year of experience.
o Salary forecasting
When we provide the input values (data) to the function, it gives the S-curve as follows:
o It uses the concept of threshold levels, values above the threshold level are rounded up to 1, and values
below the threshold level are rounded up to 0.
o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)
Polynomial Regression:
Polynomial regression is used to model nonlinear relationships between the dependent variable and the
independent variables. It adds polynomial terms to the linear regression model to capture more complex
relationships.
o The equation for polynomial regression also derived from linear regression equation that means Linear
regression equation Y= b0+ b1x, is transformed into Polynomial regression equation Y= b0+b1 x+ b2x2 +
b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x is
our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic
Support vector regression (SVR) is a type of regression algorithm that is based on the support vector
machine (SVM) algorithm. SVM is a type of algorithm that is used for classification tasks, but it can
also be used for regression tasks. SVR works by finding a hyperplane that minimizes the sum of the
squared residuals between the predicted and actual values.
Random forest regression is an ensemble method that combines multiple decision trees to predict the
target value. Ensemble methods are a type of machine learning algorithm that combines multiple models
to improve the performance of the overall model. Random forest regression works by building a large
number of decision trees, each of which is trained on a different subset of the training data. The final
prediction is made by averaging the predictions of all of the trees.
The algorithm which implements the classification on a dataset is known as a classifier. There are two types of
Classifications:
o Binary Classifier: If the classification problem has only two possible outcomes, then it is called as Binary
Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
o Multi-class Classifier: If a classification problem has more than two outcomes, then it is called as Multi-
class Classifier.
Example: Classifications of types of crops, Classification of types of music.
Types of ML Classification Algorithms:
Classification Algorithms can be further divided into the mainly two categories:
o Linear Models
o Logistic Regression
o Non-linear Models
o K-Nearest Neighbours
o Naïve Bayes
Classification algorithms can be used in different places. Below are some popular use cases of Classification
Algorithms:
o Email Spam Detection
o Speech Recognition
o Identifications of Cancer tumor cells.
o Drugs Classification
o Biometric Identification, etc.
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for
Classification as well as Regression problems. However, primarily, it is used for Classification problems in
Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional
space into classes so that we can easily put the new data point in the correct category in the future. This best
decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called as
support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below diagram in
which there are two different categories that are classified using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier. Suppose we see a
strange cat that also has some features of dogs, so if we want a model that can accurately identify whether it is a
cat or dog, so such a model can be created by using the SVM algorithm. We will first train our model with lots
of images of cats and dogs so that it can learn about different features of cats and dogs, and then we test it with
this strange creature. So as support vector creates a decision boundary between these two data (cat and dog) and
choose extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis of the support
vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional space,
but we need to find out the best decision boundary that helps to classify the data points. This best boundary is
known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means if there are 2
features (as shown in image), then hyperplane will be a straight line. And if there are 3 features, then hyperplane
will be a 2-dimension plane.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of the hyperplane
are termed as Support Vector. Since these vectors support the hyperplane, hence called a Support vector.
How does SVM works?
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset that has
two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that can classify the
pair(x1, x2) of coordinates in either green or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there can be
multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is
called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These points
are called support vectors. The distance between the vectors and the hyperplane is called as margin. And the
goal of SVM is to maximize this margin. The hyperplane with maximum margin is called the optimal
hyperplane.
Non-Linear SVM: If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have used two
dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d space with
z=1, then it will become as:
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this
data point will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm. With
the help of K-NN, we can easily identify the category or class of a particular dataset. Consider the below
diagram:
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
Suppose we have a new data point and we need to put it in the required category. Consider the below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the
distance between two points, which we have already studied in geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B. Consider the below image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to
category A.
Below are some points to remember while selecting the value of K in the K-NN algorithm:
o There is no particular way to determine the best value for "K", so we need to try some values to find the
best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.
o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points for all the
training samples.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of the
occurrence of other features. Such as if the fruit is identified on the bases of color, shape, and taste, then
red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually contributes to
identify that it is an apple without depending on each other.
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 4
Likelihood table weather condition:
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes)= 3/10= 0.3
P(Sunny)= 0.35
P(Yes)=0.71
So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
Hence on a Sunny day, Player can play the game.
o Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which further
gets divided into two or more homogeneous sets.
o Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting a
leaf node.
o Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to the
given conditions.
o Branch/Sub Tree: A tree formed by splitting the tree.
o Pruning: Pruning is the process of removing the unwanted branches from the tree.
o Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the
child nodes.
How does the Decision Tree algorithm Work?
o In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of
the tree. This algorithm compares the values of root attribute with the record (real dataset) attribute and,
based on the comparison, follows the branch and jumps to the next node.
o For the next node, the algorithm again compares the attribute value with the other sub-nodes and move
further. It continues the process until it reaches the leaf node of the tree. The complete process can be
better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and called the
final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he should accept the
offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary attribute by ASM). The
root node splits further into the next decision node (distance from the office) and one leaf node based on the
corresponding labels. The next decision node further gets split into one decision node (Cab facility) and one leaf
node. Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer). Consider the
below diagram:
Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset based on
an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a node/attribute
having the highest information gain is split first. It can be calculated using the below formula:
To overcome overfitting, pruning techniques are used. Pruning reduces the size of the tree by removing
nodes that provide little power in classifying instances. There are two main types of pruning:
Pre-pruning (Early Stopping): Stops the tree from growing once it meets certain criteria (e.g., maximum
depth, minimum number of samples per leaf).
Post-pruning: Removes branches from a fully grown tree that do not provide significant power.
Advantages of the Decision Tree
o It is simple to understand as it follows the same process which a human follow while making any
decision in real-life.
o It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase.
Random Forest Algorithm:
Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It can
be used for both Classification and Regression problems in ML. It is based on the concept of ensemble
learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the
performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees on various
subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree and
based on the majority votes of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.
The below diagram explains the working of the Random Forest algorithm:
Random Forest works in two-phase first is to create the random forest by combining N decision tree, and
second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, and assign the new data points to the
category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to the Random
forest classifier. The dataset is divided into subsets and given to each decision tree. During the training phase,
each decision tree produces a prediction result, and when a new data point occurs, then based on the majority of
results, the Random Forest classifier predicts the final decision. Consider the below image:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
o Although random forest can be used for both classification and regression tasks, it is not more suitable
for Regression tasks.
Clustering in Machine Learning
Clustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset. It can be
defined as "A way of grouping the data points into different clusters, consisting of similar data points. The
objects with the possible similarities remain in a group that has less or no similarities with another group."
It is an unsupervised learning method. Cluster is done by finding some similar patterns in the unlabelled dataset
such as shape, size, color, behavior, etc., and divides them as per the presence and absence of those similar
patterns.
After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML system can use
this id to simplify the processing of large and complex datasets.
Example: Let's understand the clustering technique with the real-world example of Mall: When we visit any
shopping mall, we can observe that the things with similar usage are grouped together. Such as the t-shirts are
grouped in one section, and trousers are at other sections, similarly, at vegetable sections, apples, bananas,
Mangoes, etc., are grouped in separate sections, so that we can easily find out the things. The clustering
technique also works in the same way. Other examples of clustering are grouping documents according to the
topic.
The clustering technique can be widely used in various tasks. Some most common uses of this technique are:
o Market Segmentation - Businesses use clustering to group their customers and use targeted
advertisements to attract more audience
o Statistical data analysis
o Medical Imaging – Doctors use Clustering to find out diseased areas in diagnostic images like X-rays.
o Anomaly detection - To find outliers in a stream of real-time dataset or forecasting fraudulent
transactions
o Apart from these general usages, it is used by the Amazon in its recommendation system to provide the
recommendations as per the past search of products. Netflix also uses this technique to recommend the movies
and web-series to its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see the different fruits are divided
into several groups with similar properties.
Clustering broadly divides into two subgroups:
Hard Clustering: Each input data point either fully belongs to a cluster or not. For insta nce, in the
example above, every customer is assigned to one group out of the ten.
Soft Clustering: Rather than assigning each input data point to a distinct cluster, it assigns a
probability or likelihood of the data point being in those clusters. For example, in the given
scenario, each customer receives a probability of being in any of the ten retail store clusters.
1) K Means Clustering
K means is an iterative clustering algorithm that aims to find local maxima in each iteration. This
algorithm works in these 5 steps:
Step1:
Specify the desired number of clusters K: Let us choose k=2 for these 5 data points in 2-D space.
Step 2:
Randomly assign each data point to a cluster: Let’s assign three points in cluster 1, shown using red color,
and two points in cluster 2, shown using grey color.
Step 3:
Compute cluster centroids: The centroid of data points in the red cluster is shown using the red cross, and
those in the grey cluster using a grey cross.
Step 4:
Re-assign each point to the closest cluster centroid: Note that only the data point at the bottom is assigned
to the red cluster, even though it’s closer to the centroid of the grey cluster. Thus, we assign that data
point to the grey cluster.
Step 5:
Re-compute cluster centroids: Now, re-computing the centroids for both clusters.
Repeat steps 4 and 5 until no improvements are possible: Similarly, we’ll repeat the 4th and 5th steps until
we’ll reach global optima, i.e., when there is no further switching of data points between two clusters for
two successive repeats. It will mark the termination of the algorithm if not explicitly mentioned.
2) Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the arbitrarily shaped
distributions are formed as long as the dense region can be connected. This algorithm does it by identifying
different clusters in the dataset and connects the areas of high densities into clusters. The dense areas in data
space are divided from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has varying densities and high
dimensions.
In the distribution model-based clustering method, the data is divided based on the probability of how a dataset
belongs to a particular distribution. The grouping is done by assuming some distributions commonly Gaussian
Distribution.
The example of this type is the Expectation-Maximization Clustering algorithm that uses Gaussian Mixture
Models (GMM).
4) Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no
requirement of pre-specifying the number of clusters to be created. In this technique, the dataset is
divided into clusters to create a tree-like structure, which is also called a dendrogram. The
observations or any number of clusters can be selected by cutting the tree at the correct level. The
most common example of this method is the Agglomerative Hierarchical algorithm.
5) Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than one group
or cluster. Each dataset has a set of membership coefficients, which depend on the degree of
membership to be in a cluster. Fuzzy C-means algorithm is the example of this type of clustering; it
is sometimes also known as the Fuzzy k-means algorithm.
1. K-Means algorithm: The k-means algorithm is one of the most popular clustering algorithms. It
classifies the dataset by dividing the samples into different clusters of equal variances. The number of
clusters must be specified in this algorithm. It is fast with fewer computations required, with the linear
complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth density of data
points. It is an example of a centroid-based model, that works on updating the candidates for centroid to
be the center of the points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications with Noise. It
is an example of a density-based model similar to the mean-shift, but with some remarkable advantages.
In this algorithm, the areas of high density are separated by the areas of low density. Because of this, the
clusters can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used as an alternative for
the k-means algorithm or for those cases where K-means can be failed. In GMM, it is assumed that the
data points are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm performs the
bottom-up hierarchical clustering. In this, each data point is treated as a single cluster at the outset and
then successively merged. The cluster hierarchy can be represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does not require to specify
the number of clusters. In this, each data point sends a message between the pair of data points until
convergence. It has O(N2T) time complexity, which is the main drawback of this algorithm.
Applications of Clustering:
Below are some commonly known applications of clustering technique in Machine Learning:
o In Identification of Cancer Cells: The clustering algorithms are widely used for the identification of
cancerous cells. It divides the cancerous and non-cancerous data sets into different groups.
o In Search Engines: Search engines also work on the clustering technique. The search result appears
based on the closest object to the search query. It does it by grouping similar data objects in one group
that is far from the other dissimilar objects. The accurate result of a query depends on the quality of the
clustering algorithm used.
o Customer Segmentation: It is used in market research to segment the customers based on their choice
and preferences.
o In Biology: It is used in the biology stream to classify different species of plants and animals using the
image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of similar lands use in the GIS
database. This can be very useful to find that for what purpose the particular land should be used, that
means for which purpose it is more suitable.
For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk, so these products are
stored within a shelf or mostly nearby.
1. Apriori
2. Eclat
3. F-P Growth Algorithm
1) Apriori Algorithm
This algorithm uses frequent datasets to generate association rules. It is designed to work on the databases that
contain transactions. This algorithm uses a breadth-first search and Hash Tree to calculate the itemset
efficiently.
It is mainly used for market basket analysis and helps to understand the products that can be bought together. It
can also be used in the healthcare field to find drug reactions for patients.
2) Eclat Algorithm
Eclat algorithm stands for Equivalence Class Transformation. This algorithm uses a depth-first search
technique to find frequent itemsets in a transaction database. It performs faster execution than Apriori
Algorithm.
The F-P growth algorithm stands for Frequent Pattern, and it is the improved version of the Apriori
Algorithm. It represents the database in the form of a tree structure that is known as a frequent pattern or tree.
The purpose of this frequent tree is to extract the most frequent patterns.
It has various applications in machine learning and data mining. Below are some popular applications of
association rule learning:
o Market Basket Analysis: It is one of the popular examples and applications of association rule mining.
This technique is commonly used by big retailers to determine the association between items.
o Medical Diagnosis: With the help of association rules, patients can be cured easily, as it helps in
identifying the probability of illness for a particular disease.
o Protein Sequence: The association rules help in determining the synthesis of artificial Proteins.
o It is also used for the Catalog Design and Loss-leader Analysis and many more other applications.