Data Mining: Frequent Pattern Analysis
Data Mining: Frequent Pattern Analysis
Mining frequent patterns are the process of discovering sets of items that frequently occur together in a given
dataset. It involves finding all itemsets that meet a minimum support threshold and generating association rules
from the frequent itemsets. Apriori, FP-growth, and Eclat are popular algorithms for mining frequent patterns.
This technique has various applications, including market basket analysis, recommending products, and
network traffic analysis. It is a powerful method to identify relationships and insights in large datasets.
Introduction
Mining frequent patterns is a data mining technique that involves discovering sets of items frequently occurring
together in a given dataset. A frequent pattern is a set of items occurring frequently in a given dataset. In other
words, it is a subset of items that appears in a minimum number of transactions or records in the dataset. The
frequency of a pattern is typically measured by its support, which is the percentage of transactions in the
dataset that contain the pattern. A pattern is considered frequent if its support exceeds a predefined threshold.
Mining frequent patterns are important because it can help identify relationships and correlations among various
items in a large dataset, which can be useful in various applications such as market basket analysis,
recommendation systems, and network traffic analysis. For example, in market basket analysis, frequent
itemsets can be used to identify which products are commonly purchased together. This can be used to
optimize store layout or recommend related products to customers.
To illustrate, consider a dataset of customer purchases in a retail store. A frequent pattern may be that
customers who purchase bread and milk are also likely to purchase eggs. By identifying such patterns, a store
can optimize its product placement and promotions to increase sales and customer satisfaction.
A few of the commonly used algorithms for mining frequent patterns include the following -
• Apriori - Apriori is a classic algorithm for mining frequent patterns in large datasets. It works by iteratively
generating candidate itemsets of increasing size and pruning those that do not meet the minimum
support threshold. This approach significantly reduces the search space and makes it possible to handle
datasets with a large number of items. However, Apriori can be computationally expensive for datasets
with many infrequent itemsets.
• FP-growth -FP-growth is an algorithm for mining frequent patterns that uses a divide-and-conquer
approach. It constructs a tree-like data structure called the frequent pattern (FP) tree, where each node
represents an item in a frequent pattern, and its children represent its immediate sub-patterns. By
scanning the dataset only twice, FP-growth can efficiently mine all frequent itemsets without generating
candidate itemsets explicitly. It is particularly suitable for datasets with long patterns and relatively low
support thresholds.
• Eclat - Eclat is a depth-first search algorithm for mining frequent itemsets similar to Apriori. However,
instead of generating candidate itemsets of increasing size, Eclat uses a vertical representation of the
dataset to identify frequent itemsets recursively. It exploits the overlap among the itemsets in different
transactions to reduce the search space and is efficient for datasets with many short and frequent
itemsets. However, Eclat may perform poorly for datasets with long itemsets or low support thresholds.
In the subsequent sections, let’s understand various terminologies used in mining frequent patterns.
UNIT 3 DATA MINING -FREQUENT PATTERN ANALYSIS
Support
In data mining, support is a measure used to identify frequent patterns in a dataset. It is the proportion of
transactions or records in the dataset that contain a given set of items or attributes. The support value is
typically expressed as a percentage or decimal value between 0 and 1.
For example, consider a dataset of customer transactions at a grocery store that contains the following items -
milk, bread, cheese, eggs, butter, and yogurt. Suppose we want to find frequent itemsets of products commonly
purchased together. If we set a minimum support threshold of 30%, we would only consider itemsets that
appear in at least 30% of the transactions in the dataset. To calculate the support of an itemset, we count the
number of transactions in which it appears and divide it by the total number of transactions in the dataset. For
instance, if the itemset {bread, eggs} appears in 5 out of 10 transactions in the dataset, then its support
is 5/10=0.55/10=0.5, or 50%. As support for the {bread, eggs} is higher than the defined threshold of 30%, it will
be considered a frequent itemset.
Confidence
In data mining, confidence is a measure used to determine the strength of association between two items in a
frequent pattern. It is the conditional probability that item Y appears in a transaction, given that item X also
appears in the same transaction.
For example, suppose we have a dataset of customer transactions at a grocery store. We can calculate the
confidence of an association rule, such as {bread, milk} -> {eggs}, which means that customers who buy bread
and milk are likely to also buy eggs.
The confidence of an association rule is calculated as the support of the combined itemset divided by the
support of the antecedent (left-hand side) itemset. In other words, it measures the proportion of transactions
that contain both the antecedent and consequent itemsets out of the transactions that contain the antecedent
itemset. The formula for calculating support is shown below -
confidence(A⇒B)=P(B/A)=sup(A)sup(A∪B)
Association and Correlation in Data Mining are two of the most widely used techniques. They are used to
identify patterns and relationships between variables in a dataset. Association refers to the discovery of co-
occurrences or relationships between items in a dataset. On the other hand, correlation measures the strength
of the relationship between two variables. It provides an insight into how the variables are related and how they
affect each other.
Introduction
Data mining is the process of extracting useful information and knowledge from large datasets. With the ever-
increasing amount of data generated in various domains, data mining has become crucial for organizations to
make informed decisions. Association and Correlation in data mining are two of the most commonly used
techniques that help identify patterns, trends, and relationships between variables. Association analysis is used
to discover co-occurrences or relationships between items in a dataset, while correlation analysis measures the
strength of the relationship between two variables. In the subsequent sections, let’s explore both techniques,
their types, and various algorithms/methods to implement them.
What is Association?
Association is a technique used in data mining to identify the relationships or co-occurrences between items in
a dataset. It involves analyzing large datasets to discover patterns or associations between items, such as
products purchased together in a supermarket or web pages frequently visited together on a website.
Association analysis is based on the idea of finding the most frequent patterns or itemsets in a dataset, where an
itemset is a collection of one or more items.
Association analysis can provide valuable insights into consumer behaviour and preferences. It can help
retailers identify the items that are frequently purchased together, which can be used to optimize product
placement and promotions. Similarly, it can help e-commerce websites recommend related products to
customers based on their purchase history.
Types of Associations
Here are the most common types of associations used in data mining:
• Itemset Associations: Itemset association is the most common type of association analysis, which is
used to discover relationships between items in a dataset. In this type of association, a collection of one
or more items that frequently co-occur together is called an itemset. For example, in a supermarket
dataset, itemset association can be used to identify items that are frequently purchased together, such
as bread and butter.
• Sequential Associations: Sequential association is used to identify patterns that occur in a specific
sequence or order. This type of association analysis is commonly used in applications such as analyzing
customer behaviour on e-commerce websites or studying weblogs. For example, in the weblogs dataset,
a sequential association can be used to identify the sequence of pages that users visit before making a
purchase.
• Graph-based Associations Graph-based association is a type of association analysis that involves
representing the relationships between items in a dataset as a graph. In this type of association, each
item is represented as a node in the graph, and the edges between nodes represent the co-occurrence or
UNIT 3 DATA MINING -FREQUENT PATTERN ANALYSIS
relationship between items. The graph-based association is used in various applications, such as social
network analysis, recommendation systems, and fraud detection. For example, in a social network
dataset, identifying groups of users with similar interests or behaviours.
Here are the most commonly used algorithms to implement association rule mining in data mining:
• Apriori Algorithm - Apriori is one of the most widely used algorithms for association rule mining. It
generates frequent item sets from a given dataset by pruning infrequent item sets iteratively. The Apriori
algorithm is based on the concept that if an item set is frequent, then all of its subsets must also be
frequent. The algorithm first identifies the frequent items in the dataset, then generates candidate
itemsets of length two from the frequent items, and so on until no more frequent itemsets can be
generated. The Apriori algorithm is computationally expensive, especially for large datasets with many
items.
• FP-Growth Algorithm - FP-Growth is another popular algorithm for association rule mining that is based
on the concept of frequent pattern growth. It is faster than the Apriori algorithm, especially for large
datasets. The FP-Growth algorithm builds a compact representation of the dataset called a frequent
pattern tree (FP-tree), which is used to mine frequent item sets. The algorithm scans the dataset only
twice, first to build the FP-tree and then to mine the frequent itemsets. The FP-Growth algorithm can
handle datasets with both discrete and continuous attributes.
• Eclat Algorithm - Eclat (Equivalence Class Clustering and Bottom-up Lattice Traversal) is a frequent
itemset mining algorithm based on the vertical data format. The algorithm first converts the dataset into a
vertical data format, where each item and the transaction ID in which it appears are stored. Eclat then
performs a depth-first search on a tree-like structure, representing the dataset's frequent itemsets. The
algorithm is efficient regarding both memory usage and runtime, especially for sparse datasets.
Correlation Analysis is a data mining technique used to identify the degree to which two or more variables are
related or associated with each other. Correlation refers to the statistical relationship between two or more
variables, where the variation in one variable is associated with the variation in another variable. In other words,
it measures how changes in one variable are related to changes in another variable. Correlation can
be positive, negative, or zero, depending on the direction and strength of the relationship between the variables.
, For example,, we are studying the relationship between the hours of study and the grades obtained by
students. If we find that as the number of hours of study increases, the grades obtained also increase, then there
is a positive correlation between the two variables. On the other hand, if we find that as the number of hours of
study increases, the grades obtained decrease, then there is a negative correlation between the two variables. If
there is no relationship between the two variables, we would say that there is zero correlation.
UNIT 3 DATA MINING -FREQUENT PATTERN ANALYSIS
Correlation analysis is important because it allows us to measure the strength and direction of the relationship
between two or more variables. This information can help identify patterns and trends in the data, make
predictions, and select relevant variables for analysis. By understanding the relationships between different
variables, we can gain valuable insights into complex systems and make informed decisions based on data-
driven analysis.
There are three main types of correlation analysis used in data mining, as mentioned below:
• Pearson Correlation Coefficient - Pearson correlation measures the linear relationship between two
continuous variables. It ranges from -1 to +1, where -1 indicates a perfect negative correlation, 0 indicates
no correlation, and +1 indicates a perfect positive correlation. The Pearson correlation coefficient
between two variables, X and Y, is calculated as follows .
• Kendall Rank Correlation - Kendall correlation is a non-parametric measure of the association between
two ordinal variables. It measures the degree of correspondence between the ranking of observations on
two variables. It calculates the difference between the number of concordant pairs (pairs of observations
that have the same rank order in both variables) and discordant pairs (pairs of observations that have an
opposite rank order in the two variables) and normalizes the result by dividing by the total number of
pairs. The formula for the Kendall correlation is -
• Any score from +0.5 to +1 indicates a very strong positive correlation, meaning that the variables are
strongly related in a positive direction, increasing together or simultaneously.
• Any score from -0.5 to -1 indicates a strong negative correlation, meaning that the variables are strongly
related in a negative direction. It also means that as one variable decreases, the other variable increases
and vice-versa.
• A score of 0 indicates no correlation, meaning there is no relationship between the analyzed variables.
Correlation analysis is a powerful tool in data mining and statistical analysis that offers several benefits.
• Identifying Relationships - Correlation analysis helps identify the relationships between different
variables in a dataset. By quantifying the degree and direction of the relationship, we can gain insights into
how changes in one variable are likely to affect the other.
• Prediction - Correlation analysis can help predict one variable's values based on another variable's
values. Building models based on correlations can predict future outcomes and make informed
decisions.
• Feature Selection - Correlation analysis can also help select the most relevant features for a particular
analysis or model. By identifying the features that are highly correlated with the outcome features, we can
focus on those features and exclude the irrelevant ones, improving the accuracy and efficiency of the
analysis or model.
• Quality Control - Correlation analysis is useful in quality control applications, where it can be used to
identify correlations between different process variables and identify potential sources of quality
problems.
Correlation analysis is a powerful tool in data mining and statistical analysis that offers several benefits.
Introduction
Data mining techniques are used to extract useful knowledge and insights from large datasets. A good data
mining technique should have the following characteristics -
• Scalability -
The technique should be able to handle large amounts of data efficiently.
• Robustness -
The technique should be able to handle noisy or incomplete data without compromising the quality of the
results.
• Accuracy -
The technique should produce accurate results with a low error rate.
• Interpretability -
The technique should produce results that domain experts can easily understand and interpret.
Data mining techniques are important because they enable organizations to discover hidden patterns,
relationships, and insights in their data. These insights can be used to make informed decisions, improve
business processes, and identify new opportunities. Data mining techniques are widely used in fields such
as marketing, finance, healthcare, and scientific research.
Classification
Classification is a supervised learning technique in data mining that assigns predefined classes to objects or
instances based on their attributes or features. It involves building a model from a set of training data that
consists of labeled examples, where the class label of each example is known. The model is then used to
classify new, unseen data based on their attributes.
UNIT 3 DATA MINING -FREQUENT PATTERN ANALYSIS
For example, consider a bank that wants to identify customers who are likely to default on their loans. The bank
can use classification to build a model that predicts the default risk of a customer based on their credit score,
income, and other relevant factors. The model can then be used to classify new loan applicants as low or high-
risk.
Classification algorithms used in data mining include decision trees, naive Bayes, support vector machines
(SVM), and logistic regression, among others. These algorithms differ in their assumptions, strengths, and
weaknesses and are chosen based on the characteristics of the data and the problem being solved.
Clustering
Clustering is an unsupervised learning technique in data mining that involves grouping similar objects or
instances together based on their attributes or features. Unlike classification, clustering does not involve
predefined classes but rather groups objects based on their similarity. The objective of clustering is to discover
inherent patterns and structures in the data that may not be immediately apparent.
For example, consider a retailer that wants to segment its customers based on their shopping behavior. The
retailer can use clustering to group customers with similar purchasing patterns, such as those who buy high-end
products or shop frequently. This information can be used to tailor marketing strategies and promotions to each
segment.
Clustering algorithms used in data mining include k-means, hierarchical clustering, and density-based
clustering, among others. These algorithms differ in their assumptions and how they define similarity or distance
between objects.
UNIT 3 DATA MINING -FREQUENT PATTERN ANALYSIS
Regression
Regression is a supervised learning technique in data mining that involves building a model to predict a
continuous or numerical output variable based on one or more input variables or predictors. Regression aims to
establish a functional relationship between the input and output variables.
For example, consider a real estate agency that wants to predict the price of a house based on its features,
such as size, location, and the number of bedrooms. The agency can use regression to build a model that
predicts the price of a house based on these features. The model can then be used to estimate the price of new
houses or to identify undervalued properties.
Regression algorithms used in data mining include linear regression, polynomial regression, and decision tree
regression, among others. These algorithms differ in their assumptions and how they model the relationship
between the input and output variables.
UNIT 3 DATA MINING -FREQUENT PATTERN ANALYSIS
Association Rules Mining
Association rule mining is an unsupervised learning technique in data mining that involves discovering
relationships or associations between variables in a dataset. It aims to find patterns of co-occurrence or
correlation among variables frequently occurring together in the data.
For example, consider a retailer that wants to increase its sales by offering promotions or discounts to
customers who buy certain products. The retailer can use association rule mining to identify which products are
often bought together, such as bread and butter or shampoo and conditioner. This information can be used to
create targeted promotions and cross-selling strategies.
Association rule mining algorithms used in data mining include Apriori, FP-Growth, and Eclat, among others.
These algorithms differ in their approach to identifying frequent itemsets or sets of variables that occur together.
Outlier Detection
Outlier detection is a data mining technique that involves identifying and analyzing data points or observations
significantly different from most of the data. Outliers are data points that deviate from the expected or normal
behavior of the data and may indicate errors, anomalies, or rare events.
For example, consider a credit card company that wants to detect fraudulent transactions. The company can
use outlier detection to identify transactions significantly different from a customer's normal spending behavior,
such as unusually large purchases or transactions made in different countries. These transactions can be
flagged for further investigation or declined to prevent fraud.
Outlier detection algorithms used in data mining include statistical methods, such as z-score and boxplot, and
machine learning methods, such as isolation forest and LOF.
Sequential Patterns
Sequential pattern mining is a data mining technique that involves discovering patterns or sequences of events
that frequently occur together in a dataset. It aims to identify temporal or time-dependent relationships between
variables or events.
UNIT 3 DATA MINING -FREQUENT PATTERN ANALYSIS
For example, consider an e-commerce company that wants to improve its user experience by recommending
products based on the purchase behavior of its users. The company can employ sequential pattern mining to
identify which products are often purchased together in a sequence, such as a user buying accessories post-
purchase of a computer or smartphone. This information can be used to personalize recommendations and
improve user engagement.
Sequential pattern mining algorithms used in data mining include GSP (Generalized Sequential Pattern), SPADE
(Sequential PAttern Discovery using Equivalence classes), and PrefixSpan, among others.
Prediction
Prediction is a data mining technique that involves building a model to predict the value or class of a target
variable based on a set of input or predictor variables. The objective of prediction is to make accurate predictions
for new or unseen data based on the patterns and relationships discovered in the training data.
Prediction algorithms used in data mining include linear regression, decision trees, neural networks, support
vector machines, and random forests, among others. These algorithms differ in their approach to building
prediction models and are based on the data type of the variable (categorical or continuous) to be predicted.
One can choose the appropriate algorithm for the prediction model based on the characteristics of the data and
the problem being solved.
• Privacy concerns -
Data mining techniques can be used to extract sensitive information about individuals, which can raise
privacy concerns.
UNIT 3 DATA MINING -FREQUENT PATTERN ANALYSIS
• Reliance on data quality -
Data mining techniques rely on the quality and accuracy of the data, and inaccurate or incomplete data
can lead to incorrect conclusions.
In data mining, pattern evaluation is the process of assessing the quality of discovered patterns. This process
is important in order to determine whether the patterns are useful and whether they can be trusted. There are
a number of different measures that can be used to evaluate patterns, and the choice of measure will depend
on the application.
There are several ways to evaluate pattern mining algorithms:
1. Accuracy
The accuracy of a data mining model is a measure of how correctly the model predicts the target values. The
accuracy is measured on a test dataset, which is separate from the training dataset that was used to train the
model. There are a number of ways to measure accuracy, but the most common is to calculate the
percentage of correct predictions. This is known as the accuracy rate.
Other measures of accuracy include the root mean squared error (RMSE) and the mean absolute error (MAE).
The RMSE is the square root of the mean squared error, and the MAE is the mean of the absolute errors. The
accuracy of a data mining model is important, but it is not the only thing that should be considered. The model
should also be robust and generalizable.
A model that is 100% accurate on the training data but only 50% accurate on the test data is not a good
model. The model is overfitting the training data and is not generalizable to new data. A model that is 80%
accurate on the training data and 80% accurate on the test data is a good model. The model is generalizable
and can be used to make predictions on new data.
2. Classification Accuracy
This measures how accurately the patterns discovered by the algorithm can be used to classify new data. This
is typically done by taking a set of data that has been labeled with known class labels and then using the
discovered patterns to predict the class labels of the data. The accuracy can then be computed by comparing
the predicted labels to the actual labels.
Classification accuracy is one of the most popular evaluation metrics for classification models, and it is
simply the percentage of correct predictions made by the model. Although it is a straightforward and easy -to-
understand metric, classification accuracy can be misleading in certain situations. For example, if we have a
dataset with a very imbalanced class distribution, such as 100 instances of class 0 and 1,000 instances of
class 1, then a model that always predicts class 1 will achieve a high classification accuracy of 90%.
However, this model is clearly not very useful, since it is not making any correct predictions for class 0.
There are a few different ways to evaluate classification models, such as precision and recall, which are more
informative in imbalanced datasets. Precision is the percentage of correct predictions made by the model for
a particular class, and recall is the percentage of instances of a particular class that was correctly predicted
by the model. In the above example, if we looked at precision and recall for class 0, we would see that the
model has a precision of 0% and a recall of 0%.
Another way to evaluate classification models is to use a confusion matrix. A confusion matrix is a table that
shows the number of correct and incorrect predictions made by the model for each class. This can be a
helpful way to visualize the performance of a model and to identify where it is making mistakes. For example,
in the above example, the confusion matrix would show that the model is making all predictions for class 1
and no predictions for class 0.
UNIT 3 DATA MINING -FREQUENT PATTERN ANALYSIS
Overall, classification accuracy is a good metric to use when evaluating classification models. However, it is
important to be aware of its limitations and to use other evaluation metrics in situations where classification
accuracy could be misleading.
3. Clustering Accuracy
This measures how accurately the patterns discovered by the algorithm can be used to cluster new data. This
is typically done by taking a set of data that has been labeled with known cluster labels and then using the
discovered patterns to predict the cluster labels of the data. The accuracy can then be computed by
comparing the predicted labels to the actual labels.
There are a few ways to evaluate the accuracy of a clustering algorithm:
• External indices: these indices compare the clusters produced by the algorithm to some known
ground truth. For example, the Rand Index or the Jaccard coefficient can be used if the ground truth
is known.
• Internal indices: these indices assess the goodness of clustering without reference to any external
information. The most popular internal index is the Dunn index.
• Stability: this measures how robust the clustering is to small changes in the data. A clustering
algorithm is said to be stable if, when applied to different samples of the same data, it produces the
same results.
• Efficiency: this measures how quickly the algorithm converges to the correct clustering.
4. Coverage
This measures how many of the possible patterns in the data are discovered by the algorithm. This can be
computed by taking the total number of possible patterns and dividing it by the number of patterns discovered
by the algorithm. A Coverage Pattern is a type of sequential pattern that is found by looking for items that tend
to appear together in sequential order. For example, a coverage pattern might be “customers who purchase
item A also tend to purchase item B within the next month.”
To evaluate a coverage pattern, analysts typically look at two things: support and confidence. Support is the
percentage of transactions that contain the pattern. Confidence is the percentage of transactions that contain
the pattern divided by the number of transactions that contain the first item in the pattern.
For example, consider the following coverage pattern: “customers who purchase item A also tend to
purchase item B within the next month.” If the support for this pattern is 0.1%, that means that 0.1% of all
transactions contain the pattern. If the confidence for this pattern is 80%, that means that 80% of the
transactions that contain item A also contain item B.
Generally, a higher support and confidence value indicates a stronger pattern. However, analysts must be
careful to avoid overfitting, which is when a pattern is found that is too specific to the data and would not be
generalizable to other data sets.
5. Visual Inspection
This is perhaps the most common method, where the data miner simply looks at the patterns to see if they
make sense. In visual inspection, the data is plotted in a graphical format and the pattern is observed. This
method is used when the data is not too large and can be easily plotted. It is also used when the data is
categorical in nature. Visual inspection is a pattern evaluation method in data mining where the data is visually
inspected for patterns. This can be done by looking at a graph or plot of the data, or by looking at the raw data
itself. This method is often used to find outliers or unusual patterns.
6. Running Time
This measures how long it takes for the algorithm to find the patterns in the data. This is typically measured in
seconds or minutes. There are a few different ways to measure the performance of a machine learning
UNIT 3 DATA MINING -FREQUENT PATTERN ANALYSIS
algorithm, but one of the most common is to simply measure the amount of time it takes to train the model
and make predictions. This is known as the running time pattern evaluation.
There are a few different things to keep in mind when measuring the running time of an algorithm. First, you
need to take into account the time it takes to load the data into memory. Second, you need to account for the
time it takes to pre-process the data if any. Finally, you need to account for the time it takes to train the model
and make predictions.
In general, the running time of an algorithm will increase as the number of data increases. This is because the
algorithm has to process more data in order to learn from it. However, there are some algorithms that are
more efficient than others and can scale to large datasets better. When comparing different algorithms, it is
important to keep in mind the specific dataset that is being used. Some algorithms may be better suited for
certain types of data than others. In addition, the running time can also be affected by the hardware that is
being used.
7. Support
The support of a pattern is the percentage of the total number of records that contain the pattern. Support
Pattern evaluation is a process of finding interesting and potentially useful patterns in data. The purpose of
support pattern evaluation is to identify interesting patterns that may be useful for decision -making. Support
pattern evaluation is typically used in data mining and machine learning applications.
There are a variety of ways to evaluate support patterns. One common approach is to use a support metric,
which measures the number of times a pattern occurs in a dataset. Another common approach is to use a lift
metric, which measures the ratio of the occurrence of a pattern to the expected occurrence of the pattern.
Support pattern evaluation can be used to find a variety of interesting patterns in data, including association
rules, sequential patterns, and co-occurrence patterns. Support pattern evaluation is an important part of
data mining and machine learning, and can be used to help make better decisions.
8. Confidence
The confidence of a pattern is the percentage of times that the pattern is found to be correct. Confidence
Pattern evaluation is a method of data mining that is used to assess the quality of patterns found in data. This
evaluation is typically performed by calculating the percentage of times a pattern is found in a data set and
comparing this percentage to the percentage of times the pattern is expected to be found based on the overall
distribution of data. If the percentage of times a pattern is found is significantly higher than the expected
percentage, then the pattern is said to be a strong confidence pattern.
9. Lift
The lift of a pattern is the ratio of the number of times that the pattern is found to be correct to the number of
times that the pattern is expected to be correct. Lift Pattern evaluation is a data mining technique that can be
used to evaluate the performance of a predictive model. The lift pattern is a graphical representation of the
model’s performance and can be used to identify potential problems with the model.
The lift pattern is a plot of the true positive rate (TPR) against the false positive rate (FPR). The TPR is the
percentage of positive instances that are correctly classified by the model, while the FPR is the percentage of
negative instances that are incorrectly classified as positive. Ideally, the TPR would be 100% and the FPR
would be 0%, but this is rarely the case in practice. The lift pattern can be used to evaluate how close the
model is to this ideal.
A good model will have a lifted pattern that is close to the diagonal line. This means that the TPR and FPR are
similar and that the model is correctly classifying a similar percentage of positive and negative instances. A
UNIT 3 DATA MINING -FREQUENT PATTERN ANALYSIS
model with a lifted pattern that is far from the diagonal line is not performing as well. This can be caused by a
number of factors, including imbalanced data, poor feature selection, or overfitting.
The lift pattern can be a useful tool for identifying potential problems with a predictive model. It is important to
remember, however, that the lift pattern is only a graphical representation of the model’s performance, and
should be interpreted in conjunction with other evaluation measures.
10. Prediction
The prediction of a pattern is the percentage of times that the pattern is found to be correct. Prediction Pattern
evaluation is a data mining technique used to assess the accuracy of predictive models. It is used to
determine how well a model can predict future outcomes based on past data. Prediction Pattern evaluation
can be used to compare different models, or to evaluate the performance of a single model.
Prediction Pattern evaluation involves splitting the data set into two parts: a training set and a test set. The
training set is used to train the model, while the test set is used to assess the accuracy of the model. To
evaluate the accuracy of the model, the prediction error is calculated. Prediction Pattern evaluation can be
used to improve the accuracy of predictive models. By using a test set, predictive models can be fine -tuned to
better fit the data. This can be done by changing the model parameters or by adding new features to the data
set.
11. Precision
Precision Pattern Evaluation is a method for analyzing data that has been collected from a variety of sources.
This method can be used to identify patterns and trends in the data, and to evaluate the accuracy of data.
Precision Pattern Evaluation can be used to identify errors in the data, and to determine the cause of the
errors. This method can also be used to determine the impact of the errors on the overall accuracy of the data.
Precision Pattern Evaluation is a valuable tool for data mining and data analysis. This method can be used to
improve the accuracy of data, and to identify patterns and trends in the data.
12. Cross-Validation
This method involves partitioning the data into two sets, training the model on one set, and then testing it on
the other. This can be done multiple times, with different partitions, to get a more reliable estimate of the
model’s performance. Cross-validation is a model validation technique for assessing how the results of a data
mining analysis will generalize to an independent data set. It is mainly used in settings where the goal is
prediction, and one wants to estimate how accurately a predictive model will perform in practice. Cross -
validation is also referred to as out-of-sample testing.
Cross-validation is a pattern evaluation method that is used to assess the accuracy of a model. It does this by
splitting the data into a training set and a test set. The model is then fit on the training set and the accuracy is
measured on the test set. This process is then repeated a number of times, with the accuracy being averaged
over all the iterations.
13. Test Set
This method involves partitioning the data into two sets, training the model on the entire data set, and then
testing it on the held-out test set. This is more reliable than cross-validation but can be more expensive if the
data set is large. There are a number of ways to evaluate the performance of a model on a test set. The most
common is to simply compare the predicted labels to the true labels and compute the percentage of
instances that are correctly classified. This is called accuracy. Another popular metric is precision, which is
the number of true positives divided by the sum of true positives and false positives. The recall is the number
of true positives divided by the sum of true positives and false negatives. These metrics can be combined into
the F1 score, which is the harmonic mean of precision and recall.
UNIT 3 DATA MINING -FREQUENT PATTERN ANALYSIS
14. Bootstrapping
This method involves randomly sampling the data with replacement, training the model on the sampled data,
and then testing it on the original data. This can be used to get a distribution of the model’s performance,
which can be useful for understanding how robust the model is. Bootstrapping is a resampling technique used
to estimate the accuracy of a model. It involves randomly selecting a sample of data from the original dataset
and then training the model on this sample. The model is then tested on another sample of data that is not
used in training. This process is repeated a number of times, and the average accuracy of the model is
calculated.
Apriori Property
The Apriori property is a fundamental property of frequent itemsets used in the Apriori algorithm. In other words,
if an itemset appears frequently enough in the dataset to be considered significant, then all of its subsets must
also appear frequently enough to be significant. For example, if the itemset {A, B, C} frequently appears in a
dataset, then the subsets {A, B}, {A, C}, {B, C}, {A}, {B}, and {C} must also appear frequently in the dataset.
The Apriori property allows the Apriori algorithm in data mining to efficiently search for frequent itemsets by
eliminating candidate itemsets containing infrequent subsets, as they cannot be frequent. This search space
pruning reduces the time and memory required to find frequent itemsets in large datasets.
Before getting into the steps involved in the Apriori algorithm, let’s understand the various terminologies used in
the Apriori algorithm.
Support
In the Apriori algorithm, support refers to the frequency or occurrence of an item set in a dataset. It is defined as
the proportion of transactions in the dataset that contain the itemset. For example, let's consider a dataset of
sales transactions in a retail store that contains the following items - milk, bread, cheese, eggs, butter, and
yogurt. To calculate the support of an itemset, we count the number of transactions in which the itemset
appears and divide it by the total number of transactions in the dataset. For instance, if the itemset {milk, bread}
appears in 5 transactions out of 10 transactions in the dataset, then its support is 5/10=0.55/10=0.5, or 50%.
In the Apriori algorithm, itemsets with a support value above the minimum defined support threshold are
considered frequent and are used to generate candidate itemsets for the next iteration of the algorithm.
Lift measures the strength of the association between two items. It is defined as the ratio of the support of the
two items occurring together to the support of the individual items multiplied together. Lift for any two items can
be calculated using the below formula -
UNIT 3 DATA MINING -FREQUENT PATTERN ANALYSIS
Lift(A→B)= Support(A and B)/ Support(A)∗Support(B)
If the lift value is greater than 1, then it indicates a positive association between the two items, which means that
the two items are more likely to be bought together. A lift value of exactly 1 indicates that the two items are
independent and there is no association between the two items, while a value less than 1 indicates a negative
association, meaning that two items are more likely to be bought separately.
Confidence
In the Apriori algorithm, confidence is also a measure of the strength of the association between two items in an
itemset. It is defined as the conditional probability that item B appears in a transaction, given that another item A
appears in the same transaction. Support for two items can be calculated using the below formula.
If the confidence value exceeds a specified threshold, it indicates that item B is likely to be purchased with item
A. For instance, if the confidence of the association between "bread" and "butter" is 0.8, it means that when a
customer buys "bread", there is an 80% chance that they will also buy "butter". This can be useful in
recommending to customers or optimizing product placement in a store.
Here are the steps involved in implementing the Apriori algorithm in data mining -
1. Define minimum support threshold - This is the minimum number of times an item set must appear in
the dataset to be considered as frequent. The support threshold is usually set by the user based on the
size of the dataset and the domain knowledge.
2. Generate a list of frequent 1-item sets - Scan the entire dataset to identify the items that meet the
minimum support threshold. These item sets are known as frequent 1-item sets.
3. Generate candidate item sets - In this step, the algorithm generates a list of candidate item sets of
length k+1 from the frequent k-item sets identified in the previous step.
4. Count the support of each candidate item set - Scan the dataset again to count the number of times
each candidate item set appears in the dataset.
5. Prune the candidate item sets - Remove the item sets that do not meet the minimum support threshold.
6. Repeat steps 3-5 until no more frequent item sets can be generated.
7. Generate association rules - Once the frequent item sets have been identified, the algorithm generates
association rules from them. Association rules are rules of form A -> B, where A and B are item sets. The
rule indicates that if a transaction contains A, it is also likely to contain B.
8. Evaluate the association rules - Finally, the association rules are evaluated based on metrics such as
confidence and lift.
UNIT 3 DATA MINING -FREQUENT PATTERN ANALYSIS
Let’s try to understand the Apriori algorithm implementation using an example. In this example, we will use a
minimum support threshold of 3. This means an item set must appear in at least three transactions to be
considered frequent.
• Let’s consider the transaction dataset of a retail store as shown in the below table.
TID Items
T1 {milk, bread}
T2 {bread, sugar}
T3 {bread, butter}
T7 {milk, sugar}
T8 {milk, sugar}
T9 {sugar, butter}
• Let’s calculate support for each item present in the dataset. As shown in the below table, support for all
items is greater than 3. It means that all items are considered as frequent 1-itemsets and will be used to
generate candidates for 2-itemsets.
milk 8
bread 7
sugar 5
butter 7
• Below table represents all candidates generated from frequent 1-itemsets identified from the previous
step and their support value.
{milk, bread} 5
{milk, sugar} 3
{milk, butter} 5
{bread, sugar} 2
{bread, butter} 3
{sugar, butter} 2
• Now remove candidate item sets that do not meet the minimum support threshold of 3. After this step,
frequent 2-itemsets would be - {milk, bread}, {milk, sugar}, {milk, butter}, and {bread, butter}. In the next
step, let’s generate candidates for 3-itemsets and calculate their respective support values. It is shown
in the below table.
• Based on association rules mentioned in the above table, we can recommend products to the customer
or optimize product placement in retail stores.
Here are some of the advantages of the Apriori algorithm in data mining -
• Apriori algorithm is simple and easy to implement, making it accessible even to those without a deep
understanding of data mining or machine learning.
• Apriori algorithm can handle large datasets and run on distributed systems, making it scalable for large-
scale applications.
• Apriori algorithm is one of the most widely used algorithms for association rule mining and is supported
by many popular data mining tools.
Below are some of the limitations of the Apriori algorithm in data mining -
• Apriori algorithm can be computationally expensive, especially for large datasets with many itemsets. For
example, if a dataset contains 104104 from frequent 1- itemsets, it will generate more than 107107 2-
length candidates, which makes this algorithm computationally expensive.
• Apriori algorithm can generate a large number of rules, making it difficult to sift through and identify the
most important ones.
• The algorithm requires multiple database scans to generate frequent itemsets, which can be a limitation
in systems where data access is slow or expensive.
• Apriori algorithm is sensitive to data sparsity, meaning it may not perform well on datasets with a low
frequency of itemsets.
The FP Growth algorithm is a popular method for frequent pattern mining in data mining. It works by
constructing a frequent pattern tree (FP-tree) from the input dataset. The FP-tree is a compressed
representation of the dataset that captures the frequency and association information of the items in the data.
UNIT 3 DATA MINING -FREQUENT PATTERN ANALYSIS
The algorithm first scans the dataset and maps each transaction to a path in the tree. Items are ordered in each
transaction based on their frequency, with the most frequent items appearing first. Once the FP tree is
constructed, frequent itemsets can be generated by recursively mining the tree. This is done by starting at the
bottom of the tree and working upwards, finding all combinations of itemsets that satisfy the minimum support
threshold.
The FP Growth algorithm in data mining has several advantages over other frequent pattern mining algorithms,
such as Apriori. The Apriori algorithm is not suitable for handling large datasets because it generates a large
number of candidates and requires multiple scans of the database to my frequent items. In comparison, the FP
Growth algorithm requires only a single scan of the data and a small amount of memory to construct the FP tree.
It can also be parallelized to improve performance.
The working of the FP Growth algorithm in data mining can be summarized in the following steps:
FP Tree
The FP-tree (Frequent Pattern tree) is a data structure used in the FP Growth algorithm for frequent pattern
mining. It represents the frequent itemsets in the input dataset compactly and efficiently. The FP tree consists of
the following components:
• Root Node:
The root node of the FP-tree represents an empty set. It has no associated item but a pointer to the first
node of each item in the tree.
• Item Node:
Each item node in the FP-tree represents a unique item in the dataset. It stores the item name and the
frequency count of the item in the dataset.
UNIT 3 DATA MINING -FREQUENT PATTERN ANALYSIS
• Header Table:
The header table lists all the unique items in the dataset, along with their frequency count. It is used to
track each item's location in the FP tree.
• Child Node:
Each child node of an item node represents an item that co-occurs with the item the parent node
represents in at least one transaction in the dataset.
• Node Link:
The node-link is a pointer that connects each item in the header table to the first node of that item in the
FP-tree. It is used to traverse the conditional pattern base of each item during the mining process.
The FP tree is constructed by scanning the input dataset and inserting each transaction into the tree one at a
time. For each transaction, the items are sorted in descending order of frequency count and then added to the
tree in that order. If an item exists in the tree, its frequency count is incremented, and a new path is created from
the existing node. If an item does not exist in the tree, a new node is created for that item, and a new path is
added to the tree. We will understand in detail how FP-tree is constructed in the next section.
Algorithm by Han
Let’s understand with an example how the FP Growth algorithm in data mining can be used to mine frequent
itemsets. Suppose we have a dataset of transactions as shown below:
Transaction ID Items
T1 {M, N, O, E, K, Y}
T2 {D, O, E, N, Y, K}
T3 {K, A, M, E}
T4 {M, C, U, Y, K}
T5 {C, O, K, O, E, I}
Let’s scan the above database and compute the frequency of each item as shown in the below table.
Item Frequency
A 1
C 2
D 1
E 4
I 1
K 5
M 3
UNIT 3 DATA MINING -FREQUENT PATTERN ANALYSIS
N 2
O 3
U 1
Y 3
Let’s consider minimum support as 3. After removing all the items below minimum support in the above table,
we would remain with these items - {K: 5, E: 4, M : 3, O : 3, Y : 3}. Let’s re-order the transaction database based
on the items above minimum support. In this step, in each transaction, we will remove infrequent items and re-
order them in the descending order of their frequency, as shown in the table below.
T1 {M, N, O, E, K, Y} {K, E, M, O, Y}
T2 {D, O, E, N, Y, K} {K, E, O, Y}
T3 {K, A, M, E} {K, E, M}
T4 {M, C, U, Y, K} {K, M, Y}
T5 {C, O, K, O, E, I} {K, E, O}
Now we will use the ordered itemset in each transaction to build the FP tree. Each transaction will be inserted
individually to build the FP tree, as shown below -
M {K, E : 2}, {K : 1}
E {K : 4}
Now for each item, we will build a conditional frequent pattern tree. It is computed by identifying the set of
elements common in all the paths in the conditional pattern base of a given frequent item and computing its
support count by summing the support counts of all the paths in the conditional pattern base. The conditional
frequent pattern tree will look like this as shown below table:
E {K: 4} {K: 4}
From the above conditional FP tree, we will generate the frequent itemsets as shown in the below table:
Y {K, Y - 3}
M {K, M - 3}
E {K, E - 4}
Here's a tabular comparison between the FP Growth algorithm and the Apriori algorithm:
UNIT 3 DATA MINING -FREQUENT PATTERN ANALYSIS
Factor FP Growth Algorithm Apriori Algorithm
The FP Growth algorithm in data mining has several advantages over other frequent itemset mining algorithms,
as mentioned below:
• Efficiency:
FP Growth algorithm is faster and more memory-efficient than other frequent itemset mining algorithms
such as Apriori, especially on large datasets with high dimensionality. This is because it generates
frequent itemsets by constructing the FP-Tree, which compresses the database and requires only two
scans.
• Scalability:
FP Growth algorithm scales well with increasing database size and itemset dimensionality, making it
suitable for mining frequent itemsets in large datasets.
• Resistant to noise:
FP Growth algorithm is more resistant to noise in the data than other frequent itemset mining algorithms,
as it generates only frequent itemsets and ignores infrequent itemsets that may be caused by noise.
• Parallelization:
FP Growth algorithm can be easily parallelized, making it suitable for distributed computing environments
and allowing it to take advantage of multi-core processors.
UNIT 3 DATA MINING -FREQUENT PATTERN ANALYSIS
Disadvantages of FP Growth Algorithm
While the FP Growth algorithm in data mining has several advantages, it also has some limitations and
disadvantages, as mentioned below:
• Memory consumption:
Although the FP Growth algorithm is more memory-efficient than other frequent itemset mining
algorithms, storing the FP-Tree and the conditional pattern bases can still require a significant amount of
memory, especially for large datasets.
• Complex implementation:
The FP Growth algorithm is more complex than other frequent itemset mining algorithms, making it more
difficult to understand and implement.
A data mining technique that is used to uncover purchase patterns in any retail setting is known as Market
Basket Analysis. In simple terms Basically, Market basket analysis in data mining is to analyze the
combination of products which been bought together.
This is a technique that gives the careful study of purchases done by a customer in a supermarket. This
concept identifies the pattern of frequent purchase items by customers. This analysis can help to promote
deals, offers, sale by the companies, and data mining techniques helps to achieve this analysis task. Example:
•Data mining concepts are in use for Sales and marketing to provide better customer service, to
improve cross-selling opportunities, to increase direct mail response rates.
• Customer Retention in the form of pattern identification and prediction of likely defections is
possible by Data mining.
• Risk Assessment and Fraud area also use the data-mining concept for identifying inappropriate or
unusual behavior etc.
Market basket analysis mainly works with the ASSOCIATION RULE {IF} -> {THEN}.
• IF means Antecedent: An antecedent is an item found within the data
• THEN means Consequent: A consequent is an item found in combination with the antecedent.
Let’s see ASSOCIATION RULE {IF} -> {THEN} rules used in Market Basket Analysis in Data Mining. For
example, customers buying a domain means they definitely need extra plugins/extensions to make it easier for
the users.
Like we said above Antecedent is the item sets that are available in data. By formulating from the rules
means {if} component and from the example is the domain.
Same as Consequent is the item that is found with the combination of Antecedents. By formulating from the
rules means {THEN} component and from the example is extra plugins/extensions.
With the help of these, we are able to predict customer behavioral patterns. From this, we are able to make
certain combinations with offers that customers will probably buy those products. That will automatically
increase the sales and revenue of the company.
UNIT 3 DATA MINING -FREQUENT PATTERN ANALYSIS
With the help of the Apriori Algorithm, we can further classify and simplify the item sets which are frequently
bought by the consumer.
There are three components in APRIORI ALGORITHM:
• SUPPORT
• CONFIDENCE
• LIFT
Now take an example, suppose 5000 transactions have been made through a popular eCommerce website.
Now they want to calculate the support, confidence, and lift for the two products, let’s say pen and notebook
for example out of 5000 transactions, 500 transactions for pen, 700 transactions for notebook, and 1000
transactions for both.
SUPPORT: It is been calculated with the number of transactions divided by the total number of transactions
made,
Lift-> 20/10=2
When the Lift value is below 1 means the combination is not so frequently bought by consumers. But in this
case, it shows that the probability of buying both the things together is high when compared to the transaction
for the individual items sold.
With this, we come to an overall view of the Market Basket Analysis in Data Mining and how to calculate the
sales for combination products.
There are three types of Market Basket Analysis. They are as follow:
1. Descriptive market basket analysis: This sort of analysis looks for patterns and connections in
the data that exist between the components of a market basket. This kind of study is mostly used to
understand consumer behavior, including what products are purchased in combination and what
the most typical item combinations. Retailers can place products in their stores more profitably by
understanding which products are frequently bought together with the aid of descriptive market
basket analysis.
2. Predictive Market Basket Analysis: Market basket analysis that predicts future purchases based
on past purchasing patterns is known as predictive market basket analysis. Large volumes of data
UNIT 3 DATA MINING -FREQUENT PATTERN ANALYSIS
are analyzed using machine learning algorithms in this sort of analysis in order to create predictions
about which products are most likely to be bought together in the future. Retailers may make data-
driven decisions about which products to carry, how to price them, and how to optimize shop
layouts with the use of predictive market basket research.
3. Differential Market Basket Analysis: Differential market basket analysis analyses two sets of
market basket data to identify variations between them. Comparing the behavior of various client
segments or the behavior of customers over time is a common usage for this kind of study.
Retailers can respond to shifting consumer behavior by modifying their marketing and sales tactics
with the help of differential market basket analysis.
1. Enhanced Customer Understanding: Market basket research offers insights into customer
behavior, including what products they buy together and which products they buy the most
frequently. Retailers can use this information to better understand their customers and make
informed decisions.
2. Improved Inventory Management: By examining market basket data, retailers can determine
which products are sluggish sellers and which ones are commonly bought together. Retailers can
use this information to make well-informed choices about what products to stock and how to
manage their inventory most effectively.
3. Better Pricing Strategies: A better understanding of the connection between product prices and
consumer behavior might help merchants develop better pricing strategies. Using this knowledge,
pricing plans that boost sales and profitability can be created.
4. Sales Growth: Market basket analysis can assist businesses in determining which products are
most frequently bought together and where they should be positioned in the store to grow sales.
Retailers may boost revenue and enhance customer shopping experiences by improving store
layouts and product positioning.
1. Retail: Market basket research is frequently used in the retail sector to examine consumer buying
patterns and inform decisions about product placement, inventory management, and pricing
tactics. Retailers can utilize market basket research to identify which items are sluggish sellers and
which ones are commonly bought together, and then modify their inventory management strategy
accordingly.
2. E-commerce: Market basket analysis can help online merchants better understand the customer
buying habits and make data-driven decisions about product recommendations and targeted
advertising campaigns. The behaviour of visitors to a website can be examined using market basket
analysis to pinpoint problem areas.
3. Finance: Market basket analysis can be used to evaluate investor behaviour and forecast the types
of investment items that investors will likely buy in the future. The performance of investment
portfolios can be enhanced by using this information to create tailored investment strategies.
4. Telecommunications: To evaluate consumer behaviour and make data-driven decisions about
which goods and services to provide, the telecommunications business might employ market
basket analysis. The usage of this data can enhance client happiness and the shopping experience.
5. Manufacturing: To evaluate consumer behaviour and make data-driven decisions about which
products to produce and which materials to employ in the production process, the manufacturing
UNIT 3 DATA MINING -FREQUENT PATTERN ANALYSIS
sector might use market basket analysis. Utilizing this knowledge will increase effectiveness and
cut costs.
Multilevel Association Rule :
Association rules created from mining information at different degrees of reflection are called various level or
staggered association rules.
Multilevel association rules can be mined effectively utilizing idea progressions under a help certainty system.
Rules at a high idea level may add to good judgment while rules at a low idea level may not be valuable
consistently.
Utilizing uniform least help for all levels :
• At the point when a uniform least help edge is utilized, the pursuit system is rearranged.
• The technique is likewise straightforward, in that clients are needed to indicate just a single least
help edge.
• A similar least help edge is utilized when mining at each degree of deliberation. (for example for
mining from “PC” down to “PC”). Both “PC” and “PC” discovered to be incessant, while “PC” isn’t.
Needs of Multidimensional Rule :
• Sometimes at the low data level, data does not show any significant pattern but there is useful
information hiding behind it.
• The aim is to find the hidden information in or between levels of abstraction.
Approaches to multilevel association rule mining :
1. Uniform Support(Using uniform minimum support for all level)
2. Reduced Support (Using reduced minimum support at lower levels)
3. Group-based Support(Using item or group based support)
Let’s discuss one by one.
1. Uniform Support –
At the point when a uniform least help edge is used, the search methodology is simplified. The
technique is likewise basic in that clients are needed to determine just a single least help
threshold. An advancement technique can be adopted, based on the information that a progenitor
is a superset of its descendant. the search keeps away from analyzing item sets containing
anything that doesn’t have minimum support. The uniform support approach however has some
difficulties. It is unlikely that items at lower levels of abstraction will occur as frequently as those at
higher levels of abstraction. If the minimum support threshold is set too high it could miss several
meaningful associations occurring at low abstraction levels. This provides the motivation for the
following approach.
2. Reduce Support –
For mining various level relationship with diminished support, there are various elective hunt
techniques as follows.
• Level-by-Level independence –
This is a full-broadness search, where no foundation information on regular item sets is
utilized for pruning. Each hub is examined, regardless of whether its parent hub is
discovered to be incessant.
• Level – cross-separating by single thing –
A thing at the I level is inspected if and just if its parent hub at the (I-1) level is regular .all
in all, we research a more explicit relationship from a more broad one. If a hub is
frequent, its kids will be examined; otherwise, its descendant is pruned from the inquiry.
• Level-cross separating by – K-itemset –
A-itemset at the I level is inspected if and just if it’s For mining various level relationship
with diminished support, there are various elective hunt techniques.
UNIT 3 DATA MINING -FREQUENT PATTERN ANALYSIS
• Level-by-Level independence –
This is a full-broadness search, where no foundation information on regular item sets is
utilized for pruning. Each hub is examined, regardless of whether its parent hub is
discovered to be incessant.
• Level – cross-separating by single thing –
A thing at the 1st level is inspected if and just if its parent hub at the (I-1) the level is
regular .all in all, we research a more explicit relationship from a more broad one. If a
hub is frequent, its kids will be examined otherwise, its descendant is pruned from the
inquiry.
• Level-cross separating by – K-item set –
A-item set at the I level is inspected if and just if its corresponding parents A item set (i -
1) level is frequent.
3. Group-based support –
The group-wise threshold value for support and confidence is input by the user or expert. The group
is selected based on a product price or item set because often expert has insight as to which
groups are more important than others.
Example –
For e.g. Experts are interested in purchase patterns of laptops or clothes in the non and electronic
category. Therefore low support threshold is set for this group to give attention to these items’
purchase patterns.
Classification Using Frequent Patterns in Data Mining
A data mining approach called frequent pattern mining is used to find recurring patterns in a dataset. It is a
kind of unsupervised machine-learning technique that looks for and identifies patterns in data using
algorithms. This method can be applied to find products that are frequently purchased together or to find
products that are more likely to be purchased by particular demographic groups. Numerous applications of
this method include client segmentation, fraud detection, and marketing analysis. Frequent pattern mining can
be utilized in classification tasks to identify the patterns that are most likely related to a particular class.
Frequent patterns refer to item sets, subsequences, or substructures that appear frequently in a data set.
It works by scanning a data collection for common patterns, or item sets, and then utilizing those patterns to
categorize previously undiscovered data items. Once learnt, the patterns may be utilized to categorize
previously unknown data items, such as new consumer purchases or new customer behaviours. This
categorization may be used for a number of purposes, including forecasting customer turnover and detecting
fraudulent activity.
UNIT 3 DATA MINING -FREQUENT PATTERN ANALYSIS
1. Apriori Algorithm:
The Apriori algorithm is an algorithm for finding frequent item sets in a given dataset. It is an unsupervised
learning technique that employs a “bottom-up” strategy to discover frequent itemsets in a dataset by first
recognizing individual items in the dataset and then looking for combinations of items that appear often
together. The Apriori technique may be used to identify the rules that govern the relationships between various
objects in a collection. It is frequently used in market basket analysis, which seeks to find goods that are
frequently purchased together.
2. FP-Growth Algorithm:
The FP-Growth (Common Pattern Growth) algorithm is a data mining technique that finds frequent patterns or
itemsets in a dataset. It operates by building an FP-Tree, which is a compact representation of the dataset.
The FP-Tree is then utilized to construct common patterns from the ground up. The FP-Growth technique is
very scalable and can effectively detect common patterns in huge datasets. It is also more efficient than
another common approach for mining frequent item sets, the Apriori algorithm.
3. Closed Frequent Itemset Mining:
Closed frequent itemset mining is a kind of frequent itemset mining in which all itemsets in a given dataset
with a frequency that meets or exceeds a predetermined threshold are discovered. The technique works by
first generating a list of all frequent item sets in the dataset, then iteratively evaluating each item set to
determine whether any supersets of the itemset have a frequency that meets or exceeds the stated threshold.
Any supersets that meet the criteria are added to the list of frequently occurring itemsets. This procedure is
continued until no further supersets are discovered.
4. Naive Bayesian Algorithm:
Naive Bayes is a form of supervised machine learning method that is used for classification and is based on
Bayes’ Theorem. It is a probabilistic method that predicts using the probability of each attribute belonging to
each class. The Naive Bayes method is based on the assumption that all qualities are independent of one
another. This streamlines the computation of probabilities, allowing the algorithm to easily forecast a class of
a new data point.