Unit 2 Data Preprocessing and Association Rule Mining
Unit 2 Data Preprocessing and Association Rule Mining
Data processing is collecting raw data and translating it into usable information. The raw data
is collected, filtered, sorted, processed, analyzed, stored, and then presented in a readable
format. It is usually performed in a step-by-step process by a team of data scientists and data
engineers in an organization.
The data processing is carried out automatically or manually. Nowadays, most data is
processed automatically with the help of the computer, which is faster and gives accurate
results. Thus, data can be converted into different forms. It can be graphic as well as audio
ones. It depends on the software used as well as data processing methods.
1. Data Collection
The collection of raw data is the first step of the data processing cycle. The raw data collected
has a huge impact on the output produced. Hence, raw data should be gathered from defined
and accurate sources so that the subsequent findings are valid and usable. Raw data can
include monetary figures, website cookies, profit/loss statements of a company, user
behavior, etc.
2. Data Preparation
Data preparation or data cleaning is the process of sorting and filtering the raw data to remove
unnecessary and inaccurate data. Raw data is checked for errors, duplication,
miscalculations, or missing data and transformed into a suitable form for further analysis
and processing. This ensures that only the highest quality data is fed into the processing unit.
3. Data Input
In this step, the raw data is converted into machine-readable form and fed into the
processing unit. This can be in the form of data entry through a keyboard, scanner, or any
other input source.
4. Data Processing
In this step, the raw data is subjected to various data processing methods using machine
learning and artificial intelligence algorithms to generate the desired output. This step
may vary slightly from process to process depending on the source of data being processed
(data lakes, online databases, connected devices, etc.) and the intended use of the output.
The data is finally transmitted and displayed to the user in a readable form like graphs,
tables, vector files, audio, video, documents, etc. This output can be stored and further
processed in the next data processing cycle.
6. Data Storage
The last step of the data processing cycle is storage, where data and metadata are stored for
further use. This allows quick access and retrieval of information whenever needed. Effective
proper data storage is necessary for compliance with GDPR (data protection legislation).
==================================================================
What is ETL?
The mechanism of extracting information from source systems and bringing it into the data
warehouse is commonly called ETL, which stands for Extraction, Transformation and Loading.
The ETL process requires active inputs from various stakeholders, including developers, analysts,
testers, top executives and is technically challenging.
To maintain its value as a tool for decision-makers, Data warehouse technique needs to change with
business changes. ETL is a recurring method (daily, weekly, monthly) of a Data warehouse system
and needs to be agile, automated, and well documented.
How ETL Works?
Extraction
o Extraction is the operation of extracting information from a source system for
further use in a data warehouse environment. This is the first stage of the ETL
process.
o Extraction process is often one of the most time-consuming tasks in the ETL.
o The source systems might be complicated and poorly documented, and thus
determining which data needs to be extracted can be difficult.
o The data has to be extracted several times in a periodic manner to supply all
changed data to the warehouse and keep it up-to-date.
Transformation
Transformation is the core of the reconciliation phase. It converts records from its
operational source format into a particular data warehouse format. If we implement a
three-layer architecture, this phase outputs our reconciled data layer.
Loading
The Load is the process of writing the data into the target database. During the load
step, it is necessary to ensure that the load is performed correctly and with as little
resources as possible.
1. Refresh: Data Warehouse data is completely rewritten. This means that older file
is replaced. Refresh is usually used in combination with static extraction to
populate a data warehouse initially.
2. Update: Only those changes applied to source information are added to the Data
Warehouse. An update is typically carried out without deleting or modifying
preexisting data. This method is used in combination with incremental extraction
to update data warehouses regularly.
==============================================================================
Data cleaning is a crucial process in Data Mining. It carries an important part in the building
of a model. Data Cleaning can be regarded as the process needed, but everyone often neglects
it. Data quality is the main issue in quality information management. Data quality problems
occur anywhere in information systems. These problems are solved by data cleaning.
While the techniques used for data cleaning may vary according to the types of data your
company stores, you can follow these basic steps to cleaning your data, such as:
Structural errors are when you measure or transfer data and notice strange naming conventions,
typos, or incorrect capitalization. These inconsistencies can cause mislabeled categories or
classes. For example, you may find "N/A" and "Not Applicable" in any sheet, but they should
be analyzed in the same category.
Often, there will be one-off observations where, at a glance, they do not appear to fit within the
data you are analyzing. If you have a legitimate reason to remove an outlier, like improper data
entry, doing so will help the performance of the data you are working with.
However, sometimes, the appearance of an outlier will prove a theory you are working on. And
just because an outlier exists doesn't mean it is incorrect. This step is needed to determine the
validity of that number. If an outlier proves to be irrelevant for analysis or is a mistake, consider
removing it.
You can't ignore missing data because many algorithms will not accept missing values. There
are a couple of ways to deal with missing data. Neither is optimal, but both can be considered,
such as:
o You can drop observations with missing values, but this will drop or lose information,
so be careful before removing it.
o You can input missing values based on other observations; again, there is an
opportunity to lose the integrity of the data because you may be operating from
assumptions and not actual observations.
o You might alter how the data is used to navigate null values effectively.
5. Validate and QA
At the end of the data cleaning process, you should be able to answer these questions as a part
of basic validation, such as:
Because of incorrect or noisy data, false conclusions can inform poor business strategy and
decision-making. False conclusions can lead to an embarrassing moment in a reporting meeting
when you realize your data doesn't stand up to study. Before you get there, it is important to
create a culture of quality data in your organization. To do this, you should document the tools
you might use to create this strategy.
There are many data cleaning methods through which the data should be run. The methods are
described below:
1. Ignore the tuples: This method is not very feasible, as it only comes to use when the
tuple has several attributes is has missing values.
2. Fill the missing value: This approach is also not very effective or feasible. Moreover,
it can be a time-consuming method. In the approach, one has to fill in the missing value.
This is usually done manually, but it can also be done by attribute mean or using the
most probable value.
3. Binning method: This approach is very simple to understand. The smoothing of sorted
data is done using the values around it. The data is then divided into several segments
of equal size. After that, the different methods are executed to complete the task.
4. Regression: The data is made smooth with the help of using the regression function.
The regression can be linear or multiple. Linear regression has only one independent
variable, and multiple regressions have more than one independent variable.
5. Clustering: This method mainly operates on the group. Clustering groups the data in a
cluster. Then, the outliers are detected with the help of clustering. Next, the similar
values are then arranged into a "group" or a "cluster".
The following steps show the process of data cleaning in data mining.
1. Monitoring the errors: Keep a note of suitability where the most mistakes arise. It will
make it easier to determine and stabilize false or corrupt information. Information
is especially necessary while integrating another possible alternative with established
management software.
2. Standardize the mining process: Standardize the point of insertion to assist and
reduce the chances of duplicity.
3. Validate data accuracy: Analyze and invest in data tools to clean the record in real-
time. Tools used Artificial Intelligence to better examine for correctness.
4. Scrub for duplicate data: Determine duplicates to save time when analyzing data.
Frequently attempted the same data can be avoided by analyzing and investing in
separate data erasing tools that can analyze rough data in quantity and automate the
operation.
5. Research on data: Before this activity, our data must be standardized, validated,
and scrubbed for duplicates. There are many third-party sources, and these Approved
& authorized parties sources can capture information directly from our databases. They
help us to clean and compile the data to ensure completeness, accuracy, and
reliability for business decision-making.
6. Communicate with the team: Keeping the group in the loop will assist in developing
and strengthening the client and sending more targeted data to prospective customers.
Here are the following usages of data cleaning in data mining, such as:
o Data Integration: Since it is difficult to ensure quality in low-quality data, data
integration has an important role in solving this problem. Data Integration is the
process of combining data from different data sets into a single one. This process
uses data cleansing tools to ensure that the embedded data set is standardized and
formatted before moving to the final destination.
o Data Migration: Data migration is the process of moving one file from one system
to another, one format to another, or one application to another. While the data is
on the move, it is important to maintain its quality, security, and consistency, to ensure
that the resultant data has the correct format and structure without any delicacies at the
destination.
o Data Transformation: Before the data is uploaded to a destination, it needs to be
transformed. This is only possible through data cleaning, which considers the system
criteria of formatting, structuring, etc. Data transformation processes usually include
using rules and filters before further analysis. Data transformation is an integral
part of most data integration and data management processes. Data cleansing tools
help to clean the data using the built-in transformations of the systems.
o Data Debugging in ETL Processes: Data cleansing is crucial to preparing data during
extract, transform, and load (ETL) for reporting and analysis. Data cleansing ensures
that only high-quality data is used for decision-making and analysis.
For example, a retail company receives data from various sources, such as CRM or ERP
systems, containing misinformation or duplicate data. A good data debugging tool would detect
inconsistencies in the data and rectify them. The purged data will be converted to a standard
format and uploaded to a target database.
Data cleaning is mandatory to guarantee the business data's accuracy, integrity, and security.
Based on the qualities or characteristics of data, these may vary in quality. Here are the main
points of data cleaning in data mining:
o Accuracy: All the data that make up a database within the business must be highly
accurate. One way to corroborate their accuracy is by comparing them with different
sources. If the source is not found or has errors, the stored information will have the
same problems.
o Coherence: The data must be consistent with each other, so you can be sure that the
information of an individual or body is the same in different forms of storage used.
o Validity: The stored data must have certain regulations or established restrictions.
Likewise, the information has to be verified to corroborate its authenticity.
o Uniformity: The data that make up a database must have the same units or values. It
is an essential aspect when carrying out the Data Cleansing process since it does not
increase the complexity of the procedure.
o Data Verification: The process must be verified at all times, both the appropriateness
and the effectiveness of the procedure. Said verification is carried out through various
insistence of the study, design, and validation stages. The drawbacks are often evident
after the data is applied in a certain amount of changes.
o Clean Data Backflow: After eliminating quality problems, the already clean data must
be replaced by those not located in the original source, so that legacy applications obtain
the benefits of these, obviating the need for applications of actions of data cleaning
afterward.
Data Cleansing Tools can be very helpful if you are not confident of cleaning the data yourself
or have no time to clean up all your data sets. You might need to invest in those tools, but it is
worth the expenditure. There are many data cleaning tools in the market. Here are some top-
ranked data cleaning tools, such as:
1. OpenRefine
2. Trifacta Wrangler
3. Drake
4. Data Ladder
5. Data Cleaner
6. Cloudingo
7. Reifier
8. IBM Infosphere Quality Stage
9. TIBCO Clarity
10. Winpure
=====================================================================
Data Integration
Data integration is the process of merging data from several disparate sources. While performing
data integration, you must work on data redundancy, inconsistency, duplicity, etc.
The statistical integration strategy is formally stated as a triple (G, S, M) approach. G represents
the global schema, S represents the heterogeneous source of schema, and M represents the mapping
between source and global schema queries.
Data integration is important because it gives a uniform view of scattered data while also
maintaining data accuracy. It assists the data-mining program in meaningful mining information,
which in turn assists the executive and managers make strategic decisions for the enterprise's
benefit.
There are mainly two kinds of approaches to data integration in data mining, as mentioned below -
1. Tight Coupling
• This approach involves the creation of a centralized database that integrates data from
different sources. The data is loaded into the centralized database using extract, transform,
and load (ETL) processes.
• In this approach, the integration is tightly coupled, meaning that the data is physically stored
in the central database, and any updates or changes made to the data sources are
immediately reflected in the central database.
• Tight coupling is suitable for situations where real-time access to the data is required, and
data consistency is critical. However, this approach can be costly and complex, especially
when dealing with large volumes of data.
2. Loose Coupling
• This approach involves the integration of data from different sources without physically
storing it in a centralized database.
• In this approach, data is accessed from the source systems as needed and combined in real-
time to provide a unified view. This approach uses middleware, such as application
programming interfaces (APIs) and web services, to connect the source systems and access
the data.
• Loose coupling is suitable for situations where real-time access to the data is not critical,
and the data sources are highly distributed. This approach is more cost-effective and flexible
than tight coupling but can be more complex to set up and maintain.
A few of the most common issues in data integration in data mining include -
• Data Quality - The quality of the data being integrated can be a significant issue in data
integration. Data from different sources may have varying levels of accuracy, completeness,
and consistency, which can lead to data quality issues in the integrated data.
• Data Semantics - Data semantics refers to the meaning and interpretation of data.
Integrating data from different sources can be challenging because the same data element
may have different meanings across sources. This can result in data integration issues and
impact the integrated data's accuracy.
• Data Heterogeneity - Data heterogeneity refers to the differences in data formats,
structures, and storage mechanisms across different data sources. Data integration can be
challenging when dealing with heterogeneous data sources, as it requires data
transformation and mapping to make the data compatible with the target data model.
• Complexity - Data integration can be complex, especially when dealing with large volumes
of data or multiple data sources. As the complexity of data integration increases, it becomes
more challenging to maintain data quality, ensure data consistency, and manage data
security and privacy.
• Data Privacy and Security - Data integration can increase the risk of data privacy and
security breaches. Integrating data from multiple sources can expose sensitive information
and increase the risk of unauthorized access or disclosure.
• Scalability - Scalability refers to the ability of the data integration solution to handle
increasing volumes of data and accommodate changes in data sources. Data integration
solutions must be scalable to meet the organization's evolving needs and ensure that the
integrated data remains accurate and consistent.
Data Reduction
Data reduction techniques ensure the integrity of data while reducing the data. Data
reduction is a process that reduces the volume of original data and represents it in a much
smaller volume. Data reduction techniques are used to obtain a reduced representation of the
dataset that is much smaller in volume by maintaining the integrity of the original data. By
reducing the data, the efficiency of the data mining process is improved, which produces the
same analytical results.
Data reduction does not affect the result obtained from data mining. That means the result
obtained from data mining before and after data reduction is the same or almost the same.
Here are the following techniques or methods of data reduction in data mining, such as:
1. Dimensionality Reduction
Whenever we encounter weakly important data, we use the attribute required for our analysis.
Dimensionality reduction eliminates the attributes from the data set under consideration,
thereby reducing the volume of original data. It reduces data size as it eliminates outdated or
redundant features.
2. Numerosity Reduction
The numerosity reduction reduces the original data volume and represents it in a much smaller
form. This technique includes two types parametric and non-parametric numerosity reduction.
This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a
multidimensional aggregation that uses aggregation at various levels of a data cube to represent
the original data set, thus achieving data reduction.
For example, suppose you have the data of All Electronics sales per quarter for the year 2018
to the year 2022. If you want to get the annual sale per year, you just have to aggregate the
sales per quarter for each year. In this way, aggregation provides you with the required data,
which is much smaller in size, and thereby we achieve data reduction even without losing any
data.
The data cube aggregation is a multidimensional aggregation that eases multidimensional
analysis. The data cube present precomputed and summarized data which eases the data mining
into fast access.
4. Data Compression
Data compression employs modification, encoding, or converting the structure of data in a way
that consumes less space. Data compression involves building a compact representation of
information by removing redundancy and representing data in binary form. Data that can
be restored successfully from its compressed form is called Lossless compression. In contrast,
the opposite where it is not possible to restore the original form from the compressed form is
Lossy compression. Dimensionality and numerosity reduction method are also used for data
compression.
This technique reduces the size of the files using different encoding mechanisms, such as
Huffman Encoding and run-length Encoding. We can divide it into two types based on their
compression techniques.
i. Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple
and minimal data size reduction. Lossless data compression uses algorithms to restore
the precise original data from the compressed data.
ii. Lossy Compression: In lossy-data compression, the decompressed data may differ
from the original data but are useful enough to retrieve information from them. For
example, the JPEG image format is a lossy compression, but we can find the meaning
equivalent to the original image.
5. Discretization Operation
The data discretization technique is used to divide the attributes of the continuous nature into
data with intervals. We replace many constant values of the attributes with labels of small
intervals. This means that mining results are shown in a concise and easily understandable way.
The main benefit of data reduction is simple: the more data you can fit into a terabyte of disk
space, the less capacity you will need to purchase. Here are some benefits of data reduction,
such as:
Data reduction greatly increases the efficiency of a storage system and directly impacts your
total spending on capacity.
================================================================
Data Transformation
Data transformation is a technique used to convert the raw data into a suitable format that
efficiently eases data mining and retrieves strategic information. Data transformation
includes data cleaning techniques and a data reduction technique to convert the data into
the appropriate form.
There are several data transformation techniques that can help structure and clean up the data
before analysis or storage in a data warehouse. Let's study all techniques used for data
transformation, some of which we have already studied in data reduction and data cleaning.
1. Data Smoothing
Data smoothing is a process that is used to remove noise from the dataset using some
algorithms. It allows for highlighting important features present in the dataset. It helps in
predicting the patterns. When collecting data, it can be manipulated to eliminate or reduce
any variance or any other noise form.
The concept behind data smoothing is that it will be able to identify simple changes to help
predict different trends and patterns. This serves as a help to analysts or traders who need to
look at a lot of data which can often be difficult to digest for finding patterns that they wouldn't
see otherwise.
We have seen how the noise is removed from the data using the techniques such as binning,
regression, clustering.
o Binning: This method splits the sorted data into the number of bins and smoothens
the data values in each bin considering the neighborhood values around it.
o Regression: This method identifies the relation among two dependent attributes so
that if we have one attribute, it can be used to predict the other attribute.
o Clustering: This method group’s similar data values and form a cluster. The values
that lie outside a cluster are known as outliers.
2. Attribute Construction
In the attribute construction method, the new attributes consult the existing attributes to
construct a new data set that eases data mining. New attributes are created and applied to assist
the mining process from the given attributes. This simplifies the original data and makes the
mining more efficient.
For example, suppose we have a data set referring to measurements of different plots, i.e., we
may have the height and width of each plot. So here, we can construct a new attribute 'area'
from attributes 'height' and 'weight'. This also helps understand the relations among the
attributes in a data set.
3. Data Aggregation
Data collection or aggregation is the method of storing and presenting data in a summary
format. The data may be obtained from multiple data sources to integrate these data sources
into a data analysis description. This is a crucial step since the accuracy of data analysis
insights is highly dependent on the quantity and quality of the data used.
Gathering accurate data of high quality and a large enough quantity is necessary to produce
relevant results. The collection of data is useful for everything from decisions concerning
financing or business strategy of the product, pricing, operations, and marketing strategies.
For example, we have a data set of sales reports of an enterprise that has quarterly sales of each
year. We can aggregate the data to get the enterprise's annual sales report.
4. Data Normalization
Normalizing the data refers to scaling the data values to a much smaller range such as [-1,
1] or [0.0, 1.0]. There are different methods to normalize the data, as discussed below.
Consider that we have a numeric attribute A and we have n number of observed values for
attribute A that are V1, V2, V3, ….Vn.
o Z-score normalization: This method normalizes the value for attribute A using
the mean and standard deviation. The following formula is used for Z-score
normalization:
Here Ᾱ and σA are the mean and standard deviation for attribute A, respectively.
For example, we have a mean and standard deviation for attribute A as $5400 and
$1600. And we have to normalize the value $73,600 using z-score normalization.
o Decimal Scaling: This method normalizes the value of attribute A by moving the
decimal point in the value. This movement of a decimal point depends on the maximum
absolute value of A. The formula for the decimal scaling is given below:
5. Data Discretization
This is a process of converting continuous data into a set of data intervals. Continuous
attribute values are substituted by small interval labels. This makes the data easier to study and
analyze. If a data mining task handles a continuous attribute, then its discrete values can be
replaced by constant quality attributes. This improves the efficiency of the task.
This method is also called a data reduction mechanism as it transforms a large dataset into a
set of categorical data. Discretization also uses decision tree-based algorithms to produce short,
compact, and accurate results when using discrete values.
Data discretization can be classified into two types: supervised discretization, where the class
information is used, and unsupervised discretization, which is based on which direction the
process proceeds, i.e., 'top-down splitting strategy' or 'bottom-up merging strategy'.
For example, the values for the age attribute can be replaced by the interval labels such as (0-
10, 11-20…) or (kid, youth, adult, senior).
6. Data Generalization
It converts low-level data attributes to high-level data attributes using concept hierarchy.
This conversion from a lower level to a higher conceptual level is useful to get a clearer
picture of the data. Data generalization can be divided into two approaches:
For example, age data can be in the form of (20, 30) in a dataset. It is transformed into a higher
conceptual level into a categorical value (young, old).
The entire process for transforming data is known as ETL (Extract, Load, and
Transform). Through the ETL process, analysts can convert data to its desired format. Here
are the steps involved in the data transformation process:
1. Data Discovery: During the first stage, analysts work to understand and identify data
in its source format. To do this, they will use data profiling tools. This step helps
analysts decide what they need to do to get data into its desired format.
2. Data Mapping: During this phase, analysts perform data mapping to determine how
individual fields are modified, mapped, filtered, joined, and aggregated. Data
mapping is essential to many data processes, and one misstep can lead to incorrect
analysis and ripple through your entire organization.
3. Data Extraction: During this phase, analysts extract the data from its original
source. These may include structured sources such as databases or streaming
sources such as customer log files from web applications.
4. Code Generation and Execution: Once the data has been extracted, analysts need to
create a code to complete the transformation. Often, analysts generate codes with
the help of data transformation platforms or tools.
5. Review: After transforming the data, analysts need to check it to ensure everything
has been formatted correctly.
6. Sending: The final step involves sending the data to its target destination. The target
might be a data warehouse or a database that handles both structured and unstructured
data.
Transforming data can help businesses in a variety of ways. Here are some of the essential
advantages of data transformation, such as:
o Better Organization: Transformed data is easier for both humans and computers to
use.
o Improved Data Quality: There are many risks and costs associated with bad data.
Data transformation can help your organization eliminate quality issues such as missing
values and other inconsistencies.
o Perform Faster Queries: You can quickly and easily retrieve transformed data thanks
to it being stored and standardized in a source location.
o Better Data Management: Businesses are constantly generating data from more and
more sources. If there are inconsistencies in the metadata, it can be challenging to
organize and understand it. Data transformation refines your metadata, so it's easier
to organize and understand.
o More Use Out of Data: While businesses may be collecting data constantly, a lot of
that data sits around unanalysed. Transformation makes it easier to get the most out
of your data by standardizing it and making it more usable.
While data transformation comes with a lot of benefits, still there are some challenges to
transforming data effectively, such as:
o Scripting: Data transformation through scripting involves Python or SQL to write the
code to extract and transform data. Python and SQL are scripting languages that allow
you to automate certain tasks in a program. They also allow you to extract information
from data sets. Scripting languages require less code than traditional programming
languages. Therefore, it is less intensive.
o On-Premises ETL Tools: ETL tools take the required work to script the data
transformation by automating the process. On-premises ETL tools are hosted on
company servers. While these tools can help save you time, using them often requires
extensive expertise and significant infrastructure costs.
o Cloud-Based ETL Tools: As the name suggests, cloud-based ETL tools are hosted in
the cloud. These tools are often the easiest for non-technical users to utilize. They allow
you to collect data from any cloud source and load it into your data warehouse. With
cloud-based ETL tools, you can decide how often you want to pull data from your
source, and you can monitor your usage.
Frequent Pattern
The technique of frequent pattern mining is built upon a number of fundamental ideas. The
analysis is based on transaction databases, which include records or transactions that represent
collections of objects. Items inside these transactions are grouped together as itemsets.
The association rule learning is one of the very important concepts of machine
learning, and it is employed in Market Basket analysis, Web usage mining,
continuous production, etc. Here market basket analysis is a technique used by
the various big retailer to discover the associations between items. We can
understand it by taking an example of a supermarket, as in a supermarket, all
products that are purchased together are put together.
For example, if a customer buys bread, he most likely can also buy butter, eggs,
or milk, so these products are stored within a shelf or mostly nearby. Consider
the below diagram:
How does Association Rule Learning work?
Association rule learning works on the concept of If and Else Statement, such as
if A then B.
o Support
o Confidence
o Lift
o
Let's understand each of them:
Support
Confidence
Confidence indicates how often the rule has been found to be true. Or how often
the items X and Y occur together in the dataset when the occurrence of X is
already given. It is the ratio of the transaction that contains X and Y to the number
of records that contain X.
Lift
It is the ratio of the observed support measure and expected support if X and Y
are independent of each other. It has three possible values:
It is mainly used for market basket analysis and helps to understand the products
that can be bought together. It can also be used in the healthcare field to find drug
reactions for patients.
The Apriori algorithm is also called frequent pattern mining.
What is Apriori Algorithm?
Apriori algorithm helps the customers to buy their products with ease and
increases the sales performance of the particular store.
1. Support
2. Confidence
Consider the following dataset and we will find frequent itemsets and generate
association rules for them.
minimum support count is 2 minimum confidence is 60%
Step-1: K=1 (I) Create a table containing support count of each item present in
dataset – Called C1(candidate set)
(II) compare candidate set item’s support count with minimum support
count(here min_support=2 if support_count of candidate set items is less than
min_support then remove those items). This gives us itemset L1.
Step-2: K=2
• Generate candidate set C2 using L1 (this is called join step). Condition of
joining Lk-1 and Lk-1 is that it should have (K-2) elements in common.
• Check all subsets of an itemset are frequent or not and if not frequent remove
that itemset.(Example subset of{I1, I2} are {I1}, {I2} they are
frequent.Check for each itemset)
• Now find support count of these itemsets by searching in dataset.
(II) compare candidate (C2) support count with minimum support count(here
min_support=2 if support_count of candidate set item is less than min_support
then remove those items) this gives us itemset L2.
Step-3:
• Generate candidate set C3 using L2 (join step). Condition of joining
Lk-1 and Lk-1 is that it should have (K-2) elements in common. So here,
for L2, first element should match. So itemset generated by joining L2
is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4, I5}{I2, I3, I5}
• Check if all subsets of these itemsets are frequent or not and if not,
then remove that itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2,
I3},{I1, I3} which are frequent. For {I2, I3, I4}, subset {I3, I4} is not
frequent so remove it. Similarly check for every itemset)
• find support count of these remaining itemset by searching in dataset.
(II) Compare candidate (C3) support count with minimum support count(here
min_support=2 if support_count of candidate set item is less than min_support
then remove those items) this gives us itemset L3.
Step-4:
• Generate candidate set C4 using L3 (join step). Condition of joining
Lk-1 and Lk-1 (K=4) is that, they should have (K-2) elements in
common. So here, for L3, first 2 elements (items) should match.
• Check all subsets of these itemsets are frequent or not (Here itemset
formed by joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3,
I5}, which is not frequent). So no itemset in C4
• We stop here because no frequent itemsets are found further
Thus, we have discovered all the frequent item-sets. Now generation of strong
association rule comes into picture. For that we need to calculate confidence of
each rule.
Confidence –A confidence of 60% means that 60% of the customers, who
purchased milk and bread also bought butter.
Confidence(A->B)=Support_count(A∪B)/Support_count(A)
So here, by taking an example of any frequent itemset, we will show the rule
generation.
Itemset {I1, I2, I3} //from L3 SO rules can be [I1^I2]=>[I3]