0% found this document useful (0 votes)
18 views

Unit 2 Data Preprocessing and Association Rule Mining

Uploaded by

thirosul
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Unit 2 Data Preprocessing and Association Rule Mining

Uploaded by

thirosul
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Data Pre-processing: An Overview:-

Data processing is collecting raw data and translating it into usable information. The raw data
is collected, filtered, sorted, processed, analyzed, stored, and then presented in a readable
format. It is usually performed in a step-by-step process by a team of data scientists and data
engineers in an organization.

The data processing is carried out automatically or manually. Nowadays, most data is
processed automatically with the help of the computer, which is faster and gives accurate
results. Thus, data can be converted into different forms. It can be graphic as well as audio
ones. It depends on the software used as well as data processing methods.

Stages of Data Processing

The data processing consists of the following six stages.

1. Data Collection

The collection of raw data is the first step of the data processing cycle. The raw data collected
has a huge impact on the output produced. Hence, raw data should be gathered from defined
and accurate sources so that the subsequent findings are valid and usable. Raw data can
include monetary figures, website cookies, profit/loss statements of a company, user
behavior, etc.
2. Data Preparation

Data preparation or data cleaning is the process of sorting and filtering the raw data to remove
unnecessary and inaccurate data. Raw data is checked for errors, duplication,
miscalculations, or missing data and transformed into a suitable form for further analysis
and processing. This ensures that only the highest quality data is fed into the processing unit.

3. Data Input

In this step, the raw data is converted into machine-readable form and fed into the
processing unit. This can be in the form of data entry through a keyboard, scanner, or any
other input source.

4. Data Processing

In this step, the raw data is subjected to various data processing methods using machine
learning and artificial intelligence algorithms to generate the desired output. This step
may vary slightly from process to process depending on the source of data being processed
(data lakes, online databases, connected devices, etc.) and the intended use of the output.

5. Data Interpretation or Output

The data is finally transmitted and displayed to the user in a readable form like graphs,
tables, vector files, audio, video, documents, etc. This output can be stored and further
processed in the next data processing cycle.

6. Data Storage

The last step of the data processing cycle is storage, where data and metadata are stored for
further use. This allows quick access and retrieval of information whenever needed. Effective
proper data storage is necessary for compliance with GDPR (data protection legislation).

==================================================================

What is ETL?

The mechanism of extracting information from source systems and bringing it into the data
warehouse is commonly called ETL, which stands for Extraction, Transformation and Loading.

The ETL process requires active inputs from various stakeholders, including developers, analysts,
testers, top executives and is technically challenging.

To maintain its value as a tool for decision-makers, Data warehouse technique needs to change with
business changes. ETL is a recurring method (daily, weekly, monthly) of a Data warehouse system
and needs to be agile, automated, and well documented.
How ETL Works?

ETL consists of three separate phases:

Extraction
o Extraction is the operation of extracting information from a source system for
further use in a data warehouse environment. This is the first stage of the ETL
process.
o Extraction process is often one of the most time-consuming tasks in the ETL.
o The source systems might be complicated and poorly documented, and thus
determining which data needs to be extracted can be difficult.
o The data has to be extracted several times in a periodic manner to supply all
changed data to the warehouse and keep it up-to-date.

Transformation

Transformation is the core of the reconciliation phase. It converts records from its
operational source format into a particular data warehouse format. If we implement a
three-layer architecture, this phase outputs our reconciled data layer.

Loading

The Load is the process of writing the data into the target database. During the load
step, it is necessary to ensure that the load is performed correctly and with as little
resources as possible.

Loading can be carried in two ways:

1. Refresh: Data Warehouse data is completely rewritten. This means that older file
is replaced. Refresh is usually used in combination with static extraction to
populate a data warehouse initially.
2. Update: Only those changes applied to source information are added to the Data
Warehouse. An update is typically carried out without deleting or modifying
preexisting data. This method is used in combination with incremental extraction
to update data warehouses regularly.
==============================================================================

Data Cleaning in Data Mining

Data cleaning is a crucial process in Data Mining. It carries an important part in the building
of a model. Data Cleaning can be regarded as the process needed, but everyone often neglects
it. Data quality is the main issue in quality information management. Data quality problems
occur anywhere in information systems. These problems are solved by data cleaning.

Data cleaning is fixing or removing incorrect, corrupted, incorrectly formatted,


duplicate, or incomplete data within a dataset. If data is incorrect, outcomes and algorithms
are unreliable, even though they may look correct. When combining multiple data sources,
there are many opportunities for data to be duplicated or mislabeled.

Steps of Data Cleaning

While the techniques used for data cleaning may vary according to the types of data your
company stores, you can follow these basic steps to cleaning your data, such as:

1. Remove duplicate or irrelevant observations

Remove unwanted observations from your dataset, including duplicate observations or


irrelevant observations. Duplicate observations will happen most often during data collection.
When you combine data sets from multiple places, scrape data, or receive data from clients or
multiple departments, there are opportunities to create duplicate data. De-duplication is one of
the largest areas to be considered in this process. Irrelevant observations are when you notice
observations that do not fit into the specific problem you are trying to analyze.
For example, if you want to analyze data regarding millennial customers, but your dataset
includes older generations, you might remove those irrelevant observations. This can make
analysis more efficient, minimize distraction from your primary target, and create a more
manageable and performable dataset.

2. Fix structural errors

Structural errors are when you measure or transfer data and notice strange naming conventions,
typos, or incorrect capitalization. These inconsistencies can cause mislabeled categories or
classes. For example, you may find "N/A" and "Not Applicable" in any sheet, but they should
be analyzed in the same category.

3. Filter unwanted outliers

Often, there will be one-off observations where, at a glance, they do not appear to fit within the
data you are analyzing. If you have a legitimate reason to remove an outlier, like improper data
entry, doing so will help the performance of the data you are working with.

However, sometimes, the appearance of an outlier will prove a theory you are working on. And
just because an outlier exists doesn't mean it is incorrect. This step is needed to determine the
validity of that number. If an outlier proves to be irrelevant for analysis or is a mistake, consider
removing it.

4. Handle missing data

You can't ignore missing data because many algorithms will not accept missing values. There
are a couple of ways to deal with missing data. Neither is optimal, but both can be considered,
such as:

o You can drop observations with missing values, but this will drop or lose information,
so be careful before removing it.
o You can input missing values based on other observations; again, there is an
opportunity to lose the integrity of the data because you may be operating from
assumptions and not actual observations.
o You might alter how the data is used to navigate null values effectively.

5. Validate and QA

At the end of the data cleaning process, you should be able to answer these questions as a part
of basic validation, such as:

o Does the data make sense?


o Does the data follow the appropriate rules for its field?
o Does it prove or disprove your working theory or bring any insight to light?
o Can you find trends in the data to help you for your next theory?
o If not, is that because of a data quality issue?

Because of incorrect or noisy data, false conclusions can inform poor business strategy and
decision-making. False conclusions can lead to an embarrassing moment in a reporting meeting
when you realize your data doesn't stand up to study. Before you get there, it is important to
create a culture of quality data in your organization. To do this, you should document the tools
you might use to create this strategy.

Methods of Data Cleaning

There are many data cleaning methods through which the data should be run. The methods are
described below:

1. Ignore the tuples: This method is not very feasible, as it only comes to use when the
tuple has several attributes is has missing values.
2. Fill the missing value: This approach is also not very effective or feasible. Moreover,
it can be a time-consuming method. In the approach, one has to fill in the missing value.
This is usually done manually, but it can also be done by attribute mean or using the
most probable value.
3. Binning method: This approach is very simple to understand. The smoothing of sorted
data is done using the values around it. The data is then divided into several segments
of equal size. After that, the different methods are executed to complete the task.
4. Regression: The data is made smooth with the help of using the regression function.
The regression can be linear or multiple. Linear regression has only one independent
variable, and multiple regressions have more than one independent variable.
5. Clustering: This method mainly operates on the group. Clustering groups the data in a
cluster. Then, the outliers are detected with the help of clustering. Next, the similar
values are then arranged into a "group" or a "cluster".

Process of Data Cleaning

The following steps show the process of data cleaning in data mining.

1. Monitoring the errors: Keep a note of suitability where the most mistakes arise. It will
make it easier to determine and stabilize false or corrupt information. Information
is especially necessary while integrating another possible alternative with established
management software.
2. Standardize the mining process: Standardize the point of insertion to assist and
reduce the chances of duplicity.
3. Validate data accuracy: Analyze and invest in data tools to clean the record in real-
time. Tools used Artificial Intelligence to better examine for correctness.
4. Scrub for duplicate data: Determine duplicates to save time when analyzing data.
Frequently attempted the same data can be avoided by analyzing and investing in
separate data erasing tools that can analyze rough data in quantity and automate the
operation.
5. Research on data: Before this activity, our data must be standardized, validated,
and scrubbed for duplicates. There are many third-party sources, and these Approved
& authorized parties sources can capture information directly from our databases. They
help us to clean and compile the data to ensure completeness, accuracy, and
reliability for business decision-making.
6. Communicate with the team: Keeping the group in the loop will assist in developing
and strengthening the client and sending more targeted data to prospective customers.

Usage of Data Cleaning in Data Mining

Here are the following usages of data cleaning in data mining, such as:
o Data Integration: Since it is difficult to ensure quality in low-quality data, data
integration has an important role in solving this problem. Data Integration is the
process of combining data from different data sets into a single one. This process
uses data cleansing tools to ensure that the embedded data set is standardized and
formatted before moving to the final destination.
o Data Migration: Data migration is the process of moving one file from one system
to another, one format to another, or one application to another. While the data is
on the move, it is important to maintain its quality, security, and consistency, to ensure
that the resultant data has the correct format and structure without any delicacies at the
destination.
o Data Transformation: Before the data is uploaded to a destination, it needs to be
transformed. This is only possible through data cleaning, which considers the system
criteria of formatting, structuring, etc. Data transformation processes usually include
using rules and filters before further analysis. Data transformation is an integral
part of most data integration and data management processes. Data cleansing tools
help to clean the data using the built-in transformations of the systems.
o Data Debugging in ETL Processes: Data cleansing is crucial to preparing data during
extract, transform, and load (ETL) for reporting and analysis. Data cleansing ensures
that only high-quality data is used for decision-making and analysis.

For example, a retail company receives data from various sources, such as CRM or ERP
systems, containing misinformation or duplicate data. A good data debugging tool would detect
inconsistencies in the data and rectify them. The purged data will be converted to a standard
format and uploaded to a target database.

Characteristics of Data Cleaning

Data cleaning is mandatory to guarantee the business data's accuracy, integrity, and security.
Based on the qualities or characteristics of data, these may vary in quality. Here are the main
points of data cleaning in data mining:

o Accuracy: All the data that make up a database within the business must be highly
accurate. One way to corroborate their accuracy is by comparing them with different
sources. If the source is not found or has errors, the stored information will have the
same problems.
o Coherence: The data must be consistent with each other, so you can be sure that the
information of an individual or body is the same in different forms of storage used.
o Validity: The stored data must have certain regulations or established restrictions.
Likewise, the information has to be verified to corroborate its authenticity.
o Uniformity: The data that make up a database must have the same units or values. It
is an essential aspect when carrying out the Data Cleansing process since it does not
increase the complexity of the procedure.
o Data Verification: The process must be verified at all times, both the appropriateness
and the effectiveness of the procedure. Said verification is carried out through various
insistence of the study, design, and validation stages. The drawbacks are often evident
after the data is applied in a certain amount of changes.
o Clean Data Backflow: After eliminating quality problems, the already clean data must
be replaced by those not located in the original source, so that legacy applications obtain
the benefits of these, obviating the need for applications of actions of data cleaning
afterward.

Tools for Data Cleaning in Data Mining

Data Cleansing Tools can be very helpful if you are not confident of cleaning the data yourself
or have no time to clean up all your data sets. You might need to invest in those tools, but it is
worth the expenditure. There are many data cleaning tools in the market. Here are some top-
ranked data cleaning tools, such as:

1. OpenRefine
2. Trifacta Wrangler
3. Drake
4. Data Ladder
5. Data Cleaner
6. Cloudingo
7. Reifier
8. IBM Infosphere Quality Stage
9. TIBCO Clarity
10. Winpure

=====================================================================

Data Integration

Data integration is the process of merging data from several disparate sources. While performing
data integration, you must work on data redundancy, inconsistency, duplicity, etc.

The statistical integration strategy is formally stated as a triple (G, S, M) approach. G represents
the global schema, S represents the heterogeneous source of schema, and M represents the mapping
between source and global schema queries.
Data integration is important because it gives a uniform view of scattered data while also
maintaining data accuracy. It assists the data-mining program in meaningful mining information,
which in turn assists the executive and managers make strategic decisions for the enterprise's
benefit.

Approaches for Data Integration

There are mainly two kinds of approaches to data integration in data mining, as mentioned below -

1. Tight Coupling

• This approach involves the creation of a centralized database that integrates data from
different sources. The data is loaded into the centralized database using extract, transform,
and load (ETL) processes.
• In this approach, the integration is tightly coupled, meaning that the data is physically stored
in the central database, and any updates or changes made to the data sources are
immediately reflected in the central database.
• Tight coupling is suitable for situations where real-time access to the data is required, and
data consistency is critical. However, this approach can be costly and complex, especially
when dealing with large volumes of data.

2. Loose Coupling

• This approach involves the integration of data from different sources without physically
storing it in a centralized database.
• In this approach, data is accessed from the source systems as needed and combined in real-
time to provide a unified view. This approach uses middleware, such as application
programming interfaces (APIs) and web services, to connect the source systems and access
the data.
• Loose coupling is suitable for situations where real-time access to the data is not critical,
and the data sources are highly distributed. This approach is more cost-effective and flexible
than tight coupling but can be more complex to set up and maintain.

Issues in Data Integration

A few of the most common issues in data integration in data mining include -

• Data Quality - The quality of the data being integrated can be a significant issue in data
integration. Data from different sources may have varying levels of accuracy, completeness,
and consistency, which can lead to data quality issues in the integrated data.
• Data Semantics - Data semantics refers to the meaning and interpretation of data.
Integrating data from different sources can be challenging because the same data element
may have different meanings across sources. This can result in data integration issues and
impact the integrated data's accuracy.
• Data Heterogeneity - Data heterogeneity refers to the differences in data formats,
structures, and storage mechanisms across different data sources. Data integration can be
challenging when dealing with heterogeneous data sources, as it requires data
transformation and mapping to make the data compatible with the target data model.
• Complexity - Data integration can be complex, especially when dealing with large volumes
of data or multiple data sources. As the complexity of data integration increases, it becomes
more challenging to maintain data quality, ensure data consistency, and manage data
security and privacy.
• Data Privacy and Security - Data integration can increase the risk of data privacy and
security breaches. Integrating data from multiple sources can expose sensitive information
and increase the risk of unauthorized access or disclosure.
• Scalability - Scalability refers to the ability of the data integration solution to handle
increasing volumes of data and accommodate changes in data sources. Data integration
solutions must be scalable to meet the organization's evolving needs and ensure that the
integrated data remains accurate and consistent.

Data Reduction

Data reduction techniques ensure the integrity of data while reducing the data. Data
reduction is a process that reduces the volume of original data and represents it in a much
smaller volume. Data reduction techniques are used to obtain a reduced representation of the
dataset that is much smaller in volume by maintaining the integrity of the original data. By
reducing the data, the efficiency of the data mining process is improved, which produces the
same analytical results.

Data reduction does not affect the result obtained from data mining. That means the result
obtained from data mining before and after data reduction is the same or almost the same.

Techniques of Data Reduction

Here are the following techniques or methods of data reduction in data mining, such as:
1. Dimensionality Reduction

Whenever we encounter weakly important data, we use the attribute required for our analysis.
Dimensionality reduction eliminates the attributes from the data set under consideration,
thereby reducing the volume of original data. It reduces data size as it eliminates outdated or
redundant features.

2. Numerosity Reduction

The numerosity reduction reduces the original data volume and represents it in a much smaller
form. This technique includes two types parametric and non-parametric numerosity reduction.

i. Parametric: Parametric numerosity reduction incorporates storing only data parameters


instead of the original data. One method of parametric numerosity reduction is the
regression and log-linear method.
ii. Non-Parametric: A non-parametric numerosity reduction technique does not assume
any model. The non-Parametric technique results in a more uniform reduction, irrespective
of data size, but it may not achieve a high volume of data reduction like the parametric.
There are at least four types of Non-Parametric data reduction techniques, Histogram,
Clustering, Sampling, Data Cube Aggregation, and Data Compression.
1. Histogram: A histogram is a graph that represents frequency distribution
which describes how often a value appears in the data. Histogram uses the
binning method to represent an attribute's data distribution. It uses a
disjoint subset which we call bin or buckets.
2. Clustering: Clustering techniques groups similar objects from the data so
that the objects in a cluster are similar to each other, but they are dissimilar
to objects in another cluster.
How much similar are the objects inside a cluster can be calculated using a
distance function. More is the similarity between the objects in a cluster
closer they appear in the cluster.
The quality of the cluster depends on the diameter of the cluster, i.e.,
the max distance between any two objects in the cluster.
The cluster representation replaces the original data. This technique is
more effective if the present data can be classified into a distinct
clustered.
3. Sampling: One of the methods used for data reduction is sampling, as it can
reduce the large data set into a much smaller data sample. Below we will
discuss the different methods in which we can sample a large data set D
containing N tuples:
4. Simple random sample without replacement (SRSWOR) of size s: In
this s, some tuples are drawn from N tuples such that in the data set D
(s<N). The probability of drawing any tuple from the data set D is 1/N.
This means all tuples have an equal probability of getting sampled.
5. Simple random sample with replacement (SRSWR) of size s: It is
similar to the SRSWOR, but the tuple is drawn from data set D, is
recorded, and then replaced into the data set D so that it can be drawn
again.

3. Data Cube Aggregation

This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a
multidimensional aggregation that uses aggregation at various levels of a data cube to represent
the original data set, thus achieving data reduction.

For example, suppose you have the data of All Electronics sales per quarter for the year 2018
to the year 2022. If you want to get the annual sale per year, you just have to aggregate the
sales per quarter for each year. In this way, aggregation provides you with the required data,
which is much smaller in size, and thereby we achieve data reduction even without losing any
data.
The data cube aggregation is a multidimensional aggregation that eases multidimensional
analysis. The data cube present precomputed and summarized data which eases the data mining
into fast access.

4. Data Compression

Data compression employs modification, encoding, or converting the structure of data in a way
that consumes less space. Data compression involves building a compact representation of
information by removing redundancy and representing data in binary form. Data that can
be restored successfully from its compressed form is called Lossless compression. In contrast,
the opposite where it is not possible to restore the original form from the compressed form is
Lossy compression. Dimensionality and numerosity reduction method are also used for data
compression.

This technique reduces the size of the files using different encoding mechanisms, such as
Huffman Encoding and run-length Encoding. We can divide it into two types based on their
compression techniques.
i. Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple
and minimal data size reduction. Lossless data compression uses algorithms to restore
the precise original data from the compressed data.
ii. Lossy Compression: In lossy-data compression, the decompressed data may differ
from the original data but are useful enough to retrieve information from them. For
example, the JPEG image format is a lossy compression, but we can find the meaning
equivalent to the original image.

5. Discretization Operation

The data discretization technique is used to divide the attributes of the continuous nature into
data with intervals. We replace many constant values of the attributes with labels of small
intervals. This means that mining results are shown in a concise and easily understandable way.

Benefits of Data Reduction

The main benefit of data reduction is simple: the more data you can fit into a terabyte of disk
space, the less capacity you will need to purchase. Here are some benefits of data reduction,
such as:

o Data reduction can save energy.


o Data reduction can reduce your physical storage costs.
o And data reduction can decrease your data center track.

Data reduction greatly increases the efficiency of a storage system and directly impacts your
total spending on capacity.

================================================================

Data Transformation

Data transformation is a technique used to convert the raw data into a suitable format that
efficiently eases data mining and retrieves strategic information. Data transformation
includes data cleaning techniques and a data reduction technique to convert the data into
the appropriate form.

Data Transformation Techniques

There are several data transformation techniques that can help structure and clean up the data
before analysis or storage in a data warehouse. Let's study all techniques used for data
transformation, some of which we have already studied in data reduction and data cleaning.
1. Data Smoothing

Data smoothing is a process that is used to remove noise from the dataset using some
algorithms. It allows for highlighting important features present in the dataset. It helps in
predicting the patterns. When collecting data, it can be manipulated to eliminate or reduce
any variance or any other noise form.

The concept behind data smoothing is that it will be able to identify simple changes to help
predict different trends and patterns. This serves as a help to analysts or traders who need to
look at a lot of data which can often be difficult to digest for finding patterns that they wouldn't
see otherwise.

We have seen how the noise is removed from the data using the techniques such as binning,
regression, clustering.

o Binning: This method splits the sorted data into the number of bins and smoothens
the data values in each bin considering the neighborhood values around it.
o Regression: This method identifies the relation among two dependent attributes so
that if we have one attribute, it can be used to predict the other attribute.
o Clustering: This method group’s similar data values and form a cluster. The values
that lie outside a cluster are known as outliers.

2. Attribute Construction

In the attribute construction method, the new attributes consult the existing attributes to
construct a new data set that eases data mining. New attributes are created and applied to assist
the mining process from the given attributes. This simplifies the original data and makes the
mining more efficient.

For example, suppose we have a data set referring to measurements of different plots, i.e., we
may have the height and width of each plot. So here, we can construct a new attribute 'area'
from attributes 'height' and 'weight'. This also helps understand the relations among the
attributes in a data set.

3. Data Aggregation

Data collection or aggregation is the method of storing and presenting data in a summary
format. The data may be obtained from multiple data sources to integrate these data sources
into a data analysis description. This is a crucial step since the accuracy of data analysis
insights is highly dependent on the quantity and quality of the data used.

Gathering accurate data of high quality and a large enough quantity is necessary to produce
relevant results. The collection of data is useful for everything from decisions concerning
financing or business strategy of the product, pricing, operations, and marketing strategies.

For example, we have a data set of sales reports of an enterprise that has quarterly sales of each
year. We can aggregate the data to get the enterprise's annual sales report.

4. Data Normalization

Normalizing the data refers to scaling the data values to a much smaller range such as [-1,
1] or [0.0, 1.0]. There are different methods to normalize the data, as discussed below.

Consider that we have a numeric attribute A and we have n number of observed values for
attribute A that are V1, V2, V3, ….Vn.

o Min-max normalization: This method implements a linear transformation on the


original data. Let us consider that we have minA and maxA as the minimum and
maximum value observed for attribute A and Vi is the value for attribute A that has to
be normalized.
The min-max normalization would map Vi to the V'i in a new smaller range [new_minA,
new_maxA]. The formula for min-max normalization is given below:
For example, we have $1200 and $9800 as the minimum, and maximum value for the
attribute income, and [0.0, 1.0] is the range in which we have to map a value of $73,600.
The value $73,600 would be transformed using min-max normalization as follows:

o Z-score normalization: This method normalizes the value for attribute A using
the mean and standard deviation. The following formula is used for Z-score
normalization:

Here Ᾱ and σA are the mean and standard deviation for attribute A, respectively.
For example, we have a mean and standard deviation for attribute A as $5400 and
$1600. And we have to normalize the value $73,600 using z-score normalization.

o Decimal Scaling: This method normalizes the value of attribute A by moving the
decimal point in the value. This movement of a decimal point depends on the maximum
absolute value of A. The formula for the decimal scaling is given below:

Here j is the smallest integer such that max(|v'i|)<1


For example, the observed values for attribute A range from 986 to 917, and the
maximum absolute value for attribute A is 986. Here, to normalize each value of
attribute A using decimal scaling, we have to divide each value of attribute A by 1000,
i.e., j=3.
So, the value 986 would be normalized to 0.986, and 917 would be normalized to 0.917.
The normalization parameters such as mean, standard deviation, the maximum absolute
value must be preserved to normalize the future data uniformly.

5. Data Discretization

This is a process of converting continuous data into a set of data intervals. Continuous
attribute values are substituted by small interval labels. This makes the data easier to study and
analyze. If a data mining task handles a continuous attribute, then its discrete values can be
replaced by constant quality attributes. This improves the efficiency of the task.
This method is also called a data reduction mechanism as it transforms a large dataset into a
set of categorical data. Discretization also uses decision tree-based algorithms to produce short,
compact, and accurate results when using discrete values.

Data discretization can be classified into two types: supervised discretization, where the class
information is used, and unsupervised discretization, which is based on which direction the
process proceeds, i.e., 'top-down splitting strategy' or 'bottom-up merging strategy'.

For example, the values for the age attribute can be replaced by the interval labels such as (0-
10, 11-20…) or (kid, youth, adult, senior).

6. Data Generalization

It converts low-level data attributes to high-level data attributes using concept hierarchy.
This conversion from a lower level to a higher conceptual level is useful to get a clearer
picture of the data. Data generalization can be divided into two approaches:

o Data cube process (OLAP) approach.


o Attribute-oriented induction (AOI) approach.

For example, age data can be in the form of (20, 30) in a dataset. It is transformed into a higher
conceptual level into a categorical value (young, old).

Data Transformation Process

The entire process for transforming data is known as ETL (Extract, Load, and
Transform). Through the ETL process, analysts can convert data to its desired format. Here
are the steps involved in the data transformation process:
1. Data Discovery: During the first stage, analysts work to understand and identify data
in its source format. To do this, they will use data profiling tools. This step helps
analysts decide what they need to do to get data into its desired format.
2. Data Mapping: During this phase, analysts perform data mapping to determine how
individual fields are modified, mapped, filtered, joined, and aggregated. Data
mapping is essential to many data processes, and one misstep can lead to incorrect
analysis and ripple through your entire organization.
3. Data Extraction: During this phase, analysts extract the data from its original
source. These may include structured sources such as databases or streaming
sources such as customer log files from web applications.
4. Code Generation and Execution: Once the data has been extracted, analysts need to
create a code to complete the transformation. Often, analysts generate codes with
the help of data transformation platforms or tools.
5. Review: After transforming the data, analysts need to check it to ensure everything
has been formatted correctly.
6. Sending: The final step involves sending the data to its target destination. The target
might be a data warehouse or a database that handles both structured and unstructured
data.

Advantages of Data Transformation

Transforming data can help businesses in a variety of ways. Here are some of the essential
advantages of data transformation, such as:
o Better Organization: Transformed data is easier for both humans and computers to
use.
o Improved Data Quality: There are many risks and costs associated with bad data.
Data transformation can help your organization eliminate quality issues such as missing
values and other inconsistencies.
o Perform Faster Queries: You can quickly and easily retrieve transformed data thanks
to it being stored and standardized in a source location.
o Better Data Management: Businesses are constantly generating data from more and
more sources. If there are inconsistencies in the metadata, it can be challenging to
organize and understand it. Data transformation refines your metadata, so it's easier
to organize and understand.
o More Use Out of Data: While businesses may be collecting data constantly, a lot of
that data sits around unanalysed. Transformation makes it easier to get the most out
of your data by standardizing it and making it more usable.

Disadvantages of Data Transformation

While data transformation comes with a lot of benefits, still there are some challenges to
transforming data effectively, such as:

o Data transformation can be expensive. The cost is dependent on the specific


infrastructure, software, and tools used to process data. Expenses may include licensing,
computing resources, and hiring necessary personnel.
o Data transformation processes can be resource-intensive. Performing
transformations in an on-premises data warehouse after loading or transforming data
before feeding it into applications can create a computational burden that slows down
other operations. If you use a cloud-based data warehouse, you can do the
transformations after loading because the platform can scale up to meet demand.
o Lack of expertise and carelessness can introduce problems during transformation.
Data analysts without appropriate subject matter expertise are less likely to notice
incorrect data because they are less familiar with the range of accurate and permissible
values.
o Enterprises can perform transformations that don't suit their needs. A business
might change information to a specific format for one application only to then revert
the information to its prior format for a different application.
Ways of Data Transformation

There are several different ways to transform data, such as:

o Scripting: Data transformation through scripting involves Python or SQL to write the
code to extract and transform data. Python and SQL are scripting languages that allow
you to automate certain tasks in a program. They also allow you to extract information
from data sets. Scripting languages require less code than traditional programming
languages. Therefore, it is less intensive.
o On-Premises ETL Tools: ETL tools take the required work to script the data
transformation by automating the process. On-premises ETL tools are hosted on
company servers. While these tools can help save you time, using them often requires
extensive expertise and significant infrastructure costs.
o Cloud-Based ETL Tools: As the name suggests, cloud-based ETL tools are hosted in
the cloud. These tools are often the easiest for non-technical users to utilize. They allow
you to collect data from any cloud source and load it into your data warehouse. With
cloud-based ETL tools, you can decide how often you want to pull data from your
source, and you can monitor your usage.

Frequent Pattern

The technique of frequent pattern mining is built upon a number of fundamental ideas. The
analysis is based on transaction databases, which include records or transactions that represent
collections of objects. Items inside these transactions are grouped together as itemsets.

The importance of patterns is greatly influenced by support and confidence measurements .


Support quantifies how frequently an itemset appears in the database, whereas confidence
quantifies how likely it is that a rule generated from the itemset is accurate.
Association rule
What is mean by association rule?

Association rule learning is a type of unsupervised learning technique that


checks for the dependency of one data item on another data item and maps
accordingly so that it can be more profitable.

The association rule learning is one of the very important concepts of machine
learning, and it is employed in Market Basket analysis, Web usage mining,
continuous production, etc. Here market basket analysis is a technique used by
the various big retailer to discover the associations between items. We can
understand it by taking an example of a supermarket, as in a supermarket, all
products that are purchased together are put together.

For example, if a customer buys bread, he most likely can also buy butter, eggs,
or milk, so these products are stored within a shelf or mostly nearby. Consider
the below diagram:
How does Association Rule Learning work?

Association rule learning works on the concept of If and Else Statement, such as
if A then B.

Here the If element is called antecedent, and then statement is called


as Consequent. These types of relationships where we can find out some
association or relation between two items is known as single cardinality. It is
all about creating rules, and if the number of items increases, then cardinality
also increases accordingly. So, to measure the associations between thousands
of data items, there are several metrics. These metrics are given below:

o Support
o Confidence
o Lift
o
Let's understand each of them:
Support

Support is the frequency of A or how frequently an item appears in the dataset. It


is defined as the fraction of the transaction T that contains the itemset X. If there
are X datasets, then for transactions T, it can be written as:

Confidence

Confidence indicates how often the rule has been found to be true. Or how often
the items X and Y occur together in the dataset when the occurrence of X is
already given. It is the ratio of the transaction that contains X and Y to the number
of records that contain X.

Lift

It is the strength of any rule, which can be defined as below formula:

It is the ratio of the observed support measure and expected support if X and Y
are independent of each other. It has three possible values:

o If Lift= 1: The probability of occurrence of antecedent and consequent is


independent of each other.
o Lift>1: It determines the degree to which the two itemsets are dependent
to each other.
o Lift<1: It tells us that one item is a substitute for other items, which means
one item has a negative effect on another.

Frequent Item Set Generation


Frequent item sets, also known as association rules, are a fundamental
concept in association rule mining, which is a technique used in data mining
to discover relationships between items in a dataset. The goal of association
rule mining is to identify relationships between items in a dataset that occur
frequently together.
Need of Association Mining: Frequent mining is the generation of association
rules from a Transactional Dataset. If there are 2 items X and Y purchased
frequently then it’s good to put them together in stores or provide some discount
offer on one item on purchase of another item. This can really increase sales.
For example, it is likely to find that if a customer buys Milk and bread he/she
also buys Butter. So the association rule is [‘milk]^[‘bread’]=>[‘butter’]. So
the seller can suggest the customer buy butter if he/she buys Milk and Bread.
Example On finding Frequent Itemsets – Consider the given dataset with
given transactions

• Lets say minimum support count is 3


• Relation hold is maximal frequent => closed => frequent
Apriori algorithm

This algorithm uses frequent datasets to generate association rules. It is


designed to work on the databases that contain transactions. This algorithm uses
a breadth-first search and Hash Tree to calculate the itemset efficiently.

It is mainly used for market basket analysis and helps to understand the products
that can be bought together. It can also be used in the healthcare field to find drug
reactions for patients.
The Apriori algorithm is also called frequent pattern mining.
What is Apriori Algorithm?

Apriori algorithm refers to an algorithm that is used in mining frequent products


sets and relevant association rules. Generally, the apriori algorithm operates on a
database containing a huge number of transactions. For example, the items
customers but at a Big Bazar.

Apriori algorithm helps the customers to buy their products with ease and
increases the sales performance of the particular store.

Components of the Apriori algorithm

The given three components comprise the apriori algorithm.

1. Support
2. Confidence

Advantages of the Apriori Algorithm


o It is used to calculate large itemsets.
o Simple to understand and apply.

Disadvantages of Apriori Algorithms


o Apriori algorithm is an expensive method to find support since the
calculation has to pass through the whole database.
o Sometimes, you need a huge number of candidate rules, so it becomes
computationally more expensive.

Consider the following dataset and we will find frequent itemsets and generate
association rules for them.
minimum support count is 2 minimum confidence is 60%

Step-1: K=1 (I) Create a table containing support count of each item present in
dataset – Called C1(candidate set)

(II) compare candidate set item’s support count with minimum support
count(here min_support=2 if support_count of candidate set items is less than
min_support then remove those items). This gives us itemset L1.

Step-2: K=2
• Generate candidate set C2 using L1 (this is called join step). Condition of
joining Lk-1 and Lk-1 is that it should have (K-2) elements in common.
• Check all subsets of an itemset are frequent or not and if not frequent remove
that itemset.(Example subset of{I1, I2} are {I1}, {I2} they are
frequent.Check for each itemset)
• Now find support count of these itemsets by searching in dataset.

(II) compare candidate (C2) support count with minimum support count(here
min_support=2 if support_count of candidate set item is less than min_support
then remove those items) this gives us itemset L2.
Step-3:
• Generate candidate set C3 using L2 (join step). Condition of joining
Lk-1 and Lk-1 is that it should have (K-2) elements in common. So here,
for L2, first element should match. So itemset generated by joining L2
is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4, I5}{I2, I3, I5}
• Check if all subsets of these itemsets are frequent or not and if not,
then remove that itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2,
I3},{I1, I3} which are frequent. For {I2, I3, I4}, subset {I3, I4} is not
frequent so remove it. Similarly check for every itemset)
• find support count of these remaining itemset by searching in dataset.

(II) Compare candidate (C3) support count with minimum support count(here
min_support=2 if support_count of candidate set item is less than min_support
then remove those items) this gives us itemset L3.

Step-4:
• Generate candidate set C4 using L3 (join step). Condition of joining
Lk-1 and Lk-1 (K=4) is that, they should have (K-2) elements in
common. So here, for L3, first 2 elements (items) should match.
• Check all subsets of these itemsets are frequent or not (Here itemset
formed by joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3,
I5}, which is not frequent). So no itemset in C4
• We stop here because no frequent itemsets are found further

Thus, we have discovered all the frequent item-sets. Now generation of strong
association rule comes into picture. For that we need to calculate confidence of
each rule.
Confidence –A confidence of 60% means that 60% of the customers, who
purchased milk and bread also bought butter.

Confidence(A->B)=Support_count(A∪B)/Support_count(A)

So here, by taking an example of any frequent itemset, we will show the rule
generation.
Itemset {I1, I2, I3} //from L3 SO rules can be [I1^I2]=>[I3]

1. confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50% [I1^I3]=>[I2]


2. confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50% [I2^I3]=>[I1]
3. confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50% [I1]=>[I2^I3]
4. confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33% [I2]=>[I1^I3]
5. confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28% [I3]=>[I1^I2]
6. confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%

So if minimum confidence is 50%, then first 3 rules can be considered as strong


association rules.

You might also like