0% found this document useful (0 votes)
20 views99 pages

Aiml Data Preprocessing

The document provides an overview of data preprocessing, including its importance for data quality and the major tasks involved such as data cleaning, integration, reduction, and transformation. It discusses issues related to inaccurate, incomplete, and inconsistent data, as well as methods for handling these problems. Additionally, it highlights the significance of believability and interpretability in data quality, and outlines tools and techniques for effective data cleaning and management.

Uploaded by

Namrata Dhanda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views99 pages

Aiml Data Preprocessing

The document provides an overview of data preprocessing, including its importance for data quality and the major tasks involved such as data cleaning, integration, reduction, and transformation. It discusses issues related to inaccurate, incomplete, and inconsistent data, as well as methods for handling these problems. Additionally, it highlights the significance of believability and interpretability in data quality, and outlines tools and techniques for effective data cleaning and management.

Uploaded by

Namrata Dhanda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 99

INTRODUCTION TO AI & ML

B.TECH(CSE)-IV SEMESTER
DATA PREPROCESSING

Prepared By:
Prof.(Dr.) Namrata Dhanda
Department of Computer Science & Engineering
Amity School of Engineering & Technology
Amity University Uttar Pradesh, Lucknow
Email: [email protected]
Contact No: 8299875092, 9415094250

1
Data Preprocessing

 Data Preprocessing: An Overview


 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
2
Data Quality: Why Preprocess the
Data?

 Measures for data quality: A multidimensional view


 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not,
dangling, …
 Timeliness: timely update?
 Believability: how trustable the data are correct?
 Interpretability: how easily the data can be
understood?
3
Inaccurate Data
Inaccurate, incomplete, and inconsistent data are commonplace
properties of large real-world databases and data warehouses.
There are many possible reasons for inaccurate data (i.e., having
incorrect attribute values).

 The data collection instruments used may be faulty.


 There may have been human or computer errors occurring at
data entry.
 Users may purposely submit incorrect data values for
mandatory fields when they do not wish to submit personal
information (e.g., by choosing the default value “January 1”
displayed for birthday). This is known as disguised missing
data.
4
 Errors in data transmission can also occur. There may be
technology limitations such as limited buffer size for
coordinating synchronized data transfer and consumption.
 Incorrect data may also result from inconsistencies in naming
conventions or data codes, or inconsistent formats for input
fields (e.g., date).
 Duplicate tuples also require data cleaning.

5
Incomplete Data
Incomplete data can occur for a number of reasons.
 Attributes of interest may not always be available, such as
customer information for sales transaction data.
 Other data may not be included simply because they were not

considered important at the time of entry.


 Relevant data may not be recorded due to a misunderstanding

or because of equipment malfunctions.


 Data that were inconsistent with other recorded data may have

been deleted.
 Furthermore, the recording of the data history or modifications

may have been overlooked.


 Missing data, particularly for tuples with missing values for

some attributes, may need to be inferred.


6
Timeliness
Timeliness also affects data quality. Suppose that you are
overseeing the distribution of monthly sales bonuses to the top
sales representatives at AllElectronics.
 Several sales representatives, however, fail to submit their sales
records on time at the end of the month.
 There are also a number of corrections and adjustments that
flow in after the month’s end.
 For a period of time following each month, the data stored in
the database are incomplete. However, once all of the data are
received, it is correct.
 The fact that the month-end data are not updated in a timely
fashion has a negative impact on the data quality.

7
Believability & Interpretability

 Believability reflects how much the data are trusted by users,


while interpretability reflects how easy the data are
understood.
 Suppose that a database, at one point, had several errors, all of
which have since been corrected. The past errors, however,
had caused many problems for sales department users, and so
they no longer trust the data.
 The data also use many accounting codes, which the sales
department does not know how to interpret.
 Even though the database is now accurate, complete,
consistent, and timely, sales department users may regard it as
of low quality due to poor believability and interpretability.
8
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation

9
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview


 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
10
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect
data, e.g., instrument faulty, human or computer error,
transmission error
 incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data

e.g., Occupation=“ ” (missing data)
 noisy: containing noise, errors, or outliers

e.g., Salary=“−10” (an error)
 inconsistent: containing discrepancies in codes or names,
e.g.,

Age=“42”, Birthday=“03/07/2010”

Was rating “1, 2, 3”, now rating “A, B, C”

discrepancy between duplicate records
 Intentional (e.g., disguised missing data)
11
Incomplete (Missing) Data
 Data is not always available
 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time
of entry
 not register history or changes of the data
 Missing data may need to be inferred
12
How to Handle Missing
Data?
 Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective unless the tuple contains
several attributes with missing values. By ignoring the tuple, we
do not make use of the remaining attribute values in the tuple.
 Fill in the missing value manually: Time consuming + tedious +
infeasible for large data set with many missing values.
 Fill in it automatically with
 a global constant : e.g., “unknown or -∞”, a new class?!
 Use a measure of central tendency for the attribute (eg.,
mean or median) to fill in the missing value.

13
 the attribute mean or median for all samples belonging to
the same class as the tuple.
 Use the most probable value to fill in the missing value: This
may be determined with regression, inference-based tools
using a Bayesian formula or decision tree.

14
Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may be due to
 faulty data collection instruments

 data entry problems

 data transmission problems

 technology limitation

 inconsistency in naming convention

 Other data problems which require data cleaning


 duplicate records

 incomplete data

 inconsistent data

15
How to Handle Noisy Data?
 Binning
 Binning methods smooth a sorted data value by consulting its

“neighborhood,” that is, the values around it.


 The sorted values are distributed into a number of “buckets” or

bins. Because binning methods consult the neighborhood of


values, they perform local smoothing.
 In this example, the data for price are first sorted and then

partitioned into equal-frequency bins of size 3 (i.e., each bin


contains three values). In smoothing by bin means, each value
in a bin is replaced by the mean value of the bin. For example,
the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore,
each original value in this bin is replaced by the value 9.

16
 Similarly, smoothing by bin medians can be employed, in
which each bin value is replaced by the bin median.
 In smoothing by bin boundaries, the minimum and
maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest
boundary value. In general, the larger the width, the greater
the effect of the smoothing. Alternatively, bins may be equal
width, where the interval range of values in each bin is
constant.

17
Example
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins:
 Bin 1: 4, 8, 15

 Bin 2: 21, 21, 24

 Bin 3: 25, 28, 34

Smoothing by bin means:


 Bin 1: 9, 9, 9

 Bin 2: 22, 22, 22

 Bin 3: 29, 29, 29

Smoothing by bin boundaries:


 Bin 1: 4, 4, 15

 Bin 2: 21, 21, 24

 Bin 3: 25, 25, 34


18
Regression
 Regression
 Data smoothing can also be done by regression, a technique

that conforms data values to a function.


 Linear regression involves finding the “best” line to fit two

attributes (or variables) so that one attribute can be used to


predict the other.
 Multiple linear regression is an extension of linear
regression, where more than two attributes are involved
and the data are fit to a multidimensional surface.

19
Outlier Analysis
 Outliers may be detected by clustering, for example, where
similar values are organized into groups, or “clusters.”
Intuitively, values that fall outside of the set of clusters may be
considered outliers.

A 2-D customer data plot wrt customer locations in a city showing


three data clusters.Outliers may be detected as values that lie
outside the clusters. 20
Data Cleaning as a Process
Data Discrepancy Detection
 Discrepancies can be caused by several factors, including poorly
designed data entry forms that have many optional fields,
human error in data entry, deliberate errors (e.g., respondents
not wanting to divulge information about themselves), and
data decay (e.g., outdated addresses).
 Discrepancies may also arise from inconsistent data
representations and inconsistent use of codes.
 Other sources of discrepancies include errors in instrumentation

devices that record data and system errors.


 Errors can also occur when the data are (inadequately) used for

purposes other than originally intended.

21
 There may also be inconsistencies due to data integration (e.g.,
where a given attribute can have different names in different
databases).
 Use metadata (e.g., domain, range, dependency, distribution)
 Check field overloading

 Check uniqueness rule, consecutive rule and null rule

 Use commercial tools


Data scrubbing: use simple domain knowledge (e.g.,
postal code, spell-check) to detect errors and make
corrections

Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and
clustering to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified 22
How can we deal with Data Discrepancy
 Use any knowledge you may already have regarding properties of
the data. Such knowledge or “data about data” is referred to as
metadata.
 For example, what are the data type and domain of each
attribute? What are the acceptable values for each attribute?
The basic statistical data descriptions are useful to grasp data
trends and identify anomalies. For example, find the mean,
median, and mode values. Are the data symmetric or skewed?
What is the range of values? Do all values fall within the
expected range? What is the standard deviation of each
attribute? Values that are more than two standard deviations
away from the mean for a given attribute may be flagged as
potential outliers. Are there any known dependencies between
attributes?
23
 From this, you may find noise, outliers, and unusual values that
need investigation.
 Field overloading is another error source that typically results
when developers squeeze new attribute definitions into unused
(bit) portions of already defined attributes (e.g., an unused bit of
an attribute that has a value range that uses only, say, 31 out of
32 bits).
 The data should also be examined regarding unique rules,
consecutive rules, and null rules.
 A unique rule says that each value of the given attribute must be
different from all other values for that attribute.
 A consecutive rule says that there can be no missing values
between the lowest and highest values for the attribute, and
that all values must also be unique (e.g., as in check numbers).
24
 A null rule specifies the use of blanks, question marks, special
characters, or other strings that may indicate the null condition
(e.g., where a value for a given attribute is not available), and
how such values should be handled. The reasons for missing
values may include
 the person originally asked to provide a value for the
attribute refuses and/or finds that the information
requested is not applicable (e.g., a license number attribute
left blank by nondrivers);
 the data entry person does not know the correct value;
 the value is to be provided by a later step of the process.

25
 The null rule should specify how to record the null condition,
for example, such as to store zero for numeric attributes, a
blank for character attributes, or any other conventions that
may be in use (e.g., entries like “don’t know” or “?” should be
transformed to blank).

26
Tools

 Data scrubbing tools use simple domain knowledge (e.g.,


knowledge of postal addresses and spell-checking) to detect
errors and make corrections in the data. These tools rely on
parsing and fuzzy matching techniques when cleaning data from
multiple sources.
 Data auditing tools find discrepancies by analyzing the data to
discover rules and relationships, and detecting data that violate
such conditions. They are variants of data mining tools. For
example, they may employ statistical analysis to find correlations,
or clustering to identify outliers. They may also use the basic
statistical data descriptions.

27
 Some data inconsistencies may be corrected manually using
external references. For example, errors made at data entry may
be corrected by performing a paper trace. Most errors, however,
will require data transformations. That is, once we find
discrepancies, we typically need to define and apply (a series of)
transformations to correct them.
 Data migration tools allow simple transformations to be specified
such as to replace the string “pincode” by “zipcode.” ETL
(extraction/transformation/loading) tools allow users to specify
transforms through a graphical user interface (GUI). These tools
typically support only a restricted set of transforms so that, often,
we may also choose to write custom scripts for this step of the
data cleaning process.

28
Data Preprocessing

 Data Preprocessing: An Overview


 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
29
Data Integration
 Data integration:
 Data mining requires combining data from multiple sources
into a coherent store.
 Careful integration can help reduce and avoid the
redundancies and inconsistencies in the resulting dataset.
 Thus helping in increasing the speed and accuracy of
corresponding data mining process.

 Entity identification problem:


 Identify real world entities from multiple data sources, e.g.,
Bill Clinton = William Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from 30
Entity identification Problem
 There are a number of issues to consider during data
integration.
 Schema integration and object matching can be tricky. How can
equivalent real-world entities from multiple data sources be
matched up? This is referred to as the entity identification
problem.
 For example, how can the data analyst or the computer be sure
that customer id in one database and cust number in another
refer to the same attribute?
 Examples of metadata for each attribute include the name,
meaning, data type, and range of values permitted for the
attribute, and null rules for handling blank, zero, or null values.

31
 Such metadata can be used to help avoid errors in schema
integration. The metadata may also be used to help transform
the data (e.g., where data codes for pay type in one database
may be “H” and “S” but 1 and 2 in another). Hence, this step also
relates to data cleaning.
 When matching attributes from one database to another during
integration, special attention must be paid to the structure of the
data. This is to ensure that any attribute functional dependencies
and referential constraints in the source system match those in
the target system. For example, in one system, a discount may be
applied to the order, whereas in another system it is applied to
each individual line item within the order.
 If this is not caught before integration, items in the target system
may be improperly discounted.
32
Redundancy & Correlation Analysis
 An attribute (such as annual revenue, for instance) may be
redundant if it can be “derived” from another attribute or set of
attributes.
 Inconsistencies in attribute or dimension naming can also cause
redundancies in the resulting data set.
 Redundant data occur often when integration of multiple
databases
 Object identification: The same attribute or object may have

different names in different databases


 Derivable data: One attribute may be a “derived” attribute in

another table, e.g., annual revenue


 Redundant attributes may be able to be detected by correlation
analysis and covariance analysis

33
Correlation Analysis (Nominal Data)
 Given two attributes, such analysis can measure how strongly
one attribute implies the other, based on the available data.
 For nominal data, we use the χ2 (chi-square) test.
 For numeric attributes, we can use the correlation coefficient and
covariance, both of which access how one attribute’s values vary
from those of another.

34
Correlation for Nominal Data

 For nominal data, a correlation relationship between two


attributes, A and B, can be discovered by a χ2 (chi-square) test.
 Suppose A has c distinct values, namely a1,a2, : : :ac .
 B has r distinct values, namely b1,b2, : : :br .
 The data tuples described by A and B can be shown as a
contingency table, with the c values of A making up the columns
and the r values of B making up the rows.
 Let (Ai ,Bj) denote the joint event that attribute A takes on value ai
and attribute B takes on value bj , that is, where (A=ai ,B= bj ).

Data Mining: Concepts and


03/23/2025 Techniques 35
 Each and every possible (Ai ,Bj) joint event has its own cell (or
slot) in the table. The χ2 value (also known as the Pearson χ2
statistic) is computed as
χ2 =
where oij is the observed frequency (i.e., the actual count) of the
join event (Ai ,Bj) and eij is the expected frequency of (Ai ,Bj) which
can be computed as:
eij =
Where n is the number of data tuples, count (A=ai ) is the number
of tuples having value ai for A and count (B=bj) is the number of
tuples having value bj for B

Data Mining: Concepts and


03/23/2025 Techniques 36
 The sum in the first equation is computed over all of the rXc
cells.
 The cells that contribute the most to the χ2 value are those for
which the actual count is very different from that expected.
 The χ2 statistic tests the hypothesis that A and B are
independent, that is, there is no correlation between them. The
test is based on a significance level, with (r-1) X (c-1) degrees of
freedom.
 If the hypothesis can be rejected, then we say that A and B are
statistically correlated.

Data Mining: Concepts and


03/23/2025 Techniques 37
Example
Suppose that a group of 1500 people was surveyed. The
gender of each person was noted. Each person was polled as
to whether his or her preferred type of reading material was
fiction or nonfiction. Thus, we have two attributes, gender and
preferred reading. The observed frequency (or count) of each
possible joint event is summarized in the contingency table
below:
 where the numbers in parentheses are the expected

frequencies. The expected frequencies are calculated


based on the data distribution for both attributes using Eq.2.
 Using Eq.2, we can verify the expected frequencies for

each cell.

Data Mining: Concepts and


03/23/2025 Techniques 38
For example, the expected frequency for the cell (male,
fiction) is:
e11= = = 90 and so on.

 In any row, the sum of the expected frequencies must


equal to the total observed frequency for that row, and the
sum of the expected frequencies in any column must also
equal the total observed frequency for that column.

Data Mining: Concepts and


03/23/2025 Techniques 39
Chi-Square Calculation: An
Example

Male Female Sum


(row)
Like science fiction 250(90) 200(360) 450

Not like science 50(210) 1000(840) 1050


fiction
Sum(col.) 300 1200 1500
 χ2 (chi-square) calculation (numbers in parenthesis are expected
counts calculated based on the data distribution in the two
categories)
(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
 
2
   507.93
90 210 360 840

40
 For this 2 X 2 table, the degrees of freedom are (2-1)x(2-
1)=1. For 1 degree of freedom, the ϰ2 value needed to
reject the hypothesis at the 0.001 significance level is
10.828 (taken from the table of upper percentage points of
the 2 distribution, typically available from any textbook on
statistics).
 Since our computed value is above this, we can reject the
hypothesis that gender and preferred reading are
independent and conclude that the two attributes are
(strongly) correlated for the given group of people.

Data Mining: Concepts and


03/23/2025 Techniques 41
Correlation Analysis (Numeric Data)
 For numeric attributes, we can evaluate the correlation
between two attributes, A and B, by computing the
correlation coefficient (also known as Pearson’s
product moment coefficient, named after its inventer,
Karl Pearson).
 Correlation coefficient is given by the formula:

 
n n
(ai  A)(bi  B) (ai bi )  n AB
rA, B  i 1
 i 1
(n  1) A B (n  1) A B

where n is the number of tuples, and A and B are the


respective means of A and B, σA and σB are the
respective standard deviation of A and B, and Σ(aibi) is
the sum of the AB cross-product. 42
Interpretations
 Note that -1≤rA,B ≤1
 If rA,B > 0, then A and B are positively correlated (A’s values
increases as B’s). The higher the value the stronger the
correlation (i.e., the more each attribute implies the other).
Hence, a higher value may indicate that A (or B) may be
removed as a redundancy.
 If rA,B = 0, then A and B are independent of each other and
there is no correlation between them.
 If rAB < 0 then A and B are negatively correlated that means
when one value increases the other decreases. This means
that each attribute discourages the other.
 Scatter plots can be used to view the correlation between
Data Mining: Concepts and
the attributes.
03/23/2025 Techniques 43
Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.

44
Covariance
 In probability theory and statistics, correlation and covariance are
two similar measures for assessing how much two attributes
change together.
 Consider two numeric attributes A and B, and a set of n
observations {(a1,b1),(a2,b2),….…….,(an ,bn)}. The mean values of A
and B, respectively, are also known as the expected values on A
and B, that is:
E(A) = = E(B) = =
 The covariance between A and B is defined as:
Cov(A,B)= E((A- )(B-)) = rA,B =

Data Mining: Concepts and


03/23/2025 Techniques 45
 where A and B are the standard deviations of A and B,
respectively. It can also be shown that
Cov(A,B)= E(A.B) -
 For two attributes A and B that tend to change together, if A is
larger than (the expected value of A), then B is likely to be larger
than (the expected value of B). Therefore, the covariance
between A and B is positive.
 On the other hand, if one of the attributes tends to be above its
expected value when the other attribute is below its expected
value, then the covariance of A and B is negative.
 If A and B are independent (i.e., they do not have correlation),
then E(A.B)=E(A).E(B). Therefore Cov(A,B)= E(A.B) - = E(A).E(B)-
=0.
Data Mining: Concepts and
Techniques 46
Covariance Example
Consider the Table below which presents a simplified example of
stock prices observed at five time points for AllElectronics and
HighTech, a high-tech company. If the stocks are affected by the
same industry trends, will their prices rise or fall together?
Time Point AllElectronics HighTech

T1 6 20
T2 5 10
T3 4 14
T4 3 5
T5 2 5

Data Mining: Concepts and


03/23/2025 Techniques 47
E(AllElectronics) = 6+5+4+3+2/5=20/5=4
E(HighTech)=20+10+14+5+5/5=54/5=10.80

Cov(AllElectronics,HighTech)= (6*20+5*10+4*14+3*5+2*5)/4 -4*10.8= 7

Thus given the positive co-variance we can say that the stock price of
both the companies rise together.

Data Mining: Concepts and


03/23/2025 Techniques 48
Tuple Duplication
 In addition to detecting redundancies between attributes,
duplication should also be detected at the tuple level (e.g.,
where there are two or more identical tuples for a given unique
data entry case).
 The use of denormalized tables (often done to improve
performance by avoiding joins) is another source of data
redundancy.
 Inconsistencies often arise between various duplicates, due to
inaccurate data entry or updating some but not all data
occurrences. For example, if a purchase order database contains
attributes for the purchaser’s name and address instead of a key
to this information in a purchaser database, discrepancies can
occur, such as the same purchaser’s name appearing with
different addresses within the purchase order database.
Data Mining: Concepts and
03/23/2025 Techniques 49
Data Value Conflict Detection and
Resolution
 Data integration also involves the detection and resolution of
data value conflicts. For example, for the same real-world entity,
attribute values from different sources may differ.
 This may be due to differences in representation, scaling, or
encoding. For instance, a weight attribute may be stored in
metric units in one system and British imperial units in another.
For a hotel chain, the price of rooms in different cities may
involve not only different currencies but also different services
(e.g., free breakfast) and taxes. When exchanging information
between schools, for example, each school may have its own
curriculum and grading scheme.

Data Mining: Concepts and


03/23/2025 Techniques 50
 One university may adopt a quarter system, offer three courses
on database systems, and assign grades from AC to F, whereas
another may adopt a semester system, offer two courses on
databases, and assign grades from 1 to 10. It is difficult to work
out precise course-to-grade transformation rules between the
two universities, making information exchange difficult.
 Attributes may also differ on the abstraction level, where an
attribute in one system is recorded at, say, a lower abstraction
level than the “same” attribute in another. For example, the total
sales in one database may refer to one branch of All Electronics,
while an attribute of the same name in another database may
refer to the total sales for All Electronics stores in a given region.

Data Mining: Concepts and


03/23/2025 Techniques 51
Data Preprocessing

 Data Preprocessing: An Overview


 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary

52
Data Reduction Strategies
 Data reduction: Obtain a reduced representation of the data
set that is much smaller in volume but yet produces the same
(or almost the same) analytical results
 Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very
long time to run on the complete data set.
 Data reduction strategies
 Dimensionality reduction, e.g., remove unimportant
attributes

Wavelet transforms

Principal Components Analysis (PCA)

Feature subset selection, feature creation
 Numerosity reduction (some simply call it: Data Reduction)


Regression and Log-Linear Models

Histograms, clustering, sampling

Data cube aggregation
 Data compression

53
Data Reduction 1: Dimensionality
Reduction
 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to
clustering, outlier analysis, becomes less meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)

54
Mapping Data to a New Space
 Fourier transform
 Wavelet transform

Two Sine Waves Two Sine Waves + Noise Frequency

55
What Is Wavelet Transform?
 The discrete wavelet transform (DWT) is a linear signal processing
technique that, when applied to a data vector X, transforms it to a
numerically different vector, X’, of wavelet coefficients.
 The two vectors are of the same length. When applying this
technique to data reduction, we consider each tuple as an n-
dimensional data vector, that is, X=(x1,x2,……,xn), depicting n
measurements made on the tuple from n database attributes.
 The usefulness lies in the fact that the wavelet transformed data
can be truncated.
 A compressed approximation of the data can be retained by
storing only a small fraction of the strongest of the wavelet
coefficients.

56
 For example, all wavelet coefficients larger than some user-
specified threshold can be retained. All other coefficients are set
to 0.
 The resulting data representation is therefore very sparse, so
that operations that can take advantage of data sparsity are
computationally very fast if performed in wavelet space.
 The technique also works to remove noise without smoothing
out the main features of the data, making it effective for data
cleaning as well.
 Given a set of coefficients, an approximation of the original data
can be constructed by applying the inverse of the DWT used.
 The DWT is closely related to the discrete Fourier transform
(DFT), a signal processing technique involving sines and cosines.
In general, however, the DWT achieves better lossy compression.
That is, if the same number of coefficients is retained for a DWT
and a DFT of a given data vector, the DWT version will provide a
more accurate approximation of the original data. Hence, for an
equivalent approximation, the DWT
requires less space than the DFT.
 Popular wavelet transforms include
the Haar-2, Daubechies-4, and
Daubechies-6.

Haar2 Daubechie4

58
Method
The general procedure for applying a discrete wavelet transform
uses a hierarchical pyramid algorithm that halves the data at each
iteration, resulting in fast computational speed. The method is as
follows:
1. The length, L, of the input data vector must be an integer power
of 2. This condition can be met by padding the data vector with
zeros as necessary (L≥n).
2. Each transform involves applying two functions. The first applies
some data smoothing, such as a sum or weighted average. The
second performs a weighted difference, which acts to bring out
the detailed features of the data.
3. The two functions are applied to pairs of data points in X, that
is, to all pairs of measurements (x2i ,x2i+1). This results in two data
sets of length L/2. 59
In general, these represent a smoothed or low-frequency version of
the input data and the high frequency content of it, respectively.
4. The two functions are recursively applied to the data sets
obtained in the previous loop, until the resulting data sets
obtained are of length 2.
5. Selected values from the data sets obtained in the previous
iterations are designated the wavelet coefficients of the
transformed data.
Equivalently, a matrix multiplication can be applied to the input
data in order to obtain the wavelet coefficients, where the matrix
used depends on the given DWT. The matrix must be orthonormal,
meaning that the columns are unit vectors and are mutually
orthogonal, so that the matrix inverse is just its transpose.
60
Wavelet Decomposition
 Wavelets: A math tool for space-efficient hierarchical
decomposition of functions
 S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^ = [23/4, -11/4, 1/2,
0, 0, -1, -1, 0]
 Compression: many small detail coefficients can be replaced by
0’s, and only the significant coefficients are retained

61
Haar Wavelet Coefficients
Coefficient
Hierarchical “Supports”
2.75
decomposition 2.75 +
structure (a.k.a. +
“error tree”) + -1.25
-
-1.25
+ -
0.5
+
0.5
- +
0
- 0
+
-
0 -1 -1 0
+
-
+ + 0
- - + - + -
-1
+
-+
-+
2 2 0 2 3 5 4 4
-1
Original frequency distribution 0 -+
-
62
Why Wavelet Transform?
 Use hat-shape filters
 Emphasize region where points cluster

 Suppress weaker information in their boundaries

 Effective removal of outliers


 Insensitive to noise, insensitive to input order

 Multi-resolution
 Detect arbitrary shaped clusters at different scales

 Efficient
 Complexity O(N)

 Only applicable to low dimensional data

63
Principal Component Analysis (PCA)
 Used for Dimensionality Reduction.
 Find a projection that captures the largest amount of
variation in data
 The original data are projected onto a much smaller space,
resulting in dimensionality reduction. We find the
eigenvectors of the covariance matrix, and these
eigenvectors
x define the new space.
2

x1
64
 Suppose that the data to be reduced consist of tuples or data
vectors described by n attributes or dimensions.
 Principal components analysis (PCA; also called the Karhunen-
Loeve, or K-L, method) searches for k n-dimensional orthogonal
vectors that can best be used to represent the data, where k≤ n.
 The original data are thus projected onto a much smaller space,
resulting in dimensionality reduction.
 PCA “combines” the essence of attributes by creating an
alternative, smaller set of variables. The initial data can then be
projected onto this smaller set.
 PCA often reveals relationships that were not previously
suspected and thereby allows interpretations that would not
ordinarily result.
65
Principal Component Analysis
(Steps)

1. The input data are normalized, so that each attribute falls within
the same range. This step helps ensure that attributes with large
domains will not dominate attributes with smaller domains.
2. PCA computes k orthonormal vectors that provide a basis for the
normalized input data. These are unit vectors that each point in
a direction perpendicular to the others. These vectors are
referred to as the principal components. The input data are a
linear combination of the principal components.
3. The principal components are sorted in order of decreasing
“significance” or strength. The principal components essentially
serve as a new set of axes for the data providing important
information about variance.
66
That is, the sorted axes are such that the first axis shows the most
variance among the data, the second axis shows the next highest
variance, and so on. For example, Figure 3.5 shows the first two
principal components, Y1 and Y2, for the given set of data originally
mapped to the axes X1 and X2. This information helps identify
groups or patterns within the data.

4. Because the components are sorted in decreasing order of


“significance,” the data size can be reduced by eliminating the
weaker components, that is, those with low variance. Using the
strongest principal components, it should be possible to
reconstruct a good approximation of the original data.

67
Applications
 PCA can be applied to ordered and unordered attributes, and
can handle sparse data and skewed data.
 Multidimensional data of more than two dimensions can be
handled by reducing the problem to two dimensions.
 Principal components may be used as inputs to multiple
regression and cluster analysis. In comparison with wavelet
transforms, PCA tends to be better at handling sparse data,
whereas wavelet transforms are more suitable for data of high
dimensionality.

68
Example
 Given find the Principal Components
 Determine the Mean =0.5 and 2.125

X Y X- Y- (X-(Y-) (X-2 (Y-2


2 4 1.5 1.875 2.8125 2.25 3.5156
1 3 0.5 0.875 .4375 .25 .7656
0 1 -0.5 -1.125 .5625 .25 1.2656
-1 0.5 -1.5 -1.625 2.4375 2.25 2.6406
0 0 6.25 5 8.1874

69
 Cov(x,x)=- /(n-1) = 5/3=1.67

Cov (y,y)=8.1874/3=2.73
 Cov (x,y)= Cov(y,x)= 6.25/3=2.083

 Make a Covariance Matrix as follows:

S=Covariance=
 Solve the equation |S-𝜆I| =0 where S is the
Covariance Matrix and I is the Identity
Matrix whose dimension is the same as that
of Covariance Matrix.

70
|S-𝜆I| =0 that is =0
that is =0

that is ((1.67-𝜆)(2.73-𝜆)-2.083*2.083)=0
that is 𝜆2 – 4.4𝜆+0.22=0
𝜆1 =4.3494 and 𝜆2=0.0506
Consider 𝜆1, =0
-2.6794 +2.083 =0
2.083 – 1.6194 =0
For orthogonal transformation 2 + 2 =0

71
Consider 𝜆2 and solve we get =0.79 and 0.61.
 After solving we get =0.61 and =0.79

 Here we use, 2 + 2 =0
 Hence the Principal components are given as :
z1=a11x+a12y
z2=a21x+a22y
that is z1=0.61x+0.79y Most significant & stores max
possible information
z2=0.79x+0.61y Second most significant & stores
remaining possible information & so on.
72
Attribute Subset Selection
 Another way to reduce dimensionality of data.
 Data sets for analysis may contain hundreds of attributes, many of
which may be irrelevant to the mining task or redundant. For
example, if the task is to classify customers based on whether or
not they are likely to purchase a popular new CD at AllElectronics
when notified of a sale, attributes such as the customer’s
telephone number are likely to be irrelevant, unlike attributes
such as age or music taste.
 Although it may be possible for a domain expert to pick out some
of the useful attributes, this can be a difficult and time consuming
task, especially when the data’s behavior is not well known.

73
 Leaving out relevant attributes or keeping irrelevant attributes
may be detrimental, causing confusion for the mining algorithm
employed.
 This can result in discovered patterns of poor quality. In addition,
the added volume of irrelevant or redundant attributes can slow
down the mining process.
 Attribute subset selection reduces the data set size by removing
irrelevant or redundant attributes (or dimensions). The goal of
attribute subset selection is to find a minimum set of attributes
such that the resulting probability distribution of the data classes
is as close as possible to the original distribution obtained using
all attributes.

74
 Mining on a reduced set of attributes has an additional benefit: It
reduces the number of attributes appearing in the discovered
patterns, helping to make the patterns easier to understand.

75
Heuristic Search in Attribute
Selection
 For n attributes, there are 2n possible subsets. An exhaustive
search for the optimal subset of attributes can be prohibitively
expensive, especially as n and the number of data classes
increase.
 Therefore, heuristic methods that explore a reduced search
space are commonly used for attribute subset selection. These
methods are typically greedy in that, while searching through
attribute space, they always make what looks to be the best
choice at the time.
 Their strategy is to make a locally optimal choice in the hope that
this will lead to a globally optimal solution. Such greedy methods
are effective in practice and may come close to estimating an
optimal solution.

76
 The “best” (and “worst”) attributes are typically determined
using tests of statistical significance, which assume that the
attributes are independent of one another.
 Many other attribute evaluation measures can be used such as
the information gain measure used in building decision trees for
classification.

77
Heuristic Approaches
1. Stepwise forward selection: The procedure starts with an
empty set of attributes as the reduced set. The best of the
original attributes is determined and added to the reduced set.
At each subsequent iteration or step, the best of the remaining
original attributes is added to the set.
2. Stepwise backward elimination: The procedure starts with the
full set of attributes. At each step, it removes the worst attribute
remaining in the set.
3. Combination of forward selection and backward elimination:
The stepwise forward selection and backward elimination
methods can be combined so that, at each step, the procedure
selects the best attribute and removes the worst from among
the remaining attributes.

78
4. Decision tree induction: Decision tree algorithms (e.g., ID3,
C4.5, and CART) were originally intended for classification.
Decision tree induction constructs a flowchart like structure
where each internal (nonleaf) node denotes a test on an
attribute, each branch corresponds to an outcome of the test,
and each external (leaf) node denotes a class prediction. At
each node, the algorithm chooses the “best” attribute to
partition the data into individual classes.
When decision tree induction is used for attribute subset selection,
a tree is constructed from the given data. All attributes that do not
appear in the tree are assumed to be irrelevant. The set of
attributes appearing in the tree form the reduced subset of
attributes. The stopping criteria may vary.

79
Example

80
Data Preprocessing

 Data Preprocessing: An Overview


 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary

81
Data Transformation
 In data transformation, the data are transformed or
consolidated into forms appropriate for mining. It is a function
that maps the entire set of values of a given attribute to a new
set of replacement values such that each old value can be
identified with one of the new values. Various methods are:
1. Smoothing, which works to remove noise from the data.
Techniques include binning, regression, and clustering.
2. Attribute construction (or feature construction), where new
attributes are constructed and added from the given set of
attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are
applied to the data. For example, the daily sales data may be
aggregated so as to compute monthly and annual total
amounts.
82
4. Normalization, where the attribute data are scaled so as to fall
within a smaller range, such as -1.0 to 1.0, or 0.0 to 1.0.
5. Discretization, where the raw values of a numeric attribute
(e.g., age) are replaced by interval labels (e.g., 0–10, 11–20,
etc.) or conceptual labels (e.g., youth, adult, senior). The labels,
in turn, can be recursively organized into higher-level concepts,
resulting in a concept hierarchy for the numeric attribute.
6. Concept hierarchy generation for nominal data, where
attributes such as street can be generalized to higher-level
concepts, like city or country. Many hierarchies for nominal
attributes are implicit within the database schema and can be
automatically defined at the schema definition level.
Data Transformation by
Normalization
 The measurement unit used can affect the data analysis. For
example, changing measurement units from meters to inches for
height, or from kilograms to pounds for weight, may lead to very
different results. In general, expressing an attribute in smaller
units will lead to a larger range for that attribute, and thus tend
to give such an attribute greater effect or “weight.” To help avoid
dependence on the choice of measurement units, the data
should be normalized or standardized.
 This involves transforming the data to fall within a smaller or
common range such as [-1, 1] or [0.0, 1.0].
 Normalizing the data attempts to give all attributes an equal
weight. Normalization is particularly useful for classification
algorithms involving neural networks or distance measurements
such as nearest-neighbor classification and clustering.
Data Mining: Concepts and
03/23/2025 Techniques 84
Data Transformation by
Normalization
 Min-max normalization: Suppose that minA and maxA are the
minimum and maximum values of an attribute, A. Min-max
normalization maps a value vi of A to vi’ in the range
[new_minv A,new_max
minA A].
v'  ( new _ maxA  new _ minA)  new _ minA
maxA  minA

Ex. Let income range $12,000 to $98,000 be normalized to


[0.0, 1.0]. Then $73,600 is mapped to:
v’ = (1-0)+0=0.716
 Z-score normalization or zero-mean normalization: (μ:
mean, σ: standard deviation): The values for an attribute A, are
normalized based on the mean (i.e., average) and standard
deviation of A. A value, vi of A is normalized to vi’ by computing
:
v i’ =
85
This method of normalization is useful when the actual minimum
and maximum of attribute A are unknown, or when there are
outliers dominate the min-max normalization.
Suppose that the mean and standard deviation of the values for
the attribute income are $54,000 and $16,000, respectively. With
z-score normalization, a value of $73,600 for income is
transformed to
v’ =73,600-54,000/16000 = 1.225

 Normalization by decimal scaling normalizes by moving the


decimal point of values of attribute A. The number of decimal
points moved depends on the maximum absolute value of A. A
value, vi , of A is normalized to vi ‘ by computing:

vi ’ = where j is the smallest integer max(|vi’|)<1.

Data Mining: Concepts and


03/23/2025 Techniques 86
 Suppose that the recorded values of A range from -986 to 917.
The maximum absolute value of A is 986. To normalize by
decimal scaling, we therefore divide each value by 1000 (i.e., j
= 3) so that -986 normalizes to -0.986 and 917 normalizes to
0.917.

Data Mining: Concepts and


03/23/2025 Techniques 87
Discretization
 Three types of attributes
 Nominal—values from an unordered set, e.g., color,
profession
 Ordinal—values from an ordered set, e.g., military or
academic rank
 Numeric—real numbers, e.g., integer or real numbers
 Discretization: Divide the range of a continuous attribute into
intervals
 Interval labels can then be used to replace actual data
values
 Reduce data size by discretization
 Supervised vs. unsupervised
 Split (top-down) vs. merge (bottom-up)
 Discretization can be performed recursively on an attribute 88
Data Discretization Methods
 Typical methods: All the methods can be applied
recursively
 Binning

Top-down split, unsupervised
 Histogram analysis

Top-down split, unsupervised
 Clustering analysis (unsupervised, top-down
split or bottom-up merge)
 Decision-tree analysis (supervised, top-down
split)
 Correlation (e.g., 2) analysis (unsupervised,
89
Simple Discretization: Binning

 Equal-width (distance) partitioning


 Divides the range into N intervals of equal size: uniform
grid
 if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B –A)/N.
 The most straightforward, but outliers may dominate
presentation
 Skewed data is not handled well
 Equal-depth (frequency) partitioning
 Divides the range into N intervals, each containing
approximately same number of samples
 Good data scaling 90
Binning Methods for Data
Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

91
Class Labels
(Binning vs. Clustering)

Data Equal interval width


(binning)

Equal frequency (binning) K-means clustering leads to better


results
92
Classification & Correlation
Analysis
 Classification (e.g., decision tree analysis)
 Supervised: Given class labels, e.g., cancerous vs. benign
 Using entropy to determine split point (discretization
point)
 Top-down, recursive split
 Details to be covered in Chapter 7
 Correlation analysis (e.g., Chi-merge: χ2-based discretization)
 Supervised: use class information
 Bottom-up merge: find the best neighboring intervals
(those having similar distributions of classes, i.e., low χ2
values) to merge
93
Concept Hierarchy Generation
 Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in
a data warehouse
 Concept hierarchies facilitate drilling and rolling in data
warehouses to view data in multiple granularity
 Concept hierarchy formation: Recursively reduce the data by
collecting and replacing low level concepts (such as numeric
values for age) by higher level concepts (such as youth, adult,
or senior)
 Concept hierarchies can be explicitly specified by domain
experts and/or data warehouse designers
 Concept hierarchy can be automatically formed for both
numeric and nominal data. For numeric data, use
discretization methods shown. 94
Concept Hierarchy Generation
for Nominal Data
 Specification of a partial/total ordering of attributes explicitly at
the schema level by users or experts
 street < city < state < country
 Specification of a hierarchy for a set of values by explicit data
grouping
 {Urbana, Champaign, Chicago} < Illinois
 Specification of only a partial set of attributes
 E.g., only street < city, not others
 Automatic generation of hierarchies (or attribute levels) by the
analysis of the number of distinct values
 E.g., for a set of attributes: {street, city, state, country}
95
Automatic Concept Hierarchy
Generation
 Some hierarchies can be automatically generated based on the
analysis of the number of distinct values per attribute in the data
set:
 The attribute with the most distinct values is placed at the

lowest level of the hierarchy.


 Exceptions, e.g., weekday, month, quarter, year.

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 6,74,339 distinct values


96
Data Preprocessing

 Data Preprocessing: An Overview


 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary

97
Summary
 Data quality: accuracy, completeness, consistency,
timeliness, believability, interpretability
 Data cleaning: e.g. missing/noisy values, outliers
 Data integration from multiple sources:
 Entity identification problem

 Remove redundancies

 Detect inconsistencies

 Data reduction
 Dimensionality reduction

 Numerosity reduction

 Data compression

 Data transformation and data discretization


 Normalization

 Concept hierarchy generation

98
References
 D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of
ACM, 42:73-78, 1999.
 A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996.
 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003.
 J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.
 H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning:
Language, model, and algorithms. VLDB'01.
 M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07
 H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), Dec. 1997.
 H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining
Perspective. Kluwer Academic, 1998.
 J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003.
 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999.
 V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB’2001.
 T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001.
 R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7:623-640, 1995.
99

You might also like