Aiml Data Preprocessing
Aiml Data Preprocessing
B.TECH(CSE)-IV SEMESTER
DATA PREPROCESSING
Prepared By:
Prof.(Dr.) Namrata Dhanda
Department of Computer Science & Engineering
Amity School of Engineering & Technology
Amity University Uttar Pradesh, Lucknow
Email: [email protected]
Contact No: 8299875092, 9415094250
1
Data Preprocessing
5
Incomplete Data
Incomplete data can occur for a number of reasons.
Attributes of interest may not always be available, such as
customer information for sales transaction data.
Other data may not be included simply because they were not
been deleted.
Furthermore, the recording of the data history or modifications
7
Believability & Interpretability
9
Chapter 3: Data Preprocessing
13
the attribute mean or median for all samples belonging to
the same class as the tuple.
Use the most probable value to fill in the missing value: This
may be determined with regression, inference-based tools
using a Bayesian formula or decision tree.
14
Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may be due to
faulty data collection instruments
technology limitation
incomplete data
inconsistent data
15
How to Handle Noisy Data?
Binning
Binning methods smooth a sorted data value by consulting its
16
Similarly, smoothing by bin medians can be employed, in
which each bin value is replaced by the bin median.
In smoothing by bin boundaries, the minimum and
maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest
boundary value. In general, the larger the width, the greater
the effect of the smoothing. Alternatively, bins may be equal
width, where the interval range of values in each bin is
constant.
17
Example
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
19
Outlier Analysis
Outliers may be detected by clustering, for example, where
similar values are organized into groups, or “clusters.”
Intuitively, values that fall outside of the set of clusters may be
considered outliers.
21
There may also be inconsistencies due to data integration (e.g.,
where a given attribute can have different names in different
databases).
Use metadata (e.g., domain, range, dependency, distribution)
Check field overloading
Data scrubbing: use simple domain knowledge (e.g.,
postal code, spell-check) to detect errors and make
corrections
Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and
clustering to find outliers)
Data migration and integration
Data migration tools: allow transformations to be specified 22
How can we deal with Data Discrepancy
Use any knowledge you may already have regarding properties of
the data. Such knowledge or “data about data” is referred to as
metadata.
For example, what are the data type and domain of each
attribute? What are the acceptable values for each attribute?
The basic statistical data descriptions are useful to grasp data
trends and identify anomalies. For example, find the mean,
median, and mode values. Are the data symmetric or skewed?
What is the range of values? Do all values fall within the
expected range? What is the standard deviation of each
attribute? Values that are more than two standard deviations
away from the mean for a given attribute may be flagged as
potential outliers. Are there any known dependencies between
attributes?
23
From this, you may find noise, outliers, and unusual values that
need investigation.
Field overloading is another error source that typically results
when developers squeeze new attribute definitions into unused
(bit) portions of already defined attributes (e.g., an unused bit of
an attribute that has a value range that uses only, say, 31 out of
32 bits).
The data should also be examined regarding unique rules,
consecutive rules, and null rules.
A unique rule says that each value of the given attribute must be
different from all other values for that attribute.
A consecutive rule says that there can be no missing values
between the lowest and highest values for the attribute, and
that all values must also be unique (e.g., as in check numbers).
24
A null rule specifies the use of blanks, question marks, special
characters, or other strings that may indicate the null condition
(e.g., where a value for a given attribute is not available), and
how such values should be handled. The reasons for missing
values may include
the person originally asked to provide a value for the
attribute refuses and/or finds that the information
requested is not applicable (e.g., a license number attribute
left blank by nondrivers);
the data entry person does not know the correct value;
the value is to be provided by a later step of the process.
25
The null rule should specify how to record the null condition,
for example, such as to store zero for numeric attributes, a
blank for character attributes, or any other conventions that
may be in use (e.g., entries like “don’t know” or “?” should be
transformed to blank).
26
Tools
27
Some data inconsistencies may be corrected manually using
external references. For example, errors made at data entry may
be corrected by performing a paper trace. Most errors, however,
will require data transformations. That is, once we find
discrepancies, we typically need to define and apply (a series of)
transformations to correct them.
Data migration tools allow simple transformations to be specified
such as to replace the string “pincode” by “zipcode.” ETL
(extraction/transformation/loading) tools allow users to specify
transforms through a graphical user interface (GUI). These tools
typically support only a restricted set of transforms so that, often,
we may also choose to write custom scripts for this step of the
data cleaning process.
28
Data Preprocessing
31
Such metadata can be used to help avoid errors in schema
integration. The metadata may also be used to help transform
the data (e.g., where data codes for pay type in one database
may be “H” and “S” but 1 and 2 in another). Hence, this step also
relates to data cleaning.
When matching attributes from one database to another during
integration, special attention must be paid to the structure of the
data. This is to ensure that any attribute functional dependencies
and referential constraints in the source system match those in
the target system. For example, in one system, a discount may be
applied to the order, whereas in another system it is applied to
each individual line item within the order.
If this is not caught before integration, items in the target system
may be improperly discounted.
32
Redundancy & Correlation Analysis
An attribute (such as annual revenue, for instance) may be
redundant if it can be “derived” from another attribute or set of
attributes.
Inconsistencies in attribute or dimension naming can also cause
redundancies in the resulting data set.
Redundant data occur often when integration of multiple
databases
Object identification: The same attribute or object may have
33
Correlation Analysis (Nominal Data)
Given two attributes, such analysis can measure how strongly
one attribute implies the other, based on the available data.
For nominal data, we use the χ2 (chi-square) test.
For numeric attributes, we can use the correlation coefficient and
covariance, both of which access how one attribute’s values vary
from those of another.
34
Correlation for Nominal Data
each cell.
40
For this 2 X 2 table, the degrees of freedom are (2-1)x(2-
1)=1. For 1 degree of freedom, the ϰ2 value needed to
reject the hypothesis at the 0.001 significance level is
10.828 (taken from the table of upper percentage points of
the 2 distribution, typically available from any textbook on
statistics).
Since our computed value is above this, we can reject the
hypothesis that gender and preferred reading are
independent and conclude that the two attributes are
(strongly) correlated for the given group of people.
n n
(ai A)(bi B) (ai bi ) n AB
rA, B i 1
i 1
(n 1) A B (n 1) A B
Scatter plots
showing the
similarity from
–1 to 1.
44
Covariance
In probability theory and statistics, correlation and covariance are
two similar measures for assessing how much two attributes
change together.
Consider two numeric attributes A and B, and a set of n
observations {(a1,b1),(a2,b2),….…….,(an ,bn)}. The mean values of A
and B, respectively, are also known as the expected values on A
and B, that is:
E(A) = = E(B) = =
The covariance between A and B is defined as:
Cov(A,B)= E((A- )(B-)) = rA,B =
T1 6 20
T2 5 10
T3 4 14
T4 3 5
T5 2 5
Thus given the positive co-variance we can say that the stock price of
both the companies rise together.
52
Data Reduction Strategies
Data reduction: Obtain a reduced representation of the data
set that is much smaller in volume but yet produces the same
(or almost the same) analytical results
Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very
long time to run on the complete data set.
Data reduction strategies
Dimensionality reduction, e.g., remove unimportant
attributes
Wavelet transforms
Principal Components Analysis (PCA)
Feature subset selection, feature creation
Numerosity reduction (some simply call it: Data Reduction)
Regression and Log-Linear Models
Histograms, clustering, sampling
Data cube aggregation
Data compression
53
Data Reduction 1: Dimensionality
Reduction
Curse of dimensionality
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to
clustering, outlier analysis, becomes less meaningful
The possible combinations of subspaces will grow exponentially
Dimensionality reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce noise
Reduce time and space required in data mining
Allow easier visualization
Dimensionality reduction techniques
Wavelet transforms
Principal Component Analysis
Supervised and nonlinear techniques (e.g., feature selection)
54
Mapping Data to a New Space
Fourier transform
Wavelet transform
55
What Is Wavelet Transform?
The discrete wavelet transform (DWT) is a linear signal processing
technique that, when applied to a data vector X, transforms it to a
numerically different vector, X’, of wavelet coefficients.
The two vectors are of the same length. When applying this
technique to data reduction, we consider each tuple as an n-
dimensional data vector, that is, X=(x1,x2,……,xn), depicting n
measurements made on the tuple from n database attributes.
The usefulness lies in the fact that the wavelet transformed data
can be truncated.
A compressed approximation of the data can be retained by
storing only a small fraction of the strongest of the wavelet
coefficients.
56
For example, all wavelet coefficients larger than some user-
specified threshold can be retained. All other coefficients are set
to 0.
The resulting data representation is therefore very sparse, so
that operations that can take advantage of data sparsity are
computationally very fast if performed in wavelet space.
The technique also works to remove noise without smoothing
out the main features of the data, making it effective for data
cleaning as well.
Given a set of coefficients, an approximation of the original data
can be constructed by applying the inverse of the DWT used.
The DWT is closely related to the discrete Fourier transform
(DFT), a signal processing technique involving sines and cosines.
In general, however, the DWT achieves better lossy compression.
That is, if the same number of coefficients is retained for a DWT
and a DFT of a given data vector, the DWT version will provide a
more accurate approximation of the original data. Hence, for an
equivalent approximation, the DWT
requires less space than the DFT.
Popular wavelet transforms include
the Haar-2, Daubechies-4, and
Daubechies-6.
Haar2 Daubechie4
58
Method
The general procedure for applying a discrete wavelet transform
uses a hierarchical pyramid algorithm that halves the data at each
iteration, resulting in fast computational speed. The method is as
follows:
1. The length, L, of the input data vector must be an integer power
of 2. This condition can be met by padding the data vector with
zeros as necessary (L≥n).
2. Each transform involves applying two functions. The first applies
some data smoothing, such as a sum or weighted average. The
second performs a weighted difference, which acts to bring out
the detailed features of the data.
3. The two functions are applied to pairs of data points in X, that
is, to all pairs of measurements (x2i ,x2i+1). This results in two data
sets of length L/2. 59
In general, these represent a smoothed or low-frequency version of
the input data and the high frequency content of it, respectively.
4. The two functions are recursively applied to the data sets
obtained in the previous loop, until the resulting data sets
obtained are of length 2.
5. Selected values from the data sets obtained in the previous
iterations are designated the wavelet coefficients of the
transformed data.
Equivalently, a matrix multiplication can be applied to the input
data in order to obtain the wavelet coefficients, where the matrix
used depends on the given DWT. The matrix must be orthonormal,
meaning that the columns are unit vectors and are mutually
orthogonal, so that the matrix inverse is just its transpose.
60
Wavelet Decomposition
Wavelets: A math tool for space-efficient hierarchical
decomposition of functions
S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^ = [23/4, -11/4, 1/2,
0, 0, -1, -1, 0]
Compression: many small detail coefficients can be replaced by
0’s, and only the significant coefficients are retained
61
Haar Wavelet Coefficients
Coefficient
Hierarchical “Supports”
2.75
decomposition 2.75 +
structure (a.k.a. +
“error tree”) + -1.25
-
-1.25
+ -
0.5
+
0.5
- +
0
- 0
+
-
0 -1 -1 0
+
-
+ + 0
- - + - + -
-1
+
-+
-+
2 2 0 2 3 5 4 4
-1
Original frequency distribution 0 -+
-
62
Why Wavelet Transform?
Use hat-shape filters
Emphasize region where points cluster
Multi-resolution
Detect arbitrary shaped clusters at different scales
Efficient
Complexity O(N)
63
Principal Component Analysis (PCA)
Used for Dimensionality Reduction.
Find a projection that captures the largest amount of
variation in data
The original data are projected onto a much smaller space,
resulting in dimensionality reduction. We find the
eigenvectors of the covariance matrix, and these
eigenvectors
x define the new space.
2
x1
64
Suppose that the data to be reduced consist of tuples or data
vectors described by n attributes or dimensions.
Principal components analysis (PCA; also called the Karhunen-
Loeve, or K-L, method) searches for k n-dimensional orthogonal
vectors that can best be used to represent the data, where k≤ n.
The original data are thus projected onto a much smaller space,
resulting in dimensionality reduction.
PCA “combines” the essence of attributes by creating an
alternative, smaller set of variables. The initial data can then be
projected onto this smaller set.
PCA often reveals relationships that were not previously
suspected and thereby allows interpretations that would not
ordinarily result.
65
Principal Component Analysis
(Steps)
1. The input data are normalized, so that each attribute falls within
the same range. This step helps ensure that attributes with large
domains will not dominate attributes with smaller domains.
2. PCA computes k orthonormal vectors that provide a basis for the
normalized input data. These are unit vectors that each point in
a direction perpendicular to the others. These vectors are
referred to as the principal components. The input data are a
linear combination of the principal components.
3. The principal components are sorted in order of decreasing
“significance” or strength. The principal components essentially
serve as a new set of axes for the data providing important
information about variance.
66
That is, the sorted axes are such that the first axis shows the most
variance among the data, the second axis shows the next highest
variance, and so on. For example, Figure 3.5 shows the first two
principal components, Y1 and Y2, for the given set of data originally
mapped to the axes X1 and X2. This information helps identify
groups or patterns within the data.
67
Applications
PCA can be applied to ordered and unordered attributes, and
can handle sparse data and skewed data.
Multidimensional data of more than two dimensions can be
handled by reducing the problem to two dimensions.
Principal components may be used as inputs to multiple
regression and cluster analysis. In comparison with wavelet
transforms, PCA tends to be better at handling sparse data,
whereas wavelet transforms are more suitable for data of high
dimensionality.
68
Example
Given find the Principal Components
Determine the Mean =0.5 and 2.125
69
Cov(x,x)=- /(n-1) = 5/3=1.67
Cov (y,y)=8.1874/3=2.73
Cov (x,y)= Cov(y,x)= 6.25/3=2.083
S=Covariance=
Solve the equation |S-𝜆I| =0 where S is the
Covariance Matrix and I is the Identity
Matrix whose dimension is the same as that
of Covariance Matrix.
70
|S-𝜆I| =0 that is =0
that is =0
that is ((1.67-𝜆)(2.73-𝜆)-2.083*2.083)=0
that is 𝜆2 – 4.4𝜆+0.22=0
𝜆1 =4.3494 and 𝜆2=0.0506
Consider 𝜆1, =0
-2.6794 +2.083 =0
2.083 – 1.6194 =0
For orthogonal transformation 2 + 2 =0
71
Consider 𝜆2 and solve we get =0.79 and 0.61.
After solving we get =0.61 and =0.79
Here we use, 2 + 2 =0
Hence the Principal components are given as :
z1=a11x+a12y
z2=a21x+a22y
that is z1=0.61x+0.79y Most significant & stores max
possible information
z2=0.79x+0.61y Second most significant & stores
remaining possible information & so on.
72
Attribute Subset Selection
Another way to reduce dimensionality of data.
Data sets for analysis may contain hundreds of attributes, many of
which may be irrelevant to the mining task or redundant. For
example, if the task is to classify customers based on whether or
not they are likely to purchase a popular new CD at AllElectronics
when notified of a sale, attributes such as the customer’s
telephone number are likely to be irrelevant, unlike attributes
such as age or music taste.
Although it may be possible for a domain expert to pick out some
of the useful attributes, this can be a difficult and time consuming
task, especially when the data’s behavior is not well known.
73
Leaving out relevant attributes or keeping irrelevant attributes
may be detrimental, causing confusion for the mining algorithm
employed.
This can result in discovered patterns of poor quality. In addition,
the added volume of irrelevant or redundant attributes can slow
down the mining process.
Attribute subset selection reduces the data set size by removing
irrelevant or redundant attributes (or dimensions). The goal of
attribute subset selection is to find a minimum set of attributes
such that the resulting probability distribution of the data classes
is as close as possible to the original distribution obtained using
all attributes.
74
Mining on a reduced set of attributes has an additional benefit: It
reduces the number of attributes appearing in the discovered
patterns, helping to make the patterns easier to understand.
75
Heuristic Search in Attribute
Selection
For n attributes, there are 2n possible subsets. An exhaustive
search for the optimal subset of attributes can be prohibitively
expensive, especially as n and the number of data classes
increase.
Therefore, heuristic methods that explore a reduced search
space are commonly used for attribute subset selection. These
methods are typically greedy in that, while searching through
attribute space, they always make what looks to be the best
choice at the time.
Their strategy is to make a locally optimal choice in the hope that
this will lead to a globally optimal solution. Such greedy methods
are effective in practice and may come close to estimating an
optimal solution.
76
The “best” (and “worst”) attributes are typically determined
using tests of statistical significance, which assume that the
attributes are independent of one another.
Many other attribute evaluation measures can be used such as
the information gain measure used in building decision trees for
classification.
77
Heuristic Approaches
1. Stepwise forward selection: The procedure starts with an
empty set of attributes as the reduced set. The best of the
original attributes is determined and added to the reduced set.
At each subsequent iteration or step, the best of the remaining
original attributes is added to the set.
2. Stepwise backward elimination: The procedure starts with the
full set of attributes. At each step, it removes the worst attribute
remaining in the set.
3. Combination of forward selection and backward elimination:
The stepwise forward selection and backward elimination
methods can be combined so that, at each step, the procedure
selects the best attribute and removes the worst from among
the remaining attributes.
78
4. Decision tree induction: Decision tree algorithms (e.g., ID3,
C4.5, and CART) were originally intended for classification.
Decision tree induction constructs a flowchart like structure
where each internal (nonleaf) node denotes a test on an
attribute, each branch corresponds to an outcome of the test,
and each external (leaf) node denotes a class prediction. At
each node, the algorithm chooses the “best” attribute to
partition the data into individual classes.
When decision tree induction is used for attribute subset selection,
a tree is constructed from the given data. All attributes that do not
appear in the tree are assumed to be irrelevant. The set of
attributes appearing in the tree form the reduced subset of
attributes. The stopping criteria may vary.
79
Example
80
Data Preprocessing
81
Data Transformation
In data transformation, the data are transformed or
consolidated into forms appropriate for mining. It is a function
that maps the entire set of values of a given attribute to a new
set of replacement values such that each old value can be
identified with one of the new values. Various methods are:
1. Smoothing, which works to remove noise from the data.
Techniques include binning, regression, and clustering.
2. Attribute construction (or feature construction), where new
attributes are constructed and added from the given set of
attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are
applied to the data. For example, the daily sales data may be
aggregated so as to compute monthly and annual total
amounts.
82
4. Normalization, where the attribute data are scaled so as to fall
within a smaller range, such as -1.0 to 1.0, or 0.0 to 1.0.
5. Discretization, where the raw values of a numeric attribute
(e.g., age) are replaced by interval labels (e.g., 0–10, 11–20,
etc.) or conceptual labels (e.g., youth, adult, senior). The labels,
in turn, can be recursively organized into higher-level concepts,
resulting in a concept hierarchy for the numeric attribute.
6. Concept hierarchy generation for nominal data, where
attributes such as street can be generalized to higher-level
concepts, like city or country. Many hierarchies for nominal
attributes are implicit within the database schema and can be
automatically defined at the schema definition level.
Data Transformation by
Normalization
The measurement unit used can affect the data analysis. For
example, changing measurement units from meters to inches for
height, or from kilograms to pounds for weight, may lead to very
different results. In general, expressing an attribute in smaller
units will lead to a larger range for that attribute, and thus tend
to give such an attribute greater effect or “weight.” To help avoid
dependence on the choice of measurement units, the data
should be normalized or standardized.
This involves transforming the data to fall within a smaller or
common range such as [-1, 1] or [0.0, 1.0].
Normalizing the data attempts to give all attributes an equal
weight. Normalization is particularly useful for classification
algorithms involving neural networks or distance measurements
such as nearest-neighbor classification and clustering.
Data Mining: Concepts and
03/23/2025 Techniques 84
Data Transformation by
Normalization
Min-max normalization: Suppose that minA and maxA are the
minimum and maximum values of an attribute, A. Min-max
normalization maps a value vi of A to vi’ in the range
[new_minv A,new_max
minA A].
v' ( new _ maxA new _ minA) new _ minA
maxA minA
91
Class Labels
(Binning vs. Clustering)
97
Summary
Data quality: accuracy, completeness, consistency,
timeliness, believability, interpretability
Data cleaning: e.g. missing/noisy values, outliers
Data integration from multiple sources:
Entity identification problem
Remove redundancies
Detect inconsistencies
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
98
References
D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of
ACM, 42:73-78, 1999.
A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996.
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003.
J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning:
Language, model, and algorithms. VLDB'01.
M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07
H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), Dec. 1997.
H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining
Perspective. Kluwer Academic, 1998.
J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003.
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999.
V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB’2001.
T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001.
R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7:623-640, 1995.
99