Lecture Slide – 2
Data Preprocessing
1
Data Preprocessing
◼ Why preprocess the data?
◼ Descriptive data summarization
◼ Data cleaning
◼ Data integration and transformation
◼ Data reduction
◼ Discretization and concept hierarchy generation
◼ Summary
2
Why Data Preprocessing?
◼ Data in the real world is dirty
◼Incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
◼e.g., Telephone no, Mother’s name, etc.
◼ Noisy: containing errors or outliers
◼e.g., Salary=“‐10”
◼ Inconsistent: containing discrepancies in codes or
names
◼e.g., Age=“42” Birthday=“03/07/1997”
◼e.g., Was rating “1,2,3”, now rating “A, B, C”
◼e.g., discrepancy between duplicate records
3
Why Is Data Dirty?
◼ Incomplete data may come from
◼ “Not applicable”data value when collected
◼ Different considerations between the time when the data
was collected and when it is analyzed.
◼ Human/hardware/software problems
◼ Noisy data (incorrect values) may come from
◼ Faulty data collection instruments
◼ Human or computer error at data entry
◼ Errors in data transmission
◼ Inconsistent data may come from
◼ Different data sources
◼ Functional dependency violation (e.g., modify linked data)
Why Is Data Preprocessing Important?
◼ No quality data, no quality mining results! (GIGO)
◼ Quality decisions must be based on quality data
◼ e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
◼ Data warehouse needs consistent integration of quality data
◼Data extraction, Cleaning, and Transformation
comprise the majority of the work of building a data
warehouse
5
Multi‐Dimensional Measure of Data Quality
◼ A well‐accepted multidimensional view:
◼ Accuracy
◼ Completeness
◼ Consistency
◼ Timeliness
◼ Value added
◼ Interpretability
◼ Accessibility
6
Major Tasks in Data Preprocessing
◼ Data cleaning: Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
◼ Data integration: Integration of multiple data sources (databases, flat
files, etc.) into a coherent data store (data warehouse)
◼ Data transformation: Transformation of data from one form to another
for accuracy and efficiency purpose (Normalization, Aggregation, etc.)
◼ Data reduction: Obtains reduced representation in volume but
produces the same or similar analytical results (aggregation, duplicate
elimination, clustering,
attribute subset selection, etc.) 7
Visualization of Data Preprocessing Tasks
8
Data Preprocessing
◼ Why preprocess the data?
◼ Descriptive data summarization
◼ Data cleaning
◼ Data integration and transformation
◼ Data reduction
◼ Discretization and concept hierarchy generation
◼ Summary
9
Descriptive Data Summarization
◼ A technique used to identify the typical properties (central
tendency & dispersion) of data and highlight which value
should be treated as noise or outliers.
◼ Measure of Central tendency
◼ Mean, Median, Mode, and Midrange
◼ Measure of dispersion
◼ Quartiles, Inter‐Quartile Range (IQR), Variance
10
Central Tendency Measure: Mean
◼ An algebraic measure
1
◼ Arithmetic Mean 𝑥ҧ = σ𝑛𝑖=1 𝑥𝑖
𝑛
𝑛
𝑖=1 𝑤𝑖 𝑥𝑖
◼ Weighted Arithmetic Mean 𝑥ҧ = 𝑛
𝑖=1 𝑤𝑖
◼ Trimmed Arithmetic Mean
𝑛−𝑚
𝑖=𝑚 𝑤𝑖 𝑥𝑖
◼ chopping extreme values (n%) 𝑥ҧ = 𝑛−𝑚
𝑖=𝑚 𝑤𝑖
◼ Sensitive to outliers
11
Measures of Central Tendency
Measures of central tendency are statistical measures that identifies a
single value as representative of the entire distribution following are the
different types of measures of central tendency.
▪ Mean
▪ Mode
▪ Median
▪ Midrange
Mean:
The mean also known as the average is a measure of central tendency
calculated by summing up all values in data set and then dividing the sum
by total number of values.
Measures of Central Tendency
σ 𝑎𝑖
𝐴ҧ =
𝑛
For example consider the following data:
𝑎𝑖
20
39
27
55
12
13
20+39+27+55+12+13
Then 𝐴ҧ = 6
= 27.66
Measures of Central Tendency
And the trimmed mean can be calculated by the following formula:
σ𝑛−𝑚
𝑖=𝑚+1 𝑎𝑖 ∗ 𝑤𝑖
ҧ
𝐴=
σ𝑛−𝑚
𝑖=𝑚+1 𝑤𝑖
For example:
-150 1
20 2
30 2
10 4
10000 1
Then after trimming the two extreme values the trimmed mean will be:
20∗2+30∗2+10∗4
𝐴ҧ = = 17.5
8
Central Tendency Measure: Median
◼ A holistic measure
◼ Middle value if odd number of values, or average of the middle two values, otherwise
◼ Estimated by interpolation (for grouped data):
n / 2 − ( f )
median = L1+ ( )c
fmedian
L1=lower boundary of the median interval n=Number of values in the entire
data set
(f)l=sum of frequencies of all the intervals lower than median interval freqmedian=frequency
of the median interval
c=width of the median interval 12
Measures of Central Tendency
Median:
The median is a measure of central tendency that represent the value of
a dataset when its arranged in ascending or descending order.
If there odd number of values then median is the middle value.
If there even number of values then median is the average of the two
middle values.
For example:
12, 24, 25, 34 , 69, 23, 66
In above example the median is 34
And if the data is as follow:
60, 22, 44, 56, 67, 78, 78, 89
56+67
Then median is = 61.5
2
Measures of Central Tendency
If our data is grouped then we use the following formula to calculate the
value of median:
𝑛Τ − σ𝑓
2
Median = 𝐿1 + 𝑐
𝑓𝑚ⅇ𝑑ⅈ𝑎𝑛
For example:
▪ 10 – 20 5
▪ 20 – 30 2
▪ 30 – 40 4
▪ 40 – 50 3
▪ 50 – 60 6
20
−7
▪ 𝑚ⅇ𝑑ⅈ𝑎𝑛 = 30 + 2
∗ 10 ≈ 22
4
Central Tendency Measure: Mode
◼ A value that occurs most frequently in the data set
◼ Unimodal vs Multimodal (Bimodal, Trimodal, etc.)
◼ No mode: If each data occurs only once
◼ For unimodal moderately skewed (asymmetrical) data set
mean − mode = 3(mean − median)
14
Measures of Central Tendency
Mode
The mode in statistics is the value the appears most frequently in the
dataset. It’s possible to have no mode, one mode, or more than one mode.
To find the mode, sort your dataset numerically and select the value that
occurs most frequently.
For example mode of the following dataset is:
12, 23, 25, 12, 45, 44, 77, 12, 14
12 because 12 appear most frequently in the dataset.
And the following data set have more than one mode.
12, 13, 14, 16, 17, 12, 13
12 and 13 are the mode of dataset.
And the fallowing dataset have no mode.
12, 13, 14, 15, 16, 17, 18
Symmetric vs. Skewed Data
Mean
◼ Median, mean and mode Median
Mode
of symmetric, positively
and negatively skewed
data Symmetric data
Negatively skewed
Positivelyskewed data
data
Central Tendency Measure: Midrange
◼ Average of largest and smallest values in the data set
◼ Can be calculated using SQL aggregate functions max()
and min()
16
Measuring Dispersion of Data
◼ The degree to which numerical data tend to spread is
called dispersion or variance of data
◼ Common measures are:
◼ Range
◼ Interquartile range (IQR)
◼ Standard deviation & Variance
◼ The five‐number summary (based on quartiles)
◼ Can be used to draw boxplots (used for outlier analysis)
17
Dispersion Measure: Range, Quartiles & IQR
◼ Difference between largest and smallest data values
◼ The kth percentile of a set of data in numerical order is the
value xi having the property that k% of the data entries lie
at or below xi
◼ The median is the 50th percentile
◼ Percentiles other than median are quartiles
◼First quartile (Q1) is the 25th percentile
◼Third quartile (Q3) is the 75th percentile
◼ IQR = Q3 – Q1
◼Outlier: Usually, a value higher/lower than 1.5 x IQR
18
Dispersion Measure: Variance & SD
◼ Variance (2): (algebraic measure)
◼ The variance is the average squared distance of each point
from the mean
n n
1 1
2 = ( xi − ) 2 = x − 2
2
i
N i =1 N i =1
◼ The variance is thus the difference between the average of th
e squared magnitude of the data points and the squared mag
nitude of the mean.
◼ Standard Deviation (): (algebraic measure)
◼square root of variance 2
19
Measures of Dispersion
Range:
The range is the simplest measure of dispersion and provides a quick glimpse
into the spread of data. It is calculated by subtracting the minimum value
from the maximum value in a dataset. The formula for the range is as follows:
Range (R) = Maximum Value — Minimum Value
Example: Suppose you have a dataset representing the daily temperatures in
a city for a week: [68, 72, 75, 80, 62, 70, 78]. To find the range:
R = 80 (maximum value) — 62 (minimum value) = 18 degrees Fahrenheit
Data scientists use the range to identify the extent of variation in a dataset.
However, it has limitations, such as sensitivity to outliers. Thus, it is often
used in conjunction with other measures for a more comprehensive analysis
Measures of Dispersion
Variance
Variance measures the average squared deviation of each data
point from the mean (average) of the dataset. It quantifies how
data points are spread out from the mean. The formula for
variance is:
Variance (σ²) = Σ(xi — μ)² / N
Where:
•Σ represents summation (i.e., adding up)
•xi is each data point
•μ is the mean of the dataset
•N is the total number of data points
Measures of Dispersion
Example: Consider a dataset of monthly sales figures for a small business:
[5000, 6000, 5500, 7000, 7500].
To find the variance:
1.Calculate the mean (μ): μ = (5000 + 6000 + 5500 + 7000 + 7500) / 5 =
6000
2.Calculate the squared differences from the mean and their sum:
Variance = [(5000–6000)² + (6000–6000)² + (5500–6000)² + (7000–6000)²
+ (7500–6000)²] / 5 Variance ≈ 433333.33
Variance is a valuable measure in data science because it quantifies the
spread of data while considering all data points. However, its units are
squared, which can be less intuitive. This leads us to the next measure.
Measures of Dispersion
Standard Deviation
The standard deviation is a more interpretable measure of dispersion as it
is the square root of the variance. It tells us how much individual data
points typically deviate from the mean. The formula for standard
deviation is:
Standard Deviation (σ) = √Variance
Using the previous example’s variance, the standard deviation is:
Standard Deviation (σ) = √433333.33 ≈ 658.58
The standard deviation is extensively used in data science for several
reason:
Identifying Outliers: Data points that deviate significantly from the mean
(beyond 2 or 3 standard deviations) may be considered outliers, which
can be important to detect anomalies in data.
Measures of Dispersion
Quartiles
A quartile is a value that divides a set of data into four equal groups. The
first quartile (Q1) is the value that separates the bottom 25% of the data
from the top 75%. The second quartile (Q2) is the median, which is the
value that separates the bottom 50% of the data from the top 50%. The
third quartile (Q3) is the value that separates the bottom 75% of the data
from the top 25%.
Interquartile Range (IQR)
The interquartile range (IQR) is the difference between the third quartile
(Q3) and the first quartile (Q1). It is a measure of how spread out the
middle 50% of the data is. A larger IQR indicates that the data is more
spread out, while a smaller IQR indicates that the data is more tightly
clustered together.
Measures of Dispersion
Example:
Data: [100,000, 200,000, 300,000, 400,000, 500,000, 600,000, 700,000,
800,000, 900,000, 1,000,000]
Q2 = (5000+6000)/2=550000.
then Q1 is the median of the following
100000, 200000, 300000, 400000, 500000 which is 300000
And Q3 is the median of the following dataset :
600000, 700000, 800000, 900000, 1000000 which is 800000
Q1 = 300000
Q3 = 800000
IQR = 800000 – 300000 = 500000
Now data which are more than 1.5 * 500000 i.e. 750000 beyond the
upper(Q3) quartile or lower quartile (Q1) are considered outliers
Boxplot Analysis
◼ Five-number summary of a distribution:
Minimum, Q1, M, Q3, Maximum
◼ Boxplot
◼ Data is represented with a box
◼ The ends of the box are at the first and third quartiles,
i.e., the height of the box is IQR
◼ The median is marked by a line within the box
◼ Whiskers: two lines outside the box extend to Minimum
and Maximum
20
Scatter plot
◼ Provides a first look at bivariate data (two variables) to see
clusters of points, outliers, etc.
◼ Gives a good visual picture of the relationship between the
two variables
◼ Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
◼ Points are plotted but not joined
◼ The resulting pattern indicates the type and strength of
the relationship between the two variables.
22
Scatter plot (cont…)
23
Scatter plot (cont…)
◼ The more the points tend to cluster around a straight line, the
stronger the linear relationship between the two variables (the
higher the correlation).
◼ Line runs from lower left to upper right → positive relationship
(direct)
◼ Line runs from upper left to lower right → negative
relationship (inverse).
◼ Random scatter of points → no relationship (very low or zero
correlation).
◼ Points clustering around a curve → non‐linear relationship (the
correlation coefficient will not be a good measure of the
strength)
24
Non‐correlated Data
26
Non‐correlated Data
26
Data Preprocessing
◼ Why preprocess the data?
◼ Descriptive data summarization
◼ Data cleaning
◼ Data integration and transformation
◼ Data reduction
◼ Discretization ss
◼ Summary
27
Data Cleaning
◼ Importance:
◼ “Data cleaning is one of the three biggest problems in data
warehousing”—Ralph Kimball
◼ “Data cleaning is the number one problem in data
warehousing”—DCI survey
◼ Data cleaning tasks:
◼ Fill in missing values
◼ Identify outliers and smooth out noisy data
◼ Correct inconsistent data
◼ Resolve redundancy caused by data integration
28
Missing Data
◼ Data is not always available
◼ E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data
◼ Missing data may be due to
◼ equipment malfunction
◼ inconsistent with other recorded data and thus deleted
◼ data not entered due to misunderstanding
◼ certain data may not be considered important at the time of entry
◼ Missing data may need to be inferred
29
How to Handle Missing Data?
◼ Ignore the tuple: usually done when class label is
missing (assuming the tasks in classification—not
effective when the percentage of missing values per
attribute varies considerably.
◼ Fill in the missing value manually: tedious + infeasible?
◼ Fill in it automatically with
◼ A global constant : e.g., “unknown”, a new class!
◼ The attribute mean
◼ The attribute mean for all samples belonging to the same class
◼ The most probable value: inference‐based such as Bayesian
formula or decision tree
30
Noisy Data
◼ Noise: random error or variance in a measured variable
◼ Incorrect attribute values ‐ may be due to
faulty data collection instruments
◼
◼ data entry problems
◼ data transmission problems
◼ technology limitation
◼ inconsistency in naming convention
◼ Other data problems which requires data cleaning
◼ duplicate records
◼ incomplete data
◼ inconsistent data
31
How to Handle Noisy Data?
◼ Binning
◼ First sort data and partition into bins
◼ Then one can smooth by bin means, smooth by bin median, smooth
by bin boundaries, etc.
◼ Regression
◼ Smooth by fitting the data into regression functions
◼ Clustering
◼ Detect and remove outliers
◼ Combined computer and human inspection
◼ Detect suspicious values and check by human (e.g., deal with possible
outliers)
32
Binning (aka Discretization)
◼ A process of transforming numerical
variables into categorical counterparts.
◼ Example: Bin values for Age into categories such as 20‐39, 40‐59, and 60‐79.
◼ Numerical variables are usually discretized in the modeling methods based on
frequency tables (e.g., decision trees)
◼ Binning may be supervised or
unsupervised
◼ Advantages:
◼ Binning may improve accuracy of the predictive models by reducing the noise or non‐linearity.
◼ Binning allows easy identification of outliers, invalid and missing values of numerical variables.
Unsupervised Binning (cont…)
◼ Equal Frequency (aka Equal Depth) Binning:
◼ The algorithm divides data into k groups where each group
contains approximately same number of values.
◼ For both methods, the best way of determining k is by looking at the
histogram and try different intervals or groups.
Example: Data: 0, 4, 12, 13, 16, 16, 18, 24, 26, 28
◼ Bin‐1: 0, 4, 12 [‐, 14]
◼ Bin‐2: 13, 16, 16 [14, 21]
◼ Bin‐3: 24, 26, 28 [21, +]
Binning Methods for Data Smoothing
❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
*Partition into equal‐frequency (equi‐depth) bins:
‐ Bin 1: 4, 8, 9, 15
‐ Bin 2: 21, 21, 24, 25
‐ Bin 3: 26, 28, 29, 34
*Smoothing by bin means:
‐ Bin 1: 9, 9, 9, 9
‐ Bin 2: 23, 23, 23, 23
‐ Bin 3: 29, 29, 29, 29
*Smoothing by bin boundaries:
‐ Bin 1: 4, 4, 4, 15
‐ Bin 2: 21, 21, 25, 25
‐ Bin 3: 26, 26, 26, 34
36
Data Preprocessing
◼ Why preprocess the data?
◼ Descriptive data summarization
◼ Data cleaning
◼ Data integration and transformation
◼ Data reduction
◼ Discretization
◼ Summary
40
Data Integration
◼ A process to combine data from multiple sources into a coherent
store
◼ Schema integration: e.g., Scholarship Fellowship
◼ Integrate metadata from different sources
◼ Entity identification problem:
◼ Identify real world entities from multiple data sources, e.g.,
Bill Clinton = William Clinton
◼ Detecting and resolving data value conflicts:
◼ For the same real world entity, attribute values from different
sources are different
◼ Possible reasons: different representations, different scales,
e.g., metric vs. British units
41
Handling Redundancy in Data Integration
◼ Redundant data occur often when integration of multiple
databases
◼ Object identification: The same attribute or object may have
different names in different databases
◼ Derivable data: One attribute may be a “derived”attribute in another
table, e.g., annual revenue, age, etc.
◼ Redundant attributes may be able to be detected by
correlation analysis
◼ Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
42
Correlation Analysis (Numerical Data)
◼ Correlation coefficient (also called
Pearson’s product‐ moment correlation
coefficient
n
(ai − A)(bi − B )
rA, B = i =1
n A B
Where n is the number of tuples, A and B are the respective means of A and B, σA and σB are
the respective standard deviation of A and B.
◼ If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The
higher, the stronger correlation.
◼ If rA,B = 0, A and B are independent
◼ If rA,B < 0: A and B are negatively correlated
Correlation Analysis
Correlation Coefficient:
A correlation coefficient is a number between -1 and 1 that tells you the
strength and direction of a relationship between variables.
To find the correlation coefficient we use the following formula:
𝑎𝑖 − 𝐴ҧ ∗ 𝑏𝑖 − 𝐵ത
𝑟𝐴,𝐵 =
𝑛 ∗ 𝜎𝐴 ∗ 𝜎𝐵
𝑎𝑖 = 43 , 21, 25, 42,57,59
𝑏𝑖 = 99, 65, 79, 75, 87, 81
To further understand the correlation coefficient consider the
above example in above example ai are the ages of peoples and
bi are their respective glucose level
Correlation Analysis
To calculate r i.e. correlation coefficient for the above example we do as
follow:
𝐴ҧ = 41 And 𝐵ത = 81
𝑎𝑖 − 𝐴ҧ 𝑏𝑖 − 𝐵ത
43 – 21 = 22 𝑎𝑖 − 𝐴ҧ ∗ 𝑏𝑖 − 𝐵ത
99 – 81 = 18
21 – 21 = 0 22 * 18
65 – 81 = -16
25 – 21 = 4 0 * -16
79 – 81 = -2
42 – 21 = 21 4 * -2
75 – 81 = -6
57 – 21 = 36 21 * -6
87 – 81 = 6
59 – 21 = 38 36 * 6
81 – 81 = 0 38 * 0
𝑛
𝑖=1 𝑎𝑖 − 𝐴ҧ ∗ 𝑏𝑖 − 𝐵ത = 478
Correlation Analysis
2
𝑎𝑖 − 𝐴ҧ 𝜎𝐵 =
𝑏𝑖−𝐵ത 2
= 10
𝜎𝐴 = = 24 𝑛
𝑛
478
𝑟𝐴,𝐵 = = 0.33
6 ∗ 10 ∗ 24
Correlation Analysis (Categorical Data)
◼ 2 Χ (chi‐square) test
(Observed − Expected ) 2
=
2
Expected
◼ The larger the Χ2 value, the more likely the variables are related
◼ The cells that contribute the most to the Χ2 value are those whose actual
count is very different from the expected count
◼ Correlation does not imply causality
◼ # of hospitals and # of car‐theft in a city are correlated
◼ Both are causally linked to the third variable: population
44
Correlation vs. Casulaty
◼ Correlation is something which we think, when we can’t see under
the covers. So the less the information we have the more we are
forced to observe correlations. Similarly the more information we
have the more transparent things will become and the more we will
be able to see the actual casual relationships.
45
◼ Example:
◼ Suppose that a group of 1,500 people was surveyed. The gender of each person
was noted. Each person was polled as to whether their preferred type of reading
material was fiction or nonfiction. Thus, we have two attributes, gender and preferred
reading.
Are gender and preferred Reading correlated?
Male Female Sum (row)
Fiction 250(90) 200(360) 450
Non-fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated
based on the data distribution in the two categories)
( 250 − 90) 2
(50 − 210) 2
( 200 − 360) 2
(1000 − 840) 2
2 = + + + = 507.93
90 210 360 840
It shows that Gender and prefferd reading are correlated in the group
Data Transformation: Normalization
◼ Min‐max normalization: to [new_minA, new_maxA]
◼ Performs linear transformation on the original data
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
◼ Example: Let income range Rs.12,000 to Rs. 98,000
normalized to [0, 1]. Then Rs. 73,600 is mapped
to
73,600 −12,000
(1.0 − 0) + 0 = 0.716
98,000 −12,000
Normalization (cont…)
◼ Z‐score normalization (or Zero‐mean
normalization)
◼ Let μ: mean and σ: standard deviation
v − A
v'=
A
◼ Example: Let μ = 54,000, σ = 16,000.
Then 73,600 is mapped to
73,600 − 54,000
= 1.225
16,000
◼ Useful when the actual min and max values of the attribute is
unknown or when there are outliers that dominate the min‐max
normalization.
Normalization (cont…)
◼ Normalization by decimal scaling
◼Normalizes data values by moving the decimal point at
extreme left position
◼Number of decimal points moved depends on the max.
absolute value
◼A value v is normalized to v’ by computing
v
v' = Where j is the smallest integer such that Max(|ν’|) < 1
10 j
◼ Example: A ranges from ‐986 to 97 Max absolute
value = 986
So, j=3 (i.e., divide each value by 1000)
53
Data Preprocessing
◼ Why preprocess the data?
◼ Descriptive data summarization
◼ Data cleaning
◼ Data integration and transformation
◼ Data reduction
◼ Discretization and concept hierarchy generation
◼ Summary
54
Data Reduction Techniques
◼ Why data reduction?
◼ A database/data warehouse may store terabytes of data
◼ Complex data analysis/mining may take a very long time to run on the complete data set
◼ What is data reduction?
◼ A process to obtain a reduced representation of the data set that is much smaller in volume but
yet produce the same (or almost the same) analytical results
◼ Data reduction techniques:
◼ Numerosity reduction: e.g., fit data into models
◼ Dimensionality reduction: e.g., remove unimportant attributes
◼ Discretization
Numerosity Reduction
◼Reduce data volume by choosing alternative,
smaller forms of data representation
◼Parametric methods
◼ Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the
data (except possible outliers)
◼ Non‐parametric methods
◼ Do not assume models
◼ Major families: histograms, clustering, sampling
60
Dimensionality Reduction: Attribute Subset Selection
◼ Feature selection (i.e., attribute subset selection):
◼ Select a minimum set of features such that the probability
distribution of different classes given the values for those
features is as close as possible to the original distribution
given the values of all features
◼ Reduce #of patterns ‐ easier to understand
61
Data Preprocessing
◼ Why preprocess the data?
◼ Descriptive data summarization
◼ Data cleaning
◼ Data integration and transformation
◼ Data reduction
◼ Discretization
◼ Summary
64
Discretization
◼ Three types of attributes:
◼ Nominal — values from an unordered set, e.g., color, profession
◼ Ordinal — values from an ordered set, e.g., military or academic
rank
◼ Continuous — real numbers, e.g., integer or real numbers
◼ Discretization:
◼ Divide the range of a continuous attribute into intervals
◼ Some classification algorithms only accept categorical attributes.
◼ Reduce data size by discretization
◼ Prepare for further analysis
65