0% found this document useful (0 votes)

13 views66 pages

2 - Unit-Ii-2

The document discusses data preprocessing, emphasizing its importance due to the presence of dirty, incomplete, noisy, and inconsistent data in real-world scenarios. It outlines major tasks involved in data preprocessing, including data cleaning, integration, transformation, and reduction, as well as methods for summarizing data through measures of central tendency and dispersion. Additionally, it highlights the significance of quality data for effective data mining and decision-making.

Uploaded by

samiullahwaziri13

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views66 pages

2 - Unit-Ii-2

Uploaded by

samiullahwaziri13

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Lecture Slide – 2

Data Preprocessing

1
Data Preprocessing
◼ Why preprocess the data?
◼ Descriptive data summarization
◼ Data cleaning
◼ Data integration and transformation
◼ Data reduction
◼ Discretization and concept hierarchy generation
◼ Summary
2
Why Data Preprocessing?
◼ Data in the real world is dirty
◼Incomplete: lacking attribute values, lacking certain

attributes of interest, or containing only aggregate data

◼e.g., Telephone no, Mother’s name, etc.
◼ Noisy: containing errors or outliers
◼e.g., Salary=“‐10”
◼ Inconsistent: containing discrepancies in codes or
names
◼e.g., Age=“42” Birthday=“03/07/1997”
◼e.g., Was rating “1,2,3”, now rating “A, B, C”

◼e.g., discrepancy between duplicate records

3
Why Is Data Dirty?
◼ Incomplete data may come from
◼ “Not applicable”data value when collected
◼ Different considerations between the time when the data

was collected and when it is analyzed.

◼ Human/hardware/software problems

◼ Noisy data (incorrect values) may come from

◼ Faulty data collection instruments
◼ Human or computer error at data entry

◼ Errors in data transmission

◼ Inconsistent data may come from

◼ Different data sources
◼ Functional dependency violation (e.g., modify linked data)
Why Is Data Preprocessing Important?

◼ No quality data, no quality mining results! (GIGO)

◼ Quality decisions must be based on quality data
◼ e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
◼ Data warehouse needs consistent integration of quality data
◼Data extraction, Cleaning, and Transformation
comprise the majority of the work of building a data
warehouse

5
Multi‐Dimensional Measure of Data Quality
◼ A well‐accepted multidimensional view:
◼ Accuracy

◼ Completeness
◼ Consistency

◼ Timeliness
◼ Value added

◼ Interpretability

◼ Accessibility
6
Major Tasks in Data Preprocessing
◼ Data cleaning: Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
◼ Data integration: Integration of multiple data sources (databases, flat
files, etc.) into a coherent data store (data warehouse)
◼ Data transformation: Transformation of data from one form to another
for accuracy and efficiency purpose (Normalization, Aggregation, etc.)
◼ Data reduction: Obtains reduced representation in volume but
produces the same or similar analytical results (aggregation, duplicate
elimination, clustering,
attribute subset selection, etc.) 7
Visualization of Data Preprocessing Tasks

8
Data Preprocessing
◼ Why preprocess the data?
◼ Descriptive data summarization
◼ Data cleaning
◼ Data integration and transformation
◼ Data reduction
◼ Discretization and concept hierarchy generation
◼ Summary

9
Descriptive Data Summarization
◼ A technique used to identify the typical properties (central
tendency & dispersion) of data and highlight which value
should be treated as noise or outliers.
◼ Measure of Central tendency
◼ Mean, Median, Mode, and Midrange
◼ Measure of dispersion
◼ Quartiles, Inter‐Quartile Range (IQR), Variance

10
Central Tendency Measure: Mean
◼ An algebraic measure
1
◼ Arithmetic Mean 𝑥ҧ = σ𝑛𝑖=1 𝑥𝑖
𝑛
𝑛
෌𝑖=1 𝑤𝑖 𝑥𝑖
◼ Weighted Arithmetic Mean 𝑥ҧ = 𝑛
෌𝑖=1 𝑤𝑖
◼ Trimmed Arithmetic Mean
𝑛−𝑚
෌𝑖=𝑚 𝑤𝑖 𝑥𝑖
◼ chopping extreme values (n%) 𝑥ҧ = 𝑛−𝑚
෌𝑖=𝑚 𝑤𝑖
◼ Sensitive to outliers
11
Measures of Central Tendency

Measures of central tendency are statistical measures that identifies a

single value as representative of the entire distribution following are the
different types of measures of central tendency.
▪ Mean
▪ Mode
▪ Median
▪ Midrange
Mean:
The mean also known as the average is a measure of central tendency
calculated by summing up all values in data set and then dividing the sum
by total number of values.
Measures of Central Tendency

σ 𝑎𝑖
𝐴ҧ =
𝑛
For example consider the following data:
𝑎𝑖
20
39
27
55
12
13
20+39+27+55+12+13
Then 𝐴ҧ = 6
= 27.66
Measures of Central Tendency

And the trimmed mean can be calculated by the following formula:

σ𝑛−𝑚
𝑖=𝑚+1 𝑎𝑖 ∗ 𝑤𝑖
ҧ
𝐴=
σ𝑛−𝑚
𝑖=𝑚+1 𝑤𝑖

For example:
-150 1
20 2
30 2
10 4
10000 1
Then after trimming the two extreme values the trimmed mean will be:
20∗2+30∗2+10∗4
𝐴ҧ = = 17.5
8
Central Tendency Measure: Median
◼ A holistic measure
◼ Middle value if odd number of values, or average of the middle two values, otherwise

◼ Estimated by interpolation (for grouped data):

n / 2 − ( f )
median = L1+ ( )c
fmedian

L1=lower boundary of the median interval n=Number of values in the entire

data set
(f)l=sum of frequencies of all the intervals lower than median interval freqmedian=frequency
of the median interval
c=width of the median interval 12
Measures of Central Tendency
Median:
The median is a measure of central tendency that represent the value of
a dataset when its arranged in ascending or descending order.
If there odd number of values then median is the middle value.
If there even number of values then median is the average of the two
middle values.
For example:
12, 24, 25, 34 , 69, 23, 66
In above example the median is 34
And if the data is as follow:
60, 22, 44, 56, 67, 78, 78, 89
56+67
Then median is = 61.5
2
Measures of Central Tendency

If our data is grouped then we use the following formula to calculate the
value of median:
𝑛Τ − σ𝑓
2
Median = 𝐿1 + 𝑐
𝑓𝑚ⅇ𝑑ⅈ𝑎𝑛

For example:
▪ 10 – 20 5
▪ 20 – 30 2
▪ 30 – 40 4
▪ 40 – 50 3
▪ 50 – 60 6
20
−7
▪ 𝑚ⅇ𝑑ⅈ𝑎𝑛 = 30 + 2
∗ 10 ≈ 22
4
Central Tendency Measure: Mode

◼ A value that occurs most frequently in the data set

◼ Unimodal vs Multimodal (Bimodal, Trimodal, etc.)
◼ No mode: If each data occurs only once
◼ For unimodal moderately skewed (asymmetrical) data set

mean − mode = 3(mean − median)

14
Measures of Central Tendency

Mode
The mode in statistics is the value the appears most frequently in the
dataset. It’s possible to have no mode, one mode, or more than one mode.
To find the mode, sort your dataset numerically and select the value that
occurs most frequently.
For example mode of the following dataset is:
12, 23, 25, 12, 45, 44, 77, 12, 14
12 because 12 appear most frequently in the dataset.
And the following data set have more than one mode.
12, 13, 14, 16, 17, 12, 13
12 and 13 are the mode of dataset.
And the fallowing dataset have no mode.
12, 13, 14, 15, 16, 17, 18
Symmetric vs. Skewed Data
Mean

◼ Median, mean and mode Median

Mode

of symmetric, positively
and negatively skewed
data Symmetric data

Negatively skewed
Positivelyskewed data
data
Central Tendency Measure: Midrange
◼ Average of largest and smallest values in the data set
◼ Can be calculated using SQL aggregate functions max()
and min()

16
Measuring Dispersion of Data
◼ The degree to which numerical data tend to spread is
called dispersion or variance of data
◼ Common measures are:
◼ Range
◼ Interquartile range (IQR)
◼ Standard deviation & Variance
◼ The five‐number summary (based on quartiles)
◼ Can be used to draw boxplots (used for outlier analysis)
17
Dispersion Measure: Range, Quartiles & IQR
◼ Difference between largest and smallest data values
◼ The kth percentile of a set of data in numerical order is the
value xi having the property that k% of the data entries lie
at or below xi
◼ The median is the 50th percentile

◼ Percentiles other than median are quartiles

◼First quartile (Q1) is the 25th percentile

◼Third quartile (Q3) is the 75th percentile

◼ IQR = Q3 – Q1

◼Outlier: Usually, a value higher/lower than 1.5 x IQR

18
Dispersion Measure: Variance & SD
◼ Variance (2): (algebraic measure)
◼ The variance is the average squared distance of each point
from the mean
n n
1 1
2 =  ( xi −  ) 2 = x − 2
2
i
N i =1 N i =1

◼ The variance is thus the difference between the average of th

e squared magnitude of the data points and the squared mag
nitude of the mean.
◼ Standard Deviation (): (algebraic measure)

◼square root of variance 2

19
Measures of Dispersion
Range:
The range is the simplest measure of dispersion and provides a quick glimpse
into the spread of data. It is calculated by subtracting the minimum value
from the maximum value in a dataset. The formula for the range is as follows:
Range (R) = Maximum Value — Minimum Value
Example: Suppose you have a dataset representing the daily temperatures in
a city for a week: [68, 72, 75, 80, 62, 70, 78]. To find the range:
R = 80 (maximum value) — 62 (minimum value) = 18 degrees Fahrenheit
Data scientists use the range to identify the extent of variation in a dataset.
However, it has limitations, such as sensitivity to outliers. Thus, it is often
used in conjunction with other measures for a more comprehensive analysis
Measures of Dispersion

Variance
Variance measures the average squared deviation of each data
point from the mean (average) of the dataset. It quantifies how
data points are spread out from the mean. The formula for
variance is:
Variance (σ²) = Σ(xi — μ)² / N
Where:
•Σ represents summation (i.e., adding up)
•xi is each data point
•μ is the mean of the dataset
•N is the total number of data points
Measures of Dispersion

Example: Consider a dataset of monthly sales figures for a small business:

[5000, 6000, 5500, 7000, 7500].
To find the variance:
1.Calculate the mean (μ): μ = (5000 + 6000 + 5500 + 7000 + 7500) / 5 =
6000
2.Calculate the squared differences from the mean and their sum:
Variance = [(5000–6000)² + (6000–6000)² + (5500–6000)² + (7000–6000)²
+ (7500–6000)²] / 5 Variance ≈ 433333.33
Variance is a valuable measure in data science because it quantifies the
spread of data while considering all data points. However, its units are
squared, which can be less intuitive. This leads us to the next measure.
Measures of Dispersion
Standard Deviation
The standard deviation is a more interpretable measure of dispersion as it
is the square root of the variance. It tells us how much individual data
points typically deviate from the mean. The formula for standard
deviation is:
Standard Deviation (σ) = √Variance
Using the previous example’s variance, the standard deviation is:
Standard Deviation (σ) = √433333.33 ≈ 658.58
The standard deviation is extensively used in data science for several
reason:
Identifying Outliers: Data points that deviate significantly from the mean
(beyond 2 or 3 standard deviations) may be considered outliers, which
can be important to detect anomalies in data.
Measures of Dispersion
Quartiles
A quartile is a value that divides a set of data into four equal groups. The
first quartile (Q1) is the value that separates the bottom 25% of the data
from the top 75%. The second quartile (Q2) is the median, which is the
value that separates the bottom 50% of the data from the top 50%. The
third quartile (Q3) is the value that separates the bottom 75% of the data
from the top 25%.
Interquartile Range (IQR)
The interquartile range (IQR) is the difference between the third quartile
(Q3) and the first quartile (Q1). It is a measure of how spread out the
middle 50% of the data is. A larger IQR indicates that the data is more
spread out, while a smaller IQR indicates that the data is more tightly
clustered together.
Measures of Dispersion

Example:
Data: [100,000, 200,000, 300,000, 400,000, 500,000, 600,000, 700,000,
800,000, 900,000, 1,000,000]
Q2 = (5000+6000)/2=550000.
then Q1 is the median of the following
100000, 200000, 300000, 400000, 500000 which is 300000
And Q3 is the median of the following dataset :
600000, 700000, 800000, 900000, 1000000 which is 800000
Q1 = 300000
Q3 = 800000
IQR = 800000 – 300000 = 500000
Now data which are more than 1.5 * 500000 i.e. 750000 beyond the
upper(Q3) quartile or lower quartile (Q1) are considered outliers
Boxplot Analysis

◼ Five-number summary of a distribution:

Minimum, Q1, M, Q3, Maximum
◼ Boxplot

◼ Data is represented with a box

◼ The ends of the box are at the first and third quartiles,

i.e., the height of the box is IQR

◼ The median is marked by a line within the box
◼ Whiskers: two lines outside the box extend to Minimum

and Maximum
20
Scatter plot
◼ Provides a first look at bivariate data (two variables) to see
clusters of points, outliers, etc.
◼ Gives a good visual picture of the relationship between the
two variables
◼ Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
◼ Points are plotted but not joined

◼ The resulting pattern indicates the type and strength of

the relationship between the two variables.

22
Scatter plot (cont…)

23
Scatter plot (cont…)
◼ The more the points tend to cluster around a straight line, the
stronger the linear relationship between the two variables (the
higher the correlation).
◼ Line runs from lower left to upper right → positive relationship
(direct)
◼ Line runs from upper left to lower right → negative
relationship (inverse).
◼ Random scatter of points → no relationship (very low or zero
correlation).
◼ Points clustering around a curve → non‐linear relationship (the
correlation coefficient will not be a good measure of the
strength)
24
Non‐correlated Data

26
Non‐correlated Data

26
Data Preprocessing
◼ Why preprocess the data?
◼ Descriptive data summarization
◼ Data cleaning
◼ Data integration and transformation
◼ Data reduction
◼ Discretization ss
◼ Summary
27
Data Cleaning
◼ Importance:
◼ “Data cleaning is one of the three biggest problems in data

warehousing”—Ralph Kimball
◼ “Data cleaning is the number one problem in data

warehousing”—DCI survey
◼ Data cleaning tasks:

◼ Fill in missing values

◼ Identify outliers and smooth out noisy data

◼ Correct inconsistent data

◼ Resolve redundancy caused by data integration

28
Missing Data
◼ Data is not always available
◼ E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data
◼ Missing data may be due to
◼ equipment malfunction
◼ inconsistent with other recorded data and thus deleted

◼ data not entered due to misunderstanding

◼ certain data may not be considered important at the time of entry

◼ Missing data may need to be inferred

29
How to Handle Missing Data?
◼ Ignore the tuple: usually done when class label is
missing (assuming the tasks in classification—not
effective when the percentage of missing values per
attribute varies considerably.
◼ Fill in the missing value manually: tedious + infeasible?

◼ Fill in it automatically with

◼ A global constant : e.g., “unknown”, a new class!

◼ The attribute mean

◼ The attribute mean for all samples belonging to the same class

◼ The most probable value: inference‐based such as Bayesian

formula or decision tree

30
Noisy Data
◼ Noise: random error or variance in a measured variable
◼ Incorrect attribute values ‐ may be due to
faulty data collection instruments
◼

◼ data entry problems

◼ data transmission problems

◼ technology limitation

◼ inconsistency in naming convention

◼ Other data problems which requires data cleaning

◼ duplicate records

◼ incomplete data

◼ inconsistent data
31
How to Handle Noisy Data?
◼ Binning
◼ First sort data and partition into bins
◼ Then one can smooth by bin means, smooth by bin median, smooth

by bin boundaries, etc.

◼ Regression
◼ Smooth by fitting the data into regression functions
◼ Clustering
◼ Detect and remove outliers
◼ Combined computer and human inspection
◼ Detect suspicious values and check by human (e.g., deal with possible
outliers)
32
Binning (aka Discretization)
◼ A process of transforming numerical
variables into categorical counterparts.
◼ Example: Bin values for Age into categories such as 20‐39, 40‐59, and 60‐79.
◼ Numerical variables are usually discretized in the modeling methods based on
frequency tables (e.g., decision trees)

◼ Binning may be supervised or

unsupervised
◼ Advantages:
◼ Binning may improve accuracy of the predictive models by reducing the noise or non‐linearity.
◼ Binning allows easy identification of outliers, invalid and missing values of numerical variables.
Unsupervised Binning (cont…)
◼ Equal Frequency (aka Equal Depth) Binning:
◼ The algorithm divides data into k groups where each group

contains approximately same number of values.

◼ For both methods, the best way of determining k is by looking at the

histogram and try different intervals or groups.

Example: Data: 0, 4, 12, 13, 16, 16, 18, 24, 26, 28
◼ Bin‐1: 0, 4, 12 [‐, 14]
◼ Bin‐2: 13, 16, 16 [14, 21]

◼ Bin‐3: 24, 26, 28 [21, +]

Binning Methods for Data Smoothing
❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
*Partition into equal‐frequency (equi‐depth) bins:
‐ Bin 1: 4, 8, 9, 15
‐ Bin 2: 21, 21, 24, 25
‐ Bin 3: 26, 28, 29, 34
*Smoothing by bin means:
‐ Bin 1: 9, 9, 9, 9
‐ Bin 2: 23, 23, 23, 23
‐ Bin 3: 29, 29, 29, 29
*Smoothing by bin boundaries:
‐ Bin 1: 4, 4, 4, 15
‐ Bin 2: 21, 21, 25, 25
‐ Bin 3: 26, 26, 26, 34
36
Data Preprocessing
◼ Why preprocess the data?
◼ Descriptive data summarization
◼ Data cleaning
◼ Data integration and transformation
◼ Data reduction
◼ Discretization
◼ Summary

40
Data Integration

◼ A process to combine data from multiple sources into a coherent

store
◼ Schema integration: e.g., Scholarship  Fellowship

◼ Integrate metadata from different sources

◼ Entity identification problem:

◼ Identify real world entities from multiple data sources, e.g.,

Bill Clinton = William Clinton

◼ Detecting and resolving data value conflicts:

◼ For the same real world entity, attribute values from different

sources are different

◼ Possible reasons: different representations, different scales,

e.g., metric vs. British units

41
Handling Redundancy in Data Integration
◼ Redundant data occur often when integration of multiple
databases
◼ Object identification: The same attribute or object may have
different names in different databases
◼ Derivable data: One attribute may be a “derived”attribute in another

table, e.g., annual revenue, age, etc.

◼ Redundant attributes may be able to be detected by
correlation analysis
◼ Careful integration of the data from multiple sources may

help reduce/avoid redundancies and inconsistencies and

improve mining speed and quality
42
Correlation Analysis (Numerical Data)
◼ Correlation coefficient (also called
Pearson’s product‐ moment correlation
coefficient


n
(ai − A)(bi − B )
rA, B = i =1

n A B

Where n is the number of tuples, A and B are the respective means of A and B, σA and σB are
the respective standard deviation of A and B.
◼ If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The
higher, the stronger correlation.
◼ If rA,B = 0, A and B are independent

◼ If rA,B < 0: A and B are negatively correlated

Correlation Analysis

Correlation Coefficient:
A correlation coefficient is a number between -1 and 1 that tells you the
strength and direction of a relationship between variables.
To find the correlation coefficient we use the following formula:

෌ 𝑎𝑖 − 𝐴ҧ ∗ 𝑏𝑖 − 𝐵ത
𝑟𝐴,𝐵 =
𝑛 ∗ 𝜎𝐴 ∗ 𝜎𝐵

𝑎𝑖 = 43 , 21, 25, 42,57,59

𝑏𝑖 = 99, 65, 79, 75, 87, 81

To further understand the correlation coefficient consider the

above example in above example ai are the ages of peoples and
bi are their respective glucose level
Correlation Analysis

To calculate r i.e. correlation coefficient for the above example we do as

follow:

𝐴ҧ = 41 And 𝐵ത = 81
𝑎𝑖 − 𝐴ҧ 𝑏𝑖 − 𝐵ത
43 – 21 = 22 𝑎𝑖 − 𝐴ҧ ∗ 𝑏𝑖 − 𝐵ത
99 – 81 = 18
21 – 21 = 0 22 * 18
65 – 81 = -16
25 – 21 = 4 0 * -16
79 – 81 = -2
42 – 21 = 21 4 * -2
75 – 81 = -6
57 – 21 = 36 21 * -6
87 – 81 = 6
59 – 21 = 38 36 * 6
81 – 81 = 0 38 * 0
𝑛
෌𝑖=1 𝑎𝑖 − 𝐴ҧ ∗ 𝑏𝑖 − 𝐵ത = 478
Correlation Analysis

2
෍ 𝑎𝑖 − 𝐴ҧ 𝜎𝐵 =
෌ 𝑏𝑖−𝐵ത 2
= 10
𝜎𝐴 = = 24 𝑛
𝑛

478
𝑟𝐴,𝐵 = = 0.33
6 ∗ 10 ∗ 24
Correlation Analysis (Categorical Data)
◼ 2 Χ (chi‐square) test
(Observed − Expected ) 2
 =
2

Expected

◼ The larger the Χ2 value, the more likely the variables are related
◼ The cells that contribute the most to the Χ2 value are those whose actual
count is very different from the expected count
◼ Correlation does not imply causality
◼ # of hospitals and # of car‐theft in a city are correlated
◼ Both are causally linked to the third variable: population
44
Correlation vs. Casulaty

◼ Correlation is something which we think, when we can’t see under

the covers. So the less the information we have the more we are
forced to observe correlations. Similarly the more information we
have the more transparent things will become and the more we will
be able to see the actual casual relationships.

45
◼ Example:
◼ Suppose that a group of 1,500 people was surveyed. The gender of each person
was noted. Each person was polled as to whether their preferred type of reading
material was fiction or nonfiction. Thus, we have two attributes, gender and preferred
reading.
Are gender and preferred Reading correlated?
Male Female Sum (row)

Fiction 250(90) 200(360) 450

Non-fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated

based on the data distribution in the two categories)

( 250 − 90) 2
(50 − 210) 2
( 200 − 360) 2
(1000 − 840) 2
2 = + + + = 507.93
90 210 360 840

It shows that Gender and prefferd reading are correlated in the group
Data Transformation: Normalization
◼ Min‐max normalization: to [new_minA, new_maxA]
◼ Performs linear transformation on the original data

v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA

◼ Example: Let income range Rs.12,000 to Rs. 98,000

normalized to [0, 1]. Then Rs. 73,600 is mapped
to
73,600 −12,000
(1.0 − 0) + 0 = 0.716
98,000 −12,000
Normalization (cont…)
◼ Z‐score normalization (or Zero‐mean
normalization)
◼ Let μ: mean and σ: standard deviation

v − A
v'=
 A

◼ Example: Let μ = 54,000, σ = 16,000.

Then 73,600 is mapped to

73,600 − 54,000
= 1.225
16,000
◼ Useful when the actual min and max values of the attribute is
unknown or when there are outliers that dominate the min‐max
normalization.
Normalization (cont…)
◼ Normalization by decimal scaling
◼Normalizes data values by moving the decimal point at
extreme left position
◼Number of decimal points moved depends on the max.

absolute value
◼A value v is normalized to v’ by computing

v
v' = Where j is the smallest integer such that Max(|ν’|) < 1
10 j

◼ Example: A ranges from ‐986 to 97 Max absolute

value = 986
So, j=3 (i.e., divide each value by 1000)
53
Data Preprocessing
◼ Why preprocess the data?
◼ Descriptive data summarization
◼ Data cleaning
◼ Data integration and transformation
◼ Data reduction
◼ Discretization and concept hierarchy generation
◼ Summary

54
Data Reduction Techniques
◼ Why data reduction?
◼ A database/data warehouse may store terabytes of data
◼ Complex data analysis/mining may take a very long time to run on the complete data set

◼ What is data reduction?

◼ A process to obtain a reduced representation of the data set that is much smaller in volume but
yet produce the same (or almost the same) analytical results

◼ Data reduction techniques:

◼ Numerosity reduction: e.g., fit data into models
◼ Dimensionality reduction: e.g., remove unimportant attributes
◼ Discretization
Numerosity Reduction
◼Reduce data volume by choosing alternative,
smaller forms of data representation
◼Parametric methods

◼ Assume the data fits some model, estimate model

parameters, store only the parameters, and discard the
data (except possible outliers)
◼ Non‐parametric methods
◼ Do not assume models
◼ Major families: histograms, clustering, sampling
60
Dimensionality Reduction: Attribute Subset Selection

◼ Feature selection (i.e., attribute subset selection):

◼ Select a minimum set of features such that the probability
distribution of different classes given the values for those
features is as close as possible to the original distribution
given the values of all features
◼ Reduce #of patterns ‐ easier to understand

61
Data Preprocessing
◼ Why preprocess the data?
◼ Descriptive data summarization
◼ Data cleaning
◼ Data integration and transformation
◼ Data reduction
◼ Discretization
◼ Summary
64
Discretization
◼ Three types of attributes:
◼ Nominal — values from an unordered set, e.g., color, profession

◼ Ordinal — values from an ordered set, e.g., military or academic

rank
◼ Continuous — real numbers, e.g., integer or real numbers

◼ Discretization:

◼ Divide the range of a continuous attribute into intervals

◼ Some classification algorithms only accept categorical attributes.

◼ Reduce data size by discretization

◼ Prepare for further analysis

Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
Data Mining: Prepared By: Eesha Tur Razia Babar
No ratings yet
Data Mining: Prepared By: Eesha Tur Razia Babar
49 pages
Lecture Notes 2 - Descriptive Statistics-1720598791715
No ratings yet
Lecture Notes 2 - Descriptive Statistics-1720598791715
21 pages
Week 6+7+8
No ratings yet
Week 6+7+8
37 pages
Statistics for LLB Students
No ratings yet
Statistics for LLB Students
22 pages
Lesson 3.2 Measures of Central Tendency Position and Variation
No ratings yet
Lesson 3.2 Measures of Central Tendency Position and Variation
62 pages
EDA: Key Stats & Visualizations in Python
No ratings yet
EDA: Key Stats & Visualizations in Python
15 pages
L1-D3 Concepts of Data Analysis
No ratings yet
L1-D3 Concepts of Data Analysis
17 pages
Central Tendency and Dispersion Measures
No ratings yet
Central Tendency and Dispersion Measures
10 pages
Data Analytics TB
No ratings yet
Data Analytics TB
1,944 pages
Chapter 3
No ratings yet
Chapter 3
17 pages
Datascience First Conti..and Second Unit
No ratings yet
Datascience First Conti..and Second Unit
49 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
58 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
20 pages
Godinez Kizzha G Asynchronous Output 3
No ratings yet
Godinez Kizzha G Asynchronous Output 3
7 pages
Basic 1
No ratings yet
Basic 1
60 pages
Exploring Numerical Data - Students
No ratings yet
Exploring Numerical Data - Students
97 pages
Module 1 Overview - of - Statistics
No ratings yet
Module 1 Overview - of - Statistics
11 pages
STAT241 - Business Statistics (Day 3)
No ratings yet
STAT241 - Business Statistics (Day 3)
32 pages
2nd Unit - Statistics
No ratings yet
2nd Unit - Statistics
15 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
65 pages
Share MBBS - Lecture 4 (1) - 1
No ratings yet
Share MBBS - Lecture 4 (1) - 1
68 pages
MCS Lecture 3
No ratings yet
MCS Lecture 3
57 pages
Chapter 5
No ratings yet
Chapter 5
6 pages
DS-Lecture-3a-Data-Central Tendency
No ratings yet
DS-Lecture-3a-Data-Central Tendency
13 pages
Ken Black QA ch03
0% (1)
Ken Black QA ch03
61 pages
Central Tendency - Lecture Notes
No ratings yet
Central Tendency - Lecture Notes
34 pages
Data Analysis and Data Visualization Basics 2
No ratings yet
Data Analysis and Data Visualization Basics 2
50 pages
Measures
No ratings yet
Measures
8 pages
3.3.1 Data Summarization
No ratings yet
3.3.1 Data Summarization
56 pages
02 Data
No ratings yet
02 Data
36 pages
Fundamentals of Statistics With MS Excel
100% (1)
Fundamentals of Statistics With MS Excel
83 pages
Quantitative Methods For Management
No ratings yet
Quantitative Methods For Management
118 pages
Qtymeth Dispersion
No ratings yet
Qtymeth Dispersion
8 pages
Understanding Mean, Median, and Mode
No ratings yet
Understanding Mean, Median, and Mode
34 pages
Engineering Data Analysis Techniques
No ratings yet
Engineering Data Analysis Techniques
58 pages
Social Science Statistics (June-Aug) 2025-Topic 2
No ratings yet
Social Science Statistics (June-Aug) 2025-Topic 2
21 pages
2 - Central-Tendency and Dispersion
No ratings yet
2 - Central-Tendency and Dispersion
64 pages
Statistical Machine Learning
100% (1)
Statistical Machine Learning
12 pages
2 - Descriptive Statistics
No ratings yet
2 - Descriptive Statistics
29 pages
Midterms Day 4
No ratings yet
Midterms Day 4
51 pages
Discriptive Statistics
No ratings yet
Discriptive Statistics
23 pages
Standard Deviation
No ratings yet
Standard Deviation
13 pages
Measures of Central Tendency
100% (15)
Measures of Central Tendency
15 pages
Central Tendency and Dispersion in Statistics
No ratings yet
Central Tendency and Dispersion in Statistics
58 pages
Topic 8 Data Processing and Analysis PDF
No ratings yet
Topic 8 Data Processing and Analysis PDF
157 pages
Statistical Analysis
No ratings yet
Statistical Analysis
15 pages
Measures of Dispersion Explained
No ratings yet
Measures of Dispersion Explained
8 pages
Numerical Descriptive Measures 1
No ratings yet
Numerical Descriptive Measures 1
39 pages
Ch3 Numerically Summarizing Data
No ratings yet
Ch3 Numerically Summarizing Data
35 pages
CH 2 Lecture Notes
No ratings yet
CH 2 Lecture Notes
12 pages
Lecture 3 - Stat HO
No ratings yet
Lecture 3 - Stat HO
21 pages
Math
No ratings yet
Math
50 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
63 pages
Lecture 06-Describing Data Visual Information
No ratings yet
Lecture 06-Describing Data Visual Information
49 pages
Module 5
No ratings yet
Module 5
51 pages
Descriptive Statistics Guide
No ratings yet
Descriptive Statistics Guide
5 pages
Midterms Notes (MMW)
No ratings yet
Midterms Notes (MMW)
8 pages
Tester Role in Software Team
No ratings yet
Tester Role in Software Team
4 pages
Research Repository Management System SRS
No ratings yet
Research Repository Management System SRS
59 pages
3 - Unit-Iii-3
No ratings yet
3 - Unit-Iii-3
29 pages
NaiveBayes Classifier+EvaluationMatrics
No ratings yet
NaiveBayes Classifier+EvaluationMatrics
15 pages
Brain Chip Report
100% (3)
Brain Chip Report
30 pages
The Arabs at War in Afghanistan Mustafa Hamidleah Farrall PDF Download
No ratings yet
The Arabs at War in Afghanistan Mustafa Hamidleah Farrall PDF Download
36 pages
HH7400
No ratings yet
HH7400
5 pages
Global Classroom
No ratings yet
Global Classroom
17 pages
Supermarket Software Design
100% (8)
Supermarket Software Design
20 pages
9865 0190 01-005 Maintenance COP SC25-HE Ver. G
No ratings yet
9865 0190 01-005 Maintenance COP SC25-HE Ver. G
64 pages
Equipment Naming Standards Z2011
No ratings yet
Equipment Naming Standards Z2011
8 pages
Debugger Sh4
No ratings yet
Debugger Sh4
69 pages
2016 CanSat Competition Guidelines For APRSAF
No ratings yet
2016 CanSat Competition Guidelines For APRSAF
3 pages
Eox bx2901
No ratings yet
Eox bx2901
3 pages
VLSI Lecture Notes - Unit-3
No ratings yet
VLSI Lecture Notes - Unit-3
66 pages
PS3381 1C2
No ratings yet
PS3381 1C2
185 pages
Maersk Discoverer: Ultra Deepwater Drilling Rig
No ratings yet
Maersk Discoverer: Ultra Deepwater Drilling Rig
12 pages
MATLAB Programming Fundamentals Guide
No ratings yet
MATLAB Programming Fundamentals Guide
23 pages
Folded Plate Structures Overview
No ratings yet
Folded Plate Structures Overview
20 pages
Prefabricated - Final Script
No ratings yet
Prefabricated - Final Script
3 pages
Basic Inspection - 27
No ratings yet
Basic Inspection - 27
22 pages
Interactive Graphics for Developers
No ratings yet
Interactive Graphics for Developers
7 pages
C++ 12th Project File
No ratings yet
C++ 12th Project File
26 pages
Subsea Low Flow CIMV Specifications
No ratings yet
Subsea Low Flow CIMV Specifications
2 pages
Automated Ration Distribution System
No ratings yet
Automated Ration Distribution System
5 pages
NICMAR Career Opportunities
No ratings yet
NICMAR Career Opportunities
4 pages
Draft - Answers To KEDA Casestudy
100% (1)
Draft - Answers To KEDA Casestudy
3 pages
Manitou Mh254t en
No ratings yet
Manitou Mh254t en
6 pages
Precision Pressure Control/sensing: Differential Pressure Switch GFS Series
No ratings yet
Precision Pressure Control/sensing: Differential Pressure Switch GFS Series
2 pages
Understanding Resistors: Types and Applications
No ratings yet
Understanding Resistors: Types and Applications
3 pages
Capitalism 2 Manual
No ratings yet
Capitalism 2 Manual
73 pages
S Curves
100% (1)
S Curves
11 pages
Accounts Payable Expertise
No ratings yet
Accounts Payable Expertise
4 pages
ISO2860 - 1992 Earth Moving Machinery - Minimum Access Dimentions
No ratings yet
ISO2860 - 1992 Earth Moving Machinery - Minimum Access Dimentions
8 pages