Topics To Be Covered
Topics To Be Covered
Loopingto be covered
• Why to pre-process data?
• Mean, Median, Mode, Range & Standard Deviation
• Attribute Types
• Data Summarization
• Data Cleaning
• Data Integration
• Data Transformation
• Data Reduction
Section - 1
Why to pre-process data?
Data pre-processing is a data mining technique that involves transforming raw data
(real world data) into an understandable format.
Real-world data is often incomplete, inconsistent, lacking in certain behaviors or
trends and likely to contain many errors.
Incomplete: Missing attribute values, lack of certain attributes of interest, or containing only
aggregate data.
E.g. Occupation = ― ‖
Noisy: Containing errors or outliers.
E.g. Salary = ―abcxy‖
Inconsistent: Containing similarity in codes or names.
E.g. ―Gujarat‖ & ―Gujrat‖ (Common mistakes like spelling, grammar, articles)
Why to pre-process data? (Cont..) No quality data, No quality
results
It looks like Garbage In Garbage Out (GIGO).
12, 15, 11, 11, 7, 12, 12, 15, 11, 10, 7, 14, 13
13
11, 12 Mode No Mode
(Bimodal)
If more than three numbers repeats within a set of numbers then it is called as
multimodal.
Range
The range of a set of data is the difference between the largest and the smallest
number in the set.
Example
Find the range for given data 40, 30, 43, 48, 26, 50, 55, 40, 34, 42, 47, 50
55 – 26 = 29
Range
Standard Deviation (σ)
Standard Deviation (σ) Cont..
Standard Deviation (σ) Cont..
The owner of the Indian restaurant is interested in how much people spend at the restaurant.
He examines 8 randomly selected receipts for parties and writes down the following data.
44, 50, 38, 96, 42, 47, 40, 39
1. Find out Mean (Mean is 49.5 for given data)
2. Write a table that subtracts the mean from each observed value. (2nd step)
2. Qualitative
Quantitati Qualitativ
1. Nominal ve e
2. Ordinal
3. Binary • Nominal
• Discreate
• Ordinal
1. Symmetric • Continuous
• Binary
2. Asymmetric • Symmetric
• Asymmetric
1. Quantitative Attribute Attribute Types
Qualitative data deals with characteristics and descriptors that can't be easily
measured, but can be observed subjectively—such as smells, tastes, textures,
attractiveness, and color.
Simple arithmetic attributes that is named or described in words.
It is represented in integer or real values.
Results of qualitative attribute are often quoted on scales.
Below are the qualitative Attributes.
Nominal
Ordinal
Binary
Symmetric
Asymmetric
2. Qualitative Attribute Cont.. Attribute Types
1) Nominal Attribute
Nominal attributes are named attributes which can be separated into discrete (individual)
categories which do not overlap.
Nominal attributes values also called as distinct values.
Example
2. Qualitative Attribute Cont.. Attribute Types
2) Ordinal Attribute
Ordinal attribute is the order of the values, that’s important and significant, but the differences
between each one is not really known.
Example
Rankings 1st, 2nd, 3rd
Ratings ,
We know that a 5 star is better than a 2 star or 3 star, but we don’t know and cannot quantify–how
much better it is?
3) Binary Attribute
Binary attributes are the categorical attributes with only two possible values (yes or no), (true or
false), (0 or 1).
Symmetric binary attribute is the attribute which each value is equally valuable (male or female).
The male here is not more important than the female value.
Asymmetric is the attribute which the two states is not equally important, for example, the medical
test (positive or negative), here, the positive results is more significant than the negative one.
Extra Attribute Types
Interval Attribute
Interval attribute comes in the form of a numerical value where the difference between points is
meaningful.
Example
Temperature 10°-20°, 30°-50°, 35°-45°
Calendar Dates 15th – 22nd, 10th – 30th
We can not find true zero (absolute) value with interval attributes.
Ratio Attribute
Ratio attribute is looks like interval attribute, but it must have a true zero (absolute) value.
It tells us about the order and the exact value between units or data.
Example
Age Group 10-20, 30-50, 35-45 (In years)
Mass 20-30 kg, 10-15 kg
It does have a true zero (absolute) so, it is possible to compute ratios.
Section - 4
Why Data Summarization?
As we are living in a digital world where data transfers in a second and it is much
faster than a human capability.
In the corporate field, employees work on a huge volume of data which is derived
from different sources like Social Network, Media, Newspaper, Book, cloud media
storage etc.
But sometimes it may create difficulties for you, to summarize the data.
Sometimes you do not expect data volume because when you retrieve data from
relational sources you can not predict that how much data will be stored in the
database.
As a result, data becomes more complex and takes time to summarize information.
What is Data Summarization?
Summarization is a key data mining concept which involves techniques for finding a
compact description of a dataset.
It is aimed at extracting useful information and general trends from the raw data.
Two methods for data summarization are through tables and graphs.
Tables are row & column representation of the dataset, you can apply aggregate functions on it.
Graphs showing the relation between variable quantities, typically of two variables, each
measured along one of a pair of axes at right angles.
Section - 4
Data Cleaning
1. Fill in missing values
1. Ignore the tuple
2. Fill missing value manually
3. Fill in the missing value automatically
4. Use a global constant to fill in the missing value
2. Identify outliers and smooth out noisy data
1. Binning Method
2. Regression
3. Clustering
3. Correct inconsistent data
4. Resolve redundancy caused by data integration
1) Fill in missing values Data
Cleaning
Ignore the tuple (record/row):
• Usually done when class label is missing.
• Example
o The task is to distinguish between two types of emails, ―spam‖ and ―non-spam‖ (Ham).
o Spam & non-spam are called as class label.
o If an email comes to you, in which class label is missing then it is discarded.
Fill missing value manually:
• Use the attribute mean (average) to fill in the missing value and also use the attribute mean
(average) for all samples belonging to the same class.
Fill in the missing value automatically:
• Predict the missing value by using a learning algorithm:
o Consider the attribute with the missing value as a dependent variable and run a learning algorithm
(usually Naive Bayes or Decision tree) to predict the missing value.
Use a global constant to fill in the missing value
• Replace all missing attribute values by the same constant such as a label like
“Unknown”.
2) Identify outliers and smooth out noisy data Data
Cleaning
There are three data smoothing techniques as follows..
1. Binning :
Binning methods smooth a sorted data value by consulting its ―neighborhood‖ that is, the values
around it.
2. Regression :
It conforms data values to a function.
Linear regression involves finding the ―best‖ line to fit two attributes (or variables) so that one
attribute can be used to predict the other.
3. Outlier analysis :
Outliers may be detected by clustering for example, where similar values are organized into
groups or ―clusters‖.
In this, values that fall outside of the set of clusters may be considered as outliers.
1. Binning Method Data
Cleaning
Binning method is a top-down splitting technique based on a specified number of
bins.
In this method the data is first sorted and then the sorted values are distributed into a
number of buckets or bins.
For example, attribute values can be discretized (separated) by applying equal-width
or equal-frequency binning, and then replacing each value by the bin mean, median
or boundaries.
It can be applied recursively to the resulting partitions to generate concept
hierarchies.
It does not use class information, therefore it is called as unsupervised discretization
technique.
It used to minimize the effects of small observation errors.