Data Preprocessing
CC19 – Data Mining
Agenda
• Definition of Data Preprocessing
• Types of Data Preprocessing
• Data Cleaning
• Data Integration
• Data Transformation
• Data Normalization
• Data Reduction
• Steps of Data Preprocessing
Defining Data Preprocessing
• Data preprocessing is a key step in data mining that involves
modifying data to prepare it for analysis.
• This allows the data to better fit different data mining analysis
techniques and tools.
• Different techniques can be utilized depending on the type of data
being analyzed.
Defining Data Preprocessing
• Preparing data in important to ensure that large datasets can be
processed more easily.
• While more data is available for analysis compared to before, a lot of
that data is “dirty”.
• The data collected via data collection techniques can also be
inconsistent in format and quality.
Defining Data Preprocessing
• Many techniques for data mining rely on data which is complete
or noise free.
• Unfortunately for us, real-world data is rarely clean or complete.
• These are other reasons why we need to preprocess data to make it
usable for data mining tools.
Types of Data Preprocessing
• Listed below are common techniques for data preprocessing:
• Data Cleaning
• Data Integration
• Data Transformation
• Data Normalization
• Data Reduction
Types of Data Preprocessing – Data Cleaning
• Data cleaning involves correcting bad data, filtering incorrect data, or
reduce unnecessary data details.
• It is a general technique that is commonly used with other techniques.
• Treatment of missing and noise data is also included here.
Types of Data Preprocessing – Data Cleaning
• Data cleaning involves
identifying and correcting errors
and inconsistencies in the data.
• These errors can involve missing
values, outliers, and duplicates.
Types of Data Preprocessing – Data Integration
• Data integration involves merging data from multiple data sources.
• This should include steps to reduce redundancies and inconsistencies
in your data set.
• Techniques involved here include identification and unification of
variables and domains.
Types of Data Preprocessing – Data Integration
• Data integration can be
challenging as it requires
combining data from different
sources with different formats,
structures, and semantics.
• Techniques used here can
include record linkage and data
fusion.
Types of Data Preprocessing – Data Transformation
• Data transformation involves converting data so that the mining
process result could be more efficient.
• These are typically composed of different tasks that are dependent on
the type of data being transformed.
• Some data transformation techniques might not work if the data used
is incompatible.
Types of Data Preprocessing – Data Transformation
• Data transformation techniques
includes smoothing, feature
construction, aggregation, or
summarization.
Types of Data Preprocessing – Data Normalization
• Data normalization involves scaling data to a common range.
• Normalizing the data attempts to give all attributes equal weight to
make them easier to analyze.
• This is done because the measurement units used for data mining can
affect the data analysis.
Types of Data Preprocessing – Data Normalization
• All attributes in the data mining
process should be expressed in
the same measurement units and
should use a common scale or
range.
Types of Data Preprocessing – Data Reduction
• Data reduction comprises techniques which obtain a reduced
representation of the original data.
• Data being processed maintains the essential structure and integrity of
the original data but is downsized.
• This is done because many data mining algorithms become very slow
the more data they process.
Types of Data Preprocessing – Data Reduction
• There are three common types of data reduction methods:
• Feature selection
• Instance selection
• Discretization
Types of Data Preprocessing – Data Reduction
Feature Selection
• This achieves the reduction of
data by removing irrelevant or
redundant features.
• This aims to find a minimum set
of attributes.
Types of Data Preprocessing – Data Reduction
Instance Selection
• This looks at choosing a subset
of the total available data to
achieve the original purpose of
data mining.
• It works in a similar manner to
statistical sampling methods.
Types of Data Preprocessing – Data Reduction
Discretization
• This transformed quantitative
(numerical) data into qualitative
(nominal) data.
• An association between each
interval with a numerical discrete
value is then established.
Types of Data Preprocessing
• To summarize how these data preprocessing tools work:
• Data Cleaning – How do I clean up the data?
• Data Integration – How do I incorporate and adjust data?
• Data Transformation – How do I provide accurate data?
• Data Normalization – How do I unify and scale data?
• Data Reduction – How do I select the best features of my data?
Steps in Data Preprocessing
• These are the general steps to consider when doing data preprocessing:
• Assess your Data Quality
• Clean your Data
• Transform your Data
• Reduce your Data
• Further Process your Data
Steps in Data Preprocessing
Assess your Data Quality
• Start by looking at your data to get an idea of its overall quality.
• This is where you look at your data collection results and determine
what issues your data may have.
• Once you have identified issues, you then need to determine which
data preprocessing techniques to use.
Steps in Data Preprocessing
Assess your Data Quality
• These are common issues you might need to look at in your data:
• Mismatched Data Types
• Mixed Data Values
• Outliers
• Missing Data
Steps in Data Preprocessing
Clean your Data
• Generally, you always want to clean your data as your first
preprocessing method.
• This is because it removes useless, unrelated, corrupted, or incorrect
data which can interfere with other steps.
• This can be done manually by deleting files or automated with code or
tools.
Steps in Data Preprocessing
Transform your Data
• This is where your data is transformed into a format suitable for
your data analysis tools.
• How you transform your data will depend on what tool you are using
and what analysis you will perform.
• This involves steps such as normalization to further enhance the data.
Steps in Data Preprocessing
Reduce your Data
• You will then want to reduce the size of your overall dataset as
needed to make analysis easier.
• This may not be needed for small datasets but becomes important for
larger datasets.
• This ensures that your data analysis process will not be slow or
impossible.
Steps in Data Preprocessing
Further Process your Data
• You will need to determine if your current data preprocessing
steps are sufficient.
• This is typically done after data analysis to check if the data
preprocessing enhanced the results.
• You can add or remove preprocessing methods if you find that they are
not effective for your dataset.
References
• Data Preprocessing in Data Mining – GeeksforGeeks
• Data Preprocessing in Data Mining.pdf (dstu.dp.ua)
• What Is Data Preprocessing & What Are The Steps Involved? (monke
ylearn.com)
• Data Preprocessing: Definition, Key Steps and Concepts (techtarget.co
m)
• A survey on data preprocessing for data stream mining: Current status
and future directions (ugr.es)