UNIT II
UNIT II: Data Preprocessing: An Overview, Data Cleaning, Data Integration, Data
Reduction, Data Transformation and Data Discretization.
Data preprocessing is the process of preparing raw data for analysis by cleaning and
transforming it into a usable format. In data mining it refers to preparing raw data for mining
by performing tasks like cleaning, transforming, and organizing it into a format suitable for
mining algorithms.
Data preprocessing is important because raw data is often messy, inconsistent, and unsuitable
for direct use in analysis or model training. Without it, your results can be misleading,
inaccurate, or completely useless.
1.Ensures Data Quality
Real-world datasets often contain missing values, duplicates, or outliers.
Preprocessing detects and fixes these issues, ensuring your model works on clean and
reliable data.
2. Improves Accuracy of Models
Machine learning algorithms are sensitive to noise, irrelevant features, and scale
differences.
By normalizing, standardizing, or selecting features, preprocessing improves prediction
accuracy.
3. Handles Inconsistencies
Different sources may store data in different formats (e.g., “Male/Female” vs. “M/F”).
Preprocessing makes data consistent and comparable.
4. Reduces Complexity
Dimensionality reduction (e.g., PCA) or feature selection removes redundant or
irrelevant data, making models faster and easier to train.
5. Prevents Bias from Bad Data
Incorrect or unbalanced datasets can bias models toward wrong conclusions.
Preprocessing ensures fairer, more representative datasets.
6. Saves Time in the Long Run
Clean, well-structured data is easier to reuse and share for multiple experiments.
Steps in Data Preprocessing
Some key steps in data preprocessing are Data Cleaning, Data Integration, Data
Transformation, and Data Reduction.
Data Cleaning: It is the process of identifying and correcting errors or inconsistencies in the
dataset. It involves handling missing values, removing duplicates, and correcting incorrect or
outlier data to ensure the dataset is accurate and reliable. Clean data is essential for effective
analysis, as it improves the quality of results and enhances the performance of data models.
Missing Values: This occur when data is absent from a dataset. You can either ignore
the rows with missing data or fill the gaps manually, with the attribute mean, or by using
the most probable value. This ensures the dataset remains accurate and complete for
analysis.
Noisy Data: It refers to irrelevant or incorrect data that is difficult for machines to
interpret, often caused by errors in data collection or entry. It can be handled in several
ways:
o Binning Method: The data is sorted into equal segments, and each segment is
smoothed by replacing values with the mean or boundary values.
o Regression: Data can be smoothed by fitting it to a regression function, either
linear or multiple, to predict values.
o Clustering: This method groups similar data points together, with outliers
either being undetected or falling outside the clusters. These techniques help
remove noise and improve data quality.
Removing Duplicates: It involves identifying and eliminating repeated data entries
to ensure accuracy and consistency in the dataset. This process prevents errors and
ensures reliable analysis by keeping only unique records.
Data Integration
It involves merging data from various sources into a single, unified dataset. It can be
challenging due to differences in data formats, structures, and meanings. Techniques like
record linkage and data fusion help in combining data efficiently, ensuring consistency and
accuracy.
Data Transformation: It involves converting data into a format suitable for analysis.
Common techniques include normalization, which scales data to a common range;
standardization, which adjusts data to have zero mean and unit variance; and discretization,
which converts continuous data into discrete categories. These techniques help prepare the
data for more accurate analysis.
Data Normalization: The process of scaling data to a common range to ensure
consistency across variables.
Discretization: Converting continuous data into discrete categories for easier
analysis.
Data Aggregation: Combining multiple data points into a summary form, such as
averages or totals, to simplify analysis.
Concept Hierarchy Generation: Organizing data into a hierarchy of concepts to
provide a higher-level view for better understanding and analysis.
Data Reduction: It reduces the dataset's size while maintaining key information. This can
be done through feature selection, which chooses the most relevant features, and feature
extraction, which transforms the data into a lower-dimensional space while preserving
important details. It uses various reduction techniques such as,
Dimensionality Reduction (e.g., Principal Component Analysis): A technique that
reduces the number of variables in a dataset while retaining its essential information.
Numerosity Reduction: Reducing the number of data points by methods like
sampling to simplify the dataset without losing critical patterns.
Data Compression: Reducing the size of data by encoding it in a more compact form,
making it easier to store and process.
Data cleaning : Data cleaning (also called data cleansing or data scrubbing) is the process
of detecting and correcting (or removing) errors, inconsistencies, and inaccuracies in a
dataset so it becomes accurate, complete, and reliable for analysis.
Importance & Benefits of Data Cleaning
1. Improves Data Accuracy
Removes errors, duplicates, and inconsistencies so your dataset truly reflects reality.
Example: Changing “Indai” → “India” ensures correct analysis.
Benefit: Reliable results and better decision-making.
2. Enhances Model Performance
Clean, consistent data helps machine learning algorithms learn patterns more
effectively.
Noisy or inconsistent data can confuse the model.
Benefit: Higher accuracy, precision, and recall.
3. Improves Consistency Across Systems
Standardizing formats (dates, text, units) ensures smooth integration with other
systems.
Benefit: Easier merging of datasets from different sources.
4. Helps in Compliance & Governance
Clean, well-maintained datasets are essential for following legal requirements like
GDPR or HIPAA.
Benefit: Avoids legal risks and penalties.
Sources of Missing Values
Missing values occur when certain data points are not recorded, lost, or not applicable. They
can be introduced at data collection, entry, storage, or processing stages.
Data Entry Errors
Data Collection Problems
Data Loss During Transfer or Storage
Inapplicable or Irrelevant Data
Equipment or Sensor Failure
Data Filtering or Cleaning
Privacy or Confidentiality Restrictions
Data Cleaning Steps:
1) Data Inspection (Identify Problems)
Understand the dataset structure and quality.
Check for missing values, duplicates, outliers, inconsistencies.
Example: Using df.info() and df.describe() in Python to spot anomalies.
2) Remove or Handle Duplicate Data
Detect and delete identical records.
Example: df.drop_duplicates() in Pandas.
3) Handle Missing Values
Methods:
o Remove rows/columns with too many missing values.
o Fill in (impute) using mean, median, mode, or prediction models.
Example: Filling missing "age" with the median value.
4) Correct Structural Errors
Fix typos, inconsistent capitalization, wrong formats.
Example: “M”, “Male”, “male” → “Male”.
5) Handle Outliers
Detect using statistical methods (Z-score, IQR).
Decide to remove, cap, or transform them.
Example: Removing height values like 500 cm.
6) Standardize Data Formats
Make sure all dates, numbers, and text follow the same format.
Example: Convert MM/DD/YYYY to YYYY-MM-DD.
7) Validate Data Consistency
Check relationships between columns.
Example: If Purchase Date < Registration Date, it’s invalid.
8)Final Review & Export
Verify that data is clean.
Save cleaned data for analysis or modelling.
Handling the noise data:
Binning
Data binning or bucketing is a data preprocessing method used to minimize the effects of
small observation errors. The original data values are divided into small intervals known as
bins and then they are replaced by a general value calculated for that bin. This has a
smoothing effect on the input data and may also reduce the chances of overfitting in the case
of small datasets
Types of Binning Techniques
Binning can be broadly categorized into three types based on how the bins are defined:
1. Equal-Width Binning
Each bin has an equal width, determined by dividing the range of the data into nn intervals.
Formula:
Advantages: Simple to implement and easy to understand.
Disadvantages: May result in bins with highly uneven data distribution.
2. Equal-Frequency Binning
Each bin contains approximately the same number of data points.
Advantages: Ensures balanced bin sizes, avoiding sparse bins.
Disadvantages: The bin width may vary significantly.
Applications of Binning
1. Data Preprocessing: Often used to prepare data for machine learning models by
converting continuous variables into categorical ones.
2. Anomaly Detection: Helps identify anomalies or outliers by binning data and
analyzing the distributions.
3. Data Visualization: Used in histograms and bar charts to represent the frequency
distribution of data.
4. Feature Engineering: Creates categorical features that can enhance the performance
of certain machine learning models.
Data Integration:
Data integration is the process of combining data from multiple sources into a unified view
so it can be used for analysis, reporting, or machine learning.
It’s a key step in data preprocessing when working with datasets from different places.
Challenges in Data Integration
1. Schema Differences
o Same information stored in different formats or field names.
oExample: "Customer_ID" in one file vs. "CustID" in another.
2. Data Redundancy
o Duplicate records across sources.
o Example: Same customer record in both sales and marketing databases.
3. Data Inconsistency
o Conflicting values for the same attribute.
o Example: Customer age is 30 in one dataset and 32 in another.
4. Different Units or Scales
o Example: Weight recorded in pounds vs. kilograms.
5. Missing Values
o Some sources may not have certain attributes.
Example:
import pandas as pd
from sklearn.datasets import load_iris
# Load iris dataset as DataFrame
iris = load_iris(as_frame=True)
df_iris = iris.frame
# Create dummy sales data (same number of rows = 150)
df_sales = pd.DataFrame({
'OrderID': range(1001, 1001 + len(df_iris)),
'Product': ['Pen', 'Book', 'Pencil'] * 50,
'Amount': [50, 150, 75] * 50
})
# Combine column-wise
combined = pd.concat([df_iris, df_sales], axis=1)
print(combined.head())
Output:
Data Reduction :
Data Reduction refers to the process of reducing the volume of data while maintaining its
informational quality. It involves methods for minimizing, summarizing, or simplifying data
while preserving its fundamental properties for storage or analysis.
Dimensionality Reduction:
Dimensionality reduction is the process of reducing the number of input variables (features)
in a dataset while keeping as much useful information as possible.
It’s mainly used to simplify data, speed up processing, and avoid problems like overfitting.
Types of Dimensionality Reduction
Feature Selection
Selecting the most important features and discarding the rest.
Methods:
o Filter methods (Correlation, Chi-square test)
o Wrapper methods (Forward Selection, Backward Elimination)
o Embedded methods (Lasso, Decision Tree feature importance)
Correlation:
Correlation measures the strength and direction of the relationship between two variables.
Values range from –1 to +1.
o +1 → Perfect positive relationship (when one increases, the other increases).
o –1 → Perfect negative relationship (when one increases, the other decreases).
o 0 → No relationship.
Common measure: Pearson correlation coefficient (for numerical data).
Formula for Pearson correlation (r):
Example:
Height and weight → High positive correlation (~0.85).
Age and gaming hours → Negative correlation (~–0.40).
Uses in preprocessing:
Detect multicollinearity (too-high correlation) → remove redundant features.
2. Chi-Square Test
The Chi-Square (χ2\chi^2χ2) test checks whether there is a significant relationship between
two categorical variables.
It compares observed frequencies (from data) with expected frequencies (if there was
no relationship).
Formula:
Where:
OiO_iOi = Observed value
EiE_iEi = Expected value
Steps:
1. Create a contingency table of observed counts.
2. Calculate expected counts assuming no relationship.
3. Apply the formula and compare the result with the Chi-Square critical value (from a
statistical table) or use p-value.
Example:
Does gender (Male/Female) affect product preference (A/B)?
If p-value < 0.05 → Significant relationship.
Feature Extraction
Transforming existing features into a smaller set of new features.
Methods:
o PCA (Principal Component Analysis) → Creates new uncorrelated
variables (principal components).
PCA (Principal Component Analysis): PCA is a dimensionality reduction technique used
to convert a large set of variables into a smaller set of new variables (called principal
components) while preserving as much important information (variance) as possible.
Steps in PCA
1. Standardize the Data
o Ensure all features have the same scale (mean = 0, variance = 1).
2. Compute the Covariance Matrix
o Measures how features vary together.
3. Calculate Eigenvalues & Eigenvectors
o Eigenvalues → amount of variance explained by each principal component.
o Eigenvectors → directions of the principal components.
4. Sort Components by Variance
o Rank them by eigenvalues in descending order.
5. Select Top k Components
o Choose enough components to explain desired variance (e.g., 95%).
6. Transform the Data
o Multiply original data by selected eigenvectors to get the reduced dataset.
Example:
# PCA on Iris Dataset - Complete Code
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# 1. Load Iris Dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target # Add species as target
species_names = iris.target_names
# 2. Standardize the Features
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df.iloc[:, :-1]) # exclude target column
# 3. Apply PCA (reduce to 2 components for visualization)
pca = PCA(n_components=2)
df_pca = pca.fit_transform(df_scaled)
# 4. Create a DataFrame for PCA results
df_pca_result = pd.DataFrame(data=df_pca, columns=['PC1', 'PC2'])
df_pca_result['species'] = df['species']
# 5. Explained Variance
print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Total variance explained:", sum(pca.explained_variance_ratio_))
# 6. Visualization
plt.figure(figsize=(8,6))
for target, color, label in zip([0, 1, 2], ['red', 'green', 'blue'], species_names):
subset = df_pca_result[df_pca_result['species'] == target]
plt.scatter(subset['PC1'], subset['PC2'], c=color, label=label, alpha=0.7, edgecolors='k')
plt.xlabel(f"PC1 ({pca.explained_variance_ratio_[0]*100:.2f}% variance)")
plt.ylabel(f"PC2 ({pca.explained_variance_ratio_[1]*100:.2f}% variance)")
plt.title("PCA on Iris Dataset")
plt.legend()
plt.grid(True)
plt.show()
# 7. Component Loadings (Contribution of each feature)
loadings = pd.DataFrame(pca.components_,
columns=iris.feature_names,
index=['PC1', 'PC2'])
print("\nFeature contributions to each principal component:")
print(loadings)
Data Compression
Data compression is defined as the process whereby information is encoded in less bits than
it had originally occupied. This mainly happens through methods that eliminate duplication
and other extraneous information.
Lossless Data Compression
Lossless data compression guarantees that the decompressed data is identical to the original
data. It works best for text and data files where precision matters.
Huffman coding: Uses a frequency-sorted binary tree to locate values efficiently.
Run-length encoding (RLE): This compresses sequences of replicated data values.
Huffman Coding
Huffman coding is a lossless data compression algorithm that assigns shorter binary codes to
more frequent symbols and longer codes to less frequent symbols, reducing overall storage
space.
Working steps:
Count the frequency of each symbol in the data.
Build a priority queue (min-heap) of nodes (each node represents a symbol +
frequency).
Combine the two smallest frequency nodes into a new node (sum of frequencies).
Repeat until there is only one tree (Huffman Tree).
Assign binary codes:
Left branch → 0
Right branch → 1
Encode the data using these variable-length codes.
Example:
Symbol Frequency
A 2
B 2
Symbol Frequency
C 1
Step 2: Build Huffman Tree
1. Take smallest two:
o C (1) + A (2) → New node (3)
2. Combine with B (2):
o B (2) + Node(3) → Root (5)
Step 3: Assign Codes
Go Left → 0
Go Right → 1
Tree might look like:
(5)
/ \
B(2) (3)
/ \
C(1) A(2)
Codes:
B→0
C → 10
A → 11
Step 4: Encode
Text: AABBC
A → 11
A → 11
B→0
B→0
C → 10
Encoded:
11 11 0 0 10 → 11110010
RLE:
Run-Length Encoding compresses data by storing repeated characters as a single value plus
the number of repeats.
It works best on data with many consecutive repeated values (e.g., images with large blocks
of the same color).
Example:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
# Load Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
# Sort by species to ensure runs
df = df.sort_values(by='species').reset_index(drop=True)
# RLE function
def run_length_encode_column(column):
encoding = []
prev = column[0]
count = 1
for value in column[1:]:
if value == prev:
count += 1
else:
encoding.append((prev, count))
prev = value
count = 1
encoding.append((prev, count))
return encoding
# Run RLE on species
rle_species = run_length_encode_column(df['species'].tolist())
# Convert RLE result to DataFrame
rle_df = pd.DataFrame(rle_species, columns=['species', 'count'])
# --- Visualization 1: Bar Chart of Species Count from RLE ---
plt.figure(figsize=(6, 4))
plt.bar(rle_df['species'], rle_df['count'], color=['green', 'blue', 'red'])
plt.title("RLE: Species Counts in Iris Dataset")
plt.xlabel("Species")
plt.ylabel("Count")
plt.show()
Data Transformation :
Data Transformation is the process of converting data from its original format into a more
useful, consistent, or structured format for analysis, storage, or further processing. It’s like
“cleaning and reshaping” your data so it’ s ready for use in machine learning or statistical
models.
Smoothing : Smoothing removes noise (random variations) from data to reveal patterns more
clearly. One common method is binning with mean smoothing.
Ex: Data: 3, 6, 8, 5, 10, 12, 14, 11, 13
Step 1: Apply Equal-Width Binning
We split the data into 3 bins (groups) of equal width.
Bin 1: 3, 6, 8
Bin 2: 5, 10, 12
Bin 3: 14, 11, 13
Step 2: Mean Smoothing
Replace each value in a bin with the mean of that bin.
Bin 1 mean: (3 + 6 + 8) / 3 = 5.67 → values become 5.67, 5.67, 5.67
Bin 2 mean: (5 + 10 + 12) / 3 = 9 → values become 9, 9, 9
Bin 3 mean: (14 + 11 + 13) / 3 = 12.67 → values become 12.67, 12.67, 12.67
Step 3: Smoothed Data
Original: 3, 6, 8, 5, 10, 12, 14, 11, 13
Smoothed: 5.67, 5.67, 5.67, 9, 9, 9, 12.67, 12.67, 12.67
Aggregation
Aggregation is the process of combining two or more attributes (or records) into a single,
more meaningful attribute. It is used to:
Summarize data
Reduce data size
Show higher-level trends rather than individual details
Example 1 – Attribute Aggregation
Dataset:
Sales_Q1 Sales_Q2 Sales_Q3 Sales_Q4
1000 1200 1100 1300
900 950 1000 1050
After Aggregation (Total Annual
Sales):
Annual_Sales
4600
3900
Generalization
o Replaces low-level data with higher-level concepts.
o Example: Changing age into age group (20–29, 30–39, etc.).
Normalization (Scaling)
Data normalization is a technique used in data mining to transform the values of a dataset into
a common scale. This is important because many machine learning algorithms are sensitive to
the scale of the input features and can produce better results when the data is normalized.
Normalization is used to scale the data of an attribute so that it falls in a smaller range, such
as -1.0 to 1.0 or 0.0 to 1.0. It is generally useful for classification algorithms.
Methods of Data Normalization
1. Decimal Scaling
2. Min-Max Normalization
3. z-Score Normalization(zero-mean Normalization)
1. Decimal Scaling Method For Normalization
It normalizes by moving the decimal point of values of the data. To normalize the data by this
technique, we divide each value of the data by the maximum absolute value of data. The data
value, vi, of data is normalized to vi' by using the formula below –
where j is the smallest integer such that max(|vi'|)<1.
Example :
Let the input data is: -10, 201, 301, -401, 501, 601, 701. To normalize the above data,
Step 1: Maximum absolute value in given data(m): 701
Step 2: Divide the given data by 1000 (i.e j=3)
Result: The normalized data is: -0.01, 0.201, 0.301, -0.401, 0.501, 0.601, 0.701
Min-Max Normalization
In this technique of data normalization, linear transformation is performed on the original
data. Minimum and maximum value from data is fetched and each value is replaced
according to the following formula.
Where A is the attribute data, Min(A), Max(A) are the minimum and maximum absolute
value of A respectively. v' is the new value of each entry in data. v is the old value of each
entry in data. new_max(A), new_min(A) is the max and min value of the range(i.e boundary
value of range required) respectively.
Z-score normalization
In this technique, values are normalized based on mean and standard deviation of the data A.
v', v is the new and old of each entry in data respectively. σA, A is the standard deviation and
mean of A respectively.
Discretization:
Data discretization is the process of converting continuous numerical data into a set of finite
intervals (bins) or categories. It’s a type of data reduction that makes the data simpler to
analyze, especially for algorithms that work better with discrete values (e.g., decision trees,
association rules).
Types of Discretization Methods
1. Unsupervised Discretization (No class label used)
Equal-width binning
o Divides range into equal-sized intervals.
o Example: Age range 20–50 → bins of width 10 → [20–30), [30–40), [40–50]
Equal-frequency binning
o Each bin has roughly the same number of data points.
Clustering-based
o Use k-means or other clustering to group values.
2. Supervised Discretization (Uses class label to guide binning)
Entropy-based discretization (e.g., ID3, C4.5)
o Split points chosen to maximize information gain.
ChiMerge
o Uses chi-square test to merge intervals until a stopping criterion is met.
Encoding (for Categorical Data)
o Converts categorical values into numerical format.
o Methods:
One-Hot Encoding: One-hot encoding is a data transformation technique used to convert
categorical (non-numeric) data into a numerical format so that machine learning models can
processit.
It works by creating a separate binary column for each category and marking the presence (1)
or absence (0) of that category.
Original Data:
Color
Red
Green
Blue
Red
Color_Red Color_Green Color_Blue
1 0 0
0 1 0
0 0 1
1 0 0
Label Encoding:
Label encoding is a data transformation technique that converts categorical values into numeric
codes by assigning a unique integer to each category. Unlike one-hot encoding, it doesn’t create
new columns—it replaces each category with a number.
Example
Original Data:
Color
Red
Green
Blue
Red
After Label Encoding:
Color
2
1
0
2
Attribute Construction (Feature Engineering)
o Creating new attributes from existing ones.
o Example: From “Date of Birth” → create “Age” attribute.