0% found this document useful (0 votes)
10 views19 pages

Unit II (DWDM)

Unit II covers data preprocessing, which is essential for preparing raw data for analysis by cleaning, transforming, and organizing it. Key steps include data cleaning, integration, transformation, and reduction, all aimed at improving data quality and model accuracy. The document also discusses techniques for handling missing values, noisy data, and dimensionality reduction to ensure reliable analysis and effective machine learning outcomes.

Uploaded by

sairam.06706
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views19 pages

Unit II (DWDM)

Unit II covers data preprocessing, which is essential for preparing raw data for analysis by cleaning, transforming, and organizing it. Key steps include data cleaning, integration, transformation, and reduction, all aimed at improving data quality and model accuracy. The document also discusses techniques for handling missing values, noisy data, and dimensionality reduction to ensure reliable analysis and effective machine learning outcomes.

Uploaded by

sairam.06706
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

UNIT II

UNIT II: Data Preprocessing: An Overview, Data Cleaning, Data Integration, Data
Reduction, Data Transformation and Data Discretization.
Data preprocessing is the process of preparing raw data for analysis by cleaning and
transforming it into a usable format. In data mining it refers to preparing raw data for mining
by performing tasks like cleaning, transforming, and organizing it into a format suitable for
mining algorithms.
Data preprocessing is important because raw data is often messy, inconsistent, and unsuitable
for direct use in analysis or model training. Without it, your results can be misleading,
inaccurate, or completely useless.
1.Ensures Data Quality
 Real-world datasets often contain missing values, duplicates, or outliers.

 Preprocessing detects and fixes these issues, ensuring your model works on clean and
reliable data.

2. Improves Accuracy of Models


 Machine learning algorithms are sensitive to noise, irrelevant features, and scale
differences.

 By normalizing, standardizing, or selecting features, preprocessing improves prediction


accuracy.

3. Handles Inconsistencies
 Different sources may store data in different formats (e.g., “Male/Female” vs. “M/F”).

 Preprocessing makes data consistent and comparable.

4. Reduces Complexity
 Dimensionality reduction (e.g., PCA) or feature selection removes redundant or
irrelevant data, making models faster and easier to train.

5. Prevents Bias from Bad Data


 Incorrect or unbalanced datasets can bias models toward wrong conclusions.

 Preprocessing ensures fairer, more representative datasets.

6. Saves Time in the Long Run


 Clean, well-structured data is easier to reuse and share for multiple experiments.
Steps in Data Preprocessing

 Some key steps in data preprocessing are Data Cleaning, Data Integration, Data
Transformation, and Data Reduction.
Data Cleaning: It is the process of identifying and correcting errors or inconsistencies in the
dataset. It involves handling missing values, removing duplicates, and correcting incorrect or
outlier data to ensure the dataset is accurate and reliable. Clean data is essential for effective
analysis, as it improves the quality of results and enhances the performance of data models.

 Missing Values: This occur when data is absent from a dataset. You can either ignore
the rows with missing data or fill the gaps manually, with the attribute mean, or by using
the most probable value. This ensures the dataset remains accurate and complete for
analysis.

 Noisy Data: It refers to irrelevant or incorrect data that is difficult for machines to
interpret, often caused by errors in data collection or entry. It can be handled in several
ways:

o Binning Method: The data is sorted into equal segments, and each segment is
smoothed by replacing values with the mean or boundary values.

o Regression: Data can be smoothed by fitting it to a regression function, either


linear or multiple, to predict values.

o Clustering: This method groups similar data points together, with outliers
either being undetected or falling outside the clusters. These techniques help
remove noise and improve data quality.
 Removing Duplicates: It involves identifying and eliminating repeated data entries
to ensure accuracy and consistency in the dataset. This process prevents errors and
ensures reliable analysis by keeping only unique records.

Data Integration

It involves merging data from various sources into a single, unified dataset. It can be
challenging due to differences in data formats, structures, and meanings. Techniques like
record linkage and data fusion help in combining data efficiently, ensuring consistency and
accuracy.

Data Transformation: It involves converting data into a format suitable for analysis.
Common techniques include normalization, which scales data to a common range;
standardization, which adjusts data to have zero mean and unit variance; and discretization,
which converts continuous data into discrete categories. These techniques help prepare the
data for more accurate analysis.
 Data Normalization: The process of scaling data to a common range to ensure
consistency across variables.
 Discretization: Converting continuous data into discrete categories for easier
analysis.
 Data Aggregation: Combining multiple data points into a summary form, such as
averages or totals, to simplify analysis.
 Concept Hierarchy Generation: Organizing data into a hierarchy of concepts to
provide a higher-level view for better understanding and analysis.

Data Reduction: It reduces the dataset's size while maintaining key information. This can
be done through feature selection, which chooses the most relevant features, and feature
extraction, which transforms the data into a lower-dimensional space while preserving
important details. It uses various reduction techniques such as,

 Dimensionality Reduction (e.g., Principal Component Analysis): A technique that


reduces the number of variables in a dataset while retaining its essential information.
 Numerosity Reduction: Reducing the number of data points by methods like
sampling to simplify the dataset without losing critical patterns.
 Data Compression: Reducing the size of data by encoding it in a more compact form,
making it easier to store and process.

Data cleaning : Data cleaning (also called data cleansing or data scrubbing) is the process
of detecting and correcting (or removing) errors, inconsistencies, and inaccuracies in a
dataset so it becomes accurate, complete, and reliable for analysis.

Importance & Benefits of Data Cleaning

1. Improves Data Accuracy


 Removes errors, duplicates, and inconsistencies so your dataset truly reflects reality.
 Example: Changing “Indai” → “India” ensures correct analysis.
Benefit: Reliable results and better decision-making.

2. Enhances Model Performance


 Clean, consistent data helps machine learning algorithms learn patterns more
effectively.
 Noisy or inconsistent data can confuse the model.
Benefit: Higher accuracy, precision, and recall.
3. Improves Consistency Across Systems
 Standardizing formats (dates, text, units) ensures smooth integration with other
systems.
Benefit: Easier merging of datasets from different sources.
4. Helps in Compliance & Governance
 Clean, well-maintained datasets are essential for following legal requirements like
GDPR or HIPAA.
Benefit: Avoids legal risks and penalties.
Sources of Missing Values

Missing values occur when certain data points are not recorded, lost, or not applicable. They
can be introduced at data collection, entry, storage, or processing stages.

 Data Entry Errors


 Data Collection Problems
 Data Loss During Transfer or Storage
 Inapplicable or Irrelevant Data
 Equipment or Sensor Failure
 Data Filtering or Cleaning
 Privacy or Confidentiality Restrictions

Data Cleaning Steps:

1) Data Inspection (Identify Problems)

 Understand the dataset structure and quality.


 Check for missing values, duplicates, outliers, inconsistencies.
 Example: Using df.info() and df.describe() in Python to spot anomalies.

2) Remove or Handle Duplicate Data


 Detect and delete identical records.
 Example: df.drop_duplicates() in Pandas.

3) Handle Missing Values


 Methods:
o Remove rows/columns with too many missing values.
o Fill in (impute) using mean, median, mode, or prediction models.
 Example: Filling missing "age" with the median value.
4) Correct Structural Errors
 Fix typos, inconsistent capitalization, wrong formats.
 Example: “M”, “Male”, “male” → “Male”.

5) Handle Outliers
 Detect using statistical methods (Z-score, IQR).
 Decide to remove, cap, or transform them.
 Example: Removing height values like 500 cm.

6) Standardize Data Formats


 Make sure all dates, numbers, and text follow the same format.
 Example: Convert MM/DD/YYYY to YYYY-MM-DD.

7) Validate Data Consistency


 Check relationships between columns.
 Example: If Purchase Date < Registration Date, it’s invalid.

8)Final Review & Export


 Verify that data is clean.
 Save cleaned data for analysis or modelling.

Handling the noise data:

Binning

Data binning or bucketing is a data preprocessing method used to minimize the effects of
small observation errors. The original data values are divided into small intervals known as
bins and then they are replaced by a general value calculated for that bin. This has a
smoothing effect on the input data and may also reduce the chances of overfitting in the case
of small datasets

Types of Binning Techniques

Binning can be broadly categorized into three types based on how the bins are defined:
1. Equal-Width Binning
Each bin has an equal width, determined by dividing the range of the data into nn intervals.
Formula:

 Advantages: Simple to implement and easy to understand.


 Disadvantages: May result in bins with highly uneven data distribution.

2. Equal-Frequency Binning
Each bin contains approximately the same number of data points.
 Advantages: Ensures balanced bin sizes, avoiding sparse bins.
 Disadvantages: The bin width may vary significantly.

Applications of Binning
1. Data Preprocessing: Often used to prepare data for machine learning models by
converting continuous variables into categorical ones.
2. Anomaly Detection: Helps identify anomalies or outliers by binning data and
analyzing the distributions.
3. Data Visualization: Used in histograms and bar charts to represent the frequency
distribution of data.
4. Feature Engineering: Creates categorical features that can enhance the performance
of certain machine learning models.

Data Integration:

Data integration is the process of combining data from multiple sources into a unified view
so it can be used for analysis, reporting, or machine learning.
It’s a key step in data preprocessing when working with datasets from different places.

Challenges in Data Integration


1. Schema Differences
o Same information stored in different formats or field names.
oExample: "Customer_ID" in one file vs. "CustID" in another.
2. Data Redundancy
o Duplicate records across sources.
o Example: Same customer record in both sales and marketing databases.
3. Data Inconsistency
o Conflicting values for the same attribute.
o Example: Customer age is 30 in one dataset and 32 in another.
4. Different Units or Scales
o Example: Weight recorded in pounds vs. kilograms.
5. Missing Values
o Some sources may not have certain attributes.
Example:

import pandas as pd
from sklearn.datasets import load_iris

# Load iris dataset as DataFrame


iris = load_iris(as_frame=True)
df_iris = iris.frame

# Create dummy sales data (same number of rows = 150)


df_sales = pd.DataFrame({
'OrderID': range(1001, 1001 + len(df_iris)),
'Product': ['Pen', 'Book', 'Pencil'] * 50,
'Amount': [50, 150, 75] * 50
})

# Combine column-wise
combined = pd.concat([df_iris, df_sales], axis=1)
print(combined.head())

Output:

Data Reduction :

Data Reduction refers to the process of reducing the volume of data while maintaining its
informational quality. It involves methods for minimizing, summarizing, or simplifying data
while preserving its fundamental properties for storage or analysis.
Dimensionality Reduction:

Dimensionality reduction is the process of reducing the number of input variables (features)
in a dataset while keeping as much useful information as possible.
It’s mainly used to simplify data, speed up processing, and avoid problems like overfitting.

Types of Dimensionality Reduction


Feature Selection
 Selecting the most important features and discarding the rest.
 Methods:
o Filter methods (Correlation, Chi-square test)
o Wrapper methods (Forward Selection, Backward Elimination)
o Embedded methods (Lasso, Decision Tree feature importance)

Correlation:

Correlation measures the strength and direction of the relationship between two variables.
 Values range from –1 to +1.
o +1 → Perfect positive relationship (when one increases, the other increases).
o –1 → Perfect negative relationship (when one increases, the other decreases).
o 0 → No relationship.
 Common measure: Pearson correlation coefficient (for numerical data).
Formula for Pearson correlation (r):

Example:
 Height and weight → High positive correlation (~0.85).
 Age and gaming hours → Negative correlation (~–0.40).
Uses in preprocessing:
 Detect multicollinearity (too-high correlation) → remove redundant features.

2. Chi-Square Test
The Chi-Square (χ2\chi^2χ2) test checks whether there is a significant relationship between
two categorical variables.
 It compares observed frequencies (from data) with expected frequencies (if there was
no relationship).
Formula:

Where:
 OiO_iOi = Observed value
 EiE_iEi = Expected value
Steps:
1. Create a contingency table of observed counts.
2. Calculate expected counts assuming no relationship.
3. Apply the formula and compare the result with the Chi-Square critical value (from a
statistical table) or use p-value.
Example:
 Does gender (Male/Female) affect product preference (A/B)?
 If p-value < 0.05 → Significant relationship.

Feature Extraction
 Transforming existing features into a smaller set of new features.
 Methods:
o PCA (Principal Component Analysis) → Creates new uncorrelated
variables (principal components).

PCA (Principal Component Analysis): PCA is a dimensionality reduction technique used


to convert a large set of variables into a smaller set of new variables (called principal
components) while preserving as much important information (variance) as possible.

Steps in PCA
1. Standardize the Data
o Ensure all features have the same scale (mean = 0, variance = 1).
2. Compute the Covariance Matrix
o Measures how features vary together.
3. Calculate Eigenvalues & Eigenvectors
o Eigenvalues → amount of variance explained by each principal component.
o Eigenvectors → directions of the principal components.
4. Sort Components by Variance
o Rank them by eigenvalues in descending order.
5. Select Top k Components
o Choose enough components to explain desired variance (e.g., 95%).
6. Transform the Data
o Multiply original data by selected eigenvectors to get the reduced dataset.
Example:
# PCA on Iris Dataset - Complete Code

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# 1. Load Iris Dataset


iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target # Add species as target
species_names = iris.target_names

# 2. Standardize the Features


scaler = StandardScaler()
df_scaled = scaler.fit_transform(df.iloc[:, :-1]) # exclude target column

# 3. Apply PCA (reduce to 2 components for visualization)


pca = PCA(n_components=2)
df_pca = pca.fit_transform(df_scaled)

# 4. Create a DataFrame for PCA results


df_pca_result = pd.DataFrame(data=df_pca, columns=['PC1', 'PC2'])
df_pca_result['species'] = df['species']

# 5. Explained Variance
print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Total variance explained:", sum(pca.explained_variance_ratio_))

# 6. Visualization
plt.figure(figsize=(8,6))
for target, color, label in zip([0, 1, 2], ['red', 'green', 'blue'], species_names):
subset = df_pca_result[df_pca_result['species'] == target]
plt.scatter(subset['PC1'], subset['PC2'], c=color, label=label, alpha=0.7, edgecolors='k')

plt.xlabel(f"PC1 ({pca.explained_variance_ratio_[0]*100:.2f}% variance)")


plt.ylabel(f"PC2 ({pca.explained_variance_ratio_[1]*100:.2f}% variance)")
plt.title("PCA on Iris Dataset")
plt.legend()
plt.grid(True)
plt.show()

# 7. Component Loadings (Contribution of each feature)


loadings = pd.DataFrame(pca.components_,
columns=iris.feature_names,
index=['PC1', 'PC2'])
print("\nFeature contributions to each principal component:")
print(loadings)
Data Compression

Data compression is defined as the process whereby information is encoded in less bits than
it had originally occupied. This mainly happens through methods that eliminate duplication
and other extraneous information.

Lossless Data Compression


Lossless data compression guarantees that the decompressed data is identical to the original
data. It works best for text and data files where precision matters.
 Huffman coding: Uses a frequency-sorted binary tree to locate values efficiently.
 Run-length encoding (RLE): This compresses sequences of replicated data values.

Huffman Coding

Huffman coding is a lossless data compression algorithm that assigns shorter binary codes to
more frequent symbols and longer codes to less frequent symbols, reducing overall storage
space.

Working steps:

 Count the frequency of each symbol in the data.


 Build a priority queue (min-heap) of nodes (each node represents a symbol +
frequency).
 Combine the two smallest frequency nodes into a new node (sum of frequencies).
 Repeat until there is only one tree (Huffman Tree).
 Assign binary codes:
 Left branch → 0
 Right branch → 1
 Encode the data using these variable-length codes.

Example:

Symbol Frequency
A 2
B 2
Symbol Frequency
C 1

Step 2: Build Huffman Tree

1. Take smallest two:


o C (1) + A (2) → New node (3)
2. Combine with B (2):
o B (2) + Node(3) → Root (5)

Step 3: Assign Codes

 Go Left → 0
 Go Right → 1

Tree might look like:

(5)

/ \

B(2) (3)

/ \

C(1) A(2)

Codes:

 B→0
 C → 10
 A → 11

Step 4: Encode

Text: AABBC

 A → 11
 A → 11
 B→0
 B→0
 C → 10

Encoded:

11 11 0 0 10 → 11110010
RLE:

Run-Length Encoding compresses data by storing repeated characters as a single value plus
the number of repeats.
It works best on data with many consecutive repeated values (e.g., images with large blocks
of the same color).

Example:

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.datasets import load_iris

# Load Iris dataset

iris = load_iris()

df = pd.DataFrame(iris.data, columns=iris.feature_names)

df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)

# Sort by species to ensure runs

df = df.sort_values(by='species').reset_index(drop=True)

# RLE function

def run_length_encode_column(column):

encoding = []

prev = column[0]

count = 1

for value in column[1:]:

if value == prev:

count += 1

else:

encoding.append((prev, count))

prev = value

count = 1
encoding.append((prev, count))

return encoding

# Run RLE on species

rle_species = run_length_encode_column(df['species'].tolist())

# Convert RLE result to DataFrame

rle_df = pd.DataFrame(rle_species, columns=['species', 'count'])

# --- Visualization 1: Bar Chart of Species Count from RLE ---

plt.figure(figsize=(6, 4))

plt.bar(rle_df['species'], rle_df['count'], color=['green', 'blue', 'red'])

plt.title("RLE: Species Counts in Iris Dataset")

plt.xlabel("Species")

plt.ylabel("Count")

plt.show()

Data Transformation :

Data Transformation is the process of converting data from its original format into a more
useful, consistent, or structured format for analysis, storage, or further processing. It’s like
“cleaning and reshaping” your data so it’ s ready for use in machine learning or statistical
models.

Smoothing : Smoothing removes noise (random variations) from data to reveal patterns more
clearly. One common method is binning with mean smoothing.

Ex: Data: 3, 6, 8, 5, 10, 12, 14, 11, 13

Step 1: Apply Equal-Width Binning

We split the data into 3 bins (groups) of equal width.

Bin 1: 3, 6, 8

Bin 2: 5, 10, 12

Bin 3: 14, 11, 13

Step 2: Mean Smoothing


Replace each value in a bin with the mean of that bin.

 Bin 1 mean: (3 + 6 + 8) / 3 = 5.67 → values become 5.67, 5.67, 5.67


 Bin 2 mean: (5 + 10 + 12) / 3 = 9 → values become 9, 9, 9
 Bin 3 mean: (14 + 11 + 13) / 3 = 12.67 → values become 12.67, 12.67, 12.67

Step 3: Smoothed Data


Original: 3, 6, 8, 5, 10, 12, 14, 11, 13
Smoothed: 5.67, 5.67, 5.67, 9, 9, 9, 12.67, 12.67, 12.67

Aggregation

Aggregation is the process of combining two or more attributes (or records) into a single,
more meaningful attribute. It is used to:

 Summarize data
 Reduce data size
 Show higher-level trends rather than individual details

Example 1 – Attribute Aggregation

Dataset:

Sales_Q1 Sales_Q2 Sales_Q3 Sales_Q4

1000 1200 1100 1300

900 950 1000 1050

After Aggregation (Total Annual


Sales):

Annual_Sales

4600

3900

Generalization

o Replaces low-level data with higher-level concepts.


o Example: Changing age into age group (20–29, 30–39, etc.).

Normalization (Scaling)

Data normalization is a technique used in data mining to transform the values of a dataset into
a common scale. This is important because many machine learning algorithms are sensitive to
the scale of the input features and can produce better results when the data is normalized.
Normalization is used to scale the data of an attribute so that it falls in a smaller range, such
as -1.0 to 1.0 or 0.0 to 1.0. It is generally useful for classification algorithms.

Methods of Data Normalization

1. Decimal Scaling

2. Min-Max Normalization

3. z-Score Normalization(zero-mean Normalization)

1. Decimal Scaling Method For Normalization

It normalizes by moving the decimal point of values of the data. To normalize the data by this
technique, we divide each value of the data by the maximum absolute value of data. The data
value, vi, of data is normalized to vi' by using the formula below –

where j is the smallest integer such that max(|vi'|)<1.

Example :

Let the input data is: -10, 201, 301, -401, 501, 601, 701. To normalize the above data,

Step 1: Maximum absolute value in given data(m): 701

Step 2: Divide the given data by 1000 (i.e j=3)

Result: The normalized data is: -0.01, 0.201, 0.301, -0.401, 0.501, 0.601, 0.701

Min-Max Normalization

In this technique of data normalization, linear transformation is performed on the original


data. Minimum and maximum value from data is fetched and each value is replaced
according to the following formula.

Where A is the attribute data, Min(A), Max(A) are the minimum and maximum absolute
value of A respectively. v' is the new value of each entry in data. v is the old value of each
entry in data. new_max(A), new_min(A) is the max and min value of the range(i.e boundary
value of range required) respectively.

Z-score normalization
In this technique, values are normalized based on mean and standard deviation of the data A.

v', v is the new and old of each entry in data respectively. σA, A is the standard deviation and
mean of A respectively.

Discretization:

Data discretization is the process of converting continuous numerical data into a set of finite
intervals (bins) or categories. It’s a type of data reduction that makes the data simpler to
analyze, especially for algorithms that work better with discrete values (e.g., decision trees,
association rules).

Types of Discretization Methods

1. Unsupervised Discretization (No class label used)

 Equal-width binning
o Divides range into equal-sized intervals.
o Example: Age range 20–50 → bins of width 10 → [20–30), [30–40), [40–50]
 Equal-frequency binning
o Each bin has roughly the same number of data points.
 Clustering-based
o Use k-means or other clustering to group values.

2. Supervised Discretization (Uses class label to guide binning)

 Entropy-based discretization (e.g., ID3, C4.5)


o Split points chosen to maximize information gain.
 ChiMerge
o Uses chi-square test to merge intervals until a stopping criterion is met.

Encoding (for Categorical Data)

o Converts categorical values into numerical format.


o Methods:

One-Hot Encoding: One-hot encoding is a data transformation technique used to convert


categorical (non-numeric) data into a numerical format so that machine learning models can
processit.
It works by creating a separate binary column for each category and marking the presence (1)
or absence (0) of that category.

Original Data:

Color
Red
Green
Blue
Red
Color_Red Color_Green Color_Blue
1 0 0
0 1 0
0 0 1
1 0 0

Label Encoding:

Label encoding is a data transformation technique that converts categorical values into numeric
codes by assigning a unique integer to each category. Unlike one-hot encoding, it doesn’t create
new columns—it replaces each category with a number.

Example

Original Data:

Color
Red
Green
Blue
Red

After Label Encoding:

Color
2
1
0
2

Attribute Construction (Feature Engineering)

o Creating new attributes from existing ones.


o Example: From “Date of Birth” → create “Age” attribute.

You might also like