0% found this document useful (0 votes)
69 views

Chapter 2 - Data Preprocessing

This chapter discusses why data preprocessing is needed before analyzing data. The raw data contains inconsistencies like missing values, outliers, redundant fields, and values that don't make sense. There are two main methods for preprocessing: data cleaning and data transformation. Data cleaning handles issues like missing data, which can be replaced with constants, averages, or random values. It also identifies misclassified values.

Uploaded by

Nosair ibrahim
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

Chapter 2 - Data Preprocessing

This chapter discusses why data preprocessing is needed before analyzing data. The raw data contains inconsistencies like missing values, outliers, redundant fields, and values that don't make sense. There are two main methods for preprocessing: data cleaning and data transformation. Data cleaning handles issues like missing data, which can be replaced with constants, averages, or random values. It also identifies misclassified values.

Uploaded by

Nosair ibrahim
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 10

CHAPTER 2 – DATA PREPROCESSING

WHY DO WE
NEED TO
PREPROCESS THE
DATA?
MUCH OF THE RAW DATA CONTAINED IN DATABASES IS
UNPREPROCESSED, INCOMPLETE AND NOISY
THE DATABASES MAY CONTAIN:
• FIELDS THAT ARE REDUNDANT
• MISSING VALUE
• OUTLIERS
• DATA IN A FORM NOT SUITABLE FOR DATA
MINNING MODELS
• VALUES NOT CONSISTENT WITH POLICY OR
COMMON SENSE
TWO PRINCIPLE METHOD

DATA CLEANING

DATA TRANSFORMATION
DATA CLEANING
HANDLING MISSING DATA
INSIGHTFUL MINER OFFERS A CHOICE OF
REPLACEMENT VALUES FOR MISSING DATA:

1. REPLACE THE MISSING VALUE WITH SOME


CONSTANT, SPECIFIED BY THE ANALYST.
2. REPLACE THE MISSING VALUE WITH THE FIELD
MEAN (FOR NUMERICAL VARIABLES) OR THE
MODE (FOR CATEGORICAL VARIABLES).
3. REPLACE THE MISSING VALUES WITH A VALUE
GENERATED AT RANDOM FROM THE VARIABLE
DISTRIBUTION OBSERVED.
IDENTIFYINGMISCLASSIFICATIONS

You might also like