0% found this document useful (0 votes)
9 views

lec 1 Data Acquisition and preprocessing

The document provides an overview of data preprocessing, emphasizing the importance of data quality and the major tasks involved, including data cleaning, integration, reduction, and transformation. It highlights the challenges of dealing with dirty data, such as missing, noisy, and inconsistent data, and outlines the need for effective data cleaning methods. The document serves as a foundational guide for understanding the critical aspects of preparing data for analysis.

Uploaded by

opoe14055
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

lec 1 Data Acquisition and preprocessing

The document provides an overview of data preprocessing, emphasizing the importance of data quality and the major tasks involved, including data cleaning, integration, reduction, and transformation. It highlights the challenges of dealing with dirty data, such as missing, noisy, and inconsistent data, and outlines the need for effective data cleaning methods. The document serves as a foundational guide for understanding the critical aspects of preparing data for analysis.

Uploaded by

opoe14055
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Data Acquisition AND

PREPROCESSING
Instructor: Mr. Asad Abbas
Lec#1
Chapter 3: Data Preprocessing

• Data Preprocessing: An Overview

– Data Quality
– Major Tasks in Data Preprocessing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation and Data Discretization

2
2
Data Quality: Why Preprocess the Data?

• Measures for data quality: A multidimensional view


– Accuracy: correct or wrong, accurate or not.
– Completeness: not recorded, unavailable.
– Consistency: some modified but some not.
– Timeliness: timely update?
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be understood?

3
Major Tasks in Data Preprocessing

• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation
4
Chapter 3: Data Preprocessing

• Data Preprocessing: An Overview

– Data Quality
– Major Tasks in Data Preprocessing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation and Data Discretization
• Summary
5
5
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument
faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
• e.g., Occupation=“ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
6
Incomplete (Missing) Data

• Data is not always available


– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time
of entry
– not register history or changes of the data
• Missing data may need to be inferred
7
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data

You might also like