0% found this document useful (0 votes)

30 views4 pages

Bana Reviewer

The document outlines the fundamental concepts and methods of data preprocessing, including tasks such as data cleaning, integration, transformation, and reduction. It also discusses post-processing and visualization techniques using R, emphasizing the importance of ethical considerations in research, particularly in relation to data privacy. The document provides a comprehensive overview of methodologies for effective data handling and analysis.

Uploaded by

CAPISTRANO, ROVIC G.

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views4 pages

Bana Reviewer

Uploaded by

CAPISTRANO, ROVIC G.

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

BANA REVIEWER

MODULE 1: BASIC CONCEPTS IN DATA PREPROCESSING

Data preprocessing
- It aims at assessing and improving the quality of data for secondary statistical analysis
- is a data mining technique that involves transforming raw or source data into an understandable
format for further processing.

Tasks for Data Pre-procesing

1. Data cleaning
● This step deals with missing data, noise, outliers, and duplicate or incorrect
records while minimizing introduction of bias into the database.
2. Data integration
● This step reorganizes the various raw datasets into a single dataset that contain all
the information required for the desired statistical analyses.
3. Data transformation
● This step translates and/or scales variables stored in a variety of formats or units
in the raw data into formats or units that are more useful for the statistical methods
that the researcher wants to use.
4. Data reduction
● This step removes redundant records and variables, as well as reorganizes the
data in an efficient and “tidy” manner for analysis.

MODULE 2: METHODS OF DATA PREPROCESSING

Data integration
- Is the process of combining data derived from various data sources (such as databases, flat files,
etc.) into a consistent dataset.

Four Types of Data Integration Methodologies

1. Inner Join
● Creates a new result table by combining column values of two tables (A and B) based
upon the join-predicate.
2. Left Join
● Returns all the values from an inner join plus all values in the left table that do not match
to the right table, including rows with NULL (empty) values in the link column.
3. Right Join
● Returns all the values from the right table and matched values from the left table (NULL in
the case of no matching join predicate).
4. Outer Join
● The union of all the left join and right join values.

Data transformation
- Is a process of transforming data from one format to another

Here are a few common possible options for data transformation:

1. Normalization
- A way to scale specific variable to fall within a small specific range
A. Min-max normalization
● Transforming values to a new scale such that all attributes fall between a
standardized format.

B. Z-score standardization
● Transforming a numerical variable to a standard normal distribution

2. Encoding and Binning

A. Binning
- The process of transforming numerical variables into categorical counterparts.
● Equal-width (distance) partitioning
➔ Divides the range into N intervals of equal size, thus forming a uniform grid
● Equal-depth (frequency) partitioning
➔ Divides the range into N intervals, each containing approximately the same
number of samples.
B. Encoding
- The process of transforming categorical values to binary or numerical
counterparts, e.g. treat male or female for gender to 1 or 0
● Binary Encoding (Unsupervised) Transformation of categorical variables by taking
the values 0 or 1 to indicate the absence or presence of each category
● Class-based Encoding (Supervised)
➔ Discrete Class
➢ Replace the categorical variable with just one new numerical
variable and replace each category of the categorical variable with
its corresponding probability of the class variable.
➔ Continuous Class
➢ Replace the categorical variable with just one new numerical
variable and replace each category of the categorical variable with
its corresponding average of the class variable.

Data Cleaning
- All data sources potentially include errors and missing values – data cleaning addresses these
anomalies.
- Is the process of altering data in a given storage resource to make sure that it is accurate and
correct.

Data Cleaning Tasks:

1. Fill in missing values
Solutions for handling missing data:
a. Ignore the tuple
b. Fill in the missing value manually
c. Data Imputation
● Use a global constant to fill in the missing value
● Use the attribute mean to fill in the missing value
● Use the attribute mean for all samples belonging to the same class

2. Cleaning noisy data

Solutions for cleaning noisy data:
a. Binning
● Transforming numerical values into categorical components
b. Clustering
●Grouping data into corresponding cluster and use the cluster average to
represent a value
c. Regression
● Utilizing a simple regression line to estimate a very erratic data set
d. Combined computer and human inspection
● Detecting suspicious values and checking it by human interventions

3. Identifying outliers
Solutions for identifying outliers:
a. Box plot

Data reduction
- Is a process of obtaining a reduced representation of the data set that is much smaller in volume
but yet produce the same (or almost the same) analytical results

Data Reduction Strategies:

1. Sampling
- Utilizing a smaller representative or sample from the big data set or population that will
generalize the entire population.
A. Types of Sampling
● Simple Random Sampling
➔ There is an equal probability of selecting any particular item.
● Sampling without replacement
➔ As each item is selected, it is removed from the population
● Sampling with replacement
➔ Objects are not removed from the population as they are selected
for the sample iv. Stratified sampling - split the data into several
partitions, then draw random samples from each partition.

2. Feature Subset Selection

- Reduces the dimensionality of data by eliminating redundant and irrelevant features.
A. Feature Subset Selection Techniques
● Brute-force approach
➔ Try all possible feature subsets as input to data mining algorithm
● Embedded approaches
➔ Feature selection occurs naturally as part of the data mining
algorithm
● Filter approaches
➔ Features are selected before data mining algorithm is run
● Wrapper approaches
➔ Use the data mining algorithm as a black box to find the best subset
or attributes

3. Feature Creation
- Creating new attributes that can capture the important information in a data set much
more efficiently than the original attributes.
A. Feature Creation Methodologies
● Feature Extraction
● Mapping Data to New Space
● Feature Construction
MODULE 3: POST-PROCESSING AND VISUALIZATION OF DATA INSIDE THE DATA WAREHOUSE

R
- Is an integrated suite of software facilities for data manipulation, calculation and graphical display.
- Take note that R is not a database but connects to a DBMS.
- Is free and open source though it has a steep learning curve.

Visualization
- Is the presentation of information using spatial or graphical representations, for the purposes of:
comparison facilitation, recognition of patterns and general decision making.

Types of Visualizations
1. Graph
- is a medium of visualization designed to communicate information. Depending on the type
of data, there is almost always a suitable graph to use.

Categorical Data can be visualized through:

● Bar Graph
● Pie Chart
● Pareto Chart
● Side-by-side chart

Numerical Data can be displayed using:

● Stem-and-Leaf Display
● Histogram

2. Charts
- Refer to a visualization medium that shows structure and relationship.
● Pie Chart
- Refers to a circle divided into sections showing proportion as a “Pie
Chart”.
3. Diagram
- Schematic pictures or illustrations of objects and entities

UNIT IV: ETHICS IN DESCRIPTIVE ANALYTICS

Why is research ethics important?

● It is the right thing to do;
● It protects research participants;
● It provides advocates for research participants;
● It preserves credibility, trust, and accountability;
● It reduces liabilities, wasted time, and resources;
● It useless, harmful, worthless to useful helpful and worthy

Republic Act 10173, the Data Privacy Act

- Recognizes the need for citizens’ data to be protected and secured.

Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
22 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Unit II - Data Preprocessing and Classification RSK-1
No ratings yet
Unit II - Data Preprocessing and Classification RSK-1
115 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Preprocessing: Clean, Transform, Integrate
No ratings yet
Data Preprocessing: Clean, Transform, Integrate
6 pages
Unit II (DWDM)
No ratings yet
Unit II (DWDM)
19 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Data Mining for Tech Enthusiasts
No ratings yet
Data Mining for Tech Enthusiasts
61 pages
Week2 2
No ratings yet
Week2 2
25 pages
UNIT 3 Data Preprocessing
No ratings yet
UNIT 3 Data Preprocessing
22 pages
Module 2 - Data Preprocessing
No ratings yet
Module 2 - Data Preprocessing
16 pages
Data Pre Processing
No ratings yet
Data Pre Processing
28 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
66 pages
L6 Data Preprocessing
No ratings yet
L6 Data Preprocessing
9 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
DWDM Unit 2
No ratings yet
DWDM Unit 2
11 pages
REVIEWER
No ratings yet
REVIEWER
9 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
33 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Session 2-Data Preprocessing
No ratings yet
Session 2-Data Preprocessing
29 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Week 3
No ratings yet
Week 3
23 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Business Analytics
No ratings yet
Business Analytics
14 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
Unit - II
No ratings yet
Unit - II
56 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
8 pages
Data Mining and Preprocessing Techniques
No ratings yet
Data Mining and Preprocessing Techniques
31 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
Comptia Data+ Da0-001
No ratings yet
Comptia Data+ Da0-001
10 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
9 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Data Mining Lab Guide
No ratings yet
Data Mining Lab Guide
58 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
ML 4
No ratings yet
ML 4
17 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
DWM
No ratings yet
DWM
14 pages
Label Printer - Google Search
No ratings yet
Label Printer - Google Search
1 page
Stainless Table - Google Search
No ratings yet
Stainless Table - Google Search
1 page
Bana Reviewer 2
No ratings yet
Bana Reviewer 2
5 pages
PMM8
No ratings yet
PMM8
8 pages
PLSP Template
No ratings yet
PLSP Template
1 page
Qand A
No ratings yet
Qand A
6 pages
Rovic 20 CV
No ratings yet
Rovic 20 CV
2 pages
Information Technology Policy
No ratings yet
Information Technology Policy
3 pages
3.2.1.7 Packet Tracer - Configuring VLANs Instructions PDF
75% (4)
3.2.1.7 Packet Tracer - Configuring VLANs Instructions PDF
3 pages
Raw Milk Preservation Methods
No ratings yet
Raw Milk Preservation Methods
5 pages
Al Khawarismi Atma Pratama - 21120119120020 - SDL A - Assignment 7
No ratings yet
Al Khawarismi Atma Pratama - 21120119120020 - SDL A - Assignment 7
6 pages
Oracle Assignment 2019: Grade NUMBER, Losal NUMBER, Hisal NUMBER
No ratings yet
Oracle Assignment 2019: Grade NUMBER, Losal NUMBER, Hisal NUMBER
8 pages
Flutter Developer Internship 2025
No ratings yet
Flutter Developer Internship 2025
2 pages
CDS VAM® SLIJ-II 7.625in. 39lb-ft P110 API Drift 6.500in. 87.5%
No ratings yet
CDS VAM® SLIJ-II 7.625in. 39lb-ft P110 API Drift 6.500in. 87.5%
1 page
SC 7000 - 9000 XL Service Manual
No ratings yet
SC 7000 - 9000 XL Service Manual
160 pages
Introduction To Computer Application
No ratings yet
Introduction To Computer Application
15 pages
16 Passwd Command
No ratings yet
16 Passwd Command
3 pages
HRK 8000 A
No ratings yet
HRK 8000 A
68 pages
Madhan Developer
No ratings yet
Madhan Developer
2 pages
Sany SS270V Specifications Overview
No ratings yet
Sany SS270V Specifications Overview
2 pages
Netone APN Settings - Google Search
No ratings yet
Netone APN Settings - Google Search
1 page
Evaluation of Interface Shear Strength Properties of GeogridReinforced Construction and Demolition Materials Using A Modified Large-Scale Direct Shear Testing Apparatus
No ratings yet
Evaluation of Interface Shear Strength Properties of GeogridReinforced Construction and Demolition Materials Using A Modified Large-Scale Direct Shear Testing Apparatus
9 pages
Preliminary Controls Self-Assessment Questionnaire
No ratings yet
Preliminary Controls Self-Assessment Questionnaire
56 pages
Femtocells for Home & Office Use
No ratings yet
Femtocells for Home & Office Use
5 pages
Rahmatia Computer 70152-2010 Board Question 2016 - 2024
No ratings yet
Rahmatia Computer 70152-2010 Board Question 2016 - 2024
33 pages
Kit de Verin Caterpillar
No ratings yet
Kit de Verin Caterpillar
182 pages
SAP Asset Accounting Guide
No ratings yet
SAP Asset Accounting Guide
5 pages
Dissertation Itil
100% (2)
Dissertation Itil
6 pages
Gear Shift Shaft Removal Guide
No ratings yet
Gear Shift Shaft Removal Guide
20 pages
DevOps With AWS Online Training Curriculum Naresh IT
No ratings yet
DevOps With AWS Online Training Curriculum Naresh IT
2 pages
Full Stack Developer Job Description
No ratings yet
Full Stack Developer Job Description
2 pages
Causes of Six Sigma Failures Explained
No ratings yet
Causes of Six Sigma Failures Explained
6 pages
Class 10 Revision With Solution
No ratings yet
Class 10 Revision With Solution
83 pages
Abhijit Mishra Vs Reserve Bank of India Anr On 7 August 2023
No ratings yet
Abhijit Mishra Vs Reserve Bank of India Anr On 7 August 2023
12 pages
Transacciones BW
No ratings yet
Transacciones BW
29 pages
Cambodian License Plate Localization Algorithm
No ratings yet
Cambodian License Plate Localization Algorithm
4 pages

Bana Reviewer

Uploaded by

Bana Reviewer

Uploaded by

BANA REVIEWER

MODULE 1: BASIC CONCEPTS IN DATA PREPROCESSING

Tasks for Data Pre-procesing

MODULE 2: METHODS OF DATA PREPROCESSING

Four Types of Data Integration Methodologies

Here are a few common possible options for data transformation:

2. Encoding and Binning

Data Cleaning Tasks:

2. Cleaning noisy data

Data Reduction Strategies:

2. Feature Subset Selection

Categorical Data can be visualized through:

Numerical Data can be displayed using:

UNIT IV: ETHICS IN DESCRIPTIVE ANALYTICS

Why is research ethics important?

Republic Act 10173, the Data Privacy Act

You might also like