0% found this document useful (0 votes)
30 views4 pages

Bana Reviewer

The document outlines the fundamental concepts and methods of data preprocessing, including tasks such as data cleaning, integration, transformation, and reduction. It also discusses post-processing and visualization techniques using R, emphasizing the importance of ethical considerations in research, particularly in relation to data privacy. The document provides a comprehensive overview of methodologies for effective data handling and analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views4 pages

Bana Reviewer

The document outlines the fundamental concepts and methods of data preprocessing, including tasks such as data cleaning, integration, transformation, and reduction. It also discusses post-processing and visualization techniques using R, emphasizing the importance of ethical considerations in research, particularly in relation to data privacy. The document provides a comprehensive overview of methodologies for effective data handling and analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

BANA REVIEWER

MODULE 1: BASIC CONCEPTS IN DATA PREPROCESSING

Data preprocessing
- It aims at assessing and improving the quality of data for secondary statistical analysis
- is a data mining technique that involves transforming raw or source data into an understandable
format for further processing.

Tasks for Data Pre-procesing


1. Data cleaning
● This step deals with missing data, noise, outliers, and duplicate or incorrect
records while minimizing introduction of bias into the database.
2. Data integration
● This step reorganizes the various raw datasets into a single dataset that contain all
the information required for the desired statistical analyses.
3. Data transformation
● This step translates and/or scales variables stored in a variety of formats or units
in the raw data into formats or units that are more useful for the statistical methods
that the researcher wants to use.
4. Data reduction
● This step removes redundant records and variables, as well as reorganizes the
data in an efficient and “tidy” manner for analysis.

MODULE 2: METHODS OF DATA PREPROCESSING

Data integration
- Is the process of combining data derived from various data sources (such as databases, flat files,
etc.) into a consistent dataset.

Four Types of Data Integration Methodologies


1. Inner Join
● Creates a new result table by combining column values of two tables (A and B) based
upon the join-predicate.
2. Left Join
● Returns all the values from an inner join plus all values in the left table that do not match
to the right table, including rows with NULL (empty) values in the link column.
3. Right Join
● Returns all the values from the right table and matched values from the left table (NULL in
the case of no matching join predicate).
4. Outer Join
● The union of all the left join and right join values.

Data transformation
- Is a process of transforming data from one format to another

Here are a few common possible options for data transformation:

1. Normalization
- A way to scale specific variable to fall within a small specific range
A. Min-max normalization
● Transforming values to a new scale such that all attributes fall between a
standardized format.

B. Z-score standardization
● Transforming a numerical variable to a standard normal distribution

2. Encoding and Binning


A. Binning
- The process of transforming numerical variables into categorical counterparts.
● Equal-width (distance) partitioning
➔ Divides the range into N intervals of equal size, thus forming a uniform grid
● Equal-depth (frequency) partitioning
➔ Divides the range into N intervals, each containing approximately the same
number of samples.
B. Encoding
- The process of transforming categorical values to binary or numerical
counterparts, e.g. treat male or female for gender to 1 or 0
● Binary Encoding (Unsupervised) Transformation of categorical variables by taking
the values 0 or 1 to indicate the absence or presence of each category
● Class-based Encoding (Supervised)
➔ Discrete Class
➢ Replace the categorical variable with just one new numerical
variable and replace each category of the categorical variable with
its corresponding probability of the class variable.
➔ Continuous Class
➢ Replace the categorical variable with just one new numerical
variable and replace each category of the categorical variable with
its corresponding average of the class variable.

Data Cleaning
- All data sources potentially include errors and missing values – data cleaning addresses these
anomalies.
- Is the process of altering data in a given storage resource to make sure that it is accurate and
correct.

Data Cleaning Tasks:


1. Fill in missing values
Solutions for handling missing data:
a. Ignore the tuple
b. Fill in the missing value manually
c. Data Imputation
● Use a global constant to fill in the missing value
● Use the attribute mean to fill in the missing value
● Use the attribute mean for all samples belonging to the same class

2. Cleaning noisy data


Solutions for cleaning noisy data:
a. Binning
● Transforming numerical values into categorical components
b. Clustering
●Grouping data into corresponding cluster and use the cluster average to
represent a value
c. Regression
● Utilizing a simple regression line to estimate a very erratic data set
d. Combined computer and human inspection
● Detecting suspicious values and checking it by human interventions

3. Identifying outliers
Solutions for identifying outliers:
a. Box plot

Data reduction
- Is a process of obtaining a reduced representation of the data set that is much smaller in volume
but yet produce the same (or almost the same) analytical results

Data Reduction Strategies:


1. Sampling
- Utilizing a smaller representative or sample from the big data set or population that will
generalize the entire population.
A. Types of Sampling
● Simple Random Sampling
➔ There is an equal probability of selecting any particular item.
● Sampling without replacement
➔ As each item is selected, it is removed from the population
● Sampling with replacement
➔ Objects are not removed from the population as they are selected
for the sample iv. Stratified sampling - split the data into several
partitions, then draw random samples from each partition.

2. Feature Subset Selection


- Reduces the dimensionality of data by eliminating redundant and irrelevant features.
A. Feature Subset Selection Techniques
● Brute-force approach
➔ Try all possible feature subsets as input to data mining algorithm
● Embedded approaches
➔ Feature selection occurs naturally as part of the data mining
algorithm
● Filter approaches
➔ Features are selected before data mining algorithm is run
● Wrapper approaches
➔ Use the data mining algorithm as a black box to find the best subset
or attributes

3. Feature Creation
- Creating new attributes that can capture the important information in a data set much
more efficiently than the original attributes.
A. Feature Creation Methodologies
● Feature Extraction
● Mapping Data to New Space
● Feature Construction
MODULE 3: POST-PROCESSING AND VISUALIZATION OF DATA INSIDE THE DATA WAREHOUSE

R
- Is an integrated suite of software facilities for data manipulation, calculation and graphical display.
- Take note that R is not a database but connects to a DBMS.
- Is free and open source though it has a steep learning curve.

Visualization
- Is the presentation of information using spatial or graphical representations, for the purposes of:
comparison facilitation, recognition of patterns and general decision making.

Types of Visualizations
1. Graph
- is a medium of visualization designed to communicate information. Depending on the type
of data, there is almost always a suitable graph to use.

Categorical Data can be visualized through:


● Bar Graph
● Pie Chart
● Pareto Chart
● Side-by-side chart

Numerical Data can be displayed using:


● Stem-and-Leaf Display
● Histogram

2. Charts
- Refer to a visualization medium that shows structure and relationship.
● Pie Chart
- Refers to a circle divided into sections showing proportion as a “Pie
Chart”.
3. Diagram
- Schematic pictures or illustrations of objects and entities

UNIT IV: ETHICS IN DESCRIPTIVE ANALYTICS

Why is research ethics important?


● It is the right thing to do;
● It protects research participants;
● It provides advocates for research participants;
● It preserves credibility, trust, and accountability;
● It reduces liabilities, wasted time, and resources;
● It useless, harmful, worthless to useful helpful and worthy

Republic Act 10173, the Data Privacy Act


- Recognizes the need for citizens’ data to be protected and secured.

You might also like