Data Collection
and Preparation
PEARLY ANN A. ESCALAMBRE, MIT
OBJECTIVES
Importance of Data Collection and
Preparation
Basic Data Quality Assessment
Ethical Considerations in Data Collection
Importing Data from Various Sources
Data Cleaning and Preprocessing in
Excel
Handling Missing Data and Outliers
Importance of Data Collection and
Preparation
Data collection
is the
systematic
gathering of
information for
analysis
Why It Matters:
Informs decision-making
processes.
Enhances the accuracy of
insights derived from data.
Supports strategic planning and
operational efficiency.
Basic Data Quality
Assessment
What is Data Quality
Assessment (DQA)?
Evaluates data
accuracy,
completeness,
reliability, and
validity.
Is it a quality data?
Is it a quality data?
Is it a quality data?
Is it a quality data?
Key Components:
Accuracy: How well data reflects
real-world scenarios.
Completeness: Whether all
necessary data is present.
Consistency: Uniformity across
datasets.
Validity: Adherence to defined
rules and formats
Ethical Considerations in
Data Collection
Informed Consent: Participants should
understand how their data will be used.
Privacy Protection: Safeguarding
personal information is crucial.
Data Ownership: Clarifying who owns
the data collected and how it can be used
Do not share personal
info on the internet
To avoid identity theft
Importing Data from
Various Sources
Common Sources:
CSV files
Excel spreadsheets
Databases (SQL, NoSQL)
Steps to Import:
Identify the source format.
Use appropriate tools or software
(e.g., Excel, Python) to import data.
Microsoft Word (.doc)
Microsoft PowerPoint (.ppt / .pptx)
Microsoft Excel (.xls )
Import Data: CSV File Example
Data Cleaning and Preprocessing
in Excel
What is Data Cleaning?
The process of
correcting or removing
inaccurate records from
a dataset.
Steps in Excel:
Remove duplicates using the
"Remove Duplicates" feature.
Use filters to identify and correct
errors.
Standardize formats for
consistency.
Remove Duplicates
Duplicates will be highlighted into
red, then delete
Some other versions will be seen
at the Conditional Formatting
Handling Missing Data and
Outliers
Missing Data Strategies:
Imputation (filling in
missing values).
Deleting rows/columns
with excessive missing
values.
Outlier Treatment:
Identify outliers using
statistical methods (e.g.,
Z-scores).
Decide whether to
remove or adjust outliers
based on context.
How to get Outliers in Excel
END