📌 Data Cleaning Notes
🔹 What is Data Cleaning?
● The process of detecting and correcting (or removing) errors, inconsistencies, and
inaccuracies in datasets.
● Ensures that data is accurate, complete, consistent, and reliable for analysis or
decision-making.
🔹 Common Issues in Raw Data
1. Missing values – empty or null fields.
2. Duplicates – repeated records.
3. Inconsistent formatting – e.g., "PH", "Philippines", "PHIL" for the same country.
4. Outliers – unusual values that may be errors.
5. Incorrect data types – e.g., numbers stored as text.
6. Noise or irrelevant data – unnecessary information.
🔹 Steps in Data Cleaning
1. Remove duplicates – drop or merge repeated entries.
2. Handle missing values:
○ Delete rows/columns (if too many missing).
○ Fill in with mean, median, mode, or placeholder values.
3. Correct inconsistencies – standardize formats (e.g., dates, units, spelling).
4. Fix data types – convert text to numeric, ensure correct date/time formats.
5. Handle outliers – investigate and decide whether to remove or keep.
6. Validate data – check for logical accuracy (e.g., age cannot be negative).
7. Normalize/standardize values – ensure uniform scales (e.g., all in USD).
🔹 Tools & Methods Used
● Spreadsheets (Excel, Google Sheets) – basic cleaning.
● Programming:
○ Python: pandas, NumPy, OpenRefine.
○ R: dplyr, tidyr.
● Databases: SQL queries for filtering and updating.
🔹 Benefits of Data Cleaning
● Improves accuracy of analysis.
● Saves time and cost in decision-making.
● Leads to better predictions and insights.
● Ensures data quality and trustworthiness.