Introduction to Data Cleaning
What is Data Cleaning?
Data Cleaning is the process of detecting and correcting (or removing)
inaccurate, incomplete, or inconsistent data to improve data quality.
Why is Data Cleaning Important?
• Ensures accurate analysis and reliable insights.
• Removes errors that can affect machine learning models.
• Enhances data consistency and integrity.
• Helps in better decision-making,
Steps of Data Cleaning
1.Handling Missing Values
• Methods:
o Removing missing values: Using dropna() in Python.
o Filling missing values: Using fillna() with mean, median, or mode.
o Interpolation: Estimating missing values based on other data
points.
2. Removing Duplicates
• Duplicate data can lead to biased results.
• Method: Using drop_duplicates() in Python.
3. Handling Outliers Detection: Using statistical methods like Z-score or IQR
(Interquartile Range).
• Removal or transformation: Removing extreme values or transforming
data using log scaling.
4. Standardizing Data Formats
• Ensuring consistency in date formats, text case, and numerical formats.
• Example: Converting all date formats to YYYY-MM-DD.
5. Correcting Data Errors
• Fixing typos, incorrect data entries, and inconsistencies.
• Example: Correcting misspelled country names (USA, U.S., United
States).
6. Handling Noisy Data
• Removing unwanted characters, white spaces, or irrelevant symbols.
• Method: Using regular expressions (re module in Python).