DATA CLEANING BEST PRACTICE
Five-Step Data Cleaning Framework (C.L.E.A.N.)
• The framework consists of Conceptualize, locate solvable issues, Evaluate
unsolvable issues, Augment the data, and Note and document. This
structured approach helps analysts’ clean data effectively in a real job setting
• Data cleaning is about achieving good enough data for analysis and iteration, not
perfect data. It can be thought of as peeling an onion with layers of cleaning:
initial removal of obvious errors, polishing and synchronization, and deeper
refinement after analysis
Understanding and Conceptualizing the Data
• Identify three key elements before cleaning:
• Grain (what each row represents; e.g., unique order)
• Key metrics (e.g., price)
• Key dimensions (e.g., time, product, marketing channel, geography)
• Example: Each row is an order with attributes like purchase date, shipping date,
product, and marketing channel. Knowing this helps prioritize cleaning efforts
aligned with business questions (e.g., sales trends across regions)
Locating and Addressing Solvable Issues
• Solvable issues include inconsistent data formats, spelling errors, categorization
inconsistencies, duplicates, and some missing values that can be imputed or
inferred from the data itself
• Initial cleaning steps: eyeball data for glaring issues, filter distinct values per
column, and create issues log to track problems and their magnitude
• Examples of solvable problems:
• Reformatting inconsistent date formats
• Standardizing product names with formulaic replacements (e.g., using Excel IF
statements)
• Replacing blanks in categorical columns (e.g., marketing channel) with
"unknown"
• Fixing inconsistent or nonsensical regional codes using a lookup table
• For duplicates, assess their impact before removal; if low (e.g., 145 duplicates in
20,000 records), document and retain until business context confirms deletion
Evaluating and Managing Unsolvable Issues
• Unsolvable issues include missing data that cannot be inferred, outliers whose
validity is uncertain, and business logic violations (e.g., ship date before
purchase date)
• Recommended approach:
• Document the issue and its magnitude
• Do not impute or delete without reliable business context or additional data
sources
• Surface issues transparently in analysis and reports to stakeholders
• Imputation (e.g., filling missing prices with averages) is rarely used by data
analysts due to risk of bias unless there is a trusted source or clear logic for
inference
• Outliers should generally be retained unless confirmed erroneous, as they may
reflect real events; their detection often occurs during exploratory analysis rather
than initial cleaning
• Business logic checks help identify nonsensical data patterns, such as shipping
dates preceding purchase dates; these require domain knowledge and may need
stakeholder input for resolution
Augmenting the Data for Robustness and Flexibility
• Enhancing the dataset involves creating additional dimensions or metrics to
enable richer analysis, such as:
• Breaking timestamps into multiple time grains (year, month, week)
• Calculating derived metrics like "time to ship" (difference between ship and
purchase dates)
• Incorporating external reference data (e.g., region from country code lookup)
• Adding demographic or customer information if available
• Careful formatting and sanity checks (e.g., removing nonsensical default dates
like 1900) ensure augmented data is meaningful
Noting and Documenting: The Issues Log and Final Reporting
• Maintain a detailed issues log throughout the cleaning process documenting:
• Identified problems
• Magnitude (percentage of affected records)
• Decisions on solvability and resolution steps
• Notes on outstanding issues requiring further investigation or stakeholder input
• Transparency and clear documentation demonstrate analytical rigor and assist
communication with hiring managers or team members
• Columns with more than 70% corrupted data are generally considered unusable,
guiding decisions on data inclusion
Summary of the Complete Data Cleaning Process
Step Description
Understand data grain, key metrics, and dimensions to frame
Conceptualize
cleaning priorities
Identify and fix data format inconsistencies, spelling errors,
Locate Solvable Issues
duplicates, and imputable nulls
Evaluate Unsolvable Document missing data, outliers, and logic errors; surface
Issues them transparently without forced fixes
Add new metrics, time grains, and reference data to enrich
Augment Data
analysis potential
Keep a comprehensive issues log detailing problems,
Note and document
magnitude, and cleaning decisions
This approach ensures data is clean enough for meaningful analysis while maintaining
transparency about limitations and assumptions
💡 Key Insight: Data cleaning is iterative and contextual. The goal is not perfect data
but data that is reliable enough to analyze, share, and improve upon.
Documentation and communication of data issues are as important as the
cleaning itself.
❗ Important: Avoid imputing missing values unless supported by strong business
logic or reliable external data to prevent bias.
ℹ️Note: Outliers and business logic violations should usually be surfaced and
documented rather than automatically corrected or deleted.
⚠️Warning: Always preserve original data and create cleaned versions in new fields
or tabs to maintain transparency and reproducibility.
Understand data
grain, metrics,
dimensions, and
business context.
Identify and fix formatting,
spelling, duplicates and some
nulls.
Document missing data, outliers
and logic violations; escalate if
needed.
Add time grains,
calculated metrics and
enrich with lookup
tables.
Finalize issues log, track
fixes and maintain data
transparency.