Data Analytics Lifecycle
Phase 1: Discovery
Phase 2: Data Preparation
Phase 3: Model Planning
Phase 4: Model Building
Phase 5: Communicate Results
Phase 6: Operationalize
Overview of
Data Analytics Lifecycle
Phase 1: Discovery
Phase 1: Discovery
Learning the Business Domain
Resources Available-
Time,People,Tech,data
Framing the Problem
Identifying Key Stakeholders
Interviewing the Analytics Sponsor
Developing Initial Hypotheses
Identifying Potential Data Sources
Phase 2: Data Preparation
Phase 2: Data Preparation
Preparing the Analytic Sandbox
Performing ETLT
Learning about the Data
Data Conditioning
Survey and Visualize
Common Tools for Data Preparation
Preparing the Analytic Sandbox
● Create the analytic sandbox (also called workspace)
● Allows team to explore data without interfering with
live production data
● Sandbox collects all kinds of data
● The sandbox allows organizations to undertake
ambitious projects beyond traditional data analysis
and BI to perform advanced predictive analytics
Performing ETLT
(Extract, Transform, Load, Transform)
● In ETL users perform extract, transform, load
● In the sandbox the process is often ELT – early
load preserves the raw data which can be useful
to examine
● [Link]
● Example – in credit card fraud detection, outliers
can represent high-risk transactions that might be
inadvertently filtered out or transformed before
being loaded into the database
Outlier
[Link]
Learning about the Data
Determines the data available to
the team early in the project
Highlights gaps – identifies data not
currently available
Identifies data outside the
organization that might be useful
Learning about the Data
Sample Dataset Inventory
Data Conditioning
Cleaning
data
Normalizing
Managing Missing
datasets
data, Outliers, and
Unwanted
Data Performing
transformation
Survey and Visualize
[Link]
Survey and Visualize
● Leverage data visualization tools to gain an
overview of the data
● “Overview first, zoom and filter, then details-on-demand”
○ This enables the user to find areas of interest, zoom
and filter to find more detailed information about a
particular area, then find the detailed data in that area
○ [Link]
[Link]
Survey and Visualize
Guidelines and Considerations
● Assess the granularity of the data, the range of values,
and the level of aggregation of the data
● Does the data represent the population of interest?
● Check time-related variables – daily, weekly, monthly?
Is this good enough?
● Is the data standardized/normalized? Scales consistent?
● For geospatial datasets, are state/country abbreviations
consistent
Common Tools for Data Preparation
Alpine Data
Open
Hadoop Wrangler
Miner Refine
Tool for
provides a GUI free, open data
Perform for creating source tool for cleansing &
parallel ingest analytic working with transformat
and analysis workflows messy data ion