Week 2 - Data Analytics Life Cycle
Week 2 - Data Analytics Life Cycle
▪ Data analytics involves mainly six important phases that are carried
out in a cycle:
1. Data discovery,
2. Data preparation,
3. Model planning,
4. Model building,
5. Communicate results,
6. Operationalization.
Data Analytics Life Cycle Phases
✓In this stage, data is collected, cleaned, and transformed into a format
that is suitable for analysis (data integration, data cleansing, data
enrichment, and data transformation activities).
• Clarifies the data that the data science team has access to at the start of the project.
4. Data Conditioning
Data conditioning refers to the process of cleaning data, normalizing datasets, and
performing transformations on the data.
✓ Data conditioning can involve many complex steps to join or merge data sets or
otherwise get datasets into a state that enables analysis in further phases.
A. Review data to ensure that calculations remained consistent within columns or across tables for
a given data field. For instance, did customer lifetime value change at some point in the middle
of data collection? Or if working with financials, did the interest calculation change from simple
to compound at the end of the year?
B. Does the data distribution stay consistent over all the data? If not, what kinds of actions should
be taken to address this problem?
C. Does the data represent the population of interest? For marketing data, if the project is focused on
targeting customers of child-rearing age, does the data represent that, or is it full of senior citizens
and teenagers?
2. Data preparation
D. For time-related variables, are the measurements daily, weekly, monthly? Is that good
enough? Is time measured in seconds everywhere? Or is it in milliseconds in some
places? Determine the level of granularity of the data needed for the analysis and assess
whether the current level of timestamps on the data meets that need.
E. Is the data standardized/normalized? Are the scales consistent? If not, how consistent or
irregular is the data?
F. For geospatial datasets, are state or country abbreviations consistent across the data?
Are personal names normalized? English units? Metric units?
3. Model Planning
➢ The third phase of the lifecycle is model
planning. At this stage, the various
division of work among the team is
decided to clearly define the workload
among the team members.
➢ Data sets are developed by the team to test, train, and produce the data.
➢ In the later phases, the team builds and executes the models that were
created in the model planning stage.
3. Model planning
• After mapping out your business goals and collecting a glut of data
(structured, unstructured, or semi-structured), it is time to build a model
that utilizes the data to achieve the goal. Model planning is the stage of
the data analytics process.
There are several techniques available to load data into the system:
• ETL (Extract, Transform, and Load) transforms the data first using a set of business
rules, before loading it into a sandbox.
• ELT (Extract, Load, and Transform) first loads raw data into the sandbox and then
transform it.
2. Model Selection
In the model selection subphase, the team's main goal is to choose
an analytical technique, or a short list of candidate techniques,
based on the end goal of the project.
3. Model planning
Common Tools for the Model Planning Phase
• Does the model appear valid and accurate on the test data?
• Does the model output/behavior make sense to the domain experts? That is, does it
appear as if the model is giving answers that make sense in this context?
• Do the parameter values of the fitted model make sense in the context of the domain?
• Is the model sufficiently accurate to meet the goal?
• Does the model avoid intolerable mistakes? Depending on context, false positives may
be more serious or less serious than false negatives
• Are more data or more inputs needed? Do any of the inputs need to be transformed or
eliminated? Will the kind of model chosen support the runtime requirements?
•Is a different form of the model required to address the business problem? If so, go back
to the model planning phase and revise the modeling approach.
5. Communicate Results
➢ Phase five of the life cycle checks the results of the project to
find whether it is a success or failure.
➢ The result is studied by the entire team along with its stakeholders
to draw inferences/ implications on the key findings and
summarize the entire work done.
✓ Team should identify key findings, quantify business value, and develop
narrative to summarize and convey findings to stakeholders.
6. Operationalization
➢ In phase six, a final report is prepared by the team
along with the briefings, source codes, and related
documents.
➢ The last phase also involves running the pilot project to implement the
model and test it in a real-time environment.
✓ The team communicates benefits of project more broadly and sets up pilot
project to deploy work in controlled way before broadening the work to full
enterprise of users.
Example
Consider an example of a retail store chain that wants to optimize its products' prices to boost its
revenue. The store chain has thousands of products over hundreds of outlets, making it a highly
complex scenario. Once you identify the store chain's objective, you find the data you need,
prepare it, and go through the Data Analytics lifecycle process.
You observe different types of customers, such as ordinary customers and customers like
contractors who buy in bulk. According to you, treating various types of customers differently can
give you the solution. However, you don't have enough information about it and need to discuss
this with the client team.
In this case, you need to get the definition, find data, and conduct hypothesis testing to check
whether various customer types impact the model results and get the right output. Once you are
convinced with the model results, you can deploy the model, and integrate it into the business,
and you are all set to deploy the prices you think are the most optimal across the
outlets of the store.
Advantages Data Analytics Life Cycle
▪ Business User:
Someone who understands the domain area and usually
benefits from the results. This person can consult and advise
the project team on the context of the project, the value of the
results, and how the outputs will be operationalized.
▪ Project Sponsor:
Responsible for establishing the project. Provides the impetus
and requirements for the project and defines the core
business problem. Generally, provides the funding and
measures the degree of value from the final outputs of
the working team.
Key roles for a successful analytics Project
▪ Project Manager:
Ensures that key milestones and objectives are met on time and
at the expected quality.
▪ Data Engineer:
Leverages deep technical skills to assist with tuning SQL queries
for data management and data extraction and provides support for
data ingestion into the analytic sandbox.
Advantages Data Analytics Life Cycle
IMPORTANT QUESTIONS
1. In which phase would the team expect to invest most of the project time?
Why? Where would the team expect to spend the least time?
2. What are the benefits of doing a pilot program before a full-scale rollout of
a new analytical methodology? Discuss this in the context of the mini case
study.
3. What kinds of tools would be used in the following phases, and for which
kinds of use scenarios?