0% found this document useful (0 votes)
331 views

Week 2 - Data Analytics Life Cycle

The document discusses the data analytics life cycle which consists of 6 phases: data discovery, data preparation, model planning, model building, communicating results, and operationalization. These phases involve defining objectives, collecting and preparing data, analyzing it, interpreting findings, and implementing insights. The phases are iterative and can move forward or backward between each phase.

Uploaded by

Alif Luqman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
331 views

Week 2 - Data Analytics Life Cycle

The document discusses the data analytics life cycle which consists of 6 phases: data discovery, data preparation, model planning, model building, communicating results, and operationalization. These phases involve defining objectives, collecting and preparing data, analyzing it, interpreting findings, and implementing insights. The phases are iterative and can move forward or backward between each phase.

Uploaded by

Alif Luqman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Data Analytics Life Cycle

Dr. Hanan Aldowah


Data Analytics Life Cycle Overview

What is Data Analytics Life Cycle


➢Data Analytics Life refers to the
systematic process involving defining
objectives, collecting data, preparing and
analyzing it, interpreting findings, and
implementing insights gained from data
to achieve organizational objectives
effectively.
The Importance of Data Analytics Life Cycle

✓ The data analytics Life Cycle is the road map on


how data is generated, collected, processed,
used, and analyzed to achieve business
objectives/goals.

✓ It offers a systematic way for managing data into


useful information that can help achieve
organizational or project goals;

✓ It provides guidance and strategies for extracting


this information and moving in the appropriate
direction in order to accomplish business goals.
Data Analytics Life Cycle Phases

▪ Data analytics involves mainly six important phases that are carried
out in a cycle:

1. Data discovery,
2. Data preparation,
3. Model planning,
4. Model building,
5. Communicate results,
6. Operationalization.
Data Analytics Life Cycle Phases

➢ The six phases of the data analytics


lifecycle that is followed one phase
after another to complete one cycle.

➢ These six phases of data analytics


can follow both forward and
backward movement between each
phase and are iterative (See the
Figure).
1. Data Discovery
In this first phase of data analytics, the stakeholders regularly perform the
following tasks:

✓ Examine the business trends,


✓ Make case studies of similar data analytics,
✓ Study the domain of the business industry,
✓ The entire team makes an assessment of the in-house resources, the in-house
infrastructure, total time involved, and technology requirements.
✓ Once all these assessments and evaluations are completed, the stakeholders
start formulating the initial hypothesis for resolving all business challenges in
terms of the current market scenario.
Discovery Phase
Key Points in the Discovery Phase

1. Learning the Business Domain

Understanding the domain area of the problem is essential. In many


cases, data scientists will have deep computational and quantitative
knowledge that can be broadly applied across many disciplines.

✓An example of this role would be someone with an advanced


degree in applied mathematics or statistics.
Discovery Phase
2. Resources
As part of the discovery phase, the team needs to assess the resources
available to support the project.
✓In this context, resources include technology, tools, data, people (the
team), and the types of systems needed for later phases to operationalize
the models.

3. Framing the Problem


Framing is the process of stating the analytics problem to be solved.
✓At this point, it is a best practice to write down the problem statement and
share it with the key stakeholders.
Discovery Phase
4. Identifying Key Stakeholders

Another important step is to identify the key


stakeholders and their interests in the
project.

✓ During these discussions, the team can


identify the success criteria, key risks,
and stakeholders, which should include
anyone who will benefit from the project or
will be significantly impacted by the
project.
Discovery Phase

5. Interviewing the Analytics Sponsor


The team should plan to collaborate with the stakeholders to clarify and frame
the analytics problem.
✓ At the outset, project sponsors may have a predetermined solution
that may not necessarily realize the desired outcome. In these cases,
the team must use its knowledge and expertise to identify the true
underlying problem and appropriate solution.

6. Developing Initial Hypotheses


Developing a set of IHs is a key facet of the discovery phase. This step
involves forming ideas that the team can test with data.
Discovery Phase

7. Identifying Potential Data Sources


As part of the discovery phase, identify the kinds of data the team will need
to solve the problem. Consider the volume, type, and time span of the data
needed to test the hypotheses.
The team should perform five main activities during this step of the discovery phase:

✓ Identify data sources


✓ Capture aggregate data sources
✓ Review the raw data
✓ Evaluate the data structures and tools needed
✓ Scope the sort of data infrastructure needed for
this type of problem
2. Data Preparation
▪ Data preparation is the process of cleaning and transforming raw data
prior to processing and analysis. It is an important step prior to processing
and often involves reformatting data, making corrections to data, and
combining datasets to enrich data.

✓In this stage, data is collected, cleaned, and transformed into a format
that is suitable for analysis (data integration, data cleansing, data
enrichment, and data transformation activities).

✓Data visualization techniques may also be used to gain a better


understanding of the data and identify any data quality issues.
2. Data preparation
➢ In this phase data is prepared by transforming it from a legacy system
into a data analytics form by using the sandbox platform.
• A sandbox is a scalable platform commonly used by data scientists for data
preprocessing.
➢ Preparing the analytics Sandbox, the team execute,
load, and transform, to get data into the sandbox.

➢ Data preparation tasks are likely to be performed


multiple times and not in predefined order.
✓ Several tools commonly used for this phase are –
Hadoop, Alpine Miner, Open Refine, etc.
2. Data preparation

Key Points in the Data Preparation Phase:

1. Preparing the Analytic Sandbox


The first subphase of data preparation requires
the team to obtain an analytic sandbox (also
commonly referred to as a workspace), in which
the team can explore the data without
interfering with live production databases.
2. Data preparation

2. Enhancing Data Quality


➢ Data preparation improves data quality by correcting errors,
inconsistencies, missing values, outliers, and more.
➢ It also validates and verifies data to ensure correctness and
completeness.

✓ For example, effective data quality


management can prevent inaccurate
analysis by removing duplicate entries
from a customer database.
2. Data preparation

3. Learning About the Data

• Clarifies the data that the data science team has access to at the start of the project.

• Highlights gaps by identifying datasets within an


organization that the team may find useful but may
not be accessible to the team today.

• Identifies datasets outside the organization that may


be useful to obtain, through open APis, data sharing,
or purchasing data to supplement already
existing datasets.
2. Data preparation

4. Data Conditioning

Data conditioning refers to the process of cleaning data, normalizing datasets, and
performing transformations on the data.

✓ Data conditioning can involve many complex steps to join or merge data sets or
otherwise get datasets into a state that enables analysis in further phases.

➢ Performing transformations on the data involves ➢ Normalizing datasets in data analytics


applying mathematical functions or operations to the refers to the process of rescaling the values
dataset to modify its distribution or structure. This can of numerical features to a standard range,
include operations such as logarithmic transformations, typically between 0 and 1 or -1 and 1. This
square root transformations, or other statistical ensures that all features contribute equally
adjustments to improve the data's properties for to the analysis, especially when dealing with
analysis. These transformations help in preparing the features that have different scales or units.
data for further analysis or modeling by making it more
suitable for statistical assumptions or algorithm
2. Data preparation
Additional questions and considerations for the data conditioning step include
these.
• What are the data sources? What are the target fields (for example, columns of the tables)?
• How clean is the data?
• How consistent are the contents and files? Determine to what degree the data contains
missing or inconsistent values and if the data contains values deviating from normal.
• Assess the consistency of the data types. For instance, if the team expects certain data to
be numeric, confirm it is numeric or if it is a mixture of alphanumeric strings and text.
• Review the content of data columns or other inputs and check to ensure they make sense.
For instance, if the project involves analyzing income levels, preview the data to confirm
that the income values are positive or if it is acceptable to have zeros or negative values.
• Look for any evidence of systematic error.
2. Data preparation
5. Survey and Visualize
After the team has collected and obtained at least some of the datasets
needed for the subsequent analysis, a useful step is to leverage data
visualization tools to gain an overview of the data. Seeing high-level
patterns in the data enables one to understand characteristics about
the data very quickly.
✓ One example is using data visualization to examine data quality, such as
whether the data contains many unexpected values or other
indicators of dirty data.
2. Data preparation
When pursuing this approach with a data visualization tool or statistical package, the following
guidelines and considerations are recommended.

A. Review data to ensure that calculations remained consistent within columns or across tables for
a given data field. For instance, did customer lifetime value change at some point in the middle
of data collection? Or if working with financials, did the interest calculation change from simple
to compound at the end of the year?

B. Does the data distribution stay consistent over all the data? If not, what kinds of actions should
be taken to address this problem?

C. Does the data represent the population of interest? For marketing data, if the project is focused on
targeting customers of child-rearing age, does the data represent that, or is it full of senior citizens
and teenagers?
2. Data preparation

D. For time-related variables, are the measurements daily, weekly, monthly? Is that good
enough? Is time measured in seconds everywhere? Or is it in milliseconds in some
places? Determine the level of granularity of the data needed for the analysis and assess
whether the current level of timestamps on the data meets that need.

E. Is the data standardized/normalized? Are the scales consistent? If not, how consistent or
irregular is the data?

F. For geospatial datasets, are state or country abbreviations consistent across the data?
Are personal names normalized? English units? Metric units?
3. Model Planning
➢ The third phase of the lifecycle is model
planning. At this stage, the various
division of work among the team is
decided to clearly define the workload
among the team members.

➢ The data prepared in the previous phase is further explored to


understand the various features and their relationships and also
perform feature selection for applying it to the model.
3. Model planning
➢ This step also includes the data analytics team
makes proper planning of the methods,
techniques, and workflow to build the model
in the subsequent phase (model building
phase).

➢ The model's building initiates with identifying


the relation between data points to select the
key variables and eventually find a suitable
model.

➢ Data sets are developed by the team to test, train, and produce the data.

➢ In the later phases, the team builds and executes the models that were
created in the model planning stage.
3. Model planning
• After mapping out your business goals and collecting a glut of data
(structured, unstructured, or semi-structured), it is time to build a model
that utilizes the data to achieve the goal. Model planning is the stage of
the data analytics process.

There are several techniques available to load data into the system:

• ETL (Extract, Transform, and Load) transforms the data first using a set of business
rules, before loading it into a sandbox.

• ELT (Extract, Load, and Transform) first loads raw data into the sandbox and then
transform it.

• ETLT (Extract, Transform, Load, Transform) is a mixture; it has two


transformation levels.
3. Model planning
Key Points in the Model planning Phase:
1. Data Exploration and Variable Selection
In Phase 3, the objective of the data exploration is to understand
the relationships among the variables to inform selection of the
variables and methods and to understand the problem domain.

2. Model Selection
In the model selection subphase, the team's main goal is to choose
an analytical technique, or a short list of candidate techniques,
based on the end goal of the project.
3. Model planning
Common Tools for the Model Planning Phase

1. R has a complete set of modeling capabilities and provides a good


environment for building interpretive models with high-quality code.
2. SQL Analysis services can perform in-database analytics of common data
mining functions, involved aggregations, and basic predictive models.
3. SAS/ACCESS provides integration between SAS and the analytics
sandbox via multiple data connectors such as ODBC,
JDBC, and OLE DB.
4. Model Building
➢ The fourth phase of the lifecycle is model building in
which the team works on developing datasets for
training and testing as well as for production
purposes.

➢ Also, the execution of the model, based on the


planning made in the previous phase, is carried out.

➢ The kind of environment needed for the execution of


the model is decided and prepared so that if a more
robust environment is required, it is accordingly applied.
4. Model Building
Key Points in the Model Building Phase:
✓ Team develops datasets for testing, training, and production purposes.
✓ Team also considers whether the existing tools will be sufficient to run the
models or if they need a more robust environment for executing the models.

• Common Tools for the Model Building Phase Commercial Tools:


Free or open-source tools – Rand PL/R, Octave, WEKA, RapidMiner, Power BI,
Tableau.
•Commercial tools –MATLAB, STASTICA.
4. Model Building
Questions to consider in the Modeling Phase include:

• Does the model appear valid and accurate on the test data?

• Does the model output/behavior make sense to the domain experts? That is, does it
appear as if the model is giving answers that make sense in this context?
• Do the parameter values of the fitted model make sense in the context of the domain?
• Is the model sufficiently accurate to meet the goal?
• Does the model avoid intolerable mistakes? Depending on context, false positives may
be more serious or less serious than false negatives
• Are more data or more inputs needed? Do any of the inputs need to be transformed or
eliminated? Will the kind of model chosen support the runtime requirements?
•Is a different form of the model required to address the business problem? If so, go back
to the model planning phase and revise the modeling approach.
5. Communicate Results

➢ Phase five of the life cycle checks the results of the project to
find whether it is a success or failure.

➢ The result is studied by the entire team along with its stakeholders
to draw inferences/ implications on the key findings and
summarize the entire work done.

➢ Also, the business values are quantified/ measured and an


elaborate narrative on the key findings is prepared that is
discussed among the various stakeholders.
5. Communicate Results
Key Points in this Phase:
✓ After executing model team need to compare outcomes of
modeling to criteria established for success and failure.

✓ Team considers how best to communicate findings and


outcomes to various team members and stakeholders, taking
into account warning, assumptions.

✓ Team should identify key findings, quantify business value, and develop
narrative to summarize and convey findings to stakeholders.
6. Operationalization
➢ In phase six, a final report is prepared by the team
along with the briefings, source codes, and related
documents.

➢ The last phase also involves running the pilot project to implement the
model and test it in a real-time environment.

➢ As data analytics help build models that lead to better decision-


making, it, in turn, adds value to individuals, customers, business
sectors, and other organizations.
6. Operationalization
Key Points in this Phase:

✓ The team communicates benefits of project more broadly and sets up pilot
project to deploy work in controlled way before broadening the work to full
enterprise of users.

✓ This approach enables team to learn about performance and related


constraints of the model in production environment on small scale and make
adjustments before full deployment.

✓ The team delivers final reports, briefings, codes.


Data Analytics Life Cycle

Example
Consider an example of a retail store chain that wants to optimize its products' prices to boost its
revenue. The store chain has thousands of products over hundreds of outlets, making it a highly
complex scenario. Once you identify the store chain's objective, you find the data you need,
prepare it, and go through the Data Analytics lifecycle process.
You observe different types of customers, such as ordinary customers and customers like
contractors who buy in bulk. According to you, treating various types of customers differently can
give you the solution. However, you don't have enough information about it and need to discuss
this with the client team.

In this case, you need to get the definition, find data, and conduct hypothesis testing to check
whether various customer types impact the model results and get the right output. Once you are
convinced with the model results, you can deploy the model, and integrate it into the business,
and you are all set to deploy the prices you think are the most optimal across the
outlets of the store.
Advantages Data Analytics Life Cycle

Advantages Data Analytics Life Cycle

• Identification of Potential Risks


Businesses operate in high-risk settings and thus need efficient risk management
solutions to deal with problems. Creating efficient risk management procedures
and strategies depends heavily on big data. Data analytics life cycle and tools
quickly minimize risks by optimizing complicated decisions for unforeseen
occurrences and prospective threats.
• Reducing Cost
• Increase efficiency
Key roles for a successful analytics Project
➢ While proceeding through these six phases, the
various stakeholders that can be involved in
the planning, implementation, and decision-
making are:

✓ Data analysts, business intelligence analysts,


database administrators, data engineers, executive
project sponsors, project managers, and data
scientists.

➢ All these stakeholders are strictly involved in the


proper planning and completion of the project,
keeping in note the various crucial factors to be
considered for the success of the project.
Key roles for a successful analytics Project

▪ Business User:
Someone who understands the domain area and usually
benefits from the results. This person can consult and advise
the project team on the context of the project, the value of the
results, and how the outputs will be operationalized.

▪ Project Sponsor:
Responsible for establishing the project. Provides the impetus
and requirements for the project and defines the core
business problem. Generally, provides the funding and
measures the degree of value from the final outputs of
the working team.
Key roles for a successful analytics Project

Key roles for a successful analytics Project

▪ Project Manager:
Ensures that key milestones and objectives are met on time and
at the expected quality.

▪ Business Intelligence Analyst:


Provides business domain expertise based on a deep
understanding of the data, key performance indicators (KPis), key
metrics, and business intelligence from a reporting perspective.
Business Intelligence Analysts generally create dashboards and
reports and have knowledge of the data feeds and sources.
Key roles for a successful analytics Project

Key roles for a successful analytics Project


▪ Database Administrator (DBA):
Provisions and configures the database environment to support
the analytics needs of the working team. These responsibilities
may include providing access to key databases or tables and
ensuring the appropriate security levels are in place related to the
data repositories.

▪ Data Engineer:
Leverages deep technical skills to assist with tuning SQL queries
for data management and data extraction and provides support for
data ingestion into the analytic sandbox.
Advantages Data Analytics Life Cycle

IMPORTANT QUESTIONS
1. In which phase would the team expect to invest most of the project time?
Why? Where would the team expect to spend the least time?
2. What are the benefits of doing a pilot program before a full-scale rollout of
a new analytical methodology? Discuss this in the context of the mini case
study.

3. What kinds of tools would be used in the following phases, and for which
kinds of use scenarios?

a. Phase 2: Data preparation


b. Phase 4: Model building
Thank You

You might also like