Chapter 1
Introduction to Data analytics and
life cycle
Ms. Tina Maru
Assistant Professor
Artificial Intelligence and Data Science Department
Key Roles for a Successful Analytics
Key Roles
Business User: Someone who understands the domain area and usually benefits from
the results. This person can consult and advise the project team on the context of the
project, the value of the results, and how the outputs will be operationalized. Usually
a business analyst, line manager, or deep subject matter expert in the project domain
fulfills this role.
Project Sponsor: Responsible for the genesis of the project. Provides the impetus
and requirements for the project and defines the core business problem. Generally
provides the funding and gauges the degree of value from the final outputs of the
working team. This person sets the priorities for the project and clarifies the desired
outputs.
Project Manager: Ensures that key milestones and objectives are met on time and at
the expected quality.
Key Roles
Business Intelligence Analyst: Provides business domain expertise based on a
deep understanding of the data, key performance indicators (KPIs), key
metrics, and business intelligence from a reporting perspective. Business
Intelligence Analysts generally create dashboards and reports and have
knowledge of the data feeds and sources.
Database Administrator (DBA): Provisions and configures the database
environment to support the analytics needs of the working team. These
responsibilities may include providing access to key databases or tables and
ensuring the appropriate security levels are in place related to the data
repositories.
Key Roles
Data Engineer: Leverages deep technical skills to assist with tuning SQL
queries for data management and data extraction, and provides support for
data ingestion into the analytic sandbox. Whereas the DBA sets up and
configures the databases to be used, the data engineer executes the actual
data extractions and performs substantial data manipulation to facilitate the
analytics. The data engineer works closely with the data scientist to help
shape data in the right ways for analyses.
Data Scientist: Provides subject matter expertise for analytical techniques,
data modeling, and applying valid analytical techniques to given business
problems. Ensures overall analytics objectives are met. Designs and executes
analytical methods and approaches with the data available to the project.
Background and Overview of Data Analytics
Lifecycle Project
Overview of Data Analytics Lifecycle
Phase 1—Discovery: In Phase 1, the team learns the
business domain, including relevant history such as
whether the organization or business unit has attempted
similar projects in the past from which they can learn.
The team assesses the resources available to support the
project in terms of people, technology, time, and data.
Important activities in this phase include framing the
business problem as an analytics challenge that can be
addressed in subsequent phases and formulating initial
hypotheses (IHs) to test and begin learning the data.
Overview of Data Analytics Lifecycle
Phase 2—Data preparation: Phase 2 requires the presence
of an analytic sandbox, in which the team can work with
data and perform analytics for the duration of the
project. The team needs to execute extract, load, and
transform (ELT) or extract, transform and load (ETL) to
get data into the sandbox. The ELT and ETL are sometimes
abbreviated as ETLT. Data should be transformed in the
ETLT process so the team can work with it and analyze it.
In this phase, the team also needs to familiarize itself
with the data thoroughly and take steps to condition the
data
Overview of Data Analytics Lifecycle
Phase 3—Model planning: Phase 3 is model planning,
where the team determines the methods, techniques, and
workflow it intends to follow for the subsequent model
building phase. The team explores the data to learn about
the relationships between variables and subsequently
selects key variables and the most suitable models.
Overview of Data Analytics Lifecycle
Phase 4—Model building: In Phase 4, the team develops
datasets for testing, training, and production purposes. In
addition, in this phase the team builds and executes
models based on the work done in the model planning
phase. The team also considers whether its existing tools
will suffice for running the models, or if it will need a
more robust environment for executing models and
workflows (for example, fast hardware and parallel
processing, if applicable)
Overview of Data Analytics Lifecycle
Phase 5—Communicate results: In Phase 5, the team, in
collaboration with major stakeholders, determines if the
results of the project are a success or a failure based on
the criteria developed in Phase 1. The team should
identify key findings, quantify the business value, and
develop a narrative to summarize and convey findings to
stakeholders.
Phase 6—Operationalize: In Phase 6, the team delivers
final reports, briefings, code, and technical documents. In
addition, the team may run a pilot project to implement
the models in a production environment.
Phase 1: Discovery
The first phase of the Data Analytics Lifecycle involves
discovery.
In this phase, the data science team must learn and
investigate the problem, develop context and
understanding, and learn about the data sources needed
and available for the project.
In addition, the team formulates initial hypotheses that
can later be tested with data.
Phase 1: Discovery
1. Learning the Business Domain
2. Resources
3. Framing the Problem
4. Identifying Key Stakeholders
5. Interviewing the Analytics Sponsor
6. Developing Initial Hypotheses
7. Identifying Potential Data Sources
Phase 1: Discovery
Learning the Business Domain
the team needs to determine how much business or domain
knowledge the data scientist needs to develop models.
the decision helps dictate the resources needed for the
project team and ensures the team has the right balance of
domain knowledge and technical expertise.
Phase 1: Discovery
Resources
technology, tools, systems, data, and people
skills and computing resources
inventory of the types of data available
long-term goals along with short-term goals
Negotiation for resources
Phase 1: Discovery
Framing the problem
Framing is the process of stating the analytics problem to be
solved.
the team needs to clearly articulate the current
situation and its main challenges.
important is to establish failure criteria.
Phase 1: Discovery
Identifying key stakeholders
the team can identify the success criteria, key risks, and
stakeholders, which should include anyone who will benefit
from the project or will be significantly impacted by the
project.
Phase 1: Discovery
Interviewing the Analytics Sponsor
who tends to be the one funding the project or providing the
high-level requirements.
This person understands the problem and usually has an idea
of a potential working solution.
Phase 1: Discovery
Questions project sponsors:
What business problem is the team trying to solve?
What is the desired outcome of the project?
What data sources are available?
What industry issues may impact the analysis?
What timelines need to be considered?
Who could provide insight into the project?
Who has final decision-making authority on the project?
Phase 1: Discovery
Questions project sponsors:
How will the focus and scope of the problem change if the
following dimensions change:
Time: Analyzing 1 year or 10 years’ worth of data?
People: Assess impact of changes in resources on project timeline.
Risk: Conservative to aggressive
Resources: None to unlimited (tools, technology, systems)
Size and attributes of data: Including internal and external data
sources
Phase 1: Discovery
Developing Initial Hypotheses
Forming ideas that the team can test with data.
involves gathering and assessing hypotheses from
stakeholders and domain experts who may have their own
perspective on what the problem is, what the solution
should be, and how to arrive at a solution.
Phase 1: Discovery
Identifying Potential Data Sources
Five main activities in this stage
Identify data sources:
Capture aggregate data sources:
Review the raw data:
Evaluate the data structures and tools needed:
Scope the sort of data infrastructure needed for this type of
problem:
Phase 2 Data Preparation
the steps to explore, preprocess, and condition data.
preparing an analytics sandbox.
most labor-intensive step in the analytics lifecycle
Phase 2 Data Preparation
1. Preparing the Analytic Sandbox
2. Performing ETLT
3. Learning About the Data
4. Data Conditioning
5. Survey and Visualize
6. Common Tools for the Data Preparation Phase
Phase 2 Data Preparation
Preparing the Analytic Sandbox
explore the data without interfering with live production
databases
collect all kinds of data - summary-level aggregated data,
structured data, raw data feeds, and unstructured text
data from call logs or web logs.
critical for the data science team to collaborate with IT,
make clear what it is trying to accomplish, and align
goals.
Sandbox size can vary greatly.
A good rule is to plan for the sandbox to be at least 5–10
times the size of the original datasets,
Phase 2 Data Preparation
Performing ETLT
extract, transform, load processes
the analytic sandbox approach differs slightly
the data is extracted in its raw form and loaded into the
datastore, where analysts can choose to transform the data into
a new state or leave it in its original, raw condition.
The reason for this approach is that there is significant value in
preserving the raw data and including it in the sandbox before
any transformations take place.
The data movement can be parallelized by technologies such as
Hadoop or MapReduce.
Check inventory of the data and compare the data currently
available with datasets the team needs,
Application programming interface (API)
Phase 2 Data Preparation
Learning About the Data
Catalog the data source.
Clarifies the data.
Highlights gaps by identifying datasets.
Identifies datasets outside the organization.
Phase 2 Data Preparation
Data Conditioning
Data conditioning refers to the process of
cleaning data, normalizing datasets, and
performing transformations on the data.
Important to involve the data scientist in this step
because many decisions are made in the data
conditioning phase that affect subsequent
analysis.
Phase 2 Data Preparation
Data Conditioning questions to ask
What are the data sources? What are the target
fields (for example, columns of the tables)?
How clean is the data?
How consistent are the contents and files?
Assess the consistency of the data types.
Review the content of data columns or other
inputs, and check to ensure they make sense.
Look for any evidence of systematic error.
Phase 2 Data Preparation
Survey and Visualize
high-level patterns in the data enables one to understand
characteristics about the data.
dirty data, skewness, etc.
Review data to ensure that calculations remained consistent
within columns or across tables for a given data.
Does the data distribution stay consistent over all the data?
Assess the granularity of the data, the range of values, and
the level of aggregation of the data.
Does the data represent the population of interest?
For time-related variables
data standardized/normalized.
For geospatial datasets
Phase 2 Data Preparation
Common Tools for the Data Preparation Phase
Hadoop can perform massively parallel ingest and custom analysis for web
traffic parsing, GPS location analytics, genomic analysis, and combining of
massive unstructured data feeds from multiple sources.
Alpine Miner provides a graphical user interface (GUI) for creating analytic
workflows, including data manipulations and a series of analytic events such
as staged data-mining techniques (for example, first select the top 100
customers, and then run descriptive statistics and clustering) on Postgres SQL
and other Big Data sources.
OpenRefine (formerly called Google Refine) is “a free, open source, powerful
tool for working with messy data.” It is a popular GUI-based tool for
performing data transformations, and it’s one of the most robust free tools
currently available.
Data Wrangler is an interactive tool for data cleaning and transformation.
Wrangler was developed at Stanford University and can be used to perform
many transformations on a given dataset. In addition, data transformation
outputs can be put into Java or Python. The advantage of this feature is that
a subset of the data can be manipulated in Wrangler via its GUI, and then the
same operations can be written out as Java or Python code to be executed
against the full, larger dataset offline in a local analytic sandbox.
Phase 3: Model Planning
Assess the structure of the datasets.
The structure of the datasets is one factor that
dictates the tools and analytical techniques for
the next phase. Depending on whether the team
plans to analyze textual data or transactional
data, for example, different tools and
approaches are required.
Ensure that the analytical techniques enable
the team to meet the business objectives and
accept or reject the working hypotheses.
Determine if the situation warrants a single
model or a series of techniques as part of a
larger analytic workflow
Phase 3: Model Planning
Data Exploration and Variable Selection
Model Selection
Common Tools for the Model Planning Phase
R has a complete set of modeling capabilities and provides a good
environment for building interpretive models with high-quality
code. In addition, it has the ability to interface with databases
via an ODBC connection and execute statistical tests and analyses
against Big Data via an open source connection.
SQL Analysis services can perform in-database analytics of
common data mining functions, involved aggregations, and basic
predictive models.
SAS/ACCESS provides integration between SAS and the analytics
sandbox via multiple data connectors such as OBDC, JDBC, and
OLE DB.
Phase 3: Model Planning
Phase 4: Model Building
The data science team needs to develop datasets
for training, testing, and production purposes.
These datasets enable the data scientist to develop
the analytical model and train it (“training data”),
while holding aside some of the data (“hold-out
data” or “test data”) for testing the model.
Phase 4: Model Building
Does the model appear valid and accurate on the test data?
Does the model output/behavior make sense to the domain experts?
That is, does it appear as if the model is giving answers that make sense
in this context?
Do the parameter values of the fitted model make sense in the context
of the domain?
Is the model sufficiently accurate to meet the goal?
Does the model avoid intolerable mistakes? Depending on context, false
positives may be more serious or less serious than false negatives,
Are more data or more inputs needed? Do any of the inputs need to be
transformed or eliminated?
Will the kind of model chosen support the runtime requirements?
Is a different form of the model required to address the business
problem? If so, go back to the model planning phase and revise the
modeling approach.
Phase 4: Model Building
Common Tools for the Model Building Phase
Commercial Tools:
SAS Enterprise Miner
SPSS Modeler
Matlab
Alpine Miner
STATISTICA [20] and Mathematica
Free or Open Source tools:
R and PL/R
Octave
WEKA
Python is a programming language that provides toolkits for machine learning,
analysis and visualization
SQL in-database implementations, such as MADlib
Phase 5: Communicate Results
Determine if it succeeded or failed in its objectives.
Failure of the data to accept or reject a given hypothesis.
Determine if the results are statistically significant and
valid.
Assess the results and identify which were in line with the
hypotheses.
Record all the findings and share with stakeholders.
Make recommendations for future work or improvements to
existing processes, and consider what each of the team
members and stakeholders needs to fulfill his
responsibilities.
Phase 6: Operationalize
The team communicates the benefits of the project
more broadly and sets up a pilot project to deploy the
work in a controlled way before broadening the work
to a full enterprise or ecosystem of users.
technical group needs to ensure that running the model
fits smoothly into the production environment and that
the model can be integrated into related business
processes.
Creating a mechanism for performing ongoing
monitoring of model accuracy and, if accuracy
degrades, finding ways to retrain the model.
Phase 6: Operationalize
Four main Deliverables
1. Presentation for project sponsors: This contains high-level
takeaways for executive level stakeholders, with a few
key messages to aid their decision-making process. Focus
on clean, easy visuals for the presenter to explain and for
the viewer to grasp.
2. Presentation for analysts, which describes business process
changes and reporting changes. Fellow data scientists will
want the details and are comfortable with technical
graphs
3. Code for technical people.
4. Technical specifications of implementing the code.
Phase 6: Operationalize