Data Science: Lesson 4
Data Science: Lesson 4
Part 1
Module Objectives
Data science projects differ from most traditional Business Intelligence projects and many data analysis projects in
that data science projects are more exploratory in nature.
Many problems that appear huge and daunting at first can be broken down into smaller pieces or actionable phases
that can be more easily addressed.
Having a good process ensures a comprehensive and repeatable method for conducting analysis. In addition, it helps
focus time and energy early in the process to get a clear grasp of the business problem to be solved.
A common mistake made in data science projects is rushing into data collection and analysis, which precludes
spending sufficient time to plan and scope the amount of work involved, understanding requirements, or even framing
the business problem properly.
Consequently, participants may discover mid-stream that the project sponsors are actually trying to achieve an
objective that may not match the available data, or they are attempting to address an interest that differs from what
has been explicitly communicated.
When this happens, the project may need to revert to the initial phases of the process for a proper discovery phase,
or the project may be canceled.
Creating and documenting a process helps demonstrate rigor, which provides additional credibility to the project
when the data science team shares its findings.
A well-defined process also offers a common framework for others to adopt, so the methods and analysis can be
repeated in the future or as new members join a team.
The Data Analytics Lifecycle is designed specifically for Big Data problems and data science projects. The lifecycle
has six phases, and project work can occur in several phases at once.
The following depicts the various roles and key stakeholders of an analytics project. Each plays a critical part in a
successful analytics project.
Although seven roles are listed, fewer or more people can accomplish the work depending on the scope of the
project, the organizational structure, and the skills of the participants.
Business User: Someone who understands the domain area and usually benefits from the results. This person can
consult and advise the project team on the context of the project, the value of the results, and how the outputs will be
operationalized. Usually a business analyst, line manager, or deep subject matter expert in the project domain fulfills
this role.
Project Sponsor: Responsible for the genesis
of the project. Provides the impetus and
requirements for the project and defines the
core business problem. Generally provides the
funding and gauges the degree of value from
the final outputs of the working team. This
person sets the priorities for the project and
clarifies the desired outputs.
Database Administrator (DBA): Provisions and configures the database environment to support the analytics needs
of the working team. These responsibilities may include providing access to key databases or tables and ensuring the
appropriate security levels are in place related to the data repositories.
Data Engineer: Leverages deep technical skills to assist with tuning SQL queries for data management and data
extraction, and provides support for data ingestion into the analytic sandbox. Whereas the DBA sets up and
configures the databases to be used, the data engineer executes the actual data extractions and performs substantial
data manipulation to facilitate the analytics. The data engineer works closely with the data scientist to help shape
data in the right ways for analyses.
Data Scientist: Provides subject matter expertise for analytical techniques, data modeling, and applying valid
analytical techniques to given business problems. Ensures overall analytics objectives are met. Designs and
executes analytical methods and approaches with the data available to the project.
Phase 1—Discovery: In Phase 1, the team learns the business domain, including relevant history such as whether
the organization or business unit has attempted similar projects in the past from which they can learn. The team
assesses the resources available to support the project in terms of people, technology, time, and data. Important
activities in this phase include framing the business problem as an analytics challenge that can be addressed in
subsequent phases and formulating initial hypotheses (IHs) to test and begin learning the data.
Phase 2— Data preparation: Phase 2 requires the presence of an analytic sandbox, in which the team can work
with data and perform analytics for the duration of the project. The team needs to execute extract, load, and
transform (ELT) or extract, transform and load (ETL) to get data into the sandbox. The ELT and ETL are sometimes
abbreviated as ETLT. Data should be transformed in the ETLT process so the team can work with it and analyze it. In
this phase, the team also needs to familiarize itself with the data thoroughly and take steps to condition the data.
Phase 3— Model planning: Phase 3 is model planning, where the team determines the methods, techniques, and
workflow it intends to follow for the subsequent model building phase. The team explores the data to learn about the
relationships between variables and subsequently selects key variables and the most suitable models.
Phase 4—Model building: In Phase 4, the team develops datasets for testing, training, and production purposes. In
addition, in this phase the team builds and executes models based on the work done in the model planning phase.
The team also considers whether its existing tools will suffice for running the models, or if it will need a more robust
environment for executing models and workflows (for example, fast hardware and parallel processing, if applicable).
Phase 5—Communicate results: In Phase 5, the team, in collaboration with major stakeholders, determines if the
results of the project are a success or a failure based on the criteria developed in Phase 1. The team should identify
key findings, quantify the business value, and develop a narrative to summarize and convey findings to stakeholders.
Phase 6—Operationalize: In Phase 6, the team delivers final reports, briefings, code, and technical documents. In
addition, the team may run a pilot project to implement the models in a production environment.
Once team members have run models and produced findings, it is critical to frame these results in a way that is tailored to the
audience that engaged the team. Moreover, it is critical to frame the results of the work in a manner that demonstrates clear
value.
If the team performs a technically accurate analysis but fails to translate the results into a language that resonates with the
audience, people will not see the value, and much of the time and effort on the project will have been wasted.
Phase 1: Discovery
In this phase, the data science team must learn and investigate the problem, develop context and understanding, and
learn about the data sources needed and available for the project.
In addition, the team formulates initial hypotheses that can later be tested with data
At this early stage in the process, the team needs to determine how much business or domain knowledge the
data scientist needs to develop models in Phases 3 and 4.
The earlier the team can make this assessment the better, because the decision helps dictate the resources
needed for the project team and ensures the team has the right balance of domain knowledge and technical
expertise.
RESOURCES
As part of the discovery phase, the team needs to assess the resources available to support the project. In this
context, resources include technology, tools, systems, data, and people.
During this scoping, there is a need to consider the available tools and technology the team will be using and the
types of systems needed for later phases to operationalize the models. In addition, try to evaluate the level of
analytical sophistication within the organization and gaps that may exist related to tools, technology, and skills.
In addition to the skills and computing resources, it is advisable to take inventory of the types of data available to
the team for the project. Consider if the data available is sufficient to support the project’s goals. The team will
need to determine whether it must collect additional data, purchase it from outside sources, or transform existing
data.
Often, projects are started looking only at the data available. When the data is less than hoped for, the size and
scope of the project is reduced to work within the constraints of the existing data.
Ensure the project team has the right mix of domain experts, customers, analytic talent, and project management
to be effective. In addition, evaluate how much time is needed and if the team has the right breadth and depth of
skills.
Framing the problem well is critical to the success of the project. Framing is the process of stating the analytics
problem to be solved. At this point, it is a best practice to write down the problem statement and share it with the
key stakeholders.
As part of this activity, it is important to identify the main objectives of the project, identify what needs to be
achieved in business terms, and identify what needs to be done to meet the needs. Additionally, consider the
objectives and the success criteria for the project.
Perhaps equally important is to establish failure criteria. The failure criteria will guide the team in understanding
when it is best to stop trying or settle for the results that have been gleaned from the data. Establishing criteria for
both success and failure helps the participants avoid unproductive effort and remain aligned with the project
sponsors
Another important step is to identify the key stakeholders and their interests in the project. During these
discussions, the team can identify the success criteria, key risks, and stakeholders, which should include anyone
who will benefit from the project or will be significantly impacted by the project. The team should plan to
collaborate with the stakeholders to clarify and frame the analytics problem.
Depending on the number of stakeholders and participants, the team may consider outlining the type of activity
and participation expected from each stakeholder and participant. This will set clear expectations with the
participants and avoid delays later.
When interviewing the main stakeholders, the team needs to take time to thoroughly interview the project
sponsor, who tends to be the one funding the project or providing the high-level requirements.
This person understands the problem and usually has an idea of a potential working solution. It is critical to
thoroughly understand the sponsor’s perspective to guide the team in getting started on the project.
Following is a brief list of common questions that are helpful to ask during the discovery phase when
interviewing the project sponsor. The responses will begin to shape the scope of the project and give the team an
idea of the goals and objectives of the project.
Developing a set of IHs is a key facet of the discovery phase. This step involves forming ideas that the team can test
with data. These IHs form the basis of the analytical tests the team will use in later phases and serve as the
foundation for the findings in Phase 5.
Another part of this process involves gathering and assessing hypotheses from stakeholders and domain experts
who may have their own perspective on what the problem is, what the solution should be, and how to arrive at a
solution. These stakeholders would know the domain area well and can offer suggestions on ideas to test as the
team formulates hypotheses during this phase.
Identify data sources: Make a list of candidate data sources the team may need to test the initial hypotheses
outlined in this phase. Make an inventory of the datasets currently available and those that can be purchased or
otherwise acquired for the tests the team wants to perform.
Capture aggregate data sources: This is for previewing the data and providing high-level understanding. It enables
the team to gain a quick overview of the data and perform further exploration on specific areas. It also points the
team to possible areas of interest within the data.
Review the raw data: Obtain preliminary data from initial data feeds. Begin understanding the interdependencies
among the data attributes, and become familiar with the content of the data, its quality, and its limitations.
Evaluate the data structures and tools needed: The data type and structure dictate which tools the team can use
to analyze the data. This evaluation gets the team thinking about which technologies may be good candidates for the
project and how to start getting access to these tools.
Scope the sort of data infrastructure needed for this type of problem: In addition to the tools needed, the data
influences the kind of infrastructure that’s required, such as disk storage and net- work capacity.
Evaluate the data structures and tools needed: The data type and structure dictate which tools the team can use
to analyze the data. This evaluation gets the team thinking about which technologies may be good candidates for the
project and how to start getting access to these tools.
Scope the sort of data infrastructure needed for this type of problem: In addition to the tools needed, the data
influences the kind of infrastructure that’s required, such as disk storage and net- work capacity.
The team can move to the next phase when it has enough information to draft an analytics plan and share it for peer
review. Although a peer review of the plan may not actually be required by the project, creating the plan is a good test
of the team’s grasp of the business problem and the team’s approach to addressing it.
Creating the analytic plan also requires a clear understanding of the domain area, the problem to be solved, and
scoping of the data sources to be used. Developing success criteria early in the project clarifies the problem definition
and helps the team when it comes time to make choices about the analytical methods being used in later phase.
The second phase of the Data Analytics Lifecycle involves data preparation, which includes the steps to explore,
preprocess, and condition data prior to modeling and analysis. In this phase, the team needs to create a robust
environment in which it can explore the data that is separate from a production environment.
Usually, this is done by preparing an analytics sandbox. To get the data into the sandbox, the team needs to perform
ETLT, by a combination of extracting, transforming, and loading data into the sandbox. Once the data is in the
sandbox, the team needs to learn about the data and become familiar with it. Understanding the data in detail is
critical to the success of the project.
The team also must decide how to condition and transform data to get it into a format to facilitate subsequent
analysis. The team may perform data visualizations to help team members understand the data, including its trends,
outliers, and relationships among data variables. Each of these steps of the data preparation phase is discussed
throughout this section.
Data preparation tends to be the most labor-intensive step in the analytics lifecycle. In fact, it is common for teams to
spend at least 50% of a data science project’s time in this critical phase. If the team cannot obtain enough data of
sufficient quality, it may be unable to perform the subsequent steps in the lifecycle process
The data preparation phase is generally the most iterative and the one that teams tend to underestimate most often.
This is because most teams and leaders are anxious to begin analyzing the data, testing hypotheses, and getting
answers to some of the questions posed in Phase 1.
Many tend to jump into Phase 3 or Phase 4 to begin rapidly developing models and algorithms without spending the
time to prepare the data for modeling. Consequently, teams come to realize the data they are working with does not
allow them to execute the models they want, and they end up back in Phase 2 anyway.
Preparing the Analytic Sandbox
The first subphase of data preparation requires the team to obtain an analytic sandbox (also commonly referred to as
a workspace), in which the team can explore the data without interfering with live production databases.
Consider an example in which the team needs to work with a company’s financial data. The team should access a
copy of the financial data from the analytic sandbox rather than interacting with the production version of the
organization’s main database, because that will be tightly controlled and needed for financial reporting.
When developing the analytic sandbox, it is a best practice to collect all kinds of data there, as team members need
access to high volumes and varieties of data for a Big Data analytics project.
This expansive approach for attracting data of all kind differs considerably from the approach advocated by many
information technology (IT) organizations. Many IT groups provide access to only a particular sub- segment of the
data for a specific purpose. Often, the mindset of the IT group is to provide the minimum amount of data required to
allow the team to achieve its objectives.
Conversely, the data science team wants access to everything. From its perspective, more data is better, as
oftentimes data science projects are a mixture of purpose-driven analyses and experimental approaches to test a
variety of ideas.
During these discussions, the data science team needs to give IT a justification to develop an analytics sandbox,
which is separate from the traditional IT-governed data warehouses within an organization.
The analytic sandbox enables organizations to undertake more ambitious data science projects and move beyond
doing traditional data analysis and Business Intelligence to perform more robust and advanced predictive analytics
Performing ETLT
As the team looks to begin data transformations, make sure the analytics sandbox has ample bandwidth and reliable
network connections to the underlying data sources to enable uninterrupted read and write.
In ETL, users perform extract, transform, load processes to extract data from a datastore, perform data
transformations, and load the data back into the datastore. However, the analytic sandbox approach differs slightly; it
advocates extract, load, and then transform.
In this case, the data is extracted in its raw form and loaded into the datastore, where analysts can choose to
transform the data into a new state or leave it in its original, raw condition. The reason for this approach is that there
is significant value in preserving the raw data and including it in the sandbox before any transformations take place.
For instance, consider an analysis for fraud detection on credit card usage. Many times, outliers in this data
population can represent higher-risk transactions that may be indicative of fraudulent credit card activity.
Using ETL, these outliers may be inadvertently filtered out or transformed and cleaned before being loaded into the
datastore. In this case, the very data that would be needed to evaluate instances of fraudulent activity would be
inadvertently cleansed, preventing the kind of analysis that a team would want to do.
Following the ELT approach gives the team access to clean data to analyze after the data has been loaded into the
database and gives access to the data in its original form for finding hidden nuances in the data.
This approach is part of the reason that the analytic sandbox can quickly grow large. The team may want clean data
and aggregated data and may need to keep a copy
of the original data to compare against or look for
hidden patterns that may have existed in the data
before the cleaning stage. This process can be
summarized as ETLT to reflect the fact that a team
may choose to perform ETL in one case and ELT in
another.
In addition, it is important to catalog the data sources that the team has access to and identify additional data sources
that the team can leverage but perhaps does not have access to today. Some of the activities in this step may
overlap with the initial investigation of the datasets that occur in the discovery phase.
The following table demonstrates one way to organize this type of data inventory.
Data Conditioning
Data conditioning refers to the process of cleaning data, normalizing datasets, and performing transformations on
the data. A critical step within the Data Analytics Lifecycle, data conditioning can involve many complex steps to join
or merge datasets or otherwise get datasets into a state that enables analysis in further phases.
Data conditioning is often viewed as a preprocessing step for the data analysis because it involves many operations
on the dataset before developing models to process or analyze the data. This implies that the data-conditioning step
is performed only by IT, the data owners, a DBA, or a data engineer.
However, it is also important to involve the data scientist in this step because many decisions are made in the data
conditioning phase that affect subsequent analysis. Part of this phase involves deciding which aspects of particular
datasets will be useful to analyze in later steps.
Data Conditioning
Typically, data science teams would rather keep more data than too little data for the analysis. Additional questions and
considerations for the data conditioning step include these.
What are the data sources? What are the target fields (for example, columns of the tables)?
How consistent are the contents and files? Determine to what degree the data contains missing or inconsistent
values and if the data contains values deviating from normal.
Assess the consistency of the data types. For instance, if the team expects certain data to be numeric, confirm it is
numeric or if it is a mixture of alphanumeric strings and text.
After the team has collected and obtained at least some of the datasets needed for the subsequent analysis, a useful
step is to leverage data visualization tools to gain an overview of the data.
Seeing high-level patterns in the data enables one to understand characteristics about the data very quickly. One
example is using data visualization to examine data quality, such as whether the data contains many unexpected
values or other indicators of dirty data. Another example is skewness, such as if the majority of the data is heavily
shifted toward one value or end of a continuum.
Shneiderman [9] is well known for his mantra for visual data analysis of “overview first, zoom and filter, then details-
on-demand.”
This is a pragmatic approach to visual data analysis. It enables the user to find areas of interest, zoom and filter to
find more detailed information about a particular area of the data, and then find the detailed data behind a particular
area.
This approach provides a high-level view of the data and a great deal of information about a given dataset in a
relatively short period of time.
When pursuing this approach with a data visualization tool or statistical package, the following guidelines and considerations
are recommended
Review data to ensure that calculations remained consistent within columns or across tables for a given data field.
Does the data distribution stay consistent over all the data?
Assess the granularity of the data, the range of values, and the level of aggregation of the data.
Does the data represent the population of interest?
Etc.