0% found this document useful (0 votes)
34 views

Unit 1 Rept

Big Data Analytics document outlines the following: 1) It introduces big data analytics including key concepts like volume, velocity, variety and value of big data. It also describes the data analytics lifecycle with 6 phases: discovery, preparation, modeling, building, communication and operationalization. 2) It discusses structured, semi-structured and unstructured data. It also compares different data repositories like spreadsheets, data warehouses and analytic sandboxes. 3) It describes the state of analytics including the difference between business intelligence and data science. It also discusses drivers of big data and key roles in the new big data ecosystem including data scientists.

Uploaded by

Sanjay H M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Unit 1 Rept

Big Data Analytics document outlines the following: 1) It introduces big data analytics including key concepts like volume, velocity, variety and value of big data. It also describes the data analytics lifecycle with 6 phases: discovery, preparation, modeling, building, communication and operationalization. 2) It discusses structured, semi-structured and unstructured data. It also compares different data repositories like spreadsheets, data warehouses and analytic sandboxes. 3) It describes the state of analytics including the difference between business intelligence and data science. It also discusses drivers of big data and key roles in the new big data ecosystem including data scientists.

Uploaded by

Sanjay H M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 61

Big Data Analytics

• Unit 1: Introduction to big data analytics


– Big data overview
– State of the practice in Analytics
– Keyrole for the new big data ecosystem
– Example of big data analytics
– Data analytics lifecycle overview
– Phase 1: Discovery
– Phase 2: Data preparation
– Phase 3: Model planning
– Phase 4: Model building
– Phase 5: Communicate Result
– Phase 6: Operationalize
1.1 Big data overview:
• Big Data is a term used to describe a collection of data that is huge in
volume and yet growing exponentially with time. In short such data is so
large and complex, Structured, unstructured or sometimes semi structured
that none of the traditional data management tools are able to store it or
process it efficiently.
• So organizations will need new data architectures and analytic sandboxes,
new tools, new analytical methods, and an integration of multiple skills into
the new role of the data scientist
Characteristics Of Big Data
• The following are known as “Big Data Characteristics”.

1. Volume - Data size

2. Velocity - Data production speed

3. Variety - Data oriented from various things

4. Veracity - Data accuracy (Trustworthy)

5. Value - Data value

1.1.1 Data structure:

1. Structured data: Data containing a defined data type, format, and structure.

It’s information designed with the explicit function of being easily searchable – it’s
quantitative and highly organized

2. Semi-structured data: includes text that is organized by subject or topic or fit into a
hierarchical programming language yet the text, having no structure itself. Struc data +
Ex: 1. Email - Email messages contain structured data like name, email address,
recipient, date, time, etc., and they are also organized into folders,
like Inbox, Sent, Trash, etc. The data within each email is unstructured
– 2. XML is widely used to store and exchange semi-structured data. It
allows its user to define tags and attributes to store the data in
hierarchical form
3. Quasi-structured data: Textual data with erratic data formats that can be
formatted with effort, tools, and time (for instance, web clickstream data that
may contain inconsistencies in data values and formats).
4. Unstructured data: usually open text, images, videos, etc., that have no
predetermined organization or design. Text documents, including chats, PDFs,
and presentations. Social media data, like posts, tweets, and comments
1.1.2 Analyst Perspective on Data Repositories:
Advantage disadvantage
Spreadsheets:
1. Structured data 1. Lack of Security.
2. Easy to share
3. End – user control

need to centralize the data

Data warehouse:
1. Faster and more efficient data
analysis 1. Limited flexibility
2. Better decision-making 2. Data latency
3. Increased data accessibility

Analytic Sandbox:
4. robust analytics
5. Flexible
6. high-performance computing
4. variety of data, such as raw data,
textual
data, and other kinds of unstructured
• There are several things to consider with Big Data Analytics projects to
ensure the approach fits with the desired goals
1.2 State of the Practice in Analytics
1.2.1 BI Versus Data Science
What is BI?
• Business Intelligence is a process of collecting, integrating, analyzing and
presenting the data. With Business Intelligence, executives and managers
can have a better understanding of decision-making.
What is Data science?

• Data Science is a process of extracting, manipulating, visualizing,


maintaining data as well as generating predictions.
1.2.2 Current Analytical Architecture:
• Analytics architecture refers to the systems, protocols, and technology used
to collect, store, and analyze data.
• The concept is an umbrella term for a variety of technical layers that allow
organizations to more effectively collect, organize, and parse the multiple
data streams they utilize.
• When building analytics architecture, organizations need to consider both
the hardware — how data will be physically stored — as well as the
software that will be used to manage and process it.
• Analytics architecture also focuses on multiple layers
• Analytics architecture helps you not just store your data but plan the optimal
flow for data from capture to analysis
• The typical data architectures just described are designed for storing and processing

mission-critical data, supporting enterprise applications, and enabling corporate


reporting activities.
• Although reports and dashboards are still important for organizations, most
traditional data architectures inhibit data exploration and more sophisticated
analysis.
1.2.3 Drivers of Big Data
• Medical information, such as genomic sequencing and diagnostic imaging

• Photos and video footage uploaded to the World Wide Web

• Video surveillance,

• Mobile devices, which provide geospatial location data of the users, as well as
metadata about text messages, phone calls, and application usage on smart phones
• Smart devices, which provide sensor-based collection of information from smart
• electric grids, smart buildings, and many other public and industry
infrastructures
• Nontraditional IT devices, including the use of radio-frequency
identification (RFID)readers, GPS navigation systems, and seismic
processing.
1.2.4 Emerging Big Data Ecosystem and a New Approach to Analytics:
• Organizations and data collectors are realizing that the data they can gather
from individuals contains intrinsic value and, as a result, a new economy is
emerging.
• As this new digital economy continues to evolve, the market sees the
introduction of data vendors and data cleaners
there are four main groups of players within this interconnected web
– 1. Data devices
– 2. Data collectors
– 3. Data aggregators
– 4. Data users and buyers
• 1.3 Key Roles for the New Big Data Ecosystem:
The Big Data ecosystem demands three categories of roles
1. Deep Analytical Talent:
• combination of skills to handle raw, unstructured data and to apply complex
analytical techniques at massive scales.
• This group has advanced training in quantitative disciplines, such as
mathematics, statistics, and machine learning
2. Data Savvy Professionals:
• tend to have a base knowledge of working with data, or some of the work
being performed by data
scientists and others with deep analytical talent

3. Technology and Data Enablers.


• people providing technical expertise to support analytical projects, such as
provisioning and administrating analytical sandboxes, and managing large-
• These three groups must work together closely to solve complex Big Data
challenges
• For simplicity, this discussion focuses on the emerging role of the Data
Scientist.
• Data scientists are generally thought of as having five main sets of skills and
behavioral characteristics,
• Data scientists are generally comfortable using this blend of skills to acquire,
manage, analyze, and visualize data and tell compelling stories about it
1.4 Examples of Big Data Analytics:
three examples of Big Data Analytics in different areas:
– retail,
– IT infrastructure, and
– social media.

Retail: Big Data presents many opportunities to improve sales andmarketing


analytics.
• It infrastructure: Apache Hadoop is an open source framework that allows
companies to process vast amounts of information in a highly parallelized
way.
• employs a distributed file system, meaning it can use a distributed cluster of
servers and commodity hardware to process large amounts of data.
• common examples of Hadoop: social media space, where Hadoop can
manage transactions, give textual updates, and develop social graphs
among millions of users.
• Twitter and Facebook generate massive amounts of unstructured data and
use Hadoop and its ecosystem of tools to manage this high volume.
Data Analytics Lifecycle
Overview:
• The Data Analytics Lifecycle is designed specifically for Big Data problems
and data science projects. The lifecycle has six phases
• For most phases in the lifecycle, the movement can be either forward or
backward.
2.1.1 Key Roles for a Successful Analytics Project:
• seven key roles that need to be fulfilled for a high functioning data science
team to execute analytic projects successfully
2. Data Analytics Lifecycle
2.1 Data Analytics Lifecycle Overview
• The lifecycle has six phases, and project work can occur in several phases at
once.
• For most phases in the lifecycle, the movement can be either forward or
backward.
2.1.1 Key Roles for a Successful Analytics Project
• seven key roles that need to be fulfilled for a high functioning data science
team to execute analytic projects successfully.
1. Business User:
• Someone who understands the domain area and usually benefits from the
results. This person can consult and advise the project team on the context
of the project, the value of the results, and how the outputs will be
operationalized. Usually a business analyst
2. Project Sponsor:
• Responsible for the genesis of the project.
• Provides the impetus and requirements for the project and defines the core
business problem
• Generally provides the funding and gauges the degree of value from the final
outputs of the working team
3. Project Manager:
• Ensures that key milestones and objectives are met on time and at the
expected quality.
4. Business Intelligence Analyst:
• Business Intelligence Analysts generally create dashboards and reports and
have knowledge of the data feeds and sources.
5. Database Administrator (DBA):
• Provisions and configures the database environment to support the analytics
needs of the working team.
• These responsibilities may include providing access to key databases or
tables and ensuring the appropriate security levels are in place related to the
data repositories.
6. Data Engineer:
• Leverages deep technical skills to assist with tuning SQL queries for data
management and data extraction, and provides support for data ingestion
into the analytic sandbox.
• The data engineer works closely with the data scientist to help shape data in
the right ways for analyses.
7. Data Scientist:
• Provides subject matter expertise for analytical techniques, data modeling,
and applying valid analytical techniques to given business problems.
• Ensures overall analytics objectives are met. Designs and executes analytical
methods and approaches with the data available to the project.
2.1.2 Background and Overview of Data Analytics
Lifecycle:
Phase 1. Discovery
• the team learns the business domain, including relevant history .
• assesses the resources available to support the project in terms of people,
technology, time, and data.
• Important activities in this phase include framing the business problem as an
analytics challenge that can be addressed in subsequent phases and
formulating initial hypotheses (IHs) to test and begin learning the data.
Phase 2. Data preparation:
• It requires the presence of an analytic sandbox, in which the team can work
with data and perform analytics for the duration of the project.
• The team needs to execute extract, load, and transform (ELT) or extract,
transform and load (ETL) to get data into the sandbox.
Phase 3—Model planning:
- where the team determines the methods, techniques, and workflow it
intends to follow for the subsequent model building phase. The team
explores the data to learn about the relationships between variables and
subsequently selects key variables and the most suitable models.
Phase 4—Model building:
• the team develops datasets for testing, training, and production purposes. In
addition, in this phase the team builds and executes models based on the
work done in the model planning phase.
• The team also considers whether its existing tools will suffice for running
the models, or if it will need a more robust environment for executing
models and workflows (for example, fast hardware and parallel processing,
if applicable).
Phase 5—Communicate results:
• the team, in collaboration with major stakeholders, determines if the results
of the project are a success or a failure based on the criteria developed in
Phase 1.
• The team should identify key findings, quantify the business value, and
develop a narrative to summarize and convey findings to stakeholders.
• Phase 6—Operationalize:
• In Phase 6, the team delivers final reports, briefings, code, and technical
documents.
• In addition, the team may run a pilot project to implement the models in a
production environment.
2.2 Phase 1: Discovery
• the data science team must learn and investigate the problem, develop
context and understanding, and learn about the data sources needed and
available for the project.
– 2.2.1 Learning the Business Domain
– Understanding the domain area of the problem is essential.
– At this early stage in the process, the team needs to determine how much
business or domain knowledge the data scientist needs to develop models
in Phases 3 and 4.
– The earlier the team can make this assessment the better, because the
decision helps dictate the resources needed for the project team and
ensures the team has the right balance of domain knowledge and
technical expertise.
– 2.2.2 Resources:

• the team needs to assess the resources available to support the project include
technology, tools, systems, data, and people.
– long-term goals of this kind of project, without being constrained by the
current data.
– The team will need to determine whether it must collect additional data,
purchase it from outside sources
– evaluate how much time is needed

– 2.2.3 Framing the Problem

– Framing is the process of stating the analytics problem to be solved.


– identify the main objectives of the project,

– identify what needs to be achieved in business terms, and identify what needs to be
done to meet the needs. Additionally, consider the objectives and the success criteria
– equally important is to establish failure criteria
– failure criteria will guide the team in understanding when it is best to
stop trying or settle for the results that have been gleaned from the data
– 2.2.4 Identifying Key Stakeholders
– Another important step is to identify the key stakeholders and their
interests in the project.
– During these discussions, the team can identify the success criteria, key
risks, and stakeholders, which should include anyone who will benefit
from the project or will be significantly impacted by the project

– Stakeholders can be investors, peers, customers, or superiors. A


project sponsor on the other hand is generally not only part of
the organization but also accountable for the project
• 2.2.5 Interviewing the Analytics Sponsor
• interview the project sponsor, who tends to be the one funding the project or
providing the high-level requirements. This person understands the problem
and usually has an idea of a potential working solution
• . It is critical to thoroughly understand the sponsor’s perspective to guide the
team in getting started on the project
The responses will begin to shape the scope of the project and give the team an
idea of the goals and objectives of the project.
– What business problem is the team trying to solve?
– What is the desired outcome of the project?
– What data sources are available?
– What industry issues may impact the analysis?
– What timelines need to be considered?
– Who could provide insight into the project?
– Who has final decision-making authority on the project?
– 2.2.6 Developing Initial Hypotheses
– What is IH? - It involves generating potential explanations or predictions
about a particular phenomenon or problem based on existing knowledge,
observations, and logical reasoning.
– 2.2.7 Identifying Potential Data Sources:
– identify the kinds of data the team will need to solve the problem.
Consider the volume, type, and time span of the data needed to test
• The team should perform five main activities during this step of the discovery phase:

1. Identify data sources:


• Make a list of candidate data sources the team may need to test the initial hypotheses
outlined in this phase.
• Make an inventory of the datasets currently available and those that can be purchased or
otherwise acquired for the tests the team wants to perform.

2. Capture aggregate data sources:


• This is for previewing the data and providing high-level understanding.
• It enables the team to gain a quick overview of the data and perform further exploration
on specific areas.
• It also points the team to possible areas of interest within the data.

3. Review the raw data:


• Obtain preliminary data from initial data feeds.
• Begin understanding the interdependencies among the data attributes, and become
familiar with the content of the data, its quality, and its limitations.
4. Evaluate the data structures and tools needed:
• The data type and structure dictate which tools the team can use to analyze the data.

• This evaluation gets the team thinking about which technologies may be good for
the project and how to start getting access to these tools.

5. Scope the sort of data infrastructure needed for this type of problem:
• In addition to the tools needed, the data influences the kind of infrastructure that’s
required, such as disk storage and network capacity
Phase 2: Data Preparation
• The second phase of the Data Analytics Lifecycle involves data preparation,
which includes the steps to explore, preprocess, and condition data prior to
modeling and analysis
• done by preparing an analytics sandbox.
• To get the data into the sandbox, the team needs to perform ETLT, by a
combination of extracting, transforming, and loading data into the sandbox.
• Once the data is in the sandbox, the team needs to learn about the data and
become familiar with it.
• The team also must decide how to condition and transform data to get it into
a format to facilitate subsequent analysis.
• The team may perform data visualizations to help team members understand
the data
• 2.3.1 Preparing the Analytic Sandbox
• The first sub phase of data preparation requires the team to obtain an
analytic sandbox (also commonly referred to as a workspace), in which the
team can explore the data without interfering with live production databases.
• When developing the analytic sandbox, it is a best practice to collect all
kinds of data there, as team members need access to high volumes and
varieties of data for a Big Data analytics project.
• This can include everything from summary-level aggregated data, structured
data, raw data feeds, and unstructured text data from call logs or web logs,
depending on the kind of analysis the team plans to undertake.
2.3.2 Performing ETLT
• It advocates extract, load, and then transform.
• In this case, the data is extracted in its raw form and loaded into the data
store, where analysts can choose to transform the data into a new state or
leave it in its original, raw condition.
• The reason for this approach is that there is significant value in preserving
the raw data and including it in the sandbox before any transformations take
place.
• The team may want clean data and aggregated data and may need to keep a
copy of the original data to compare against or look for hidden patterns that
may have existed in the data before the cleaning stage.
• This process can be summarized as ETLT to reflect the fact that a team may
choose to perform ETL in one case and ELT in another.
• Depending on the size and number of the data sources, the team may need to
consider how to parallelize the movement of the datasets into the sandbox.
2.3.3 Learning About the Data

• become familiar with the data itself, learn the datasets provides context to
understand what constitutes a reasonable value and expected output versus what is a

surprising finding.
• it is important to identify additional data sources that the team can leverage.

• Some of the activities in this step may overlap with the initial investigation of the
datasets that occur in the discovery phase
2.3.4 Data Conditioning
• Data conditioning refers to the process of cleaning data, normalizing datasets, and
performing transformations on the data.
• about which data to keep and which data to transform or discard
• Data conditioning is often viewed as a preprocessing step for the data
analysis.
• Additional questions and considerations for the data conditioning step
include these.
1. What are the data sources? What are the target fields
2. How clean is the data?
3. How consistent are the contents and files?
4. Assess the consistency of the data types
5. Look for any evidence of systematic error
2.3.5 Survey and Visualize
• After the team has collected and obtained at least some of the datasets needed for
the subsequent analysis, a useful step is to leverage data visualization tools to gain
an overview of the data.

• Seeing high-level patterns in the data enables one to understand characteristics

about the data very quickly.

• Shneiderman’s mantra-for visual data analysis of “overview first, zoom and filter,
then details-on-demand.”- describes how data should be presented on screen so that
it is most effective for users
2.3.6 Common Tools for the Data Preparation Phase

1. Hadoop -can perform massively parallel ingest and custom analysis for
web traffic parsing, GPS location analytics, genomic analysis, and combining of
massive unstructured data feeds from multiple sources.

2. Alpine Miner- provides a graphical user interface (GUI) for creating


analytic workflows, including data manipulations and a series of analytic events such
as staged data-mining techniques.

3. OpenRefine (formerly called Google Refine)- “a free, open source,


powerful tool for working with messy data.” It is a popular GUI-based tool for
performing data transformations.

4. Data Wrangler [13] is an interactive tool for data cleaning and


transformation
2.4 Phase 3: Model Planning
• the data science team identifies candidate models to apply to the data for
clustering, classifying, or finding relationships in the data depending on the
goal of the project
• It is during this phase that the team refers to the hypotheses developed in
Phase 1
• These hypotheses help the team frame the analytics to execute in Phase 4
and select the right methods to achieve its objectives.
Some of the activities to consider in this phase include the following:
1. Assess the structure of the datasets.
– The structure of the datasets is one factor that dictates the tools and
analytical techniques for the next phase
2. Ensure that the analytical techniques enable the team to meet the business
objectives and accept or reject the working hypotheses.
3. Determine if the situation warrants a single model or a series of
techniques as part of a larger analytic workflow
• In addition, it is useful to research and understand how other analysts
generally approach a specific kind of problem.
• Given the kind of data and resources that are available, evaluate whether
similar, existing approaches will work or if the team will need to create
something new.
2.4.1 Data Exploration and Variable Selection
• Although some data exploration takes place in the data preparation phase,
those activities focus mainly on data hygiene and on assessing the quality of
the data itself.
• In Phase 3, the objective of the data exploration is to understand the
relationships among the variables and to understand the problem domain
• A common way to conduct this step involves using tools to perform data
visualizations.
• The models can be statistical model or machine learning model, depends on
the problem to be solved. If the problem is regression problem then come up
with regression model, if it is classification problem then come up with
classification algorithm like decision tree, SVM and train particular model.
• To determine what kind of model to use, exploratory data analysis has to be
done to understand the various relationship between variables and to see
what the data can tell us.
– Know the data type, what is the data in each of the column, what is the
maximum and minimum value
– So can get an understanding of the data

2.4.2 Model Selection


• In the model selection sub phase, the team’s main goal is to choose an
analytical technique.
• The team can move to the model building phase once it has a good idea
about the type of model to try and the team has gained enough knowledge to
refine the analytics plan
2.4.3 Common Tools for the Model Planning Phase
• R - has a complete set of modeling capabilities and provides a good environment
for building interpretive models with high-quality code.
– In addition, it has the ability to interface with databases via an ODBC

• SQL - Analysis services can perform in-database analytics of common data mining
functions, involved aggregations, and basic predictive models.
• SAS/ACCESS (Statistical Analysis Software) - provides integration between SAS
and the analytics sandbox via multiple data connectors such as OBDC, JDBC, and
OLE DB. SAS itself is generally used on file extracts, but with SAS/ACCESS,
users can connect to relational databases (such as Oracle or Teradata) and data
warehouse appliances (such as Greenplum or Aster), files, and enterprise
applications
2.5 Phase 4: Model Building
• In Phase 4, the data science team needs to develop datasets for training,
testing, and production purposes.
• These datasets enable the data scientist to develop the analytical model and
train it (“training data”), while holding aside some of the data (“hold-out
data” or “test data”) for testing the model.
• In the model building phase, an analytical model is developed and fit on the
training data and evaluated (scored) against the test data
• The phases of model planning and model building can overlap quite a bit

• Statistical modeling is the use of mathematical models


and statistical assumptions to generate sample data and make
predictions about the real world.
• On a small scale, assess the validity of the model and its results.
2.5.1 Common Tools for the Model Building Phase
• There are many tools available to assist in this phase, focused primarily on
statistical analysis or data mining software. Common tools are
Commercial Tools:
• SPSS Modeler offers methods to explore and analyze data through a GUI.
• Matlab provides a high-level language for performing a variety of data
analytics, algorithms, and data exploration.
• Alpine Miner provides a GUI front end for users to develop analytic
workflows and interact with Big Data tools and platforms on the back end.
• STATISTICA and Mathematica are also popular and well-regarded data
mining and analytics tools.
1. SPSS Modeler
• offers a variety of modeling methods taken from machine learning, artificial
intelligence, and statistics.
• derive new information from your data and to develop predictive models.

2. Alpine Miner
• embeds statistical algorithms in the database to leverage the innate
capabilities of parallel processing databases.
2.6 Phase 5: Communicate Results
• After executing the model, the team needs to compare the outcomes of the
modeling to the criteria established for success and failure
• the key is to remember that the team must be rigorous enough with the data
to determine whether it will prove or disprove the hypotheses outlined in
Phase 1 (discovery).
• When conducting this assessment, determine if the results are statistically
significant and valid. If they are, identify the aspects of the results that stand
out
• If the results are not valid, think about adjustments that can be made to
refine and iterate on the model to make it valid
• By this time, the team should have determined which model or models
address the analytical challenge in the most appropriate way.
• Make recommendations for future work or improvements to existing
processes, and consider what each of the team members and stakeholders
needs to fulfill her responsibilities.
• As a result - The deliverable of this phase will be the most visible portion of
the process to the outside stakeholders and sponsors, so take care to clearly
articulate the results, methodology
2.7 Phase 6: Operationalize
• In the final phase, the team communicates the benefits of the project more
broadly and sets up a pilot project to deploy the work in a controlled way
before broadening the work to a full enterprise or ecosystem of users
• This approach enables the team to learn about the performance and related
constraints of the model in a production environment on a small scale and
make adjustments before a full deployment
• Part of the operationalizing phase includes-> creating a mechanism for
performing ongoing monitoring of model accuracy and, if accuracy
degrades, finding ways to retrain the model.
• If feasible, design alerts for when the model is operating “out-of-bounds.”
• This includes situations when the inputs are beyond the range that the model
was trained on, which may cause the outputs of the model to be inaccurate
or invalid.
• If this begins to happen regularly, the model needs to be retrained on new
data
The key outputs for each of the main stakeholders of an analytics project and
what they usually expect at the conclusion of a project.
• Business User typically tries to determine the benefits and implications of
the findings to the business.
• Project Sponsor typically asks questions related to the business impact of
the project, the risks and return on investment (ROI),
• Project Manager needs to determine if the project was completed on time
and within budget and how well the goals were met.
• Business Intelligence Analyst needs to know if the reports and dashboards
he manages will be impacted and need to change.
• Data Engineer and Database Administrator (DBA) typically need to
share their code from the analytics project and create a technical document
on how to implement it.
• Data Scientist needs to share the code and explain the model to her peers,
managers, and other stakeholders.

You might also like