Introduction to Big Data Analytics

INTRODUCTION TO BIG DATA
ANALYTICS
Utkarsh Sharma
Asst. Prof. (CSE)
Jaypee University Of Engineering & Technology

Big Data Overview
Several industries have led the way in developing their ability to
gather and exploit data:
• Credit card companies monitor every purchase their customers make and
can identify fraudulent purchases with a high degree of accuracy using
rules derived by processing billions of transactions.
• Mobile phone companies analyze subscriber’s calling patterns to
determine, If that rival network is offering an attractive promotion that might
cause the subscriber to defect.
• For companies such as Linked In and Facebook, data itself is their primary
product.

Big Data Overview
Three attributes stand out as defining Big Data characteristics:
• Huge volume of data: Rather than thousands or millions of rows, Big Data can be billions of rows
and millions of columns.
• Complexity of data types and structures: Big Data reflects the variety of new data sources, formats,
and structures, including digital traces being left on the web and other digital repositories for
subsequent analysis.
• Speed of new data creation and growth: Big Data can describe high velocity data, with rapid data
ingestion and near real time analysis.

Another definition of Big Data comes from the McKinsey Global report from 2011:
• Big Data is data whose scale, distribution, diversity, and/or timeliness
require the use of new technical architectures and analytics to enable
insights that unlock new sources of business value.
McKinsey's definition of Big Data implies that organizations will need new data architectures and
analytic sandboxes, new tools, new analytical methods, and an integration of multiple skills into the
new role of the data scientist.

An Example(Genomic sequencing)
While data has grown, the cost to perform this work has fallen dramatically. The cost to sequence one
human genome has fallen from $100 million in 2001 to $10,000 in 2011, and the cost continues to drop. Now,
websites such as 23andme offer genotyping for less than $100.

Data Structures
• Big data can come in multiple forms, including structured and
non-structured data such as financial data, text files, multimedia
files, and genetic mappings.
• Most of the Big Data is unstructured or semi-structured in
nature, which requires different techniques and tools to process
and analyze.
• Distributed computing environments and massively parallel
processing (MPP) architectures that enable parallelized data
ingest and analysis are the preferred approach to process such
complex data.

Structured Data
• Data containing a defined data type, format, and structure (that is, transaction data, online analytical
processing [OLAP] data cubes, traditional RDBMS, CSV files, and even simple spreadsheets).

Semi-structured data
• Textual data files with a discernible pattern that enables parsing (such as Extensible Markup
Language [XML] data files that are self-describing and defined by an XML schema).

Quasi-structured data
• Textual data with erratic data formats that can be formatted with effort, tools, and time (for instance,
web clickstream data that may contain inconsistencies in data values and formats).
• Consider the following example. A user attends the EMC World conference and subsequently runs
a Google search online to find information related to EMC and Data Science. This would produce a
URL such as https: I /www . google. com/ #q=EMC+ data+science
• After doing this search, the user may choose the second link, to read more about the headline "Data
Scientist- EM( Education, Training, and Certification." This brings the user to an erne . com site
focused on this topic and a new URL, ht t p s : I / e ducation . e rne . com/ guest/ campai gn/ data_
science.aspx
• Arriving at this site, the user may decide to click to learn more about the process of becoming
certified in data science. The user chooses a link toward the top of the page on Certifications,
bringing the user to a new URL: ht tps :I I education. erne. com/guest / certifica tion/ framework/ stf/
data_science . aspx,

Unstructured data
• Data that has no inherent structure, which may include text
documents, PDFs, images, and video.
• All of these heterogenous types of data structures created the need
of some specialized data storage and retrieval techniques, such as
Data warehouses and analytics sandbox.

Data Warehouse
• A data warehouse is a central repository of information that can be analyzed to make more informed
decisions.
• Data flows into a data warehouse from transactional systems, relational databases, and other sources,
typically on a regular cadence.
• Business analysts, data engineers, data scientists, and decision makers access the data
through business intelligence (BI) tools, SQL clients, and other analytics applications.

Intro. to Data Warehouse
• The term "Data Warehouse" was first coined by Bill Inmon in 1990. According to Inmon, a data
warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of data. This
data helps analysts to take informed decisions in an organization.
• An operational database undergoes frequent changes on a daily basis on account of the
transactions that take place whereas a Data Warehouse keeps historical data also.
• A data warehouses provides us generalized and consolidated data in multidimensional view.
Along with generalized and consolidated view of data, a data warehouses also provides us Online
Analytical Processing (OLAP) tools.

Understanding a Data Warehouse
• A data warehouse is a database, which is kept separate from the organization's operational
database.
• There is no frequent updating done in a data warehouse.
• It possesses consolidated historical data, which helps the organization to analyze its business.
• A data warehouse helps executives to organize, understand, and use their data to take strategic
decisions.
• Data warehouse systems help in the integration of diversity of application systems.
• A data warehouse system helps in consolidated historical data analysis.

Analytics sandbox
• A workspace in which data assets are gathered from multiple sources
and technologies for analysis.
• To lessen the performance burden of the analysis, the workspace may
use in-database processing and is considered to be owned by the
analysts rather than database administrators.
• Often, this workspace is created by using a sampling of the dataset
rather than the entire dataset.
• The sandbox may also reduce the stove-piped and partial versions of
the true data that may have been developed in business units.

Business Intelligence vs Data Science

Examples of Big Data Analytics
• As mentioned earlier, Big Data presents many opportunities to improve sales and marketing
analytics.
• An example of this is the U.S. retailer Target. After analysing consumer purchasing behavior,
Target's statisticians determined that the retailer made a great deal of money from three main life-
event situations.
• Marriage, when people tend to buy many new products.
• Divorce, when people buy new products and change their spending habits.
• Pregnancy, when people have many new things to buy and have an urgency to buy them.
• Target determined that the most lucrative of these life-events is the third situation: pregnancy. Using
data collected from shoppers, Target was able to identify this fact and predict which of its shoppers
were pregnant. In one case, Target knew a female shopper was pregnant even before her family
knew

Data Science Project Lifecycle

Data Science Project Lifecycle
• 1. Obtain Data
• Skills required
• how to use MySQL, PostgreSQL or MongoDB
• 2. Scrub Data
• Skills required
• You will need scripting tools like Python or R to help you to scrub the data.
• 3. Explore Data
• Skills required
• If you are using Python then Numpy, Matplotlib, Pandas or Scipy; if you are using R, then
GGplot2 or the data exploration swiss knife Dplyr. On top of that, you need to have knowledge
and skills in inferential statistics and data visualization.
• 4. Model Data
• Skills required
• In Machine Learning, the skills you will need is both supervised and unsupervised algorithms.
• 5. Interpreting Data
• Skills required
• You will need strong business domain knowledge to present your findings in a way that can
answer the business questions you set out to answer

The Analytics Process
An Analysis process contains all or some of the following phases:
• Business understanding: Identifying and understanding the business objectives
• Data Collection: Collection of data from different sources and its representation
in terms of its application.
• Data Preparation: Removing the unnecessary and unwanted data
• Data Modelling: Create a model to analyse the different relationships between
the objects.
• Data Evaluation: Evaluation and preparation
of analysis report
• Deployment: Finalizing the plan for
deployment

Types of Analytics
On the basis of problem description, four types of data analytics are used:
• Descriptive Analytics
• Diagnostic Analytics
• Predictive Analytics
• Prescriptive Analytics

Descriptive analytics : What is happening?
• This is the most common of all forms. In business it provides the analyst a view of
key metrics and measures within the business.
• Descriptive analytics juggles raw data from
multiple data sources to give valuable insights
into the past.
• However, these findings simply signal that something
is wrong or right, without explaining why.

Diagnostic: Why is it happening?
• At this stage, historical data can be measured against other data to answer the question
of why something happened.
• Diagnostic analytics gives in-depth insights into a
particular problem.
• On assessment of the descriptive data, diagnostic
analytical tools will empower an analyst to drill down
and in so doing isolate the root-cause of a problem.

Predictive: What is likely to happen?
• Predictive analytics tells what is likely to happen. It uses the findings
of descriptive and diagnostic analytics to detect clusters and
exceptions, and to predict future trends.
• Predictive models typically utilize
a variety of variable data to make
the prediction.
• Predictive analytics belongs to
advanced analytics types and brings
many advantages like sophisticated
analysis based on machine or deep
learning.

Prescriptive: What do I need to do?
• The purpose of prescriptive analytics is to literally prescribe what action to take to
eliminate a future problem or take full advantage of a promising trend.
• The prescriptive model utilizes an understanding of what has
happened, why it has happened and a variety of
“what-might-happen” analysis to help the user determine
the best course of action to take.
• Besides, this state-of-the-art type of data analytics requires not
only historical internal data but also external information due
to the nature of algorithms it’s based on.

Big Data Analytics(One more categorization)
• Basic Analytics
Slicing & Dicing
Basic monitoring
Anomaly identification
• Advanced Analytics
Predictive Modelling
Text Analytics
Statistics and data mining algorithms
• Operational Analytics
• Monetized Analytics

Data Analytics Lifecycle
Brief Overview
• The Data Analytics Lifecycle is designed specifically for Big Data problems and data
science projects.
• The lifecycle has six phases, and project work can occur in several phases at once.
• For most phases in the lifecycle, the movement can be either forward or backward.
• In recent years, substantial attention has been placed on the emerging role of the data
scientist.
• Despite this strong focus on the emerging role of the data scientist specifically, there are
actually seven key roles that need to be fulfilled for a high-functioning data science team
to execute analytic projects successfully.

Key Roles for a Successful Analytics Project
• For a small, versatile team, the seven roles may be fulfilled by only 3 people, but a very large
project may require 20 or more people. The seven roles follow:

Key Roles for a Successful Analytics Project
• Business User :- business analyst, line manager, or deep subject matter expert in the project
domain.
• Project Sponsor :- provides the funding and gauges
• Project Manager :- Ensures that key milestones and objectives are met on time and at the expected
quality.
• Business Intelligence Analyst :- Provides business domain expertise based on a deep
understanding of the data, key performance indicators (KPis).
• Database Administrator (DBA) :- Provisions and configures the database environment to support
the analytics needs of the working team.
• Data Engineer :- Leverages deep technical skills to assist with tuning SQL queries for data
management and data extraction, and provides support for data ingestion into the analytic sandbox.
• Data Scientist :- Provides subject matter expertise for analytical techniques, data modeling, and
applying valid analytical techniques to given business problems.

Phase 1- Discovery
• Learning the Business Domain
• Resources
• Framing the Problem
• Identifying Key Stakeholders
• Interviewing the Analytics Sponsor
• Developing Initial Hypotheses

Phase 2: Data Preparation
• Preparing the Analytic Sandbox
• Performing ETLT
• Learning About the Data
• Data Conditioning
• Survey and Visualize

Phase 3: Model Planning
• Data Exploration and Variable Selection
• Model Selection
Phase 4: Model Building
• The team develops data sets for testing, training, and production purposes.

Phase 5: Communicate Results
• The team, in collaboration with major stakeholders, determines if the results of the project
are a success or a failure based on the criteria developed in Phase 1.
Phase 6: Operationalize
• The team delivers final reports, briefings, code, and technical documents.
• In addition, the team may run a pilot project to implement the models in a production
environment.

Key Outputs from a Successful Analytic Project

Big Data Pre-processing
• The set of techniques used prior to the application of a data mining
method is named as data preprocessing for data mining.
• The bigger amounts of data collected require more sophisticated
mechanisms to analyze it.
• Data preprocessing is able to adapt the data to the requirements
posed by each data mining algorithm, enabling to process data that
would be unfeasible otherwise.

Introduction to Big Data Analytics

Introduction to Big Data Analytics

More Related Content

What's hot (20)

Similar to Introduction to Big Data Analytics (20)

More from Utkarsh Sharma (10)

Recently uploaded (20)

Introduction to Big Data Analytics