0% found this document useful (0 votes)

11 views47 pages

Intro

Uploaded by

manvitha.hemachandra25

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views47 pages

Intro

Uploaded by

manvitha.hemachandra25

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Introduction to Analytics

and Big Data - Hadoop

Rob Peglar
EMC Isilon
SNIA Legal Notice
The material contained in this tutorial is copyrighted by the SNIA.
Member companies and individual members may use this material in
presentations and literature under the following conditions:
Any slide or slides used must be reproduced in their entirety without
modification
The SNIA must be acknowledged as the source of any material used in the
body of any document containing material from these presentations.
This presentation is a project of the SNIA Education Committee.
Neither the author nor the presenter is an attorney and nothing in this
presentation is intended to be, or should be construed as legal advice or an
opinion of counsel. If you need legal advice or a legal opinion please
contact your attorney.
The information presented herein represents the author's personal opinion
and current understanding of the relevant issues involved. The author, the
presenter, and the SNIA do not assume any responsibility or liability for
damages arising out of any reliance on or use of this information.
NO WARRANTIES, EXPRESS OR IMPLIED. USE AT YOUR OWN RISK.
Introduction to Analytics and Big Data – Hadoop
© 2012 Storage Networking Industry Association. All Rights Reserved.
2
BIG DATA AND HADOOP

Data Challenges
Why Hadoop

Introduction to Analytics and Big Data – Hadoop

IN 2010 THE DIGITAL UNIVERSE WAS

1.2 ZETTABYTES
IN A DECADE THE DIGITAL UNIVERSE WILL BE
35 ZETTABYTES

90% OF THE DIGITAL UNIVERSE IS

UNSTRUCTURED
IN 2011 THE DIGITAL UNIVERSE IS
300 QUADRILLION FILES
The Economist, Feb 25, 2010

Introduction to Analytics and Big Data – Hadoop

“TRADITIONAL BI”

Repetitive
“BIG DATA ANALYTICS”

Experimental, Ad Hoc Structured

Mostly Semi-Structured Operational

External + Operational GBs to 10s of TBs

10s of TB to 100’s of PB’s

Introduction to Analytics and Big Data – Hadoop

Past Future

What What is What is likely to

happened? happening? happen?

Reporting, Real-Time Predictive

Dashboards Analytics Analytics

Why did it Why is it What should I do

happen? happening? about it?
Forensics & Data Real-Time Prescriptive
Mining Data Mining Analytics

Introduction to Analytics and Big Data – Hadoop

“The future is here, it’s just not evenly distributed yet.”

William Gibson

Introduction to Analytics and Big Data – Hadoop

Terabytes
Transactions
Tables
Records
Files

Batch
Structured
Near Time
Unstructured
Real Time
Semistructured
Streams

Velocity Variety
Introduction to Analytics and Big Data – Hadoop
© 2012 Storage Networking Industry Association. All Rights Reserved.
Ten Common Big Data Problems

1. Modeling true risk 6. Analyzing network

2. Customer churn data to predict
analysis failure
3. Recommendation 7. Threat analysis
engine 8. Trade surveillance
4. Ad targeting 9. Search quality
5. PoS transaction 10.Data “sandbox”
analysis

Introduction to Analytics and Big Data – Hadoop

Financial Services Healthcare

Retail Web/Social/Mobile

Manufacturing Government

Introduction to Analytics and Big Data – Hadoop

Retail Advertising & Public Relations

• CRM – Customer Scoring • Demand Signaling
• Store Siting and Layout • Ad Targeting
• Fraud Detection / Prevention • Sentiment Analysis
• Supply Chain Optimization • Customer Acquisition

Financial Services Media & Telecommunications

• Algorithmic Trading • Network Optimization
• Risk Analysis • Customer Scoring
• Fraud Detection • Churn Prevention
• Portfolio Analysis • Fraud Prevention

Manufacturing Energy
• Product Research • Smart Grid
• Engineering Analytics • Exploration
• Process & Quality Analysis
• Distribution Optimization

Government Healthcare & Life Sciences

• Market Governance • Pharmaco-Genomics
• Counter-Terrorism • Bio-Informatics
• Econometrics • Pharmaceutical Research
• Health Informatics • Clinical Outcomes Research

Introduction to Analytics and Big Data – Hadoop

Answer: Big Datasets!

Introduction to Analytics and Big Data – Hadoop

Big Data analytics and the Apache Hadoop open source

project are rapidly emerging as the preferred solution to
address business and technology trends that are
disrupting traditional data management and processing.
Enterprises can gain a competitive advantage by
being early adopters of big data analytics.

Introduction to Analytics and Big Data – Hadoop

CPU DRAM LAN Disk

Annual bandwidth improvement (all milestones)
1.5 1.27 1.39 1.28

Annual latency improvement (all milestones) 1.17 1.07 1.12 1.11

Memory Wall Storage Chasm

CPU B/W requirements out-pacing memory and

storage
Disk & memory getting “further” away from CPU
Large sequential transfers better for both memory &
disk

Introduction to Analytics and Big Data – Hadoop

For $1000 Process

One computer can ~32GB

Store 99.9%
~15TB Of data is Underutilized

Introduction to Analytics and Big Data – Hadoop

© 2012 Storage Networking Industry Association. All Rights Reserved.
17
WHAT IS HADOOP
Hadoop Adoption
HDFS
MapReduce
Ecosystem Projects

Introduction to Analytics and Big Data – Hadoop

The Datagraph Blog

Source:
Introduction Hadoopand
to Analytics Summit
Big DataPresentations
– Hadoop
© 2012 Storage Networking Industry Association. All Rights Reserved.
What is Hadoop?

A scalable fault-tolerant distributed system for data storage and

processing
Core Hadoop has two main components
Hadoop Distributed File System (HDFS): self-healing, high-bandwidth clustered
storage
Reliable, redundant, distributed file system optimized for large files
MapReduce: fault-tolerant distributed processing
Programming model for processing sets of data
Mapping inputs to outputs and reducing the output of multiple Mappers to
one (or a few) answer(s)
Operates on unstructured and structured data
A large and active ecosystem
Open source under the friendly Apache License
http://wiki.apache.org/hadoop/

Introduction to Analytics and Big Data – Hadoop

 Sits on top of a native (ext3, xfs, etc..) file system

 Performs best with a ‘modest’ number of large files
 Files in HDFS are ‘write once’
 HDFS is optimized for large, streaming reads of files

Introduction to Analytics and Big Data – Hadoop

© 2012 Storage Networking Industry Association. All Rights Reserved.
HDFS
 Hadoop Distributed File System
– Data is organized into files & directories
– Files are divided into blocks, distributed across
cluster nodes
– Block placement known at runtime by map-
reduce = computation co-located with data
– Blocks replicated to handle failure
– Checksums used to ensure data integrity
 Replication: one and only strategy for error
handling, recovery and fault tolerance
– Self Healing
– Make multiple copies

Introduction to Analytics and Big Data – Hadoop

Client Client Client Client Client Client Client Client

Name Job Secondary

Node Tracker Node

Master Master

Data Task Data Data Task

Task Tracker
Node Tracker Node Node Tracker

Slave Slave Slave

Up to 4K
Nodes
Data Task Data Data
Task Tracker Task Tracker
Node Tracker Node Node

Slave Slave Slave

Introduction to Analytics and Big Data – Hadoop

CORE SWITCH CORE SWITCH Client

1GbE/10GbE 1GbE/10GbE 1GbE/10GbE 1GbE/10GbE

NN JT SNN DN, TT

DN, TT DN, TT DN, TT DN, TT

Up to 4K
DN, TT DN, TT DN, TT DN, TT
Nodes
DN, TT DN, TT DN, TT DN, TT

DN, TT DN, TT DN, TT DN, TT

Introduction to Analytics and Big Data – Hadoop

© 2012 Storage Networking Industry Association. All Rights Reserved.
MapReduce 101
Functional Programming meets
Distributed Processing

Introduction to Analytics and Big Data – Hadoop

Automatic parallelization and distribution

Fault Tolerance
Status and Monitoring Tools
A clean abstraction for programmers
Google Technology RoundTable: MapReduce

Introduction to Analytics and Big Data – Hadoop

A method for distributing a task across multiple nodes

Each node processes data stored on that node
Consists of two developer-created phases
1. Map
2. Reduce
In between Map and Reduce is the Shuffle and Sort

Introduction to Analytics and Big Data – Hadoop

A user runs a client program on a client computer

The client program submits a job to Hadoop
The job is sent to the JobTracker process on the
Master Node
Each Slave Node runs a process called the
TaskTracker
The JobTracker instructs TaskTrackers to run and
monitor tasks
A task attempt is an instance of a task running on a
slave node
There will be at least as many task attempts as there
are tasks which need to be performed
Introduction to Analytics and Big Data – Hadoop
© 2012 Storage Networking Industry Association. All Rights Reserved.
MapReduce: Basic Concepts

Each Mapper processes single input split from HDFS

Hadoop passes developer’s Map code one record at a
time
Each record has a key and a value
Intermediate data written by the Mapper to local disk
During shuffle and sort phase, all values associated
with same intermediate key are transferred to same
Reducer
Reducer is passed each key and a list of all its values
Output from Reducers is written to HDFS
Introduction to Analytics and Big Data – Hadoop
© 2012 Storage Networking Industry Association. All Rights Reserved.
MapReduce Operation

What was the max/min temperature for the last century?

Introduction to Analytics and Big Data – Hadoop

The requirement:
you need to find out grouped by type of customer how
many of each type are in each country with the name of the
country listed in the countries.dat in the final result
(and not the 2 digit country name). Each record has a key
and a value
To do this you need to:
Join the data sets
Key on country
Count type of customer per country
Output the results

Introduction to Analytics and Big Data – Hadoop

Input Map Shuffle and Sort Reduce Output

Map

Reduce

Map

Reduce

Map

cat grep sort uniq output

Introduction to Analytics and Big Data – Hadoop

© 2012 Storage Networking Industry Association. All Rights Reserved.
MapReduce Example
Problem: Count the number of times that each word appears in the following paragraph:
John has a red car, which has no radio. Mary has a red
bicycle. Bill has no car or bicycle.

Server 1: John has a red car, which has no radio. Server 2: Mary has a red bicycle. Server 3: Bill has no car or bicycle.
John: 1 Mary: 1 Bill: 1
has: 2 has: 1 has: 1
a: 1 a: 1 no: 1
Map red: 1 red: 1 car: 1
car: 1 bicycle: 1 or: 1
which: 1 biclycle:1
no: 1
radio: 1

Server 1 Server 2 Server 3 Server 1 Server 2 Server 3

John: car: 1 bicycle: 1 John: 1 car: 2 bicycle: 2
1 car: 1 bicycle: 1 has 4 which: 1 Bill: 1
has 2 which: 1 Bill: 1 a: 2 no: 2 or: 1
has: 1
Reduce has: 1
no: 1
no: 1
or: 1 red: 2 radio: 1
Mary: 1
a: 1 radio: 1
a: 1 Mary: 1
red: 1
red: 1

Introduction to Analytics and Big Data – Hadoop

Reduce
Reduce Job
Job
3
Task Tracker
Task Tracker
Task Tracker

Map Job 4 Map Job Map Job

Reduce Job Reduce Job Reduce Job

Large Data Set

(Log files, Sensor Data) Hadoop Distributed File System (HDFS)

Introduction to Analytics and Big Data – Hadoop

© 2012 Storage Networking Industry Association. All Rights Reserved.
Hadoop Ecosystem Projects
• Hadoop is a ‘top-level’ Apache project
• Created and managed under the auspices of the Apache Software Foundation

• Several other projects exist that rely on some or all of Hadoop

• Typically either both HDFS and MapReduce, or just HDFS

• Ecosystem Projects Include

• Hive
• Pig
• HBase
• Many more…..

Hadoop Traditional SQL MPP Systems

Systems
Scale-Out Scale-Up Scale-Out

Key/Value Pairs Relational Tables Relational Tables

Functional Declarative Queries Declarative Queries

Programming
Offline Batch Online Transactions Online Transactions
Processing

Introduction to Analytics and Big Data – Hadoop

Traditional RDBMS MapReduce

Data Size Gigabytes (Terabytes) Petabytes (Exabytes)

Access Interactive and Batch Batch

Updates Read / Write many times Write once, Read many times

Structure Static Schema Dynamic Schema

Integrity High (ACID) Low

Scaling Nonlinear Linear

DBA Ratio 1:40 1:3000

Reference: Tom White’s Hadoop: The Definitive Guide

Introduction to Analytics and Big Data – Hadoop

Issues
What make and model systems are deployed?
Are certain set top boxes in need of replacement based on system
diagnostic data?
Is the a correlation between make, model or vintage of set top box and
customer churn?
What are the most expensive boxes to maintain?
Which systems should we pro-actively replace to keep customers happy?
Big Data Solution
Collect unstructured data from set top boxes—multiple terabytes
Analyze system data in Hadoop in near real time
Pull data in to Hive for interactive query and modeling
Analytics with Hadoop increases customer satisfaction

Introduction to Analytics and Big Data – Hadoop

© 2012 Storage Networking Industry Association. All Rights Reserved.
Pay Per View Advertising
Issues
Fixed inventory of ad space is provided by national content providers. For
example, 100 ads offered to provider for 1 month of programming
Provider can use this space to advertise its products and services, such as
pay per view
Do we advertise “The Longest Yard” in the middle of a football game or in
the middle of a romantic comedy?
10% increase in pay per view movie rentals = $10M in incremental revenue
• Big Data Solution
Collect programming data and viewer rental data in a large data repository
Develop models to correlate proclivity to rent to programming format
Find the most productive time slots and programs to advertise pay per
view inventory
Improve ad placement and pay-per-view conversion with Hadoop

Introduction to Analytics and Big Data – Hadoop

© 2012 Storage Networking Industry Association. All Rights Reserved.
Risk Modeling
 Risk Modeling
– Bank had customer data across multiple lines of business and needed to
develop a better risk picture of its customers. i.e, if direct deposits stop
coming into checking acct, it’s likely that customer lost his/her job, which
impacts creditworthiness for other products (CC, mortgage, etc.)
– Data existing in silos across multiple LOB’s and acquired bank systems
– Data size approached 1 petabyte
 Why do this in Hadoop?
– Ability to cost-effectively integrate + 1 PB of data from multiple data
sources: data warehouse, call center, chat and email
– Platform for more analysis with poly-structured data sources; i.e.,
combining bank data with credit bureau data; Twitter, etc.
– Offload intensive computation from DW

Introduction to Analytics and Big Data – Hadoop

 Sentiment Analysis
– Hadoop used frequently to monitor what customers think of
company’s products or services
– Data loaded from social media sources (Twitter, blogs,
Facebook, emails, chats, etc.) into Hadoop cluster
– Map/Reduce jobs run continuously to identify sentiment (i.e.,
Acme Company’s rates are “outrageous” or “rip off”)
– Negative/positive comments can be acted upon (special offer,
coupon, etc.)
 Why Hadoop
– Social media/web data is unstructured
– Amount of data is immense
– New data sources arise weekly
Introduction to Analytics and Big Data – Hadoop
© 2012 Storage Networking Industry Association. All Rights Reserved.
Resources to enable the Big Data Conversation

World Economic Forum: “Personal Data: The Emergence of a New Asset

Class” 2011
McKinsey Global Institute: Big Data: The next frontier for innovation,
competition, and productivity
Big Data: Harnessing a game-changing asset
IDC: 2011 Digital Universe Study: Extracting Value from Chaos
The Economist: Data, Data Everywhere
Data Science Revealed: A Data-Driven Glimpse into the Burgeoning New
Field
O’Reilly – What is Data Science?
O’Reilly – Building Data Science Teams?
O’Reilly – Data for the public good
Obama Administration “Big Data Research and Development Initiative.”

Introduction to Analytics and Big Data – Hadoop

Please send any questions or comments on this

presentation to the SNIA at this address:
[email protected]
Many thanks to the following individuals
for their contributions to this tutorial.
SNIA Education Committee

Denis Guyadeen
Rob Peglar

Introduction to Analytics and Big Data – Hadoop

Introduction To Analytics and Big Data - Hadoop: Thomas Rivera Hitachi Data Systems
No ratings yet
Introduction To Analytics and Big Data - Hadoop: Thomas Rivera Hitachi Data Systems
45 pages
BDH Admin Ebook
No ratings yet
BDH Admin Ebook
807 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
229 pages
Hadoop Week 1
No ratings yet
Hadoop Week 1
25 pages
Chapter 09 - in Class
No ratings yet
Chapter 09 - in Class
34 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Big Data Analytics
No ratings yet
Big Data Analytics
36 pages
BDA Unit-1
No ratings yet
BDA Unit-1
32 pages
BDA Unit-1
No ratings yet
BDA Unit-1
31 pages
SergeBazhievsky Introduction To Hadoop MapReduce v2
No ratings yet
SergeBazhievsky Introduction To Hadoop MapReduce v2
67 pages
Hadoop V.01
No ratings yet
Hadoop V.01
24 pages
11-12 Big Data Concepts and Tools
No ratings yet
11-12 Big Data Concepts and Tools
30 pages
Data Science
No ratings yet
Data Science
87 pages
Hadoop Report
No ratings yet
Hadoop Report
110 pages
Big Data Storage: Margaret Rouse Garry Kranz
No ratings yet
Big Data Storage: Margaret Rouse Garry Kranz
6 pages
BigDataAnalytics 1.2
No ratings yet
BigDataAnalytics 1.2
25 pages
$RM5TSDQ
No ratings yet
$RM5TSDQ
70 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
41 pages
BDA Unit-1
No ratings yet
BDA Unit-1
33 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
32 pages
Hadoop Quick Guide
No ratings yet
Hadoop Quick Guide
32 pages
Big Data-2
No ratings yet
Big Data-2
40 pages
Intr Oduction of Big Data
No ratings yet
Intr Oduction of Big Data
12 pages
Big Data Analysis Concepts and References
100% (1)
Big Data Analysis Concepts and References
60 pages
Ashish Presentation Stage1 Modify LR
No ratings yet
Ashish Presentation Stage1 Modify LR
24 pages
Bigdata Overview PDF
No ratings yet
Bigdata Overview PDF
98 pages
Big Data Analytics - Project
50% (2)
Big Data Analytics - Project
27 pages
Big Data: Introduction To Terms, Concepts and Tools
No ratings yet
Big Data: Introduction To Terms, Concepts and Tools
23 pages
Big Data Analytics Unit-1
No ratings yet
Big Data Analytics Unit-1
39 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Unit1 - BDH
No ratings yet
Unit1 - BDH
77 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Inside Cloud - Case Study
No ratings yet
Inside Cloud - Case Study
11 pages
Big Data Complete Notes
100% (3)
Big Data Complete Notes
33 pages
Big Data: Hadoop Framework Guide
No ratings yet
Big Data: Hadoop Framework Guide
3 pages
Mca Big Data PDF Sem 3
No ratings yet
Mca Big Data PDF Sem 3
193 pages
Hadoop by Dr. Kamal Gulati
No ratings yet
Hadoop by Dr. Kamal Gulati
33 pages
Big Data Question Bank
No ratings yet
Big Data Question Bank
38 pages
Big Data Technologies, Introduction To Hadoop
No ratings yet
Big Data Technologies, Introduction To Hadoop
37 pages
CH 2
No ratings yet
CH 2
23 pages
Unit 1
No ratings yet
Unit 1
16 pages
I Am Preparing For A Big Data Analytics University...
No ratings yet
I Am Preparing For A Big Data Analytics University...
15 pages
Experiment No - 1 Bda
No ratings yet
Experiment No - 1 Bda
10 pages
Hadoop
No ratings yet
Hadoop
562 pages
Big Data Intro-1
No ratings yet
Big Data Intro-1
75 pages
Unit 1,2,3,4
No ratings yet
Unit 1,2,3,4
116 pages
Big Data Analytics - Lecture Slides
No ratings yet
Big Data Analytics - Lecture Slides
72 pages
Hadoop PPT
100% (1)
Hadoop PPT
25 pages
BigData AmberSahai1
No ratings yet
BigData AmberSahai1
32 pages
Taming Big Data
No ratings yet
Taming Big Data
268 pages
BIG DATA Technology: Subtitle
No ratings yet
BIG DATA Technology: Subtitle
34 pages
Big Data - 1
No ratings yet
Big Data - 1
46 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
55 pages
Event Id 3041 PDF
No ratings yet
Event Id 3041 PDF
2 pages
Civil Engineering Lab Guide
No ratings yet
Civil Engineering Lab Guide
12 pages
Sample Chapter 3
No ratings yet
Sample Chapter 3
5 pages
Chapter7 Student PPT - Databases
No ratings yet
Chapter7 Student PPT - Databases
26 pages
Unit 5 - Object Oriented Programming / C++
No ratings yet
Unit 5 - Object Oriented Programming / C++
23 pages
GAN-Enhanced Autonomous Driving Simulation
No ratings yet
GAN-Enhanced Autonomous Driving Simulation
19 pages
Unit 3 HDFS Notes
No ratings yet
Unit 3 HDFS Notes
71 pages
File Handling
No ratings yet
File Handling
42 pages
Networking Concepts and Technologies Guide
No ratings yet
Networking Concepts and Technologies Guide
10 pages
2023MCS320004 HEMANTH TARRA - Assignment - 9
No ratings yet
2023MCS320004 HEMANTH TARRA - Assignment - 9
4 pages
Tarun Internship
No ratings yet
Tarun Internship
20 pages
MySQL Database Management Lab
No ratings yet
MySQL Database Management Lab
3 pages
Experiment - 1: DDL Commands, DML Commands, TCL Commands: // CREATE
No ratings yet
Experiment - 1: DDL Commands, DML Commands, TCL Commands: // CREATE
23 pages
Accounting, Organizations and Society
No ratings yet
Accounting, Organizations and Society
15 pages
Cbs
No ratings yet
Cbs
273 pages
Health Data Analytics Exam
No ratings yet
Health Data Analytics Exam
5 pages
Type of Analytics
No ratings yet
Type of Analytics
11 pages
Commands - For - NOKIA DX 200 O&M
No ratings yet
Commands - For - NOKIA DX 200 O&M
12 pages
SQL Server Query Performance Tuning Introduction
100% (2)
SQL Server Query Performance Tuning Introduction
56 pages
LSMW Recording Method Guide
100% (1)
LSMW Recording Method Guide
28 pages
Lecture3 DM
No ratings yet
Lecture3 DM
11 pages
DP - 14 - 2 - Practice FAZRULAKMALFADILA - C2C022001
No ratings yet
DP - 14 - 2 - Practice FAZRULAKMALFADILA - C2C022001
4 pages
Excel Tricks for Effective Data Analysis
No ratings yet
Excel Tricks for Effective Data Analysis
8 pages
As You Delve Into The World of Data Analytics
No ratings yet
As You Delve Into The World of Data Analytics
10 pages
Payroll System Database Design Guide
No ratings yet
Payroll System Database Design Guide
22 pages
Literature Review On Data Warehouse
100% (1)
Literature Review On Data Warehouse
23 pages
Quarter 1 Learning Activity Sheet 4: Computer System Servicing
No ratings yet
Quarter 1 Learning Activity Sheet 4: Computer System Servicing
6 pages
Hana View As External View in Abap PDF
No ratings yet
Hana View As External View in Abap PDF
17 pages
Week 9 - Normalization
No ratings yet
Week 9 - Normalization
17 pages