Big Data Analytics
Veningston .K
Associate Professor
Department of CSE
Madanapalle Institute of Technology & Science
[email protected]
Big Data Analytics
Contents
1 Explosion in Quantity of Data
2 Big Data Characteristics
3 Importance of Big Data
4 Usage Example in Big Data
55 Challenges in Big Data
Big Data Analytics
Contents
1
6 Big Data vs. Hadoop
2
7 Data Analytics Architecture
3
8 Hadoop Key Characteristics
4
9 MapReduce Architecture
10
5 Potential Applications
Big Data Analytics
What is big data?
A massive volume of both structured and
unstructured data that is so large that it's
difficult to process with traditional database
and software techniques.
Big Data Analytics
Explosion in Quantity of Data
Air Bus A380
Each engine generate 10 TB 640TB per Flight
every 30 min
Stock Exchange generate 1TB of new trade data
everyday
Big Data Analytics
Explosion in Quantity of Data
Science
Data bases from astronomy, genomics, environmental data,
transportation data,
Humanities and Social Sciences
Scanned books, historical documents, social interactions data, new
technology like GPS
Business & Commerce
Corporate sales, stock market transactions, census, airline traffic,
Entertainment
Internet images, movies, MP3 files,
Medicine
MRI & CT scans, patient health records,
Big Data Analytics
Explosion in Quantity of Data
http://newstex.com/
The Data Explosion in 2014 Minute by Minute
In 2012, Google received over 2 million search
queries per minute
Today, Google receives over 4 million search
queries per minute from the 2.4 billion strong
global internet population
Big Data Analytics
Explosion in Quantity of Data
http://newstex.com/
Every minute
Facebook users share nearly 2.5 million pieces of
content
Twitter users tweet nearly 300,000 times
Instagram users post nearly 220,000 new photos
YouTube users upload 72 hours of new video
content
Apple users download nearly 50,000 apps
Email users send over 200 million messages
Amazon generates over $80,000 in online sales
Big Data Analytics
Explosion in Quantity of Data
Big Data Analytics
Big Data Characteristics
Volume Data at rest
Amount of data
Velocity Data in motion
Speed rate in collecting or acquiring or generating or
processing of data
Variety Data in many forms,
Different data type such as audio, video, image data
(mostly unstructured data)
Veracity Data in doubt
Sparse data, Inconsistent and missed data
Big Data Analytics
4Vs
Big Data Analytics
5 Vs of Big Data
Volume, Veracity, Velocity,
Variety, and Value
Banking/Marketing/IT:
Volume, Velocity, and Value
Healthcare/Life Sciences:
Veracity, Variety, and Value
Big Data Analytics
Big Data Characteristics 4Vs
Big Data Analytics
Big Data Analytics
Big Data Analytics
Types of Data
Structured
Fields/ Tables/ Columns/ RDBMS/Spreadsheet
Semi-structured
Markers/Tags to separate elements
XML/HTML
Unstructured
No fields/attributes
Free form text (E-mail body, notes, articles,)
Audio, video, and image
Big Data Analytics
Comprehensive List of Big Data Statistics
http://wikibon.org/
Big Data in Todays Business and Technology Environment
2.7 Zetabytes of data exist in the digital universe today
Facebook stores, accesses, and analyzes 30+ Petabytes of user
generated data.
Walmart handles more than 1 million customer transactions
every hour, which is imported into databases estimated to
Byte (B) contain more than 2.5 Petabytes of data
Kilobyte (KB) More than 5 billion people are calling, texting, tweeting and
Megabyte MB) browsing on mobile phones worldwide
Gigabyte (GB)
Decoding the human genome originally took 10 years to
Terabyte (TB)
Petabyte (PB) process; now it can be achieved in one week
Exabyte (EB)
Zettabyte (ZB)
Yottabyte (YB)
Big Data Analytics
Comprehensive List of Big Data Statistics
http://wikibon.org/
The Rapid Growth of Unstructured Data
YouTube users upload 72 hours of new video every minute of
the day
571 new websites are created every minute of the day
Brands and organizations on Facebook receive 34,722 Likes
every minute of the day
100 terabytes of data uploaded daily to Facebook
Data production will be 44 times greater in 2020 than it was in
2009
Big Data Analytics
Comprehensive List of Big Data Statistics
http://wikibon.org/
The Market Challenge with Big Data
Big data is a top business priority and drives enormous
opportunity for business improvement
Customer Churn analysis (the cost of retaining an existing
customer is far less than acquiring a new one)
Government administration could save more than 100 billion
($149 billion) in operational efficiency improvements alone by
using big data
Operational Efficiency: Ratio between the input to run a
business operation and the output gained from the business
Big Data Analytics
Comprehensive List of Big Data Statistics
http://wikibon.org/
Big Data & Real Business Issues
What data to collect?
Poor data can cost businesses 20%35% of their operating
revenue.
Data scientist Just give me the data and I'll work
out what it is we'll need.
Response Well, if you can tell me just exactly what
you need, we'll get it for you.
Data scientist I'm not going to know what I need
until I see it all.
Response You really want all the data?
Data scientist Yes, ideally we'd have all the data
in its most basic form.
Response We've got that on tape drive
somewhere.
Big Data Analytics
Big Data is a Hot Topic of Research because Technology Makes it
Possible to Analyze All Available Data
Cost effectively manage and analyze
all available data in its native form
(unstructured, structured, streaming)
ERP: Business management software that a
company can use to collect, store, manage and
interpret data from many business activities,
including: CRM: sales, marketing,
Product planning, cost customer service, and technical
Manufacturing or service delivery support
Marketing and sales
Inventory management
Shipping and payment
Website Social Media
Billing
ERP Network Switches
CRM RFID
Big Data Analytics
Common Big data Customer Scenarios
Big Data Analytics
Common Big data Customer Scenarios
Big Data Analytics
Common Big data Customer Scenarios
Big Data Analytics
H
BIG DATA vs. HADOOP B
Understand and navigate
Federated Discovery and Navigation
federated big data sources
Manage & store huge Hadoop File System
volume of any data MapReduce
Structure and control data Data Warehousing
Manage streaming data Stream Computing
Analyze unstructured data Text Analytics Engine
Integrate and govern all ETL, Integration, Data Quality,
data sources Security, Lifecycle Management
Big Data Analytics
A Holistic View of a Big Data System
Real Time
Streams
Real-Time
Processing
(s4, storm)
Analytics
ETL Real Time
Structured Big SQL
(Greenplum,
(Greenplum, Batch
Database AsterData,
AsterData, Processing
(hBase,
(hBase, Gemfire,
Gemfire, Etc)
Etc)
Cassandra)
Cassandra)
Unstructured Data (HDFS)
Big Data Analytics
Limitations of Existing Data Analytics Architecture
Big Data Analytics
Solution: A Combined Storage Compute Layer
Big Data Analytics
Why DFS?
Big Data Analytics
Why DFS?
Big Data Analytics
What is Hadoop?
Apache Hadoop is a framework that allows for
the distributed processing of large data sets
across clusters of commodity computers using
a simple programming model
It is an Open-source Data Management with
scale-out storage and distributed processing
Big Data Analytics
Scalability: Scale-up or Scale-out
Vertical Scaling (Scale-up): Generally refers to adding more
processors and RAM, buying a more expensive and robust
server.
Pros
Less power consumption than running multiple servers
Cooling costs are less than scaling horizontally
Generally less challenging to implement
Less licensing costs
(sometimes) uses less network hardware than scaling horizontally (this is a
whole different topic that we can discuss later)
Cons
PRICE, PRICE, PRICE
Greater risk of hardware failure causing bigger outages
generally severe vendor lock-in and limited upgradeability in the future
Big Data Analytics
Horizontal Scaling (Scale-out): Generally refers to adding
more servers with less processors and RAM. This is usually
cheaper overall and can literally scale infinitely (although we
know that there are usually limits imposed by software or
other attributes of an environments infrastructure)
Pros
Much cheaper than scaling vertically
Easier to run fault-tolerance
Easy to upgrade
Cons
More licensing fees
Bigger footprint in the Data Center
Higher utility cost (Electricity and cooling)
Possible need for more networking equipment (switches/routers)
Big Data Analytics
Hadoop
Open-source software framework from Apache
Inspired by
Google MapReduce
GFS (Google File System)
HDFS
Map/Reduce
Big Data Analytics
Hadoop Distribution
Microsoft
IBM
Cloudera
Apache
MapR
Horton Works
Big Data Analytics
Hadoop Key Characteristics
Big Data Analytics
Hadoop enables...
Scalable
New nodes can be added as needed
Cost effective
Hadoop brings massively parallel computing to commodity
servers.
sizeable decrease in the cost per terabyte of storage
Flexible
Hadoop is schema-less, and can absorb any type of data,
structured or not, from any number of sources.
Fault tolerant
When you lose a node, the system redirects work to another
location of the data and continues processing
Big Data Analytics
RDBMS vs. Hadoop
Big Data Analytics
Big Data Analytics
Big Data Analytics
Hadoop Ecosystem
Big Data Analytics
Hadoop 2.x Core Components
Big Data Analytics
Main Components of HDFS
Big Data Analytics
NameNode Metadata
Big Data Analytics
File Blocks
Big Data Analytics
HDFS Architecture
Big Data Analytics
Anatomy of a File Read
Big Data Analytics
Anatomy of a File Write
Big Data Analytics
Replication and Rack Awareness
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Hadoop 2.x Cluster Architecture
Big Data Analytics
Hadoop 2.x Cluster Architecture
Big Data Analytics
Hadoop 2.x Cluster Architecture -
Federation
R data Finance data Marketing data
Big Data Analytics
Hadoop 2.x High Availability
Data sync
Big Data Analytics
Hadoop 2.x High Availability
Big Data Analytics
Hadoop 2.x Resource Management
Big Data Analytics
Hadoop 2.x Resource Management
Big Data Analytics
Big Data Analytics
Big Data Analytics
YARN Moving beyond MapReduce
Big Data Analytics
Hadoop Cluster - Facebook
Use Hadoop to store copies of internal log and
dimension data sources and use it as a source
for reporting/analytics and machine learning.
2 Major clusters:
1100-machine cluster with 8800 cores & about
12PB raw storage
300-machine cluster with 2400 cores & about 3PB
raw storage.
Each node has 8 cores & 12 TB of storage
Big Data Analytics
Hadoop 2.x Configuration files
Big Data Analytics
Data Loading Techniques & Data Analysis
Big Data Analytics
MapReduce Way
Big Data Analytics
Why MapReduce?
Two Advantages:
Taking processing to
the data
Processing data in
parallel
Big Data Analytics
Solving the Problem with MapReduce
Big Data Analytics
Hadoop 2.x MapReduce Architecture
Big Data Analytics
Hadoop 2.x MapReduce Components
Big Data Analytics
Application Workflow
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
MapReduce Paradigm
Big Data Analytics
Big Data Analytics
Big Data Analytics
Unifying the Big Data Platform using
Virtualization
Goals
Make it fast and easy to provision new data Clusters
on Demand
Allow Mixing of Workloads
Leverage virtual machines to provide isolation (esp.
for Multi-tenant)
Big Data Analytics
Unifying the Big Data Platform using
Virtualization
Leveraging Virtualization
Elastic scale
Use high-availability to protect key services, e.g.,
Hadoops namenode
Resource controls and sharing: re-use underutilized
memory, cpu
Big Data Analytics
Use Local Disk where its Needed
SAN Storage NAS Filers Local Storage
$2 - $10/Gigabyte $1 - $5/Gigabyte $0.05/Gigabyte
$1M gets: $1M gets: $1M gets:
0.5Petabytes 1 Petabyte 20 Petabytes
200,000 IOPS 400,000 IOPS 10,000,000 IOPS
1Gbyte/sec 2Gbyte/sec 800 Gbytes/sec
Big Data Analytics
Text Analytics for Large
unstructured information
Data Big Data
Digital Data Data Mining
Warehouse Analytics
(1960) (1990)
(1980) (1960)
Predictive
Descriptive Diagnostic Prescriptive
analysis
analysis analysis analysis
(What is
(What (Why did it (What shall
going to
happened?) happen?) we do?)
happen?)
INFORM ANALYSIS ACT
Big Data Analytics
Past vs. Future
Big Data Analytics
Text Mining
Discover useful and previously unknown
gems of information from large collection
Patterns
Trends
Associations
Big Data Analytics
Search vs. Discover
Search Discover
(Goal oriented) (Opportunistic)
Structured Data
Data
Data Mining
Retrieval
Unstructured Information
Data Text Mining
Retrieval
Big Data Analytics
Analytic Stack
Hadoop, Hbase,
HDFS
Hardware servers
Big Data Analytics
Live Datasets
PUBMED Medical Literature Abstract
Twitter API
Yahoo API
Yelp Reviews data
Foursquare
.
.
.
Big Data Analytics
Implementation of Big Data
Platforms for Large-scale Data Analysis
Parallel DBMS technologies
Proposed in late eighties
Matured over the last two decades
Multi-billion dollar industry: Proprietary DBMS Engines intended as
Data Warehousing solutions for very large enterprises
Map Reduce
pioneered by Google
popularized by Yahoo! (Hadoop)
Big Data Analytics
Implementation of Big Data
MapReduce Parallel DBMS technologies
Overview: Popularly used for more than two
Data-parallel programming model decades
Relational Data Model
An associated parallel and
Indexing
distributed implementation for
Familiar SQL interface
commodity clusters
Advanced query optimization
Pioneered by Google
Processes 20 PB of data per day
Popularized by open-source Hadoop
Used by Yahoo!, Facebook,
Amazon, and the list is growing
Big Data Analytics
Implementation of Big Data
MapReduce vs. Parallel DBMS
Parallel DBMS MapReduce
Schema Support Not out of the box
Indexing Not out of the box
Imperative
Declarative (C/C++, Java, )
Programming Model
(SQL) Extensions through
Pig and Hive
Optimizations
(Compression, Query Not out of the box
Optimization)
Flexibility Not out of the box
Coarse grained
Fault Tolerance
techniques
Big Data Analytics
Applications for Big Data Analytics
Multi-channel
Smarter Healthcare sales Finance Log Analysis
Homeland Security Traffic Control Telecom Search Quality
Retail: Churn
Manufacturing Trading Analytics Fraud and Risk analysis
Big Data Analytics
Big Data Driven by Real-World Benefit
Fraud detection in Stock markets
Twitter Trend analysis
Google trend analysis
Location aware recommendations
Sentiment Analysis
Health care systems
.
.
Big Data Analytics
Health care
Health care management for cancer survivors
Parental stress
Health status of survivors
Physical stress factors
Psychological stress factors
Psychosocial factors
Impact of family
Health care demands
Big Data Analytics
Potential Applications
Healthcare Recommendation system
Patient-driven health social network
To find health related resources such
as clinical trials, physician question
and answers, emotional support, etc.
Doctor recommender system
Patient Secured ratings
Recommendation data Secure computation
Big Data Analytics
Potential Applications
Personalized health education system
User modeling Document modeling
Similarity Matching
Personalized resources to users
Nursing care plan recommender system
Recommends all the required items to
nurses
To create effective comprehensive care
plans for their patients
Big Data Analytics
Other Aspects of Big Data
Provocations for Big Data
1- Bigger Data are not always Better data
2- Not all Data are equivalent
3- Just because it is accessible doesnt make it ethical
Big Data Analytics
Can we Avoid Big Data?
YES
YES
YES
Big Data Analytics
How Can we Avoid Big Data?
Pay cash for everything!
Never go online!
Dont use a telephone!
Dont use smart cards!
Dont fill any prescriptions!
Never leave your house!
Big Data Analytics
Summary
Big Data is Unavoidable
Greater Opportunities in
Financial Services
Retail
Manufacturing
Healthcare
Web/Social/Mobile
Government
Big Data Analytics
Thank
You all
Big Data Analytics