0% found this document useful (0 votes)

231 views48 pages

Berkeley Data Analytics Stack Overview

The document summarizes the goals and components of the Berkeley Data Analytics Stack (BDAS). The goals are to enable low latency queries on historical and live data for faster decisions, and sophisticated data processing. The stack aims to make batch, streaming, and interactive computations easy by using memory aggressively and increasing parallelism. It includes components like Spark, Spark Streaming, Shark, Tachyon, and BlinkDB that integrate with existing open source tools like Hadoop, Hive and Pig. The stack is managed by Mesos to share infrastructure across frameworks.

Uploaded by

vidisha vaid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

231 views48 pages

Berkeley Data Analytics Stack Overview

Uploaded by

vidisha vaid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

Berkeley Data Analytics Stack

Prof. Harold Liu

15 December 2014
Data Processing Goals
• Low latency (interactive) queries on
historical data: enable faster decisions
– E.g., identify why a site is slow and fix it
• Low latency queries on live data (streaming):
enable decisions on real-time data
– E.g., detect & block worms in real-time (a
worm may infect 1mil hosts in 1.3sec)
• Sophisticated data processing: enable
“better” decisions
– E.g., anomaly detection, trend analysis
Today’s Open Analytics Stack…
• ..mostly focused on large on-disk datasets: great for
batch but slow

Application

Data Processing

Storage

Infrastructure
Goals

Batch

One
stack to
rule them all!

Interactive Streaming

 Easy to combine batch, streaming, and interactive computations

 Easy to develop sophisticated algorithms
 Compatible with existing open source ecosystem (Hadoop/HDFS)
Support Interactive and Streaming Comp.

• Aggressive use of memory 10Gbps

• Why?
1. Memory transfer rates >> disk or SSDs 128-
512GB

2. Many datasets already fit into memory 40-60GB/s

• Inputs of over 90% of jobs in
Facebook, Yahoo!, and Bing clusters
fit into memory 16 cores
• e.g., 1TB = 1 billion records @ 1KB
each 0.2-
1-
1GB/s
(x10 disks) 4GB/s
(x4 disks)
3. Memory density (still) grows with Moore’s 10-30TB
law 1-4TB
• RAM/SSD hybrid memories at
horizon High end datacenter node
Support Interactive and Streaming Comp.

• Increase parallelism
• Why? result
– Reduce work per node 
improve latency
T
• Techniques:
– Low latency parallel scheduler
that achieve high locality
– Optimized parallel communication result
patterns (e.g., shuffle, broadcast)
– Efficient recovery from failures
and straggler mitigation
Tnew (< T)
Support Interactive and Streaming Comp.
• Trade between result accuracy and
response times
• Why? 128-
– In-memory processing does not 512GB
doubles
guarantee interactive query every 18
processing months

• e.g., ~10’s sec just to scan 512 doubles

GB RAM! 40-60GB/s every 36
months
• Gap between memory capacity
and transfer rate increasing
• Challenges:
– accurately estimate error and 16 cores
running time for…
– … arbitrary computations
Berkeley Data Analytics Stack
(BDAS)
New apps: AMP-Genomics, Carat, …
Application
• in-memory processing
• trade between time, quality, and
Data Processing
cost

Data Storage
Management Efficient data sharing across
frameworks

Resource
Infrastructure
Management Share infrastructure across
frameworks
(multi-programming for datacenters)
Berkeley AMPLab lg
 “Launched” January 2011: 6 Year Plan
orit
– 8 CS Faculty hm
– a
~40 students s
– 3 software engineers chi eo
• Organized for collaboration:
ne pl
s e
Berkeley AMPLab
• Funding:
– XData, CISE Expedition Grant

– Industrial, founding sponsors

– 18 other sponsors, including

Goal: Next Generation of Analytics Data Stack for Industry &

Research:
• Berkeley Data Analytics Stack (BDAS)
• Release as Open Source
Berkeley Data Analytics Stack
(BDAS)
• Existing stack components….

HIVE Pig
… Data
Data Processing
HBase Storm MPI Processing
Hadoop

Data
Data Management
HDFS Mgmnt.

Resource
Resource Management Mgmnt.
Mesos
• Management platform that allows multiple framework to share
cluster
• Compatible with existing open analytics stack
• Deployed in production at Twitter on 3,500+ servers

HIVE Pig
… Data
HBase Storm MPI Processing
Hadoop

Data
HDFS Mgmnt.

Resource
Mesos Mgmnt.
Spark
• In-memory framework for interactive and iterative
computations
– Resilient Distributed Dataset (RDD): fault-tolerance, in-
memory storage abstraction
• Scala interface, Java and Python APIs

HIVE Pig Data

…
Storm MPI Processing

Spark Hadoop

Data
HDFS Mgmnt.

Resource
Mesos Mgmnt.
Spark Streaming [Alpha Release]
• Large scale streaming computation
• Ensure exactly one semantics
• Integrated with Spark  unifies batch, interactive, and streaming
computations!

Spark
Streamin HIVE Pig Data
… Stor MP
g Processing
m I
Spark Hadoop

Data
HDFS Mgmnt.

Resource
Mesos Mgmnt.
Shark  Spark SQL
• HIVE over Spark: SQL-like interface (supports Hive 0.9)
– up to 100x faster for in-memory data, and 5-10x for disk
• In tests on hundreds node cluster at

Spark
Streamin HIVE Pig Data
… Stor MP
g Shark Processing
m I
Spark Hadoop

Data
HDFS Mgmnt.

Resource
Mesos Mgmnt.
Tachyon
• High-throughput, fault-tolerant in-memory storage
• Interface compatible to HDFS
• Support for Spark and Hadoop
Spark
Streamin HIVE Pig Data
… Stor MP
g Shark Processing
m I
Spark Hadoop

Tachyon Data
HDFS Mgmnt.

Resource
Mesos Mgmnt.
BlinkDB
• Large scale approximate query engine
• Allow users to specify error or time bounds
• Preliminary prototype starting being tested at Facebook

Spark BlinkDB
Streamin Pig Data
… Stor MP
g Shark HIVE Processing
m I
Spark Hadoop

Tachyon Data
HDFS Mgmnt.

Resource
Mesos Mgmnt.
SparkGraph
• GraphLab API and Toolkits on top of Spark
• Fault tolerance by leveraging Spark

Spark BlinkDB
Spark
Streamin Pig Data
Graph … Stor MP
g Shark HIVE Processing
m I
Spark Hadoop

Tachyon Data
HDFS Mgmnt.

Resource
Mesos Mgmnt.
MLlib
• Declarative approach to ML
• Develop scalable ML algorithms
• Make ML accessible to non-experts

Spark BlinkDB
Spark MLbas
Streamin Pig Data
Graph e … Stor MP
g Shark HIVE Processing
m I
Spark Hadoop

Tachyon Data
HDFS Mgmnt.

Resource
Mesos Mgmnt.
Compatible with Open Source Ecosystem
• Support existing interfaces whenever possible

GraphLab API

Spark BlinkDB Hive Interface

Spark MLbas and Shell
Streamin Pig Data
Graph e … Stor MP
g Shark HIVE Processing
m I
Spark Hadoop

Tachyon Compatibility Data

layer for
HDFS API HDFS
Hadoop, Storm, MPI,
Mgmnt.
etc to run over Mesos
Resource
Mesos Mgmnt.
Compatible with Open Source Ecosystem
• Use existing interfaces whenever possible

Accept inputs
from Kafka,
Flume, Twitter, Support Hive API
TCP
Sockets, …
Spark BlinkDB
Spark MLbas
Streamin Pig Data
Graph e … Stor MP
g Shark HIVE Processing
m
Support HDFS I
Spark Hadoop
API, S3 API, and
Hive metadata

Tachyon Data
HDFS Mgmnt.

Resource
Mesos Mgmnt.
Summary
• Support interactive and streaming computations
– In-memory, fault-tolerant storage abstraction, low-latency
scheduling,...
• Easy to combine batch, streaming, and interactive Batch
computations
– Spark execution engine supports
Spark
all comp. models
• Easy to develop sophisticated algorithms Interacti Streami
ve ng
– Scala interface, APIs for Java, Python, Hive QL, …
– New frameworks targeted to graph based and ML algorithms
• Compatible with existing open source ecosystem
• Open source (Apache/BSD) and fully committed to release high
quality software
– Three-person software engineering team lead by Matt Massie
(creator of Ganglia, 5th Cloudera engineer)
Spark
In-Memory Cluster Computing for
Iterative and Interactive Applications

UC Berkeley
Background
• Commodity clusters have become an important computing
platform for a variety of applications
– In industry: search, machine translation, ad targeting, …
– In research: bioinformatics, NLP, climate simulation, …
• High-level cluster programming models like MapReduce
power many of these apps
• Theme of this work: provide similarly powerful abstractions
for a broader class of applications
Motivation
Current popular programming models for
clusters transform data flowing from stable
storage to stable storage
e.g., MapReduce:
Map
Reduce

Input Map Output

Reduce
Map
Motivation
• Acyclic data flow is a powerful abstraction, but is
not efficient for applications that repeatedly reuse a
working set of data:
– Iterative algorithms (many in machine learning)
– Interactive data mining tools (R, Excel, Python)
• Spark makes working sets a first-class concept to
efficiently support these apps
Spark Goal
• Provide distributed memory abstractions for clusters to
support apps with working sets
• Retain the attractive properties of MapReduce:
– Fault tolerance (for crashes & stragglers)
– Data locality
– Scalability

Solution: augment data flow model with

“resilient distributed datasets” (RDDs)
Programming Model
• Resilient distributed datasets (RDDs)
– Immutable collections partitioned across cluster that
can be rebuilt if a partition is lost
– Created by transforming data in stable storage using
data flow operators (map, filter, group-by, …)
– Can be cached across parallel operations
• Parallel operations on RDDs
– Reduce, collect, count, save, …
• Restricted shared variables
– Accumulators, broadcast variables
Example: Log Mining
•Load error messages from a log into memory,
then interactively search for various patterns
Base
Transformed Cache 1
lines = [Link](“hdfs://...”) RDD RDD Worke
results r
errors = [Link](_.startsWith(“ERROR”))
messages = [Link](_.split(‘\t’)(2)) tasks Block 1
Driver
cachedMsgs = [Link]()
Cached RDD Parallel
[Link](_.contains(“foo”)).count operation
[Link](_.contains(“bar”)).count Cache 2
Worke
. . . r
Cache 3
Result: full-text search of Worke Block 2
r
Wikipedia in <1 sec (vs 20 sec
Block 3
for on-disk data)
RDDs in More Detail
• An RDD is an immutable, partitioned, logical collection
of records
– Need not be materialized, but rather contains
information to rebuild a dataset from stable storage

• Partitioning can be based on a key in each record

(using hash or range partitioning)

• Built using bulk transformations on other RDDs

• Can be cached for future reuse

RDD Operations
Transformations Parallel operations (Actions)
(define a new RDD) (return a result to driver)
map reduce
filter collect
sample count
union save
groupByKey lookupKey
reduceByKey …
join
cache
…
RDD Fault Tolerance
• RDDs maintain lineage information that can be used to
reconstruct lost partitions
• e.g.:
cachedMsgs = textFile(...).filter(_.contains(“error”))
.map(_.split(‘\t’)(2))
.cache()

HdfsRDD FilteredRDD MappedRDD

func: CachedRDD
path: hdfs://… func: split(…)
contains(...)
Example 1: Logistic Regression
• Goal: find best line separating two sets of points

random initial line

target
Logistic Regression Code
• val data = [Link](...).map(readPoint).cache()

• var w = [Link](D)

• for (i <- 1 to ITERATIONS) {

• val gradient = [Link](p =>
• (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
• ).reduce(_ + _)
• w -= gradient
•}

• println("Final w: " + w)
Logistic Regression Performance

127 s / iteration

first iteration 174 s

further iterations 6 s
Example 2: MapReduce
• MapReduce data flow can be expressed using RDD
transformations

res = [Link](rec => myMapFunc(rec))

.groupByKey()
.map((key, vals) => myReduceFunc(key, vals))

Or with combiners:

res = [Link](rec => myMapFunc(rec))

.reduceByKey(myCombiner)
.map((key, val) => myReduceFunc(key, val))
Example 3
Other Spark Applications
• Twitter spam classification (Justin Ma)
• EM alg. for traffic prediction (Mobile Millennium)
• K-means clustering
• Alternating Least Squares matrix factorization
• In-memory OLAP aggregation on Hive data
• SQL on Spark (future work)
Conclusion
• By making distributed datasets a first-class primitive,
Spark provides a simple, efficient programming model for
stateful data analytics

• RDDs provide:
– Lineage info for fault recovery and debugging
– Adjustable in-memory caching
– Locality-aware parallel operations

• We plan to make Spark the basis of a suite of batch

and interactive data analysis tools

Berkeley Data Analytics Stack Overview
No ratings yet
Berkeley Data Analytics Stack Overview
28 pages
Berkeley Data Analytics Stack Overview
No ratings yet
Berkeley Data Analytics Stack Overview
28 pages
DevOps Class Overview: Spark Insights
0% (1)
DevOps Class Overview: Spark Insights
301 pages
Overview of Apache Spark Components
No ratings yet
Overview of Apache Spark Components
45 pages
Apache Spark: Fast Stream Processing
No ratings yet
Apache Spark: Fast Stream Processing
74 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
30 pages
Big Data Analytics
No ratings yet
Big Data Analytics
57 pages
Introduction to Spark Development
No ratings yet
Introduction to Spark Development
172 pages
Apache Spark Basics and Features
No ratings yet
Apache Spark Basics and Features
44 pages
Overview of Apache Spark Architecture
No ratings yet
Overview of Apache Spark Architecture
20 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
DevOps Advanced Class Overview
No ratings yet
DevOps Advanced Class Overview
223 pages
Features and Architecture of Apache Spark
No ratings yet
Features and Architecture of Apache Spark
24 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Spark: Efficient In-Memory Computing
No ratings yet
Spark: Efficient In-Memory Computing
25 pages
Introduction to Apache Spark Features
No ratings yet
Introduction to Apache Spark Features
15 pages
Spark 01
No ratings yet
Spark 01
8 pages
Apache Spark Overview and Applications
No ratings yet
Apache Spark Overview and Applications
20 pages
Spark Summit East 2015 Overview
No ratings yet
Spark Summit East 2015 Overview
219 pages
Fast Analytics with Spark on Hadoop
No ratings yet
Fast Analytics with Spark on Hadoop
7 pages
Apache Spark: In-Memory Big Data Processing
No ratings yet
Apache Spark: In-Memory Big Data Processing
19 pages
Understanding Apache Spark Basics
No ratings yet
Understanding Apache Spark Basics
66 pages
Big Data Processing with Hadoop & Spark
No ratings yet
Big Data Processing with Hadoop & Spark
5 pages
Master Big Data with Apache Spark
No ratings yet
Master Big Data with Apache Spark
47 pages
Big Data Analytics and Hadoop Overview
No ratings yet
Big Data Analytics and Hadoop Overview
38 pages
Learning Real-Time Processing With Spark Streaming - Sample Chapter
No ratings yet
Learning Real-Time Processing With Spark Streaming - Sample Chapter
30 pages
Big Data Handling Techniques Overview
No ratings yet
Big Data Handling Techniques Overview
21 pages
Apache Spark vs. MapReduce Limitations
No ratings yet
Apache Spark vs. MapReduce Limitations
47 pages
Hadoop and Spark for Big Data Analysis
No ratings yet
Hadoop and Spark for Big Data Analysis
36 pages
Hadoop and MapReduce for Batch Processing
No ratings yet
Hadoop and MapReduce for Batch Processing
12 pages
Apache Spark Overview and Benefits
No ratings yet
Apache Spark Overview and Benefits
18 pages
Apache Spark: Big Data Analytics Overview
No ratings yet
Apache Spark: Big Data Analytics Overview
52 pages
Introduction to Apache Spark Basics
No ratings yet
Introduction to Apache Spark Basics
49 pages
Spark in Big Data Analytics Overview
No ratings yet
Spark in Big Data Analytics Overview
68 pages
Big Data Architecture & Apache Spark Guide
No ratings yet
Big Data Architecture & Apache Spark Guide
45 pages
Overview of Apache Spark Basics
No ratings yet
Overview of Apache Spark Basics
49 pages
Apache Spark: Fast Data Processing Engine
No ratings yet
Apache Spark: Fast Data Processing Engine
28 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
48 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
75 pages
Overview of Hortonworks Data Platform
100% (1)
Overview of Hortonworks Data Platform
56 pages
Introduction to Apache Spark by Dulari Bhatt
No ratings yet
Introduction to Apache Spark by Dulari Bhatt
19 pages
Spark: Fast Data Processing Overview
No ratings yet
Spark: Fast Data Processing Overview
80 pages
Big Data Processing and Hadoop Overview
No ratings yet
Big Data Processing and Hadoop Overview
30 pages
Big Data Processing with Apache Spark
No ratings yet
Big Data Processing with Apache Spark
12 pages
Advanced Analytics with Apache Spark
No ratings yet
Advanced Analytics with Apache Spark
45 pages
Apache Spark: Real-Time Data Processing
No ratings yet
Apache Spark: Real-Time Data Processing
61 pages
Overview of Apache Spark Architecture
No ratings yet
Overview of Apache Spark Architecture
44 pages
Apache Spark and Streaming Overview
No ratings yet
Apache Spark and Streaming Overview
16 pages
Big Data Analytics Course Overview
No ratings yet
Big Data Analytics Course Overview
66 pages
Spark
No ratings yet
Spark
96 pages
Spark Bootcamp: Big Data Solutions
No ratings yet
Spark Bootcamp: Big Data Solutions
47 pages
Big Data Processing with Apache Spark
No ratings yet
Big Data Processing with Apache Spark
61 pages
Spark: Beyond MapReduce for Big Data
No ratings yet
Spark: Beyond MapReduce for Big Data
99 pages
Apache Spark: Performance and Fault Tolerance
No ratings yet
Apache Spark: Performance and Fault Tolerance
66 pages
Big Data Technologies Overview
No ratings yet
Big Data Technologies Overview
24 pages
Atigeo's Spark Big Data Platform Overview
No ratings yet
Atigeo's Spark Big Data Platform Overview
16 pages
Introduction to Big Data with Spark
No ratings yet
Introduction to Big Data with Spark
18 pages
Enhancing MapReduce with Spark RDDs
No ratings yet
Enhancing MapReduce with Spark RDDs
25 pages
Developer Training For Apache Spark and Hadoop: Hands-On Exercises
No ratings yet
Developer Training For Apache Spark and Hadoop: Hands-On Exercises
113 pages
Candidate Technical Skills Assessment
No ratings yet
Candidate Technical Skills Assessment
4 pages
Spotfire & Talend Migration Insights
No ratings yet
Spotfire & Talend Migration Insights
6 pages
Big Data with Hadoop and Hive Overview
No ratings yet
Big Data with Hadoop and Hive Overview
27 pages
Research Article: Load Forecasting Method Based On Improved Deep Learning in Cloud Computing Environment
No ratings yet
Research Article: Load Forecasting Method Based On Improved Deep Learning in Cloud Computing Environment
11 pages
Cloudera CDSW
No ratings yet
Cloudera CDSW
122 pages
Big Data Analytics: Key Concepts & Practices
No ratings yet
Big Data Analytics: Key Concepts & Practices
19 pages
Software Developer Intern Profile
No ratings yet
Software Developer Intern Profile
1 page
CS246: Mining Massive Data Sets Overview
No ratings yet
CS246: Mining Massive Data Sets Overview
69 pages
Data Analytics Professional Resume
No ratings yet
Data Analytics Professional Resume
1 page
Incremental Community Detection Algorithm
No ratings yet
Incremental Community Detection Algorithm
10 pages
Distributed Machine Learning Exam Revision
No ratings yet
Distributed Machine Learning Exam Revision
8 pages
Deep Learning with TensorFlow on GPUs
No ratings yet
Deep Learning with TensorFlow on GPUs
50 pages
Data Engineer with Cloud Expertise
No ratings yet
Data Engineer with Cloud Expertise
3 pages
Databricks Performance Optimization Course
No ratings yet
Databricks Performance Optimization Course
94 pages
Microsoft Fabric Data Warehouse Overview
50% (2)
Microsoft Fabric Data Warehouse Overview
280 pages
Cloud Data Engineer Profile: Dhashinamoorthy
No ratings yet
Cloud Data Engineer Profile: Dhashinamoorthy
3 pages
Data Science Curriculum Overview
No ratings yet
Data Science Curriculum Overview
3 pages
Intellipaat's Data Science Architect Masters Course PDF
No ratings yet
Intellipaat's Data Science Architect Masters Course PDF
13 pages
AWS Delta Lake Solution
No ratings yet
AWS Delta Lake Solution
7 pages
Hadoop Ecosystem Components Overview
No ratings yet
Hadoop Ecosystem Components Overview
97 pages
Apache Hadoop YARN - Enabling Next Generation Data Applications
No ratings yet
Apache Hadoop YARN - Enabling Next Generation Data Applications
64 pages
Data Engineering Expertise on GCP
No ratings yet
Data Engineering Expertise on GCP
1 page
Data Analytics Question Bank for B.Tech
No ratings yet
Data Analytics Question Bank for B.Tech
14 pages
Apache Spark Tutorial for Fast Data Architecture
No ratings yet
Apache Spark Tutorial for Fast Data Architecture
5 pages
Apache Spark: Overview and Architecture
No ratings yet
Apache Spark: Overview and Architecture
12 pages
IBM AI Reference Architecture White Paper
No ratings yet
IBM AI Reference Architecture White Paper
28 pages
Apache Spark SQL: DataFrame and Catalyst
No ratings yet
Apache Spark SQL: DataFrame and Catalyst
46 pages

Berkeley Data Analytics Stack Overview

Uploaded by

Berkeley Data Analytics Stack Overview

Uploaded by

Berkeley Data Analytics Stack

Prof. Harold Liu

 Easy to combine batch, streaming, and interactive computations

• Aggressive use of memory 10Gbps

2. Many datasets already fit into memory 40-60GB/s

• e.g., ~10’s sec just to scan 512 doubles

– Industrial, founding sponsors

Goal: Next Generation of Analytics Data Stack for Industry &

HIVE Pig Data

Spark BlinkDB Hive Interface

Tachyon Compatibility Data

Input Map Output

Solution: augment data flow model with

• Partitioning can be based on a key in each record

• Built using bulk transformations on other RDDs

• Can be cached for future reuse

HdfsRDD FilteredRDD MappedRDD

random initial line

• for (i <- 1 to ITERATIONS) {

first iteration 174 s

res = [Link](rec => myMapFunc(rec))

res = [Link](rec => myMapFunc(rec))

• We plan to make Spark the basis of a suite of batch

You might also like