© 2014 MapR Technologies 1© 2014 MapR Technologies
adawar@mapr.com
pat.mcdonough@databricks.com
© 2014 MapR Technologies 2
About MapR and Databricks
• Project leads for Spark,
formerly with UC Berkeley’s
AMPLab
• Founded in June 2013 and
backed by Andreessen
Horowitz
• Strong Engineering focus
* Forrester Wave Big Data Hadoop Solutions, Q1 2014
• Top Ranked distribution for
Hadoop*
• Hundreds of deployments
– 17 of Fortune 100
– Largest deployment in FSI
(1000+ nodes)
• Strong focus on making
Hadoop resilient and
enterprise grade
• Worldwide Presence
© 2014 MapR Technologies 3
Hadoop Evolves
Make it solid
• HA: eliminate SPOFs
• Data Protection: recover
from application/user
errors
• Disaster Recovery: data
center outages
• Enterprise Integration:
breaking the wall that
separates Hadoop from
the rest
• Security & Multi-
tenancy: sharing the
cluster and meeting
SLA’s, secure
authorization, data
governance
Make it do more
(easily)
• Interactive apps (i.e.
SQL)
• Iterative programs
• Streaming apps
• Medium/Small Data
• Architecture: using
memory efficiently
• How many different tools
should it take?
– It’s hard to get
interoperability amongst
different data-parallel models
right
– Learning curves and
operational costs increase
with each new tool
© 2014 MapR Technologies 4
MapR – Top ranked Hadoop distribution
Management
MapR Data Platform
APACHE HADOOP AND OSS ECOSYSTEM
Security
YARN
Pig
Cascading
Batch
Storm*
Streaming
HBase
Solr
NoSQL &
Search
Juju
Provisioning /
coordination
Savannah*
Mahout
ML, Graph
MapReduce
v1 & v2
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Workflow
& Data
Governance
Tez*
Accumulo*
Hive
Impala
Drill*
SQL
Sentry* Oozie ZooKeeperSqoop
Knox* WhirrFalcon*Flume
Data
Integratio
n
& Access
HttpFS
Hue
* Certification/support planned for 2014
Enterprise-grade Security OperationalPerformance Multi-tenancyInteroperability
• High availability
• Data protection
• Disaster recovery
• Standard file
access
• Standard
database access
• Pluggable
services
• Broad developer
support
• Enterprise
security
authorization
• Wire-level
authentication
• Data governance
• Ability to support
predictive
analytics, real-
time database
operations, and
support high
arrival rate data
• Ability to logically
divide a cluster to
support different
use cases, job
types, user
groups, and
administrators
• 2X to 7X higher
performance
• Consistent, low
latency
* Forrester Wave Big Data Hadoop Solutions, Q1 2014
© 2014 MapR Technologies 5
MapR – The Only Distribution to Integrate
the Complete Apache Spark Stack
Management
MapR Data Platform
APACHE HADOOP AND OSS ECOSYSTEM
Security
YARN
Pig
Cascading
Batch
Storm*
Streaming
HBase
Solr
NoSQL &
Search
Juju
Provisioning
&
coordination
Savannah*
Mahout
ML, Graph
MapReduce
v1 & v2
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Workflow
& Data
Governan
ce
Tez*
Accumulo*
Hive
Impala
Drill*
SQL
Sentry* Oozie ZooKeeperSqoop
Knox* WhirrFalcon*Flume
Data
Integratio
n
& Access
HttpFS
Hue
* Certification/support planned for 2014
Shark
(SQL)
Spark
Streaming
(Streaming)
MLLib
(Machine
learning)
Spark (General execution engine)
GraphX
(Graph
computation)
Spark
Spark
Streaming
MLLib
GraphX Shark
© 2014 MapR Technologies 6
Spark on MapR
World-record performance on
disk coupled with in-memory
processing advantages
High Performance
Industry-leading enterprise-grade
High Availability, Data Protection
and Disaster Recovery
Enterprise-grade dependability for
Spark
Strategic partnership with
Databricks to ensure enterprise
support for the entire stack
24/7 Best-in-class Global Support
Spark stack can also be deployed
natively as an independent
standalone service on the MapR
cluster
Can Run Natively on MapR
Apache Spark
Apache Spark
spark.apache.org
github.com/apache/spark
user@spark.apache.org
• Originally developed in 2009
in UC Berkeley’s AMP Lab
• Fully open sourced in 2010
• Top-level Apache Project as of
2014
The Spark Community
Spark is The Most Active Open Source
Project in Big Data
Giraph
Storm
Tez
0
20
40
60
80
100
120
140
Projectcontributorsinpastyear
Spark: Easy and Fast Big Data
Easy to Develop
> Rich APIs in
Java, Scala, Pytho
n
> Interactive shell
Fast to Run
> General execution
graphs
> In-memory storage
Spark: Easy and Fast Big Data
Easy to Develop
> Rich APIs in
Java, Scala, Pytho
n
> Interactive shell
Fast to Run
> General execution
graphs
> In-memory storage
2-5× less code Up to 10× faster on disk,
100× in memory
Easy: Get Started Immediately
• Multi-language support
• Interactive Shell
Python
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Scala
val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count()
Java
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();
Easy: Get Started Immediately
• Multi-language support
• Interactive Shell
Python
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Scala
val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count()
Java
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();
Java 8 (Coming Soon)
JavaRDD<String> lines = sc.textFile(...)
lines.filter(x -> x.contains(“ERROR”)).count()
Easy: Clean API
Resilient Distributed Datasets
• Collections of objects spread
across a cluster, stored in RAM
or on Disk
• Built through parallel
transformations
• Automatically rebuilt on failure
Operations
• Transformations
(e.g.
map, filter, groupBy)
• Actions
(e.g.
count, collect, save)
Write programs in terms of transformations on
distributed datasets
Easy: Expressive API
map reduce
Easy: Expressive API
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save ...
Easy: Example – Word Count
Spark
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Hadoop MapReduce
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Easy: Example – Word Count
Spark
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Hadoop MapReduce
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Easy: Works Well With Hadoop
Data Compatibility
• Access your existing
Hadoop Data
• Use the same data
formats
• Adheres to data locality
for efficient processing
Deployment Models
• “Standalone” deployment
• YARN-based deployment
• Mesos-based deployment
• Deploy on existing
Hadoop cluster or side-
by-side
Easy: User-Driven Roadmap
Language support
> Improved Python
support
> SparkR
> Java 8
> Integrated Schema
and SQL support in
Spark’s APIs
Better ML
> Sparse Data Support
> Model Evaluation
Framework
> Performance Testing
Example: Logistic Regression
data = spark.textFile(...).map(readPoint).cache()
w = numpy.random.rand(D)
for i in range(iterations):
gradient = data
.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))
* p.y * p.x)
.reduce(lambda x, y: x + y)
w -= gradient
print “Final w: %s” % w
Fast: Logistic Regression Performance
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
RunningTime(s)
Number of Iterations
Hadoop
Spark
110 s / iteration
first iteration 80 s
further iterations 1 s
Fast: Using RAM, Operator Graphs
In-memory Caching
• Data Partitions read from
RAM instead of disk
Operator Graphs
• Scheduling Optimizations
• Fault Tolerance
= cached partition
= RDD
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
map
Fast: Scaling Down
69
58
41
30
12
0
20
40
60
80
100
Cache
disabled
25% 50% 75% Fully
cached
Executiontime(s)
% of working set in cache
Easy: Fault Recovery
RDDs track lineage information that can be used to
efficiently recompute lost data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))
Easy: Unified Platform
Spark SQL
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
learning)
Spark (General execution engine)
GraphX
(Graph
computation)
Continued innovation bringing new functionality, e.g.:
• BlinkDB (Approximate Queries)
• SparkR (R wrapper for Spark)
• Tachyon (off-heap RDD caching)
Spark SQL
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
learning)
Spark (General execution engine)
GraphX
(Graph
computation)
Hive Compatibility
• Interfaces to access data and code in the Hive
ecosystem:
o Support for writing queries in HQL
o Catalog for that interfaces with the
Hive MetaStore
o Tablescan operator that uses Hive SerDes
o Wrappers for Hive UDFs, UDAFs, UDTFs
Parquet Support
Native support for reading data stored in
Parquet:
• Columnar storage avoids reading
unneeded data.
• Currently only supports flat structures
(nested data on short-term roadmap).
• RDDs can be written to parquet
files, preserving the schema.
Mixing SQL and Machine Learning
val trainingDataTable = sql(""" SELECT
e.action, u.age, u.latitude, u.logitude FROM Users u
JOIN Events e ON u.userId = e.userId""")// Since `sql`
returns an RDD, the results of can be easily used in MLlib
val trainingData = trainingDataTable.map { row =>
val features = Array[Double](row(1), row(2), row(3))
LabeledPoint(row(0), features)
}
val model = new
LogisticRegressionWithSGD().run(trainingData)
Relationship to
Borrows
• Hive data loading code / in-
memory columnar
representation
• hardened spark execution
engine
Adds
• RDD-aware optimizer /
query planner
• execution engine
• language interfaces.
Catalyst/SparkSQL is a nearly from scratch
rewrite that leverages the best parts of Shark
Shark
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
learning)
Spark (General execution engine)
GraphX
(Graph
computation)
Spark Streaming
Run a streaming computation as a series of very
small, deterministic batch jobs
34
Spark
Spark
Streaming
batches of X
seconds
live data stream
processed
results
• Chop up the live stream into batches of
½ second or more, leverage RDDs for
micro-batch processing
• Use the same familiar Spark APIs to
process streams
• Combine your batch and online
processing in a single system
• Guarantee exactly-once semantics
DStream of data
Window-based Transformations
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap(status => getTags(status))
val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue()
sliding window
operation
window length sliding interval
window length
sliding interval
Shark
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
learning)
Spark (General execution engine)
GraphX
(Graph
computation)
MLlib – Machine Learning library
Logis] c*Regression,*Linear*SVM*(+L1,*L2),*Decision*
Trees,*Naive*Bayes"
Linear*Regression*(+Lasso,*Ridge)*
Alterna] ng*Least*Squares*
KZMeans,*SVD*
SGD,*Parallel*Gradient*
Scala,*Java,*PySpark*(0.9)
MLlib
Classifica. on:"
Regression:"
Collabora. ve"Filtering:"
Clustering"/"Explora. on:"
Op. miza. on"Primi. ves:"
Interopera. lity:"
Shark
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
learning)
Spark (General execution engine)
GraphX
(Graph
computation)
Enabling users to easily and efficiently
express the entire graph analytics
pipeline
New API
Blurs the distinction
between Tables and
Graphs
New System
Combines Data-Parallel
Graph-Parallel Systems
The GraphX Unified Approach
Easy: Unified Platform
Shark
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
learning)
Spark (General execution engine)
GraphX
(Graph
computation)
Continued innovation bringing new functionality, e.g.,:
• BlinkDB (Approximate Queries)
• SparkR (R wrapper for Spark)
• Tachyon (off-heap RDD caching)
Use Cases
Interactive Exploratory Analytics
• Leverage Spark’s in-memory caching and efficient
execution to explore large distributed datasets
• Use Spark’s APIs to explore any kind of data
(structured, unstructured, semi-structured, etc.) and
combine programming models
• Execute arbitrary code using a fully-functional interactive
programming environment
• Connect external tools via SQL Drivers
Machine Learning
• Improve performance of iterative algorithms by caching
frequently accessed datasets
• Develop programs that are easy to reason using a fully-
capable functional programming style
• Refine algorithms using the interactive REPL
• Use carefully-curated algorithms out-of-the-box with
MLlib
Power Real-time Dashboards
• Use Spark Streaming to perform low-latency window-
based aggregations
• Combine offline models with streaming data for online
clustering and classification within the dashboard
• Use Spark’s core APIs and/or Spark SQL to give users
large-scale, low-latency drill-down capabilities in
exploring dashboard data
Faster ETL
• Leverage Spark’s optimized scheduling for more efficient
I/O on large datasets, and in-memory processing for
aggregations, shuffles, and more
• Use Spark SQL to perform ETL using a familiar SQL
interface
• Easily port PIG scripts to Spark’s API
• Run existing HIVE queries directly on Spark SQL or
Shark
San Francisco
June 30 – July 2
• Use Cases
• Tech Talks
• Training
http://spark-summit.org/
© 2014 MapR Technologies 47
Q&A
@mapr maprtech
adawar@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

Let Spark Fly: Advantages and Use Cases for Spark on Hadoop

  • 1.
    © 2014 MapRTechnologies 1© 2014 MapR Technologies [email protected] [email protected]
  • 2.
    © 2014 MapRTechnologies 2 About MapR and Databricks • Project leads for Spark, formerly with UC Berkeley’s AMPLab • Founded in June 2013 and backed by Andreessen Horowitz • Strong Engineering focus * Forrester Wave Big Data Hadoop Solutions, Q1 2014 • Top Ranked distribution for Hadoop* • Hundreds of deployments – 17 of Fortune 100 – Largest deployment in FSI (1000+ nodes) • Strong focus on making Hadoop resilient and enterprise grade • Worldwide Presence
  • 3.
    © 2014 MapRTechnologies 3 Hadoop Evolves Make it solid • HA: eliminate SPOFs • Data Protection: recover from application/user errors • Disaster Recovery: data center outages • Enterprise Integration: breaking the wall that separates Hadoop from the rest • Security & Multi- tenancy: sharing the cluster and meeting SLA’s, secure authorization, data governance Make it do more (easily) • Interactive apps (i.e. SQL) • Iterative programs • Streaming apps • Medium/Small Data • Architecture: using memory efficiently • How many different tools should it take? – It’s hard to get interoperability amongst different data-parallel models right – Learning curves and operational costs increase with each new tool
  • 4.
    © 2014 MapRTechnologies 4 MapR – Top ranked Hadoop distribution Management MapR Data Platform APACHE HADOOP AND OSS ECOSYSTEM Security YARN Pig Cascading Batch Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning / coordination Savannah* Mahout ML, Graph MapReduce v1 & v2 EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Governance Tez* Accumulo* Hive Impala Drill* SQL Sentry* Oozie ZooKeeperSqoop Knox* WhirrFalcon*Flume Data Integratio n & Access HttpFS Hue * Certification/support planned for 2014 Enterprise-grade Security OperationalPerformance Multi-tenancyInteroperability • High availability • Data protection • Disaster recovery • Standard file access • Standard database access • Pluggable services • Broad developer support • Enterprise security authorization • Wire-level authentication • Data governance • Ability to support predictive analytics, real- time database operations, and support high arrival rate data • Ability to logically divide a cluster to support different use cases, job types, user groups, and administrators • 2X to 7X higher performance • Consistent, low latency * Forrester Wave Big Data Hadoop Solutions, Q1 2014
  • 5.
    © 2014 MapRTechnologies 5 MapR – The Only Distribution to Integrate the Complete Apache Spark Stack Management MapR Data Platform APACHE HADOOP AND OSS ECOSYSTEM Security YARN Pig Cascading Batch Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning & coordination Savannah* Mahout ML, Graph MapReduce v1 & v2 EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Governan ce Tez* Accumulo* Hive Impala Drill* SQL Sentry* Oozie ZooKeeperSqoop Knox* WhirrFalcon*Flume Data Integratio n & Access HttpFS Hue * Certification/support planned for 2014 Shark (SQL) Spark Streaming (Streaming) MLLib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Spark Spark Streaming MLLib GraphX Shark
  • 6.
    © 2014 MapRTechnologies 6 Spark on MapR World-record performance on disk coupled with in-memory processing advantages High Performance Industry-leading enterprise-grade High Availability, Data Protection and Disaster Recovery Enterprise-grade dependability for Spark Strategic partnership with Databricks to ensure enterprise support for the entire stack 24/7 Best-in-class Global Support Spark stack can also be deployed natively as an independent standalone service on the MapR cluster Can Run Natively on MapR
  • 7.
  • 8.
    Apache Spark spark.apache.org github.com/apache/spark [email protected] • Originallydeveloped in 2009 in UC Berkeley’s AMP Lab • Fully open sourced in 2010 • Top-level Apache Project as of 2014
  • 9.
  • 10.
    Spark is TheMost Active Open Source Project in Big Data Giraph Storm Tez 0 20 40 60 80 100 120 140 Projectcontributorsinpastyear
  • 11.
    Spark: Easy andFast Big Data Easy to Develop > Rich APIs in Java, Scala, Pytho n > Interactive shell Fast to Run > General execution graphs > In-memory storage
  • 12.
    Spark: Easy andFast Big Data Easy to Develop > Rich APIs in Java, Scala, Pytho n > Interactive shell Fast to Run > General execution graphs > In-memory storage 2-5× less code Up to 10× faster on disk, 100× in memory
  • 13.
    Easy: Get StartedImmediately • Multi-language support • Interactive Shell Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
  • 14.
    Easy: Get StartedImmediately • Multi-language support • Interactive Shell Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count(); Java 8 (Coming Soon) JavaRDD<String> lines = sc.textFile(...) lines.filter(x -> x.contains(“ERROR”)).count()
  • 15.
    Easy: Clean API ResilientDistributed Datasets • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure Operations • Transformations (e.g. map, filter, groupBy) • Actions (e.g. count, collect, save) Write programs in terms of transformations on distributed datasets
  • 16.
  • 17.
  • 18.
    Easy: Example –Word Count Spark public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Hadoop MapReduce val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 19.
    Easy: Example –Word Count Spark public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Hadoop MapReduce val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 20.
    Easy: Works WellWith Hadoop Data Compatibility • Access your existing Hadoop Data • Use the same data formats • Adheres to data locality for efficient processing Deployment Models • “Standalone” deployment • YARN-based deployment • Mesos-based deployment • Deploy on existing Hadoop cluster or side- by-side
  • 21.
    Easy: User-Driven Roadmap Languagesupport > Improved Python support > SparkR > Java 8 > Integrated Schema and SQL support in Spark’s APIs Better ML > Sparse Data Support > Model Evaluation Framework > Performance Testing
  • 22.
    Example: Logistic Regression data= spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y) w -= gradient print “Final w: %s” % w
  • 23.
    Fast: Logistic RegressionPerformance 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 RunningTime(s) Number of Iterations Hadoop Spark 110 s / iteration first iteration 80 s further iterations 1 s
  • 24.
    Fast: Using RAM,Operator Graphs In-memory Caching • Data Partitions read from RAM instead of disk Operator Graphs • Scheduling Optimizations • Fault Tolerance = cached partition = RDD join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: map
  • 25.
    Fast: Scaling Down 69 58 41 30 12 0 20 40 60 80 100 Cache disabled 25%50% 75% Fully cached Executiontime(s) % of working set in cache
  • 26.
    Easy: Fault Recovery RDDstrack lineage information that can be used to efficiently recompute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))
  • 27.
    Easy: Unified Platform SparkSQL (SQL) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Continued innovation bringing new functionality, e.g.: • BlinkDB (Approximate Queries) • SparkR (R wrapper for Spark) • Tachyon (off-heap RDD caching)
  • 28.
  • 29.
    Hive Compatibility • Interfacesto access data and code in the Hive ecosystem: o Support for writing queries in HQL o Catalog for that interfaces with the Hive MetaStore o Tablescan operator that uses Hive SerDes o Wrappers for Hive UDFs, UDAFs, UDTFs
  • 30.
    Parquet Support Native supportfor reading data stored in Parquet: • Columnar storage avoids reading unneeded data. • Currently only supports flat structures (nested data on short-term roadmap). • RDDs can be written to parquet files, preserving the schema.
  • 31.
    Mixing SQL andMachine Learning val trainingDataTable = sql(""" SELECT e.action, u.age, u.latitude, u.logitude FROM Users u JOIN Events e ON u.userId = e.userId""")// Since `sql` returns an RDD, the results of can be easily used in MLlib val trainingData = trainingDataTable.map { row => val features = Array[Double](row(1), row(2), row(3)) LabeledPoint(row(0), features) } val model = new LogisticRegressionWithSGD().run(trainingData)
  • 32.
    Relationship to Borrows • Hivedata loading code / in- memory columnar representation • hardened spark execution engine Adds • RDD-aware optimizer / query planner • execution engine • language interfaces. Catalyst/SparkSQL is a nearly from scratch rewrite that leverages the best parts of Shark
  • 33.
  • 34.
    Spark Streaming Run astreaming computation as a series of very small, deterministic batch jobs 34 Spark Spark Streaming batches of X seconds live data stream processed results • Chop up the live stream into batches of ½ second or more, leverage RDDs for micro-batch processing • Use the same familiar Spark APIs to process streams • Combine your batch and online processing in a single system • Guarantee exactly-once semantics
  • 35.
    DStream of data Window-basedTransformations val tweets = ssc.twitterStream() val hashTags = tweets.flatMap(status => getTags(status)) val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue() sliding window operation window length sliding interval window length sliding interval
  • 36.
  • 37.
    MLlib – MachineLearning library Logis] c*Regression,*Linear*SVM*(+L1,*L2),*Decision* Trees,*Naive*Bayes" Linear*Regression*(+Lasso,*Ridge)* Alterna] ng*Least*Squares* KZMeans,*SVD* SGD,*Parallel*Gradient* Scala,*Java,*PySpark*(0.9) MLlib Classifica. on:" Regression:" Collabora. ve"Filtering:" Clustering"/"Explora. on:" Op. miza. on"Primi. ves:" Interopera. lity:"
  • 38.
  • 39.
    Enabling users toeasily and efficiently express the entire graph analytics pipeline New API Blurs the distinction between Tables and Graphs New System Combines Data-Parallel Graph-Parallel Systems The GraphX Unified Approach
  • 40.
    Easy: Unified Platform Shark (SQL) Spark Streaming (Streaming) MLlib (Machine learning) Spark(General execution engine) GraphX (Graph computation) Continued innovation bringing new functionality, e.g.,: • BlinkDB (Approximate Queries) • SparkR (R wrapper for Spark) • Tachyon (off-heap RDD caching)
  • 41.
  • 42.
    Interactive Exploratory Analytics •Leverage Spark’s in-memory caching and efficient execution to explore large distributed datasets • Use Spark’s APIs to explore any kind of data (structured, unstructured, semi-structured, etc.) and combine programming models • Execute arbitrary code using a fully-functional interactive programming environment • Connect external tools via SQL Drivers
  • 43.
    Machine Learning • Improveperformance of iterative algorithms by caching frequently accessed datasets • Develop programs that are easy to reason using a fully- capable functional programming style • Refine algorithms using the interactive REPL • Use carefully-curated algorithms out-of-the-box with MLlib
  • 44.
    Power Real-time Dashboards •Use Spark Streaming to perform low-latency window- based aggregations • Combine offline models with streaming data for online clustering and classification within the dashboard • Use Spark’s core APIs and/or Spark SQL to give users large-scale, low-latency drill-down capabilities in exploring dashboard data
  • 45.
    Faster ETL • LeverageSpark’s optimized scheduling for more efficient I/O on large datasets, and in-memory processing for aggregations, shuffles, and more • Use Spark SQL to perform ETL using a familiar SQL interface • Easily port PIG scripts to Spark’s API • Run existing HIVE queries directly on Spark SQL or Shark
  • 46.
    San Francisco June 30– July 2 • Use Cases • Tech Talks • Training http://spark-summit.org/
  • 47.
    © 2014 MapRTechnologies 47 Q&A @mapr maprtech [email protected] Engage with us! MapR maprtech mapr-technologies

Editor's Notes

  • #4 The power of MapR begins with the power of open source innovation and community participation.In some cases MapR leads the community in projects like Apache Mahout (machine learning) or Apache Drill (SQL on Hadoop)In other areas, MapR contributes, integrates Apache and other open source software (OSS) projects into the MapR distribution, delivering a more reliable and performant system with lower overall TCO and easier system management.MapR releases a new version with the latest OSS innovations on a monthly basis. We add 2-4 new Apache projects annually as new projects become production ready and based on customer demand.
  • #5 The power of MapR begins with the power of open source innovation and community participation.In some cases MapR leads the community in projects like Apache Mahout (machine learning) or Apache Drill (SQL on Hadoop)In other areas, MapR contributes, integrates Apache and other open source software (OSS) projects into the MapR distribution, delivering a more reliable and performant system with lower overall TCO and easier system management.MapR releases a new version with the latest OSS innovations on a monthly basis. We add 2-4 new Apache projects annually as new projects become production ready and based on customer demand.
  • #9 You can find Project Resources on the Apache Incubator siteYou’ll also find information about the mailing list there (including archives)
  • #10 One of the most exciting things you’ll findGrowing all the timeNASCAR slideIncluding several sponsors of this event are just starting to get involved…If your logo is not up here, forgive us – it’s hard to keep up!
  • #23 Key idea: add “variables” to the “functions” in functional programming
  • #24 This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)
  • #26 Gracefully