Oleg Mürk October 2016
Spark Overview
Oleg Mürk
Founding Engineer
2 October 2016
Big Data: Volume & Velocity
10s of Terabytes up to Petabytes
Scales to 100s computing nodes
Examples:
Hadoop, Spark, Flink, Google Dataflow,
H2O, Dato/Turi, etc
Volume (Batch Computations)
100K+ msg/sec/node
Latency: 100ms - 1 sec
Scales to 100s computing nodes
Examples:
Storm, Spark Streaming, Flink,
Google Dataflow, Kafka Streams, etc
Velocity (Streaming Computions)
3 October 2016
Hadoop Stack
Key Value Store HBase
Computation MapReduce, Spark, Flink
Filesystem HDFS
Scheduler YARN
Coordination Zookeeper
4 October 2016
SMACK Stack
Computation Spark
Scheduler Mesos
Reactive Akka
Key Value Store Cassandra
Event Streaming Kafka
5 October 2016
PANCAKE STACK
6 October 2016
Hadoop MapReduce: Word Count
Input
Deer Bear River
Car Car River
Deer Car Bear
Deer Bear River
Car Car River
Deer Car Bear
Splitting Mapping Shuffling Reducing Final Result
Deer, 1
Bear, 1
River, 1
Car, 1
Car, 1
River, 1
Deer, 1
Car, 1
Bear, 1
Bear, 1
Bear, 1
Car, 1
Car, 1
Car, 1
Deer, 1
Deer, 1
River, 1
River, 1
Bear, 2
Car, 3
Deer, 2
River, 2
Bear, 2
Car, 3
Deer, 2
River, 2
7 October 2016
Hadoop/MapReduce Problems
Writing MapReduce jobs is very verbose
High overhead when starting jobs (seconds)
Many real-world computations are iterative
Each phase writes output to disk (HDFS)
Intermediate results are not cached in RAM
Hadoop is written in unoptimized Java :)
8 October 2016
Spark 1.x Improvements over Hadoop
Productivity: up to 10x less lines of code
Similar API for batch and stream processing

Functional programming (Scala, Python)
… but also works well with plain Java
Computations are defined as transformations of datasets (RDDs)
Can develop from REPL (read-eval-print loop)
Performance: up to 10x faster
In-memory computations when possible
Optimized computation graph (DAG)
9 October 2016
Word Count in Spark and PySpark
10 October 2016
Word Count in MapReduce
11 October 2016
Spark 1.x Components
Spark SQL
structured data
Standalone Scheduler
Spark Streaming
real-time
MLib
machine learning
GraphX
graph processing
Spark Core
YARN Mesos
12 October 2016
Resilient Distributed Datasets (RDD)
RDDs are read-only collection of values
Transformations: map, filter, flatMap, join, etc
Can be re-executed and should not have side-effects.
Actions: reduce, collect, save, foreach
Can have side-effects and are executed exactly once.
Transformations are lazy
Computations are triggered by actions
13 October 2016
Word Count in Spark (again)
14 October 2016
RDD Partitions
RDDS are partitioned, potentially across nodes
Partitioning scheme can be customized
Partitions can be recomputed from input partitions
… this is called lineage
Partition dependencies form a DAG
… DAG = Directed Acyclic Graph
Partition dependencies can be narrow or wide
15 October 2016
Computation Graph (DAG)
RDD BRDD A
RDD DRDD C
RDD E
map() map()
map() map()
join()
Resilient
Distributed
Dataset
(RDD)
RDD Partitions
Dependency
16 October 2016
RDD DAG Execution
DAG is compiled into Execution Plan
Partitions are kept in-memory and spilled to disk
Lost / slow partitions can be recomputed
Unneeded partitions are freed
Unless explicitly marked as cached/persistent
… either in ram or on disk
17 October 2016
STAGE3
Execution Plan
RDD BRDD A
RDD DRDD C
RDD E
map() map()
map() map()
join()
STAGE1STAGE2
Task
Shuffle
18 October 2016
Spark Streaming
Processing event streams from:
Kafka, ZeroMQ, TCP, and various other sources
Can output results to:
Kafka, HDFS, KV-Stores, Databases, Dashboards and various other sinks
Developer productivity
Syntax very similar to Spark Core
Reliability:
Can recover from crashes via RDD snapshots and lineage
Throughput:
100K+ events/sec per cpu core, scales to 100s of nodes
Latency:
~500 ms micro-batches
19 October 2016
Word Count in Spark Streaming
20 October 2016
Discretized Streams (DStreams)
21 October 2016
Stateful Spark Streaming: updateStateByKey
22 October 2016
Spark SQL
Analytical SQL queries
Unified Data Source API
File formats: CSV, JSON
Columnar files: Parquet, RCFile, ORC
Key-value stores: HBase, Cassandra
Databases: JDBC
Can do predicate pushdown to data sources
… ie SQL where clauses can be passed to data sources
23 October 2016
Spark SQL Example
24 October 2016
DataFrames API
DataFrames API is a typed DSL equivalent to Spark SQL
Query result is a DataFrame
Intermediate result is a DataFrame
DataFrames can be cached in columnar compressed format
External table can be cached as a DataFrame
25 October 2016
Data Frame API Example
26 October 2016
Columnar Table Representation
Split table rows into chunks
… can skip chunks based on precomputed column min/max values
Store each table column separately
… can read only needed columns
Columns compress better than rows
Better CPU cache utilization
Better CPU vectorization
Off-heap Java storage
Examples: Parquet, RCFile, ORC
27 October 2016
Analytical Databases (OLAP)
Also called data warehouses
Support large batch inserts & deletes
Support long-running analytical queries
Often organize tables in columns
Examples: Redshift, Vertica, Greenplum, HIVE
28 October 2016
HIVE on MapReduce vs Spark
Data warehousing system (OLAP)
Used to manage PetaBytes of event logs
Table data stored in partitions on HDFS
Optimized Row Columnar Format (ORC)
Table Metadata stored in HCatalog
HiveQL originally executed using MapReduce
Now 10x faster on Spark
29 October 2016
Spark ML(Lib)
Algorithms: classification, regression, clustering, collab. filtering
Featurization: feature extraction, transformation, dim. reduction
Pipelines: constructing, evaluating, and tuning ML Pipelines
Persistence: saving and load algorithms, models, and Pipelines
Utilities: linear algebra, statistics, data handling, etc.
Spark ML: based on DataFrames
30 October 2016
Spark GraphX & GraphFrames
Vertex & Edge RDDs
Edge RDD partitioning is essential
… in Power-law graphs with highly skewed edge distributions
Common algorithms
PageRank, Connected components, Label propagation, SVD++,
Strongly connected components, Triangle count
GraphX exposes a variant of the Google Pregel API
GraphFrames: based on DataFrames
31 October 2016
Spark 2.0 Components
Spark SQL
structured data
Standalone Scheduler
Spark Streaming
real-time
MLilb
machine learning
GraphX
graph processing
Spark Core (RDDs, etc)
YARN Mesos
Tungsten (Binary Representation & Code Generation)
Catalyst (Query Planner)
DataFrame / Dataset
Incrementalization
32 October 2016
Catalyst + Tungsten Pipeline
33 October 2016
Incrementalization: Structured Streaming
Identical API for Batch and Streaming
Auto-incrementalizing DataFrame / SQL queries
Prefix consistency on input stream
Transactional consistency on output sinks
www.planetos.com

Spark Overview - Oleg Mürk

  • 1.
    Oleg Mürk October2016 Spark Overview Oleg Mürk Founding Engineer
  • 2.
    2 October 2016 BigData: Volume & Velocity 10s of Terabytes up to Petabytes Scales to 100s computing nodes Examples: Hadoop, Spark, Flink, Google Dataflow, H2O, Dato/Turi, etc Volume (Batch Computations) 100K+ msg/sec/node Latency: 100ms - 1 sec Scales to 100s computing nodes Examples: Storm, Spark Streaming, Flink, Google Dataflow, Kafka Streams, etc Velocity (Streaming Computions)
  • 3.
    3 October 2016 HadoopStack Key Value Store HBase Computation MapReduce, Spark, Flink Filesystem HDFS Scheduler YARN Coordination Zookeeper
  • 4.
    4 October 2016 SMACKStack Computation Spark Scheduler Mesos Reactive Akka Key Value Store Cassandra Event Streaming Kafka
  • 5.
  • 6.
    6 October 2016 HadoopMapReduce: Word Count Input Deer Bear River Car Car River Deer Car Bear Deer Bear River Car Car River Deer Car Bear Splitting Mapping Shuffling Reducing Final Result Deer, 1 Bear, 1 River, 1 Car, 1 Car, 1 River, 1 Deer, 1 Car, 1 Bear, 1 Bear, 1 Bear, 1 Car, 1 Car, 1 Car, 1 Deer, 1 Deer, 1 River, 1 River, 1 Bear, 2 Car, 3 Deer, 2 River, 2 Bear, 2 Car, 3 Deer, 2 River, 2
  • 7.
    7 October 2016 Hadoop/MapReduceProblems Writing MapReduce jobs is very verbose High overhead when starting jobs (seconds) Many real-world computations are iterative Each phase writes output to disk (HDFS) Intermediate results are not cached in RAM Hadoop is written in unoptimized Java :)
  • 8.
    8 October 2016 Spark1.x Improvements over Hadoop Productivity: up to 10x less lines of code Similar API for batch and stream processing
 Functional programming (Scala, Python) … but also works well with plain Java Computations are defined as transformations of datasets (RDDs) Can develop from REPL (read-eval-print loop) Performance: up to 10x faster In-memory computations when possible Optimized computation graph (DAG)
  • 9.
    9 October 2016 WordCount in Spark and PySpark
  • 10.
    10 October 2016 WordCount in MapReduce
  • 11.
    11 October 2016 Spark1.x Components Spark SQL structured data Standalone Scheduler Spark Streaming real-time MLib machine learning GraphX graph processing Spark Core YARN Mesos
  • 12.
    12 October 2016 ResilientDistributed Datasets (RDD) RDDs are read-only collection of values Transformations: map, filter, flatMap, join, etc Can be re-executed and should not have side-effects. Actions: reduce, collect, save, foreach Can have side-effects and are executed exactly once. Transformations are lazy Computations are triggered by actions
  • 13.
    13 October 2016 WordCount in Spark (again)
  • 14.
    14 October 2016 RDDPartitions RDDS are partitioned, potentially across nodes Partitioning scheme can be customized Partitions can be recomputed from input partitions … this is called lineage Partition dependencies form a DAG … DAG = Directed Acyclic Graph Partition dependencies can be narrow or wide
  • 15.
    15 October 2016 ComputationGraph (DAG) RDD BRDD A RDD DRDD C RDD E map() map() map() map() join() Resilient Distributed Dataset (RDD) RDD Partitions Dependency
  • 16.
    16 October 2016 RDDDAG Execution DAG is compiled into Execution Plan Partitions are kept in-memory and spilled to disk Lost / slow partitions can be recomputed Unneeded partitions are freed Unless explicitly marked as cached/persistent … either in ram or on disk
  • 17.
    17 October 2016 STAGE3 ExecutionPlan RDD BRDD A RDD DRDD C RDD E map() map() map() map() join() STAGE1STAGE2 Task Shuffle
  • 18.
    18 October 2016 SparkStreaming Processing event streams from: Kafka, ZeroMQ, TCP, and various other sources Can output results to: Kafka, HDFS, KV-Stores, Databases, Dashboards and various other sinks Developer productivity Syntax very similar to Spark Core Reliability: Can recover from crashes via RDD snapshots and lineage Throughput: 100K+ events/sec per cpu core, scales to 100s of nodes Latency: ~500 ms micro-batches
  • 19.
    19 October 2016 WordCount in Spark Streaming
  • 20.
    20 October 2016 DiscretizedStreams (DStreams)
  • 21.
    21 October 2016 StatefulSpark Streaming: updateStateByKey
  • 22.
    22 October 2016 SparkSQL Analytical SQL queries Unified Data Source API File formats: CSV, JSON Columnar files: Parquet, RCFile, ORC Key-value stores: HBase, Cassandra Databases: JDBC Can do predicate pushdown to data sources … ie SQL where clauses can be passed to data sources
  • 23.
  • 24.
    24 October 2016 DataFramesAPI DataFrames API is a typed DSL equivalent to Spark SQL Query result is a DataFrame Intermediate result is a DataFrame DataFrames can be cached in columnar compressed format External table can be cached as a DataFrame
  • 25.
    25 October 2016 DataFrame API Example
  • 26.
    26 October 2016 ColumnarTable Representation Split table rows into chunks … can skip chunks based on precomputed column min/max values Store each table column separately … can read only needed columns Columns compress better than rows Better CPU cache utilization Better CPU vectorization Off-heap Java storage Examples: Parquet, RCFile, ORC
  • 27.
    27 October 2016 AnalyticalDatabases (OLAP) Also called data warehouses Support large batch inserts & deletes Support long-running analytical queries Often organize tables in columns Examples: Redshift, Vertica, Greenplum, HIVE
  • 28.
    28 October 2016 HIVEon MapReduce vs Spark Data warehousing system (OLAP) Used to manage PetaBytes of event logs Table data stored in partitions on HDFS Optimized Row Columnar Format (ORC) Table Metadata stored in HCatalog HiveQL originally executed using MapReduce Now 10x faster on Spark
  • 29.
    29 October 2016 SparkML(Lib) Algorithms: classification, regression, clustering, collab. filtering Featurization: feature extraction, transformation, dim. reduction Pipelines: constructing, evaluating, and tuning ML Pipelines Persistence: saving and load algorithms, models, and Pipelines Utilities: linear algebra, statistics, data handling, etc. Spark ML: based on DataFrames
  • 30.
    30 October 2016 SparkGraphX & GraphFrames Vertex & Edge RDDs Edge RDD partitioning is essential … in Power-law graphs with highly skewed edge distributions Common algorithms PageRank, Connected components, Label propagation, SVD++, Strongly connected components, Triangle count GraphX exposes a variant of the Google Pregel API GraphFrames: based on DataFrames
  • 31.
    31 October 2016 Spark2.0 Components Spark SQL structured data Standalone Scheduler Spark Streaming real-time MLilb machine learning GraphX graph processing Spark Core (RDDs, etc) YARN Mesos Tungsten (Binary Representation & Code Generation) Catalyst (Query Planner) DataFrame / Dataset Incrementalization
  • 32.
    32 October 2016 Catalyst+ Tungsten Pipeline
  • 33.
    33 October 2016 Incrementalization:Structured Streaming Identical API for Batch and Streaming Auto-incrementalizing DataFrame / SQL queries Prefix consistency on input stream Transactional consistency on output sinks
  • 34.