Apache Spark 
Buenos Aires High Scalability 
Buenos Aires, Argentina, Dic 2014 
Fernando Rodriguez Olivera 
@frodriguez
Fernando Rodriguez Olivera 
Professor at Universidad Austral (Distributed Systems, Compiler 
Design, Operating Systems, …) 
Creator of mvnrepository.com 
Organizer at Buenos Aires High Scalability Group, Professor at 
nosqlessentials.com 
Twitter: @frodriguez
Apache Spark 
Apache Spark is a Fast and General Engine 
for Large-Scale data processing 
In-Memory computing primitives 
Supports for Batch, Interactive, Iterative and 
Stream processing with Unified API
Apache Spark 
Unified API for multiple kind of processing 
Batch (high throughput) 
Interactive (low latency) 
Stream (continuous processing) 
Iterative (results used immediately)
Daytona Gray Sort 100TB Benchmark 
Data Size Time Nodes Cores 
Hadoop MR 
(2013) 
102.5 TB 72 min 2,100 
50,400 
physical 
Apache 
Spark 
(2014) 
100 TB 23 min 206 
6,592 
virtualized 
source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Daytona Gray Sort 100TB Benchmark 
Data Size Time Nodes Cores 
Hadoop MR 
(2013) 
102.5 TB 72 min 2,100 
50,400 
physical 
Apache 
Spark 
(2014) 
100 TB 23 min 206 
6,592 
virtualized 
3X faster using 10X fewer machines 
source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Hadoop vs Spark for Iterative Proc 
Logistic regression in Hadoop and Spark 
source: https://spark.apache.org/
Hadoop MR Limits 
Job Job Job 
Hadoop HDFS 
MapReduce designed for Batch Processing: 
- Communication between jobs through FS 
- Fault-Tolerance (between jobs) by Persistence to FS 
- Memory not managed (relies on OS caches) 
Compensated with: Storm, Samza, Giraph, Impala, Presto, etc
Apache Spark 
Apache Spark (Core) 
Spark 
SQL 
Spark 
Streaming ML lib GraphX 
Powered by Scala and Akka 
APIs for Java, Scala, Python
Resilient Distributed Datasets (RDD) 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
Immutable Collection of Objects
Resilient Distributed Datasets (RDD) 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
Immutable Collection of Objects 
Partitioned and Distributed
Resilient Distributed Datasets (RDD) 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
Immutable Collection of Objects 
Partitioned and Distributed 
Stored in Memory
Resilient Distributed Datasets (RDD) 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
Immutable Collection of Objects 
Partitioned and Distributed 
Stored in Memory 
Partitions Recomputed on Failure
RDD Transformations and Actions 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
...
RDD Transformations and Actions 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
Compute 
Function 
(transformation) 
e.g: apply 
function 
to count 
chars
RDD Transformations and Actions 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
RDD of Ints 
11 
... 
... 
10 
... 
5 
... 
7 
... 
Compute 
Function 
(transformation) 
e.g: apply 
function 
to count 
chars
RDD Transformations and Actions 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
RDD of Ints 
11 
... 
... 
10 
... 
5 
... 
7 
... 
depends on 
Compute 
Function 
(transformation) 
e.g: apply 
function 
to count 
chars
RDD Transformations and Actions 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
RDD of Ints 
11 
... 
... 
10 
... 
5 
... 
7 
... 
depends on 
Compute 
Function 
(transformation) 
e.g: apply 
function 
to count 
chars 
Int 
N 
Action
RDD Transformations and Actions 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
RDD of Ints 
11 
... 
... 
10 
... 
5 
... 
7 
... 
Compute 
Function 
(transformation) 
e.g: apply 
function 
to count 
chars 
RDD Implementation 
Partitions 
Compute Function 
Dependencies 
Preferred Compute 
Location 
(for each partition) 
Partitioner 
depends on 
Int 
N 
Action
Spark API 
val spark = new SparkContext() 
val lines = spark.textFile(“hdfs://docs/”) // RDD[String] 
val nonEmpty = lines.filter(l => l.nonEmpty()) // RDD[String] 
val count = nonEmpty.count 
Scala 
SparkContext spark = new SparkContext(); 
JavaRDD<String> lines = spark.textFile(“hdfs://docs/”) 
JavaRDD<String> nonEmpty = lines.filter(l -> l.length() > 0); 
long count = nonEmpty.count(); 
Java 8 Python 
spark = SparkContext() 
lines = spark.textFile(“hdfs://docs/”) 
nonEmpty = lines.filter(lambda line: len(line) > 0) 
count = nonEmpty.count()
RDD Operations 
Transformations Actions 
map(func) 
flatMap(func) 
filter(func) 
take(N) 
count() 
collect() 
groupByKey() 
reduceByKey(func) 
reduce(func) 
mapValues(func) 
takeOrdered(N) 
top(N) 
… …
Text Processing Example 
Top Words by Frequency 
(Step by step)
Create RDD from External Data 
Apache Spark 
Hadoop FileSystem, 
I/O Formats, Codecs 
HDFS S3 HBase MongoDB 
Cassandra 
… 
Spark can read/write from any data source supported by Hadoop 
I/O via Hadoop is optional (e.g: Cassandra connector bypass Hadoop) 
// Step 1 - Create RDD from Hadoop Text File 
val docs = spark.textFile(“/docs/”) 
ElasticSearch
Function map 
RDD[String] RDD[String] 
Hello World 
A New Line 
hello 
... 
The end 
.map(line => line.toLowerCase) 
hello world 
a new line 
hello 
... 
the end 
= 
.map(_.toLowerCase) 
// Step 2 - Convert lines to lower case 
val lower = docs.map(line => line.toLowerCase)
Functions map and flatMap 
RDD[String] 
hello world 
a new line 
hello 
... 
the end
Functions map and flatMap 
RDD[String] 
hello world 
a new line 
hello 
... 
the end 
.map( … ) 
RDD[Array[String]] 
_.split(“s+”) 
hello 
a 
hello 
... 
the 
world 
new line 
end
Functions map and flatMap 
RDD[String] 
hello world 
a new line 
hello 
... 
the end 
.map( … ) 
RDD[Array[String]] 
_.split(“s+”) 
hello 
a 
hello 
... 
the 
world 
new line 
end 
.flatten 
RDD[String] 
hello 
world 
a 
new 
line 
... 
*
Functions map and flatMap 
hello world 
a new line 
hello 
... 
the end 
RDD[Array[String]] 
hello 
.flatMap(line => line.split(“s+“)) 
RDD[String] 
.map( … ) 
_.split(“s+”) 
a 
hello 
... 
the 
world 
new line 
end 
.flatten 
RDD[String] 
hello 
world 
a 
new 
line 
... 
*
Functions map and flatMap 
RDD[String] 
hello world 
a new line 
hello 
... 
the end 
.map( … ) 
RDD[Array[String]] 
_.split(“s+”) 
hello 
a 
world 
new line 
hello 
... 
the 
end 
.flatten 
.flatMap(line => line.split(“s+“)) 
RDD[String] 
world 
// Step 3 - Split lines into words 
val words = lower.flatMap(line => line.split(“s+“)) 
Note: flatten() not available in spark, only flatMap 
hello 
a 
new 
line 
... 
*
Key-Value Pairs 
RDD[Tuple2[String, Int]] 
RDD[String] RDD[(String, Int)] 
hello 
world 
a 
new 
line 
hello 
... 
hello 
world 
a 
new 
line 
hello 
... 
.map(word => Tuple2(word, 1)) 
1 
1 
1 
1 
1 
1 
= 
.map(word => (word, 1)) 
// Step 4 - Split lines into words 
val counts = words.map(word => (word, 1)) 
Pair RDD
Shuffling 
RDD[(String, Int)] 
hello 
world 
a 
new 
line 
hello 
1 
1 
1 
1 
1 
1
Shuffling 
hello 
world 
a 
new 
line 
hello 
1 
1 
1 
1 
1 
1 
RDD[(String, Iterator[Int])] 
world 
a 
1 
1 
new 1 
line 
hello 
1 
1 
.groupByKey 
1 
RDD[(String, Int)]
Shuffling 
hello 
world 
a 
new 
line 
hello 
1 
1 
1 
1 
1 
1 
RDD[(String, Iterator[Int])] 
world 
a 
1 
1 
new 1 
line 
hello 
1 
1 
.groupByKey 
1 
RDD[(String, Int)] 
RDD[(String, Int)] 
world 
a 
1 
1 
new 1 
line 
hello 
1 
2 
.mapValues 
_.reduce(…) 
(a,b) => a+b
Shuffling 
hello 
world 
a 
new 
line 
hello 
1 
1 
1 
1 
1 
1 
RDD[(String, Iterator[Int])] 
world 
a 
1 
1 
new 1 
line 
hello 
1 
1 
.groupByKey 
1 
.reduceByKey((a, b) => a + b) 
RDD[(String, Int)] 
RDD[(String, Int)] 
world 
a 
1 
1 
new 1 
line 
hello 
1 
2 
.mapValues 
_.reduce(…) 
(a,b) => a+b
Shuffling 
RDD[(String, Int)] 
hello 
world 
a 
new 
line 
hello 
1 
1 
1 
1 
1 
1 
RDD[(String, Iterator[Int])] 
world 
a 
1 
1 
new 1 
line 
hello 
1 
1 
.groupByKey 
1 
RDD[(String, Int)] 
.reduceByKey((a, b) => a + b) 
// Step 5 - Count all words 
val freq = counts.reduceByKey(_ + _) 
world 
a 
1 
1 
new 1 
line 
hello 
1 
2 
.mapValues 
_.reduce(…) 
(a,b) => a+b
Top N (Prepare data) 
RDD[(String, Int)] RDD[(Int, String)] 
world 
a 
1 
1 
new 1 
line 
hello 
1 
2 
.map(_.swap) 
1 
1 
1 new 
world 
a 
line 
hello 
1 
2 
// Step 6 - Swap tuples (partial code) 
freq.map(_.swap)
Top N (First Attempt) 
RDD[(Int, String)] 
1 
1 
1 new 
world 
a 
line 
hello 
1 
2
Top N (First Attempt) 
RDD[(Int, String)] 
1 
1 
1 new 
world 
a 
line 
hello 
1 
2 
.sortByKey 
RDD[(Int, String)] 
2 
1 
1 a 
hello 
world 
new 
line 
1 
1 
(sortByKey(false) for descending)
Top N (First Attempt) 
RDD[(Int, String)] Array[(Int, String)] 
1 
1 
1 new 
world 
a 
line 
hello 
1 
2 
hello 
world 
2 
1 
RDD[(Int, String)] 
2 
1 
1 a 
hello 
world 
.sortByKey .take(N) 
new 
line 
1 
1 
(sortByKey(false) for descending)
Top N 
Array[(Int, String)] 
RDD[(Int, String)] 
1 
1 
1 new 
world 
a 
line 
hello 
1 
2 
world 
a 
1 
1 
.top(N) 
hello 
line 
2 
1 
hello 
line 
2 
1 
local top N * 
local top N * 
reduction 
* local top N implemented by bounded priority queues 
// Step 6 - Swap tuples (complete code) 
val top = freq.map(_.swap).top(N)
Top Words by Frequency (Full Code) 
val spark = new SparkContext() 
// RDD creation from external data source 
val docs = spark.textFile(“hdfs://docs/”) 
// Split lines into words 
val lower = docs.map(line => line.toLowerCase) 
val words = lower.flatMap(line => line.split(“s+“)) 
val counts = words.map(word => (word, 1)) 
// Count all words (automatic combination) 
val freq = counts.reduceByKey(_ + _) 
// Swap tuples and get top results 
val top = freq.map(_.swap).top(N) 
top.foreach(println)
RDD Persistence (in-memory) 
RDD 
… 
... 
... 
… 
... 
… 
... 
… 
... 
.cache() 
.persist() 
.persist(storageLevel) 
StorageLevel: 
MEMORY_ONLY, 
MEMORY_ONLY_SER, 
MEMORY_AND_DISK, 
MEMORY_AND_DISK_SER, 
DISK_ONLY, … 
(memory only) 
(memory only) 
(lazy persistence & caching)
RDD Lineage 
RDD Transformations 
words = sc.textFile(“hdfs://large/file/”) HadoopRDD 
.map(_.toLowerCase) 
.flatMap(_.split(“ “)) FlatMappedRDD 
nums = words.filter(_.matches(“[0-9]+”)) 
alpha.count() 
MappedRDD 
alpha = words.filter(_.matches(“[a-z]+”)) 
FilteredRDD 
FilteredRDD 
Lineage 
(built on the driver 
by the transformations) 
Action (run job on the cluster)
SchemaRDD & SQL 
SchemaRDD 
Row 
... 
... 
Row 
... 
... 
Row 
Row 
... 
RRD of Row + Column Metadata 
Queries with SQL 
Support for Reflection, JSON, 
Parquet, …
SchemaRDD & SQL 
topWords 
Row 
... 
... 
Row 
... 
... 
Row 
Row 
... 
case class Word(text: String, n: Int) 
val wordsFreq = freq.map { 
case (text, count) => Word(text, count) 
} // RDD[Word] 
wordsFreq.registerTempTable("wordsFreq") 
val topWords = sql("select text, n 
from wordsFreq 
order by n desc 
limit 20”) // RDD[Row] 
topWords.collect().foreach(println)
Spark Streaming 
DStream 
RDD RDD RDD RDD RDD RDD 
Data Collected, Buffered and Replicated 
by a Receiver (one per DStream) 
then Pushed to a stream as small RDDs 
Configurable Batch Intervals. 
e.g: 1 second, 5 seconds, 5 minutes 
Receiver 
e.g: Kafka, 
Kinesis, 
Flume, 
Sockets, 
Akka 
etc
DStream Transformations 
DStream 
RDD RDD RDD RDD RDD RDD 
DStream 
transform 
RDD RDD RDD RDD RDD RDD 
Receiver 
// Example 
val entries = stream.transform { rdd => rdd.map(Log.parse) } 
// Alternative 
val entries = stream.map(Log.parse)
Parallelism with Multiple Receivers 
DStream 1 
Receiver 1 RDD RDD RDD RDD RDD RDD 
DStream 2 
Receiver 2 RDD RDD RDD RDD RDD RDD 
union of (stream1, stream2, …) 
Union can be used to manage multiple DStreams as 
a single logical stream
Sliding Windows 
DStream 
RDD RDD RDD RDD RDD RDD 
DStream 
… … … W3 W2 W1 
Window Length: 3, Sliding Interval: 1 
Receiver
Deployment with Hadoop 
A 
B 
C 
D 
/large/file 
allocates resources 
(cores and memory) 
Spark 
Worker 
Data 
Node 1 
Application 
Spark 
Worker 
Data 
Node 3 
Spark 
Worker 
Data 
Node 4 
Spark 
Worker 
Data 
Node 2 
A C B C A B A B 
Spark 
Master 
Name 
Node 
RF 3 D D D C 
Client 
Submit App 
(mode=cluster) 
Driver Executors Executors Executors 
DN + Spark 
HDFS Spark
Fernando Rodriguez Olivera 
twitter: @frodriguez

More Related Content

PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Apache spark
PDF
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
PPTX
Introduction to Apache Spark
PDF
Apache Spark 101
PDF
Apache Spark Introduction
PPTX
Apache PIG
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
The Parquet Format and Performance Optimization Opportunities
Apache spark
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Introduction to Apache Spark
Apache Spark 101
Apache Spark Introduction
Apache PIG

What's hot (20)

PDF
Parquet performance tuning: the missing guide
ODP
Low level java programming
PDF
Introduction to PySpark
PPTX
Apache Spark overview
PDF
Spark overview
PDF
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
PPTX
Tuning and Debugging in Apache Spark
PDF
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
PDF
Simplifying Big Data Analytics with Apache Spark
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
PPTX
Optimizing Apache Spark SQL Joins
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PDF
Deep Dive: Memory Management in Apache Spark
PPT
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
PDF
A deeper-understanding-of-spark-internals-aaron-davidson
PDF
Cosco: An Efficient Facebook-Scale Shuffle Service
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PPTX
RocksDB compaction
PDF
Introduction to Spark Internals
Parquet performance tuning: the missing guide
Low level java programming
Introduction to PySpark
Apache Spark overview
Spark overview
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Tuning and Debugging in Apache Spark
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Simplifying Big Data Analytics with Apache Spark
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Spark SQL Deep Dive @ Melbourne Spark Meetup
Optimizing Apache Spark SQL Joins
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Deep Dive: Memory Management in Apache Spark
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
A deeper-understanding-of-spark-internals-aaron-davidson
Cosco: An Efficient Facebook-Scale Shuffle Service
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
RocksDB compaction
Introduction to Spark Internals
Ad

Viewers also liked (20)

PDF
Apache Spark with Scala
PPTX
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
PDF
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
PDF
New Developments in Spark
PDF
Introduction to Spark Streaming
PDF
Introduction to Apache Spark
PDF
Hadoop Lecture for Harvard's CS 264 -- October 19, 2009
PDF
AWS Kinesis Streams
PPT
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
PDF
Preso spark leadership
PPTX
Spark - Philly JUG
PDF
Spark, the new age of data scientist
PDF
Performance
PDF
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
PPTX
Spark - The beginnings
PDF
Spark Streaming Data Pipelines
PPTX
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
PDF
Spark introduction - In Chinese
PDF
Spark the next top compute model
PPTX
SPIntersection 2016 - MICROSOFT CLOUD IDENTITIES IN AZURE AND OFFICE 365
Apache Spark with Scala
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
New Developments in Spark
Introduction to Spark Streaming
Introduction to Apache Spark
Hadoop Lecture for Harvard's CS 264 -- October 19, 2009
AWS Kinesis Streams
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Preso spark leadership
Spark - Philly JUG
Spark, the new age of data scientist
Performance
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Spark - The beginnings
Spark Streaming Data Pipelines
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Spark introduction - In Chinese
Spark the next top compute model
SPIntersection 2016 - MICROSOFT CLOUD IDENTITIES IN AZURE AND OFFICE 365
Ad

Similar to Apache Spark & Streaming (20)

PDF
Introduction to Apache Spark
PPTX
AI與大數據數據處理 Spark實戰(20171216)
PDF
Apache Spark
PPTX
Introduction to Apache Spark
PDF
Writing your own RDD for fun and profit
PPTX
SparkNotes
PPT
Scala and spark
PDF
Meetup ml spark_ppt
PDF
Introduction to Spark
PPTX
Apache Spark - Aram Mkrtchyan
PDF
Big Data Analytics with Scala at SCALA.IO 2013
PDF
Distributed computing with spark
PDF
OCF.tw's talk about "Introduction to spark"
PPT
11. From Hadoop to Spark 2/2
PPTX
Scala meetup - Intro to spark
PPTX
Dive into spark2
PPTX
Ten tools for ten big data areas 03_Apache Spark
PPTX
Spark real world use cases and optimizations
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
PDF
Visual Api Training
Introduction to Apache Spark
AI與大數據數據處理 Spark實戰(20171216)
Apache Spark
Introduction to Apache Spark
Writing your own RDD for fun and profit
SparkNotes
Scala and spark
Meetup ml spark_ppt
Introduction to Spark
Apache Spark - Aram Mkrtchyan
Big Data Analytics with Scala at SCALA.IO 2013
Distributed computing with spark
OCF.tw's talk about "Introduction to spark"
11. From Hadoop to Spark 2/2
Scala meetup - Intro to spark
Dive into spark2
Ten tools for ten big data areas 03_Apache Spark
Spark real world use cases and optimizations
Apache spark sneha challa- google pittsburgh-aug 25th
Visual Api Training

Recently uploaded (20)

PPTX
IOP Unit 1.pptx for btech 1st year students
PPT
Module_1_Lecture_1_Introduction_To_Automation_In_Production_Systems2023.ppt
PDF
August 2025 Top Read Articles in - Bioscience & Engineering Recent Research T...
PPTX
MODULE 02 - CLOUD COMPUTING-Virtual Machines and Virtualization of Clusters a...
PDF
Human CELLS and structure in Anatomy and human physiology
PPTX
Module 1 – Introduction to Computer Networks: Foundations of Data Communicati...
PDF
IoT-Based Hybrid Renewable Energy System.pdf
PDF
1.-fincantieri-investor-presentation2.pdf
PDF
ASPEN PLUS USER GUIDE - PROCESS SIMULATIONS
PDF
August 2025 Top read articles in International Journal of Database Managemen...
PPTX
DATA STRCUTURE LABORATORY -BCSL305(PRG1)
PDF
THE PEDAGOGICAL NEXUS IN TEACHING ELECTRICITY CONCEPTS IN THE GRADE 9 NATURAL...
PDF
PhD defense presentation in field of Computer Science
PDF
The Journal of Finance - July 1993 - JENSEN - The Modern Industrial Revolutio...
PPTX
non conventional energy resorses material unit-1
PDF
Recent Trends in Network Security - 2025
PPTX
highway-150803160405-lva1-app6891 (1).pptx
PPTX
sinteringn kjfnvkjdfvkdfnoeneornvoirjoinsonosjf).pptx
PPTX
Research Writing, Mechanical Engineering
PPTX
quantum theory on the next future in.pptx
IOP Unit 1.pptx for btech 1st year students
Module_1_Lecture_1_Introduction_To_Automation_In_Production_Systems2023.ppt
August 2025 Top Read Articles in - Bioscience & Engineering Recent Research T...
MODULE 02 - CLOUD COMPUTING-Virtual Machines and Virtualization of Clusters a...
Human CELLS and structure in Anatomy and human physiology
Module 1 – Introduction to Computer Networks: Foundations of Data Communicati...
IoT-Based Hybrid Renewable Energy System.pdf
1.-fincantieri-investor-presentation2.pdf
ASPEN PLUS USER GUIDE - PROCESS SIMULATIONS
August 2025 Top read articles in International Journal of Database Managemen...
DATA STRCUTURE LABORATORY -BCSL305(PRG1)
THE PEDAGOGICAL NEXUS IN TEACHING ELECTRICITY CONCEPTS IN THE GRADE 9 NATURAL...
PhD defense presentation in field of Computer Science
The Journal of Finance - July 1993 - JENSEN - The Modern Industrial Revolutio...
non conventional energy resorses material unit-1
Recent Trends in Network Security - 2025
highway-150803160405-lva1-app6891 (1).pptx
sinteringn kjfnvkjdfvkdfnoeneornvoirjoinsonosjf).pptx
Research Writing, Mechanical Engineering
quantum theory on the next future in.pptx

Apache Spark & Streaming

  • 1. Apache Spark Buenos Aires High Scalability Buenos Aires, Argentina, Dic 2014 Fernando Rodriguez Olivera @frodriguez
  • 2. Fernando Rodriguez Olivera Professor at Universidad Austral (Distributed Systems, Compiler Design, Operating Systems, …) Creator of mvnrepository.com Organizer at Buenos Aires High Scalability Group, Professor at nosqlessentials.com Twitter: @frodriguez
  • 3. Apache Spark Apache Spark is a Fast and General Engine for Large-Scale data processing In-Memory computing primitives Supports for Batch, Interactive, Iterative and Stream processing with Unified API
  • 4. Apache Spark Unified API for multiple kind of processing Batch (high throughput) Interactive (low latency) Stream (continuous processing) Iterative (results used immediately)
  • 5. Daytona Gray Sort 100TB Benchmark Data Size Time Nodes Cores Hadoop MR (2013) 102.5 TB 72 min 2,100 50,400 physical Apache Spark (2014) 100 TB 23 min 206 6,592 virtualized source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
  • 6. Daytona Gray Sort 100TB Benchmark Data Size Time Nodes Cores Hadoop MR (2013) 102.5 TB 72 min 2,100 50,400 physical Apache Spark (2014) 100 TB 23 min 206 6,592 virtualized 3X faster using 10X fewer machines source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
  • 7. Hadoop vs Spark for Iterative Proc Logistic regression in Hadoop and Spark source: https://spark.apache.org/
  • 8. Hadoop MR Limits Job Job Job Hadoop HDFS MapReduce designed for Batch Processing: - Communication between jobs through FS - Fault-Tolerance (between jobs) by Persistence to FS - Memory not managed (relies on OS caches) Compensated with: Storm, Samza, Giraph, Impala, Presto, etc
  • 9. Apache Spark Apache Spark (Core) Spark SQL Spark Streaming ML lib GraphX Powered by Scala and Akka APIs for Java, Scala, Python
  • 10. Resilient Distributed Datasets (RDD) RDD of Strings Hello World ... ... A New Line ... ... hello The End ... Immutable Collection of Objects
  • 11. Resilient Distributed Datasets (RDD) RDD of Strings Hello World ... ... A New Line ... ... hello The End ... Immutable Collection of Objects Partitioned and Distributed
  • 12. Resilient Distributed Datasets (RDD) RDD of Strings Hello World ... ... A New Line ... ... hello The End ... Immutable Collection of Objects Partitioned and Distributed Stored in Memory
  • 13. Resilient Distributed Datasets (RDD) RDD of Strings Hello World ... ... A New Line ... ... hello The End ... Immutable Collection of Objects Partitioned and Distributed Stored in Memory Partitions Recomputed on Failure
  • 14. RDD Transformations and Actions RDD of Strings Hello World ... ... A New Line ... ... hello The End ...
  • 15. RDD Transformations and Actions RDD of Strings Hello World ... ... A New Line ... ... hello The End ... Compute Function (transformation) e.g: apply function to count chars
  • 16. RDD Transformations and Actions RDD of Strings Hello World ... ... A New Line ... ... hello The End ... RDD of Ints 11 ... ... 10 ... 5 ... 7 ... Compute Function (transformation) e.g: apply function to count chars
  • 17. RDD Transformations and Actions RDD of Strings Hello World ... ... A New Line ... ... hello The End ... RDD of Ints 11 ... ... 10 ... 5 ... 7 ... depends on Compute Function (transformation) e.g: apply function to count chars
  • 18. RDD Transformations and Actions RDD of Strings Hello World ... ... A New Line ... ... hello The End ... RDD of Ints 11 ... ... 10 ... 5 ... 7 ... depends on Compute Function (transformation) e.g: apply function to count chars Int N Action
  • 19. RDD Transformations and Actions RDD of Strings Hello World ... ... A New Line ... ... hello The End ... RDD of Ints 11 ... ... 10 ... 5 ... 7 ... Compute Function (transformation) e.g: apply function to count chars RDD Implementation Partitions Compute Function Dependencies Preferred Compute Location (for each partition) Partitioner depends on Int N Action
  • 20. Spark API val spark = new SparkContext() val lines = spark.textFile(“hdfs://docs/”) // RDD[String] val nonEmpty = lines.filter(l => l.nonEmpty()) // RDD[String] val count = nonEmpty.count Scala SparkContext spark = new SparkContext(); JavaRDD<String> lines = spark.textFile(“hdfs://docs/”) JavaRDD<String> nonEmpty = lines.filter(l -> l.length() > 0); long count = nonEmpty.count(); Java 8 Python spark = SparkContext() lines = spark.textFile(“hdfs://docs/”) nonEmpty = lines.filter(lambda line: len(line) > 0) count = nonEmpty.count()
  • 21. RDD Operations Transformations Actions map(func) flatMap(func) filter(func) take(N) count() collect() groupByKey() reduceByKey(func) reduce(func) mapValues(func) takeOrdered(N) top(N) … …
  • 22. Text Processing Example Top Words by Frequency (Step by step)
  • 23. Create RDD from External Data Apache Spark Hadoop FileSystem, I/O Formats, Codecs HDFS S3 HBase MongoDB Cassandra … Spark can read/write from any data source supported by Hadoop I/O via Hadoop is optional (e.g: Cassandra connector bypass Hadoop) // Step 1 - Create RDD from Hadoop Text File val docs = spark.textFile(“/docs/”) ElasticSearch
  • 24. Function map RDD[String] RDD[String] Hello World A New Line hello ... The end .map(line => line.toLowerCase) hello world a new line hello ... the end = .map(_.toLowerCase) // Step 2 - Convert lines to lower case val lower = docs.map(line => line.toLowerCase)
  • 25. Functions map and flatMap RDD[String] hello world a new line hello ... the end
  • 26. Functions map and flatMap RDD[String] hello world a new line hello ... the end .map( … ) RDD[Array[String]] _.split(“s+”) hello a hello ... the world new line end
  • 27. Functions map and flatMap RDD[String] hello world a new line hello ... the end .map( … ) RDD[Array[String]] _.split(“s+”) hello a hello ... the world new line end .flatten RDD[String] hello world a new line ... *
  • 28. Functions map and flatMap hello world a new line hello ... the end RDD[Array[String]] hello .flatMap(line => line.split(“s+“)) RDD[String] .map( … ) _.split(“s+”) a hello ... the world new line end .flatten RDD[String] hello world a new line ... *
  • 29. Functions map and flatMap RDD[String] hello world a new line hello ... the end .map( … ) RDD[Array[String]] _.split(“s+”) hello a world new line hello ... the end .flatten .flatMap(line => line.split(“s+“)) RDD[String] world // Step 3 - Split lines into words val words = lower.flatMap(line => line.split(“s+“)) Note: flatten() not available in spark, only flatMap hello a new line ... *
  • 30. Key-Value Pairs RDD[Tuple2[String, Int]] RDD[String] RDD[(String, Int)] hello world a new line hello ... hello world a new line hello ... .map(word => Tuple2(word, 1)) 1 1 1 1 1 1 = .map(word => (word, 1)) // Step 4 - Split lines into words val counts = words.map(word => (word, 1)) Pair RDD
  • 31. Shuffling RDD[(String, Int)] hello world a new line hello 1 1 1 1 1 1
  • 32. Shuffling hello world a new line hello 1 1 1 1 1 1 RDD[(String, Iterator[Int])] world a 1 1 new 1 line hello 1 1 .groupByKey 1 RDD[(String, Int)]
  • 33. Shuffling hello world a new line hello 1 1 1 1 1 1 RDD[(String, Iterator[Int])] world a 1 1 new 1 line hello 1 1 .groupByKey 1 RDD[(String, Int)] RDD[(String, Int)] world a 1 1 new 1 line hello 1 2 .mapValues _.reduce(…) (a,b) => a+b
  • 34. Shuffling hello world a new line hello 1 1 1 1 1 1 RDD[(String, Iterator[Int])] world a 1 1 new 1 line hello 1 1 .groupByKey 1 .reduceByKey((a, b) => a + b) RDD[(String, Int)] RDD[(String, Int)] world a 1 1 new 1 line hello 1 2 .mapValues _.reduce(…) (a,b) => a+b
  • 35. Shuffling RDD[(String, Int)] hello world a new line hello 1 1 1 1 1 1 RDD[(String, Iterator[Int])] world a 1 1 new 1 line hello 1 1 .groupByKey 1 RDD[(String, Int)] .reduceByKey((a, b) => a + b) // Step 5 - Count all words val freq = counts.reduceByKey(_ + _) world a 1 1 new 1 line hello 1 2 .mapValues _.reduce(…) (a,b) => a+b
  • 36. Top N (Prepare data) RDD[(String, Int)] RDD[(Int, String)] world a 1 1 new 1 line hello 1 2 .map(_.swap) 1 1 1 new world a line hello 1 2 // Step 6 - Swap tuples (partial code) freq.map(_.swap)
  • 37. Top N (First Attempt) RDD[(Int, String)] 1 1 1 new world a line hello 1 2
  • 38. Top N (First Attempt) RDD[(Int, String)] 1 1 1 new world a line hello 1 2 .sortByKey RDD[(Int, String)] 2 1 1 a hello world new line 1 1 (sortByKey(false) for descending)
  • 39. Top N (First Attempt) RDD[(Int, String)] Array[(Int, String)] 1 1 1 new world a line hello 1 2 hello world 2 1 RDD[(Int, String)] 2 1 1 a hello world .sortByKey .take(N) new line 1 1 (sortByKey(false) for descending)
  • 40. Top N Array[(Int, String)] RDD[(Int, String)] 1 1 1 new world a line hello 1 2 world a 1 1 .top(N) hello line 2 1 hello line 2 1 local top N * local top N * reduction * local top N implemented by bounded priority queues // Step 6 - Swap tuples (complete code) val top = freq.map(_.swap).top(N)
  • 41. Top Words by Frequency (Full Code) val spark = new SparkContext() // RDD creation from external data source val docs = spark.textFile(“hdfs://docs/”) // Split lines into words val lower = docs.map(line => line.toLowerCase) val words = lower.flatMap(line => line.split(“s+“)) val counts = words.map(word => (word, 1)) // Count all words (automatic combination) val freq = counts.reduceByKey(_ + _) // Swap tuples and get top results val top = freq.map(_.swap).top(N) top.foreach(println)
  • 42. RDD Persistence (in-memory) RDD … ... ... … ... … ... … ... .cache() .persist() .persist(storageLevel) StorageLevel: MEMORY_ONLY, MEMORY_ONLY_SER, MEMORY_AND_DISK, MEMORY_AND_DISK_SER, DISK_ONLY, … (memory only) (memory only) (lazy persistence & caching)
  • 43. RDD Lineage RDD Transformations words = sc.textFile(“hdfs://large/file/”) HadoopRDD .map(_.toLowerCase) .flatMap(_.split(“ “)) FlatMappedRDD nums = words.filter(_.matches(“[0-9]+”)) alpha.count() MappedRDD alpha = words.filter(_.matches(“[a-z]+”)) FilteredRDD FilteredRDD Lineage (built on the driver by the transformations) Action (run job on the cluster)
  • 44. SchemaRDD & SQL SchemaRDD Row ... ... Row ... ... Row Row ... RRD of Row + Column Metadata Queries with SQL Support for Reflection, JSON, Parquet, …
  • 45. SchemaRDD & SQL topWords Row ... ... Row ... ... Row Row ... case class Word(text: String, n: Int) val wordsFreq = freq.map { case (text, count) => Word(text, count) } // RDD[Word] wordsFreq.registerTempTable("wordsFreq") val topWords = sql("select text, n from wordsFreq order by n desc limit 20”) // RDD[Row] topWords.collect().foreach(println)
  • 46. Spark Streaming DStream RDD RDD RDD RDD RDD RDD Data Collected, Buffered and Replicated by a Receiver (one per DStream) then Pushed to a stream as small RDDs Configurable Batch Intervals. e.g: 1 second, 5 seconds, 5 minutes Receiver e.g: Kafka, Kinesis, Flume, Sockets, Akka etc
  • 47. DStream Transformations DStream RDD RDD RDD RDD RDD RDD DStream transform RDD RDD RDD RDD RDD RDD Receiver // Example val entries = stream.transform { rdd => rdd.map(Log.parse) } // Alternative val entries = stream.map(Log.parse)
  • 48. Parallelism with Multiple Receivers DStream 1 Receiver 1 RDD RDD RDD RDD RDD RDD DStream 2 Receiver 2 RDD RDD RDD RDD RDD RDD union of (stream1, stream2, …) Union can be used to manage multiple DStreams as a single logical stream
  • 49. Sliding Windows DStream RDD RDD RDD RDD RDD RDD DStream … … … W3 W2 W1 Window Length: 3, Sliding Interval: 1 Receiver
  • 50. Deployment with Hadoop A B C D /large/file allocates resources (cores and memory) Spark Worker Data Node 1 Application Spark Worker Data Node 3 Spark Worker Data Node 4 Spark Worker Data Node 2 A C B C A B A B Spark Master Name Node RF 3 D D D C Client Submit App (mode=cluster) Driver Executors Executors Executors DN + Spark HDFS Spark
  • 51. Fernando Rodriguez Olivera twitter: @frodriguez