0% found this document useful (0 votes)

19 views

Second Order Parallelism in Spark-Based Data Pipelines by Zachary Ennenga Medium

1. Spark jobs can contain multiple independent actions like saving datasets to tables, but Spark will process them sequentially by default. 2. Applying additional parallelism through techniques like Futures can execute independent actions concurrently, improving efficiency and reducing runtime. 3. This "second order parallelism" minimizes idle time for executors and avoids deallocating them prematurely with dynamic allocation before the full job is complete.

Uploaded by

Sekhar Sahu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

Second Order Parallelism in Spark-Based Data Pipelines by Zachary Ennenga Medium

Uploaded by

Sekhar Sahu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Second Order Parallelism in Spark-based Data Pipelines | by Zachary E... https://medium.com/@zach-ennenga/second-order-parallelism-in-spark...

@'A+1/B" 6'7)""'/"/3")2$%,8 .5$)8',"75 -17,+#'(-1C DE3%/9"+F'%+A()7#7-"9+/00",,C

!"#$"%&'()*+,-'%*

!"#$%&'()&")'*+)+,,",-./'-%'!0+)123+."&'4+5+
*-0",-%".
./01/%*+2(("(3/ 4 5'))'67(3
8+#7(+%"/9 4 !/*+:8;+:<:=

>7,-"( ?1/%" !'%"

The entire purpose of Spark is to efficiently distribute and parallelize work, and
because of this, it can be easy to miss places where applying additional parallelism
on top of Spark can increase the efficiency of your application.

Spark operations can be broken into Actions, and Transformations. Spark

Transformations — like a map or a filter — are lazy, and simply help spark build
and execution plan for when you eventually execute an Action.

Actions are blocking operations that actual perform distributed computation. These
include things like count or repartition , as well as any sort of saving/serialization
operation.

In some cases, your job may only contain one action, or all your actions might be
dependent on one another, however, often, jobs contain multiple, independent
actions.

Let’s look at some common patterns:

First, perhaps you need to split your data into multiple subsets, and save them to

1 of 14 6/2/23, 5:03 PM
Second Order Parallelism in Spark-based Data Pipelines | by Zachary E... https://medium.com/@zach-ennenga/second-order-parallelism-in-spark...

unique tables

val data = load().map(...).cache()

val subset1 = data.filter(...)

val subset2 = data.filter(...)

Table1.save(subset1) # <- Blocks

Table2.save(subset2) # <- Also Blocks

Or perhaps you’re not splitting a single dataset, but your job just results in multiple
unique datasets:

val (data1, data2, data3) = processData(load())

Table1.save(data1) # <- Blocks

Table2.save(data2) # <- Blocks Again
Table3.save(data3) # <- Still Blocks

Maybe you’re calculating some metrics, as well as saving the underlying data:

val data = load().map(...).cache()

val count = data.count() # <- Blocks

MetricsTable.save(count) # <- Blocks

Table.save(data) # <- Blocks

The point is, with a traditional application structure, Spark will only process one of
these actions at a time, even though there may be no, or minimal, direct
dependencies between them.

Why does this matter?

9:+%5-78-%;'<77-#-"%#8
When considering performance characteristics of a data pipeline, there are two
quantities that come to mind; first, runtime, and second, efficiency.

2 of 14 6/2/23, 5:03 PM
Second Order Parallelism in Spark-based Data Pipelines | by Zachary E... https://medium.com/@zach-ennenga/second-order-parallelism-in-spark...

While runtime is usually pretty straightforward to measure and understand,

efficiency is a bit more complex. Here, we will define efficiency as the average
amount of time your executors are doing work, divided by the overall runtime of
your job.

So, we want our executors to minimize their idle time, in which they’re associated
with your job, but not doing work. A common solution to this problem is dynamic
allocation, in which Spark dynamically adds and removes executors from your job.
This works well… until it doesn’t.

</3)+#-%;'!1"=
There are two sorts of skew I see in jobs. First, the kind you’re probably think of,
what I call “unnatural” skew, where you have some Spark partitions that are much
larger than their peers. This sort of skew has been documented to death, and I won’t
spend much time on it.

That said, when observing spark job execution, you’ll notice that even a job with
right-sized partitions still has a bit of a bell curve when it comes to partition
computation times. This is what I refer to as “natural” skew, that being the fact that
some partitions just take somewhat longer to complete than others.

This can be for a number of reasons, slight (10–20%) record count differences
between partitions, differences in computation costs between different partitions
(common when using complex Scala functions to transform data), network delays,
and so on.

While you can direct work towards minimizing natural skew, in a complex Spark
application, you might as well be trying to bail out the ocean. This sort of skew
broadly has minimal impact on job runtime, and the fixes (complex repartitioning
schemes) generally cost more to implement and execute than doing nothing at all.

However, this sort natural skew is the enemy of dynamic allocation. As a Spark job
nears completion, natural skew will mean the last ~20% of your tasks will likely take
some additional time to compute. With dynamic allocation, Spark is happy to drop
those 80% “idle” executors, meaning your application will start its’ next Spark job

3 of 14 6/2/23, 5:03 PM
Second Order Parallelism in Spark-based Data Pipelines | by Zachary E... https://medium.com/@zach-ennenga/second-order-parallelism-in-spark...

with a small percentage of its’ initial resource allocation, meaning it has to spend
the time to scale up again.

The reason for this is simple, Spark doesn’t know the entire body of work your
application will do in advance; it can’t see beyond an Action when computing its’
execution plan.

This result of this is a negative impact to your runtime, to the benefit of your
efficiency. As a developer, this often is not a trade you are intending to make.

To solve this problem, and simultaneously improve runtime and efficiency, we need
to give Spark a bit of a helping hand, so it can build a more complete execution plan
from the start.

<%5")>'!"#$%&2()&")'*+)+,,",-./
Spark is our primary, or first-order parallelism engine. To execute multiple actions
in parallel, we need a secondary, or Second-Order parallelism construct.

Let’s take one of our initial examples, and apply some standard Scala parallelization
constructs:

val data = load().map(...).cache()

val subset1 = data.filter(...)

val subset2 = data.filter(...)

val futures = Futures.sequence(

Seq(
Future {
Table1.save(subset1)
},
Future {
Table2.save(subset2)
}
)
)

Await.result(futures)

The result of this will be Spark will execute the save commands in parallel. This will

4 of 14 6/2/23, 5:03 PM
Second Order Parallelism in Spark-based Data Pipelines | by Zachary E... https://medium.com/@zach-ennenga/second-order-parallelism-in-spark...

mean that as executors complete the first save, they will immediately begin working
on the second save, and will not become idle until both are complete, and the entire
body of work for your application is complete.

(:5#$/".
I have used this pattern to great effect across many jobs; it generally increases
efficiency by 10–20%, and runtime by a larger amount, depending on the number of
actions that can parallelized.

Most importantly, it minimizes idle time of executors, and minimizes cases where
executors are deallocated from your job before it is complete (when using dynamic
allocation).
?E/%I Spark generally
J/%/))")7,# does a good job of thread safety, and I have not
K/-/+2(37(""%7(3
encountered any issues using this sort of pattern with modern Spark versions.

5'))'67(3

!"#$$%&'()'*+,-+")'.&&%&/+
:GH+5'))'6"%,

01"%'2"13'*+,-+")'.&&%&/+

5 of 14 6/2/23, 5:03 PM

API-570 Final Exam Questions
88% (26)
API-570 Final Exam Questions
26 pages
Learn SAP Basis in 24 Hours
From Everand
Learn SAP Basis in 24 Hours
Alex Nordeen
4.5/5 (2)
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
Agile Project Management: Scrum for Beginners
From Everand
Agile Project Management: Scrum for Beginners
Markus Heimrath
4/5 (8)
Learning PySpark
From Everand
Learning PySpark
Tomasz Drabas
No ratings yet
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
The Little Book of Sitecore® Tips: Volume 1
From Everand
The Little Book of Sitecore® Tips: Volume 1
Neil P Shack
No ratings yet
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
From Everand
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
Karl Josef Hensel
No ratings yet
Parallel Python with Dask: Perform distributed computing, concurrent programming and manage large dataset
From Everand
Parallel Python with Dask: Perform distributed computing, concurrent programming and manage large dataset
Tim Peters
No ratings yet
Parallel Python with Dask
From Everand
Parallel Python with Dask
Tim Peters
No ratings yet
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)
Learning Cascading
From Everand
Learning Cascading
Michael Covert
No ratings yet
JavaScript Patterns JumpStart Guide (Clean up your JavaScript Code)
From Everand
JavaScript Patterns JumpStart Guide (Clean up your JavaScript Code)
Dan Wahlin
4.5/5 (3)
Spark: Big Data Cluster Computing in Production
From Everand
Spark: Big Data Cluster Computing in Production
Ilya Ganelin
No ratings yet
Oracle APEX Tips and Tricks
From Everand
Oracle APEX Tips and Tricks
Priyanka Agarwal
No ratings yet
How to a Developers Guide to 4k: Developer edition, #3
From Everand
How to a Developers Guide to 4k: Developer edition, #3
Xinc Cyberwizard
No ratings yet
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
From Everand
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
Joerg Christian Seubert
No ratings yet
Java Streams Explained: A Practical Guide with Examples
From Everand
Java Streams Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Beginning Software Engineering
From Everand
Beginning Software Engineering
Rod Stephens
4.5/5 (2)
Learning Concurrent Programming in Scala
From Everand
Learning Concurrent Programming in Scala
Aleksandar Prokopec
No ratings yet
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
Just the basics of JavaScript
From Everand
Just the basics of JavaScript
Tom Henricksen
No ratings yet
Dimensionnement Spark - Les 5 Erreurs À Éviter
No ratings yet
Dimensionnement Spark - Les 5 Erreurs À Éviter
75 pages
JavaScript File Handling from Scratch: A Practical Guide with Examples
From Everand
JavaScript File Handling from Scratch: A Practical Guide with Examples
William E. Clark
No ratings yet
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
DESIGN ALGORITHMS TO SOLVE COMMON PROBLEMS: Mastering Algorithm Design for Practical Solutions (2024 Guide)
From Everand
DESIGN ALGORITHMS TO SOLVE COMMON PROBLEMS: Mastering Algorithm Design for Practical Solutions (2024 Guide)
ARCHER PAUL
No ratings yet
Apache Spark Things To Know
No ratings yet
Apache Spark Things To Know
8 pages
Azure For Starters
From Everand
Azure For Starters
Chinmoy Mukherjee
No ratings yet
Spark Tips 1716698498
No ratings yet
Spark Tips 1716698498
7 pages
40 Ready to Use Excel VBA and Macros
From Everand
40 Ready to Use Excel VBA and Macros
Mac Guru
No ratings yet
Intermediate Load Runner With Oracle/Apex Concepts.
From Everand
Intermediate Load Runner With Oracle/Apex Concepts.
Rohan Gordon
No ratings yet
Beyond the Basics of JavaScript
From Everand
Beyond the Basics of JavaScript
Tom Henricksen
No ratings yet
100 Recipes for Programming Java
From Everand
100 Recipes for Programming Java
Jamie Munro
4.5/5 (2)
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
From Everand
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
Adam Jones
No ratings yet
Optimal Strategies For Large-Scale Batch ETL Jobs: Emma Tang, Neustar
No ratings yet
Optimal Strategies For Large-Scale Batch ETL Jobs: Emma Tang, Neustar
60 pages
Concise Oracle Database For People Who Has No Time
From Everand
Concise Oracle Database For People Who Has No Time
Billy Aung Myint
No ratings yet
Oracle 11g Streams Implementer's Guide
From Everand
Oracle 11g Streams Implementer's Guide
Ann L. R. McKinnell
No ratings yet
The Mac Terminal Reference and Scripting Primer
From Everand
The Mac Terminal Reference and Scripting Primer
Jay Docherty
4.5/5 (3)
Java Package Mastery: 100 Knock Series - Master Java in One Hour, 2024 Edition
From Everand
Java Package Mastery: 100 Knock Series - Master Java in One Hour, 2024 Edition
Kanto
No ratings yet
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet
15 Asked Questions in KPMG
No ratings yet
15 Asked Questions in KPMG
22 pages
Dataflow and Reactive Programming Systems
From Everand
Dataflow and Reactive Programming Systems
Matt Carkci
No ratings yet
Practical Play Framework: Focus on what is really important
From Everand
Practical Play Framework: Focus on what is really important
Alberto Souza
No ratings yet
Data Engineer Interview
No ratings yet
Data Engineer Interview
23 pages
Expert Strategies in Apache Spark: Comprehensive Data Processing and Advanced Analytics
From Everand
Expert Strategies in Apache Spark: Comprehensive Data Processing and Advanced Analytics
Adam Jones
No ratings yet
spark
No ratings yet
spark
27 pages
bdafinal
No ratings yet
bdafinal
11 pages
PySpark Optimization Scenarios - Wipro
No ratings yet
PySpark Optimization Scenarios - Wipro
8 pages
Hands-On Oracle Application Express Security: Building Secure Apex Applications
From Everand
Hands-On Oracle Application Express Security: Building Secure Apex Applications
Recx
No ratings yet
Ian Talks JS A-Z: WebDevAtoZ, #1
From Everand
Ian Talks JS A-Z: WebDevAtoZ, #1
Ian Eress
No ratings yet
Javascript Mastery: In-Depth Techniques and Strategies for Advanced Development
From Everand
Javascript Mastery: In-Depth Techniques and Strategies for Advanced Development
Adam Jones
No ratings yet
Dart for Flutter
From Everand
Dart for Flutter
Zeuz IT
No ratings yet
Elements of Android Room
From Everand
Elements of Android Room
Mark Murphy
No ratings yet
Sql : The Ultimate Beginner to Advanced Guide To Master SQL Quickly with Step-by-Step Practical Examples
From Everand
Sql : The Ultimate Beginner to Advanced Guide To Master SQL Quickly with Step-by-Step Practical Examples
Mark Robinson
No ratings yet
Automation Systems: Technical Catalogue
No ratings yet
Automation Systems: Technical Catalogue
37 pages
Jada Cumberbatch 816010507 BIOC2162 Secretory and Circulatory Systems Tutorial #2
No ratings yet
Jada Cumberbatch 816010507 BIOC2162 Secretory and Circulatory Systems Tutorial #2
14 pages
MCQ in DC Circuits Part 5 - REE Board Exam - Pinoybix Engineering
No ratings yet
MCQ in DC Circuits Part 5 - REE Board Exam - Pinoybix Engineering
32 pages
The Lux Programming Language
100% (1)
The Lux Programming Language
126 pages
Vibrations Report FV
No ratings yet
Vibrations Report FV
17 pages
E Book Image File Types Explained
No ratings yet
E Book Image File Types Explained
10 pages
Improving Resolution With Spectral Balancing-A Case Study: M Fatima, Lavendra Kumar, RK Bhattacharjee, PH Rao, DP Sinha
No ratings yet
Improving Resolution With Spectral Balancing-A Case Study: M Fatima, Lavendra Kumar, RK Bhattacharjee, PH Rao, DP Sinha
8 pages
Macro Cheat Sheet 2
No ratings yet
Macro Cheat Sheet 2
2 pages
Topic Sentence Parag PDF
No ratings yet
Topic Sentence Parag PDF
2 pages
s Tr Civil Work (Rev.0 2015) Piling Work
No ratings yet
s Tr Civil Work (Rev.0 2015) Piling Work
9 pages
Full Download Electron correlation in molecules -- ab initio beyond Gaussian quantum chemistry 1st Edition Hoggan PDF DOCX
100% (4)
Full Download Electron correlation in molecules -- ab initio beyond Gaussian quantum chemistry 1st Edition Hoggan PDF DOCX
76 pages
Psps Unit 1 QB
No ratings yet
Psps Unit 1 QB
9 pages
Design Criteria Electrical
100% (4)
Design Criteria Electrical
38 pages
Testing For Excessive Cylinder Blowby in 3500 Engines
No ratings yet
Testing For Excessive Cylinder Blowby in 3500 Engines
10 pages
Web Technology UNIT-1
No ratings yet
Web Technology UNIT-1
20 pages
jma11-01-rms-20230824
No ratings yet
jma11-01-rms-20230824
16 pages
Immediate Download (Ebook PDF) The Infinite 3rd Edition by A.W. Moore Ebooks 2024
100% (5)
Immediate Download (Ebook PDF) The Infinite 3rd Edition by A.W. Moore Ebooks 2024
41 pages
Support Material / Material de Apoyo Learning Activity 1 / Actividad de Aprendizaje 1
No ratings yet
Support Material / Material de Apoyo Learning Activity 1 / Actividad de Aprendizaje 1
6 pages
Waspmote Lorawan Networking Guide PDF
No ratings yet
Waspmote Lorawan Networking Guide PDF
77 pages
OPSS - Muni 310 Nov17 Construction Specification For Hot Mix Asphalt
No ratings yet
OPSS - Muni 310 Nov17 Construction Specification For Hot Mix Asphalt
30 pages
ap biology 2020 practice exam 1
No ratings yet
ap biology 2020 practice exam 1
67 pages
Reflexw Manual
No ratings yet
Reflexw Manual
729 pages
Information and Communication Technology Paper 1 (Sample Paper)
No ratings yet
Information and Communication Technology Paper 1 (Sample Paper)
12 pages
Quiz Cea201 at Fptu
No ratings yet
Quiz Cea201 at Fptu
32 pages
Universal Beams... BS 4-11971 (Superseded by BS 4 1993) Prop
No ratings yet
Universal Beams... BS 4-11971 (Superseded by BS 4 1993) Prop
2 pages
10 1016@j Joen 2019 11 015
No ratings yet
10 1016@j Joen 2019 11 015
6 pages
2017july-8 KNEK PAST PAPERS
No ratings yet
2017july-8 KNEK PAST PAPERS
4 pages
Rates of Reaction Questions
No ratings yet
Rates of Reaction Questions
9 pages
MOTOR PROTECTION RELAYdocx
100% (3)
MOTOR PROTECTION RELAYdocx
7 pages

Second Order Parallelism in Spark-Based Data Pipelines by Zachary Ennenga Medium

Uploaded by

Second Order Parallelism in Spark-Based Data Pipelines by Zachary Ennenga Medium

Uploaded by

Second Order Parallelism in Spark-based Data Pipelines | by Zachary E... https://medium.com/@zach-ennenga/second-order-parallelism-in-spark...

@'A+1/B" 6'7)""'/"/3")2$%,8 .5$)8',"75 -17,+#'(-1C DE3%/9"+F'%+A()7#7-"9+/00",,C

>7,-"( ?1/%" !'%"

Spark operations can be broken into Actions, and Transformations. Spark

Let’s look at some common patterns:

val data = load().map(...).cache()

val subset1 = data.filter(...)

Table1.save(subset1) # <- Blocks

val (data1, data2, data3) = processData(load())

Table1.save(data1) # <- Blocks

val data = load().map(...).cache()

val count = data.count() # <- Blocks

MetricsTable.save(count) # <- Blocks

Why does this matter?

While runtime is usually pretty straightforward to measure and understand,

val data = load().map(...).cache()

val subset1 = data.filter(...)

val futures = Futures.sequence(

You might also like