Second Order Parallelism in Spark-Based Data Pipelines by Zachary Ennenga Medium
Second Order Parallelism in Spark-Based Data Pipelines by Zachary Ennenga Medium
!"#$"%&'()*+,-'%*
!"#$%&'()&")'*+)+,,",-./'-%'!0+)123+."&'4+5+
*-0",-%".
./01/%*+2(("(3/ 4 5'))'67(3
8+#7(+%"/9 4 !/*+:8;+:<:=
The entire purpose of Spark is to efficiently distribute and parallelize work, and
because of this, it can be easy to miss places where applying additional parallelism
on top of Spark can increase the efficiency of your application.
Actions are blocking operations that actual perform distributed computation. These
include things like count or repartition , as well as any sort of saving/serialization
operation.
In some cases, your job may only contain one action, or all your actions might be
dependent on one another, however, often, jobs contain multiple, independent
actions.
First, perhaps you need to split your data into multiple subsets, and save them to
1 of 14 6/2/23, 5:03 PM
Second Order Parallelism in Spark-based Data Pipelines | by Zachary E... https://medium.com/@zach-ennenga/second-order-parallelism-in-spark...
unique tables
Or perhaps you’re not splitting a single dataset, but your job just results in multiple
unique datasets:
Maybe you’re calculating some metrics, as well as saving the underlying data:
The point is, with a traditional application structure, Spark will only process one of
these actions at a time, even though there may be no, or minimal, direct
dependencies between them.
9:+%5-78-%;'<77-#-"%#8
When considering performance characteristics of a data pipeline, there are two
quantities that come to mind; first, runtime, and second, efficiency.
2 of 14 6/2/23, 5:03 PM
Second Order Parallelism in Spark-based Data Pipelines | by Zachary E... https://medium.com/@zach-ennenga/second-order-parallelism-in-spark...
So, we want our executors to minimize their idle time, in which they’re associated
with your job, but not doing work. A common solution to this problem is dynamic
allocation, in which Spark dynamically adds and removes executors from your job.
This works well… until it doesn’t.
</3)+#-%;'!1"=
There are two sorts of skew I see in jobs. First, the kind you’re probably think of,
what I call “unnatural” skew, where you have some Spark partitions that are much
larger than their peers. This sort of skew has been documented to death, and I won’t
spend much time on it.
That said, when observing spark job execution, you’ll notice that even a job with
right-sized partitions still has a bit of a bell curve when it comes to partition
computation times. This is what I refer to as “natural” skew, that being the fact that
some partitions just take somewhat longer to complete than others.
This can be for a number of reasons, slight (10–20%) record count differences
between partitions, differences in computation costs between different partitions
(common when using complex Scala functions to transform data), network delays,
and so on.
While you can direct work towards minimizing natural skew, in a complex Spark
application, you might as well be trying to bail out the ocean. This sort of skew
broadly has minimal impact on job runtime, and the fixes (complex repartitioning
schemes) generally cost more to implement and execute than doing nothing at all.
However, this sort natural skew is the enemy of dynamic allocation. As a Spark job
nears completion, natural skew will mean the last ~20% of your tasks will likely take
some additional time to compute. With dynamic allocation, Spark is happy to drop
those 80% “idle” executors, meaning your application will start its’ next Spark job
3 of 14 6/2/23, 5:03 PM
Second Order Parallelism in Spark-based Data Pipelines | by Zachary E... https://medium.com/@zach-ennenga/second-order-parallelism-in-spark...
with a small percentage of its’ initial resource allocation, meaning it has to spend
the time to scale up again.
The reason for this is simple, Spark doesn’t know the entire body of work your
application will do in advance; it can’t see beyond an Action when computing its’
execution plan.
This result of this is a negative impact to your runtime, to the benefit of your
efficiency. As a developer, this often is not a trade you are intending to make.
To solve this problem, and simultaneously improve runtime and efficiency, we need
to give Spark a bit of a helping hand, so it can build a more complete execution plan
from the start.
<%5")>'!"#$%&2()&")'*+)+,,",-./
Spark is our primary, or first-order parallelism engine. To execute multiple actions
in parallel, we need a secondary, or Second-Order parallelism construct.
Let’s take one of our initial examples, and apply some standard Scala parallelization
constructs:
Await.result(futures)
The result of this will be Spark will execute the save commands in parallel. This will
4 of 14 6/2/23, 5:03 PM
Second Order Parallelism in Spark-based Data Pipelines | by Zachary E... https://medium.com/@zach-ennenga/second-order-parallelism-in-spark...
mean that as executors complete the first save, they will immediately begin working
on the second save, and will not become idle until both are complete, and the entire
body of work for your application is complete.
(:5#$/".
I have used this pattern to great effect across many jobs; it generally increases
efficiency by 10–20%, and runtime by a larger amount, depending on the number of
actions that can parallelized.
Most importantly, it minimizes idle time of executors, and minimizes cases where
executors are deallocated from your job before it is complete (when using dynamic
allocation).
?E/%I Spark generally
J/%/))")7,# does a good job of thread safety, and I have not
K/-/+2(37(""%7(3
encountered any issues using this sort of pattern with modern Spark versions.
5'))'67(3
!"#$$%&'()'*+,-+")'.&&%&/+
:GH+5'))'6"%,
01"%'2"13'*+,-+")'.&&%&/+
5 of 14 6/2/23, 5:03 PM