0% found this document useful (0 votes)
37 views26 pages

Introduction to Apache Spark Overview

The document discusses the limitations of Map-Reduce in handling queries and analytical tasks, highlighting issues such as inefficiency with full scans, lack of iteration, and absence of caching. It introduces Apache Spark as a solution that offers in-memory processing, simplified programming abstractions, and improved performance for big data processing tasks. Spark's Resilient Distributed Dataset (RDD) is emphasized as a core feature that enables fault-tolerant and efficient data manipulation across distributed systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views26 pages

Introduction to Apache Spark Overview

The document discusses the limitations of Map-Reduce in handling queries and analytical tasks, highlighting issues such as inefficiency with full scans, lack of iteration, and absence of caching. It introduces Apache Spark as a solution that offers in-memory processing, simplified programming abstractions, and improved performance for big data processing tasks. Spark's Resilient Distributed Dataset (RDD) is emphasized as a core feature that enables fault-tolerant and efficient data manipulation across distributed systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Introduction to Apache Spark

pm jat @ daiict
Issues with “Map Reduce”
• There are issues with Map-Reduce while performing many “queries” and analytical
tasks
• Some of the issues listed here are from a survey article[6]
• Requirement of FULL SCAN of the File
– Very inefficient when queries “low selectivity” are to be executed
– Conditional termination of file processing is not possible.
– We can not early terminate the scan.
• Lack of iteration: If we require iterating through a dataset for multiple times, then
every time we read data from disk files; and that happens to be the case with many
Analytical and most Machine Learning tasks.

14-Aug-25 Introduction to Apache Spark 2


Issues with “Map Reduce”
• Lack of Caching: Multiple MR jobs are processing same data almost at the same
time. Cashing can help it to run faster x100 times
• The system lacks to “reuse” results of previously executed queries/jobs.
• Quick retrieval of approximate results [for example if we want to process only 10%
data from the file.
• Lack of interactive (“dashboard kind of application”) or real-time processing – Map-
Reduce runs in the background, and there is no interaction till it finishes the job.

14-Aug-25 Introduction to Apache Spark 3


“Spark” [1]
• The Spark was created by Matei Zaharia at AMP Lab
of UC Berkeley in 2009. It was the outcome of his Ph.D. thesis!
• Was introduced through the paper
“Spark: Cluster Computing with Working Sets” [1] in 2010.
• Spark comes as a solution to most of the map-reduce problems
– Caching: makes iterations faster and enables using of “intermediate results”
– Faster Execution – due to Caching
– Support “query optimization”
– Further simpler programming abstraction

[1] Zaharia, Matei, et al. "Spark: Cluster computing with working sets." HotCloud 10.10-10 (2010)
14-Aug-25 Introduction to Apache Spark 4
Apache Spark
• Immediately Spark was open source project at Apache in 2010
• Spark primarily has two revolutionary features
1. “In Memory Processing” - once data are read from files, they can be kept in
primary memory (distributed on various computers). Processing can be done in
parallel on data in the memory.
2. Further simplified programming abstraction – we only require writing driver
programs

14-Aug-25 Introduction to Apache Spark 5


Apache Spark
• Its implementation is claiming it to be “Unified Engine for Big Data Processing”
• Its “RDD Abstraction” has been demonstrated to
be core engine for large array of computational
tasks [2]
– SQL,
– Stream Processing,
– Machine Learning,
– Graph Processing, so forth

14-Aug-25 Introduction to Apache Spark 6


Apache Spark [2]

[2] Zaharia, Matei, et al. "Apache spark: a unified engine for big
data processing." Communications of the ACM 59.11 (2016)
14-Aug-25 Introduction to Apache Spark 7
Apache Spark[2]

[2] Zaharia, Matei, et al. "Apache spark: a unified engine for big
data processing." Communications of the ACM 59.11 (2016)
14-Aug-25 Introduction to Apache Spark 8
A Unified Engine for Big Data Processing [2]
• Spark’s generality has several important benefits.
• “First, applications are easier to develop because they use a unified API”
• “Second, it is more efficient to combine processing tasks; whereas prior systems
required writing the data to storage to pass it to another engine, Spark can run
diverse functions over the same data, often in memory.”
• Finally, Spark enables new applications (such as interactive queries on a graph and
streaming machine learning) that was not possible with previous systems!

[2] Zaharia, Matei, et al. "Apache spark: a unified engine for big data processing."
Communications of the ACM 59.11 (2016)
14-Aug-25 Introduction to Apache Spark 9
Programming Model [2]
• The simpler Programming Model is yet another highlighting feature of
Spark!
• Beauty is all this is Distributed and Parallel!

[2] Zaharia, Matei, et al. "Apache spark: a unified engine for big
data processing." Communications of the ACM 59.11 (2016)
14-Aug-25 Introduction to Apache Spark 10
Apache Spark – Some Numbers[2]
• The Article report a “Logistic Regression Analysis” results using Spark
• Figure here shows “scala” implementation of logistic regression via batch gradient
descent in Spark
• May also note the
simplicity of the code.
• Spark makes it easy to
load the data into RAM
once and run multiple
sums.
• As a result, it runs faster
than traditional
Map-Reduce.

14-Aug-25 Introduction to Apache Spark 11


MR vs Apache Spark – some numbers
• Following are experimental results of running “logistic regression” in Hadoop
Map-Reduce vs. Spark for 100GB of data on 50 m2.4xlarge EC2 nodes [2]
• Map-Reduce takes 110 seconds per iteration because each iteration loads the data
from disk, while Spark takes only one second per iteration after the first load.

“100 time faster at 10 iterations”!

Larger the gain at higher iterations!


14-Aug-25 Introduction to Apache Spark 12
MR vs Apache Spark – some numbers

[Link]
14-Aug-25 Introduction to Apache Spark 13
MR vs Apache Spark – some numbers

14-Aug-25
[Link] Introduction to Apache Spark 14
Spark “Word Frequency Count” - Scala
• Spark has been created in Scala; and Scala is more native kind of language for Spark
• Scala is “Functional Programming” Language; extensively uses concept of “Lambda
expressions”; scala code is quite compact

[Link]
14-Aug-25 Introduction to Apache Spark 15
Spark “Word Frequency Count” - Java
• Spark programs are “driver programs” only, and does not require writing mapper,
reducer, combiner etc.

[Link]
14-Aug-25 Introduction to Apache Spark 16
Spark “Word Frequency Count” - Python

[Link]

14-Aug-25 Introduction to Apache Spark 17


Apache Spark – framework*
Spark Scala

Spark Java

Spark Python

Spark R

Though programmed in
Scala.

It allows accessing from


multiple programming
environments
14-Aug-25 Introduction to Apache Spark 18
* [Link]
Spark Overview
• Spark offers a revolutionary programming paradigm that makes distributed programming
like a desktop programming
• The following are main revolutionary features that make Spark a amazing solution for
cluster computing
– “In Memory”, “Distributed”, “Fault Tolerant” collections of objects (called as RDD)
– Simple Programming Abstractions;
– Rich set of operations on “distributed collection data objects”. Operations are performed
in parallel.
– Support of “Query Optimizer”

14-Aug-25 Introduction to Apache Spark 19


Resilient Distributed Dataset (RDD)
• The main abstraction in Spark is Resilient Distributed Dataset (RDD)
• RDD is a distributed, fault tolerant “object collection” that is partitioned across a set
of machines. (Note that “Object collection” here is “in memory”)
– Comparing with “Python List”, RDD is a “Distributed, Fault Tolerant List”
• RDD objects can explicitly be cached in memory and reused in consequent calls. This
“in memory” processing is what makes Spark, amazing fast!
• For Fault Tolerance; RDD objects themselves are not replicated (as per the original
article, today’s implementation may not be so), but maintain information that a
partition can be rebuilt if a node fails!

14-Aug-25 Introduction to Apache Spark 20


A Complete Spark Program in Python

14-Aug-25 Introduction to Apache Spark 21


A Complete Spark Program in Python
• A driver program
• Creates two types of objects
– Local
– Distributed: RDDs are distributed objects
– RDD acts as main “data model” in spark programs
– You perform various manipulation operations on it
• recall applying various operations on “relations”
– That is all it is!

14-Aug-25 Introduction to Apache Spark 22


Spark “Word Frequency Count” - Python
• Spark programs are “driver programs” only, and does not require writing mapper,
reducer, combiner etc.
• BTW, what do we mean by “driver program”?

RDDs here are:


lines (distributed list of String objects),
pairs (distributed list of <String, int> pair objects),
counts (distributed list of <String, int> pair objects)
14-Aug-25 Introduction to Apache Spark 23
[Link]
Map-Reduce vs Spark (recap)
• Map Reduce?
– Simple File Processing in parallel
– “Distributed data file” processing
• Spark?
– Same as MR but
– quiet smart and faster
• “In Memory Distributed Processing” vs “Distributed File Processing for Map-Reduce”
• Lazy Evaluation and Optimizer behind the scene
• Intermediate results can be used!
– Further Simplifies Programming

14-Aug-25 RDD Operations 24


What makes Spark Faster?
• In Memory Processing
• Optimizer behind the scene
• Intermediate results can be used!

14-Aug-25 RDD Operations 25


References
[1] Zaharia, Matei, et al. "Spark: Cluster computing with working sets." HotCloud 10.10-10 (2010): 95.
[2] Zaharia, Matei, et al. "Apache spark: a unified engine for big data processing." Communications of
the ACM 59.11 (2016): 56-65.
[3] Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant abstraction for in-memory
cluster computing." Proceedings of the 9th USENIX conference on Networked Systems Design and
Implementation. USENIX Association, 2012.
[4] Chambers, Bill, and Matei Zaharia. Spark: The definitive guide: Big data processing made simple. "
O'Reilly Media, Inc.", 2018.
[5] Python Spark Documentation
[Link]
[6] Doulkeridis, Christos, and Kjetil NØrvåg. "A survey of large-scale analytical query processing in
MapReduce." The VLDB Journal—The International Journal on Very Large Data Bases 23.3 (2014):
355-380.

14-Aug-25 Introduction to Apache Spark 26

You might also like