0% found this document useful (0 votes)

37 views26 pages

Introduction to Apache Spark Overview

The document discusses the limitations of Map-Reduce in handling queries and analytical tasks, highlighting issues such as inefficiency with full scans, lack of iteration, and absence of caching. It introduces Apache Spark as a solution that offers in-memory processing, simplified programming abstractions, and improved performance for big data processing tasks. Spark's Resilient Distributed Dataset (RDD) is emphasized as a core feature that enables fault-tolerant and efficient data manipulation across distributed systems.

Uploaded by

funbobbythewineguy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views26 pages

Introduction to Apache Spark Overview

Uploaded by

funbobbythewineguy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Introduction to Apache Spark

pm jat @ daiict
Issues with “Map Reduce”
• There are issues with Map-Reduce while performing many “queries” and analytical
tasks
• Some of the issues listed here are from a survey article[6]
• Requirement of FULL SCAN of the File
– Very inefficient when queries “low selectivity” are to be executed
– Conditional termination of file processing is not possible.
– We can not early terminate the scan.
• Lack of iteration: If we require iterating through a dataset for multiple times, then
every time we read data from disk files; and that happens to be the case with many
Analytical and most Machine Learning tasks.

14-Aug-25 Introduction to Apache Spark 2

Issues with “Map Reduce”
• Lack of Caching: Multiple MR jobs are processing same data almost at the same
time. Cashing can help it to run faster x100 times
• The system lacks to “reuse” results of previously executed queries/jobs.
• Quick retrieval of approximate results [for example if we want to process only 10%
data from the file.
• Lack of interactive (“dashboard kind of application”) or real-time processing – Map-
Reduce runs in the background, and there is no interaction till it finishes the job.

14-Aug-25 Introduction to Apache Spark 3

“Spark” [1]
• The Spark was created by Matei Zaharia at AMP Lab
of UC Berkeley in 2009. It was the outcome of his Ph.D. thesis!
• Was introduced through the paper
“Spark: Cluster Computing with Working Sets” [1] in 2010.
• Spark comes as a solution to most of the map-reduce problems
– Caching: makes iterations faster and enables using of “intermediate results”
– Faster Execution – due to Caching
– Support “query optimization”
– Further simpler programming abstraction

[1] Zaharia, Matei, et al. "Spark: Cluster computing with working sets." HotCloud 10.10-10 (2010)
14-Aug-25 Introduction to Apache Spark 4
Apache Spark
• Immediately Spark was open source project at Apache in 2010
• Spark primarily has two revolutionary features
1. “In Memory Processing” - once data are read from files, they can be kept in
primary memory (distributed on various computers). Processing can be done in
parallel on data in the memory.
2. Further simplified programming abstraction – we only require writing driver
programs

14-Aug-25 Introduction to Apache Spark 5

Apache Spark
• Its implementation is claiming it to be “Unified Engine for Big Data Processing”
• Its “RDD Abstraction” has been demonstrated to
be core engine for large array of computational
tasks [2]
– SQL,
– Stream Processing,
– Machine Learning,
– Graph Processing, so forth

14-Aug-25 Introduction to Apache Spark 6

Apache Spark [2]

[2] Zaharia, Matei, et al. "Apache spark: a unified engine for big
data processing." Communications of the ACM 59.11 (2016)
14-Aug-25 Introduction to Apache Spark 7
Apache Spark[2]

[2] Zaharia, Matei, et al. "Apache spark: a unified engine for big
data processing." Communications of the ACM 59.11 (2016)
14-Aug-25 Introduction to Apache Spark 8
A Unified Engine for Big Data Processing [2]
• Spark’s generality has several important benefits.
• “First, applications are easier to develop because they use a unified API”
• “Second, it is more efficient to combine processing tasks; whereas prior systems
required writing the data to storage to pass it to another engine, Spark can run
diverse functions over the same data, often in memory.”
• Finally, Spark enables new applications (such as interactive queries on a graph and
streaming machine learning) that was not possible with previous systems!

[2] Zaharia, Matei, et al. "Apache spark: a unified engine for big data processing."
Communications of the ACM 59.11 (2016)
14-Aug-25 Introduction to Apache Spark 9
Programming Model [2]
• The simpler Programming Model is yet another highlighting feature of
Spark!
• Beauty is all this is Distributed and Parallel!

[2] Zaharia, Matei, et al. "Apache spark: a unified engine for big
data processing." Communications of the ACM 59.11 (2016)
14-Aug-25 Introduction to Apache Spark 10
Apache Spark – Some Numbers[2]
• The Article report a “Logistic Regression Analysis” results using Spark
• Figure here shows “scala” implementation of logistic regression via batch gradient
descent in Spark
• May also note the
simplicity of the code.
• Spark makes it easy to
load the data into RAM
once and run multiple
sums.
• As a result, it runs faster
than traditional
Map-Reduce.

14-Aug-25 Introduction to Apache Spark 11

MR vs Apache Spark – some numbers
• Following are experimental results of running “logistic regression” in Hadoop
Map-Reduce vs. Spark for 100GB of data on 50 m2.4xlarge EC2 nodes [2]
• Map-Reduce takes 110 seconds per iteration because each iteration loads the data
from disk, while Spark takes only one second per iteration after the first load.

“100 time faster at 10 iterations”!

Larger the gain at higher iterations!

14-Aug-25 Introduction to Apache Spark 12
MR vs Apache Spark – some numbers

[Link]
14-Aug-25 Introduction to Apache Spark 13
MR vs Apache Spark – some numbers

14-Aug-25
[Link] Introduction to Apache Spark 14
Spark “Word Frequency Count” - Scala
• Spark has been created in Scala; and Scala is more native kind of language for Spark
• Scala is “Functional Programming” Language; extensively uses concept of “Lambda
expressions”; scala code is quite compact

[Link]
14-Aug-25 Introduction to Apache Spark 15
Spark “Word Frequency Count” - Java
• Spark programs are “driver programs” only, and does not require writing mapper,
reducer, combiner etc.

[Link]
14-Aug-25 Introduction to Apache Spark 16
Spark “Word Frequency Count” - Python

[Link]

14-Aug-25 Introduction to Apache Spark 17

Apache Spark – framework*
Spark Scala

Spark Java

Spark Python

Spark R

Though programmed in
Scala.

It allows accessing from

multiple programming
environments
14-Aug-25 Introduction to Apache Spark 18
* [Link]
Spark Overview
• Spark offers a revolutionary programming paradigm that makes distributed programming
like a desktop programming
• The following are main revolutionary features that make Spark a amazing solution for
cluster computing
– “In Memory”, “Distributed”, “Fault Tolerant” collections of objects (called as RDD)
– Simple Programming Abstractions;
– Rich set of operations on “distributed collection data objects”. Operations are performed
in parallel.
– Support of “Query Optimizer”

14-Aug-25 Introduction to Apache Spark 19

Resilient Distributed Dataset (RDD)
• The main abstraction in Spark is Resilient Distributed Dataset (RDD)
• RDD is a distributed, fault tolerant “object collection” that is partitioned across a set
of machines. (Note that “Object collection” here is “in memory”)
– Comparing with “Python List”, RDD is a “Distributed, Fault Tolerant List”
• RDD objects can explicitly be cached in memory and reused in consequent calls. This
“in memory” processing is what makes Spark, amazing fast!
• For Fault Tolerance; RDD objects themselves are not replicated (as per the original
article, today’s implementation may not be so), but maintain information that a
partition can be rebuilt if a node fails!

14-Aug-25 Introduction to Apache Spark 20

A Complete Spark Program in Python

14-Aug-25 Introduction to Apache Spark 21

A Complete Spark Program in Python
• A driver program
• Creates two types of objects
– Local
– Distributed: RDDs are distributed objects
– RDD acts as main “data model” in spark programs
– You perform various manipulation operations on it
• recall applying various operations on “relations”
– That is all it is!

14-Aug-25 Introduction to Apache Spark 22

Spark “Word Frequency Count” - Python
• Spark programs are “driver programs” only, and does not require writing mapper,
reducer, combiner etc.
• BTW, what do we mean by “driver program”?

RDDs here are:

lines (distributed list of String objects),
pairs (distributed list of <String, int> pair objects),
counts (distributed list of <String, int> pair objects)
14-Aug-25 Introduction to Apache Spark 23
[Link]
Map-Reduce vs Spark (recap)
• Map Reduce?
– Simple File Processing in parallel
– “Distributed data file” processing
• Spark?
– Same as MR but
– quiet smart and faster
• “In Memory Distributed Processing” vs “Distributed File Processing for Map-Reduce”
• Lazy Evaluation and Optimizer behind the scene
• Intermediate results can be used!
– Further Simplifies Programming

14-Aug-25 RDD Operations 24

What makes Spark Faster?
• In Memory Processing
• Optimizer behind the scene
• Intermediate results can be used!

14-Aug-25 RDD Operations 25

References
[1] Zaharia, Matei, et al. "Spark: Cluster computing with working sets." HotCloud 10.10-10 (2010): 95.
[2] Zaharia, Matei, et al. "Apache spark: a unified engine for big data processing." Communications of
the ACM 59.11 (2016): 56-65.
[3] Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant abstraction for in-memory
cluster computing." Proceedings of the 9th USENIX conference on Networked Systems Design and
Implementation. USENIX Association, 2012.
[4] Chambers, Bill, and Matei Zaharia. Spark: The definitive guide: Big data processing made simple. "
O'Reilly Media, Inc.", 2018.
[5] Python Spark Documentation
[Link]
[6] Doulkeridis, Christos, and Kjetil NØrvåg. "A survey of large-scale analytical query processing in
MapReduce." The VLDB Journal—The International Journal on Very Large Data Bases 23.3 (2014):
355-380.

14-Aug-25 Introduction to Apache Spark 26

Spark: Fast Data Processing Overview
No ratings yet
Spark: Fast Data Processing Overview
80 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
36 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
7 pages
A Brief Introduction To Apache Spark
No ratings yet
A Brief Introduction To Apache Spark
10 pages
Data Analysis with Apache Spark Overview
No ratings yet
Data Analysis with Apache Spark Overview
39 pages
Overview of Apache Spark Ecosystem
No ratings yet
Overview of Apache Spark Ecosystem
17 pages
Overview of Apache Spark Architecture
No ratings yet
Overview of Apache Spark Architecture
44 pages
Learning Spark - Chapter 1
No ratings yet
Learning Spark - Chapter 1
18 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
No ratings yet
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
99 pages
Overview of Apache Spark and RDDs
100% (1)
Overview of Apache Spark and RDDs
109 pages
Apache Spark: Big Data Analytics Overview
No ratings yet
Apache Spark: Big Data Analytics Overview
52 pages
Apache Spark: Fast Data Processing Overview
No ratings yet
Apache Spark: Fast Data Processing Overview
19 pages
Introduction to Apache Spark by Dulari Bhatt
No ratings yet
Introduction to Apache Spark by Dulari Bhatt
19 pages
Introduction to Apache Spark
No ratings yet
Introduction to Apache Spark
21 pages
Introduction to Apache Spark Architecture
No ratings yet
Introduction to Apache Spark Architecture
12 pages
Introduction to Apache Spark and Hadoop
No ratings yet
Introduction to Apache Spark and Hadoop
9 pages
Key Features and Components of Spark
No ratings yet
Key Features and Components of Spark
9 pages
Pyspark Learning Notes PDF Guide
No ratings yet
Pyspark Learning Notes PDF Guide
18 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
13 pages
Understanding Apache Spark: Features & Benefits
No ratings yet
Understanding Apache Spark: Features & Benefits
19 pages
Understanding Spark RDD and Ecosystem
No ratings yet
Understanding Spark RDD and Ecosystem
15 pages
Overview of Apache Spark Framework
No ratings yet
Overview of Apache Spark Framework
57 pages
Introduction to Apache Spark
No ratings yet
Introduction to Apache Spark
66 pages
Sparklyr Online Course Overview
No ratings yet
Sparklyr Online Course Overview
80 pages
Apache Spark Seminar Overview
No ratings yet
Apache Spark Seminar Overview
5 pages
Apache Spark Java Programming Guide
No ratings yet
Apache Spark Java Programming Guide
209 pages
Introduction to Big Data with Spark
No ratings yet
Introduction to Big Data with Spark
18 pages
Key Features of Apache Spark
No ratings yet
Key Features of Apache Spark
16 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Overview of Apache Spark Features
No ratings yet
Overview of Apache Spark Features
22 pages
Introduction to Apache Spark and RDDs
No ratings yet
Introduction to Apache Spark and RDDs
26 pages
PySpark+Slides v1
100% (1)
PySpark+Slides v1
458 pages
Introduction to Apache Spark Features
No ratings yet
Introduction to Apache Spark Features
27 pages
Apache Spark Basics and Setup Guide
No ratings yet
Apache Spark Basics and Setup Guide
23 pages
Introduction to Apache Spark
No ratings yet
Introduction to Apache Spark
40 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
28 pages
Introduction to Apache Spark Framework
No ratings yet
Introduction to Apache Spark Framework
30 pages
Introduction to Apache Spark for Big Data
No ratings yet
Introduction to Apache Spark for Big Data
241 pages
Introduction to Apache Spark and RDDs
No ratings yet
Introduction to Apache Spark and RDDs
14 pages
Overview of Hadoop and Spark Modules
No ratings yet
Overview of Hadoop and Spark Modules
27 pages
Fast Data Analytics with PySpark Guide
No ratings yet
Fast Data Analytics with PySpark Guide
75 pages
Spark SQL and Hadoop Integration Guide
100% (1)
Spark SQL and Hadoop Integration Guide
25 pages
Introduction to Apache Spark Architecture
No ratings yet
Introduction to Apache Spark Architecture
55 pages
Spark
No ratings yet
Spark
96 pages
Apache Spark Basics for Beginners
No ratings yet
Apache Spark Basics for Beginners
30 pages
Overview of Apache Spark Framework
No ratings yet
Overview of Apache Spark Framework
14 pages
Apache Spark Overview and Applications
No ratings yet
Apache Spark Overview and Applications
31 pages
Overview of Apache Spark Components
No ratings yet
Overview of Apache Spark Components
11 pages
Apache Spark Overview and Benefits
No ratings yet
Apache Spark Overview and Benefits
18 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
187 pages
Overview of Apache Spark Basics
No ratings yet
Overview of Apache Spark Basics
49 pages
Introduction to Apache Spark 2 Architecture
No ratings yet
Introduction to Apache Spark 2 Architecture
43 pages
Huawei HCIA-Datacom Training Overview
No ratings yet
Huawei HCIA-Datacom Training Overview
6 pages
Maintenance Techniques for Materialized Views
No ratings yet
Maintenance Techniques for Materialized Views
16 pages
High-Low Method Excel Exercise
No ratings yet
High-Low Method Excel Exercise
6 pages
Overview of IANT Hardware Course
No ratings yet
Overview of IANT Hardware Course
56 pages
Importance of Data Models in Databases
No ratings yet
Importance of Data Models in Databases
15 pages
GIS Architecture Development Trends
No ratings yet
GIS Architecture Development Trends
21 pages
C++ and Pseudocode Programming Quiz
No ratings yet
C++ and Pseudocode Programming Quiz
26 pages
Sitecore-The Nonlinear Way
No ratings yet
Sitecore-The Nonlinear Way
122 pages
SAP ChaRM Process Overview
No ratings yet
SAP ChaRM Process Overview
20 pages
Oracle Performance Tuning Best Practices
100% (1)
Oracle Performance Tuning Best Practices
238 pages
Data Arrangement and Sorting Basics
No ratings yet
Data Arrangement and Sorting Basics
15 pages
CNC Part Programming Lab Manual
100% (1)
CNC Part Programming Lab Manual
24 pages
Remote Access Platform Security Overview
No ratings yet
Remote Access Platform Security Overview
15 pages
DevOps Roadmap Checklist for Beginners
No ratings yet
DevOps Roadmap Checklist for Beginners
4 pages
HL72UKA.3 v2
No ratings yet
HL72UKA.3 v2
210 pages
BB-XM Schematic REVA3
No ratings yet
BB-XM Schematic REVA3
10 pages
Visual Basic Data Management Guide
No ratings yet
Visual Basic Data Management Guide
11 pages
AI Challenges and Everyday Applications
No ratings yet
AI Challenges and Everyday Applications
6 pages
CompTIA Security+ Test Bank Guide
No ratings yet
CompTIA Security+ Test Bank Guide
17 pages
Instructions Watermarked 68dcc1c3
No ratings yet
Instructions Watermarked 68dcc1c3
23 pages
Flash Device Parameters Overview
No ratings yet
Flash Device Parameters Overview
7 pages
Automotive Multimedia Expert Profile
No ratings yet
Automotive Multimedia Expert Profile
1 page
PHP Array and String Operations Quiz
No ratings yet
PHP Array and String Operations Quiz
92 pages
Managing Files and Folders Guide
No ratings yet
Managing Files and Folders Guide
4 pages
M6550 Monochrome Laser Printer Specs
No ratings yet
M6550 Monochrome Laser Printer Specs
2 pages
BCM4330 Integration Overview
No ratings yet
BCM4330 Integration Overview
26 pages
ERP and WMS Integration Insights
No ratings yet
ERP and WMS Integration Insights
11 pages
Data Sheet 3RK1904-2AB02
No ratings yet
Data Sheet 3RK1904-2AB02
3 pages
Middleware and Gateways in Mobile Computing
100% (1)
Middleware and Gateways in Mobile Computing
11 pages
Azure Data Engineer Resume Template
No ratings yet
Azure Data Engineer Resume Template
1 page

Introduction to Apache Spark Overview

Uploaded by

Introduction to Apache Spark Overview

Uploaded by

Introduction to Apache Spark

14-Aug-25 Introduction to Apache Spark 2

14-Aug-25 Introduction to Apache Spark 3

14-Aug-25 Introduction to Apache Spark 5

14-Aug-25 Introduction to Apache Spark 6

14-Aug-25 Introduction to Apache Spark 11

“100 time faster at 10 iterations”!

Larger the gain at higher iterations!

14-Aug-25 Introduction to Apache Spark 17

It allows accessing from

14-Aug-25 Introduction to Apache Spark 19

14-Aug-25 Introduction to Apache Spark 20

14-Aug-25 Introduction to Apache Spark 21

14-Aug-25 Introduction to Apache Spark 22

RDDs here are:

14-Aug-25 RDD Operations 24

14-Aug-25 RDD Operations 25

14-Aug-25 Introduction to Apache Spark 26

You might also like