8 TH

Uploaded by

Labeeb Naji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views19 pages

8 TH

Uploaded by

Labeeb Naji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Apache Spark

Lecture 8
Introduction

Industries widely use Hadoop to analyse large datasets because it follows a simple
programming model (MapReduce) and offers scalability, flexibility, fault tolerance, and
cost-effectiveness.
However, processing large amounts of data quickly remains a challenge, particularly in
terms of query response time and program execution speed. To address this, the Apache
Software Foundation introduced Spark, which significantly improves computational
performance.
Contrary to popular belief, Spark is not an improved version of Hadoop. It operates
independently because it has its own cluster management system. Hadoop is just one option
for implementing Spark. Spark interacts with Hadoop in two main ways:
1.Storage: It utilizes Hadoop’s distributed file system (HDFS) to store data.
2.Processing: While Spark has its own computational framework, it can use Hadoop's
resources when needed.
Because Spark manages computation on its own, it primarily relies on Hadoop for data
storage rather than processing.
Apache Spark is a high-speed cluster computing framework built for fast data processing.
While it is based on Hadoop’s MapReduce, it extends this model to handle additional
computation types, such as interactive queries and stream processing.
The key advantage of Spark is its in-memory cluster computing, which significantly boosts
processing speed by reducing the need for disk reads and writes. Spark is designed to
efficiently handle various workloads, including:
1.Batch Processing: Handling large volumes of data at once.
2.Iterative Algorithms: Ideal for machine learning tasks.
3.Interactive Queries: Enables real-time data exploration.
4.Streaming Processing: Processes continuous flows of data.
By integrating these capabilities into a single system, Spark minimizes the complexity of
managing separate tools, making big data processing more streamlined.
Apache Spark provides high-level APIs in Java, Scala, Python, and R, making it
accessible to developers across different programming environments. Although
Spark itself is written in Scala, it offers rich and efficient APIs for all supported
languages, allowing developers to build and run Spark applications effectively.
The biggest advantage of Spark lies in its speed. Compared to Hadoop:
•In-Memory Processing: Spark is 100 times faster than Hadoop when utilizing
memory-based computations.
•On-Disk Processing: Spark still outperforms Hadoop, running 10 times faster in
disk-based mode.
This performance boost is primarily due to Spark’s ability to process data in
memory instead of repeatedly writing intermediate results to disk, which is a
limitation of the traditional MapReduce model
Apache Spark: Timeline
Apache Spark: Features

•Speed: Spark significantly accelerates computation in a Hadoop cluster, running up

to 100 times faster in memory and 10 times faster on disk. This efficiency comes
from reducing excessive read/write operations, as intermediate data is stored in
memory rather than disk.
•Multi-language Support: Spark includes built-in APIs for Java, Scala, Python,
and R, allowing developers to write applications in different programming
languages. It also provides 80+ high-level operators for interactive querying,
making data manipulation easier.
•Advanced Analytics: Unlike traditional MapReduce, Spark expands its capabilities
to support a variety of data processing tasks, including:
• SQL queries
• Real-time streaming
• Machine learning (ML)
• Graph processing
By integrating these diverse functionalities, Spark reduces the complexity of
managing separate tools for different analytics needs
Apache Spark on Hadoop
Apache Spark on Hadoop

The image presents three configurations for integrating Apache Spark with Hadoop
components:
1.Standalone Mode: Spark runs independently, using HDFS (Hadoop
Distributed File System) for storage but managing its own execution.
2.YARN (Hadoop 2.x): Spark is integrated with YARN, Hadoop’s resource manager,
allowing it to share cluster resources dynamically while still relying on HDFS for
storage.

3.SIMR (Hadoop V1 - Spark in MapReduce): Spark is executed within

MapReduce, leveraging Hadoop’s existing framework for processing tasks while
utilizing HDFS.
This diagram highlights Spark’s flexibility in deployment, showing how it can
function independently or alongside various Hadoop components.
Apache Spark Components
Spark Core
Spark Core serves as the foundation of the Apache Spark framework, acting as the
primary execution engine for all Spark applications. It is responsible for essential
functions such as task scheduling, memory management, fault recovery, and
interacting with storage systems.
To accommodate diverse workloads, Spark Core provides a generalized platform that
supports multiple specialized libraries, including:
1.Spark SQL – Enables structured querying and processing of data using SQL.
2.Spark Streaming – Handles real-time data streams for continuous processing.
3.MLlib – A library for scalable machine learning algorithms.
4.GraphX – Designed for graph processing and analytics.
This modular architecture ensures that Spark can efficiently manage a wide range of
batch, interactive, and streaming workloads while maintaining high performance
Spark SQL
Spark SQL is a powerful component built on top of Apache Spark that allows users
to run SQL and HQL queries seamlessly. It enables efficient processing of both
structured and semi-structured data, making it ideal for working with databases,
JSON files, and other formatted datasets.
One of its biggest advantages is performance optimization—Spark SQL can
execute unmodified SQL queries up to 100 times faster than traditional systems,
thanks to its in-memory processing and query optimization techniques.
Additionally, Spark SQL supports:
•Integration with existing databases such as Hive, allowing compatibility with
HQL.
•Data Frame API, which offers flexibility for programmatic access in Java, Scala,
Python, and R.
•Catalyst Optimizer, which enhances query execution efficiency.
Spark Streaming
Spark Streaming is a powerful component of Apache Spark that enables real-time data
processing and interactive analytics. It allows applications to handle live data streams
efficiently, making it ideal for scenarios like log monitoring, fraud detection, and real-
time dashboard updates.
Instead of processing each individual event separately, Spark Streaming converts
incoming live data into micro-batches, which are then processed on Spark Core.
This approach balances performance and fault tolerance while integrating seamlessly
with other Spark components.
Key advantages of Spark Streaming:
•Real-time processing for continuous data flows.
•Scalability to handle large-scale streaming workloads.
•Integration with tools like Kafka, Flume, and HDFS for data ingestion.
•Fault tolerance, ensuring reliable data processing even in distributed environments
Spark MLlib
MLlib is Spark’s distributed machine learning library, designed to efficiently handle
large-scale data processing by leveraging Spark’s distributed memory-based
architecture. This enables faster computations compared to traditional disk-based
systems.
According to benchmarks conducted by MLlib developers, its Alternating Least
Squares (ALS) implementation significantly outperforms Apache Mahout’s original
disk-based version. Specifically, MLlib is:
•Nine times faster than Hadoop’s disk-based Apache Mahout.
Optimized for in-memory processing, reducing I/O overhead and improving speed.
Since Mahout later introduced a Spark interface, performance differences have
become less pronounced, but MLlib remains widely used for scalable machine
learning tasks, such as regression, classification, clustering, and recommendation
systems.
Spark GraphX
GraphX is Spark’s distributed graph-processing framework, designed for efficiently
handling large-scale graph computations within the Spark ecosystem. It provides a
specialized API for expressing graph algorithms, enabling users to model complex
relationships using Pregel’s abstraction API, a popular iterative computation
framework for graphs.
Key features of GraphX include:
• Unified Data Representation: Merges graph processing with standard Spark APIs,
allowing seamless integration of structured data and graph analysis.
• Optimized Runtime: Offers an enhanced execution engine tailored for iterative graph
computations.
• Scalability: Handles massive datasets by leveraging Spark’s distributed computing
power.
• Built-in Algorithms: Includes pre-implemented algorithms such as PageRank,
Connected Components, and Triangle Counting, simplifying common graph
analytics tasks
SparkR
SparkR is a lightweight R package that provides a front-end interface for using Apache
Spark within R. It enables data scientists to analyze large datasets efficiently and allows
interactive job execution directly from the R shell.
Key features of SparkR:
•Scalability – Combines R’s usability with Spark’s ability to process massive datasets.
•Interactive Analysis – Supports real-time data exploration from the R console.
•DataFrame API – Introduces a distributed DataFrame similar to Pandas or R’s data frames,
allowing optimized manipulation of structured data.
•Integration with MLlib – Supports machine learning algorithms for scalable data
modeling.
SparkR was developed to bridge the gap between R’s statistical capabilities and Spark’s
distributed processing power, making large-scale analytics accessible within an R
programming environment.
Catalyst Optimizer

Catalyst is the query optimization engine within Spark SQL. It enhances query
execution efficiency by:
•Logical Query Optimization: Automatically rewriting queries to improve
performance.
•Physical Query Optimization: Selecting the best execution strategy based on
available resources.
•Extensibility: Supports custom optimization rules for advanced tuning.

Tungsten Execution Engine

Tungsten is an advanced optimization framework within Spark, designed to

maximize CPU and memory efficiency. Key features include:
•Binary Processing: Uses low-level memory management to reduce Java object
overhead.
•Code Generation: Dynamically compiles queries into JVM bytecode for faster
execution.
•Improved Memory Management: Reduces garbage collection issues for better
performance.
RDDs (Resilient Distributed Datasets)

RDDs are the foundational data structure in Spark, enabling fault-tolerant parallel
processing. Important aspects:
•Immutable and Distributed: Data is stored across multiple nodes to ensure reliability.
•Lazy Evaluation: Computations are executed only when needed, optimizing efficiency.
•Fault Tolerance: Automatically recovers lost partitions in case of node failures.

DataFrames and Datasets

Spark introduced DataFrames and Datasets as modern replacements for RDDs,

offering enhanced performance and usability:
•DataFrames: Tabular data structure similar to Pandas or SQL tables.
•Datasets: Type-safe, object-oriented interface combining the benefits of RDDs and
DataFrames.
•Optimized Execution: Uses Catalyst Optimizer and Tungsten Engine for high-speed
processing.
Integration with External Tools

Spark integrates seamlessly with various big data tools, allowing flexible data
ingestion and processing:
•Kafka & Flume: For real-time event streaming.
•HDFS & HBase: For large-scale distributed storage.
•Cassandra: For NoSQL database operations.
These integrations make Spark adaptable across diverse use cases.
Uses o f Spark
Apache Spark plays a crucial role in modern data processing across multiple domains:
•Data Integration (ETL): Raw data from different sources often lacks consistency. The
ETL (Extract, Transform, Load) process ensures that data is properly formatted and
integrated for analysis. Spark optimizes ETL workflows by reducing cost and processing
time, making data preparation more efficient.
•Stream Processing: Handling real-time data streams, such as log files or sensor data, can
be challenging. Spark Streaming enables scalable stream processing, helping detect
anomalies and prevent fraudulent operations.
•Machine Learning: As data volumes grow, machine learning models become more
accurate and feasible. Spark’s in-memory processing allows repeated computations to run
efficiently, making it ideal for training ML algorithms.
•Interactive Analytics: Instead of relying on predefined queries, Spark allows dynamic,
interactive data exploration, offering rapid responses and real-time insights.

Introduction To Spark 1
No ratings yet
Introduction To Spark 1
21 pages
06 Big Data
No ratings yet
06 Big Data
52 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Sspark
No ratings yet
Sspark
7 pages
Unit 5.1
No ratings yet
Unit 5.1
9 pages
Apache Spark Defined
No ratings yet
Apache Spark Defined
14 pages
Big Data Anlytics Unit 3 R22 It
No ratings yet
Big Data Anlytics Unit 3 R22 It
57 pages
Introduction-to-Apache-Spark
No ratings yet
Introduction-to-Apache-Spark
22 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Shark
No ratings yet
Shark
24 pages
Unit 5
100% (1)
Unit 5
109 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Apache Spark
No ratings yet
Apache Spark
27 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Apache Spark: Features & Components
No ratings yet
Apache Spark: Features & Components
9 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Presentation On Apache Spark
No ratings yet
Presentation On Apache Spark
7 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
Spark BD
No ratings yet
Spark BD
9 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Apache Spark RDD Overview
No ratings yet
Apache Spark RDD Overview
15 pages
4a.introduction To Apache Spark
No ratings yet
4a.introduction To Apache Spark
28 pages
Spark Introduction
No ratings yet
Spark Introduction
12 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
Apache Spark
No ratings yet
Apache Spark
113 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Bda 5
No ratings yet
Bda 5
21 pages
Apache Spark IP Gemini 1 PDF
No ratings yet
Apache Spark IP Gemini 1 PDF
38 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
Module 2
No ratings yet
Module 2
20 pages
Bda U4
No ratings yet
Bda U4
49 pages
BIG Data Analytics 21CSH-471: Computer Science & Engineering
No ratings yet
BIG Data Analytics 21CSH-471: Computer Science & Engineering
24 pages
Introduction to Data Analysis with Spark
No ratings yet
Introduction to Data Analysis with Spark
51 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
A Brief Introduction To Apache Spark
No ratings yet
A Brief Introduction To Apache Spark
10 pages
Pyspark Notes New
No ratings yet
Pyspark Notes New
18 pages
Spark & SparkMLLib
No ratings yet
Spark & SparkMLLib
6 pages
3.5 Apache Spark
No ratings yet
3.5 Apache Spark
12 pages
Apache Spark: Dhineshkumar S K
No ratings yet
Apache Spark: Dhineshkumar S K
31 pages
Unit - 4
No ratings yet
Unit - 4
49 pages
1.1.4 and 1.1.5
No ratings yet
1.1.4 and 1.1.5
38 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
Introduction To Big Data Technologies
No ratings yet
Introduction To Big Data Technologies
10 pages
Apache Spark Primer 170303
No ratings yet
Apache Spark Primer 170303
8 pages
Apache Spark Overview & Features
No ratings yet
Apache Spark Overview & Features
65 pages
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
No ratings yet
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
99 pages
Bda U3 p1 (Intro To Spark)
No ratings yet
Bda U3 p1 (Intro To Spark)
66 pages
Cloud Computing Lect2 (Cloud Services Models) 060557
No ratings yet
Cloud Computing Lect2 (Cloud Services Models) 060557
33 pages
I Promise You Will Be Amazed
No ratings yet
I Promise You Will Be Amazed
1 page
9 TH
No ratings yet
9 TH
33 pages
PEMS 1523 Sivakani RevolutionizingLiterature
No ratings yet
PEMS 1523 Sivakani RevolutionizingLiterature
18 pages
5394-Article Text-13520-1-10-20250313
No ratings yet
5394-Article Text-13520-1-10-20250313
15 pages
CH13
No ratings yet
CH13
46 pages
CH12
No ratings yet
CH12
64 pages
Never Buy Proxies Again - Setup Your Own Proxy Server - BlackHatWorld
No ratings yet
Never Buy Proxies Again - Setup Your Own Proxy Server - BlackHatWorld
2 pages
Quy trình lấy LOG
No ratings yet
Quy trình lấy LOG
4 pages
Cloudera Administration PDF
100% (1)
Cloudera Administration PDF
476 pages
FlashcatUSB Manual
No ratings yet
FlashcatUSB Manual
26 pages
Shift Register Basics & Applications
No ratings yet
Shift Register Basics & Applications
7 pages
Creating A Standby Using RMAN Duplicate (RAC or Non RAC) (Doc ID 1617946.1)
No ratings yet
Creating A Standby Using RMAN Duplicate (RAC or Non RAC) (Doc ID 1617946.1)
11 pages
DNS, SMTP, Pop, FTP, HTTP
No ratings yet
DNS, SMTP, Pop, FTP, HTTP
9 pages
Endianness: From Wikipedia, The Free Encyclopedia
No ratings yet
Endianness: From Wikipedia, The Free Encyclopedia
8 pages
Questions-Configure and Administer Server-Final Exam ACS-20250529-0729
No ratings yet
Questions-Configure and Administer Server-Final Exam ACS-20250529-0729
25 pages
AS400 Automatically Logging Off Inactive Workstations
No ratings yet
AS400 Automatically Logging Off Inactive Workstations
2 pages
Exception
No ratings yet
Exception
767 pages
9.4.1.2 Lab - Configuring ASA Basic Settings and Firewall Using ASDM - Instructor
No ratings yet
9.4.1.2 Lab - Configuring ASA Basic Settings and Firewall Using ASDM - Instructor
47 pages
CS3691 - Esiot Lab Manual
No ratings yet
CS3691 - Esiot Lab Manual
80 pages
AnyConnect Troubleshooting Guide
No ratings yet
AnyConnect Troubleshooting Guide
16 pages
BC5380 - Upgrade Guide For IPU V01.22 - V1.0 - EN
No ratings yet
BC5380 - Upgrade Guide For IPU V01.22 - V1.0 - EN
10 pages
JD System Administrator L2
No ratings yet
JD System Administrator L2
2 pages
Simoreg Dc-Master 6ra70 SIMOREG DC-MASTER Control Module Simotras HD 6Sg70
No ratings yet
Simoreg Dc-Master 6ra70 SIMOREG DC-MASTER Control Module Simotras HD 6Sg70
2 pages
PPT-unit 4-303105103
No ratings yet
PPT-unit 4-303105103
16 pages
Sinclair and The Sunrise Technology
No ratings yet
Sinclair and The Sunrise Technology
7 pages
Vraylog
No ratings yet
Vraylog
168 pages
H17216-Dell Emc Unity-Metrosync
No ratings yet
H17216-Dell Emc Unity-Metrosync
44 pages
Geovariances Software Licensing 2020
No ratings yet
Geovariances Software Licensing 2020
2 pages
Product Manuals and Guides EnterpriseGuide v43
No ratings yet
Product Manuals and Guides EnterpriseGuide v43
800 pages
Akash Network 1
No ratings yet
Akash Network 1
24 pages
Kiran Reddy
No ratings yet
Kiran Reddy
3 pages
Windows 11 Specifications - Microsoft
No ratings yet
Windows 11 Specifications - Microsoft
3 pages
HCIE-DC V1.0 Training Material 4 Cloud Data Center Unified Management Solutions
No ratings yet
HCIE-DC V1.0 Training Material 4 Cloud Data Center Unified Management Solutions
156 pages
ICT Grade 4 - 1st Term Evaluation 2024
No ratings yet
ICT Grade 4 - 1st Term Evaluation 2024
5 pages
QLC SSD Product Brief
No ratings yet
QLC SSD Product Brief
2 pages
IBM Bigfix Module Explained.
No ratings yet
IBM Bigfix Module Explained.
18 pages

8 TH

Uploaded by

8 TH

Uploaded by

Apache Spark

•Speed: Spark significantly accelerates computation in a Hadoop cluster, running up

3.SIMR (Hadoop V1 - Spark in MapReduce): Spark is executed within

Tungsten Execution Engine

Tungsten is an advanced optimization framework within Spark, designed to

DataFrames and Datasets

Spark introduced DataFrames and Datasets as modern replacements for RDDs,

You might also like