0% found this document useful (0 votes)
7 views19 pages

8 TH

Uploaded by

Labeeb Naji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views19 pages

8 TH

Uploaded by

Labeeb Naji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Apache Spark

Lecture 8
Introduction

Industries widely use Hadoop to analyse large datasets because it follows a simple
programming model (MapReduce) and offers scalability, flexibility, fault tolerance, and
cost-effectiveness.
However, processing large amounts of data quickly remains a challenge, particularly in
terms of query response time and program execution speed. To address this, the Apache
Software Foundation introduced Spark, which significantly improves computational
performance.
Contrary to popular belief, Spark is not an improved version of Hadoop. It operates
independently because it has its own cluster management system. Hadoop is just one option
for implementing Spark. Spark interacts with Hadoop in two main ways:
1.Storage: It utilizes Hadoop’s distributed file system (HDFS) to store data.
2.Processing: While Spark has its own computational framework, it can use Hadoop's
resources when needed.
Because Spark manages computation on its own, it primarily relies on Hadoop for data
storage rather than processing.
Apache Spark is a high-speed cluster computing framework built for fast data processing.
While it is based on Hadoop’s MapReduce, it extends this model to handle additional
computation types, such as interactive queries and stream processing.
The key advantage of Spark is its in-memory cluster computing, which significantly boosts
processing speed by reducing the need for disk reads and writes. Spark is designed to
efficiently handle various workloads, including:
1.Batch Processing: Handling large volumes of data at once.
2.Iterative Algorithms: Ideal for machine learning tasks.
3.Interactive Queries: Enables real-time data exploration.
4.Streaming Processing: Processes continuous flows of data.
By integrating these capabilities into a single system, Spark minimizes the complexity of
managing separate tools, making big data processing more streamlined.
Apache Spark provides high-level APIs in Java, Scala, Python, and R, making it
accessible to developers across different programming environments. Although
Spark itself is written in Scala, it offers rich and efficient APIs for all supported
languages, allowing developers to build and run Spark applications effectively.
The biggest advantage of Spark lies in its speed. Compared to Hadoop:
•In-Memory Processing: Spark is 100 times faster than Hadoop when utilizing
memory-based computations.
•On-Disk Processing: Spark still outperforms Hadoop, running 10 times faster in
disk-based mode.
This performance boost is primarily due to Spark’s ability to process data in
memory instead of repeatedly writing intermediate results to disk, which is a
limitation of the traditional MapReduce model
Apache Spark: Timeline
Apache Spark: Features

•Speed: Spark significantly accelerates computation in a Hadoop cluster, running up


to 100 times faster in memory and 10 times faster on disk. This efficiency comes
from reducing excessive read/write operations, as intermediate data is stored in
memory rather than disk.
•Multi-language Support: Spark includes built-in APIs for Java, Scala, Python,
and R, allowing developers to write applications in different programming
languages. It also provides 80+ high-level operators for interactive querying,
making data manipulation easier.
•Advanced Analytics: Unlike traditional MapReduce, Spark expands its capabilities
to support a variety of data processing tasks, including:
• SQL queries
• Real-time streaming
• Machine learning (ML)
• Graph processing
By integrating these diverse functionalities, Spark reduces the complexity of
managing separate tools for different analytics needs
Apache Spark on Hadoop
Apache Spark on Hadoop

The image presents three configurations for integrating Apache Spark with Hadoop
components:
1.Standalone Mode: Spark runs independently, using HDFS (Hadoop
Distributed File System) for storage but managing its own execution.
2.YARN (Hadoop 2.x): Spark is integrated with YARN, Hadoop’s resource manager,
allowing it to share cluster resources dynamically while still relying on HDFS for
storage.

3.SIMR (Hadoop V1 - Spark in MapReduce): Spark is executed within


MapReduce, leveraging Hadoop’s existing framework for processing tasks while
utilizing HDFS.
This diagram highlights Spark’s flexibility in deployment, showing how it can
function independently or alongside various Hadoop components.
Apache Spark Components
Spark Core
Spark Core serves as the foundation of the Apache Spark framework, acting as the
primary execution engine for all Spark applications. It is responsible for essential
functions such as task scheduling, memory management, fault recovery, and
interacting with storage systems.
To accommodate diverse workloads, Spark Core provides a generalized platform that
supports multiple specialized libraries, including:
1.Spark SQL – Enables structured querying and processing of data using SQL.
2.Spark Streaming – Handles real-time data streams for continuous processing.
3.MLlib – A library for scalable machine learning algorithms.
4.GraphX – Designed for graph processing and analytics.
This modular architecture ensures that Spark can efficiently manage a wide range of
batch, interactive, and streaming workloads while maintaining high performance
Spark SQL
Spark SQL is a powerful component built on top of Apache Spark that allows users
to run SQL and HQL queries seamlessly. It enables efficient processing of both
structured and semi-structured data, making it ideal for working with databases,
JSON files, and other formatted datasets.
One of its biggest advantages is performance optimization—Spark SQL can
execute unmodified SQL queries up to 100 times faster than traditional systems,
thanks to its in-memory processing and query optimization techniques.
Additionally, Spark SQL supports:
•Integration with existing databases such as Hive, allowing compatibility with
HQL.
•Data Frame API, which offers flexibility for programmatic access in Java, Scala,
Python, and R.
•Catalyst Optimizer, which enhances query execution efficiency.
Spark Streaming
Spark Streaming is a powerful component of Apache Spark that enables real-time data
processing and interactive analytics. It allows applications to handle live data streams
efficiently, making it ideal for scenarios like log monitoring, fraud detection, and real-
time dashboard updates.
Instead of processing each individual event separately, Spark Streaming converts
incoming live data into micro-batches, which are then processed on Spark Core.
This approach balances performance and fault tolerance while integrating seamlessly
with other Spark components.
Key advantages of Spark Streaming:
•Real-time processing for continuous data flows.
•Scalability to handle large-scale streaming workloads.
•Integration with tools like Kafka, Flume, and HDFS for data ingestion.
•Fault tolerance, ensuring reliable data processing even in distributed environments
Spark MLlib
MLlib is Spark’s distributed machine learning library, designed to efficiently handle
large-scale data processing by leveraging Spark’s distributed memory-based
architecture. This enables faster computations compared to traditional disk-based
systems.
According to benchmarks conducted by MLlib developers, its Alternating Least
Squares (ALS) implementation significantly outperforms Apache Mahout’s original
disk-based version. Specifically, MLlib is:
•Nine times faster than Hadoop’s disk-based Apache Mahout.
Optimized for in-memory processing, reducing I/O overhead and improving speed.
Since Mahout later introduced a Spark interface, performance differences have
become less pronounced, but MLlib remains widely used for scalable machine
learning tasks, such as regression, classification, clustering, and recommendation
systems.
Spark GraphX
GraphX is Spark’s distributed graph-processing framework, designed for efficiently
handling large-scale graph computations within the Spark ecosystem. It provides a
specialized API for expressing graph algorithms, enabling users to model complex
relationships using Pregel’s abstraction API, a popular iterative computation
framework for graphs.
Key features of GraphX include:
• Unified Data Representation: Merges graph processing with standard Spark APIs,
allowing seamless integration of structured data and graph analysis.
• Optimized Runtime: Offers an enhanced execution engine tailored for iterative graph
computations.
• Scalability: Handles massive datasets by leveraging Spark’s distributed computing
power.
• Built-in Algorithms: Includes pre-implemented algorithms such as PageRank,
Connected Components, and Triangle Counting, simplifying common graph
analytics tasks
SparkR
SparkR is a lightweight R package that provides a front-end interface for using Apache
Spark within R. It enables data scientists to analyze large datasets efficiently and allows
interactive job execution directly from the R shell.
Key features of SparkR:
•Scalability – Combines R’s usability with Spark’s ability to process massive datasets.
•Interactive Analysis – Supports real-time data exploration from the R console.
•DataFrame API – Introduces a distributed DataFrame similar to Pandas or R’s data frames,
allowing optimized manipulation of structured data.
•Integration with MLlib – Supports machine learning algorithms for scalable data
modeling.
SparkR was developed to bridge the gap between R’s statistical capabilities and Spark’s
distributed processing power, making large-scale analytics accessible within an R
programming environment.
Catalyst Optimizer

Catalyst is the query optimization engine within Spark SQL. It enhances query
execution efficiency by:
•Logical Query Optimization: Automatically rewriting queries to improve
performance.
•Physical Query Optimization: Selecting the best execution strategy based on
available resources.
•Extensibility: Supports custom optimization rules for advanced tuning.

Tungsten Execution Engine

Tungsten is an advanced optimization framework within Spark, designed to


maximize CPU and memory efficiency. Key features include:
•Binary Processing: Uses low-level memory management to reduce Java object
overhead.
•Code Generation: Dynamically compiles queries into JVM bytecode for faster
execution.
•Improved Memory Management: Reduces garbage collection issues for better
performance.
RDDs (Resilient Distributed Datasets)

RDDs are the foundational data structure in Spark, enabling fault-tolerant parallel
processing. Important aspects:
•Immutable and Distributed: Data is stored across multiple nodes to ensure reliability.
•Lazy Evaluation: Computations are executed only when needed, optimizing efficiency.
•Fault Tolerance: Automatically recovers lost partitions in case of node failures.

DataFrames and Datasets

Spark introduced DataFrames and Datasets as modern replacements for RDDs,


offering enhanced performance and usability:
•DataFrames: Tabular data structure similar to Pandas or SQL tables.
•Datasets: Type-safe, object-oriented interface combining the benefits of RDDs and
DataFrames.
•Optimized Execution: Uses Catalyst Optimizer and Tungsten Engine for high-speed
processing.
Integration with External Tools

Spark integrates seamlessly with various big data tools, allowing flexible data
ingestion and processing:
•Kafka & Flume: For real-time event streaming.
•HDFS & HBase: For large-scale distributed storage.
•Cassandra: For NoSQL database operations.
These integrations make Spark adaptable across diverse use cases.
Uses o f Spark
Apache Spark plays a crucial role in modern data processing across multiple domains:
•Data Integration (ETL): Raw data from different sources often lacks consistency. The
ETL (Extract, Transform, Load) process ensures that data is properly formatted and
integrated for analysis. Spark optimizes ETL workflows by reducing cost and processing
time, making data preparation more efficient.
•Stream Processing: Handling real-time data streams, such as log files or sensor data, can
be challenging. Spark Streaming enables scalable stream processing, helping detect
anomalies and prevent fraudulent operations.
•Machine Learning: As data volumes grow, machine learning models become more
accurate and feasible. Spark’s in-memory processing allows repeated computations to run
efficiently, making it ideal for training ML algorithms.
•Interactive Analytics: Instead of relying on predefined queries, Spark allows dynamic,
interactive data exploration, offering rapid responses and real-time insights.

You might also like