Learn by doing it
11 November 2024 15:27
What is spark ..?
It is general purpose in memory computation engine..
Spark - each and every operation we can perform or do in spark itself like data cleaning, sql .. That’s why we call it general puprose.
Hadoop vs pypsark
So hadoop provide 3 components
1. HDFS - storage
2. Resource manager
3. Map reduce -computation
We don’t have to compare the spark with hadoop , we have to compare map reduce with hadoop
In memory (ram)
Spark does all computation in memory only
How map reduce and spark computation work : Hadoop is slower than Spark because:
1. Disk-Based Processing: Hadoop's MapReduce
reads and writes data to disk after each stage,
causing high I/O overhead. Spark, on the other
hand, keeps data in-memory using RDDs,
Spark significantly speeding up processing.
2. Lack of Optimization: Hadoop processes each
HDFS task independently without advanced query
There is multiple jobs in spark so what happen
First time it will take input it will do optimizations. Spark leverages DAG (Directed
There is multiple MR map reduce task
computation and will write in memory only Acyclic Graph) execution and Catalyst optimizer
is available so how the process is
second time it will not hit the HDFS, from in for efficient query planning and execution.
happen first it will take input from
HDFS that is one time input task once it memory again whatever output we have it will
is process again it will write to HDFS so process with same again it will hit second job
this will happen multiple time input only . Here only 2 times input output happen
ouput that’s why HDFS is slow .
That’s why spark it faster than map reduce
pyspark Page 1
SPARK ARCHITECTURE:
Apache Spark Architecture
Apache Spark is a distributed computing framework designed for processing large datasets efficiently. Its architecture ensures high performance,
scalability, and fault tolerance.
Core Components of Spark Architecture
1. Driver:
○ The Driver is the master process that controls the execution of the application.
○ It splits the job into smaller tasks and distributes them across worker nodes.
○ It maintains metadata, tracks task execution, and collects results.
2. Executors:
○ Executors are worker processes running on each node in the cluster.
○ They execute tasks assigned by the Driver and store intermediate and final results.
○ Each Executor runs multiple tasks and has its own memory for computation and storage.
3. Cluster Manager:
○ The Cluster Manager is responsible for resource allocation and managing the nodes in the cluster.
○ Supported Cluster Managers:
▪ Standalone: Spark's built-in cluster manager.
▪ YARN: Hadoop's cluster manager.
▪ Mesos: General-purpose cluster manager.
▪ Kubernetes: Container-based orchestration.
Key Concepts in Spark
1. RDD (Resilient Distributed Dataset):
○ Immutable distributed collections of objects.
○ Provides f ault tolerance and parallel processing.
○ Operations on RDDs:
▪ Transformations (e.g., map, filter) – lazy operations that define a computation.
▪ Actions (e.g., collect, reduce) – trigger execution and return results.
2. DataFrame and Dataset:
○ DataFrame: Distributed collection of data organized into named columns (like a table in SQL).
○ Dataset: Strongly-typed API for structured data with compile-time type safety.
3. SparkSession:
○ The entry point for working with Spark. It allows access to APIs for RDDs, DataFrames, and SQL.
4. Partitions:
○ Spark divides data into smaller chunks called partitions to enable parallel processing.
○ Each partition is processed by a single task on an Executor.
5. Stages and Tasks:
○ A job is split into stages, and each stage contains multiple tasks.
○ Stages are determined by shuffle boundaries (data redistribution between nodes).
6. DAG (Directed Acyclic Graph):
○ Spark constructs a DAG to represent the sequence of transformations and actions.
○ The DAG Scheduler optimizes the execution by grouping operations and minimizing shuffles.
7. Shuffling:
○ The process of redistributing data across partitions to align with computation needs.
○ It occurs during operations like groupByKey and can impact performance.
8. Broadcast Variables:
○ Used to share read-only variables with all Executors to minimize data transfer.
9. Accumulators:
○ Write-only shared variables used for aggregating values across tasks (e.g., counters).
Execution Flow
1. The user submits a Spark application (e.g., PySpark code) to the Driver.
2. The Driver creates a DAG of stages and tasks.
3. The Cluster Manager allocates resources (Executors).
4. The Driver sends tasks to the Executors for processing.
5. Executors execute tasks, process partitions, and return results to the Driver.
6. The final result is sent to the user.
Diagram of Spark Architecture
• Driver: Manages application lifecycle and schedules tasks.
• Cluster Manager: Allocates Executors to the Driver.
• Executors: Process partitions in parallel.
pyspark Page 2
RDD:
What is RDD?
RDD (Resilient Distributed Dataset) is the fundamental data structure in Apache Spark. It represents an immutable, distributed collection of objects that
can be processed in parallel across a cluster.
Why RDD?
1. Fault Tolerance:
○ RDDs provide resilience through lineage. If a partition of data is lost, it can be recomputed from its source or prior transformations.
2. Distributed Computing:
○ Data is distributed across a cluster, enabling parallel processing and efficient utilization of resources.
3. Immutability:
○ RDDs are immutable, ensuring consistency and enabling optimization by Spark's execution engine.
4. Flexibility:
○ RDDs support a wide range of transformations and actions for data processing.
5. Low-Level Control:
○ Compared to higher-level APIs like DataFrames, RDDs give more control over how operations are executed, making them suitable for complex
or unstructured data processing.
How RDD Works?
1. Creation:
○ RDDs can be created from:
▪ Existing data (e.g., a collection in memory).
▪ External datasets (e.g., HDFS, S3, or local files).
▪ Transformations on other RDDs.
2. Transformations:
○ Operations like map, filter, or flatMap are lazy; they create a new RDD by applying a function but don’t immediately execute the operation.
3. Actions:
○ Operations like collect, count, or saveAsTextFile trigger execution, computing and returning results to the driver or saving them to storage.
4. Partitioning:
○ RDDs are automatically divided into partitions, which are processed in parallel.
5. Execution:
○ RDD operations generate a DAG (Directed Acyclic Graph), which Spark optimizes before execution.
Key Properties of RDD
1. Immutability:
○ Once created, an RDD cannot be modified. Transformations generate new RDDs.
2. Partitioning:
○ Data in an RDD is split into partitions, distributed across the nodes in a cluster.
3. Persistence:
○ RDDs can be cached in memory or disk using methods like persist() or cache() for faster access during iterative computations.
4. Fault Tolerance:
○ Spark reconstructs lost partitions using lineage information.
When to Use RDD?
1. Unstructured Data: When working with raw, unstructured, or semi-structured data.
2. Low-Level Control: When transformations or optimizations aren’t available in higher-level APIs.
3. Complex Logic: When implementing algorithms that require direct manipulation of distributed data.
Limitations of RDD
1. Performance:
○ RDDs do not use Spark's Catalyst optimizer or Tungsten execution engine, making them slower than DataFrames or Datasets for structured
data.
2. No Schema:
○ RDDs lack a schema, making them harder to work with for structured data.
3. Developer Overhead:
○ More code and effort are required compared to higher-level APIs like DataFrames.
pyspark Page 3
Fault Tolerance: Interview Questions and Answers
1. What is fault tolerance in distributed systems?
• Answer:
Fault tolerance is the ability of a system to continue functioning properly in the event of hardware failures, software errors, or network issues. In
distributed systems, it ensures reliability and availability by detecting failures and recovering from them.
2. How does Apache Spark provide fault tolerance?
• Answer:
Spark achieves fault tolerance through:
1. Lineage: RDDs keep track of transformations used to compute the dataset. Lost partitions can be recomputed using this lineage.
2. Replication (for persisted data): When RDDs are cached, their partitions can be replicated across nodes, ensuring availability even if one node
fails.
3. What is lineage in Apache Spark, and how does it contribute to fault tolerance?
• Answer:
Lineage is a directed acyclic graph (DAG) that records the sequence of transformations applied to an RDD. In case of a failure, Spark uses this
lineage to recompute lost data, ensuring fault tolerance without requiring replication of all data.
4. How does Spark handle node failures?
• Answer:
When a node fails:
1. Lost RDD partitions are recomputed using the lineage graph.
2. If data is cached and replicated, Spark retrieves a copy from another node.
3. Task scheduling ensures that failed tasks are retried on other available nodes.
5. What happens if a Spark executor fails during computation?
• Answer:
If an executor fails:
1. Tasks running on the executor are rescheduled on other executors.
2. Lost RDD partitions are recomputed using lineage or retrieved from cached replicas.
3. If the driver fails, the Spark application is terminated unless run in a highly available mode.
6. How does checkpointing differ from lineage in fault tolerance?
• Answer:
1. Lineage: Maintains the history of transformations to recompute data. It is lightweight but may become inefficient for long lineage chains.
2. Checkpointing: Saves RDDs to stable storage (e.g., HDFS). It is useful when the lineage graph becomes too long or computation is iterative.
7. Can you explain fault tolerance in Spark Streaming?
• Answer:
Spark Streaming ensures fault tolerance by:
1. Metadata Checkpointing: Saves information like offsets and DAGs to recover the stream processing state.
2. Data Checkpointing: Optionally saves intermediate RDDs to stable storage for long-running computations.
3. Spark's lineage recomputes lost data for unprocessed batches.
8. What is the role of replication in Spark fault tolerance?
• Answer:
Replication creates multiple copies of cached RDD partitions across nodes. If one node fails, Spark retrieves the data from a replica, reducing the
need for recomputation and speeding up recovery.
9. How does Spark ensure reliability in case of task failure?
• Answer:
Spark retries failed tasks up to a configurable number of times (spark.task.maxFailures). If a task continues to fail after retries, the job is
terminated. Task rescheduling ensures other healthy nodes take over the computation.
10. What are the limitations of Spark’s fault tolerance mechanism?
• Answer:
1. Lineage Overhead: Long lineage chains can slow down recomputation.
2. Checkpointing Cost: Frequent checkpointing can increase I/O overhead.
3. Driver Failure: Unless configured for high availability, driver failure can cause job termination.
DAG and Lazy Evaluation in Apache Spark
1. What is a DAG in Spark?
DAG (Directed Acyclic Graph) is a data structure that represents the sequence of computations performed on data in Apache Spark.
• Directed: The graph has a direction, indicating the flow of data transformations.
• Acyclic: There are no cycles; you can't return to a previous node.
• Graph: Nodes represent RDDs or data transformations, and edges represent dependencies between them.
Purpose of DAG:
pyspark Page 4
• Directed: The graph has a direction, indicating the flow of data transformations.
• Acyclic: There are no cycles; you can't return to a previous node.
• Graph: Nodes represent RDDs or data transformations, and edges represent dependencies between them.
Purpose of DAG:
• Optimizes execution by analyzing the logical plan before running.
• Groups transformations into stages for efficient processing.
Why is DAG important in Spark?
1. Fault Tolerance: DAG helps recompute lost partitions using lineage.
2. Optimized Execution: Spark analyzes the DAG to determine the best execution strategy.
3. Parallelism: The DAG ensures transformations are parallelized where possible.
Example of DAG Formation
from pyspark import SparkContext
sc = SparkContext("local", "DAG Example")
data = [1, 2, 3, 4]
rdd = sc.parallelize(data) # Node 1: Source RDD
mapped_rdd = rdd.map(lambda x: x * 2) # Node 2: Transformation
filtered_rdd = mapped_rdd.filter(lambda x: x > 4) # Node 3: Transformation
result = filtered_rdd.collect() # Action
DAG Explanation:
• Nodes: parallelize, map, filter.
• Edges: Dependencies between transformations.
• Stages: Spark breaks the DAG into stages, each representing a collection of tasks that can run in parallel.
2. What is Lazy Evaluation in Spark?
Lazy Evaluation means that Spark doesn’t execute transformations immediately when they’re called. Instead, it builds a DAG of transformations
and only executes when an action (like collect, count) triggers computation.
Why Lazy Evaluation?
1. Optimization:
○ Spark analyzes the DAG before execution to combine transformations into efficient physical execution plans.
○ Reduces the number of passes through the data.
2. Fault Tolerance:
○ Delaying execution helps Spark recompute only the required transformations in case of failure.
3. Efficiency:
○ Spark can skip unnecessary transformations by optimizing the plan.
Example of Lazy Evaluation : data = [1, 2, 3, 4]
rdd = sc.parallelize(data) # No computation yet
mapped_rdd = rdd.map(lambda x: x * 2) # Still no computation
filtered_rdd = mapped_rdd.filter(lambda x: x > 4) # Still no computation
result = filtered_rdd.collect() # Computation triggered here
• When Computation Happens: Only when collect() is called.
• Spark builds the DAG during transformations and optimizes it before execution.
Relation Between DAG and Lazy Evaluation
1. DAG Construction: Spark builds a DAG during lazy evaluation as transformations are called.
2. Execution: When an action is triggered, Spark optimizes and executes the DAG in stages.
3. Fault Recovery: The DAG is used to recompute lost data efficiently, leveraging lazy evaluation.
Benefits of DAG and Lazy Evaluation
1. Optimized Query Execution:
○ Spark combines multiple transformations into stages to minimize shuffles and I/O.
2. Parallelism:
○ The DAG enables tasks to be distributed across a cluster for efficient execution.
3. Reduced Overhead:
○ Lazy evaluation ensures that Spark processes only what is needed, reducing unnecessary computations.
pyspark Page 5
Interview Questions and Answers: DAG and Lazy Evaluation in Spark
1. What is a DAG in Apache Spark, and why is it important?
Answer:
A DAG (Directed Acyclic Graph) in Spark is a logical representation of the sequence of computations to be performed on data. Nodes in the DAG represent
RDDs or transformations, while edges represent dependencies between operations.
Importance:
• Optimized Execution: Spark divides the DAG into stages and tasks, reducing unnecessary computations.
• Fault Tolerance: DAG tracks lineage, allowing Spark to recompute lost data in case of failures.
• Parallelism: Identifies independent tasks for efficient execution across a cluster.
2. How does Spark create and execute a DAG?
Answer:
• Creation: Spark builds a DAG during the definition of transformations (e.g., map, filter). These transformations are lazily evaluated and stored as a
logical execution plan.
• Execution: When an action (e.g., collect, count) is called, Spark optimizes the DAG by grouping transformations into stages. Each stage is executed as a
series of parallel tasks.
3. What is lazy evaluation in Spark, and how does it work?
Answer:
Lazy evaluation means Spark does not execute transformations immediately but instead builds a DAG of transformations.
How it works:
1. When transformations (e.g., map, filter) are called, they are added to the DAG as nodes.
2. Execution is deferred until an action (e.g., collect, save) is invoked.
3. At that point, Spark optimizes the DAG and processes data in an efficient manner.
4. Why does Spark use lazy evaluation?
Answer:
Lazy evaluation offers the following advantages:
• Optimization: Spark analyzes the entire transformation pipeline to minimize unnecessary computations.
• Reduced I/O: Combines multiple transformations into a single stage to reduce data shuffling.
• Fault Tolerance: Delaying execution ensures that Spark recomputes only the required transformations in case of failure.
5. Can you explain the relationship between DAG and lazy evaluation in Spark?
Answer:
• DAG Construction: Lazy evaluation is used to construct the DAG as transformations are defined.
• Execution Trigger: The DAG is only executed when an action is called.
• Optimization: Spark optimizes the DAG for efficient execution before processing data.
For example, if you perform a series of transformations on an RDD, Spark does not execute them immediately. It builds a DAG of transformations. When an
action like collect() is invoked, Spark optimizes and executes the DAG.
6. What are the benefits of DAG in Spark?
Answer:
1. Optimized Execution: Reduces shuffles and combines narrow transformations into a single stage.
2. Fault Tolerance: Tracks lineage for recomputation of lost partitions.
3. Parallelism: Enables Spark to process tasks in parallel across nodes.
7. Provide an example demonstrating DAG and lazy evaluation.
Answer:
data = [1, 2, 3, 4] Explanation:
rdd = sc.parallelize(data) # DAG Node 1: RDD creation • When parallelize, map, and filter are called, Spark builds the
mapped_rdd = rdd.map(lambda x: x * 2) # DAG Node 2: Transformation DAG without executing.
filtered_rdd = mapped_rdd.filter(lambda x: x > 4) # DAG Node 3: Transformation • Execution begins only when the collect action is called,
optimizing transformations into stages for efficient execution.
result = filtered_rdd.collect() # Action triggers execution of the DAG
8. How does Spark optimize the DAG before execution?
Answer:
• Stage Division: Spark divides the DAG into stages based on shuffle dependencies.
• Pipelining: Groups narrow transformations (e.g., map, filter) into a single stage to minimize data shuffling.
• Task Scheduling: Tasks within a stage are distributed across cluster nodes for parallel processing.
9. What is the role of DAG in Spark’s fault tolerance?
Answer:
The DAG tracks the lineage of transformations applied to data. In case of failure:
1. Spark uses the DAG to identify lost partitions.
2. It recomputes only the required transformations for the affected data.
pyspark Page 6
9. What is the role of DAG in Spark’s fault tolerance?
Answer:
The DAG tracks the lineage of transformations applied to data. In case of failure:
1. Spark uses the DAG to identify lost partitions.
2. It recomputes only the required transformations for the affected data.
This ensures efficient recovery without reprocessing the entire dataset.
10. What are the limitations of DAG and lazy evaluation in Spark?
Answer:
1. Long Lineages: DAGs with long lineage chains can slow down recovery and execution.
2. Checkpointing Overhead: While checkpointing mitigates long lineage issues, it introduces storage and I/O overhead.
3. Complexity: Managing and debugging DAGs for large-scale jobs with complex dependencies can be challenging.
Cache vs Persist in Apache Spark
Cache and Persist are both mechanisms in Apache Spark used to store intermediate RDDs (Resilient Distributed Datasets) or DataFrames in memory or on
disk for faster access during future operations. While they serve similar purposes, they have some differences in terms of storage levels and flexibility.
1. What is cache in Spark?
Answer:
cache() is a shorthand for persisting an RDD or DataFrame using the MEMORY_AND_DISK storage level. It stores data in memory, and if there isn’t enough
memory, it spills over to disk.
Default Storage Level:
• MEMORY_AND_DISK: Spark first tries to store data in memory, and if memory is insufficient, it writes the data to disk.
When to use cache:
• When you have data that will be accessed repeatedly and fits into memory, or when memory is large enough to hold the dataset.
2. What is persist in Spark?
Answer:
persist() allows you to store an RDD or DataFrame in memory or on disk using various storage levels (like MEMORY_ONLY, DISK_ONLY,
MEMORY_AND_DISK, etc.). persist() is more flexible than cache() because it allows specifying the storage level.
Example Storage Levels:
• MEMORY_ONLY: Stores data only in memory.
• MEMORY_AND_DISK: Stores data in memory, spills to disk if memory is full.
• DISK_ONLY: Stores data only on disk.
• MEMORY_ONLY_SER: Stores data in memory in a serialized format (lower memory usage).
• OFF_HEAP: Stores data in off-heap memory (not recommended in many cases).
When to use persist:
• When you need fine-grained control over the storage level and decide whether to store the data in memory, disk, or both.
3. Key Differences Between cache and persist
Feature cache() persist()
Storage Default is Allows specifying any storage level (e.g.,
Level MEMORY_AND_DISK MEMORY_ONLY, DISK_ONLY)
Flexibility Less flexible, suitable for More flexible, suitable for advanced use
basic caching needs cases where control over storage is
needed
Memory Stores data in memory Can store data in memory, disk, or both
Usage first, spills to disk if based on the storage level
needed
Use Case Quick caching of data to Suitable for complex scenarios requiring
speed up operations custom storage management
4. When to use cache and persist?
• Use cache():
○ When you want a quick, default caching mechanism and are okay with
Spark using MEMORY_AND_DISK storage.
○ For situations where data fits in memory, but if not, it can spill to disk.
• Use persist():
○ When you need more control over where the data is stored (in memory,
on disk, serialized format, etc.).
○ For long-running jobs or iterative algorithms (e.g., machine learning)
where different storage levels can help optimize performance.
6. Performance Implications
pyspark Page 7
where different storage levels can help optimize performance.
6. Performance Implications
• cache():
○ Faster, simpler caching with the default MEMORY_AND_DISK level. Best used for datasets that can fit into memory or benefit from spilling
to disk.
• persist():
○ More control over performance. You can choose a storage level that best matches the size of your data and the nature of the operations.
For example, MEMORY_ONLY can be faster but can lead to OutOfMemoryError for large datasets, while DISK_ONLY may be slower but can
handle larger-than-memory datasets.
7. Memory Considerations
• cache() will use memory first, and only spill to disk if the memory is insufficient. This is often ideal for datasets that are small enough to fit into
memory, with the option for disk fallback.
• persist(StorageLevel.MEMORY_ONLY) stores data only in memory. This is ideal for small datasets that will fit comfortably in memory, providing
the fastest access speed.
• persist(StorageLevel.DISK_ONLY) stores data only on disk. This is useful when you have a large dataset that won’t fit in memory, but accessing
data from disk is slower compared to memory.
8. Can you unpersist data after caching or persisting? 4
Answer:
Yes, you can unpersist cached or persisted data to free up memory or disk space by calling the unpersist() method:
rdd.unpersist() # Frees up the memory/disk used for the cached or persisted RDD
Parquet in Apache Spark
Parquet is a popular columnar storage format supported by Apache Spark and many other big data processing frameworks. It is optimized for
analytical workloads and is widely used in the industry.
1. What is Parquet?
• Definition: Parquet is a columnar file format that organizes data by columns instead of rows, enabling efficient storage and querying of
structured and semi-structured data.
• File Type: Binary, highly compressed format.
• Supported By: Spark, Hive, Presto, Snowflake, etc.
2. Features of Parquet
1. Columnar Storage:
○ Data for each column is stored together, enabling faster column-wise operations like filtering and aggregation.
2. Efficient Compression:
○ Parquet applies column-level compression, often achieving better compression ratios compared to row-based formats like CSV or JSON.
3. Schema Evolution:
○ Parquet supports schema evolution, allowing you to add, remove, or modify columns without breaking existing data.
4. Predicate Pushdown:
○ Queries can skip reading unnecessary columns or data (e.g., SELECT col1 only reads data from col1).
5. Splittable:
○ Parquet files are splittable, enabling parallel processing across partitions.
3. Why Use Parquet in Spark?
1. Performance:
○ Columnar format enables faster read operations for analytical queries.
○ Compression reduces storage and I/O costs.
2. Integration:
○ Works seamlessly with Spark's DataFrame and Dataset APIs.
○ Optimized for Spark's Catalyst and Tungsten engines.
3. Cost-Effective:
○ Saves storage space and reduces read/write time, leading to lower compute costs in cloud environments.
5. Advantages of Parquet
Feature Benefit
Columnar Format Faster column-wise operations like filtering and aggregation.
Compression Reduces storage and I/O costs.
Predicate Pushdown Improves query performance by skipping unnecessary data.
Interoperability Supported by multiple tools in the big data ecosystem.
Splittable Allows parallel processing for large-scale jobs.
pyspark Page 8
Interoperability Supported by multiple tools in the big data ecosystem.
Splittable Allows parallel processing for large-scale jobs.
6. Comparison with Other Formats
Feature Parquet CSV JSON
Storage Format Columnar Row-based Row-based
Compression Highly efficient Low Moderate
Read Performance Fast for analytics Slower Slower
Schema Support Yes No No
7. Interview Questions
1. What is Parquet, and why is it used in Spark?
○ Parquet is a columnar storage format optimized for analytical workloads, enabling faster queries, better compression, and efficient
storage.
2. What are the benefits of using Parquet in Spark?
○ Columnar storage for faster queries, efficient compression, schema evolution, predicate pushdown, and integration with Spark SQL.
3. What is predicate pushdown in Parquet?
○ Predicate pushdown allows queries to filter data at the file level by reading only the relevant rows and columns, improving performance.
4. How does Parquet differ from row-based formats like CSV?
○ Parquet stores data column-wise, enabling faster analytical queries, better compression, and reduced I/O compared to CSV, which is row-
based.
5. What compression formats are supported by Parquet?
○ Common formats include Snappy, Gzip, and None. Snappy is the default in Spark.
pyspark Page 9