0% found this document useful (0 votes)

18 views9 pages

Learn by Doing It

Uploaded by

aliya.pathan0505

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views9 pages

Learn by Doing It

Uploaded by

aliya.pathan0505

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Learn by doing it

11 November 2024 15:27

What is spark ..?

It is general purpose in memory computation engine..

Spark - each and every operation we can perform or do in spark itself like data cleaning, sql .. That’s why we call it general puprose.

Hadoop vs pypsark
So hadoop provide 3 components
1. HDFS - storage
2. Resource manager
3. Map reduce -computation

We don’t have to compare the spark with hadoop , we have to compare map reduce with hadoop

In memory (ram)
Spark does all computation in memory only

How map reduce and spark computation work : Hadoop is slower than Spark because:
1. Disk-Based Processing: Hadoop's MapReduce
reads and writes data to disk after each stage,
causing high I/O overhead. Spark, on the other
hand, keeps data in-memory using RDDs,
Spark significantly speeding up processing.
2. Lack of Optimization: Hadoop processes each
HDFS task independently without advanced query
There is multiple jobs in spark so what happen
First time it will take input it will do optimizations. Spark leverages DAG (Directed
There is multiple MR map reduce task
computation and will write in memory only Acyclic Graph) execution and Catalyst optimizer
is available so how the process is
second time it will not hit the HDFS, from in for efficient query planning and execution.
happen first it will take input from
HDFS that is one time input task once it memory again whatever output we have it will
is process again it will write to HDFS so process with same again it will hit second job
this will happen multiple time input only . Here only 2 times input output happen
ouput that’s why HDFS is slow .

That’s why spark it faster than map reduce

pyspark Page 1
SPARK ARCHITECTURE:

Apache Spark Architecture

Apache Spark is a distributed computing framework designed for processing large datasets efficiently. Its architecture ensures high performance,
scalability, and fault tolerance.

Core Components of Spark Architecture

1. Driver:
○ The Driver is the master process that controls the execution of the application.
○ It splits the job into smaller tasks and distributes them across worker nodes.
○ It maintains metadata, tracks task execution, and collects results.
2. Executors:
○ Executors are worker processes running on each node in the cluster.
○ They execute tasks assigned by the Driver and store intermediate and final results.
○ Each Executor runs multiple tasks and has its own memory for computation and storage.
3. Cluster Manager:
○ The Cluster Manager is responsible for resource allocation and managing the nodes in the cluster.
○ Supported Cluster Managers:
▪ Standalone: Spark's built-in cluster manager.
▪ YARN: Hadoop's cluster manager.
▪ Mesos: General-purpose cluster manager.
▪ Kubernetes: Container-based orchestration.

Key Concepts in Spark

1. RDD (Resilient Distributed Dataset):
○ Immutable distributed collections of objects.
○ Provides f ault tolerance and parallel processing.
○ Operations on RDDs:
▪ Transformations (e.g., map, filter) – lazy operations that define a computation.
▪ Actions (e.g., collect, reduce) – trigger execution and return results.
2. DataFrame and Dataset:
○ DataFrame: Distributed collection of data organized into named columns (like a table in SQL).
○ Dataset: Strongly-typed API for structured data with compile-time type safety.
3. SparkSession:
○ The entry point for working with Spark. It allows access to APIs for RDDs, DataFrames, and SQL.
4. Partitions:
○ Spark divides data into smaller chunks called partitions to enable parallel processing.
○ Each partition is processed by a single task on an Executor.
5. Stages and Tasks:
○ A job is split into stages, and each stage contains multiple tasks.
○ Stages are determined by shuffle boundaries (data redistribution between nodes).
6. DAG (Directed Acyclic Graph):
○ Spark constructs a DAG to represent the sequence of transformations and actions.
○ The DAG Scheduler optimizes the execution by grouping operations and minimizing shuffles.
7. Shuffling:
○ The process of redistributing data across partitions to align with computation needs.
○ It occurs during operations like groupByKey and can impact performance.
8. Broadcast Variables:
○ Used to share read-only variables with all Executors to minimize data transfer.
9. Accumulators:
○ Write-only shared variables used for aggregating values across tasks (e.g., counters).

Execution Flow
1. The user submits a Spark application (e.g., PySpark code) to the Driver.
2. The Driver creates a DAG of stages and tasks.
3. The Cluster Manager allocates resources (Executors).
4. The Driver sends tasks to the Executors for processing.
5. Executors execute tasks, process partitions, and return results to the Driver.
6. The final result is sent to the user.

Diagram of Spark Architecture

• Driver: Manages application lifecycle and schedules tasks.
• Cluster Manager: Allocates Executors to the Driver.
• Executors: Process partitions in parallel.

pyspark Page 2
RDD:
What is RDD?
RDD (Resilient Distributed Dataset) is the fundamental data structure in Apache Spark. It represents an immutable, distributed collection of objects that
can be processed in parallel across a cluster.

Why RDD?
1. Fault Tolerance:
○ RDDs provide resilience through lineage. If a partition of data is lost, it can be recomputed from its source or prior transformations.
2. Distributed Computing:
○ Data is distributed across a cluster, enabling parallel processing and efficient utilization of resources.
3. Immutability:
○ RDDs are immutable, ensuring consistency and enabling optimization by Spark's execution engine.
4. Flexibility:
○ RDDs support a wide range of transformations and actions for data processing.
5. Low-Level Control:
○ Compared to higher-level APIs like DataFrames, RDDs give more control over how operations are executed, making them suitable for complex
or unstructured data processing.

How RDD Works?

1. Creation:
○ RDDs can be created from:
▪ Existing data (e.g., a collection in memory).
▪ External datasets (e.g., HDFS, S3, or local files).
▪ Transformations on other RDDs.
2. Transformations:
○ Operations like map, filter, or flatMap are lazy; they create a new RDD by applying a function but don’t immediately execute the operation.
3. Actions:
○ Operations like collect, count, or saveAsTextFile trigger execution, computing and returning results to the driver or saving them to storage.
4. Partitioning:
○ RDDs are automatically divided into partitions, which are processed in parallel.
5. Execution:
○ RDD operations generate a DAG (Directed Acyclic Graph), which Spark optimizes before execution.

Key Properties of RDD

1. Immutability:
○ Once created, an RDD cannot be modified. Transformations generate new RDDs.
2. Partitioning:
○ Data in an RDD is split into partitions, distributed across the nodes in a cluster.
3. Persistence:
○ RDDs can be cached in memory or disk using methods like persist() or cache() for faster access during iterative computations.
4. Fault Tolerance:
○ Spark reconstructs lost partitions using lineage information.

When to Use RDD?

1. Unstructured Data: When working with raw, unstructured, or semi-structured data.
2. Low-Level Control: When transformations or optimizations aren’t available in higher-level APIs.
3. Complex Logic: When implementing algorithms that require direct manipulation of distributed data.

Limitations of RDD
1. Performance:
○ RDDs do not use Spark's Catalyst optimizer or Tungsten execution engine, making them slower than DataFrames or Datasets for structured
data.
2. No Schema:
○ RDDs lack a schema, making them harder to work with for structured data.
3. Developer Overhead:
○ More code and effort are required compared to higher-level APIs like DataFrames.

pyspark Page 3
Fault Tolerance: Interview Questions and Answers
1. What is fault tolerance in distributed systems?
• Answer:
Fault tolerance is the ability of a system to continue functioning properly in the event of hardware failures, software errors, or network issues. In
distributed systems, it ensures reliability and availability by detecting failures and recovering from them.

2. How does Apache Spark provide fault tolerance?

• Answer:
Spark achieves fault tolerance through:
1. Lineage: RDDs keep track of transformations used to compute the dataset. Lost partitions can be recomputed using this lineage.
2. Replication (for persisted data): When RDDs are cached, their partitions can be replicated across nodes, ensuring availability even if one node
fails.

3. What is lineage in Apache Spark, and how does it contribute to fault tolerance?
• Answer:
Lineage is a directed acyclic graph (DAG) that records the sequence of transformations applied to an RDD. In case of a failure, Spark uses this
lineage to recompute lost data, ensuring fault tolerance without requiring replication of all data.

4. How does Spark handle node failures?

• Answer:
When a node fails:
1. Lost RDD partitions are recomputed using the lineage graph.
2. If data is cached and replicated, Spark retrieves a copy from another node.
3. Task scheduling ensures that failed tasks are retried on other available nodes.

5. What happens if a Spark executor fails during computation?

• Answer:
If an executor fails:
1. Tasks running on the executor are rescheduled on other executors.
2. Lost RDD partitions are recomputed using lineage or retrieved from cached replicas.
3. If the driver fails, the Spark application is terminated unless run in a highly available mode.

6. How does checkpointing differ from lineage in fault tolerance?

• Answer:
1. Lineage: Maintains the history of transformations to recompute data. It is lightweight but may become inefficient for long lineage chains.
2. Checkpointing: Saves RDDs to stable storage (e.g., HDFS). It is useful when the lineage graph becomes too long or computation is iterative.

7. Can you explain fault tolerance in Spark Streaming?

• Answer:
Spark Streaming ensures fault tolerance by:
1. Metadata Checkpointing: Saves information like offsets and DAGs to recover the stream processing state.
2. Data Checkpointing: Optionally saves intermediate RDDs to stable storage for long-running computations.
3. Spark's lineage recomputes lost data for unprocessed batches.

8. What is the role of replication in Spark fault tolerance?

• Answer:
Replication creates multiple copies of cached RDD partitions across nodes. If one node fails, Spark retrieves the data from a replica, reducing the
need for recomputation and speeding up recovery.

9. How does Spark ensure reliability in case of task failure?

• Answer:
Spark retries failed tasks up to a configurable number of times (spark.task.maxFailures). If a task continues to fail after retries, the job is
terminated. Task rescheduling ensures other healthy nodes take over the computation.

10. What are the limitations of Spark’s fault tolerance mechanism?

• Answer:
1. Lineage Overhead: Long lineage chains can slow down recomputation.
2. Checkpointing Cost: Frequent checkpointing can increase I/O overhead.
3. Driver Failure: Unless configured for high availability, driver failure can cause job termination.

DAG and Lazy Evaluation in Apache Spark

1. What is a DAG in Spark?
DAG (Directed Acyclic Graph) is a data structure that represents the sequence of computations performed on data in Apache Spark.
• Directed: The graph has a direction, indicating the flow of data transformations.
• Acyclic: There are no cycles; you can't return to a previous node.
• Graph: Nodes represent RDDs or data transformations, and edges represent dependencies between them.
Purpose of DAG:

pyspark Page 4
• Directed: The graph has a direction, indicating the flow of data transformations.
• Acyclic: There are no cycles; you can't return to a previous node.
• Graph: Nodes represent RDDs or data transformations, and edges represent dependencies between them.
Purpose of DAG:
• Optimizes execution by analyzing the logical plan before running.
• Groups transformations into stages for efficient processing.

Why is DAG important in Spark?

1. Fault Tolerance: DAG helps recompute lost partitions using lineage.
2. Optimized Execution: Spark analyzes the DAG to determine the best execution strategy.
3. Parallelism: The DAG ensures transformations are parallelized where possible.

Example of DAG Formation

from pyspark import SparkContext

sc = SparkContext("local", "DAG Example")

data = [1, 2, 3, 4]
rdd = sc.parallelize(data) # Node 1: Source RDD
mapped_rdd = rdd.map(lambda x: x * 2) # Node 2: Transformation
filtered_rdd = mapped_rdd.filter(lambda x: x > 4) # Node 3: Transformation
result = filtered_rdd.collect() # Action

DAG Explanation:
• Nodes: parallelize, map, filter.
• Edges: Dependencies between transformations.
• Stages: Spark breaks the DAG into stages, each representing a collection of tasks that can run in parallel.

2. What is Lazy Evaluation in Spark?

Lazy Evaluation means that Spark doesn’t execute transformations immediately when they’re called. Instead, it builds a DAG of transformations
and only executes when an action (like collect, count) triggers computation.

Why Lazy Evaluation?

1. Optimization:
○ Spark analyzes the DAG before execution to combine transformations into efficient physical execution plans.
○ Reduces the number of passes through the data.
2. Fault Tolerance:
○ Delaying execution helps Spark recompute only the required transformations in case of failure.
3. Efficiency:
○ Spark can skip unnecessary transformations by optimizing the plan.

Example of Lazy Evaluation : data = [1, 2, 3, 4]

rdd = sc.parallelize(data) # No computation yet
mapped_rdd = rdd.map(lambda x: x * 2) # Still no computation
filtered_rdd = mapped_rdd.filter(lambda x: x > 4) # Still no computation

result = filtered_rdd.collect() # Computation triggered here

• When Computation Happens: Only when collect() is called.

• Spark builds the DAG during transformations and optimizes it before execution.

Relation Between DAG and Lazy Evaluation

1. DAG Construction: Spark builds a DAG during lazy evaluation as transformations are called.
2. Execution: When an action is triggered, Spark optimizes and executes the DAG in stages.
3. Fault Recovery: The DAG is used to recompute lost data efficiently, leveraging lazy evaluation.

Benefits of DAG and Lazy Evaluation

1. Optimized Query Execution:
○ Spark combines multiple transformations into stages to minimize shuffles and I/O.
2. Parallelism:
○ The DAG enables tasks to be distributed across a cluster for efficient execution.
3. Reduced Overhead:
○ Lazy evaluation ensures that Spark processes only what is needed, reducing unnecessary computations.

pyspark Page 5
Interview Questions and Answers: DAG and Lazy Evaluation in Spark

1. What is a DAG in Apache Spark, and why is it important?

Answer:
A DAG (Directed Acyclic Graph) in Spark is a logical representation of the sequence of computations to be performed on data. Nodes in the DAG represent
RDDs or transformations, while edges represent dependencies between operations.
Importance:
• Optimized Execution: Spark divides the DAG into stages and tasks, reducing unnecessary computations.
• Fault Tolerance: DAG tracks lineage, allowing Spark to recompute lost data in case of failures.
• Parallelism: Identifies independent tasks for efficient execution across a cluster.

2. How does Spark create and execute a DAG?

Answer:
• Creation: Spark builds a DAG during the definition of transformations (e.g., map, filter). These transformations are lazily evaluated and stored as a
logical execution plan.
• Execution: When an action (e.g., collect, count) is called, Spark optimizes the DAG by grouping transformations into stages. Each stage is executed as a
series of parallel tasks.

3. What is lazy evaluation in Spark, and how does it work?

Answer:
Lazy evaluation means Spark does not execute transformations immediately but instead builds a DAG of transformations.
How it works:
1. When transformations (e.g., map, filter) are called, they are added to the DAG as nodes.
2. Execution is deferred until an action (e.g., collect, save) is invoked.
3. At that point, Spark optimizes the DAG and processes data in an efficient manner.

4. Why does Spark use lazy evaluation?

Answer:
Lazy evaluation offers the following advantages:
• Optimization: Spark analyzes the entire transformation pipeline to minimize unnecessary computations.
• Reduced I/O: Combines multiple transformations into a single stage to reduce data shuffling.
• Fault Tolerance: Delaying execution ensures that Spark recomputes only the required transformations in case of failure.

5. Can you explain the relationship between DAG and lazy evaluation in Spark?
Answer:
• DAG Construction: Lazy evaluation is used to construct the DAG as transformations are defined.
• Execution Trigger: The DAG is only executed when an action is called.
• Optimization: Spark optimizes the DAG for efficient execution before processing data.
For example, if you perform a series of transformations on an RDD, Spark does not execute them immediately. It builds a DAG of transformations. When an
action like collect() is invoked, Spark optimizes and executes the DAG.

6. What are the benefits of DAG in Spark?

Answer:
1. Optimized Execution: Reduces shuffles and combines narrow transformations into a single stage.
2. Fault Tolerance: Tracks lineage for recomputation of lost partitions.
3. Parallelism: Enables Spark to process tasks in parallel across nodes.

7. Provide an example demonstrating DAG and lazy evaluation.

Answer:
data = [1, 2, 3, 4] Explanation:
rdd = sc.parallelize(data) # DAG Node 1: RDD creation • When parallelize, map, and filter are called, Spark builds the
mapped_rdd = rdd.map(lambda x: x * 2) # DAG Node 2: Transformation DAG without executing.
filtered_rdd = mapped_rdd.filter(lambda x: x > 4) # DAG Node 3: Transformation • Execution begins only when the collect action is called,
optimizing transformations into stages for efficient execution.
result = filtered_rdd.collect() # Action triggers execution of the DAG

8. How does Spark optimize the DAG before execution?

Answer:
• Stage Division: Spark divides the DAG into stages based on shuffle dependencies.
• Pipelining: Groups narrow transformations (e.g., map, filter) into a single stage to minimize data shuffling.
• Task Scheduling: Tasks within a stage are distributed across cluster nodes for parallel processing.

9. What is the role of DAG in Spark’s fault tolerance?

Answer:
The DAG tracks the lineage of transformations applied to data. In case of failure:
1. Spark uses the DAG to identify lost partitions.
2. It recomputes only the required transformations for the affected data.

pyspark Page 6
9. What is the role of DAG in Spark’s fault tolerance?
Answer:
The DAG tracks the lineage of transformations applied to data. In case of failure:
1. Spark uses the DAG to identify lost partitions.
2. It recomputes only the required transformations for the affected data.
This ensures efficient recovery without reprocessing the entire dataset.

10. What are the limitations of DAG and lazy evaluation in Spark?
Answer:
1. Long Lineages: DAGs with long lineage chains can slow down recovery and execution.
2. Checkpointing Overhead: While checkpointing mitigates long lineage issues, it introduces storage and I/O overhead.
3. Complexity: Managing and debugging DAGs for large-scale jobs with complex dependencies can be challenging.

Cache vs Persist in Apache Spark

Cache and Persist are both mechanisms in Apache Spark used to store intermediate RDDs (Resilient Distributed Datasets) or DataFrames in memory or on
disk for faster access during future operations. While they serve similar purposes, they have some differences in terms of storage levels and flexibility.

1. What is cache in Spark?

Answer:
cache() is a shorthand for persisting an RDD or DataFrame using the MEMORY_AND_DISK storage level. It stores data in memory, and if there isn’t enough
memory, it spills over to disk.
Default Storage Level:
• MEMORY_AND_DISK: Spark first tries to store data in memory, and if memory is insufficient, it writes the data to disk.
When to use cache:
• When you have data that will be accessed repeatedly and fits into memory, or when memory is large enough to hold the dataset.

2. What is persist in Spark?

Answer:
persist() allows you to store an RDD or DataFrame in memory or on disk using various storage levels (like MEMORY_ONLY, DISK_ONLY,
MEMORY_AND_DISK, etc.). persist() is more flexible than cache() because it allows specifying the storage level.
Example Storage Levels:
• MEMORY_ONLY: Stores data only in memory.
• MEMORY_AND_DISK: Stores data in memory, spills to disk if memory is full.
• DISK_ONLY: Stores data only on disk.
• MEMORY_ONLY_SER: Stores data in memory in a serialized format (lower memory usage).
• OFF_HEAP: Stores data in off-heap memory (not recommended in many cases).
When to use persist:
• When you need fine-grained control over the storage level and decide whether to store the data in memory, disk, or both.

3. Key Differences Between cache and persist

Feature cache() persist()
Storage Default is Allows specifying any storage level (e.g.,
Level MEMORY_AND_DISK MEMORY_ONLY, DISK_ONLY)
Flexibility Less flexible, suitable for More flexible, suitable for advanced use
basic caching needs cases where control over storage is
needed
Memory Stores data in memory Can store data in memory, disk, or both
Usage first, spills to disk if based on the storage level
needed
Use Case Quick caching of data to Suitable for complex scenarios requiring
speed up operations custom storage management

4. When to use cache and persist?

• Use cache():
○ When you want a quick, default caching mechanism and are okay with
Spark using MEMORY_AND_DISK storage.
○ For situations where data fits in memory, but if not, it can spill to disk.
• Use persist():
○ When you need more control over where the data is stored (in memory,
on disk, serialized format, etc.).
○ For long-running jobs or iterative algorithms (e.g., machine learning)
where different storage levels can help optimize performance.

6. Performance Implications

pyspark Page 7
where different storage levels can help optimize performance.

6. Performance Implications
• cache():
○ Faster, simpler caching with the default MEMORY_AND_DISK level. Best used for datasets that can fit into memory or benefit from spilling
to disk.
• persist():
○ More control over performance. You can choose a storage level that best matches the size of your data and the nature of the operations.
For example, MEMORY_ONLY can be faster but can lead to OutOfMemoryError for large datasets, while DISK_ONLY may be slower but can
handle larger-than-memory datasets.

7. Memory Considerations
• cache() will use memory first, and only spill to disk if the memory is insufficient. This is often ideal for datasets that are small enough to fit into
memory, with the option for disk fallback.
• persist(StorageLevel.MEMORY_ONLY) stores data only in memory. This is ideal for small datasets that will fit comfortably in memory, providing
the fastest access speed.
• persist(StorageLevel.DISK_ONLY) stores data only on disk. This is useful when you have a large dataset that won’t fit in memory, but accessing
data from disk is slower compared to memory.

8. Can you unpersist data after caching or persisting? 4

Answer:
Yes, you can unpersist cached or persisted data to free up memory or disk space by calling the unpersist() method:
rdd.unpersist() # Frees up the memory/disk used for the cached or persisted RDD

Parquet in Apache Spark

Parquet is a popular columnar storage format supported by Apache Spark and many other big data processing frameworks. It is optimized for
analytical workloads and is widely used in the industry.

1. What is Parquet?
• Definition: Parquet is a columnar file format that organizes data by columns instead of rows, enabling efficient storage and querying of
structured and semi-structured data.
• File Type: Binary, highly compressed format.
• Supported By: Spark, Hive, Presto, Snowflake, etc.

2. Features of Parquet
1. Columnar Storage:
○ Data for each column is stored together, enabling faster column-wise operations like filtering and aggregation.
2. Efficient Compression:
○ Parquet applies column-level compression, often achieving better compression ratios compared to row-based formats like CSV or JSON.
3. Schema Evolution:
○ Parquet supports schema evolution, allowing you to add, remove, or modify columns without breaking existing data.
4. Predicate Pushdown:
○ Queries can skip reading unnecessary columns or data (e.g., SELECT col1 only reads data from col1).
5. Splittable:
○ Parquet files are splittable, enabling parallel processing across partitions.

3. Why Use Parquet in Spark?

1. Performance:
○ Columnar format enables faster read operations for analytical queries.
○ Compression reduces storage and I/O costs.
2. Integration:
○ Works seamlessly with Spark's DataFrame and Dataset APIs.
○ Optimized for Spark's Catalyst and Tungsten engines.
3. Cost-Effective:
○ Saves storage space and reduces read/write time, leading to lower compute costs in cloud environments.

5. Advantages of Parquet
Feature Benefit
Columnar Format Faster column-wise operations like filtering and aggregation.
Compression Reduces storage and I/O costs.
Predicate Pushdown Improves query performance by skipping unnecessary data.
Interoperability Supported by multiple tools in the big data ecosystem.
Splittable Allows parallel processing for large-scale jobs.

pyspark Page 8
Interoperability Supported by multiple tools in the big data ecosystem.
Splittable Allows parallel processing for large-scale jobs.

6. Comparison with Other Formats

Feature Parquet CSV JSON
Storage Format Columnar Row-based Row-based
Compression Highly efficient Low Moderate
Read Performance Fast for analytics Slower Slower
Schema Support Yes No No

7. Interview Questions
1. What is Parquet, and why is it used in Spark?
○ Parquet is a columnar storage format optimized for analytical workloads, enabling faster queries, better compression, and efficient
storage.
2. What are the benefits of using Parquet in Spark?
○ Columnar storage for faster queries, efficient compression, schema evolution, predicate pushdown, and integration with Spark SQL.
3. What is predicate pushdown in Parquet?
○ Predicate pushdown allows queries to filter data at the file level by reading only the relevant rows and columns, improving performance.
4. How does Parquet differ from row-based formats like CSV?
○ Parquet stores data column-wise, enabling faster analytical queries, better compression, and reduced I/O compared to CSV, which is row-
based.
5. What compression formats are supported by Parquet?
○ Common formats include Snappy, Gzip, and None. Snappy is the default in Spark.

pyspark Page 9

Unit V
No ratings yet
Unit V
35 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
19 pages
Spark Class 1
No ratings yet
Spark Class 1
33 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
Spark Guide for 4th Year Engineering Students
No ratings yet
Spark Guide for 4th Year Engineering Students
241 pages
Spark Architecture
No ratings yet
Spark Architecture
7 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
Spark Theory
No ratings yet
Spark Theory
26 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
Apache Spark Overview & Features
No ratings yet
Apache Spark Overview & Features
65 pages
PySpark Cheat Sheet Overview
No ratings yet
PySpark Cheat Sheet Overview
18 pages
MapReduce vs. Spark: Big Data Processing
No ratings yet
MapReduce vs. Spark: Big Data Processing
21 pages
Data Bricks
No ratings yet
Data Bricks
42 pages
Understanding RDDs in Apache Spark
No ratings yet
Understanding RDDs in Apache Spark
14 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
07 - Apache Spark - An Introduction
No ratings yet
07 - Apache Spark - An Introduction
36 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
37 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Spark DAGs: Execution and Optimization
No ratings yet
Spark DAGs: Execution and Optimization
6 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Sparklyr Online Training Overview
No ratings yet
Sparklyr Online Training Overview
80 pages
Understanding Apache Spark Basics
No ratings yet
Understanding Apache Spark Basics
125 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Introduction to Data Analysis with Spark
No ratings yet
Introduction to Data Analysis with Spark
51 pages
Unit - 4
No ratings yet
Unit - 4
49 pages
Apache Spark Features and Architecture
No ratings yet
Apache Spark Features and Architecture
4 pages
Spark Everything
No ratings yet
Spark Everything
34 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Spark RDD WithCode
No ratings yet
Spark RDD WithCode
34 pages
Apache Spark Big Data Framework Overview
No ratings yet
Apache Spark Big Data Framework Overview
58 pages
Pyspark - Notes 1
No ratings yet
Pyspark - Notes 1
3 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
Spark Basic Info
No ratings yet
Spark Basic Info
11 pages
Spark Questions Imp
No ratings yet
Spark Questions Imp
33 pages
SPARK
No ratings yet
SPARK
47 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Introduction to Apache Spark 2 Architecture
No ratings yet
Introduction to Apache Spark 2 Architecture
43 pages
Super 25 Unit 5 Notes
No ratings yet
Super 25 Unit 5 Notes
11 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Understanding Apache Spark Architecture
0% (1)
Understanding Apache Spark Architecture
30 pages
Spark Jobs and APIs Spark 2 0 Architecture 1732231641
No ratings yet
Spark Jobs and APIs Spark 2 0 Architecture 1732231641
26 pages
Msbte Super 25 Unit 5 Notes
No ratings yet
Msbte Super 25 Unit 5 Notes
17 pages
Bda U4
No ratings yet
Bda U4
49 pages
Spark
No ratings yet
Spark
7 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Apache Spark Basics & Comparison
No ratings yet
Apache Spark Basics & Comparison
66 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Pyspark
No ratings yet
Pyspark
31 pages
Spark Interview Prep for Telugu Speakers
100% (3)
Spark Interview Prep for Telugu Speakers
31 pages
Overview of PySpark Components
No ratings yet
Overview of PySpark Components
9 pages
Execr
No ratings yet
Execr
4 pages
Engine
No ratings yet
Engine
4 pages
Note 1
No ratings yet
Note 1
2 pages
Skills
No ratings yet
Skills
2 pages
Tadat Halat
No ratings yet
Tadat Halat
1 page
Basic Arb
No ratings yet
Basic Arb
2 pages
Tasheel Rule
No ratings yet
Tasheel Rule
4 pages
Mukassr
No ratings yet
Mukassr
1 page
Minhaj Sabaq
No ratings yet
Minhaj Sabaq
3 pages
ADMS 2511: Management Information Systems
No ratings yet
ADMS 2511: Management Information Systems
43 pages
Case Study
No ratings yet
Case Study
3 pages
CHAPTER 6 File System Management
No ratings yet
CHAPTER 6 File System Management
13 pages
Crack Your Databricks
100% (2)
Crack Your Databricks
103 pages
Chapter 1
No ratings yet
Chapter 1
54 pages
Toaz - Info Workbook Top Notch 2 Third Edition PR
No ratings yet
Toaz - Info Workbook Top Notch 2 Third Edition PR
5 pages
Python MySQL Connectivity Guide
No ratings yet
Python MySQL Connectivity Guide
20 pages
Next Pathway Hack Backpackers Problem Statement
No ratings yet
Next Pathway Hack Backpackers Problem Statement
11 pages
CheatSheet PowerBI en v2
No ratings yet
CheatSheet PowerBI en v2
1 page
Data Warehousing
No ratings yet
Data Warehousing
10 pages
DSA Past Questions for Pokhara University
No ratings yet
DSA Past Questions for Pokhara University
2 pages
Snowflake DSII Driver Configuration Guide
No ratings yet
Snowflake DSII Driver Configuration Guide
7 pages
Partha Resume ETL Testing
No ratings yet
Partha Resume ETL Testing
4 pages
Write A Mapreduce Program To Find Dept Wise Salary. Empno Empname Dept Salary
100% (1)
Write A Mapreduce Program To Find Dept Wise Salary. Empno Empname Dept Salary
5 pages
BW Database Monitoring and Maintenance Guide
No ratings yet
BW Database Monitoring and Maintenance Guide
23 pages
Manjushree Mutta
No ratings yet
Manjushree Mutta
5 pages
Institute of Aeronautical Engineering: Ch. Keerthi IT-A - 18951A1239 Adavanced Databases
No ratings yet
Institute of Aeronautical Engineering: Ch. Keerthi IT-A - 18951A1239 Adavanced Databases
10 pages
DBMS Lab Manual
No ratings yet
DBMS Lab Manual
54 pages
Howto Configure Multiple Ldap Servers For Ep6.0
No ratings yet
Howto Configure Multiple Ldap Servers For Ep6.0
10 pages
Big Data Hadoop
No ratings yet
Big Data Hadoop
11 pages
5 Slide
No ratings yet
5 Slide
20 pages
DSA Continue Assessment
No ratings yet
DSA Continue Assessment
2 pages
Flash Memory Lifespan and Reliability White Paper
No ratings yet
Flash Memory Lifespan and Reliability White Paper
10 pages
Industrial Manpower Resource Organizer
83% (6)
Industrial Manpower Resource Organizer
104 pages
BVoc-Software-02Sem-DikshaSinghal-DATABASE MANAGEMENT SYSTEM
No ratings yet
BVoc-Software-02Sem-DikshaSinghal-DATABASE MANAGEMENT SYSTEM
78 pages
Introduction to Database Concepts
No ratings yet
Introduction to Database Concepts
40 pages
Class 8
No ratings yet
Class 8
37 pages
Acces MCQ
No ratings yet
Acces MCQ
27 pages
MS Access to SQL Server Migration Guide
No ratings yet
MS Access to SQL Server Migration Guide
5 pages
Professional Cloud Architect - D2cce566118d
No ratings yet
Professional Cloud Architect - D2cce566118d
170 pages

Learn by Doing It

Uploaded by

Learn by Doing It

Uploaded by

Learn by doing it

11 November 2024 15:27

What is spark ..?

That’s why spark it faster than map reduce

Apache Spark Architecture

Core Components of Spark Architecture

Key Concepts in Spark

Diagram of Spark Architecture

How RDD Works?

Key Properties of RDD

When to Use RDD?

2. How does Apache Spark provide fault tolerance?

4. How does Spark handle node failures?

5. What happens if a Spark executor fails during computation?

6. How does checkpointing differ from lineage in fault tolerance?

7. Can you explain fault tolerance in Spark Streaming?

8. What is the role of replication in Spark fault tolerance?

9. How does Spark ensure reliability in case of task failure?

10. What are the limitations of Spark’s fault tolerance mechanism?

DAG and Lazy Evaluation in Apache Spark

Why is DAG important in Spark?

Example of DAG Formation

sc = SparkContext("local", "DAG Example")

2. What is Lazy Evaluation in Spark?

Why Lazy Evaluation?

Example of Lazy Evaluation : data = [1, 2, 3, 4]

result = filtered_rdd.collect() # Computation triggered here

• When Computation Happens: Only when collect() is called.

Relation Between DAG and Lazy Evaluation

Benefits of DAG and Lazy Evaluation

1. What is a DAG in Apache Spark, and why is it important?

2. How does Spark create and execute a DAG?

3. What is lazy evaluation in Spark, and how does it work?

4. Why does Spark use lazy evaluation?

6. What are the benefits of DAG in Spark?

7. Provide an example demonstrating DAG and lazy evaluation.

8. How does Spark optimize the DAG before execution?

9. What is the role of DAG in Spark’s fault tolerance?

Cache vs Persist in Apache Spark

1. What is cache in Spark?

2. What is persist in Spark?

3. Key Differences Between cache and persist

4. When to use cache and persist?

8. Can you unpersist data after caching or persisting? 4

Parquet in Apache Spark

3. Why Use Parquet in Spark?

6. Comparison with Other Formats

You might also like