0% found this document useful (0 votes)
3 views

homework10_mounika

The document outlines key differences between Apache Spark and MapReduce, highlighting Spark's in-memory processing speed and RDD lineage for fault tolerance. It explains the types of operations supported by RDDs, namely actions and transformations, and their characteristics and differences. Additionally, it discusses the advantages of Parquet files, the concept of lazy evaluation, the significance of RDDs, the lineage graph, components of the Spark ecosystem, and the limitation of one active SparkContext per JVM.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

homework10_mounika

The document outlines key differences between Apache Spark and MapReduce, highlighting Spark's in-memory processing speed and RDD lineage for fault tolerance. It explains the types of operations supported by RDDs, namely actions and transformations, and their characteristics and differences. Additionally, it discusses the advantages of Parquet files, the concept of lazy evaluation, the significance of RDDs, the lineage graph, components of the Spark ecosystem, and the limitation of one active SparkContext per JVM.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

1)Mention two differences between Apache Spark and MapReduce?

These are the two differences between Apache Spark and Map Reduce

1)Speed and Performance:

Spark : It processes the data in-memory, which makes it much faster than MapReduce,
especially for iterative algorithms and real-time processing.

MapReduce writes intermediate results to disk between each Map and Reduce step,
which can significantly slow down performance.

2) Fault Tolerance

Spark: Uses RDD lineage (Resilient Distributed Datasets), allowing it to recompute lost
partitions without needing full replication.

MapReduce: Relies on replication of data blocks in HDFS to recover from failures.

2) What are the types of operations supported by RDDs? What is the


difference between them?

There are two types pf operations supported bt RDDs are:

1.Actions

2.Transformations

1.Actions

Definition:

Actions are operations that trigger the computation on the RDD and return a value to the driver
program or write data to external storage.

Characteristics:

Eager Execution

Materializes Data

Returns a Value

Returns a Value
Examples:

 collect(): Retrieves all the elements of the RDD to the driver program.

 count(): Returns the number of elements in the RDD.

 first(): Returns the first element of the RDD.

 saveAsTextFile(): Writes the data in the RDD to a file system.

2.Transformers

Definition:

Transformations are operations on an RDD that return a new RDD. They do not modify the
original RDD because RDDs are immutable. Instead, they create a lineage graph of dependencies
between RDDs.

Characteristics:

Lazy Evaluation

Non-Immediate Execution

Creates a New RDD

Optimization

Examples:

map(): Applies a function to each element of the RDD and returns a new RDD.

filter(): Filters the elements based on a condition and returns a new RDD.

flatMap(): Similar to map(), but can return multiple elements for each input element.

reduceByKey(): Aggregates data by key and returns a new RDD.

Differences:
Key Differences Between Transformations and Actions:

1. Execution:
o Transformations: Lazy evaluation — the computation is not
performed until an action is invoked.
o Actions: Eager execution — triggers the execution of all
preceding transformations.
2. Output:
o Transformations: Return a new RDD without executing the
computation.
o Actions: Return a final result (either to the driver or to an
external storage system), causing computation to occur.
3. Purpose:
o Transformations: Define the computation and manipulation
logic on data, enabling data flow and computational pipeline.
o Actions: Trigger the actual computation and produce the
output (e.g., return data or store it).
4. Data Processing:
o Transformations: No data is processed when a transformation
is called. It only modifies the data structure (RDD) for future use.
o Actions: Actual data processing and result generation occur
when an action is invoked.
5. Relation to DAG:
o Transformations: Contribute to the logical DAG, which
represents the sequence of computations.
o Actions: Execute the DAG and produce the final output.

What are the advantages of a Parquet file for Spark system?

 Efficient Storage: Columnar format leads to better compression and more efficient use of
storage.

 Improved Performance: Optimized for analytical queries, with features like column pruning
and predicate pushdown that improve read performance.

 Schema Flexibility: Supports schema evolution and complex nested data types.

 Cross-System Compatibility: Works well with various data processing frameworks.

 Parallel Processing: Supports distributed processing and efficient parallelism in Spark and
Hadoop.

 Compression: High compression ratios reduce storage costs and improve I/O performance.

What is a lazy evaluation in Spark?


Lazy evaluation in Spark is a computational strategy where the execution of
RDD transformations is postponed until an action is invoked on the resulting
RDD. Spark records the sequence of transformations (building a lineage
graph) and only computes the results when an action requires the data.

What is the significance of Resilient Distributed Datasets in Spark?

RDDs are Spark's fundamental building blocks for big data. They are
distributed, meaning data is spread across computers for parallel processing.
They are resilient, automatically recovering from failures. They enable fast,
in-memory computation and provide a simple way for developers to work
with distributed data. Their lazy evaluation allows for optimization, and they
form the base for Spark's more advanced tools.

What is a Lineage Graph?

A Lineage Graph (also known as a DAG – DAG-directed acyclic Graph) in Apache Spark is a
record of all the operations (like map, filter, etc.) that have been applied to a Resilient
Distributed Dataset (RDD) to create new RDDs.

What are the important components of the Spark ecosystem?

 spark Core: The fundamental engine for distributed processing using RDDs.
 Spark SQL: For working with structured data using SQL and DataFrames.
 Spark Streaming: For processing real-time data streams.
 MLlib: The library for scalable machine learning algorithms.
 GraphX: For graph processing and analysis.

 SparkR: R language bindings for Spark.

How many Spark Context can be active per JVM?

Only one SparkContext can be active at a time per JVM (Java Virtual
Machine).

You might also like