homework10_mounika
homework10_mounika
These are the two differences between Apache Spark and Map Reduce
Spark : It processes the data in-memory, which makes it much faster than MapReduce,
especially for iterative algorithms and real-time processing.
MapReduce writes intermediate results to disk between each Map and Reduce step,
which can significantly slow down performance.
2) Fault Tolerance
Spark: Uses RDD lineage (Resilient Distributed Datasets), allowing it to recompute lost
partitions without needing full replication.
1.Actions
2.Transformations
1.Actions
Definition:
Actions are operations that trigger the computation on the RDD and return a value to the driver
program or write data to external storage.
Characteristics:
Eager Execution
Materializes Data
Returns a Value
Returns a Value
Examples:
collect(): Retrieves all the elements of the RDD to the driver program.
2.Transformers
Definition:
Transformations are operations on an RDD that return a new RDD. They do not modify the
original RDD because RDDs are immutable. Instead, they create a lineage graph of dependencies
between RDDs.
Characteristics:
Lazy Evaluation
Non-Immediate Execution
Optimization
Examples:
map(): Applies a function to each element of the RDD and returns a new RDD.
filter(): Filters the elements based on a condition and returns a new RDD.
flatMap(): Similar to map(), but can return multiple elements for each input element.
Differences:
Key Differences Between Transformations and Actions:
1. Execution:
o Transformations: Lazy evaluation — the computation is not
performed until an action is invoked.
o Actions: Eager execution — triggers the execution of all
preceding transformations.
2. Output:
o Transformations: Return a new RDD without executing the
computation.
o Actions: Return a final result (either to the driver or to an
external storage system), causing computation to occur.
3. Purpose:
o Transformations: Define the computation and manipulation
logic on data, enabling data flow and computational pipeline.
o Actions: Trigger the actual computation and produce the
output (e.g., return data or store it).
4. Data Processing:
o Transformations: No data is processed when a transformation
is called. It only modifies the data structure (RDD) for future use.
o Actions: Actual data processing and result generation occur
when an action is invoked.
5. Relation to DAG:
o Transformations: Contribute to the logical DAG, which
represents the sequence of computations.
o Actions: Execute the DAG and produce the final output.
Efficient Storage: Columnar format leads to better compression and more efficient use of
storage.
Improved Performance: Optimized for analytical queries, with features like column pruning
and predicate pushdown that improve read performance.
Schema Flexibility: Supports schema evolution and complex nested data types.
Parallel Processing: Supports distributed processing and efficient parallelism in Spark and
Hadoop.
Compression: High compression ratios reduce storage costs and improve I/O performance.
RDDs are Spark's fundamental building blocks for big data. They are
distributed, meaning data is spread across computers for parallel processing.
They are resilient, automatically recovering from failures. They enable fast,
in-memory computation and provide a simple way for developers to work
with distributed data. Their lazy evaluation allows for optimization, and they
form the base for Spark's more advanced tools.
A Lineage Graph (also known as a DAG – DAG-directed acyclic Graph) in Apache Spark is a
record of all the operations (like map, filter, etc.) that have been applied to a Resilient
Distributed Dataset (RDD) to create new RDDs.
spark Core: The fundamental engine for distributed processing using RDDs.
Spark SQL: For working with structured data using SQL and DataFrames.
Spark Streaming: For processing real-time data streams.
MLlib: The library for scalable machine learning algorithms.
GraphX: For graph processing and analysis.
Only one SparkContext can be active at a time per JVM (Java Virtual
Machine).