0% found this document useful (0 votes)

3 views

homework10_mounika

The document outlines key differences between Apache Spark and MapReduce, highlighting Spark's in-memory processing speed and RDD lineage for fault tolerance. It explains the types of operations supported by RDDs, namely actions and transformations, and their characteristics and differences. Additionally, it discusses the advantages of Parquet files, the concept of lazy evaluation, the significance of RDDs, the lineage graph, components of the Spark ecosystem, and the limitation of one active SparkContext per JVM.

Uploaded by

practisecodingforinterview

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

homework10_mounika

Uploaded by

practisecodingforinterview

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

1)Mention two differences between Apache Spark and MapReduce?

These are the two differences between Apache Spark and Map Reduce

1)Speed and Performance:

Spark : It processes the data in-memory, which makes it much faster than MapReduce,
especially for iterative algorithms and real-time processing.

MapReduce writes intermediate results to disk between each Map and Reduce step,
which can significantly slow down performance.

2) Fault Tolerance

Spark: Uses RDD lineage (Resilient Distributed Datasets), allowing it to recompute lost
partitions without needing full replication.

MapReduce: Relies on replication of data blocks in HDFS to recover from failures.

2) What are the types of operations supported by RDDs? What is the

difference between them?

There are two types pf operations supported bt RDDs are:

1.Actions

2.Transformations

1.Actions

Definition:

Actions are operations that trigger the computation on the RDD and return a value to the driver
program or write data to external storage.

Characteristics:

Eager Execution

Materializes Data

Returns a Value

Returns a Value
Examples:

 collect(): Retrieves all the elements of the RDD to the driver program.

 count(): Returns the number of elements in the RDD.

 first(): Returns the first element of the RDD.

 saveAsTextFile(): Writes the data in the RDD to a file system.

2.Transformers

Definition:

Transformations are operations on an RDD that return a new RDD. They do not modify the
original RDD because RDDs are immutable. Instead, they create a lineage graph of dependencies
between RDDs.

Characteristics:

Lazy Evaluation

Non-Immediate Execution

Creates a New RDD

Optimization

Examples:

map(): Applies a function to each element of the RDD and returns a new RDD.

filter(): Filters the elements based on a condition and returns a new RDD.

flatMap(): Similar to map(), but can return multiple elements for each input element.

reduceByKey(): Aggregates data by key and returns a new RDD.

Differences:
Key Differences Between Transformations and Actions:

1. Execution:
o Transformations: Lazy evaluation — the computation is not
performed until an action is invoked.
o Actions: Eager execution — triggers the execution of all
preceding transformations.
2. Output:
o Transformations: Return a new RDD without executing the
computation.
o Actions: Return a final result (either to the driver or to an
external storage system), causing computation to occur.
3. Purpose:
o Transformations: Define the computation and manipulation
logic on data, enabling data flow and computational pipeline.
o Actions: Trigger the actual computation and produce the
output (e.g., return data or store it).
4. Data Processing:
o Transformations: No data is processed when a transformation
is called. It only modifies the data structure (RDD) for future use.
o Actions: Actual data processing and result generation occur
when an action is invoked.
5. Relation to DAG:
o Transformations: Contribute to the logical DAG, which
represents the sequence of computations.
o Actions: Execute the DAG and produce the final output.

What are the advantages of a Parquet file for Spark system?

 Efficient Storage: Columnar format leads to better compression and more efficient use of
storage.

 Improved Performance: Optimized for analytical queries, with features like column pruning
and predicate pushdown that improve read performance.

 Schema Flexibility: Supports schema evolution and complex nested data types.

 Cross-System Compatibility: Works well with various data processing frameworks.

 Parallel Processing: Supports distributed processing and efficient parallelism in Spark and
Hadoop.

 Compression: High compression ratios reduce storage costs and improve I/O performance.

What is a lazy evaluation in Spark?

Lazy evaluation in Spark is a computational strategy where the execution of
RDD transformations is postponed until an action is invoked on the resulting
RDD. Spark records the sequence of transformations (building a lineage
graph) and only computes the results when an action requires the data.

What is the significance of Resilient Distributed Datasets in Spark?

RDDs are Spark's fundamental building blocks for big data. They are
distributed, meaning data is spread across computers for parallel processing.
They are resilient, automatically recovering from failures. They enable fast,
in-memory computation and provide a simple way for developers to work
with distributed data. Their lazy evaluation allows for optimization, and they
form the base for Spark's more advanced tools.

What is a Lineage Graph?

A Lineage Graph (also known as a DAG – DAG-directed acyclic Graph) in Apache Spark is a
record of all the operations (like map, filter, etc.) that have been applied to a Resilient
Distributed Dataset (RDD) to create new RDDs.

What are the important components of the Spark ecosystem?

 spark Core: The fundamental engine for distributed processing using RDDs.
 Spark SQL: For working with structured data using SQL and DataFrames.
 Spark Streaming: For processing real-time data streams.
 MLlib: The library for scalable machine learning algorithms.
 GraphX: For graph processing and analysis.

 SparkR: R language bindings for Spark.

How many Spark Context can be active per JVM?

Only one SparkContext can be active at a time per JVM (Java Virtual
Machine).

Yahoo Hadoop Tutorial
No ratings yet
Yahoo Hadoop Tutorial
28 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
Ipl Report
100% (3)
Ipl Report
44 pages
Big Data Engineering - PySpark
100% (1)
Big Data Engineering - PySpark
120 pages
Big Data Assignment
No ratings yet
Big Data Assignment
6 pages
Spark End To End QUESTIONS
No ratings yet
Spark End To End QUESTIONS
10 pages
Spark Interview Questions
100% (1)
Spark Interview Questions
8 pages
Spark Interview Questions 1713805760
No ratings yet
Spark Interview Questions 1713805760
40 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Interview Question Spark Day1
No ratings yet
Interview Question Spark Day1
3 pages
Spark_Class_1_PPT
No ratings yet
Spark_Class_1_PPT
33 pages
Interview - Questions
No ratings yet
Interview - Questions
8 pages
Bigdata Interview Q&A
No ratings yet
Bigdata Interview Q&A
71 pages
Spark Interview
No ratings yet
Spark Interview
17 pages
SPARK Interview Questions
No ratings yet
SPARK Interview Questions
12 pages
Apache Spark Essentials
No ratings yet
Apache Spark Essentials
12 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Introduction to Spark
No ratings yet
Introduction to Spark
30 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
4 pages
Apache Spark IQ
No ratings yet
Apache Spark IQ
15 pages
PySpark Comprehensive Notes⚡
No ratings yet
PySpark Comprehensive Notes⚡
59 pages
BDA unit 4
No ratings yet
BDA unit 4
19 pages
Apache Spark Python Slides
No ratings yet
Apache Spark Python Slides
186 pages
Lecturer 5
No ratings yet
Lecturer 5
21 pages
Spark - RDD CS DESIGN
No ratings yet
Spark - RDD CS DESIGN
1 page
Spark Architecture
No ratings yet
Spark Architecture
7 pages
Pyspark
No ratings yet
Pyspark
31 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Spark Interview Questions and Answers
100% (3)
Spark Interview Questions and Answers
31 pages
PySpark Core Print
No ratings yet
PySpark Core Print
8 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
SPARK
No ratings yet
SPARK
35 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
BDA-Lec7
No ratings yet
BDA-Lec7
32 pages
Introduction to Big Data With PySpark_ Spark RDDs With PySpark Cheatsheet _ Codecademy
No ratings yet
Introduction to Big Data With PySpark_ Spark RDDs With PySpark Cheatsheet _ Codecademy
6 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
Unit 4
No ratings yet
Unit 4
8 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Unit 5 Note
No ratings yet
Unit 5 Note
18 pages
Apache Spark: The Next Gen Toolset For Big Data Processing
No ratings yet
Apache Spark: The Next Gen Toolset For Big Data Processing
9 pages
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
32 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
32 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
unit 6 spark (2)
No ratings yet
unit 6 spark (2)
43 pages
Lec no 10
No ratings yet
Lec no 10
17 pages
Apache Spark
No ratings yet
Apache Spark
15 pages
UNIT 4 Part 2
No ratings yet
UNIT 4 Part 2
11 pages
Spark Introduction
No ratings yet
Spark Introduction
26 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
L3
No ratings yet
L3
30 pages
Spark(Introduction,RDD)
No ratings yet
Spark(Introduction,RDD)
28 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
Pyspark-1
No ratings yet
Pyspark-1
7 pages
Spark Material
No ratings yet
Spark Material
6 pages
Spark
No ratings yet
Spark
96 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
notes (2) - Copy
No ratings yet
notes (2) - Copy
4 pages
SPARK Architecture
No ratings yet
SPARK Architecture
22 pages
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
From Everand
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
Karl Josef Hensel
No ratings yet
Exam2Review
No ratings yet
Exam2Review
23 pages
Exam2Topics_Fall2023 - Tagged
No ratings yet
Exam2Topics_Fall2023 - Tagged
3 pages
Lecture3_upload
No ratings yet
Lecture3_upload
28 pages
Lecture 2 Modeling and Solving LP Problems in a Spreadsheet (3)
No ratings yet
Lecture 2 Modeling and Solving LP Problems in a Spreadsheet (3)
53 pages
cp5293 Big Data Analytics Unit 5 PDF
No ratings yet
cp5293 Big Data Analytics Unit 5 PDF
28 pages
21CS71 BDA
No ratings yet
21CS71 BDA
13 pages
BDA Question Bank
No ratings yet
BDA Question Bank
3 pages
Unit 1.1data Science Technology Stack
No ratings yet
Unit 1.1data Science Technology Stack
87 pages
Map Reduce
100% (1)
Map Reduce
33 pages
Venkata Sai (Sr. GCP Data Engineer)
No ratings yet
Venkata Sai (Sr. GCP Data Engineer)
7 pages
Big Data Hadoop in Health Care
No ratings yet
Big Data Hadoop in Health Care
51 pages
Assignment 02 BigData Computing Noc23-Cs112
No ratings yet
Assignment 02 BigData Computing Noc23-Cs112
9 pages
Nptel Assignment 1
No ratings yet
Nptel Assignment 1
4 pages
Splunk Interview Questions
No ratings yet
Splunk Interview Questions
14 pages
Chapter 2 Literature Review
No ratings yet
Chapter 2 Literature Review
21 pages
Big Data Framework
No ratings yet
Big Data Framework
3 pages
Syllabus
No ratings yet
Syllabus
11 pages
From Local To Global: A Graph RAG Approach To Query-Focused Summarization
No ratings yet
From Local To Global: A Graph RAG Approach To Query-Focused Summarization
15 pages
Big Data Analytics 1-5
100% (1)
Big Data Analytics 1-5
63 pages
Iterative MapReduce For Azure Cloud
No ratings yet
Iterative MapReduce For Azure Cloud
4 pages
BDAmod 3
No ratings yet
BDAmod 3
18 pages
Experiment 3
No ratings yet
Experiment 3
5 pages
Big Data Analytics
No ratings yet
Big Data Analytics
22 pages
Adbms Final Reviewer Upd
No ratings yet
Adbms Final Reviewer Upd
6 pages
Part B Questions
No ratings yet
Part B Questions
3 pages
D&A: 06062013 Amazon & GAO Cloud Services For The CIA
No ratings yet
D&A: 06062013 Amazon & GAO Cloud Services For The CIA
17 pages
4-2 Final Project
No ratings yet
4-2 Final Project
78 pages
JNTUA R20 B.tech - CSE III IV Year Course Structure Syllabus
No ratings yet
JNTUA R20 B.tech - CSE III IV Year Course Structure Syllabus
117 pages
The Mapreduce Programming Model
No ratings yet
The Mapreduce Programming Model
64 pages
Apache Spark Interview Questions and Answers PDF
No ratings yet
Apache Spark Interview Questions and Answers PDF
31 pages
Strategyand Web and Social Media Analytics
No ratings yet
Strategyand Web and Social Media Analytics
19 pages
Big Data Nit067
No ratings yet
Big Data Nit067
1 page

homework10_mounika

Uploaded by

homework10_mounika

Uploaded by

1)Mention two differences between Apache Spark and MapReduce?

1)Speed and Performance:

MapReduce: Relies on replication of data blocks in HDFS to recover from failures.

2) What are the types of operations supported by RDDs? What is the

There are two types pf operations supported bt RDDs are:

 count(): Returns the number of elements in the RDD.

 first(): Returns the first element of the RDD.

 saveAsTextFile(): Writes the data in the RDD to a file system.

Creates a New RDD

reduceByKey(): Aggregates data by key and returns a new RDD.

What are the advantages of a Parquet file for Spark system?

 Cross-System Compatibility: Works well with various data processing frameworks.

What is a lazy evaluation in Spark?

What is the significance of Resilient Distributed Datasets in Spark?

What is a Lineage Graph?

What are the important components of the Spark ecosystem?

 SparkR: R language bindings for Spark.

How many Spark Context can be active per JVM?

You might also like