0% found this document useful (0 votes)

16 views47 pages

SPARK

The document discusses the limitations of Hadoop MapReduce, including its difficulty in programming, high latency, and inefficiency for iterative tasks, and introduces alternatives like Apache Spark, Apache Flink, and Apache Tez. Apache Spark is highlighted as a fast, scalable, and easy-to-use distributed data processing engine that supports in-memory computations and various programming languages. The document also covers Spark's architecture, core components, and the Resilient Distributed Dataset (RDD) operations.

Uploaded by

badrooxe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views47 pages

SPARK

Uploaded by

badrooxe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 47

INF4101-Big Data et NoSQL

5. SPARK

Dr Mouhim Sanaa
HADOOP MAPREDUCE

Limitations of Hadoop MapReduce

Hadoop MapReduce is a powerful framework for distributed processing of large

datasets, but it has several limitations:

• Difficult to Program
The MapReduce API is low-level and requires extensive code for relatively simple
tasks.

• I/O Intensive
 The MapReduce model relies on intermediate storage of results between the
map and reduce phases, leading to numerous disk read/write operations.
 This limits performance, especially for tasks that require repeated operations on
the same data (e.g., iterative algorithms like machine learning).
HADOOP MAPREDUCE

Limitations of Hadoop MapReduce

• High Latency
 MapReduce is designed for batch processing, making it unsuitable for
applications requiring low latency (real-time or near real-time).
 Each MapReduce task involves reading/writing from HDFS, significantly slowing
execution.

• Inefficient for Iterative and Interactive Tasks

 Machine learning and graph analysis algorithms often require multiple
passes over the data, which is inefficient with MapReduce.
 Systems like Apache Spark, which keep data in memory, are much more efficient
for such tasks.

• Not Suitable for Stream Processing

 Hadoop MapReduce is not designed for real-time data stream processing.
 Solutions like Apache Flink or Apache Kafka Streams are better suited for streaming
HADOOP MAPREDUCE ALTERNATIVES

Several solutions have been developed to overcome MapReduce's limitations:

• Batch Processing

 Apache Spark
 Stores data in-memory (RDD, DataFrame) to avoid unnecessary disk writes.
 Much faster than MapReduce for repetitive tasks and iterative algorithms.
 Simpler API with support for Python (PySpark), Scala, Java, and R.
 Compatible with Hadoop, HDFS, S3, and NoSQL databases (Cassandra, HBase, etc.).

Use Cases
 Large-scale batch data processing.
 Machine Learning with MLlib.
 Distributed SQL processing with Spark SQL.
Disadvantages
Requires a lot of RAM for good performance.
HADOOP MAPREDUCE ALTERNATIVES

Several solutions have been developed to overcome MapReduce's limitations:

• Batch Processing

 Apache Flink
 Supports batch and stream processing with the same API.
 Advanced memory management, sometimes better than Spark.
 Efficient for iterative algorithms and graph processing.
Use Cases
 Big Data analytics in batch and streaming.
 Log processing and complex event handling.
 Distributed machine learning.
Disadvantages
 Less widely adopted than Spark, meaning fewer resources and community support.
HADOOP MAPREDUCE ALTERNATIVES

Several solutions have been developed to overcome MapReduce's limitations:

• Batch Processing

 Apache Tez
 Replaces MapReduce in Apache Hive and Apache Pig.
 Optimized to execute DAG-based tasks (Directed Acyclic Graph).
 Reduces intermediate read/write operations on HDFS.
Use Cases
 Running SQL queries on Hadoop via Hive.
 Replacing MapReduce for Pig and other Hadoop tools.
Disadvantages
 Not as fast as Spark for complex tasks.
SPARK

Apache Spark is an open-source distributed data processing engine designed to be

fast, scalable, and easy to use. It enables parallel computing on a cluster of
machines.

Speed
• In memory computations
• Faster than MapReduce for complex applications on disk
Generality
• Batch applications
• Iterative algorithms
• Interactive queries and streaming
Ease of use
• API for Scala, Python, Java and R.
• Librairies for SQL, machine mearning, streaming, and graph processing
• Runs on Hadoop clusters or as a standalone
SPARK
Spark is not intended to replace HADOOP but it can regarded as an axtension to it

Hadoop is a complete ecosystem with

several modules: •Storage: Spark does not have its own storage
•HDFS (Hadoop Distributed File system; it typically uses HDFS, Amazon S3,
System): Distributed storage system. Cassandra, or NoSQL databases.
•YARN (Yet Another Resource •Data Processing: Spark can replace MapReduce
Negotiator): Resource and task but can also work with Hadoop YARN for resource
management. management.
•MapReduce: Batch data processing
model.
SPARK FEATURES

Provides
100x faster
powerful
than
caching and
MapReduce for
disk
large scala data
persistence
processing
capabilities

Can be
programmed in Can be
Scala, Java, deployed
Python or R through Mesos,
Hadoop via
Yarn or Spark’s
SPARK ECHOSYSTEM
SPARK ECHOSYSTEM

Upper Layers (Processing Modules)

These components are libraries that provide different
functionalities for data analysis:

Spark SQL: Enables running interactive SQL queries on

structured data.
Spark Streaming: Handles real-time data stream
processing.
Spark MLlib: Provides tools for Machine Learning.
GraphX: Used for graph processing and analysis.
SPARK ECHOSYSTEM

Spark Core Engine

This is the heart of Spark, responsible for:

• Managing tasks, resource allocation, and executing

programs.
• Optimizing tasks and handling memory management.
• Facilitating communication between different
components.
SPARK ECHOSYSTEM

Lower Layers (Cluster Managers)

Spark can run on different cluster managers:

• YARN (Yet Another Resource Negotiator): Mainly

used in Hadoop for resource management.
• Mesos: Another resource manager that allows
resource sharing among multiple applications.
• Standalone Scheduler: A built-in mode where Spark
manages its own cluster resources.
• Kubernetes: Allows Spark to run in a containerized
environment.
SPARK ECHOSYSTEM

Spark Core is the base engine for large-scale

parallel and distributed data processing

It is responsible for:

• Memory management and fault recovery

• Scheduling, distributing and monitoring
jobs on a cluster
• Interacting with storage systems
SPARK ARCHITECTURE

Master slave architecture:

• Driver
• Workers
SPARK ARCHITECTURE

1. DRIVER (Master Node)

The Driver is the central control point of a Spark

application. It is responsible for:

• Launching the application and maintaining its

state.
• Creating a SparkContext, which is the main
entry point to interact with Spark.
• Planning task execution and
communicating with the worker nodes.
SPARK ARCHITECTURE

2. Cluster Manager (Resource Manager)

The Cluster Manager handles resource allocation

to different Workers.

Spark can run on multiple cluster managers:

Hadoop YARN → Used within the Hadoop

ecosystem.
Apache Mesos → A general-purpose cluster
manager.
Kubernetes → Runs Spark in containerized
environments.
Spark Standalone → Spark’s built-in cluster
manager, requiring no external dependencies
SPARK ARCHITECTURE

3. WORKERS (Compute Nodes)

Workers are the machines where computations take

place.

Each Worker:
• Hosts one or more Executors (processes that
execute tasks).
• Handles multiple tasks simultaneously to
speed up processing.
• Caches data to optimize performance.
RESILIENT DISTRIBUTED DATASET RDD
A RDD (Resilient Distributed Dataset) is the fundamental data structure in
Apache Spark. It is a collection of data distributed across multiple nodes in a cluster,
enabling parallel processing.
• Immutable •Resilient: Fault tolerant and is capable of
rebuilding data on failure
• Lazy evaluation •Distributed: Distributed data among the multiple
nodes in a cluster
• On memory computation •Dataset: Collection of partitioned data with
values
• Three methods for creating RDD
Parallelizing an existing collection
Referencing a dataset • Dataset from any storage supported by Hadoop
Transformation from an existing RDD HDFS, Cassandra, Hbase….

• Two types of RDD operations

• Type of file supported:
Transformations
Text files, Sequence files, Hadoop Input
Actions
Format
RESILIENT DISTRIBUTED DATASET RDD
RDD OPERATIONS

Two types of RDD

operations

Actions
Transformations
• Creates a DAG • Performs the
• Lazy evaluations transformations and the
• No return value action that follows
• Return a value

Transformations
create a new dataset from an existing one. They are lazy, meaning they are only
executed when an action is called.

Actions (Trigger execution)

Actions trigger the execution of transformations and return a result.

RDD OPERATIONS
RESILIENT DISTRIBUTED DATASET RDD
RDD operations: Transformations

Transformation Methods Method usage and description

Applies transformation function on dataset and returns same number of

map(func)
elements in distributed dataset
filter(func) Returns a new RDD after applying filter function on source dataset.
Returns flattern map meaning if you have a dataset with array, it converts
flatMap(func) each elements in a array as a row. In other words it return 0 or more items in
output for each element in dataset.
Return a new dataset that contains the distinct elements of the source
distinct([numPartitions]))
dataset.
cache() Caches the RDD
When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>)
groupByKey([numPartitions])
pairs.
reduceByKey(func,numPartit
merges the values for each key with the function specified.
ions])
RESILIENT DISTRIBUTED DATASET RDD
RDD operations: actions

Action Methods Method usage and description

collect():Array[T] Return the complete dataset as an Array.

count():Long Return the count of elements in the dataset.

first():T Return the first element in the dataset.

foreach(f: (T) ⇒ Iterates all elements in the dataset by applying

Unit): Unit function f to all elements.
max()(implicit ord:
Return the maximum value from the dataset.
Ordering[T]): T

reduce(f: (T, T) ⇒ Reduces the elements of the dataset using the

T): T specified binary operator.
RDD OPERATIONS
Transformations

Actions
SPARK INSTALLATION

• JDK
Apache Spark requires Java Development Kit (JDK) because Spark is written in Scala,
which runs on the Java Virtual Machine (JVM).
Once the JDK is installed, you must then specify the JAVA_HOME environment variable so
that Spark knows where the Java installation directory is located.
SPARK INSTALLATION

• SPARK
Prepared versions of Spark are available for download from the project's official website:
http://spark.apache.org/downloads.html
SPARK SHELL

•The Spark shell provides a simple way to learn Spark's API.

•It is also a powerful tool to analyze data interactively.
•The Shell is available in either Scala, which runs on the Java VM, or Python.

Scala:
• To launch the Scala shell : Spark-shell

• To read a text file: Scala> val textfile=sc.textFile(«file.txt»)

Python:
• To launch the Python shell : pyspark

• To read a text file:

>>> textfile=sc.textFile(« file.txt »)
RESILIENT DISTRIBUTED DATASET RDD

RDD operations: Basics

Loading a file file= sc.textFile(« monText.txt »)

Applying transformationlineslenght= file.map(lambda s: len(s))

Invoking action total = lineslength.reduce(lambda a, b: a

+ b)
RESILIENT DISTRIBUTED DATASET RDD
Create a Spark RDD using Parallelize()

scala> val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))

Create a Spark RDD using textFile() or wholeTextFiles()

Spark core provides textFile() & wholeTextFiles() methods in SparkContext class which is used to read single
and multiple text or csv files into a single Spark RDD. Using this method we can also read all files from a
directory and files with a specific pattern.

• TextFile() – Read single or multiple text, csv files and returns a single Spark RDD [String]

• wholeTextFiles() – Reads single or multiple files and returns a single RDD[Tuple2[String,

String]], where first value (_1) in a tuple is a file name and second value (_2) is content of the
file.
RESILIENT DISTRIBUTED DATASET RDD
Create a Spark RDD using Parallelize()

rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

Create a Spark RDD using textFile() or wholeTextFiles()

This example reads all files from a directory, creates a single RDD and prints the contents
of the RDD.
# Chargement du fichier et création de l'RDD
rdd = spark.sparkContext.textFile("C:/tmp/files/*")

# Utilisation de foreach pour afficher chaque ligne

rdd.foreach(lambda f: print(f))
RESILIENT DISTRIBUTED DATASET RDD
Create a Spark RDD using Parallelize()

scala> val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))

Create a Spark RDD using textFile() or wholeTextFiles()

where first value f[0] in a tuple is a file name and second value f[1] is content of the file.

rddWhole = sc.wholeTextFiles("C:/tmp/files/*")

rddWhole.foreach(lambda f: print(f[0], "=>", f[1]))

RESILIENT DISTRIBUTED DATASET RDD
map

Returns a new distributed dataset formed by passing each element of the source through a
function.

pairs = test.map(lambda s: len(s))

pairs = test.map(lambda s: (s, len(s)))

RESILIENT DISTRIBUTED DATASET RDD
flatMap

Spark flatMap() transformation flattens the RDD column after applying the function on every
element and returns a new RDD respectively.

The returned RDD can have the same count or more number of elements. This is one of
the major differences between flatMap() and map(), where map() transformation always
returns the same number of elements as input.

rdd1 = rdd.flatMap(lambda f: f.split(" "))

RESILIENT DISTRIBUTED DATASET RDD
reduce

Spark RDD reduce() aggregate action function is used to calculate min, max, and total of
elements in a dataset
# Create an RDD
list_rdd = sc.parallelize([1, 2, 3, 4, 5, 3, 2])

# Find the min using reduce function

min_value = list_rdd.reduce(lambda a, b: a if a < b else b)
print("output min using binary:", min_value)

# Find the max using reduce function

max_value = list_rdd.reduce(lambda a, b: a if a > b else b)
print("output max using binary:", max_value)

# Calculate Sum using reduce function

sum_value = list_rdd.reduce(lambda a, b: a + b)
print("output sum using binary:", sum_value)
RESILIENT DISTRIBUTED DATASET RDD
reduceByKey

Spark RDD reduceByKey() transformation is used to merge the values of each key using an
associative reduce function.

rdd2 = rdd.reduceByKey(lambda a, b: a + b)
rdd2.foreach(lambda x: print(x))
RESILIENT DISTRIBUTED DATASET RDD
Filter transformation

Spark RDD filter is an operation that creates a new RDD by selecting the elements
from the input RDD that satisfy a given predicate (or condition).
filteredRDD = RDD.filter(lambda x: x % 2 == 0)
RESILIENT DISTRIBUTED DATASET RDD

Word Count Example

val fileRDD=
sc.textFile(« File.txt »)
File.txt

.map(l=>l.toLowerCase()

.flatMap(l=>l.split(« »)

.map(m=>(m,1))

.reduceByKey(_+_) .collect()
RDD ACTIONS
collect

Return the complete dataset as an Array

count

count() – Return the count of elements in the dataset.

SPARK CACH

• In Apache Spark, the cache() function is used to store a DataFrame or RDD in memory to
improve performance when accessing the data multiple times.

When to cache

• If the DataFrame/RDD is used multiple times in calculations.

• If transformations are expensive to recompute. Val rdd_cached = rdd.cache()
• Avoid using it for very large datasets (it can cause memory issues).
SPARK CACH

Benefits of caching DataFrame

• Reading data from source(hdfs:// or s3://) is time consuming. So after you read data
from the source and apply all the common operations, cache it if you are going to reuse the
data.

•By caching you create a checkpoint in your spark application and if further down the
execution of application any of the tasks fail your application will be able to recompute the
lost RDD partition from the cache.

•If you don’t have enough memory data will be cached at the local disk of executor
which will also be faster than reading from the source.

•If you can only cache a fraction of data it will also improve the performance, the rest of the
data can be recomputed by spark and that’s what resilient in RDD means.
SPARK DATAFRAME

• A DataFrame in Apache Spark is a tabular data structure similar to a table in a

relational database or a Pandas DataFrame in Python. It is based on the concept of
RDD, but with a more organized structure and optimized for processing large data
sets.
DataFrame Features

• Stores data in columns and rows, like an SQL

table.
• Well-defined schema: each column has a
name and a type (int, string, etc.).
• Distributed across multiple Spark cluster nodes
for massively parallel processing (MPP).
•
SPARK DATAFRAME

• SparkSession is an abstraction introduced in Spark 2.0 to simplify the Spark

API. It consolidates multiple components (such as SparkContext, SQLContext,
HiveContext) into a single interface. It is now the primary object used in Spark
applications to work with DataFrames, Datasets, and other high-level
abstractions.

val df = spark.read.option("header","true")
.option("inferSchema", "true")
.csv("C:\\Users\\user\\Downloads\\LabData\\LabData\\nyctaxi.csv")

df.printSchema()

df.show()
SPARK MAVEN APPLICATION

<maven.compiler.source>1.8</maven.compiler.source 
<maven.compiler.target>1.8</maven.compiler.target <dependency>
> <groupId>org.apache.hadoop</groupId>
</properties> <artifactId>hadoop-client</artifactId>
<version>1.2.1</version>
<dependencies> </dependency>
 </dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.13</artifactId>
<version>3.2.1</version>
</dependency>
SPARK MAVEN APPLICATION
SPARK MAVEN APPLICATION
SPARK MAVEN APPLICATION

En production Créer le fichier wc.jar

Spark-submit\
--class sparkwordcount\
--master local
Wc.jar test.txt resultat

4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Module 2
No ratings yet
Module 2
20 pages
SPARK
No ratings yet
SPARK
66 pages
Apache Spark Basics & Comparison
No ratings yet
Apache Spark Basics & Comparison
66 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Lecturer 5
No ratings yet
Lecturer 5
21 pages
07 - Apache Spark - An Introduction
No ratings yet
07 - Apache Spark - An Introduction
36 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Unit - 4
No ratings yet
Unit - 4
49 pages
Shark
No ratings yet
Shark
24 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Chap5 BigDataComputingAndProcessing
No ratings yet
Chap5 BigDataComputingAndProcessing
72 pages
Introduction to Data Analysis with Spark
No ratings yet
Introduction to Data Analysis with Spark
51 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Bda Unit 6
No ratings yet
Bda Unit 6
14 pages
SPARK
No ratings yet
SPARK
125 pages
Unit 5
100% (1)
Unit 5
109 pages
Spark
No ratings yet
Spark
96 pages
Apache Spark
No ratings yet
Apache Spark
27 pages
Spark Guide for 4th Year Engineering Students
No ratings yet
Spark Guide for 4th Year Engineering Students
241 pages
BIG Data Analytics 21CSH-471: Computer Science & Engineering
No ratings yet
BIG Data Analytics 21CSH-471: Computer Science & Engineering
24 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Unit V
No ratings yet
Unit V
35 pages
Apache Spark Overview & Features
No ratings yet
Apache Spark Overview & Features
65 pages
Bda U4
No ratings yet
Bda U4
49 pages
Spark: Big Data Processing & Libraries
No ratings yet
Spark: Big Data Processing & Libraries
47 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Unit - 4
No ratings yet
Unit - 4
18 pages
ECS765P - W4 - Introduction To Spark
No ratings yet
ECS765P - W4 - Introduction To Spark
39 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Spark Introduction
No ratings yet
Spark Introduction
26 pages
Hadoopvsspark 180108070838
No ratings yet
Hadoopvsspark 180108070838
17 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Spark
No ratings yet
Spark
96 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
Bda 5
No ratings yet
Bda 5
21 pages
Cloudera Data Analyst Training PDF
No ratings yet
Cloudera Data Analyst Training PDF
2 pages
Double Linked List Review & Implementation
No ratings yet
Double Linked List Review & Implementation
23 pages
Round Robin Scheduling Explained
No ratings yet
Round Robin Scheduling Explained
2 pages
Offline Programming For Short Batch Robotic Welding
No ratings yet
Offline Programming For Short Batch Robotic Welding
8 pages
Query
No ratings yet
Query
10 pages
r05311201 Automata and Compiler Design
100% (3)
r05311201 Automata and Compiler Design
6 pages
MR - Cooper Interview Experience - WriteUp
No ratings yet
MR - Cooper Interview Experience - WriteUp
4 pages
Oracle SQL DDL & DCL Guide
No ratings yet
Oracle SQL DDL & DCL Guide
33 pages
Visual Basic Training Outline
No ratings yet
Visual Basic Training Outline
3 pages
Code 1 Led - Master Uno
No ratings yet
Code 1 Led - Master Uno
4 pages
CREATE Remote and Proxy Tables: Chapter 1: Creating
No ratings yet
CREATE Remote and Proxy Tables: Chapter 1: Creating
1 page
Part 3. Interactive Graphing and Crossfiltering - Dash For Python Documentation - Plotly
No ratings yet
Part 3. Interactive Graphing and Crossfiltering - Dash For Python Documentation - Plotly
4 pages
Dokumen - Tips - Extending Ibm Tivoli Identity Manager Ibm Tivoli Identity Manager 46 Using Javascript
No ratings yet
Dokumen - Tips - Extending Ibm Tivoli Identity Manager Ibm Tivoli Identity Manager 46 Using Javascript
31 pages
C Programming Exercises and Solutions
No ratings yet
C Programming Exercises and Solutions
19 pages
Access Database With Visual Basic (Tutorial)
No ratings yet
Access Database With Visual Basic (Tutorial)
330 pages
Algorithms
No ratings yet
Algorithms
20 pages
CMP - Comm Command in Linux
No ratings yet
CMP - Comm Command in Linux
2 pages
Cse357 Ip
No ratings yet
Cse357 Ip
36 pages
Problem Set 6
No ratings yet
Problem Set 6
3 pages
h5 File Open
No ratings yet
h5 File Open
2 pages
JDBC Lab Manual 1.0 Student Version
No ratings yet
JDBC Lab Manual 1.0 Student Version
27 pages
Examen - MD February 2019 - September 2020: Work Experience
No ratings yet
Examen - MD February 2019 - September 2020: Work Experience
2 pages
All Computer Programming For Class 12 - 250411 - 132453
No ratings yet
All Computer Programming For Class 12 - 250411 - 132453
67 pages
JavaScript Basics: A Hands-On Guide
No ratings yet
JavaScript Basics: A Hands-On Guide
28 pages
Comprehensive WAS Interview Guide
No ratings yet
Comprehensive WAS Interview Guide
11 pages
Relational (OLTP) Data Modeling
No ratings yet
Relational (OLTP) Data Modeling
2 pages
All 2025 Assignments JAVA NPTEL
100% (1)
All 2025 Assignments JAVA NPTEL
124 pages
Java Callable vs Runnable Explained
No ratings yet
Java Callable vs Runnable Explained
5 pages
Internship Report
No ratings yet
Internship Report
19 pages
Iris - Ipynb - Colaboratory
No ratings yet
Iris - Ipynb - Colaboratory
8 pages

SPARK

Uploaded by

SPARK

Uploaded by

INF4101-Big Data et NoSQL

Limitations of Hadoop MapReduce

Hadoop MapReduce is a powerful framework for distributed processing of large

Limitations of Hadoop MapReduce

• Inefficient for Iterative and Interactive Tasks

• Not Suitable for Stream Processing

Several solutions have been developed to overcome MapReduce's limitations:

Several solutions have been developed to overcome MapReduce's limitations:

Several solutions have been developed to overcome MapReduce's limitations:

Apache Spark is an open-source distributed data processing engine designed to be

Hadoop is a complete ecosystem with

Upper Layers (Processing Modules)

Spark SQL: Enables running interactive SQL queries on

Spark Core Engine

This is the heart of Spark, responsible for:

• Managing tasks, resource allocation, and executing

Lower Layers (Cluster Managers)

Spark can run on different cluster managers:

• YARN (Yet Another Resource Negotiator): Mainly

Spark Core is the base engine for large-scale

• Memory management and fault recovery

Master slave architecture:

1. DRIVER (Master Node)

The Driver is the central control point of a Spark

• Launching the application and maintaining its

2. Cluster Manager (Resource Manager)

The Cluster Manager handles resource allocation

Spark can run on multiple cluster managers:

Hadoop YARN → Used within the Hadoop

3. WORKERS (Compute Nodes)

Workers are the machines where computations take

• Two types of RDD operations

Two types of RDD

Actions (Trigger execution)

Actions trigger the execution of transformations and return a result.

Transformation Methods Method usage and description

Applies transformation function on dataset and returns same number of

Action Methods Method usage and description

collect():Array[T] Return the complete dataset as an Array.

first():T Return the first element in the dataset.

foreach(f: (T) ⇒ Iterates all elements in the dataset by applying

reduce(f: (T, T) ⇒ Reduces the elements of the dataset using the

•The Spark shell provides a simple way to learn Spark's API.

• To read a text file: Scala> val textfile=sc.textFile(«file.txt»)

• To read a text file:

RDD operations: Basics

Loading a file file= sc.textFile(« monText.txt »)

Applying transformationlineslenght= file.map(lambda s: len(s))

Invoking action total = lineslength.reduce(lambda a, b: a

scala> val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))

Create a Spark RDD using textFile() or wholeTextFiles()

• wholeTextFiles() – Reads single or multiple files and returns a single RDD[Tuple2[String,

rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

Create a Spark RDD using textFile() or wholeTextFiles()

# Utilisation de foreach pour afficher chaque ligne

scala> val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))

Create a Spark RDD using textFile() or wholeTextFiles()

rddWhole.foreach(lambda f: print(f[0], "=>", f[1]))

pairs = test.map(lambda s: len(s))

pairs = test.map(lambda s: (s, len(s)))

rdd1 = rdd.flatMap(lambda f: f.split(" "))

# Find the min using reduce function

# Find the max using reduce function

# Calculate Sum using reduce function

Word Count Example

Return the complete dataset as an Array

count() – Return the count of elements in the dataset.

• If the DataFrame/RDD is used multiple times in calculations.

Benefits of caching DataFrame

• A DataFrame in Apache Spark is a tabular data structure similar to a table in a

• Stores data in columns and rows, like an SQL

• SparkSession is an abstraction introduced in Spark 2.0 to simplify the Spark

En production Créer le fichier wc.jar

You might also like