INF4101-Big Data et NoSQL
5. SPARK
Dr Mouhim Sanaa
HADOOP MAPREDUCE
Limitations of Hadoop MapReduce
Hadoop MapReduce is a powerful framework for distributed processing of large
datasets, but it has several limitations:
• Difficult to Program
The MapReduce API is low-level and requires extensive code for relatively simple
tasks.
• I/O Intensive
The MapReduce model relies on intermediate storage of results between the
map and reduce phases, leading to numerous disk read/write operations.
This limits performance, especially for tasks that require repeated operations on
the same data (e.g., iterative algorithms like machine learning).
HADOOP MAPREDUCE
Limitations of Hadoop MapReduce
• High Latency
MapReduce is designed for batch processing, making it unsuitable for
applications requiring low latency (real-time or near real-time).
Each MapReduce task involves reading/writing from HDFS, significantly slowing
execution.
• Inefficient for Iterative and Interactive Tasks
Machine learning and graph analysis algorithms often require multiple
passes over the data, which is inefficient with MapReduce.
Systems like Apache Spark, which keep data in memory, are much more efficient
for such tasks.
• Not Suitable for Stream Processing
Hadoop MapReduce is not designed for real-time data stream processing.
Solutions like Apache Flink or Apache Kafka Streams are better suited for streaming
HADOOP MAPREDUCE ALTERNATIVES
Several solutions have been developed to overcome MapReduce's limitations:
• Batch Processing
Apache Spark
Stores data in-memory (RDD, DataFrame) to avoid unnecessary disk writes.
Much faster than MapReduce for repetitive tasks and iterative algorithms.
Simpler API with support for Python (PySpark), Scala, Java, and R.
Compatible with Hadoop, HDFS, S3, and NoSQL databases (Cassandra, HBase, etc.).
Use Cases
Large-scale batch data processing.
Machine Learning with MLlib.
Distributed SQL processing with Spark SQL.
Disadvantages
Requires a lot of RAM for good performance.
HADOOP MAPREDUCE ALTERNATIVES
Several solutions have been developed to overcome MapReduce's limitations:
• Batch Processing
Apache Flink
Supports batch and stream processing with the same API.
Advanced memory management, sometimes better than Spark.
Efficient for iterative algorithms and graph processing.
Use Cases
Big Data analytics in batch and streaming.
Log processing and complex event handling.
Distributed machine learning.
Disadvantages
Less widely adopted than Spark, meaning fewer resources and community support.
HADOOP MAPREDUCE ALTERNATIVES
Several solutions have been developed to overcome MapReduce's limitations:
• Batch Processing
Apache Tez
Replaces MapReduce in Apache Hive and Apache Pig.
Optimized to execute DAG-based tasks (Directed Acyclic Graph).
Reduces intermediate read/write operations on HDFS.
Use Cases
Running SQL queries on Hadoop via Hive.
Replacing MapReduce for Pig and other Hadoop tools.
Disadvantages
Not as fast as Spark for complex tasks.
SPARK
Apache Spark is an open-source distributed data processing engine designed to be
fast, scalable, and easy to use. It enables parallel computing on a cluster of
machines.
Speed
• In memory computations
• Faster than MapReduce for complex applications on disk
Generality
• Batch applications
• Iterative algorithms
• Interactive queries and streaming
Ease of use
• API for Scala, Python, Java and R.
• Librairies for SQL, machine mearning, streaming, and graph processing
• Runs on Hadoop clusters or as a standalone
SPARK
Spark is not intended to replace HADOOP but it can regarded as an axtension to it
Hadoop is a complete ecosystem with
several modules: •Storage: Spark does not have its own storage
•HDFS (Hadoop Distributed File system; it typically uses HDFS, Amazon S3,
System): Distributed storage system. Cassandra, or NoSQL databases.
•YARN (Yet Another Resource •Data Processing: Spark can replace MapReduce
Negotiator): Resource and task but can also work with Hadoop YARN for resource
management. management.
•MapReduce: Batch data processing
model.
SPARK FEATURES
Provides
100x faster
powerful
than
caching and
MapReduce for
disk
large scala data
persistence
processing
capabilities
Can be
programmed in Can be
Scala, Java, deployed
Python or R through Mesos,
Hadoop via
Yarn or Spark’s
SPARK ECHOSYSTEM
SPARK ECHOSYSTEM
Upper Layers (Processing Modules)
These components are libraries that provide different
functionalities for data analysis:
Spark SQL: Enables running interactive SQL queries on
structured data.
Spark Streaming: Handles real-time data stream
processing.
Spark MLlib: Provides tools for Machine Learning.
GraphX: Used for graph processing and analysis.
SPARK ECHOSYSTEM
Spark Core Engine
This is the heart of Spark, responsible for:
• Managing tasks, resource allocation, and executing
programs.
• Optimizing tasks and handling memory management.
• Facilitating communication between different
components.
SPARK ECHOSYSTEM
Lower Layers (Cluster Managers)
Spark can run on different cluster managers:
• YARN (Yet Another Resource Negotiator): Mainly
used in Hadoop for resource management.
• Mesos: Another resource manager that allows
resource sharing among multiple applications.
• Standalone Scheduler: A built-in mode where Spark
manages its own cluster resources.
• Kubernetes: Allows Spark to run in a containerized
environment.
SPARK ECHOSYSTEM
Spark Core is the base engine for large-scale
parallel and distributed data processing
It is responsible for:
• Memory management and fault recovery
• Scheduling, distributing and monitoring
jobs on a cluster
• Interacting with storage systems
SPARK ARCHITECTURE
Master slave architecture:
• Driver
• Workers
SPARK ARCHITECTURE
1. DRIVER (Master Node)
The Driver is the central control point of a Spark
application. It is responsible for:
• Launching the application and maintaining its
state.
• Creating a SparkContext, which is the main
entry point to interact with Spark.
• Planning task execution and
communicating with the worker nodes.
SPARK ARCHITECTURE
2. Cluster Manager (Resource Manager)
The Cluster Manager handles resource allocation
to different Workers.
Spark can run on multiple cluster managers:
Hadoop YARN → Used within the Hadoop
ecosystem.
Apache Mesos → A general-purpose cluster
manager.
Kubernetes → Runs Spark in containerized
environments.
Spark Standalone → Spark’s built-in cluster
manager, requiring no external dependencies
SPARK ARCHITECTURE
3. WORKERS (Compute Nodes)
Workers are the machines where computations take
place.
Each Worker:
• Hosts one or more Executors (processes that
execute tasks).
• Handles multiple tasks simultaneously to
speed up processing.
• Caches data to optimize performance.
RESILIENT DISTRIBUTED DATASET RDD
A RDD (Resilient Distributed Dataset) is the fundamental data structure in
Apache Spark. It is a collection of data distributed across multiple nodes in a cluster,
enabling parallel processing.
• Immutable •Resilient: Fault tolerant and is capable of
rebuilding data on failure
• Lazy evaluation •Distributed: Distributed data among the multiple
nodes in a cluster
• On memory computation •Dataset: Collection of partitioned data with
values
• Three methods for creating RDD
Parallelizing an existing collection
Referencing a dataset • Dataset from any storage supported by Hadoop
Transformation from an existing RDD HDFS, Cassandra, Hbase….
• Two types of RDD operations
• Type of file supported:
Transformations
Text files, Sequence files, Hadoop Input
Actions
Format
RESILIENT DISTRIBUTED DATASET RDD
RDD OPERATIONS
Two types of RDD
operations
Actions
Transformations
• Creates a DAG • Performs the
• Lazy evaluations transformations and the
• No return value action that follows
• Return a value
Transformations
create a new dataset from an existing one. They are lazy, meaning they are only
executed when an action is called.
Actions (Trigger execution)
Actions trigger the execution of transformations and return a result.
RDD OPERATIONS
RESILIENT DISTRIBUTED DATASET RDD
RDD operations: Transformations
Transformation Methods Method usage and description
Applies transformation function on dataset and returns same number of
map(func)
elements in distributed dataset
filter(func) Returns a new RDD after applying filter function on source dataset.
Returns flattern map meaning if you have a dataset with array, it converts
flatMap(func) each elements in a array as a row. In other words it return 0 or more items in
output for each element in dataset.
Return a new dataset that contains the distinct elements of the source
distinct([numPartitions]))
dataset.
cache() Caches the RDD
When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>)
groupByKey([numPartitions])
pairs.
reduceByKey(func,numPartit
merges the values for each key with the function specified.
ions])
RESILIENT DISTRIBUTED DATASET RDD
RDD operations: actions
Action Methods Method usage and description
collect():Array[T] Return the complete dataset as an Array.
count():Long Return the count of elements in the dataset.
first():T Return the first element in the dataset.
foreach(f: (T) ⇒ Iterates all elements in the dataset by applying
Unit): Unit function f to all elements.
max()(implicit ord:
Return the maximum value from the dataset.
Ordering[T]): T
reduce(f: (T, T) ⇒ Reduces the elements of the dataset using the
T): T specified binary operator.
RDD OPERATIONS
Transformations
Actions
SPARK INSTALLATION
• JDK
Apache Spark requires Java Development Kit (JDK) because Spark is written in Scala,
which runs on the Java Virtual Machine (JVM).
Once the JDK is installed, you must then specify the JAVA_HOME environment variable so
that Spark knows where the Java installation directory is located.
SPARK INSTALLATION
• SPARK
Prepared versions of Spark are available for download from the project's official website:
http://spark.apache.org/downloads.html
SPARK SHELL
•The Spark shell provides a simple way to learn Spark's API.
•It is also a powerful tool to analyze data interactively.
•The Shell is available in either Scala, which runs on the Java VM, or Python.
Scala:
• To launch the Scala shell : Spark-shell
• To read a text file: Scala> val textfile=sc.textFile(«file.txt»)
Python:
• To launch the Python shell : pyspark
• To read a text file:
>>> textfile=sc.textFile(« file.txt »)
RESILIENT DISTRIBUTED DATASET RDD
RDD operations: Basics
Loading a file file= sc.textFile(« monText.txt »)
Applying transformationlineslenght= file.map(lambda s: len(s))
Invoking action total = lineslength.reduce(lambda a, b: a
+ b)
RESILIENT DISTRIBUTED DATASET RDD
Create a Spark RDD using Parallelize()
scala> val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))
Create a Spark RDD using textFile() or wholeTextFiles()
Spark core provides textFile() & wholeTextFiles() methods in SparkContext class which is used to read single
and multiple text or csv files into a single Spark RDD. Using this method we can also read all files from a
directory and files with a specific pattern.
• TextFile() – Read single or multiple text, csv files and returns a single Spark RDD [String]
• wholeTextFiles() – Reads single or multiple files and returns a single RDD[Tuple2[String,
String]], where first value (_1) in a tuple is a file name and second value (_2) is content of the
file.
RESILIENT DISTRIBUTED DATASET RDD
Create a Spark RDD using Parallelize()
rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
Create a Spark RDD using textFile() or wholeTextFiles()
This example reads all files from a directory, creates a single RDD and prints the contents
of the RDD.
# Chargement du fichier et création de l'RDD
rdd = spark.sparkContext.textFile("C:/tmp/files/*")
# Utilisation de foreach pour afficher chaque ligne
rdd.foreach(lambda f: print(f))
RESILIENT DISTRIBUTED DATASET RDD
Create a Spark RDD using Parallelize()
scala> val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))
Create a Spark RDD using textFile() or wholeTextFiles()
where first value f[0] in a tuple is a file name and second value f[1] is content of the file.
rddWhole = sc.wholeTextFiles("C:/tmp/files/*")
rddWhole.foreach(lambda f: print(f[0], "=>", f[1]))
RESILIENT DISTRIBUTED DATASET RDD
map
Returns a new distributed dataset formed by passing each element of the source through a
function.
pairs = test.map(lambda s: len(s))
pairs = test.map(lambda s: (s, len(s)))
RESILIENT DISTRIBUTED DATASET RDD
flatMap
Spark flatMap() transformation flattens the RDD column after applying the function on every
element and returns a new RDD respectively.
The returned RDD can have the same count or more number of elements. This is one of
the major differences between flatMap() and map(), where map() transformation always
returns the same number of elements as input.
rdd1 = rdd.flatMap(lambda f: f.split(" "))
RESILIENT DISTRIBUTED DATASET RDD
reduce
Spark RDD reduce() aggregate action function is used to calculate min, max, and total of
elements in a dataset
# Create an RDD
list_rdd = sc.parallelize([1, 2, 3, 4, 5, 3, 2])
# Find the min using reduce function
min_value = list_rdd.reduce(lambda a, b: a if a < b else b)
print("output min using binary:", min_value)
# Find the max using reduce function
max_value = list_rdd.reduce(lambda a, b: a if a > b else b)
print("output max using binary:", max_value)
# Calculate Sum using reduce function
sum_value = list_rdd.reduce(lambda a, b: a + b)
print("output sum using binary:", sum_value)
RESILIENT DISTRIBUTED DATASET RDD
reduceByKey
Spark RDD reduceByKey() transformation is used to merge the values of each key using an
associative reduce function.
rdd2 = rdd.reduceByKey(lambda a, b: a + b)
rdd2.foreach(lambda x: print(x))
RESILIENT DISTRIBUTED DATASET RDD
Filter transformation
Spark RDD filter is an operation that creates a new RDD by selecting the elements
from the input RDD that satisfy a given predicate (or condition).
filteredRDD = RDD.filter(lambda x: x % 2 == 0)
RESILIENT DISTRIBUTED DATASET RDD
Word Count Example
val fileRDD=
sc.textFile(« File.txt »)
File.txt
.map(l=>l.toLowerCase()
.flatMap(l=>l.split(« »)
.map(m=>(m,1))
.reduceByKey(_+_) .collect()
RDD ACTIONS
collect
Return the complete dataset as an Array
count
count() – Return the count of elements in the dataset.
SPARK CACH
• In Apache Spark, the cache() function is used to store a DataFrame or RDD in memory to
improve performance when accessing the data multiple times.
When to cache
• If the DataFrame/RDD is used multiple times in calculations.
• If transformations are expensive to recompute. Val rdd_cached = rdd.cache()
• Avoid using it for very large datasets (it can cause memory issues).
SPARK CACH
Benefits of caching DataFrame
• Reading data from source(hdfs:// or s3://) is time consuming. So after you read data
from the source and apply all the common operations, cache it if you are going to reuse the
data.
•By caching you create a checkpoint in your spark application and if further down the
execution of application any of the tasks fail your application will be able to recompute the
lost RDD partition from the cache.
•If you don’t have enough memory data will be cached at the local disk of executor
which will also be faster than reading from the source.
•If you can only cache a fraction of data it will also improve the performance, the rest of the
data can be recomputed by spark and that’s what resilient in RDD means.
SPARK DATAFRAME
• A DataFrame in Apache Spark is a tabular data structure similar to a table in a
relational database or a Pandas DataFrame in Python. It is based on the concept of
RDD, but with a more organized structure and optimized for processing large data
sets.
DataFrame Features
• Stores data in columns and rows, like an SQL
table.
• Well-defined schema: each column has a
name and a type (int, string, etc.).
• Distributed across multiple Spark cluster nodes
for massively parallel processing (MPP).
•
SPARK DATAFRAME
• SparkSession is an abstraction introduced in Spark 2.0 to simplify the Spark
API. It consolidates multiple components (such as SparkContext, SQLContext,
HiveContext) into a single interface. It is now the primary object used in Spark
applications to work with DataFrames, Datasets, and other high-level
abstractions.
val df = spark.read.option("header","true")
.option("inferSchema", "true")
.csv("C:\\Users\\user\\Downloads\\LabData\\LabData\\nyctaxi.csv")
df.printSchema()
df.show()
SPARK MAVEN APPLICATION
<properties>
<maven.compiler.source>1.8</maven.compiler.source <!--
> https://mvnrepository.com/artifact/org.apache.hadoop
/hadoop-client -->
<maven.compiler.target>1.8</maven.compiler.target <dependency>
> <groupId>org.apache.hadoop</groupId>
</properties> <artifactId>hadoop-client</artifactId>
<version>1.2.1</version>
<dependencies> </dependency>
<!--
https://mvnrepository.com/artifact/org.apache.spa
rk/spark-core --> </dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.13</artifactId>
<version>3.2.1</version>
</dependency>
SPARK MAVEN APPLICATION
SPARK MAVEN APPLICATION
SPARK MAVEN APPLICATION
En production Créer le fichier wc.jar
Spark-submit\
--class sparkwordcount\
--master local
Wc.jar test.txt resultat