0% found this document useful (0 votes)
16 views47 pages

SPARK

The document discusses the limitations of Hadoop MapReduce, including its difficulty in programming, high latency, and inefficiency for iterative tasks, and introduces alternatives like Apache Spark, Apache Flink, and Apache Tez. Apache Spark is highlighted as a fast, scalable, and easy-to-use distributed data processing engine that supports in-memory computations and various programming languages. The document also covers Spark's architecture, core components, and the Resilient Distributed Dataset (RDD) operations.

Uploaded by

badrooxe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views47 pages

SPARK

The document discusses the limitations of Hadoop MapReduce, including its difficulty in programming, high latency, and inefficiency for iterative tasks, and introduces alternatives like Apache Spark, Apache Flink, and Apache Tez. Apache Spark is highlighted as a fast, scalable, and easy-to-use distributed data processing engine that supports in-memory computations and various programming languages. The document also covers Spark's architecture, core components, and the Resilient Distributed Dataset (RDD) operations.

Uploaded by

badrooxe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 47

INF4101-Big Data et NoSQL

5. SPARK

Dr Mouhim Sanaa
HADOOP MAPREDUCE

Limitations of Hadoop MapReduce

Hadoop MapReduce is a powerful framework for distributed processing of large


datasets, but it has several limitations:

• Difficult to Program
The MapReduce API is low-level and requires extensive code for relatively simple
tasks.

• I/O Intensive
 The MapReduce model relies on intermediate storage of results between the
map and reduce phases, leading to numerous disk read/write operations.
 This limits performance, especially for tasks that require repeated operations on
the same data (e.g., iterative algorithms like machine learning).
HADOOP MAPREDUCE

Limitations of Hadoop MapReduce

• High Latency
 MapReduce is designed for batch processing, making it unsuitable for
applications requiring low latency (real-time or near real-time).
 Each MapReduce task involves reading/writing from HDFS, significantly slowing
execution.

• Inefficient for Iterative and Interactive Tasks


 Machine learning and graph analysis algorithms often require multiple
passes over the data, which is inefficient with MapReduce.
 Systems like Apache Spark, which keep data in memory, are much more efficient
for such tasks.

• Not Suitable for Stream Processing


 Hadoop MapReduce is not designed for real-time data stream processing.
 Solutions like Apache Flink or Apache Kafka Streams are better suited for streaming
HADOOP MAPREDUCE ALTERNATIVES

Several solutions have been developed to overcome MapReduce's limitations:


• Batch Processing

 Apache Spark
 Stores data in-memory (RDD, DataFrame) to avoid unnecessary disk writes.
 Much faster than MapReduce for repetitive tasks and iterative algorithms.
 Simpler API with support for Python (PySpark), Scala, Java, and R.
 Compatible with Hadoop, HDFS, S3, and NoSQL databases (Cassandra, HBase, etc.).

Use Cases
 Large-scale batch data processing.
 Machine Learning with MLlib.
 Distributed SQL processing with Spark SQL.
Disadvantages
Requires a lot of RAM for good performance.
HADOOP MAPREDUCE ALTERNATIVES

Several solutions have been developed to overcome MapReduce's limitations:


• Batch Processing

 Apache Flink
 Supports batch and stream processing with the same API.
 Advanced memory management, sometimes better than Spark.
 Efficient for iterative algorithms and graph processing.
Use Cases
 Big Data analytics in batch and streaming.
 Log processing and complex event handling.
 Distributed machine learning.
Disadvantages
 Less widely adopted than Spark, meaning fewer resources and community support.
HADOOP MAPREDUCE ALTERNATIVES

Several solutions have been developed to overcome MapReduce's limitations:


• Batch Processing

 Apache Tez
 Replaces MapReduce in Apache Hive and Apache Pig.
 Optimized to execute DAG-based tasks (Directed Acyclic Graph).
 Reduces intermediate read/write operations on HDFS.
Use Cases
 Running SQL queries on Hadoop via Hive.
 Replacing MapReduce for Pig and other Hadoop tools.
Disadvantages
 Not as fast as Spark for complex tasks.
SPARK

Apache Spark is an open-source distributed data processing engine designed to be


fast, scalable, and easy to use. It enables parallel computing on a cluster of
machines.

Speed
• In memory computations
• Faster than MapReduce for complex applications on disk
Generality
• Batch applications
• Iterative algorithms
• Interactive queries and streaming
Ease of use
• API for Scala, Python, Java and R.
• Librairies for SQL, machine mearning, streaming, and graph processing
• Runs on Hadoop clusters or as a standalone
SPARK
Spark is not intended to replace HADOOP but it can regarded as an axtension to it

Hadoop is a complete ecosystem with


several modules: •Storage: Spark does not have its own storage
•HDFS (Hadoop Distributed File system; it typically uses HDFS, Amazon S3,
System): Distributed storage system. Cassandra, or NoSQL databases.
•YARN (Yet Another Resource •Data Processing: Spark can replace MapReduce
Negotiator): Resource and task but can also work with Hadoop YARN for resource
management. management.
•MapReduce: Batch data processing
model.
SPARK FEATURES

Provides
100x faster
powerful
than
caching and
MapReduce for
disk
large scala data
persistence
processing
capabilities

Can be
programmed in Can be
Scala, Java, deployed
Python or R through Mesos,
Hadoop via
Yarn or Spark’s
SPARK ECHOSYSTEM
SPARK ECHOSYSTEM

Upper Layers (Processing Modules)


These components are libraries that provide different
functionalities for data analysis:

Spark SQL: Enables running interactive SQL queries on


structured data.
Spark Streaming: Handles real-time data stream
processing.
Spark MLlib: Provides tools for Machine Learning.
GraphX: Used for graph processing and analysis.
SPARK ECHOSYSTEM

Spark Core Engine

This is the heart of Spark, responsible for:

• Managing tasks, resource allocation, and executing


programs.
• Optimizing tasks and handling memory management.
• Facilitating communication between different
components.
SPARK ECHOSYSTEM

Lower Layers (Cluster Managers)

Spark can run on different cluster managers:

• YARN (Yet Another Resource Negotiator): Mainly


used in Hadoop for resource management.
• Mesos: Another resource manager that allows
resource sharing among multiple applications.
• Standalone Scheduler: A built-in mode where Spark
manages its own cluster resources.
• Kubernetes: Allows Spark to run in a containerized
environment.
SPARK ECHOSYSTEM

Spark Core is the base engine for large-scale


parallel and distributed data processing

It is responsible for:

• Memory management and fault recovery


• Scheduling, distributing and monitoring
jobs on a cluster
• Interacting with storage systems
SPARK ARCHITECTURE

Master slave architecture:

• Driver
• Workers
SPARK ARCHITECTURE

1. DRIVER (Master Node)

The Driver is the central control point of a Spark


application. It is responsible for:

• Launching the application and maintaining its


state.
• Creating a SparkContext, which is the main
entry point to interact with Spark.
• Planning task execution and
communicating with the worker nodes.
SPARK ARCHITECTURE

2. Cluster Manager (Resource Manager)

The Cluster Manager handles resource allocation


to different Workers.

Spark can run on multiple cluster managers:

Hadoop YARN → Used within the Hadoop


ecosystem.
Apache Mesos → A general-purpose cluster
manager.
Kubernetes → Runs Spark in containerized
environments.
Spark Standalone → Spark’s built-in cluster
manager, requiring no external dependencies
SPARK ARCHITECTURE

3. WORKERS (Compute Nodes)

Workers are the machines where computations take


place.

Each Worker:
• Hosts one or more Executors (processes that
execute tasks).
• Handles multiple tasks simultaneously to
speed up processing.
• Caches data to optimize performance.
RESILIENT DISTRIBUTED DATASET RDD
A RDD (Resilient Distributed Dataset) is the fundamental data structure in
Apache Spark. It is a collection of data distributed across multiple nodes in a cluster,
enabling parallel processing.
• Immutable •Resilient: Fault tolerant and is capable of
rebuilding data on failure
• Lazy evaluation •Distributed: Distributed data among the multiple
nodes in a cluster
• On memory computation •Dataset: Collection of partitioned data with
values
• Three methods for creating RDD
Parallelizing an existing collection
Referencing a dataset • Dataset from any storage supported by Hadoop
Transformation from an existing RDD HDFS, Cassandra, Hbase….

• Two types of RDD operations


• Type of file supported:
Transformations
Text files, Sequence files, Hadoop Input
Actions
Format
RESILIENT DISTRIBUTED DATASET RDD
RDD OPERATIONS

Two types of RDD


operations

Actions
Transformations
• Creates a DAG • Performs the
• Lazy evaluations transformations and the
• No return value action that follows
• Return a value

Transformations
create a new dataset from an existing one. They are lazy, meaning they are only
executed when an action is called.

Actions (Trigger execution)

Actions trigger the execution of transformations and return a result.


RDD OPERATIONS
RESILIENT DISTRIBUTED DATASET RDD
RDD operations: Transformations

Transformation Methods Method usage and description

Applies transformation function on dataset and returns same number of


map(func)
elements in distributed dataset
filter(func) Returns a new RDD after applying filter function on source dataset.
Returns flattern map meaning if you have a dataset with array, it converts
flatMap(func) each elements in a array as a row. In other words it return 0 or more items in
output for each element in dataset.
Return a new dataset that contains the distinct elements of the source
distinct([numPartitions]))
dataset.
cache() Caches the RDD
When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>)
groupByKey([numPartitions])
pairs.
reduceByKey(func,numPartit
merges the values for each key with the function specified.
ions])
RESILIENT DISTRIBUTED DATASET RDD
RDD operations: actions

Action Methods Method usage and description

collect():Array[T] Return the complete dataset as an Array.


count():Long Return the count of elements in the dataset.

first():T Return the first element in the dataset.

foreach(f: (T) ⇒ Iterates all elements in the dataset by applying


Unit): Unit function f to all elements.
max()(implicit ord:
Return the maximum value from the dataset.
Ordering[T]): T

reduce(f: (T, T) ⇒ Reduces the elements of the dataset using the


T): T specified binary operator.
RDD OPERATIONS
Transformations

Actions
SPARK INSTALLATION

• JDK
Apache Spark requires Java Development Kit (JDK) because Spark is written in Scala,
which runs on the Java Virtual Machine (JVM).
Once the JDK is installed, you must then specify the JAVA_HOME environment variable so
that Spark knows where the Java installation directory is located.
SPARK INSTALLATION

• SPARK
Prepared versions of Spark are available for download from the project's official website:
http://spark.apache.org/downloads.html
SPARK SHELL

•The Spark shell provides a simple way to learn Spark's API.


•It is also a powerful tool to analyze data interactively.
•The Shell is available in either Scala, which runs on the Java VM, or Python.

Scala:
• To launch the Scala shell : Spark-shell

• To read a text file: Scala> val textfile=sc.textFile(«file.txt»)

Python:
• To launch the Python shell : pyspark

• To read a text file:


>>> textfile=sc.textFile(« file.txt »)
RESILIENT DISTRIBUTED DATASET RDD

RDD operations: Basics

Loading a file file= sc.textFile(« monText.txt »)

Applying transformationlineslenght= file.map(lambda s: len(s))

Invoking action total = lineslength.reduce(lambda a, b: a


+ b)
RESILIENT DISTRIBUTED DATASET RDD
Create a Spark RDD using Parallelize()

scala> val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))

Create a Spark RDD using textFile() or wholeTextFiles()


Spark core provides textFile() & wholeTextFiles() methods in SparkContext class which is used to read single
and multiple text or csv files into a single Spark RDD. Using this method we can also read all files from a
directory and files with a specific pattern.

• TextFile() – Read single or multiple text, csv files and returns a single Spark RDD [String]

• wholeTextFiles() – Reads single or multiple files and returns a single RDD[Tuple2[String,


String]], where first value (_1) in a tuple is a file name and second value (_2) is content of the
file.
RESILIENT DISTRIBUTED DATASET RDD
Create a Spark RDD using Parallelize()

rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

Create a Spark RDD using textFile() or wholeTextFiles()


This example reads all files from a directory, creates a single RDD and prints the contents
of the RDD.
# Chargement du fichier et création de l'RDD
rdd = spark.sparkContext.textFile("C:/tmp/files/*")

# Utilisation de foreach pour afficher chaque ligne


rdd.foreach(lambda f: print(f))
RESILIENT DISTRIBUTED DATASET RDD
Create a Spark RDD using Parallelize()

scala> val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))

Create a Spark RDD using textFile() or wholeTextFiles()

where first value f[0] in a tuple is a file name and second value f[1] is content of the file.

rddWhole = sc.wholeTextFiles("C:/tmp/files/*")

rddWhole.foreach(lambda f: print(f[0], "=>", f[1]))


RESILIENT DISTRIBUTED DATASET RDD
map

Returns a new distributed dataset formed by passing each element of the source through a
function.

pairs = test.map(lambda s: len(s))

pairs = test.map(lambda s: (s, len(s)))


RESILIENT DISTRIBUTED DATASET RDD
flatMap

Spark flatMap() transformation flattens the RDD column after applying the function on every
element and returns a new RDD respectively.

The returned RDD can have the same count or more number of elements. This is one of
the major differences between flatMap() and map(), where map() transformation always
returns the same number of elements as input.

rdd1 = rdd.flatMap(lambda f: f.split(" "))


RESILIENT DISTRIBUTED DATASET RDD
reduce

Spark RDD reduce() aggregate action function is used to calculate min, max, and total of
elements in a dataset
# Create an RDD
list_rdd = sc.parallelize([1, 2, 3, 4, 5, 3, 2])

# Find the min using reduce function


min_value = list_rdd.reduce(lambda a, b: a if a < b else b)
print("output min using binary:", min_value)

# Find the max using reduce function


max_value = list_rdd.reduce(lambda a, b: a if a > b else b)
print("output max using binary:", max_value)

# Calculate Sum using reduce function


sum_value = list_rdd.reduce(lambda a, b: a + b)
print("output sum using binary:", sum_value)
RESILIENT DISTRIBUTED DATASET RDD
reduceByKey

Spark RDD reduceByKey() transformation is used to merge the values of each key using an
associative reduce function.

rdd2 = rdd.reduceByKey(lambda a, b: a + b)
rdd2.foreach(lambda x: print(x))
RESILIENT DISTRIBUTED DATASET RDD
Filter transformation

Spark RDD filter is an operation that creates a new RDD by selecting the elements
from the input RDD that satisfy a given predicate (or condition).
filteredRDD = RDD.filter(lambda x: x % 2 == 0)
RESILIENT DISTRIBUTED DATASET RDD

Word Count Example


val fileRDD=
sc.textFile(« File.txt »)
File.txt

.map(l=>l.toLowerCase()

.flatMap(l=>l.split(« »)

.map(m=>(m,1))

.reduceByKey(_+_) .collect()
RDD ACTIONS
collect

Return the complete dataset as an Array

count

count() – Return the count of elements in the dataset.


SPARK CACH

• In Apache Spark, the cache() function is used to store a DataFrame or RDD in memory to
improve performance when accessing the data multiple times.

When to cache

• If the DataFrame/RDD is used multiple times in calculations.


• If transformations are expensive to recompute. Val rdd_cached = rdd.cache()
• Avoid using it for very large datasets (it can cause memory issues).
SPARK CACH

Benefits of caching DataFrame

• Reading data from source(hdfs:// or s3://) is time consuming. So after you read data
from the source and apply all the common operations, cache it if you are going to reuse the
data.

•By caching you create a checkpoint in your spark application and if further down the
execution of application any of the tasks fail your application will be able to recompute the
lost RDD partition from the cache.

•If you don’t have enough memory data will be cached at the local disk of executor
which will also be faster than reading from the source.

•If you can only cache a fraction of data it will also improve the performance, the rest of the
data can be recomputed by spark and that’s what resilient in RDD means.
SPARK DATAFRAME

• A DataFrame in Apache Spark is a tabular data structure similar to a table in a


relational database or a Pandas DataFrame in Python. It is based on the concept of
RDD, but with a more organized structure and optimized for processing large data
sets.
DataFrame Features

• Stores data in columns and rows, like an SQL


table.
• Well-defined schema: each column has a
name and a type (int, string, etc.).
• Distributed across multiple Spark cluster nodes
for massively parallel processing (MPP).

SPARK DATAFRAME

• SparkSession is an abstraction introduced in Spark 2.0 to simplify the Spark


API. It consolidates multiple components (such as SparkContext, SQLContext,
HiveContext) into a single interface. It is now the primary object used in Spark
applications to work with DataFrames, Datasets, and other high-level
abstractions.

val df = spark.read.option("header","true")
.option("inferSchema", "true")
.csv("C:\\Users\\user\\Downloads\\LabData\\LabData\\nyctaxi.csv")

df.printSchema()

df.show()
SPARK MAVEN APPLICATION

<properties>

<maven.compiler.source>1.8</maven.compiler.source <!--
> https://mvnrepository.com/artifact/org.apache.hadoop
/hadoop-client -->
<maven.compiler.target>1.8</maven.compiler.target <dependency>
> <groupId>org.apache.hadoop</groupId>
</properties> <artifactId>hadoop-client</artifactId>
<version>1.2.1</version>
<dependencies> </dependency>
<!--
https://mvnrepository.com/artifact/org.apache.spa
rk/spark-core --> </dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.13</artifactId>
<version>3.2.1</version>
</dependency>
SPARK MAVEN APPLICATION
SPARK MAVEN APPLICATION
SPARK MAVEN APPLICATION

En production Créer le fichier wc.jar

Spark-submit\
--class sparkwordcount\
--master local
Wc.jar test.txt resultat

You might also like