0% found this document useful (0 votes)

29 views27 pages

SPARK

The document provides an overview of Apache Hadoop and its basic modules, including HDFS, YARN, and MapReduce, along with other related technologies like Spark and HBase. It highlights the limitations of MapReduce and presents Apache Spark as a more efficient alternative for data processing, emphasizing its capabilities in handling various workflows, in-memory data sharing, and support for multiple programming languages. Additionally, it discusses real-world applications of Spark in companies like Uber and Netflix, as well as considerations for when not to use Spark.

Uploaded by

vmahescse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views27 pages

SPARK

Uploaded by

vmahescse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 27

Apache Hadoop Basic Modules

• Hadoop Common
• Hadoop Distributed File System (HDFS)
• Hadoop YARN
Other Modules: Zookeeper, Impala,
• Hadoop MapReduce Oozie, etc.

Spark, Storm, Tez,

etc.
Pig Hive
Non-relational

Scripting SQL Like Query

Database

HBase

MapReduce Others
Distributed Processing Distributed Processing
Yarn
Resource Manager

HDFS Distributed File System (Storage)

HBase

• NoSQL data store build on top of HDFS

• Based on the Google BigTable paper (2006)
• Can handle various types of data
• Stores large amount of data (TB,PB)
• Column-Oriented data store
• Big Data with random read and writes
• Horizontally scalable
HBase, not to use for
• Not good as a traditional RDBMs (Relational Database Model)
– Transactional applications
– Data Analytics

• Not efficient for text searching and processing

Map Reduce Paradigm

• Map and Reduce are based on functional programming

Apply function Map: Reduce:

Apply a function to all the elements of Combine all the elements of list for a
List summary

list1=[1,2,3,4,5]; list1 = [1,2,3,4,5];

square x = x * x A = reduce (+) list1
list2=Map square(list1) Print A
print list2 -> 15
-> [1,4,9,16,25]

Input Map Reduce Output

Shortcoming of MapReduce
• Forces your data processing into Map and Reduce
– Other workflows missing include join, filter, flatMap,
groupByKey, union, intersection, …
• Based on “Acyclic Data Flow” from Disk to Disk (HDFS)
• Read and write to Disk before and after Map and Reduce
(stateless machine)
– Not efficient for iterative tasks, i.e. Machine Learning
• Only Java natively supported
– Support for others languages needed
• Only for Batch processing
– Interactivity, streaming data
One Solution is Apache Spark
• A new general framework, which solves many of the short comings of
MapReduce
• It capable of leveraging the Hadoop ecosystem, e.g. HDFS, YARN, HBase,
S3, …
• Has many other workflows, i.e. join, filter, flatMapdistinct, groupByKey,
reduceByKey, sortByKey, collect, count, first…
– (around 30 efficient distributed operations)
• In-memory caching of data (for iterative, graph, and machine learning
algorithms, etc.)
• Native Scala, Java, Python, and R support
• Supports interactive shells for exploratory data analysis
• Spark API is extremely simple to use
• Developed at AMPLab UC Berkeley, now by Databricks.com
Spark Uses Memory instead of Disk
Hadoop: Use Disk for Data Sharing

HDFS HDFS HDFS

read HDFS
read Write
Write
Iteration1 Iteration2

Spark: In-Memory Data Sharing

HDFS read

Iteration1 Iteration2
Sort competition
Hadoop MR Spark
Record (2013) Record (2014) Spark, 3x
Data Size 102.5 TB 100 TB faster with
1/10 the
Elapsed Time 72 mins 23 mins nodes
# Nodes 2100 206
# Cores 50400 physical 6592 virtualized
Cluster disk 3150 GB/s
618 GB/s
throughput (est.)
dedicated data virtualized (EC2) 10Gbps
Network
center, 10Gbps network
Sort rate 1.42 TB/min 4.27 TB/min
Sort rate/node 0.67 GB/min 20.7 GB/min

Sort benchmark, Daytona Gray: sort of 100 TB of data (1 trillion records)

http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Apache Spark
Apache Spark supports data analysis, machine learning, graphs, streaming data, etc. It
can read/write from a range of data types and allows development in multiple
languages.

Scala, Java, Python, R, SQL

DataFrames ML Pipelines

Spark
Spark SQL MLlib GraphX
Streaming
Spark Core

Data Sources

Hadoop HDFS, HBase, Hive, Apache S3, Streaming, JSON, MySQL, and HPC-style (GlusterFS, Lustre)
Resilient Distributed Datasets (RDDs)

• RDDs (Resilient Distributed Datasets) is Data Containers

• All the different processing components in Spark
share the same abstraction called RDD
• As applications share the RDD abstraction, you can
mix different kind of transformations to create new
RDDs
• Created by parallelizing a collection or reading a file
• Fault tolerant
DataFrames & SparkSQL
• DataFrames (DFs) is one of the other distributed datasets organized
in named columns
• Similar to a relational database, Python Pandas Dataframe or R’s
DataTables
– Immutable once constructed
– Track lineage
– Enable distributed computations
• How to construct Dataframes
– Read from file(s)
– Transforming an existing DFs(Spark or Pandas)
– Parallelizing a python collection list
– Apply transformations and actions
DataFrame example
// Create a new DataFrame that contains “students”
students = users.filter(users.age < 21)

//Alternatively, using Pandas-like syntax

students = users[users.age < 21]

//Count the number of students users by gender

students.groupBy("gender").count()

// Join young students with another DataFrame called

logs
students.join(logs, logs.userId == users.userId,
“left_outer")
RDDs vs. DataFrames

• RDDs provide a low level interface into Spark

• DataFrames have a schema
• DataFrames are cached and optimized by Spark
• DataFrames are built on top of the RDDs and the core
Spark API

Example: performance
Spark Operations
map flatMap
filter union
join
Transformations sample cogroup
(create a new RDD) groupByKey cross
reduceByKey
sortByKey mapValues
intersection reduceByKey
collect first
Reduce
take
Actions Count takeOrdered
(return results to takeSample countByKey
driver program) take
save
lookupKey foreach
Directed Acyclic Graphs (DAG)

B C

A
E

S
D
F
DAGs track dependencies (also known as Lineage )
 nodes are RDDs
 arrows are Transformations
Narrow Vs. Wide transformation

Narrow Vs. Wide

A,1 A,[1,2]

A,2

Map groupByKey
Actions

• What is an action
– The final stage of the workflow
– Triggers the execution of the DAG
– Returns the results to the driver
– Or writes the data to HDFS or to a file
Spark Workflow

FlatMap Map groupbyKey

Collect

Spark Driver
Context Program
Python RDD API Examples

• Word count

text_file = sc.textFile("hdfs://usr/godil/text/book.txt")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://usr/godil/output/wordCount.txt")

• Logistic Regression

# Every record of this DataFrame contains the label and

# features represented by a vector.
df = sqlContext.createDataFrame(data, ["label", "features"])
# Set parameters for the algorithm.
# Here, we limit the number of iterations to 10.
lr = LogisticRegression(maxIter=10)
# Fit the model to the data.
model = lr.fit(df)
# Given a dataset, predict each point's label, and show the results.
model.transform(df).show()

Examples from http://spark.apache.org/

RDD Persistence and Removal
• RDD Persistence
– RDD.persist()
– Storage level:
• MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER,
DISK_ONLY,…….

• RDD Removal
– RDD.unpersist()
Spark’s Main Use Cases

• Streaming Data
• Machine Learning
• Interactive Analysis
• Data Warehousing
• Batch Processing
• Exploratory Data Analysis
• Graph Data Analysis
• Spatial (GIS) Data Analysis
• And many more
My Spark Use Cases

• Fingerprint Matching
– Developed a Spark based fingerprint minutia
detection and fingerprint matching code
• Twitter Sentiment Analysis
– Developed a Spark based Sentiment Analysis code
for a Twitter dataset
Spark in the Real World (I)
• Uber – the online taxi company gathers terabytes of event data from its
mobile users every day.
– By using Kafka, Spark Streaming, and HDFS, to build a continuous ETL
pipeline
– Convert raw unstructured event data into structured data as it is collected
– Uses it further for more complex analytics and optimization of operations

• Pinterest – Uses a Spark ETL pipeline

– Leverages Spark Streaming to gain immediate insight into how users all
over the world are engaging with Pins—in real time.
– Can make more relevant recommendations as people navigate the site
– Recommends related Pins
– Determine which products to buy, or destinations to visit
Spark in the Real World (II)
Here are Few other Real World Use Cases:

• Conviva – 4 million video feeds per month

– This streaming video company is second only to YouTube.
– Uses Spark to reduce customer churn by optimizing video streams and
managing live video traffic
– Maintains a consistently smooth, high quality viewing experience.

• Capital One – is using Spark and data science algorithms to understand customers
in a better way.
– Developing next generation of financial products and services
– Find attributes and patterns of increased probability for fraud

• Netflix – leveraging Spark for insights of user viewing habits and then
recommends movies to them.
– User data is also used for content creation
Spark: when not to use

• Even though Spark is versatile, that doesn’t mean Spark’s

in-memory capabilities are the best fit for all use cases:
– For many simple use cases Apache MapReduce and
Hive might be a more appropriate choice
– Spark was not designed as a multi-user environment
– Spark users are required to know that memory they
have is sufficient for a dataset
– Adding more users adds complications, since the users
will have to coordinate memory usage to run code
HPC and Big Data Convergence
• Clouds and supercomputers are collections of computers
networked together in a datacenter
• Clouds have different networking, I/O, CPU and cost trade-offs
than supercomputers
• Cloud workloads are data oriented vs. computation oriented
and are less closely coupled than supercomputers
• Principles of parallel computing same on both
• Apache Hadoop and Spark vs. Open MPI
Conclusion
• Hadoop (HDFS, MapReduce)
– Provides an easy solution for processing of Big Data
– Brings a paradigm shift in programming distributed system
• Spark
– Has extended MapReduce for in memory computations
– for streaming, interactive, iterative and machine learning
tasks
• Changing the World
– Made data processing cheaper and more efficient and
scalable
– Is the foundation of many other tools and software

In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
Apache Spark: In-Memory Data Processing
No ratings yet
Apache Spark: In-Memory Data Processing
187 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Spark
No ratings yet
Spark
96 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Spark: Big Data Processing & Libraries
No ratings yet
Spark: Big Data Processing & Libraries
47 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Spark: Fast, Interactive, Language-Integrated Cluster Computing
No ratings yet
Spark: Fast, Interactive, Language-Integrated Cluster Computing
25 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
ECS765P - W4 - Introduction To Spark
No ratings yet
ECS765P - W4 - Introduction To Spark
39 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
Introduction to Data Analysis with Spark
No ratings yet
Introduction to Data Analysis with Spark
51 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
SPARK
No ratings yet
SPARK
47 pages
Hadoop and Spark Overview
No ratings yet
Hadoop and Spark Overview
34 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
ScalaJVMBigData SparkLessons PDF
100% (1)
ScalaJVMBigData SparkLessons PDF
100 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Spark Introduction
No ratings yet
Spark Introduction
26 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Apache Spark RDDs Guide
No ratings yet
Apache Spark RDDs Guide
186 pages
Big Data & Apache Spark Explained
No ratings yet
Big Data & Apache Spark Explained
31 pages
Module 2
No ratings yet
Module 2
20 pages
Big Data Processing Techniques
No ratings yet
Big Data Processing Techniques
21 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
35-Unit5 DataAnalytics IoT Adoop Spark Part4
No ratings yet
35-Unit5 DataAnalytics IoT Adoop Spark Part4
12 pages
Pyspark Interview Code
100% (3)
Pyspark Interview Code
197 pages
Spark
No ratings yet
Spark
37 pages
Bda Unit 6
No ratings yet
Bda Unit 6
14 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
8 TH
No ratings yet
8 TH
19 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
M5
No ratings yet
M5
18 pages
HDP Developer Apache Pig and Hive
No ratings yet
HDP Developer Apache Pig and Hive
42 pages
Fastdataanalyticswithsparkandpython 150207060921 Conversion Gate02
No ratings yet
Fastdataanalyticswithsparkandpython 150207060921 Conversion Gate02
75 pages
Spark
No ratings yet
Spark
26 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Apache Spark
No ratings yet
Apache Spark
27 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
Spark
No ratings yet
Spark
160 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
Mohd SaifKhan
No ratings yet
Mohd SaifKhan
2 pages
Data Engineer Resume: Jatin Sharma
No ratings yet
Data Engineer Resume: Jatin Sharma
1 page
Data Engineer Roadmap
No ratings yet
Data Engineer Roadmap
4 pages
Big Data Insights for Businesses
No ratings yet
Big Data Insights for Businesses
17 pages
Abhijit Barik Resume
No ratings yet
Abhijit Barik Resume
1 page
Resume - Riaz Mahmud
No ratings yet
Resume - Riaz Mahmud
8 pages
Sai Kruthik Reddy Data Engineer
No ratings yet
Sai Kruthik Reddy Data Engineer
9 pages
Job Description - Principal Engineer Data Engineering
No ratings yet
Job Description - Principal Engineer Data Engineering
2 pages
Data Science & Big Data Lab Guide
No ratings yet
Data Science & Big Data Lab Guide
167 pages
Tuning Aws Glue For Apache Spark
No ratings yet
Tuning Aws Glue For Apache Spark
98 pages
DataCamp - Data Engineer
No ratings yet
DataCamp - Data Engineer
2 pages
Cloud Computing Applications Part 1 Final
No ratings yet
Cloud Computing Applications Part 1 Final
130 pages
BIG DATA ANALYTICS - Syllabus
No ratings yet
BIG DATA ANALYTICS - Syllabus
4 pages
Course - Big Data Technology
No ratings yet
Course - Big Data Technology
7 pages
Software Engineer Resume in Red Black Simple Style
No ratings yet
Software Engineer Resume in Red Black Simple Style
1 page
Gridgain® In-Memory Computing Platform: Feature Comparison: Pivotal Gemfire®
No ratings yet
Gridgain® In-Memory Computing Platform: Feature Comparison: Pivotal Gemfire®
14 pages
Summary: 12 Years
No ratings yet
Summary: 12 Years
7 pages
Resume - Vincent Su
No ratings yet
Resume - Vincent Su
1 page
Profile
No ratings yet
Profile
3 pages
2023 Assignment Answers
No ratings yet
2023 Assignment Answers
52 pages
Brochure - Cloud Computing
No ratings yet
Brochure - Cloud Computing
14 pages
Btech Cs 6 Sem Big Data Kcs 061 2023
No ratings yet
Btech Cs 6 Sem Big Data Kcs 061 2023
2 pages
Progress OpenEdge To Fabric OneLake Transfer
No ratings yet
Progress OpenEdge To Fabric OneLake Transfer
13 pages
Anurag-Sah (Data Engineer) - 2
No ratings yet
Anurag-Sah (Data Engineer) - 2
2 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
11 pages
Pyspark Interview Questions: Click Here
0% (1)
Pyspark Interview Questions: Click Here
35 pages
GCP Data Engineer Resume
No ratings yet
GCP Data Engineer Resume
1 page
Int 421
No ratings yet
Int 421
2 pages
AWS Plus Common Big Data Notes
No ratings yet
AWS Plus Common Big Data Notes
3 pages
Azure Data Fundamentals Cheat Sheet
No ratings yet
Azure Data Fundamentals Cheat Sheet
32 pages

SPARK

Uploaded by

SPARK

Uploaded by

Apache Hadoop Basic Modules

Spark, Storm, Tez,

Scripting SQL Like Query

HDFS Distributed File System (Storage)

• NoSQL data store build on top of HDFS

• Not efficient for text searching and processing

• Map and Reduce are based on functional programming

Apply function Map: Reduce:

list1=[1,2,3,4,5]; list1 = [1,2,3,4,5];

Input Map Reduce Output

HDFS HDFS HDFS

Spark: In-Memory Data Sharing

Sort benchmark, Daytona Gray: sort of 100 TB of data (1 trillion records)

Scala, Java, Python, R, SQL

• RDDs (Resilient Distributed Datasets) is Data Containers

//Alternatively, using Pandas-like syntax

//Count the number of students users by gender

// Join young students with another DataFrame called

• RDDs provide a low level interface into Spark

Narrow Vs. Wide

FlatMap Map groupbyKey

# Every record of this DataFrame contains the label and

Examples from http://spark.apache.org/

• Pinterest – Uses a Spark ETL pipeline

• Conviva – 4 million video feeds per month

• Even though Spark is versatile, that doesn’t mean Spark’s

You might also like