0% found this document useful (0 votes)
29 views27 pages

SPARK

The document provides an overview of Apache Hadoop and its basic modules, including HDFS, YARN, and MapReduce, along with other related technologies like Spark and HBase. It highlights the limitations of MapReduce and presents Apache Spark as a more efficient alternative for data processing, emphasizing its capabilities in handling various workflows, in-memory data sharing, and support for multiple programming languages. Additionally, it discusses real-world applications of Spark in companies like Uber and Netflix, as well as considerations for when not to use Spark.

Uploaded by

vmahescse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views27 pages

SPARK

The document provides an overview of Apache Hadoop and its basic modules, including HDFS, YARN, and MapReduce, along with other related technologies like Spark and HBase. It highlights the limitations of MapReduce and presents Apache Spark as a more efficient alternative for data processing, emphasizing its capabilities in handling various workflows, in-memory data sharing, and support for multiple programming languages. Additionally, it discusses real-world applications of Spark in companies like Uber and Netflix, as well as considerations for when not to use Spark.

Uploaded by

vmahescse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Apache Hadoop Basic Modules

• Hadoop Common
• Hadoop Distributed File System (HDFS)
• Hadoop YARN
Other Modules: Zookeeper, Impala,
• Hadoop MapReduce Oozie, etc.

Spark, Storm, Tez,


etc.
Pig Hive
Non-relational

Scripting SQL Like Query


Database

HBase

MapReduce Others
Distributed Processing Distributed Processing
Yarn
Resource Manager

HDFS Distributed File System (Storage)


HBase

• NoSQL data store build on top of HDFS


• Based on the Google BigTable paper (2006)
• Can handle various types of data
• Stores large amount of data (TB,PB)
• Column-Oriented data store
• Big Data with random read and writes
• Horizontally scalable
HBase, not to use for
• Not good as a traditional RDBMs (Relational Database Model)
– Transactional applications
– Data Analytics

• Not efficient for text searching and processing


Map Reduce Paradigm

• Map and Reduce are based on functional programming

Apply function Map: Reduce:


Apply a function to all the elements of Combine all the elements of list for a
List summary

list1=[1,2,3,4,5]; list1 = [1,2,3,4,5];


square x = x * x A = reduce (+) list1
list2=Map square(list1) Print A
print list2 -> 15
-> [1,4,9,16,25]

Input Map Reduce Output


Shortcoming of MapReduce
• Forces your data processing into Map and Reduce
– Other workflows missing include join, filter, flatMap,
groupByKey, union, intersection, …
• Based on “Acyclic Data Flow” from Disk to Disk (HDFS)
• Read and write to Disk before and after Map and Reduce
(stateless machine)
– Not efficient for iterative tasks, i.e. Machine Learning
• Only Java natively supported
– Support for others languages needed
• Only for Batch processing
– Interactivity, streaming data
One Solution is Apache Spark
• A new general framework, which solves many of the short comings of
MapReduce
• It capable of leveraging the Hadoop ecosystem, e.g. HDFS, YARN, HBase,
S3, …
• Has many other workflows, i.e. join, filter, flatMapdistinct, groupByKey,
reduceByKey, sortByKey, collect, count, first…
– (around 30 efficient distributed operations)
• In-memory caching of data (for iterative, graph, and machine learning
algorithms, etc.)
• Native Scala, Java, Python, and R support
• Supports interactive shells for exploratory data analysis
• Spark API is extremely simple to use
• Developed at AMPLab UC Berkeley, now by Databricks.com
Spark Uses Memory instead of Disk
Hadoop: Use Disk for Data Sharing

HDFS HDFS HDFS


read HDFS
read Write
Write
Iteration1 Iteration2

Spark: In-Memory Data Sharing

HDFS read

Iteration1 Iteration2
Sort competition
Hadoop MR Spark
Record (2013) Record (2014) Spark, 3x
Data Size 102.5 TB 100 TB faster with
1/10 the
Elapsed Time 72 mins 23 mins nodes
# Nodes 2100 206
# Cores 50400 physical 6592 virtualized
Cluster disk 3150 GB/s
618 GB/s
throughput (est.)
dedicated data virtualized (EC2) 10Gbps
Network
center, 10Gbps network
Sort rate 1.42 TB/min 4.27 TB/min
Sort rate/node 0.67 GB/min 20.7 GB/min

Sort benchmark, Daytona Gray: sort of 100 TB of data (1 trillion records)


http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Apache Spark
Apache Spark supports data analysis, machine learning, graphs, streaming data, etc. It
can read/write from a range of data types and allows development in multiple
languages.

Scala, Java, Python, R, SQL

DataFrames ML Pipelines

Spark
Spark SQL MLlib GraphX
Streaming
Spark Core

Data Sources

Hadoop HDFS, HBase, Hive, Apache S3, Streaming, JSON, MySQL, and HPC-style (GlusterFS, Lustre)
Resilient Distributed Datasets (RDDs)

• RDDs (Resilient Distributed Datasets) is Data Containers


• All the different processing components in Spark
share the same abstraction called RDD
• As applications share the RDD abstraction, you can
mix different kind of transformations to create new
RDDs
• Created by parallelizing a collection or reading a file
• Fault tolerant
DataFrames & SparkSQL
• DataFrames (DFs) is one of the other distributed datasets organized
in named columns
• Similar to a relational database, Python Pandas Dataframe or R’s
DataTables
– Immutable once constructed
– Track lineage
– Enable distributed computations
• How to construct Dataframes
– Read from file(s)
– Transforming an existing DFs(Spark or Pandas)
– Parallelizing a python collection list
– Apply transformations and actions
DataFrame example
// Create a new DataFrame that contains “students”
students = users.filter(users.age < 21)

//Alternatively, using Pandas-like syntax


students = users[users.age < 21]

//Count the number of students users by gender


students.groupBy("gender").count()

// Join young students with another DataFrame called


logs
students.join(logs, logs.userId == users.userId,
“left_outer")
RDDs vs. DataFrames

• RDDs provide a low level interface into Spark


• DataFrames have a schema
• DataFrames are cached and optimized by Spark
• DataFrames are built on top of the RDDs and the core
Spark API

Example: performance
Spark Operations
map flatMap
filter union
join
Transformations sample cogroup
(create a new RDD) groupByKey cross
reduceByKey
sortByKey mapValues
intersection reduceByKey
collect first
Reduce
take
Actions Count takeOrdered
(return results to takeSample countByKey
driver program) take
save
lookupKey foreach
Directed Acyclic Graphs (DAG)

B C

A
E

S
D
F
DAGs track dependencies (also known as Lineage )
 nodes are RDDs
 arrows are Transformations
Narrow Vs. Wide transformation

Narrow Vs. Wide


A,1 A,[1,2]

A,2

Map groupByKey
Actions

• What is an action
– The final stage of the workflow
– Triggers the execution of the DAG
– Returns the results to the driver
– Or writes the data to HDFS or to a file
Spark Workflow

FlatMap Map groupbyKey

Collect

Spark Driver
Context Program
Python RDD API Examples

• Word count

text_file = sc.textFile("hdfs://usr/godil/text/book.txt")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://usr/godil/output/wordCount.txt")

• Logistic Regression

# Every record of this DataFrame contains the label and


# features represented by a vector.
df = sqlContext.createDataFrame(data, ["label", "features"])
# Set parameters for the algorithm.
# Here, we limit the number of iterations to 10.
lr = LogisticRegression(maxIter=10)
# Fit the model to the data.
model = lr.fit(df)
# Given a dataset, predict each point's label, and show the results.
model.transform(df).show()

Examples from http://spark.apache.org/


RDD Persistence and Removal
• RDD Persistence
– RDD.persist()
– Storage level:
• MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER,
DISK_ONLY,…….

• RDD Removal
– RDD.unpersist()
Spark’s Main Use Cases

• Streaming Data
• Machine Learning
• Interactive Analysis
• Data Warehousing
• Batch Processing
• Exploratory Data Analysis
• Graph Data Analysis
• Spatial (GIS) Data Analysis
• And many more
My Spark Use Cases

• Fingerprint Matching
– Developed a Spark based fingerprint minutia
detection and fingerprint matching code
• Twitter Sentiment Analysis
– Developed a Spark based Sentiment Analysis code
for a Twitter dataset
Spark in the Real World (I)
• Uber – the online taxi company gathers terabytes of event data from its
mobile users every day.
– By using Kafka, Spark Streaming, and HDFS, to build a continuous ETL
pipeline
– Convert raw unstructured event data into structured data as it is collected
– Uses it further for more complex analytics and optimization of operations

• Pinterest – Uses a Spark ETL pipeline


– Leverages Spark Streaming to gain immediate insight into how users all
over the world are engaging with Pins—in real time.
– Can make more relevant recommendations as people navigate the site
– Recommends related Pins
– Determine which products to buy, or destinations to visit
Spark in the Real World (II)
Here are Few other Real World Use Cases:

• Conviva – 4 million video feeds per month


– This streaming video company is second only to YouTube.
– Uses Spark to reduce customer churn by optimizing video streams and
managing live video traffic
– Maintains a consistently smooth, high quality viewing experience.

• Capital One – is using Spark and data science algorithms to understand customers
in a better way.
– Developing next generation of financial products and services
– Find attributes and patterns of increased probability for fraud

• Netflix – leveraging Spark for insights of user viewing habits and then
recommends movies to them.
– User data is also used for content creation
Spark: when not to use

• Even though Spark is versatile, that doesn’t mean Spark’s


in-memory capabilities are the best fit for all use cases:
– For many simple use cases Apache MapReduce and
Hive might be a more appropriate choice
– Spark was not designed as a multi-user environment
– Spark users are required to know that memory they
have is sufficient for a dataset
– Adding more users adds complications, since the users
will have to coordinate memory usage to run code
HPC and Big Data Convergence
• Clouds and supercomputers are collections of computers
networked together in a datacenter
• Clouds have different networking, I/O, CPU and cost trade-offs
than supercomputers
• Cloud workloads are data oriented vs. computation oriented
and are less closely coupled than supercomputers
• Principles of parallel computing same on both
• Apache Hadoop and Spark vs. Open MPI
Conclusion
• Hadoop (HDFS, MapReduce)
– Provides an easy solution for processing of Big Data
– Brings a paradigm shift in programming distributed system
• Spark
– Has extended MapReduce for in memory computations
– for streaming, interactive, iterative and machine learning
tasks
• Changing the World
– Made data processing cheaper and more efficient and
scalable
– Is the foundation of many other tools and software

You might also like