SlideShare a Scribd company logo
© 2015 IBM Corporation
Apache Hadoop Day 2015
Paranth Thiruvengadam – Architect @ IBM
Sachin Aggarwal – Developer @ IBM
© 2015 IBM Corporation
Spark Streaming
 Features of Spark Streaming
 High Level API (joins, windows etc.)
 Fault – Tolerant (exactly once semantics achievable)
 Deep Integration with Spark Ecosystem (MLlib, SQL, GraphX
etc.)
Apache Hadoop Day 2015
© 2015 IBM Corporation
Architecture
Apache Hadoop Day 2015
© 2015 IBM Corporation
High Level Overview
Apache Hadoop Day 2015
© 2015 IBM Corporation
Receiving Data
Driver
RECEIVER
Input
Source
Executor
Executor
Data Blocks
Data Blocks
Data Blocks
Are replicated
To another
Executor
Driver runs
Receiver as
Long running
tasks
Receiver divides
Streams into
Blocks and
keeps in
memory
Apache Hadoop Day 2015
© 2015 IBM Corporation
Processing Data
Driver
RECEIVER
Executor
Executor
Data Blocks
Data Blocks
Every batch
Internal Driver
Launches tasks
To process the
blocks
Data
Store
results
results
© 2015 IBM Corporation
What’s different from other
Streaming applications?
© 2015 IBM Corporation
Traditional Stream Processing
© 2015 IBM Corporation
Load Balancing…
© 2015 IBM Corporation
Node failure / Stragglers…
© 2015 IBM Corporation
Word Count with Kafka
© 2015 IBM Corporation
Fault Tolerance
© 2015 IBM Corporation
Fault Tolerance
 Why Care?
 Different guarantees for Data Loss
 Atleast Once
 Exactly Once
 What all can fail?
 Driver
 Executor
© 2015 IBM Corporation
What happens when executor fails?
© 2015 IBM Corporation
What happens when Driver fails?
© 2015 IBM Corporation
Recovering Driver – Checkpointing
© 2015 IBM Corporation
Driver restart
© 2015 IBM Corporation
Driver restart – ToDO List
 Configure automatic driver restart
 Spark Standalone
 YARN
 Set Checkpoint in HDFS compatible file system
streamingContext.checkpiont(hdfsDirectory)
 Ensure the Code uses checkpoints for recovery
Def setupStreamingContext() : StreamingContext = {
Val context = new StreamingContext(…)
Val lines = KafkaUtils.createStream(…)
…
Context.checkpoint(hdfsDir)
Val context = StreamingContext.getOrCreate(hdfsDir, setupStreamingContext)
Context.start()
© 2015 IBM Corporation
WAL for no data loss
© 2015 IBM Corporation
Recover using WAL
© 2015 IBM Corporation
Configuration – Enabling WAL
 Enable Checkpointing.
 Enable WAL in Spark Configuration
 sparkConf.set(“spark.streaming.receiver.writeAheadLog.en
able”, “true”)
 Receiver should acknowledge the input source after data
written to WAL
 Disable in-memory replication
© 2015 IBM Corporation
Normal Processing
© 2015 IBM Corporation
Restarting Failed Driver
© 2015 IBM Corporation
Fault-Tolerant Semantics
Exactly Once, If Outputs are Idempotent or transactional
Exactly Once, as long as received data is not lost
Aleast Once, with Checkpointing / WAL
Source
Receiving
Transforming
Outputting
Sink
© 2015 IBM Corporation
Fault-Tolerant Semantics
Exactly Once, If Outputs are Idempotent or transactional
Exactly Once, as long as received data is not lost
Exactly Once, with Kafka Direct API
Source
Receiving
Transforming
Outputting
Sink
© 2015 IBM Corporation
How to achieve “exactly once”
guarantee?
© 2015 IBM Corporation
Before Kafka Direct API
© 2015 IBM Corporation
Kafka Direct API
• Simplified Parallelism
• Less Storage Need
• Exactly Once Semantics
Benefits of this approach
© 2015 IBM Corporation
Demo
D E M O
SPARK STREAMING
OVERVIEW OF SPARK STREAMING
DISCRETIZED STREAMS (DSTREAMS)
• Dstream is basic abstraction in Spark Streaming.
• It is represented by a continuous series of RDDs(of the
same type).
• Each RDD in a DStream contains data from a certain
interval
• DStreams can either be created from live data (such as,
data from TCP sockets, Kafka, Flume, etc.) using a
Streaming Context or it can be generated by
transforming existing DStreams using operations such
as `map`, `window` and `reduceByKeyAndWindow`.
DISCRETIZED STREAMS (DSTREAMS)
WORD COUNT
val sparkConf = new SparkConf()
.setMaster("local[2]”)
.setAppName("WordCount")
val sc = new SparkContext(
sparkConf)
val file = sc.textFile(“filePath”)
val words = file
.flatMap(_.split(" "))
Val pairs = words
.map(x => (x, 1))
val wordCounts =pairs
.reduceByKey(_ + _)
wordCounts.saveAsTextFile(args(1))
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("SocketStreaming")
val ssc = new StreamingContext(
conf, Seconds(2))
val lines = ssc
.socketTextStream("localhost", 9998)
val words = lines
.flatMap(_.split(" "))
val pairs = words
.map(word => (word, 1))
val wordCounts = pairs
.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
DEMO
KAFKA STREAM
val lines = ssc
.socketTextStream("localh
ost", 9998)
val words = lines
.flatMap(_.split(" "))
val pairs = words
.map(word => (word, 1))
val wordCounts = pairs
.reduceByKey(_ + _)
val zkQuorum="localhost:2181”;
val group="test";
val topics="test";
val numThreads="1";
val topicMap = topics
.split(",")
.map((_, numThreads.toInt))
.toMap
val lines = KafkaUtils
.createStream(
ssc, zkQuorum, group, topicMap)
.map(_._2)
val words = lines
.flatMap(_.split(" "))
……..
DEMO
OPERATIONS
• Repartition
• Operation on RDD
(Example print partition count
of each RDD)
Val re_lines=lines
.repartition(5)
re_lines
.foreachRDD(x =>fun(x))
def fun (rdd:RDD[String]) ={
print("partition count”
+ rdd.partitions.length)
}
DEMO
STATELESS TRANSFORMATIONS
• map() Apply a function to each element in the DStream and return a DStream of the result.
• ds.map(x => x + 1)
• flatMap() Apply a function to each element in the DStream and return a DStream of the contents
of the iterators returned.
• ds.flatMap(x => x.split(" "))
• filter() Return a DStream consisting of only elements that pass the condition passed to filter.
• ds.filter(x => x != 1)
• repartition() Change the number of partitions of the DStream.
• ds.repartition(10)
• reduceBy Combine values with the same Key() key in each batch.
• ds.reduceByKey( (x,y)=>x+y)
• groupBy Group values with the same Key() key in each batch.
• ds.groupByKey()
DEMO
STATEFUL TRANSFORMATIONS
Stateful transformations require checkpointing to be
enabled in your StreamingContext for fault tolerance
• Windowed transformations: windowed computations
allow you to apply transformations over a sliding window
of data
• UpdateStateByKey transformation: Enables this by
providing access to a state variable for DStreams of
key/value pairs
DEMO
WINDOW OPERATIONS
This shows that any window operation needs to specify two
parameters.
• window length - The duration of the window.
• sliding interval - The interval at which the window
operation is performed.
These two parameters must be multiples of the batch
interval of the source Dstream
DEMO
WINDOWED TRANSFORMATIONS
• window(windowLength, slideInterval)
• Return a new Dstream, computed based on windowed batches of the source Dstream.
• countByWindow(windowLength, slideInterval)
• Return a sliding window count of elements in the stream.
• val totalWordCount= words.countByWindow(Seconds(30), Seconds(10))
• reduceByWindow(func, windowLength, slideInterval)
• Return a new single-element stream, created by aggregating elements in the stream over a sliding
interval using func.
• The function should be associative so that it can be computed correctly in parallel.
• val totalWordCount= pairs.reduceByWindow({(x, y) => x + y},{(x, y) => x – y} Seconds(10)
• reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])
• Returns a new DStream of (K, V) pairs where the values for each key are aggregated using the
given reduce function func over batches in a sliding window
• val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30),
Seconds(10))
• countByValueAndWindow(windowLength, slideInterval)
• Returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a
sliding window.
• val EachWordCount= word.countByValueAndWindow(Seconds(30), Seconds(10))
DEMO
UPDATE STATE BY KEY
TRANSFORMATION
• updateStateByKey()
• Enables this by providing access to a state variable for DStreams of
key/value pairs
• User provide a function updateFunc(events, oldState) and initialRDD
• val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1),
("world", 1)))
• val updateFunc = (values: Seq[Int], state: Option[Int]) => {
val currentCount = values.foldLeft(0)(_ + _)
val previousCount = state.getOrElse(0)
Some(currentCount + previousCount)
}
• val stateCount= pairs.updateStateByKey[Int](updateFunc)
DEMO
TRANSFORM OPERATION
• The transform operation allows arbitrary RDD-to-RDD
functions to be applied on a DStream.
• It can be used to apply any RDD operation that is not
exposed in the DStream API.
• For example, the functionality of joining every batch in a
data stream with another dataset is not directly exposed
in the DStream API.
• val cleanedDStream = wordCounts.transform(rdd => {
rdd.join(data)
})
DEMO
JOIN OPERATIONS
• Stream-stream joins:
• Streams can be very easily joined with other streams.
• val stream1: DStream[String, String] = ...
• val stream2: DStream[String, String] = ...
• val joinedStream = stream1.join(stream2)
• Windowed join
• val windowedStream1 = stream1.window(Seconds(20))
• val windowedStream2 = stream2.window(Minutes(1))
• val joinedStream = windowedStream1.join(windowedStream2)
• Stream-dataset joins
• val dataset: RDD[String, String] = ...
• val windowedStream = stream.window(Seconds(20))...
• val joinedStream = windowedStream.transform { rdd => rdd.join(dataset) }
DEMO
USING FOREACHRDD()
• foreachRDD is a powerful primitive that allows data to be sent out to
external systems.
• dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection)
}
}
• Using foreachRDD, Each RDD is converted to a DataFrame, registered
as a temporary table and then queried using SQL.
• words.foreachRDD { rdd =>
val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
import sqlContext.implicits._
val wordsDataFrame = rdd.toDF("word")
wordsDataFrame.registerTempTable("words")
val wordCountsDataFrame =
sqlContext.sql("select word, count(*) as total from words group by word")
wordCountsDataFrame.show()
}
DEMO
DSTREAMS (SPARK CODE)
• DStreams internally is characterized by a few basic properties:
• A list of other DStreams that the DStream depends on
• A time interval at which the DStream generates an RDD
• A function that is used to generate an RDD after each time interval
• Methods that should be implemented by subclasses of Dstream
• Time interval after which the DStream generates a RDD
• def slideDuration: Duration
• List of parent DStreams on which this DStream depends on
• def dependencies: List[DStream[_]]
• Method that generates a RDD for the given time
• def compute(validTime: Time): Option[RDD[T]]
• This class contains the basic operations available on all DStreams, such as
`map`, `filter` and `window`. In addition, PairDStreamFunctions contains
operations available only on DStreams of key-value pairs, such as
`groupByKeyAndWindow` and `join`. These operations are automatically
available on any DStream of pairs (e.g., DStream[(Int, Int)] through implicit
conversions.
© 2015 IBM Corporation

More Related Content

What's hot (20)

PDF
Spark SQL
Joud Khattab
 
PPTX
Apache Flink @ NYC Flink Meetup
Stephan Ewen
 
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PDF
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
PPTX
Real-time Stream Processing with Apache Flink
DataWorks Summit
 
PDF
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
PPTX
Introduction to spark
Home
 
ODP
Writing and testing high frequency trading engines in java
Peter Lawrey
 
PDF
What Is RDD In Spark? | Edureka
Edureka!
 
PDF
Apache Kafka Architecture & Fundamentals Explained
confluent
 
PPTX
Kafka Tutorial - basics of the Kafka streaming platform
Jean-Paul Azar
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PPTX
A visual introduction to Apache Kafka
Paul Brebner
 
PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
PDF
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
PPTX
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
DataWorks Summit
 
PPTX
Intro to Apache Spark
Robert Sanders
 
PDF
Apache Kafka
Diego Pacheco
 
Spark SQL
Joud Khattab
 
Apache Flink @ NYC Flink Meetup
Stephan Ewen
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Real-time Stream Processing with Apache Flink
DataWorks Summit
 
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
Introduction to spark
Home
 
Writing and testing high frequency trading engines in java
Peter Lawrey
 
What Is RDD In Spark? | Edureka
Edureka!
 
Apache Kafka Architecture & Fundamentals Explained
confluent
 
Kafka Tutorial - basics of the Kafka streaming platform
Jean-Paul Azar
 
Introduction to Apache Spark
Rahul Jain
 
A visual introduction to Apache Kafka
Paul Brebner
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Parquet performance tuning: the missing guide
Ryan Blue
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
DataWorks Summit
 
Intro to Apache Spark
Robert Sanders
 
Apache Kafka
Diego Pacheco
 

Viewers also liked (20)

PPTX
Interactive Analytics using Apache Spark
Sachin Aggarwal
 
PDF
Introduction to Spark Streaming
datamantra
 
PDF
Spark streaming: Best Practices
Prakash Chockalingam
 
PDF
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
PDF
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
Spark Summit
 
PDF
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PPTX
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
PDF
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Spark Summit
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PDF
Spark meetup stream processing use cases
punesparkmeetup
 
PPTX
Spark streaming high level overview
Avi Levi
 
PPTX
Spark Streaming and Expert Systems
Jim Haughwout
 
PPTX
Big Data Scala by the Bay: Interactive Spark in your Browser
gethue
 
PDF
MOLDEAS at City College
CARLOS III UNIVERSITY OF MADRID
 
PPTX
WP4-QoS Management in the Cloud
CARLOS III UNIVERSITY OF MADRID
 
PDF
Reactive Streams 1.0 and Akka Streams
Dean Wampler
 
ODP
Graph Data -- RDF and Property Graphs
andyseaborne
 
PPT
PSL Overview
stephenbach
 
PPTX
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Caserta
 
Interactive Analytics using Apache Spark
Sachin Aggarwal
 
Introduction to Spark Streaming
datamantra
 
Spark streaming: Best Practices
Prakash Chockalingam
 
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
Spark Summit
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Spark Summit
 
Apache Spark Architecture
Alexey Grishchenko
 
Spark meetup stream processing use cases
punesparkmeetup
 
Spark streaming high level overview
Avi Levi
 
Spark Streaming and Expert Systems
Jim Haughwout
 
Big Data Scala by the Bay: Interactive Spark in your Browser
gethue
 
MOLDEAS at City College
CARLOS III UNIVERSITY OF MADRID
 
WP4-QoS Management in the Cloud
CARLOS III UNIVERSITY OF MADRID
 
Reactive Streams 1.0 and Akka Streams
Dean Wampler
 
Graph Data -- RDF and Property Graphs
andyseaborne
 
PSL Overview
stephenbach
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Caserta
 
Ad

Similar to Apache Spark Streaming: Architecture and Fault Tolerance (20)

PDF
Apache Spark Overview part2 (20161117)
Steve Min
 
PDF
Productionizing your Streaming Jobs
Databricks
 
PDF
So you think you can stream.pptx
Prakash Chockalingam
 
ODP
Meet Up - Spark Stream Processing + Kafka
Knoldus Inc.
 
PDF
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Legacy Typesafe (now Lightbend)
 
PPTX
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
PDF
Spark Streaming with Cassandra
Jacek Lewandowski
 
PPT
strata_spark_streaming.ppt
rveiga100
 
PPT
Spark streaming
Venkateswaran Kandasamy
 
PDF
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Spark Summit
 
PDF
Spark streaming
Noam Shaish
 
PPTX
Spark 计算模型
wang xing
 
PPTX
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Michael Spector
 
PDF
Spark Streaming @ Berlin Apache Spark Meetup, March 2015
Stratio
 
PDF
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Santosh Sahoo
 
PPTX
Apache Spark Streaming
Zahra Eskandari
 
PDF
Deep dive into spark streaming
Tao Li
 
PPT
strata spark streaming strata spark streamingsrata spark streaming
ShidrokhGoudarzi1
 
PPT
strata_spark_streaming.ppt
snowflakebatch
 
PPT
strata_spark_streaming.ppt
AbhijitManna19
 
Apache Spark Overview part2 (20161117)
Steve Min
 
Productionizing your Streaming Jobs
Databricks
 
So you think you can stream.pptx
Prakash Chockalingam
 
Meet Up - Spark Stream Processing + Kafka
Knoldus Inc.
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Legacy Typesafe (now Lightbend)
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
Spark Streaming with Cassandra
Jacek Lewandowski
 
strata_spark_streaming.ppt
rveiga100
 
Spark streaming
Venkateswaran Kandasamy
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Spark Summit
 
Spark streaming
Noam Shaish
 
Spark 计算模型
wang xing
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Michael Spector
 
Spark Streaming @ Berlin Apache Spark Meetup, March 2015
Stratio
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Santosh Sahoo
 
Apache Spark Streaming
Zahra Eskandari
 
Deep dive into spark streaming
Tao Li
 
strata spark streaming strata spark streamingsrata spark streaming
ShidrokhGoudarzi1
 
strata_spark_streaming.ppt
snowflakebatch
 
strata_spark_streaming.ppt
AbhijitManna19
 
Ad

Recently uploaded (20)

PPTX
Introduction to Neural Networks and Perceptron Learning Algorithm.pptx
Kayalvizhi A
 
PPTX
Thermal runway and thermal stability.pptx
godow93766
 
PDF
GTU Civil Engineering All Semester Syllabus.pdf
Vimal Bhojani
 
DOC
MRRS Strength and Durability of Concrete
CivilMythili
 
PPTX
Introduction to Design of Machine Elements
PradeepKumarS27
 
DOCX
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
PDF
ARC--BUILDING-UTILITIES-2-PART-2 (1).pdf
IzzyBaniquedBusto
 
PDF
Statistical Data Analysis Using SPSS Software
shrikrishna kesharwani
 
PPTX
artificial intelligence applications in Geomatics
NawrasShatnawi1
 
PDF
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
PPTX
Heart Bleed Bug - A case study (Course: Cryptography and Network Security)
Adri Jovin
 
PPTX
Innowell Capability B0425 - Commercial Buildings.pptx
regobertroza
 
PPTX
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
PDF
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
PPTX
Green Building & Energy Conservation ppt
Sagar Sarangi
 
PDF
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
PDF
Zilliz Cloud Demo for performance and scale
Zilliz
 
PDF
Ethics and Trustworthy AI in Healthcare – Governing Sensitive Data, Profiling...
AlqualsaDIResearchGr
 
PDF
6th International Conference on Machine Learning Techniques and Data Science ...
ijistjournal
 
PPTX
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
Introduction to Neural Networks and Perceptron Learning Algorithm.pptx
Kayalvizhi A
 
Thermal runway and thermal stability.pptx
godow93766
 
GTU Civil Engineering All Semester Syllabus.pdf
Vimal Bhojani
 
MRRS Strength and Durability of Concrete
CivilMythili
 
Introduction to Design of Machine Elements
PradeepKumarS27
 
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
ARC--BUILDING-UTILITIES-2-PART-2 (1).pdf
IzzyBaniquedBusto
 
Statistical Data Analysis Using SPSS Software
shrikrishna kesharwani
 
artificial intelligence applications in Geomatics
NawrasShatnawi1
 
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
Heart Bleed Bug - A case study (Course: Cryptography and Network Security)
Adri Jovin
 
Innowell Capability B0425 - Commercial Buildings.pptx
regobertroza
 
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
Green Building & Energy Conservation ppt
Sagar Sarangi
 
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
Zilliz Cloud Demo for performance and scale
Zilliz
 
Ethics and Trustworthy AI in Healthcare – Governing Sensitive Data, Profiling...
AlqualsaDIResearchGr
 
6th International Conference on Machine Learning Techniques and Data Science ...
ijistjournal
 
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 

Apache Spark Streaming: Architecture and Fault Tolerance

  • 1. © 2015 IBM Corporation Apache Hadoop Day 2015 Paranth Thiruvengadam – Architect @ IBM Sachin Aggarwal – Developer @ IBM
  • 2. © 2015 IBM Corporation Spark Streaming  Features of Spark Streaming  High Level API (joins, windows etc.)  Fault – Tolerant (exactly once semantics achievable)  Deep Integration with Spark Ecosystem (MLlib, SQL, GraphX etc.) Apache Hadoop Day 2015
  • 3. © 2015 IBM Corporation Architecture Apache Hadoop Day 2015
  • 4. © 2015 IBM Corporation High Level Overview Apache Hadoop Day 2015
  • 5. © 2015 IBM Corporation Receiving Data Driver RECEIVER Input Source Executor Executor Data Blocks Data Blocks Data Blocks Are replicated To another Executor Driver runs Receiver as Long running tasks Receiver divides Streams into Blocks and keeps in memory Apache Hadoop Day 2015
  • 6. © 2015 IBM Corporation Processing Data Driver RECEIVER Executor Executor Data Blocks Data Blocks Every batch Internal Driver Launches tasks To process the blocks Data Store results results
  • 7. © 2015 IBM Corporation What’s different from other Streaming applications?
  • 8. © 2015 IBM Corporation Traditional Stream Processing
  • 9. © 2015 IBM Corporation Load Balancing…
  • 10. © 2015 IBM Corporation Node failure / Stragglers…
  • 11. © 2015 IBM Corporation Word Count with Kafka
  • 12. © 2015 IBM Corporation Fault Tolerance
  • 13. © 2015 IBM Corporation Fault Tolerance  Why Care?  Different guarantees for Data Loss  Atleast Once  Exactly Once  What all can fail?  Driver  Executor
  • 14. © 2015 IBM Corporation What happens when executor fails?
  • 15. © 2015 IBM Corporation What happens when Driver fails?
  • 16. © 2015 IBM Corporation Recovering Driver – Checkpointing
  • 17. © 2015 IBM Corporation Driver restart
  • 18. © 2015 IBM Corporation Driver restart – ToDO List  Configure automatic driver restart  Spark Standalone  YARN  Set Checkpoint in HDFS compatible file system streamingContext.checkpiont(hdfsDirectory)  Ensure the Code uses checkpoints for recovery Def setupStreamingContext() : StreamingContext = { Val context = new StreamingContext(…) Val lines = KafkaUtils.createStream(…) … Context.checkpoint(hdfsDir) Val context = StreamingContext.getOrCreate(hdfsDir, setupStreamingContext) Context.start()
  • 19. © 2015 IBM Corporation WAL for no data loss
  • 20. © 2015 IBM Corporation Recover using WAL
  • 21. © 2015 IBM Corporation Configuration – Enabling WAL  Enable Checkpointing.  Enable WAL in Spark Configuration  sparkConf.set(“spark.streaming.receiver.writeAheadLog.en able”, “true”)  Receiver should acknowledge the input source after data written to WAL  Disable in-memory replication
  • 22. © 2015 IBM Corporation Normal Processing
  • 23. © 2015 IBM Corporation Restarting Failed Driver
  • 24. © 2015 IBM Corporation Fault-Tolerant Semantics Exactly Once, If Outputs are Idempotent or transactional Exactly Once, as long as received data is not lost Aleast Once, with Checkpointing / WAL Source Receiving Transforming Outputting Sink
  • 25. © 2015 IBM Corporation Fault-Tolerant Semantics Exactly Once, If Outputs are Idempotent or transactional Exactly Once, as long as received data is not lost Exactly Once, with Kafka Direct API Source Receiving Transforming Outputting Sink
  • 26. © 2015 IBM Corporation How to achieve “exactly once” guarantee?
  • 27. © 2015 IBM Corporation Before Kafka Direct API
  • 28. © 2015 IBM Corporation Kafka Direct API • Simplified Parallelism • Less Storage Need • Exactly Once Semantics Benefits of this approach
  • 29. © 2015 IBM Corporation Demo
  • 30. D E M O SPARK STREAMING
  • 31. OVERVIEW OF SPARK STREAMING
  • 32. DISCRETIZED STREAMS (DSTREAMS) • Dstream is basic abstraction in Spark Streaming. • It is represented by a continuous series of RDDs(of the same type). • Each RDD in a DStream contains data from a certain interval • DStreams can either be created from live data (such as, data from TCP sockets, Kafka, Flume, etc.) using a Streaming Context or it can be generated by transforming existing DStreams using operations such as `map`, `window` and `reduceByKeyAndWindow`.
  • 34. WORD COUNT val sparkConf = new SparkConf() .setMaster("local[2]”) .setAppName("WordCount") val sc = new SparkContext( sparkConf) val file = sc.textFile(“filePath”) val words = file .flatMap(_.split(" ")) Val pairs = words .map(x => (x, 1)) val wordCounts =pairs .reduceByKey(_ + _) wordCounts.saveAsTextFile(args(1)) val conf = new SparkConf() .setMaster("local[2]") .setAppName("SocketStreaming") val ssc = new StreamingContext( conf, Seconds(2)) val lines = ssc .socketTextStream("localhost", 9998) val words = lines .flatMap(_.split(" ")) val pairs = words .map(word => (word, 1)) val wordCounts = pairs .reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  • 35. DEMO
  • 36. KAFKA STREAM val lines = ssc .socketTextStream("localh ost", 9998) val words = lines .flatMap(_.split(" ")) val pairs = words .map(word => (word, 1)) val wordCounts = pairs .reduceByKey(_ + _) val zkQuorum="localhost:2181”; val group="test"; val topics="test"; val numThreads="1"; val topicMap = topics .split(",") .map((_, numThreads.toInt)) .toMap val lines = KafkaUtils .createStream( ssc, zkQuorum, group, topicMap) .map(_._2) val words = lines .flatMap(_.split(" ")) ……..
  • 37. DEMO
  • 38. OPERATIONS • Repartition • Operation on RDD (Example print partition count of each RDD) Val re_lines=lines .repartition(5) re_lines .foreachRDD(x =>fun(x)) def fun (rdd:RDD[String]) ={ print("partition count” + rdd.partitions.length) }
  • 39. DEMO
  • 40. STATELESS TRANSFORMATIONS • map() Apply a function to each element in the DStream and return a DStream of the result. • ds.map(x => x + 1) • flatMap() Apply a function to each element in the DStream and return a DStream of the contents of the iterators returned. • ds.flatMap(x => x.split(" ")) • filter() Return a DStream consisting of only elements that pass the condition passed to filter. • ds.filter(x => x != 1) • repartition() Change the number of partitions of the DStream. • ds.repartition(10) • reduceBy Combine values with the same Key() key in each batch. • ds.reduceByKey( (x,y)=>x+y) • groupBy Group values with the same Key() key in each batch. • ds.groupByKey()
  • 41. DEMO
  • 42. STATEFUL TRANSFORMATIONS Stateful transformations require checkpointing to be enabled in your StreamingContext for fault tolerance • Windowed transformations: windowed computations allow you to apply transformations over a sliding window of data • UpdateStateByKey transformation: Enables this by providing access to a state variable for DStreams of key/value pairs
  • 43. DEMO
  • 44. WINDOW OPERATIONS This shows that any window operation needs to specify two parameters. • window length - The duration of the window. • sliding interval - The interval at which the window operation is performed. These two parameters must be multiples of the batch interval of the source Dstream
  • 45. DEMO
  • 46. WINDOWED TRANSFORMATIONS • window(windowLength, slideInterval) • Return a new Dstream, computed based on windowed batches of the source Dstream. • countByWindow(windowLength, slideInterval) • Return a sliding window count of elements in the stream. • val totalWordCount= words.countByWindow(Seconds(30), Seconds(10)) • reduceByWindow(func, windowLength, slideInterval) • Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. • The function should be associative so that it can be computed correctly in parallel. • val totalWordCount= pairs.reduceByWindow({(x, y) => x + y},{(x, y) => x – y} Seconds(10) • reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks]) • Returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window • val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10)) • countByValueAndWindow(windowLength, slideInterval) • Returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. • val EachWordCount= word.countByValueAndWindow(Seconds(30), Seconds(10))
  • 47. DEMO
  • 48. UPDATE STATE BY KEY TRANSFORMATION • updateStateByKey() • Enables this by providing access to a state variable for DStreams of key/value pairs • User provide a function updateFunc(events, oldState) and initialRDD • val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 1))) • val updateFunc = (values: Seq[Int], state: Option[Int]) => { val currentCount = values.foldLeft(0)(_ + _) val previousCount = state.getOrElse(0) Some(currentCount + previousCount) } • val stateCount= pairs.updateStateByKey[Int](updateFunc)
  • 49. DEMO
  • 50. TRANSFORM OPERATION • The transform operation allows arbitrary RDD-to-RDD functions to be applied on a DStream. • It can be used to apply any RDD operation that is not exposed in the DStream API. • For example, the functionality of joining every batch in a data stream with another dataset is not directly exposed in the DStream API. • val cleanedDStream = wordCounts.transform(rdd => { rdd.join(data) })
  • 51. DEMO
  • 52. JOIN OPERATIONS • Stream-stream joins: • Streams can be very easily joined with other streams. • val stream1: DStream[String, String] = ... • val stream2: DStream[String, String] = ... • val joinedStream = stream1.join(stream2) • Windowed join • val windowedStream1 = stream1.window(Seconds(20)) • val windowedStream2 = stream2.window(Minutes(1)) • val joinedStream = windowedStream1.join(windowedStream2) • Stream-dataset joins • val dataset: RDD[String, String] = ... • val windowedStream = stream.window(Seconds(20))... • val joinedStream = windowedStream.transform { rdd => rdd.join(dataset) }
  • 53. DEMO
  • 54. USING FOREACHRDD() • foreachRDD is a powerful primitive that allows data to be sent out to external systems. • dstream.foreachRDD { rdd => rdd.foreachPartition { partitionOfRecords => val connection = ConnectionPool.getConnection() partitionOfRecords.foreach(record => connection.send(record)) ConnectionPool.returnConnection(connection) } } • Using foreachRDD, Each RDD is converted to a DataFrame, registered as a temporary table and then queried using SQL. • words.foreachRDD { rdd => val sqlContext = SQLContext.getOrCreate(rdd.sparkContext) import sqlContext.implicits._ val wordsDataFrame = rdd.toDF("word") wordsDataFrame.registerTempTable("words") val wordCountsDataFrame = sqlContext.sql("select word, count(*) as total from words group by word") wordCountsDataFrame.show() }
  • 55. DEMO
  • 56. DSTREAMS (SPARK CODE) • DStreams internally is characterized by a few basic properties: • A list of other DStreams that the DStream depends on • A time interval at which the DStream generates an RDD • A function that is used to generate an RDD after each time interval • Methods that should be implemented by subclasses of Dstream • Time interval after which the DStream generates a RDD • def slideDuration: Duration • List of parent DStreams on which this DStream depends on • def dependencies: List[DStream[_]] • Method that generates a RDD for the given time • def compute(validTime: Time): Option[RDD[T]] • This class contains the basic operations available on all DStreams, such as `map`, `filter` and `window`. In addition, PairDStreamFunctions contains operations available only on DStreams of key-value pairs, such as `groupByKeyAndWindow` and `join`. These operations are automatically available on any DStream of pairs (e.g., DStream[(Int, Int)] through implicit conversions.
  • 57. © 2015 IBM Corporation

Editor's Notes

  • #5: Continuous operator processing model. Each node continuously receives records, updates internal state, and emits new records. The latency is low but Fault tolerance is typically achieved through replication, using a synchronization protocol like Flux. D-Stream processing model. In each time interval, the records that arrive are stored reliably across the cluster to form an immutable, partitioned dataset. This is then processed via deterministic parallel operations to compute other distributed datasets that represent program output or state to pass to the next interval. Each series of datasets forms one D-Stream
  • #9: Continuous operator processing model. Each node continuously receives records, updates internal state, and emits new records. The latency is low but Fault tolerance is typically achieved through replication, using a synchronization protocol like Flux. D-Stream processing model. In each time interval, the records that arrive are stored reliably across the cluster to form an immutable, partitioned dataset. This is then processed via deterministic parallel operations to compute other distributed datasets that represent program output or state to pass to the next interval. Each series of datasets forms one D-Stream
  • #10: Continuous operator processing model. Each node continuously receives records, updates internal state, and emits new records. The latency is low but Fault tolerance is typically achieved through replication, using a synchronization protocol like Flux. D-Stream processing model. In each time interval, the records that arrive are stored reliably across the cluster to form an immutable, partitioned dataset. This is then processed via deterministic parallel operations to compute other distributed datasets that represent program output or state to pass to the next interval. Each series of datasets forms one D-Stream
  • #11: Continuous operator processing model. Each node continuously receives records, updates internal state, and emits new records. The latency is low but Fault tolerance is typically achieved through replication, using a synchronization protocol like Flux. D-Stream processing model. In each time interval, the records that arrive are stored reliably across the cluster to form an immutable, partitioned dataset. This is then processed via deterministic parallel operations to compute other distributed datasets that represent program output or state to pass to the next interval. Each series of datasets forms one D-Stream
  • #19: Have to have a sample code before coming to this slide.
  • #23: reference ids of the blocks for locating their data in the executor memory, (ii) offset information of the block data in the logs
  • #26: Have to read on Kafka Direct API.