Lecture - 20
Spark Streaming and Sliding Window Analytics (Part-II)
Spark Streaming Workflow.
Refer slide time :( 0:15)
So, spark streaming workflow has four high level stages and let us sees all these stages in in a brief
manner. So, the first is to stream the data from various sources and these sources are such as akka system,
Kafka system, flume, AWS or Parquet for the real-time streaming data input. So, this particular streaming
data is to be input using these different sources akka, Kafka, flume and Parquet, they will give the live
feed of streaming data to the to the system. Now, second type of sources of the data include they are
called the static data sources. So, they include HBase that we have discussed, MySQL, PostgreSQL, then
Mongo DB, Cassandra and they are for the static are the batch streaming data feed into the SPARK
streaming system. Once, these once these data is streamed into the spark system the spark and we used to
perform on on it the machine learning using the machine learning API that is MLlib further the spark
SQL can also be used to perform further operations on this particular data finally the streaming output can
be stored into the into the various data storage system like HBase and Cassandra, MemSQL, Kafka,
elastic search, HDFS, local system and so on.
Refer slide time :( 02:40)
So, therefore the the spark is streaming workflow looks like this that input of the streaming data comes
from Kafka, flume, HDFS, Kinesis, Twitter. And, finally after processing they will be stored on to the
either HDFS databases dashboard and so on. So, incoming these incoming data streams are now divided
into the into the batches of input data which is given back to the spark engine for computation and they
will process it and gives back output this particular data. Now, this batch the input data streams is now
divided into the discrete chunks of data for example the streams which is handled by the spark streaming
is called the, ‘ Discretized Stream’, or a dstream. So, dstream is is the batches of data of X seconds. So,
here the batch is of let us say one second so from zero to one second batch is called as ‘RDD’ at the rate
time one. So, the batch of timing from time one to time two that is that next one second is called the ‘
RDD’ at the rate time 2 and so on. So, the batches are divided into the into the discrete chunks in this
example this is the batches of one second it is divided into the batches of one second in this example it
can be minimum half a second batches for better latency from n to n. So, once this particular data stream
or dstream is decided or is broken by the Spark streaming system then various transformation can be
applied which is also a part of the Spark streaming system for example a flat map operation can be
applied on every dstream. So, for example when a flat map operation is applied it will give different
words which are input or which are divided in the form of dstream of one second duration similarly this
flat map is when applied on all the case then it will extract the words from the input stream.
Refer slide time :( 05:19)
And, this can be shown here in this example of getting the hashtags from on the Twitter. So, Twitter
stream is input into the into the system into the spark streaming system. So, this is called. So, Twitter
stream is given into the system and then after dividing into the different Dstreams it will perform the
transformation on top of it which is called the ‘Flat Map’. And, the flat map will be defined on the tweets
which is given into the system and the flat map will perform the get tag the hashtag it will get as the status
and this hashtag will be extracted as per the transformation. So, the transformation will modify one D
steam to create another dstream in this particular manner and finally it will from this tweet dstream it will
extract the hashtags in this particular example. So, this is shown over here that this tweet Dstreams when
a flat map is applied it will give the hashtag Dstreams which is hashtag in in the terms of for example
number of #cat #dog and so on different topics it will extract and it will generate the new RDD
Refer slide time :( 07:03)
out of every batch therefore after that these hashtags will be saved into the Hadoop as a file and this will
be the output which is to be. So, output operation is to push this particular transform data in the form of a
hashtag to the external storage. And, here that is shown over here but not every time we are going to store
we can perform various analytics on this transform hashtag. And, maybe that sometimes this transformed
hashtag analytics will require to update the website or perform various other applications depending upon
whatever we want to do on this output so for each. So, therefore various,
Refer slide time :( 08:01)
different programming languages are supported. So, the example which we have seen was written in
scala the same application it can be written using the Java. So, Java API also is available with the Spark
streaming system.
Refer slide time :( 08:21)
And, now let us see about the fault tolerance. So, RDDs our remember the sequence of operation and that
created it from the original fault tolerant data. So, therefore RDDs are knowing using lineage about the
sequence of operation how they are created from the original. So, this we know from the spark fault
tolerance system and when the batches of input data are replicated in the memory of multiple worker
nodes and therefore we are trying to achieve the fault tolerance in this case. So, whenever the data is lost
to the worker failure it can be recomputed using lineage from the input why because these RDDs
remember all that things. So, so therefore this fault tolerance can also be ensured here in the part of this
spark system in a spark streaming.
Refer slide time :( 09:17)
So, let us see the key concepts. So, key concepts so far we have seen about the Dstream which is a
sequence of RDDs representing the stream of data and these Dstreams are created out of the stream of
data from Twitter, HDFS, Kafka, flume and so on. And, there are various transformations can be applied
on Dstream which can modify one from from from the given Dstream to another form of these team. So,
the standard RDD transformations which are operations which are available for the transformations are
the map, count by value, reduce, join and so on similarly there are other stateful operations which are
available in the form of transformations such as window operations. And, count by value and window and
so on we will see all these stateful operations. Now, besides the transformation there are actions or the
output operations also to be performed on the Dstreams which are available as part of the spark streaming
system. Now, these output operation will send the data to the external entities will save as Hadoop files
will say to HDFS for each do anything with each batch of results. So, we will see that whenever an action
or an output operation which we are going to perform either it will save or to the HDFS file save as a file
or it will further actions, using for each command.
Refer slide time :( 10:53)
So, again we will now count the hashtags in this particular example. So, that means once we get the
hashtags that we are going to count these hashtags. So, these hashtags which is now available using this
spark streaming system. Now, we are going to perform an action or the operation as an output. So, the
output here is to be the count by value. So, it will count how many hashtags about these hashtags are there
into the stream processing into the data. So, here we can see that we perform this count by value.
Refer slide time :( 11:42)
Now, we will also see some more functionality in the terms of this example which will count the hashtag
over the last 10 minutes. So, is so basically for that we have to use the been doing operation. So, we have
to use the window of one minute and within the one minute after every five seconds we are now
monitoring this hashtag and then performing the count by value. So, this particular command this
operation will have the windows window has two different arguments one is the length of the window,
window size and window length. And, so here it is in the window of one minute duration for every five
second we are going to count this by value. So, just window is can be seen here the window has sliding
window operation, has the window length and the window interval, sliding interval. So, we can see using
this particular diagram that this is the window length this is of one minute and sliding interval is of that
another duration. So, sliding interval will give the data for processing so whenever we say count by value.
So, this particular window length and sliding interval together will give the data for this computation.
Refer slide time :( 13:35)
So, here we can see that in this particular example that that we are going to count all the data in that
particular window. Now, one important thing is that when the window slides can see the sliding interval
there are two things one is called ‘Window Length’ and the ‘sliding interval’. So, when the window slides
then the this particular old data will be out of that window and a new data will enter into the system. So,
the values which we are now counting every time is going to be changed in this way. So, what will be the
new count word count count by value that is required to subtract this is the previous value and the new
value is to be added this concept requires the window based different algorithm to do the analysis.
Refer slide time :( 14:48)
That, we will see in the further slides. So, smart window based reduction we will see different commands
are there so techniques to incrementally compute count generally generalizes to the many reduce
operations that needs a function to inverse the reduce that is subtract for counting. I, could have
implemented counting as the hashtag reduced by key and window. So, within that particular time that will
do this operation that is after sliding the new value will come and the old value will be subtracted. And
this is performed in the reduced by key and window operation.
Refer slide time :( 15:31)
Arbitrary stateful computation, this is also very important computation in the spark streaming system
why because this allows you to maintain a count of particular words or event which is occurring into the
stream of data. So, this is to maintain the stateful computation. So, especially by function to generate the
new state based on the previous state and a new data. So, for example maintain per user mood as the state
and update it when when it sees that tweet. So, update mode definition of a function can be defined as the
new tweets and the last mood which it has seen and then it will update to the new mood and so this is an
update function so whenever. So, this particular function will do this update mood using the parameters
which is given in the in the tweet so so whenever a new tweet it comes it will perform this update move
function. And, this will be the update state by the key and so whenever the Twitter tweets by the user is
given. So, here the mood will be extracted and and the state variable is maintained to measure the to
understand the mood. So, here that is that is why it is called a ‘Stateful Computation’. So, state is man is
maintained all the time so and this state will be updated whenever not did the stream of data is coming
and now there is a change of mood. So, it will be extracted out of the stream and updated as the state.
Refer slide time :( 17:34)
So, arbitrary combination of batch and stream computation example we are going to see now. So, there
will be an intermix RDDs and the dstream computation operation. So, for example join incoming stream
with the with a spam HDFS file to filter out the bad tweets. So, using this particular way of joining we
can see that so so the tweet RDDs is now joined with the HDFS file system and we will perform various
filter on top of it and this will transform the tweets and the transform to X will be given as the output. So,
all these functions are written in the scalar to understand these particular operations or the commands or
the programs. So, I advise you to refer the scala programming language to have a better understanding of
the streaming.
Refer slide time :( 18:36)
So, spark streaming these 3 dstreams batches and RDDs. So, again let us summarize all this is that
whenever the data input data stream is coming input streaming data is coming. So, in the spark streaming
system it will divide into the batches of one second duration in this example and these batches are now
performed various transformations. And, an action and given back to the spark engine for giving the
output. So, these steps are repeated for each batch continuously because we are dealing with the
streaming data. So, the data is continuously coming, in the form of streams. So, spark streaming has the
ability to remember the previous RDDs and to some extent.
Refer slide time :( 19:26)
Therefore, this dstream and RDDs together that is dstream is basically the the streaming data and RDDs
are the batch data together if we add it will become the more power for example we can apply or we can
introduce the online machine learning. So, that means we can apply the the RDDs when it becomes an
RDD we can apply the machine learning that the library is on top of it hence it becomes a online machine
learning technique that means machine learning applied on the dstreaming using this model. So,
continuously learn and update the data model this can be performed using update state by the key and
transformations. So, also there is a in this manner we can combine the live stream data with the historical
data we generally we can generate the historical data model with the spark etc. Now, we can use the data
model to process live data streams using transformations we can also do the CP style processing such as
window based operations reduced by window etc.
Refer slide time :( 20:26)
So, from Dstream to the spark jobs we can see that every interval an RDD graph is computed from
Dstream graph. And, for each output operation spark action is created for each action a spark job is
computed. So, this is shown here in this particular example graph that it is a Dstream graph that input
streams are coming and all these input different streams we can perform a union on it. And, we can
perform different transformations and then perform various actions. So, these block of RDDs are then
with the data received from the last batch interval is given back to the spark system in the form of RDD
graph. And, RDD again will perform these RDDs and perform the union and again apply the
transformation and give the output. So, this particular from a spark streaming weather and when the when
the jobs are given for this part then again another level of transformations, can be applied before it can
output.
Refer slide time :( 21:34)
So, there are input streams which we have already seen let us summarize that that out of the box we have
provided for the input sources, Kafka, HDFS, flume, Akka actors, raw TCP sockets and it is very easy to
write a receiver for your own data source. So, these are different receivers which are inbuilt and you can
also write down your own receiver for your data source also generate your own RDDs from the spark and
push them as the stream.
Refer slide time :( 22:09)
So, current a spark streaming input output we can say now summarizes as Kafka, Flume, Twitter,
ZeroMQ and so on the basic sources are sockets file a character and so on. So, the output operations are
print save as a file save as object file save as Hadoop files for each RDDs for each RDD can be used as a
message queue and DB operations and many more things.
Refer slide time :( 22:38)
So, dstream classes different classes for different languages are ported Scala and Java these three miles36
different values value members and multiple type of rest reams and separate Python API will be provided.
Refer slide time :( 22:54)
Now, Spark streaming operations are summarized here as the RDD operations and some or the
transformations such as map and flat map that we have seen filter also we have seen repartition, Union,
count, reduce, countbyvalue, reducebykey, join, cogroup, transform, and updatestatebykey are different
transformations available in the spark RDDs. Now, it's spark streaming window operations are available
such as window, countbywindow, count really reducebywindow, reducebykeyandwindow, countby
valueandwindow. Similarly, for output operations in spark streaming various commands such as print,
saveastextfiles, saveasobjectfiles, saveasHadoopfiles, foreachRDD and so on.
Refer slide time :( 23:46)
So, therefore the batches of input data are replicated in the memory for fault tolerance. Data, lost due to
the worker can be recomputed from the replicated input data. And, all transformed transformation or fault
tolerant and follow the exactly ones transformations.
Refer slide time :( 24:02)
So, fault tolerance is that is the receive data is replicated among multiple spark executors the default is to
and this must protect the driver program there is only one driver running. So, if the driver node is running
on the spark streaming and application fails driver must be restarted on another node. And, this can be
handled using the zookeeper our yarn this was engine requires a check point directly here in the streaming
context. So, check pointing saves the state on the regular intervals typically every five to ten batches of
data the check point’s being made. So, a failure would have to replay the five to ten previous batches to
recreate the appropriate RDDs. So, checkpoint done to HDFS or equivalent so streaming back pressure
exclaiming backpressure will be enabled and all these will achieve the fault tolerance. So, in nutshell we
can say that a check pointing and replication together will ensure the the the recovery of the failures.
Refer slide time :( 25:19)
So, performance if we see that even the grep and the wordcount performance. So, on sub second latency it
is four it is achieved in one second and two second a very good performance that we are seeing here.
Refer slide time :( 25:39)
Compared, to the other systems this is also showing the better performance that is a spark system is
having good performance and storm is basically achieving and the but the higher throughput a low
throughput and a spark is having achieved the better throughput.
Refer slide time :( 26:02)
So, fault tolerance recovery systems are there so it recovers from false or stragglers within one second.
Refer slide time :( 26:10)
So, this also is reported and
Refer slide time :( 26:18)
Now, let us see the spark program versus spark streaming program and here again those issues we have
seen that. So, whether it is coming from the spark streaming or coming from the Hadoop file these
particular things are going to be handled in the sparks training system,
Refer slide time :( 26:50)
both as a batch as well as streaming data. So, all these things we have already discussed there that is for
having the unified stack and explore the data interactively to identify the problems and use the same code
in the spark for processing large logs and use similar code in the Spark streaming for the real time. And,
in the code we can see that we can apply the the filter and then we can also invoke this particular fault
tolerant aspect and then invoking the spark context and once the spark context is invoked then the driver
and the executors are allocated
Refer slide time :( 27:42)
and they are basically able to get ready for the computation.