PCET-NMVPM’s
Nutan College of Engineering and Research, Talegaon,
Pune
DEPARTMENT OF CSE
AY-2024-25
Subject: Big Data Analytics
Subject Code: BTCOE703 (C)
Subject Teacher: Prof. Sarita Charkha
Unit II Big Data Platforms 1
Syllabus
Big Data Streaming Platforms [6 Hours]
Big Data Streaming Platforms for Fast Data, Streaming Systems, Big Data Pipelines for Real-Time computing,
Spark Streaming, Kafka, Streaming Ecosystem.
Unit II Big Data Platforms 2
What is Streaming Data?
Also known as event stream processing, streaming data is the continuous flow of data
generated by various sources. By using stream processing technology, data streams can
be processed, stored, analyzed, and acted upon as it's generated in real-time.
What is Streaming?
The term "streaming" is used to describe continuous, never-ending data streams with no
beginning or end, that provide a constant feed of data that can be utilized/acted upon
without needing to be downloaded first.
Similarly, data streams are generated by all types of sources, in various formats and
volumes. From applications, networking devices, and server log files, to website activity,
banking transactions, and location data, they can all be aggregated to seamlessly gather
real-time information and analytics from a single source of truth.
Streaming data is data that is emitted at high volume in a continuous, incremental manner
with the goal of low-latency processing. Organizations have thousands of data sources
that typically simultaneously emit messages, records, or data ranging in size from a few
bytes to several megabytes (MB). Streaming data includes location, event, and sensor
data that companies use for real-time analytics and visibility into many aspects of their
business. For example, companies can track changes in public sentiment on their brands
and products by continuously analyzing clickstream and customer posts from social
media streams then responding promptly as needed.
What are the characteristics of streaming data?
A data stream has the following specific characteristics that define it.
Chronologically significant
Individual elements in a data stream contain time stamps. The data stream itself may be
time-sensitive with diminished significance after a specific time interval. For example,
your application makes restaurant recommendations based on the current location of its
user. You have to act upon user geolocation data in real time or the data loses
significance.
Continuously flowing
A data stream has no beginning or end. It collects data constantly and continuously as
long as required. For example, server activity logs accumulate as long as the server runs.
Unit II Big Data Platforms 3
Unique
Repeat transmission of a data stream is challenging because of time sensitivity. Hence,
accurate real-time data processing is critical. Unfortunately, provisions for retransmission
are limited in most streaming data sources.
Nonhomogeneous
Some sources may stream data in multiple formats that are in structured formats such as
JSON, Avro, and comma-separated values (CSV) with data types that include strings,
numbers, dates, and binary types. Your stream processing systems should have the
capabilities to handle such data variations.
Imperfect
Temporary errors at the source may result in damaged or missing elements in the
streamed data. It can be challenging to guarantee data consistency because of the
continuous nature of the stream. Stream processing and analytics systems typically
include logic for data validation to mitigate or minimize these errors.
Why is streaming data important?
Traditional data processing systems capture data in a central data warehouse and process
it in groups or batches. These systems were built to ingest and structure data before
analytics. However, in recent years, the nature of enterprise data and the underlying data
processing systems have changed significantly.
Infinite data volume
Generated data volumes from stream sources can be very large, making it a challenge for
real-time analytics to regulate the streaming data's integrity (validation), structure
(evolution), or velocity (throughput and latency).
Advanced data processing systems
At the same time, cloud infrastructure has introduced flexibility in the scale and usage of
computing resources. You use exactly what you need and pay only for what you use. You
have the options of real-time filtering or aggregation both before and after storing
streaming data. Streaming data architecture uses cloud technologies to consume, enrich,
analyze, and permanently store streaming data as required.
What are the use cases for streaming data?
Unit II Big Data Platforms 4
A stream processing system is beneficial in most scenarios where new and dynamic data
is generated continually. It applies to most of the industry segments and big data use
cases.
Companies generally begin with simple applications, such as collecting system logs and
rudimentary processing like rolling min-max computations. Then, these applications
evolve to more sophisticated near real-time processing.
Here are some more examples of streaming data.
Data analysis
Applications process data streams to produce reports and perform actions in response,
such as emitting alarms when key measures exceed certain thresholds. More sophisticated
stream processing applications extract deeper insights by applying machine learning
algorithms to business and customer activity data.
IoT applications
Internet of Things (IoT) devices are another use case for streaming data. Sensors in
vehicles, industrial equipment, and farm machinery send data to a streaming application.
The application monitors performance, detects potential defects in advance, and
automatically places a spare part order, preventing equipment downtime.
Financial analysis
Financial institutions use stream data to track real-time changes in the stock market,
compute value at risk, and automatically rebalance portfolios based on stock price
movements. Another financial use case is fraud detection of credit card transactions using
real-time inferencing against streaming transaction data.
Real-time recommendations
Real estate applications track geolocation data from consumers’ mobile devices and make
real-time recommendations of properties to visit. Similarly, advertising, food, retail, and
consumer applications can integrate real-time recommendations to give more value to
customers.
Service guarantees
You can implement data stream processing to track and maintain service levels in
applications and equipment. For example, a solar power company has to maintain power
throughput for its customers or pay penalties. It implements a streaming data application
Unit II Big Data Platforms 5
that monitors all panels in the field and schedules service in real time. Thus, it can
minimize each panel's periods of low throughput and the associated penalty payouts.
Media and gaming
Media publishers stream billions of clickstream records from their online properties,
aggregate and enrich the data with user demographic information, and optimize the
content placement. This helps publishers deliver a better, more relevant experience to
audiences. Similarly, online gaming companies use event stream processing to analyze
player-game interactions and offer dynamic experiences to engage players.
Risk control
Live streaming and social platforms capture user behavior data in real time for risk
control over users' financial activity, such as recharge, refund, and rewards. They view
real-time dashboards to flexibly adjust risk strategies.
Challenges Building Real-Time Applications
Scalability: When system failures happen, log data coming from each device could
increase from being sent a rate of kilobits per second to megabits per second and
aggregated to be gigabits per second. Adding more capacity, resources and servers as
applications scale happens instantly, exponentially increasing the amount of raw data
generated. Designing applications to scale is crucial in working with streaming data.
Ordering: It is not trivial to determine the sequence of data in the data stream and very
important in many applications. A chat or conversation wouldn’t make sense out of order.
When developers debug an issue by looking an aggregated log view, it’s crucial that each
line is in order. There are often discrepancies between the order of the generated data
packet to the order in which it reaches the destination. There are also often discrepancies
in timestamps and clocks of the devices generating data. When analyzing data streams,
applications must be aware of its assumptions on ACID transactions.
Consistency and Durability: Data consistency and data access is always a hard problem
in data stream processing. The data read at any given time could already be modified and
stale in another data centre in another part of the world. Data durability is also a
challenge when working with data streams on the cloud.
Fault Tolerance & Data Guarantees: these are important considerations when working
with data, stream processing, or any distributed systems. With data coming from
Unit II Big Data Platforms 6
numerous sources, locations, and in varying formats and volumes, can your system
prevent disruptions from a single point of failure? Can it store streams of data with high
availability and durability?
What is a Data Pipeline?
A data pipeline is a systematic and automated process for the efficient and reliable
movement, transformation, and management of data from one point to another
within a computing environment. It plays a crucial role in modern data-driven
organizations by enabling the seamless flow of information across various stages
of data processing.
A data pipeline consists of a series of data processing steps. If the data is not
currently loaded into the data platform, then it is ingested at the beginning of the
pipeline. Then there are a series of steps in which each step delivers an output that
is the input to the next step. This continues until the pipeline is complete. In some
cases, independent steps may be run in parallel.
Data pipelines consist of three key elements: a source, a processing step or steps,
and a destination. In some data pipelines, the destination may be called a sink.
Data pipelines enable the flow of data from an application to a data warehouse,
from a data lake to an analytics database, or into a payment processing
system system, for example. Data pipelines also may have the same source and
sink, such that the pipeline is purely about modifying the data set. Any time data
is processed between point A and point B (or points B, C, and D), there is a data
pipeline between those points.
Unit II Big Data Platforms 7
What Is a Big Data Pipeline?
As the volume, variety, and velocity of data have dramatically grown in recent
years, architects and developers have had to adapt to “big data.” The term “big
data” implies that there is a huge volume to deal with. This volume of data can
open opportunities for use cases such as predictive analytics, real-time reporting,
and alerting, among many examples.
Like many components of data architecture, data pipelines have evolved to
support big data. Big data pipelines are data pipelines built to accommodate one
or more of the three traits of big data. The velocity of big data makes it appealing
to build streaming data pipelines for big data. Then data can be captured and
processed in real time so some action can then occur. The volume of big data
requires that data pipelines must be scalable, as the volume can be variable over
time. In practice, there are likely to be many big data events that occur
simultaneously or very close together, so the big data pipeline must be able to
scale to process significant volumes of data concurrently. The variety of big data
requires that big data pipelines be able to recognize and process data in many
different formats—structured, unstructured, and semi-structured.
Benefits of a Data Pipeline
Efficiency
Data pipelines automate the flow of data, reducing manual intervention and
minimizing the risk of errors. This enhances overall efficiency in data processing
workflows.
Real-time Insights
With the ability to process data in real-time, data pipelines empower organizations
to derive insights quickly and make informed decisions on the fly.
Scalability
Scalable architectures in data pipelines allow organizations to handle growing
volumes of data without compromising performance, ensuring adaptability to
changing business needs.
Data Quality
Unit II Big Data Platforms 8
By incorporating data cleansing and transformation steps, data pipelines
contribute to maintaining high data quality standards, ensuring that the
information being processed is accurate and reliable.
Cost-Effective
Automation and optimization of data processing workflows result in cost savings
by reducing manual labor, minimizing errors, and optimizing resource utilization.
Types of Data Pipelines
Batch Processing
Batch processing involves the execution of data jobs at scheduled intervals. It is
well-suited for scenarios where data can be processed in non-real-time, allowing
for efficient handling of large datasets.
Streaming Data
Streaming data pipelines process data in real-time as it is generated. This type of
pipeline is crucial for applications requiring immediate insights and actions based
on up-to-the-moment information.
How Data Pipelines Work
A typical data pipeline involves several key stages:
1. Ingestion
Data is collected from various sources and ingested into the pipeline. This
can include structured and unstructured data from databases, logs, APIs, and
other sources.
2. Processing
The ingested data undergoes processing, which may involve transformation,
cleansing, aggregation, and other operations to prepare it for analysis or
storage.
3. Storage
Processed data is stored in a suitable data store, such as a database, data
warehouse, or cloud storage, depending on the requirements of the
organization.
4. Analysis
Analytical tools and algorithms are applied to the stored data to extract
meaningful insights, patterns, and trends.
Unit II Big Data Platforms 9
5. Visualization
The results of the analysis are presented in a visual format through
dashboards or reports, making it easier for stakeholders to interpret and act
upon the information.
Data Pipeline Architecture Examples
Data pipelines may be architected in several different ways. One common
example is a batch-based data pipeline. In that example, you may have an
application such as a point-of-sale system that generates a large number of data
points that you need to push to a data warehouse and an analytics database. Here
is an example of what that would look like:
Another example is a streaming data pipeline. In a streaming data
pipeline, data from the point of sales system would be processed as it is generated.
The stream processing engine could feed outputs from the pipeline to data stores,
marketing applications, and CRMs, among other applications, as well as back to
the point of sale system itself .
Unit II Big Data Platforms 10
Use Cases
Finance
Handling financial transactions, fraud detection, and risk analysis in real-time.
E-commerce
Managing and analyzing large volumes of customer data, transaction logs, and
inventory information in real-time.
Business Intelligence
Deriving insights from historical and real-time data to inform decision-making
processes.
Healthcare
Processing and analyzing patient records, medical images, and sensor data for
improved diagnostics and patient care.
Spark Streaming
Spark Streaming is an extension of the Apache Spark cluster computing system that
enables processing of real-time data streams. It allows you to process and analyze
streaming data in near real-time with high fault tolerance, scalability, and ease of use.
Unit II Big Data Platforms 11
In Spark Streaming, data is ingested in small batches or micro-batches, and each batch is
processed using the same set of operations as used in batch processing. The processed data
can then be stored or further analyzed in real-time.
It allows you to process live data streams from many sources, such as Kafka, Flume,
Kinesis, or TCP sockets. You can then use Spark’s machine learning and graph processing
algorithms to analyze the data.
Monitoring and alerting
You can use Spark Streaming to monitor your applications and systems for errors or
anomalies.
For example, you could use it to track the number of errors in a web application or
the number of requests per second to a server.
For example, you could create a Spark Streaming job that reads the logs from your
web application and counts the number of errors. If the number of errors exceeds a
certain threshold, you could send an alert to your team.
Real-time analytics
You can use Spark Streaming to analyze data in real time.
For example, you could use it to track the sentiment of social media posts or the
price of stocks.
For example, you could create a Spark Streaming job that reads tweets from Twitter
and calculates the sentiment of each tweet. You could then use this information to
track the public’s reaction to a product launch or an event.
Machine learning
You can use Spark Streaming to train and deploy machine learning models on streaming
data.
For example, you could use it to predict customer churn or fraud.
Unit II Big Data Platforms 12
For example, you could create a Spark Streaming job that reads the price of stocks
from a financial data feed and trains a machine learning model to predict future
prices. You could then use this model to make investment decisions.
Spark Streaming is available in the following cloud providers:
AWS: Amazon EMR, AWS Lambda
Azure: Azure HDInsight, Azure Functions
Google Cloud Platform: Cloud Dataproc, Cloud Functions
IBM Cloud: IBM Cloud Pak for Data, IBM Cloud Functions
Alibaba Cloud: Alibaba Cloud EMR, Alibaba Cloud Functions
Benefits of using Spark Streaming in the cloud
Scalability: Cloud providers offer a wide range of resources, so you can scale your
Spark Streaming jobs up or down as needed.
Cost-effectiveness: Cloud providers offer pay-as-you-go pricing, so you only pay
for the resources that you use.
Ease of use: Cloud providers offer managed Spark services, so you can focus on
your application development and not worry about managing the underlying
infrastructure.
Real-time processing: Spark Streaming allows you to process and analyze data in
near real-time, which enables you to make decisions and take actions quickly based
on the data.
Unit II Big Data Platforms 13
High fault tolerance: Spark Streaming provides built-in fault tolerance by
replicating the data across multiple nodes in the cluster. If a node fails, the data is
automatically reprocessed on another node, ensuring that no data is lost and
processing is not interrupted.
Integration with other Spark components: Spark Streaming integrates
seamlessly with other Spark components, such as Spark SQL, MLlib, and GraphX.
This allows you to perform complex analytics on real-time data, such as machine
learning, graph processing, and SQL queries.
Support for multiple data sources: Spark Streaming supports a variety of data
sources, such as Kafka, Flume, HDFS, and S3. This allows you to easily ingest data
from different sources into Spark Streaming for processing.
Apache Kafka
Kafka Streams is a client library for building applications and microservices, where the input and
output data are stored in Kafka clusters. It combines the simplicity of writing and deploying
standard Java and Scala applications on the client side with the benefits of Kafka's server-side
cluster technology.
Unit II Big Data Platforms 14
Apache Kafka is a distributed messaging system that provides fast, durable highly
scalable and fault-tolerant messaging through a publish-subscribe (pub-sub) model. Kafka
has higher throughput, reliability and replication characteristics. It is built for Big Data
applications, real-time data pipelines, and streaming apps.
Apache Kafka was originally developed by LinkedIn, and was subsequently open sourced
in early 2011. In November 2014, several engineers who worked on Kafka at LinkedIn
created a new company named Confluent with a focus on Kafka.
Kafka as a distributed system runs in a cluster (a group of similar things or people
positioned or occurring closely together). Each node in the cluster is called a
Kafka broker.
The basic architecture of Kafka is organized around a few key terms: topics, producers,
Consumers, and brokers.
Kafka Terminology
Unit II Big Data Platforms 15
Kafka Broker
Kafka runs as a distributed system in a cluster. There are one or more servers available in
the cluster. Each node in the cluster is called a Kafka Broker.
Brokers are responsible for receiving and storing the data when it arrives. The broker also
provides the data when requested.
Kafka broker is more precisely described as a Message Broker which is responsible for
mediating the conversation between different computer systems, guaranteeing delivery of
the message to the correct parties.
Kafka Topic
Topics represent the logical collection of messages that belong to a group/category. The
data sent by the producers is stored in topics. Consumers subscribe to a specific topic that
they are interested in. A topic can have zero or more consumers.
In a nutshell, A topic is a category or feed name to which records are published.
Kafka Message
In Kafka, messages represent the fundamental unit of data. Each message is represented as
Unit II Big Data Platforms 16
a record, which comprises two parts: key and value. Irrespective of the data type, Kafka
always converts messages into byte arrays.
Many other messaging systems also have a way of carrying other information along with
the messages. Kafka 0.11 introduced record headers for this purpose.
Partitions
Topics are divided into one (default is one) or more partitions. A partition lives on a
physical node and persists the messages it receives. A partition can be replicated onto
other nodes in a master/slave relationship. There is only one “leader” node for a given
partition which accepts all reads and writes — in case of failure a new leader is chosen.
The other nodes just replicate messages from the “leader” to ensure fault-tolerance.
Kafka ensures strict ordering within a partition i.e. consumers will receive it in the order
which a producer published the data, to begin with.
KafkaProducer
A Producer is an entity that publishes streams of messages to Kafka topics. A producer
can publish to one or more topics and can optionally choose the partition that stores the
data.
Kafka comes with its own producer written in Java, but there are many other Kafka client
libraries that support C/C++, Go, Python, REST, and more.
Kafka Consumers
Consumers are the subscribers or readers that receive the data. Kafka consumers are
stateful, which means they are responsible for remembering the cursor position, which is
called as an offset.
Kafka Consumer Groups
A consumer group is a group of related consumers that perform a task.
Each consumer group must have a unique id. Each consumer group is a subscriber to one
or more Kafka topics. Each consumer group maintains its offset per topic partition.
Kafka Producers
Unit II Big Data Platforms 17
A Producer is an entity that consumes/reads messages from Kafka topics and processes
the feed of messages. A consumer can consume from one or more topics or partitions.
Kafka Offset
The offset is a position within a partition for the next message to be sent to a consumer.
Offset is used to uniquely identifies each record within the partition.
Kafka Consumer Lags
Kafka Consumer Lag is the indicator of how much lag there is between Kafka producers
and consumers.
Inside the Kafka, data is stored in one or more topics. Each topic consists of one or more
partitions. When writing data a Broker actually writes it into a specific Partition. As it
writes data, it keeps track of the last “write position” in each Partition. This is called
Latest Offset. Each Partition has its own independent Latest Offset.
Just like Brokers keep track of their write position in each Partition, each Consumer keeps
track of “read position” in each Partition whose data it is consuming. This is known as
Consumer Offset. This Consumer Offset is periodically persisted (to ZooKeeper or a
special Topic in Kafka itself) so it can survive Consumer crashes or unclean shutdowns
and avoid re-consuming too much old data.
Unit II Big Data Platforms 18