0% found this document useful (0 votes)

38 views18 pages

BDA Unit 3

The document outlines the syllabus for a Big Data Analytics course at Nutan College of Engineering, focusing on Big Data Streaming Platforms and their importance in real-time data processing. It explains the characteristics of streaming data, its significance, and various use cases across industries, along with challenges in building real-time applications. Additionally, it discusses data pipelines, their architecture, and the benefits of using Spark Streaming for processing real-time data in cloud environments.

Uploaded by

gaurav.verma061003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views18 pages

BDA Unit 3

Uploaded by

gaurav.verma061003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

PCET-NMVPM’s

Nutan College of Engineering and Research, Talegaon,

Pune
DEPARTMENT OF CSE
AY-2024-25

Subject: Big Data Analytics

Subject Code: BTCOE703 (C)

Subject Teacher: Prof. Sarita Charkha

Unit II Big Data Platforms 1

Syllabus
Big Data Streaming Platforms [6 Hours]

Big Data Streaming Platforms for Fast Data, Streaming Systems, Big Data Pipelines for Real-Time computing,
Spark Streaming, Kafka, Streaming Ecosystem.

Unit II Big Data Platforms 2

What is Streaming Data?
Also known as event stream processing, streaming data is the continuous flow of data
generated by various sources. By using stream processing technology, data streams can
be processed, stored, analyzed, and acted upon as it's generated in real-time.

What is Streaming?
The term "streaming" is used to describe continuous, never-ending data streams with no
beginning or end, that provide a constant feed of data that can be utilized/acted upon
without needing to be downloaded first.

Similarly, data streams are generated by all types of sources, in various formats and
volumes. From applications, networking devices, and server log files, to website activity,
banking transactions, and location data, they can all be aggregated to seamlessly gather
real-time information and analytics from a single source of truth.

Streaming data is data that is emitted at high volume in a continuous, incremental manner
with the goal of low-latency processing. Organizations have thousands of data sources
that typically simultaneously emit messages, records, or data ranging in size from a few
bytes to several megabytes (MB). Streaming data includes location, event, and sensor
data that companies use for real-time analytics and visibility into many aspects of their
business. For example, companies can track changes in public sentiment on their brands
and products by continuously analyzing clickstream and customer posts from social
media streams then responding promptly as needed.
What are the characteristics of streaming data?
A data stream has the following specific characteristics that define it.
Chronologically significant

Individual elements in a data stream contain time stamps. The data stream itself may be
time-sensitive with diminished significance after a specific time interval. For example,
your application makes restaurant recommendations based on the current location of its
user. You have to act upon user geolocation data in real time or the data loses
significance.
Continuously flowing

A data stream has no beginning or end. It collects data constantly and continuously as
long as required. For example, server activity logs accumulate as long as the server runs.

Unit II Big Data Platforms 3

Unique

Repeat transmission of a data stream is challenging because of time sensitivity. Hence,

accurate real-time data processing is critical. Unfortunately, provisions for retransmission
are limited in most streaming data sources.
Nonhomogeneous

Some sources may stream data in multiple formats that are in structured formats such as
JSON, Avro, and comma-separated values (CSV) with data types that include strings,
numbers, dates, and binary types. Your stream processing systems should have the
capabilities to handle such data variations.
Imperfect

Temporary errors at the source may result in damaged or missing elements in the
streamed data. It can be challenging to guarantee data consistency because of the
continuous nature of the stream. Stream processing and analytics systems typically
include logic for data validation to mitigate or minimize these errors.

Why is streaming data important?

Traditional data processing systems capture data in a central data warehouse and process
it in groups or batches. These systems were built to ingest and structure data before
analytics. However, in recent years, the nature of enterprise data and the underlying data
processing systems have changed significantly.
Infinite data volume

Generated data volumes from stream sources can be very large, making it a challenge for
real-time analytics to regulate the streaming data's integrity (validation), structure
(evolution), or velocity (throughput and latency).
Advanced data processing systems

At the same time, cloud infrastructure has introduced flexibility in the scale and usage of
computing resources. You use exactly what you need and pay only for what you use. You
have the options of real-time filtering or aggregation both before and after storing
streaming data. Streaming data architecture uses cloud technologies to consume, enrich,
analyze, and permanently store streaming data as required.
What are the use cases for streaming data?

Unit II Big Data Platforms 4

A stream processing system is beneficial in most scenarios where new and dynamic data
is generated continually. It applies to most of the industry segments and big data use
cases.
Companies generally begin with simple applications, such as collecting system logs and
rudimentary processing like rolling min-max computations. Then, these applications
evolve to more sophisticated near real-time processing.
Here are some more examples of streaming data.
Data analysis

Applications process data streams to produce reports and perform actions in response,
such as emitting alarms when key measures exceed certain thresholds. More sophisticated
stream processing applications extract deeper insights by applying machine learning
algorithms to business and customer activity data.
IoT applications

Internet of Things (IoT) devices are another use case for streaming data. Sensors in
vehicles, industrial equipment, and farm machinery send data to a streaming application.
The application monitors performance, detects potential defects in advance, and
automatically places a spare part order, preventing equipment downtime.
Financial analysis

Financial institutions use stream data to track real-time changes in the stock market,
compute value at risk, and automatically rebalance portfolios based on stock price
movements. Another financial use case is fraud detection of credit card transactions using
real-time inferencing against streaming transaction data.
Real-time recommendations

Real estate applications track geolocation data from consumers’ mobile devices and make
real-time recommendations of properties to visit. Similarly, advertising, food, retail, and
consumer applications can integrate real-time recommendations to give more value to
customers.
Service guarantees

You can implement data stream processing to track and maintain service levels in
applications and equipment. For example, a solar power company has to maintain power
throughput for its customers or pay penalties. It implements a streaming data application
Unit II Big Data Platforms 5
that monitors all panels in the field and schedules service in real time. Thus, it can
minimize each panel's periods of low throughput and the associated penalty payouts.
Media and gaming

Media publishers stream billions of clickstream records from their online properties,
aggregate and enrich the data with user demographic information, and optimize the
content placement. This helps publishers deliver a better, more relevant experience to
audiences. Similarly, online gaming companies use event stream processing to analyze
player-game interactions and offer dynamic experiences to engage players.
Risk control

Live streaming and social platforms capture user behavior data in real time for risk
control over users' financial activity, such as recharge, refund, and rewards. They view
real-time dashboards to flexibly adjust risk strategies.

Challenges Building Real-Time Applications

Scalability: When system failures happen, log data coming from each device could
increase from being sent a rate of kilobits per second to megabits per second and
aggregated to be gigabits per second. Adding more capacity, resources and servers as
applications scale happens instantly, exponentially increasing the amount of raw data
generated. Designing applications to scale is crucial in working with streaming data.

Ordering: It is not trivial to determine the sequence of data in the data stream and very
important in many applications. A chat or conversation wouldn’t make sense out of order.
When developers debug an issue by looking an aggregated log view, it’s crucial that each
line is in order. There are often discrepancies between the order of the generated data
packet to the order in which it reaches the destination. There are also often discrepancies
in timestamps and clocks of the devices generating data. When analyzing data streams,
applications must be aware of its assumptions on ACID transactions.

Consistency and Durability: Data consistency and data access is always a hard problem
in data stream processing. The data read at any given time could already be modified and
stale in another data centre in another part of the world. Data durability is also a
challenge when working with data streams on the cloud.

Fault Tolerance & Data Guarantees: these are important considerations when working
with data, stream processing, or any distributed systems. With data coming from

Unit II Big Data Platforms 6

numerous sources, locations, and in varying formats and volumes, can your system
prevent disruptions from a single point of failure? Can it store streams of data with high
availability and durability?

What is a Data Pipeline?

A data pipeline is a systematic and automated process for the efficient and reliable
movement, transformation, and management of data from one point to another
within a computing environment. It plays a crucial role in modern data-driven
organizations by enabling the seamless flow of information across various stages
of data processing.

A data pipeline consists of a series of data processing steps. If the data is not
currently loaded into the data platform, then it is ingested at the beginning of the
pipeline. Then there are a series of steps in which each step delivers an output that
is the input to the next step. This continues until the pipeline is complete. In some
cases, independent steps may be run in parallel.

Data pipelines consist of three key elements: a source, a processing step or steps,
and a destination. In some data pipelines, the destination may be called a sink.
Data pipelines enable the flow of data from an application to a data warehouse,
from a data lake to an analytics database, or into a payment processing
system system, for example. Data pipelines also may have the same source and
sink, such that the pipeline is purely about modifying the data set. Any time data
is processed between point A and point B (or points B, C, and D), there is a data
pipeline between those points.

Unit II Big Data Platforms 7

What Is a Big Data Pipeline?

As the volume, variety, and velocity of data have dramatically grown in recent
years, architects and developers have had to adapt to “big data.” The term “big
data” implies that there is a huge volume to deal with. This volume of data can
open opportunities for use cases such as predictive analytics, real-time reporting,
and alerting, among many examples.

Like many components of data architecture, data pipelines have evolved to

support big data. Big data pipelines are data pipelines built to accommodate one
or more of the three traits of big data. The velocity of big data makes it appealing
to build streaming data pipelines for big data. Then data can be captured and
processed in real time so some action can then occur. The volume of big data
requires that data pipelines must be scalable, as the volume can be variable over
time. In practice, there are likely to be many big data events that occur
simultaneously or very close together, so the big data pipeline must be able to
scale to process significant volumes of data concurrently. The variety of big data
requires that big data pipelines be able to recognize and process data in many
different formats—structured, unstructured, and semi-structured.

Benefits of a Data Pipeline

Efficiency
Data pipelines automate the flow of data, reducing manual intervention and
minimizing the risk of errors. This enhances overall efficiency in data processing
workflows.

Real-time Insights
With the ability to process data in real-time, data pipelines empower organizations
to derive insights quickly and make informed decisions on the fly.

Scalability
Scalable architectures in data pipelines allow organizations to handle growing
volumes of data without compromising performance, ensuring adaptability to
changing business needs.

Data Quality

Unit II Big Data Platforms 8

By incorporating data cleansing and transformation steps, data pipelines
contribute to maintaining high data quality standards, ensuring that the
information being processed is accurate and reliable.

Cost-Effective
Automation and optimization of data processing workflows result in cost savings
by reducing manual labor, minimizing errors, and optimizing resource utilization.
Types of Data Pipelines

 Batch Processing

Batch processing involves the execution of data jobs at scheduled intervals. It is

well-suited for scenarios where data can be processed in non-real-time, allowing
for efficient handling of large datasets.

 Streaming Data

Streaming data pipelines process data in real-time as it is generated. This type of

pipeline is crucial for applications requiring immediate insights and actions based
on up-to-the-moment information.

How Data Pipelines Work

A typical data pipeline involves several key stages:

1. Ingestion
Data is collected from various sources and ingested into the pipeline. This
can include structured and unstructured data from databases, logs, APIs, and
other sources.
2. Processing
The ingested data undergoes processing, which may involve transformation,
cleansing, aggregation, and other operations to prepare it for analysis or
storage.
3. Storage
Processed data is stored in a suitable data store, such as a database, data
warehouse, or cloud storage, depending on the requirements of the
organization.
4. Analysis
Analytical tools and algorithms are applied to the stored data to extract
meaningful insights, patterns, and trends.
Unit II Big Data Platforms 9
5. Visualization
The results of the analysis are presented in a visual format through
dashboards or reports, making it easier for stakeholders to interpret and act
upon the information.

Data Pipeline Architecture Examples

Data pipelines may be architected in several different ways. One common
example is a batch-based data pipeline. In that example, you may have an
application such as a point-of-sale system that generates a large number of data
points that you need to push to a data warehouse and an analytics database. Here
is an example of what that would look like:

Another example is a streaming data pipeline. In a streaming data

pipeline, data from the point of sales system would be processed as it is generated.
The stream processing engine could feed outputs from the pipeline to data stores,
marketing applications, and CRMs, among other applications, as well as back to
the point of sale system itself .

Unit II Big Data Platforms 10

Use Cases

Finance
Handling financial transactions, fraud detection, and risk analysis in real-time.

E-commerce
Managing and analyzing large volumes of customer data, transaction logs, and
inventory information in real-time.

Business Intelligence
Deriving insights from historical and real-time data to inform decision-making
processes.

Healthcare
Processing and analyzing patient records, medical images, and sensor data for
improved diagnostics and patient care.
Spark Streaming

Spark Streaming is an extension of the Apache Spark cluster computing system that
enables processing of real-time data streams. It allows you to process and analyze
streaming data in near real-time with high fault tolerance, scalability, and ease of use.

Unit II Big Data Platforms 11

In Spark Streaming, data is ingested in small batches or micro-batches, and each batch is
processed using the same set of operations as used in batch processing. The processed data
can then be stored or further analyzed in real-time.
It allows you to process live data streams from many sources, such as Kafka, Flume,
Kinesis, or TCP sockets. You can then use Spark’s machine learning and graph processing
algorithms to analyze the data.
Monitoring and alerting
You can use Spark Streaming to monitor your applications and systems for errors or
anomalies.
 For example, you could use it to track the number of errors in a web application or
the number of requests per second to a server.
 For example, you could create a Spark Streaming job that reads the logs from your
web application and counts the number of errors. If the number of errors exceeds a
certain threshold, you could send an alert to your team.
Real-time analytics
You can use Spark Streaming to analyze data in real time.
 For example, you could use it to track the sentiment of social media posts or the
price of stocks.
 For example, you could create a Spark Streaming job that reads tweets from Twitter
and calculates the sentiment of each tweet. You could then use this information to
track the public’s reaction to a product launch or an event.
Machine learning
You can use Spark Streaming to train and deploy machine learning models on streaming
data.
 For example, you could use it to predict customer churn or fraud.

Unit II Big Data Platforms 12

 For example, you could create a Spark Streaming job that reads the price of stocks
from a financial data feed and trains a machine learning model to predict future
prices. You could then use this model to make investment decisions.

Spark Streaming is available in the following cloud providers:

 AWS: Amazon EMR, AWS Lambda
 Azure: Azure HDInsight, Azure Functions
 Google Cloud Platform: Cloud Dataproc, Cloud Functions
 IBM Cloud: IBM Cloud Pak for Data, IBM Cloud Functions
 Alibaba Cloud: Alibaba Cloud EMR, Alibaba Cloud Functions

Benefits of using Spark Streaming in the cloud

 Scalability: Cloud providers offer a wide range of resources, so you can scale your
Spark Streaming jobs up or down as needed.

 Cost-effectiveness: Cloud providers offer pay-as-you-go pricing, so you only pay

for the resources that you use.

 Ease of use: Cloud providers offer managed Spark services, so you can focus on
your application development and not worry about managing the underlying
infrastructure.

 Real-time processing: Spark Streaming allows you to process and analyze data in
near real-time, which enables you to make decisions and take actions quickly based
on the data.

Unit II Big Data Platforms 13

 High fault tolerance: Spark Streaming provides built-in fault tolerance by
replicating the data across multiple nodes in the cluster. If a node fails, the data is
automatically reprocessed on another node, ensuring that no data is lost and
processing is not interrupted.

 Integration with other Spark components: Spark Streaming integrates

seamlessly with other Spark components, such as Spark SQL, MLlib, and GraphX.
This allows you to perform complex analytics on real-time data, such as machine
learning, graph processing, and SQL queries.

 Support for multiple data sources: Spark Streaming supports a variety of data
sources, such as Kafka, Flume, HDFS, and S3. This allows you to easily ingest data
from different sources into Spark Streaming for processing.

Apache Kafka
Kafka Streams is a client library for building applications and microservices, where the input and
output data are stored in Kafka clusters. It combines the simplicity of writing and deploying
standard Java and Scala applications on the client side with the benefits of Kafka's server-side
cluster technology.

Unit II Big Data Platforms 14

Apache Kafka is a distributed messaging system that provides fast, durable highly
scalable and fault-tolerant messaging through a publish-subscribe (pub-sub) model. Kafka
has higher throughput, reliability and replication characteristics. It is built for Big Data
applications, real-time data pipelines, and streaming apps.
Apache Kafka was originally developed by LinkedIn, and was subsequently open sourced
in early 2011. In November 2014, several engineers who worked on Kafka at LinkedIn
created a new company named Confluent with a focus on Kafka.
Kafka as a distributed system runs in a cluster (a group of similar things or people
positioned or occurring closely together). Each node in the cluster is called a
Kafka broker.
The basic architecture of Kafka is organized around a few key terms: topics, producers,
Consumers, and brokers.

Kafka Terminology

Unit II Big Data Platforms 15

Kafka Broker
Kafka runs as a distributed system in a cluster. There are one or more servers available in
the cluster. Each node in the cluster is called a Kafka Broker.
Brokers are responsible for receiving and storing the data when it arrives. The broker also
provides the data when requested.
Kafka broker is more precisely described as a Message Broker which is responsible for
mediating the conversation between different computer systems, guaranteeing delivery of
the message to the correct parties.
Kafka Topic
Topics represent the logical collection of messages that belong to a group/category. The
data sent by the producers is stored in topics. Consumers subscribe to a specific topic that
they are interested in. A topic can have zero or more consumers.
In a nutshell, A topic is a category or feed name to which records are published.
Kafka Message
In Kafka, messages represent the fundamental unit of data. Each message is represented as
Unit II Big Data Platforms 16
a record, which comprises two parts: key and value. Irrespective of the data type, Kafka
always converts messages into byte arrays.
Many other messaging systems also have a way of carrying other information along with
the messages. Kafka 0.11 introduced record headers for this purpose.
Partitions
Topics are divided into one (default is one) or more partitions. A partition lives on a
physical node and persists the messages it receives. A partition can be replicated onto
other nodes in a master/slave relationship. There is only one “leader” node for a given
partition which accepts all reads and writes — in case of failure a new leader is chosen.
The other nodes just replicate messages from the “leader” to ensure fault-tolerance.
Kafka ensures strict ordering within a partition i.e. consumers will receive it in the order
which a producer published the data, to begin with.
KafkaProducer
A Producer is an entity that publishes streams of messages to Kafka topics. A producer
can publish to one or more topics and can optionally choose the partition that stores the
data.
Kafka comes with its own producer written in Java, but there are many other Kafka client
libraries that support C/C++, Go, Python, REST, and more.
Kafka Consumers
Consumers are the subscribers or readers that receive the data. Kafka consumers are
stateful, which means they are responsible for remembering the cursor position, which is
called as an offset.
Kafka Consumer Groups
A consumer group is a group of related consumers that perform a task.
Each consumer group must have a unique id. Each consumer group is a subscriber to one
or more Kafka topics. Each consumer group maintains its offset per topic partition.
Kafka Producers
Unit II Big Data Platforms 17
A Producer is an entity that consumes/reads messages from Kafka topics and processes
the feed of messages. A consumer can consume from one or more topics or partitions.
Kafka Offset
The offset is a position within a partition for the next message to be sent to a consumer.
Offset is used to uniquely identifies each record within the partition.
Kafka Consumer Lags
Kafka Consumer Lag is the indicator of how much lag there is between Kafka producers
and consumers.
Inside the Kafka, data is stored in one or more topics. Each topic consists of one or more
partitions. When writing data a Broker actually writes it into a specific Partition. As it
writes data, it keeps track of the last “write position” in each Partition. This is called
Latest Offset. Each Partition has its own independent Latest Offset.
Just like Brokers keep track of their write position in each Partition, each Consumer keeps
track of “read position” in each Partition whose data it is consuming. This is known as
Consumer Offset. This Consumer Offset is periodically persisted (to ZooKeeper or a
special Topic in Kafka itself) so it can survive Consumer crashes or unclean shutdowns
and avoid re-consuming too much old data.

Unit II Big Data Platforms 18

Chapter 1-1
No ratings yet
Chapter 1-1
34 pages
Big Data Notes
No ratings yet
Big Data Notes
37 pages
Lec 01
No ratings yet
Lec 01
17 pages
DataStreaming L-4
No ratings yet
DataStreaming L-4
16 pages
Bigdata Unit-Ii
No ratings yet
Bigdata Unit-Ii
33 pages
Unit 2
No ratings yet
Unit 2
10 pages
Unit4 2
No ratings yet
Unit4 2
40 pages
4 Building Blocks of A Streaming Data Architecture
No ratings yet
4 Building Blocks of A Streaming Data Architecture
11 pages
Stream Processing for IT/CSE Students
No ratings yet
Stream Processing for IT/CSE Students
57 pages
Big Data Stream Processing Guide
No ratings yet
Big Data Stream Processing Guide
22 pages
Chapter 1
No ratings yet
Chapter 1
13 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
14 pages
JyothsnaDST Unit-1 Extra
No ratings yet
JyothsnaDST Unit-1 Extra
25 pages
Lec 19
No ratings yet
Lec 19
23 pages
Unit-2 BDA
No ratings yet
Unit-2 BDA
30 pages
Unit 3-6
No ratings yet
Unit 3-6
14 pages
BDA Unit-4
No ratings yet
BDA Unit-4
12 pages
Lec 19
No ratings yet
Lec 19
24 pages
BDA Unit 3
No ratings yet
BDA Unit 3
42 pages
StreamProcessingAndAnalytics Handout
No ratings yet
StreamProcessingAndAnalytics Handout
7 pages
Streaming Data Insights for Tech Pros
No ratings yet
Streaming Data Insights for Tech Pros
4 pages
5 Unit
No ratings yet
5 Unit
5 pages
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
No ratings yet
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
4 pages
UNIT-2 (Big Data)
No ratings yet
UNIT-2 (Big Data)
30 pages
Mining Data Streams in Data Analytics Refers To The Process of Extracting Useful Patterns
No ratings yet
Mining Data Streams in Data Analytics Refers To The Process of Extracting Useful Patterns
30 pages
Stream Processing
No ratings yet
Stream Processing
33 pages
Understanding Data Streams
No ratings yet
Understanding Data Streams
10 pages
Chapter 6
No ratings yet
Chapter 6
26 pages
Data Stream in Data Analytics
No ratings yet
Data Stream in Data Analytics
4 pages
Streaming Systems
No ratings yet
Streaming Systems
1 page
20250129-EB-Ultimate Data Streaming Guide
No ratings yet
20250129-EB-Ultimate Data Streaming Guide
103 pages
Stream Processing and Analytics Handout
No ratings yet
Stream Processing and Analytics Handout
8 pages
Real-Time Streaming in Big Data: Kafka and Spark With Singlestore
100% (1)
Real-Time Streaming in Big Data: Kafka and Spark With Singlestore
23 pages
T09 Data Streaming
No ratings yet
T09 Data Streaming
52 pages
Streaming Data
No ratings yet
Streaming Data
33 pages
BigData Mod2
No ratings yet
BigData Mod2
12 pages
Stream Processing in Big Data
No ratings yet
Stream Processing in Big Data
39 pages
BDA Lec10
No ratings yet
BDA Lec10
33 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
24 pages
Continuous Application 1725280881
No ratings yet
Continuous Application 1725280881
72 pages
Unit 3 Data Analytics
No ratings yet
Unit 3 Data Analytics
15 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
3 Challenges of Data Streaming Pipelines and How To Overcome Them
No ratings yet
3 Challenges of Data Streaming Pipelines and How To Overcome Them
5 pages
Bda Mid Ans
No ratings yet
Bda Mid Ans
18 pages
BDA UNIT-2 (Final)
No ratings yet
BDA UNIT-2 (Final)
27 pages
Big Data 3rd Assignment Answers
No ratings yet
Big Data 3rd Assignment Answers
8 pages
Big Data Analytics Unit-2
100% (1)
Big Data Analytics Unit-2
11 pages
Unit 3
No ratings yet
Unit 3
30 pages
Real-Time Streaming for Tech Pros
No ratings yet
Real-Time Streaming for Tech Pros
5 pages
What Is Stream Processing
No ratings yet
What Is Stream Processing
3 pages
Kate Wilson
No ratings yet
Kate Wilson
27 pages
Lec 05
No ratings yet
Lec 05
10 pages
Module-2-MINING DATA STREAMS
100% (3)
Module-2-MINING DATA STREAMS
17 pages
Understanding Data Processing in Databricks: From Spark Streaming To Structured Streaming
No ratings yet
Understanding Data Processing in Databricks: From Spark Streaming To Structured Streaming
12 pages
Assignment No. 3 For Business Data Analytics
No ratings yet
Assignment No. 3 For Business Data Analytics
16 pages
Real-Time Data Stream Applications
No ratings yet
Real-Time Data Stream Applications
18 pages
Uint 4miningdatastream 230810162429 9d7c02a7
No ratings yet
Uint 4miningdatastream 230810162429 9d7c02a7
11 pages
Introduction To Stream Concepts - Stream Data Model and Architecture
100% (1)
Introduction To Stream Concepts - Stream Data Model and Architecture
8 pages
Unit-Ii 30-1-24
No ratings yet
Unit-Ii 30-1-24
162 pages
3.2 Circuit Safe Load Guidelines
No ratings yet
3.2 Circuit Safe Load Guidelines
9 pages
Memorandum Example
No ratings yet
Memorandum Example
4 pages
Computer Network Class 12 Quiz - Quizizz QP & MS
No ratings yet
Computer Network Class 12 Quiz - Quizizz QP & MS
11 pages
Bitly Connections Platform Short URLs, QR Codes, and More
No ratings yet
Bitly Connections Platform Short URLs, QR Codes, and More
1 page
Cuestionario Fortinet 2020 NS4 V6 - 2
No ratings yet
Cuestionario Fortinet 2020 NS4 V6 - 2
18 pages
KL3112 - 2-Channel Analog Input Terminal 0 20 Ma
No ratings yet
KL3112 - 2-Channel Analog Input Terminal 0 20 Ma
1 page
Indian Sign Language Generator and Detector 121111
No ratings yet
Indian Sign Language Generator and Detector 121111
39 pages
QNX System Boot Log Analysis
No ratings yet
QNX System Boot Log Analysis
12 pages
Speed Monitor PDF
No ratings yet
Speed Monitor PDF
30 pages
Infor Cloverleaf Integration Suite Tackles Healthcare Integration Challenges Simplifies Health Information Exchange PDF
No ratings yet
Infor Cloverleaf Integration Suite Tackles Healthcare Integration Challenges Simplifies Health Information Exchange PDF
1 page
Halfords Portable Powerpack 200 Manual PDF
0% (14)
Halfords Portable Powerpack 200 Manual PDF
2 pages
Toyota Altezza Plugin Manual
No ratings yet
Toyota Altezza Plugin Manual
22 pages
A. Preparation
No ratings yet
A. Preparation
9 pages
Library Management
50% (10)
Library Management
53 pages
Tommelein 2008 Poka Yoke or Quality by Mistake Proofing Design and Construction Systems
No ratings yet
Tommelein 2008 Poka Yoke or Quality by Mistake Proofing Design and Construction Systems
11 pages
PC600/600LC PC600/600LC: - 8R1 Backhoe - 8R1 Loading Shovel
No ratings yet
PC600/600LC PC600/600LC: - 8R1 Backhoe - 8R1 Loading Shovel
11 pages
SQL Notebook by Rishabh
100% (1)
SQL Notebook by Rishabh
101 pages
Types of Systems
No ratings yet
Types of Systems
4 pages
Job Portal Documentation
No ratings yet
Job Portal Documentation
48 pages
Style, Design, and Function
No ratings yet
Style, Design, and Function
43 pages
Front Office Executive - KRA & KPI - Example-13.05.2019
100% (2)
Front Office Executive - KRA & KPI - Example-13.05.2019
4 pages
Engineering Graphics
No ratings yet
Engineering Graphics
4 pages
Railway Equipment 1004-En Lowres
No ratings yet
Railway Equipment 1004-En Lowres
15 pages
Autel Maxiscan Ms509 User Manual
No ratings yet
Autel Maxiscan Ms509 User Manual
73 pages
What Is An Epd The Epd Registry
No ratings yet
What Is An Epd The Epd Registry
2 pages
HC200 03 En-Us
100% (1)
HC200 03 En-Us
300 pages
Grade 12 PAT Phase 1 Template 2021 (V2)
No ratings yet
Grade 12 PAT Phase 1 Template 2021 (V2)
18 pages
9859 Safey Management Manual-1
100% (1)
9859 Safey Management Manual-1
254 pages
1.1 About The Project
No ratings yet
1.1 About The Project
6 pages
Geog PMB Envs211 2 14 Supp
No ratings yet
Geog PMB Envs211 2 14 Supp
4 pages

BDA Unit 3

Uploaded by

BDA Unit 3

Uploaded by

PCET-NMVPM’s

Nutan College of Engineering and Research, Talegaon,

Subject: Big Data Analytics

Subject Code: BTCOE703 (C)

Subject Teacher: Prof. Sarita Charkha

Unit II Big Data Platforms 1

Unit II Big Data Platforms 2

Unit II Big Data Platforms 3

Repeat transmission of a data stream is challenging because of time sensitivity. Hence,

Why is streaming data important?

Unit II Big Data Platforms 4

Challenges Building Real-Time Applications

Unit II Big Data Platforms 6

What is a Data Pipeline?

Unit II Big Data Platforms 7

Like many components of data architecture, data pipelines have evolved to

Benefits of a Data Pipeline

Unit II Big Data Platforms 8

Batch processing involves the execution of data jobs at scheduled intervals. It is

Streaming data pipelines process data in real-time as it is generated. This type of

How Data Pipelines Work

A typical data pipeline involves several key stages:

Data Pipeline Architecture Examples

Another example is a streaming data pipeline. In a streaming data

Unit II Big Data Platforms 10

Unit II Big Data Platforms 11

Unit II Big Data Platforms 12

Spark Streaming is available in the following cloud providers:

Benefits of using Spark Streaming in the cloud

 Cost-effectiveness: Cloud providers offer pay-as-you-go pricing, so you only pay

Unit II Big Data Platforms 13

 Integration with other Spark components: Spark Streaming integrates

Unit II Big Data Platforms 14

Unit II Big Data Platforms 15

Unit II Big Data Platforms 18

You might also like