0% found this document useful (0 votes)
21 views23 pages

Unit 1 Windowing

Windowing in big data involves processing unbounded data streams by dividing them into finite sets based on criteria like time or tuple count. Different types of windows include fixed, sliding, tumbling, and hopping, each serving specific analytical needs. Event streaming allows for real-time data processing and insights, supported by components like event producers, message brokers, and consumers, while addressing challenges such as data volume and complexity.

Uploaded by

Ameryn Ameryn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views23 pages

Unit 1 Windowing

Windowing in big data involves processing unbounded data streams by dividing them into finite sets based on criteria like time or tuple count. Different types of windows include fixed, sliding, tumbling, and hopping, each serving specific analytical needs. Event streaming allows for real-time data processing and insights, supported by components like event producers, message brokers, and consumers, while addressing challenges such as data volume and complexity.

Uploaded by

Ameryn Ameryn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Concept of Windowing

in Big Data
•Windowing is one of the most frequently
used processing methods for streams of data.
• An unbounded stream of data (events) is
split into finite sets, or windows, based on
specified criteria, such as time.
•A window can be conceptualized as an in-
memory table in which events are added and
removed based on a set of policies
Time Based Windows
● A window defined by a rowtime interval. The window’s defining criteria
specify a finite set of rows, using a rowtime-based specification.
● In time-based windowing, data is grouped based on specific time intervals.
This could be, for example, 1-minute, 5-minute, or 1-hour windows.
● As new data arrives, it is assigned to the appropriate time interval, and
computations are performed on the data within that interval. This type of
windowing is useful for analyzing trends and patterns over time.
Tuple Count based Window
● In Tuple count-based windowing, data is grouped based on a fixed number of data
points. For example, you could define a window size of 100 data points, and as
soon as 100 data points arrive, the computation is performed on that batch of
data. This type of windowing is useful when you want to process data in chunks of
a certain size.
● Tuples are grouped in a single window based on time or count. Any tuple belongs
to only one of the windows.
● Storm core has support for processing a group of tuples that falls within a window.
Windows are specified with the following two parameters,
1. Window length - the length or duration of the window
2. Sliding interval - the interval at which the windowing slides
Movement of Windows

1) Fixed
2) Sliding
3) Trumbling
4) Hoping
Windows can be:
•fixed/tumbling: time is partitioned into same-
length, non-overlapping chunks. Each event
belongs to exactly one window.
•sliding: windows have fixed lengths, but are
separated by a time interval (step). Typically the
window interval is a multiplicity of the step. Each
event belongs to a number of windows.
•session: windows have various sizes and are
defined based on data, which should carry some
session identifiers
Fixed Window Movement
● Fixed windows, often referred to as "tumbling windows," are non-overlapping
windows that are defined by a fixed size or time interval.
● Data is grouped into these windows, and each window is processed
independently.
● When the window size or time interval is reached, the window "tumbles" to the
next set of data points.
● Fixed windows are particularly useful for discrete, non-overlapping analyses on
distinct chunks of data.
Sliding Window Movement
● Sliding windows are overlapping windows that continuously move forward
in time by a specified increment, also known as the "slide" or "hop."
● These windows allow for capturing trends and patterns that might span
multiple time intervals.
● As new data arrives, the window slides by the specified step size,
incorporating new data while also retaining some overlap with previous
data.
● This overlap enables more comprehensive analysis of evolving trends.
Trumbling Window Movement
● A “tumbling window” is a collection of rows that are aggregated to produce a fewer
number of output rows, such as “the sum of the last twenty rows” or “the sum of the
rows in the last hour”. One row is returned for every group of rows.
● Tumbling windows are the same as fixed windows
● They are non-overlapping windows defined by a fixed size or time interval.
● The data is partitioned into these windows, and computations are performed
independently on each window. Once the window's boundary is reached, it "tumbles" to
the next set of data points.
Hoping Window Movement
● Hopping windows are a variation of sliding windows that have fixed-size intervals
with regular gaps or hops between them.
● These windows maintain overlap between adjacent windows, similar to sliding
windows, but with a defined gap.
● The hop size determines how frequently the windows move.
● Hopping windows are useful when you want to balance overlap for comprehensive
analysis and efficient processing.
EVENT STREAMING
● Event streaming refers to the practice of processing and analyzing real-time
data as a continuous stream of events.
● These events could be anything from user interactions on a website to
sensor readings in an industrial setting.
● Event streaming systems enable organizations to capture, process, and
respond to events as they happen, allowing for real-time insights, decision-
making, and actions.
Key concepts

● Events: Events are discrete pieces of data that represent occurrences in the
system or the external environment. They can be generated by applications,
sensors, devices, users, or any other source.

● Event Stream: An event stream is a sequence of events that occur over


time. Event streams can vary in volume, velocity, and variety, depending on
the sources and use cases.
Components of Event Streaming
➔ Event Producers: These are the sources that generate and emit events. They can
be applications, sensors, databases, IoT devices, social media feeds, etc.
➔ Message Broker: The message broker serves as an intermediary that accepts
events from producers and delivers them to consumers (subscribers). Popular
message broker technologies include Apache Kafka, RabbitMQ, and Amazon
Kinesis.
➔ Event Consumers: Consumers subscribe to specific event types and receive
events from the message broker. Consumers can be applications, analytics
systems, real-time dashboards, or any component that processes events.
Benefits
● Real-Time Insights: Event streaming enables organizations to gain insights
from data as events occur, leading to better decision-making and faster
responses.
● Continuous Processing: Event streaming supports continuous processing
of data, enabling businesses to react to changing conditions immediately.
● Flexibility: Event streaming platforms can handle a variety of data types and
sources, making them suitable for diverse use cases.
● IoT and Monitoring: Event streaming is ideal for IoT applications and real-
time monitoring of systems, enabling proactive maintenance and rapid
issue resolution.
● Event-Driven Architectures: Event streaming supports event-driven
architectures, which allow applications to react to events without needing to
constantly poll for updates.
● Data Integration: Event streaming can help integrate data from various
sources and systems, providing a unified view of operations.
Challenges

● Data Volume and Velocity: Managing high volumes of events and ensuring low-latency
processing can be challenging.
● Data Quality: Ensuring the accuracy and reliability of events is crucial for making
informed decisions.
● Complexity: Designing, deploying, and maintaining event streaming systems can be
complex, especially for organizations new to this approach.
● Scalability and Resource Management: As event loads increase, scaling the system
while managing resources becomes important.
Architecture
● Event Producers: Event producers are the sources that generate and emit events.
These can be applications, sensors, devices, databases, IoT devices, or any other data
source that generates events. Producers send events to the event streaming platform
for processing.
● Event Ingestion Layer: The event ingestion layer is responsible for receiving events
from producers and preparing them for processing. This layer might include
components like message brokers, event gateways, or data ingestion pipelines.
Popular message broker technologies like Apache Kafka, RabbitMQ, or cloud-based
services like Amazon Kinesis can be used here.
● Event Stream Processing: This layer processes the incoming events in real time. It
involves various components that analyze, filter, transform, enrich, and aggregate the
events. Complex event processing (CEP) engines, stream processing frameworks like
Apache Flink, Apache Spark Streaming, or even custom applications can be used for
this purpose.
Contd.
● State Management: For stateful event processing, a state management
component stores and maintains the state required for processing events over
time. This can be a key-value store, a database, or an in-memory data grid.
Stateful processing is useful for maintaining context and aggregating data over
windows of time.
● Event Consumers: Consumers subscribe to specific event types or topics and
receive processed events for further action or analysis. Consumers can be
applications, microservices, analytics platforms, real-time dashboards, or
downstream systems that require the processed event data.
● Event Schemas and Metadata: Managing event schemas and metadata is crucial
for maintaining data consistency and compatibility as events evolve over time.
Schema registries can be used to enforce schema validation and compatibility
checks.
stream processing

● Stream processing is a data management technique that involves ingesting


a continuous data stream to quickly analyze, filter, transform, or enhance
the data in real-time. Once processed, the data is passed off to an
application, data store, or another stream processing engine.

● Stream processing services and architectures are growing in popularity


because they allow enterprises to combine data feed from various sources.
Sources can include transactions, stock feeds, website analytics, connected
devices, operational databases, weather reports, and other commercial
services.
How does stream processing work?
● Stream processing architectures help simplify the data management tasks
required to consume, process and publish the data securely and reliably. Stream
processing starts by ingesting data from a publish-subscribe service, performs an
action on it and then publishes the results back to the publish-subscribe service
or another data store. These actions can include processes such as analyzing,
filtering, transforming, combining or cleaning data.
● Stream processing commonly connotes the notion of real-time analytics, which is
a relative term. Real time could mean five minutes for a weather analytics app,
millionths of a second for an algorithmic trading app or a billionth of a second for
a physics researcher.
How does stream processing work?
● However, this notion of real-time points to something important about how the stream
processing engine packages up bunches of data for different applications. The stream
processing engine organizes data events arriving in short batches and presents them to
other applications as a continuous feed. This simplifies the logic for application
developers combining and recombining data from various sources and from different
time scales.
Why is stream processing needed?

● Stream processing is needed to:


● Develop adaptive and responsive applications
● Help enterprises improve real-time business analytics
● Facilitate faster decisions
● Accelerate decision-making
● Improve decision-making with increased context
● Improve the user experience
● Create new applications that use a wider variety of data sources

You might also like