@gschmutz
BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF
HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH
Introduction to
Stream Processing
Guido Schmutz
Frankfurt - 21.2.2019
@gschmutz guidoschmutz.wordpress.com
@gschmutz
Agenda
Introduction to Stream Processing
1. Motivation for Stream Processing?
2. Capabilities for Stream Processing
3. Implementing Stream Processing Solutions
4. Demo
5. Summary
@gschmutz
Guido Schmutz
Working at Trivadis for more than 22 years
Oracle Groundbreaker Ambassador & Oracle ACE Director
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Head of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 30 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz
145th edition
Introduction to Stream Processing
@gschmutz
Motivation for Stream Processing?
Introduction to Stream Processing
@gschmutz
Bulk Source
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
BI Tools
Enterprise Data
Warehouse
SQL
Search / Explore
Search
SQL
Export
Service
Parallel
Processing
Storage
Storage
RawRefined
Results
high latency
Enterprise Apps
Logic
{ }
API
File Import / SQL Import
DB
Extract
File
DB
Big Data solves Volume and Variety – not Velocity
Introduction to Stream Processing
@gschmutz
Bulk Source
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
BI Tools
Enterprise Data
Warehouse
SQL
Search / Explore
Search
SQL
Export
Service
Parallel
Processing
Storage
Storage
RawRefined
Results
high latency
Enterprise Apps
Logic
{ }
API
File Import / SQL Import
DB
Extract
File
DB
Event Source
Location
Telemetry
IoT
Data
Mobile
Apps
Social
Big Data solves Volume and Variety – not Velocity
Introduction to Stream Processing
Event Stream
@gschmutz
Bulk Source
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
BI Tools
Enterprise Data
Warehouse
SQL
Search / Explore
Search
SQL
Export
Service
• Machine Learning
• Graph Algorithms
• Natural Language Processing
Parallel
Processing
Storage
Storage
RawRefined
Results
high latency
Enterprise Apps
Logic
{ }
API
File Import / SQL Import
DB
Extract
File
DB
Event Stream
Event Source
Location
IoT
Data
Mobile
Apps
Social
Big Data solves Volume and Variety – not Velocity
Introduction to Stream Processing
Event
Hub
Event
Hub
Event
Hub
Telemetry
@gschmutz
"Data at Rest" vs. "Data in Motion"
Data at Rest Data in Motion
Store
Act
Analyze
StoreAct
Analyze
11101
01010
10110
11101
01010
10110
Introduction to Stream Processing
@gschmutz
When to Stream / When not?
Introduction to Stream Processing
Constant low
Milliseconds & under
Low milliseconds to seconds,
delay in case of failures
10s of seconds of more,
Re-run in case of failures
Real-Time Near-Real-Time Batch
Source: adapted from Cloudera
@gschmutz
"No free lunch"
Introduction to Stream Processing
Constant low
Milliseconds & under
Low milliseconds to seconds,
delay in case of failures
10s of seconds of more,
Re-run in case of failures
Real-Time Near-Real-Time Batch
"Difficult" architectures, lower latency "Easier architectures", higher latency
@gschmutz
Event
Hub
Event
Hub
Hadoop Clusterd
Hadoop Cluster
Stream Analytics
Platform
Stream Processing Architecture solves Velocity
BI Tools
Enterprise Data
Warehouse
Event
Hub
SQL
Search / Explore
Enterprise Apps
Search
ServiceResults
Stream Analytics
Reference /
Models
Dashboard
Logic
{ }
API
Event
Stream
Event
Stream
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Introduction to Stream Processing
Low(est) latency, no history
Telemetry
@gschmutz
Hadoop Clusterd
Hadoop Cluster
Stream Analytics
Platform
Big Data for all historical data analysis
BI Tools
Enterprise Data
Warehouse
SQL
Search / Explore
Enterprise Apps
Search
ServiceResults
Stream Analytics
Reference /
Models
Dashboard
Logic
{ }
API
Event
Stream
Event
Stream
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
Parallel
Processing
Storage
Storage
RawRefined
Results
Data FlowEvent
Hub
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
File Import / SQL Import
Introduction to Stream Processing
Telemetry
@gschmutz
Hadoop Clusterd
Hadoop Cluster
Stream Analytics
Platform
Integrate existing systems with lower latency through CDC
BI Tools
Enterprise Data
Warehouse
SQL
Search / Explore
Enterprise Apps
Search
ServiceResults
Stream Analytics
Reference /
Models
Dashboard
Logic
{ }
API
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
Parallel
Processing
Storage
Storage
RawRefined
Results
File Import / SQL Import
Event
Stream
Event
Stream
Data FlowEvent
Hub
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Change Data
Capture
Introduction to Stream Processing
Telemetry
@gschmutz
New systems participate in event-oriented fashion
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
Parallel
Processing
Storage
Storage
RawRefined
Results
Microservice Platform
Microservice State
{ }
API
Stream Analytics Platform
Stream
Processor
State
{ }
API
Event
Stream
SQL
Search
Service
BI Tools
Enterprise Data
Warehouse
Search / Explore
SQL
Export
Search
Service
Enterprise Apps
Logic
{ }
API
File Import / SQL Import
Event
Stream
Data FlowEvent
Hub
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Change Data
Capture
Event
Stream
Event
Stream
Introduction to Stream Processing
Telemetry
@gschmutz
Edge computing allows processing close to data sources
Hadoop Clusterd
Hadoop Cluster
Big Data Platform
Parallel
Processing
Storage
Storage
RawRefined
Results
Microservice Platform
Microservice State
{ }
API
Stream Analytics Platform
Stream
Processor
State
{ }
API
SQL
Search
Service
BI Tools
Enterprise Data
Warehouse
Search / Explore
SQL
Export
Search
Service
Enterprise Apps
Logic
{ }
API
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Edge Node
File Import / SQL Import
Change DataCapture
Data
Flow
Event
Hub
Data Flow
Event
Stream
Event
Stream
Event Stream
Introduction to Stream Processing
Telemetry
Rules
Event Hub
Storage
@gschmutz
Hadoop Clusterd
Hadoop Cluster
Big Data
Unified Architecture for Modern Data Analytics Solutions
SQL
Search
Service
BI Tools
Enterprise Data
Warehouse
Search / Explore
File Import / SQL Import
Event
HubData
Flow
Data
Flow
Change DataCapture Parallel
Processing
Storage
Storage
RawRefined
Results
SQL
Export
Microservice State
{ }
API
Stream
Processor
State
{ }
API
Event
Stream
Event
Stream
Search
Service
Stream Analytics
Microservices
Enterprise Apps
Logic
{ }
API
Edge Node
Rules
Event Hub
Storage
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Event Stream
Telemetry
Introduction to Stream Processing
@gschmutz
Two Types of Stream Processing
(by Gartner)
Introduction to Stream Processing
Stream Data Integration
• focuses on the ingestion and processing of
data sources targeting real-time extract-
transform-load (ETL) and data integration
use cases
• filter and enrich the data
Stream Analytics
• targets analytics use cases
• calculating aggregates and detecting
patterns to generate higher-level, more
relevant summary information (complex
events)
• Complex events may signify threats or
opportunities that require a response from
the business
Gartner: Market Guide for Event Stream Processing, Nick Heudecker, W. Roy Schulte
@gschmutz
Stream Processing & Analytics Ecosystem
Stream Analytics
Event Hub
Open Source Closed Source
Stream Data Integration
Source: adapted from Tibco
Edge
Introduction to Stream Processing
@gschmutz
Stream vs. Table / Static
Stream
“History”
an unbounded sequence of structured
data ("facts")
Facts in a stream are immutable
Table / Static
“State”
a view of a stream, or another table, and
represents a collection of evolving facts
Latest value for each key in a stream
Facts in a table are mutable
Introduction to Stream Processing
@gschmutzIntroduction to Stream Processing
Important Capabilities for Stream
Processing
@gschmutz
Capabilities: Stream Data Integration vs. Stream Analytics
Stream Data Integration Stream Analytics
Support for Various Data Sources yes partial
Streaming ETL (Transformation/Format Translation, Routing, Validation) yes partial
Micro-Batching yes partial
Event-at-a-time yes yes
Delivery Guarantees yes yes
API : GUI-Based API / Declarative API / Programmatic yes yes
API: Streaming SQL - yes
Event Time vs. Ingestion / Processing Time - yes
Windowing - yes
Stream-to-Static Joins (Lookup/Enrichment) partial yes
Stream-to-Stream Joins - yes
State Management - yes
Queryable State (aka Interactive Queries) - yes
Event Pattern Detection - Yes
Introduction to Stream Processing
@gschmutz
Integrating Data Sources
Introduction to Stream Processing
SQL Polling
Change Data Capture
(CDC)
File Polling
File Stream (File Tailing)
File Stream (Appender)
Sensor Stream
@gschmutz
Streaming ETL
Introduction to Stream Processing
• Flow-based ”programming”
• Ingest Data from various sources
• Extract – Transform – Load
• High-Throughput, straight-through
data flows
• Data Lineage
• Batch- or Stream-Processing
• Visual coding with flow editor
• Event Stream Processing (ESP) but
not Complex Event Processing (CEP)
Source: Confluent
@gschmutz
Event-at-a-time vs. Micro Batch
Introduction to Stream Processing
Event-at-a-time Processing
• Events processed as they
arrive
• low-latency
• fault tolerance expensive
Micro-Batch Processing
• Splits incoming stream in
small batches
• Fault tolerance easier
• Better throughput
@gschmutz
Delivery Guarantees
Introduction to Stream Processing
At most once (fire-and-forget)
message is sent, but the sender doesn’t care if it’s received or lost. it is the
easiest and most performant behavior to support.
At least once
retransmission of a message will occur until an acknowledgment is received.
Since a delayed acknowledgment from a receiver could be in flight when the
sender retransmits, the message may be received one or more times.
Exactly once
ensures that a message is received once and only once, and is never lost and
never repeated. The system must implement whatever mechanisms are
required to ensure that a message is received and processed just once
[ 0 | 1 ]
[ 1+ ]
[ 1 ]
@gschmutz
API
Introduction to Stream Processing
GUI-based / Drag-and-Drop
• A graphical way of designing a plipeline
• Often web-based to improve usability
Declarative
• An streaming engine which can be
configured in a declarative way
• JSON, YML
Programmatic
• Low-level (class) or high-level fluent
API
• Higher order function as operators
(filter, mapWithState …)
Streaming SQL
• use a stream in a FROM clause
• Extensions supporting windowing, pattern
matching, spatial, …. Operators
val filteredDf = truckPosDf.
where("eventType !='Normal'")
SELECT * FROM truck_position_s
WHERE eventType != ’Normal’
"config": {
"connector.class":
"io.confluent.connect.mqtt.MqttSourceConnector",
"tasks.max": "1",
"mqtt.server.uri": "tcp://mosquitto-1:1883",
"mqtt.topics": "truck/+/position",
"kafka.topic":"truck_position",
...
@gschmutz
Event Time vs. Ingestion / Processing Time
Introduction to Stream Processing
Event time
the time at which events actually occurred
Ingestion time / Processing Time
the time at which events are ingested into /
processed by the system
Not all use cases care about event times but many do
Examples
• characterizing user behavior over time
• most billing applications
• anomaly detection
@gschmutz
Windowing
Introduction to Stream Processing
Computations over events done using
windows of data
Due to size and never-ending nature of it,
it’s not feasible to keep entire stream of
data in memory
A window of data represents a certain
amount of data where we can perform
computations on
Windows give the power to keep a
working memory and look back at recent
data efficiently
Time
Stream of Data Window of Data
@gschmutz
Sliding Window (aka Hopping
Window) - uses eviction and
trigger policies that are based on
time: window length and sliding
interval length
Fixed Window (aka Tumbling
Window) - eviction policy always
based on the window being full and
trigger policy based on either the
count of items in the window or
time
Session Window – composed of
sequences of temporarily related
events terminated by a gap of
inactivity greater than some
timeout
Windowing
Time TimeTime
Introduction to Stream Processing
@gschmutz
Joining – Stream-to-Static and Stream-to-Stream
Introduction to Stream Processing
Challenges of joining streams
1. Data streams need to be aligned as they
come because they have different timestamps
2. since streams are never-ending, the joins
must be limited; otherwise join will never end
3. join needs to produce results continuously as
there is no end to the data
Stream-to-Static (Table) Join
Stream-to-Stream Join (one window join)
Stream-to-Stream Join (two window join)
Stream-to-
Static Join
Stream-to-
Stream
Join
Stream-to-
Stream
Join
Time
Time
Time
@gschmutz
State Management
Introduction to Stream Processing
Necessary if stream processing use case
is dependent on previously seen data or
external data
Windowing, Joining and Pattern
Detection use State Management behind
the scenes
State Management services can be made
available for custom state handling logic
State needs to be managed as close to
the stream processor as possible
Options for State Management
How does it handle failures? If a machine
crashes and the/some state is lost?
In-Memory
Replicated,
Distributed
Store
Local,
Embedded
Store
Operational Complexity and Features
Low high
@gschmutz
Queryable State (aka. Interactive Queries)
Introduction to Stream Processing
Exposes the state managed by the
Stream Analytics solution to the
outside world
Allows an application to query the
managed state, i.e. to visualize it
For some scenarios, Queryable State
can eliminate the need for an external
database to keep results
Stream Processing Cluster
Reference
Data
Stream Analytics
{ }
Query API
State
Stream
Processor
Search /
Explore
Online &
Mobile Apps
Model
Dashboard
@gschmutz
Event Pattern Detection
Introduction to Stream Processing
• Streaming Data often contains interesting patterns that only emerge as new
streaming data arrives, e.g.
• Absence Pattern: event A not followed by event B within time window
• Sequence Pattern: event A followed by event B followed by event C
• Increasing Pattern: up trend of a value of a certain attribute
• Decreasing Pattern: down trend of a value of a certain attribute
• …
• Pattern operators allow developers to define complex relationships between
streaming events
@gschmutz
Capabilities: Stream Data Integration vs. Stream Analytics
Stream Data Integration Stream Analytics
Support for Various Data Sources yes -
Streaming ETL (Transformation/Format Translation, Routing, Validation) yes partial
Micro-Batching yes partial
Event-at-a-time yes yes
Delivery Guarantees yes yes
API : GUI-Based API / Declarative API / Programmatic yes yes
API: Streaming SQL - yes
Event Time vs. Ingestion / Processing Time - yes
Windowing - yes
Stream-to-Static Joins (Lookup/Enrichment) partial yes
Stream-to-Stream Joins - yes
State Management - yes
Queryable State (aka Interactive Queries) - yes
Event Pattern Detection - Yes
Introduction to Stream Processing
@gschmutz
Implementing Stream Processing
Solutions
Introduction to Stream Processing
@gschmutz
Stream Processing & Analytics Ecosystem
Stream Analytics
Event Hub
Open Source Closed Source
Stream Data Integration
Source: adapted from Tibco
Edge
Introduction to Stream Processing
@gschmutz
Highly available, Pub/Sub infrastructure
Highly Scalable
Event Hub: Apache Kafka
Distributed Log at the Core
Logs do not (necessarily) forget
• Never
• Time (TTL) or Size-based
• Log-Compacted based
@gschmutz
Stream Data Integration: Kafka Connect
Introduction to Stream Processing
Many connectors available
Implement custom connectors using Java
Supported by Confluent
#!/bin/bash
curl -X "POST"
"http://192.168.69.138:8083/connectors" 
-H "Content-Type: application/json" 
-d $'{
"name": "mqtt-source",
"config": {
"connector.class": ”...MqttSourceConnector",
"tasks.max": "1",
"name": "mqtt-source",
"mqtt.server.uri": "tcp://mosquitto:1883",
"mqtt.topics": "truck/+/position",
"kafka.topic":"truck_position",
}
}'
declarative style data flows
simplicity - “simple things done simple”
very well integrated with Kafka –
framework is part of Kafka
Single Message Transforms (SMT)
@gschmutz
Stream Data Integration: StreamSets
Continuous open source, intent-driven,
big data ingest
Visible, record-oriented approach fixes
combinatorial explosion
Both stream and batch processing
• Standalone, Spark cluster, MapReduce
cluster
IDE for pipeline development by ‘civilians’
special option for Edge computing
custom sources, sinks, processors
Supported by StreamSets
@gschmutz
Streaming Analytics: Kafka Streams
Designed as a simple and lightweight
library in Apache Kafka
no other dependencies than Kafka
Supports fault-tolerant local state
Supports Windowing (Fixed, Sliding and
Session) and Stream-Stream / Stream-
Table Joins
Millisecond processing latency, no micro-
batching
At-least-once and exactly-once
processing guarantees
KTable<Integer, Customer> customers =
builder.stream(”customer");
KStream<Integer, Order> orders =
builder.stream(”order");
KStream<Integer, String> enriched =
orders.leftJoin(customers, …);
joined.to(”orderEnriched");
trucking_
driver
Kafka Broker
Java Application
Kafka Streams
@gschmutz
Streaming Analytics: KSQL
STREAM and TABLE as first-class
citizens
• STREAM = data in motion
• TABLE = collected state of a stream
Stream Processing with zero coding
using SQL-like language
Built on top of Kafka Streams
Interactive (CLI) and headless (command
file)
ksql> CREATE STREAM order_s 
WITH (kafka_topic=‘order', 
value_format=‘AVRO');
Message
----------------
Stream created
ksql> SELECT * FROM order_s 
WHERE address->country = ‘Switzerland’;
...
trucking_
driver
Kafka Broker
KSQL Engine
Kafka Streams
KSQL CLI Commands
@gschmutz
Streaming Analytics: Spark Structured Streaming
Introduction to Stream Processing
2nd generation (1st to be Spark
Streaming)
Structured API through DataFrames /
Datasets rather than RDDs
Easier code reuse between batch and
streaming
marked production ready in Spark 2.2.0
Support for Java, Scala, Python, R and
SQL
val oderDf = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers",
"broker-1:9092")
.option("subscribe", ”order")
.load()
val orderFilteredDf =
orderDf.where(
”address.county = ‘Switzerland'")
@gschmutz
Demo
Introduction to Stream Processing
@gschmutz
Sample Use Case
Truck-2
Truck-1
Truck-3
truck_
position
detect_danger
ous_driving
Truck
Driver
Change Data
Capture
join_dangerous_driv
ing_driver
dangerous_dri
ving_driver
Count By Event Type
Window (1m, 30s)
count_by_event
_type
Introduction to Stream Processing
@gschmutz
Summary
Introduction to Stream Processing
@gschmutz
Summary
Introduction to Stream Processing
Stream Processing is the solution for low-latency
Event Hub, Stream Data Integration and Stream Analytics are the main building
blocks in your architecture
Kafka is currently the de-facto standard for Event Hub
Various options exists for Stream Data Integration and Stream Analytics
SQL becomes a valid option for implementing Stream Analytics
Still room for improvements (SQL, Event Pattern Detection, Streaming Machine
Learning)
@gschmutzIntroduction to Stream Processing
Technology on its own won't help you.
You need to know how to use it properly.

More Related Content

PDF
Introduction to Data Stream Processing
PPTX
MULTI THREADING IN JAVA
PPSX
Eat That Frog
PDF
Atomic Habits Visuals
PDF
The Basics of Public Speaking
PDF
Benefits of Stream Processing and Apache Kafka Use Cases
PPTX
Interface in java
PPTX
Linked list
Introduction to Data Stream Processing
MULTI THREADING IN JAVA
Eat That Frog
Atomic Habits Visuals
The Basics of Public Speaking
Benefits of Stream Processing and Apache Kafka Use Cases
Interface in java
Linked list

What's hot (20)

PDF
Stream Processing – Concepts and Frameworks
PDF
Building Event Driven (Micro)services with Apache Kafka
PPTX
PDF
ksqlDB - Stream Processing simplified!
PDF
Introduction SQL Analytics on Lakehouse Architecture
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PPTX
Data streaming fundamentals
PPTX
Prometheus and Grafana
PDF
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
PDF
Iceberg: a fast table format for S3
PDF
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r2)
PDF
Apache Spark Introduction
PPTX
Introduction to Apache Kafka
PPSX
Event Sourcing & CQRS, Kafka, Rabbit MQ
PDF
Introduction to MLflow
PDF
ksqlDB: A Stream-Relational Database System
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Kafka Streams: What it is, and how to use it?
Stream Processing – Concepts and Frameworks
Building Event Driven (Micro)services with Apache Kafka
ksqlDB - Stream Processing simplified!
Introduction SQL Analytics on Lakehouse Architecture
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Data streaming fundamentals
Prometheus and Grafana
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Iceberg: a fast table format for S3
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Apache Spark Introduction
Introduction to Apache Kafka
Event Sourcing & CQRS, Kafka, Rabbit MQ
Introduction to MLflow
ksqlDB: A Stream-Relational Database System
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Kafka Streams: What it is, and how to use it?
Ad

Similar to Introduction to Stream Processing (20)

PDF
Introduction to Stream Processing
PDF
Introduction to Stream Processing
PDF
Data Ingestion in Big Data and IoT platforms
PDF
Streaming Visualization
PDF
Introduction to Stream Processing
PDF
Big Data Architecture
PDF
Fundamentals Big Data and AI Architecture
PDF
Big Data Architectures @ JAX / BigDataCon 2016
PDF
Introduction to Streaming Analytics
PDF
Architecture of Big Data Solutions
PDF
Oracle Stream Analytics - Simplifying Stream Processing
PDF
Introduction to Stream Processing
PDF
Big Data - in the cloud or rather on-premises?
PDF
Streaming Visualization
PDF
Building Event-Driven (Micro)Services with Apache Kafka
PDF
Streaming Visualization
PDF
Architektur von Big Data Lösungen
PDF
Building Event-Driven (Micro) Services with Apache Kafka
PDF
Reliable Data Intestion in BigData / IoT
PDF
[WSO2Con EU 2018] The Rise of Streaming SQL
Introduction to Stream Processing
Introduction to Stream Processing
Data Ingestion in Big Data and IoT platforms
Streaming Visualization
Introduction to Stream Processing
Big Data Architecture
Fundamentals Big Data and AI Architecture
Big Data Architectures @ JAX / BigDataCon 2016
Introduction to Streaming Analytics
Architecture of Big Data Solutions
Oracle Stream Analytics - Simplifying Stream Processing
Introduction to Stream Processing
Big Data - in the cloud or rather on-premises?
Streaming Visualization
Building Event-Driven (Micro)Services with Apache Kafka
Streaming Visualization
Architektur von Big Data Lösungen
Building Event-Driven (Micro) Services with Apache Kafka
Reliable Data Intestion in BigData / IoT
[WSO2Con EU 2018] The Rise of Streaming SQL
Ad

More from Guido Schmutz (20)

PDF
30 Minutes to the Analytics Platform with Infrastructure as Code
PDF
Event Broker (Kafka) in a Modern Data Architecture
PDF
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
PDF
Kafka as your Data Lake - is it Feasible?
PDF
Event Hub (i.e. Kafka) in Modern Data Architecture
PDF
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
PDF
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
PDF
Location Analytics - Real-Time Geofencing using Apache Kafka
PDF
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
PDF
What is Apache Kafka? Why is it so popular? Should I use it?
PDF
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
PDF
Location Analytics Real-Time Geofencing using Kafka
PDF
Streaming Visualisation
PDF
Kafka as an event store - is it good enough?
PDF
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
PDF
Location Analytics - Real-Time Geofencing using Kafka
PDF
Streaming Visualization
PDF
Location Analytics - Real Time Geofencing using Apache Kafka
PDF
Kafka as an Event Store - is it Good Enough?
PDF
Streaming Visualization
30 Minutes to the Analytics Platform with Infrastructure as Code
Event Broker (Kafka) in a Modern Data Architecture
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Kafka as your Data Lake - is it Feasible?
Event Hub (i.e. Kafka) in Modern Data Architecture
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Location Analytics - Real-Time Geofencing using Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
What is Apache Kafka? Why is it so popular? Should I use it?
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Location Analytics Real-Time Geofencing using Kafka
Streaming Visualisation
Kafka as an event store - is it good enough?
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Location Analytics - Real-Time Geofencing using Kafka
Streaming Visualization
Location Analytics - Real Time Geofencing using Apache Kafka
Kafka as an Event Store - is it Good Enough?
Streaming Visualization

Recently uploaded (20)

PDF
Human Computer Interaction Miterm Lesson
PDF
Slides World Game (s) Great Redesign Eco Economic Epochs.pdf
PDF
Advancements in abstractive text summarization: a deep learning approach
PDF
Child-friendly e-learning for artificial intelligence education in Indonesia:...
PDF
substrate PowerPoint Presentation basic one
PDF
CCUS-as-the-Missing-Link-to-Net-Zero_AksCurious.pdf
PPTX
Slides World Game (s) Great Redesign Eco Economic Epochs.pptx
PDF
Secure Java Applications against Quantum Threats
PDF
Introduction to c language from lecture slides
PPTX
Report in SIP_Distance_Learning_Technology_Impact.pptx
PDF
Domain-specific knowledge and context in large language models: challenges, c...
PDF
EGCB_Solar_Project_Presentation_and Finalcial Analysis.pdf
PDF
Examining Bias in AI Generated News Content.pdf
PPTX
From XAI to XEE through Influence and Provenance.Controlling model fairness o...
PDF
The Digital Engine Room: Unlocking APAC’s Economic and Digital Potential thro...
PDF
State of AI in Business 2025 - MIT NANDA
PDF
FASHION-DRIVEN TEXTILES AS A CRYSTAL OF A NEW STREAM FOR STAKEHOLDER CAPITALI...
PPTX
Rise of the Digital Control Grid Zeee Media and Hope and Tivon FTWProject.com
PDF
Altius execution marketplace concept.pdf
PDF
Applying Agentic AI in Enterprise Automation
Human Computer Interaction Miterm Lesson
Slides World Game (s) Great Redesign Eco Economic Epochs.pdf
Advancements in abstractive text summarization: a deep learning approach
Child-friendly e-learning for artificial intelligence education in Indonesia:...
substrate PowerPoint Presentation basic one
CCUS-as-the-Missing-Link-to-Net-Zero_AksCurious.pdf
Slides World Game (s) Great Redesign Eco Economic Epochs.pptx
Secure Java Applications against Quantum Threats
Introduction to c language from lecture slides
Report in SIP_Distance_Learning_Technology_Impact.pptx
Domain-specific knowledge and context in large language models: challenges, c...
EGCB_Solar_Project_Presentation_and Finalcial Analysis.pdf
Examining Bias in AI Generated News Content.pdf
From XAI to XEE through Influence and Provenance.Controlling model fairness o...
The Digital Engine Room: Unlocking APAC’s Economic and Digital Potential thro...
State of AI in Business 2025 - MIT NANDA
FASHION-DRIVEN TEXTILES AS A CRYSTAL OF A NEW STREAM FOR STAKEHOLDER CAPITALI...
Rise of the Digital Control Grid Zeee Media and Hope and Tivon FTWProject.com
Altius execution marketplace concept.pdf
Applying Agentic AI in Enterprise Automation

Introduction to Stream Processing

  • 1. @gschmutz BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH Introduction to Stream Processing Guido Schmutz Frankfurt - 21.2.2019 @gschmutz guidoschmutz.wordpress.com
  • 2. @gschmutz Agenda Introduction to Stream Processing 1. Motivation for Stream Processing? 2. Capabilities for Stream Processing 3. Implementing Stream Processing Solutions 4. Demo 5. Summary
  • 3. @gschmutz Guido Schmutz Working at Trivadis for more than 22 years Oracle Groundbreaker Ambassador & Oracle ACE Director Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data Head of Trivadis Architecture Board Technology Manager @ Trivadis More than 30 years of software development experience Contact: [email protected] Blog: http://guidoschmutz.wordpress.com Slideshare: http://www.slideshare.net/gschmutz Twitter: gschmutz 145th edition Introduction to Stream Processing
  • 4. @gschmutz Motivation for Stream Processing? Introduction to Stream Processing
  • 5. @gschmutz Bulk Source Hadoop Clusterd Hadoop Cluster Big Data Platform BI Tools Enterprise Data Warehouse SQL Search / Explore Search SQL Export Service Parallel Processing Storage Storage RawRefined Results high latency Enterprise Apps Logic { } API File Import / SQL Import DB Extract File DB Big Data solves Volume and Variety – not Velocity Introduction to Stream Processing
  • 6. @gschmutz Bulk Source Hadoop Clusterd Hadoop Cluster Big Data Platform BI Tools Enterprise Data Warehouse SQL Search / Explore Search SQL Export Service Parallel Processing Storage Storage RawRefined Results high latency Enterprise Apps Logic { } API File Import / SQL Import DB Extract File DB Event Source Location Telemetry IoT Data Mobile Apps Social Big Data solves Volume and Variety – not Velocity Introduction to Stream Processing Event Stream
  • 7. @gschmutz Bulk Source Hadoop Clusterd Hadoop Cluster Big Data Platform BI Tools Enterprise Data Warehouse SQL Search / Explore Search SQL Export Service • Machine Learning • Graph Algorithms • Natural Language Processing Parallel Processing Storage Storage RawRefined Results high latency Enterprise Apps Logic { } API File Import / SQL Import DB Extract File DB Event Stream Event Source Location IoT Data Mobile Apps Social Big Data solves Volume and Variety – not Velocity Introduction to Stream Processing Event Hub Event Hub Event Hub Telemetry
  • 8. @gschmutz "Data at Rest" vs. "Data in Motion" Data at Rest Data in Motion Store Act Analyze StoreAct Analyze 11101 01010 10110 11101 01010 10110 Introduction to Stream Processing
  • 9. @gschmutz When to Stream / When not? Introduction to Stream Processing Constant low Milliseconds & under Low milliseconds to seconds, delay in case of failures 10s of seconds of more, Re-run in case of failures Real-Time Near-Real-Time Batch Source: adapted from Cloudera
  • 10. @gschmutz "No free lunch" Introduction to Stream Processing Constant low Milliseconds & under Low milliseconds to seconds, delay in case of failures 10s of seconds of more, Re-run in case of failures Real-Time Near-Real-Time Batch "Difficult" architectures, lower latency "Easier architectures", higher latency
  • 11. @gschmutz Event Hub Event Hub Hadoop Clusterd Hadoop Cluster Stream Analytics Platform Stream Processing Architecture solves Velocity BI Tools Enterprise Data Warehouse Event Hub SQL Search / Explore Enterprise Apps Search ServiceResults Stream Analytics Reference / Models Dashboard Logic { } API Event Stream Event Stream Event Stream Bulk Source Event Source Location DB Extract File DB IoT Data Mobile Apps Social Introduction to Stream Processing Low(est) latency, no history Telemetry
  • 12. @gschmutz Hadoop Clusterd Hadoop Cluster Stream Analytics Platform Big Data for all historical data analysis BI Tools Enterprise Data Warehouse SQL Search / Explore Enterprise Apps Search ServiceResults Stream Analytics Reference / Models Dashboard Logic { } API Event Stream Event Stream Hadoop Clusterd Hadoop Cluster Big Data Platform Parallel Processing Storage Storage RawRefined Results Data FlowEvent Hub Event Stream Bulk Source Event Source Location DB Extract File DB IoT Data Mobile Apps Social File Import / SQL Import Introduction to Stream Processing Telemetry
  • 13. @gschmutz Hadoop Clusterd Hadoop Cluster Stream Analytics Platform Integrate existing systems with lower latency through CDC BI Tools Enterprise Data Warehouse SQL Search / Explore Enterprise Apps Search ServiceResults Stream Analytics Reference / Models Dashboard Logic { } API Hadoop Clusterd Hadoop Cluster Big Data Platform Parallel Processing Storage Storage RawRefined Results File Import / SQL Import Event Stream Event Stream Data FlowEvent Hub Event Stream Bulk Source Event Source Location DB Extract File DB IoT Data Mobile Apps Social Change Data Capture Introduction to Stream Processing Telemetry
  • 14. @gschmutz New systems participate in event-oriented fashion Hadoop Clusterd Hadoop Cluster Big Data Platform Parallel Processing Storage Storage RawRefined Results Microservice Platform Microservice State { } API Stream Analytics Platform Stream Processor State { } API Event Stream SQL Search Service BI Tools Enterprise Data Warehouse Search / Explore SQL Export Search Service Enterprise Apps Logic { } API File Import / SQL Import Event Stream Data FlowEvent Hub Event Stream Bulk Source Event Source Location DB Extract File DB IoT Data Mobile Apps Social Change Data Capture Event Stream Event Stream Introduction to Stream Processing Telemetry
  • 15. @gschmutz Edge computing allows processing close to data sources Hadoop Clusterd Hadoop Cluster Big Data Platform Parallel Processing Storage Storage RawRefined Results Microservice Platform Microservice State { } API Stream Analytics Platform Stream Processor State { } API SQL Search Service BI Tools Enterprise Data Warehouse Search / Explore SQL Export Search Service Enterprise Apps Logic { } API Bulk Source Event Source Location DB Extract File DB IoT Data Mobile Apps Social Edge Node File Import / SQL Import Change DataCapture Data Flow Event Hub Data Flow Event Stream Event Stream Event Stream Introduction to Stream Processing Telemetry Rules Event Hub Storage
  • 16. @gschmutz Hadoop Clusterd Hadoop Cluster Big Data Unified Architecture for Modern Data Analytics Solutions SQL Search Service BI Tools Enterprise Data Warehouse Search / Explore File Import / SQL Import Event HubData Flow Data Flow Change DataCapture Parallel Processing Storage Storage RawRefined Results SQL Export Microservice State { } API Stream Processor State { } API Event Stream Event Stream Search Service Stream Analytics Microservices Enterprise Apps Logic { } API Edge Node Rules Event Hub Storage Bulk Source Event Source Location DB Extract File DB IoT Data Mobile Apps Social Event Stream Telemetry Introduction to Stream Processing
  • 17. @gschmutz Two Types of Stream Processing (by Gartner) Introduction to Stream Processing Stream Data Integration • focuses on the ingestion and processing of data sources targeting real-time extract- transform-load (ETL) and data integration use cases • filter and enrich the data Stream Analytics • targets analytics use cases • calculating aggregates and detecting patterns to generate higher-level, more relevant summary information (complex events) • Complex events may signify threats or opportunities that require a response from the business Gartner: Market Guide for Event Stream Processing, Nick Heudecker, W. Roy Schulte
  • 18. @gschmutz Stream Processing & Analytics Ecosystem Stream Analytics Event Hub Open Source Closed Source Stream Data Integration Source: adapted from Tibco Edge Introduction to Stream Processing
  • 19. @gschmutz Stream vs. Table / Static Stream “History” an unbounded sequence of structured data ("facts") Facts in a stream are immutable Table / Static “State” a view of a stream, or another table, and represents a collection of evolving facts Latest value for each key in a stream Facts in a table are mutable Introduction to Stream Processing
  • 20. @gschmutzIntroduction to Stream Processing Important Capabilities for Stream Processing
  • 21. @gschmutz Capabilities: Stream Data Integration vs. Stream Analytics Stream Data Integration Stream Analytics Support for Various Data Sources yes partial Streaming ETL (Transformation/Format Translation, Routing, Validation) yes partial Micro-Batching yes partial Event-at-a-time yes yes Delivery Guarantees yes yes API : GUI-Based API / Declarative API / Programmatic yes yes API: Streaming SQL - yes Event Time vs. Ingestion / Processing Time - yes Windowing - yes Stream-to-Static Joins (Lookup/Enrichment) partial yes Stream-to-Stream Joins - yes State Management - yes Queryable State (aka Interactive Queries) - yes Event Pattern Detection - Yes Introduction to Stream Processing
  • 22. @gschmutz Integrating Data Sources Introduction to Stream Processing SQL Polling Change Data Capture (CDC) File Polling File Stream (File Tailing) File Stream (Appender) Sensor Stream
  • 23. @gschmutz Streaming ETL Introduction to Stream Processing • Flow-based ”programming” • Ingest Data from various sources • Extract – Transform – Load • High-Throughput, straight-through data flows • Data Lineage • Batch- or Stream-Processing • Visual coding with flow editor • Event Stream Processing (ESP) but not Complex Event Processing (CEP) Source: Confluent
  • 24. @gschmutz Event-at-a-time vs. Micro Batch Introduction to Stream Processing Event-at-a-time Processing • Events processed as they arrive • low-latency • fault tolerance expensive Micro-Batch Processing • Splits incoming stream in small batches • Fault tolerance easier • Better throughput
  • 25. @gschmutz Delivery Guarantees Introduction to Stream Processing At most once (fire-and-forget) message is sent, but the sender doesn’t care if it’s received or lost. it is the easiest and most performant behavior to support. At least once retransmission of a message will occur until an acknowledgment is received. Since a delayed acknowledgment from a receiver could be in flight when the sender retransmits, the message may be received one or more times. Exactly once ensures that a message is received once and only once, and is never lost and never repeated. The system must implement whatever mechanisms are required to ensure that a message is received and processed just once [ 0 | 1 ] [ 1+ ] [ 1 ]
  • 26. @gschmutz API Introduction to Stream Processing GUI-based / Drag-and-Drop • A graphical way of designing a plipeline • Often web-based to improve usability Declarative • An streaming engine which can be configured in a declarative way • JSON, YML Programmatic • Low-level (class) or high-level fluent API • Higher order function as operators (filter, mapWithState …) Streaming SQL • use a stream in a FROM clause • Extensions supporting windowing, pattern matching, spatial, …. Operators val filteredDf = truckPosDf. where("eventType !='Normal'") SELECT * FROM truck_position_s WHERE eventType != ’Normal’ "config": { "connector.class": "io.confluent.connect.mqtt.MqttSourceConnector", "tasks.max": "1", "mqtt.server.uri": "tcp://mosquitto-1:1883", "mqtt.topics": "truck/+/position", "kafka.topic":"truck_position", ...
  • 27. @gschmutz Event Time vs. Ingestion / Processing Time Introduction to Stream Processing Event time the time at which events actually occurred Ingestion time / Processing Time the time at which events are ingested into / processed by the system Not all use cases care about event times but many do Examples • characterizing user behavior over time • most billing applications • anomaly detection
  • 28. @gschmutz Windowing Introduction to Stream Processing Computations over events done using windows of data Due to size and never-ending nature of it, it’s not feasible to keep entire stream of data in memory A window of data represents a certain amount of data where we can perform computations on Windows give the power to keep a working memory and look back at recent data efficiently Time Stream of Data Window of Data
  • 29. @gschmutz Sliding Window (aka Hopping Window) - uses eviction and trigger policies that are based on time: window length and sliding interval length Fixed Window (aka Tumbling Window) - eviction policy always based on the window being full and trigger policy based on either the count of items in the window or time Session Window – composed of sequences of temporarily related events terminated by a gap of inactivity greater than some timeout Windowing Time TimeTime Introduction to Stream Processing
  • 30. @gschmutz Joining – Stream-to-Static and Stream-to-Stream Introduction to Stream Processing Challenges of joining streams 1. Data streams need to be aligned as they come because they have different timestamps 2. since streams are never-ending, the joins must be limited; otherwise join will never end 3. join needs to produce results continuously as there is no end to the data Stream-to-Static (Table) Join Stream-to-Stream Join (one window join) Stream-to-Stream Join (two window join) Stream-to- Static Join Stream-to- Stream Join Stream-to- Stream Join Time Time Time
  • 31. @gschmutz State Management Introduction to Stream Processing Necessary if stream processing use case is dependent on previously seen data or external data Windowing, Joining and Pattern Detection use State Management behind the scenes State Management services can be made available for custom state handling logic State needs to be managed as close to the stream processor as possible Options for State Management How does it handle failures? If a machine crashes and the/some state is lost? In-Memory Replicated, Distributed Store Local, Embedded Store Operational Complexity and Features Low high
  • 32. @gschmutz Queryable State (aka. Interactive Queries) Introduction to Stream Processing Exposes the state managed by the Stream Analytics solution to the outside world Allows an application to query the managed state, i.e. to visualize it For some scenarios, Queryable State can eliminate the need for an external database to keep results Stream Processing Cluster Reference Data Stream Analytics { } Query API State Stream Processor Search / Explore Online & Mobile Apps Model Dashboard
  • 33. @gschmutz Event Pattern Detection Introduction to Stream Processing • Streaming Data often contains interesting patterns that only emerge as new streaming data arrives, e.g. • Absence Pattern: event A not followed by event B within time window • Sequence Pattern: event A followed by event B followed by event C • Increasing Pattern: up trend of a value of a certain attribute • Decreasing Pattern: down trend of a value of a certain attribute • … • Pattern operators allow developers to define complex relationships between streaming events
  • 34. @gschmutz Capabilities: Stream Data Integration vs. Stream Analytics Stream Data Integration Stream Analytics Support for Various Data Sources yes - Streaming ETL (Transformation/Format Translation, Routing, Validation) yes partial Micro-Batching yes partial Event-at-a-time yes yes Delivery Guarantees yes yes API : GUI-Based API / Declarative API / Programmatic yes yes API: Streaming SQL - yes Event Time vs. Ingestion / Processing Time - yes Windowing - yes Stream-to-Static Joins (Lookup/Enrichment) partial yes Stream-to-Stream Joins - yes State Management - yes Queryable State (aka Interactive Queries) - yes Event Pattern Detection - Yes Introduction to Stream Processing
  • 36. @gschmutz Stream Processing & Analytics Ecosystem Stream Analytics Event Hub Open Source Closed Source Stream Data Integration Source: adapted from Tibco Edge Introduction to Stream Processing
  • 37. @gschmutz Highly available, Pub/Sub infrastructure Highly Scalable Event Hub: Apache Kafka Distributed Log at the Core Logs do not (necessarily) forget • Never • Time (TTL) or Size-based • Log-Compacted based
  • 38. @gschmutz Stream Data Integration: Kafka Connect Introduction to Stream Processing Many connectors available Implement custom connectors using Java Supported by Confluent #!/bin/bash curl -X "POST" "http://192.168.69.138:8083/connectors" -H "Content-Type: application/json" -d $'{ "name": "mqtt-source", "config": { "connector.class": ”...MqttSourceConnector", "tasks.max": "1", "name": "mqtt-source", "mqtt.server.uri": "tcp://mosquitto:1883", "mqtt.topics": "truck/+/position", "kafka.topic":"truck_position", } }' declarative style data flows simplicity - “simple things done simple” very well integrated with Kafka – framework is part of Kafka Single Message Transforms (SMT)
  • 39. @gschmutz Stream Data Integration: StreamSets Continuous open source, intent-driven, big data ingest Visible, record-oriented approach fixes combinatorial explosion Both stream and batch processing • Standalone, Spark cluster, MapReduce cluster IDE for pipeline development by ‘civilians’ special option for Edge computing custom sources, sinks, processors Supported by StreamSets
  • 40. @gschmutz Streaming Analytics: Kafka Streams Designed as a simple and lightweight library in Apache Kafka no other dependencies than Kafka Supports fault-tolerant local state Supports Windowing (Fixed, Sliding and Session) and Stream-Stream / Stream- Table Joins Millisecond processing latency, no micro- batching At-least-once and exactly-once processing guarantees KTable<Integer, Customer> customers = builder.stream(”customer"); KStream<Integer, Order> orders = builder.stream(”order"); KStream<Integer, String> enriched = orders.leftJoin(customers, …); joined.to(”orderEnriched"); trucking_ driver Kafka Broker Java Application Kafka Streams
  • 41. @gschmutz Streaming Analytics: KSQL STREAM and TABLE as first-class citizens • STREAM = data in motion • TABLE = collected state of a stream Stream Processing with zero coding using SQL-like language Built on top of Kafka Streams Interactive (CLI) and headless (command file) ksql> CREATE STREAM order_s WITH (kafka_topic=‘order', value_format=‘AVRO'); Message ---------------- Stream created ksql> SELECT * FROM order_s WHERE address->country = ‘Switzerland’; ... trucking_ driver Kafka Broker KSQL Engine Kafka Streams KSQL CLI Commands
  • 42. @gschmutz Streaming Analytics: Spark Structured Streaming Introduction to Stream Processing 2nd generation (1st to be Spark Streaming) Structured API through DataFrames / Datasets rather than RDDs Easier code reuse between batch and streaming marked production ready in Spark 2.2.0 Support for Java, Scala, Python, R and SQL val oderDf = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "broker-1:9092") .option("subscribe", ”order") .load() val orderFilteredDf = orderDf.where( ”address.county = ‘Switzerland'")
  • 44. @gschmutz Sample Use Case Truck-2 Truck-1 Truck-3 truck_ position detect_danger ous_driving Truck Driver Change Data Capture join_dangerous_driv ing_driver dangerous_dri ving_driver Count By Event Type Window (1m, 30s) count_by_event _type Introduction to Stream Processing
  • 46. @gschmutz Summary Introduction to Stream Processing Stream Processing is the solution for low-latency Event Hub, Stream Data Integration and Stream Analytics are the main building blocks in your architecture Kafka is currently the de-facto standard for Event Hub Various options exists for Stream Data Integration and Stream Analytics SQL becomes a valid option for implementing Stream Analytics Still room for improvements (SQL, Event Pattern Detection, Streaming Machine Learning)
  • 47. @gschmutzIntroduction to Stream Processing Technology on its own won't help you. You need to know how to use it properly.