BASEL | BERN | BRUGG | BUCHAREST | DÜSSELDORF | FRANKFURT A.M. | FREIBURG I.BR. | GENEVA
HAMBURG | COPENHAGEN | LAUSANNE | MANNHEIM | MUNICH | STUTTGART | VIENNA | ZURICH
http://guidoschmutz.wordpress.com@gschmutz
Grundlagen der Big-Data und KI-Architektur
DOAG Data Centric Day, 25.9.2019 in Köln
Guido Schmutz
Guido Schmutz
Working at Trivadis for more than 22 years
Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data
Oracle Groundbreaker Ambassador & Oracle ACE Director
Head of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 30 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz
169th edition
From Data Warehouse …
Data Warehouse Architecture
Enterprise Data
Warehouse
Extract, Transform
& Load (ETL)
Bulk Source
DB
Extract
File
DB
Consumer
RDBMS BI Tools
ETL Engine
high latency
Data Warehouse is an architecture
Layered model, controlled ETL, single point
of truth, query optimized data marts
Tested, optimized, quality assured,
„operated“
Standard-reporting, adHoc-reporting on
DWH Base
Perfect and fast for new requirements to
known and prepared data and structures
Data Warehouse ist not „agile“
No free definition and shaping of arbitrary
analytical questions
= Data Production
Source: https://www.flickr.com/photos/128950981@N04/15452926858
DWH Architecture – what about Streaming Data?
Enterprise Data
Warehouse
Extract, Transform
& Load (ETL)
Bulk Source
DB
Extract
File
DB
Consumer
RDBMS BI Tools
ETL Engine
Event Source
Location
Weather
IoT
Data
Mobile
Apps
Social Yes No
Low High
Yes No
Elasticity
End-to-End Latency
Ad-Hoc (SQL) Queries
Low HighStorage Costs
Yes NoSupports Raw Data
Yes NoSupports Streaming Data
Low HighAccess Latency
… to Big Data / Data Lake
Initial Idea of a Data Lake …
Adapted from Wikipedia.org
“Reporting,
visualization,
analytics and
machine
learning”
“Single store of
all data in the
enterprise” “Should put an
end to data
silos.”
“Example:
Distributed file
system used in
Apache
Hadoop”
Data
Lake
Data Lake is an Infrastructure
Permanently new Data and Structures
Schema on Read
Really large amounts of Data
Explorative Working (Research)
Established Error-Culture
New user groups ([Data] Scientists)
Freedom of data-choice
Freedom of source-choice
Self-Service Data Labs
adHoc- & One-Shot implementations
Query + Advanced Analytics
= Research & Development
Source: https://www.flickr.com/photos/ian-arlett/34233379390
Data-Lab Interpretation
Schema on Read instead of (only) Schema on Write
"Schema on Write"
• Data quality managed by formalized ETL process
• Data persisted in tabular, agreed and consistent
form
• Data integration happens in ETL
• Structure must be decided before writing
"Schema on Read"
• Interpretation of data captured in code for each
program accessing the data
• Data quality dependent on code quality
• Data integration happens in code
EDWHETLData
Source
Consumer
RDBMS BI Tools
Data LakeData
Source
Consumer
Storage
Script
Data Science
Workbench
Data Science
Workbench
Transform
Transform
Bulk Source
Consumer
• Machine Learning
• Graph Algorithms
• Natural Language Processing
DB
Extract
File
DB
Big Data / Data Lake Architecture
Data Science
Workbench
File Import / SQL Import
“Native” Raw
Hadoop ClusterdHadoop ClusterBig Data Platform
Parallel
Processing
Storage
Storage
Raw
Refined/
UsageOpt
Yes No
Low High
Yes No
Elasticity
End-to-End Latency
Ad-Hoc (SQL) Queries
Low HighStorage Costs
Yes NoSupports Raw Data
Yes NoSupports Streaming Data
Low HighAccess Latency
high latency
Bulk Source
Consumer
DB
Extract
File
DB
Big Data / Data Lake Architecture
BI Tools
Data Science
Workbench
SQL
File Import / SQL Import
“Native” Raw
Hadoop ClusterdHadoop ClusterBig Data Platform
Parallel
Processing
Storage
Storage
Raw
Refined/
UsageOpt
Yes No
Low High
Yes No
Elasticity
End-to-End Latency
Ad-Hoc (SQL) Queries
Low HighStorage Costs
Yes NoSupports Raw Data
Yes NoSupports Streaming Data
Low HighAccess Latency
Query
Engine
Enterprise Data
Warehouse
SQL
SQL Export
Data Lake & EDWH Architecture
Bulk Source
DB
Extract
File
DB
File Import / SQL Import
Consumer
BI Apps
Data Science
Workbench
“Native” Raw
RDBMS
Hadoop ClusterdHadoop ClusterBig Data Platform
Storage
Storage
Raw
Refined/
UsageOpt
Yes No
Low High
Yes No
Elasticity
End-to-End Latency
Ad-Hoc (SQL) Queries
Low HighStorage Costs
Yes NoSupports Raw Data
Yes NoSupports Streaming Data
Low HighAccess Latency
Parallel
Processing
Query
Engine
Enterprise Data
Warehouse
SQL / Search
Data Lake & EDWH Architecture
Consumer
BI Apps
Data Science
Workbench
SQL
“Native” Raw
RDBMS
Hadoop ClusterdHadoop ClusterBig Data Platform
Storage
Storage
Raw
Refined/
UsageOpt
File Import / SQL Import
Bulk Source
DB
Extract
File
DB
SQL Export
Yes No
Low High
Yes No
Elasticity
End-to-End Latency
Ad-Hoc (SQL) Queries
Low HighStorage Costs
Yes NoSupports Raw Data
Yes NoSupports Streaming Data
Low HighAccess Latency
Parallel
Processing
Query
Engine
Bulk Source
Enterprise Data
Warehouse
SQL / Search
SQL Export
File Import / SQL Import
DB
Extract
File
DB
Data Lake & EDWH Architecture with Streaming Data
SQL
Event Source
Location
Weather
IoT
Data
Mobile
Apps
Social
Hadoop ClusterdHadoop ClusterBig Data Platform
Storage
Storage
Raw
Refined/
UsageOpt
Consumer
BI Apps
Data Science
Workbench
Parallel
Processing
Query
Engine
“Native” Raw
Bulk Source
Enterprise Data
Warehouse
SQL / Search
SQL Export
File Import / SQL Import
DB
Extract
File
DB
Data Lake & EDWH Architecture with Streaming Data
Consumer
BI Apps
Data Science
Workbench
SQL
Event Source
Location
Weather
IoT
Data
Mobile
Apps
Social
Event
Hub
Event
Hub
Event
Hub
Event
Stream
B
ulk
D
ata
Im
port
Hadoop ClusterdHadoop ClusterBig Data Platform
Storage
Storage
Raw
Refined/
UsageOpt
high latency
Yes No
Low High
Yes No
Elasticity
End-to-End Latency
Ad-Hoc (SQL) Queries
Low HighStorage Costs
Yes NoSupports Raw Data
Yes NoSupports Streaming Data
Low HighAccess Latency
Parallel
Processing
Query
Engine
“Native” Raw
Keep the data in motion …
Data at Rest Data in Motion
Store
(Re)Act
Visualize/
Analyze
StoreAct
Analyze
11101
01010
10110
11101
01010
10110
vs.
Visualize
Event
Hub
Event
Hub
Event Processing Architecture
Event
Hub
“SQL” / Search
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
Low(est) latency, no history
Consumer
Enterprise
App
Dashboard
Stream Processing Cluster
Stream
Processor
Model /
State
Event
Stream
Service
Yes No
Low High
Yes No
Elasticity
End-to-End Latency
Ad-Hoc (SQL) Queries
Low HighStorage Costs
Yes NoSupports Raw Data
Yes NoSupports Streaming Data
Low HighAccess Latency
Rules
Engine
• Complex Event Processing (CEP)
• Machine Learning Model
Execution (Inference)
• State Transition
Event
Stream
Event Processing & Data Lake
ServiceEvent
Stream
Data Flow
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
File Import / SQL Import
Consumer
BI Apps
Data Science
Workbench
Enterprise
App
“SQL” / Search
Hadoop ClusterdHadoop ClusterBig Data Platform
Storage
Storage
Raw
Refined/
UsageOpt
DashboardStream Processing Cluster
Stream
Processor
Model /
State
Event
Hub
Yes No
Low High
Yes No
Elasticity
End-to-End Latency
Ad-Hoc (SQL) Queries
Low HighStorage Costs
Yes NoSupports Raw Data
Yes NoSupports Streaming Data
Low HighAccess Latency
Parallel
Processing
Query
Engine
Rules
Engine
Event
Stream
Enterprise Data
Warehouse
SQL / Search
SQL
“Native” Raw
RDBMS
SQL
Export
Event Processing & Data Lake: Lambda Architecture
Event
Stream
Bulk
Data Flow
Hadoop ClusterdHadoop ClusterBig Data Platform
Storage
Storage
Raw
Refined/
UsageOpt
Stream Processing Cluster
Stream
Processor
Model /
State
ML Inference
Server
Event
Hub
Consumer
BI Apps
Dashboard
Serving
API
(Merger)
Event Source
Location
Weather
IoT
Data
Mobile
Apps
Social
Event
Stream
Batch
Result
Speed
Result
{ }
Batch Layer
Speed Layer
Parallel
Processing
Query
Engine
Event Processing & Data Lake: Kappa Architecture
Event
Stream
Stream Processing Cluster
Stream Processor V1.0 State V1.0
Event
Hub
Event Source
Location
Weather
IoT
Data
Mobile
Apps
Social
Reply
Hadoop ClusterdHadoop ClusterBig Data Platform
Storage
Storage
Raw
Refined/
UsageOpt
Bulk
Data Flow
Consumer
BI Apps
Dashboard
Serving
Stream Processor V2.0 State V2.0
Result V1.0
Result V2.0
API
(Switcher)
{ }
Speed Layer
Parallel
Processing
Query
Engine
Integrate existing systems with CDC
ServiceEvent
Stream
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
File Import / SQL Import
Consumer
BI Apps
Data Science
Workbench
Enterprise
App
“SQL” / Search
Hadoop ClusterdHadoop ClusterBig Data Platform
Storage
Storage
Raw
Refined/
UsageOpt
DashboardStream Processing Cluster
Stream
Processor
Model /
State
Event
Hub
Change Data
Capture
Parallel
Processing
Query
Engine
Rules
Engine
Bulk
Data Flow
Event
Stream
Enterprise Data
Warehouse
SQL / Search
SQL
“Native” Raw
RDBMS
SQL
Export
Applications participate Event-Driven
Service
Event
Stream
Bulk
Data Flow
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
File Import / SQL Import
Consumer
BI Apps
Data Science
Workbench
Enterprise
App
“SQL” / Search
Service
Event
Hub
Hadoop ClusterdHadoop ClusterBig Data Platform
Storage
Storage
Raw
Refined/
UsageOpt
Microservice Platform
Stream Processing Platform
Stream
Processor
Model /
State
Change Data
Capture
Rules
Engine
Event
Stream
Microservice Data
{ }
API
Enterprise Data
Warehouse
SQL / Search
SQL
“Native” Raw
RDBMS
SQL
Export
Move Processing to Edge
Service
Event
Stream
Bulk
Data
Flow
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
File Import / SQL Import
Consumer
BI Apps
Data Science
Workbench
Enterprise
App
“SQL” / Search
Service
Event
Hub
Hadoop ClusterdHadoop ClusterBig Data Platform
Storage
Storage
Raw
Refined/
UsageOpt
Microservice Cluster
Microservice Data
{ }
API
Stream Processing Cluster
Stream
Processor
Model /
State
Change Data
Capture
Edge Node
Rules
Event Hub
Storage
Parallel
Processing
Query
Engine
Rules
Engine
Event
Stream
Event
Stream
Enterprise Data
Warehouse
SQL / Search
SQL
“Native” Raw
RDBMS
SQL
Export
Anyone does what they want
No (central?) documentation
No unique data structure
No unique transformations
No unique KPI definitions
No quality assurance
No data flow analysis
Silo-Thinking
Data avalibility? Security? Auditibility?
= No Data Architecture
Data
SwampQuelle https://www.flickr.com/photos/82134796@N03/10603438015
But be careful ….
Data Lake Zones & Data
Catalog
Data Storage
Landing Zone
Archive Zone
Data Lake Zones
Object
Store
Tape
Raw Zone
Sandbox Zone
Usage-
Optimized Zone
Data Source Data Access
File
System
Event Hub
Object
Store
File
System
Event Hub
Object
Store
File
System
Object
Store
File
System
RDBMS
Object
Store
File
System
RDBMS/
NoSQL
Refined Zone
Object
Store
File
System
Event Hub
NoSQL
In-Memory
Grid
Event Hub/
Store
Disk Service
Disk Service
Data Catalog
Service
Event
Stream
Bulk
Data
Flow
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
File Import / SQL Import
Consumer
BI Apps
Data Science
Workbench
Enterprise
App
Enterprise Data
Warehouse
SQL / Search
SQL
“Native” Raw
RDBMS
“SQL” / Search
Service
Event
Hub
Hadoop ClusterdHadoop ClusterBig Data Platform
SQL
Export
Storage
Storage
Raw
Refined/
UsageOpt
Microservice Cluster
Stream Processing Cluster
Stream
Processor
Model /
State
Change Data
Capture
Edge Node
Rules
Event Hub
Storage
Governance
Data Catalog
Rules
Engine
Parallel
Processing
Query
Engine
Microservice Data
{ }
API
Event
Stream
Event
Stream
(Machine Learning Augmented) Data Catalog
A data catalog creates and maintains an inventory
of data assets through discovery, description and
organization of distributed datasets.
It provides context to enable data stewards,
data/business analysts, data engineers, data
scientists and other line of business (LOB) data
consumers to find and understand relevant
datasets for the purpose of extracting business
value.
Modern machine-learning-augmented data
catalogs automate various tedious tasks involved
in data cataloging, including metadata discovery,
ingestion, translation, enrichment and the creation
of semantic relationships between metadata.
Data Catalog
Data Catalog Features
Ranking on Utilization
Rate Catalog Objects
Maintain Multiple Versions
of Catalog Object
Search & Navigation for
Content
Content Check in/out
Certify Official Versions of
Metadata
Analyze and Audit Decision
Processes
Integrate Data Lineage
Levels of Access to Catalog
Objects
Impact Analysis
API for Search / Catalog /
Mgmt Functions
Track Usage of Catalog
Objects
Integration with IAM
Automated Crawling of
Source System
Catalog Cloud-Deployed
Sources
Catalog Hadoop-based
Sources
Catalog BI & Data
Visualization Tools
Catalog Databases
Integration with self-service
Tools
Classify Catalog Objects by
Business Glossary
Supports user-defined
Tagging
Integrates with Data
Profiling
Supports Data Sampling
Quality Metrics
Catalog Machine/IoT Data
Supports Discussion
Threads on Catalog Objects
Annotate & Comment on
Catalog Objects
Catalog Unstructured Data
with NLP functionality
Semantic Search
Classify Catalog Objects by
Domain
Publish/Subscribe on
Changes of Catalog Objects
AI/ML based
Recommendation
Detect Similar/Duplicate or
Related Data
Easy to use, intuitive GUI
Supports Manual Curation
Supports Automated (ML
based) tagging
Supports ongoing discovery
of new data sets
Natural Language Search
Facetted based Search
Catalog Object Value
Estimation
Incentive-based
Participation
Encouragement
Assign Data Steward
Traditional vs. Cloud Native
Big Data Platforms
Traditional vs. Cloud Native Big Data
Data Local Compute
(traditional)
Separate Compute and Storage
(cloud native)
Worker #1
Disk
Processing
Master Node
Worker #2
Disk
Processing
Worker #3
Disk
Processing
Network
Storage
Disk Disk Disk
Compute #1
Processing
Compute #2
Processing
Compute #3
Processing
Network
Master Node
Network
Separation of compute
and storage – the
fundamental difference
• store data in Object
Storage instead of HDFS
• bring up Compute nodes
only for data processing
• multiple workloads on
separate clusters can
access same data
Traditional vs. Cloud Native Big Data
Traditional Cloud Native
Data Local Compute Yes No
Network Bandwidth Req. Low High
Scalable, shared-usage of Data No (only within cluster) Yes
Persistence HDFS Object Storage
Data Lifecycle Tiered Storage Built-in (cloud)
Compute Hadoop, Spark Hadoop, Spark
Serverless Processing no yes
Infrastructure Hadoop Cluster Cloud, Container
Orchestration
Entry Threshold high low
Modern Data Platform
Data Platform
Service
Event
Stream
Bulk
Data
Flow
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
File Import / SQL Import
Consumer
BI Apps
Data Science
Workbench
Enterprise
App
Enterprise Data
Warehouse
SQL / Search
SQL
“Native” Raw
RDBMS
“SQL” / Search
Service
Event
Hub
Hadoop ClusterdHadoop ClusterBig Data Platform
SQL
Export
Storage
Storage
Raw
Refined/
UsageOpt
Microservice Cluster
Stream Processing Cluster
Stream
Processor
Model /
State
Change Data
Capture
Edge Node
Rules
Event Hub
Storage
Governance
Data Catalog
Rules
Engine
Parallel
Processing
Query
Engine
Microservice Data
{ }
API
Event
Stream
Event
Stream
Modern Data Platform
Service
Event
Stream
Bulk
Data
Flow
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
File Import / SQL Import
Consumer
BI Apps
Data Science
Workbench
Enterprise
App
Enterprise Data
Warehouse
SQL / Search
SQL
“Native” Raw
RDBMS
“SQL” / Search
Service
Event
Hub
Hadoop ClusterdHadoop ClusterBig Data Platform
SQL
Export
Storage
Storage
Raw
Refined/
UsageOpt
Microservice Cluster
Stream Processing Cluster
Stream
Processor
Model /
State
Change Data
Capture
Edge Node
Rules
Event Hub
Storage
Governance
Data Catalog
Rules
Engine
Parallel
Processing
Query
Engine
Microservice Data
{ }
API
Event
Stream
Event
Stream
On-Premises – Traditional
Hadoop YARN
Pig
HDFS
HDFS
Kafka
Confluent
Hive
Kafka Streams
Spring Boot NoSQL
RDBMS
NoSQL
RDBMS
RDBMS
Atlas
Debezium Streamsets
Flume
Sqoop Flume
Impala
MapReduce
Spark
SparkSQL
Spark Streaming
Zeppelin
Jupyter
Service
Event
Stream
Bulk
Data
Flow
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
File Import / SQL Import
Consumer
BI Apps
Data Science
Workbench
Enterprise
App
Enterprise Data
Warehouse
SQL / Search
SQL
“Native” Raw
RDBMS
“SQL” / Search
Service
Event
Hub
Hadoop ClusterdHadoop ClusterBig Data Platform
SQL
Export
Storage
Storage
Raw
Refined/
UsageOpt
Microservice Cluster
Stream Processing Cluster
Stream
Processor
Model /
State
Change Data
Capture
Edge Node
Rules
Event Hub
Storage
Governance
Data Catalog
Rules
Engine
Parallel
Processing
Query
Engine
Microservice Data
{ }
API
Event
Stream
Event
Stream
Oracle Cloud
Kafka
Confluent
Streamsets
Nifi
Streamsets
Nifi
Object Storage
Archive Storage
Object Storage
Archive Storage
Data Science
Big Data Cloud Service
Machine
Learning
Streaming
Data Science
Functions
Visual Builder
Java
NoSQL DB
Data Catalog
Autonomous
Transaction Proc
NoSQL DB
Autonomous
DWH
Big Data SQL
Cloud Service
GoldenGate
Cloud Service
Kafka Streams/
KSQL
SOA Cloud Service
Container Engine for
Kubernetes
Zeppelin
Jupyter
Transfer Service
Container Pipelines
Container
Registry
Service
Event
Stream
Bulk
Data
Flow
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
File Import / SQL Import
Consumer
BI Apps
Data Science
Workbench
Enterprise
App
Enterprise Data
Warehouse
SQL / Search
SQL
“Native” Raw
RDBMS
“SQL” / Search
Service
Event
Hub
Hadoop ClusterdHadoop ClusterBig Data Platform
SQL
Export
Storage
Storage
Raw
Refined/
UsageOpt
Microservice Cluster
Stream Processing Cluster
Stream
Processor
Model /
State
Change Data
Capture
Edge Node
Rules
Event Hub
Storage
Governance
Data Catalog
Rules
Engine
Parallel
Processing
Query
Engine
Microservice Data
{ }
API
Event
Stream
Event
Stream
AWS Cloud
Kafka
Confluent
Streamsets
Nifi
Streamsets
Nifi
Zeppelin
Jupyter
S3
S3 Glacier
Deep Archive
S3
Dynamo DB
Redshift
Redshift
Spectrum
Spark on EMR Glue
Snowball
Data Sync
Athena
Presto on EMR
SageMaker
Deep Learning
Containers
Spark Streaming on EMR
Databricks on AWS
Kinesis Data Analytics
Lambda
Batch
Spring Boot
QuickSight
Zeppelin on EMR
Databricks on AWS
RStudio on EMR
API Gateway
Managed Streaming
for Kafka
Kinesis Data Firehose
Confluent Cloud
Service
Event
Stream
Bulk
Data
Flow
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
File Import / SQL Import
Consumer
BI Apps
Data Science
Workbench
Enterprise
App
Enterprise Data
Warehouse
SQL / Search
SQL
“Native” Raw
RDBMS
“SQL” / Search
Service
Event
Hub
Hadoop ClusterdHadoop ClusterBig Data Platform
SQL
Export
Storage
Storage
Raw
Refined/
UsageOpt
Microservice Cluster
Stream Processing Cluster
Stream
Processor
Model /
State
Change Data
Capture
Edge Node
Rules
Event Hub
Storage
Governance
Data Catalog
Rules
Engine
Parallel
Processing
Query
Engine
Microservice Data
{ }
API
Event
Stream
Event
Stream
On-Premises – Cloud Native
Istio
Kubernetes
Docker
SparkMinIO
S3
MinIO
S3
Kafka
Confluent
NoSQL
Presto
Kafka Streams
Spring Boot NoSQL
RDBMS
NoSQL
RDBMS
RDBMS
Atlas
Debezium Streamsets
Nifi
StreamsetsNifi
SparkSQL
Spark Streaming
Zeppelin
Jupyter
Physical Data Lake vs. Virtual
Data Lake
Physical Data Lake
Hadoop ClusterdHadoop ClusterData Lake
Parallel
Processing
Storage
Storage
Raw
Refined/
UsageOpt
Consumer
Query
Engine
BI Apps
Data Source 1
File
Data Source 2
RDBMS
Data Source 3
NoSQL
Data Source 4
Enterprise
App
Governance
Data Catalog Data Lineage EncryptionPolicy Mgmt
Query
Data Ingest
DiscoveryCatalog
Virtual Data Lake
Data Source 1
File
Data Source 2
RDBMS
Data Source 3
NoSQL
Data Source 4
Enterprise
App
Data
Virtuali
zation
Query
Engine
Consumer
BI Apps
Governance
Data LineageLogical Data Catalog EncryptionPolicy Mgmt
DiscoveryCatalog
Catalog
Query
Query
Physical Data Lake as part of Virtual Data Lake
Data Source 1
File
Data Source 2
RDBMS
Data Source 3
NoSQL
Data Source 4
Enterprise
App
Data
Virtuali
zation
Query
Engine
Consumer
BI Apps
Governance
Data LineageLogical Data Catalog
Hadoop ClusterdHadoop ClusterData Lake
Storage
Storage
Raw
Refined/
UsageOpt
EncryptionPolicy Mgmt
Parallel
Processing
Query
Engine
Query
Data Ingest
Query
DiscoveryCatalog
Catalog
Query
Multiple Data Lakes form a Virtual Data Lake
Hadoop ClusterdHadoop ClusterData Lake 1
Storage
Storage
Raw
Refined/
UsageOpt
Hadoop ClusterdHadoop ClusterData Lake 2
Storage
Storage
Raw
Refined/
UsageOpt
Data
Virtuali
zation
Query
Engine
Consumer
BI Apps
Data Source 1
File
Data Source 2
RDBMS
Governance
Data LineageLogical Data Catalog EncryptionPolicy Mgmt
Parallel
Processing
Query
Engine
Parallel
Processing
Query
Engine
Query
DiscoveryCatalogCatalog
Query
Query
AI & Machine Learning
AI & Machine Learning: Training vs. Inference
© 2019 Gartner, Inc.ID: 354956
Raw Data
Logical Flow of Data
Trained Model
App or Service
Featuring
Capability
Inference
Applying This
Capability to
New Data
New
Data
“?”
“cat”
Deep-Learning
Framework
Training
Learning a New
Capability From
Existing Data
“cat”
Training
Dataset
“dog” “cat”
Logical Data
Warehouse
Edge Device, On-
Premises or
Cloud-Hosted
On-Premises or
Cloud-Hosted
Data Platform
Service
Event
Stream
Bulk
Data
Flow
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
File Import / SQL Import
Consumer
BI Apps
Data Science
Workbench
Enterprise
App
Enterprise Data
Warehouse
SQL / Search
SQL
“Native” Raw
RDBMS
“SQL” / Search
Service
Event
Hub
Hadoop ClusterdHadoop ClusterBig Data Platform
SQL
Export
Storage
Storage
Raw
Refined/
UsageOpt
Stream Processing Cluster
Stream
Processor
Model /
State
Change Data
Capture
Edge Node
Rules
Event Hub
Storage
Governance
Data Catalog
Parallel
Processing
Query
Engine
Event
Stream
Event
Stream
Modern Data Platform
ML Inference
Server
Microservice Cluster
Microservice Data
{ }
API
ML Inference
Server
AI & Machine Learning: Model Training & Deployment
Backing Service
Integration of Machine Learning Model in application
Trained ML
Model
Trained ML
Model
ML
Serving
ML
Serving
Application
Trained ML
Model
ML
Serving
Application
MLasanAPI
MLinApplication
Trained ML
Model
ML
Serving
Trained ML
Model
ML
Serving
Application
Event Hub
MLandStreamProcessing
Event Hub
Application
MLasaCloudService
Trained ML
Model
ML
Serving
Fundamentals Big Data and AI Architecture

Fundamentals Big Data and AI Architecture

  • 1.
    BASEL | BERN| BRUGG | BUCHAREST | DÜSSELDORF | FRANKFURT A.M. | FREIBURG I.BR. | GENEVA HAMBURG | COPENHAGEN | LAUSANNE | MANNHEIM | MUNICH | STUTTGART | VIENNA | ZURICH http://guidoschmutz.wordpress.com@gschmutz Grundlagen der Big-Data und KI-Architektur DOAG Data Centric Day, 25.9.2019 in Köln Guido Schmutz
  • 2.
    Guido Schmutz Working atTrivadis for more than 22 years Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data Oracle Groundbreaker Ambassador & Oracle ACE Director Head of Trivadis Architecture Board Technology Manager @ Trivadis More than 30 years of software development experience Contact: [email protected] Blog: http://guidoschmutz.wordpress.com Slideshare: http://www.slideshare.net/gschmutz Twitter: gschmutz 169th edition
  • 3.
  • 4.
    Data Warehouse Architecture EnterpriseData Warehouse Extract, Transform & Load (ETL) Bulk Source DB Extract File DB Consumer RDBMS BI Tools ETL Engine high latency
  • 5.
    Data Warehouse isan architecture Layered model, controlled ETL, single point of truth, query optimized data marts Tested, optimized, quality assured, „operated“ Standard-reporting, adHoc-reporting on DWH Base Perfect and fast for new requirements to known and prepared data and structures Data Warehouse ist not „agile“ No free definition and shaping of arbitrary analytical questions = Data Production Source: https://www.flickr.com/photos/128950981@N04/15452926858
  • 6.
    DWH Architecture –what about Streaming Data? Enterprise Data Warehouse Extract, Transform & Load (ETL) Bulk Source DB Extract File DB Consumer RDBMS BI Tools ETL Engine Event Source Location Weather IoT Data Mobile Apps Social Yes No Low High Yes No Elasticity End-to-End Latency Ad-Hoc (SQL) Queries Low HighStorage Costs Yes NoSupports Raw Data Yes NoSupports Streaming Data Low HighAccess Latency
  • 7.
    … to BigData / Data Lake
  • 8.
    Initial Idea ofa Data Lake … Adapted from Wikipedia.org “Reporting, visualization, analytics and machine learning” “Single store of all data in the enterprise” “Should put an end to data silos.” “Example: Distributed file system used in Apache Hadoop”
  • 9.
    Data Lake Data Lake isan Infrastructure Permanently new Data and Structures Schema on Read Really large amounts of Data Explorative Working (Research) Established Error-Culture New user groups ([Data] Scientists) Freedom of data-choice Freedom of source-choice Self-Service Data Labs adHoc- & One-Shot implementations Query + Advanced Analytics = Research & Development Source: https://www.flickr.com/photos/ian-arlett/34233379390 Data-Lab Interpretation
  • 10.
    Schema on Readinstead of (only) Schema on Write "Schema on Write" • Data quality managed by formalized ETL process • Data persisted in tabular, agreed and consistent form • Data integration happens in ETL • Structure must be decided before writing "Schema on Read" • Interpretation of data captured in code for each program accessing the data • Data quality dependent on code quality • Data integration happens in code EDWHETLData Source Consumer RDBMS BI Tools Data LakeData Source Consumer Storage Script Data Science Workbench Data Science Workbench Transform Transform
  • 11.
    Bulk Source Consumer • MachineLearning • Graph Algorithms • Natural Language Processing DB Extract File DB Big Data / Data Lake Architecture Data Science Workbench File Import / SQL Import “Native” Raw Hadoop ClusterdHadoop ClusterBig Data Platform Parallel Processing Storage Storage Raw Refined/ UsageOpt Yes No Low High Yes No Elasticity End-to-End Latency Ad-Hoc (SQL) Queries Low HighStorage Costs Yes NoSupports Raw Data Yes NoSupports Streaming Data Low HighAccess Latency high latency
  • 12.
    Bulk Source Consumer DB Extract File DB Big Data/ Data Lake Architecture BI Tools Data Science Workbench SQL File Import / SQL Import “Native” Raw Hadoop ClusterdHadoop ClusterBig Data Platform Parallel Processing Storage Storage Raw Refined/ UsageOpt Yes No Low High Yes No Elasticity End-to-End Latency Ad-Hoc (SQL) Queries Low HighStorage Costs Yes NoSupports Raw Data Yes NoSupports Streaming Data Low HighAccess Latency Query Engine
  • 13.
    Enterprise Data Warehouse SQL SQL Export DataLake & EDWH Architecture Bulk Source DB Extract File DB File Import / SQL Import Consumer BI Apps Data Science Workbench “Native” Raw RDBMS Hadoop ClusterdHadoop ClusterBig Data Platform Storage Storage Raw Refined/ UsageOpt Yes No Low High Yes No Elasticity End-to-End Latency Ad-Hoc (SQL) Queries Low HighStorage Costs Yes NoSupports Raw Data Yes NoSupports Streaming Data Low HighAccess Latency Parallel Processing Query Engine
  • 14.
    Enterprise Data Warehouse SQL /Search Data Lake & EDWH Architecture Consumer BI Apps Data Science Workbench SQL “Native” Raw RDBMS Hadoop ClusterdHadoop ClusterBig Data Platform Storage Storage Raw Refined/ UsageOpt File Import / SQL Import Bulk Source DB Extract File DB SQL Export Yes No Low High Yes No Elasticity End-to-End Latency Ad-Hoc (SQL) Queries Low HighStorage Costs Yes NoSupports Raw Data Yes NoSupports Streaming Data Low HighAccess Latency Parallel Processing Query Engine
  • 15.
    Bulk Source Enterprise Data Warehouse SQL/ Search SQL Export File Import / SQL Import DB Extract File DB Data Lake & EDWH Architecture with Streaming Data SQL Event Source Location Weather IoT Data Mobile Apps Social Hadoop ClusterdHadoop ClusterBig Data Platform Storage Storage Raw Refined/ UsageOpt Consumer BI Apps Data Science Workbench Parallel Processing Query Engine “Native” Raw
  • 16.
    Bulk Source Enterprise Data Warehouse SQL/ Search SQL Export File Import / SQL Import DB Extract File DB Data Lake & EDWH Architecture with Streaming Data Consumer BI Apps Data Science Workbench SQL Event Source Location Weather IoT Data Mobile Apps Social Event Hub Event Hub Event Hub Event Stream B ulk D ata Im port Hadoop ClusterdHadoop ClusterBig Data Platform Storage Storage Raw Refined/ UsageOpt high latency Yes No Low High Yes No Elasticity End-to-End Latency Ad-Hoc (SQL) Queries Low HighStorage Costs Yes NoSupports Raw Data Yes NoSupports Streaming Data Low HighAccess Latency Parallel Processing Query Engine “Native” Raw
  • 17.
    Keep the datain motion … Data at Rest Data in Motion Store (Re)Act Visualize/ Analyze StoreAct Analyze 11101 01010 10110 11101 01010 10110 vs. Visualize
  • 18.
    Event Hub Event Hub Event Processing Architecture Event Hub “SQL”/ Search Event Stream Bulk Source Event Source Location DB Extract File Weather DB IoT Data Mobile Apps Social Low(est) latency, no history Consumer Enterprise App Dashboard Stream Processing Cluster Stream Processor Model / State Event Stream Service Yes No Low High Yes No Elasticity End-to-End Latency Ad-Hoc (SQL) Queries Low HighStorage Costs Yes NoSupports Raw Data Yes NoSupports Streaming Data Low HighAccess Latency Rules Engine • Complex Event Processing (CEP) • Machine Learning Model Execution (Inference) • State Transition Event Stream
  • 19.
    Event Processing &Data Lake ServiceEvent Stream Data Flow Event Stream Bulk Source Event Source Location DB Extract File Weather DB IoT Data Mobile Apps Social File Import / SQL Import Consumer BI Apps Data Science Workbench Enterprise App “SQL” / Search Hadoop ClusterdHadoop ClusterBig Data Platform Storage Storage Raw Refined/ UsageOpt DashboardStream Processing Cluster Stream Processor Model / State Event Hub Yes No Low High Yes No Elasticity End-to-End Latency Ad-Hoc (SQL) Queries Low HighStorage Costs Yes NoSupports Raw Data Yes NoSupports Streaming Data Low HighAccess Latency Parallel Processing Query Engine Rules Engine Event Stream Enterprise Data Warehouse SQL / Search SQL “Native” Raw RDBMS SQL Export
  • 20.
    Event Processing &Data Lake: Lambda Architecture Event Stream Bulk Data Flow Hadoop ClusterdHadoop ClusterBig Data Platform Storage Storage Raw Refined/ UsageOpt Stream Processing Cluster Stream Processor Model / State ML Inference Server Event Hub Consumer BI Apps Dashboard Serving API (Merger) Event Source Location Weather IoT Data Mobile Apps Social Event Stream Batch Result Speed Result { } Batch Layer Speed Layer Parallel Processing Query Engine
  • 21.
    Event Processing &Data Lake: Kappa Architecture Event Stream Stream Processing Cluster Stream Processor V1.0 State V1.0 Event Hub Event Source Location Weather IoT Data Mobile Apps Social Reply Hadoop ClusterdHadoop ClusterBig Data Platform Storage Storage Raw Refined/ UsageOpt Bulk Data Flow Consumer BI Apps Dashboard Serving Stream Processor V2.0 State V2.0 Result V1.0 Result V2.0 API (Switcher) { } Speed Layer Parallel Processing Query Engine
  • 22.
    Integrate existing systemswith CDC ServiceEvent Stream Event Stream Bulk Source Event Source Location DB Extract File Weather DB IoT Data Mobile Apps Social File Import / SQL Import Consumer BI Apps Data Science Workbench Enterprise App “SQL” / Search Hadoop ClusterdHadoop ClusterBig Data Platform Storage Storage Raw Refined/ UsageOpt DashboardStream Processing Cluster Stream Processor Model / State Event Hub Change Data Capture Parallel Processing Query Engine Rules Engine Bulk Data Flow Event Stream Enterprise Data Warehouse SQL / Search SQL “Native” Raw RDBMS SQL Export
  • 23.
    Applications participate Event-Driven Service Event Stream Bulk DataFlow Event Stream Bulk Source Event Source Location DB Extract File Weather DB IoT Data Mobile Apps Social File Import / SQL Import Consumer BI Apps Data Science Workbench Enterprise App “SQL” / Search Service Event Hub Hadoop ClusterdHadoop ClusterBig Data Platform Storage Storage Raw Refined/ UsageOpt Microservice Platform Stream Processing Platform Stream Processor Model / State Change Data Capture Rules Engine Event Stream Microservice Data { } API Enterprise Data Warehouse SQL / Search SQL “Native” Raw RDBMS SQL Export
  • 24.
    Move Processing toEdge Service Event Stream Bulk Data Flow Bulk Source Event Source Location DB Extract File Weather DB IoT Data Mobile Apps Social File Import / SQL Import Consumer BI Apps Data Science Workbench Enterprise App “SQL” / Search Service Event Hub Hadoop ClusterdHadoop ClusterBig Data Platform Storage Storage Raw Refined/ UsageOpt Microservice Cluster Microservice Data { } API Stream Processing Cluster Stream Processor Model / State Change Data Capture Edge Node Rules Event Hub Storage Parallel Processing Query Engine Rules Engine Event Stream Event Stream Enterprise Data Warehouse SQL / Search SQL “Native” Raw RDBMS SQL Export
  • 25.
    Anyone does whatthey want No (central?) documentation No unique data structure No unique transformations No unique KPI definitions No quality assurance No data flow analysis Silo-Thinking Data avalibility? Security? Auditibility? = No Data Architecture Data SwampQuelle https://www.flickr.com/photos/82134796@N03/10603438015 But be careful ….
  • 26.
    Data Lake Zones& Data Catalog
  • 27.
    Data Storage Landing Zone ArchiveZone Data Lake Zones Object Store Tape Raw Zone Sandbox Zone Usage- Optimized Zone Data Source Data Access File System Event Hub Object Store File System Event Hub Object Store File System Object Store File System RDBMS Object Store File System RDBMS/ NoSQL Refined Zone Object Store File System Event Hub NoSQL In-Memory Grid Event Hub/ Store Disk Service Disk Service
  • 28.
    Data Catalog Service Event Stream Bulk Data Flow Bulk Source EventSource Location DB Extract File Weather DB IoT Data Mobile Apps Social File Import / SQL Import Consumer BI Apps Data Science Workbench Enterprise App Enterprise Data Warehouse SQL / Search SQL “Native” Raw RDBMS “SQL” / Search Service Event Hub Hadoop ClusterdHadoop ClusterBig Data Platform SQL Export Storage Storage Raw Refined/ UsageOpt Microservice Cluster Stream Processing Cluster Stream Processor Model / State Change Data Capture Edge Node Rules Event Hub Storage Governance Data Catalog Rules Engine Parallel Processing Query Engine Microservice Data { } API Event Stream Event Stream
  • 29.
    (Machine Learning Augmented)Data Catalog A data catalog creates and maintains an inventory of data assets through discovery, description and organization of distributed datasets. It provides context to enable data stewards, data/business analysts, data engineers, data scientists and other line of business (LOB) data consumers to find and understand relevant datasets for the purpose of extracting business value. Modern machine-learning-augmented data catalogs automate various tedious tasks involved in data cataloging, including metadata discovery, ingestion, translation, enrichment and the creation of semantic relationships between metadata.
  • 30.
    Data Catalog Data CatalogFeatures Ranking on Utilization Rate Catalog Objects Maintain Multiple Versions of Catalog Object Search & Navigation for Content Content Check in/out Certify Official Versions of Metadata Analyze and Audit Decision Processes Integrate Data Lineage Levels of Access to Catalog Objects Impact Analysis API for Search / Catalog / Mgmt Functions Track Usage of Catalog Objects Integration with IAM Automated Crawling of Source System Catalog Cloud-Deployed Sources Catalog Hadoop-based Sources Catalog BI & Data Visualization Tools Catalog Databases Integration with self-service Tools Classify Catalog Objects by Business Glossary Supports user-defined Tagging Integrates with Data Profiling Supports Data Sampling Quality Metrics Catalog Machine/IoT Data Supports Discussion Threads on Catalog Objects Annotate & Comment on Catalog Objects Catalog Unstructured Data with NLP functionality Semantic Search Classify Catalog Objects by Domain Publish/Subscribe on Changes of Catalog Objects AI/ML based Recommendation Detect Similar/Duplicate or Related Data Easy to use, intuitive GUI Supports Manual Curation Supports Automated (ML based) tagging Supports ongoing discovery of new data sets Natural Language Search Facetted based Search Catalog Object Value Estimation Incentive-based Participation Encouragement Assign Data Steward
  • 31.
    Traditional vs. CloudNative Big Data Platforms
  • 32.
    Traditional vs. CloudNative Big Data Data Local Compute (traditional) Separate Compute and Storage (cloud native) Worker #1 Disk Processing Master Node Worker #2 Disk Processing Worker #3 Disk Processing Network Storage Disk Disk Disk Compute #1 Processing Compute #2 Processing Compute #3 Processing Network Master Node Network Separation of compute and storage – the fundamental difference • store data in Object Storage instead of HDFS • bring up Compute nodes only for data processing • multiple workloads on separate clusters can access same data
  • 33.
    Traditional vs. CloudNative Big Data Traditional Cloud Native Data Local Compute Yes No Network Bandwidth Req. Low High Scalable, shared-usage of Data No (only within cluster) Yes Persistence HDFS Object Storage Data Lifecycle Tiered Storage Built-in (cloud) Compute Hadoop, Spark Hadoop, Spark Serverless Processing no yes Infrastructure Hadoop Cluster Cloud, Container Orchestration Entry Threshold high low
  • 34.
  • 35.
    Data Platform Service Event Stream Bulk Data Flow Bulk Source EventSource Location DB Extract File Weather DB IoT Data Mobile Apps Social File Import / SQL Import Consumer BI Apps Data Science Workbench Enterprise App Enterprise Data Warehouse SQL / Search SQL “Native” Raw RDBMS “SQL” / Search Service Event Hub Hadoop ClusterdHadoop ClusterBig Data Platform SQL Export Storage Storage Raw Refined/ UsageOpt Microservice Cluster Stream Processing Cluster Stream Processor Model / State Change Data Capture Edge Node Rules Event Hub Storage Governance Data Catalog Rules Engine Parallel Processing Query Engine Microservice Data { } API Event Stream Event Stream Modern Data Platform
  • 36.
    Service Event Stream Bulk Data Flow Bulk Source Event Source Location DB Extract File Weather DB IoT Data Mobile Apps Social FileImport / SQL Import Consumer BI Apps Data Science Workbench Enterprise App Enterprise Data Warehouse SQL / Search SQL “Native” Raw RDBMS “SQL” / Search Service Event Hub Hadoop ClusterdHadoop ClusterBig Data Platform SQL Export Storage Storage Raw Refined/ UsageOpt Microservice Cluster Stream Processing Cluster Stream Processor Model / State Change Data Capture Edge Node Rules Event Hub Storage Governance Data Catalog Rules Engine Parallel Processing Query Engine Microservice Data { } API Event Stream Event Stream On-Premises – Traditional Hadoop YARN Pig HDFS HDFS Kafka Confluent Hive Kafka Streams Spring Boot NoSQL RDBMS NoSQL RDBMS RDBMS Atlas Debezium Streamsets Flume Sqoop Flume Impala MapReduce Spark SparkSQL Spark Streaming Zeppelin Jupyter
  • 37.
    Service Event Stream Bulk Data Flow Bulk Source Event Source Location DB Extract File Weather DB IoT Data Mobile Apps Social FileImport / SQL Import Consumer BI Apps Data Science Workbench Enterprise App Enterprise Data Warehouse SQL / Search SQL “Native” Raw RDBMS “SQL” / Search Service Event Hub Hadoop ClusterdHadoop ClusterBig Data Platform SQL Export Storage Storage Raw Refined/ UsageOpt Microservice Cluster Stream Processing Cluster Stream Processor Model / State Change Data Capture Edge Node Rules Event Hub Storage Governance Data Catalog Rules Engine Parallel Processing Query Engine Microservice Data { } API Event Stream Event Stream Oracle Cloud Kafka Confluent Streamsets Nifi Streamsets Nifi Object Storage Archive Storage Object Storage Archive Storage Data Science Big Data Cloud Service Machine Learning Streaming Data Science Functions Visual Builder Java NoSQL DB Data Catalog Autonomous Transaction Proc NoSQL DB Autonomous DWH Big Data SQL Cloud Service GoldenGate Cloud Service Kafka Streams/ KSQL SOA Cloud Service Container Engine for Kubernetes Zeppelin Jupyter Transfer Service Container Pipelines Container Registry
  • 38.
    Service Event Stream Bulk Data Flow Bulk Source Event Source Location DB Extract File Weather DB IoT Data Mobile Apps Social FileImport / SQL Import Consumer BI Apps Data Science Workbench Enterprise App Enterprise Data Warehouse SQL / Search SQL “Native” Raw RDBMS “SQL” / Search Service Event Hub Hadoop ClusterdHadoop ClusterBig Data Platform SQL Export Storage Storage Raw Refined/ UsageOpt Microservice Cluster Stream Processing Cluster Stream Processor Model / State Change Data Capture Edge Node Rules Event Hub Storage Governance Data Catalog Rules Engine Parallel Processing Query Engine Microservice Data { } API Event Stream Event Stream AWS Cloud Kafka Confluent Streamsets Nifi Streamsets Nifi Zeppelin Jupyter S3 S3 Glacier Deep Archive S3 Dynamo DB Redshift Redshift Spectrum Spark on EMR Glue Snowball Data Sync Athena Presto on EMR SageMaker Deep Learning Containers Spark Streaming on EMR Databricks on AWS Kinesis Data Analytics Lambda Batch Spring Boot QuickSight Zeppelin on EMR Databricks on AWS RStudio on EMR API Gateway Managed Streaming for Kafka Kinesis Data Firehose Confluent Cloud
  • 39.
    Service Event Stream Bulk Data Flow Bulk Source Event Source Location DB Extract File Weather DB IoT Data Mobile Apps Social FileImport / SQL Import Consumer BI Apps Data Science Workbench Enterprise App Enterprise Data Warehouse SQL / Search SQL “Native” Raw RDBMS “SQL” / Search Service Event Hub Hadoop ClusterdHadoop ClusterBig Data Platform SQL Export Storage Storage Raw Refined/ UsageOpt Microservice Cluster Stream Processing Cluster Stream Processor Model / State Change Data Capture Edge Node Rules Event Hub Storage Governance Data Catalog Rules Engine Parallel Processing Query Engine Microservice Data { } API Event Stream Event Stream On-Premises – Cloud Native Istio Kubernetes Docker SparkMinIO S3 MinIO S3 Kafka Confluent NoSQL Presto Kafka Streams Spring Boot NoSQL RDBMS NoSQL RDBMS RDBMS Atlas Debezium Streamsets Nifi StreamsetsNifi SparkSQL Spark Streaming Zeppelin Jupyter
  • 40.
    Physical Data Lakevs. Virtual Data Lake
  • 41.
    Physical Data Lake HadoopClusterdHadoop ClusterData Lake Parallel Processing Storage Storage Raw Refined/ UsageOpt Consumer Query Engine BI Apps Data Source 1 File Data Source 2 RDBMS Data Source 3 NoSQL Data Source 4 Enterprise App Governance Data Catalog Data Lineage EncryptionPolicy Mgmt Query Data Ingest DiscoveryCatalog
  • 42.
    Virtual Data Lake DataSource 1 File Data Source 2 RDBMS Data Source 3 NoSQL Data Source 4 Enterprise App Data Virtuali zation Query Engine Consumer BI Apps Governance Data LineageLogical Data Catalog EncryptionPolicy Mgmt DiscoveryCatalog Catalog Query Query
  • 43.
    Physical Data Lakeas part of Virtual Data Lake Data Source 1 File Data Source 2 RDBMS Data Source 3 NoSQL Data Source 4 Enterprise App Data Virtuali zation Query Engine Consumer BI Apps Governance Data LineageLogical Data Catalog Hadoop ClusterdHadoop ClusterData Lake Storage Storage Raw Refined/ UsageOpt EncryptionPolicy Mgmt Parallel Processing Query Engine Query Data Ingest Query DiscoveryCatalog Catalog Query
  • 44.
    Multiple Data Lakesform a Virtual Data Lake Hadoop ClusterdHadoop ClusterData Lake 1 Storage Storage Raw Refined/ UsageOpt Hadoop ClusterdHadoop ClusterData Lake 2 Storage Storage Raw Refined/ UsageOpt Data Virtuali zation Query Engine Consumer BI Apps Data Source 1 File Data Source 2 RDBMS Governance Data LineageLogical Data Catalog EncryptionPolicy Mgmt Parallel Processing Query Engine Parallel Processing Query Engine Query DiscoveryCatalogCatalog Query Query
  • 45.
    AI & MachineLearning
  • 46.
    AI & MachineLearning: Training vs. Inference © 2019 Gartner, Inc.ID: 354956 Raw Data Logical Flow of Data Trained Model App or Service Featuring Capability Inference Applying This Capability to New Data New Data “?” “cat” Deep-Learning Framework Training Learning a New Capability From Existing Data “cat” Training Dataset “dog” “cat” Logical Data Warehouse Edge Device, On- Premises or Cloud-Hosted On-Premises or Cloud-Hosted
  • 47.
    Data Platform Service Event Stream Bulk Data Flow Bulk Source EventSource Location DB Extract File Weather DB IoT Data Mobile Apps Social File Import / SQL Import Consumer BI Apps Data Science Workbench Enterprise App Enterprise Data Warehouse SQL / Search SQL “Native” Raw RDBMS “SQL” / Search Service Event Hub Hadoop ClusterdHadoop ClusterBig Data Platform SQL Export Storage Storage Raw Refined/ UsageOpt Stream Processing Cluster Stream Processor Model / State Change Data Capture Edge Node Rules Event Hub Storage Governance Data Catalog Parallel Processing Query Engine Event Stream Event Stream Modern Data Platform ML Inference Server Microservice Cluster Microservice Data { } API ML Inference Server
  • 48.
    AI & MachineLearning: Model Training & Deployment
  • 49.
    Backing Service Integration ofMachine Learning Model in application Trained ML Model Trained ML Model ML Serving ML Serving Application Trained ML Model ML Serving Application MLasanAPI MLinApplication Trained ML Model ML Serving Trained ML Model ML Serving Application Event Hub MLandStreamProcessing Event Hub Application MLasaCloudService Trained ML Model ML Serving