1
Building a fully-automated Fast Data
Platform
Bernd Zuther, codecentric AG
2 . 1
Outline
Fast Data
SMACK
DC/OS
Extend DC/OS cluster
3 . 1
In the beginning of Big Data there
was
HADOOP
3 . 2
BATCH
SEEMS TO BE GOOD
map and reduce was everywhere
3 . 3
But now Business does not wait.
It always demands more...
EVER FASTER
3 . 4
Updating machine learning models as new information
arrives
Detecting anomalies, faults, performance problems, etc.
and taking timely action
Aggregating and processing data on arrival for
downstream storage and analytics
3 . 5
λ-Architecture
Batch Layer
Speed Layer
Serving Layer
Master
Dataset
batch
view
batch
view
...
realtime
view
realtime
view
Query
Query
New
Data
3 . 6
Fast Data
Fast Data covers a range of new
systems and approaches, which
balance various tradeoffs to deliver
timely, cost-efficient data processing,
as well as higher developer
productivity.
3 . 7
Requirements for a
Fast Data Architecture
Reliable data ingestion
Flexible storage and query options
Sophisticated analytics tools
4 . 1
SMACK
S park
M esos
A kka
C assandra
K afka
SWISS ARMY KNIFE FOR DATA PROCESSING
ETL Jobs
μ-Batching on Streams
SQL and Joins on non-RDBMS
Graph Operations on non-Graphs
Super Fast Map/Reduce
4 . 2
4 . 3
How does it fit to a λ-Architectures?
Spark operations can be run unaltered in either batch or
stream mode
Serving layer uses a Resilient Distributed Dataset (RDD)
Speed layer can uses DStream
Mesos
DISTRIBUTED KERNEL FOR THE CLOUD
Links machines to one logical instance
Static deployment of Mesos
Dynamic deployment of the workload
Good integration with Hadoop, Kafka, Spark, and Akka
4 . 4
FRAMEWORK FOR REACTIVE APPLICATIONS
Highly performant - 50 million messages per machine in
a second
Simple concurrency via asynchronous processing
Elastic, resilient and without single point of failure
Used for applications that can process or query data
4 . 5
PERFORMANT AND ALWAYS-UP NOSQL DATABASE
Linear scaling - approx. 10'000 requests per machine
and second
No downtime
Comfort of a column index with append-only
performance Data-Safety over multiple data-centers
Strong in denormalized models
4 . 6
Kafka
MESSAGING SYSTEM FOR BIG DATA APPLICATIONS
Fast - delivers hundreds of MegaBytes per second to
1000s of clients
Scales - partitions data to manageable volumes
Managing backpressure
Distributed - from the ground up
4 . 7
4 . 8
Big Ball of Mud
Source 1
Source 2
Log/Files
Source
Akka Ingest 1
Akka Ingest 2
Spark Ingest 1
4 . 9
Kafka as a Multiplexer-Demultiplexer
4 . 10
Emerging Architecture
4 . 11
Zeppelin
4 . 12
Benefits and downsides of Zeppelin
 No Jar-Wars
 Easy analytics
 New technology
4 . 13
Real World Example
4 . 14
Traditional Approach
4 . 15
DC/OS Approach
5 . 1
DC/OS
5 . 2
DC/OS Architecture
DCOS Master (1..3)
Zookeeper
Mesos Master
Process
Mesos DNS
Marathon
Admin Router
DCOS Private Agent (0..n)
Mesos Agent Process
Mesos Containerizer
Docker Containerizer
DCOS Public Agent (0..n)
Mesos Agent Process
Mesos Containerizer
Docker Containerizer
Public Internet
User
5 . 3
DC/OS Network Security
Admin
Public Internet
Secure by port
number or IP address
Master Nodes
Public
Public Agents
Private
Private Agents
5 . 4
DC/OS Installation
5 . 5
DC/OS Universe
5 . 6
Command Line Interface
$ dcos
Command line utility for the Mesosphere Datacenter Operating
System (DC/OS). The Mesosphere DC/OS is a distributed operating
system built around Apache Mesos. This utility provides tools
for easy management of a DC/OS installation.
Available DC/OS commands:
config Get and set DC/OS CLI configuration properties
help Display command line usage information
marathon Deploy and manage applications on the DC/OS
node Manage DC/OS nodes
package Install and manage DC/OS packages
service Manage DC/OS services
task Manage DC/OS tasks
Get detailed command description with 'dcos <command> --help'.
5 . 7
SMACK Installation - Databases/Tools
dcos package install --yes cassandra
dcos package install --yes kafka
dcos package install --yes spark
dcos kafka topic add METRO-Vehicles
5 . 8
SMACK Installation - Custom Application
cat > /opt/smack/conf/bus-demo-ingest.json << EOF
{
"id": "/ingest",
"container": {
"type": "DOCKER",
"volumes": [],
"docker": {
"image": "codecentric/bus-demo-ingest",
"network": "HOST",
"privileged": false,
"parameters": [],
"forcePullImage": true
}
},
"env": {
"CASSANDRA_HOST": "$CASSANDRA_HOST",
"CASSANDRA_PORT": "$CASSANDRA_PORT",
5 . 9
Service Discovery
DNS-based Proxy-based Application-aware
 easy to integrate  no port conflicts  developer fully in control
and full-feature
 SRV records  fast failover  implementation effort
 no health checks  no UDP  requires distributed
state management (ZK,
etcd or Consul)
 TTL  management of VIPs
(Minuteman) or service
ports (Marathon-lb)
 
5 . 10
A Records
An A record associates a hostname to an IP address
bz@cc ~/$ dig app.marathon.mesos
; <<>> DiG 9.9.5-3ubuntu0.7-Ubuntu <<>> app.marathon.mesos
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 9336
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;app.marathon.mesos. IN A
;; ANSWER SECTION:
app.marathon.mesos. 60 IN A 10.0.3.201
app.marathon.mesos. 60 IN A 10.0.3.199
;; Query time: 2 msec
;; SERVER: 10.0.5.98#53(10.0.5.98)
5 . 11
SRV Records
A SRV record associates a service name to a hostname
and an IP port
bz@cc ~/$ dig _app._tcp.marathon.mesos SRV
; <<>> DiG 9.9.5-3ubuntu0.7-Ubuntu <<>> _app._tcp.marathon.mesos SRV
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 31708
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 2
;; QUESTION SECTION:
;_app._tcp.marathon.mesos. IN SRV
;; ANSWER SECTION:
_app._tcp.marathon.mesos. 60 IN SRV 0 0 10148 app-qtugm-s5.marathon.
_app._tcp.marathon.mesos. 60 IN SRV 0 0 13289 app-t49o6-s2.marathon.
;; ADDITIONAL SECTION:
5 . 12
DNS Pattern
Service CT-IP
Avail
DI
Avail
Target Host Target
Port
A (Target
Resolution)
{task}.{proto}.framework.domain no no {task}.framework.slave.domain host-
port
slave-ip
yes no {task}.framework.slave.domain host-
port
slave-ip
no yes {task}.framework.domain di-
port
slave-ip
yes yes {task}.framework.domain di-
port
container-
ip
{task}.{proto}.framework.slave.domain n/a n/a {task}.framework.slave.domain host-
port
slave-ip
Benefits and downsides of DC/OS
 Layer that abstract hardware
 Applicaton run in a sandbox with and without Docker
 Buildin service discovery
 Effort to training the technology
 Monitoring gets a bigger rule
5 . 13
6 . 1
Extend our DC/OS cluster
6 . 2
Add new Network Security Zone
Master
Public Internet
Master Nodes
Public
Public Agents
Private
Private Agents
Admin
VPN Cli
6 . 3
Add ELK
Filebeat
Agent Nodes
Filebeat
Agent Nodes
Filebeat
Agent Nodes
Logstash Elasticsearch Kibana
6 . 4
Download Filebeat
- "content": |
[Unit]
Description=ELK: Download Filebeat
After=network-online.target
Wants=network-online.target
ConditionPathExists=!/opt/filebeat/filebeat
[Service]
Type=oneshot
StandardOutput=journal+console
StandardError=journal+console
ExecStartPre=/usr/bin/curl --fail --retry 20 --continue-at - --location
ExecStartPre=/usr/bin/mkdir -p /opt/filebeat /tmp/filebeat /etc/filebea
ExecStartPre=/usr/bin/tar -axf /tmp/filebeat.tar.xz -C /tmp/filebeat --
ExecStart=-/bin/mv /tmp/filebeat/filebeat /opt/filebeat/filebeat
ExecStartPost=-/usr/bin/rm -rf /tmp/filebeat.tar.xz /tmp/filebeat
"name": |-
filebeat-download.service
6 . 5
Start Filebeat
- "command": |-
start
"content": |
[Unit]
Description=ELK: Filebeat collectes log file and send them to logstash
Requires=filebeat-download.service
After=filebeat-download.service
[Service]
Type=simple
StandardOutput=journal+console
StandardError=journal+console
ExecStart=/opt/filebeat/filebeat -e -c /etc/filebeat/filebeat.yml -
"enable": !!bool |-
true
"name": |-
filebeat.service
6 . 6
Working with Cloudformation
 Easy integration in a build pipeline
 Hard to maintain
 Hard to extend
 Not Cloud-agnostic (only support AWS)
6 . 7
Terraform
BUILD, COMBINE, AND LAUNCH INFRASTRUCTURE
Infrastructure as code
Combine Multiple Providers (AWS, Azure, etc.)
Evolve your Infrastructure
6 . 8
Terraform
resource "aws_launch_configuration" "public_slave" {
security_groups = ["${aws_security_group.public_slave.id}"]
image_id = "${lookup(var.coreos_amis, var.aws_region)}"
instance_type = "${var.public_slave_instance_type}"
key_name = "${aws_key_pair.dcos.key_name}"
user_data = "${template_file.public_slave_user_data.rendered}"
associate_public_ip_address = true
lifecycle {
create_before_destroy = false
}
}
6 . 9
Benefits of Terraform
 Easy integration in a build pipeline
 Easier to maintain
 Easier to extend
 Cloud-agnostic (AWS, Azure, etc.)
 Need some time until new resources are adopted
6 . 10
Create infrastructure with Jenkins
6 . 11
Terraform - DC/OS Source & Real World
Example
https://github.com/ANierbeck/BusFloatingData
https://github.com/zutherb/terraform-dcos/
7 . 1
Summary
SMACK helps you to build a near realtime Fast Data
platform
Kafka & Akka can be used for reliable data ingestion
Cassandra provides a flexible storage and query options
Mesos enables fault-tolerant and elastic distributed
systems
Zeppelin is a sophisticated analytics tool
Terraform makes it easy to integrate our infrastructure
with a build pipeline
7 . 2
Lessons Learned
Cassandra is good for known problems
When dealing with unknown problems it is better to
store raw data with Apache Parquet
Automate everything
Bleeding edge sometimes sucks (Zeppelin, S3a, Spark,
etc.)
7 . 3
Is your infrastructure a pet
7 . 4
Treat your infrastructure like cattle
7 . 5
If you want your infrastructure like cattle
KEEP CALM
AND
AUTOMATE EVERYTHING
7 . 6
Feedback
@Bernd_Z
http://github.com/zutherb
http://zutherb.github.io/Building-a-full-automated-Fast-
Data-Platform/slides/
7 . 7
The End
 
Copyright 2016

More Related Content

PDF
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
PPTX
Building a Lambda Architecture with Elasticsearch at Yieldbot
PDF
Using the SDACK Architecture to Build a Big Data Product
PPTX
Developing a Real-time Engine with Akka, Cassandra, and Spray
PDF
SMACK Stack 1.1
PDF
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
PDF
Cassandra & Spark for IoT
PPTX
Kafka Lambda architecture with mirroring
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Building a Lambda Architecture with Elasticsearch at Yieldbot
Using the SDACK Architecture to Build a Big Data Product
Developing a Real-time Engine with Akka, Cassandra, and Spray
SMACK Stack 1.1
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Cassandra & Spark for IoT
Kafka Lambda architecture with mirroring

What's hot (20)

PDF
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
PDF
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
PPTX
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
PDF
Feeding Cassandra with Spark-Streaming and Kafka
PDF
Querying Data Pipeline with AWS Athena
PDF
Fast NoSQL from HDDs?
PDF
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
PDF
Getting Started Running Apache Spark on Apache Mesos
PDF
Reactive dashboard’s using apache spark
PDF
Simplifying Big Data Analytics with Apache Spark
PDF
Lambda architecture
PDF
Data Streaming Ecosystem Management at Booking.com
PDF
Fully fault tolerant real time data pipeline with docker and mesos
PDF
Apache Spark for Library Developers with William Benton and Erik Erlandson
PDF
Spark streaming: Best Practices
PDF
Cassandra + Spark + Elk
PDF
Bellevue Big Data meetup: Dive Deep into Spark Streaming
PDF
Apache Spark Overview part2 (20161117)
PDF
Performance Analysis and Optimizations for Kafka Streams Applications
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
Feeding Cassandra with Spark-Streaming and Kafka
Querying Data Pipeline with AWS Athena
Fast NoSQL from HDDs?
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Getting Started Running Apache Spark on Apache Mesos
Reactive dashboard’s using apache spark
Simplifying Big Data Analytics with Apache Spark
Lambda architecture
Data Streaming Ecosystem Management at Booking.com
Fully fault tolerant real time data pipeline with docker and mesos
Apache Spark for Library Developers with William Benton and Erik Erlandson
Spark streaming: Best Practices
Cassandra + Spark + Elk
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Apache Spark Overview part2 (20161117)
Performance Analysis and Optimizations for Kafka Streams Applications
Ad

Viewers also liked (20)

PDF
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
PDF
Rethinking Streaming Analytics For Scale
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PPTX
Introduction to Storm
PDF
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
PDF
SMACK Stack @ Nitro
PDF
Architecture Big Data open source S.M.A.C.K
PPTX
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...
PDF
Data processing platforms with SMACK: Spark and Mesos internals
PPTX
Alpine academy apache spark series #1 introduction to cluster computing wit...
PDF
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
PDF
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
PDF
Reactive app using actor model & apache spark
PDF
How to deploy Apache Spark 
to Mesos/DCOS
PPTX
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
PPTX
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
PDF
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
PPTX
Intro to Apache Spark
PDF
Akka in Production - ScalaDays 2015
PDF
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Rethinking Streaming Analytics For Scale
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Introduction to Storm
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack @ Nitro
Architecture Big Data open source S.M.A.C.K
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...
Data processing platforms with SMACK: Spark and Mesos internals
Alpine academy apache spark series #1 introduction to cluster computing wit...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Reactive app using actor model & apache spark
How to deploy Apache Spark 
to Mesos/DCOS
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Intro to Apache Spark
Akka in Production - ScalaDays 2015
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Ad

Similar to Building a fully-automated Fast Data Platform (20)

PDF
Mesos at OpenTable
PDF
Capital One: Using Cassandra In Building A Reporting Platform
PPTX
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
PPTX
Cassandra Tuning - above and beyond
PPTX
Cassandra Tuning - Above and Beyond (Matija Gobec, SmartCat) | Cassandra Summ...
PDF
Google Compute and MapR
PDF
Data Lake and the rise of the microservices
PDF
Open Security Operations Center - OpenSOC
PDF
Cmu 2011 09.pptx
PDF
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
PDF
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
PDF
Cassandra Day Chicago 2015: Diagnosing Problems in Production
PDF
Cassandra Day London 2015: Diagnosing Problems in Production
PDF
Make 2016 your year of SMACK talk
PDF
JDD2014: Real Big Data - Scott MacGregor
PPTX
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
PDF
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...
PPTX
Essential Data Engineering for Data Scientist
PPTX
Keys for Success from Streams to Queries
PDF
How Apache Spark fits in the Big Data landscape
Mesos at OpenTable
Capital One: Using Cassandra In Building A Reporting Platform
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Cassandra Tuning - above and beyond
Cassandra Tuning - Above and Beyond (Matija Gobec, SmartCat) | Cassandra Summ...
Google Compute and MapR
Data Lake and the rise of the microservices
Open Security Operations Center - OpenSOC
Cmu 2011 09.pptx
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in Production
Make 2016 your year of SMACK talk
JDD2014: Real Big Data - Scott MacGregor
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...
Essential Data Engineering for Data Scientist
Keys for Success from Streams to Queries
How Apache Spark fits in the Big Data landscape

Recently uploaded (20)

PDF
Book Trusted Companions in Delhi – 24/7 Available Delhi Personal Meeting Ser...
PPTX
Chapter security of computer_8_v8.1.pptx
PPTX
DATA ANALYTICS COURSE IN PITAMPURA.pptx
PPTX
Sheep Seg. Marketing Plan_C2 2025 (1).pptx
PPT
Technicalities in writing workshops indigenous language
PDF
Nucleic-Acids_-Structure-Typ...-1.pdf 011
PPTX
Stats annual compiled ipd opd ot br 2024
PDF
Delhi c@ll girl# cute girls in delhi with travel girls in delhi call now
PPTX
Capstone Presentation a.pptx on data sci
PPT
BME 301 Lecture Note 1_2.ppt mata kuliah Instrumentasi
PDF
Buddhism presentation about world religion
PDF
9 FinOps Tools That Simplify Cloud Cost Reporting.pdf
PPT
What is life? We never know the answer exactly
PDF
General category merit rank list for neet pg
PPTX
Fkrjrkrkekekekeekkekswkjdjdjddwkejje.pptx
PDF
toaz.info-grade-11-2nd-quarter-earth-and-life-science-pr_5360bfd5a497b75f7ae4...
PPTX
Sistem Informasi Manejemn-Sistem Manajemen Database
PPTX
cyber row.pptx for cyber proffesionals and hackers
PPTX
DIGITAL DESIGN AND.pptx hhhhhhhhhhhhhhhhh
PDF
Hikvision-IR-PPT---EN.pdfSADASDASSAAAAAAAAAAAAAAA
Book Trusted Companions in Delhi – 24/7 Available Delhi Personal Meeting Ser...
Chapter security of computer_8_v8.1.pptx
DATA ANALYTICS COURSE IN PITAMPURA.pptx
Sheep Seg. Marketing Plan_C2 2025 (1).pptx
Technicalities in writing workshops indigenous language
Nucleic-Acids_-Structure-Typ...-1.pdf 011
Stats annual compiled ipd opd ot br 2024
Delhi c@ll girl# cute girls in delhi with travel girls in delhi call now
Capstone Presentation a.pptx on data sci
BME 301 Lecture Note 1_2.ppt mata kuliah Instrumentasi
Buddhism presentation about world religion
9 FinOps Tools That Simplify Cloud Cost Reporting.pdf
What is life? We never know the answer exactly
General category merit rank list for neet pg
Fkrjrkrkekekekeekkekswkjdjdjddwkejje.pptx
toaz.info-grade-11-2nd-quarter-earth-and-life-science-pr_5360bfd5a497b75f7ae4...
Sistem Informasi Manejemn-Sistem Manajemen Database
cyber row.pptx for cyber proffesionals and hackers
DIGITAL DESIGN AND.pptx hhhhhhhhhhhhhhhhh
Hikvision-IR-PPT---EN.pdfSADASDASSAAAAAAAAAAAAAAA

Building a fully-automated Fast Data Platform