Sensu and Sensibility 
Tomas 
Doran 
@bobtfish 
2014-­‐09-­‐23
2 
Sensu and Sensibility
Cycle of failure and 
disappointment 
• Manually edited and deployed monitoring 
• Changes require two teams 
• Low developer visibility about production 
3
4
Cycle of failure and 
disappointment 
• Manually edited and deployed monitoring 
• Changes require two teams 
• Low developer visibility about production 
• Escalation of issues is hard 
• Ops ignore alerts from services 
• Postmortems 
5
6
Cycle of failure and 
disappointment 
• Manually edited and deployed monitoring 
• Changes require two teams 
• Low developer visibility about production 
• Escalation of issues is hard 
• Ops ignore alerts from services 
• Postmortems 
• High friction, low trust, low visibility. 
7
“Normality” 
8 
-­‐ 
http://gunshowcomic.com/648
“Normality” 
dysfunctional 
9 
This is 
-­‐ 
http://gunshowcomic.com/648
10 
Sensibility
11 
Sensibility
“51 % viewed their ERP implementation as 
unsuccessful” 
12 
The Robbins-Gioia Survey (2001)
The Conference Board Survey (2001) 
“40 % of the projects failed to achieve their 
business case within one year of going live” 
13
McKinsey & Company in conjunction 
with the University of Oxford (2012) 
• “17 percent of large IT projects go so 
badly that they can threaten the very 
existence of the company” 
• “On average, large IT projects run 45 
percent over budget and 7 percent over 
time, while delivering 56 percent less 
value than predicted” 
14
Failure is an option 
-­‐ 
blog.parasoft.com/single-­‐greatest-­‐barrier-­‐with-­‐sw-­‐delivery 
15
Sensibility 
16
17 
Sensibility
Why Sensu? 
• Designed to be pluggable / extensible 
• Arbitrary check metadata 
• Simple model 
• Components do exactly one thing 
• Ruby 
• Not afraid to extend (or fork!) 
18
‘industry standard’ 
‘enterprise class’ 
19
Cheap shot 
20
21
status.dat 
cmd.dat 
22
cmd.dat 
23
24 
Centralized
25
How we use Sensu 
• Don’t use all of this! 
• ‘Standalone’ checks only 
• Default in the puppet module 
26
Sensu data flow 
• Sensu client runs checks on each machine 
• Pushes results to RabbitMQ 
• Clustered, clients/messages will fail over. 
• Sensu server (multiple, ha) 
• Processes check results, invokes handlers 
• Writes state to redis 
• Redis + sentinel 
• Read by API (2 instances) 
• All layers behind haproxy 
27
Quis custodiet ipsos custodes? 
28 
“Sensu 
has 
so 
many 
moving 
parts 
that 
I 
wouldn’t 
be 
able 
to 
sleep 
at 
night 
unless 
I 
set 
up 
a 
Nagios 
instance 
to 
make 
sure 
they 
were 
all 
running.”
Mutually assured monitoring 
• Multiple independent Sensu installs (per-datacenter) 
• Monitor each other! 
29
Machine readable config 
• /etc/sensu/conf.d/checks/check_name.json 
• Extensible with arbitrary metadata 
• Hash merge 
• Never edit by hand! 
30
monitoring_check 
monitoring_check { 'systems-apache-external': 
page => true, 
command => "/usr/lib/nagios/plugins/ 
check_tcp -H ${external_ip_address} -p 443", 
check_every => ‘5m', 
alert_after => '30m', 
realert_every => 10, 
runbook => 'y/apache', 
} 
31
monitoring_check 
monitoring_check { 'systems-apache-external': 
page => true, 
command => "/usr/lib/nagios/plugins/ 
check_tcp -H ${external_ip_address} -p 443", 
check_every => ‘5m', 
alert_after => '30m', 
realert_every => 10, 
runbook => 'y/apache', 
} 
32
monitoring_check 
monitoring_check { 'systems-apache-external': 
page => true, 
command => "/usr/lib/nagios/plugins/ 
check_tcp -H ${external_ip_address} -p 443", 
check_every => ‘5m', 
alert_after => '30m', 
realert_every => 10, 
runbook => 'y/apache', 
} 
33
monitoring_check 
monitoring_check { 'systems-apache-external': 
page => true, 
command => "/usr/lib/nagios/plugins/ 
check_tcp -H ${external_ip_address} -p 443", 
check_every => ‘5m', 
alert_after => '30m', 
realert_every => 10, 
runbook => 'y/apache', 
} 
34
sensu::check 
• monitoring_check wraps this 
• Writes a JSON file for each check 
• Comment safe 
35
"disk_ro_mounts": { 
"standalone": true, "handlers": [“default"], "subscribers": [], 
"command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", 
"interval": 60, 
"alert_after": 0, "realert_every": “-1", 
"dependencies": [], 
"runbook": "http://lmgtfy.com/?q=linux+read+only+disk", 
"annotation": "https://gitweb.yelpcorp.com/? 
p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", 
"team": "operations", 
"irc_channels": "operations-notifications", 
"notification_email": "undef", 
"ticket": true, 
"project": “OPS”, 
"page": false, 
"tip": false 
} 
36
"disk_ro_mounts": { 
"standalone": true, "handlers": [“default"], "subscribers": [], 
"command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", 
"interval": 60, 
"alert_after": 0, "realert_every": “-1", 
"dependencies": [], 
"runbook": "http://lmgtfy.com/?q=linux+read+only+disk", 
"annotation": "https://gitweb.yelpcorp.com/? 
p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", 
"team": "operations", 
"irc_channels": "operations-notifications", 
"notification_email": "undef", 
"ticket": true, 
"project": “OPS”, 
"page": false, 
"tip": false 
} 
37
"disk_ro_mounts": { 
"standalone": true, "handlers": [“default"], "subscribers": [], 
"command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", 
"interval": 60, 
"alert_after": 0, "realert_every": “-1", 
"dependencies": [], 
"runbook": "http://lmgtfy.com/?q=linux+read+only+disk", 
"annotation": "https://gitweb.yelpcorp.com/? 
p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", 
"team": "operations", 
"irc_channels": "operations-notifications", 
"notification_email": "undef", 
"ticket": true, 
"project": “OPS”, 
"page": false, 
"tip": false 
} 
38
"disk_ro_mounts": { 
"standalone": true, "handlers": [“default"], "subscribers": [], 
"command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", 
"interval": 60, 
"alert_after": 0, "realert_every": “-1", 
"dependencies": [], 
"runbook": "http://lmgtfy.com/?q=linux+read+only+disk", 
"annotation": "https://gitweb.yelpcorp.com/? 
p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", 
"team": "operations", 
"irc_channels": "operations-notifications", 
"notification_email": "undef", 
"ticket": true, 
"project": “OPS”, 
"page": false, 
"tip": false 
} 
39
"disk_ro_mounts": { 
"standalone": true, "handlers": [“default"], "subscribers": [], 
"command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", 
"interval": 60, 
"alert_after": 0, "realert_every": “-1", 
"dependencies": [], 
"runbook": "http://lmgtfy.com/?q=linux+read+only+disk", 
"annotation": "https://gitweb.yelpcorp.com/? 
p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", 
"team": "operations", 
"irc_channels": "operations-notifications", 
"notification_email": "undef", 
"ticket": true, 
"project": “OPS”, 
"page": false, 
"tip": false 
} 
40
"disk_ro_mounts": { 
"standalone": true, "handlers": [“default"], "subscribers": [], 
"command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", 
"interval": 60, 
"alert_after": 0, "realert_every": “-1", 
"dependencies": [], 
"runbook": "http://lmgtfy.com/?q=linux+read+only+disk", 
"annotation": "https://gitweb.yelpcorp.com/? 
p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", 
"team": "operations", 
"irc_channels": "operations-notifications", 
"notification_email": "undef", 
"ticket": true, 
"project": “OPS”, 
"page": false, 
"tip": false 
} 
41
Check scripts 
• Same as nagios checks 
• Simple (text) output 
• Exit code 
• Result sent to server, along with check definition 
• Including all the custom metadata 
• Our handlers use the extra data. 
42
Handlers 
• base 
• JIRA 
• email 
• irc 
• pagerduty 
• awsprune 
43
How do checks get run? 
• Every machine runs the client. 
• Client managed by puppet 
• Client has a TCP socket you can send JSON to 
• Custom checks + pysensu-yelp 
44
45
Situational awareness 
46
Single source of truth 
• DNS is canonical for sensu servers 
• Configure things in one place! 
47
Single source of truth 
• DNS is canonical for sensu servers 
• Configure things in one place! 
48
Automatic monitoring 
• E.g. cron jobs - check successful recently! 
• cron::d 
49
Automatic monitoring 
• E.g. cron jobs - check successful recently! 
• cron::d 
50
Generate monitoring_check 
51
User specified monitoring 
52
User specified monitoring 
53 
• Data lives in the service config 
• Next to the code to emit metrics!
• Simple checks for free! 
54 
User specified monitoring
User specified monitoring 
• Data lives in the service config 
• Next to the code to emit metrics 
• Next to metadata about SLAs and LB timeouts 
• Developers can push without OPS 
55
Cluster checks 
• We’re working on this currently 
• Assert some % of machines are healthy. 
• Use to reduce alert noise. 
• If a service becomes fully unavailable to clients, 
you want to page someone. 
• If one machine goes belly up, you don’t (make 
a JIRA ticket for handling later!) 
56
WIP 
• This is all still a work in progress. 
• We’ve not 100% migrated off of Nagios 
• Open sourcing the pieces 
57
Thanks! 
• Slides will be online shortly: 
• slideshare.net/bobtfish 
• @bobtfish 
• Some (most?) of our code is open source: 
• https://github.com/Yelp/sensu/commit/ 
aa5c43c2fdfde5e8739952c0b8082000934f3ad2 
• https://github.com/Yelp/puppet-monitoring_check 
• https://github.com/Yelp/puppet-netstdlib 
• https://github.com/Yelp/sensu_handlers 
• https://github.com/Yelp/pysensu-yelp 
58

More Related Content

PPTX
Serverspec and Sensu - Testing and Monitoring collide
PDF
Steamlining your puppet development workflow
PDF
Sensu @ Yelp!: A Guided Tour
PPTX
WTF is Sensu and Monitoring
PDF
Chasing AMI - Building Amazon machine images with Puppet, Packer and Jenkins
PDF
Puppet Development Workflow
PPTX
Monitoring with sensu
PPTX
How Yelp does Service Discovery
Serverspec and Sensu - Testing and Monitoring collide
Steamlining your puppet development workflow
Sensu @ Yelp!: A Guided Tour
WTF is Sensu and Monitoring
Chasing AMI - Building Amazon machine images with Puppet, Packer and Jenkins
Puppet Development Workflow
Monitoring with sensu
How Yelp does Service Discovery

What's hot (20)

PDF
How Yelp Uses Sensu to Monitor Services in a SOA World
PDF
PDF
Superb Supervision of Short-lived Servers with Sensu
PDF
Experiences from Running Masterless Puppet - PuppetConf 2014
PDF
How Yelp uses Mesos to Power its SOA Infrastructure
PDF
Stop using Nagios (so it can die peacefully)
PDF
Spot Trading - A case study in continuous delivery for mission critical finan...
PPTX
SaltConf 2014: Safety with powertools
PDF
Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!
PDF
Security Testing with OWASP ZAP in CI/CD - Simon Bennetts - Codemotion Amster...
PPTX
Verifying your Ansible Roles using Docker, Test Kitchen and Serverspec
PDF
Inside the Chef Push Jobs Service - ChefConf 2015
KEY
London devops logging
PDF
Understanding salt modular sub-systems and customization
PPTX
Sensu Monitoring
PDF
Ansible Case Studies
PDF
Configuration Management - Finding the tool to fit your needs
PPTX
SaltConf 2015: Salt stack at web scale: Better, Stronger, Faster
PDF
SaltConf14 - Matthew Williams, Flowroute - Salt Virt for Linux contatiners an...
PDF
Chef Provisioning a Chef Server Cluster - ChefConf 2015
How Yelp Uses Sensu to Monitor Services in a SOA World
Superb Supervision of Short-lived Servers with Sensu
Experiences from Running Masterless Puppet - PuppetConf 2014
How Yelp uses Mesos to Power its SOA Infrastructure
Stop using Nagios (so it can die peacefully)
Spot Trading - A case study in continuous delivery for mission critical finan...
SaltConf 2014: Safety with powertools
Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!
Security Testing with OWASP ZAP in CI/CD - Simon Bennetts - Codemotion Amster...
Verifying your Ansible Roles using Docker, Test Kitchen and Serverspec
Inside the Chef Push Jobs Service - ChefConf 2015
London devops logging
Understanding salt modular sub-systems and customization
Sensu Monitoring
Ansible Case Studies
Configuration Management - Finding the tool to fit your needs
SaltConf 2015: Salt stack at web scale: Better, Stronger, Faster
SaltConf14 - Matthew Williams, Flowroute - Salt Virt for Linux contatiners an...
Chef Provisioning a Chef Server Cluster - ChefConf 2015

Viewers also liked (20)

PDF
Dockersh and a brief intro to the docker internals
PDF
Building a smarter application stack - service discovery and wiring for Docker
PDF
Empowering developers to deploy their own data stores
PDF
BIg Data Trends in 2016
PPTX
1DMP: Marketing Data Platform - the future of data-driven marketing
PDF
Internet of Things and Big Data
PPTX
Thank Bunny - Customer Engagement Platform
PPTX
The Big Data Ecosystem for Financial Services
PDF
Building a New Platform for Customer Analytics
PDF
Cloud transition - The Trivadis approach
PDF
Puppet Camp Sydney 2015: The (Im)perfect Puppet Module
PDF
Puppet Camp Atlanta 2014: Continuous Deployment of Puppet Modules
PDF
You're the New CDO, Now What?
PDF
Using Vagrant, Puppet, Testing & Hadoop
PDF
Devops, Dungeons & Dragons
PDF
Puppet - Configuration Management Made Eas(ier)
PDF
Writing and Publishing Puppet Modules - PuppetConf 2014
PDF
The Art Of Net Promoter Score
PDF
Galvanize Data Science Open House
PDF
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
Dockersh and a brief intro to the docker internals
Building a smarter application stack - service discovery and wiring for Docker
Empowering developers to deploy their own data stores
BIg Data Trends in 2016
1DMP: Marketing Data Platform - the future of data-driven marketing
Internet of Things and Big Data
Thank Bunny - Customer Engagement Platform
The Big Data Ecosystem for Financial Services
Building a New Platform for Customer Analytics
Cloud transition - The Trivadis approach
Puppet Camp Sydney 2015: The (Im)perfect Puppet Module
Puppet Camp Atlanta 2014: Continuous Deployment of Puppet Modules
You're the New CDO, Now What?
Using Vagrant, Puppet, Testing & Hadoop
Devops, Dungeons & Dragons
Puppet - Configuration Management Made Eas(ier)
Writing and Publishing Puppet Modules - PuppetConf 2014
The Art Of Net Promoter Score
Galvanize Data Science Open House
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...

Similar to Sensu and Sensibility - Puppetconf 2014 (20)

PDF
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
PPTX
Sonian, Open Source and Sensu
PDF
PuppetConf 2016: Watching the Puppet Show – Sean Porter, Heavy Water Operations
PPTX
NagiOs.pptxhjkgfddssddfccgghuikjhgvccvvhjj
PDF
An Introduction to Sensu by Bethany Erskine
PDF
Sensu at nycdevops Meetup
PDF
OSMC 2014 | Monitoring Love with Sensu by Jochen Lillich
ODP
Open Source Monitoring Tools Shootout
PDF
Have you been stalking your servers?
PDF
OSMC 2014: MonitoringLove with Sensu | Jochen Lillich
ODP
Monitoring shootout loadays
PDF
Learning Nagios
PDF
Using Nagios to monitor your WO systems
ODP
Monitoring at/with SUSE 2015
ODP
opensource Monitoring Tool , an overview
PPTX
Nagios Conference 2014 - Jim Prins - Passive Monitoring with Nagios
ODP
Automating Monitoring with Puppet
PDF
Nagios 3
ODP
Sensu at brightpearl
PDF
Distributed monitoring at Hyves- Puppet
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
Sonian, Open Source and Sensu
PuppetConf 2016: Watching the Puppet Show – Sean Porter, Heavy Water Operations
NagiOs.pptxhjkgfddssddfccgghuikjhgvccvvhjj
An Introduction to Sensu by Bethany Erskine
Sensu at nycdevops Meetup
OSMC 2014 | Monitoring Love with Sensu by Jochen Lillich
Open Source Monitoring Tools Shootout
Have you been stalking your servers?
OSMC 2014: MonitoringLove with Sensu | Jochen Lillich
Monitoring shootout loadays
Learning Nagios
Using Nagios to monitor your WO systems
Monitoring at/with SUSE 2015
opensource Monitoring Tool , an overview
Nagios Conference 2014 - Jim Prins - Passive Monitoring with Nagios
Automating Monitoring with Puppet
Nagios 3
Sensu at brightpearl
Distributed monitoring at Hyves- Puppet

More from Tomas Doran (20)

PPTX
Long haul infrastructure: Failures and successes
PPT
Deploying puppet code at light speed
PDF
Thinking through puppet code layout
PDF
Docker puppetcamp london 2013
PDF
"The worst code I ever wrote"
PDF
Test driven infrastructure development (2 - puppetconf 2013 edition)
PDF
Test driven infrastructure development
PPT
London devops - orc
KEY
Message:Passing - lpw 2012
KEY
Webapp security testing
KEY
Webapp security testing
KEY
Dates aghhhh!!?!?!?!
KEY
Messaging, interoperability and log aggregation - a new framework
KEY
Zero mq logs
KEY
Cooking a rabbit pie
KEY
High scale flavour
KEY
Large platform architecture in (mostly) perl - an illustrated tour
KEY
Large platform architecture in (mostly) perl
KEY
Web frameworks don't matter
KEY
Real time system_performance_mon
Long haul infrastructure: Failures and successes
Deploying puppet code at light speed
Thinking through puppet code layout
Docker puppetcamp london 2013
"The worst code I ever wrote"
Test driven infrastructure development (2 - puppetconf 2013 edition)
Test driven infrastructure development
London devops - orc
Message:Passing - lpw 2012
Webapp security testing
Webapp security testing
Dates aghhhh!!?!?!?!
Messaging, interoperability and log aggregation - a new framework
Zero mq logs
Cooking a rabbit pie
High scale flavour
Large platform architecture in (mostly) perl - an illustrated tour
Large platform architecture in (mostly) perl
Web frameworks don't matter
Real time system_performance_mon

Recently uploaded (20)

PDF
Top AI Tools for Project Managers: My 2025 AI Stack
PPTX
Greedy best-first search algorithm always selects the path which appears best...
PDF
Top 10 Project Management Software for Small Teams in 2025.pdf
PDF
Coding with GPT-5- What’s New in GPT 5 That Benefits Developers.pdf
PPTX
Relevance Tuning with Genetic Algorithms
PDF
Adlice Diag Crack With Serial Key Free Download 2025
PDF
Odoo Construction Management System by CandidRoot
PPTX
Foundations of Marketo Engage: Nurturing
PPTX
Streamlining Project Management in the AV Industry with D-Tools for Zoho CRM ...
PDF
WhatsApp Chatbots The Key to Scalable Customer Support.pdf
PPTX
Human Computer Interaction lecture Chapter 2.pptx
PPTX
Post-Migration Optimization Playbook: Getting the Most Out of Your New Adobe ...
PPTX
Beige and Black Minimalist Project Deck Presentation (1).pptx
PDF
infoteam HELLAS company profile 2025 presentation
PDF
Multiverse AI Review 2025_ The Ultimate All-in-One AI Platform.pdf
PPTX
Presentation - Summer Internship at Samatrix.io_template_2.pptx
PPTX
UNIT II: Software design, software .pptx
PPTX
MCP empowers AI Agents from Zero to Production
PPTX
ESDS_SAP Application Cloud Offerings.pptx
PDF
Mobile App Backend Development with WordPress REST API: The Complete eBook
Top AI Tools for Project Managers: My 2025 AI Stack
Greedy best-first search algorithm always selects the path which appears best...
Top 10 Project Management Software for Small Teams in 2025.pdf
Coding with GPT-5- What’s New in GPT 5 That Benefits Developers.pdf
Relevance Tuning with Genetic Algorithms
Adlice Diag Crack With Serial Key Free Download 2025
Odoo Construction Management System by CandidRoot
Foundations of Marketo Engage: Nurturing
Streamlining Project Management in the AV Industry with D-Tools for Zoho CRM ...
WhatsApp Chatbots The Key to Scalable Customer Support.pdf
Human Computer Interaction lecture Chapter 2.pptx
Post-Migration Optimization Playbook: Getting the Most Out of Your New Adobe ...
Beige and Black Minimalist Project Deck Presentation (1).pptx
infoteam HELLAS company profile 2025 presentation
Multiverse AI Review 2025_ The Ultimate All-in-One AI Platform.pdf
Presentation - Summer Internship at Samatrix.io_template_2.pptx
UNIT II: Software design, software .pptx
MCP empowers AI Agents from Zero to Production
ESDS_SAP Application Cloud Offerings.pptx
Mobile App Backend Development with WordPress REST API: The Complete eBook

Sensu and Sensibility - Puppetconf 2014

  • 1. Sensu and Sensibility Tomas Doran @bobtfish 2014-­‐09-­‐23
  • 2. 2 Sensu and Sensibility
  • 3. Cycle of failure and disappointment • Manually edited and deployed monitoring • Changes require two teams • Low developer visibility about production 3
  • 4. 4
  • 5. Cycle of failure and disappointment • Manually edited and deployed monitoring • Changes require two teams • Low developer visibility about production • Escalation of issues is hard • Ops ignore alerts from services • Postmortems 5
  • 6. 6
  • 7. Cycle of failure and disappointment • Manually edited and deployed monitoring • Changes require two teams • Low developer visibility about production • Escalation of issues is hard • Ops ignore alerts from services • Postmortems • High friction, low trust, low visibility. 7
  • 8. “Normality” 8 -­‐ http://gunshowcomic.com/648
  • 9. “Normality” dysfunctional 9 This is -­‐ http://gunshowcomic.com/648
  • 12. “51 % viewed their ERP implementation as unsuccessful” 12 The Robbins-Gioia Survey (2001)
  • 13. The Conference Board Survey (2001) “40 % of the projects failed to achieve their business case within one year of going live” 13
  • 14. McKinsey & Company in conjunction with the University of Oxford (2012) • “17 percent of large IT projects go so badly that they can threaten the very existence of the company” • “On average, large IT projects run 45 percent over budget and 7 percent over time, while delivering 56 percent less value than predicted” 14
  • 15. Failure is an option -­‐ blog.parasoft.com/single-­‐greatest-­‐barrier-­‐with-­‐sw-­‐delivery 15
  • 18. Why Sensu? • Designed to be pluggable / extensible • Arbitrary check metadata • Simple model • Components do exactly one thing • Ruby • Not afraid to extend (or fork!) 18
  • 21. 21
  • 25. 25
  • 26. How we use Sensu • Don’t use all of this! • ‘Standalone’ checks only • Default in the puppet module 26
  • 27. Sensu data flow • Sensu client runs checks on each machine • Pushes results to RabbitMQ • Clustered, clients/messages will fail over. • Sensu server (multiple, ha) • Processes check results, invokes handlers • Writes state to redis • Redis + sentinel • Read by API (2 instances) • All layers behind haproxy 27
  • 28. Quis custodiet ipsos custodes? 28 “Sensu has so many moving parts that I wouldn’t be able to sleep at night unless I set up a Nagios instance to make sure they were all running.”
  • 29. Mutually assured monitoring • Multiple independent Sensu installs (per-datacenter) • Monitor each other! 29
  • 30. Machine readable config • /etc/sensu/conf.d/checks/check_name.json • Extensible with arbitrary metadata • Hash merge • Never edit by hand! 30
  • 31. monitoring_check monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/ check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', } 31
  • 32. monitoring_check monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/ check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', } 32
  • 33. monitoring_check monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/ check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', } 33
  • 34. monitoring_check monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/ check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', } 34
  • 35. sensu::check • monitoring_check wraps this • Writes a JSON file for each check • Comment safe 35
  • 36. "disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/? p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false } 36
  • 37. "disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/? p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false } 37
  • 38. "disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/? p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false } 38
  • 39. "disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/? p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false } 39
  • 40. "disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/? p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false } 40
  • 41. "disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/? p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false } 41
  • 42. Check scripts • Same as nagios checks • Simple (text) output • Exit code • Result sent to server, along with check definition • Including all the custom metadata • Our handlers use the extra data. 42
  • 43. Handlers • base • JIRA • email • irc • pagerduty • awsprune 43
  • 44. How do checks get run? • Every machine runs the client. • Client managed by puppet • Client has a TCP socket you can send JSON to • Custom checks + pysensu-yelp 44
  • 45. 45
  • 47. Single source of truth • DNS is canonical for sensu servers • Configure things in one place! 47
  • 48. Single source of truth • DNS is canonical for sensu servers • Configure things in one place! 48
  • 49. Automatic monitoring • E.g. cron jobs - check successful recently! • cron::d 49
  • 50. Automatic monitoring • E.g. cron jobs - check successful recently! • cron::d 50
  • 53. User specified monitoring 53 • Data lives in the service config • Next to the code to emit metrics!
  • 54. • Simple checks for free! 54 User specified monitoring
  • 55. User specified monitoring • Data lives in the service config • Next to the code to emit metrics • Next to metadata about SLAs and LB timeouts • Developers can push without OPS 55
  • 56. Cluster checks • We’re working on this currently • Assert some % of machines are healthy. • Use to reduce alert noise. • If a service becomes fully unavailable to clients, you want to page someone. • If one machine goes belly up, you don’t (make a JIRA ticket for handling later!) 56
  • 57. WIP • This is all still a work in progress. • We’ve not 100% migrated off of Nagios • Open sourcing the pieces 57
  • 58. Thanks! • Slides will be online shortly: • slideshare.net/bobtfish • @bobtfish • Some (most?) of our code is open source: • https://github.com/Yelp/sensu/commit/ aa5c43c2fdfde5e8739952c0b8082000934f3ad2 • https://github.com/Yelp/puppet-monitoring_check • https://github.com/Yelp/puppet-netstdlib • https://github.com/Yelp/sensu_handlers • https://github.com/Yelp/pysensu-yelp 58