Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Container Camp UK

A Million Ways
to Crash Your Cluster
CONTAINER CAMP UK
HENNING JACOBS
@try_except_
2018-09-07

4
ZALANDO AT A GLANCE
~ 4.5billion EUR
revenue 2017
> 200
million
visits
per
month
> 15.000
employees in
Europe
> 70%
of visits via
mobile devices
> 23
million
active customers
> 300.000
product choices
~ 2.000
brands
15
countries

5
SCALE
95Clusters
378Accounts

7
INCIDENT #1: CUSTOMER IMPACT

8
INCIDENT #1: IAM RETURNING 404

10
LIFE OF A REQUEST (INGRESS)
DNS
my-app.example.org
ALB
aws-1234-lb.eu-central-1.elb.amazonaws.com
SERVICE
10.3.0.216
DEPLOYMENT
POD
10.2.0.1
POD
10.2.1.1
POD
10.2.2.1
POD
10.2.3.1
SKIPPER
172.31.1.1:9999
SKIPPER
172.31.2.1:9999
SKIPPER
172.31.3.1:9999
SKIPPER
172.31.4.1:9999
ALIAS Record

11
INCIDENT #1: INNOCENT MANIFEST
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
labels:
application: "foobar"
spec:
schedule: "*/15 9-19 * * Mon-Fri"
jobTemplate:
spec:
template:
metadata:
labels:
spec:
restartPolicy: Never
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
containers:
...

12
INCIDENT #1: FIXED CRON JOB
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
labels:
spec:
schedule: "7 8-18 * * Mon-Fri"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
jobTemplate:
spec:
activeDeadlineSeconds: 120
template:
metadata:
labels:
spec:
restartPolicy: Never
containers:
...

13
INCIDENT #1: LESSONS LEARNED
• ALB routes traffic to ALL hosts if all hosts report “unhealthy”
• Fix Skipper Ingress to stay “healthy” during API server problems
• Fix Skipper Ingress to retain last known set of routes
• Use quota for number of pods
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources
spec:
hard:
pods: "1500"

15
INCIDENT #2: MANUAL OPERATION
% etcdctl del -r /registry-kube-1/certificatesigningrequest prefix

16
INCIDENT #2: RTFM
% etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
help: etcdctl del [options] <key> [range_end]

17
Junior Engineers are Features, not Bugs
https://www.youtube.com/watch?v=cQta4G3ge44

https://www.outcome-eng.com/human-error-never-root-cause/

19
• Disaster Recovery Plan?
• Backup etcd to S3
• Monitor the snapshots

20
INCIDENT #3: LATENCY SPIKES
...
Kubernetes worker and master nodes sporadically fail to connect to etcd
causing timeouts in the APIserver and disconnects in the pod network.
...

21
INCIDENT #3: STOP THE BLEEDING
#!/bin/bash
SLEEPTIME=60
while true; do
echo "sleep for $SLEEPTIME seconds"
sleep $SLEEPTIME
timeout 5 curl http://localhost:8080/api/v1/nodes > /dev/null
if [ $? -eq 0 ]; then
echo "all fine, no need to restart etcd member"
continue
else
echo "restarting etcd-member"
systemctl restart etcd-member
fi
done

22
INCIDENT #3: CONFIRMATION FROM AWS
[...]
We can’t go into the details [...] that resulted the networking problems during
the “non-intrusive maintenance”, as it relates to internal workings of EC2.
We can confirm this only affected the T2 instance types, ...
[...]
We don’t explicitly recommend against running production services on T2
[...]

23
• It's never the AWS infrastructure until it is
• Treat t2 instances with care
• Kubernetes components are not necessarily "cloud native"
Cloud Native? Declarative, dynamic, resilient, and scalable

29
CLUSTER LIFECYCLE MANAGER (CLM)
github.com/zalando-incubator/cluster-lifecycle-manager

30
CLUSTER CHANNELS
github.com/zalando-incubator/kubernetes-on-aws
Channel Description Clusters
dev Development and playground clusters. 3
alpha Main infrastructure cluster (important to us). 1
beta
Product clusters for the rest of the
organization (prod/test). 90+

31
E2E TESTS ON EVERY PR

32
RUNNING E2E TESTS (BEFORE)
Control plane
nodenode
branch: dev
Create Cluster Run e2e tests Delete Cluster
Testing dev to alpha upgrade
Control plane Control plane

33
RUNNING E2E TESTS (NOW)
Control plane
nodenode
Control plane
nodenode
branch: alpha (base) branch: dev (head)
Create Cluster Update Cluster Run e2e tests Delete Cluster
Testing dev to alpha upgrade
Control plane Control plane

34
• Automated end-to-end tests are pretty good, but not enough
• Test the diff/migration automatically
• Bootstrap new cluster with the previous configuration
• Apply new configuration
• Run end-to-end & conformance tests

35
INCIDENT #5: IMPACT
[4:59 PM] Marc: There is a error during build - forbidden: image policy webhook backend denied
one or more images: X-Trusted header "false" for image pierone../ci/cdp-builder:234 ..
[5:01 PM] Alice: Now it does not start the build step at all
[5:02 PM] John: +1
[5:02 PM] John: Failed to create builder pod: …
[5:02 PM] Pedro: +1
[5:04 PM] Damien: +1
[5:19 PM] Anton: We're currently having issues pulling images from our Docker registry which
results in many problems…
...

37
INCIDENT #5: A VERY INNOCENT PULL REQUEST

38
INCIDENT #5: WHAT HAPPENED
• Deployment caused rebuild with the latest stable Go version
• Library for signature verification was incompatible with Go 1.10,
causing all verification checks to fail during runtime.
• Lack of unit/smoke tests and alerting for one component
• "Near miss": outage could have had large impact

39
A MILLION WAYS TO CRASH YOUR CLUSTER?
• Switch to latest Docker to fix issues with Docker daemon freezing
• Redesign of DNS setup due to high DNS latencies (5s),
switch from kube-dns to node-local dnsmasq+CoreDNS
• Disabling CPU throttling (CFS quota) to avoid latency issues
• Quick fix for timeouts using etcd-proxy: client-go still seems to have
issues with timeouts
• 502's during cluster updates: race condition during network setup

40
MORE TOPICS
• Graceful Pod shutdown and
race conditions (endpoints, Ingress)
• Incompatible Kubernetes changes
• CoreOS ContainerLinux "stable" won't boot
• Kubernetes EBS volume handling
• Docker

41
RACE CONDITIONS..
• Switch to the latest Docker version available to fix the issues with Docker daemon freezing
• Redesign of DNS setup due to high DNS latencies (5s), switch from kube-dns to CoreDNS
• Disabling CPU throttling (CFS quota) to avoid latency issues
• Quick fix for timeouts using etcd-proxy, since client-go still seems to have issues with timeouts
• 502's during cluster updates: race condition
•

42
TIMEOUTS TO API SERVER..

45
OPEN SOURCE
Kubernetes on AWS
AWS ALB Ingress controller
github.com/zalando-incubator/kube-ingress-aws-controller
Skipper HTTP Router & Ingress controller
github.com/zalando/skipper
External DNS
github.com/kubernetes-incubator/external-dns
Postgres Operator
github.com/zalando-incubator/postgres-operator
Kubernetes Resource Report
github.com/hjacobs/kube-resource-report
Kubernetes Downscaler
github.com/hjacobs/kube-downscaler

46
KUBERNETES RESOURCE REPORT
github.com/hjacobs/kube-resource-report

https://github.com/hjacobs/kube-ops-view

48
OTHER TALKS
• Nordstrom: 101 Ways to Crash Your Cluster - KubeCon 2017
• Monzo: Anatomy of a Production Kubernetes Outage - KubeCon 2018
• Inside Kubernetes Resource Management (QoS) - KubeCon 2018
We need more failure talks!

QUESTIONS?
HENNING JACOBS
HEAD OF
DEVELOPER PRODUCTIVITY
henning@zalando.de
@try_except_
Illustrations by @01k

Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Container Camp UK

More Related Content

What's hot (20)

Similar to Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Container Camp UK (20)

More from Henning Jacobs (12)

Recently uploaded (20)

Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Container Camp UK