5 ways deployments go
wrong and 5 solutions
Cloud
adoption
fails
FAIL
“All happy cloud
deployments are alike;
each unhappy cloud
deployment is unhappy
in its own way.”
Leo Tolstoy
Site Reliability Engineer
I’m
Yevgeniy
Brikman
ybrikman.com
Author
Co-founder of
Gruntwork
gruntwork.io
At Gruntwork,
I’ve seen the
cloud adoption
journeys of
hundreds of
companies
I’ve seen some go well.
I’ve seen some go poorly.
I've seen things you people
wouldn’t believe. DDos attacks
starting fires off the shoulder
of Ohio (us-east-2). I watched
C-suite foreheads glitter in the
dark near their Fargate bills.
All those moments will be lost
in time, like tears in rain...
Image credit: Blade Runner, Warner Bros, 1982
Why is it so hard?
Because everything has changed
about how we build software.
Before After
Dev team Write code, “toss it over the wall” Write code, deploy
Ops team Rack servers, deploy code Write code, deploy
Servers Dedicated physical servers Elastic virtual servers
Connectivity Static IPs Dynamic IPs, service discovery
Security Physical, strong perimeter, high trust Virtual, end-to-end, zero trust
Infra provisioning Manual Infrastructure as Code (IaC) tools
Server configuration Manual Configuration management tools
Testing Manual Automated testing
Deployments Manual Automated
Deployment cadence Weeks or months Many times per day
Change process Change request tickets Self-service
Change cadence Weeks or months Minutes
The shift to DevOps and the cloud
Adopting the cloud without acknowledging
these changes leads to problems
This talk is about 5 common causes of
cloud adoption failure…
Plus 5 solutions
based on the
patterns that
worked across
hundreds of
companies
The 5 solutions
are part of the
Gruntwork
Production
Framework
https://docs.gruntwork.io/guides/production-framework/
1. Do it by hand
2. Do it live
3. Do it on my machine
4. Do it only on my machine
5. Do it once
Outline
1. Do it by hand
2. Do it live
3. Do it on my machine
4. Do it only on my machine
5. Do it once
Outline
NUMBER 1:
FAIL
Deploying by using the web console
for your cloud provider: “ClickOps”
Almost everyone starts this way.
Almost everyone regrets it.
Problems with ClickOps:
1. Slow
Hours of clicking to spin up a new environment.
2. No reuse
Every deploy must be done from scratch. No leverage from previous work.
3. No audit trail
All info trapped in one person’s head. No versioning.
4. Error-prone
Manual task = human error. Deployment problems. Snowflake servers. Can’t use tests.
5. Tedious
No one likes doing slow, repetitive, error-prone, risky work over and over again.
“Realizing your
DevOps Engineer left...
After deploying
everything via
ClickOps.”
Vasily Vereshchagin
Oil on canvas, 1887
Side note:
credit to Classic
Programmer
Paintings for the
comic inspiration!
https://classicprogrammerpaintings.com/
NUMBER 1:
SOLUTION
Create a Service Catalog
A modern Service Catalog.
The modern Service Catalog:
1. Defined as code
Using tools such as Terraform, CloudFormation, Docker, Kubernetes, etc.
2. Designed for production use
Not a “5 minute demo,” but production-grade code.
3. Meet company requirements out-of-the-box
Scalability, HA, security, compliance (e.g., SOC 2, ISO 27001, PCI, HIPAA), etc.
4. Tested to meet company requirements
Code reviews, static analysis, functional testing, policy enforcement, etc.
5. Infrastructure and app code
Defines templates and patterns for both infrastructure and applications.
Infrastructure
templates
This is your Cloud API
https://docs.gruntwork.io/guides/production-
framework/ingredients/service-catalog/infrastructure-templates
Application
templates
This is your API between the
cloud and your apps
https://docs.gruntwork.io/guides/production-
framework/ingredients/service-catalog/application-templates
Real-world example: Gruntwork Service Catalog
Example infrastructure template for EKS
Example application template for Node.js
Key idea #1: Manage everything as
code in a Service Catalog.
Manual provisioning à Infrastructure as code
Manual server config à Configuration management
Manual app config à Configuration files
Manual builds à Continuous integration
Manual deployment à Continuous delivery
Manual testing à Automated testing
Manual policies à Automated policies (OPA)
Manual DBA work à Schema migrations
Manual specs à Automated specs (BDD)
Recall the problems with ClickOps:
1. Slow
Hours of clicking to spin up a new environment.
2. No reuse
Every deploy must be done from scratch. No leverage from previous work.
3. No audit trail
All info trapped in one person’s head. No reproducibility. No versioning.
4. Error-prone
Manual task = human error. Every environment a little bit different. No testing.
5. Tedious
No one likes doing slow, repetitive, error-prone, risky work over and over again.
Advantages of code:
1. Slow Fast
Computers can do in seconds what it takes a human hours to do.
2. No reuse Reusable
Leverage your previous work and the work of others. Evolve your code over time.
3. No audit trail Logged & versioned
Everything is in your version control system, including the full history of changes.
4. Error-prone Reliable
Code + automated tests + code reviews dramatically reduce errors.
5. Tedious Enjoyable
Writing code and being creative is more fun than repetitive, stressful, manual work.
1. Do it by hand
2. Do it live
3. Do it on my machine
4. Do it only on my machine
5. Do it once
Outline
NUMBER 2:
FAIL
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "*",
"Resource": "*"
}
]
}
Making everyone an admin
Initially, most companies try to limit
permissions…
But IAM is hard
Image from Why is AWS IAM So Hard? by Stephen Kuenzli
An error occurred (AccessDenied) when calling the
ListBuckets operation: Access Denied
(tweak the IAM policy)
An error occurred (AccessDenied) when calling the
ListBuckets operation: Access Denied
(tweak the IAM policy)
An error occurred (AccessDenied) when calling the
ListBuckets operation: Access Denied
And frustrating. It’s just “Access Denied”
over and over and over again.
The inevitable result: “F*ck it, we’ll do it
live!” and you make everyone an admin.
Problems with everyone is an admin:
1. Weak security
Huge blast radius from any mistake. Any compromised credentials may result in a
severe security incident. Any guard rails you put in place are ineffective.
2. Sprawl
Tons of new accounts and resources spun up and no one knows what they are for.
3. No consistency
Everything is configured differently: logging, networking, security controls, etc.
4. Difficult to fix it
If everyone is an admin, very hard to “undo” the damage: you don’t know what they’ve
done and you’re never 100% confident you’ve reined things in.
“Attempting to
get all the AWS
accounts under
control”
Jacques-Louis David
Oil on canvas, 1799
NUMBER 2:
SOLUTION
Set up your Landing Zone as
early as possible
landing zone noun
/ˈlændɪŋ zəʊn/
A streamlined way to create new accounts in your cloud provider that are
configured out-of-the-box with best practices (e.g., authentication, authorization,
logging, monitoring, tagging, guard rails, etc.).
Key ingredients of a Landing Zone:
1. Account structure
2. Account baselines
3. Account vending machine
Key ingredients of a Landing Zone:
1. Account structure
2. Account baselines
3. Account vending machine
account structure noun
/əˈkaʊnt ˈstrʌktʃə(r) /
How to configure multiple inter-connected accounts in the cloud to provide
isolation, compartmentalization, authentication, authorization, auditing, and
reporting.
Each cloud recommends different
account structures
Key ingredients of a Landing Zone:
1. Account structure
2. Account baselines
3. Account vending machine
account baseline noun
/əˈkaʊnt ˈbeɪslaɪn/
The basic set of controls installed in every account to enforce a common set of
best practices (e.g., authentication, authorization, logging, monitoring, tagging,
guard rails, etc.).
Description Examples
Authentication User identity, login, MFA IAM users & roles, SSO, IdPs
Authorization User permissions and access IAM policies & groups, ACLs, RBAC
Monitoring Audit logging, app logging, metrics CloudTrail, Elastic stack, Grafana
Networking IPs, routing, DNS, connectivity VPCs, NAT, Route 53, VPN, SSH, RDP
Hardening Network hardening, intrusion detection WAF, IPS, Squid Proxy, GuardDuty
Guard rails Limit what actions can be taken IAM policies, SCPs, OPA, AWS Config
Compliance Enforce compliance requirements SOC2, ISO 27001, CIS, PCI, HIPAA
Ownership Associate accounts & resources with teams Tagging, billing
Account baselines should handle:
module "account_baseline" {
source = "github.com/gruntwork-io/account-baseline"
enable_cloudtrail = true
enable_aws_config = true
enable_guard_duty = true
child_accounts = {
dev = "accounts+dev@company.com"
stage = "accounts+stage@company.com"
prod = "accounts+prod@company.com"
}
}
Define your account baselines as code
Key ingredients of a Landing Zone:
1. Account structure
2. Account baselines
3. Account vending machine
account vending machine noun
/əˈkaʊnt ˈvendɪŋ məˈʃiːn/
An official tool or process for spinning up new accounts which enforces each of
those accounts is configured with the appropriate account baseline.
Key ingredients for an account vending machine:
1. Self-service
Teams should be able to spin up new accounts for themselves on-demand.
2. GitOps-driven
Under the hood, manage accounts as code checked into version control.
3. Apply baselines
The vending machine ensures the proper baseline is applied to every new account.
4. Provision access
The vending machine not only creates accounts, but also grants teams access to them
(e.g., via SSO).
module "account_baseline" {
source = "github.com/gruntwork-io/account-baseline"
child_accounts = {
dev = "accounts+dev@company.com"
stage = "accounts+stage@company.com"
prod = "accounts+prod@company.com"
# Add new account
example = "accounts+example@company.com"
}
}
Example vending machine: update a
file, commit, CI / CD system deploys it
Key idea #2: Set up your Landing Zone
as early as you can.
1. Do it by hand
2. Do it live
3. Do it on my machine
4. Do it only on my machine
5. Do it once
Outline
NUMBER 3:
FAIL
Deployments are done by humans
from their own computers
Even with IaC, relying on a person to do
deployments leads to problems
Problems with a person deploying:
1. Error prone
Manual process = human error. E.g., fat-fingering a command, forgetting some step.
2. Not reproducible
E.g., Wrong version installed locally, accidentally deploying uncommitted changes.
3. Low bus factor
Often only 1 or 2 devs can deploy. What if they go on vacation or leave the company?
4. Race conditions
Different devs accidentally deploy different code (e.g., different branches) = conflicts.
5. Not secure
Deploying arbitrary changes requires arbitrary—admin—permissions. We already know
what happens when you give too many people admin permissions.
“Realizing you
just ran terraform
destroy in prod.”
Gustav Courbet
Oil on canvas, 1845
NUMBER 3:
SOLUTION
Do all deploys through a
CI / CD pipeline
Description
GitOps-driven The pipeline is triggered by commits to version control
Defined as code The full workflow should be defined as code
Automated tests The pipeline should run pre-, post-, and during- deploy checks.
Preview environments Deploy the changes in each PR into an ephemeral environment
Promotion workflows Promote immutable artifacts across environments: e.g., dev à stage à prod
Approval workflows For some types of changes, require human approval for deployment to prod
Deployment workflows Blue/green deploys, rolling deploys, canary deploys, feature toggles
App and infra code Your need a workflows for both application and infrastructure code
Key CI / CD pipeline features:
The workflows for app & infra code are
similar, but with key differences.
Application code Infrastructure code
Run locally
• Run the code on localhost
• Make a change, refresh
• Run the code in the cloud (sandboxes)
• Make a change, redeploy (use stages!)
Code review • Submit pull request with code changes • Submit pull request with code changes
Test
• Static analysis: linter
• Functional tests: unit, integration, e2e
• Static analysis: linter, policy enforcement
• Functional tests: plan, integration
Release
• Merge pull request
• Build immutable, versioned artifact
• Merge pull request
• Create git tag
CI config
• CI server has limited permissions
• CI server triggers K8S, ECS, EC2, etc.
• Isolated worker has admin permissions
• CI server triggers isolated worker
Deploy
• Promote artifacts: e.g., dev à stage à prod
• Rolling, blue/green, canary, feature flags
• Promote tags: e.g., dev à stage à prod
• Plan, approve, deploy, hope
Workflows for app & infra code:
Key idea #3: The CI / CD pipeline is the
only thing that can deploy to prod.
No one has write access to prod (let
alone admin access) except the pipeline.
Key idea #4: The CI / CD pipeline will
only deploy vetted services from the
Service Catalog to prod.
The Catalog + Pipeline are the only path
to prod; the API between Devs and Ops.
Key idea #5: The CI / CD pipeline
protects its permissions for prod.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "*",
"Resource": "*"
}
]
}
To deploy arbitrary infra changes, you
need arbitrary (admin) permissions!
Giving your CI server direct access to
admin permissions considered harmful.
This is a BAD combination:
1. Everyone in your company can access your CI server
2. You use the CI server to execute arbitrary code
3. The CI server has admin permissions
Congratulations, everyone in your
company has admin permissions again!
And so do
hackers
outside your
company!
https://research.nccgroup.com/2022/01/13/10-real-world-stories-
of-how-weve-compromised-ci-cd-pipelines/
The solution: only give admin
permissions to an isolated worker
The isolated worker:
1. Is highly locked down
Unlike the CI server, no one at the company has direct access to the worker.
2. Can only be triggered by the CI server
The CI server only has permissions to trigger the worker via an API & stream logs from it.
3. Exposes a limited, locked-down API
The worker only allows you to run certain commands (e.g., terraform apply), in certain
repos, in certain branches, in certain folders, etc.
4. Minimizes the potential damage
If an attacker gets access to your CI server, the worst they can do is trigger a deploy on
your own code. They do NOT get admin permissions directly.
1. Do it by hand
2. Do it live
3. Do it on my machine
4. Do it only on my machine
5. Do it once
Outline
NUMBER 4:
FAIL
Only Ops is allowed to deploy
The Ops team, trying to protect the
company, acts as a gatekeeper.
But that usually backfires:
Inevitably, the Ops team is overwhelmed
and becomes a bottleneck
So the Dev team finds a workaround…
So Ops adds more process… but that just
makes things even more backed up.
“The Ops team
explains the new
95-step change
request process
to the Dev team.”
Ferdinand Pauwels
Oil on canvas, 1872
NUMBER 4:
SOLUTION
Provide
developers with
self-service
Key idea #6: Any team can deploy their
own infra + apps from the Service Catalog
The cloud is primarily a tool for Devs,
not Ops.
One of the biggest benefits of the cloud:
Devs can be more self-sufficient.
Ops team as a gatekeeper: Devs
aren’t self sufficient, go slow.
Ops team as enabler: Devs are self-
sufficient, go fast.
Enable self-service safely via the Catalog
+ Pipeline: your API on top of the cloud.
Devs should have sandbox accounts
for easy testing, learning, etc.
Tool Clouds Features
cloud-nuke AWS
Delete all resources older than a certain
date; in a certain region; of a certain type.
safe-scrub Google Cloud
Safely delete unwanted resources in a
GCP project
Azure Powershell Azure
Includes native commands to delete
Resource Groups
Run cleanup tools in cron jobs to remove
old resources in sandbox accounts
In prod, Devs deploy via self-service with
the Service Catalog + CI / CD Pipeline.
Key self-service features:
1. GitOps-driven
Everything is managed as code and driven by commits to version control. Allows code
review, testing, audit log, versioning, etc.
2. UI-driven (optional)
Web UI as a layer on top of GitOps layer to make it more accessible.
3. Focus on common use cases
E.g., Account vending machine, data store deployment, app deployment. Don’t have to
solve everything right away.
4. Access controls
Different teams can access/deploy different things. E.g., NetOps team might be able to
deploy networking, whereas app teams can deploy orchestration tools and data stores.
module "account_baseline" {
source = "github.com/gruntwork-io/account-baseline"
child_accounts = {
dev = "accounts+dev@company.com"
stage = "accounts+stage@company.com"
prod = "accounts+prod@company.com"
# Add new account
example = "accounts+example@company.com"
}
}
Example of self-service: update a file,
commit, CI / CD system deploys it
Key idea #7: Any team can contribute
to the Service Catalog.
stage prod
Modern software involves many
moving pieces
If only Ops can add those pieces to the
Service Catalog, that’ll be a bottleneck
Automated tests:
✓ tflint
✓ tfsec
✓ OPA
✓ steampipe
✓ checkhov
✓ Terratest
Passed: 6. Failed: 0. Skipped: 0.
Test run successful.
Instead, allow
everyone to contribute
and enforce company
requirements through
code reviews and
automated tests
1. Do it by hand
2. Do it live
3. Do it on my machine
4. Do it only on my machine
5. Do it once
Outline
NUMBER 5:
FAIL
Not taking into account ongoing
maintenance work
stage prod
Not only are there many moving pieces,
but they’re all also constantly changing.
AWS is
constantly
changing
The last S3 security document that we’ll ever need, and how to use it
How To Keep Up With AWS Announcements
Docker is
constantly
changing
Docker Releases
Kubernetes is
constantly
changing
Kubernetes Wikipedia page
Terraform is
constantly
changing
Terraform Upgrade Guides
Many companies assume that the initial
cloud deployment is the hard part.
It isn’t.
“Software maintenance
cost is increasingly
growing and estimates
showed that about 90%
of software life cost is
related to its
maintenance phase.”
Which Factors Affect Software Projects
Maintenance Cost More?
Sayed Mehdi Hejazi Dehaghani and Nafiseh Hajrahimi
If you don’t have a plan for maintenance,
all that code you wrote will rot.
“Coming back to that
Terraform codebase
after 6 months.”
Eero Järnefelt
Oil on canvas, 1893
NUMBER 5:
SOLUTION
Set up
automatic
updates
Key auto-update features:
1. Automation-driven
Updates are discovered and the code is updated automatically. No relying on a human
to remember it. Update cadence should be configurable.
2. GitOps-driven
The code is updated via automated pull requests.
3. Automated testing
You must have automated tests in place and running against each pull request to let
you know if the updated code still works.
4. Automated deployment
Once a pull request is merged, it must deploy automatically via the CI / CD pipeline,
promoting the update across environments: e.g., dev à stage à prod.
Key idea #8: Updates are pushed to the
code via PRs, automatically.
Key idea #9: Code without automated
tests will rot.
How to do automated testing for infrastructure code
https://terratest.gruntwork.io/docs/getting-started/introduction/#watch-how-to-test-infrastructure-code
1. Do it by hand
2. Do it live
3. Do it on my machine
4. Do it only on my machine
5. Do it once
Outline
Let’s recap:
Key ideas:
1. Manage everything as code in a Service Catalog.
2. Set up your Landing Zone as early as you can.
3. Only the CI / CD Pipeline can deploy to prod.
4. The CI / CD Pipeline only deploys from the Service Catalog.
5. The CI / CD Pipeline protects its admin permissions.
6. Any team can deploy infra + apps from the Service Catalog.
7. Any team can contribute to the Service Catalog.
8. Updates are pushed to the code via PRs, automatically.
9. Code without automated tests will rot.
Fail Description Solution
Do it by hand ClickOps Service Catalog
Do it live Everyone is an admin Landing Zone
Do it on my machine People deploying from their computers CI / CD Pipeline
Do it only on my machine Only Ops can deploy Self-Service
Do it once Not taking maintenance into account Automatic Updates
5 cloud adoption fails and solutions:
The 5 solutions
are part of the
Gruntwork
Production
Framework
https://docs.gruntwork.io/guides/production-framework/
If you use this framework, here’s the
experience for your Ops team:
Step 1: Create a Service Catalog
Everything defined as code. Works for app + infra. You could build from
scratch or on top of an existing one (e.g., Gruntwork Service Catalog).
Step 2: Set up your Landing Zone
Set up your basic account structure, define account baselines, etc.
Step 3: Set up a CI / CD pipeline
Ensure it’s the only way to deploy to prod. Make it work for apps + infra.
Step 4: Provide self-service
Enable all teams to deploy. Start with a GitOps solution. Add UI later.
Step 5: Set up automatic updates
PRs opened automatically. Automated tests in place for app + infra code.
And here’s the experience for your
Dev team:
Step 1: Scaffold a new app
Leverage vetted application templates from the Service Catalog and the
logic built in: e.g., service discovery, packaging, monitoring, testing, etc.
Step 2: Deploy infrastructure
Leverage Self-Service + Service Catalog + CI / CD Pipeline.
Step 3: Iterate on the app
Leverage CI / CD built into the templates to deploy subsequent changes.
Step 4: Debug issues
Leverage monitoring, logging, alerting, etc. built into the templates.
Step 5: Stay up to date
Leverage auto update built into the templates. Automated PRs + tests.
“The Cloud
you always
wanted.”
Thomas Cole
Oil on canvas, 1836
Questions?
info@gruntwork.io

Cloud adoption fails - 5 ways deployments go wrong and 5 solutions

  • 1.
    5 ways deploymentsgo wrong and 5 solutions Cloud adoption fails FAIL
  • 2.
    “All happy cloud deploymentsare alike; each unhappy cloud deployment is unhappy in its own way.” Leo Tolstoy Site Reliability Engineer
  • 3.
  • 4.
  • 5.
  • 6.
    At Gruntwork, I’ve seenthe cloud adoption journeys of hundreds of companies
  • 7.
    I’ve seen somego well. I’ve seen some go poorly.
  • 8.
    I've seen thingsyou people wouldn’t believe. DDos attacks starting fires off the shoulder of Ohio (us-east-2). I watched C-suite foreheads glitter in the dark near their Fargate bills. All those moments will be lost in time, like tears in rain... Image credit: Blade Runner, Warner Bros, 1982
  • 9.
    Why is itso hard?
  • 10.
    Because everything haschanged about how we build software.
  • 11.
    Before After Dev teamWrite code, “toss it over the wall” Write code, deploy Ops team Rack servers, deploy code Write code, deploy Servers Dedicated physical servers Elastic virtual servers Connectivity Static IPs Dynamic IPs, service discovery Security Physical, strong perimeter, high trust Virtual, end-to-end, zero trust Infra provisioning Manual Infrastructure as Code (IaC) tools Server configuration Manual Configuration management tools Testing Manual Automated testing Deployments Manual Automated Deployment cadence Weeks or months Many times per day Change process Change request tickets Self-service Change cadence Weeks or months Minutes The shift to DevOps and the cloud
  • 12.
    Adopting the cloudwithout acknowledging these changes leads to problems
  • 13.
    This talk isabout 5 common causes of cloud adoption failure…
  • 14.
    Plus 5 solutions basedon the patterns that worked across hundreds of companies
  • 15.
    The 5 solutions arepart of the Gruntwork Production Framework https://docs.gruntwork.io/guides/production-framework/
  • 16.
    1. Do itby hand 2. Do it live 3. Do it on my machine 4. Do it only on my machine 5. Do it once Outline
  • 17.
    1. Do itby hand 2. Do it live 3. Do it on my machine 4. Do it only on my machine 5. Do it once Outline
  • 18.
  • 19.
    Deploying by usingthe web console for your cloud provider: “ClickOps”
  • 20.
    Almost everyone startsthis way. Almost everyone regrets it.
  • 21.
    Problems with ClickOps: 1.Slow Hours of clicking to spin up a new environment. 2. No reuse Every deploy must be done from scratch. No leverage from previous work. 3. No audit trail All info trapped in one person’s head. No versioning. 4. Error-prone Manual task = human error. Deployment problems. Snowflake servers. Can’t use tests. 5. Tedious No one likes doing slow, repetitive, error-prone, risky work over and over again.
  • 22.
    “Realizing your DevOps Engineerleft... After deploying everything via ClickOps.” Vasily Vereshchagin Oil on canvas, 1887
  • 23.
    Side note: credit toClassic Programmer Paintings for the comic inspiration! https://classicprogrammerpaintings.com/
  • 24.
  • 25.
  • 26.
  • 27.
    The modern ServiceCatalog: 1. Defined as code Using tools such as Terraform, CloudFormation, Docker, Kubernetes, etc. 2. Designed for production use Not a “5 minute demo,” but production-grade code. 3. Meet company requirements out-of-the-box Scalability, HA, security, compliance (e.g., SOC 2, ISO 27001, PCI, HIPAA), etc. 4. Tested to meet company requirements Code reviews, static analysis, functional testing, policy enforcement, etc. 5. Infrastructure and app code Defines templates and patterns for both infrastructure and applications.
  • 28.
    Infrastructure templates This is yourCloud API https://docs.gruntwork.io/guides/production- framework/ingredients/service-catalog/infrastructure-templates
  • 29.
    Application templates This is yourAPI between the cloud and your apps https://docs.gruntwork.io/guides/production- framework/ingredients/service-catalog/application-templates
  • 30.
  • 31.
  • 32.
  • 33.
    Key idea #1:Manage everything as code in a Service Catalog.
  • 34.
    Manual provisioning àInfrastructure as code Manual server config à Configuration management Manual app config à Configuration files Manual builds à Continuous integration Manual deployment à Continuous delivery Manual testing à Automated testing Manual policies à Automated policies (OPA) Manual DBA work à Schema migrations Manual specs à Automated specs (BDD)
  • 35.
    Recall the problemswith ClickOps: 1. Slow Hours of clicking to spin up a new environment. 2. No reuse Every deploy must be done from scratch. No leverage from previous work. 3. No audit trail All info trapped in one person’s head. No reproducibility. No versioning. 4. Error-prone Manual task = human error. Every environment a little bit different. No testing. 5. Tedious No one likes doing slow, repetitive, error-prone, risky work over and over again.
  • 36.
    Advantages of code: 1.Slow Fast Computers can do in seconds what it takes a human hours to do. 2. No reuse Reusable Leverage your previous work and the work of others. Evolve your code over time. 3. No audit trail Logged & versioned Everything is in your version control system, including the full history of changes. 4. Error-prone Reliable Code + automated tests + code reviews dramatically reduce errors. 5. Tedious Enjoyable Writing code and being creative is more fun than repetitive, stressful, manual work.
  • 37.
    1. Do itby hand 2. Do it live 3. Do it on my machine 4. Do it only on my machine 5. Do it once Outline
  • 38.
  • 39.
    { "Version": "2012-10-17", "Statement": [ { "Effect":"Allow", "Action": "*", "Resource": "*" } ] } Making everyone an admin
  • 40.
    Initially, most companiestry to limit permissions…
  • 41.
    But IAM ishard Image from Why is AWS IAM So Hard? by Stephen Kuenzli
  • 42.
    An error occurred(AccessDenied) when calling the ListBuckets operation: Access Denied (tweak the IAM policy) An error occurred (AccessDenied) when calling the ListBuckets operation: Access Denied (tweak the IAM policy) An error occurred (AccessDenied) when calling the ListBuckets operation: Access Denied And frustrating. It’s just “Access Denied” over and over and over again.
  • 43.
    The inevitable result:“F*ck it, we’ll do it live!” and you make everyone an admin.
  • 45.
    Problems with everyoneis an admin: 1. Weak security Huge blast radius from any mistake. Any compromised credentials may result in a severe security incident. Any guard rails you put in place are ineffective. 2. Sprawl Tons of new accounts and resources spun up and no one knows what they are for. 3. No consistency Everything is configured differently: logging, networking, security controls, etc. 4. Difficult to fix it If everyone is an admin, very hard to “undo” the damage: you don’t know what they’ve done and you’re never 100% confident you’ve reined things in.
  • 46.
    “Attempting to get allthe AWS accounts under control” Jacques-Louis David Oil on canvas, 1799
  • 47.
  • 48.
    Set up yourLanding Zone as early as possible
  • 49.
    landing zone noun /ˈlændɪŋzəʊn/ A streamlined way to create new accounts in your cloud provider that are configured out-of-the-box with best practices (e.g., authentication, authorization, logging, monitoring, tagging, guard rails, etc.).
  • 50.
    Key ingredients ofa Landing Zone: 1. Account structure 2. Account baselines 3. Account vending machine
  • 51.
    Key ingredients ofa Landing Zone: 1. Account structure 2. Account baselines 3. Account vending machine
  • 52.
    account structure noun /əˈkaʊntˈstrʌktʃə(r) / How to configure multiple inter-connected accounts in the cloud to provide isolation, compartmentalization, authentication, authorization, auditing, and reporting.
  • 53.
    Each cloud recommendsdifferent account structures
  • 54.
    Key ingredients ofa Landing Zone: 1. Account structure 2. Account baselines 3. Account vending machine
  • 55.
    account baseline noun /əˈkaʊntˈbeɪslaɪn/ The basic set of controls installed in every account to enforce a common set of best practices (e.g., authentication, authorization, logging, monitoring, tagging, guard rails, etc.).
  • 56.
    Description Examples Authentication Useridentity, login, MFA IAM users & roles, SSO, IdPs Authorization User permissions and access IAM policies & groups, ACLs, RBAC Monitoring Audit logging, app logging, metrics CloudTrail, Elastic stack, Grafana Networking IPs, routing, DNS, connectivity VPCs, NAT, Route 53, VPN, SSH, RDP Hardening Network hardening, intrusion detection WAF, IPS, Squid Proxy, GuardDuty Guard rails Limit what actions can be taken IAM policies, SCPs, OPA, AWS Config Compliance Enforce compliance requirements SOC2, ISO 27001, CIS, PCI, HIPAA Ownership Associate accounts & resources with teams Tagging, billing Account baselines should handle:
  • 57.
    module "account_baseline" { source= "github.com/gruntwork-io/account-baseline" enable_cloudtrail = true enable_aws_config = true enable_guard_duty = true child_accounts = { dev = "[email protected]" stage = "[email protected]" prod = "[email protected]" } } Define your account baselines as code
  • 58.
    Key ingredients ofa Landing Zone: 1. Account structure 2. Account baselines 3. Account vending machine
  • 59.
    account vending machinenoun /əˈkaʊnt ˈvendɪŋ məˈʃiːn/ An official tool or process for spinning up new accounts which enforces each of those accounts is configured with the appropriate account baseline.
  • 60.
    Key ingredients foran account vending machine: 1. Self-service Teams should be able to spin up new accounts for themselves on-demand. 2. GitOps-driven Under the hood, manage accounts as code checked into version control. 3. Apply baselines The vending machine ensures the proper baseline is applied to every new account. 4. Provision access The vending machine not only creates accounts, but also grants teams access to them (e.g., via SSO).
  • 61.
    module "account_baseline" { source= "github.com/gruntwork-io/account-baseline" child_accounts = { dev = "[email protected]" stage = "[email protected]" prod = "[email protected]" # Add new account example = "[email protected]" } } Example vending machine: update a file, commit, CI / CD system deploys it
  • 62.
    Key idea #2:Set up your Landing Zone as early as you can.
  • 63.
    1. Do itby hand 2. Do it live 3. Do it on my machine 4. Do it only on my machine 5. Do it once Outline
  • 64.
  • 65.
    Deployments are doneby humans from their own computers
  • 66.
    Even with IaC,relying on a person to do deployments leads to problems
  • 67.
    Problems with aperson deploying: 1. Error prone Manual process = human error. E.g., fat-fingering a command, forgetting some step. 2. Not reproducible E.g., Wrong version installed locally, accidentally deploying uncommitted changes. 3. Low bus factor Often only 1 or 2 devs can deploy. What if they go on vacation or leave the company? 4. Race conditions Different devs accidentally deploy different code (e.g., different branches) = conflicts. 5. Not secure Deploying arbitrary changes requires arbitrary—admin—permissions. We already know what happens when you give too many people admin permissions.
  • 68.
    “Realizing you just ranterraform destroy in prod.” Gustav Courbet Oil on canvas, 1845
  • 69.
  • 70.
    Do all deploysthrough a CI / CD pipeline
  • 71.
    Description GitOps-driven The pipelineis triggered by commits to version control Defined as code The full workflow should be defined as code Automated tests The pipeline should run pre-, post-, and during- deploy checks. Preview environments Deploy the changes in each PR into an ephemeral environment Promotion workflows Promote immutable artifacts across environments: e.g., dev à stage à prod Approval workflows For some types of changes, require human approval for deployment to prod Deployment workflows Blue/green deploys, rolling deploys, canary deploys, feature toggles App and infra code Your need a workflows for both application and infrastructure code Key CI / CD pipeline features:
  • 72.
    The workflows forapp & infra code are similar, but with key differences.
  • 73.
    Application code Infrastructurecode Run locally • Run the code on localhost • Make a change, refresh • Run the code in the cloud (sandboxes) • Make a change, redeploy (use stages!) Code review • Submit pull request with code changes • Submit pull request with code changes Test • Static analysis: linter • Functional tests: unit, integration, e2e • Static analysis: linter, policy enforcement • Functional tests: plan, integration Release • Merge pull request • Build immutable, versioned artifact • Merge pull request • Create git tag CI config • CI server has limited permissions • CI server triggers K8S, ECS, EC2, etc. • Isolated worker has admin permissions • CI server triggers isolated worker Deploy • Promote artifacts: e.g., dev à stage à prod • Rolling, blue/green, canary, feature flags • Promote tags: e.g., dev à stage à prod • Plan, approve, deploy, hope Workflows for app & infra code:
  • 74.
    Key idea #3:The CI / CD pipeline is the only thing that can deploy to prod.
  • 75.
    No one haswrite access to prod (let alone admin access) except the pipeline.
  • 76.
    Key idea #4:The CI / CD pipeline will only deploy vetted services from the Service Catalog to prod.
  • 77.
    The Catalog +Pipeline are the only path to prod; the API between Devs and Ops.
  • 78.
    Key idea #5:The CI / CD pipeline protects its permissions for prod.
  • 79.
    { "Version": "2012-10-17", "Statement": [ { "Effect":"Allow", "Action": "*", "Resource": "*" } ] } To deploy arbitrary infra changes, you need arbitrary (admin) permissions!
  • 80.
    Giving your CIserver direct access to admin permissions considered harmful.
  • 81.
    This is aBAD combination: 1. Everyone in your company can access your CI server 2. You use the CI server to execute arbitrary code 3. The CI server has admin permissions
  • 82.
    Congratulations, everyone inyour company has admin permissions again!
  • 83.
    And so do hackers outsideyour company! https://research.nccgroup.com/2022/01/13/10-real-world-stories- of-how-weve-compromised-ci-cd-pipelines/
  • 84.
    The solution: onlygive admin permissions to an isolated worker
  • 85.
    The isolated worker: 1.Is highly locked down Unlike the CI server, no one at the company has direct access to the worker. 2. Can only be triggered by the CI server The CI server only has permissions to trigger the worker via an API & stream logs from it. 3. Exposes a limited, locked-down API The worker only allows you to run certain commands (e.g., terraform apply), in certain repos, in certain branches, in certain folders, etc. 4. Minimizes the potential damage If an attacker gets access to your CI server, the worst they can do is trigger a deploy on your own code. They do NOT get admin permissions directly.
  • 86.
    1. Do itby hand 2. Do it live 3. Do it on my machine 4. Do it only on my machine 5. Do it once Outline
  • 87.
  • 88.
    Only Ops isallowed to deploy
  • 89.
    The Ops team,trying to protect the company, acts as a gatekeeper.
  • 90.
    But that usuallybackfires:
  • 91.
    Inevitably, the Opsteam is overwhelmed and becomes a bottleneck
  • 92.
    So the Devteam finds a workaround…
  • 94.
    So Ops addsmore process… but that just makes things even more backed up.
  • 95.
    “The Ops team explainsthe new 95-step change request process to the Dev team.” Ferdinand Pauwels Oil on canvas, 1872
  • 96.
  • 97.
  • 98.
    Key idea #6:Any team can deploy their own infra + apps from the Service Catalog
  • 99.
    The cloud isprimarily a tool for Devs, not Ops.
  • 100.
    One of thebiggest benefits of the cloud: Devs can be more self-sufficient.
  • 101.
    Ops team asa gatekeeper: Devs aren’t self sufficient, go slow.
  • 102.
    Ops team asenabler: Devs are self- sufficient, go fast.
  • 103.
    Enable self-service safelyvia the Catalog + Pipeline: your API on top of the cloud.
  • 104.
    Devs should havesandbox accounts for easy testing, learning, etc.
  • 105.
    Tool Clouds Features cloud-nukeAWS Delete all resources older than a certain date; in a certain region; of a certain type. safe-scrub Google Cloud Safely delete unwanted resources in a GCP project Azure Powershell Azure Includes native commands to delete Resource Groups Run cleanup tools in cron jobs to remove old resources in sandbox accounts
  • 106.
    In prod, Devsdeploy via self-service with the Service Catalog + CI / CD Pipeline.
  • 107.
    Key self-service features: 1.GitOps-driven Everything is managed as code and driven by commits to version control. Allows code review, testing, audit log, versioning, etc. 2. UI-driven (optional) Web UI as a layer on top of GitOps layer to make it more accessible. 3. Focus on common use cases E.g., Account vending machine, data store deployment, app deployment. Don’t have to solve everything right away. 4. Access controls Different teams can access/deploy different things. E.g., NetOps team might be able to deploy networking, whereas app teams can deploy orchestration tools and data stores.
  • 108.
    module "account_baseline" { source= "github.com/gruntwork-io/account-baseline" child_accounts = { dev = "[email protected]" stage = "[email protected]" prod = "[email protected]" # Add new account example = "[email protected]" } } Example of self-service: update a file, commit, CI / CD system deploys it
  • 109.
    Key idea #7:Any team can contribute to the Service Catalog.
  • 110.
    stage prod Modern softwareinvolves many moving pieces
  • 111.
    If only Opscan add those pieces to the Service Catalog, that’ll be a bottleneck
  • 112.
    Automated tests: ✓ tflint ✓tfsec ✓ OPA ✓ steampipe ✓ checkhov ✓ Terratest Passed: 6. Failed: 0. Skipped: 0. Test run successful. Instead, allow everyone to contribute and enforce company requirements through code reviews and automated tests
  • 113.
    1. Do itby hand 2. Do it live 3. Do it on my machine 4. Do it only on my machine 5. Do it once Outline
  • 114.
  • 115.
    Not taking intoaccount ongoing maintenance work
  • 116.
    stage prod Not onlyare there many moving pieces, but they’re all also constantly changing.
  • 117.
    AWS is constantly changing The lastS3 security document that we’ll ever need, and how to use it How To Keep Up With AWS Announcements
  • 118.
  • 119.
  • 120.
  • 121.
    Many companies assumethat the initial cloud deployment is the hard part.
  • 122.
  • 123.
    “Software maintenance cost isincreasingly growing and estimates showed that about 90% of software life cost is related to its maintenance phase.” Which Factors Affect Software Projects Maintenance Cost More? Sayed Mehdi Hejazi Dehaghani and Nafiseh Hajrahimi
  • 124.
    If you don’thave a plan for maintenance, all that code you wrote will rot.
  • 125.
    “Coming back tothat Terraform codebase after 6 months.” Eero Järnefelt Oil on canvas, 1893
  • 126.
  • 127.
  • 128.
    Key auto-update features: 1.Automation-driven Updates are discovered and the code is updated automatically. No relying on a human to remember it. Update cadence should be configurable. 2. GitOps-driven The code is updated via automated pull requests. 3. Automated testing You must have automated tests in place and running against each pull request to let you know if the updated code still works. 4. Automated deployment Once a pull request is merged, it must deploy automatically via the CI / CD pipeline, promoting the update across environments: e.g., dev à stage à prod.
  • 129.
    Key idea #8:Updates are pushed to the code via PRs, automatically.
  • 130.
    Key idea #9:Code without automated tests will rot.
  • 131.
    How to doautomated testing for infrastructure code https://terratest.gruntwork.io/docs/getting-started/introduction/#watch-how-to-test-infrastructure-code
  • 132.
    1. Do itby hand 2. Do it live 3. Do it on my machine 4. Do it only on my machine 5. Do it once Outline
  • 133.
  • 134.
    Key ideas: 1. Manageeverything as code in a Service Catalog. 2. Set up your Landing Zone as early as you can. 3. Only the CI / CD Pipeline can deploy to prod. 4. The CI / CD Pipeline only deploys from the Service Catalog. 5. The CI / CD Pipeline protects its admin permissions. 6. Any team can deploy infra + apps from the Service Catalog. 7. Any team can contribute to the Service Catalog. 8. Updates are pushed to the code via PRs, automatically. 9. Code without automated tests will rot.
  • 135.
    Fail Description Solution Doit by hand ClickOps Service Catalog Do it live Everyone is an admin Landing Zone Do it on my machine People deploying from their computers CI / CD Pipeline Do it only on my machine Only Ops can deploy Self-Service Do it once Not taking maintenance into account Automatic Updates 5 cloud adoption fails and solutions:
  • 136.
    The 5 solutions arepart of the Gruntwork Production Framework https://docs.gruntwork.io/guides/production-framework/
  • 137.
    If you usethis framework, here’s the experience for your Ops team:
  • 138.
    Step 1: Createa Service Catalog Everything defined as code. Works for app + infra. You could build from scratch or on top of an existing one (e.g., Gruntwork Service Catalog).
  • 139.
    Step 2: Setup your Landing Zone Set up your basic account structure, define account baselines, etc.
  • 140.
    Step 3: Setup a CI / CD pipeline Ensure it’s the only way to deploy to prod. Make it work for apps + infra.
  • 141.
    Step 4: Provideself-service Enable all teams to deploy. Start with a GitOps solution. Add UI later.
  • 142.
    Step 5: Setup automatic updates PRs opened automatically. Automated tests in place for app + infra code.
  • 143.
    And here’s theexperience for your Dev team:
  • 144.
    Step 1: Scaffolda new app Leverage vetted application templates from the Service Catalog and the logic built in: e.g., service discovery, packaging, monitoring, testing, etc.
  • 145.
    Step 2: Deployinfrastructure Leverage Self-Service + Service Catalog + CI / CD Pipeline.
  • 146.
    Step 3: Iterateon the app Leverage CI / CD built into the templates to deploy subsequent changes.
  • 147.
    Step 4: Debugissues Leverage monitoring, logging, alerting, etc. built into the templates.
  • 148.
    Step 5: Stayup to date Leverage auto update built into the templates. Automated PRs + tests.
  • 149.
  • 150.