20171122 aws usergrp_coretech-spn-cicd-aws-v01

SPN CI/CD journey on AWS
SPN Infra., CoreTech
Scott Miao
11/22/2017
1

Who am I
• Scott Miao
• RD, SPN Infra., TrendMicro
• OOAD system dev. 10+ years
• Hadoop ecosystem 6 years
• AWS for BigData 4 years
• @linkedIn
• @slideshare
2

Agenda
• Original services delivery process in SPN
• Dev/Ops
– DevOps goals V.S. our original way
• CI/CD on AWS
• An example service CI/CD on AWS
• DevOps goals V.S. our original way V.S. CI/CD
on AWS
• Lessons learned

Original services delivery process
in SPN

Developers
2. Source Repo
1. Dev, utests,…
3. Back and forth
4. Trigger CI
Release portal
7. Trigger
Release
build
8.
Release
artifacts
Operators Infra. admin
5. Devices spec.
For both Stg/PROD6.1 Monitoring scripts
6.2 Puppet scripts
6.3 Operation guides
Release portal
Stg.
PROD
Service team Operation team DCS team
9. Stg resources
ready
11. Deploy
and monitor
13.
Release
artifacts
12.1 Itests
12.2 Stress tests
12.3 UAT
15. 16.
17. PROD
release
10. Deploy
service &scripts
14. PROD
resources ready

20171122 aws usergrp_coretech-spn-cicd-aws-v01

8
DevOps is not a new technology or a
product. It’s an approach or culture of
software development that seeks stability
and performance at the same time that it
speeds software deliveries to the business.
── Andi Mann, CA Technology ──
Cited from: Derek Chen, RD, TrendMicro
https://www.slideshare.net/derekhound/devops-in-practice-78905911, p#15

9
Software Delivery
Plan Release
Operat
e
Code Build DeployTest
Monito
r
Agile Development
Continuous Integration
Continuous Delivery
Continuous Deployment
DevOps
Cited from: Derek Chen, RD, TrendMicro
https://www.slideshare.net/derekhound/devops-in-practice-78905911, p#23

DevOps goals V.S. our original way
• Faster time to market
– Too complicated to miss steps
– Service team needs to follow up themselves
– Lead time needed steps (Machine resources, etc)
• Lower failure rate of new releases
– Manual steps lead to errors
• Shorten lead time between fixes
– Rolling upgrade
– Invasive
• Faster mean time to recovery
– Hard to deal with machine errors and peak
2https://en.wikipedia.org/wiki/DevOps#Goals

“Very often, automation supports
this objective”
https://en.wikipedia.org/wiki/DevOps#Goals
Quoted from Wikipedia for DevOps goals

CI/CD on AWS
TWO ACHIEVE SAME DEVOPS GOALS
DEVOPS FOCUSES ON ORGANIZATIONAL CHANGES
CI/CD FOCUSES ON TECHNICAL IMPLEMENTATIONS

Review for CI and CD
• Continuous Integration
– is the practice of merging all developer working
copies to a shared mainline (trunk) several times
a day
• Continuous Delivery
– produce software in short cycles, ensuring that
the software can be reliably released at any time
• Continuous Deployment
– means that every change is automatically
deployed to production
https://en.wikipedia.org/wiki/Continuous_integration
https://en.wikipedia.org/wiki/Continuous_delivery

Characteristics of Cloud Computing
• On-demand self-service
– A consumer can unilaterally provision computing capabilities
• Broad network access
– Capabilities are available over the network and accessed
through standard mechanisms
• Resource pooling
– The provider's computing resources are pooled to serve
multiple consumers using a multi-tenant model
• Rapid elasticity
– Capabilities can be elastically provisioned and released
• Measured service
– Cloud systems automatically control and optimize resource use
http://www.inforisktoday.com/5-essential-characteristics-cloud-computing-a-4189
https://en.wikipedia.org/wiki/Infrastructure_as_Code

(AWS)
DevOps
CI/CD
Automation
Cloud Computing

AWS managed services SPN used
• AWS CloudFormation
– Gives developers and systems administrators an easy
way to create and manage a collection of related
AWS resources
– We use it to provision our service components
• Such as Load balancer (ALB), machines (EC2)
• AWS OpsWorks
– A configuration management service that uses Chef,
an automation platform that treats server
configurations as code
– We use it to deploy, configure and startup our
service components
https://aws.amazon.com/cloudformation/
https://aws.amazon.com/opsworks/

AWS CloudFormation + OpsWorks
user
main
IAM ELB OpsWorks
AWS
CloudFormation
main
IAM ALB OpsWorks
AWS
OpsWorks
artifacts
AWS S3
AWS
VPC
Chef recipes1. Put CF templates
2. Put artifacts
3. Put Chef recipes
4. Create CF W/ params,
VPC ID, etc
5. Templates
input
6. Create CF
stacks
7. Provision
AWS resources
8. Create OpsWorks
9. Artifacts/recipes
input
10.
Deploy/Config/start
up service
User
CF
Ops
Ready to
serve

CoreTech DCS managed services
• Enterprise github
– Just like the github we use on Internet
• CloudCI – Enterprise Circle CI
– A Docker container based CI solution
– Seamlessly integrated with github
• JFrog Artifactory
– A CoreTech wise shared artifacts repo.

An example service CI/CD on AWS
ANALYTIC ENGINE

Analytic Engine is an API service for…
Common Big Data computation
service on Cloud (AWS)
https://www.slideshare.net/takeshi_miao/analytic-engine-a-common-big-data-computation-service-on-the-aws

IDC
AE High Level Architecture Design
AZb
AE API servers
RDS
AZa
AZb
AZc
AE API servers
RDS
services
services
services
peering
HTTPS
EMR
EMR
Cross-account
S3 buckets
Auto
Scaling
group
worker
s
worker
sMulti-AZs
Auto
Scaling
group
Auto
Scaling
group
Eureka
Eureka
VPN
HTTPS/HTTP
Basic
Cloud Storagepeering
isValidUser
CS output
HTTPS/HTTP
Basic
Amazon
SNS
Oregon (us-west-2)
IDC
VPN
Splunk
peering
Private ALB

IDC
This is really what we taking care about
AZb
AE API servers
RDS
AZa
AZb
AZc
AE API servers
RDS
services
services
services
peering
HTTPS
EMR
EMR
Cross-account
S3 buckets
Auto
Scaling
group
worker
s
worker
sMulti-AZs
Auto
Scaling
group
Auto
Scaling
group
Eureka
Eureka
VPN
HTTPS/HTTP
Basic
Cloud Storagepeering
isValidUser
CS output
HTTPS/HTTP
Basic
Amazon
SNS
Oregon (us-west-2)
IDC
VPN
Splunk
peering
Private ALB

What components in CI/CD scope
• In scope
– API, Worker, Eureka, Genie W/ auto-scaling group
• EC2, deploy, configure and startup component services
– AWS Elastic Application Load Balancer
– AWS Simple Notification Service
• NOT in scope
– VPC/subnets/VPC peerings
• We use fixed VPC and subnets for both VPN connections and VPC
peerings
– RDS MySQL DB
• Already pre-created
– EMR clusters
• Create by user API calls via AWS Java SDK

CI/CD Usecases
1. Developer edits/pushes codes to github
2. Developer deploys AE to Dev env. for tests
3. Developer terminates AE in Dev env. after tests
4. Developer deploys AE to Stg env. for integrated
tests/UAT
5. Developer deploys AE to PROD env.
6. Developer patches hotfixes and deploys to
PROD
7. Monitor your service components

1. Developer edits/pushes codes to github
Developers
master
AE-100
Repo: spn/ae-saas Project: spn/ae-saas
1.19.0 3.build 4.utests 5.package
6.cp artifacts
to S3
S3: dev-us-east-1
CF templates
ae-
1.19.AE_100.jar
s
Chef recipes
ae-
1.19.AE_100.jars
1. Push
AE-100 branch
2. Trigger CI
7. cp to S3
8.publish artifacts
to mvn repo.
9. Publish
artifacts to
mvn repo.
Feature branch workflow
https://www.atlassian.com/git/tutorials/comparing-workflows
Every commit will trigger this build

2. Developer deploys AE to Dev env. for tests
Developers
4.Create CF
S3: dev-us-east-1
CF templates
ae-
1.19.AE_100.jars
Chef recipes
1. Git tag: c-1.19.AE_100-
dev-us-east-1-myAE
3. Trigger CI
2. Push tag
Dev VPC
AWS CF
5. CF creating for stack: ae-dev-myAE
5.1 Templates
input
6. Provision
resources
7.
Deploy/config/s
tartup service
Ready for
tests
Env.
variables
in CImaster
AE-100

3. Developer terminates AE in Dev env. after tests
Developers
4.delete CF
3. Trigger CI
2. Push tag
Dev VPC
AWS CF
5. CF deleting for
stack: ae-dev-myAE
6. Terminating
resources
1. Git tag: d-1.19.AE_100-
dev-us-east-1-myAE
master

8.1
Deploy/config/
startup service
4. Developer deploys AE to Stg env. for integrated
tests/UAT (Much like UC#2)
Developers
7.Create CF
S3: dev-us-east-1
CF templates
ae-1.19.563.jars Chef recipes
2. Git tag: c-1.19.563-stg-
us-east-1-myAE
4. Trigger CI
3. Push tag
Dev VPC
AWS CF
8. Provision resources
for stack: ae-stg-myAE
Ready for
tests
Env.
variables
in CImaster
AE-100
1.19.563
1. Merge feature branch:
1.19.<buildNum>
5.cp artifacts
to stg S3
●
●
●
6.1 copying
6. cp artifacts from dev to stg
9.Run itests
S3: stg-us-east-1
Run itests
on service

5. Developer deploys AE to PROD env. (Much like
UC#4)
29
Much like UC#4
Git tag: c-1.19.563-prod-us-west-2-myAE

6. Developer patches hotfixes and deploys to PROD
(1/2)
Developers
6.Update
CF
S3: stg-us-east-1
CF templates
ae-1.19.563.jars Chef recipes
1. Git tag: u-1.19.570-
prod-us-west-2-myAE
3. Trigger CI
2. Push tag
Dev VPC
AWS CF
7. Update CF stack: ae-
prod-myAE
Ready to
serve
Env.
variables
in CImaster
AE-105
1.19.570
4.cp artifacts
to prod S3
●
●
●
5.1 copying
5. cp artifacts from stg to prod
S3: prod-us-west-2
8.1 Re-
Deploy/config/
startup service

6. Developer patches hotfixes and deploys to PROD
(2/2)
• Updating W/O SLA impact
– ALB W/ AutoScalingReplacingUpdate for
UpdatePolicy Attribute configured
• Better and flexible Auto-scaling
– EC2 Auto-scaling group + Opsworks
• Cross region deployment as early as possible
– Minor configuration diffs
• Deploy to us-east-1 successful does not assure on others…
– AWS SDK default value is us-east-1
• You may forgot to set in your code…
31
http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatepolicy.html
https://aws.amazon.com/tw/blogs/devops/auto-scaling-aws-opsworks-instances/
(Auto-healing really sucks)

7. Monitor your service components (1/2)
These are the practices we learned from other teams in Trend
• Visibility
– Operator can get the timely system status every time every where
– Practice:
• CW metrics -> CW dashboard
• CloudWatchLog -> AWS Lambda -> Log management system
• Monitoring
– Operator can setup a threshold at specific point for any metrics as a
monitor
– Therefore, the monitor can trigger corresponding actions to notify operator
– Practice:
• [App logs -> WC agent -> | custom] WC metrics -> WC Alarm
• Auto-Recovery
– System can auto recovers itself for every component runs failed
– Practice:
• EC2 auto-scaling group + Opsworks
• WC metrics -> WC Alarm -> AWS Lambda -> AWS SDK -> AWS Opsworks|AWS EC2
32

7. Monitor your service components (2/2)
A high level architecture design
33
App
components
Managed
Services
AWS
CloudWatch
Default
metrics
Custom metrics
(CPU, mem, disk)
CW
metrics
CW Dashboard
CW Alarms
Pager
AWS SNS
AWS Lambda
AWS
CloudWatchLog
App logs to CWLog
Metric
filters
AWS Lambda
Input Store Process Output
Log management
Visibility
Monitoring
Visibility
AWS Lambda
Auto-recovery

DevOps goals V.S. our original way V.S. CI/CD on
AWS
Goals Original way CI/CD
Faster time to
market
• Too complicated to miss
steps
• Service team needs to
follow up themselves
• Lead time needed steps
(Machine resources, etc)
• One click delivery
• Only one role “developer”
• Minutes of lead time for
resources
Lower failure
rate of new
releases
• Manual steps lead to errors • Fully automation
Shorten lead
time between
fixes
• Rolling upgrade
• Invasive
• Replacing/Rolling upgrade
deployment
• Non-invasive
Faster mean
time to recovery
• Hard to deal with machine
errors and peak
• Elasticities brought from
Cloud Computing platform
https://en.wikipedia.org/wiki/DevOps#Goals

Lessons learned
• Try to automate everything as you can
– Cloudformation + EC2 Auto-scaling group + Opsworks
– AWS::CloudFormation::CustomResource is also a tool to rescue
• Consider to split your service CF template
– Service infra. (RDS, SNS, KMS key, etc)
• You not update your infra. often
– Service instance, (EC2, etc)
• We update our service instances very often
• Not only consider about first time creation
– How to update your services W/O impact SLA
• Monitor ! Monitor !! Monitor !!!
• TEST ! TEST !! TEST !!!
35
http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-cfn-customresource.html

Different types of Auto-scaling group
39
Service
Auto Scaling
Group
Features Deploy
OpsWorks
24/7
•manual creation/deletion
•configure one instance for one AZ
chef recipe
time-based
•can specify time slot(s) based on hour unit,
on everyday or any day in week
chef recipe
load-based
•can specify CPU/MEM/workload avg. based
on an OPS layer
•UP: when to increase instances
•Down: when to decrease instances
•No max./min. # of instances setting
chef recipe
EC2
•can set max./min. for # of instance
•Multi-AZs support
user-data

Auto Recovery based on Monit
• OpsWorks already use Monit for Auto
Recovery
– Leverage the Monit on EC2
– Have practices in on-premise
11/22/201
7
Confidential | Copyright 2014
TrendMicro Inc.
2
AZ1 AZ2
API
server
API
server
https://mmonit.com/monit/
Auto Scaling group
• Instance check by
CloudWatch
• Process check by
Monit
• No process –
restart process
• Process health
check failed –
terminate EC2
• Terminate EC2 !Auto Scaling group
launch new EC2

Little variances among AWS regions
• Impact
– Same automation scripts can not run successfully among regions, even the
same region sometimes
• Issues
11/22/201
7
Confidential | Copyright 2014
TrendMicro Inc.
2
Service Regions Root cause
OpsWorks Same region on
us-west-2
S3 URL acceptable spec. had changed for property
“Repository URL”
From “https://s3.amazonaws.com” to “https://s3-us-
west-2.amazonaws.com”
OpsWorks us-west-2 V.S. us-
east-1
Still be “Repository URL” issue. “https://s3-us-west-
2.amazonaws.com” V.S. “https://s3.amazonaws.com”
EC2 us-west-2 V.S. us-
east-1
EC2 FQDN spec. is different.
“ip-10-104-33-152.us-west-2.compute.internal” V.S. “ip-
10-103-73-248.ec2.internal”

OpsWorks V.S. image-based deployment
• OpsWorks deployment
– We are currently using
– It takes too long to launch a service component
• E.g. It takes about ~10 mins to launch a Genie node
• Image-based deployment
– Theoretically, it should takes very short time to
launch a service component
– More responsive for peak workloads
– AMI (AWS Machine Images) V.S. Docker images ?

How about API Gateway and ECS ?
• API Gateway
– Not good due to only Internet accessible
– Cold start
– RDB connection overflow
– CORS integration for web UI
• ECS
– Still need to run standby EC2 instances for peak…
– Only take care for RESTful API services
– Kubernates more suitable for our usecases
43

20171122 aws usergrp_coretech-spn-cicd-aws-v01

More Related Content

Similar to 20171122 aws usergrp_coretech-spn-cicd-aws-v01 (20)

More from Scott Miao (9)

Recently uploaded (20)

20171122 aws usergrp_coretech-spn-cicd-aws-v01

Editor's Notes