Machine learning at scale with aws sage maker

PHIL BASFORD – HEAD OF SOLUTION ENGINEERING
Tuesday, 8 October 2019
MACHINE LEARNING
AT SCALE
@philipbasford

2
AGENDA
Baseline:
➤ Machine Learning, Serverless, Dev Ops, Well-
Architected
Endpoint Architecture:
➤ Reference architecture
➤ Endpoint components
➤ Docker & standard algorithms
➤ Implications of Using Docker
Experiment:
➤ Lambda and Serverless artillery
➤ Results + Charts
➤ Support email
➤ Instance types and Gunicorn
➤ Findings
➤ Auto scaling – HA / Fault Torrance
Operational excellence:
➤ SageMaker CloudWatch metrics
➤ Custom CloudWatch metrics
➤ CloudWatch dashboard
➤ X-Ray - inclusion, sample rate, and set up
➤ X-Ray – Service Map, Traces, and Analytics
DevOps:
➤ Deployment Types
➤ AWS SageMaker with AWS CodePipeline, AWS Lambda
and AWS Step Functions
➤ SAM
Questions

3
SERVERLESS
Lambda API Gateway
DynamoDB is A fully
managed non-sql
cloud service from
AWS. In machine
learning it is typically
used for reference
data.
DynamoDBS3
SNS ; Pub + Sub
SQS : Queues
Fargate : Containers
Step Functions:
Workflows
..and more
Highly durable object
storage used for many
things including data
lakes. For machine
learning it is used to
store training data sets
and model artefacts
API Gateway is the
endpoint for your API,
it has extensive
security measures,
logging, and API
definition using open
API or swagger.
AWS Lambda is
AWS’s native and fully
managed cloud
service for running
application code
without the need to
run servers.

4
Monitoring, observing
and alerting using
CloudWatch and X-
Ray. Infrastructure as
Code with SAM and
CloudFormation.
Operational Excellence
Least privilege, Data
Encryption at Rest,
and Data Encryption
in Transit using IAM
Policies, Resource
Policies, KMS, Secret
Manager, VPC and
Security Group.
Security
Elastic scaling based
on demand and
meeting response
times using Auto
Scaling, Serverless,
and Per Request
managed services.
Performance
Serverless and fully
managed services to
lower TCO. Resource
Tag everything
possible for cost
analysis. Right sizing
instance types for
model hosting.
Cost Optimisation
Fault tolerance and
auto healing to meet a
target availability
using Auto Scaling,
Multi AZ, Multi Region,
Read Replicas and
Snapshots.
Reliance

5
Serverless Machine Learning application using an XGBoost model for predictions
REFERENCE ARCHITECTURE
AWS Cloud (Inawisdom RAMP)
Users
Data Lake Data Science
Web API
Data Science
Team
Amazon
API Gateway
Amazon
SageMaker
AWS
Lambda
Amazon
DynamoDB
Amazon S3
AWS Glue

6
Logical components of an endpoint within Amazon SageMaker
AMAZON SAGEMAKER – COMPONENTS
All components are immutable, any configuration changes require new models and endpoint configurations,
however there is a specific SageMaker API to update instance count and variant weight
Endpoint
ConfigurationEndpoint
Model
Primary Container
Container
Container
VPC
S3
KMS + IAM
Model
Primary Container
Container
Container
VPC
S3
KMS + IAM
Production Variant
Production Variant
Model
Initial
Count + Weight
Instance Type
SDKsRESTSignV4Requests
Name

7
Endpoint
Docker containers host the inference engines, inference engines can be written in any language and endpoints can use
more than one container. Primary container needs to implement a simple REST API.
Common Engines:
➤ 685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:1
➤ 520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-
tensorflow:1.11-cpu-py2
tensorflow:1.11-gpu-py2
➤ 763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-
inference:1.13-gpu
tensorflow-serving:1.11-cpu
AMAZON SAGEMAKER – INFERENCE ENGINES
Dockerfile:
FROM tensorflow/serving:latest
RUN apt-get update && apt-get install -y --no-install-
recommends nginx git
RUN mkdir -p /opt/ml/model
COPY nginx.conf /etc/nginx/nginx.conf
ENTRYPOINT service nginx start | tensorflow_model_server --
rest_api_port=8501 --
model_config_file=/opt/ml/model/models.config
Container
http://localhost:8080/invocations
http://localhost:8080/ping
Amazon
SageMaker model.tar.gz
Primary Container
Nginx Gunicorn Model
Runtime
link
/opt/ml/model
X-Amzn-SageMaker-Custom-Attributes

8
Using Docker immediately raises the following questions
➤How many Docker containers are run on a single underlying EC2 instance?
➤Is Kubernetes or ECS used? And do I have to become a Docker expert?
➤How fast and how slow are instances started and stopped?
➤How do instances reside within the VPC and use network resources? For example, can
the number of instances exhaust the network addresses of a VPC?
➤How isolated are my models? as Docker uses soft CPU and Memory units?
➤Will I suffer issues if containers are bin packed or re-distributed?
IMPLICATIONS OF USING DOCKER

9
In order to answer these questions a series of experiments were carried out
THE EXPERIMENT
AZ Available Address
EU-West-1a 4091
EU-West-1b 4091
EU-West-1c 4091
EU-West-1a 4090
EU-West-1b 4091
EU-West-1c 4091
EU-West-1a 4090
EU-West-1b 4090
EU-West-1c 4090
After VPC Creation:
After Notebook Instance Creation:
After Endpoint Creation:
primary_container ={
"Image": "685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:1",
"ModelDataUrl": "s3://mybucket/mymodel/output/model.tar.gz",
}
create_model_response = sm_client.create_model(
ModelName = ‘load-test’,
ExecutionRoleArn = role,
PrimaryContainer = primary_container,
VpcConfig = {
"SecurityGroupIds": [
"My SecurityGroupId”
],
“Subnets": [
"Subnet Id 1b”,
"Subnet Id 1c”
]}
}
create_endpoint_config_response = sm_client.create_endpoint_config(
EndpointConfigName = endpoint_config_name,
ProductionVariants=[{
'InstanceType':'ml.t2.medium',
'InitialInstanceCount':2,
'InitialVariantWeight':1,
'ModelName’: ‘load-test’,
'VariantName':'AllTraffic’}
])
Endpoint Creation:

10
In order to answer these questions a series of experiments were carried out
THE EXPERIMENT
session = boto3.Session()
client = session.client('sagemaker-runtime')
ENDPOINT_NAME = 'XGBoostEndpoint'
def lambda_handler(event, context):
file_name = 'test_point.csv'
with open(file_name, 'r') as f:
payload = f.read().strip()
print(payload)
response = client.invoke_endpoint(
EndpointName=ENDPOINT_NAME,
Body=payload,
ContentType='text/csv'
)
return {
'statusCode': 200,
'body': json.dumps('Hello!')
}
Lambda:
# You can find great documentation of the possibilities
at: https://artillery.io/docs/
config:
target: "https://???.execute-api.us-west-
2.amazonaws.com/Dev"
phases:
- duration: 3600
arrivalRate: 200
rampTo: 600
name: "warm up”
- duration: 3600
arrivalRate: 600
rampTo: 800
name: "Max load"
scenarios:
- flow:
- post:
url: "/".
Artillery:

11
The CPU usage in AWS CloudWatch for a load run test experiment
RESULTS
At 13:20 we saw the start of a drop
in the CPU usage and at 13:40 it
stopped at 100%, why was this?
From the load script I configured
we know that this is when
serverless-artillery entered the 2nd
phase of sustained load.
There was a slow ramp up for the
first 15 mins until we hit around the
200% CPU usage mark. The 200%
CPU usage means we are using
more than the capacity of a single
endpoint instance.
We then saw a return to 200%
CPU usage 10 minutes later. At
14:40 we saw a complete stop in
load and this is when the
serverless-artillery job completed

12
Investigation into what happen at 13:20, luckily AWS SageMaker sends logs to AWS CloudWatch
RESULTS
➤ During the entire run there was only three log streams and
each one has an ‘instance id’ as an identifier
➤ There were 2 instances at the start of the run and at the end we
still had only two instances
➤ There are no errors in the logs and the last entry was at 13:24.
This would imply none of the instances crashed due to a fatal
error
➤ Instance i-010adxx started logging at 13:38 but we only saw
the CPU start to increase at 13:45, a gap of 7 minutes.
➤ The output from AWS CloudWatch Logs Insights confirms that
each instance was producing 100-150 entries per second

14
Inside XGBoost Docker image and its implementation
XGBOOST INFERENCE ENGINE
“Gunicorn relies on the operating system to provide all of the load balancing when handling requests. Generally we
recommend (2 x num_cores) + 1 as the number of workers to start off with. While not overly scientific, the formula is
based on the assumption that for a given core, one worker will be reading or writing from the socket while the other
worker is processing a request” 2019 © http://docs.gunicorn.org/
The XGBoost inference engine is
implemented using Gunicorn. Gunicorn
is providing a lightweight REST API to
the model located on each of the
instances.
5 Gunicorn workers were created and
this is very important as Python is single
threaded and you need to relate the
number of workers to CPU cores.
Docker image used was
685385470294.dkr.ecr.eu-west-
1.amazonaws.com/xgboost:1

15
The key findings from the experiment were:
FINDINGS
We have proven from the endpoint creation process and from building a model that an endpoint
uses Docker containers on EC2 instances.
Load was spread evenly over the 2 instances and each instance can serve a large amount of
inference requests per second.
Instance i-01be56xxx termination took 5 minutes and instance i-010adxx took 7 minutes to start.
Such timings are indicative of the time it takes to start EC2 instances and configured cooldown
times
Each instance when deployed into a VPC will use a single IP Address. This means that if a
reasonable subnet range is configured then the number of IP addresses will not be exhausted.
We have proven that if an instance crashes or terminates that it will be auto healed.

Instance scaling using Application Auto Scaling
AUTO SCALING
➤Application Auto Scaling is supported with the
usual controls.
➤'InitialInstanceCount’ becomes
DesiredInstanceCount
➤Model can be scaled on CPU, Disk and
Memory utilisation
➤More importantly they can be scaled on
SageMakerVariantInvocationsPerInstance
➤Make sure MinCapacity is greater than 1 for
Fault Torrance and High Availability.
"SageMakerAutoScalingTarget": {
"Type": "AWS::ApplicationAutoScaling::ScalableTarget",
"Properties": {
"MaxCapacity": 4,
"MinCapacity": 2,
"ScalableDimension":
"sagemaker:variant:DesiredInstanceCount",
"ServiceNamespace": "sagemaker",
"RoleARN": "arn:aws:iam::xxxxx:role/sagemakerrole",
"ResourceId": "endpoint/MyEndpoint/variant/AllTraffic”
}
},
"SageMakerAutoScalingPolicy": {
"Type" : "AWS::ApplicationAutoScaling::ScalingPolicy",
"DependsOn": “SageMakerAutoScalingTarget",
"Properties": {
"PolicyName": "SageMakerAutoScalingPolicy",
"ServiceNamespace": "sagemaker",
"ResourceId": "endpoint/MyEndpoint/variant/AllTraffic”
"ScaleInCooldown": 300,
"ScaleOutCooldown": 300,
"ScalableDimension":"sagemaker:variant:DesiredInstanceCount",
"PolicyType": "TargetTrackingScaling",
"TargetTrackingScalingPolicyConfiguration": {
"TargetValue": 100,
"PredefinedMetricSpecification":{
"PredefinedMetricType":
"SageMakerVariantInvocationsPerInstance"
}
}
}

Amazon SageMaker exposes metrics to AWS CloudWatch
METRIC AND ALARMS
Used with the following metrics to
provide a complete view:
➤ API Gateway 4XX and 5XX errors
➤ Lambda Latency
➤ Lambda Invocations
➤ Lambda Errors
Name Dimension Statistic Threshold Time Period Missing
Endpoint model
latency
Milliseconds Average >100 For 5 minutes ignore
Endpoint model
invocations
Count Sum
> 10000
For 15 minutes
notBreaching
< 1000 breaching
Endpoint disk
usage
% Average
> 90%
For 15 minutes ignore
> 80%
Endpoint CPU
usage
% Average
> 90%
> 80%
Endpoint memory
usage
% Average
> 90%
> 80%
Endpoint 5XX
errors
Count Sum >10 For 5 minutes
notBreaching
Endpoint 4XX
errors
Count Sum >50 For 5 minutes
The metrics in AWS CloudWatch
can then be used for alarms:
➤ Always pay attention to how to
handle missing data
➤ Always test your alarms
➤ Look to level your alarms
➤ Make your alarms complement
each other

WITH CLOUD FORMATION
AWS SAGEMAKER

19
Availability Metrics: Usage Plan Metrics:
ADDITIONAL CUSTOM METRICS
➤ AWS API GW has no Usage Plan or Access Key metric in
AWS CloudWatch. Therefore we added it!
➤ Runs 0100 UTC daily and publish the previous 24hrs
usage with dimensions being ‘Usage Plan Id’ and ‘Key Id’
for RemainingQouta, UsedQouta and PercentageUsed
➤ There is no away of recording complete end-to-end
availability, therefore we added it!
➤ The end-to-end availability performs a ‘meaningful’ health
check, involving all elements of the solution.
➤ Runs every minute and publishes a simple count.

API error and
success rates
API Gateway
response times
using percentiles
Lambda
executions
AWS CloudWatch a dashboard providing complete oversight of the inference process
MONITORING
Availability
recorded from
health checker
API Usage data
for Usage Plan

21
Complete production observations using AWS X-Ray, including all services involved in the inference process
PRODUCTION OBSERVATIONS

22
Using AWS X-Ray from AWS Lambda is very easy
➤ Download X-RAY SDK via requirements.txt
➤ Monkey Patch AWS SDK and common SDKs
➤ Use annotations on your own functions, but be careful!
➤ Use Sample Rates to always capture in production
INTEGRATING X-RAY
# Tracing
from aws_xray_sdk.core import patch_all
patch_all()
from aws_xray_sdk.core import xray_recorder
xray_recorder.capture("## predict")
def __predict(self, req):
….
return a

23
For target response times below 200 Milliseconds, X-RAY traces can help you spot bottlenecks and costly areas of the
code.
X-RAY : TRACES

24
New this year - Analytics allows you to see the distribution of traces
X-RAY : ANALYTICS

25
The following are the four ways to deploy new versions of models in Amazon SageMaker
Rolling:
DEPLOYMENT TYPES
Endpoint
Configuration
Canary Variant
Full Variant
Endpoint
Configuration
Full Variant
Endpoint
Configuration
Full Variant
Endpoint
Configuration
Full Variant
Endpoint
Configuration
New Variant
Old Variant
Canary: Blue/Green: Linear:
weight
The default option, SageMaker
will start new instances and then
once they are healthy stop the
old ones
Canary deployments are done
using two Variants in the
Endpoint Configuration and
performed over two
CloudFormation updates.
Requires two CloudFormation
stacks and then changing the
endpoint name in the AWS
Lambda using an Environment
Variable
Linear uses two Variants in the
Endpoint Configuration and using
an AWS Step Function and AWS
Lambda to call the
UpdateEndpointWeightsAndCap
acities API.

26
AWS SageMaker with AWS CodePipeline, AWS Lambda and AWS Step Functions
CONTINUOUS DEPLOYMENT OF ML MODELS

27
Using Infrastructure as Code to define the application with AWS SAM
Defining applications with SAM allows you to use SAM
Local. Using SAM Local then allows for rapid changes to
be made locally to the inference process without the
need to constantly upload lambda functions to AWS
DEPLOYMENT
SAM Local helps to package the source code of your
AWS Lambda function into zip files (including
dependencies). Then it uploads the zip files to S3 and
deploys them to AWS Lambda
Using stages in API GW and aliases for AWS Lambda
function is brilliant for release control. This can be
automated using SAM’s AutoPublishAlias feature
AWS Toolkit for Visual Studio Code allows you to use
the Python debugger to connect AWS Lambda functions
running within SAM Local

29
Webinar:
➤https://pages.awscloud.com/GLOBAL-PTNR-OE-IPC-AIML-Inawisdom-Oct-2019-reg-
event.html
➤https://www.inawisdom.com/machine-learning/amazon-sagemaker-endpoints-inference/
➤https://www.inawisdom.com/machine-learning/machine-learning-performance-more-
than-skin-deep/
➤https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms.html
➤https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsE
mail.html#alarms-and-missing-data
REFERENCES
Other:
My blogs:

020 3575 1337
info@inawisdom.com
Columba House,
Adastral Park, Martlesham Heath
Ipswich, Suffolk, IP5 3RE
www.inawisdom.com
@philipbasford

Machine learning at scale with aws sage maker

More Related Content

What's hot (20)

Similar to Machine learning at scale with aws sage maker (20)

More from PhilipBasford (20)

Recently uploaded (20)

Machine learning at scale with aws sage maker

Editor's Notes