MLOps and Reproducible ML on AWS
with Kubeflow and Amazon SageMaker
Presented by:
Stepan Pushkarev, CTO @ Provectus
Qingwei Li, ML Specialist Solutions Architect @ AWS
1. Learn how to a build scalable and secure ML Infrastructure on AWS with
Provectus
2. Explore best practices of using Amazon SageMaker with open source tools
for better experience and productivity
Webinar Objectives
1. Familiarity with AWS & Amazon SageMaker services
2. Familiarity with ML Workflow
3. Familiarity with Kubeflow & Kubeflow Pipelines
Webinar Prerequisites
1. Introductions
2. Case Study: GoCheck Kids
3. Overview of AWS Infrastructure for Machine Learning
4. Provectus ML Infrastructure on AWS
a. Experimentation
b. MLOps
c. Feature Store
Agenda
AI-First Consultancy & Solutions Provider
Сlients ranging from
fast-growing startups to
large enterprises
450 employees and
growing
Established in 2010
HQ in Palo Alto
Offices across the US,
Canada, and Europe
We are obsessed about leveraging cloud, data, and AI to reimagine the way
businesses operate, compete, and deliver customer value
Innovative Tech Vendors
Seeking for niche expertise to
differentiate and win the market
Midsize to Large Enterprises
Seeking to accelerate innovation,
achieve operational excellence
Our Clients
Introductions
Stepan Pushkarev
Chief Technology
Officer, Provectus
Iskandar Sitdikov
ML Solutions Architect,
Provectus
Rinat Gareev
ML Solutions Architect,
Provectus
Ilnur Garifullin
ML Solutions Architect,
Provectus
Qingwei Li
ML Specialist Solutions
Architect, AWS
The past few years have been like a dream come true for those who work in
analytics and big data.There is a new career path for platform engineers to learn
Hadoop, Scala and Spark. Java and Python programmers have a chance to move
to the Big Data world. There they find higher salaries, new challenges and get
to scale up to distributed systems. But recently I am starting to hear some
complaints and dashed hopes from engineers who have spent time working there.
1. Tools evolution — The Apache Spark/Hadoop ecosystem is great, but it is not stable and user-friendly enough
to just run and forget. Engineers and data scientists should contribute to existing open source projects and create
new tools to fill the gaps in day-to-day operations.
2. Education and cross skills — When data scientists write code, they need to think not just about abstractions,
but consider the practical issues of what is possible and what is reasonable. For example, they need to think how
long their query will run and whether the data they extract will fit into the storage mechanism they are using.
3. Improve the process — DevOps might be a solution. Here DevOps does not just mean writing Ansible scripts
and installing Jenkins. We need DevOps working in optimal fashion to reduce handoffs and invent new tools to
give everyone self-service to make them as productive as possible.
Why ML Infrastructure
GoCheck Kids Story: Secure, agile, and compliant ML
infrastructure for Deep Vision Screening
GoCheck Kids
Reduce manual overhead for child vision
screening.
Detect strabismus, crescent, dark iris/pupil
population, as well as to reject images where
child is not looking straight into the camera.
Security and compliance requirements - Track
everything, do not touch anything.
Deep Vision Solution for GoCheck Kids
Business Problem Solution
End-to-end deep learning image classification
models to detect child gaze, strabismus,
crescent, and dark iris/pupil population.
Provectus has developed quite a few ML models:
● Different input (pre-processing, region cropping, single vs two eyes, etc.), 6
● Different feature generation backbones (deep convolutional networks: ResNet,
MobileNet, EfficientNet, custom, etc.), 7
● Transfer learning from a synthetic dataset, 3
● Tweaks with objective functions to tackle data imbalance, 5
● Different datasets splits, 10
Modeling Hypothesis
6x7x3x5x10 = 6,300 combinations to test in 3 weeks!
Conducted ~100* experiments on the entire dataset using pipelines within 3 weeks
● 100 000+ images
● Each experiment takes 15 min – 6 hours on a single GPU (P3 instance type)
* not counting development runs and experiments in notebook instances
We always had quite a few pending improvement hypotheses in backlog
● Each good hypothesis needs several runs to determine best hyperparameters
● OR automatic hyperparameter optimizer
Data preparation took ~5 hours
● Had to parallelize and reuse outputs
Each experiment produces artifacts: models, metrics, predictions
Met security and compliance requirements
Benefits and Outcomes of ML Infrastructure
Results Summary
3X
Increase in ML
model’s recall
(same precision)
95%
ML Engineer’s time
was dedicated to
experimentation
100+
Large scale
experiments in 3
weeks by 3 ML
engineers
This could not be achieved without Provectus ML Infrastructure on AWS
100%
Secure and FDA
Compliant
Overview of AWS Infrastructure
for Machine Learning
VISION SPEECH TEXT SEARCH NEW CHATBOTS PERSONALIZATION FORECASTING FRAUD NEW DEVELOPMENT NEW CONTACT CENTERS
Amazon SageMaker
Amazon
SageMaker
Ground
Truth
Amazon
A2I
Amazon
SageMaker
Neo
Built-in
algorithms
SageMaker
Notebooks NEW
SageMaker
Experiments NEW
Model
tuning
SageMaker
Debugger NEW
SageMaker
Autopilot NEW
Model
hosting
SageMaker
Model Monitor NEW
Deep Learning
AMIs & Containers
GPUs &
CPUs
Elastic
Inference
Inferentia FPGA
Amazon
Rekognition
Amazon
Polly
Amazon
Transcribe
+Medical
Amazon
Comprehend
+Medical
Amazon
Translate
Amazon
Lex
Amazon
Personalize
Amazon
Forecast
Amazon
Fraud Detector
Amazon
CodeGuru
AWS AI Services
AWS ML Services
AWS ML Frameworks & Infrastructure
Amazon
Textract
Amazon
Kendra
Contact Lens
For Amazon Connect
Amazon SageMaker Studio IDE
NEW
NEW NEW
AWS AI/ML Stack
Amazon SageMaker - A Fully Managed Services for ML
10101101
0
0101010
Collect
and prepare
training data
Select or
Build ML
algorithms
Set up and
manage
environments
for training
Train, debug,
and tune
models
Deploy
models in
production
Manage
training runs
Monitor
models
Scale and manage
the production
environment
Validate
predictions
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Image registry
Container image repository
Amazon Elastic
Container Registry
(Amazon ECR)
Compute
Where the containers run
Amazon Elastic
Compute Cloud
(Amazon EC2)
Jupyter notebook
instances
High performance
algorithms
Large-scale
training
Optimization One-click
deployment
Fully managed with
auto-scaling
ML services
Fully-managed service that
covers the entire machine
learning workflow
Amazon SageMaker
Management
Deployment, scheduling,
scaling, and management of
containerized applications
Amazon Elastic
Kubernetes Service
(Amazon EKS)
Amazon Elastic
Container Service
(Amazon ECS)
ML Infrastructure and Services
1
2
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kubernetes
Amazon SageMaker Operators
for Kubernetes
github.com/aws/amazon-sagemaker-operator-for-k8s
Kubeflow
Amazon SageMaker Components
for Kubeflow Pipelines
github.com/kubeflow/pipelines/tree/master/components/
aws/sagemaker
Scaling ML on Kubernetes with Amazon SageMaker
2
1
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Fully-managed infrastructure
• Ground Truth labeling
• Automatic model tuning
• Built-in optimized algorithms
• Managed Spot Training
• Scalable inference endpoints
• Model monitoring
• Easy scalability
• Portability
• Composability
• Scalability
• Shared infrastructure
• Repeatable pipelines
• Automation
• CI/CD
• Open-source
Open Source + Amazon SageMaker Value Proposition
Amazon SageMaker Kubeflow
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kubeflow Pipeline
Component
Other
component
Pipeline
step
Pipeline
step
Pipeline
step
Input/Output
Implementation
(container)
Metadata
Amazon
ECR
Amazon
SageMaker
Amazon SageMaker Components for Kubeflow Pipelines
Other
component
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Example pipeline:
1. Hyperparameter optimization
2. Select best hyperparameters and increase epochs
3. Training model using the best hyperparameters
4. Create an Amazon SageMaker model
5. Deploy the model
BYO containerBYO training scripts
Amazon SageMaker Components for Kubeflow Pipelines
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Model
development
Model
training
Model
tracking
Model
deployment
Hyper-param
tuning
Data
prep
Amazon SageMaker + Kubeflow for Machine Learning
Amazon SageMaker
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kubernetes
Amazon SageMaker Operators
for Kubernetes
github.com/aws/amazon-sagemaker-operator-for-k8s
Kubeflow
Amazon SageMaker Components
for Kubeflow Pipelines
github.com/kubeflow/pipelines/tree/master/componen
ts/aws/sagemaker
Scaling ML on Kubernetes with Amazon SageMaker
1
2
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Product Architecture Kubernetes Orchestration Dev Interface GUI Ease of Use
SageMaker
Components
Kubeflow
Pipeline
Components
Yes
Self Hosted
Kubeflow
Pipelines
Python
KFP
Dashboard
Medium
SageMaker
Operators
Kubernetes
Operators
Customer
Resources
Yes
Kubernetes
Tools (Ex.
Flyte, Argo)
YAML,
or custom
extension
by customer
None,
or custom
Advanced
Amazon SageMaker Operators for Kubernetes vs.
Components for Kubeflow Pipelines
Provectus ML Infrastructure
on AWS
Amazon SageMaker Services
How Provectus Adds Value
Feature Store
Store and reuse features to build ML models faster
ML Workflow Orchestrator
Reproduce and track the whole ML Workflow
Dataset Management
Track and govern training datasets
Dataset Sampling
Sample from production
streams
Advanced Monitoring
Detect drift in text & images
MLOps
Continuous Training & Delivery
The Core of MLOps and Reproducible Experimentation
Pipelines
1. Backbone of Experimentation flow
2. Essential part of Continuous Integration and Delivery flow
3. Major part of Continuous Retraining flow
4. Production workload (unlike traditional CI/CD)
5. Part of day-to-day model tuning and development process
6. Idempotent — Should produce the same results with the same inputs
ML Pipeline Characteristics
ML Pipeline Options
Component
/Option
Amazon SageMaker
Managed
AWS
Native
Kubernetes
Native
DSL
Orchestrator
Metadata
Tracker & UI
Integrations (Tuner,
Debugger,
TensorBoard, etc)
ML Pipeline Options
Component
/Option
Amazon SageMaker
Managed
AWS
Native
Kubernetes
Native
DSL SageMaker Processing Data Science SDK
for Step Functions
Kubeflow Pipelines
Orchestrator SageMaker Processing Step Functions Argo Workflow
Metadata
Tracker & UI
Amazon SageMaker
Experiments
N/A Kubeflow
Metadata
Integrations (Tuner,
Debugger,
TensorBoard, etc)
Amazon SageMaker
Services DIY
Opensource, Amazon
SageMaker
Components
Kubeflow: Orchestrator and Experiments Tracker of Choice
ML Engineer-Centric Flow
End-to-end
Amazon
SageMaker +
Kubeflow
Pipelines
MLOps with
Argo Workflows,
Amazon SageMaker,
& Kubeflow
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
Summary of Kubeflow on AWS
Best Practices:
● Invest into a library of reusable components
● Use Amazon SageMaker Components for Kubeflow
● Deploy on Amazon EKS, consider Provectus Swiss Army
Kube for a quick start
● Use Argo and Kubeflow for MLOps
Benefits:
● Metadata Tracker and Pipeline Orchestrator
● Minimal intervention into existing day-to-day ML routines
Feature Store
Value Proposition of Feature Store
A data management layer for machine learning features.
1. Better ROI from feature engineering — Facilitates collaboration,
sharing and reusing of features
2. Increases ML Engineer productivity — Storage is further
decoupled from ML pipelines
3. Prevents training-serving data skew by design
4. Can encapsulate or facilitate data versioning and features
quality monitoring
Good News: A properly designed Data Lake
covers 80% of requirements for Feature Store
Higher Level Operations:
● Fetch batch (take a sample)
● Get one
● Add / Deprecate feature
Lineage Metadata:
● Upstream Models
● Data Sources and transformations
Annotation Metadata:
● Agreements
● Judgements
● Annotation job parameters
Adding ML Awareness to Data Lake
Data Profiling Metadata:
● Min/max
● Uniqueness, missing values, etc.
Governance Metadata:
● Owner
● Description
● Version
● Last updated, SLA
Feature Store: Options
Not a Store. General purpose Data Catalogue.
Adds nice UI, Governance and Searchability.
Great design. Early Stage. Nicely overlaps with Data Lake.
No extensive metadata management yet.
AWS support: https://github.com/feast-dev/feast/issues/367
By Ph.D for Ph.Ds. Tremendous amount of work,
very advanced concepts but overcomplicated.
By creators of Uber Michelangelo. Closed source.
1. Modern ML infrastructure accelerates time to value for ML initiatives and increases
trust from the business
2. Eliminates handoffs between Data Scientists, ML Engineers and IT
3. Must-have requirement for small ML shops and for large organizations. Spans from
straightforward “image classification” projects to more complex ML pipelines
4. Must-have requirement for secure and compliant environments
5. Minimizes growing technical debt in machine learning projects
6. Complements fully managed AWS services with Open Source projects for pipeline
orchestration, experiments tracking, dataset versioning, and feature store
Summary of ML Infrastructure
125 University Avenue
Suite 290, Palo Alto
California, 94301
hello@provectus.com
Questions, details?
We would be happy to answer!

More Related Content

PPTX
MLOps in action
PDF
Apply MLOps at Scale by H&M
PDF
Ml ops intro session
PDF
MLOps Bridging the gap between Data Scientists and Ops.
PPTX
MLOps - The Assembly Line of ML
PDF
Introdution to Dataops and AIOps (or MLOps)
PDF
MLOps by Sasha Rosenbaum
PPTX
From Data Science to MLOps
MLOps in action
Apply MLOps at Scale by H&M
Ml ops intro session
MLOps Bridging the gap between Data Scientists and Ops.
MLOps - The Assembly Line of ML
Introdution to Dataops and AIOps (or MLOps)
MLOps by Sasha Rosenbaum
From Data Science to MLOps

What's hot (20)

PPTX
Microsoft Cloud Adoption Framework for Azure: Thru Partner Governance Workshop
PPTX
Azure API Management
PDF
Seamless MLOps with Seldon and MLflow
PDF
MLOps for production-level machine learning
PPTX
MLOps.pptx
PPTX
Google Vertex AI
PDF
MLOps Using MLflow
PDF
Ml ops past_present_future
PDF
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
PPTX
MLOps and Data Quality: Deploying Reliable ML Models in Production
PDF
The A-Z of Data: Introduction to MLOps
PDF
Azure Arc by K.Narisorn // Azure Multi-Cloud
PDF
MLOps with Kubeflow
PDF
ML-Ops how to bring your data science to production
PPTX
Using Generative AI
PDF
What is MLOps
PPTX
AzureOpenAI.pptx
PDF
Using the power of Generative AI at scale
PDF
Azure Arc Overview from Microsoft
PDF
Introduction to MLflow
Microsoft Cloud Adoption Framework for Azure: Thru Partner Governance Workshop
Azure API Management
Seamless MLOps with Seldon and MLflow
MLOps for production-level machine learning
MLOps.pptx
Google Vertex AI
MLOps Using MLflow
Ml ops past_present_future
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
MLOps and Data Quality: Deploying Reliable ML Models in Production
The A-Z of Data: Introduction to MLOps
Azure Arc by K.Narisorn // Azure Multi-Cloud
MLOps with Kubeflow
ML-Ops how to bring your data science to production
Using Generative AI
What is MLOps
AzureOpenAI.pptx
Using the power of Generative AI at scale
Azure Arc Overview from Microsoft
Introduction to MLflow
Ad

Similar to MLOps and Reproducible ML on AWS with Kubeflow and SageMaker (20)

PDF
Ml ops on AWS
PPTX
AI Stack on AWS: Amazon SageMaker and Beyond
PPTX
CNCF-Istanbul-MLOps for Devops Engineers.pptx
PPTX
Software engineering practices for the data science and machine learning life...
PDF
[AI] ML Operationalization with Microsoft Azure
PDF
MLOPS By Amazon offered and free download
PDF
Infrastructure Agnostic Machine Learning Workload Deployment
PDF
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
PPTX
DevOps for Machine Learning overview en-us
PDF
Machine Learning Operations Cababilities
PPTX
Big datatraining.in devops-part2 (1)
PPTX
Big datatraining.in devops-part2 (1)
PPTX
Big datatraining.in devops-part2 (1)
PPTX
Big datatraining.in devops-part2 (1)
PPTX
Big datatraining.in devops-part2 (1)
PDF
The Complexity to "Yes" in Analytics Software and the Possibilities with Dock...
PDF
201908 Overview of Automated ML
PDF
Tuning the Untunable - Insights on Deep Learning Optimization
PPT
Strata CA 2019: From Jupyter to Production Manu Mukerji
PDF
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
Ml ops on AWS
AI Stack on AWS: Amazon SageMaker and Beyond
CNCF-Istanbul-MLOps for Devops Engineers.pptx
Software engineering practices for the data science and machine learning life...
[AI] ML Operationalization with Microsoft Azure
MLOPS By Amazon offered and free download
Infrastructure Agnostic Machine Learning Workload Deployment
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
DevOps for Machine Learning overview en-us
Machine Learning Operations Cababilities
Big datatraining.in devops-part2 (1)
Big datatraining.in devops-part2 (1)
Big datatraining.in devops-part2 (1)
Big datatraining.in devops-part2 (1)
Big datatraining.in devops-part2 (1)
The Complexity to "Yes" in Analytics Software and the Possibilities with Dock...
201908 Overview of Automated ML
Tuning the Untunable - Insights on Deep Learning Optimization
Strata CA 2019: From Jupyter to Production Manu Mukerji
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
Ad

More from Provectus (20)

PPTX
Choosing the right IDP Solution
PPTX
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
PPTX
Choosing the Right Document Processing Solution for Healthcare Organizations
PPTX
Feature Store as a Data Foundation for Machine Learning
PPTX
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
PPTX
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
PDF
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
PDF
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
PDF
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
PDF
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
PDF
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
PDF
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
PDF
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
PDF
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
PDF
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
PDF
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
PPTX
How to implement authorization in your backend with AWS IAM
PDF
Yurii Gavrilin | ML Interpretability: From A to Z | Kazan ODSC Meetup
PDF
Andrei Grigoriev | Version Control in Data Science | Kazan ODSC Meetup
PDF
Modern word embeddings | Andrei Kulagin | Kazan ODSC Meetup
Choosing the right IDP Solution
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Choosing the Right Document Processing Solution for Healthcare Organizations
Feature Store as a Data Foundation for Machine Learning
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
How to implement authorization in your backend with AWS IAM
Yurii Gavrilin | ML Interpretability: From A to Z | Kazan ODSC Meetup
Andrei Grigoriev | Version Control in Data Science | Kazan ODSC Meetup
Modern word embeddings | Andrei Kulagin | Kazan ODSC Meetup

Recently uploaded (20)

PDF
CCUS-as-the-Missing-Link-to-Net-Zero_AksCurious.pdf
PDF
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
PDF
ment.tech-Siri Delay Opens AI Startup Opportunity in 2025.pdf
PDF
SaaS reusability assessment using machine learning techniques
PPTX
Report in SIP_Distance_Learning_Technology_Impact.pptx
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
PDF
The AI Revolution in Customer Service - 2025
PPTX
How to Convert Tickets Into Sales Opportunity in Odoo 18
PPTX
Build automations faster and more reliably with UiPath ScreenPlay
PDF
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
PDF
Early detection and classification of bone marrow changes in lumbar vertebrae...
PPTX
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
PDF
CEH Module 2 Footprinting CEH V13, concepts
PPTX
How to use fields_get method in Odoo 18
PDF
Build Real-Time ML Apps with Python, Feast & NoSQL
PDF
Streamline Vulnerability Management From Minimal Images to SBOMs
PDF
giants, standing on the shoulders of - by Daniel Stenberg
PDF
Altius execution marketplace concept.pdf
PDF
zbrain.ai-Scope Key Metrics Configuration and Best Practices.pdf
PPTX
Rise of the Digital Control Grid Zeee Media and Hope and Tivon FTWProject.com
CCUS-as-the-Missing-Link-to-Net-Zero_AksCurious.pdf
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
ment.tech-Siri Delay Opens AI Startup Opportunity in 2025.pdf
SaaS reusability assessment using machine learning techniques
Report in SIP_Distance_Learning_Technology_Impact.pptx
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
The AI Revolution in Customer Service - 2025
How to Convert Tickets Into Sales Opportunity in Odoo 18
Build automations faster and more reliably with UiPath ScreenPlay
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
Early detection and classification of bone marrow changes in lumbar vertebrae...
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
CEH Module 2 Footprinting CEH V13, concepts
How to use fields_get method in Odoo 18
Build Real-Time ML Apps with Python, Feast & NoSQL
Streamline Vulnerability Management From Minimal Images to SBOMs
giants, standing on the shoulders of - by Daniel Stenberg
Altius execution marketplace concept.pdf
zbrain.ai-Scope Key Metrics Configuration and Best Practices.pdf
Rise of the Digital Control Grid Zeee Media and Hope and Tivon FTWProject.com

MLOps and Reproducible ML on AWS with Kubeflow and SageMaker

  • 1. MLOps and Reproducible ML on AWS with Kubeflow and Amazon SageMaker Presented by: Stepan Pushkarev, CTO @ Provectus Qingwei Li, ML Specialist Solutions Architect @ AWS
  • 2. 1. Learn how to a build scalable and secure ML Infrastructure on AWS with Provectus 2. Explore best practices of using Amazon SageMaker with open source tools for better experience and productivity Webinar Objectives
  • 3. 1. Familiarity with AWS & Amazon SageMaker services 2. Familiarity with ML Workflow 3. Familiarity with Kubeflow & Kubeflow Pipelines Webinar Prerequisites
  • 4. 1. Introductions 2. Case Study: GoCheck Kids 3. Overview of AWS Infrastructure for Machine Learning 4. Provectus ML Infrastructure on AWS a. Experimentation b. MLOps c. Feature Store Agenda
  • 5. AI-First Consultancy & Solutions Provider Сlients ranging from fast-growing startups to large enterprises 450 employees and growing Established in 2010 HQ in Palo Alto Offices across the US, Canada, and Europe We are obsessed about leveraging cloud, data, and AI to reimagine the way businesses operate, compete, and deliver customer value
  • 6. Innovative Tech Vendors Seeking for niche expertise to differentiate and win the market Midsize to Large Enterprises Seeking to accelerate innovation, achieve operational excellence Our Clients
  • 7. Introductions Stepan Pushkarev Chief Technology Officer, Provectus Iskandar Sitdikov ML Solutions Architect, Provectus Rinat Gareev ML Solutions Architect, Provectus Ilnur Garifullin ML Solutions Architect, Provectus Qingwei Li ML Specialist Solutions Architect, AWS
  • 8. The past few years have been like a dream come true for those who work in analytics and big data.There is a new career path for platform engineers to learn Hadoop, Scala and Spark. Java and Python programmers have a chance to move to the Big Data world. There they find higher salaries, new challenges and get to scale up to distributed systems. But recently I am starting to hear some complaints and dashed hopes from engineers who have spent time working there.
  • 9. 1. Tools evolution — The Apache Spark/Hadoop ecosystem is great, but it is not stable and user-friendly enough to just run and forget. Engineers and data scientists should contribute to existing open source projects and create new tools to fill the gaps in day-to-day operations. 2. Education and cross skills — When data scientists write code, they need to think not just about abstractions, but consider the practical issues of what is possible and what is reasonable. For example, they need to think how long their query will run and whether the data they extract will fit into the storage mechanism they are using. 3. Improve the process — DevOps might be a solution. Here DevOps does not just mean writing Ansible scripts and installing Jenkins. We need DevOps working in optimal fashion to reduce handoffs and invent new tools to give everyone self-service to make them as productive as possible.
  • 10. Why ML Infrastructure GoCheck Kids Story: Secure, agile, and compliant ML infrastructure for Deep Vision Screening
  • 12. Reduce manual overhead for child vision screening. Detect strabismus, crescent, dark iris/pupil population, as well as to reject images where child is not looking straight into the camera. Security and compliance requirements - Track everything, do not touch anything. Deep Vision Solution for GoCheck Kids Business Problem Solution End-to-end deep learning image classification models to detect child gaze, strabismus, crescent, and dark iris/pupil population.
  • 13. Provectus has developed quite a few ML models: ● Different input (pre-processing, region cropping, single vs two eyes, etc.), 6 ● Different feature generation backbones (deep convolutional networks: ResNet, MobileNet, EfficientNet, custom, etc.), 7 ● Transfer learning from a synthetic dataset, 3 ● Tweaks with objective functions to tackle data imbalance, 5 ● Different datasets splits, 10 Modeling Hypothesis 6x7x3x5x10 = 6,300 combinations to test in 3 weeks!
  • 14. Conducted ~100* experiments on the entire dataset using pipelines within 3 weeks ● 100 000+ images ● Each experiment takes 15 min – 6 hours on a single GPU (P3 instance type) * not counting development runs and experiments in notebook instances We always had quite a few pending improvement hypotheses in backlog ● Each good hypothesis needs several runs to determine best hyperparameters ● OR automatic hyperparameter optimizer Data preparation took ~5 hours ● Had to parallelize and reuse outputs Each experiment produces artifacts: models, metrics, predictions Met security and compliance requirements Benefits and Outcomes of ML Infrastructure
  • 15. Results Summary 3X Increase in ML model’s recall (same precision) 95% ML Engineer’s time was dedicated to experimentation 100+ Large scale experiments in 3 weeks by 3 ML engineers This could not be achieved without Provectus ML Infrastructure on AWS 100% Secure and FDA Compliant
  • 16. Overview of AWS Infrastructure for Machine Learning
  • 17. VISION SPEECH TEXT SEARCH NEW CHATBOTS PERSONALIZATION FORECASTING FRAUD NEW DEVELOPMENT NEW CONTACT CENTERS Amazon SageMaker Amazon SageMaker Ground Truth Amazon A2I Amazon SageMaker Neo Built-in algorithms SageMaker Notebooks NEW SageMaker Experiments NEW Model tuning SageMaker Debugger NEW SageMaker Autopilot NEW Model hosting SageMaker Model Monitor NEW Deep Learning AMIs & Containers GPUs & CPUs Elastic Inference Inferentia FPGA Amazon Rekognition Amazon Polly Amazon Transcribe +Medical Amazon Comprehend +Medical Amazon Translate Amazon Lex Amazon Personalize Amazon Forecast Amazon Fraud Detector Amazon CodeGuru AWS AI Services AWS ML Services AWS ML Frameworks & Infrastructure Amazon Textract Amazon Kendra Contact Lens For Amazon Connect Amazon SageMaker Studio IDE NEW NEW NEW AWS AI/ML Stack
  • 18. Amazon SageMaker - A Fully Managed Services for ML 10101101 0 0101010 Collect and prepare training data Select or Build ML algorithms Set up and manage environments for training Train, debug, and tune models Deploy models in production Manage training runs Monitor models Scale and manage the production environment Validate predictions
  • 21. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Image registry Container image repository Amazon Elastic Container Registry (Amazon ECR) Compute Where the containers run Amazon Elastic Compute Cloud (Amazon EC2) Jupyter notebook instances High performance algorithms Large-scale training Optimization One-click deployment Fully managed with auto-scaling ML services Fully-managed service that covers the entire machine learning workflow Amazon SageMaker Management Deployment, scheduling, scaling, and management of containerized applications Amazon Elastic Kubernetes Service (Amazon EKS) Amazon Elastic Container Service (Amazon ECS) ML Infrastructure and Services 1 2
  • 22. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kubernetes Amazon SageMaker Operators for Kubernetes github.com/aws/amazon-sagemaker-operator-for-k8s Kubeflow Amazon SageMaker Components for Kubeflow Pipelines github.com/kubeflow/pipelines/tree/master/components/ aws/sagemaker Scaling ML on Kubernetes with Amazon SageMaker 2 1
  • 23. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 24. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Fully-managed infrastructure • Ground Truth labeling • Automatic model tuning • Built-in optimized algorithms • Managed Spot Training • Scalable inference endpoints • Model monitoring • Easy scalability • Portability • Composability • Scalability • Shared infrastructure • Repeatable pipelines • Automation • CI/CD • Open-source Open Source + Amazon SageMaker Value Proposition Amazon SageMaker Kubeflow
  • 25. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kubeflow Pipeline Component Other component Pipeline step Pipeline step Pipeline step Input/Output Implementation (container) Metadata Amazon ECR Amazon SageMaker Amazon SageMaker Components for Kubeflow Pipelines Other component
  • 26. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Example pipeline: 1. Hyperparameter optimization 2. Select best hyperparameters and increase epochs 3. Training model using the best hyperparameters 4. Create an Amazon SageMaker model 5. Deploy the model BYO containerBYO training scripts Amazon SageMaker Components for Kubeflow Pipelines
  • 27. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Model development Model training Model tracking Model deployment Hyper-param tuning Data prep Amazon SageMaker + Kubeflow for Machine Learning Amazon SageMaker
  • 28. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kubernetes Amazon SageMaker Operators for Kubernetes github.com/aws/amazon-sagemaker-operator-for-k8s Kubeflow Amazon SageMaker Components for Kubeflow Pipelines github.com/kubeflow/pipelines/tree/master/componen ts/aws/sagemaker Scaling ML on Kubernetes with Amazon SageMaker 1 2
  • 29. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Product Architecture Kubernetes Orchestration Dev Interface GUI Ease of Use SageMaker Components Kubeflow Pipeline Components Yes Self Hosted Kubeflow Pipelines Python KFP Dashboard Medium SageMaker Operators Kubernetes Operators Customer Resources Yes Kubernetes Tools (Ex. Flyte, Argo) YAML, or custom extension by customer None, or custom Advanced Amazon SageMaker Operators for Kubernetes vs. Components for Kubeflow Pipelines
  • 32. How Provectus Adds Value Feature Store Store and reuse features to build ML models faster ML Workflow Orchestrator Reproduce and track the whole ML Workflow Dataset Management Track and govern training datasets Dataset Sampling Sample from production streams Advanced Monitoring Detect drift in text & images MLOps Continuous Training & Delivery
  • 33. The Core of MLOps and Reproducible Experimentation Pipelines
  • 34. 1. Backbone of Experimentation flow 2. Essential part of Continuous Integration and Delivery flow 3. Major part of Continuous Retraining flow 4. Production workload (unlike traditional CI/CD) 5. Part of day-to-day model tuning and development process 6. Idempotent — Should produce the same results with the same inputs ML Pipeline Characteristics
  • 35. ML Pipeline Options Component /Option Amazon SageMaker Managed AWS Native Kubernetes Native DSL Orchestrator Metadata Tracker & UI Integrations (Tuner, Debugger, TensorBoard, etc)
  • 36. ML Pipeline Options Component /Option Amazon SageMaker Managed AWS Native Kubernetes Native DSL SageMaker Processing Data Science SDK for Step Functions Kubeflow Pipelines Orchestrator SageMaker Processing Step Functions Argo Workflow Metadata Tracker & UI Amazon SageMaker Experiments N/A Kubeflow Metadata Integrations (Tuner, Debugger, TensorBoard, etc) Amazon SageMaker Services DIY Opensource, Amazon SageMaker Components
  • 37. Kubeflow: Orchestrator and Experiments Tracker of Choice
  • 40. MLOps with Argo Workflows, Amazon SageMaker, & Kubeflow
  • 42. Summary of Kubeflow on AWS Best Practices: ● Invest into a library of reusable components ● Use Amazon SageMaker Components for Kubeflow ● Deploy on Amazon EKS, consider Provectus Swiss Army Kube for a quick start ● Use Argo and Kubeflow for MLOps Benefits: ● Metadata Tracker and Pipeline Orchestrator ● Minimal intervention into existing day-to-day ML routines
  • 44. Value Proposition of Feature Store A data management layer for machine learning features. 1. Better ROI from feature engineering — Facilitates collaboration, sharing and reusing of features 2. Increases ML Engineer productivity — Storage is further decoupled from ML pipelines 3. Prevents training-serving data skew by design 4. Can encapsulate or facilitate data versioning and features quality monitoring
  • 45. Good News: A properly designed Data Lake covers 80% of requirements for Feature Store
  • 46. Higher Level Operations: ● Fetch batch (take a sample) ● Get one ● Add / Deprecate feature Lineage Metadata: ● Upstream Models ● Data Sources and transformations Annotation Metadata: ● Agreements ● Judgements ● Annotation job parameters Adding ML Awareness to Data Lake Data Profiling Metadata: ● Min/max ● Uniqueness, missing values, etc. Governance Metadata: ● Owner ● Description ● Version ● Last updated, SLA
  • 47. Feature Store: Options Not a Store. General purpose Data Catalogue. Adds nice UI, Governance and Searchability. Great design. Early Stage. Nicely overlaps with Data Lake. No extensive metadata management yet. AWS support: https://github.com/feast-dev/feast/issues/367 By Ph.D for Ph.Ds. Tremendous amount of work, very advanced concepts but overcomplicated. By creators of Uber Michelangelo. Closed source.
  • 48. 1. Modern ML infrastructure accelerates time to value for ML initiatives and increases trust from the business 2. Eliminates handoffs between Data Scientists, ML Engineers and IT 3. Must-have requirement for small ML shops and for large organizations. Spans from straightforward “image classification” projects to more complex ML pipelines 4. Must-have requirement for secure and compliant environments 5. Minimizes growing technical debt in machine learning projects 6. Complements fully managed AWS services with Open Source projects for pipeline orchestration, experiments tracking, dataset versioning, and feature store Summary of ML Infrastructure
  • 49. 125 University Avenue Suite 290, Palo Alto California, 94301 [email protected] Questions, details? We would be happy to answer!