0% found this document useful (0 votes)
69 views

12 ModelDeployment

1. The document discusses model deployment and serving, including model exchange formats, machine learning serving systems, and serverless computing. 2. It describes common model exchange formats like PMML, PFA, ONNX, and TensorFlow Saved Models that aim to standardize the exchange of trained machine learning models between different frameworks and languages. 3. It also outlines machine learning serving systems for embedding models on devices, and building serving services that optimize performance through batching, caching, and multi-model optimizations on dedicated hardware.

Uploaded by

禹范
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

12 ModelDeployment

1. The document discusses model deployment and serving, including model exchange formats, machine learning serving systems, and serverless computing. 2. It describes common model exchange formats like PMML, PFA, ONNX, and TensorFlow Saved Models that aim to standardize the exchange of trained machine learning models between different frameworks and languages. 3. It also outlines machine learning serving systems for embedding models on devices, and building serving services that optimize performance through batching, caching, and multi-model optimizations on dedicated hardware.

Uploaded by

禹范
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

1

SCIENCE
PASSION
TECHNOLOGY

Architecture of ML Systems
12 Model Deployment & Serving
Matthias Boehm
Graz University of Technology, Austria
Computer Science and Biomedical Engineering
Institute of Interactive Systems and Data Science
BMK endowed chair for Data Management

Last update: June 16, 2021


Announcements/Org
2

 #1 Video Recording
 Link in TeachCenter & TUbe (lectures will be public)
 https://tugraz.webex.com/meet/m.boehm
 Corona traffic light RED  May 17: ORANGE  Jul 01: YELLOW

 #2 Programming Projects / Exercises


 Soft deadline: June 30 (w/ room for extension)
 Submission of exercises in TeachCenter
 Submission of projects as PRs in Apache SystemDS

 #3 Exams
 Doodle w/ 42/~50 exam slots (45min each)
 July 7/8/9/12/13 (done via skype/webex)

 #4 Course Evaluation
 Please participate; open period: June 1 – July 15
706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving
Matthias Boehm, Graz University of Technology, SS 2021
Data Science Lifecycle

Recap: The Data Science Lifecycle


3
Data-centric View:
Application perspective
Workload perspective
Data System perspective
Scientist

Data Integration Model Selection Validate & Debug


Data Cleaning Training Deployment
Data Preparation Hyper-parameters Scoring & Feedback

Exploratory Process
(experimentation, refinements, ML pipelines)
Data/SW DevOps
Engineer Engineer

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Agenda
4

 Model Exchange and Serving


 Model Monitoring and Updates

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
5

Model Exchange and Serving

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Model Exchange and Serving

Model Exchange Formats


6

 Definition Deployed Model


 #1 Trained ML model (weight/parameter matrix)
 #2 Trained weights AND operator graph / entire ML pipeline
 especially for DNN (many weight/bias tensors, hyper parameters, etc)

 Recap: Data Exchange Formats (model + meta data)


 General-purpose formats: CSV, JSON, XML, Protobuf
 Sparse matrix formats: matrix market, libsvm
 Scientific formats: NetCDF, HDF5
 ML-system-specific binary formats (e.g., SystemDS, PyTorch serialized)

 Problem ML System Landscape


 Different languages and frameworks, including versions
 Lack of standardization  DSLs for ML is wild west

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Model Exchange and Serving

Model Exchange Formats, cont.


7

 Why Open Standards?


 Open source allows inspection but no control
 Open governance necessary for open standard [Nick Pentreath: Open Standards
for Machine Learning Deployment,
 Cons: needs adoption, moves slowly bbuzz 2019]

 #1 Predictive Model Markup Language (PMML)


 Model exchange format in XML, created by Data Mining Group 1997
 Package model weights, hyper parameters, and limited set of algorithms

 #2 Portable Format for Analytics (PFA)


 Attempt to fix limitations of PMML, created by Data Mining Group
 JSON and AVRO exchange format
 Minimal functional math language  arbitrary custom models
 Scoring in JVM, Python, R

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Model Exchange and Serving

Model Exchange Formats, cont.


8

 #3 Open Neural Network Exchange (ONNX)


 Model exchange format (data and operator graph) via Protobuf
 First Facebook and Microsoft, then IBM, Amazon  PyTorch, MXNet
 Focused on deep learning and tensor operations
 ONNX-ML: support for traditional ML algorithms
 Scoring engine: https://github.com/Microsoft/onnxruntime Lukas Timpl
 Cons: low level (e.g., fused ops), DNN-centric  ONNX-ML python/systemds/
onnx_systemds

 TensorFlow Saved Models


 TensorFlow-specific exchange format for model and operator graph
 Freezes input weights and literals, for additional optimizations
(e.g., constant folding, quantization, etc)
 Cloud providers may not be interested in open exchange standards

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Model Exchange and Serving

ML Systems for Serving


9

 #1 Embedded ML Serving
 TensorFlow Lite and new language bindings (small footprint,
dedicated HW acceleration, APIs, and models: MobileNet, SqueezeNet)
 SystemML JMLC (Java ML Connector)

 #2 ML Serving Services Example:


 Motivation: Complex DNN models, ran on dedicated HW Google Translate
140B words/day
 RPC/REST interface for applications
 82K GPUs in 2016
 TensorFlow Serving: configurable serving w/ batching
 Clipper: Decoupled multi-framework scoring, w/ batching and result caching
 Pretzel: Batching and multi-model optimizations in ML.NET
 Rafiki: Optimization for accuracy under latency constraints, and
batching and multi-model optimizations
[Christopher Olston et al: [Daniel Crankshaw [Yunseong Lee et al.:
[Wei Wang et al: Rafiki:
TensorFlow-Serving: et al: Clipper: A PRETZEL: Opening the Black
Machine Learning as
Flexible, High- Low-Latency Online Box of Machine Learning
an Analytics Service
Performance ML Serving. Prediction Serving Prediction Serving Systems.
706.550 Architecture of Machine Learning Systems – 12 Model System. PVLDB 2018]
NIPS ML Systems 2017] System. NSDI 2017] OSDI 2018]Deployment & Serving
Matthias Boehm, Graz University of Technology, SS 2021
Model Exchange and Serving

Serverless Computing
10
[Joseph M. Hellerstein et al: Serverless
Computing: One Step Forward, Two
Steps Back. CIDR 2019]
 Definition Serverless
 FaaS: functions-as-a-service (event-driven, stateless input-output mapping)
 Infrastructure for deployment and auto-scaling of APIs/functions
 Examples: Amazon Lambda, Microsoft Azure Functions, etc

Lambda Functions
Event Source
Other APIs
(e.g., cloud
and Services
services)
Auto scaling
Pay-per-request
(1M x 100ms = 0.2$)

import com.amazonaws.services.lambda.runtime.Context;
 Example
import com.amazonaws.services.lambda.runtime.RequestHandler;
public class MyHandler implements RequestHandler<Tuple, MyResponse> {
@Override
public MyResponse handleRequest(Tuple input, Context context) {
return expensiveModelScoring(input); // with read-only model
} 706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving
} Matthias Boehm, Graz University of Technology, SS 2021
Model Exchange and Serving

Example SystemDS JMLC


11

 Example Token Features


Scenario Sentences ΔX

Feature Extraction Sentence


Sentence …
(e.g., doc structure, sentences, Classification
tokenization, n-grams)
Classification (e.g., ⨝, )

“Model”
M

 Challenges
 Scoring part of larger end-to-end pipeline  Embedded scoring
 External parallelization w/o materialization
 Simple synchronous scoring  Latency ⇒ Throughput
 Data size (tiny ΔX, huge model M)  Minimize overhead per ΔX
 Seamless integration & model consistency
 Token inputs & outputs
706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving
Matthias Boehm, Graz University of Technology, SS 2021
Model Exchange and Serving

Example SystemDS JMLC, cont.


12

 Background: Frame Distributed


 Abstract data type with schema Schema representation:
(boolean, int, double, string) ? x ncol(F) blocks
 Column-wise block layout
 Local/distributed operations: (shuffle-free
conversion of
e.g., indexing, append, transform
… csv / datasets)

 Data Preparation FX transformencode X Y


via Transform FY
Training
MX MY B

ΔFX transformapply ΔX
Scoring
ΔFŶ transformdecode ΔŶ

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Model Exchange and Serving

Example SystemML JMLC, cont.


13

 Motivation Typical compiler/runtime overheads:


 Embedded scoring Script parsing and config: ~100ms
 Latency ⇒ Throughput Validation, compile, IPA: ~10ms
 Minimize overhead per ΔX HOP DAG (re-)compile: ~1ms
Instruction execute: <0.1μs
 Example
// single-node, no evictions,
1: Connection conn = new Connection(); // no recompile, no multithread.
2: PreparedScript pscript = conn.prepareScript(
getScriptAsString(“glm-predict-extended.dml”),
new String[]{“FX”,“MX”,“MY”,“B”}, new String[]{“FY”});
3: pscript.setFrame(“MX”,
// ... Setup constant inputs MX, true);
4: pscript.setFrame(“MY”,
for( Document d : documents MY, true);) { // setup static inputs (for reuse)
5: pscript.setMatrix(“B”,
FrameBlock FX = ...;B,//Input true); pipeline
6: pscript.setFrame(“FX”, FX);
7: FrameBlock FY = pscript.executeScript().getFrame(“FY”);
8: // ... Remaining pipeline
// execute precompiled script
9: }
// many times
706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving
Matthias Boehm, Graz University of Technology, SS 2021
Model Exchange and Serving
k
Serving Optimizations – Batching
14

 Recap: Model Batching (see 08 Data Access) n


 One-pass evaluation of multiple configurations O(m*n)
 EL, CV, feature selection, hyper parameter tuning read
m X O(m*n*k)
 E.g.: TUPAQ [SoCC’16], Columbus [SIGMOD’14
compute
m >> n >> k
 Data Batching
 Batching to utilize the HW more efficiently under SLA
 Use case: multiple users use the same model
n
(wait and collect user request and merge)
 Adaptive: additive increase, multiplicative decrease
X1 Benefits for
m X2 multi-class /
complex
X3 models
[Clipper @
NSDI’17]
706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving
Matthias Boehm, Graz University of Technology, SS 2021
Model Exchange and Serving

Serving Optimizations – Quantization


15

 Quantization 08 Data Access


 Lossy compression via ultra-low precision / fixed-point Methods
 Ex.: 62.7% energy spent on data movement [Amirali Boroumand et al.: Google
Workloads for Consumer Devices:
Mitigating Data Movement
 Quantization for Model Scoring Bottlenecks. ASPLOS 2018]
 Usually much smaller data types (e.g., UINT8)
 Quantization of model weights, and sometimes also activations
 reduced memory requirements and better latency / throughput (SIMD)

import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
tflite_quant_model = converter.convert()

[Credit: https://www.tensorflow.org/lite/performance/post_training_quantization ]

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Model Exchange and Serving

Serving Optimizations – MQO


16

 Result Caching
 Establish a function cache for X  Y
(memoization of deterministic function evaluation)

 Multi Model Optimizations


 Same input fed into multiple partially redundant model evaluations
 Common subexpression elimination between prediction programs
 Done during compilation or runtime
 In PRETZEL, programs compiled into
physical stages and registered
with the runtime + caching for stages
(decided based on hashing the inputs)

[Yunseong Lee et al.: PRETZEL: Opening


the Black Box of Machine Learning
Prediction Serving Systems. OSDI 2018]

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Model Exchange and Serving

Serving Optimizations – Compilation


17
04 Adaptation,
Fusion, and JIT
 TensorFlow tf.compile
 Compile entire TF graph into binary function w/ low footprint
 Input: Graph, config (feeds+fetches w/ fixes shape sizes)
[Chris Leary, Todd Wang:
 Output: x86 binary and C++ header (e.g., inference) XLA – TensorFlow, Compiled!,
 Specialization for frozen model and sizes TF Dev Summit 2017]

 PyTorch Compile
 Compile Python functions into ScriptModule/ScriptFunction
 Lazily collect operations,
a = torch.rand(5)
optimize, and JIT compile def func(x):
 Explicit jit.script call for i in range(10):
or @torch.jit.script x = x * x # unrolled into graph
return x
[Vincent Quenneville-Bélair:
How PyTorch Optimizes jitfunc = torch.jit.script(func) # JIT
Deep Learning Computations, jitfunc.save("func.pt")
Guest Lecture Stanford 2020]

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Model Exchange and Serving

Serving Optimizations – Model Vectorization


18

 HummingBird [https://github.com/microsoft/hummingbird] [Supun Nakandala et al: A


 Compile ML scoring pipelines into tensor ops Tensor Compiler for Unified
Machine Learning Prediction
 Tree-based models (GEMM, 2x tree traversal) Serving. OSDI 2020]
input node pred

path ∑ Bucket-class
Bucket paths:
-1 (lhs) / 0 / 1 (rhs) mapping

[Geoffrey E. Hinton, Oriol Vinyals, Jeffrey


Dean: Distilling the Knowledge in a
Neural Network. CoRR 2015]

 Model Distillation
 Ensembles of models  single NN model
 Specialized models for different classes
706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving
(found via differences to generalist
Matthias Boehm, model)
Graz University of Technology, SS 2021
Model Exchange and Serving

Serving Optimizations – Specialization


19

 NoScope Architecture
 Baseline: YOLOv2 on 1 GPU
per video camera @30fps
 Optimizer to find filters
[Daniel Kang et al: NoScope:
Optimizing Deep CNN-Based
Queries over Video Streams at
Scale. PVLDB 2017]

 #1 Model Specialization
 Given query and baseline model
 Trained shallow NN (based on AlexNet) on output of baseline model
 Short-circuit if prediction with high confidence

 #2 Difference Detection
 Compute difference to ref-image/earlier-frame
706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving
 Short-circuit w/ ref label
Matthias Boehm,ifGraz
no University
significant difference
of Technology, SS 2021
20

Model Monitoring and Updates


Part of Model Management and MLOps
(see 10 Model Selection & Management)

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Model Monitoring and Updates

Model Deployment Workflow


21

Data Integration Model Selection


#1 Model
Data Cleaning Training
Deployment
Data Preparation Hyper-parameters
MX MY B

#2 Continuous Data Validation /


Prediction Concept Drift Detection
Model Serving
Requests

#3 Model
Monitoring
#4 Periodic / Event-based
DevOps
Re-Training & Updates
Engineer
(automatic / semi-manual)

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Model Monitoring and Updates

Monitoring Deployed Models


22

 Goals: Robustness (e.g., data, latency) [Neoklis Polyzotis, Sudip Roy, Steven Whang,
Martin Zinkevich: Data Management Challenges in
and model accuracy Production Machine Learning, SIGMOD 2017]

 #1 Check Deviations Training/Serving Data


 Different data distributions, distinct items  impact on model accuracy?
 See 09 Data Acquisition and Preparation (Data Validation)

 #2 Definition of Alerts
 Understandable and actionable During serving:
 Sensitivity for alerts (ignored if too frequent) 0.11?

 #3 Data Fixes “The question is not whether


something is ‘wrong’. The question is
 Identify problematic parts
whether it gets fixed”
 Impact of fix on accuracy
 How to backfill into training data
706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving
Matthias Boehm, Graz University of Technology, SS 2021
Model Monitoring and Updates

Monitoring Deployed Models, cont.


23

 Alert Guidelines [Neoklis Polyzotis, Sudip Roy, Steven Whang,


Martin Zinkevich: Data Management Challenges in
 Make them actionable Production Machine Learning, SIGMOD 2017]
missing field,
less
field has new values,
actionable
distribution changes [George Beskales et al: On the relative
 Question data AND constraints trust between inconsistent data and
inaccurate constraints. ICDE 2013]
 Combining repairs:
[Xu Chu, Ihab F. Ilyas: Qualitative Data
principle of minimality Cleaning. Tutorial, PVLDB 2016]

 Complex Data Lifecycle


 Adding new features to production ML pipelines is a complex process
 Data does not live in a DBMS; data often resides in multiple storage systems
that have different characteristics
 Collecting data for training can be hard and expensive

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Model Monitoring and Updates

Concept Drift
24
[A. Bifet, J. Gama, M. Pechenizkiy, I.
Žliobaitė: Handling Concept Drift:
Importance, Challenges & Solutions,
 Recap Concept Drift (features  labels) PAKDD 2011]

 Change of statistical properties / dependencies (features-labels)


 Requires re-training, parametric approaches for deciding when to retrain

 #1 Input Data Changes


 Population change (gradual/sudden), but also new categories, data errors
 Covariance shift p(x) with constant p(y|x)

 #2 Output Data Changes


 Label shift p(y)
 Constant conditional
feature distributed p(x|y)

 Goals: Fast adaptation; noise vs change, recurring contexts, small overhead


706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving
Matthias Boehm, Graz University of Technology, SS 2021
Model Monitoring and Updates

Concept Drift, cont.


25
[A. Bifet, J. Gama, M. Pechenizkiy, I.
Žliobaitė: Handling Concept Drift:
Importance, Challenges & Solutions,
 Approach 1: Periodic Re-Training PAKDD 2011]

 Training: window of latest data + data selection/weighting


 Alternatives: incremental maintenance, warm starting, online learning

 Approach 2: Event-based Re-Training


 Change detection (supervised, unsupervised)
 Often model-dependent, specific techniques for time series
 Drift Detection Method: binomial distribution, if error outside scaled
standard-deviation  raise warnings and alters
 Adaptive Windowing (ADWIN): [Albert Bifet, Ricard Gavaldà:
Learning from Time-Changing Data
window W, append data to W, drop with Adaptive Windowing. SDM 2007]
old values until avg windows W=W1-W2
[https://scikitmultiflow.readthedocs.io/en
similar (below epsillon), raise alerts /stable/api/generated/
 Kolmogorov-Smirnov distance / Chi-Squared: skmultiflow.drift_detection.ADWIN.html]
univariate statistical tests training/serving

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Model Monitoring and Updates

Concept Drift, cont.


26

[Sebastian Schelter, Tammo Rukat, Felix


Bießmann: Learning to Validate the
 Model-agnostic Performance Predictor Predictions of Black Box Classifiers on
Unseen Data. SIGMOD 2020]
 Approach 2: Event-based Re-Training
 User-defined error generators
 Synthetic data corruption  impact on black-box model
 Train performance predictor (regression/classification at threshold t)
for expected prediction quality on percentiles of target variable ŷ

 Results PPM

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Model Monitoring and Updates

GDPR (General Data Protection Regulation)


27

 GDPR “Right to be Forgotten”


 Recent laws such as GDPR require
companies and institutions to
delete user data upon request
 Personal data must not only be deleted
from primary data stores but also from
ML models trained on it (Recital 75)
[https://gdpr.eu/article-17-right-to-be-forgotten/]

 Example Deanonymization
 Recommender systems: models X ≈ UV

retain user similarly


 Social network data / clustering / KNN
[Sebastian Schelter: "Amnesia" -
 Large language models (e.g., GPT-3) Machine Learning Models That Can
Forget User Data Very Fast. CIDR 2020]

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Model Monitoring and Updates

GDPR, cont.
28 [Sebastian Schelter, Stefan Grafberger, Ted Dunning:
HedgeCut: Maintaining Randomised Trees for Low-
Latency Machine Unlearning, SIGMOD 2021]
 HedgeCut Overview
 Extremely Randomized Trees (ERT):
ensemble of DTs w/ randomized
attributes and cut-off points
 Online unlearning requests < 1ms
w/o retraining for few points

 Handling of Non-robust Splits

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Summary and Conclusions
29

 Model Exchange and Serving


 Model Monitoring and Updates

 #1 Finalize Programming Projects by ~June 30


 #2 Oral Exam
 Doodle for July 7/8/9/12/13, 45min each (done via skype/webex)
 Part 1: Describe you programming project, warm-up questions
 Part 2: Questions on 2-3 topics of 11 lectures
(basic understanding of the discussed topics / techniques)

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021

You might also like