Best practices for implementing machine learning on Google Cloud _ Cloud Architecture Center
Best practices for implementing machine learning on Google Cloud _ Cloud Architecture Center
This document introduces best practices for implementing machine learning (ML) on Google Cloud, with a focus on custom-trained models based on your
data and code. We provide recommendations on how to develop a custom-trained model throughout the machine learning workflow, including key actions and
links for further reading.
The following diagram gives a high-level overview of the stages in the ML workflow addressed in this document, which include:
1. ML development
2. Data processing
3. Operationalized training
5. ML workflow orchestration
6. Artifact organization
7. Model monitoring
The document is not an exhaustive list of recommendations; its goal is to help data scientists and machine learning architects understand the scope of
activities involved in using ML on Google Cloud and plan accordingly. And while ML development alternatives like AutoML
(/vertex-ai/docs/start/automl-model-types) are mentioned in Use recommended tools and products (#use-recommended-tools-and-products), this document focuses on
custom-trained models.
Before following the best practices in this document, we recommend that you read Introduction to Vertex AI (/vertex-ai/docs/start/introduction-unified-platform).
You are primarily using Google Cloud services; hybrid and on-premises approaches are not addressed in this document.
You have an intermediate-level knowledge of machine learning, big data tools, and data preprocessing, as well as a familiarity with Cloud Storage
(/storage/docs), BigQuery (/bigquery/docs), and Google Cloud (/training) fundamentals.
If you are new to machine learning, check out Google's Machine Learning Crash Course (https://developers.google.com/machine-learning/crash-course).
Terraform (https://www.terraform.io/)
AutoML (/vertex-ai/docs/training-overview)
BigQuery ML (/bigquery-ml/docs)
TensorFlow (https://www.tensorflow.org/overview)
XGBoost (https://xgboost.readthedocs.io/en/stable/)
scikit-learn (https://scikit-learn.org/stable/)
VM co-hosting (/vertex-ai/docs/predictions/model-co-hosting)
BigQuery ML BigQuery ML brings together data, infrastructure, and pre-defined model types All of your data is contained in BigQuery.
into a single system.
You are comfortable with SQL.
AutoML (in the AutoML provides training routines for common problems like image Your problem fits into one of the types that AutoML supports. See AutoML model
context of classification and tabular regression. Nearly all aspects of training and serving types (/vertex-ai/docs/start/automl-model-types) for more information.
Vertex AI) a model, like choosing an architecture, hyperparameter tuning, and provisioning
Your data matches the format and fits within the limits set by each type of
machines, are handled for you.
AutoML model. See Prepare training data for use with Vertex AI
(/vertex-ai/docs/datasets/datasets).
For AutoML models, your model can be served from Google Cloud or deployed to
an external device. See Training an AutoML model using Google Cloud console
(/vertex-ai/docs/training/automl-console) and Training an AutoML Edge model
using Google Cloud console (/vertex-ai/docs/training/automl-edge-console).
For text, video, or tabular models, your model can tolerate inference latencies >
100ms.
Note: You can also train AutoML tabular models from the BigQuery ML
(/bigquery-ml/docs) environment.
Vertex AI Vertex lets you run your own custom training routines and deploy models of Your problem does not match the criteria listed in this table for BigQuery ML or
custom trained any type on serverless architecture. Vertex AI offers additional services, like AutoML.
models hyperparameter tuning and monitoring, to make it easier to develop a model.
You are already running training on-premises or on another cloud platform, and
See Choosing a custom training method
you need consistency across the platforms.
(/vertex-ai/docs/training/custom-training-methods).
Best practices:
For more information about managing a notebook instance, see Experimentation with Vertex AI Workbench user-managed notebooks
(/architecture/best-practices-for-ml-performance-cost#experimentation_with_ai_platform_notebooks).
Alternatively, you can use the Google Cloud console, which supports the functionality of Vertex AI as a user interface through the browser.
Best practices:
Prepare training data (#prepare-training-data).
Store structured and semi-structured data in BigQuery (#store-tabular-data-in-bigquery).
Store image, video, audio and unstructured data on Cloud Storage (#store-image-video-audio-and-unstructured-data-on-cloud-storage).
Use Vertex AI Data Labeling for unstructured data (#use-vertex-data-labeling).
Use Vertex AI Feature Store with structured data (#use-vertex-feature-store-with-structured-data).
Avoid storing data in block storage (#avoid-storing-data-in-block-storage).
Use Vertex AI TensorBoard and Vertex AI Experiments for analyzing experiments (#use-tensorboard-and-experiments-to-analyze-experiments).
Train a model within a notebook instance for small datasets (#train-a-model-within-notebooks-for-small-datasets).
Maximize your model's predictive accuracy with hyperparameter tuning (#maximize-your-model's-predictive-accuracy-with-hyperparameter-tuning).
Use a notebook instance to understand your models (#use-notebooks-to-understand-your-models).
Use feature attributions to gain insights into model predictions (#use-feature-attributions-to-gain-insights-into-model-predictions).
Machine learning development addresses preparing the data, experimenting, and evaluating the model. When solving a machine learning problem, it is typically
necessary to build and compare many different models to figure out what works best.
Typically, data scientists train models using different architectures, input data sets, hyperparameters, and hardware. Data scientists evaluate the resulting
models by looking at aggregate performance metrics like accuracy, precision, and recall on test datasets. Finally, data scientists evaluate the performance of
the models against particular subsets of their data, different model versions, and different model architectures.
Regardless of your data's origin, extract data from the source systems and convert to the format and storage (separate from the operational source) optimized
for ML training. For more information on preparing training data for use with Vertex AI, see Prepare training data for use with Vertex AI
(/vertex-ai/docs/datasets/datasets).
Any other framework (such as PyTorch, XGBoost, or BigQuery Python client library (/bigquery/docs/bigquery-storage-python-pandas)
scilearn-kit)
Combine many individual images, videos, or audio clips into large files, as this will improve your read and write throughput to Cloud Storage. Aim for files of at
least 100mb, and between 100 and 10,000 shards.
To enable data management, use Cloud Storage buckets and directories to group the shards. For more information, see What is Cloud Storage?
(/storage/docs/introduction)
a. Open Vertex AI Feature Store and do a search to see if a feature already exists that relates your use case or covers the signal that you're interested
in passing to the model.
b. If there are features in Vertex AI Feature Store that you want to use, fetch those features for your training labels using Vertex AI Feature Store's
batch serving capability (/vertex-ai/docs/featurestore/serving-batch).
2. Create a new feature (/vertex-ai/docs/featurestore/managing-features#create-feature). If Vertex AI Feature Store doesn't have the features you need, create a
new feature using data from your data lake.
a. Fetch raw data from your data lake and write your scripts to perform the necessary feature processing and engineering.
b. Join the feature values you fetched from Vertex AI Feature Store and the new feature values that you created from the data lake. Merging those
feature values produces the training data set.
c. Set up a periodic job to compute updated values of the new feature. Once you determine that a feature is useful and you want to put it into
production, set up a regularly scheduled job with the required cadence to compute updated values of that feature and ingest it into Vertex AI
Feature Store. By adding your new feature to Vertex AI Feature Store, you automatically have a solution to do online serving of the features (for
online prediction use cases), and you can share your feature with others in the organization that may get value from it for their own ML models.
star Note: Cadence is an important consideration, because some features may need to be refreshed every three hours, some may need to be refreshed every day or every
week, depending on the feature.
Use Vertex AI Experiments to integrate with Vertex ML Metadata and to log and build linkage across parameters, metrics, and dataset and model artifacts.
To learn more about hyperparameter tuning, see Overview of hyperparameter tuning (/vertex-ai/docs/training/hyperparameter-tuning-overview) and Using
hyperparameter tuning (/vertex-ai/docs/training/using-hyperparameter-tuning).
Vertex Explainable AI supports custom-trained models based on tabular and image data.
Data processing
Best practices:
The recommended approach for processing your data depends on the framework and data types you're using. This section provides high-level
recommendations for common scenarios.
For general recommendations on data engineering and feature engineering for ML, see Data preprocessing for machine learning: options and
recommendations (/architecture/data-preprocessing-for-ml-with-tf-transform-pt1) and Data preprocessing for machine learning using TensorFlow Transform
(/architecture/data-preprocessing-for-ml-with-tf-transform-pt2).
If you need to perform transformations that are not expressible in Cloud SQL or are for streaming, you can use a combination of Dataflow and the pandas
(https://pandas.pydata.org/) library.
Managed datasets are not required; you may choose not to use them if you want more control over splitting your data in your training code, or if lineage
between your data and model isn't critical to your application.
For more information, see Datasets (/vertex-ai/docs/datasets/datasets) and Using a managed dataset in a custom training application
(/vertex-ai/docs/training/using-managed-datasets).
Operationalized training
Best practices:
Operationalized training refers to the process of making model training repeatable, tracking repetitions, and managing performance. While Vertex AI
Workbench notebooks are convenient for iterative development on small datasets, we recommend that you operationalize your code to make it reproducible
and scale to large datasets. In this section, we discuss tooling and best practices for operationalizing your training routines.
Optionally, you can run your code directly in a Deep Learning Virtual Machine (/deep-learning-vm/docs) container or on Compute Engine (/compute); however, we
don't recommend this approach because the Vertex AI managed services provide automatic scaling and burst capability that is more cost effective.
To learn more about checkpoints, see Training checkpoints (https://www.tensorflow.org/guide/checkpoint) for TensorFlow Core, Saving and loading a General
Checkpoint in PyTorch (https://pytorch.org/tutorials/recipes/recipes/saving_and_loading_a_general_checkpoint.html), and Machine Learning Design Patterns
(https://www.oreilly.com/library/view/machine-learning-design/9781098115777/).
Store your Cloud Storage bucket in the same Google Cloud project. If your Cloud Storage bucket is in a different Google Cloud project, you need to grant Vertex
AI access (/vertex-ai/docs/general/access-control#foreign-project) to read your model artifacts.
If you're using a Vertex AI pre-built container, ensure that your model artifacts have filenames that exactly match these examples:
XGBoost: model.bst
PyTorch: model.pth
To learn how to save your model in the form of one or more model artifacts, see Exporting model artifacts for prediction
(/vertex-ai/docs/training/exporting-model-artifacts).
Best practices:
Model deployment and serving refers to putting a model into production. The output of the training job is one or more model artifacts stored on Cloud Storage,
which you can upload to Vertex AI Model Registry so the file can be used for prediction serving. There are two types of prediction serving: batch prediction is
used to score batches of data at a regular cadence, and online prediction is used for near real-time scoring of data for live applications. Both approaches let
you obtain predictions from trained models by passing input data to a cloud-hosted ML model and getting inferences for each data instance.To learn more,
see Getting batch predictions (/vertex-ai/docs/predictions/batch-predictions) and Get online predictions from custom-trained models
(/vertex-ai/docs/predictions/get-predictions#get_online_predictions).
For lower latency for peer-to-peer requests between the client and the model server, use Vertex AI private endpoints
(/vertex-ai/docs/predictions/using-private-endpoints). These are particularly useful if your application that makes the prediction requests and the serving binary are
within the same local network. You can avoid the overhead of internet routing and make a peer-to-peer connection using Virtual Private Cloud
(/vpc/docs/overview).
Streaming ingestion lets you make real-time updates to feature values. This method is useful when having the latest available data for online serving is a
priority. For example, you can ingest streaming event data and, within a few seconds, Vertex AI Feature Store streaming ingestion
(/vertex-ai/docs/featurestore/ingesting-stream) makes that data available for online serving scenarios.
You can additionally customize the input (request) and output (response) handling and format to and from your model server using custom prediction routines
(/vertex-ai/docs/predictions/custom-prediction-routines).
To learn more about scaling options, see Scaling machine learning predictions (/blog/products/ai-machine-learning/scaling-machine-learning-predictions).
Best practices:
Vertex AI provides ML workflow orchestration to automate the ML workflow with Vertex AI Pipelines (/vertex-ai/docs/pipelines), a fully managed service that
allows you to retrain your models as often as necessary. While retraining enables your models to adapt to changes and maintain performance over time,
consider how much your data will change when choosing the optimal model retraining cadence.
ML orchestration workflows work best for customers who have already designed and built their model, put it into production, and want to determine what is
and isn't working in the ML model. The code you use for experimentation will likely be useful for the rest of the ML workflow with some modification. To work
with automated ML workflows, you need to be fluent in Python, understand basic infrastructure like containers, and have ML and data science knowledge.
Vertex AI Pipelines supports running DAGs generated by KubeFlow, TensorFlow Extended (TFX) and Airflow.
Artifact organization
Best practices:
Artifacts are outputs resulting from each step in the ML workflow. It is a best practice to organize them in a standardized way.
Preprocessing Functions
Serving functions
Parameters
Hyperparameters
Metaparameters
Metrics
Dataset artifacts
Model artifacts
Pipeline metadata
Use a source control repository for pipeline definitions and training code
You can use source control to version control your ML pipelines and the custom components you build for those pipelines. Use Artifact Registry
(/artifact-registry/docs/docker/quickstart) to store, manage, and secure your Docker container images without making them publicly visible.
Model monitoring
Best practices:
Once you deploy your model into production, you need to monitor performance to ensure that the model is performing as expected. Vertex AI provides two
ways to monitor your ML models:
Skew detection: This approach looks for the degree of distortion between your model training and production data
Drift detection: In this type of monitoring, you're looking for drift in your production data. Drift occurs when the statistical properties of the inputs and the
target, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions could become less
accurate as time passes.
Model monitoring works for structured data, like numerical and categorical features, but not for unstructured data, like images. For more information, see
Monitoring models for feature skew or drift (/vertex-ai/docs/model-monitoring/overview).
If you do not have access to the training data, turn on drift detection so that you'll know when the inputs change over time.
Use drift detection to monitor whether your production data is deviating over time. For drift detection, enable the features you want to monitor and the
corresponding thresholds to trigger an alert.
Note: You can use feature attributions to detect model degradation regardless of the type of feature your model takes as input.
This is particularly useful for complex feature types, like embeddings and time series, which are difficult to compare using traditional skew and drift methods.
With Vertex Explainable AI, feature attributions can indicate when model performance is degrading.
What's next
Vertex AI documentation (/vertex-ai/docs)
Practitioners guide to MLOps:A framework for continuous delivery and automation of machine learning (/resources/mlops-whitepaper)
Explore reference architectures, diagrams, tutorials, and best practices about Google Cloud. Take a look at our Cloud Architecture Center (/architecture).
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/), and code samples are
licensed under the Apache 2.0 License (https://www.apache.org/licenses/LICENSE-2.0). For details, see the Google Developers Site Policies (https://developers.google.com/site-policies).
Java is a registered trademark of Oracle and/or its affiliates.