The ML Test Score: A Rubric For ML Production Readiness and Technical Debt Reduction
The ML Test Score: A Rubric For ML Production Readiness and Technical Debt Reduction
Abstract—Creating reliable, production-level machine learn- may find difficult. Note that this rubric focuses on issues
ing systems brings on a host of concerns not found in specific to ML systems, and so does not include generic
small toy examples or even large offline research experiments. software engineering best practices such as ensuring good
Testing and monitoring are key considerations for ensuring
the production-readiness of an ML system, and for reducing unit test coverage and a well-defined binary release process.
technical debt of ML systems. But it can be difficult to formu- Such strategies remain necessary as well. We do call out
late specific tests, given that the actual prediction behavior of a few specific areas for unit or integration tests that have
any given model is difficult to specify a priori. In this paper, unique ML-related behavior.
we present 28 specific tests and monitoring needs, drawn from
experience with a wide range of production ML systems to help
How to read the tests: Each test is written as an
quantify these issues and present an easy to follow road-map assertion; our recommendation is to test that the assertion is
to improve production readiness and pay down ML technical true, the more frequently the better, and to fix the system if
debt. the assertion is not true.
Keywords-Machine Learning, Testing, Monitoring, Reliabil- Doesn’t this all go without saying?: Before we enu-
ity, Best Practices, Technical Debt merate our suggested tests, we should address one objection
the reader may have – obviously one should write tests for
I. I NTRODUCTION an engineering project! While this is true in principle, in a
As machine learning (ML) systems continue to take on survey of several dozen teams at Google, none of these tests
ever more central roles in real-world production settings, was implemented by more than 80% of teams (though, even
the issue of ML reliability has become increasingly critical. in a engineering culture valuing rigorous testing, many of
ML reliability involves a host of issues not found in small these ML-centric tests are non-obvious). Conversely, most
toy examples or even large offline experiments, which can tests had a nonzero score for at least half of the teams
lead to surprisingly large amounts of technical debt [1]. surveyed; our tests do represent practices that teams find
Testing and monitoring are important strategies for improv- to be worth doing.
ing reliability, reducing technical debt, and lowering long- In this paper, we are largely concerned with supervised
term maintenance cost. However, as suggested by Figure ML systems that are trained continuously online and perform
1, ML system testing is also more complex a challenge rapid, low-latency inference on a server. Features are often
than testing manually coded systems, due to the fact that derived from large amounts of data such as streaming logs
ML system behavior depends strongly on data and models of incoming data. However, most of our recommendations
that cannot be strongly specified a priori. One way to see apply to other forms of ML systems, such as infrequently
this is to consider ML training as analogous to compilation, trained models pushed to client-side systems for inference.
where the source is both code and training data. By that
analogy, training data needs testing like code, and a trained A. Related work
ML model needs production practices like a binary does,
such as debuggability, rollbacks and monitoring. Software testing is well studied, as is machine learning,
So, what should be tested and how much is enough? but their intersection has been less well explored in the
In this paper, we try to answer this question with a test literature. [4] reviews testing for scientific software more
rubric, which is based on engineering decades of production- generally, and cites a number of articles such as [5], who
level ML systems at Google, in systems such as ad click present an approach for testing ML algorithms. These ideas
prediction [2] and the Sibyl ML platform [3]. are a useful complement for the tests we present, which are
We present a rubric as a set of 28 actionable tests, and focused on testing the use of ML in a production system
offer a scoring system to measure how ready for production rather than just the correctness of the ML algorithm per se.
a given machine learning system is. This rubric is intended Zinkevich provides extensive advice on building effective
to cover a range from a team just starting out with machine machine learning models in real world systems [6]. Those
learning up through tests that even a well-established team rules are complementary to this rubric, which is more
c 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media,
including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to
servers or lists, or reuse of any copyrighted component of this work in other works. Published as [7].
Figure 1. ML Systems Require Extensive Testing and Monitoring. The key consideration is that unlike a manually coded system (left), ML-based
system behavior is not easily specified in advance. This behavior depends on dynamic qualities of the data, and on various model configuration choices.
concerned with determining how reliable an ML system is bias. Visualization tools such as Facets1 can be very useful
rather than how to build one. for analyzing the data to produce the schema. Invariants to
Issues of surprising sources of technical debt in ML capture in a schema can also be inferred automatically from
systems has been studied before [1]. It has been noted that your system’s behavior [8].
the prior work has identified problems but been largely silent Data 2: All features are beneficial: A kitchen-sink
on how to address them; this paper details actionable advice approach to features can be tempting, but every feature
drawn from practice and verified with extensive interviews added has a software engineering cost. Hence, it’s important
with the maintainers of 36 real world systems. to understand the value each feature provides in additional
predictive power (independent of other features).
II. T ESTS FOR F EATURES AND DATA How? Some ways to run this test are by computing
Machine learning systems differ from traditional software- correlation coefficients, by training models with one or two
based systems in that the behavior of ML systems is not features, or by training a set of models that each have one
specified directly in code but is learned from data. Therefore, of k features individually removed.
while traditional software can rely on unit tests and integra- Data 3: No feature’s cost is too much: It is not
tion tests of the code, here we attempt to add a sufficient only a waste of computing resources, but also an ongoing
set of tests of the data. maintenance burden to include -features that add only
Data 1: Feature expectations are captured in a minimal predictive benefit [1].
schema: It is useful to encode intuitions about the data How? To measure the costs of a feature, consider not
in a schema so they can be automatically checked. For only added inference latency and RAM usage, but also
example, an adult human is surely between one and ten more upstream data dependencies, and additional expected
feet in height. The most common word in English text is instability incurred by relying on that feature. See Rule#22
probably ‘the’, with other word frequencies following a [6] for further discussion.
power-law distribution. Such expectations can be used for Data 4: Features adhere to meta-level requirements:
tests on input data during training and serving (see test Your project may impose requirements on the data coming
Monitor 2). in to the system. It might prohibit features derived from user
How? To construct the schema, one approach is to start data, prohibit the use of specific features like age, or simply
with calculating statistics from training data, and then ad- prohibit any feature that is deprecated. It might require all
justing them as appropriate based on domain knowledge. It features be available from a single source. However, during
may also be useful to start by writing down expectations model development and experimentation, it is typical to try
and then compare them to the data to avoid an anchoring out a wide variety of potential features to improve prediction
quality.
1 Feature expectations are captured in a schema. How? Programmatically enforce these requirements, so
2 All features are beneficial.
3 No feature’s cost is too much.
that all models in production properly adhere to them.
4 Features adhere to meta-level requirements. Data 5: The data pipeline has appropriate privacy
5 The data pipeline has appropriate privacy controls. controls: Training data, validation data, and vocabulary files
6 New features can be added quickly.
7 All input feature code is tested.
all have the potential to contain sensitive user data. While
teams often are aware of the need to remove personally iden-
Table I tifiable information (PII), during this type of exporting and
B RIEF LISTING OF THE SEVEN DATA T ESTS .
1 https://pair-code.github.io/facets/
transformations, programming errors and system changes Model 2: Offline proxy metrics correlate with actual
can lead to inadvertent PII leakages that may have serious online impact metrics: A user-facing production system’s
consequences. impact is judged by metrics of engagement, user happiness,
How? Make sure to budget sufficient time during new revenue, and so forth. A machine learning system is trained
feature development that depends on sensitive data to allow to optimize loss metrics such as log-loss or squared error.
for proper handling. Test that access to pipeline data is A strong understanding of the relationship between these
controlled as tightly as the access to raw user data, especially offline proxy metrics and the actual impact metrics is needed
for data sources that haven’t previously been used in ML. to ensure that a better scoring model will result in a better
Finally, test that any user-requested data deletion propagates production system.
to the data in the ML training pipeline, and to any learned How? The offline/online metric relationship can be mea-
models. sured in one or more small scale A/B experiments using an
Data 6: New features can be added quickly: The intentionally degraded model.
faster a team can go from a feature idea to the feature Model 3: All hyperparameters have been tuned:
running in production, the faster it can both improve the A ML model can often have multiple hyperparameters,
system and respond to external changes. For highly efficient such as learning rates, number of layers, layer sizes and
teams, this can be as little as one to two months even for regularization coefficients. Choice of the hyperparameter
global-scale, high-traffic ML systems. Note that this can values can have dramatic impact on prediction quality.
be in tension with Data 5, but privacy should always take How? Methods such as a grid search [9] or a more
precedence. sophisticated hyperparameter search strategy [10] [11] not
Data 7: All input feature code is tested: Feature only improve prediction quality, but also can uncover hid-
creation code may appear simple enough to not need unit den reliability issues. Substantial performance improvements
tests, but this code is crucial for correct behavior and so have been realized in many ML systems through use of an
its continued quality is vital. Bugs in features may be internal hyperparameter tuning service[12]2 .
almost impossible to detect once they have entered the data Model 4: The impact of model staleness is known:
generation process, especially if they are represented in both Many production ML systems encounter rapidly changing,
training and test data. non-stationary data. Examples include content recommen-
dation systems and financial ML applications. For such
III. T ESTS FOR M ODEL D EVELOPMENT systems, if the pipeline fails to train and deploy sufficiently
up-to-date models, we say the model is stale. Understanding
While the field of software engineering has developed a how model staleness affects the quality of predictions is
full range of best practices for developing reliable software necessary to determine how frequently to update the model.
systems, similar best-practices for ML model development If predictions are based on a model trained yesterday versus
are still emerging. last week versus last year, what is the impact on the
Model 1: Every model specification undergoes a live metrics of interest? Most models need to be updated
code review and is checked in to a repository: It can eventually to account for changes in the external world;
be tempting to avoid code review out of expediency, and a careful assessment is important to decide how often to
run experiments based on one’s own personal modifications. perform the updates (see Rule 8 in [6] for related discussion).
In addition, when responding to production incidents, it’s How? One way of testing the impact of staleness is with
crucial to know the exact code that was run to produce a a small A/B experiment with older models. Testing a range
given learned model. For example, a responder might need of ages can provide an age-versus-quality curve to help
to re-run training with corrected input data, or compare the understand what amount of staleness is tolerable.
result of a particular modification. Proper version control of Model 5: A simpler model is not better: Regularly
the model specification can help make training auditable and testing against a very simple baseline model, such as a linear
improve reproducibility. model with very few features, is an effective strategy both
for confirming the functionality of the larger pipeline and
1 Model specs are reviewed and submitted. for helping to assess the cost to benefit tradeoffs of more
2 Offline and online metrics correlate.
3 All hyperparameters have been tuned. sophisticated techniques.
4 The impact of model staleness is known. Model 6: Model quality is sufficient on all important
5 A simpler model is not better. data slices: Slicing a data set along certain dimensions of
6 Model quality is sufficient on important data slices.
7 The model is tested for considerations of inclusion. interest can improve fine-grained understanding of model
quality. Slices should distinguish subsets of the data that
Table II might behave qualitatively differently, for example, users by
B RIEF LISTING OF THE SEVEN M ODEL TESTS
2 The service is closely related to HyperTune[13].
Table III
B RIEF LISTING OF THE ML I NFRASTRUCTURE TESTS Infra 1: Training is reproducible: Ideally, training
twice on the same data should produce two identical mod-
1 Training is reproducible. els. Deterministic training dramatically simplifies reasoning
2 Model specs are unit tested.
3 The ML pipeline is Integration tested. about the whole system and can aid auditability and debug-
4 Model quality is validated before serving. ging. For example, optimizing feature generation code is a
5 The model is debuggable. delicate process but verifying that the old and new feature
6 Models are canaried before serving.
7 Serving models can be rolled back. generation code will train to an identical model can provide
more confidence that the refactoring was correct. This sort
of diff-testing relies entirely on deterministic training.
country, users by frequency of use, or movies by genre. Unfortunately, model training is often not reproducible in
Examining sliced data avoids having fine-grained quality practice, especially when working with non-convex methods
issues masked by a global summary metric, e.g. global such as deep learning or even random forests. This can
accuracy improved by 1% but accuracy for one country manifest as a change in aggregate metrics across an entire
dropped by 50%. This class of problems often arises from dataset, or, even if the aggregate performance appears the
a fault in the collection of training data, that caused an same from run to run, as changes on individual examples.
important set of training data to be lost or late. Random number generation is an obvious source of non-
How? Consider including these tests in your release determinism, which can be alleviated with seeding. But
process, e.g. release tests for models can impose absolute even with proper seeding, initialization order can be un-
thresholds (e.g., error for slice x must be <5%), to catch derspecified so that different portions of the model will be
large drops in quality, as well as incremental (e.g. the change initialized at different times on different runs leading to
in error for slice x must be <1% compared to the previously non-determinism. Furthermore, even when initialization is
released model). fully deterministic, multiple threads of execution on a single
machine or across a distributed system [16] may be subject
Model 7: The model has been tested for considera-
to unpredictable orderings of training data, which is another
tions of inclusion: There have been a number of recent
source of non-determinism.
studies on the issue of ML Fairness [14], [15], which
How? Besides working to remove nondeterminism as
may arise inadvertently due to factors such as choice of
discussed above, ensembling models can help.
training data. For example, Bolukbasi et al. found that a
Infra 2: Model specification code is unit tested: Al-
word embedding trained on news articles had learned some
though model specifications may seem like “configuration”,
striking associations between gender and occupation that
such files can have bugs and need to be tested. Unfortunately,
may have reflected the content of the news articles but
testing a model specification can be very hard. Unit tests
which may have been inappropriate for use in a predictive
should run quickly and require no external dependencies but
modeling context [14]. This form of potentially overlooked
model training is often a very slow process that involves
biases in training data sets may then influence the larger
pulling in lots of data from many sources.
system behavior.
How? It’s useful to distinguish two kinds of model tests:
How? Diagnosing such issues is an important step for
tests of API usage and tests of algorithmic correctness. We
creating robust modeling systems that serve all users well.
plan to release an open source framework implementing
Tests that can be run include examining input features to
some of these tests soon.
determine if they correlate strongly with protected user
ML APIs can be complex, and code using them can
categories, and slicing predictions to determine if prediction
be wrong in subtle ways. Even if code errors would be
outputs differ materially when conditioned on different user
apparent after training (due to a model that fails to train
groups.
or results in poor performance), training is expensive and
Bolukbasi et al. [14] propose one method for ameliorating
so the development loop is slow. We have found in practice
such effects by projecting embeddings to spaces that collapse
that a simple unit test to generate random input data, and
differences along certain protected dimensions. Hardt et al
train the model for a single step of gradient descent is quite
propose a post-processing step in model creation to mini-
powerful for detecting a host of common library mistakes,
mize disproportionate loss for certain groups in the manner
resulting in a much faster development cycle. Another useful
of [15]. Finally, the approach of collecting more data to
assertion is that a model can restore from a checkpoint after
ensure data representation for potentially under-represented
a mid-training job crash.
categories or subgroups can be effective in many cases.
Testing correctness of a novel implementation of an ML
algorithm is more difficult, but still necessary – it is not
IV. T ESTS FOR ML I NFRASTRUCTURE
sufficient that code produces a model with high quality
An ML system often relies on a complex pipeline rather predictions, but that it does so for the expected reasons. One
than a single running binary. solution is to make assertions that specific subcomputations
Figure 2. Importance of a Model Canary before Serving. It is possible
of the algorithm are correct, e.g. that a specific part of for models to incorporate new pieces of code that are not live in separate
an RNN was executed exactly once per element of the serving binaries, causing havoc at serving time. Using small scale canary
processes can help protect against this.
input sequence. Another solution involves not training to
completion in the unit test but only training for a few
iterations and verifying that loss decreases with training. Still
another is to purposefully train a model for overfitting: if one
can get a model to effectively memorize its training data,
then that provides some confidence that learning reliably
happens. When testing models, pains should be taken to
avoid “golden tests”, i.e., tests that partially train a model
and compare the results to a previously generated model – internal node of a neural network)?
such tests are difficult to maintain over time without blindly Observing the step-by-step computation through the
updating the golden file. In addition to problems in training model on small amounts of data is an especially useful
non-determinism, when these tests do break they provide debugging strategy for issues like numerical instability.
very little insight into how or why. Additionally, flaky tests How? An internal tool that allows users to enter examples
remain a real danger here. and see how the a specific model version interprets it can be
Infra 3: The full ML pipeline is integration tested: very helpful. The TensorFlow debugger [17] is one example
A complete ML pipeline typically consists of assembling of such a tool.
training data, feature generation, model training, model Infra 6: Models are tested via a canary process
verification, and deployment to a serving system. Although before they enter production serving environments:
a single engineering team may be focused on a small part Offline testing, however extensive, cannot by itself guarantee
of the process, each stage can introduce errors that may the model will perform well in live production settings,
affect subsequent stages, possibly even several stages away. as the real world often contains significant non-stationarity
That means there must be a fully automated test that runs or other issues that limit the utility of historical data.
regularly and exercises the entire pipeline, validating that Consequently, there is always some risk when turning on
data and code can successfully move through each stage a new model in production.
and that the resulting model performs well. One recurring problem that canarying can help catch
How? The integration test should run both continuously is mismatches between model artifacts and serving infras-
as well as with new releases of models or servers, in order tructure. Modeling code can change more frequently than
to catch problems well before they reach production. Faster serving code, so there is a danger that an older serving
running integration tests with a subset of training data or a system will not be able to serve a model trained from newer
simpler model can give faster feedback to developers while code. For example, as shown in Figure 2, a refactoring
still backed by less frequent, long running versions with a in the core learning library might change the low-level
setup that more closely mirrors production. implementation of an operation Op in the model from Op0.1
Infra 4: Model quality is validated before attempting to a more efficient implementation, Op0.2. A newly trained
to serve it: After a model is trained but before it actually model will thus expect to be implemented with Op0.2; an
affects real traffic, an automated system needs to inspect older deployed server will not include Op0.2 and so will
it and verify that its quality is sufficient; that system must refuse to load the model.
either bless the model or veto it, terminating its entry to the How? To mitigate the mismatch issue, one approach
production environment. is testing that a model successfully loads into production
How? It is important to test for both slow degradations serving binaries and that inference on production input data
in quality over many versions as well as sudden drops in succeeds. To mitigate the new-model risk more generally,
a new version. For the former, setting loose thresholds and one can turn up new models gradually, running old and new
comparing against predictions on a validation set can be models concurrently, with new models only seeing a small
useful; for the latter, it is useful to compare predictions fraction of traffic, gradually increased as the new model is
to the previous version of the model while setting tighter observed to behave sanely.
thresholds. Infra 7: Models can be quickly and safely rolled
Infra 5: The model allows debugging by observing back to a previous serving version: A model “roll back”
the step-by-step computation of training or inference on procedure is a key part of incident response to many of
a single example: When someone finds a case where a the issues that can be detected by the monitoring discussed
model is behaving bizarrely, how difficult is it to figure in Section V. Being able to quickly revert to a previous
out why? Is there an easy, well documented process for known-good state is as crucial with ML models as with any
feeding a single example to the model and investigating other aspect of a serving system. Because rolling back is
the computation through each stage of the model (e.g. each an emergency procedure, operators should practice doing it
Table IV Figure 3. Monitoring for Training/Serving Skew. It is often necessary
B RIEF LISTING OF THE SEVEN M ONITORING TESTS for the same feature to be computed in different ways in different parts
of the system. In such cases, we must carefully test that these different
1 Dependency changes result in notification. codepaths are in fact logically identical.
2 Data invariants hold for inputs.
3 Training and serving are not skewed.
4 Models are not too stale.
5 Models are numerically stable.
6 Computing performance has not regressed.
7 Prediction quality has not regressed.
Table V
Interpreting an ML Test Score. T HIS SCORE IS COMPUTED BY TAKING THE minimum SCORE FROM EACH OF THE FOUR TEST AREAS . N OTE THAT
DIFFERENT SYSTEMS AT DIFFERENT POINTS IN THEIR DEVELOPMENT MAY REASONABLY AIM TO BE AT DIFFERENT POINTS ALONG THIS SCALE .
All tests are worth the same number of points. This is numbers”). When we asked if they had done any work to
intentional, as we believe the relative importance of tests ensure their system performed well for African American
to teams will vary depending on their specific priorities. Vernacular English or had taken steps to ensure diversity
This means that choosing any test to implement will raise in the population of human raters they hired for scoring,
the score, and we feel that is appropriate, as they are each they paused at length and then agreed that this question
valuable and often working on one will make it easier to opened up new possibilities for debiasing which they had
work on another. not considered and would address.
To interpret the score, see Table V. These interpretations Finally, the context of our interview provided additional
were calibrated against a number of internal ML systems, motivation for getting around to implementing tests - one
and overall have been reflective of other qualitative percep- team was motivated to implement feature code tests because
tions of those systems. of the clear danger of training/serving skew, while others
were spurred to automate previously manual processes to
VII. A PPLYING THE RUBRIC TO R EAL S YSTEMS make them more frequent and testable.
We developed the ML Test Certified program to help B. Dependency issues
engineers doing ML work at Google. Some of our work has
Data dependencies can lead to outsourcing responsibility
involved meeting with teams doing ML and evaluating their
for fully understanding it. Multiple teams initially suggested
performance in a structured interview based on the rubric
that since their features were produced by an upstream,
detailed above. We met with 36 teams from across Google
much larger service, any problems in their data would be
working in a diverse array of product areas; their scores on
discovered by the other team. While this can certainly be
the rubric are presented in Figure 4. These interviews have
some protection, it may still be that the smaller team has
offered some unexpected insights.
different requirements for the data that would not be caught
A. The importance of checklists by the larger team’s validation.
In the other direction, multiple teams initially suggested
Checklists are helpful even for expert teams [18]. For
that their system did not require independent monitoring, as
example, one team we worked with discovered a thousand-
their serving was done via a larger system whose reliability
line code file, completely untested, that created their input
engineers would notice any problems downstream. Again,
features. Code of that size, even if it contains only simple
this can be some protection, but it’s also quite possible that
and straightforward logic, will likely have bugs, against
the smaller system’s errors may be masked in the noise of
which simple unit tests can provide an effective hedge.
the larger system. In addition, it’s crucial in that regime that
Another example we found was a team who realized when
the larger system know how to find the appropriate contact
we asked that they had no evaluation or monitoring to
person from the smaller one.
discover if their global service was serving poor predictions
For the data tests, several teams indicated a key distinction
localized to a single country. They also relied heavily on
between features that represent new combinations of existing
informal evaluation of performance based on the team’s own
data sources, and features based on new data sources. The
usage of the product, which does not protect users very
latter requires significantly more time and introduces more
different from the team members. Similarly, the interviews
risk. Depending on a new data source can mean time spent
were useful simply as a way of advertising the existing tools
negotiating with the owning team to ensure the data is
– some teams had not even heard of the Facets tools or of
properly treated. Or if the data come from newly logged
our unit testing framework mentioned in Infra 2.
information, the existing training data must be backfilled, or
As another example, when we asked one team about ML thrown away to wait for new logs including the data.
inclusiveness, they confidently answered that they had given
the matter some thought and concluded that there was no C. The importance of frameworks
way for their system to be biased since they were only Integration testing (Infra 3) stood out as a test with
dealing with speech waveforms (“we just get vectors of much lower adoption than most. When implemented, it often
Figure 4. Average scores for interviewed teams. These graphs display the average score for each test across the 36 systems we examined.
included serving systems but not training. This is in part implemented tests. In part this is because it is difficult, but
because training is often developed as an ad hoc set of scripts again, building this into a framework like TFX allows many
and manual processes. A training pipeline platform like the teams to benefit from a single investment.
TFX system[19] can be beneficial here as it then allows To test TFX, we evaluated a hypothetical system that
building a generic integration test. used TFX along with its standard recommendations for
Model canarying (Infra 6) was frequently implemented by introductory data analysis and so forth. We found that this
many teams, and cited as a key part of their testing plan. But hypothetical system already scored as “reasonably tested”
this masks two interesting issues. First, canarying can indeed according to our criterion. TFX is quite new, however, and
catch many issues like unservable models, numeric instabil- we haven’t yet measured real world TFX systems.
ity, and so forth. However, it typically occurs long after the D. Assessing the assessment
engineering decisions that led to the issue, so it would be We also conducted some meta-level assessment, asking
much preferable to catch issues earlier in unit or integration teams what was useful or non-useful about this rubric.
tests. Second, the teams that implemented canarying usually One interesting theme was that teams using purely image
did so because their existing release framework made it easy or audio data did not feel many of the Data tests were
– and one team lacking such a framework reported the one applicable. However, methods like manual inspection of raw
time they did canary it was so painful they’d never do it data and LIME-style importance analysis [20] remain im-
again. portant tools in such settings. For example, such inspection
Perhaps the most important and least implemented test can reveal skew in distributions or unrealistically consistent
is the one for training/serving skew (Monitor 3). This sort background effects correlated with the training target.
of error is responsible for production issues across a wide Supervised ML requires labeled data, but a number of
swath of teams, and yet it is one of the least frequently groups are working in domains where labels are either not
present or extremely expensive to acquire. One group had [8] M. D. Ernst, J. H. Perkins, P. J. Guo, S. McCamant,
an extremely large data set that was so diverse that using C. Pacheco, M. S. Tschantz, and C. Xiao, “The daikon
human raters to generate training labels proved infeasible. system for dynamic detection of likely invariants,” Science
of Computer Programming, vol. 69, no. 1, pp. 35–45, 2007.
So they built a simple heuristic system and then used that to
train an ML system (“The ML experts told us that training [9] C.-W. Hsu, C.-C. Chang, C.-J. Lin et al., “A practical guide
a model like this was crazy and would never work but they to support vector classification,” 2003.
were wrong!”). Human raters consistently rate the heuristic
system as good but the ML system trained from it as much [10] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian
optimization of machine learning algorithms,” in Advances in
better – however, this exposes a need for a level of testing neural information processing systems, 2012.
of the base heuristic system that is not covered in our
rubric. Expensive labels also mean that quality evaluation [11] T. Desautels, A. Krause, and J. Burdick, “Parallelizing
of a learned model is difficult, which impacts the ability of exploration-exploitation tradeoffs in gaussian process ban-
teams to implement several tests like Model 4 and Infra 4. dit optimization,” Journal of Machine Learning Research
(JMLR), vol. 15, p. 40534103, December 2014.
ACKNOWLEDGMENT [12] D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro,
We are very grateful to Keith Arner, Gary Holt, Josh and D. Sculley, “Google vizier: A service for black-box
optimization,” in KDD 2017, 2017.
Lovejoy, Fernando Pereira, Todd Phillips, Tal Shaked, Todd
Underwood, Martin Wicke, Cory Williams, and Martin [13] “Google cloud machine learning: now open to all with new
Zinkevich for many helpful discussions and comments on professional services and education programs,” https://goo.gl/
drafts of this paper. ULh7ZW, 2017, accessed: 2017-02-08.