The ML Test Score: A Rubric For ML Production Readiness and Technical Debt Reduction

Uploaded by

Dinesh Bhatia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

493 views10 pages

The ML Test Score: A Rubric For ML Production Readiness and Technical Debt Reduction

Uploaded by

Dinesh Bhatia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

The ML Test Score:

A Rubric for ML Production Readiness and Technical Debt Reduction

Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley

Google, Inc.
ebreck, cais, nielsene, msalib, [email protected]

Abstract—Creating reliable, production-level machine learn- may find difficult. Note that this rubric focuses on issues
ing systems brings on a host of concerns not found in specific to ML systems, and so does not include generic
small toy examples or even large offline research experiments. software engineering best practices such as ensuring good
Testing and monitoring are key considerations for ensuring
the production-readiness of an ML system, and for reducing unit test coverage and a well-defined binary release process.
technical debt of ML systems. But it can be difficult to formu- Such strategies remain necessary as well. We do call out
late specific tests, given that the actual prediction behavior of a few specific areas for unit or integration tests that have
any given model is difficult to specify a priori. In this paper, unique ML-related behavior.
we present 28 specific tests and monitoring needs, drawn from
experience with a wide range of production ML systems to help
How to read the tests: Each test is written as an
quantify these issues and present an easy to follow road-map assertion; our recommendation is to test that the assertion is
to improve production readiness and pay down ML technical true, the more frequently the better, and to fix the system if
debt. the assertion is not true.
Keywords-Machine Learning, Testing, Monitoring, Reliabil- Doesn’t this all go without saying?: Before we enu-
ity, Best Practices, Technical Debt merate our suggested tests, we should address one objection
the reader may have – obviously one should write tests for
I. I NTRODUCTION an engineering project! While this is true in principle, in a
As machine learning (ML) systems continue to take on survey of several dozen teams at Google, none of these tests
ever more central roles in real-world production settings, was implemented by more than 80% of teams (though, even
the issue of ML reliability has become increasingly critical. in a engineering culture valuing rigorous testing, many of
ML reliability involves a host of issues not found in small these ML-centric tests are non-obvious). Conversely, most
toy examples or even large offline experiments, which can tests had a nonzero score for at least half of the teams
lead to surprisingly large amounts of technical debt [1]. surveyed; our tests do represent practices that teams find
Testing and monitoring are important strategies for improv- to be worth doing.
ing reliability, reducing technical debt, and lowering long- In this paper, we are largely concerned with supervised
term maintenance cost. However, as suggested by Figure ML systems that are trained continuously online and perform
1, ML system testing is also more complex a challenge rapid, low-latency inference on a server. Features are often
than testing manually coded systems, due to the fact that derived from large amounts of data such as streaming logs
ML system behavior depends strongly on data and models of incoming data. However, most of our recommendations
that cannot be strongly specified a priori. One way to see apply to other forms of ML systems, such as infrequently
this is to consider ML training as analogous to compilation, trained models pushed to client-side systems for inference.
where the source is both code and training data. By that
analogy, training data needs testing like code, and a trained A. Related work
ML model needs production practices like a binary does,
such as debuggability, rollbacks and monitoring. Software testing is well studied, as is machine learning,
So, what should be tested and how much is enough? but their intersection has been less well explored in the
In this paper, we try to answer this question with a test literature. [4] reviews testing for scientific software more
rubric, which is based on engineering decades of production- generally, and cites a number of articles such as [5], who
level ML systems at Google, in systems such as ad click present an approach for testing ML algorithms. These ideas
prediction [2] and the Sibyl ML platform [3]. are a useful complement for the tests we present, which are
We present a rubric as a set of 28 actionable tests, and focused on testing the use of ML in a production system
offer a scoring system to measure how ready for production rather than just the correctness of the ML algorithm per se.
a given machine learning system is. This rubric is intended Zinkevich provides extensive advice on building effective
to cover a range from a team just starting out with machine machine learning models in real world systems [6]. Those
learning up through tests that even a well-established team rules are complementary to this rubric, which is more

c 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media,
including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to
servers or lists, or reuse of any copyrighted component of this work in other works. Published as [7].
Figure 1. ML Systems Require Extensive Testing and Monitoring. The key consideration is that unlike a manually coded system (left), ML-based
system behavior is not easily specified in advance. This behavior depends on dynamic qualities of the data, and on various model configuration choices.

concerned with determining how reliable an ML system is bias. Visualization tools such as Facets1 can be very useful
rather than how to build one. for analyzing the data to produce the schema. Invariants to
Issues of surprising sources of technical debt in ML capture in a schema can also be inferred automatically from
systems has been studied before [1]. It has been noted that your system’s behavior [8].
the prior work has identified problems but been largely silent Data 2: All features are beneficial: A kitchen-sink
on how to address them; this paper details actionable advice approach to features can be tempting, but every feature
drawn from practice and verified with extensive interviews added has a software engineering cost. Hence, it’s important
with the maintainers of 36 real world systems. to understand the value each feature provides in additional
predictive power (independent of other features).
II. T ESTS FOR F EATURES AND DATA How? Some ways to run this test are by computing
Machine learning systems differ from traditional software- correlation coefficients, by training models with one or two
based systems in that the behavior of ML systems is not features, or by training a set of models that each have one
specified directly in code but is learned from data. Therefore, of k features individually removed.
while traditional software can rely on unit tests and integra- Data 3: No feature’s cost is too much: It is not
tion tests of the code, here we attempt to add a sufficient only a waste of computing resources, but also an ongoing
set of tests of the data. maintenance burden to include -features that add only
Data 1: Feature expectations are captured in a minimal predictive benefit [1].
schema: It is useful to encode intuitions about the data How? To measure the costs of a feature, consider not
in a schema so they can be automatically checked. For only added inference latency and RAM usage, but also
example, an adult human is surely between one and ten more upstream data dependencies, and additional expected
feet in height. The most common word in English text is instability incurred by relying on that feature. See Rule#22
probably ‘the’, with other word frequencies following a [6] for further discussion.
power-law distribution. Such expectations can be used for Data 4: Features adhere to meta-level requirements:
tests on input data during training and serving (see test Your project may impose requirements on the data coming
Monitor 2). in to the system. It might prohibit features derived from user
How? To construct the schema, one approach is to start data, prohibit the use of specific features like age, or simply
with calculating statistics from training data, and then ad- prohibit any feature that is deprecated. It might require all
justing them as appropriate based on domain knowledge. It features be available from a single source. However, during
may also be useful to start by writing down expectations model development and experimentation, it is typical to try
and then compare them to the data to avoid an anchoring out a wide variety of potential features to improve prediction
quality.
1 Feature expectations are captured in a schema. How? Programmatically enforce these requirements, so
2 All features are beneficial.
3 No feature’s cost is too much.
that all models in production properly adhere to them.
4 Features adhere to meta-level requirements. Data 5: The data pipeline has appropriate privacy
5 The data pipeline has appropriate privacy controls. controls: Training data, validation data, and vocabulary files
6 New features can be added quickly.
7 All input feature code is tested.
all have the potential to contain sensitive user data. While
teams often are aware of the need to remove personally iden-
Table I tifiable information (PII), during this type of exporting and
B RIEF LISTING OF THE SEVEN DATA T ESTS .
1 https://pair-code.github.io/facets/
transformations, programming errors and system changes Model 2: Offline proxy metrics correlate with actual
can lead to inadvertent PII leakages that may have serious online impact metrics: A user-facing production system’s
consequences. impact is judged by metrics of engagement, user happiness,
How? Make sure to budget sufficient time during new revenue, and so forth. A machine learning system is trained
feature development that depends on sensitive data to allow to optimize loss metrics such as log-loss or squared error.
for proper handling. Test that access to pipeline data is A strong understanding of the relationship between these
controlled as tightly as the access to raw user data, especially offline proxy metrics and the actual impact metrics is needed
for data sources that haven’t previously been used in ML. to ensure that a better scoring model will result in a better
Finally, test that any user-requested data deletion propagates production system.
to the data in the ML training pipeline, and to any learned How? The offline/online metric relationship can be mea-
models. sured in one or more small scale A/B experiments using an
Data 6: New features can be added quickly: The intentionally degraded model.
faster a team can go from a feature idea to the feature Model 3: All hyperparameters have been tuned:
running in production, the faster it can both improve the A ML model can often have multiple hyperparameters,
system and respond to external changes. For highly efficient such as learning rates, number of layers, layer sizes and
teams, this can be as little as one to two months even for regularization coefficients. Choice of the hyperparameter
global-scale, high-traffic ML systems. Note that this can values can have dramatic impact on prediction quality.
be in tension with Data 5, but privacy should always take How? Methods such as a grid search [9] or a more
precedence. sophisticated hyperparameter search strategy [10] [11] not
Data 7: All input feature code is tested: Feature only improve prediction quality, but also can uncover hid-
creation code may appear simple enough to not need unit den reliability issues. Substantial performance improvements
tests, but this code is crucial for correct behavior and so have been realized in many ML systems through use of an
its continued quality is vital. Bugs in features may be internal hyperparameter tuning service[12]2 .
almost impossible to detect once they have entered the data Model 4: The impact of model staleness is known:
generation process, especially if they are represented in both Many production ML systems encounter rapidly changing,
training and test data. non-stationary data. Examples include content recommen-
dation systems and financial ML applications. For such
III. T ESTS FOR M ODEL D EVELOPMENT systems, if the pipeline fails to train and deploy sufficiently
up-to-date models, we say the model is stale. Understanding
While the field of software engineering has developed a how model staleness affects the quality of predictions is
full range of best practices for developing reliable software necessary to determine how frequently to update the model.
systems, similar best-practices for ML model development If predictions are based on a model trained yesterday versus
are still emerging. last week versus last year, what is the impact on the
Model 1: Every model specification undergoes a live metrics of interest? Most models need to be updated
code review and is checked in to a repository: It can eventually to account for changes in the external world;
be tempting to avoid code review out of expediency, and a careful assessment is important to decide how often to
run experiments based on one’s own personal modifications. perform the updates (see Rule 8 in [6] for related discussion).
In addition, when responding to production incidents, it’s How? One way of testing the impact of staleness is with
crucial to know the exact code that was run to produce a a small A/B experiment with older models. Testing a range
given learned model. For example, a responder might need of ages can provide an age-versus-quality curve to help
to re-run training with corrected input data, or compare the understand what amount of staleness is tolerable.
result of a particular modification. Proper version control of Model 5: A simpler model is not better: Regularly
the model specification can help make training auditable and testing against a very simple baseline model, such as a linear
improve reproducibility. model with very few features, is an effective strategy both
for confirming the functionality of the larger pipeline and
1 Model specs are reviewed and submitted. for helping to assess the cost to benefit tradeoffs of more
2 Offline and online metrics correlate.
3 All hyperparameters have been tuned. sophisticated techniques.
4 The impact of model staleness is known. Model 6: Model quality is sufficient on all important
5 A simpler model is not better. data slices: Slicing a data set along certain dimensions of
6 Model quality is sufficient on important data slices.
7 The model is tested for considerations of inclusion. interest can improve fine-grained understanding of model
quality. Slices should distinguish subsets of the data that
Table II might behave qualitatively differently, for example, users by
B RIEF LISTING OF THE SEVEN M ODEL TESTS
2 The service is closely related to HyperTune[13].
Table III
B RIEF LISTING OF THE ML I NFRASTRUCTURE TESTS Infra 1: Training is reproducible: Ideally, training
twice on the same data should produce two identical mod-
1 Training is reproducible. els. Deterministic training dramatically simplifies reasoning
2 Model specs are unit tested.
3 The ML pipeline is Integration tested. about the whole system and can aid auditability and debug-
4 Model quality is validated before serving. ging. For example, optimizing feature generation code is a
5 The model is debuggable. delicate process but verifying that the old and new feature
6 Models are canaried before serving.
7 Serving models can be rolled back. generation code will train to an identical model can provide
more confidence that the refactoring was correct. This sort
of diff-testing relies entirely on deterministic training.
country, users by frequency of use, or movies by genre. Unfortunately, model training is often not reproducible in
Examining sliced data avoids having fine-grained quality practice, especially when working with non-convex methods
issues masked by a global summary metric, e.g. global such as deep learning or even random forests. This can
accuracy improved by 1% but accuracy for one country manifest as a change in aggregate metrics across an entire
dropped by 50%. This class of problems often arises from dataset, or, even if the aggregate performance appears the
a fault in the collection of training data, that caused an same from run to run, as changes on individual examples.
important set of training data to be lost or late. Random number generation is an obvious source of non-
How? Consider including these tests in your release determinism, which can be alleviated with seeding. But
process, e.g. release tests for models can impose absolute even with proper seeding, initialization order can be un-
thresholds (e.g., error for slice x must be <5%), to catch derspecified so that different portions of the model will be
large drops in quality, as well as incremental (e.g. the change initialized at different times on different runs leading to
in error for slice x must be <1% compared to the previously non-determinism. Furthermore, even when initialization is
released model). fully deterministic, multiple threads of execution on a single
machine or across a distributed system [16] may be subject
Model 7: The model has been tested for considera-
to unpredictable orderings of training data, which is another
tions of inclusion: There have been a number of recent
source of non-determinism.
studies on the issue of ML Fairness [14], [15], which
How? Besides working to remove nondeterminism as
may arise inadvertently due to factors such as choice of
discussed above, ensembling models can help.
training data. For example, Bolukbasi et al. found that a
Infra 2: Model specification code is unit tested: Al-
word embedding trained on news articles had learned some
though model specifications may seem like “configuration”,
striking associations between gender and occupation that
such files can have bugs and need to be tested. Unfortunately,
may have reflected the content of the news articles but
testing a model specification can be very hard. Unit tests
which may have been inappropriate for use in a predictive
should run quickly and require no external dependencies but
modeling context [14]. This form of potentially overlooked
model training is often a very slow process that involves
biases in training data sets may then influence the larger
pulling in lots of data from many sources.
system behavior.
How? It’s useful to distinguish two kinds of model tests:
How? Diagnosing such issues is an important step for
tests of API usage and tests of algorithmic correctness. We
creating robust modeling systems that serve all users well.
plan to release an open source framework implementing
Tests that can be run include examining input features to
some of these tests soon.
determine if they correlate strongly with protected user
ML APIs can be complex, and code using them can
categories, and slicing predictions to determine if prediction
be wrong in subtle ways. Even if code errors would be
outputs differ materially when conditioned on different user
apparent after training (due to a model that fails to train
groups.
or results in poor performance), training is expensive and
Bolukbasi et al. [14] propose one method for ameliorating
so the development loop is slow. We have found in practice
such effects by projecting embeddings to spaces that collapse
that a simple unit test to generate random input data, and
differences along certain protected dimensions. Hardt et al
train the model for a single step of gradient descent is quite
propose a post-processing step in model creation to mini-
powerful for detecting a host of common library mistakes,
mize disproportionate loss for certain groups in the manner
resulting in a much faster development cycle. Another useful
of [15]. Finally, the approach of collecting more data to
assertion is that a model can restore from a checkpoint after
ensure data representation for potentially under-represented
a mid-training job crash.
categories or subgroups can be effective in many cases.
Testing correctness of a novel implementation of an ML
algorithm is more difficult, but still necessary – it is not
IV. T ESTS FOR ML I NFRASTRUCTURE
sufficient that code produces a model with high quality
An ML system often relies on a complex pipeline rather predictions, but that it does so for the expected reasons. One
than a single running binary. solution is to make assertions that specific subcomputations
Figure 2. Importance of a Model Canary before Serving. It is possible
of the algorithm are correct, e.g. that a specific part of for models to incorporate new pieces of code that are not live in separate
an RNN was executed exactly once per element of the serving binaries, causing havoc at serving time. Using small scale canary
processes can help protect against this.
input sequence. Another solution involves not training to
completion in the unit test but only training for a few
iterations and verifying that loss decreases with training. Still
another is to purposefully train a model for overfitting: if one
can get a model to effectively memorize its training data,
then that provides some confidence that learning reliably
happens. When testing models, pains should be taken to
avoid “golden tests”, i.e., tests that partially train a model
and compare the results to a previously generated model – internal node of a neural network)?
such tests are difficult to maintain over time without blindly Observing the step-by-step computation through the
updating the golden file. In addition to problems in training model on small amounts of data is an especially useful
non-determinism, when these tests do break they provide debugging strategy for issues like numerical instability.
very little insight into how or why. Additionally, flaky tests How? An internal tool that allows users to enter examples
remain a real danger here. and see how the a specific model version interprets it can be
Infra 3: The full ML pipeline is integration tested: very helpful. The TensorFlow debugger [17] is one example
A complete ML pipeline typically consists of assembling of such a tool.
training data, feature generation, model training, model Infra 6: Models are tested via a canary process
verification, and deployment to a serving system. Although before they enter production serving environments:
a single engineering team may be focused on a small part Offline testing, however extensive, cannot by itself guarantee
of the process, each stage can introduce errors that may the model will perform well in live production settings,
affect subsequent stages, possibly even several stages away. as the real world often contains significant non-stationarity
That means there must be a fully automated test that runs or other issues that limit the utility of historical data.
regularly and exercises the entire pipeline, validating that Consequently, there is always some risk when turning on
data and code can successfully move through each stage a new model in production.
and that the resulting model performs well. One recurring problem that canarying can help catch
How? The integration test should run both continuously is mismatches between model artifacts and serving infras-
as well as with new releases of models or servers, in order tructure. Modeling code can change more frequently than
to catch problems well before they reach production. Faster serving code, so there is a danger that an older serving
running integration tests with a subset of training data or a system will not be able to serve a model trained from newer
simpler model can give faster feedback to developers while code. For example, as shown in Figure 2, a refactoring
still backed by less frequent, long running versions with a in the core learning library might change the low-level
setup that more closely mirrors production. implementation of an operation Op in the model from Op0.1
Infra 4: Model quality is validated before attempting to a more efficient implementation, Op0.2. A newly trained
to serve it: After a model is trained but before it actually model will thus expect to be implemented with Op0.2; an
affects real traffic, an automated system needs to inspect older deployed server will not include Op0.2 and so will
it and verify that its quality is sufficient; that system must refuse to load the model.
either bless the model or veto it, terminating its entry to the How? To mitigate the mismatch issue, one approach
production environment. is testing that a model successfully loads into production
How? It is important to test for both slow degradations serving binaries and that inference on production input data
in quality over many versions as well as sudden drops in succeeds. To mitigate the new-model risk more generally,
a new version. For the former, setting loose thresholds and one can turn up new models gradually, running old and new
comparing against predictions on a validation set can be models concurrently, with new models only seeing a small
useful; for the latter, it is useful to compare predictions fraction of traffic, gradually increased as the new model is
to the previous version of the model while setting tighter observed to behave sanely.
thresholds. Infra 7: Models can be quickly and safely rolled
Infra 5: The model allows debugging by observing back to a previous serving version: A model “roll back”
the step-by-step computation of training or inference on procedure is a key part of incident response to many of
a single example: When someone finds a case where a the issues that can be detected by the monitoring discussed
model is behaving bizarrely, how difficult is it to figure in Section V. Being able to quickly revert to a previous
out why? Is there an easy, well documented process for known-good state is as crucial with ML models as with any
feeding a single example to the model and investigating other aspect of a serving system. Because rolling back is
the computation through each stage of the model (e.g. each an emergency procedure, operators should practice doing it
Table IV Figure 3. Monitoring for Training/Serving Skew. It is often necessary
B RIEF LISTING OF THE SEVEN M ONITORING TESTS for the same feature to be computed in different ways in different parts
of the system. In such cases, we must carefully test that these different
1 Dependency changes result in notification. codepaths are in fact logically identical.
2 Data invariants hold for inputs.
3 Training and serving are not skewed.
4 Models are not too stale.
5 Models are numerically stable.
6 Computing performance has not regressed.
7 Prediction quality has not regressed.

normally, when not in emergency conditions.

V. M ONITORING T ESTS FOR ML
It is crucial to know not just that your ML system worked
correctly at launch, but that it continues to work correctly the different codepaths should generate the same values, but
over time. An ML system by definition is making predictions in practice a common problem is that they do not. This
on previously unseen data, and typically also incorporates is sometimes called “training/serving skew” and requires
new data over time into training. The standard approach careful monitoring to detect and avoid. As one concrete
is to monitor the system, i.e. to have a constantly-updated example, imagine adding a new feature to an existing
“dashboard” user interface displaying relevant graphs and production system. While the value of the feature in the
statistics, and to automatically alert the engineering team serving system might be computed based on data from live
when particular metrics deviate significantly from expecta- user behavior, the feature will not be present in training data,
tions. For ML systems, it is important to monitor serving and so must be backfilled by imputing it from other stored
systems, training pipelines, and input data. Here we rec- data, likely using an entirely independent codepath. Another
ommend specific metrics to monitor throughout the system. example is when the computation at training time is done
The usual sorts of incident response approaches will apply; using code that is highly flexible (for easy experimentation)
one unique to ML is to roll back not the system code but but inefficient, while at serving time the same computation
the learned model, hence our test earlier (test Infra 7) to is heavily optimized for low latency.
regularly ensure that this process is safe and easy. How? To measure this, it is crucial to log a sample of
Monitor 1: Dependency changes result in notifica- actual serving traffic. For systems that use serving input as
tion: ML systems typically consume data from a wide array future training data, adding identifiers to each example at
of other systems to generate useful features. Partial outages, serving time will allow direct comparison; the feature values
version upgrades, and other changes in the source system can should be perfectly identical at training and serving time for
radically change the feature’s meaning and thus confuse the the same example. Important metrics to monitor here are
model’s training or inference, without necessarily producing the number of features that exhibit skew, and the number of
values that are strange enough to trigger other monitoring. examples exhibiting skew for each skewed feature.
How? Make sure that your team is subscribed to and reads Another approach is to compute distribution statistics
announcement lists for all dependencies, and make sure that on the training features and the sampled serving features,
the dependent team knows your team is using the data. and ensure that they match. Typical statistics include the
Monitor 2: Data invariants hold in training and minimum, maximum, or average, values, the fraction of
serving inputs: It can be difficult to effectively monitor missing values, etc. Again, thresholds for alerting on these
the internal behavior of a learned model for correctness, but metrics must be carefully tuned to ensure a low enough false
the input data should be more transparent. Consequently, positive rate for actionable response.
analyzing and comparing data sets is the first line of defense Monitor 4: Models are not too stale: In test Model 4
for detecting problems where the world is changing in ways we discussed testing the effect that an old (“stale”) model has
that can confuse an ML system. on prediction quality. Here, we recommend monitoring how
How? Using the schema constructed in test Data 1, old the system in production is, using the prior measurement
measure whether data matches the schema and alert when as a guide for determining what age is problematic enough
they diverge significantly. In practice, careful tuning of to raise an alert.
alerting thresholds is needed to achieve a useful balance Surprisingly, infrequently updated models also incur a
between false positive and false negative rates to ensure these maintenance cost. Imagine a model that is manually re-
alerts remain useful and actionable. trained once or twice a year by a given engineer. If that
Monitor 3: Training and serving features compute engineer leaves the team, this process may be difficult to
the same values: The codepaths that actually generate input replicate – even carefully written instructions may become
features may differ at training and inference time. Ideally stale or incorrect over this kind of time horizon.
How? For models that re-train regularly (e.g. weekly • Measure statistical bias in predictions, i.e. the average
or more often), the most obvious metric is the age of the of predictions in a particular slice of data. Generally
model in production. It is also important to measure the age speaking, models should have zero bias, in aggregate
of the model at each stage of the training pipeline, to quickly and on slices (e.g. 90% of predictions of probability
determine where a stall has occurred and react appropriately. 0.9 should in fact be positive). Knowing that a model
Even for models that re-train more infrequently, there is unbiased is not enough to know it is any good, but
is often a dependence on data aggregation or other such knowing there is bias can be a useful canary to detect
processes to produce features, which can themselves grow problems.
stale. For example, consider using a feature based on the • In some tasks, the label actually is available immedi-
most popular n items (movies, apps, cars, etc). The process ately or soon after the prediction is made (e.g. will a
that computes the top-n table must be re-run frequently, and user click on an ad). In this case, we can judge the
it is crucial to monitor the age of this table, so that if the quality of predictions in almost real-time and identify
process stops running, alerts will fire. problems quickly.
Monitor 5: The model is numerically stable: • Finally, it can be useful to periodically add new training
Invalid or implausible numeric values can potentially crop data by having human raters manually annotate labels
up during model training without triggering explicit errors, for logged serving inputs. Some of this data can be held
and knowing that they have occurred can speed diagnosis of out to validate the served predictions.
the problem. However the measure can be done, thresholds must be
How? Explicitly monitor the initial occurrence of any set as to acceptable quality (e.g. based on bounds of quality
NaNs or infinities. Set plausible bounds for weights and at the launch of the initial system), and then a responder
the fraction of ReLU units in a layer returning zero values, should be notified immediately if quality drifts outside
and trigger alerts during training if these exceed appropriate that threshold. As with computational performance, it is
thresholds. crucial to monitor both dramatic and slow-leak regressions
Monitor 6: The model has not experienced a dra- in prediction quality.
matic or slow-leak regressions in training speed, serving
latency, throughput, or RAM usage: The computational VI. I NCENTIVIZING CULTURE CHANGE
performance (as opposed to predictive quality) of an ML
Because technical debt is difficult to quantify, it can be
system is often a key concern at scale. Deep neural networks
difficult to prioritize paydown or measure improvements.
can be slow to train and run inference on, wide linear models
To address this, our rubric provides a quantified ML Test
with feature crosses can use a lot of memory; any ML model
Score which can be measured and improved over time. This
may take days to train; and so forth. Swiftly reacting to
provides a vector for incentivizing ML system developers to
changes in this performance due to changes in data, features,
achieve strong levels of reliability by providing a clear indi-
modeling, or underlying compute library or infrastructure is
cator of readiness and clear guidelines for how to improve.
crucial to maintaining a performant system.
This strategy was inspired by the Test Certified program
How? While measuring computational performance is
at Google, which provided a scored ladder for overall test
a standard part of any monitoring, it is useful to slice
robustness, and which had strong success in incentivizing
performance metrics not just by the versions and components
teams to adopt best practices.
of code, but also by data and model versions. Degradations
in computational performance may occur with dramatic
A. Computing an ML Test Score
changes (for which comparison to performance of prior
versions or time slices can be helpful for detection) or in The final test score is computed as follows:
slow leaks (for which a pre-set alerting threshold can be • For each test, half a point is awarded for executing the
helpful for detection) test manually, with the results documented and distributed.
Monitor 7: The model has not experienced a re- • A full point is awarded if there is a system in place to
gression in prediction quality on served data: Validation run that test automatically on a repeated basis.
data will always be older than real serving input data, so • Sum the score for each of the 4 sections individually.
measuring a model’s quality on that validation data before • The final ML Test Score is computed by taking the
pushing it to serving is only an estimate of quality metrics on minimum of the scores aggregated for each of the 4 sections.
actual live serving inputs. However, it is not always possible We choose the minimum because we believe all four
to know the correct labels even shortly after serving time, sections are important, and so a system must consider all
making quality measurement difficult. in order to raise the score. One downside of this approach
How? Here are some options to make sure that there is is that it reduces the extent to which an individual’s efforts
no degradation in served prediction quality due to changes are reflected in higher system scores and ranks; it remains
in data, differing codepaths, etc. to be seen how this will affect the adoption of our system.
Points Description
0 More of a research project than a productionized system.
(0,1] Not totally untested, but it is worth considering the possibility of serious holes in reliability.
(1,2] There’s been first pass at basic productionization, but additional investment may be needed.
(2,3] Reasonably tested, but it’s possible that more of those tests and procedures may be automated.
(3,5] Strong levels of automated testing and monitoring, appropriate for mission-critical systems.
>5 Exceptional levels of automated testing and monitoring.

Table V
Interpreting an ML Test Score. T HIS SCORE IS COMPUTED BY TAKING THE minimum SCORE FROM EACH OF THE FOUR TEST AREAS . N OTE THAT
DIFFERENT SYSTEMS AT DIFFERENT POINTS IN THEIR DEVELOPMENT MAY REASONABLY AIM TO BE AT DIFFERENT POINTS ALONG THIS SCALE .

All tests are worth the same number of points. This is numbers”). When we asked if they had done any work to
intentional, as we believe the relative importance of tests ensure their system performed well for African American
to teams will vary depending on their specific priorities. Vernacular English or had taken steps to ensure diversity
This means that choosing any test to implement will raise in the population of human raters they hired for scoring,
the score, and we feel that is appropriate, as they are each they paused at length and then agreed that this question
valuable and often working on one will make it easier to opened up new possibilities for debiasing which they had
work on another. not considered and would address.
To interpret the score, see Table V. These interpretations Finally, the context of our interview provided additional
were calibrated against a number of internal ML systems, motivation for getting around to implementing tests - one
and overall have been reflective of other qualitative percep- team was motivated to implement feature code tests because
tions of those systems. of the clear danger of training/serving skew, while others
were spurred to automate previously manual processes to
VII. A PPLYING THE RUBRIC TO R EAL S YSTEMS make them more frequent and testable.
We developed the ML Test Certified program to help B. Dependency issues
engineers doing ML work at Google. Some of our work has
Data dependencies can lead to outsourcing responsibility
involved meeting with teams doing ML and evaluating their
for fully understanding it. Multiple teams initially suggested
performance in a structured interview based on the rubric
that since their features were produced by an upstream,
detailed above. We met with 36 teams from across Google
much larger service, any problems in their data would be
working in a diverse array of product areas; their scores on
discovered by the other team. While this can certainly be
the rubric are presented in Figure 4. These interviews have
some protection, it may still be that the smaller team has
offered some unexpected insights.
different requirements for the data that would not be caught
A. The importance of checklists by the larger team’s validation.
In the other direction, multiple teams initially suggested
Checklists are helpful even for expert teams [18]. For
that their system did not require independent monitoring, as
example, one team we worked with discovered a thousand-
their serving was done via a larger system whose reliability
line code file, completely untested, that created their input
engineers would notice any problems downstream. Again,
features. Code of that size, even if it contains only simple
this can be some protection, but it’s also quite possible that
and straightforward logic, will likely have bugs, against
the smaller system’s errors may be masked in the noise of
which simple unit tests can provide an effective hedge.
the larger system. In addition, it’s crucial in that regime that
Another example we found was a team who realized when
the larger system know how to find the appropriate contact
we asked that they had no evaluation or monitoring to
person from the smaller one.
discover if their global service was serving poor predictions
For the data tests, several teams indicated a key distinction
localized to a single country. They also relied heavily on
between features that represent new combinations of existing
informal evaluation of performance based on the team’s own
data sources, and features based on new data sources. The
usage of the product, which does not protect users very
latter requires significantly more time and introduces more
different from the team members. Similarly, the interviews
risk. Depending on a new data source can mean time spent
were useful simply as a way of advertising the existing tools
negotiating with the owning team to ensure the data is
– some teams had not even heard of the Facets tools or of
properly treated. Or if the data come from newly logged
our unit testing framework mentioned in Infra 2.
information, the existing training data must be backfilled, or
As another example, when we asked one team about ML thrown away to wait for new logs including the data.
inclusiveness, they confidently answered that they had given
the matter some thought and concluded that there was no C. The importance of frameworks
way for their system to be biased since they were only Integration testing (Infra 3) stood out as a test with
dealing with speech waveforms (“we just get vectors of much lower adoption than most. When implemented, it often
Figure 4. Average scores for interviewed teams. These graphs display the average score for each test across the 36 systems we examined.

included serving systems but not training. This is in part implemented tests. In part this is because it is difficult, but
because training is often developed as an ad hoc set of scripts again, building this into a framework like TFX allows many
and manual processes. A training pipeline platform like the teams to benefit from a single investment.
TFX system[19] can be beneficial here as it then allows To test TFX, we evaluated a hypothetical system that
building a generic integration test. used TFX along with its standard recommendations for
Model canarying (Infra 6) was frequently implemented by introductory data analysis and so forth. We found that this
many teams, and cited as a key part of their testing plan. But hypothetical system already scored as “reasonably tested”
this masks two interesting issues. First, canarying can indeed according to our criterion. TFX is quite new, however, and
catch many issues like unservable models, numeric instabil- we haven’t yet measured real world TFX systems.
ity, and so forth. However, it typically occurs long after the D. Assessing the assessment
engineering decisions that led to the issue, so it would be We also conducted some meta-level assessment, asking
much preferable to catch issues earlier in unit or integration teams what was useful or non-useful about this rubric.
tests. Second, the teams that implemented canarying usually One interesting theme was that teams using purely image
did so because their existing release framework made it easy or audio data did not feel many of the Data tests were
– and one team lacking such a framework reported the one applicable. However, methods like manual inspection of raw
time they did canary it was so painful they’d never do it data and LIME-style importance analysis [20] remain im-
again. portant tools in such settings. For example, such inspection
Perhaps the most important and least implemented test can reveal skew in distributions or unrealistically consistent
is the one for training/serving skew (Monitor 3). This sort background effects correlated with the training target.
of error is responsible for production issues across a wide Supervised ML requires labeled data, but a number of
swath of teams, and yet it is one of the least frequently groups are working in domains where labels are either not
present or extremely expensive to acquire. One group had [8] M. D. Ernst, J. H. Perkins, P. J. Guo, S. McCamant,
an extremely large data set that was so diverse that using C. Pacheco, M. S. Tschantz, and C. Xiao, “The daikon
human raters to generate training labels proved infeasible. system for dynamic detection of likely invariants,” Science
of Computer Programming, vol. 69, no. 1, pp. 35–45, 2007.
So they built a simple heuristic system and then used that to
train an ML system (“The ML experts told us that training [9] C.-W. Hsu, C.-C. Chang, C.-J. Lin et al., “A practical guide
a model like this was crazy and would never work but they to support vector classification,” 2003.
were wrong!”). Human raters consistently rate the heuristic
system as good but the ML system trained from it as much [10] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian
optimization of machine learning algorithms,” in Advances in
better – however, this exposes a need for a level of testing neural information processing systems, 2012.
of the base heuristic system that is not covered in our
rubric. Expensive labels also mean that quality evaluation [11] T. Desautels, A. Krause, and J. Burdick, “Parallelizing
of a learned model is difficult, which impacts the ability of exploration-exploitation tradeoffs in gaussian process ban-
teams to implement several tests like Model 4 and Infra 4. dit optimization,” Journal of Machine Learning Research
(JMLR), vol. 15, p. 40534103, December 2014.
ACKNOWLEDGMENT [12] D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro,
We are very grateful to Keith Arner, Gary Holt, Josh and D. Sculley, “Google vizier: A service for black-box
optimization,” in KDD 2017, 2017.
Lovejoy, Fernando Pereira, Todd Phillips, Tal Shaked, Todd
Underwood, Martin Wicke, Cory Williams, and Martin [13] “Google cloud machine learning: now open to all with new
Zinkevich for many helpful discussions and comments on professional services and education programs,” https://goo.gl/
drafts of this paper. ULh7ZW, 2017, accessed: 2017-02-08.

R EFERENCES [14] T. Bolukbasi, K.-W. Chang, J. Y. Zou, V. Saligrama, and

A. T. Kalai, “Man is to computer programmer as woman
[1] D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, is to homemaker? debiasing word embeddings,” in Advances
D. Ebner, V. Chaudhary, and M. Young, “Machine learning: in Neural Information Processing Systems 29, D. D. Lee,
The high interest credit card of technical debt,” in SE4ML: M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds.,
Software Engineering for Machine Learning (NIPS 2014 2016.
Workshop), 2014.
[15] M. Hardt, E. Price, N. Srebro et al., “Equality of opportunity
[2] H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, in supervised learning,” in Advances in Neural Information
J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin, Processing Systems, 2016, pp. 3315–3323.
S. Chikkerur, D. Liu, M. Wattenberg, A. M. Hrafnkelsson,
T. Boulos, and J. Kubica, “Ad click prediction: A view [16] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin,
from the trenches,” in Proceedings of the 19th ACM M. Mao, M. aurelio Ranzato, A. Senior, P. Tucker, K. Yang,
SIGKDD International Conference on Knowledge Discovery Q. V. Le, and A. Y. Ng, “Large scale distributed deep
and Data Mining, ser. KDD ’13. New York, NY, networks,” in Advances in Neural Information Processing
USA: ACM, 2013, pp. 1222–1230. [Online]. Available: Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and
http://doi.acm.org/10.1145/2487575.2488200 K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp.
1223–1231. [Online]. Available: http://papers.nips.cc/paper/
[3] T. Chandra, E. Ie, K. Goldman, T. L. Llinares, J. McFadden, 4687-large-scale-distributed-deep-networks.pdf
F. Pereira, J. Redstone, T. Shaked, and Y. Singer, “Sibyl: a
system for large scale machine learning,” vol. 28, Jul. 2010. [17] S. Cai, E. Breck, E. Nielsen, M. Salib, and D. Sculley, “Ten-
sorflow debugger: Debugging dataflow graphs for machine
[4] U. Kanewala and J. M. Bieman, “Testing scientific software: learning,” in Proceedings of the Reliable Machine Learning
A systematic literature review,” Information and software in the Wild - NIPS 2016 Workshop, 2016.
technology, vol. 56, no. 10, pp. 1219–1232, 2014.
[18] A. Gawande, Checklist Manifesto, The. Henry Holt and
[5] C. Murphy, G. E. Kaiser, and M. Arias, “An approach to Company, 2009.
software testing of machine learning applications.” in SEKE.
Citeseer, 2007, p. 167. [19] D. Baylor, E. Breck, H.-T. Cheng, N. Fiedel, C. Y. Foo,
Z. Haque, S. Haykal, M. Ispir, V. Jain, L. Koc, C. Y. Koo,
L. Lew, C. Mewald, A. N. Modi, N. Polyzotis, S. Ramesh,
[6] M. Zinkevich, “Rules of machine learning,” Invited talk
S. Roy, S. E. Whang, M. Wicke, J. Wilkiewicz, X. Zhang,
at the NIPS Reliable Machine Learning Workshop, 1996.
and M. Zinkevich, “Tfx: A tensorflow-based production-scale
[Online]. Available: http://martin.zinkevich.org/rules of ml/
machine learning platform,” in KDD 2017, 2017.
rules of ml.pdf
[20] M. T. Ribeiro, S. Singh, and C. Guestrin, “”why should
[7] E. Breck, S. Cai, E. Nielsen, M. Salib, and D. Sculley, “The I trust you?”: Explaining the predictions of any classifier,”
ML test score: A rubric for ML production readiness and CoRR, vol. abs/1602.04938, 2016. [Online]. Available:
technical debt reduction,” in Proceedings of IEEE Big Data http://arxiv.org/abs/1602.04938
2017, 2017.

MLOps: CI/CD for Machine Learning Systems
100% (1)
MLOps: CI/CD for Machine Learning Systems
14 pages
Whitepaper - State of Platform Engineering Report
No ratings yet
Whitepaper - State of Platform Engineering Report
18 pages
ML Systems: Hidden Technical Debt
No ratings yet
ML Systems: Hidden Technical Debt
9 pages
Automation and DevOps Best Practices Presentation
No ratings yet
Automation and DevOps Best Practices Presentation
33 pages
NASA's Autonomous Systems Overview
No ratings yet
NASA's Autonomous Systems Overview
26 pages
Why Understanding Software Cycle Time Is Messy, Not Magic.
No ratings yet
Why Understanding Software Cycle Time Is Messy, Not Magic.
29 pages
Unit 42 Cloud Threat Report 2h 2020 PDF
No ratings yet
Unit 42 Cloud Threat Report 2h 2020 PDF
28 pages
Linux Journal March PDF
No ratings yet
Linux Journal March PDF
84 pages
Guide to Systems Engineering Knowledge
100% (1)
Guide to Systems Engineering Knowledge
47 pages
Darwin's Theory of Evolution
No ratings yet
Darwin's Theory of Evolution
4 pages
Quantum Cryptography and The Future of Secure Communication
No ratings yet
Quantum Cryptography and The Future of Secure Communication
6 pages
The Docker Book
No ratings yet
The Docker Book
80 pages
6 A - New - Approach - To - Commercialisation
No ratings yet
6 A - New - Approach - To - Commercialisation
14 pages
Agile Strategy for Business Leaders
No ratings yet
Agile Strategy for Business Leaders
14 pages
Software Quality & Cost Analysis
No ratings yet
Software Quality & Cost Analysis
6 pages
Vault Part 2 Concepts
100% (3)
Vault Part 2 Concepts
16 pages
NSA Report on Memory Safe Programming
No ratings yet
NSA Report on Memory Safe Programming
19 pages
Scaling Laws For Neural Language Models
No ratings yet
Scaling Laws For Neural Language Models
30 pages
Guide To Ai Assisted Engineering
No ratings yet
Guide To Ai Assisted Engineering
67 pages
TR Technology Radar Vol 27 en
No ratings yet
TR Technology Radar Vol 27 en
43 pages
CIO Survey Insights and Panel Discussion
No ratings yet
CIO Survey Insights and Panel Discussion
42 pages
A Comprehensive Survey of Retrieval-Augmented Generation (RAG) : Evolution, Current Landscape and Future Directions
No ratings yet
A Comprehensive Survey of Retrieval-Augmented Generation (RAG) : Evolution, Current Landscape and Future Directions
18 pages
The Definitive Guide To Continuous Integration
No ratings yet
The Definitive Guide To Continuous Integration
40 pages
5 Capabilities For The Best Azure Backup and Recovery-Veeam - PG
No ratings yet
5 Capabilities For The Best Azure Backup and Recovery-Veeam - PG
12 pages
TMP 9 AA7
No ratings yet
TMP 9 AA7
12 pages
Event Driven Programing
No ratings yet
Event Driven Programing
18 pages
Software Testing Levels Guide
100% (1)
Software Testing Levels Guide
17 pages
Terraform for AWS: Build vs. Buy Guide
No ratings yet
Terraform for AWS: Build vs. Buy Guide
16 pages
O'Reilly - Web Caching
No ratings yet
O'Reilly - Web Caching
331 pages
Linux Journal 2019 06
No ratings yet
Linux Journal 2019 06
151 pages
Software-Defined Substation Automation
No ratings yet
Software-Defined Substation Automation
5 pages
Observability Fundamentals PDF
No ratings yet
Observability Fundamentals PDF
1 page
RDMA Aware Programming User Manual
No ratings yet
RDMA Aware Programming User Manual
247 pages
Kubernetes Autoscaling Guide
No ratings yet
Kubernetes Autoscaling Guide
13 pages
Managing Kube Perf at Scale
No ratings yet
Managing Kube Perf at Scale
26 pages
Iot Gateway MQTT Client Microsoft Azure
No ratings yet
Iot Gateway MQTT Client Microsoft Azure
6 pages
Core Bazel - Fast Builds For Busy People
No ratings yet
Core Bazel - Fast Builds For Busy People
91 pages
Time For Trust: The Trillion-Dollar Reasons To Rethink Blockchain
No ratings yet
Time For Trust: The Trillion-Dollar Reasons To Rethink Blockchain
23 pages
Document Scanning Best Practices
0% (1)
Document Scanning Best Practices
2 pages
Ultimate DevSecOps Library 1706607714
No ratings yet
Ultimate DevSecOps Library 1706607714
27 pages
Antoni Porowski's Lazy Pierogi Recipe - PureWow 2
No ratings yet
Antoni Porowski's Lazy Pierogi Recipe - PureWow 2
2 pages
Nvidia Ai Enterprise User Guide
No ratings yet
Nvidia Ai Enterprise User Guide
100 pages
Deploying Deep Learning with Docker
No ratings yet
Deploying Deep Learning with Docker
65 pages
Chicken Chicken Chicken: Chicken Chicken
No ratings yet
Chicken Chicken Chicken: Chicken Chicken
3 pages
3 Messaging Patterns - To Never Fail Coding Interview
No ratings yet
3 Messaging Patterns - To Never Fail Coding Interview
16 pages
Hashicorp Vault
100% (2)
Hashicorp Vault
28 pages
Futures Report 2025 174983645
No ratings yet
Futures Report 2025 174983645
61 pages
The Linux TCP IP Stack - Networking For Embedded Systems (Networking Series) PDF
No ratings yet
The Linux TCP IP Stack - Networking For Embedded Systems (Networking Series) PDF
2,034 pages
CS Ref Architecture
No ratings yet
CS Ref Architecture
27 pages
OpenStack Cloud Guide by Chris Anderson
100% (1)
OpenStack Cloud Guide by Chris Anderson
17 pages
IBM LinuxONE 4 Datasheet
No ratings yet
IBM LinuxONE 4 Datasheet
6 pages
Adversarial ML Survey Paper
No ratings yet
Adversarial ML Survey Paper
23 pages
The National Academies Press: Powering The U.S. Army of The Future (2021)
No ratings yet
The National Academies Press: Powering The U.S. Army of The Future (2021)
164 pages
The ML Test Score A Rubric For ML Production Readiness and Technical
No ratings yet
The ML Test Score A Rubric For ML Production Readiness and Technical
10 pages
Data Science Project Lifecycle
No ratings yet
Data Science Project Lifecycle
43 pages
ML Model Testing Tools Guide
No ratings yet
ML Model Testing Tools Guide
24 pages
03 ML Testing
No ratings yet
03 ML Testing
51 pages
On Testing Machine Learing Programs - Braiek & Khomh
No ratings yet
On Testing Machine Learing Programs - Braiek & Khomh
15 pages
Unit Test Generation Using Machine Master Thesis Laurence Saes PDF
No ratings yet
Unit Test Generation Using Machine Master Thesis Laurence Saes PDF
64 pages
Identifing Software Bugs or Not Using SMLT Model
No ratings yet
Identifing Software Bugs or Not Using SMLT Model
34 pages
Assignment 1 Instructions
No ratings yet
Assignment 1 Instructions
6 pages
Using Customer Behavior Analytics To Increase Revenue
No ratings yet
Using Customer Behavior Analytics To Increase Revenue
13 pages
Turning Data Into Knowledge: Creating and Implementing A Meta Data Strategy
No ratings yet
Turning Data Into Knowledge: Creating and Implementing A Meta Data Strategy
53 pages
MacBook Charging Architecture Overview
No ratings yet
MacBook Charging Architecture Overview
23 pages
Sesam For Fixed Offshore Wind Turbine Structures Flyer - tcm8 58839
No ratings yet
Sesam For Fixed Offshore Wind Turbine Structures Flyer - tcm8 58839
2 pages
Logs
No ratings yet
Logs
12 pages
Biometrics
No ratings yet
Biometrics
13 pages
UNIT-1 - .Net Notes
No ratings yet
UNIT-1 - .Net Notes
43 pages
MetaData Management
No ratings yet
MetaData Management
7 pages
Ancău, Mircea - Practical Optimization With MATLAB (2019, Cambridge Scholars Publishing)
No ratings yet
Ancău, Mircea - Practical Optimization With MATLAB (2019, Cambridge Scholars Publishing)
292 pages
MAD Microproject
No ratings yet
MAD Microproject
29 pages
MIE Individual Enrolment Form
No ratings yet
MIE Individual Enrolment Form
2 pages
Arduino Basics PDF
No ratings yet
Arduino Basics PDF
5 pages
SQL Server Performance Tuning Tools Guide
No ratings yet
SQL Server Performance Tuning Tools Guide
3 pages
UDF & Powercopy of CATIA V5: About COE Membership Industries Events Content Center Partners Forums
No ratings yet
UDF & Powercopy of CATIA V5: About COE Membership Industries Events Content Center Partners Forums
13 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
48 pages
WEG VDL200 ASY Functions Descriptions Parameters Asy 1S9VDFEN en
No ratings yet
WEG VDL200 ASY Functions Descriptions Parameters Asy 1S9VDFEN en
110 pages
PDF 417 Font Ware
No ratings yet
PDF 417 Font Ware
73 pages
PL/SQL Employee Management Solutions
No ratings yet
PL/SQL Employee Management Solutions
4 pages
Computer Science Paper 1 HL Markscheme
No ratings yet
Computer Science Paper 1 HL Markscheme
13 pages
Unit - V Security in The Cloud
No ratings yet
Unit - V Security in The Cloud
40 pages
Listening Test
No ratings yet
Listening Test
5 pages
File Handling in C: fopen and fputs Guide
No ratings yet
File Handling in C: fopen and fputs Guide
12 pages
Wa0005.
No ratings yet
Wa0005.
29 pages
Case Study
No ratings yet
Case Study
10 pages
CLIP STUDIO PAINT - The Artist's App For Drawing
No ratings yet
CLIP STUDIO PAINT - The Artist's App For Drawing
1 page
FibeAir IP20G Installation Guide Rev E.07
No ratings yet
FibeAir IP20G Installation Guide Rev E.07
49 pages
Ironport: Spam N' Stuff: Hrvoje Dogan
No ratings yet
Ironport: Spam N' Stuff: Hrvoje Dogan
44 pages
Implications of Cache Management in Partitioned Systems
No ratings yet
Implications of Cache Management in Partitioned Systems
5 pages
Message 7
No ratings yet
Message 7
17 pages

The ML Test Score: A Rubric For ML Production Readiness and Technical Debt Reduction

Uploaded by

The ML Test Score: A Rubric For ML Production Readiness and Technical Debt Reduction

Uploaded by

The ML Test Score:

A Rubric for ML Production Readiness and Technical Debt Reduction

Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley

normally, when not in emergency conditions.

R EFERENCES [14] T. Bolukbasi, K.-W. Chang, J. Y. Zou, V. Saligrama, and

You might also like