SLEP011: Fixing randomness handling in estimators and splitters #24

NicolasHug · 2019-11-27T13:40:07Z

This SLEP aims at fixing the issues related to passing RandomState instances or None to estimators and CV splitters.

The proposed change is to store an actual random state as returned by numpy's get_state() method in the self.random_state attribute.

The net result is that calling fit or split on the same instance will always use the same RNG. The changes are mostly backward compatible.

This is a follow up to scikit-learn/scikit-learn#14042 and scikit-learn/scikit-learn#15177.

@scikit-learn/core-devs , this probably looks more like a manifesto than a SLEP at the moment, so please share your comments ;)

amueller · 2019-11-27T20:51:27Z

I didn't do a review but from a short discussion:
I think we need to decide on whether we want "None" to result in the same estimator for multiple runs of fit. I think in particular in cross-validation it might be useful to have different random states.

NicolasHug · 2019-11-28T00:23:01Z

I think we need to decide on whether we want "None" to result in the same estimator for multiple runs of fit.

This SLEP's answer is no. The reason is (as detailed in the SLEP): allowing fit to differ for a given instance is precisely the cause of all these bugs and unexpected behaviours.

I think in particular in cross-validation it might be useful to have different random states.

Yet we/users mostly use random_state=0, right?

amueller · 2019-11-30T21:34:25Z

Yet we/users mostly use random_state=0, right?

I don't think that's true.

jnothman

Wouldn't a more compliant variant of the proposed solution simply copy the RandomState in check_random_state? This would not fix the None case, admittedly.

There are other related ideas: testing with multiple random states; and using fixed seed by default in all cases

jnothman · 2019-12-08T22:53:37Z

slep011/proposal.rst

+`__init__`::
+
+    def __init__(self, ..., random_state=None):
+        self.random_state = check_random_state(random_state).get_state()


This would be better done with a setter

I believe that such a behavior leads to multiple folds in a CV to have the same random state, and as a result does not seem desirable.

jnothman · 2019-12-08T22:57:01Z

fit can also store the initial random state to make it reproducible, without changing behaviours otherwise

NicolasHug · 2019-12-09T15:22:25Z

Thanks for the comments Joel

Wouldn't a more compliant variant of the proposed solution simply copy the RandomState in check_random_state?

Probably, I just didn't want to modify check_random_state.

There are other related ideas: testing with multiple random states; and using fixed seed by default in all cases

Changing the defaults might minimize issues on the users side. But it only partially addresses the main issues raised in this SLEP. As long as we allow estimators and splitters to be stateful, we are still exposing ourselves to the bugs described here.

fit can also store the initial random state to make it reproducible, without changing behaviours otherwise

This is discussed in SLEP here, I believe
https://github.com/scikit-learn/enhancement_proposals/pull/24/files#diff-5f21dbd421ae7420240d592cf8e56332R351

slep011/proposal.rst

amueller · 2019-12-11T21:15:30Z

slep011/proposal.rst

+Statefulness and violation of fit idempotence
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Estimators should be stateless. That is, any call to `fit` should forget


Statelessness and identical results are not the same thing imho. Discussing with @GaelVaroquaux and @agramfort yesterday, the three of us agreed (don't remember @adrinjalali's stance) that having different calls to fit in a RandomForest return different models is desired behavior that we do not want to change.
You are arguing here that it is undesired behavior. Independent of anything else, we need to agree on what our desired behavior is.

If we accept this as the default behavior, there is no way to generally avoid the bugs. We can test for them and fix them. They are bugs after all, caused by subtleties of the current design. I don't see how we can avoid these subtleties while maintaining the behavior, though.

If we accept this as the default behavior, there is no way to generally avoid the bugs

Indeed! My goal with this SLEP is to make this clear :)

We can either fix the bugs once and for all by changing the current behavior

Or we can keep things as they are, knowing the price we're paying for that

I'll be happy with either outcome. But whatever we end up choosing, I want the decision to be informed and well thought out.

having different calls to fit in a RandomForest return different models is desired behavior that we do not want to change

What is the rationale for this? Clearly this is what is causing all the difficulties.

What about CV splitters?

What's wrong with creating another instance?

Creating another instance with None? So that instantiation records state? That seems also odd and would go against my expectations.

My main usecase is running cross-validation, where I prefer my random forest to use different random states for each split.
There are probably usecases where using the same random state makes sense but I definitely want to be able to use different random states as an option (and I would make it the default option).

So that instantiation records state? That seems also odd and would go against my expectations.

Storing the state in init is precisely the proposed solution ;)

I prefer my random forest to use different random states for each split.

I think we can easily make this behavior available through our CV routines like cross_validate or cross_val_score, by e.g. adding a new parameter randomness="splitwise" or something like similar

randomness="splitwise" would be a hack that would need to be added in many places, including for fairly simple situations. I'm not in favor.

As far as I can tell, only in the tools that take cv as input. Basically:

GridSearchCV and RandomizedSearchCV

cross_validate, cross_val_predict, cross_val_score, permutation_test_score, validation_curve, learning_curve

the EstimatorCV stuff, which hopefully we will deprecate by instead having a decent GridSearch + warm start mechanism.

That's not that many places IMHO

amueller · 2019-12-11T21:15:59Z

slep011/proposal.rst

+to store a private seed that was generated the first time `fit` is called.
+
+This is a typical example of many other similar bugs that we need to
+monkey-patch with potentially overly complex logic.


why monkey-patch?

monkey patch isn't the right term. I guess what I meant is that the patches are not obvious, and make the code harder to understand and maintain.

adrinjalali

I'm personally undecided on this one.

What I completely agree with, is that the documentation is not clear, and the behavior is odd in some cases. I'm not sure if we should change the behavior or document it better.

slep011/proposal.rst

adrinjalali · 2019-12-14T19:27:22Z

slep011/proposal.rst

+This is a typical example of many other similar bugs that we need to
+monkey-patch with potentially overly complex logic.
+
+Cloning may return a different estimator


This to me is one of the most problematic issues.

I don't agree that this is the case.

I think that it is about defining what same / not the same is.

I would argue that the definition should be statistical, and not exact equality.

adrinjalali · 2019-12-14T19:32:03Z

slep011/proposal.rst

+<https://github.com/scikit-learn/scikit-learn/issues/15611>`_
+
+CV-Splitters statefulness
+~~~~~~~~~~~~~~~~~~~~~~~~~


I think of this one more of a documentation issue. As a user I'd like the cv.split return different values by default.

Do you also expect the splitter to be stateless when you pass an int but statefull when you pass an instance or None?

As a user I'd like the cv.split return different values by default

Me too, but only across executions, not across calls to split. The current proposal will still yield different splits across executions.

Right now users expect the split to return different results per call, if random_state is not an int, and I think that's a reasonable expectation/behavior. I don't see why we need to change that. The alternative behavior is somewhat less intuitive to me.

I don't see why we need to change that

I'm not saying we need to, but the reason we might want to consider changing this is because this is bug-prone and inconsistent, as (hopefully) illustrated in the SLEP

adrinjalali · 2019-12-14T19:33:29Z

slep011/proposal.rst

+entries in the `Roadmap <https://scikit-learn.org/dev/roadmap.html>`_.
+
+Potential bugs in custom parameter searches
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


as we've discussed in on the SuccessiveHalving PR, I think that design can be improved. I wouldn't make that a reason to change the random state semantic.

Indeed, changing the whole library is something that needs to be taken seriously.

I agree this issue alone isn't enough, but there are many more arguments in the SLEP. That's just yet another reason.

I think it's important to show how the current design affects scikit-learn (and third party libraries) in ways that are prone to subtle yet severe bugs, and in parts where we might not expect. This section is about that.

BTW @GaelVaroquaux , you wrote in #24 (comment) that this issue with SH is the origin of the SLEP, but that's not true! The origin is scikit-learn/scikit-learn#14042, and the numerous bugs we've had to fix so far. I want to make that clear because otherwise you might misinterpret my intentions here, which go way beyond the SH issue.

slep011/proposal.rst

adrinjalali · 2019-12-14T19:46:15Z

slep011/proposal.rst

+- Backward compatibility is preserved between scikit-learn versions. Let A
+  be a version with the current behaviour (say 0.22) and let B be a version
+  where the new behaviour is implemented. The models and the splits obtained
+  will be the same in A and in B. That property may not be respected with
+  other solutions, see below.


this will make fit and split idempotent, doesn't it? In that sense, it's not backward compatible. One major issue is that (I think) many users do expect split to return different splits at each run.

Sure, we can't preserve full backward compatibility since we're changing the behavior. What I mean is that the change is backward compatible for the first time split or fit is called (not for the subsequent calls, as you noticed). I'll clarify

GaelVaroquaux · 2019-12-14T23:13:30Z

Rather than make scattered comments on the PR, let me try to summarize thoughts that arose of a discussion with @amueller, @adrinjalali and @agramfort at NeurIPS. I think that it would be great to integrate these notes in the SLEP, partly because the goal of the SLEP is to document the reasons for the behavior.

Problem1 Origin of the SLEP: SuccessiveHalving, but also other suboptimal behaviors such a warm restart.

Problem2 Other problem mentioned: Stateful and non-deterministic behavior are annoying (because for an advanced usage, stateless and deterministic are usually better)

Two aspects of considered modifications should be separated :

Question1 Default: should behavior by default be random or deterministic?
Question2 Dealing with an "rng": should rng objects be valid inputs for random_state?

Considerations for defining the API

We have two populations we are targeting: naive users and advanced. Ideally, things should be easy/natural for naive users, but let room for advanced patterns.

For question1 To address question 1 above, we need to consider: what is the "naive" user expectation, where naive is defined as not expert in scikit-learn? In other terms: to teach (eg to teach random forests), what is the best default? Things called "Random" are often expected to be random.

For question2 An advanced usage that is really important is to enable to have a complete chain of operations with a large entropy, yet keep it reproducible. Seeding the global rng is not a good solution, because in parallel computing it breaks down to out-of-order execution.

To take a specific example, one might want to apply cross_val_score on an estimator with randomness such as random forest, and gauge in the cross_val_score multiple sources of variance in the score, including that due to this randomness.

I do not see a way of answering this need in a general setting without passing a random number generator around. Passing a random generator around is actual a common pattern in numerical computing.

Redefining the contract?

Problem1 is a serious one: it arises from statefullness and the lack determinism of that makes some computing patterns unreliable.

Taking a step back, to address it, we might need a mechanism to enforce determinism in an estimator in the context of these computing patterns.

Overriding random_state

The simplest approach would be that in such a context, we could modify
the estimator that is input with something like (code not tested):

def freeze_random_state(estimator, msg='here'):
    if not hasattr(estimator, 'random_state'):
        return estimator
    random_state = estimator.random_state
    if not isinstance(random_state, np.random.RandomState):
        return estimator
    warning.warn('random_state parameter overriden. Indeed, estimator %s has an uncontrolled '
                 'random state leading to non deterministic behavior, which leads to inconsistent '
                 'behavior %s.' % (estimator, msg))
    estimator.random_state = estimator.random_state.randint(2**32)
    return(estimator)

We would call this code (or similar), where appropriate, ie only where
non-deterministic behavior lead to inconsistency.

The draw back is that it leads to warnings and special cases.

Imposing determinism

A more invasive change (hence that requires more thought), is along what is proposed in the SLEP: at init (or at first fit), run check_random_state, and if the it is a random_state, turn it to a random int.

A drawback is that it renders our API more complicated (things happen at init). As a result, libraries who implement our API may not do this, and hence lead to inconsistent behavior.

Also, it may violate users expectations as multiple fits of the same random forest will return the same thing, though not multiple instanciations / clones.

GaelVaroquaux · 2019-12-14T23:19:36Z

I just realized that enforcing determinism at first fit has a strong drawback: it means that the behavior of fit is strongly dependent on the past of the instance.

slep011/proposal.rst

NicolasHug · 2019-12-14T23:26:06Z

I just realized that enforcing determinism at first fit has a strong drawback: it means that the behavior of fit is strongly dependent on the past of the instance.

That's the main reason it's not the proposed solution, and that's noted in the SLEP already.

the behavior of fit is strongly dependent on the past of the instance.

This precisely applies to the current design and this is exactly what I want to fix.

rth · 2020-05-01T11:12:41Z

Thanks for this work! I think we should add a section with considerations about the compatibility of the proposed solution with future support for the new RNG API in numpy (NEP 19 and scikit-learn/scikit-learn#16988).

I think storing only the random seed in estimators as proposed here would potentially allow to switch the RNG backend with some config option. Of course, the goal of this SLEP is orthogonal, but I think any in depth refactoring of the random state should take into account that in the long term focusing too much on np.random.RandomState might not be the best approach. From NEP 19,

All current usages of RandomState will continue to work in perpetuity, though some may be discouraged through documentation

cc @grisaitis

…s into random_state

Co-authored-by: Adrin Jalali <[email protected]>

NicolasHug · 2020-09-04T22:22:11Z

@GaelVaroquaux , thank you for your notes in #24 (comment).

I have read them carefully, but unfortunately, I am not quite sure how I can leverage these notes to make this SLEP move forward, or how to integrate them in the SLEP. There are a few things in these notes that I don't completely understand, but also some other things that I don't fully agree with.

Problem1 Origin of the SLEP: SuccessiveHalving

The SH issue isn't the origin of the SLEP, it's only a data point. The origin is scikit-learn/scikit-learn#14042, and the numerous bugs we've had to fix so far. I want to make that clear because otherwise you might misinterpret my intentions here, which go way beyond the SH issue

I also don't share the framing of the problem that was proposed, with these two questions: 1. whether randomness should be the default, and 2. whether we should allow passing RNGs. While these two questions are definitely worth considering, I believe they are mostly tangential to this SLEP and to the points that I'm trying to make throughout this doc. To me, and I hope this is properly illustrated in this SLEP, the main relevant question is: Do we want our estimators (and splitters) to be stateful across calls to fit() (and split()).

If I interpret your comments properly (#24 (comment), and #24 (comment)) you and I seem to agree that the answer should be no ;).
I might be misinterpreting your thoughts here, so I'd appreciate if you could clarify that for me when you have time.

Thanks!

…osals into random_state

NicolasHug added 5 commits November 26, 2019 18:14

WIP

5a53892

still WIP

926ad39

more

e5ba859

title

47011fa

some more

e3f02ac

NicolasHug mentioned this pull request Nov 27, 2019

RFC design of random_state scikit-learn/scikit-learn#14042

Open

yarikoptic mentioned this pull request Nov 28, 2019

Promote SKLEARN_SEED for consideration in __init__.py itself scikit-learn/scikit-learn#15727

Open

NicolasHug mentioned this pull request Dec 5, 2019

Version 1.0 of scikit-learn scikit-learn/scikit-learn#14386

Closed

jnothman reviewed Dec 8, 2019

View reviewed changes

Added abstract

18a0f25

amueller reviewed Dec 11, 2019

View reviewed changes

slep011/proposal.rst Show resolved Hide resolved

amueller reviewed Dec 11, 2019

View reviewed changes

slep011/proposal.rst Show resolved Hide resolved

amueller reviewed Dec 11, 2019

View reviewed changes

adrinjalali reviewed Dec 14, 2019

View reviewed changes

GaelVaroquaux reviewed Dec 14, 2019

View reviewed changes

slep011/proposal.rst Show resolved Hide resolved

adrinjalali mentioned this pull request Jan 28, 2020

DOC How is randomization handled fairlearn/fairlearn#243

Open

NicolasHug mentioned this pull request Feb 23, 2020

[MRG] Set random seed in init for CV splitters scikit-learn/scikit-learn#15177

Closed

chrisbarber mentioned this pull request May 11, 2020

models with random states should have their random states saved. SwissDataScienceCenter/mlschema-model-converters#4

Closed

NicolasHug mentioned this pull request Aug 18, 2020

[MRG] Successive halving for faster parameter search scikit-learn/scikit-learn#13900

Merged

6 tasks

Merge branch 'master' of github.com:scikit-learn/enhancement_proposal…

023cc4f

…s into random_state

NicolasHug and others added 2 commits September 4, 2020 17:49

Added note about SH forbidding stateful splitters

18ed43c

Update slep011/proposal.rst

c3f5495

Co-authored-by: Adrin Jalali <[email protected]>

NicolasHug added 2 commits September 7, 2020 10:01

some important details after chat with Alex

907672c

Merge branch 'random_state' of github.com:NicolasHug/enhancement_prop…

5417b7a

…osals into random_state

lorentzenchr mentioned this pull request Nov 15, 2020

Support numpy.random.Generator and/or BitGenerator for random number generation scikit-learn/scikit-learn#16988

Open

NicolasHug closed this Jul 22, 2022

NicolasHug mentioned this pull request Apr 13, 2023

Cloned estimators have identical randomness but different RNG instances scikit-learn/scikit-learn#26148

Open

NicolasHug mentioned this pull request May 15, 2023

SLEP022: Fixing randomness ambiguities #88

Open

adrinjalali mentioned this pull request Feb 11, 2025

Are there any pitfalls by combining n_jobs and random_state? scikit-learn/scikit-learn#30811

Closed

SLEP011: Fixing randomness handling in estimators and splitters #24

SLEP011: Fixing randomness handling in estimators and splitters #24

Conversation

NicolasHug commented Nov 27, 2019

amueller commented Nov 27, 2019

NicolasHug commented Nov 28, 2019

amueller commented Nov 30, 2019

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Dec 8, 2019

NicolasHug commented Dec 9, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GaelVaroquaux commented Dec 14, 2019

Considerations for defining the API

Redefining the contract?

Overriding random_state

Imposing determinism

GaelVaroquaux commented Dec 14, 2019

NicolasHug commented Dec 14, 2019

rth commented May 1, 2020

NicolasHug commented Sep 4, 2020

NicolasHug commented Dec 9, 2019 •

edited

Loading