Skip to content

SLEP011: Fixing randomness handling in estimators and splitters #24

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

NicolasHug
Copy link
Member

This SLEP aims at fixing the issues related to passing RandomState instances or None to estimators and CV splitters.

The proposed change is to store an actual random state as returned by numpy's get_state() method in the self.random_state attribute.

The net result is that calling fit or split on the same instance will always use the same RNG. The changes are mostly backward compatible.

This is a follow up to scikit-learn/scikit-learn#14042 and scikit-learn/scikit-learn#15177.

@scikit-learn/core-devs , this probably looks more like a manifesto than a SLEP at the moment, so please share your comments ;)

@amueller
Copy link
Member

I didn't do a review but from a short discussion:
I think we need to decide on whether we want "None" to result in the same estimator for multiple runs of fit. I think in particular in cross-validation it might be useful to have different random states.

@NicolasHug
Copy link
Member Author

I think we need to decide on whether we want "None" to result in the same estimator for multiple runs of fit.

This SLEP's answer is no. The reason is (as detailed in the SLEP): allowing fit to differ for a given instance is precisely the cause of all these bugs and unexpected behaviours.

I think in particular in cross-validation it might be useful to have different random states.

Yet we/users mostly use random_state=0, right?

@amueller
Copy link
Member

Yet we/users mostly use random_state=0, right?

I don't think that's true.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't a more compliant variant of the proposed solution simply copy the RandomState in check_random_state? This would not fix the None case, admittedly.

There are other related ideas: testing with multiple random states; and using fixed seed by default in all cases

`__init__`::

def __init__(self, ..., random_state=None):
self.random_state = check_random_state(random_state).get_state()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be better done with a setter

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that such a behavior leads to multiple folds in a CV to have the same random state, and as a result does not seem desirable.

@jnothman
Copy link
Member

jnothman commented Dec 8, 2019

fit can also store the initial random state to make it reproducible, without changing behaviours otherwise

@NicolasHug
Copy link
Member Author

NicolasHug commented Dec 9, 2019

Thanks for the comments Joel

Wouldn't a more compliant variant of the proposed solution simply copy the RandomState in check_random_state?

Probably, I just didn't want to modify check_random_state.

There are other related ideas: testing with multiple random states; and using fixed seed by default in all cases

Changing the defaults might minimize issues on the users side. But it only partially addresses the main issues raised in this SLEP. As long as we allow estimators and splitters to be stateful, we are still exposing ourselves to the bugs described here.

fit can also store the initial random state to make it reproducible, without changing behaviours otherwise

This is discussed in SLEP here, I believe
https://github.com/scikit-learn/enhancement_proposals/pull/24/files#diff-5f21dbd421ae7420240d592cf8e56332R351

Statefulness and violation of fit idempotence
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Estimators should be stateless. That is, any call to `fit` should forget
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Statelessness and identical results are not the same thing imho. Discussing with @GaelVaroquaux and @agramfort yesterday, the three of us agreed (don't remember @adrinjalali's stance) that having different calls to fit in a RandomForest return different models is desired behavior that we do not want to change.
You are arguing here that it is undesired behavior. Independent of anything else, we need to agree on what our desired behavior is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we accept this as the default behavior, there is no way to generally avoid the bugs. We can test for them and fix them. They are bugs after all, caused by subtleties of the current design. I don't see how we can avoid these subtleties while maintaining the behavior, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we accept this as the default behavior, there is no way to generally avoid the bugs

Indeed! My goal with this SLEP is to make this clear :)

  • We can either fix the bugs once and for all by changing the current behavior
  • Or we can keep things as they are, knowing the price we're paying for that

I'll be happy with either outcome. But whatever we end up choosing, I want the decision to be informed and well thought out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having different calls to fit in a RandomForest return different models is desired behavior that we do not want to change

What is the rationale for this? Clearly this is what is causing all the difficulties.

What about CV splitters?

What's wrong with creating another instance?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating another instance with None? So that instantiation records state? That seems also odd and would go against my expectations.

My main usecase is running cross-validation, where I prefer my random forest to use different random states for each split.
There are probably usecases where using the same random state makes sense but I definitely want to be able to use different random states as an option (and I would make it the default option).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that instantiation records state? That seems also odd and would go against my expectations.

Storing the state in init is precisely the proposed solution ;)

I prefer my random forest to use different random states for each split.

I think we can easily make this behavior available through our CV routines like cross_validate or cross_val_score, by e.g. adding a new parameter randomness="splitwise" or something like similar

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

randomness="splitwise" would be a hack that would need to be added in many places, including for fairly simple situations. I'm not in favor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, only in the tools that take cv as input. Basically:

  • GridSearchCV and RandomizedSearchCV
  • cross_validate, cross_val_predict, cross_val_score, permutation_test_score, validation_curve, learning_curve
  • the EstimatorCV stuff, which hopefully we will deprecate by instead having a decent GridSearch + warm start mechanism.

That's not that many places IMHO

to store a private seed that was generated the first time `fit` is called.

This is a typical example of many other similar bugs that we need to
monkey-patch with potentially overly complex logic.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why monkey-patch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

monkey patch isn't the right term. I guess what I meant is that the patches are not obvious, and make the code harder to understand and maintain.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm personally undecided on this one.

What I completely agree with, is that the documentation is not clear, and the behavior is odd in some cases. I'm not sure if we should change the behavior or document it better.

This is a typical example of many other similar bugs that we need to
monkey-patch with potentially overly complex logic.

Cloning may return a different estimator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This to me is one of the most problematic issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree that this is the case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it is about defining what same / not the same is.

I would argue that the definition should be statistical, and not exact equality.

<https://github.com/scikit-learn/scikit-learn/issues/15611>`_

CV-Splitters statefulness
~~~~~~~~~~~~~~~~~~~~~~~~~
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think of this one more of a documentation issue. As a user I'd like the cv.split return different values by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you also expect the splitter to be stateless when you pass an int but statefull when you pass an instance or None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a user I'd like the cv.split return different values by default

Me too, but only across executions, not across calls to split. The current proposal will still yield different splits across executions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now users expect the split to return different results per call, if random_state is not an int, and I think that's a reasonable expectation/behavior. I don't see why we need to change that. The alternative behavior is somewhat less intuitive to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why we need to change that

I'm not saying we need to, but the reason we might want to consider changing this is because this is bug-prone and inconsistent, as (hopefully) illustrated in the SLEP

entries in the `Roadmap <https://scikit-learn.org/dev/roadmap.html>`_.

Potential bugs in custom parameter searches
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we've discussed in on the SuccessiveHalving PR, I think that design can be improved. I wouldn't make that a reason to change the random state semantic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, changing the whole library is something that needs to be taken seriously.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this issue alone isn't enough, but there are many more arguments in the SLEP. That's just yet another reason.

I think it's important to show how the current design affects scikit-learn (and third party libraries) in ways that are prone to subtle yet severe bugs, and in parts where we might not expect. This section is about that.

BTW @GaelVaroquaux , you wrote in #24 (comment) that this issue with SH is the origin of the SLEP, but that's not true! The origin is scikit-learn/scikit-learn#14042, and the numerous bugs we've had to fix so far. I want to make that clear because otherwise you might misinterpret my intentions here, which go way beyond the SH issue.

Comment on lines +288 to +292
- Backward compatibility is preserved between scikit-learn versions. Let A
be a version with the current behaviour (say 0.22) and let B be a version
where the new behaviour is implemented. The models and the splits obtained
will be the same in A and in B. That property may not be respected with
other solutions, see below.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will make fit and split idempotent, doesn't it? In that sense, it's not backward compatible. One major issue is that (I think) many users do expect split to return different splits at each run.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can't preserve full backward compatibility since we're changing the behavior. What I mean is that the change is backward compatible for the first time split or fit is called (not for the subsequent calls, as you noticed). I'll clarify

@GaelVaroquaux
Copy link
Member

Rather than make scattered comments on the PR, let me try to summarize thoughts that arose of a discussion with @amueller, @adrinjalali and @agramfort at NeurIPS. I think that it would be great to integrate these notes in the SLEP, partly because the goal of the SLEP is to document the reasons for the behavior.

Problem1 Origin of the SLEP: SuccessiveHalving, but also other suboptimal behaviors such a warm restart.

Problem2 Other problem mentioned: Stateful and non-deterministic behavior are annoying (because for an advanced usage, stateless and deterministic are usually better)

Two aspects of considered modifications should be separated :

Question1 Default: should behavior by default be random or deterministic?
Question2 Dealing with an "rng": should rng objects be valid inputs for random_state?

Considerations for defining the API

We have two populations we are targeting: naive users and advanced. Ideally, things should be easy/natural for naive users, but let room for advanced patterns.

For question1 To address question 1 above, we need to consider: what is the "naive" user expectation, where naive is defined as not expert in scikit-learn? In other terms: to teach (eg to teach random forests), what is the best default? Things called "Random" are often expected to be random.

For question2 An advanced usage that is really important is to enable to have a complete chain of operations with a large entropy, yet keep it reproducible. Seeding the global rng is not a good solution, because in parallel computing it breaks down to out-of-order execution.

To take a specific example, one might want to apply cross_val_score on an estimator with randomness such as random forest, and gauge in the cross_val_score multiple sources of variance in the score, including that due to this randomness.

I do not see a way of answering this need in a general setting without passing a random number generator around. Passing a random generator around is actual a common pattern in numerical computing.

Redefining the contract?

Problem1 is a serious one: it arises from statefullness and the lack determinism of that makes some computing patterns unreliable.

Taking a step back, to address it, we might need a mechanism to enforce determinism in an estimator in the context of these computing patterns.

Overriding random_state

The simplest approach would be that in such a context, we could modify
the estimator that is input with something like (code not tested):

def freeze_random_state(estimator, msg='here'):
    if not hasattr(estimator, 'random_state'):
        return estimator
    random_state = estimator.random_state
    if not isinstance(random_state, np.random.RandomState):
        return estimator
    warning.warn('random_state parameter overriden. Indeed, estimator %s has an uncontrolled '
                 'random state leading to non deterministic behavior, which leads to inconsistent '
                 'behavior %s.' % (estimator, msg))
    estimator.random_state = estimator.random_state.randint(2**32)
    return(estimator)

We would call this code (or similar), where appropriate, ie only where
non-deterministic behavior lead to inconsistency.

The draw back is that it leads to warnings and special cases.

Imposing determinism

A more invasive change (hence that requires more thought), is along what is proposed in the SLEP: at init (or at first fit), run check_random_state, and if the it is a random_state, turn it to a random int.

A drawback is that it renders our API more complicated (things happen at init). As a result, libraries who implement our API may not do this, and hence lead to inconsistent behavior.

Also, it may violate users expectations as multiple fits of the same random forest will return the same thing, though not multiple instanciations / clones.

@GaelVaroquaux
Copy link
Member

I just realized that enforcing determinism at first fit has a strong drawback: it means that the behavior of fit is strongly dependent on the past of the instance.

@NicolasHug
Copy link
Member Author

I just realized that enforcing determinism at first fit has a strong drawback: it means that the behavior of fit is strongly dependent on the past of the instance.

That's the main reason it's not the proposed solution, and that's noted in the SLEP already.

the behavior of fit is strongly dependent on the past of the instance.

This precisely applies to the current design and this is exactly what I want to fix.

@rth
Copy link
Member

rth commented May 1, 2020

Thanks for this work! I think we should add a section with considerations about the compatibility of the proposed solution with future support for the new RNG API in numpy (NEP 19 and scikit-learn/scikit-learn#16988).

I think storing only the random seed in estimators as proposed here would potentially allow to switch the RNG backend with some config option. Of course, the goal of this SLEP is orthogonal, but I think any in depth refactoring of the random state should take into account that in the long term focusing too much on np.random.RandomState might not be the best approach. From NEP 19,

All current usages of RandomState will continue to work in perpetuity, though some may be discouraged through documentation

cc @grisaitis

@NicolasHug
Copy link
Member Author

@GaelVaroquaux , thank you for your notes in #24 (comment).

I have read them carefully, but unfortunately, I am not quite sure how I can leverage these notes to make this SLEP move forward, or how to integrate them in the SLEP. There are a few things in these notes that I don't completely understand, but also some other things that I don't fully agree with.

Problem1 Origin of the SLEP: SuccessiveHalving

The SH issue isn't the origin of the SLEP, it's only a data point. The origin is scikit-learn/scikit-learn#14042, and the numerous bugs we've had to fix so far. I want to make that clear because otherwise you might misinterpret my intentions here, which go way beyond the SH issue

I also don't share the framing of the problem that was proposed, with these two questions: 1. whether randomness should be the default, and 2. whether we should allow passing RNGs. While these two questions are definitely worth considering, I believe they are mostly tangential to this SLEP and to the points that I'm trying to make throughout this doc. To me, and I hope this is properly illustrated in this SLEP, the main relevant question is: Do we want our estimators (and splitters) to be stateful across calls to fit() (and split()).

If I interpret your comments properly (#24 (comment), and #24 (comment)) you and I seem to agree that the answer should be no ;).
I might be misinterpreting your thoughts here, so I'd appreciate if you could clarify that for me when you have time.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants