-
-
Notifications
You must be signed in to change notification settings - Fork 34
SLEP011: Fixing randomness handling in estimators and splitters #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I didn't do a review but from a short discussion: |
This SLEP's answer is no. The reason is (as detailed in the SLEP): allowing
Yet we/users mostly use random_state=0, right? |
I don't think that's true. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't a more compliant variant of the proposed solution simply copy the RandomState in check_random_state? This would not fix the None case, admittedly.
There are other related ideas: testing with multiple random states; and using fixed seed by default in all cases
`__init__`:: | ||
|
||
def __init__(self, ..., random_state=None): | ||
self.random_state = check_random_state(random_state).get_state() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be better done with a setter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that such a behavior leads to multiple folds in a CV to have the same random state, and as a result does not seem desirable.
fit can also store the initial random state to make it reproducible, without changing behaviours otherwise |
Thanks for the comments Joel
Probably, I just didn't want to modify
Changing the defaults might minimize issues on the users side. But it only partially addresses the main issues raised in this SLEP. As long as we allow estimators and splitters to be stateful, we are still exposing ourselves to the bugs described here.
This is discussed in SLEP here, I believe |
Statefulness and violation of fit idempotence | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Estimators should be stateless. That is, any call to `fit` should forget |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Statelessness and identical results are not the same thing imho. Discussing with @GaelVaroquaux and @agramfort yesterday, the three of us agreed (don't remember @adrinjalali's stance) that having different calls to fit
in a RandomForest
return different models is desired behavior that we do not want to change.
You are arguing here that it is undesired behavior. Independent of anything else, we need to agree on what our desired behavior is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we accept this as the default behavior, there is no way to generally avoid the bugs. We can test for them and fix them. They are bugs after all, caused by subtleties of the current design. I don't see how we can avoid these subtleties while maintaining the behavior, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we accept this as the default behavior, there is no way to generally avoid the bugs
Indeed! My goal with this SLEP is to make this clear :)
- We can either fix the bugs once and for all by changing the current behavior
- Or we can keep things as they are, knowing the price we're paying for that
I'll be happy with either outcome. But whatever we end up choosing, I want the decision to be informed and well thought out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having different calls to fit in a RandomForest return different models is desired behavior that we do not want to change
What is the rationale for this? Clearly this is what is causing all the difficulties.
What about CV splitters?
What's wrong with creating another instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating another instance with None? So that instantiation records state? That seems also odd and would go against my expectations.
My main usecase is running cross-validation, where I prefer my random forest to use different random states for each split.
There are probably usecases where using the same random state makes sense but I definitely want to be able to use different random states as an option (and I would make it the default option).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that instantiation records state? That seems also odd and would go against my expectations.
Storing the state in init is precisely the proposed solution ;)
I prefer my random forest to use different random states for each split.
I think we can easily make this behavior available through our CV routines like cross_validate
or cross_val_score
, by e.g. adding a new parameter randomness="splitwise"
or something like similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
randomness="splitwise" would be a hack that would need to be added in many places, including for fairly simple situations. I'm not in favor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, only in the tools that take cv
as input. Basically:
- GridSearchCV and RandomizedSearchCV
- cross_validate, cross_val_predict, cross_val_score, permutation_test_score, validation_curve, learning_curve
- the EstimatorCV stuff, which hopefully we will deprecate by instead having a decent GridSearch + warm start mechanism.
That's not that many places IMHO
slep011/proposal.rst
Outdated
to store a private seed that was generated the first time `fit` is called. | ||
|
||
This is a typical example of many other similar bugs that we need to | ||
monkey-patch with potentially overly complex logic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why monkey-patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
monkey patch isn't the right term. I guess what I meant is that the patches are not obvious, and make the code harder to understand and maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm personally undecided on this one.
What I completely agree with, is that the documentation is not clear, and the behavior is odd in some cases. I'm not sure if we should change the behavior or document it better.
This is a typical example of many other similar bugs that we need to | ||
monkey-patch with potentially overly complex logic. | ||
|
||
Cloning may return a different estimator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This to me is one of the most problematic issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree that this is the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it is about defining what same / not the same is.
I would argue that the definition should be statistical, and not exact equality.
<https://github.com/scikit-learn/scikit-learn/issues/15611>`_ | ||
|
||
CV-Splitters statefulness | ||
~~~~~~~~~~~~~~~~~~~~~~~~~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think of this one more of a documentation issue. As a user I'd like the cv.split
return different values by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you also expect the splitter to be stateless when you pass an int but statefull when you pass an instance or None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a user I'd like the cv.split return different values by default
Me too, but only across executions, not across calls to split. The current proposal will still yield different splits across executions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now users expect the split to return different results per call, if random_state is not an int, and I think that's a reasonable expectation/behavior. I don't see why we need to change that. The alternative behavior is somewhat less intuitive to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why we need to change that
I'm not saying we need to, but the reason we might want to consider changing this is because this is bug-prone and inconsistent, as (hopefully) illustrated in the SLEP
entries in the `Roadmap <https://scikit-learn.org/dev/roadmap.html>`_. | ||
|
||
Potential bugs in custom parameter searches | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as we've discussed in on the SuccessiveHalving PR, I think that design can be improved. I wouldn't make that a reason to change the random state semantic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, changing the whole library is something that needs to be taken seriously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this issue alone isn't enough, but there are many more arguments in the SLEP. That's just yet another reason.
I think it's important to show how the current design affects scikit-learn (and third party libraries) in ways that are prone to subtle yet severe bugs, and in parts where we might not expect. This section is about that.
BTW @GaelVaroquaux , you wrote in #24 (comment) that this issue with SH is the origin of the SLEP, but that's not true! The origin is scikit-learn/scikit-learn#14042, and the numerous bugs we've had to fix so far. I want to make that clear because otherwise you might misinterpret my intentions here, which go way beyond the SH issue.
- Backward compatibility is preserved between scikit-learn versions. Let A | ||
be a version with the current behaviour (say 0.22) and let B be a version | ||
where the new behaviour is implemented. The models and the splits obtained | ||
will be the same in A and in B. That property may not be respected with | ||
other solutions, see below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will make fit and split idempotent, doesn't it? In that sense, it's not backward compatible. One major issue is that (I think) many users do expect split to return different splits at each run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can't preserve full backward compatibility since we're changing the behavior. What I mean is that the change is backward compatible for the first time split or fit is called (not for the subsequent calls, as you noticed). I'll clarify
Rather than make scattered comments on the PR, let me try to summarize thoughts that arose of a discussion with @amueller, @adrinjalali and @agramfort at NeurIPS. I think that it would be great to integrate these notes in the SLEP, partly because the goal of the SLEP is to document the reasons for the behavior. Problem1 Origin of the SLEP: SuccessiveHalving, but also other suboptimal behaviors such a warm restart. Problem2 Other problem mentioned: Stateful and non-deterministic behavior are annoying (because for an advanced usage, stateless and deterministic are usually better) Two aspects of considered modifications should be separated : Question1 Default: should behavior by default be random or deterministic? Considerations for defining the APIWe have two populations we are targeting: naive users and advanced. Ideally, things should be easy/natural for naive users, but let room for advanced patterns. For question1 To address question 1 above, we need to consider: what is the "naive" user expectation, where naive is defined as not expert in scikit-learn? In other terms: to teach (eg to teach random forests), what is the best default? Things called "Random" are often expected to be random. For question2 An advanced usage that is really important is to enable to have a complete chain of operations with a large entropy, yet keep it reproducible. Seeding the global rng is not a good solution, because in parallel computing it breaks down to out-of-order execution. To take a specific example, one might want to apply cross_val_score on an estimator with randomness such as random forest, and gauge in the cross_val_score multiple sources of variance in the score, including that due to this randomness. I do not see a way of answering this need in a general setting without passing a random number generator around. Passing a random generator around is actual a common pattern in numerical computing. Redefining the contract?Problem1 is a serious one: it arises from statefullness and the lack determinism of that makes some computing patterns unreliable. Taking a step back, to address it, we might need a mechanism to enforce determinism in an estimator in the context of these computing patterns. Overriding random_stateThe simplest approach would be that in such a context, we could modify def freeze_random_state(estimator, msg='here'): if not hasattr(estimator, 'random_state'): return estimator random_state = estimator.random_state if not isinstance(random_state, np.random.RandomState): return estimator warning.warn('random_state parameter overriden. Indeed, estimator %s has an uncontrolled ' 'random state leading to non deterministic behavior, which leads to inconsistent ' 'behavior %s.' % (estimator, msg)) estimator.random_state = estimator.random_state.randint(2**32) return(estimator) We would call this code (or similar), where appropriate, ie only where The draw back is that it leads to warnings and special cases. Imposing determinismA more invasive change (hence that requires more thought), is along what is proposed in the SLEP: at init (or at first fit), run check_random_state, and if the it is a random_state, turn it to a random int. A drawback is that it renders our API more complicated (things happen at init). As a result, libraries who implement our API may not do this, and hence lead to inconsistent behavior. Also, it may violate users expectations as multiple fits of the same random forest will return the same thing, though not multiple instanciations / clones. |
I just realized that enforcing determinism at first fit has a strong drawback: it means that the behavior of fit is strongly dependent on the past of the instance. |
That's the main reason it's not the proposed solution, and that's noted in the SLEP already.
This precisely applies to the current design and this is exactly what I want to fix. |
Thanks for this work! I think we should add a section with considerations about the compatibility of the proposed solution with future support for the new RNG API in numpy (NEP 19 and scikit-learn/scikit-learn#16988). I think storing only the random seed in estimators as proposed here would potentially allow to switch the RNG backend with some config option. Of course, the goal of this SLEP is orthogonal, but I think any in depth refactoring of the random state should take into account that in the long term focusing too much on
cc @grisaitis |
…s into random_state
Co-authored-by: Adrin Jalali <[email protected]>
@GaelVaroquaux , thank you for your notes in #24 (comment). I have read them carefully, but unfortunately, I am not quite sure how I can leverage these notes to make this SLEP move forward, or how to integrate them in the SLEP. There are a few things in these notes that I don't completely understand, but also some other things that I don't fully agree with.
The SH issue isn't the origin of the SLEP, it's only a data point. The origin is scikit-learn/scikit-learn#14042, and the numerous bugs we've had to fix so far. I want to make that clear because otherwise you might misinterpret my intentions here, which go way beyond the SH issue I also don't share the framing of the problem that was proposed, with these two questions: 1. whether randomness should be the default, and 2. whether we should allow passing RNGs. While these two questions are definitely worth considering, I believe they are mostly tangential to this SLEP and to the points that I'm trying to make throughout this doc. To me, and I hope this is properly illustrated in this SLEP, the main relevant question is: Do we want our estimators (and splitters) to be stateful across calls to fit() (and split()). If I interpret your comments properly (#24 (comment), and #24 (comment)) you and I seem to agree that the answer should be no ;). Thanks! |
This SLEP aims at fixing the issues related to passing RandomState instances or None to estimators and CV splitters.
The proposed change is to store an actual random state as returned by numpy's
get_state()
method in theself.random_state
attribute.The net result is that calling
fit
orsplit
on the same instance will always use the same RNG. The changes are mostly backward compatible.This is a follow up to scikit-learn/scikit-learn#14042 and scikit-learn/scikit-learn#15177.
@scikit-learn/core-devs , this probably looks more like a manifesto than a SLEP at the moment, so please share your comments ;)