-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Initialize ARPACK eigsh #5012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The appveyor failure is unrelated and now tracked at #5013. If you wish, you can trigger a new build by using:
|
About the fix itself, I agree that using random positive residuals seems like a bad idea. I check the source code of ARPACK and it seems that they do it by uniform sampling in the range I don't know if it's supposed to be better than the standard distribution you are using but I think we should try to replicate the native ARPACK behavior as much as possible. Also do you think you could contribute a non-regression test for this fix? Finally please add a new entry for this bug fix in |
Also we have other occurrences of not properly initialized calls to
This gives us the opportunity to fix them all consistently. |
About the non-regression test I was thinking about a test that would highlight the problem in the locally linear embedding model: some new test that would fail on the current master but that would be fixed by the changes in this PR. |
Apparently those changes cause the following test to fail:
I don't see how this is related to the changes of this PR though. |
I see what you mean by the non-regression test. I will try to come up with a test for About the failed build: I also could not trace a use of ARPACK or anything related to my changes... |
Weird, now the NMF test failure is appearing both on appveyor (windows) and travis (linux). I agree I don't see how this test could possibly be related to changes ARPACK or FORTRAN rng seeding in general. |
Well it says in the failed test |
Probably but I don't see where by reading the source of the NMF model on sparse input data. We want our estimators to be explicitly seedable for reproducibility. This is not the case for this estimator. It would be great to understand how the changes on this PR have impacted the results of the tests of the NMF model. |
I have just merged #5020 that better seeds this estimator. You might want to rebase this PR on top of the current master. |
|
||
A = random_state.rand(50, 50) | ||
A = np.dot(A.T, A) # create s.p.d. matrix | ||
A = graph_laplacian(A) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why calling graph_laplacian
here? This is not guaranteed to be s.p.d. any more right? I get the following failure:
======================================================================
FAIL: Non-regression test that shows null-space computation is better with
----------------------------------------------------------------------
Traceback (most recent call last):
File "/volatile/ogrisel/envs/py35/lib/python3.5/site-packages/nose/case.py", line 198, in runTest
self.test(*self.arg)
File "/volatile/ogrisel/code/scikit-learn/sklearn/utils/tests/test_utils.py", line 152, in test_arpack_eigsh_initialization
assert_greater_equal(w[0], 0)
File "/volatile/ogrisel/code/scikit-learn/sklearn/utils/testing.py", line 126, in assert_greater_equal
assert a >= b, message
nose.proxy.AssertionError: -430.0444945505198 is not greater than or equal to 0
>> assert -430.0444945505198 >= 0, '-430.0444945505198 is not greater than or equal to 0'
The test passes when using A = np.dot(A.T, A)
directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
graph_laplacian
makes the eigenvalue problem more difficult. For an undirected graph, the matrix is s.p.d. and the first eigenvalue is 0. The issue with the initialization only occurs for difficult cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the smallest eigen value is negative most of the time on my box:
>>> for i in range(10):
... print(eigsh(A, k=k, sigma=0.0, v0=np.random.uniform(-1, 1, A.shape[0]))[0][0])
...
-1.13686837722e-15
-146.940587556
-154.837569586
-92.5527859236
-82.614922039
-14.0299334028
-108.269166341
-35.7750459558
-27.0384493697
-42.4412458528
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is strange. For me, the eigenvalue is mostly correct (~90%) with uniform
, but it mostly fails (~99%) with rand
.
Still, the algorithm applied to the Laplacian seems not stable enough to use it in the test.
I haven't looked into the issue (sorry travelling) but if this is a problem with eigsh we should implement a wrapper with a fix and talk to scipy about fixing upstream. |
The failing doctest on travis is caused by a missing |
Maybe we could add a |
@yanlend can you please squash all your commits and rebase on top of current master? |
@ogrisel done! |
Weird, one of the windows tests has failed with:
I wonder if it's random or not. Will try to relaunch this on appveyor. As the seed is fixed we should get the same failure. BTW please fix the failing doctest in |
You can ignore the OneClassSVM segfault that is unrelated. However the test on non-negative eigenvectors is seemingly randomly failing under windows with 32 bits python. Maybe the ARPACK solver is just not stable enough on 32 bit Python despite the deterministic seeding for some reason I do not quite understand... |
@@ -126,6 +130,28 @@ def test_pinvh_simple_complex(): | |||
assert_almost_equal(np.dot(a, a_pinv), np.eye(3)) | |||
|
|||
|
|||
def test_arpack_eigsh_initialization(): | |||
''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We prefer to start tests with comments, not docstrings, so we can see the name of the test that is failing.
I'm not sure I understand the issue then. I thought scipy was using |
If no initial The main issue is that in some places of sklearn, |
ok, got it now. Thanks for the explanation. |
What is the status of this? |
I particular, this PR need a rebase (and optionally squash) on top of current master. |
rt.fit(X_train, y_train) | ||
rt_lm.fit(SelectFromModel(rt, prefit=True).transform(X_train_lr), y_train_lr) | ||
rt = RandomTreesEmbedding(max_depth=3, n_estimators=n_estimator, | ||
random_state=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yanlend : are these changes supposed to be in your PR? They look strange to me and somewhat unrelated.
I wonder if there are not merge artifacts in the current status of the PR. |
540f37d
to
c6c9160
Compare
@GaelVaroquaux sorry about that, I somehow messed up the rebase. It should be fixed now. |
Travis is not running on this guy. I don't understand why. |
@yanlend : I am sorry: there conflicts again. This bothers me, as I would like to have the tests run before merging. Can you rebase once again? I am sorry, this is happening because, as there is a sprint, there is a lot of activity on the project. |
I don't think the failed tests of PCA whitening are related to this PR. However, the regression test |
The failed test of PCA whitening don't fail on master and they seem legit to me. For the other, I am still thinking and trying to understand what is the right way to test this. |
With regards to the failing test (not the whitening one): how about we add 1e-7 on the diagonal of the matrix. This should make the test slightly more easy, and I would hope that it passes. |
`v0 = random_state.rand(M.shape[0])` leads to an initial residual vector in ARPACK which is all positive. However, this is not the absolute or squared residual, but a true difference. Thus, it is better to initialize with `v0=random_state.uniform(-1, 1, M.shape[0])` to have an equally distributed sign. This is the way that ARPACK initializes the residuals. The effect of the previous initialization is that eigsh frequently does not converge to the correct eigenvalues, e.g. negative eigenvalues for s.p.d. matrix, which leads to an incorrect null-space. - initialized all occurences of sklearn.utils.arpack.eigsh the same way it would be initialzed by ARPACK - regression test to test behavior of new initialization
Comment instead of docstring to follow sklearn convention. Regression test made easier by adding small value to main diagonal of s.p.d. matrix
I did the rebase and added the suggestion of @GaelVaroquaux. |
|
OK, I am merging. Good job! |
[MRG+1] Initialize ARPACK eigsh
Do you want to squash those commits @GaelVaroquaux ? |
v0 = random_state.rand(M.shape[0])
leads to an initial residual vector in ARPACK which is all positive. However, this is not the absolute or squared residual, but a true difference. Thus, it is better to initialize withrandn
to have an equally distributed sign.The effect of the previous initialization is that eigsh frequently does not converge to the correct eigenvalues, e.g. negative eigenvalues for s.p.d. matrix, which leads to an incorrect null-space.