[MRG+1] Fix estimators to work if sample_weight parameter is pandas Series type #7825

kathyxchen · 2016-11-04T20:53:13Z

Reference Issue

Finishes work in #5642

What does this implement/fix? Explain your changes.

This addresses the comments made on @pradyu1993's commit

Any other comments?

May still warrant a PEP8 linting. (Having trouble with my usual development machine so once I get back on that I'll run the linter & all). Let me know if there are any suggestions on tests or improvements to how the check is made. Thanks!

jnothman · 2016-11-05T11:38:40Z

Travis is indeed reporting linter failures: https://travis-ci.org/scikit-learn/scikit-learn/jobs/173378047

kathyxchen · 2016-11-06T17:22:51Z

@jnothman done!
Also, the sample_weight parameter is used in many places throughout the linear_model/ridge.py file. Should a modification to accept pandas.Series be applied to all of these, or is that outside the scope of this issue? @amueller

jnothman · 2016-11-07T10:13:58Z

Also, the sample_weight parameter is used in many places throughout the linear_model/ridge.py file.

Do you mean in functions like sklearn.linear_model.ridge_regression? Or do you mean checking that it works for all solvers? At the moment we only have common tests for estimator classes. I think that checking support in functions is a good idea, but perhaps in a separate PR.

jnothman · 2016-11-07T10:11:42Z

sklearn/utils/tests/test_estimator_checks.py

+        estimator = Estimator()
+        if has_fit_parameter(estimator, "sample_weight"):
+            try:
+                # default solver liblinear doesn't support this parameter


solver='liblinear' does support sample_weight . Where did you get this idea from?

This was part of the code taken from the previous PR. I should have double checked it first!

Ah and it may have been true... over a year ago. :)

It was, I think ;)

jnothman · 2016-11-07T10:12:09Z

sklearn/utils/tests/test_estimator_checks.py

+    X = pd.DataFrame([[1, 1], [1, 2], [1, 3], [2, 1], [2, 2], [2, 3]])
+    y = pd.Series([1, 1, 1, 2, 2, 2])
+    weights = pd.Series([1] * 6)
+    for estimator_name, Estimator in all_estimators():


This should be implemented like the other checks in sklearn.utils.estimator_checks, not as a separate test.

kathyxchen · 2016-11-14T13:09:36Z

@jnothman @amueller: I keep running into failed test cases that I don't encounter when I run nosetests utils/tests/test_estimator_checks.py on my local machine. Do you have an idea as to why this may be the case, or any tips for how I ought to set up my environment to avoid this issue?

jnothman · 2016-11-14T13:47:18Z

I generally find Travis logs much easier to check than Appveyor. At a glance, the problems on Travis may be related to Python 2, though I see the same failure in Python 3 AppVeyor.

jnothman · 2016-11-16T10:18:50Z

Please fix the PEP8 issues Travis is complaining about

jnothman

Could you please also rename this to MRG if you think the work is complete enough for a full review.

jnothman · 2016-11-16T13:17:04Z

sklearn/utils/estimator_checks.py

+            weights = pd.Series([1] * 6)
+            try:
+                estimator.fit(X, y, sample_weight=weights)
+            except:


You should (almost) never use except:. Use except Exception:

jnothman · 2016-11-16T13:17:24Z

sklearn/utils/estimator_checks.py

@@ -380,6 +381,27 @@ def check_estimator_sparse_data(name, Estimator):


 @ignore_warnings(category=(DeprecationWarning, UserWarning))
+def check_pandas_series(name, Estimator):


please mention sample_weight in the name

jnothman · 2016-11-16T13:21:21Z

sklearn/utils/tests/test_estimator_checks.py

+                         accept_sparse=("csr", "csc"),
+                         multi_output=True,
+                         y_numeric=True)
+        # Loosely based on _solve_cholesky_kernel (called in KernelRidge.fit)


This is far too confusing. Just raise an exception directly if sample_weight is a pandas Series.

kathyxchen · 2016-11-16T16:20:36Z

sklearn/utils/tests/test_estimator_checks.py

@@ -72,6 +91,15 @@ def test_check_estimator():
    # check that fit does input validation
    msg = "TypeError not raised by fit"
    assert_raises_regex(AssertionError, msg, check_estimator, BaseBadClassifier)
+    # check that sample_weights in fit accepts pandas.Series type
+    try:
+        from pandas import Series


Here, I check if I am able to import pandas before the new estimator check (skip if not). Is there a better way to handle this?
pep8 complained about an unused import in my previous commit because I would just do a try-catch for an ImportError but would do nothing with the import itself. Now, I am using it in msg to avoid that lint error. Should I be handling this in a different way?

This is fine. Maybe again raise a skip if there is no pandas. I feel we should do that whenever there is a pandas import in the tests but that goes beyond this issue.

Raising a skip if there is no pandas requires splitting test_check_estimator into more functions. Which may be a good idea, but isn't really pertinent to this issue.

There should be a way to disable the linter catching a line... Look it up?

amueller

Looks good apart from minor comments.

amueller · 2016-11-16T16:54:05Z

sklearn/linear_model/ridge.py

@@ -957,6 +957,8 @@ def fit(self, X, y, sample_weight=None):
        """
        X, y = check_X_y(X, y, ['csr', 'csc', 'coo'], dtype=np.float64,
                         multi_output=True, y_numeric=True)
+        if sample_weight is not None and not isinstance(sample_weight, float):


Why could sample_weight be a float?

This was based on the docstring (line 951) which said that sample_weight could accept a float

I'm not sure why we support this but fair enough.

amueller · 2016-11-16T16:54:35Z

sklearn/utils/estimator_checks.py

@@ -380,6 +381,27 @@ def check_estimator_sparse_data(name, Estimator):


 @ignore_warnings(category=(DeprecationWarning, UserWarning))
+def check_sample_weights_pandas_series(name, Estimator):


Why UserWarning?

amueller · 2016-11-16T16:56:49Z

sklearn/utils/estimator_checks.py

+                                 "'sample_weight' parameter is type "
+                                 "{1}".format(name, pd.Series))
+        except ImportError:
+            pass


Maybe we should raise a SkipTest("Pandas not installed, not testing pandas series as class weight")? That would be more explicit.

amueller · 2016-11-16T16:58:56Z

sklearn/utils/tests/test_estimator_checks.py

@@ -72,6 +91,15 @@ def test_check_estimator():
    # check that fit does input validation
    msg = "TypeError not raised by fit"
    assert_raises_regex(AssertionError, msg, check_estimator, BaseBadClassifier)
+    # check that sample_weights in fit accepts pandas.Series type
+    try:
+        from pandas import Series


This is fine. Maybe again raise a skip if there is no pandas. I feel we should do that whenever there is a pandas import in the tests but that goes beyond this issue.

kathyxchen · 2016-11-22T00:57:00Z

@jnothman @amueller: Adding the SkipTest causes a failure in the doctest run for contributing.rst (here). Should I remove the SkipTest again (in favor of a 'pass') or change the contributing.rst example?

jnothman · 2016-11-22T02:13:58Z

Right, so this is due to a discrepancy between how we suggest others run our common tests (with check_estimator) and how we do internally. It would be good to modify check_estimator to catch and warn upon SkipTest, whether in this PR or separately.

jnothman

LGTM

jnothman · 2016-11-23T12:32:47Z

sklearn/utils/estimator_checks.py

+        except SkipTest as message:
+            # the only SkipTest thrown currently results from not
+            # being able to import pandas.
+            warnings.warn(message, ImportWarning)


I'm not certain whether ImportWarning is correct. Should we have a SkipTestWarning or some such?

Should I add a SkipTestWarning to the exceptions.py file that inherits from the base Warning class? (Would it be in the scope of this PR?)

ImportWarning is ignored by default. I'd either use UserWarning or add a SkipTestWarning. Feel free to add this to the PR, or change to UserWarning. otherwise LGTM!

amueller

I don't think ImportWarning is good as it's ignored by default.

amueller · 2016-12-01T21:15:36Z

sklearn/linear_model/ridge.py

@@ -957,6 +957,8 @@ def fit(self, X, y, sample_weight=None):
        """
        X, y = check_X_y(X, y, ['csr', 'csc', 'coo'], dtype=np.float64,
                         multi_output=True, y_numeric=True)
+        if sample_weight is not None and not isinstance(sample_weight, float):


I'm not sure why we support this but fair enough.

amueller · 2016-12-01T21:16:22Z

sklearn/utils/estimator_checks.py

@@ -379,6 +385,28 @@ def check_estimator_sparse_data(name, Estimator):
            raise


+@ignore_warnings(category=(DeprecationWarning))


You don't need the parentheses around DeprecationWarning, they don't do anything.

amueller · 2016-12-01T21:19:00Z

sklearn/utils/estimator_checks.py

+        except SkipTest as message:
+            # the only SkipTest thrown currently results from not
+            # being able to import pandas.
+            warnings.warn(message, ImportWarning)


ImportWarning is ignored by default. I'd either use UserWarning or add a SkipTestWarning. Feel free to add this to the PR, or change to UserWarning. otherwise LGTM!

jnothman

LGTM. Please add a changelog entry to doc/whats_new.rst, under your choice of bug fix or enhancement (I can't decide!).

…ptance of floats

…eforehand

amueller · 2016-12-03T20:56:19Z

thanks :)

…eries type (scikit-learn#7825) * addressed comments in the PR about parameters in check_array * update the test case for the evaluation of estimators with pandas series * bug fix, need to check for *not* None explicitly * updated with isinstance check if the documentation says there is acceptance of floats * ran pep8 linter on modified files * moving the test case to estimators_check * add a predict function into the testing pandas.Series class * avoid running anything beyond the newly added meta checks * check if pandas is installed before running the specific test * changed the order of the try-catch to check for sample_weight param beforehand * pass on import error rather than printing something to std out * improve test case naming and pd.Series check in the bad estimator class * address a pep8 linter error with unused import * pep8 warning disabled for potential unused import * throw a warning when SkipTest is raised * add a SkipTestWarning * updated the whats_new.rst with this issue * rebase and fix a spacing issue

kathyxchen mentioned this pull request Nov 4, 2016

[MRG] Fix estimators to work if sample_weight parameter is pandas Series type #5642

Closed

jnothman requested changes Nov 7, 2016

View reviewed changes

jnothman requested changes Nov 16, 2016

View reviewed changes

kathyxchen commented Nov 16, 2016

View reviewed changes

kathyxchen changed the title ~~[WIP] Fix estimators to work if sample_weight parameter is pandas Series type~~ [MRG] Fix estimators to work if sample_weight parameter is pandas Series type Nov 16, 2016

amueller reviewed Nov 16, 2016

View reviewed changes

jnothman approved these changes Nov 23, 2016

View reviewed changes

jnothman changed the title ~~[MRG] Fix estimators to work if sample_weight parameter is pandas Series type~~ [MRG+1] Fix estimators to work if sample_weight parameter is pandas Series type Nov 23, 2016

amueller requested changes Dec 1, 2016

View reviewed changes

jnothman approved these changes Dec 3, 2016

View reviewed changes

kchen17 added 12 commits December 3, 2016 14:14

addressed comments in the PR about parameters in check_array

dd71fdf

update the test case for the evaluation of estimators with pandas series

3afa007

bug fix, need to check for *not* None explicitly

b0155ae

updated with isinstance check if the documentation says there is acce…

7cdd58a

…ptance of floats

ran pep8 linter on modified files

d984250

moving the test case to estimators_check

41b234a

add a predict function into the testing pandas.Series class

eba40d1

avoid running anything beyond the newly added meta checks

807f96b

check if pandas is installed before running the specific test

d721b95

changed the order of the try-catch to check for sample_weight param b…

9c4f5ed

…eforehand

pass on import error rather than printing something to std out

cbdc1ad

improve test case naming and pd.Series check in the bad estimator class

8600be3

kchen17 added 6 commits December 3, 2016 14:17

address a pep8 linter error with unused import

ea9b5dc

pep8 warning disabled for potential unused import

b012f3c

throw a warning when SkipTest is raised

180d629

add a SkipTestWarning

bbac679

updated the whats_new.rst with this issue

f6128ea

rebase and fix a spacing issue

5c7b166

kathyxchen force-pushed the estimators-accept-series branch from 8a54e04 to 5c7b166 Compare December 3, 2016 19:25

amueller merged commit 04b67e2 into scikit-learn:master Dec 3, 2016

This was referenced Dec 20, 2016

Common test for sample_weight as list #8064

Closed

Fix for passing pandas series in sample_weights in RidgeCV leading to an error(#5606) #6307

Closed

Using a pandas Series for sample_weights leads to an error: #5606

Closed

		@@ -380,6 +381,27 @@ def check_estimator_sparse_data(name, Estimator):


		@ignore_warnings(category=(DeprecationWarning, UserWarning))
		def check_pandas_series(name, Estimator):

		@@ -380,6 +381,27 @@ def check_estimator_sparse_data(name, Estimator):


		@ignore_warnings(category=(DeprecationWarning, UserWarning))
		def check_sample_weights_pandas_series(name, Estimator):

		@@ -379,6 +385,28 @@ def check_estimator_sparse_data(name, Estimator):
		raise


		@ignore_warnings(category=(DeprecationWarning))

Uh oh!

[MRG+1] Fix estimators to work if sample_weight parameter is pandas Series type #7825

[MRG+1] Fix estimators to work if sample_weight parameter is pandas Series type #7825

Conversation

kathyxchen commented Nov 4, 2016

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman commented Nov 5, 2016

Uh oh!

kathyxchen commented Nov 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Nov 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kathyxchen commented Nov 14, 2016

Uh oh!

jnothman commented Nov 14, 2016

Uh oh!

jnothman commented Nov 16, 2016

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kathyxchen commented Nov 22, 2016

Uh oh!

jnothman commented Nov 22, 2016

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller left a comment

Choose a reason for hiding this comment

Uh oh!

kathyxchen commented Nov 6, 2016 •

edited

Loading