Skip to content

[MRG+1] Fix estimators to work if sample_weight parameter is pandas Series type #7825

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Dec 3, 2016

Conversation

kathyxchen
Copy link
Contributor

Reference Issue

Finishes work in #5642

What does this implement/fix? Explain your changes.

This addresses the comments made on @pradyu1993's commit

Any other comments?

May still warrant a PEP8 linting. (Having trouble with my usual development machine so once I get back on that I'll run the linter & all). Let me know if there are any suggestions on tests or improvements to how the check is made. Thanks!

@jnothman
Copy link
Member

jnothman commented Nov 5, 2016

@kathyxchen
Copy link
Contributor Author

kathyxchen commented Nov 6, 2016

@jnothman done!
Also, the sample_weight parameter is used in many places throughout the linear_model/ridge.py file. Should a modification to accept pandas.Series be applied to all of these, or is that outside the scope of this issue? @amueller

@jnothman
Copy link
Member

jnothman commented Nov 7, 2016

Also, the sample_weight parameter is used in many places throughout the linear_model/ridge.py file.

Do you mean in functions like sklearn.linear_model.ridge_regression? Or do you mean checking that it works for all solvers? At the moment we only have common tests for estimator classes. I think that checking support in functions is a good idea, but perhaps in a separate PR.

estimator = Estimator()
if has_fit_parameter(estimator, "sample_weight"):
try:
# default solver liblinear doesn't support this parameter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

solver='liblinear' does support sample_weight . Where did you get this idea from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was part of the code taken from the previous PR. I should have double checked it first!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah and it may have been true... over a year ago. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was, I think ;)

X = pd.DataFrame([[1, 1], [1, 2], [1, 3], [2, 1], [2, 2], [2, 3]])
y = pd.Series([1, 1, 1, 2, 2, 2])
weights = pd.Series([1] * 6)
for estimator_name, Estimator in all_estimators():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be implemented like the other checks in sklearn.utils.estimator_checks, not as a separate test.

@kathyxchen
Copy link
Contributor Author

@jnothman @amueller: I keep running into failed test cases that I don't encounter when I run nosetests utils/tests/test_estimator_checks.py on my local machine. Do you have an idea as to why this may be the case, or any tips for how I ought to set up my environment to avoid this issue?

@jnothman
Copy link
Member

I generally find Travis logs much easier to check than Appveyor. At a glance, the problems on Travis may be related to Python 2, though I see the same failure in Python 3 AppVeyor.

@jnothman
Copy link
Member

Please fix the PEP8 issues Travis is complaining about

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please also rename this to MRG if you think the work is complete enough for a full review.

weights = pd.Series([1] * 6)
try:
estimator.fit(X, y, sample_weight=weights)
except:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should (almost) never use except:. Use except Exception:

@@ -380,6 +381,27 @@ def check_estimator_sparse_data(name, Estimator):


@ignore_warnings(category=(DeprecationWarning, UserWarning))
def check_pandas_series(name, Estimator):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please mention sample_weight in the name

accept_sparse=("csr", "csc"),
multi_output=True,
y_numeric=True)
# Loosely based on _solve_cholesky_kernel (called in KernelRidge.fit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is far too confusing. Just raise an exception directly if sample_weight is a pandas Series.

@@ -72,6 +91,15 @@ def test_check_estimator():
# check that fit does input validation
msg = "TypeError not raised by fit"
assert_raises_regex(AssertionError, msg, check_estimator, BaseBadClassifier)
# check that sample_weights in fit accepts pandas.Series type
try:
from pandas import Series
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I check if I am able to import pandas before the new estimator check (skip if not). Is there a better way to handle this?
pep8 complained about an unused import in my previous commit because I would just do a try-catch for an ImportError but would do nothing with the import itself. Now, I am using it in msg to avoid that lint error. Should I be handling this in a different way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine. Maybe again raise a skip if there is no pandas. I feel we should do that whenever there is a pandas import in the tests but that goes beyond this issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raising a skip if there is no pandas requires splitting test_check_estimator into more functions. Which may be a good idea, but isn't really pertinent to this issue.

There should be a way to disable the linter catching a line... Look it up?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right...

@kathyxchen kathyxchen changed the title [WIP] Fix estimators to work if sample_weight parameter is pandas Series type [MRG] Fix estimators to work if sample_weight parameter is pandas Series type Nov 16, 2016
Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good apart from minor comments.

@@ -957,6 +957,8 @@ def fit(self, X, y, sample_weight=None):
"""
X, y = check_X_y(X, y, ['csr', 'csc', 'coo'], dtype=np.float64,
multi_output=True, y_numeric=True)
if sample_weight is not None and not isinstance(sample_weight, float):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why could sample_weight be a float?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was based on the docstring (line 951) which said that sample_weight could accept a float

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we support this but fair enough.

@@ -380,6 +381,27 @@ def check_estimator_sparse_data(name, Estimator):


@ignore_warnings(category=(DeprecationWarning, UserWarning))
def check_sample_weights_pandas_series(name, Estimator):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why UserWarning?

"'sample_weight' parameter is type "
"{1}".format(name, pd.Series))
except ImportError:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should raise a SkipTest("Pandas not installed, not testing pandas series as class weight")? That would be more explicit.

@@ -72,6 +91,15 @@ def test_check_estimator():
# check that fit does input validation
msg = "TypeError not raised by fit"
assert_raises_regex(AssertionError, msg, check_estimator, BaseBadClassifier)
# check that sample_weights in fit accepts pandas.Series type
try:
from pandas import Series
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine. Maybe again raise a skip if there is no pandas. I feel we should do that whenever there is a pandas import in the tests but that goes beyond this issue.

@kathyxchen
Copy link
Contributor Author

@jnothman @amueller: Adding the SkipTest causes a failure in the doctest run for contributing.rst (here). Should I remove the SkipTest again (in favor of a 'pass') or change the contributing.rst example?

@jnothman
Copy link
Member

Right, so this is due to a discrepancy between how we suggest others run our common tests (with check_estimator) and how we do internally. It would be good to modify check_estimator to catch and warn upon SkipTest, whether in this PR or separately.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

except SkipTest as message:
# the only SkipTest thrown currently results from not
# being able to import pandas.
warnings.warn(message, ImportWarning)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain whether ImportWarning is correct. Should we have a SkipTestWarning or some such?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I add a SkipTestWarning to the exceptions.py file that inherits from the base Warning class? (Would it be in the scope of this PR?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ImportWarning is ignored by default. I'd either use UserWarning or add a SkipTestWarning. Feel free to add this to the PR, or change to UserWarning. otherwise LGTM!

@jnothman jnothman changed the title [MRG] Fix estimators to work if sample_weight parameter is pandas Series type [MRG+1] Fix estimators to work if sample_weight parameter is pandas Series type Nov 23, 2016
Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think ImportWarning is good as it's ignored by default.

@@ -957,6 +957,8 @@ def fit(self, X, y, sample_weight=None):
"""
X, y = check_X_y(X, y, ['csr', 'csc', 'coo'], dtype=np.float64,
multi_output=True, y_numeric=True)
if sample_weight is not None and not isinstance(sample_weight, float):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we support this but fair enough.

@@ -379,6 +385,28 @@ def check_estimator_sparse_data(name, Estimator):
raise


@ignore_warnings(category=(DeprecationWarning))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the parentheses around DeprecationWarning, they don't do anything.

except SkipTest as message:
# the only SkipTest thrown currently results from not
# being able to import pandas.
warnings.warn(message, ImportWarning)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ImportWarning is ignored by default. I'd either use UserWarning or add a SkipTestWarning. Feel free to add this to the PR, or change to UserWarning. otherwise LGTM!

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please add a changelog entry to doc/whats_new.rst, under your choice of bug fix or enhancement (I can't decide!).

@kathyxchen kathyxchen force-pushed the estimators-accept-series branch from 8a54e04 to 5c7b166 Compare December 3, 2016 19:25
@amueller
Copy link
Member

amueller commented Dec 3, 2016

thanks :)

@amueller amueller merged commit 04b67e2 into scikit-learn:master Dec 3, 2016
sergeyf pushed a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017
…eries type (scikit-learn#7825)

* addressed comments in the PR about parameters in check_array

* update the test case for the evaluation of estimators with pandas series

* bug fix, need to check for *not* None explicitly

* updated with isinstance check if the documentation says there is acceptance of floats

* ran pep8 linter on modified files

* moving the test case to estimators_check

* add a predict function into the testing pandas.Series class

* avoid running anything beyond the newly added meta checks

* check if pandas is installed before running the specific test

* changed the order of the try-catch to check for sample_weight param beforehand

* pass on import error rather than printing something to std out

* improve test case naming and pd.Series check in the bad estimator class

* address a pep8 linter error with unused import

* pep8 warning disabled for potential unused import

* throw a warning when SkipTest is raised

* add a SkipTestWarning

* updated the whats_new.rst with this issue

* rebase and fix a spacing issue
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
…eries type (scikit-learn#7825)

* addressed comments in the PR about parameters in check_array

* update the test case for the evaluation of estimators with pandas series

* bug fix, need to check for *not* None explicitly

* updated with isinstance check if the documentation says there is acceptance of floats

* ran pep8 linter on modified files

* moving the test case to estimators_check

* add a predict function into the testing pandas.Series class

* avoid running anything beyond the newly added meta checks

* check if pandas is installed before running the specific test

* changed the order of the try-catch to check for sample_weight param beforehand

* pass on import error rather than printing something to std out

* improve test case naming and pd.Series check in the bad estimator class

* address a pep8 linter error with unused import

* pep8 warning disabled for potential unused import

* throw a warning when SkipTest is raised

* add a SkipTestWarning

* updated the whats_new.rst with this issue

* rebase and fix a spacing issue
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
…eries type (scikit-learn#7825)

* addressed comments in the PR about parameters in check_array

* update the test case for the evaluation of estimators with pandas series

* bug fix, need to check for *not* None explicitly

* updated with isinstance check if the documentation says there is acceptance of floats

* ran pep8 linter on modified files

* moving the test case to estimators_check

* add a predict function into the testing pandas.Series class

* avoid running anything beyond the newly added meta checks

* check if pandas is installed before running the specific test

* changed the order of the try-catch to check for sample_weight param beforehand

* pass on import error rather than printing something to std out

* improve test case naming and pd.Series check in the bad estimator class

* address a pep8 linter error with unused import

* pep8 warning disabled for potential unused import

* throw a warning when SkipTest is raised

* add a SkipTestWarning

* updated the whats_new.rst with this issue

* rebase and fix a spacing issue
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
…eries type (scikit-learn#7825)

* addressed comments in the PR about parameters in check_array

* update the test case for the evaluation of estimators with pandas series

* bug fix, need to check for *not* None explicitly

* updated with isinstance check if the documentation says there is acceptance of floats

* ran pep8 linter on modified files

* moving the test case to estimators_check

* add a predict function into the testing pandas.Series class

* avoid running anything beyond the newly added meta checks

* check if pandas is installed before running the specific test

* changed the order of the try-catch to check for sample_weight param beforehand

* pass on import error rather than printing something to std out

* improve test case naming and pd.Series check in the bad estimator class

* address a pep8 linter error with unused import

* pep8 warning disabled for potential unused import

* throw a warning when SkipTest is raised

* add a SkipTestWarning

* updated the whats_new.rst with this issue

* rebase and fix a spacing issue
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
…eries type (scikit-learn#7825)

* addressed comments in the PR about parameters in check_array

* update the test case for the evaluation of estimators with pandas series

* bug fix, need to check for *not* None explicitly

* updated with isinstance check if the documentation says there is acceptance of floats

* ran pep8 linter on modified files

* moving the test case to estimators_check

* add a predict function into the testing pandas.Series class

* avoid running anything beyond the newly added meta checks

* check if pandas is installed before running the specific test

* changed the order of the try-catch to check for sample_weight param beforehand

* pass on import error rather than printing something to std out

* improve test case naming and pd.Series check in the bad estimator class

* address a pep8 linter error with unused import

* pep8 warning disabled for potential unused import

* throw a warning when SkipTest is raised

* add a SkipTestWarning

* updated the whats_new.rst with this issue

* rebase and fix a spacing issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants