-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG+1] Fix estimators to work if sample_weight parameter is pandas Series type #7825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Fix estimators to work if sample_weight parameter is pandas Series type #7825
Conversation
Travis is indeed reporting linter failures: https://travis-ci.org/scikit-learn/scikit-learn/jobs/173378047 |
Do you mean in functions like |
estimator = Estimator() | ||
if has_fit_parameter(estimator, "sample_weight"): | ||
try: | ||
# default solver liblinear doesn't support this parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
solver='liblinear' does support sample_weight
. Where did you get this idea from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was part of the code taken from the previous PR. I should have double checked it first!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah and it may have been true... over a year ago. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was, I think ;)
X = pd.DataFrame([[1, 1], [1, 2], [1, 3], [2, 1], [2, 2], [2, 3]]) | ||
y = pd.Series([1, 1, 1, 2, 2, 2]) | ||
weights = pd.Series([1] * 6) | ||
for estimator_name, Estimator in all_estimators(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be implemented like the other checks in sklearn.utils.estimator_checks
, not as a separate test.
@jnothman @amueller: I keep running into failed test cases that I don't encounter when I run |
I generally find Travis logs much easier to check than Appveyor. At a glance, the problems on Travis may be related to Python 2, though I see the same failure in Python 3 AppVeyor. |
Please fix the PEP8 issues Travis is complaining about |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please also rename this to MRG if you think the work is complete enough for a full review.
weights = pd.Series([1] * 6) | ||
try: | ||
estimator.fit(X, y, sample_weight=weights) | ||
except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should (almost) never use except:
. Use except Exception:
@@ -380,6 +381,27 @@ def check_estimator_sparse_data(name, Estimator): | |||
|
|||
|
|||
@ignore_warnings(category=(DeprecationWarning, UserWarning)) | |||
def check_pandas_series(name, Estimator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please mention sample_weight in the name
accept_sparse=("csr", "csc"), | ||
multi_output=True, | ||
y_numeric=True) | ||
# Loosely based on _solve_cholesky_kernel (called in KernelRidge.fit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is far too confusing. Just raise an exception directly if sample_weight is a pandas Series.
@@ -72,6 +91,15 @@ def test_check_estimator(): | |||
# check that fit does input validation | |||
msg = "TypeError not raised by fit" | |||
assert_raises_regex(AssertionError, msg, check_estimator, BaseBadClassifier) | |||
# check that sample_weights in fit accepts pandas.Series type | |||
try: | |||
from pandas import Series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, I check if I am able to import pandas before the new estimator check (skip if not). Is there a better way to handle this?
pep8 complained about an unused import in my previous commit because I would just do a try-catch for an ImportError but would do nothing with the import itself. Now, I am using it in msg
to avoid that lint error. Should I be handling this in a different way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine. Maybe again raise a skip if there is no pandas. I feel we should do that whenever there is a pandas import in the tests but that goes beyond this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raising a skip if there is no pandas requires splitting test_check_estimator
into more functions. Which may be a good idea, but isn't really pertinent to this issue.
There should be a way to disable the linter catching a line... Look it up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good apart from minor comments.
@@ -957,6 +957,8 @@ def fit(self, X, y, sample_weight=None): | |||
""" | |||
X, y = check_X_y(X, y, ['csr', 'csc', 'coo'], dtype=np.float64, | |||
multi_output=True, y_numeric=True) | |||
if sample_weight is not None and not isinstance(sample_weight, float): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why could sample_weight
be a float?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was based on the docstring (line 951) which said that sample_weight
could accept a float
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we support this but fair enough.
@@ -380,6 +381,27 @@ def check_estimator_sparse_data(name, Estimator): | |||
|
|||
|
|||
@ignore_warnings(category=(DeprecationWarning, UserWarning)) | |||
def check_sample_weights_pandas_series(name, Estimator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why UserWarning
?
"'sample_weight' parameter is type " | ||
"{1}".format(name, pd.Series)) | ||
except ImportError: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should raise a SkipTest("Pandas not installed, not testing pandas series as class weight")
? That would be more explicit.
@@ -72,6 +91,15 @@ def test_check_estimator(): | |||
# check that fit does input validation | |||
msg = "TypeError not raised by fit" | |||
assert_raises_regex(AssertionError, msg, check_estimator, BaseBadClassifier) | |||
# check that sample_weights in fit accepts pandas.Series type | |||
try: | |||
from pandas import Series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine. Maybe again raise a skip if there is no pandas. I feel we should do that whenever there is a pandas import in the tests but that goes beyond this issue.
Right, so this is due to a discrepancy between how we suggest others run our common tests (with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
except SkipTest as message: | ||
# the only SkipTest thrown currently results from not | ||
# being able to import pandas. | ||
warnings.warn(message, ImportWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not certain whether ImportWarning is correct. Should we have a SkipTestWarning or some such?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I add a SkipTestWarning to the exceptions.py file that inherits from the base Warning
class? (Would it be in the scope of this PR?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ImportWarning
is ignored by default. I'd either use UserWarning
or add a SkipTestWarning
. Feel free to add this to the PR, or change to UserWarning
. otherwise LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think ImportWarning is good as it's ignored by default.
@@ -957,6 +957,8 @@ def fit(self, X, y, sample_weight=None): | |||
""" | |||
X, y = check_X_y(X, y, ['csr', 'csc', 'coo'], dtype=np.float64, | |||
multi_output=True, y_numeric=True) | |||
if sample_weight is not None and not isinstance(sample_weight, float): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we support this but fair enough.
@@ -379,6 +385,28 @@ def check_estimator_sparse_data(name, Estimator): | |||
raise | |||
|
|||
|
|||
@ignore_warnings(category=(DeprecationWarning)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need the parentheses around DeprecationWarning
, they don't do anything.
except SkipTest as message: | ||
# the only SkipTest thrown currently results from not | ||
# being able to import pandas. | ||
warnings.warn(message, ImportWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ImportWarning
is ignored by default. I'd either use UserWarning
or add a SkipTestWarning
. Feel free to add this to the PR, or change to UserWarning
. otherwise LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please add a changelog entry to doc/whats_new.rst, under your choice of bug fix or enhancement (I can't decide!).
8a54e04
to
5c7b166
Compare
thanks :) |
…eries type (scikit-learn#7825) * addressed comments in the PR about parameters in check_array * update the test case for the evaluation of estimators with pandas series * bug fix, need to check for *not* None explicitly * updated with isinstance check if the documentation says there is acceptance of floats * ran pep8 linter on modified files * moving the test case to estimators_check * add a predict function into the testing pandas.Series class * avoid running anything beyond the newly added meta checks * check if pandas is installed before running the specific test * changed the order of the try-catch to check for sample_weight param beforehand * pass on import error rather than printing something to std out * improve test case naming and pd.Series check in the bad estimator class * address a pep8 linter error with unused import * pep8 warning disabled for potential unused import * throw a warning when SkipTest is raised * add a SkipTestWarning * updated the whats_new.rst with this issue * rebase and fix a spacing issue
…eries type (scikit-learn#7825) * addressed comments in the PR about parameters in check_array * update the test case for the evaluation of estimators with pandas series * bug fix, need to check for *not* None explicitly * updated with isinstance check if the documentation says there is acceptance of floats * ran pep8 linter on modified files * moving the test case to estimators_check * add a predict function into the testing pandas.Series class * avoid running anything beyond the newly added meta checks * check if pandas is installed before running the specific test * changed the order of the try-catch to check for sample_weight param beforehand * pass on import error rather than printing something to std out * improve test case naming and pd.Series check in the bad estimator class * address a pep8 linter error with unused import * pep8 warning disabled for potential unused import * throw a warning when SkipTest is raised * add a SkipTestWarning * updated the whats_new.rst with this issue * rebase and fix a spacing issue
…eries type (scikit-learn#7825) * addressed comments in the PR about parameters in check_array * update the test case for the evaluation of estimators with pandas series * bug fix, need to check for *not* None explicitly * updated with isinstance check if the documentation says there is acceptance of floats * ran pep8 linter on modified files * moving the test case to estimators_check * add a predict function into the testing pandas.Series class * avoid running anything beyond the newly added meta checks * check if pandas is installed before running the specific test * changed the order of the try-catch to check for sample_weight param beforehand * pass on import error rather than printing something to std out * improve test case naming and pd.Series check in the bad estimator class * address a pep8 linter error with unused import * pep8 warning disabled for potential unused import * throw a warning when SkipTest is raised * add a SkipTestWarning * updated the whats_new.rst with this issue * rebase and fix a spacing issue
…eries type (scikit-learn#7825) * addressed comments in the PR about parameters in check_array * update the test case for the evaluation of estimators with pandas series * bug fix, need to check for *not* None explicitly * updated with isinstance check if the documentation says there is acceptance of floats * ran pep8 linter on modified files * moving the test case to estimators_check * add a predict function into the testing pandas.Series class * avoid running anything beyond the newly added meta checks * check if pandas is installed before running the specific test * changed the order of the try-catch to check for sample_weight param beforehand * pass on import error rather than printing something to std out * improve test case naming and pd.Series check in the bad estimator class * address a pep8 linter error with unused import * pep8 warning disabled for potential unused import * throw a warning when SkipTest is raised * add a SkipTestWarning * updated the whats_new.rst with this issue * rebase and fix a spacing issue
…eries type (scikit-learn#7825) * addressed comments in the PR about parameters in check_array * update the test case for the evaluation of estimators with pandas series * bug fix, need to check for *not* None explicitly * updated with isinstance check if the documentation says there is acceptance of floats * ran pep8 linter on modified files * moving the test case to estimators_check * add a predict function into the testing pandas.Series class * avoid running anything beyond the newly added meta checks * check if pandas is installed before running the specific test * changed the order of the try-catch to check for sample_weight param beforehand * pass on import error rather than printing something to std out * improve test case naming and pd.Series check in the bad estimator class * address a pep8 linter error with unused import * pep8 warning disabled for potential unused import * throw a warning when SkipTest is raised * add a SkipTestWarning * updated the whats_new.rst with this issue * rebase and fix a spacing issue
Reference Issue
Finishes work in #5642
What does this implement/fix? Explain your changes.
This addresses the comments made on @pradyu1993's commit
Any other comments?
May still warrant a PEP8 linting. (Having trouble with my usual development machine so once I get back on that I'll run the linter & all). Let me know if there are any suggestions on tests or improvements to how the check is made. Thanks!