Skip to content

[MRG+1] BUG: MultiLabelBinarizer.fit_transform sometimes returns an invalid CSR matrix #7750

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Oct 26, 2016

Conversation

perimosocordiae
Copy link
Contributor

See scipy/scipy#6719 for context.

The gist is that the inverse array may have a different dtype than yt.indices, which causes trouble down the line because, in those cases, yt.indices and yt.indptr have different dtypes.

Alternately, we could insert yt.check_format(full_check=False) after modifying the sparse matrix members.

See scipy/scipy#6719 for context.

The gist is that the `inverse` array may have a different dtype than `yt.indices`, which causes trouble down the line because, in those cases, `yt.indices` and `yt.indptr` have different dtypes.

Alternately, we could insert `yt.check_format(full_check=False)` after modifying the sparse matrix members.
@amueller
Copy link
Member

Thanks. can you add a test please?

@amueller amueller added this to the 0.18.1 milestone Oct 25, 2016
Older versions don't support kwargs for `astype`
@@ -732,7 +732,7 @@ def fit_transform(self, y):
class_mapping = np.empty(len(tmp), dtype=dtype)
class_mapping[:] = tmp
self.classes_, inverse = np.unique(class_mapping, return_inverse=True)
yt.indices = inverse[yt.indices].astype(yt.indices.dtype, copy=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could use np.asarray(..., dtype=) to more closely reflect this operation, I think

@jnothman
Copy link
Member

Yes, it looks like that's the sort of test we should perform wherever we do CSR manipulation. Hacky hacky hack hack. Thanks @perimosocordiae.

@perimosocordiae
Copy link
Contributor Author

Tests pass now. Should I squash the commits?

@jnothman jnothman changed the title BUG: MultiLabelBinarizer.fit_transform sometimes returns an invalid CSR matrix [MRG+1] BUG: MultiLabelBinarizer.fit_transform sometimes returns an invalid CSR matrix Oct 26, 2016
@jnothman
Copy link
Member

no need.

@amueller
Copy link
Member

Maybe add a comment why the assert is needed and why the line is needed? It's a bit non-obvious to me.
Would checking if it's possible to convert to lil be a better test?
LGTM as is, but comments might be nice.

@amueller
Copy link
Member

Opened #7762 to track the overall problem.

@perimosocordiae
Copy link
Contributor Author

Okay, comments added. I skipped the full CI treatment for the comment-only changes.

Conversion to LIL format would test the symptom but not the cause of the error. It's possible that we may add checks in scipy to deal with this in the future, so I'd prefer to not rely on .tolil() failing for this case.

@amueller amueller merged commit 94c2094 into scikit-learn:master Oct 26, 2016
@amueller
Copy link
Member

thanks :)

amueller pushed a commit to amueller/scikit-learn that referenced this pull request Nov 9, 2016
…nvalid CSR matrix (scikit-learn#7750)

* BUG: MultiLabelBinarizer makes invalid CSR matrix

See scipy/scipy#6719 for context.

The gist is that the `inverse` array may have a different dtype than `yt.indices`, which causes trouble down the line because, in those cases, `yt.indices` and `yt.indptr` have different dtypes.

Alternately, we could insert `yt.check_format(full_check=False)` after modifying the sparse matrix members.

* Fixing for old numpy

Older versions don't support kwargs for `astype`

* Adding tests

* line-wrapping

* adding comment to tests

[ci skip]

* added rationale comment

[ci skip]
sergeyf pushed a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017
…nvalid CSR matrix (scikit-learn#7750)

* BUG: MultiLabelBinarizer makes invalid CSR matrix

See scipy/scipy#6719 for context.

The gist is that the `inverse` array may have a different dtype than `yt.indices`, which causes trouble down the line because, in those cases, `yt.indices` and `yt.indptr` have different dtypes.

Alternately, we could insert `yt.check_format(full_check=False)` after modifying the sparse matrix members.

* Fixing for old numpy

Older versions don't support kwargs for `astype`

* Adding tests

* line-wrapping

* adding comment to tests

[ci skip]

* added rationale comment

[ci skip]
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
…nvalid CSR matrix (scikit-learn#7750)

* BUG: MultiLabelBinarizer makes invalid CSR matrix

See scipy/scipy#6719 for context.

The gist is that the `inverse` array may have a different dtype than `yt.indices`, which causes trouble down the line because, in those cases, `yt.indices` and `yt.indptr` have different dtypes.

Alternately, we could insert `yt.check_format(full_check=False)` after modifying the sparse matrix members.

* Fixing for old numpy

Older versions don't support kwargs for `astype`

* Adding tests

* line-wrapping

* adding comment to tests

[ci skip]

* added rationale comment

[ci skip]
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
…nvalid CSR matrix (scikit-learn#7750)

* BUG: MultiLabelBinarizer makes invalid CSR matrix

See scipy/scipy#6719 for context.

The gist is that the `inverse` array may have a different dtype than `yt.indices`, which causes trouble down the line because, in those cases, `yt.indices` and `yt.indptr` have different dtypes.

Alternately, we could insert `yt.check_format(full_check=False)` after modifying the sparse matrix members.

* Fixing for old numpy

Older versions don't support kwargs for `astype`

* Adding tests

* line-wrapping

* adding comment to tests

[ci skip]

* added rationale comment

[ci skip]
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
…nvalid CSR matrix (scikit-learn#7750)

* BUG: MultiLabelBinarizer makes invalid CSR matrix

See scipy/scipy#6719 for context.

The gist is that the `inverse` array may have a different dtype than `yt.indices`, which causes trouble down the line because, in those cases, `yt.indices` and `yt.indptr` have different dtypes.

Alternately, we could insert `yt.check_format(full_check=False)` after modifying the sparse matrix members.

* Fixing for old numpy

Older versions don't support kwargs for `astype`

* Adding tests

* line-wrapping

* adding comment to tests

[ci skip]

* added rationale comment

[ci skip]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants