Skip to content

Add base test for ExtensionArray.astype and copy #28488

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue Sep 17, 2019 · 14 comments · Fixed by #35116
Closed

Add base test for ExtensionArray.astype and copy #28488

TomAugspurger opened this issue Sep 17, 2019 · 14 comments · Fixed by #35116
Assignees
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. good first issue
Milestone

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 17, 2019

We should be able to test astype to our own type

diff --git a/pandas/tests/extension/base/casting.py b/pandas/tests/extension/base/casting.py
index 7146443bf8..b33ab93031 100644
--- a/pandas/tests/extension/base/casting.py
+++ b/pandas/tests/extension/base/casting.py
@@ -1,3 +1,5 @@
+import pytest
+
 import pandas as pd
 from pandas.core.internals import ObjectBlock
 
@@ -21,3 +23,9 @@ class BaseCastingTests(BaseExtensionTests):
         result = pd.Series(data[:5]).astype(str)
         expected = pd.Series(data[:5].astype(str))
         self.assert_series_equal(result, expected)
+
+    @pytest.mark.parametrize('copy', [True, False])
+    def test_astype_own_type(self, data, copy):
+        result = data.astype(data.dtype, copy=copy)
+        assert (result is data) is (not copy)
+        self.assert_extension_array_equal(result, data)

A few tests fail this. Would welcome investigation.

$ pytest pandas/tests/extension/ -k test_astype_own_type -v --tb=line

========================================================================= FAILURES ==========================================================================
/Users/taugspurger/sandbox/pandas/pandas/tests/extension/base/casting.py:30: AssertionError
/Users/taugspurger/sandbox/pandas/pandas/tests/extension/base/casting.py:30: AssertionError
/Users/taugspurger/sandbox/pandas/pandas/tests/extension/base/casting.py:30: AssertionError
/Users/taugspurger/sandbox/pandas/pandas/tests/extension/base/casting.py:30: AssertionError
/Users/taugspurger/sandbox/pandas/pandas/tests/extension/base/casting.py:30: AssertionError
/Users/taugspurger/sandbox/pandas/pandas/tests/extension/base/casting.py:30: AssertionError
/Users/taugspurger/sandbox/pandas/pandas/tests/extension/base/casting.py:30: AssertionError
/Users/taugspurger/sandbox/pandas/pandas/tests/extension/base/casting.py:30: AssertionError
/Users/taugspurger/sandbox/pandas/pandas/core/arrays/base.py:390: TypeError: data type not understood
/Users/taugspurger/sandbox/pandas/pandas/core/arrays/base.py:390: TypeError: data type not understood
/Users/taugspurger/sandbox/pandas/pandas/core/arrays/base.py:390: TypeError: data type not understood
/Users/taugspurger/sandbox/pandas/pandas/core/arrays/base.py:390: TypeError: data type not understood
/Users/taugspurger/sandbox/pandas/pandas/tests/extension/base/casting.py:30: AssertionError
/Users/taugspurger/sandbox/pandas/pandas/tests/extension/base/casting.py:30: AssertionError
/Users/taugspurger/sandbox/pandas/pandas/tests/extension/base/casting.py:30: AssertionError
/Users/taugspurger/sandbox/pandas/pandas/tests/extension/base/casting.py:30: AssertionError
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Sep 17, 2019
@TomAugspurger TomAugspurger added ExtensionArray Extending pandas with custom dtypes or arrays. good first issue and removed good first issue labels Sep 17, 2019
@TomAugspurger
Copy link
Contributor Author

I'm assuming we do want this property. It's what NumPy does

In [27]: a = np.array([1, 2, 3])

In [28]: a.astype(a.dtype, copy=False) is a
Out[28]: True

if isinstance(dtype, _IntegerDtype):
may need an additional branch checking if self.dtype exactly matches dtype. In which case we just return self when copy=False.

We'll also want to update the docs in ExtensionArray.astype that this should return self when the dtype doesn't change and copy=False.

@jimmybeckett
Copy link

Hi! I'd like to work on this issue.

@TomAugspurger
Copy link
Contributor Author

Thanks! Let us know if you need help getting started.

@vishnuyar
Copy link

I would like to work on this issue, can you help me where to start

@TomAugspurger
Copy link
Contributor Author

I think @jimmybeckett may already be working on it.

jimmybeckett added a commit to jimmybeckett/pandas that referenced this issue Sep 20, 2019
@IsabellaRO
Copy link

Hi, can I work on this issue?

@TomAugspurger
Copy link
Contributor Author

@IsabellaRO that'd be great. I don't think anyone has submitted a pull request for it yet.

@tomaszps
Copy link
Contributor

tomaszps commented Jul 2, 2020

Given that there hasn't been any activity on this issue for a few months, I figured it would be okay to jump on this issue. I looked into the first failed test, which fails on the BooleanArray. I want to walk through my understanding of the problem and solution, @TomAugspurger, to make sure I'm getting this right. I'm going to be a little verbose for clarity's sake.

The first test that fails:

pandas/tests/extension/test_boolean.py::TestCasting::test_astype_own_type[True] PASSED           [  2%]
pandas/tests/extension/test_boolean.py::TestCasting::test_astype_own_type[False] FAILED                                           [  5%]

So I dropped in with the --pdb flag to see what's happening.

 55  	    @pytest.mark.parametrize('copy', [True, False])
 56  	    def test_astype_own_type(self, data, copy):
 57  	        result = data.astype(data.dtype, copy=copy)
 58  ->	        assert (result is data) is (not copy)
 59  	        self.assert_extension_array_equal(result, data)

Turns out result and data are different objects, according to id(). That must be happening when result is created from data.astype, so I drop into pandas.core.array.boolean.BooleanArray.astype and add a breakpoint...

    def astype(self, dtype, copy: bool = True) -> ArrayLike:
        """
        Got rid of the docstring to shorten this.
        """
        from pandas.core.arrays.string_ import StringDtype

        dtype = pandas_dtype(dtype)

        if isinstance(dtype, BooleanDtype):
            values, mask = coerce_to_array(self, copy=copy)
            if not copy:
                import pdb; pdb.set_trace()
            return BooleanArray(values, mask, copy=False)

And test equality of the parts
(Pdb) self._data is values
True
(Pdb) self._mask is mask
True

This points to that BooleanArray constructor call as the culprit. I think the solution is to check whether copy is False, and if so, then simply return self instead of calling that constructor.

e.g.

    def astype(self, dtype, copy: bool = True) -> ArrayLike:
        """
        Got rid of the docstring to shorten this.
        """
        from pandas.core.arrays.string_ import StringDtype

        dtype = pandas_dtype(dtype)

        if isinstance(dtype, BooleanDtype):
            values, mask = coerce_to_array(self, copy=copy)
            if copy:
                return BooleanArray(values, mask, copy=False)
            else:
                return self

The changes above make the test pass.
Everything seem good, @TomAugspurger? If so I'll continue w/ the other tests. It wouldn't surprise me if the issues are very similar.

@TomAugspurger
Copy link
Contributor Author

@tomaszps that looks right, thanks.

@tomaszps
Copy link
Contributor

tomaszps commented Jul 2, 2020

I'm a little confused about a detail- pandas.core.arrays.base.ExtensionArray.astype always coerces to and returns a numpy array. This is not the desired behavior, correct? That seems inconsistent with the way other subclasses of extensionarray act, where the method is overridden (e.g. Integer, Boolean extensionarrays).

@tomaszps
Copy link
Contributor

tomaszps commented Jul 2, 2020

take

@tomaszps
Copy link
Contributor

tomaszps commented Jul 2, 2020

I'm digging further, I'll dump the rest of my questions/notes here until there's a response.

That solution or some variant worked for all of the tests up to
pandas/tests/extension/test_string.py::TestCasting::test_astype_own_type[True] FAILED [ 87%]
(assert_extension_array_equal is what fails in testing_ module, at line 1198.)

All the others failed for the reason mentioned above, and I have fixes hacked in. The results are up in my branch. I'd need to go through and clean up the commit for merging to make the style consistent between the fixes.

edit: Sorry for not following standards in my PR. I figured I would get everything down and then do an interactive rebase and force push it or push it to a new branch.

@tomaszps tomaszps mentioned this issue Jul 2, 2020
5 tasks
@tomaszps
Copy link
Contributor

tomaszps commented Jul 3, 2020

@TomAugspurger I looked in the contribution docs but didn't see- is there a standard way to solicit feedback on an issue? (If I'm a new person and I have questions because I'm unfamiliar. @'ing you repeatedly seemed wrong, but I don't know better.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jul 3, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. good first issue
Projects
None yet
6 participants