Skip to content

BUG: Boolean 2d array for __setitem__ with DataFrame incorrect results #18582

Closed
@dhirschfeld

Description

@dhirschfeld
Contributor

According to my understanding of how boolean indexing should work the below test should pass:

import numpy as np
from numpy.random import randn
import pandas as pd

df = pd.DataFrame(randn(10, 5))
df1 = df.copy(); df2 = df.copy()
INVALID = (df < -1)|(df > 1)
df1[INVALID.values] = NaN
df2[INVALID] = NaN
assert df1.equals(df2)

i.e. it shouldn't matter whether or not the boolean mask is a DataFrame or a plain ndarray

Problem description

As can be seen in the picture below, when indexing a DataFrame with a boolean ndarray, rather than
setting the individual True elements entire rows are being set if any value in the row is True.

i.e. what is happening is the result I would expect from df[INVALID.any(axis=1)] = NaN rather than df[INVALID] = NaN

image

This is different from how indexing an ndarray works and to me at least is very surprising and entirely unexpected. This behaviour was the cause of a very subtle bug in my code :(

Output of pd.show_versions()


INSTALLED VERSIONS
------------------
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.21.0
pytest: 3.2.5
pip: 9.0.1
setuptools: 36.7.2
Cython: 0.27.3
numpy: 1.13.3
scipy: 1.0.0
pyarrow: 0.7.1
xarray: 0.9.6
IPython: 6.2.1
sphinx: 1.6.5
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: 1.5.1
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.1.0
openpyxl: 2.5.0b1
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: 0.1.2
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Activity

jreback

jreback commented on Dec 1, 2017

@jreback
Contributor

the 2d ndarray input hits a path where a 1d array is expected, so this is a bug

Idiomatically and what the DataFrame input hits is .where()

In [1]: df = pd.DataFrame(np.random.randn(10, 5))
   ...: INVALID = (df < -1)|(df > 1)

In [3]: df.where(INVALID.values)
Out[3]: 
          0         1         2         3         4
0  1.444995 -1.274442       NaN  1.252285  1.744645
1       NaN       NaN  1.780958       NaN -1.790128
2       NaN       NaN       NaN  1.737279       NaN
3       NaN  1.854267       NaN  1.742598 -1.475852
4       NaN       NaN  1.284623       NaN  1.002319
5       NaN       NaN       NaN       NaN       NaN
6  1.182883       NaN       NaN       NaN       NaN
7       NaN       NaN       NaN       NaN       NaN
8       NaN       NaN -2.538289       NaN  1.767622
9 -1.057514  2.161599       NaN       NaN       NaN

In [4]: df.where(INVALID)
Out[4]: 
          0         1         2         3         4
0  1.444995 -1.274442       NaN  1.252285  1.744645
1       NaN       NaN  1.780958       NaN -1.790128
2       NaN       NaN       NaN  1.737279       NaN
3       NaN  1.854267       NaN  1.742598 -1.475852
4       NaN       NaN  1.284623       NaN  1.002319
5       NaN       NaN       NaN       NaN       NaN
6  1.182883       NaN       NaN       NaN       NaN
7       NaN       NaN       NaN       NaN       NaN
8       NaN       NaN -2.538289       NaN  1.767622
9 -1.057514  2.161599       NaN       NaN       NaN

Here's a patch to implement the correct conversion (note might fail some existing tests). Alternatively we could raise

diff --git a/pandas/core/frame.py b/pandas/core/frame.py
index d3561f8a0..82d9375e5 100644
--- a/pandas/core/frame.py
+++ b/pandas/core/frame.py
@@ -2524,10 +2524,10 @@ class DataFrame(NDFrame):
         if indexer is not None:
             return self._setitem_slice(indexer, value)
 
-        if isinstance(key, (Series, np.ndarray, list, Index)):
-            self._setitem_array(key, value)
-        elif isinstance(key, DataFrame):
+        if isinstance(key, DataFrame) or getattr(key, 'ndim', None) == 2:
             self._setitem_frame(key, value)
+        elif isinstance(key, (Series, np.ndarray, list, Index)):
+            self._setitem_array(key, value)
         else:
             # set column
             self._set_item(key, value)
@@ -2560,8 +2560,16 @@ class DataFrame(NDFrame):
     def _setitem_frame(self, key, value):
         # support boolean setting with DataFrame input, e.g.
         # df[df > df2] = 0
+
+        if isinstance(key, np.ndarray):
+            if key.shape != self.shape:
+                raise ValueError('Array conditional must be same shape as '
+                                 'self')
+            key = self._constructor(key, **self._construct_axes_dict())
+
         if key.values.size and not is_bool_dtype(key.values):
-            raise TypeError('Must pass DataFrame with boolean values only')
+            raise TypeError('Must pass DataFrame or 2-d ndarray'
+                            'with boolean values only')
 
         self._check_inplace_setting(value)
         self._check_setitem_copy()
jreback

jreback commented on Dec 1, 2017

@jreback
Contributor

would love a PR!

added this to the Next Major Release milestone on Dec 1, 2017
changed the title [-]BUG: Boolean ndarray indexing gives incorrect results?[/-] [+]BUG: Boolean 2d array for __setitem__ with DataFrame incorrect results[/+] on Dec 1, 2017
dhirschfeld

dhirschfeld commented on Dec 1, 2017

@dhirschfeld
ContributorAuthor

would love a PR!

Ha! Seems to be my weekend for getting stuck into pandas so I'll give it a crack...

jreback

jreback commented on Dec 1, 2017

@jreback
Contributor
added a commit that references this issue on Dec 5, 2017
bdc08c9
added a commit that references this issue on Dec 9, 2017
18e7a50

9 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugDtype ConversionsUnexpected or buggy dtype conversionsError ReportingIncorrect or improved errors from pandasIndexingRelated to indexing on series/frames, not to indexes themselves

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      Participants

      @dhirschfeld@jreback

      Issue actions

        BUG: Boolean 2d array for __setitem__ with DataFrame incorrect results · Issue #18582 · pandas-dev/pandas