Skip to content

Obscur AttributeError when dropping on a multi-index dataframe #12078

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nbonnotte opened this issue Jan 18, 2016 · 9 comments
Closed

Obscur AttributeError when dropping on a multi-index dataframe #12078

nbonnotte opened this issue Jan 18, 2016 · 9 comments
Labels
Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Milestone

Comments

@nbonnotte
Copy link
Contributor

In [2]:  df = pd.DataFrame(columns=['a','b','c','d'], data=[[1,'b1','c1',3], [1,'b2','c2',4]])

In [3]: df = df.pivot_table(index='a', columns=['b','c'], values='d').reset_index()

In [4]: df
Out[4]: 
b  a b1 b2
c    c1 c2
0  1  3  4

In [5]: df.drop('a', axis=1)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-b59fbf92d28f> in <module>()
----> 1 df.drop('a', axis=1)

/home/nicolas/Git/pandas/pandas/core/generic.pyc in drop(self, labels, axis, level, inplace, errors)
   1617                 new_axis = axis.drop(labels, level=level, errors=errors)
   1618             else:
-> 1619                 new_axis = axis.drop(labels, errors=errors)
   1620             dropped = self.reindex(**{axis_name: new_axis})
   1621             try:

/home/nicolas/Git/pandas/pandas/core/index.pyc in drop(self, labels, level, errors)
   5729                     inds.append(loc)
   5730                 else:
-> 5731                     inds.extend(lrange(loc.start, loc.stop))
   5732             except KeyError:
   5733                 if errors != 'ignore':

AttributeError: 'numpy.ndarray' object has no attribute 'start'

This is related to issue #11640. I have been working on a solution that I submitted in the pull request #11717, but the said solution was controversial, so I'm creating this issue to separate the problems.

I'll make a PR soon enough.

@nbonnotte
Copy link
Contributor Author

I'm a bit confused.

As I have understood the API, here .drop should not work, because 'a' is not a column, and we should just have a more meaningful error message. If I wanted to remove the columns whose first level is 'a', I should do df.drop('a', axis=1, level=0). Right?

On the other hand, if we consider

In [4]: dg = pd.DataFrame([[1,3,4]],columns=pd.MultiIndex.from_tuples([('a',''),('b1','c1'),('b2','c2')],names=['b','c']))

In [5]: dg
Out[5]: 
b  a b1 b2
c    c1 c2
0  1  3  4

then dg and df are equivalent:

In [7]: from pandas.util.testing import assert_frame_equal

In [8]: assert_frame_equal(df, dg) or "No error raised"
Out[8]: 'No error raised'

but

In [14]: dg.drop('a', axis=1)
Out[14]: 
b b1 b2
c c1 c2
0  3  4

Here is what happens:

  • In MultiIndex.drop (see here), in the try... except ... the ValueError is raised because labels ['a'] not contained in axis, which is correct.
  • Then we go on, to loc = self.get_loc(label), with here label='a'
  • In MultiIndex.get_loc, since the key 'a' is not a tuple, the parameter level=0 is automagically added (see here)

Does that mean that, in the API as it should be, in .drop the parameter level=0 was intended to be superfluous? That is, df.drop('a', axis=1) should be equivalent to df.drop('a', axis=1, level=0) ?

What should I do in my pull request?

As as side note, the reason why .drop fails for the first example df and not for the second example dg comes later: for the former, .get_loc returns a boolean mask, and the latter returns a slice, but .drop forgets to handle boolean mask (see those lines)

Also, I feel the need to say that I'm sorry if it seems that I am insisting a bit on those issues about .drop. I just like to understand things, and I'm confused about what the code pretends to be doing, what it should in theory do, and what it actually does. I guess that's bound to happen on such a complex project, and I'd be glad to help in any direction I can.

@jreback
Copy link
Contributor

jreback commented Jan 19, 2016

In [1]: dg = pd.DataFrame([[1,3,4]],columns=pd.MultiIndex.from_tuples([('a',''),('b1','c1'),('b2','c2')],names=['b','c']))

In [6]: dg.columns.is_lexsorted()
Out[6]: True

In [7]: df = pd.DataFrame(columns=['a','b','c','d'], data=[[1,'b1','c1',3], [1,'b2','c2',4]])

In [8]: df = df.pivot_table(index='a', columns=['b','c'], values='d').reset_index()
In [9]: df.columns.is_lexsorted()
Out[9]: False

The difference is that when the columns are not lexsorted this doesn't work, and the error message is incorrectly propogated, and an incorrect path is taken showing an error message which doesn't make sense. So you need to see where the difference is and what is happening to the exceptions.

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Error Reporting Incorrect or improved errors from pandas MultiIndex labels Jan 19, 2016
@jreback jreback added this to the Next Major Release milestone Jan 19, 2016
@nbonnotte
Copy link
Contributor Author

Oki doki, I'll do that ^^

@nbonnotte
Copy link
Contributor Author

I couldn't find any other exception that would be raised but incorrectly propagated. Except the one that shows up, of course.

And this exception is raised for the reason I gave:

  • when the multi-index is lexsorted, .get_loc() returns a slice
  • when it is not, it returns a boolean mask, but what comes next in MultiIndex.drop cant' handle that (see those lines)
In [2]: ref = pd.MultiIndex.from_tuples([('a',''),('b1','c1'),('b2','c2')],names=['b','c'])

In [3]: pbm = pd.DataFrame(columns=['a','b','c','d'], data=[[1,'b1','c1',3], [1,'b2','c2',4]]).pivot_table(index='a', columns=['b','c'], values='d').reset_index().columns

In [6]: ref.is_lexsorted()
Out[6]: True

In [7]: pbm.is_lexsorted()
Out[7]: False

In [8]: ref.drop('a')
Out[8]: 
MultiIndex(levels=[[u'a', u'b1', u'b2'], [u'', u'c1', u'c2']],
           labels=[[1, 2], [1, 2]],
           names=[u'b', u'c'])

In [9]: pbm.drop('a')
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-9-fcb8cd09713a> in <module>()
----> 1 pbm.drop('a')

/home/nicolas/Git/pandas/pandas/indexes/multi.py in drop(self, labels, level, errors)
   1091                     inds.append(loc)
   1092                 else:
-> 1093                     inds.extend(lrange(loc.start, loc.stop))
   1094             except KeyError as e:
   1095                 if errors != 'ignore':

AttributeError: 'numpy.ndarray' object has no attribute 'start'

In [10]: ref.get_loc('a')
Out[10]: slice(0, 1, None)

In [11]: pbm.get_loc('a')
Out[11]: array([ True, False, False], dtype=bool)

In [12]: ref.get_loc('a').start
Out[12]: 0

In [13]: pbm.get_loc('a').start
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-13-2a974e7413c7> in <module>()
----> 1 pbm.get_loc('a').start

AttributeError: 'numpy.ndarray' object has no attribute 'start'

But maybe I'm just not looking at the right place. Am I missing something?

@jreback
Copy link
Contributor

jreback commented Jan 27, 2016

yeh, prob just not correctly implemented.

@nbonnotte
Copy link
Contributor Author

Can I correct the implementation, so that .drop works for a non lexsorted multi-index in the same way as for a lexsorted one? :D

In [2]: ref = pd.MultiIndex.from_tuples([('a',''),('b1','c1'),('b2','c2')],names=['b','c'])

In [3]: pbm = pd.DataFrame(columns=['a','b','c','d'], data=[[1,'b1','c1',3], [1,'b2','c2',4]]).pivot_table(index='a', columns=['b','c'], values='d').reset_index().columns

In [4]: ref.is_lexsort
ref.is_lexsorted            ref.is_lexsorted_for_tuple  

In [4]: ref.is_lexsorted()
Out[4]: True

In [5]: pbm.is_lex
pbm.is_lexsorted            pbm.is_lexsorted_for_tuple  

In [5]: pbm.is_lexsorted()
Out[5]: False

In [6]: ref.values
Out[6]: array([('a', ''), ('b1', 'c1'), ('b2', 'c2')], dtype=object)

In [7]: pbm.values
Out[7]: array([('a', ''), ('b1', 'c1'), ('b2', 'c2')], dtype=object)

In [8]: ref.drop('a')
Out[8]: 
MultiIndex(levels=[[u'a', u'b1', u'b2'], [u'', u'c1', u'c2']],
           labels=[[1, 2], [1, 2]],
           names=[u'b', u'c'])

Beware that this simple correction might change the API of both .drop or .groupby, as we discussed in the pull request #11717 😇

So perhaps a safer option would be to first have ref.drop('a') raise a KeyError or ValueError because 'a' is not a correct value, the proper way being ref.drop('a', level=0)? And then correct the implementation.

Let me know what I can do.

@jreback
Copy link
Contributor

jreback commented Jan 27, 2016

I think .drop on a DataFrame is find (your example is not that). you can simply lexsort the pivot table I think.

@nbonnotte
Copy link
Contributor Author

The problem with the DataFrame arises because of the problem with the MultiIndex, as shown in my examples.

What can I do to remove the obscur error message?

@jreback
Copy link
Contributor

jreback commented Jan 27, 2016

ahh, yes, see if you can

tm.assert_index_equal(pbm.drop('a'), ref.drop('a'))

though you may want to output a PerformanceWarning for pbm.drop('a')

you'll have to look and see how its used elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

No branches or pull requests

2 participants