Skip to content

AttributeError: 'numpy.ndarray' object has no attribute '_get_repr' with np.abs on a DatetimeIndex #2948

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
robintw opened this issue Feb 28, 2013 · 43 comments
Milestone

Comments

@robintw
Copy link
Contributor

robintw commented Feb 28, 2013

I am trying to find the row of a DataFrame that is closest to a certain datetime. I have my data in a DataFrame which has a DatetimeIndex, and I can't use functions like asof because I need to get the closest row (either before or after the time), so I am trying to implement the standard way of doing this which is getting the absolute difference between each index time and the time I want to find, then finding the row with the minimum difference.

However, there seems to be a strange bug with using np.abs with the pandas index, and it seems to be in the printing of the results stage (relating to __get_repr). Pandas prints things differently depending on how many rows the DataFrame has, and I think this is related here - as running the code on a 200 row DataFrame fails, but on a 100 row DataFrame it works! I've checked that its not some strange value in the DataFrame which is causing the problem, as a 100 row segment from anywhere in my original input file works, whereas a 200 row segment from anywhere fails.

Two input files (aeronet_large.txt with 200 rows and aeronet.txt with 100 rows) and the code I've been using are available at this gist: https://gist.github.com/robintw/631b53fa0cd9dbdabb36, and I asked a StackOverflow question about this (http://stackoverflow.com/questions/15115547/find-closest-row-of-dataframe-to-given-time-in-pandas) where I was recommended to raise an issue.

@jreback
Copy link
Contributor

jreback commented Feb 28, 2013

try updating to master
this was fixed in #2899
and let me know

the printing is just a red herring its an exception on the subtraction

@robintw
Copy link
Contributor Author

robintw commented Feb 28, 2013

I downloaded the latest version from http://pandas.pydata.org/pandas-build/dev/ yesterday - looking at the times of that pull request, it looks like it should have been included. Is there a way to check from within pandas which version I am running?

@jreback
Copy link
Contributor

jreback commented Feb 28, 2013

print pandas.version
you are on windows?

which file?

@jreback
Copy link
Contributor

jreback commented Feb 28, 2013

all of the windows version looks pretty recent, so should be good

if you are windows, then this should work,
linux use git clone

@robintw
Copy link
Contributor Author

robintw commented Feb 28, 2013

The output of pandas.__version__ is: 0.11.0.dev-8ad9516

Yes, I am using Windows with Python 2.7.

@jreback
Copy link
Contributor

jreback commented Feb 28, 2013

yep...that looks right

@robintw
Copy link
Contributor Author

robintw commented Mar 1, 2013

Does it seem that this wasn't fixed in #2899 then? Is there any other information you'd like me to provide to help fix this issue?

@jreback
Copy link
Contributor

jreback commented Mar 1, 2013

I would need a test case that reproduces the error
just paste it in a comment

@robintw
Copy link
Contributor Author

robintw commented Mar 1, 2013

Hmm...I'm not sure entirely where the problem lies - but it definitely seems to be related to the length of the DataFrame. I've tried creating a generic sequence of dates and running np.abs on them, and it doesn't give an error:

n = 100
ind = pd.date_range(start='2009-07-01', periods=n, freq='H')
df = DataFrame(data=np.random.randn(n), index=ind)
df['time'] = df.index
image_time = dateutil.parser.parse('2009-07-02 13:04')
np.abs(df.time -image_time)

Also, the results look fairly sensible - differences are a maximum of 2 days. However, if I run the same code with n = 101 instead, I still don't get an error, but I get insane looking results - see the example of the last few lines of the results below:

Results where n = 100
2009-07-05 01:00:00 2 days, 11:56:00
2009-07-05 02:00:00 2 days, 12:56:00
2009-07-05 03:00:00 2 days, 13:56:00

Results where n = 101
2009-07-05 02:00:00 2536 days, 23:59:51.242751
2009-07-05 03:00:00 2580 days, 00:12:35.359744
2009-07-05 04:00:00 2622 days, 00:33:40.130816

Something strange definitely seems to be going on there!

Going back to the example that I provided when opening this issue (shown in the gist at https://gist.github.com/robintw/631b53fa0cd9dbdabb36), the results seem to be sensible (a max difference of 3 days) when run with the aeronet.txt file, but when run with aeronet_long.txt it gives the error below, which all seems to stem from the printing code:

AttributeError                            Traceback (most recent call last)
<ipython-input-35-97fdad570e66> in <module>()
     22 # Trying to do the absolute difference gives an error for aeronet_long.txt but not for aeronet.txt
     23 result = np.abs(aeronet.time - image_time)
---> 24 print result
     25 

C:\Python27\lib\site-packages\pandas\core\series.pyc in __str__(self)
   1021         if py3compat.PY3:
   1022             return self.__unicode__()
-> 1023         return self.__bytes__()
   1024 
   1025     def __bytes__(self):

C:\Python27\lib\site-packages\pandas\core\series.pyc in __bytes__(self)
   1031         """
   1032         encoding = com.get_option("display.encoding")
-> 1033         return self.__unicode__().encode(encoding, 'replace')
   1034 
   1035     def __unicode__(self):

C:\Python27\lib\site-packages\pandas\core\series.pyc in __unicode__(self)
   1044                     else get_option("display.max_rows"))
   1045         if len(self.index) > (max_rows or 1000):
-> 1046             result = self._tidy_repr(min(30, max_rows - 4))
   1047         elif len(self.index) > 0:
   1048             result = self._get_repr(print_header=True,

C:\Python27\lib\site-packages\pandas\core\series.pyc in _tidy_repr(self, max_vals)
   1069         """
   1070         num = max_vals // 2
-> 1071         head = self[:num]._get_repr(print_header=True, length=False,
   1072                                     name=False)
   1073         tail = self[-(max_vals - num):]._get_repr(print_header=False,

AttributeError: 'numpy.ndarray' object has no attribute '_get_repr'

That makes me suspect that the results given by the aeronet_test.py code running with aeronet_long.txt are also very strange results, and that is why an error is raised when we try to print them - because they're so insane that they can't be printed.

Does that help?

@ghost
Copy link

ghost commented Mar 1, 2013

You're substracting a Datetime object from a TimeSeries and that causes an overflow
in _possibly_cast_to_timedelta where it calls tslib.array_to_timedelta64(value.astype(object), coerce=False)
there's a too broad Exception clause in Series._get_values which masks the error
and returns a numpy array (shouldn't this just die, what's the case here), which isn't
a TImeSeries Object and so the _get_repr call fails fails.
this only happens when you exceed pd.options.max_rows, because then the repr gocde path which
attempts this conversion is invoked.

Short Answer: use aeronet.time-pd.Timestamp(image_time) instead.
but you should have gotten a warning that you were doing something wrong.

Other then that, think you've got your date format wrong, most of the data is on consecutive days
but you've got some "01-07" days there, which probably should be july first, rather then january seventh.

you need to specigy a date format or store your date strings as iso8601 which is preferable, exacly
to prevent these kinds of ambiguity errors.

@jreback
Copy link
Contributor

jreback commented Mar 1, 2013

this is actually a very subtle error
I have push in 2955 to mostly fix this (not in master yet though)

this 'works', but the final answer you are looking for 'a'
is the wrong dtype (its timedelta64[us])...in microseconds..
I know it 'looks' right, but if you use the values they will be 1000 times off...(and THAT
is causing the overflow that @y-p pointed to above)

I think numpy 1.6.2 does the wrong thing on an np.abs(of a timedelta64)....
not easy right now to work around this...
but you can do some trickery depending on exactly what you are after

what should your final answer look like?

n [19]: df = pd.DataFrame(data=np.random.randn(5), index=pd.date_range('20090705',periods=5))

In [20]: df
Out[20]: 
                   0
2009-07-05 -0.659246
2009-07-06 -0.249913
2009-07-07 -0.025722
2009-07-08 -0.115015
2009-07-09 -0.123338

In [21]: df['time'] = df.index

In [22]: df
Out[22]: 
                   0                time
2009-07-05 -0.659246 2009-07-05 00:00:00
2009-07-06 -0.249913 2009-07-06 00:00:00
2009-07-07 -0.025722 2009-07-07 00:00:00
2009-07-08 -0.115015 2009-07-08 00:00:00
2009-07-09 -0.123338 2009-07-09 00:00:00

In [23]: df.dtypes
Out[23]: 
0              float64
time    datetime64[ns]
Dtype: object

In [24]: image_time
Out[24]: <Timestamp: 2009-07-02 13:04:00>

In [26]: df['diff'] = df['time']-image_time

In [27]: df
Out[27]: 
                   0                time             diff
2009-07-05 -0.659246 2009-07-05 00:00:00 2 days, 10:56:00
2009-07-06 -0.249913 2009-07-06 00:00:00 3 days, 10:56:00
2009-07-07 -0.025722 2009-07-07 00:00:00 4 days, 10:56:00
2009-07-08 -0.115015 2009-07-08 00:00:00 5 days, 10:56:00
2009-07-09 -0.123338 2009-07-09 00:00:00 6 days, 10:56:00

In [28]: df.dtypes
Out[28]: 
0               float64
time     datetime64[ns]
diff    timedelta64[ns]
Dtype: object

In [29]: a = df['diff'].abs()

In [30]: a
Out[30]: 
2009-07-05   2 days, 10:56:00
2009-07-06   3 days, 10:56:00
2009-07-07   4 days, 10:56:00
2009-07-08   5 days, 10:56:00
2009-07-09   6 days, 10:56:00
Freq: D, Name: diff, Dtype: timedelta64[us]

@ghost
Copy link

ghost commented Mar 1, 2013

together with #2888, this is pointing at low test coverage for large integer cases. is there a way
to get coverage for cython code? I bet coverage would have exposed this.

@robintw
Copy link
Contributor Author

robintw commented Mar 1, 2013

Thanks for all the investigation of this.

@y-p: Using aeronet.time - pd.Timestamp(image_time) gives the same error for me.

I hadn't noticed that after doing np.abs the result was in us not ns - is that basically a NumPy bug? If so, should I report that separately to NumPy?

Basically, the results given by @jreback in the final line of his example seem right (apart from the fact that the datatype has been switched to us from ns). Once I've got the np.abs stuff working I was then going to find the minimum value and extract that row of the DataFrame, as that will be the row that has the closest time to the input time. So basically, if the result gives the right numbers (which the final result in the example above does), doesn't give an error, and works with something like np.min to get the minimum value, then that's all sorted for me.

Interestingly, there also seems to be a problem with the final bit of my requirement: getting the minimum value. I suspect it is related to the other problems that you've identified:

  • Running a.argmin() gives an OverflowError: long too big to convert
  • Running a.min() gives a TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule 'safe'

On a separate note: once we've sorted these problems, would it be worse adding a closest_time method to the DatetimeIndex object, in a similar manner to the asof method? Finding the closest row to a certain time is a fairly common operation with time-series data, so it may be worth including 'out-of-the-box'.

@jreback
Copy link
Contributor

jreback commented Mar 1, 2013

@robintw the np.abs is a numpy bug, its prob fixed in 1.7, I have a work around for you, min/max should work, will check argmin/max as well.....

I'll update master later today and let you know

@ghost
Copy link

ghost commented Mar 1, 2013

ah yes, pd.Timestamp has no effect, my bad.
The get_repr error is due to series slicing returning a ndarray when an exception
occurs, and that's a seperate issue the PR fixes for me. the result is still erroneous
, should be fixed, or the exception raised should be relavent.

@jreback
Copy link
Contributor

jreback commented Mar 1, 2013

@robintw what is your numpy version?
1.6.1 has many issues in regards to this (which pandas works around)....

@y-p on 3.2 travis build we test with 1.6.1...do we support this? (or are we min numpy 1.6.2)?

@robintw
Copy link
Contributor Author

robintw commented Mar 1, 2013

I'm running 1.6.1 it seems. I wasn't aware there was a newer version actually. Shall I try updating to 1.7 and see if that solves some of these problems? (Or do you want me to stay on 1.6.1 or go to 1.6.2 for debugging purposes?).

I have a suspicion that pip install --upgrade numpy won't work on my Windows system, so I'll look into how best to install it.

@jreback
Copy link
Contributor

jreback commented Mar 1, 2013

it won't solve most of these.,but you should use at least 1.6.2 in any event

you'll have to way for me to merge to master and then the dev builds...(prob tomorrow or next day)...they are auto-generated

I get windows builds from here for all python stuff
http://www.lfd.uci.edu/~gohlke/pythonlibs/

@ghost
Copy link

ghost commented Mar 1, 2013

Travis provides numpy preinstalled, 1.6.1 is apperently what's bundled with precise.
If it's a problem, it's possible to install numpy 1.6.2 specifically, it'll just make for longer
builds, but as long as the time limit isn't exceeded, that's fine.

@ghost
Copy link

ghost commented Mar 1, 2013

README says 1.6.1 is supported. if that needs to be changed, you need to ask wes.

@jreback
Copy link
Contributor

jreback commented Mar 1, 2013

no...numpy 1.6.1 is fine

its good that we test with this (but only on 3.2)...prob not worth having a 2.7 with 1.6.1

thxs all good now

@jreback
Copy link
Contributor

jreback commented Mar 1, 2013

@robintw ok...i pushed to master....but as you are on windows, check the lastest dev builds for updates

Iit should have this commit: 819e0ad (or later) in it

give it a whirl and report back

also http://pandas.pydata.org/pandas-docs/dev/timeseries.html#time-deltas has an update (I think after 5pm EST) with some more supported ops

@jreback
Copy link
Contributor

jreback commented Mar 1, 2013

@robintw look at #2957 for the way to deal with the np.abs (for now)

@robintw
Copy link
Contributor Author

robintw commented Mar 1, 2013

Unfortunately I can't test this on my Windows machine at the moment as the builds available at http://pandas.pydata.org/pandas-build/dev/ haven't been updated since the 25th Feb. Any ideas what's going on there?

I'll try and test on my Mac in a bit and get back to you.

@jreback
Copy link
Contributor

jreback commented Mar 1, 2013

these builds are updated once a day, not exactly sure when...check back later for the windows updates

@robintw
Copy link
Contributor Author

robintw commented Mar 1, 2013

From the list of commits it looks like there have been changes on the 26th and 28th Feb, but no new builds since the 25th. It also says at http://pandas.pydata.org/getpandas.html: "Stable windows binaries are built on a rotating basis every hour, as long as there have been code changes on github since the previous build. You can find the latest builds here."

Is there someone I should contact to either (a) change the text on the Get Pandas page or (b) see if the automated Windows builds have stopped being built for some reason?

@wesm
Copy link
Member

wesm commented Mar 1, 2013

I'll check the build box when I get home (it's sitting in my apartment)

@robintw
Copy link
Contributor Author

robintw commented Mar 1, 2013

Thanks @wesm.

@jreback: I've installed master from source on my Mac and the code sample I gave originally works without errors - which is great :-) Using the original code sample, I can't get min or argmin to work, they give errors as before.

Using the workaround in #2957 I can get min to work fine (it gives a result as an integer in nanoseconds, which I have converted to minutes and checked is correct), but I can't get argmin to work, it gives an error saying:

ValueError: cannot operate on a series with out a rhs of a series/ndarray of type datetime64[ns] or a timedelta

@jreback
Copy link
Contributor

jreback commented Mar 1, 2013

use idxmin instead (same idea but handles numpy bugs)

@jreback
Copy link
Contributor

jreback commented Mar 1, 2013

also post your code again pls

@robintw
Copy link
Contributor Author

robintw commented Mar 1, 2013

Ahh brilliant - idxmin works fine.

Which code did you want to see?

@jreback
Copy link
Contributor

jreback commented Mar 1, 2013

where u use min

I think should work directly rather than u have to do a conversion

@robintw
Copy link
Contributor Author

robintw commented Mar 1, 2013

n = 1000
ind = pd.date_range(start='2009-07-01', periods=n, freq='H')
df = DataFrame(data=np.random.randn(n), index=ind)
df['time'] = df.index
image_time = dateutil.parser.parse('2009-07-02 13:04')
res = np.abs(df.time -image_time)
res.min()

Gives an error saying: TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule 'safe', and the same error is given by a call to res.idxmin().

Replacing the res = ... line with res = pd.Series(np.abs(df.time -image_time)).astype('timedelta64[ns]') makes both res.min() and res.idxmin() work.

(Just FYI, this is with the latest master of pandas, and numpy 1.6.2.)

@jreback
Copy link
Contributor

jreback commented Mar 1, 2013

ahh....that is correct, you need my workaround for that....fyi min should work on a timedelta series in any event
(its the abs that's causing the issue)....

thanks for the debug...

let me know if anything else not working, or updated docs (prob available later today)...

@robintw
Copy link
Contributor Author

robintw commented Mar 1, 2013

Thank you all for the help - as soon as I can get hold of the Windows builds all my problems will have been solved (well...all of my Pandas-related problems anyway).

Thank you very much to everyone who contributed for helping with the debugging and investigation.

Is it worth suggesting a closest_time method for a DatetimeIndex, to do what I've been doing here, but wrapped up nicely in a method, mirroring the asof method? If so, should I raise a separate Feature Request issue about that?

Once I've checked this all works on Windows, I will close this issue.

@jreback
Copy link
Contributor

jreback commented Mar 1, 2013

yes...pls post a new issue (with your code as revised)

@robintw
Copy link
Contributor Author

robintw commented Mar 2, 2013

I'm trying to install pandas from source on Windows to test this, as the windows builds online don't seem to have been updated yet, but am running into a lot of errors like:

build\temp.win-amd64-2.7\Release\pandas\index.o:index.c:(.text+0x5cf0): undefined reference to `_imp___Py_NoneStruct'                          
build\temp.win-amd64-2.7\Release\pandas\index.o:index.c:(.text+0x5d60): undefined reference to `_imp__PyDict_Size'                             
build\temp.win-amd64-2.7\Release\pandas\index.o:index.c:(.text+0x5d91): undefined reference to `_imp__PyDict_GetItem'                          
build\temp.win-amd64-2.7\Release\pandas\index.o:index.c:(.text+0x5db7): undefined reference to `_imp__PyDict_GetItem'                          
build\temp.win-amd64-2.7\Release\pandas\index.o:index.c:(.text+0x5f98): undefined reference to `_imp__PyObject_RichCompare'                    
build\temp.win-amd64-2.7\Release\pandas\index.o:index.c:(.text+0x5fe2): undefined reference to `_imp___Py_TrueStruct'                          
build\temp.win-amd64-2.7\Release\pandas\index.o:index.c:(.text+0x5ff0): undefined reference to `_imp___Py_ZeroStruct'                          
build\temp.win-amd64-2.7\Release\pandas\index.o:index.c:(.text+0x5ffe): undefined reference to 

Do you have any idea what might be causing this problem? It looks like it isn't able to link into the basic Python stuff properly, but I'm not sure why.

Any ideas?

@jreback
Copy link
Contributor

jreback commented Mar 2, 2013

Building on windows x64 is quite difficult to setup.

Here's some links...maybe someone has a better guide
I got mine working using the 2010 C++ stuff (so it uses the 2010 redistributable)

http://wiki.cython.org/64BitCythonExtensionsOnWindows
build page - http://mattptr.net/2010/07/28/building-python-extensions-in-a-modern-windows-environment/
create vcvarsamd64.bat and put in C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC
set VS90COMNTOOLS=C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\Common7\Tools
Microsoft Visual Studio Express 2008 - http://www.microsoft.com/en-us/download/details.aspx?id=6506
Microsoft Visual Studio C++ 2008 Redistributable - http://www.microsoft.com/en-us/download/confirmation.aspx?id=29

Microsoft .Net 4.0 - http://www.microsoft.com/en-us/download/confirmation.aspx?id=17851
Microsoft Windows 7 SDK - http://www.microsoft.com/en-us/download/details.aspx?id=8279

@jreback
Copy link
Contributor

jreback commented Mar 3, 2013

FYI looks like windows builds are updated

@robintw
Copy link
Contributor Author

robintw commented Mar 5, 2013

Thanks :-)

I've installed the Windows builds and everything seems to be working fine - it seems to have fixed all of the problems that I've reported, and upgrading to NumPy 1.7 seems to have solved some of the NumPy issues.

Thanks for all your help everyone - it's wonderful to find a great open-source community like this.

@robintw robintw closed this as completed Mar 5, 2013
@jreback
Copy link
Contributor

jreback commented Mar 5, 2013

np....also good to have users who can troubleshoot :)

@robintw
Copy link
Contributor Author

robintw commented Mar 6, 2013

Just one more quick question: A Python library I have released now depends on this bugfix in Pandas. Do you have any idea when this fix will appear in a formal release? I found the development roadmap, but it didn't have any approximate dates for the next formal release - any ideas?

@jreback
Copy link
Contributor

jreback commented Mar 6, 2013

should be this month sometime

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants