Problem with pandas.Series.std() #2888

yashoteja · 2013-02-17T15:59:55Z

Hi,

I ran the following commands on "pandas.Series" object named "S":

S.std()
39192.660667875185
S.values.std()
17570.025893232512
numpy.std(S)
39192.500822771792
numpy.std(S.values)
17570.025893232512

Correct standard deviation is 17570.02589
I should have got the same value in all 4 cases.

This seems to be a bug in pandas.Series
Pickle file for "S" object is here:
https://gist.github.com/yashoteja/4971970

Thank you,
Yashoteja

yashoteja · 2013-02-17T16:17:13Z

Hi,

I am new to github as a whole.
If there are any issues with the format of my previous post,
please inform me so that I would rectify them next time onwards.

Thanks,
Yashoteja

ghost · 2013-02-17T17:51:07Z

I believe this is the usual 1/sqrt(N) vs. 1/sqrt(N-1) issue.
numpy and pandas apperently use different defaults for the
ddof parameter. pandas default to 1/sqrt(N-1)

a=pd.Series(np.array(range(10)))

In [6]: a.std()
Out[6]: 3.0276503540974917

In [7]: a.values.std(ddof=1)
Out[7]: 3.0276503540974917

In [8]: a.values.std()
Out[8]: 2.8722813232690143

In [9]: a.std(ddof=0)
Out[9]: 2.8722813232690143

stephenwlin · 2013-02-17T18:21:38Z

hey actually I loaded his dataset, it's apparently an int64 overflow and/or roundoff issue somewhere, because it doesn't happen when you cast his Series to float64 (it's a big series with a lot of large int64 values)...so this is legitimately a bug.

In [5]: S.astype('float64').std()
Out[5]: 17570.097551905212

In [6]: S.astype('int64').std()
Out[6]: 39192.660667875185

still figuring out exactly where the overflow/roundoff is happening.

stephenwlin · 2013-02-17T18:35:55Z

got it, in "nanops.py" with some print statements added:

    X = _ensure_numeric(values.sum(axis))
    XX = _ensure_numeric((values ** 2).sum(axis))
    print type(X)
    print X
    print X ** 2

yields

<type 'numpy.int64'>
4063418664
-1935372834766006720

this would work if X were a builtin int because they overflow to long automatically, but numpy int64s don't...

i can do a patch, but I'm not sure if the correct solution is to cast values to float64 first here before doing anything, or to cast int64 to int/long after the fact in case it's an int64? the two would yield slightly different results, the latter would be faster (since it's casting up a scalar rather than casting up an array), but the latter would be a bit uglier-looking of a hack too (IMHO). also I'm not 100% sure it would cover all cases because I don't know the overflow rules within ndarray.__pow__ and ndarray.sum() for int64 arrays..

anyone have any thoughts?

stephenwlin · 2013-02-17T18:43:56Z

actually...I think the only safe thing to do is to cast everything to float64 upfront, because you can overflow within ndarray.__pow__ and ndarray.sum() too:

In [6]: np.asarray([4063418664], dtype='int64')
Out[6]: array([4063418664], dtype=int64)

In [7]: np.asarray([4063418664], dtype='int64') ** 2
Out[7]: array([-1935372834766006720], dtype=int64)

In [8]: np.asarray([2**63-1,1], dtype='int64')
Out[8]: array([9223372036854775807,                   1], dtype=int64)

In [9]: np.asarray([2**63-1,1], dtype='int64').sum()
Out[9]: -9223372036854775808

stephenwlin · 2013-02-17T18:50:34Z

(or we can just avoid calling std() in nanops.py in case of integers, since they won't have NaNs to being with...)

wesm · 2013-02-17T19:15:04Z

Upcast to float is fine because stdev should always yield a floating point number.

stephenwlin · 2013-02-17T20:27:50Z

okay, i apparently kurt and skew were already upcasting so I just added the same code to var for consistency

ghost · 2013-02-17T22:07:38Z

Thanks, I totally missed his intention with the examples.

yashoteja · 2013-02-18T05:37:32Z

Wow, thanks a lot for such a quick response!
And I am happy to have reported my first bug :)

Thank you,
Yashoteja

stephenwlin mentioned this issue Feb 17, 2013

BUG: nanops.var produces incorrect results due to int64 overflow (fixes #2888) #2889

Merged

jreback closed this as completed in 6dec888 Feb 23, 2013

ghost mentioned this issue Mar 1, 2013

AttributeError: 'numpy.ndarray' object has no attribute '_get_repr' with np.abs on a DatetimeIndex #2948

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with pandas.Series.std() #2888

Problem with pandas.Series.std() #2888

yashoteja commented Feb 17, 2013

yashoteja commented Feb 17, 2013

ghost commented Feb 17, 2013

stephenwlin commented Feb 17, 2013

stephenwlin commented Feb 17, 2013

stephenwlin commented Feb 17, 2013

stephenwlin commented Feb 17, 2013

wesm commented Feb 17, 2013

stephenwlin commented Feb 17, 2013

ghost commented Feb 17, 2013

yashoteja commented Feb 18, 2013

Problem with pandas.Series.std() #2888

Problem with pandas.Series.std() #2888

Comments

yashoteja commented Feb 17, 2013

yashoteja commented Feb 17, 2013

ghost commented Feb 17, 2013

stephenwlin commented Feb 17, 2013

stephenwlin commented Feb 17, 2013

stephenwlin commented Feb 17, 2013

stephenwlin commented Feb 17, 2013

wesm commented Feb 17, 2013

stephenwlin commented Feb 17, 2013

ghost commented Feb 17, 2013

yashoteja commented Feb 18, 2013