Skip to content

Problem with pandas.Series.std() #2888

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
yashoteja opened this issue Feb 17, 2013 · 10 comments · Fixed by #2889
Closed

Problem with pandas.Series.std() #2888

yashoteja opened this issue Feb 17, 2013 · 10 comments · Fixed by #2889

Comments

@yashoteja
Copy link

Hi,

I ran the following commands on "pandas.Series" object named "S":

S.std()
39192.660667875185
S.values.std()
17570.025893232512
numpy.std(S)
39192.500822771792
numpy.std(S.values)
17570.025893232512

Correct standard deviation is 17570.02589
I should have got the same value in all 4 cases.

This seems to be a bug in pandas.Series
Pickle file for "S" object is here:
https://gist.github.com/yashoteja/4971970

Thank you,
Yashoteja

@yashoteja
Copy link
Author

Hi,

I am new to github as a whole.
If there are any issues with the format of my previous post,
please inform me so that I would rectify them next time onwards.

Thanks,
Yashoteja

@ghost
Copy link

ghost commented Feb 17, 2013

I believe this is the usual 1/sqrt(N) vs. 1/sqrt(N-1) issue.
numpy and pandas apperently use different defaults for the
ddof parameter. pandas default to 1/sqrt(N-1)

a=pd.Series(np.array(range(10)))

In [6]: a.std()
Out[6]: 3.0276503540974917

In [7]: a.values.std(ddof=1)
Out[7]: 3.0276503540974917

In [8]: a.values.std()
Out[8]: 2.8722813232690143

In [9]: a.std(ddof=0)
Out[9]: 2.8722813232690143

@stephenwlin
Copy link
Contributor

hey actually I loaded his dataset, it's apparently an int64 overflow and/or roundoff issue somewhere, because it doesn't happen when you cast his Series to float64 (it's a big series with a lot of large int64 values)...so this is legitimately a bug.

In [5]: S.astype('float64').std()
Out[5]: 17570.097551905212

In [6]: S.astype('int64').std()
Out[6]: 39192.660667875185

still figuring out exactly where the overflow/roundoff is happening.

@stephenwlin
Copy link
Contributor

got it, in "nanops.py" with some print statements added:

    X = _ensure_numeric(values.sum(axis))
    XX = _ensure_numeric((values ** 2).sum(axis))
    print type(X)
    print X
    print X ** 2

yields

<type 'numpy.int64'>
4063418664
-1935372834766006720

this would work if X were a builtin int because they overflow to long automatically, but numpy int64s don't...

i can do a patch, but I'm not sure if the correct solution is to cast values to float64 first here before doing anything, or to cast int64 to int/long after the fact in case it's an int64? the two would yield slightly different results, the latter would be faster (since it's casting up a scalar rather than casting up an array), but the latter would be a bit uglier-looking of a hack too (IMHO). also I'm not 100% sure it would cover all cases because I don't know the overflow rules within ndarray.__pow__ and ndarray.sum() for int64 arrays..

anyone have any thoughts?

@stephenwlin
Copy link
Contributor

actually...I think the only safe thing to do is to cast everything to float64 upfront, because you can overflow within ndarray.__pow__ and ndarray.sum() too:

In [6]: np.asarray([4063418664], dtype='int64')
Out[6]: array([4063418664], dtype=int64)

In [7]: np.asarray([4063418664], dtype='int64') ** 2
Out[7]: array([-1935372834766006720], dtype=int64)

In [8]: np.asarray([2**63-1,1], dtype='int64')
Out[8]: array([9223372036854775807,                   1], dtype=int64)

In [9]: np.asarray([2**63-1,1], dtype='int64').sum()
Out[9]: -9223372036854775808

@stephenwlin
Copy link
Contributor

(or we can just avoid calling std() in nanops.py in case of integers, since they won't have NaNs to being with...)

@wesm
Copy link
Member

wesm commented Feb 17, 2013

Upcast to float is fine because stdev should always yield a floating point number.

@stephenwlin
Copy link
Contributor

okay, i apparently kurt and skew were already upcasting so I just added the same code to var for consistency

@ghost
Copy link

ghost commented Feb 17, 2013

Thanks, I totally missed his intention with the examples.

@yashoteja
Copy link
Author

Wow, thanks a lot for such a quick response!
And I am happy to have reported my first bug :)

Thank you,
Yashoteja

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants