-
Notifications
You must be signed in to change notification settings - Fork 333
Replace np.nanstd
with Faster rolling_nanstd
#273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@mexxexx FYI |
Looks interesting! It looks like the computation is done using the regular variance computation Maybe one could have a look at the method described here, which might be pretty fast if implemented using numba, since it would be one giant for loop. |
Yeah, the catastrophic cancellation concerns me. On the flip side, the article that you pointed to uses Welford'd method which I now have experience with and can probably parallelize it with Numba. The only thing that I haven't thought through is how to handle the NaN values. At least, I can verify the results. Do you know how this works in Do you have a simple test case that might cause catastrophic cancellation? |
Btw, how were you able to identify that it was using variance? I couldn't actually decipher the formulation since it was using convolutions. |
I have this roughly:
This works when there are no |
Pandas' rolling variance implementation may be useful here as a reference |
Some experience with convolutions 😄 Basically, convolving with a sequence of
According to wikipedia, one could take the following test case:
I don't know welfords algorithm to well, but my idea would go like this: Treat a
|
Btw, thanks for the info above! Okay, I think I understand the math and should be able to make this work. I'm deriving it from scratch right now... |
@mexxexx I've been working on this and I can't seem to get the math to work nicely. It appears that the rolling window only works for me if the number of data points used to compute the mean and variance in the last window is the same in the current window. In other words, if your previous window contained one or more |
Okay, I thought about it more and realized that I don't actually have to make the math work out nicely. Instead, I just need to detect when and where
Additionally, there's an opportunity to make this parallelized by splitting up the array into chunks. |
I can also confirm that while the original FFT nanstd suffers from catastrophic cancellation, the Welford method is safe! |
Great to hear! Do you have any timings to compare it to the original rolling approach? |
Yes! For a time series with
So, Welford is faster than FFT and around 10x faster than
For the longer time series, Welford required ~4.75GB of memory while FFT requires ~8.42GB ( In case you were wondering, the timing was about the same when I inserted a bunch of |
You're absolutely right and your point is well taken. We can certainly parallelize this by splitting up the time series into chunks/sections equal to the number of threads and then we'd be even faster on longer time series. |
I'm impressed by the timings! Great that we have something fast and stable implemented, because I remember this variance computation giving me headaches on large time serieses 😄 |
Yeah, I was surprised too. Mind you, the timing does not include the overhead to compile the function which is about 0.1 seconds. So, for small (less than 1 million) data sets, it may be worth using |
Uh oh!
There was an error while loading. Please reload this page.
When computing a standard deviation with
np.nanstd
, it is both computationally and memory intensive for large arrays. There is a faster implementation using convolutions:This implementation is around 5x faster than
np.nanstd
and uses about 5-10x less memoryThe text was updated successfully, but these errors were encountered: