Skip to content

Replace np.nanstd with Faster rolling_nanstd #273

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
seanlaw opened this issue Oct 26, 2020 · 16 comments
Closed

Replace np.nanstd with Faster rolling_nanstd #273

seanlaw opened this issue Oct 26, 2020 · 16 comments
Assignees
Labels

Comments

@seanlaw
Copy link
Contributor

seanlaw commented Oct 26, 2020

When computing a standard deviation with np.nanstd, it is both computationally and memory intensive for large arrays. There is a faster implementation using convolutions:

def nanstd_1d(a, w):
    k = np.ones(w, dtype=int)
    m = ~np.isnan(a)
    a0 = np.where(m, a,0)
    
    n = np.convolve(m,k,'valid')
    c1 = np.convolve(a0, k,'valid')
    f2 = c1**2
    p2 = f2/n**2
    f1 = np.convolve((a0**2)*m,k,'valid')+n*p2
    out = np.sqrt((f1 - (2/n)*f2)/n)
    return out


def rolling_nanstd(a, m, axis=0):
    axis = axis - 1  # Account for rolling
    return np.apply_along_axis(lambda a_row, m: nanstd_1d(a_row, m), axis=axis, arr=a, m=m)

T = np.random.rand(100)
m = 50
out_std = rolling_nanstd(T, m)


T = np.random.rand(5, 1_000_000)
m = 50
out_std = rolling_nanstd(T, m, axis=T.ndim)

This implementation is around 5x faster than np.nanstd and uses about 5-10x less memory

@seanlaw
Copy link
Contributor Author

seanlaw commented Oct 26, 2020

@mexxexx FYI

@seanlaw seanlaw self-assigned this Oct 26, 2020
@mihailescum
Copy link
Contributor

Looks interesting! It looks like the computation is done using the regular variance computation s_n = (sum x_i^2 - N mu) / N which suffers from catastrophic cancellation (I think that's what it' called) if the variance is small compared to the mean. Can't say how much of a problem this is.

Maybe one could have a look at the method described here, which might be pretty fast if implemented using numba, since it would be one giant for loop.

@seanlaw
Copy link
Contributor Author

seanlaw commented Oct 26, 2020

Yeah, the catastrophic cancellation concerns me.

On the flip side, the article that you pointed to uses Welford'd method which I now have experience with and can probably parallelize it with Numba.

The only thing that I haven't thought through is how to handle the NaN values. At least, I can verify the results. Do you know how this works in np.nanstd?

Do you have a simple test case that might cause catastrophic cancellation?

@seanlaw
Copy link
Contributor Author

seanlaw commented Oct 26, 2020

Btw, how were you able to identify that it was using variance? I couldn't actually decipher the formulation since it was using convolutions.

@seanlaw
Copy link
Contributor Author

seanlaw commented Oct 27, 2020

I have this roughly:

def _welford_variance(a, w, a_isfinite):
    all_window_variances = np.empty(a.shape[0] - w + 1)
    
    last_window_mean = np.nanmean(a[:w])
    last_window_variance = np.nanstd(a[:w])
    last_window_variance = last_window_variance ** 2
    all_window_variances[0] = last_window_variance
    
    if np.sum(a_isfinite[:w]) == 0:
        last_window_mean = 0.0
        last_window_variance = 0.0
    
    for last_idx, curr_idx in enumerate(range(w, a.shape[0])):
        curr_denom = np.sum(a_isfinite[last_idx + 1: last_idx + 1 + w])
        last_denom = np.sum(a_isfinite[last_idx: last_idx + w])
        last_mean = last_window_mean
        
        if not a_isfinite[curr_idx] and not a_isfinite[last_idx]:
            curr_mean = last_mean * last_denom / curr_denom
            #last_window_variance += ??
        elif not a_isfinite[curr_idx]:
            curr_mean = last_mean * last_denom / curr_denom - a[last_idx] / curr_denom
            #last_window_variance += ??
        elif not a_isfinite[last_idx]:
            curr_mean = last_mean * last_denom / curr_denom + a[curr_idx] / curr_denom
            #last_window_variance += ???
        else:
            curr_mean = last_mean + (a[curr_idx] - a[last_idx]) / curr_denom
            last_window_variance += (a[curr_idx] - a[last_idx]) * (a[curr_idx] - curr_mean + a[last_idx] - last_mean) / curr_denom
        
        last_window_mean = curr_mean
        all_window_variances[last_idx + 1] = last_window_variance
        
    return all_window_variances

def welford_nanstd(a, w):
    a_isfinite = np.isfinite(a)
    
    return np.sqrt(_welford_variance(a, w, a_isfinite))

if __name__ == "__main__":
    T = np.random.rand(10)
    m = 5
    out = welford_nanstd(T, m)

This works when there are no NaN in the input data but, otherwise, I don't know how to handle it when there are NaN in the data. Additionally, all of the curr_mean values should be correct. @mexxexx Any chance you can figure out what to do here?

@seanlaw
Copy link
Contributor Author

seanlaw commented Oct 27, 2020

Pandas' rolling variance implementation may be useful here as a reference

@mihailescum
Copy link
Contributor

mihailescum commented Oct 27, 2020

Btw, how were you able to identify that it was using variance?

Some experience with convolutions 😄 Basically, convolving with a sequence of m ones, as done there, is equal to a rolling sum of window size m (write it down and you will see it immediately). So f2 is a rolling sum squared, which is the first term in he naive way to calculate the variance and starting from there you can figure out what the other terms do. If you need further explanation I can help!

Do you have a simple test case that might cause catastrophic cancellation?

According to wikipedia, one could take the following test case: (10^9 + 4, 10^9 + 7, 10^9 + 13, 10^9 + 16). The variance should be 30, but the naive approach will completely fail (giving −170.66666666666666 in their case, but that could depend on the implementation, so I would just calculate the error and see how large it is).

The only thing that I haven't thought through is how to handle the NaN values. At least, I can verify the results. Do you know how this works in np.nanstd?

I don't know welfords algorithm to well, but my idea would go like this: Treat a NaN value as 0 and reduce the number of observations used by one (effectively ignoring the new datum). On wikipedia there is one implementation in python, which I adapted for NaNs. I didn't try it out, so it might be incomplete, but maybe you can see where I'm going.

# For a new value newValue, compute the new count, new mean, the new M2.
# mean accumulates the mean of the entire dataset
# M2 aggregates the squared distance from the mean
# count aggregates the number of samples seen so far
def update(existingAggregate, newValue):
    (count, mean, M2) = existingAggregate
    if (not np.isnan(newValue):
        count += 1
    else:
        newValue = 0
    delta = newValue - mean

    if (count != 0):
        mean += delta / count
    else:
        mean = np.nan
    delta2 = newValue - mean
    M2 += delta * delta2
    return (count, mean, M2)

# Retrieve the mean, variance and sample variance from an aggregate
def finalize(existingAggregate):
    (count, mean, M2) = existingAggregate
    if count < 2:
        return np.nan
    else:
        (mean, variance, sampleVariance) = (mean, M2 / count, M2 / (count - 1))
        return (mean, variance, sampleVariance)

@seanlaw
Copy link
Contributor Author

seanlaw commented Oct 27, 2020

Btw, thanks for the info above!

Okay, I think I understand the math and should be able to make this work. I'm deriving it from scratch right now...

@seanlaw
Copy link
Contributor Author

seanlaw commented Nov 6, 2020

@mexxexx I've been working on this and I can't seem to get the math to work nicely. It appears that the rolling window only works for me if the number of data points used to compute the mean and variance in the last window is the same in the current window. In other words, if your previous window contained one or more np.nan then it is not possible to re-use the variance from the last window to compute the variance of the current window.

@seanlaw
Copy link
Contributor Author

seanlaw commented Nov 7, 2020

Okay, I thought about it more and realized that I don't actually have to make the math work out nicely. Instead, I just need to detect when and where np.nan is present. So, this hybrid Welford method is actually faster than the FFT version:

@njit(fastmath={'nsz', 'arcp', 'contract', 'afn', 'reassoc'})
def _welford_variance(a, w, subseq_a_isfinite):
    all_window_variances = np.empty(a.shape[0] - w + 1)
    
    last_window_mean = np.nanmean(a[:w])
    last_window_variance = np.nanvar(a[:w])
    all_window_variances[0] = last_window_variance
        
    for last_idx, curr_idx in enumerate(range(w, a.shape[0])):
        last_mean = last_window_mean
        
        if not subseq_a_isfinite[last_idx] or not subseq_a_isfinite[last_idx+1]:
            curr_mean = np.nanmean(a[last_idx+1:last_idx+1+w])
            last_window_variance = np.nanvar(a[last_idx+1:last_idx+1+w])
        else:
            curr_mean = last_mean + (a[curr_idx] - a[last_idx]) / w
            last_window_variance += (a[curr_idx] - a[last_idx]) * (a[curr_idx] - curr_mean + a[last_idx] - last_mean) / w
        
        last_window_mean = curr_mean
        all_window_variances[last_idx + 1] = last_window_variance
        
    return all_window_variances

def welford_nanstd(a, w):
    subseq_a_isfinite = np.all(np.isfinite(stumpy.core.rolling_window(a, w)), axis=1)
    
    return np.sqrt(_welford_variance(a, w, subseq_a_isfinite))

T, m = np.random.rand(1000000), 50
# T, m = np.random.rand(20), 10
T[1] = np.nan
T[10] = np.nan
T[13:18] = np.nan
start = time.time()
welford_std = welford_nanstd(T, m)
print(time.time() - start)
start = time.time()
np_std = np.nanstd(stumpy.core.rolling_window(T, m), axis=T.ndim)
print(time.time() - start)
npt.assert_almost_equal(welford_std, np_std)

Additionally, there's an opportunity to make this parallelized by splitting up the array into chunks.

@seanlaw
Copy link
Contributor Author

seanlaw commented Nov 9, 2020

I can also confirm that while the original FFT nanstd suffers from catastrophic cancellation, the Welford method is safe!

seanlaw added a commit that referenced this issue Nov 10, 2020
@mihailescum
Copy link
Contributor

Great to hear! Do you have any timings to compare it to the original rolling approach?
I don't think this is a time critical part, so probably parallelization would be an overkill, but just out of interest.

@seanlaw
Copy link
Contributor Author

seanlaw commented Nov 10, 2020

Yes! For a time series with T = np.random.rand(10_000_000)

Welford:  0.46 seconds
np.nanstd:  4.96 seconds
FFT:  1.10 seconds

So, Welford is faster than FFT and around 10x faster than np.nanstd. And for a longer time series with T = np.random.rand(100_000_000)

Welford:  5.31 seconds
np.nanstd:  N/A
FFT:  13.77 seconds

For the longer time series, Welford required ~4.75GB of memory while FFT requires ~8.42GB (np.nanstd required too much memory).

In case you were wondering, the timing was about the same when I inserted a bunch of np.nan into the time series so this is actually a really performant algorithm that also does not have catastrophic cancellation problems!

@seanlaw
Copy link
Contributor Author

seanlaw commented Nov 10, 2020

I don't think this is a time critical part, so probably parallelization would be an overkill, but just out of interest.

You're absolutely right and your point is well taken. We can certainly parallelize this by splitting up the time series into chunks/sections equal to the number of threads and then we'd be even faster on longer time series.

@mihailescum
Copy link
Contributor

I'm impressed by the timings! Great that we have something fast and stable implemented, because I remember this variance computation giving me headaches on large time serieses 😄

@seanlaw
Copy link
Contributor Author

seanlaw commented Nov 11, 2020

Yeah, I was surprised too. Mind you, the timing does not include the overhead to compile the function which is about 0.1 seconds. So, for small (less than 1 million) data sets, it may be worth using np.nanstd instead. I did try to implement parallelization and it compiled and produced the correct result but it wasn't faster. I suspect that there was something that was blocking the pure parallelization. Numba seems to be particular about these things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants