Skip to content

[MRG] Faster manhattan_distances() for sparse matrices #15049

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Oct 5, 2019

Conversation

ptocca
Copy link
Contributor

@ptocca ptocca commented Sep 21, 2019

Reference Issues/PRs

See also PR #14986

What does this implement/fix? Explain your changes.

Provides a faster implementation of manhattan_distances() for sparse matrices. Originally discussed in PR #14986 (which also targeted the dense case)
The cython implementation in pairwise_fast.pyx iterates over only the non-zero elements in the rows.

@jnothman
Copy link
Member

Why is this WIP? Is there work you intend to do before it is fully considered for merge?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

cdef int n = D.shape[1]

# We scan the matrices row by row.
# Given row px in X and row py in Y, we find the positions (i and j respectively), in .indices where the indices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to keep this under 80 chars per line.

# Below the avoidance of inplace operators is intentional.
# When prange is used, the inplace operator has a special meaning, i.e. it signals a "reduction"

for px in prange(m,nogil=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after comma, please

@@ -765,10 +765,12 @@ def manhattan_distances(X, Y=None, sum_over_features=True):

X = csr_matrix(X, copy=False)
Y = csr_matrix(Y, copy=False)
X.sort_indices()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formally you might require sum_duplicates too??? Perhaps we should make a copy if the matrix indices are unsorted, rather than modifying inplace without permission..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree that doing inplace modification of input is bad, I can't think of a case where it would be bad to sort_indices and sum_duplicates for a CSR array. scipy is somewhat ambiguous about doing that at initialization https://github.com/scipy/scipy/blob/26b7d3f40905d85845a3af75a67c318a8d441ed1/scipy/sparse/compressed.py#L198

Maybe adding a note to the docstring about it could be enough? Of course if the arrays are read-only a copy still need to be triggered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So let's sum duplicates here and document the in place operation

@rth
Copy link
Member

rth commented Sep 24, 2019

Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

@ptocca ptocca changed the title [WIP] Faster manhattan_distances() for sparse matrices [MRG] Faster manhattan_distances() for sparse matrices Sep 27, 2019
@jnothman
Copy link
Member

I like this, but you neeed to resolve merge conflicts with master.

…hattan

# Conflicts:
#	doc/whats_new/v0.22.rst
@thomasjpfan
Copy link
Member

I am a little concerned with how this will interact with NearestNeighbors and n_jobs. Will this lead to another over-subscription problem?

@rth
Copy link
Member

rth commented Oct 1, 2019

I am a little concerned with how this will interact with NearestNeighbors and n_jobs. Will this lead to another over-subscription problem?

Well it's not too different from Euclidean distances with BLAS in that respect . Anything that applied there (e.g. pairwise_distances slower with n_jobs>1) will probably also apply here.

Though we could also add a single thread implementation for now and add the prange later once #14196 and related discussions are concluded. I don't have a strong opinion about it.

@jeremiedbb
Copy link
Member

Though we could also add a single thread implementation for now and add the prange later once #14196 and related discussions are concluded. I don't have a strong opinion about it.

I feel like we need to think about parallelism in pairwise distances. Adding the prange now is good for internal use of manhattan_distances but will probably be detrimental for pairwise distances (unless n_jobs=1).

pairwise_distances is a wrapper around scipy metrics (sequential) and sklearn metrics (some multi-threaded, some sequential). I think one solution could be to only use joblib parallelism for metrics we now are sequential.

@thomasjpfan
Copy link
Member

Just noticed, https://github.com/joblib/joblib/blob/master/CHANGES.rst for joblib 0.14.0, which has joblib/joblib#940 This may be a non issue now.

@jeremiedbb
Copy link
Member

It is because pairwise distances uses the threading backend which is not covered by joblib/joblib#940

@thomasjpfan
Copy link
Member

Though we could also add a single thread implementation for now and add the prange later once #14196 and related discussions are concluded. I don't have a strong opinion about it.

pairwise_distances is a wrapper around scipy metrics (sequential) and sklearn metrics (some multi-threaded, some sequential). I think one solution could be to only use joblib parallelism for metrics we now are sequential.

I agree with keeping this PR sequential for now.

@ptocca
Copy link
Contributor Author

ptocca commented Oct 1, 2019

My initial motivation for this PR was the intense frustration felt when computing a Laplacian kernel matrix on a 24-core machine and seeing that only one core was used (the Laplacian kernel is computed using the Manhattan distance). When I then looked at the code, I found that in the sparse case the implementation could be improved.
Multicore machines are the norm today, so my view is that we should try to make software that distributes computation across the available resources.
If one wants to limit the number of threads, there are ways to do that (OMP_NUM_THREADS or in this case setting num_threads in the prange call).
By the way, the Gaussian kernel computation appears to be multithreaded. So, why should the Laplacian be any different?

@thomasjpfan
Copy link
Member

@ptocca What do you think of the following?

    cdef np.npy_intp px, py, i, j, ix, iy, X_indptr_end, Y_indptr_end
    cdef double d = 0.0

    cdef int m = D.shape[0]
    cdef int n = D.shape[1]
    for px in range(m):
        X_indptr_end = X_indptr[px + 1]
        for py in range(n):
            d = 0.0
            Y_indptr_end = Y_indptr[py + 1]
            i = X_indptr[px]
            j = Y_indptr[py]

            while i < X_indptr_end and j < Y_indptr_end:
                ix = X_indices[i]
                iy = Y_indices[j]

                if ix == iy:
                    d += fabs(X_data[i] - Y_data[j])
                    i += 1
                    j += 1
                elif ix < iy:
                    d += fabs(X_data[i])
                    i += 1
                else:
                    d += fabs(Y_data[j])
                    j += 1

            if i == X_indptr_end:
                while j < Y_indptr_end:
                    d += fabs(Y_data[j])
                    j += 1
            else:
                while i < X_indptr_end:
                    d += fabs(X_data[i])
                    i += 1

            D[px, py] = d

@rth
Copy link
Member

rth commented Oct 2, 2019

Multicore machines are the norm today, so my view is that we should try to make software that distributes computation across the available resources.

Indeed. However the issue starts when multiple level of parallelism are used. For instance, pairwise_distances has an n_jobs parameter. On a machine with lots of CPU, if one uses that parameter to use all CPUs where each job starts N_CPU threads, one would end up with N_CPU**2 threads, and that can freeze a PC due to CPU oversubscription, with N_CPU large enough.

Some of this is addressed joblib, but not for the threading backend as mentioned by @jeremiedbb above.

Overall, I think we should handle this in pairwise_distances in any case, since it's currently an issue for Euclidean distances as well.

+1 to keep the prange, but could you please benchmark this implementation, with OMP_NUM_THREAD of 1, 2, 4, 8, 16 (if available) on a reasonably sized dataset? To make sure that the scaling with num threads is reasonably good. Thanks!

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay lets keep the prange and deal with the joblib threading issue at the pairwise_distances level.

Comments based on #15049 (comment)

j = Y_indptr[py]
d = 0.0
while i < X_indptr[px + 1] and j < Y_indptr[py + 1]:
if i < X_indptr[px + 1]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition can be removed, because it is always true in this loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, you're absolutely right!

while i < X_indptr[px + 1] and j < Y_indptr[py + 1]:
if i < X_indptr[px + 1]:
ix = X_indices[i]
if j < Y_indptr[py + 1]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition can be removed because it is always true in this loop


if i == X_indptr[px + 1]:
while j < Y_indptr[py + 1]:
iy = Y_indices[j]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iy = Y_indices[j] is unneeded

j = j + 1
else:
while i < X_indptr[px + 1]:
ix = X_indices[i]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ix = X_indices[i] is unneeded.

i = X_indptr[px]
j = Y_indptr[py]
d = 0.0
while i < X_indptr[px + 1] and j < Y_indptr[py + 1]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define X_indptr[px + 1] before the range(n) loop and Y_indptr[py + 1] before the while loop and then reference them everywhere?

@ptocca
Copy link
Contributor Author

ptocca commented Oct 3, 2019

@rth: I benchmarked the new implementation with the code that you can see in this gist. I ran it on my laptop and on a node of an HPC system.
My laptop (L) is an Intel(R) Core(TM) i5-7300U CPU @ 2.60GHz, with 4 cores.
The HPC node (H) is an Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz, with 32 cores (but I only had 12 allocated to me).
The benchmark consists in computing 10 times the manhattan_distances() for two 2000x3000 random sparse matrices with 1% density.

The old implementation ran in 34s on my laptop and in 44s on the HPC node.

#cores User (L) Wall (L) User (H) Wall (H)
1 11.8 11.90 14.8 14.80
2 12.0 6.04 14.9 7.51
3 14.4 5.16 14.9 5.06
4 16.3 4.25 15.0 3.83
6 NaN NaN 15.0 2.60
8 NaN NaN 15.1 1.97
10 NaN NaN 15.1 1.61
12 NaN NaN 15.3 1.38

Somewhat surprisingly, my puny laptop is faster than the Xeon in the single-threaded case. it does have a faster clock, but I thought that, for starters, the memory subsystem would be much slower.
Where the HPC node really shines is in multi-threaded computation.
At least up to 12 cores, there is almost no multi-threading overhead (compare the "user" totals) and the speed-up is almost linear.

@rth
Copy link
Member

rth commented Oct 3, 2019

Thanks for doing the benchmarks @ptocca. I can confirm your conclusions while re-running those,

  • for a (2000, 3000) dataset with density of 0.01, this implementation is >2x faster than master on my desktop. The speed-up when going from 1 to 10 threads is 8.7 which is great particularly on such small dataset.
  • for a (2000, 10000) dataset with a density of 0.0001, this implementation is ~23x faster than master, with still a good scaling with num_threads.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work @ptocca !

I wonder if applying a similar approach to euclidean distances would also be faster.

@rth
Copy link
Member

rth commented Oct 3, 2019

I wonder if applying a similar approach to euclidean distances would also be faster.

Seems unlikely, as current euclidean_distances is up to 10-30x faster than the manhattan_distances in this PR. A different metric, but it still sounds difficult to do better, in Cython. Although I really wish sparse dot product (used in euclidean-distances) was multi-threaded in scipy.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit

Comment on lines 74 to 78
for px in prange(m, nogil=True):
for py in range(n):
i = X_indptr[px]
j = Y_indptr[py]
d = 0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of doing:

for px i nprange(m, nogil=True):
    X_indptr_end = X_indptr[px + 1]
    for py in range(n):
        Y_indptr_end = Y_indptr[py + 1]

And then using X_indptr_end and Y_indptr_end everywhere?

@thomasjpfan thomasjpfan merged commit 24a50e5 into scikit-learn:master Oct 5, 2019
@thomasjpfan
Copy link
Member

Thank you @ptocca !

@ptocca
Copy link
Contributor Author

ptocca commented Oct 5, 2019

@thomasjpfan It was my privilege! Thanks everyone for your careful checking and your constructive suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants