[MRG] Faster manhattan_distances() for sparse matrices #15049

ptocca · 2019-09-21T16:14:56Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Provides a faster implementation of manhattan_distances() for sparse matrices. Originally discussed in PR #14986 (which also targeted the dense case)
The cython implementation in pairwise_fast.pyx iterates over only the non-zero elements in the rows.

jnothman · 2019-09-23T02:45:07Z

Why is this WIP? Is there work you intend to do before it is fully considered for merge?

jnothman

Otherwise LGTM

jnothman · 2019-09-23T02:45:36Z

sklearn/metrics/pairwise_fast.pyx

+    cdef int n = D.shape[1]
+
+    # We scan the matrices row by row.
+    # Given row px in X and row py in Y, we find the positions (i and j respectively), in .indices where the indices


Try to keep this under 80 chars per line.

jnothman · 2019-09-23T02:45:47Z

sklearn/metrics/pairwise_fast.pyx

+    # Below the avoidance of inplace operators is intentional.
+    # When prange is used, the inplace operator has a special meaning, i.e. it signals a "reduction"
+
+    for px in prange(m,nogil=True):


space after comma, please

jnothman · 2019-09-23T02:49:46Z

sklearn/metrics/pairwise.py

@@ -765,10 +765,12 @@ def manhattan_distances(X, Y=None, sum_over_features=True):

        X = csr_matrix(X, copy=False)
        Y = csr_matrix(Y, copy=False)
+        X.sort_indices()


formally you might require sum_duplicates too??? Perhaps we should make a copy if the matrix indices are unsorted, rather than modifying inplace without permission..

While I agree that doing inplace modification of input is bad, I can't think of a case where it would be bad to sort_indices and sum_duplicates for a CSR array. scipy is somewhat ambiguous about doing that at initialization https://github.com/scipy/scipy/blob/26b7d3f40905d85845a3af75a67c318a8d441ed1/scipy/sparse/compressed.py#L198

Maybe adding a note to the docstring about it could be enough? Of course if the arrays are read-only a copy still need to be triggered.

So let's sum duplicates here and document the in place operation

rth · 2019-09-24T11:08:40Z

Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

Added entry to change log

jnothman · 2019-09-28T11:30:31Z

I like this, but you neeed to resolve merge conflicts with master.

…hattan # Conflicts: # doc/whats_new/v0.22.rst

sklearn/metrics/pairwise_fast.pyx

…arse_manhattan

thomasjpfan · 2019-10-01T18:09:46Z

I am a little concerned with how this will interact with NearestNeighbors and n_jobs. Will this lead to another over-subscription problem?

rth · 2019-10-01T18:33:41Z

I am a little concerned with how this will interact with NearestNeighbors and n_jobs. Will this lead to another over-subscription problem?

Well it's not too different from Euclidean distances with BLAS in that respect . Anything that applied there (e.g. pairwise_distances slower with n_jobs>1) will probably also apply here.

Though we could also add a single thread implementation for now and add the prange later once #14196 and related discussions are concluded. I don't have a strong opinion about it.

jeremiedbb · 2019-10-01T18:59:21Z

Though we could also add a single thread implementation for now and add the prange later once #14196 and related discussions are concluded. I don't have a strong opinion about it.

I feel like we need to think about parallelism in pairwise distances. Adding the prange now is good for internal use of manhattan_distances but will probably be detrimental for pairwise distances (unless n_jobs=1).

pairwise_distances is a wrapper around scipy metrics (sequential) and sklearn metrics (some multi-threaded, some sequential). I think one solution could be to only use joblib parallelism for metrics we now are sequential.

thomasjpfan · 2019-10-01T19:00:32Z

Just noticed, https://github.com/joblib/joblib/blob/master/CHANGES.rst for joblib 0.14.0, which has joblib/joblib#940 This may be a non issue now.

jeremiedbb · 2019-10-01T19:03:41Z

It is because pairwise distances uses the threading backend which is not covered by joblib/joblib#940

thomasjpfan · 2019-10-01T19:22:50Z

Though we could also add a single thread implementation for now and add the prange later once #14196 and related discussions are concluded. I don't have a strong opinion about it.

pairwise_distances is a wrapper around scipy metrics (sequential) and sklearn metrics (some multi-threaded, some sequential). I think one solution could be to only use joblib parallelism for metrics we now are sequential.

I agree with keeping this PR sequential for now.

ptocca · 2019-10-01T20:20:05Z

My initial motivation for this PR was the intense frustration felt when computing a Laplacian kernel matrix on a 24-core machine and seeing that only one core was used (the Laplacian kernel is computed using the Manhattan distance). When I then looked at the code, I found that in the sparse case the implementation could be improved.
Multicore machines are the norm today, so my view is that we should try to make software that distributes computation across the available resources.
If one wants to limit the number of threads, there are ways to do that (OMP_NUM_THREADS or in this case setting num_threads in the prange call).
By the way, the Gaussian kernel computation appears to be multithreaded. So, why should the Laplacian be any different?

thomasjpfan · 2019-10-01T21:45:31Z

@ptocca What do you think of the following?

    cdef np.npy_intp px, py, i, j, ix, iy, X_indptr_end, Y_indptr_end
    cdef double d = 0.0

    cdef int m = D.shape[0]
    cdef int n = D.shape[1]
    for px in range(m):
        X_indptr_end = X_indptr[px + 1]
        for py in range(n):
            d = 0.0
            Y_indptr_end = Y_indptr[py + 1]
            i = X_indptr[px]
            j = Y_indptr[py]

            while i < X_indptr_end and j < Y_indptr_end:
                ix = X_indices[i]
                iy = Y_indices[j]

                if ix == iy:
                    d += fabs(X_data[i] - Y_data[j])
                    i += 1
                    j += 1
                elif ix < iy:
                    d += fabs(X_data[i])
                    i += 1
                else:
                    d += fabs(Y_data[j])
                    j += 1

            if i == X_indptr_end:
                while j < Y_indptr_end:
                    d += fabs(Y_data[j])
                    j += 1
            else:
                while i < X_indptr_end:
                    d += fabs(X_data[i])
                    i += 1

            D[px, py] = d

rth · 2019-10-02T10:22:36Z

Multicore machines are the norm today, so my view is that we should try to make software that distributes computation across the available resources.

Indeed. However the issue starts when multiple level of parallelism are used. For instance, pairwise_distances has an n_jobs parameter. On a machine with lots of CPU, if one uses that parameter to use all CPUs where each job starts N_CPU threads, one would end up with N_CPU**2 threads, and that can freeze a PC due to CPU oversubscription, with N_CPU large enough.

Some of this is addressed joblib, but not for the threading backend as mentioned by @jeremiedbb above.

Overall, I think we should handle this in pairwise_distances in any case, since it's currently an issue for Euclidean distances as well.

+1 to keep the prange, but could you please benchmark this implementation, with OMP_NUM_THREAD of 1, 2, 4, 8, 16 (if available) on a reasonably sized dataset? To make sure that the scaling with num threads is reasonably good. Thanks!

thomasjpfan

Okay lets keep the prange and deal with the joblib threading issue at the pairwise_distances level.

Comments based on #15049 (comment)

thomasjpfan · 2019-10-02T15:21:48Z

sklearn/metrics/pairwise_fast.pyx

+            j = Y_indptr[py]
+            d = 0.0
+            while i < X_indptr[px + 1] and j < Y_indptr[py + 1]:
+                if i < X_indptr[px + 1]:


This condition can be removed, because it is always true in this loop

Oops, you're absolutely right!

thomasjpfan · 2019-10-02T15:22:01Z

sklearn/metrics/pairwise_fast.pyx

+            while i < X_indptr[px + 1] and j < Y_indptr[py + 1]:
+                if i < X_indptr[px + 1]:
+                    ix = X_indices[i]
+                if j < Y_indptr[py + 1]:


This condition can be removed because it is always true in this loop

sklearn/metrics/pairwise_fast.pyx

thomasjpfan · 2019-10-02T15:22:58Z

sklearn/metrics/pairwise_fast.pyx

+
+            if i == X_indptr[px + 1]:
+                while j < Y_indptr[py + 1]:
+                    iy = Y_indices[j]


iy = Y_indices[j] is unneeded

thomasjpfan · 2019-10-02T15:23:11Z

sklearn/metrics/pairwise_fast.pyx

+                    j = j + 1
+            else:
+                while i < X_indptr[px + 1]:
+                    ix = X_indices[i]


ix = X_indices[i] is unneeded.

thomasjpfan · 2019-10-02T15:23:55Z

sklearn/metrics/pairwise_fast.pyx

+            i = X_indptr[px]
+            j = Y_indptr[py]
+            d = 0.0
+            while i < X_indptr[px + 1] and j < Y_indptr[py + 1]:


Can we define X_indptr[px + 1] before the range(n) loop and Y_indptr[py + 1] before the while loop and then reference them everywhere?

…arse_manhattan

Avoided repeated computation of last index

ptocca · 2019-10-03T20:06:12Z

@rth: I benchmarked the new implementation with the code that you can see in this gist. I ran it on my laptop and on a node of an HPC system.
My laptop (L) is an Intel(R) Core(TM) i5-7300U CPU @ 2.60GHz, with 4 cores.
The HPC node (H) is an Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz, with 32 cores (but I only had 12 allocated to me).
The benchmark consists in computing 10 times the manhattan_distances() for two 2000x3000 random sparse matrices with 1% density.

The old implementation ran in 34s on my laptop and in 44s on the HPC node.

#cores	User (L)	Wall (L)	User (H)	Wall (H)
1	11.8	11.90	14.8	14.80
2	12.0	6.04	14.9	7.51
3	14.4	5.16	14.9	5.06
4	16.3	4.25	15.0	3.83
6	NaN	NaN	15.0	2.60
8	NaN	NaN	15.1	1.97
10	NaN	NaN	15.1	1.61
12	NaN	NaN	15.3	1.38

Somewhat surprisingly, my puny laptop is faster than the Xeon in the single-threaded case. it does have a faster clock, but I thought that, for starters, the memory subsystem would be much slower.
Where the HPC node really shines is in multi-threaded computation.
At least up to 12 cores, there is almost no multi-threading overhead (compare the "user" totals) and the speed-up is almost linear.

rth · 2019-10-03T21:08:47Z

Thanks for doing the benchmarks @ptocca. I can confirm your conclusions while re-running those,

for a (2000, 3000) dataset with density of 0.01, this implementation is >2x faster than master on my desktop. The speed-up when going from 1 to 10 threads is 8.7 which is great particularly on such small dataset.
for a (2000, 10000) dataset with a density of 0.0001, this implementation is ~23x faster than master, with still a good scaling with num_threads.

rth

Very nice work @ptocca !

I wonder if applying a similar approach to euclidean distances would also be faster.

rth · 2019-10-03T21:21:19Z

I wonder if applying a similar approach to euclidean distances would also be faster.

Seems unlikely, as current euclidean_distances is up to 10-30x faster than the manhattan_distances in this PR. A different metric, but it still sounds difficult to do better, in Cython. Although I really wish sparse dot product (used in euclidean-distances) was multi-threaded in scipy.

thomasjpfan

Small nit

thomasjpfan · 2019-10-04T15:07:28Z

sklearn/metrics/pairwise_fast.pyx

+    for px in prange(m, nogil=True):
+        for py in range(n):
+            i = X_indptr[px]
+            j = Y_indptr[py]
+            d = 0.0


What do you think of doing:

for px i nprange(m, nogil=True): X_indptr_end = X_indptr[px + 1] for py in range(n): Y_indptr_end = Y_indptr[py + 1]

And then using X_indptr_end and Y_indptr_end everywhere?

…hattan

thomasjpfan · 2019-10-05T02:24:24Z

Thank you @ptocca !

ptocca · 2019-10-05T08:48:41Z

@thomasjpfan It was my privilege! Thanks everyone for your careful checking and your constructive suggestions!

Sparse matrix case ported from PR scikit-learn#14986

a24850e

ptocca mentioned this pull request Sep 21, 2019

[WIP] Faster manhattan distance for sparse and dense matrices #14986

Closed

jnothman reviewed Sep 23, 2019

View reviewed changes

Sparse matrices converted to canonical format if necessary

e224477

Added entry to change log

ptocca changed the title ~~[WIP] Faster manhattan_distances() for sparse matrices~~ [MRG] Faster manhattan_distances() for sparse matrices Sep 27, 2019

jnothman approved these changes Sep 28, 2019

View reviewed changes

Merge remote-tracking branch 'upstream/master' into faster_sparse_man…

0996818

…hattan # Conflicts: # doc/whats_new/v0.22.rst

thomasjpfan reviewed Sep 28, 2019

View reviewed changes

sklearn/metrics/pairwise_fast.pyx Show resolved Hide resolved

Merge remote-tracking branch 'remotes/upstream/master' into faster_sp…

427c70c

…arse_manhattan

thomasjpfan reviewed Oct 2, 2019

View reviewed changes

ptocca added 2 commits October 3, 2019 20:09

Merge remote-tracking branch 'remotes/upstream/master' into faster_sp…

f5a3dec

…arse_manhattan

Removed spurious if statements

3cd55da

Avoided repeated computation of last index

rth approved these changes Oct 3, 2019

View reviewed changes

thomasjpfan reviewed Oct 4, 2019

View reviewed changes

Merge remote-tracking branch 'upstream/master' into faster_sparse_man…

b5ce77c

…hattan

Minor change suggested by @thomasjpfan

d2f9131

thomasjpfan approved these changes Oct 5, 2019

View reviewed changes

thomasjpfan merged commit 24a50e5 into scikit-learn:master Oct 5, 2019

thomasjpfan mentioned this pull request Oct 25, 2019

manhattan_distances for sparse matrices is slow #14304

Closed

Uh oh!

[MRG] Faster manhattan_distances() for sparse matrices #15049

[MRG] Faster manhattan_distances() for sparse matrices #15049

Conversation

ptocca commented Sep 21, 2019

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

jnothman commented Sep 23, 2019

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth commented Sep 24, 2019

Uh oh!

jnothman commented Sep 28, 2019

Uh oh!

Uh oh!

thomasjpfan commented Oct 1, 2019

Uh oh!

rth commented Oct 1, 2019

Uh oh!

jeremiedbb commented Oct 1, 2019

Uh oh!

thomasjpfan commented Oct 1, 2019

Uh oh!

jeremiedbb commented Oct 1, 2019

Uh oh!

thomasjpfan commented Oct 1, 2019

Uh oh!

ptocca commented Oct 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan commented Oct 1, 2019

Uh oh!

rth commented Oct 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ptocca commented Oct 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rth commented Oct 3, 2019

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

rth commented Oct 3, 2019

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Oct 5, 2019

ptocca commented Oct 1, 2019 •

edited

Loading

rth commented Oct 2, 2019 •

edited

Loading

ptocca commented Oct 3, 2019 •

edited

Loading