-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Faster manhattan_distances() for sparse matrices #15049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Faster manhattan_distances() for sparse matrices #15049
Conversation
Why is this WIP? Is there work you intend to do before it is fully considered for merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
sklearn/metrics/pairwise_fast.pyx
Outdated
cdef int n = D.shape[1] | ||
|
||
# We scan the matrices row by row. | ||
# Given row px in X and row py in Y, we find the positions (i and j respectively), in .indices where the indices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to keep this under 80 chars per line.
sklearn/metrics/pairwise_fast.pyx
Outdated
# Below the avoidance of inplace operators is intentional. | ||
# When prange is used, the inplace operator has a special meaning, i.e. it signals a "reduction" | ||
|
||
for px in prange(m,nogil=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space after comma, please
sklearn/metrics/pairwise.py
Outdated
@@ -765,10 +765,12 @@ def manhattan_distances(X, Y=None, sum_over_features=True): | |||
|
|||
X = csr_matrix(X, copy=False) | |||
Y = csr_matrix(Y, copy=False) | |||
X.sort_indices() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
formally you might require sum_duplicates too??? Perhaps we should make a copy if the matrix indices are unsorted, rather than modifying inplace without permission..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree that doing inplace modification of input is bad, I can't think of a case where it would be bad to sort_indices and sum_duplicates for a CSR array. scipy is somewhat ambiguous about doing that at initialization https://github.com/scipy/scipy/blob/26b7d3f40905d85845a3af75a67c318a8d441ed1/scipy/sparse/compressed.py#L198
Maybe adding a note to the docstring about it could be enough? Of course if the arrays are read-only a copy still need to be triggered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So let's sum duplicates here and document the in place operation
Please add an entry to the change log at |
Added entry to change log
I like this, but you neeed to resolve merge conflicts with master. |
…hattan # Conflicts: # doc/whats_new/v0.22.rst
I am a little concerned with how this will interact with |
Well it's not too different from Euclidean distances with BLAS in that respect . Anything that applied there (e.g. Though we could also add a single thread implementation for now and add the |
I feel like we need to think about parallelism in pairwise distances. Adding the prange now is good for internal use of manhattan_distances but will probably be detrimental for pairwise distances (unless n_jobs=1). pairwise_distances is a wrapper around scipy metrics (sequential) and sklearn metrics (some multi-threaded, some sequential). I think one solution could be to only use joblib parallelism for metrics we now are sequential. |
Just noticed, https://github.com/joblib/joblib/blob/master/CHANGES.rst for joblib 0.14.0, which has joblib/joblib#940 This may be a non issue now. |
It is because pairwise distances uses the threading backend which is not covered by joblib/joblib#940 |
I agree with keeping this PR sequential for now. |
My initial motivation for this PR was the intense frustration felt when computing a Laplacian kernel matrix on a 24-core machine and seeing that only one core was used (the Laplacian kernel is computed using the Manhattan distance). When I then looked at the code, I found that in the sparse case the implementation could be improved. |
@ptocca What do you think of the following? cdef np.npy_intp px, py, i, j, ix, iy, X_indptr_end, Y_indptr_end
cdef double d = 0.0
cdef int m = D.shape[0]
cdef int n = D.shape[1]
for px in range(m):
X_indptr_end = X_indptr[px + 1]
for py in range(n):
d = 0.0
Y_indptr_end = Y_indptr[py + 1]
i = X_indptr[px]
j = Y_indptr[py]
while i < X_indptr_end and j < Y_indptr_end:
ix = X_indices[i]
iy = Y_indices[j]
if ix == iy:
d += fabs(X_data[i] - Y_data[j])
i += 1
j += 1
elif ix < iy:
d += fabs(X_data[i])
i += 1
else:
d += fabs(Y_data[j])
j += 1
if i == X_indptr_end:
while j < Y_indptr_end:
d += fabs(Y_data[j])
j += 1
else:
while i < X_indptr_end:
d += fabs(X_data[i])
i += 1
D[px, py] = d |
Indeed. However the issue starts when multiple level of parallelism are used. For instance, Some of this is addressed joblib, but not for the threading backend as mentioned by @jeremiedbb above. Overall, I think we should handle this in +1 to keep the prange, but could you please benchmark this implementation, with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay lets keep the prange and deal with the joblib threading issue at the pairwise_distances
level.
Comments based on #15049 (comment)
sklearn/metrics/pairwise_fast.pyx
Outdated
j = Y_indptr[py] | ||
d = 0.0 | ||
while i < X_indptr[px + 1] and j < Y_indptr[py + 1]: | ||
if i < X_indptr[px + 1]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition can be removed, because it is always true in this loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, you're absolutely right!
sklearn/metrics/pairwise_fast.pyx
Outdated
while i < X_indptr[px + 1] and j < Y_indptr[py + 1]: | ||
if i < X_indptr[px + 1]: | ||
ix = X_indices[i] | ||
if j < Y_indptr[py + 1]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition can be removed because it is always true in this loop
sklearn/metrics/pairwise_fast.pyx
Outdated
|
||
if i == X_indptr[px + 1]: | ||
while j < Y_indptr[py + 1]: | ||
iy = Y_indices[j] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iy = Y_indices[j]
is unneeded
sklearn/metrics/pairwise_fast.pyx
Outdated
j = j + 1 | ||
else: | ||
while i < X_indptr[px + 1]: | ||
ix = X_indices[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ix = X_indices[i]
is unneeded.
sklearn/metrics/pairwise_fast.pyx
Outdated
i = X_indptr[px] | ||
j = Y_indptr[py] | ||
d = 0.0 | ||
while i < X_indptr[px + 1] and j < Y_indptr[py + 1]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we define X_indptr[px + 1]
before the range(n)
loop and Y_indptr[py + 1]
before the while loop and then reference them everywhere?
Avoided repeated computation of last index
@rth: I benchmarked the new implementation with the code that you can see in this gist. I ran it on my laptop and on a node of an HPC system. The old implementation ran in 34s on my laptop and in 44s on the HPC node.
Somewhat surprisingly, my puny laptop is faster than the Xeon in the single-threaded case. it does have a faster clock, but I thought that, for starters, the memory subsystem would be much slower. |
Thanks for doing the benchmarks @ptocca. I can confirm your conclusions while re-running those,
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work @ptocca !
I wonder if applying a similar approach to euclidean distances would also be faster.
Seems unlikely, as current euclidean_distances is up to 10-30x faster than the manhattan_distances in this PR. A different metric, but it still sounds difficult to do better, in Cython. Although I really wish sparse dot product (used in euclidean-distances) was multi-threaded in scipy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit
sklearn/metrics/pairwise_fast.pyx
Outdated
for px in prange(m, nogil=True): | ||
for py in range(n): | ||
i = X_indptr[px] | ||
j = Y_indptr[py] | ||
d = 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of doing:
for px i nprange(m, nogil=True):
X_indptr_end = X_indptr[px + 1]
for py in range(n):
Y_indptr_end = Y_indptr[py + 1]
And then using X_indptr_end
and Y_indptr_end
everywhere?
Thank you @ptocca ! |
@thomasjpfan It was my privilege! Thanks everyone for your careful checking and your constructive suggestions! |
Reference Issues/PRs
See also PR #14986
What does this implement/fix? Explain your changes.
Provides a faster implementation of manhattan_distances() for sparse matrices. Originally discussed in PR #14986 (which also targeted the dense case)
The cython implementation in pairwise_fast.pyx iterates over only the non-zero elements in the rows.