Accelerating GPU Based Machine Learning Python 2020
Accelerating GPU Based Machine Learning Python 2020
2020 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) and Workshop on Artificial Intelligence and Machine Learning for Scientific Applications (AI4S) | 978-1-6654-2291-8/20/$31.00 ©2020 IEEE | DOI: 10.1109/MLHPCAI4S51975.2020.00010
Abstract—The growth of big data applications during the last Some of the popular ML libraries employed for Data
decade has led to a surge in the deployment and popularity Analytics include Scikit-learn [5] and Apache Spark’s MLlib
of machine learning (ML) libraries. On the other hand, the
[6]. These libraries are natively designed to support the exe-
high performance offered by GPUs makes them well suited
for ML problems. To take advantage of GPU performance for cution of ML algorithms on CPUs. On the other hand, GPUs
ML, NVIDIA has recently developed the cuML library. cuML have emerged as a popular platform for optimizing parallel
is the GPU counterpart of Scikit-learn, and provides similar workloads because of their high throughput. This also makes
Pythonic interfaces to Scikit-learn while hiding the complex- them a good match for ML applications, which require high
ities of writing GPU compute kernels directly using CUDA.
arithmetic intensity [7].
To support execution of ML workloads on Multi-Node Multi-
GPU (MNMG) systems, the cuML library exploits the NVIDIA To take advantage of the performance offered by GPUs,
Collective Communications Library (NCCL) as a backend for NVIDIA has recently launched the RAPIDS AI [8] framework,
collective communications between processes. On the other hand, which is a collection of open-source software libraries and
MPI is a de facto standard for communication in HPC systems. APIs. The main goal of this effort is to enable end-to-end data
Among various MPI libraries, MVAPICH2-GDR is the pioneer
science analytic pipelines entirely on GPUs. One of the main
in optimizing GPU communication.
This paper explores various aspects and challenges of pro- components of RAPIDS—within its data science ecosystem—
viding MPI-based communication support for GPU-accelerated is the cuML[9] library. This GPU-accelerated ML library is the
cuML applications. More specifically, it proposes a Python API GPU-counterpart of Scikit-learn and provides similar Pythonic
to take advantage of MPI-based communications for cuML interfaces while hiding the complexities of writing compute
applications. It also gives an in-depth analysis, characterization,
kernels for GPUs directly using CUDA. One of the main
and benchmarking of the cuML algorithms such as K-Means,
Nearest Neighbors, Random Forest, and tSVD. Moreover, it benefits of cuML is that it supports the execution of ML
provides a comprehensive performance evaluation and profiling workloads on Multi-Node Multi-GPUs (MNMG) systems. For
study for MPI-based versus NCCL-based communication for this, it takes advantage of a task-based framework called Dask
these algorithms. The evaluation results show that the proposed [10], which allows the distributed execution of Python data
MPI-based communication approach achieves up to 1.6x, 1.25x,
science applications. Dask is comprised of a central scheduler,
1.25x, and 1.36x speedup for K-Means, Nearest Neighbors, Linear
Regression, and tSVD, respectively on up to 32 GPUs. workers (one per GPU), and a client process.
Index Terms—Machine Learning, cuML, MPI, GPUs When a cuML application is running in a MNMG config-
uration, workers need to communicate with each other. These
I. I NTRODUCTION communications are required at two stages: 1) the training data
The last decade has witnessed unprecedented growth in data is distributed to all workers to execute the training stage (the
generated from diverse sources. These sources range from fit() function), 2) the output of the training stage i.e. the model
scientific experiments—like the Large Hadron Collider [1] and parameters are shared with all workers involved in the infer-
Sloan Digital Sky Survey [2]—to social media platforms [3] ence stage (the predict() function). These communications can
to personalized medicine [4]. A daunting challenge—common either be point-to-point or collective. cuML takes advantage
to these disparate Big Data applications—is to process vasts of Dask for handling the point-to-point communications while
amount of data in a timely manner to gain vital insights and using NVIDIA Collective Communications Library (NCCL)
drive decision making. To aid with this quest for better un- to take care of collective communications.
derstanding, there has been a resurgence of Machine Learning Message Passing Interface (MPI) is considered the de facto
(ML) libraries, tools, and techniques that allow processing and standard for writing parallel applications on clusters. The MPI
extracting useful information from this data through iterative standard [9] provides efficient support for point-to-point and
processing. collective communication routines. The MVAPICH2-GDR[10]
library is a pioneer MPI library that supports communication
†These authors contributed equally to this work. between GPU devices. MVAPICH2-GDR has been used to
Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on February 19,2022 at 00:31:32 UTC from IEEE Xplore. Restrictions apply.
Table I: Comparison of different ML libraries
Libraries GPU Support MNMG Support Python Interface High Performance
Scikit-learn [5]
Apache Spark’s MLlib [6]
Apache Mahout [7]
RAPIDS cuML [8]
MPI [9]
Our paper
accelerate and scale [11] Deep Learning (DL) frameworks in- with MPI-based communications to take advantage of ef-
cluding TensorFlow [12] on large clusters. In order to leverage ficient and GPU-aware collective communication designs in
years of research and development behind the MPI standard MVAPICH2-GDR?
and its compliant libraries, this paper introduces support for The cuML library supports training/inference based on sev-
MPI-based communication for the cuML library. This is done eral compute-bound ML algorithms. These include: 1) Cluster-
using the MVAPICH2-GDR library that has efficient collec- ing (like K-Means), 2) Dimensionality reduction (like Princi-
tive communication designs — including MPI_Reduce(), pal Component Analysis (PCA) and Truncated Singular Value
MPI_Bcast(), and MPI_AllReduce()— on GPU mem- Decomposition (tSVD)), 3) Ensemble methods (like Random
ory. Forests), and 4) Linear models (like Linear Regression). Data
scientists should attempt to understand cuML in order to
A. Motivation and Challenges achieve the best possible performance on these algorithms.
Table I compares different libraries that can be used to run However, cuML—being a relatively new ML library—has not
ML applications. Among these libraries, Apache Mahout [7], been studied well by the community. With this background
RAPIDS cuML , and MPI support execution on GPUs. Note in mind, the question is How can we provide performance
that NVIDIA has recently created a RAPIDS Accelerator [13] characterization for GPU-accelerated cuML Algorithms and
for Spark 3.0 that enables the launch of Spark applications provide guidelines for data scientists to best take advantage
on GPUs. However, Spark does not natively support GPU of them?
acceleration. Apache Mahout, however, has poor visualization Each of the cuML algorithms mentioned above has a unique
and it does not support Python interface. It also has poor format for data and model features. Moreover, each cuML
performance compared to the other libraries. The respective model has a unique set of hyperparameters that must be tuned
strengths of cuML and MPI are complementary. While cuML for accuracy at every scale. In order to run cuML applications
provides a Python interface and hides the complexities of in a fast (time-to-solution) and accurate manner, a synthetic
CUDA/C++ programming from the user, MPI is well-known benchmarking suite as well as a hyperparameter tuning frame-
for its high performance in parallel applications. Therefore, work is required. This brings up another question: How can we
the question we ask is: How can we combine the ease-of- provide a synthetic benchmarking suite for cuML algorithms
use provided by cuML for running ML applications with the in order to understand their scaling behavior?
high-performance provided by MPI?
B. Contributions
As mentioned earlier, cuML utilizes NCCL for handling
collective communications. It has some APIs at the CUD- The key contributions of this paper are as follows:
A/C++ layer for MPI communications, but these APIs are not • We provide MPI-based communication support for GPU-
available at the Python layer and cannot be applied by end accelerated cuML applications in Python. More specifi-
users developing Python code. At the same time, MPI libraries cally, we propose a Python API to take advantage of
have many years of research and development behind them. MPI-based communications for cuML applications.
Among these libraries, MVAPICH2-GDR, in particular, pro- • We provide in-depth analysis and characterization of the
vides efficient designs of collective communications for GPUs. state-of-the-art cuML applications and identify regions
We use the OSU microbenchmarks suite [14] to compare that can be enhanced using MPI-based communications.
the performance of the MVAPICH2-GDR library with NCCL • We provide synthetic benchmarking of cuML algorithms
for different collective operations in Figure 1. We run the to characterize their throughput behavior. Further, we
experiments on the Comet cluster from XSEDE. A detailed apply Dask ML’s hyperparameter optimization library on
description of this cluster is provided in Section VII-A. We the Higgs Boson dataset to ensure cuML’s distributed
run the experiments for Allreduce, Bcast, and Reduce for algorithms maintain accuracy at scale.
the 4 Bytes to 16 KBytes message range. These collectives • We provide a comprehensive performance evaluation
are extensively used by cuML algorithms within this message and profiling study comparing MPI-based versus NCCL-
range—this will be further discussed in Section VII. Figure 1 based communication for cuML applications. The evalua-
shows that MVAPICH2-GDR is performing significantly better tion results show that with adding support for MPI-based
than NCCL. This brings up another question: How can we communications, we gain up to 38%, 20%, 20%, and
replace NCCL-based collective communications in cuML 26% performance improvement for K-Means, Nearest
18
Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on February 19,2022 at 00:31:32 UTC from IEEE Xplore. Restrictions apply.
40 25 20
NCCL MVAPICH2-GDR NCCL MVAPICH2-GDR NCCL MVAPICH2-GDR
35 18
20 16
30
14
Latency (s)
Latency (s)
Latency (s)
25 15 12
20 10
15 10 8
6
10
5 4
5 2
0 0 0
4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K
Message Size (bytes) Message Size (bytes) Message Size (bytes)
Figure 1: The performance of collective operations with MPI (MVAPICH2-GDR) vs. NCCL with 8 GPUs across 2 nodes on
Comet cluster and for various collective operations: Allreduce, Reduce, and Bcast. MVAPICH2-GDR performs 84%, 74%, and
85% better that NCCL for Allreduce, Reduce, and Bcast, respectively.
19
Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on February 19,2022 at 00:31:32 UTC from IEEE Xplore. Restrictions apply.
set of processes with ranks 0 to n − 1, where n is the A. K-Means
number of ranks. When a program initializes, the rank number
K-Means is an algorithm that separates n data
is assigned to each process. Communicators use groups to
points
into k clusters by minimizing each cluster’s
represent the processes that participate in each communication.
of squarederror(SSE) :
Each process can be a member of one or more communicators.
MPI supports different types of communications including
point-to-point and collective communications. The point-to- min {||xi − μj ||2 }
μj ∈C
point communication is a basic mechanism in which only n
the sender and receiver take part in the communication. In where μj is the mean of the samples in cluster C. The
collective communication, which is extensively used by the extent of parallelism in Scikit-learn is in splitting the data
applications in cuML , messages are exchanged among a group into chunks, and processing each chunk with a separate thread.
of processes. Collective communications give this opportu- In cuML, however, the single-GPU K-Means model is copied
nity to the process to have one-to-many and many-to-many onto each node, and the initial centroid is communicated
communications in a convenient, portable, and optimized way. to each worker GPU via a Bcast operation, and the total
Bcast, Reduce, and Allreduce are some examples of collective cluster cost and centroid selections are performed with one-
communications. time calls to Allreduce and Allgather, respectively. At each
There are various MPI libraries available in the HPC training iteration, the centroids from each Dask worker GPU
community. Among them, MVAPICH2-GDR is specifically de- are communicated with an Allreduce operation. Since the data
signed to improve the communication performance on GPU communicated at each training step is small (the centroid
devices and provides efficient performance for GPU-enabled information), the message size for each communication call
HPC and deep learning applications [16]. It has support for is also small.
GPUDirect-RDMA, OpenPOWER with NVLink interconnect,
CUDA-aware managed memory, and MPI-3 one-sided com-
B. Random Forest
munication, among many other features.
Random Forest is an ensemble model made up of deci-
C. NCCL sion tree classifiers. Decision tree classifiers predict a target
NCCL implements a subset of the collective communication variable’s value by learning simple decision criteria from
routines. Specifically, NCCL provides optimized collective the training data’s features. A decision tree may be easily
communication algorithms for NVIDIA GPUs. The available explained as a set of yes or no questions posed to the input
primitives for collective communication in NCCL are: All- training variable. The decision tree makes a classification
gather, Allreduce, Reduce, Reduce-scatter, and Bcast. It is decision based on the answers to these questions. A random
important to note that NCCL does not follow the MPI standard, forest combines many individual decision trees, and takes
and lacks support for many common MPI operations such as a majority vote on the tree outputs to decide on the final
point-to-point, Gather, Scatter, and Alltoall. NCCL was created classification.
to meet the need for communication routines in common In cuML, there is no collective communication used for
distributed deep learning workloads. Given the growth of Deep Random Forest training at the time of writing this paper.
Learning applications and the volume of NVIDIA processors Instead, Dask is used to partition the data over each worker
on HPC systems, NCCL has become a competitor to MPI for GPU. Specifically, to train N trees with ω workers, each
some applications. worker initializes N/ω trees and trains them on that worker’s
local data subset. cuML’s distributed Random Forest algorithm
III. D ISTRIBUTED M ACHINE L EARNING A LGORITHMS takes an embarrassingly-parallel approach: all communication
is carried out at the outset of training, and workers act indepen-
This section discusses various ML algorithms that are dently once training has begun. Therefore, the communication
currently available in cuML. This includes clustering algo- overhead for random forests is small; we do not expect any
rithms (e.g. K-Means), dimensionality reduction (e.g. PCA and performance difference between communication backends.
tSVD), ensemble methods (e.g. Random Forests), and linear
models (e.g. Linear Regression) It also explains how these al-
C. K Nearest Neighbors
gorithms are parallelized in cuML under MNMG configuration.
Each algorithm in cuML is split into fit() and Nearest-neighbors classification seeks to assign each data
predict() functions that loosely take the place of training point based on a simple majority vote of its neighbors. An
and inference functions. Namely, fit() acts on the training example of K Nearest Neighbors on a single worker with K =
data (e.g. a dense matrix for K-Means) as input, and adjusts 3 is depicted below in Algorithm 1
the weights of the initial model. After fit() has ended and In cuML’s distributed implementation of K Nearest Neigh-
the model is trained, predict() acts on the test data (e.g. bors, a subset of data is sent to each worker GPU with a call
a mapping from points to clusters for K-Means) as input, and to Bcast, and the output of each local model is communicated
provides a mapping to the output for each test sample. at each global training step with a call to Reduce.
20
Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on February 19,2022 at 00:31:32 UTC from IEEE Xplore. Restrictions apply.
Algorithm 1 Example K Nearest Neighbors (with K = 3) This approximate matrix Q may be computed by taking a
load(input data) collection of random vectors {Ω1 Ω2 , ...} and take the action
K←3 of M on each vector. The resulting subspace of actions may
/* Classify each point */ be used to form another intermediate matrix Ω, which may be
foreach data pointi in input data do used to compute Q.
list ← {} foreach data pointj in input data do
In the second step, a small intermediate matrix B is con-
/* Store distances */
list.append(distance(data pointi , data pointj )) structed such that B = QT M , and compute its SVD. Then,
/* Sort points by distance */ since M ≈ QQT M = Q(SΣV T ), it is clear that U = QS
list.sort() constitutes an approximation M ≈ U ΣV T . In cuML, tSVD
end is parallelized by performing distributed matrix computations.
/* Class set by closest k points */
data pointi .class ← majority class(list[0 : k])
Specifically, tSVD was parallelized in cuML by applying a
end Bcast operation at the outset of training, and then applying a
sequence of Reduce and Allgather operations to perform the
matrix computation.
21
Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on February 19,2022 at 00:31:32 UTC from IEEE Xplore. Restrictions apply.
regression targets as a random linear combination of features is passed to HyperbandSearchCV via the user-parameter
with noise. The user may set the levels of sparsity and whether aggressiveness.
the informative features are uncorrelated or follow a low rank-
fat tail singular profile (where a few features account for most 1 from cuml.dask.cluster.kmeans import KMeans as
of the variance). For the purposes of our study, the default cuKMeans
uncorrelated informative features are sufficient. Examples of 2
3 model_kmeans = cuKMeans(init="k-means||",
each of the three synthetic benchmarking methods are depicted n_clusters=5, random_state=123,
in Listing 1. oversampling_factor=3)
4
5 clusters = [i for i in range(1,100)]
1 from cuml import make_blobs, 6 sample_factors = [i for i in range(1,5)]
make_classification, make_regression 7
5 12 search = HyperbandSearchCV(model_kmeans,
6 make_classification(n_samples, n_features, params, max_iter=100, aggressiveness=3,
n_clusters_per_class, n_informative, random_state=123)
random_state, n_classes)
7 Listing 2: cuML example code for performing hyperparameter
8 make_regression(n_samples, n_features,
random_state)
optimization on KMeans
Listing 1: cuML example code for generating synthetic V. OVERVIEW OF THE S OFTWARE S TACKS
data for clustering, classification, and regression applications, A. RAPIDS
respectively
NVIDIA has recently launched RAPIDS AI which is a
In order to ensure that accuracy is not affected by the collection of open source software libraries and APIs. It’s
distributed algorithms in cuML, we trained K-Means and used to run end-to-end data science analytics pipelines entirely
Random Forest algorithms on the Higgs Boson dataset on GPUs. For low-level compute optimizations, it relies on
[18]. We applied the hyperparameter optimization suite in NVIDIA CUDA primitives and user-friendly Python inter-
Dask ML to maximize the accuracy achievable at each scale. faces to expose GPU parallelism and high-bandwidth memory
In particular, we used the adaptive cross-validation algo- speed.
rithm HyperbandSearchCV, which is specialized for spe-
cialized processors like GPUs, where the search space of
22
Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on February 19,2022 at 00:31:32 UTC from IEEE Xplore. Restrictions apply.
!
##
(a) cuML components (b) cuML software stack (c) cuML software stack in a distributed setting
23
Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on February 19,2022 at 00:31:32 UTC from IEEE Xplore. Restrictions apply.
lines show Dask communication paths, while the red lines injected to cumlHandle. A similar thing should be done
show NCCL communication paths. A NCCL communicator is for MPI communicator to add support for MPI-based commu-
created across the worker processes for taking advantage of nications. cuML has a function initialize_mpi_comms
collective communication support in NCCL. The NCCL com- which attaches an MPI communicator to cumlHandle. This
municator is then injected to cumlHandle. cumlHandle function is defined in the CUDA/C++ layer, so it cannot
is a class in cuML that is used to manage resources needed be used directly in the cuML applications written in Python.
for running machine learning algorithms. As can be seen in To be able to practically use this function for running
Figure 5, cumlHandle is used by all workers. cuML applications, we have developed a Cython wrapper
Once the NCCL communicator is attached to the MPI_plugin.pyx shown in Figure 8. This wrapper defines
cumlHandle, it can be used as the input of fit() and a new function inject_mpi_comms_py() which injects
predict(). This way, the fit() and predict() func- MPI communicator to cumlHandle in Python. It requires
tions take advantage of NCCL for running collective com- mpi4py to get an MPI communicator from cumlHandle
munications. Note that fit() has the most communication and the function initialize_mpi_comms. For includ-
overhead in cuML algorithms. Figure 6 shows the procedure ing cumlHandle and initialize_mpi_comms, cuML
of injecting a NCCL communicator to cumlHandle in and CUDA libraries are linked to the wrapper through
Python. As can be seen in the figure, this is done through setup.py. We also linked MVAPICH2-GDR which is re-
inject_nccl_comms_py(). Once the NCCL communi- quired by mpi4py. After compiling MPI_plugin.pyx, we
cator is attached to the cumlHandle, it is passed as the input have an extension module that injects MPI communicator to
of fit(). cumlHandle in Python. Listing 3 shows the details of defin-
ing inject_mpi_comms_py() in MPI_plugin.pyx.
'''#()
+
!
'!%
'!%#"
"
24
Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on February 19,2022 at 00:31:32 UTC from IEEE Xplore. Restrictions apply.
Listing 4 shows how inject_mpi_comms_py() is used Table II: Hardware specification of the SDSC Comet cluster
to inject an MPI communicator to cumlHandle in K-Means.
Specification SDSC Comet
Here, we show the modifications we made to K-Means as
an example. We have made similar changes to other cuML Number of Nodes 36
Processor Family Xeon Haswell
algorithms to make them use MPI-based communications. Processor Model E5-2680 v3
The listing shows a part of _func_fit(). This function Clock Speed 2.5 GHz
is executed by all the worker processes in the cluster. First, Sockets 2
Cores Per socket 12
we import the module MPI_plugin that we have created RAM (DDR4) 128 GB
from our Cython code in Listing 3. In _func_fit(), GPU Family NVIDIA Pascal P100
we create MPI communicator using mpi4py (Lines 3 and GPUs 4
GPU Memory 16 GB (HBM2)
4). This communicator is created between worker processes Interconnect IB-EDR (56G)
that are calling _func_fit(). Then, we import Handle
from cuml.commom.handle (Line 6) and use it to create
a cumlHandle object handle (Line 7). In Line 8, we Forest, Nearest Neighbors, tSVD, and Linear Regression.
call inject_mpi_comms_py() which we have defined Scikit-learn and cuML are used for the experiments on CPU
in MPI_plugin.pyx. This function gets two inputs, the and GPU, respectively. As can be seen in Figure 9, GPU
MPI communicator and the cuML handle and attaches the performs significantly better than CPU for all the algorithms,
communicator to handle. Then, we pass the new handle as expected.
to cumlKMeans() (Line 12). cumlKMeans() takes care
of running the algorithm. It uses an MPI communicator that 4500
GPU CPU
is attached to handle for executing the collective com- 4000
3500
25
Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on February 19,2022 at 00:31:32 UTC from IEEE Xplore. Restrictions apply.
1600 1.6 1800 1.8 250 2.0
NCCL MVAPICH2-GDR Speedup NCCL MVAPICH2-GDR Speedup
1600 1.6 NCCL MVAPICH2-GDR Speedup
1400 1.4 225 1.8
1400 1.4 200 1.6
Speedup
Speedup
Speedup
1000 1.0 150 1.2
800 0.8 125 1.0
800 0.8
600 0.6 100 0.8
600 0.6
400 0.4 75 0.6
400 0.4
50 0.4
200 0.2 200 0.2 25 0.2
0 0.0 0 0.0 0 0.0
1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
Number of GPUs Number of GPUs Number of GPUs
Speedup
1200 0.8 1.0
1500
Speedup
1000
0.6 1200 0.8
800
900 0.6
600 0.4
400 600 0.4
0.2
200 300 0.2
0 0.0 0 0.0
1 2 4 8 16 32 1 2 4 8 16 32
Number of GPUs Number of GPUs
Figure 10: Training Time for GPU-Accelerated Distributed Algorithms using cuML.
information in Figure 11, we can see that the vast majority
!
of collective calls are for the small to medium message size
!
range. Note that we observed the same behavior for other
!
cuML algorithms that use collectives such as tSVD and Linear
!
Regression. We believe this to be the case for two primary
!
reasons:
!
• For many distributed cuML algorithms like K-Means
and Nearest Neighbors, the update information at each
training step consists of many small messages (e.g. for (a) K-Means
K-Means, only the centroid information is shared between !
workers at each step). This allows the cuML algorithms !
!
45
Allreduce in K-Means Figure 12: Accuracy Numbers
Bcast in Nearest Neighbors
40
Reduce in Nearest Neighbors
35
Number of Calls
18 04
27 52
27 32
29 56
29 88
29 36
29 44
29 68
32 92
0
22 8
23 2
36 6
49 8
72 6
74 2
10 40
84
21 2
02
04
71
19
72
8
9
,5
,6
,6
,6
,5
,1
,2
0
1,
2,
3,
8,
4,
6,
6,
4,
5,
0,
2,
7,
8,
9,
6,
Message Size (bytes) accuracy on the test dataset was taken with the default
model parameters for Random Forest and K-Means, then
Figure 11: Profile of different collective operations in cuML HPO was performed embarrassingly parallel via Dask ML’s
algorithms: K-Means and Nearest Neighbors. HyperbandSearchCV across the job’s worker GPUs. The
accuracy results are depicted in Figure 12. The accuracy with
Given that the communication overhead for cuML algo-
the optimal parameters for K-Means and Random Forests show
rithms consists of many small messages, we can take advan-
an approximate 5% improvement over the default model. We
tage of MVAPICH2-GDR’s improved small-message algorithms
used K-Means and Random Forests on Higgs because:
compared to NCCL, as in Figure 1. Further, while cuML’s
small-message strategy does keep communication overheads • They are representative of different communications in
low for increasing node counts, the increased number of cuML (K-Means following a data-parallel approach, and
Random Forests being embarrassingly parallel)
26
Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on February 19,2022 at 00:31:32 UTC from IEEE Xplore. Restrictions apply.
• Both algorithms have enough hyperparameters to stress state-of-the-art cuML algorithms show that the proposed MPI-
the HPO implementation in Dask ML, while having a based communication approach gives up to 1.6x, 1.25x, 1.25x,
broad parameter space to enable accuracy benefits and 1.36x speedup for K-Means, Nearest Neighbors, Linear
Regression, and tSVD, respectively. For future work, we would
VIII. R ELATED W ORK like to explore the impact of the MPI-based communication
support for real ML applications and larger data sets.
González et al. [22] present a review of bagging and boost-
ing ML algorithms. XGBoost [23][24] is a gradient boosting
ACKNOWLEDGEMENT
library that supports both distributed—through Dask—and
GPU-based execution. H203 [25] is another ML library that is We would like to thank Arpan Jain for assisting us in the initial
capable of distributed computation. H204GPU [26] is a variant setup of cuML. This research is supported in part by NSF
with shared memory GPU support. While the computational grants #1450440, #1565414, #1664137, #1818253, #1854828,
support for ML workloads has been around for a long time #1931537, #2007991, and XRAC grant #NCR-130002.
in the form of software library and packages, the support for
distributed execution on a cluster of GPUs is still in its infancy. R EFERENCES
As we noted above, XGBoost is one such library that provides
[1] D. Guest, K. Cranmer, and D. Whiteson, “Deep learning
boosting algorithms. However, a wide range of new and
and its application to lhc physics,” Annual Review of
existing ML algorithms are still being investigated for efficient
Nuclear and Particle Science, vol. 68, pp. 161–181, 2018.
Multi-Node Multi-GPU execution—this is the main motivation
for the development of the cuML library. Deep Learning frame- [2] M. R. Blanton, M. A. Bershady, B. Abolfathi, F. D.
works like TensorFlow [12] and PyTorch [27] has support for Albareti, C. Allende Prieto, A. Almeida, J. Alonso-
some ML tools and techniques. However, we only focus on Garcı́a, F. Anders, S. F. Anderson, B. Andrews, and et al.,
specialized ML libraries in this paper. Raschka et al. [28] “Sloan Digital Sky Survey IV: Mapping the Milky Way,
provide a survey of machine learning in Python, including Nearby Galaxies, and the Distant Universe,” , vol. 154,
Scikit-learn training on CPUs and the RAPIDS ecosystem. p. 28, Jul. 2017.
They also discuss the need and motivation for the cuML library. [3] H. Ledford, “How facebook, twitter and other data troves
The RAPIDS framework—and the cuML library—has been are revolutionizing social science,” Nature, vol. 582, pp.
gaining traction in the community as a viable option to support 328–330, 06 2020.
the execution of performance-hungry applications on a cluster
of GPUs. For example, Napoli et al. [29] applied RAPIDS [4] H. Fröhlich, R. Balling, N. Beerenwinkel, O. Kohlbacher,
[30] and Dask [31] ecosystems to analyze data for geophysics S. Kumar, T. Lengauer, H. M. Maathuis, Y. Moreau,
simulations. While all of these studies exploit the RAPIDS A. S. Murphy, M. T. Przytycka, M. Rebhan, H. Röst,
framework with cuML to exploit multiple NVIDIA GPUs, A. Schuppert, M. Schwab, R. Spang, D. Stekhoven,
none of them provide insight into improving communication J. Sun, A. Weber, D. Ziemek, and B. Zupan, “From hype
performance, nor explore other viable communication options. to reality: data science enabling personalized medicine,”
We fill this gap—in this paper—by proposing MVAPICH2- BMC medicine, 2018.
GDR as an alternative to NCCL in the Multi-Node Multi- [5] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
GPU setting. We also attempt to characterize our evaluation B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
for these GPU-aware messaging libraries to gain insights into R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-
scaling behavior of ML applications in the cuML library. napeau, M. Brucher, M. Perrot, and E. Duchesnay,
“Scikit-learn: Machine learning in Python,” Journal of
IX. C ONCLUSION AND F UTURE W ORK Machine Learning Research, vol. 12, pp. 2825–2830,
In this paper, we add support for MPI-based communica- 2011.
tions for cuML applications in Python. For this, we propose [6] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkatara-
a Python API that takes advantage of MPI calls to han- man, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen,
dle collective communications between processes in cuML . D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and
Moreover, we provide a synthetic benchmarking suite and A. Talwalkar, “Mllib: Machine learning in apache spark,”
in-depth analysis of cuML applications to understand their J. Mach. Learn. Res., vol. 17, no. 1, p. 1235–1241, Jan.
behavior and identify the regions that can be improved using 2016.
MPI-based communications. We believe these analysis and
[7] R. Anil, S. Owen, T. Dunning, and E. Friedman, Mahout
characterizations provide a comprehensive guideline for cuML
in Action, Manning Publications Co. Sound View Ct.
application developers and data scientists to take the most
3B Greenwich, CT 06830, 2010. [Online]. Available:
advantage of the features provided by cuML. Finally, we
http://manning.com/owen/
compare the performance of the proposed MPI-based commu-
nication approach with NCCL-based communication design [8] S. Raschka, J. T. Patterson, and C. Nolet, “Machine
that is used by default in cuML. The experimental results on learning in python: Main developments and technology
27
Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on February 19,2022 at 00:31:32 UTC from IEEE Xplore. Restrictions apply.
trends in data science, machine learning, and artificial and David Shupe, Eds., 2019, pp. 118 – 125.
intelligence,” ArXiv, vol. abs/2002.04803, 2020.
[20] Apache Software Foundation, “Arrow: Across-language
[9] “MPI-3 Standard Document,” http://www.mpi-forum. development platform for in-memory data,” https://arrow.
org/docs/mpi-3.0/mpi30-report.pdf. apache.org.
[10] MVAPICH2-GDR Development Team, “MVAPICH2- [21] L. D. Dalcin, R. R. Paz, P. A. Kler, and A. Cosimo,
GDR User Guide,” http://mvapich.cse.ohio-state.edu/ “Parallel distributed computing using python,” Advances
userguide/gdr. in Water Resources, vol. 34, no. 9, pp. 1124 –
[11] A. A. Awan, A. Jain, Q. Anthony, H. Subramoni, 1139, 2011, new Computational Methods and Software
and D. K. Panda, “HyPar-Flow: Exploiting MPI and Tools. [Online]. Available: http://www.sciencedirect.
Keras for Scalable Hybrid-Parallel DNN Training using com/science/article/pii/S0309170811000777
TensorFlow,” arXiv preprint arXiv:1911.05146, 2019. [22] S. González, S. Garcı́a, J. Del Ser, L. Rokach, and
[Online]. Available: http://arxiv.org/abs/1911.05146 F. Herrera, “A practical tutorial on bagging and boosting
[12] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, based ensembles for machine learning: Algorithms,
J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, software tools, performance study, practical perspectives
M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. and opportunities,” Information Fusion, vol. 64, pp. 205 –
Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, 237, 2020. [Online]. Available: http://www.sciencedirect.
M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: A com/science/article/pii/S1566253520303195
system for large-scale machine learning,” in 12th [23] T. Chen and C. Guestrin, “Xgboost: A scalable
USENIX Symposium on Operating Systems Design tree boosting system,” in Proceedings of the 22nd
and Implementation (OSDI 16), 2016, pp. 265–283. ACM SIGKDD International Conference on Knowledge
[Online]. Available: https://www.usenix.org/system/files/ Discovery and Data Mining, ser. KDD ’16. New York,
conference/osdi16/osdi16-abadi.pdf NY, USA: Association for Computing Machinery, 2016,
[13] NVIDIA, “RAPIDS Accelerator For Apache Spark,” p. 785–794. [Online]. Available: https://doi.org/10.1145/
https://github.com/NVIDIA/spark-rapids. 2939672.2939785
[14] D. Panda et al., “OSU microbenchmarks v5.6.3,” http: [24] XGBoost Development Team, “XGBoost Library,” https:
//mvapich.cse.ohio-state.edu/benchmarks/. //xgboost.readthedocs.io/en/latest/.
[15] P. Shamis, M. G. Venkata, M. G. Lopez, M. B. Baker, [25] H2O Development Team, “H2O3: Distributed, fast, and
O. Hernandez, Y. Itigin, M. Dubman, G. Shainer, R. L. scalable machine learning software,” http://docs.h2o.ai/
Graham, L. Liss, Y. Shahar, S. Potluri, D. Rossetti, h2o/latest-stable/h2o-docs/welcome.html.
D. Becker, D. Poole, C. Lamb, S. Kumar, C. Stunkel, [26] N. Gill, E. LeDell, Y. Tang, “H2O4GPU: Machine Learn-
G. Bosilca, and A. Bouteiller, “Ucx: An open source ing with GPUs in R and Python,” https://github.com/
framework for hpc network apis and beyond,” in 2015 h2oai/h2o4gpu.
IEEE 23rd Annual Symposium on High-Performance
Interconnects, 2015, pp. 40–43. [27] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang,
Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and
[16] A. A. Awan, J. Bedorf, C.-H. Chu, H. Subramoni, A. Lerer, “Automatic differentiation in pytorch,” 2017.
and D. K. Panda, “Scalable distributed dnn training
using tensorflow and cuda-aware mpi: Characterization, [28] S. Raschka, J. Patterson, and C. Nolet, “Machine
designs, and performance evaluation,” arXiv preprint learning in python: Main developments and technology
arXiv:1810.11112, 2018. trends in data science, machine learning, and artificial
intelligence,” Information, vol. 11, no. 4, p. 193,
[17] N. Halko, P.-G. Martinsson, and J. A. Tropp, “Finding Apr 2020. [Online]. Available: http://dx.doi.org/10.3390/
structure with randomness: Stochastic algorithms for info11040193
constructing approximate matrix decompositions,” 2009.
[29] O. O. Napoli, V. M. do Rosario, J. P. Navarro, P. M. C.
[18] N. Becker, A. Dattagupta, E. Fajardo, P. Gali, B. Rhodes, e Silva, and E. Borin, “Accelerating multi-attribute un-
B. Richardson, and B. Suryadevara, “Streamlined and supervised seismic facies analysis with rapids,” 2020.
accelerated cyber analyst workflows with clx and rapids,”
in 2019 IEEE International Conference on Big Data (Big [30] NVIDIA, “RAPIDS: Open GPU Data Science Frame-
Data), 2019, pp. 2011–2015. work,” http://rapids.ai.
[19] Scott Sievert, Tom Augspurger, and Matthew Rock- [31] M. Rocklin, “Dask: Parallel computation with blocked al-
lin, “Better and faster hyperparameter optimization with gorithms and task scheduling,” in Proceedings of the 14th
Dask,” in Proceedings of the 18th Python in Science Con- Python in Science Conference, K. Huff and J. Bergstra,
ference, Chris Calloway, David Lippa, Dillon Niederhut, Eds., 2015, pp. 130 – 136.
28
Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on February 19,2022 at 00:31:32 UTC from IEEE Xplore. Restrictions apply.