0% found this document useful (0 votes)
46 views

Accelerating GPU Based Machine Learning Python 2020

This document summarizes a study on accelerating GPU-based machine learning in Python using the MPI library MVAPICH2-GDR. The study explores using MPI to support multi-node multi-GPU machine learning workloads as an alternative to NVIDIA's NCCL library. The authors find that MVAPICH2-GDR achieves significant speedups over NCCL for collective communication operations important to machine learning algorithms.

Uploaded by

ricardo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Accelerating GPU Based Machine Learning Python 2020

This document summarizes a study on accelerating GPU-based machine learning in Python using the MPI library MVAPICH2-GDR. The study explores using MPI to support multi-node multi-GPU machine learning workloads as an alternative to NVIDIA's NCCL library. The authors find that MVAPICH2-GDR achieves significant speedups over NCCL for collective communication operations important to machine learning algorithms.

Uploaded by

ricardo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

2020 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) and Workshop on

2020 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) and Workshop on Artificial Intelligence and Machine Learning for Scientific Applications (AI4S) | 978-1-6654-2291-8/20/$31.00 ©2020 IEEE | DOI: 10.1109/MLHPCAI4S51975.2020.00010

Artificial Intelligence and Machine Learning for Scientific Applications (AI4S)

Accelerating GPU-based Machine Learning in


Python using MPI Library: A Case Study with
MVAPICH2-GDR
S. Mahdieh Ghazimirsaeed † , Quentin Anthony† , Aamir Shafi,
Hari Subramoni, and Dhabaleswar K. (DK) Panda
The Ohio State University
{ghazimirsaeed.3, anthony.301, shafi.16, subramoni.1, panda.2}@osu.edu

Abstract—The growth of big data applications during the last Some of the popular ML libraries employed for Data
decade has led to a surge in the deployment and popularity Analytics include Scikit-learn [5] and Apache Spark’s MLlib
of machine learning (ML) libraries. On the other hand, the
[6]. These libraries are natively designed to support the exe-
high performance offered by GPUs makes them well suited
for ML problems. To take advantage of GPU performance for cution of ML algorithms on CPUs. On the other hand, GPUs
ML, NVIDIA has recently developed the cuML library. cuML have emerged as a popular platform for optimizing parallel
is the GPU counterpart of Scikit-learn, and provides similar workloads because of their high throughput. This also makes
Pythonic interfaces to Scikit-learn while hiding the complex- them a good match for ML applications, which require high
ities of writing GPU compute kernels directly using CUDA.
arithmetic intensity [7].
To support execution of ML workloads on Multi-Node Multi-
GPU (MNMG) systems, the cuML library exploits the NVIDIA To take advantage of the performance offered by GPUs,
Collective Communications Library (NCCL) as a backend for NVIDIA has recently launched the RAPIDS AI [8] framework,
collective communications between processes. On the other hand, which is a collection of open-source software libraries and
MPI is a de facto standard for communication in HPC systems. APIs. The main goal of this effort is to enable end-to-end data
Among various MPI libraries, MVAPICH2-GDR is the pioneer
science analytic pipelines entirely on GPUs. One of the main
in optimizing GPU communication.
This paper explores various aspects and challenges of pro- components of RAPIDS—within its data science ecosystem—
viding MPI-based communication support for GPU-accelerated is the cuML[9] library. This GPU-accelerated ML library is the
cuML applications. More specifically, it proposes a Python API GPU-counterpart of Scikit-learn and provides similar Pythonic
to take advantage of MPI-based communications for cuML interfaces while hiding the complexities of writing compute
applications. It also gives an in-depth analysis, characterization,
kernels for GPUs directly using CUDA. One of the main
and benchmarking of the cuML algorithms such as K-Means,
Nearest Neighbors, Random Forest, and tSVD. Moreover, it benefits of cuML is that it supports the execution of ML
provides a comprehensive performance evaluation and profiling workloads on Multi-Node Multi-GPUs (MNMG) systems. For
study for MPI-based versus NCCL-based communication for this, it takes advantage of a task-based framework called Dask
these algorithms. The evaluation results show that the proposed [10], which allows the distributed execution of Python data
MPI-based communication approach achieves up to 1.6x, 1.25x,
science applications. Dask is comprised of a central scheduler,
1.25x, and 1.36x speedup for K-Means, Nearest Neighbors, Linear
Regression, and tSVD, respectively on up to 32 GPUs. workers (one per GPU), and a client process.
Index Terms—Machine Learning, cuML, MPI, GPUs When a cuML application is running in a MNMG config-
uration, workers need to communicate with each other. These
I. I NTRODUCTION communications are required at two stages: 1) the training data
The last decade has witnessed unprecedented growth in data is distributed to all workers to execute the training stage (the
generated from diverse sources. These sources range from fit() function), 2) the output of the training stage i.e. the model
scientific experiments—like the Large Hadron Collider [1] and parameters are shared with all workers involved in the infer-
Sloan Digital Sky Survey [2]—to social media platforms [3] ence stage (the predict() function). These communications can
to personalized medicine [4]. A daunting challenge—common either be point-to-point or collective. cuML takes advantage
to these disparate Big Data applications—is to process vasts of Dask for handling the point-to-point communications while
amount of data in a timely manner to gain vital insights and using NVIDIA Collective Communications Library (NCCL)
drive decision making. To aid with this quest for better un- to take care of collective communications.
derstanding, there has been a resurgence of Machine Learning Message Passing Interface (MPI) is considered the de facto
(ML) libraries, tools, and techniques that allow processing and standard for writing parallel applications on clusters. The MPI
extracting useful information from this data through iterative standard [9] provides efficient support for point-to-point and
processing. collective communication routines. The MVAPICH2-GDR[10]
library is a pioneer MPI library that supports communication
†These authors contributed equally to this work. between GPU devices. MVAPICH2-GDR has been used to

978-1-6654-2291-8/20/$31.00 ©2020 IEEE 17


DOI 10.1109/MLHPCAI4S51975.2020.00010

Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on February 19,2022 at 00:31:32 UTC from IEEE Xplore. Restrictions apply.
Table I: Comparison of different ML libraries
Libraries GPU Support MNMG Support Python Interface High Performance
Scikit-learn [5]    
Apache Spark’s MLlib [6]    
Apache Mahout [7]    
RAPIDS cuML [8]    
MPI [9]    
Our paper    

accelerate and scale [11] Deep Learning (DL) frameworks in- with MPI-based communications to take advantage of ef-
cluding TensorFlow [12] on large clusters. In order to leverage ficient and GPU-aware collective communication designs in
years of research and development behind the MPI standard MVAPICH2-GDR?
and its compliant libraries, this paper introduces support for The cuML library supports training/inference based on sev-
MPI-based communication for the cuML library. This is done eral compute-bound ML algorithms. These include: 1) Cluster-
using the MVAPICH2-GDR library that has efficient collec- ing (like K-Means), 2) Dimensionality reduction (like Princi-
tive communication designs — including MPI_Reduce(), pal Component Analysis (PCA) and Truncated Singular Value
MPI_Bcast(), and MPI_AllReduce()— on GPU mem- Decomposition (tSVD)), 3) Ensemble methods (like Random
ory. Forests), and 4) Linear models (like Linear Regression). Data
scientists should attempt to understand cuML in order to
A. Motivation and Challenges achieve the best possible performance on these algorithms.
Table I compares different libraries that can be used to run However, cuML—being a relatively new ML library—has not
ML applications. Among these libraries, Apache Mahout [7], been studied well by the community. With this background
RAPIDS cuML , and MPI support execution on GPUs. Note in mind, the question is How can we provide performance
that NVIDIA has recently created a RAPIDS Accelerator [13] characterization for GPU-accelerated cuML Algorithms and
for Spark 3.0 that enables the launch of Spark applications provide guidelines for data scientists to best take advantage
on GPUs. However, Spark does not natively support GPU of them?
acceleration. Apache Mahout, however, has poor visualization Each of the cuML algorithms mentioned above has a unique
and it does not support Python interface. It also has poor format for data and model features. Moreover, each cuML
performance compared to the other libraries. The respective model has a unique set of hyperparameters that must be tuned
strengths of cuML and MPI are complementary. While cuML for accuracy at every scale. In order to run cuML applications
provides a Python interface and hides the complexities of in a fast (time-to-solution) and accurate manner, a synthetic
CUDA/C++ programming from the user, MPI is well-known benchmarking suite as well as a hyperparameter tuning frame-
for its high performance in parallel applications. Therefore, work is required. This brings up another question: How can we
the question we ask is: How can we combine the ease-of- provide a synthetic benchmarking suite for cuML algorithms
use provided by cuML for running ML applications with the in order to understand their scaling behavior?
high-performance provided by MPI?
B. Contributions
As mentioned earlier, cuML utilizes NCCL for handling
collective communications. It has some APIs at the CUD- The key contributions of this paper are as follows:
A/C++ layer for MPI communications, but these APIs are not • We provide MPI-based communication support for GPU-
available at the Python layer and cannot be applied by end accelerated cuML applications in Python. More specifi-
users developing Python code. At the same time, MPI libraries cally, we propose a Python API to take advantage of
have many years of research and development behind them. MPI-based communications for cuML applications.
Among these libraries, MVAPICH2-GDR, in particular, pro- • We provide in-depth analysis and characterization of the
vides efficient designs of collective communications for GPUs. state-of-the-art cuML applications and identify regions
We use the OSU microbenchmarks suite [14] to compare that can be enhanced using MPI-based communications.
the performance of the MVAPICH2-GDR library with NCCL • We provide synthetic benchmarking of cuML algorithms
for different collective operations in Figure 1. We run the to characterize their throughput behavior. Further, we
experiments on the Comet cluster from XSEDE. A detailed apply Dask ML’s hyperparameter optimization library on
description of this cluster is provided in Section VII-A. We the Higgs Boson dataset to ensure cuML’s distributed
run the experiments for Allreduce, Bcast, and Reduce for algorithms maintain accuracy at scale.
the 4 Bytes to 16 KBytes message range. These collectives • We provide a comprehensive performance evaluation
are extensively used by cuML algorithms within this message and profiling study comparing MPI-based versus NCCL-
range—this will be further discussed in Section VII. Figure 1 based communication for cuML applications. The evalua-
shows that MVAPICH2-GDR is performing significantly better tion results show that with adding support for MPI-based
than NCCL. This brings up another question: How can we communications, we gain up to 38%, 20%, 20%, and
replace NCCL-based collective communications in cuML 26% performance improvement for K-Means, Nearest

18

Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on February 19,2022 at 00:31:32 UTC from IEEE Xplore. Restrictions apply.
40 25 20
NCCL MVAPICH2-GDR NCCL MVAPICH2-GDR NCCL MVAPICH2-GDR
35 18
20 16
30
14

Latency (s)

Latency (s)
Latency (s)

25 15 12
20 10

15 10 8
6
10
5 4
5 2
0 0 0
4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K
Message Size (bytes) Message Size (bytes) Message Size (bytes)

(a) Allreduce (b) Reduce (c) Bcast

Figure 1: The performance of collective operations with MPI (MVAPICH2-GDR) vs. NCCL with 8 GPUs across 2 nodes on
Comet cluster and for various collective operations: Allreduce, Reduce, and Bcast. MVAPICH2-GDR performs 84%, 74%, and
85% better that NCCL for Allreduce, Reduce, and Bcast, respectively.

Neighbors, Linear Regression, and tSVD, respectively on


up to 32 GPUs.
The rest of the paper is organized as follows. Section II
provides background information on different libraries that are
currently used to run cuML applications (such as Dask and
NCCL) and MPI. Section III discusses different ML applica-
tions that are supported by cuML and are targeted in this paper.
This includes K-Means, Random Forest, Linear Regression,
and truncated SVD (tSVD). Section IV provides detail on the Figure 2: Dask Architecture
synthetic benchmarking and hyperparameter optimization of
these applications. In Section V, we provide a comprehensive
overview of cuML’s software stack and discuss the use of Figure 2 shows the Dask architecture. As can be seen in
Dask and NCCL for handling different communications paths the figure, Dask consists of a client, a scheduler, and multiple
in cuML. The proposed MPI-based communication support is worker processes. These processes communicate with each
discussed in Section VI. Section VII characterizes state-of- other to execute the Python application in a distributed manner.
the-art cuML applications and compares the performance of The scheduler is responsible for coordinating data access and
the proposed MPI-based approach vs the default NCCL-based sending tasks to workers for execution. It also manages the
communication design. We discuss related works in Section state of the workers. The workers are responsible for actual
VIII. Finally, we conclude the paper in Section IX. execution of the tasks. The scheduler along with N workers
is called a Dask cluster. The Dask cluster in Figure 2 has
II. BACKGROUND 3 workers. As the figure shows, the workers are directly
In this section, we provide background information on the connected to each other for sending/receiving the data during
Dask, MPI, and NCCL libraries. These libraries are used to the execution of parallel jobs. In the past, Dask only supported
discuss and analyze the cuML software stack in upcoming TCP for the communication between client, scheduler, and
sections. the workers. However, recently, it has been extended to take
advantage of the UCX [15] library for communication between
A. Dask the processes.
Dask is an open source library for parallel computing in The end-users submit their Python applications to the Dask
Python. It enables the Python ecosystem to scale into multi- cluster through the client process. The client process is con-
core machines and distributed clusters. The main advantage nected to the scheduler and provides the execution code and
of Dask is that it integrates well with analytic tools (such as the input data to the scheduler. Then, the scheduler divides up
Pandas, Scikit-Learn, Numpy, etc.) and provides ways to scale the input data and store the chunks on workers to achieve data
them on distributed clusters with minimal rewriting. In the parallelism.
past, Dask only supported execution on the host processors B. MPI
(CPUs). However, it has recently been integrated with the
RAPIDS framework for processing cuDF data structures on
MPI has been the de facto standard for communication in
GPU devices and also GPU-accelerated machine learning HPC systems, and is by far the dominant parallel programming
with cuML. For this, it distributes data and computation over model in large-scale parallel applications. In MPI, the concepts
multiple GPUs. These GPUs can be located on the same of groups and communicators are used to define the scope
system or they can be distributed over a multi-node cluster. and context of the communications. A group is an ordered

19

Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on February 19,2022 at 00:31:32 UTC from IEEE Xplore. Restrictions apply.
set of processes with ranks 0 to n − 1, where n is the A. K-Means
number of ranks. When a program initializes, the rank number
K-Means is an algorithm that separates n data
is assigned to each process. Communicators use groups to
points
 into k clusters by minimizing each cluster’s
represent the processes that participate in each communication.
of squarederror(SSE) :
Each process can be a member of one or more communicators.
MPI supports different types of communications including 
point-to-point and collective communications. The point-to- min {||xi − μj ||2 }
μj ∈C
point communication is a basic mechanism in which only n

the sender and receiver take part in the communication. In where μj is the mean of the samples in cluster C. The
collective communication, which is extensively used by the extent of parallelism in Scikit-learn is in splitting the data
applications in cuML , messages are exchanged among a group into chunks, and processing each chunk with a separate thread.
of processes. Collective communications give this opportu- In cuML, however, the single-GPU K-Means model is copied
nity to the process to have one-to-many and many-to-many onto each node, and the initial centroid is communicated
communications in a convenient, portable, and optimized way. to each worker GPU via a Bcast operation, and the total
Bcast, Reduce, and Allreduce are some examples of collective cluster cost and centroid selections are performed with one-
communications. time calls to Allreduce and Allgather, respectively. At each
There are various MPI libraries available in the HPC training iteration, the centroids from each Dask worker GPU
community. Among them, MVAPICH2-GDR is specifically de- are communicated with an Allreduce operation. Since the data
signed to improve the communication performance on GPU communicated at each training step is small (the centroid
devices and provides efficient performance for GPU-enabled information), the message size for each communication call
HPC and deep learning applications [16]. It has support for is also small.
GPUDirect-RDMA, OpenPOWER with NVLink interconnect,
CUDA-aware managed memory, and MPI-3 one-sided com-
B. Random Forest
munication, among many other features.
Random Forest is an ensemble model made up of deci-
C. NCCL sion tree classifiers. Decision tree classifiers predict a target
NCCL implements a subset of the collective communication variable’s value by learning simple decision criteria from
routines. Specifically, NCCL provides optimized collective the training data’s features. A decision tree may be easily
communication algorithms for NVIDIA GPUs. The available explained as a set of yes or no questions posed to the input
primitives for collective communication in NCCL are: All- training variable. The decision tree makes a classification
gather, Allreduce, Reduce, Reduce-scatter, and Bcast. It is decision based on the answers to these questions. A random
important to note that NCCL does not follow the MPI standard, forest combines many individual decision trees, and takes
and lacks support for many common MPI operations such as a majority vote on the tree outputs to decide on the final
point-to-point, Gather, Scatter, and Alltoall. NCCL was created classification.
to meet the need for communication routines in common In cuML, there is no collective communication used for
distributed deep learning workloads. Given the growth of Deep Random Forest training at the time of writing this paper.
Learning applications and the volume of NVIDIA processors Instead, Dask is used to partition the data over each worker
on HPC systems, NCCL has become a competitor to MPI for GPU. Specifically, to train N trees with ω workers, each
some applications. worker initializes N/ω trees and trains them on that worker’s
local data subset. cuML’s distributed Random Forest algorithm
III. D ISTRIBUTED M ACHINE L EARNING A LGORITHMS takes an embarrassingly-parallel approach: all communication
is carried out at the outset of training, and workers act indepen-
This section discusses various ML algorithms that are dently once training has begun. Therefore, the communication
currently available in cuML. This includes clustering algo- overhead for random forests is small; we do not expect any
rithms (e.g. K-Means), dimensionality reduction (e.g. PCA and performance difference between communication backends.
tSVD), ensemble methods (e.g. Random Forests), and linear
models (e.g. Linear Regression) It also explains how these al-
C. K Nearest Neighbors
gorithms are parallelized in cuML under MNMG configuration.
Each algorithm in cuML is split into fit() and Nearest-neighbors classification seeks to assign each data
predict() functions that loosely take the place of training point based on a simple majority vote of its neighbors. An
and inference functions. Namely, fit() acts on the training example of K Nearest Neighbors on a single worker with K =
data (e.g. a dense matrix for K-Means) as input, and adjusts 3 is depicted below in Algorithm 1
the weights of the initial model. After fit() has ended and In cuML’s distributed implementation of K Nearest Neigh-
the model is trained, predict() acts on the test data (e.g. bors, a subset of data is sent to each worker GPU with a call
a mapping from points to clusters for K-Means) as input, and to Bcast, and the output of each local model is communicated
provides a mapping to the output for each test sample. at each global training step with a call to Reduce.

20

Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on February 19,2022 at 00:31:32 UTC from IEEE Xplore. Restrictions apply.
Algorithm 1 Example K Nearest Neighbors (with K = 3) This approximate matrix Q may be computed by taking a
load(input data) collection of random vectors {Ω1 Ω2 , ...} and take the action
K←3 of M on each vector. The resulting subspace of actions may
/* Classify each point */ be used to form another intermediate matrix Ω, which may be
foreach data pointi in input data do used to compute Q.
list ← {} foreach data pointj in input data do
In the second step, a small intermediate matrix B is con-
/* Store distances */
list.append(distance(data pointi , data pointj )) structed such that B = QT M , and compute its SVD. Then,
/* Sort points by distance */ since M ≈ QQT M = Q(SΣV T ), it is clear that U = QS
list.sort() constitutes an approximation M ≈ U ΣV T . In cuML, tSVD
end is parallelized by performing distributed matrix computations.
/* Class set by closest k points */
data pointi .class ← majority class(list[0 : k])
Specifically, tSVD was parallelized in cuML by applying a
end Bcast operation at the outset of training, and then applying a
sequence of Reduce and Allgather operations to perform the
matrix computation.

D. Linear Regression IV. S YNTHETIC B ENCHMARKING AND H YPERPARAMETER


O PTIMIZATION
Linear Regression is the classical problem of fitting a set
of data points y with a linear combination of the predictors. In order to take distributed training and accuracy perfor-
In cuML, Linear Regression on multiple GPUs can take one mance data, the existing benchmarking and hyperparameter
of two forms: a standard data-parallel Linear Regression optimization implementations needed to be expanded to sup-
algorithm with two possible fit algorithms (SVD and Eig), or port a distributed workload. We were able to refer to the single-
a model-parallel Solver class. For the standard data-parallel GPU benchmarking suite in cuML as a model for distributed
version, cuML provides the choice of fit algorithms: SVD and synthetic benchmarking.
Eig. While computing the SVD (singular value decomposition) For synthetic benchmarking, we seek to maximize through-
for a linear system of equations is stable, it tends to be much put without exceeding the memory bound of the worker
slower than finding the Eigen decomposition of the associated GPUs. Therefore, cuML’s GPU-equivalent of the popular
covariance matrix (the solution in cuML referred to as Eig). Scikit-learn function make_blobs() was used to generate
Nevertheless, for the purposes of this paper, we found SVD random samples to feed into each clustering algorithm (K-
to be more useful for taking reproducible benchmarks. For Means and K Nearest Neighbors). The make_blobs()
more details on the SVD, refer to the next section on the function generates a set of random isotropic gaussian blobs
tSVD algorithm in cuML. Linear Regression was parallelized for clustering applications. To simulate a real dataset, each
in cuML by applying a Bcast operation at the outset of training class is allocated one or more normally-distributed clusters
and then applying a Reduce operation at each training step. of points. User-set parameters may be tuned to modify the
center and standard deviation of each generated cluster of
E. Truncated SVD (tSVD) points. Given the flexibility of the make_blobs() function,
we also applied dimensionality reduction algorithms (tSVD) to
The Truncated Singular Value Decomposition (tSVD) is
the clustered data for our throughput experiments. During the
another widely-used method of dimensionality reduction (in
initial throughput study in Figure 10 we set these parameters in
addition to PCA) that is more suited to sparse matrices.
tandem with the hyperparameters to maximize GPU memory
Specifically, is a matrix factorization that generalizes the
allocation while forcing enough model complexity to keep the
eigendecomposition of a matrix M , computing:
GPU and interconnect under load for the duration of training.
For each algorithm in Figure 10, these parameters and the total
M ≈ Uk Σk VkT
number of training iterations was kept fixed for each scale.
Where M is an m × n large matrix, U is an m × m unitary For the classification applications in cuML (Random
matrix, Σ is an m × n rectangular diagonal matrix, and V is Forests), we applied make_classification() to create
an n × n unitary matrix. a random n-class classification problem. The cuML function
tSVD is a less computationally intensive version of full SVD make_classification() initially generates normally-
that only computes the k largest singular values of a large distributed clusters of points with std = 1 about vertices of
matrix M . Scikit-learn and cuML use a randomized matrix a K-dimensional hypercube with sides of length M , where
approximation algorithm originated by Halko et al. [17] that K, M are user-defined. An equal number of clusters is as-
consists of two major steps: First compute an approximation signed to each class, while inserting both interdependence
to the range of M with randomization methods. That is, and random noise into the data. Features are then horizontally
an intermediate matrix Q must be constructed with a small stacked and split into informative (independent) and redundant
number of orthonormal columns such that: (linear combinations of informative) features.
For the regression applications in cuML (Linear Regression),
M ≈ QQT M we applied make_regression() to generate a set of

21

Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on February 19,2022 at 00:31:32 UTC from IEEE Xplore. Restrictions apply.
regression targets as a random linear combination of features is passed to HyperbandSearchCV via the user-parameter
with noise. The user may set the levels of sparsity and whether aggressiveness.
the informative features are uncorrelated or follow a low rank-
fat tail singular profile (where a few features account for most 1 from cuml.dask.cluster.kmeans import KMeans as
of the variance). For the purposes of our study, the default cuKMeans
uncorrelated informative features are sufficient. Examples of 2
3 model_kmeans = cuKMeans(init="k-means||",
each of the three synthetic benchmarking methods are depicted n_clusters=5, random_state=123,
in Listing 1. oversampling_factor=3)
4
5 clusters = [i for i in range(1,100)]
1 from cuml import make_blobs, 6 sample_factors = [i for i in range(1,5)]
make_classification, make_regression 7

2 8 params = {’n_clusters’ : clusters


3 9 ’oversampling_factor’ : sample_factors
4 make_blobs(n_samples, n_features, centers, 10 ’tol’ : loguniform(1e-5, 1e-3)}
n_parts, cluster_std) 11

5 12 search = HyperbandSearchCV(model_kmeans,
6 make_classification(n_samples, n_features, params, max_iter=100, aggressiveness=3,
n_clusters_per_class, n_informative, random_state=123)
random_state, n_classes)
7 Listing 2: cuML example code for performing hyperparameter
8 make_regression(n_samples, n_features,
random_state)
optimization on KMeans

Listing 1: cuML example code for generating synthetic V. OVERVIEW OF THE S OFTWARE S TACKS
data for clustering, classification, and regression applications, A. RAPIDS
respectively
NVIDIA has recently launched RAPIDS AI which is a
In order to ensure that accuracy is not affected by the collection of open source software libraries and APIs. It’s
distributed algorithms in cuML, we trained K-Means and used to run end-to-end data science analytics pipelines entirely
Random Forest algorithms on the Higgs Boson dataset on GPUs. For low-level compute optimizations, it relies on
[18]. We applied the hyperparameter optimization suite in NVIDIA CUDA primitives and user-friendly Python inter-
Dask ML to maximize the accuracy achievable at each scale. faces to expose GPU parallelism and high-bandwidth memory
In particular, we used the adaptive cross-validation algo- speed.
rithm HyperbandSearchCV, which is specialized for spe-
cialized processors like GPUs, where the search space of 

hyperparameters is large. Specifically, Hyperband optimizers  


seek to spend the most time training high-performing sets  

of hyperparameters. To achieve this, Hyperband optimizers 
initialize many models, start to train all of them, and then drop
  
poor-performing models before high-performing models have
finished training. One can think of Hyperband optimizers as a Figure 3: RAPIDS software stack
randomized optimizer that drops poor parameter sets before
training is finished, thus saving resources. The Dask ML Figure 3 shows a high-level overview of the RAPIDS stack.
implementation HyperbandSearchCV follows an embar- As can be seen in this Figure, RAPIDS is built on top of CUDA
rassingly parallel approach by assigning each model with a and is under the standard specification of Apache Arrow
unique parameter set to a worker GPU. Since there are many [20]. Apache Arrow is a cross-language development platform
more parameter sets than worker GPUs in our experiments, for in-memory analytics. It defines a language-independent
we prioritize parameter sets based on the model’s most recent columnar memory format for flat and hierarchical data. It
loss score. This allows HyperbandSearchCV to be more provides efficient analytic operations on both CPUs and GPUs.
aggressive in dropping poor-performing parameter sets. This RAPIDS has three main components: 1. cuDF which is a
is in accordance with the findings of [19]. pandas-like dataframe manipulation library and is used for data
Listing 2 shows an example HPO run on the cuML preparation, 2. cuML which is a collection of machine learning
K-Means algorithm with HyperbandSearchCV. In this libraries and provides GPU versions of algorithms available
example, we pass K-Means parameter ranges for the in scikit-learn, and 3. cuGraph which is an accelerated graph
number of clusters (n clusters), the oversampling fac- analytics library. The Python layer is built on top of these
tor (sample f actors), and the stopping tolerance (tol) to CUDA/C++ layers and data scientists can easily use it without
HyperbandSearchCV, which performs at most max iter worrying about the complexities of underneath layers. In this
training iterations on the best-performing parameter set. Fi- paper, we mainly focus on the cuML library as we aim to
nally, the aggressiveness in culling off the different estimators improve the performance of machine learning applications.

22

Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on February 19,2022 at 00:31:32 UTC from IEEE Xplore. Restrictions apply.
!    

# # 
 

  
    
      


#  


  # 

    

 #    
      
   
      "     
      

(a) cuML components (b) cuML software stack (c) cuML software stack in a distributed setting

Figure 4: cuML components and software stack

B. cuML Library collective communications between the workers. This way,


cuML is a suite of fast, GPU-accelerated machine learning cuML takes advantage of both the simplicity and flexibility
algorithms within the RAPIDS data science ecosystem. cuML is of Dask and the collective algorithms provided by NCCL.
designed for data science and analytical tasks. It enables users Figure 4(c) shows cuML software stack in a distributed
to run traditional ML tasks on GPUs without going into the setting. As can be seen in the figure, both Dask and NCCL
details of CUDA programming. Figure 4(a) shows three main are used for communications in cuML . NCCL is mainly used
components of cuML: primitives, machine learning algorithms, for the collective communications between the workers in
and the Python layer. Primitives are reusable building blocks the fit() function. On the other hand, Dask is used for
for building machine learning algorithms. They are common communication between the scheduler and workers while also
for different machine learning algorithms. Linear algebra is handling point-to-point communications between workers. For
a big part of these primitives which includes element-wise example, Dask is used for sending the model parameters
operations, matrix multiplication, eigen decomposition, etc. from one worker to the others in the predict() function.
The primitives also include statistical functions (such as the The reason Dask is used at this stage is that at the time of
mean and standard deviation), random number generation, sending the model parameters, the root does not have any
distance/matrix calculations, etc. These primitives are used in knowledge on how the data is going to be distributed among
the second layer to build different machine learning algorithms the workers/processes. More specifically, this communication
such as Random Forest, K-Means, Linear Regression, etc. is happening among a limited number of workers which are
Finally, there is a Python layer that provides a Scikit-learn- not known beforehand.
like interface to expose these algorithms to data scientists. Another observation from this figure is that Dask uses
The cuML library allows execution of ML workloads on UCX underneath to handle the communications. In other
a variety of platforms with three configurations: 1) a single words, Dask uses UCX to pass CUDA objects between the
GPU, 2) a system with multiple GPUs called Single-Node workers. The advantage of UCX is that it provides the best
Multi-GPUs (SNMG), and 3) multiple systems with multiple communication performance to Dask based on the cluster’s
GPUs called Multi-Node Multi-GPUs (MNMG). Figure 4(b) available hardware.
shows the software stack of cuML in a system with single 
GPU. The primitives and machine learning algorithms are built
on top of CUDA libraries in the CUDA/C++ layer. Thrust,   

nvGraph, cuBlas, and cuRand are some example of these

CUDA libraries. The CUDA/C++ layer is wrapped to the  
 
Cython layer to expose the cuML algorithms. The Python layer
is used to integrate cuML with other tools in the community  

such as Numpy and cuDF and as mentioned earlier, it provides


a Scikit-learn-like interfaces to the user.
     




C. cuML Library in a Distributed Setting   

One of the main advantages of cuML is its support of


distributed execution on MNMG systems. For distributed runs,
cuML takes advantage of Dask. Dask enables the cuML code in Figure 5: Dask and NCCL Communication paths in cuML
Python to run in parallel across many nodes. To do this, Dask
creates a task graph and divides up the code between workers. Figure 5 shows Dask and NCCL communication paths
cuML also uses NCCL-based communication, especially for for cuML in a cluster with three workers. The black dotted

23

Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on February 19,2022 at 00:31:32 UTC from IEEE Xplore. Restrictions apply.
lines show Dask communication paths, while the red lines injected to cumlHandle. A similar thing should be done
show NCCL communication paths. A NCCL communicator is for MPI communicator to add support for MPI-based commu-
created across the worker processes for taking advantage of nications. cuML has a function initialize_mpi_comms
collective communication support in NCCL. The NCCL com- which attaches an MPI communicator to cumlHandle. This
municator is then injected to cumlHandle. cumlHandle function is defined in the CUDA/C++ layer, so it cannot
is a class in cuML that is used to manage resources needed be used directly in the cuML applications written in Python.
for running machine learning algorithms. As can be seen in To be able to practically use this function for running
Figure 5, cumlHandle is used by all workers. cuML applications, we have developed a Cython wrapper
Once the NCCL communicator is attached to the MPI_plugin.pyx shown in Figure 8. This wrapper defines
cumlHandle, it can be used as the input of fit() and a new function inject_mpi_comms_py() which injects
predict(). This way, the fit() and predict() func- MPI communicator to cumlHandle in Python. It requires
tions take advantage of NCCL for running collective com- mpi4py to get an MPI communicator from cumlHandle
munications. Note that fit() has the most communication and the function initialize_mpi_comms. For includ-
overhead in cuML algorithms. Figure 6 shows the procedure ing cumlHandle and initialize_mpi_comms, cuML
of injecting a NCCL communicator to cumlHandle in and CUDA libraries are linked to the wrapper through
Python. As can be seen in the figure, this is done through setup.py. We also linked MVAPICH2-GDR which is re-
inject_nccl_comms_py(). Once the NCCL communi- quired by mpi4py. After compiling MPI_plugin.pyx, we
cator is attached to the cumlHandle, it is passed as the input have an extension module that injects MPI communicator to
of fit(). cumlHandle in Python. Listing 3 shows the details of defin-
ing inject_mpi_comms_py() in MPI_plugin.pyx.
   


   
 
   '''#()


+
 

 
! 
'!%

'!%#"
" 

   $' 


'() " 

Figure 6: Injecting NCCL communicator to cumlHandle


 !%#
 

*
VI. MPI- BASED COMMUNICATION SUPPORT IN CU ML &
 

In this section, we explain adding MPI-based communica-
tion support for cuML. More specifically, we discuss how to
Figure 8: A Cython wrapper MPI_plugin.pyx for
take advantage of an MPI library for running collective com-
injecting an MPI communicator to cumlHandle
munications in the fit() function and replace the existing
in Python. MPI_plugin.pyx defines a new
NCCL-based communications. Figure 7 shows the software
function inject_mpi_comms_py that wraps
stack of cuML with MPI-based communication support. We
initialize_mpi_comms for Python. For this,
use MVAPICH2-GDR as the MPI library in our work due to
MPI_plugin.pyx requires mpi4py to get an MPI communi-
its efficiency and high-performance on GPU clusters, but any
cator from cumlHandle and initialize_mpi_comms.
other MPI library can be used. We also take advantage of
mpi4py [21] as a Python binding library for MPI so that it can
be used for running cuML applications written in Python.
1 cimport mpi4py.MPI as MPI

! 2 cimport mpi4py.libmpi as libmpi
3
!

%
 4 cdef extern from "cuml/cuml.hpp" namespace "ML":
 ""&& 5 cdef cppclass cumlHandle:
  
6 cumlHandle()

7
 
  8 cdef extern from "cuML_comms.hpp" namespace

$# "ML":
 
9 void initialize_mpi_comms(cumlHandle,
  mpi_comm)
10
11 def inject_mpi_comms_py(handle, MPI.Comm comm):
Figure 7: cuML software stack with MPI-based communication 12 handle_ = <cumlHandle*> handle.getHandle()
13 initialize_mpi_comms(handle_, comm.ob_mpi)
As discussed in Section V-C, to support NCCL-based
communications in cuML, a NCCL communicator should be Listing 3: MPI_plugin.pyx

24

Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on February 19,2022 at 00:31:32 UTC from IEEE Xplore. Restrictions apply.
Listing 4 shows how inject_mpi_comms_py() is used Table II: Hardware specification of the SDSC Comet cluster
to inject an MPI communicator to cumlHandle in K-Means.
Specification SDSC Comet
Here, we show the modifications we made to K-Means as
an example. We have made similar changes to other cuML Number of Nodes 36
Processor Family Xeon Haswell
algorithms to make them use MPI-based communications. Processor Model E5-2680 v3
The listing shows a part of _func_fit(). This function Clock Speed 2.5 GHz
is executed by all the worker processes in the cluster. First, Sockets 2
Cores Per socket 12
we import the module MPI_plugin that we have created RAM (DDR4) 128 GB
from our Cython code in Listing 3. In _func_fit(), GPU Family NVIDIA Pascal P100
we create MPI communicator using mpi4py (Lines 3 and GPUs 4
GPU Memory 16 GB (HBM2)
4). This communicator is created between worker processes Interconnect IB-EDR (56G)
that are calling _func_fit(). Then, we import Handle
from cuml.commom.handle (Line 6) and use it to create
a cumlHandle object handle (Line 7). In Line 8, we Forest, Nearest Neighbors, tSVD, and Linear Regression.
call inject_mpi_comms_py() which we have defined Scikit-learn and cuML are used for the experiments on CPU
in MPI_plugin.pyx. This function gets two inputs, the and GPU, respectively. As can be seen in Figure 9, GPU
MPI communicator and the cuML handle and attaches the performs significantly better than CPU for all the algorithms,
communicator to handle. Then, we pass the new handle as expected.
to cumlKMeans() (Line 12). cumlKMeans() takes care
of running the algorithm. It uses an MPI communicator that 4500
GPU CPU
is attached to handle for executing the collective com- 4000
3500

Trainng Time (s)


munications. Note that in Listing 4, we show the parts of 3000

_func_fit() that we have modified to be able to use the 2500


2000
MPI-based communications. Other parts of _func_fit() 1500

have not been changed, so we did not include it in Listing 4. 1000


500
0
K-Means Random Nearest tSVD Linear
Forest Neighbors Regression
1 import MPI_plugin
2 def _func_fit(): Figure 9: Training Time of different ML algorithms with GPU
3 From mpi4py import MPI;
4 mpicomm = MPI.COMM_WORLD
vs CPU on Comet cluster. For all the algorithms, the training
5 time with GPU is significantly better than CPU.
6 from cuml.common.handle import Handle
7 handle = Handle()
8 MPI_plugin.inject_mpi_comms_py(handle,
Now that we have confirmed the efficiency of running ML
mpicomm) algorithms on a GPU, we discuss the performance of ML
9 . algorithms on GPUs under a distributed or MNMG setting.
10 .
11 .
Figure 10 shows the results on up to 32 GPUs across 8 nodes
12 cumlKMeans(handle, ...) on the Comet cluster. We compare the performance of NCCL
vs MPI-based communication with MVAPICH2-GDR. One ob-
Listing 4: Injecting an MPI communicator to cumlHandle servation from Figure 10 is that as we increase the number of
in _func_fit(), which is a part of the KMeans code GPUs, the training time reduces. This shows the scalability of
cuML in the distributed setting. Another observation from the
VII. P ERFORMANCE C HARACTERIZATION figure is that MVAPICH2-GDR is performing better than NCCL
A. Experimental Setup for almost all algorithms, and the performance improvement
increases as we increase the number of GPUs. This is because
We performed all experiments on the Comet cluster at the communication requirement of each algorithm is increased
the San Diego Supercomputer Center. The GPU partition on as the scale increases. The improved training performance
Comet is composed of 36 nodes each with 4 NVIDIA Pascal shows that the MPI-based approach provides better scalability
(P100) GPUs. Each P100 has 16 GB HBM2 memory and is compared to the NCCL-based design. The only ML algorithm
connected to other nodes on the cluster via Infiniband FDR. that performs the same with NCCL and MVAPICH2-GDR is
Table II shows detailed specifications of the Comet cluster. We Random Forest. The reason for this is that cuML does not
use cuML v0.15 compiled with CUDA 10.1, NCCL 2.7.8-1, use any collective communication for training Random Forest.
and MVAPICH2-GDR 2.3.4. Therefore, there is no room to take advantage of MVAPICH2-
B. Experimental Results GDR’s efficient design for this algorithm. In order to better
understand why MVAPICH2-GDR performs better than NCCL
First, we compare the baseline performance of different ML at all scales, we collected the message size of each NCCL
algorithms when they are executed on a single CPU vs a collective call in a training run for both K-Means and Nearest
single GPU. Figure 9 shows the results for K-Means, Random Neighbors on 2 Comet nodes (8 GPUs). From this profiling

25

Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on February 19,2022 at 00:31:32 UTC from IEEE Xplore. Restrictions apply.
1600 1.6 1800 1.8 250 2.0
NCCL MVAPICH2-GDR Speedup NCCL MVAPICH2-GDR Speedup
1600 1.6 NCCL MVAPICH2-GDR Speedup
1400 1.4 225 1.8
1400 1.4 200 1.6

Training Time (s)


Training Time (s) 1200 1.2

Training Time (s)


1200 1.2 175 1.4
1000 1.0

Speedup

Speedup

Speedup
1000 1.0 150 1.2
800 0.8 125 1.0
800 0.8
600 0.6 100 0.8
600 0.6
400 0.4 75 0.6
400 0.4
50 0.4
200 0.2 200 0.2 25 0.2
0 0.0 0 0.0 0 0.0
1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
Number of GPUs Number of GPUs Number of GPUs

(a) tSVD (b) K-Means (c) Random Forest


2000 1.4
NCCL MVAPICH2-GDR Speedup 1.6
1800 2400
1.2 NCCL MVAPICH2-GDR Speedup
1600 2100 1.4
Training Time (s)

1400 1.0 1.2

Training Time (s)


1800

Speedup
1200 0.8 1.0
1500

Speedup
1000
0.6 1200 0.8
800
900 0.6
600 0.4
400 600 0.4
0.2
200 300 0.2
0 0.0 0 0.0
1 2 4 8 16 32 1 2 4 8 16 32
Number of GPUs Number of GPUs

(d) Linear Regression (e) Nearest Neighbors

Figure 10: Training Time for GPU-Accelerated Distributed Algorithms using cuML.
information in Figure 11, we can see that the vast majority
!
of collective calls are for the small to medium message size     
!
range. Note that we observed the same behavior for other

 

!
cuML algorithms that use collectives such as tSVD and Linear
!
Regression. We believe this to be the case for two primary
!
reasons:
!
• For many distributed cuML algorithms like K-Means      
and Nearest Neighbors, the update information at each   

training step consists of many small messages (e.g. for (a) K-Means
K-Means, only the centroid information is shared between !
    
workers at each step). This allows the cuML algorithms !

to scale to many GPUs without significantly increasing


 


!

the ratio of communication to computation !

• At the Dask level, each message is chunked before !

distributing to individual workers. This increases the !


     
volume of collective calls while reducing the size of each   
individual message (b) Random Forest
50

45
Allreduce in K-Means Figure 12: Accuracy Numbers
Bcast in Nearest Neighbors
40
Reduce in Nearest Neighbors
35
Number of Calls

30 collective calls allows MVAPICH2-GDR to improve its speedup


25
compared to NCCL for a large number of worker GPUs, as
20

15 depicted in Figure 10. Finally, we applied the hyperparameter


10 optimization (HPO) library discussed earlier in section IV
5
to the real-world Higgs dataset [18]. We used 30% of the
0
dataset to perform HPO at each scale (1-32 GPUs). The
13 60

18 04

27 52

27 32

29 56

29 88

29 36

29 44

29 68

32 92

0
22 8

23 2

36 6

49 8

72 6

74 2
10 40
84

21 2
02

04

71

19

72
8

9
,5

,6

,6

,6

,5

,1

,2

0
1,

2,

3,

8,

4,

6,

6,

4,

5,

0,

2,

7,

8,

9,

6,

Message Size (bytes) accuracy on the test dataset was taken with the default
model parameters for Random Forest and K-Means, then
Figure 11: Profile of different collective operations in cuML HPO was performed embarrassingly parallel via Dask ML’s
algorithms: K-Means and Nearest Neighbors. HyperbandSearchCV across the job’s worker GPUs. The
accuracy results are depicted in Figure 12. The accuracy with
Given that the communication overhead for cuML algo-
the optimal parameters for K-Means and Random Forests show
rithms consists of many small messages, we can take advan-
an approximate 5% improvement over the default model. We
tage of MVAPICH2-GDR’s improved small-message algorithms
used K-Means and Random Forests on Higgs because:
compared to NCCL, as in Figure 1. Further, while cuML’s
small-message strategy does keep communication overheads • They are representative of different communications in
low for increasing node counts, the increased number of cuML (K-Means following a data-parallel approach, and
Random Forests being embarrassingly parallel)

26

Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on February 19,2022 at 00:31:32 UTC from IEEE Xplore. Restrictions apply.
• Both algorithms have enough hyperparameters to stress state-of-the-art cuML algorithms show that the proposed MPI-
the HPO implementation in Dask ML, while having a based communication approach gives up to 1.6x, 1.25x, 1.25x,
broad parameter space to enable accuracy benefits and 1.36x speedup for K-Means, Nearest Neighbors, Linear
Regression, and tSVD, respectively. For future work, we would
VIII. R ELATED W ORK like to explore the impact of the MPI-based communication
support for real ML applications and larger data sets.
González et al. [22] present a review of bagging and boost-
ing ML algorithms. XGBoost [23][24] is a gradient boosting
ACKNOWLEDGEMENT
library that supports both distributed—through Dask—and
GPU-based execution. H203 [25] is another ML library that is We would like to thank Arpan Jain for assisting us in the initial
capable of distributed computation. H204GPU [26] is a variant setup of cuML. This research is supported in part by NSF
with shared memory GPU support. While the computational grants #1450440, #1565414, #1664137, #1818253, #1854828,
support for ML workloads has been around for a long time #1931537, #2007991, and XRAC grant #NCR-130002.
in the form of software library and packages, the support for
distributed execution on a cluster of GPUs is still in its infancy. R EFERENCES
As we noted above, XGBoost is one such library that provides
[1] D. Guest, K. Cranmer, and D. Whiteson, “Deep learning
boosting algorithms. However, a wide range of new and
and its application to lhc physics,” Annual Review of
existing ML algorithms are still being investigated for efficient
Nuclear and Particle Science, vol. 68, pp. 161–181, 2018.
Multi-Node Multi-GPU execution—this is the main motivation
for the development of the cuML library. Deep Learning frame- [2] M. R. Blanton, M. A. Bershady, B. Abolfathi, F. D.
works like TensorFlow [12] and PyTorch [27] has support for Albareti, C. Allende Prieto, A. Almeida, J. Alonso-
some ML tools and techniques. However, we only focus on Garcı́a, F. Anders, S. F. Anderson, B. Andrews, and et al.,
specialized ML libraries in this paper. Raschka et al. [28] “Sloan Digital Sky Survey IV: Mapping the Milky Way,
provide a survey of machine learning in Python, including Nearby Galaxies, and the Distant Universe,” , vol. 154,
Scikit-learn training on CPUs and the RAPIDS ecosystem. p. 28, Jul. 2017.
They also discuss the need and motivation for the cuML library. [3] H. Ledford, “How facebook, twitter and other data troves
The RAPIDS framework—and the cuML library—has been are revolutionizing social science,” Nature, vol. 582, pp.
gaining traction in the community as a viable option to support 328–330, 06 2020.
the execution of performance-hungry applications on a cluster
of GPUs. For example, Napoli et al. [29] applied RAPIDS [4] H. Fröhlich, R. Balling, N. Beerenwinkel, O. Kohlbacher,
[30] and Dask [31] ecosystems to analyze data for geophysics S. Kumar, T. Lengauer, H. M. Maathuis, Y. Moreau,
simulations. While all of these studies exploit the RAPIDS A. S. Murphy, M. T. Przytycka, M. Rebhan, H. Röst,
framework with cuML to exploit multiple NVIDIA GPUs, A. Schuppert, M. Schwab, R. Spang, D. Stekhoven,
none of them provide insight into improving communication J. Sun, A. Weber, D. Ziemek, and B. Zupan, “From hype
performance, nor explore other viable communication options. to reality: data science enabling personalized medicine,”
We fill this gap—in this paper—by proposing MVAPICH2- BMC medicine, 2018.
GDR as an alternative to NCCL in the Multi-Node Multi- [5] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
GPU setting. We also attempt to characterize our evaluation B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
for these GPU-aware messaging libraries to gain insights into R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-
scaling behavior of ML applications in the cuML library. napeau, M. Brucher, M. Perrot, and E. Duchesnay,
“Scikit-learn: Machine learning in Python,” Journal of
IX. C ONCLUSION AND F UTURE W ORK Machine Learning Research, vol. 12, pp. 2825–2830,
In this paper, we add support for MPI-based communica- 2011.
tions for cuML applications in Python. For this, we propose [6] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkatara-
a Python API that takes advantage of MPI calls to han- man, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen,
dle collective communications between processes in cuML . D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and
Moreover, we provide a synthetic benchmarking suite and A. Talwalkar, “Mllib: Machine learning in apache spark,”
in-depth analysis of cuML applications to understand their J. Mach. Learn. Res., vol. 17, no. 1, p. 1235–1241, Jan.
behavior and identify the regions that can be improved using 2016.
MPI-based communications. We believe these analysis and
[7] R. Anil, S. Owen, T. Dunning, and E. Friedman, Mahout
characterizations provide a comprehensive guideline for cuML
in Action, Manning Publications Co. Sound View Ct.
application developers and data scientists to take the most
3B Greenwich, CT 06830, 2010. [Online]. Available:
advantage of the features provided by cuML. Finally, we
http://manning.com/owen/
compare the performance of the proposed MPI-based commu-
nication approach with NCCL-based communication design [8] S. Raschka, J. T. Patterson, and C. Nolet, “Machine
that is used by default in cuML. The experimental results on learning in python: Main developments and technology

27

Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on February 19,2022 at 00:31:32 UTC from IEEE Xplore. Restrictions apply.
trends in data science, machine learning, and artificial and David Shupe, Eds., 2019, pp. 118 – 125.
intelligence,” ArXiv, vol. abs/2002.04803, 2020.
[20] Apache Software Foundation, “Arrow: Across-language
[9] “MPI-3 Standard Document,” http://www.mpi-forum. development platform for in-memory data,” https://arrow.
org/docs/mpi-3.0/mpi30-report.pdf. apache.org.
[10] MVAPICH2-GDR Development Team, “MVAPICH2- [21] L. D. Dalcin, R. R. Paz, P. A. Kler, and A. Cosimo,
GDR User Guide,” http://mvapich.cse.ohio-state.edu/ “Parallel distributed computing using python,” Advances
userguide/gdr. in Water Resources, vol. 34, no. 9, pp. 1124 –
[11] A. A. Awan, A. Jain, Q. Anthony, H. Subramoni, 1139, 2011, new Computational Methods and Software
and D. K. Panda, “HyPar-Flow: Exploiting MPI and Tools. [Online]. Available: http://www.sciencedirect.
Keras for Scalable Hybrid-Parallel DNN Training using com/science/article/pii/S0309170811000777
TensorFlow,” arXiv preprint arXiv:1911.05146, 2019. [22] S. González, S. Garcı́a, J. Del Ser, L. Rokach, and
[Online]. Available: http://arxiv.org/abs/1911.05146 F. Herrera, “A practical tutorial on bagging and boosting
[12] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, based ensembles for machine learning: Algorithms,
J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, software tools, performance study, practical perspectives
M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. and opportunities,” Information Fusion, vol. 64, pp. 205 –
Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, 237, 2020. [Online]. Available: http://www.sciencedirect.
M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: A com/science/article/pii/S1566253520303195
system for large-scale machine learning,” in 12th [23] T. Chen and C. Guestrin, “Xgboost: A scalable
USENIX Symposium on Operating Systems Design tree boosting system,” in Proceedings of the 22nd
and Implementation (OSDI 16), 2016, pp. 265–283. ACM SIGKDD International Conference on Knowledge
[Online]. Available: https://www.usenix.org/system/files/ Discovery and Data Mining, ser. KDD ’16. New York,
conference/osdi16/osdi16-abadi.pdf NY, USA: Association for Computing Machinery, 2016,
[13] NVIDIA, “RAPIDS Accelerator For Apache Spark,” p. 785–794. [Online]. Available: https://doi.org/10.1145/
https://github.com/NVIDIA/spark-rapids. 2939672.2939785
[14] D. Panda et al., “OSU microbenchmarks v5.6.3,” http: [24] XGBoost Development Team, “XGBoost Library,” https:
//mvapich.cse.ohio-state.edu/benchmarks/. //xgboost.readthedocs.io/en/latest/.
[15] P. Shamis, M. G. Venkata, M. G. Lopez, M. B. Baker, [25] H2O Development Team, “H2O3: Distributed, fast, and
O. Hernandez, Y. Itigin, M. Dubman, G. Shainer, R. L. scalable machine learning software,” http://docs.h2o.ai/
Graham, L. Liss, Y. Shahar, S. Potluri, D. Rossetti, h2o/latest-stable/h2o-docs/welcome.html.
D. Becker, D. Poole, C. Lamb, S. Kumar, C. Stunkel, [26] N. Gill, E. LeDell, Y. Tang, “H2O4GPU: Machine Learn-
G. Bosilca, and A. Bouteiller, “Ucx: An open source ing with GPUs in R and Python,” https://github.com/
framework for hpc network apis and beyond,” in 2015 h2oai/h2o4gpu.
IEEE 23rd Annual Symposium on High-Performance
Interconnects, 2015, pp. 40–43. [27] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang,
Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and
[16] A. A. Awan, J. Bedorf, C.-H. Chu, H. Subramoni, A. Lerer, “Automatic differentiation in pytorch,” 2017.
and D. K. Panda, “Scalable distributed dnn training
using tensorflow and cuda-aware mpi: Characterization, [28] S. Raschka, J. Patterson, and C. Nolet, “Machine
designs, and performance evaluation,” arXiv preprint learning in python: Main developments and technology
arXiv:1810.11112, 2018. trends in data science, machine learning, and artificial
intelligence,” Information, vol. 11, no. 4, p. 193,
[17] N. Halko, P.-G. Martinsson, and J. A. Tropp, “Finding Apr 2020. [Online]. Available: http://dx.doi.org/10.3390/
structure with randomness: Stochastic algorithms for info11040193
constructing approximate matrix decompositions,” 2009.
[29] O. O. Napoli, V. M. do Rosario, J. P. Navarro, P. M. C.
[18] N. Becker, A. Dattagupta, E. Fajardo, P. Gali, B. Rhodes, e Silva, and E. Borin, “Accelerating multi-attribute un-
B. Richardson, and B. Suryadevara, “Streamlined and supervised seismic facies analysis with rapids,” 2020.
accelerated cyber analyst workflows with clx and rapids,”
in 2019 IEEE International Conference on Big Data (Big [30] NVIDIA, “RAPIDS: Open GPU Data Science Frame-
Data), 2019, pp. 2011–2015. work,” http://rapids.ai.
[19] Scott Sievert, Tom Augspurger, and Matthew Rock- [31] M. Rocklin, “Dask: Parallel computation with blocked al-
lin, “Better and faster hyperparameter optimization with gorithms and task scheduling,” in Proceedings of the 14th
Dask,” in Proceedings of the 18th Python in Science Con- Python in Science Conference, K. Huff and J. Bergstra,
ference, Chris Calloway, David Lippa, Dillon Niederhut, Eds., 2015, pp. 130 – 136.

28

Authorized licensed use limited to: Universidad Nacional Autonoma De Mexico (UNAM). Downloaded on February 19,2022 at 00:31:32 UTC from IEEE Xplore. Restrictions apply.

You might also like