Accelerating
microbiome research
with OpenACC
Igor Sfiligoi – University of California San Diego
in collaboration with
Daniel McDonald and Rob Knight
OpenACC Summit 2020 – Sept 2020
Accelerating microbiome research with OpenACC
We are what we eat
Studies demonstrated
clear link between
• Gut microbiome
• General human health
https://www.biotechniques.com/multimedia/archive/00252/microbiome2_252150a.jpg
UniFrac distance
Need to understand
how similar pairs of
microbiome samples
are with respect to the
evolutionary histories
of the organisms.
UniFrac distance matrix
• Samples where the organisms are all very
similar from an evolutionary perspective
will have a small UniFrac distance.
• On the other hand, two samples
composed of very different organisms
will have a large UniFrac distance.
Lozupone and Knight Applied and environmental microbiology 2005
Computing UniFrac
• Matrix can be computed
using a striped pattern
• Each stripe can be
computed independently
• Easy to distribute
over many compute units
CPU1
CPU2
Computing UniFrac
• Most compute localized
in a tight loop
• Operating on
a stripe range
for(unsigned int stripe = start;
stripe < stop; stripe++) {
dm_stripe = dm_stripes[stripe];
for(unsigned int j = 0;
j < n_samples / 4; j++) {
int k = j * 4;
double u1 = emb[k];
double u2 = emb[k+1];
double v1 = emb[k + stripe + 1];
double v2 = emb[k + stripe + 2];
…
dm_stripe[k] += (u1-v1)*length;
dm_stripe[k+1] += (u2-v2)*length;
}
}
Invoked many times with distinct emp[:] buffers
Computing UniFrac
• Most compute localized
in a tight loop
• Operating on
a stripe range
for(unsigned int stripe = start;
stripe < stop; stripe++) {
dm_stripe = dm_stripes[stripe];
for(unsigned int j = 0;
j < n_samples / 4; j++) {
int k = j * 4;
double u1 = emb[k];
double u2 = emb[k+1];
double v1 = emb[k + stripe + 1];
double v2 = emb[k + stripe + 2];
…
dm_stripe[k] += (u1-v1)*length;
dm_stripe[k+1] += (u2-v2)*length;
}
}
Intel Xeon E5-2680 v4 CPU
(using all 14 cores)
800 minutes (13 hours)
Modest size EMP dataset
Invoked many times with distinct emp[:] buffers
Porting to GPU
• OpenACC makes it trivial to
port to GPU compute.
• Just decorate with a pragma.
• But needed minor refactoring
to have a unified buffer.
(Was array of pointers)
#pragma acc parallel loop collapse(2) 
present(emb,dm_stripes_buf)
for(unsigned int stripe = start;
stripe < stop; stripe++) {
for(unsigned int j = 0;
j < n_samples / 4; j++) {
int idx =(stripe-start_idx)*n_samples;
double *dm_stripe =dm_stripes_buf+idx;
int k = j * 4;
double u1 = emb[k];
double u2 = emb[k+1];
double v1 = emb[k + stripe + 1];
double v2 = emb[k + stripe + 2];
…
dm_stripe[k] += (u1-v1)*length;
dm_stripe[k+1] += (u2-v2)*length;
}
}
Invoked many times with distinct emp[:] buffers
Porting to GPU
#pragma acc parallel loop collapse(2) 
present(emb,dm_stripes_buf)
for(unsigned int stripe = start;
stripe < stop; stripe++) {
for(unsigned int j = 0;
j < n_samples / 4; j++) {
int idx =(stripe-start_idx)*n_samples;
double *dm_stripe =dm_stripes_buf+idx;
int k = j * 4;
double u1 = emb[k];
double u2 = emb[k+1];
double v1 = emb[k + stripe + 1];
double v2 = emb[k + stripe + 2];
…
dm_stripe[k] += (u1-v1)*length;
dm_stripe[k+1] += (u2-v2)*length;
}
}
Modest size EMP dataset
NVIDIA Tesla V100
(using all 84 SMs)
92 minutes (1.5 hours) Was 13h on CPU
• OpenACC makes it trivial to
port to GPU compute.
• Just decorate with a pragma.
• But needed minor refactoring
to have a unified buffer.
(Was array of pointers)
Invoked many times with distinct emp[:] buffers
Optimization step 1
Modest size EMP dataset
92 mins before, was 13h on CPU
• Cluster reads and minimize writes
• Fewer kernel invocations
• Memory writes much more
expensive than memory reads.
• Also undo manual unrolls
• Were optimal for CPU
• Bad for GPU
• Properly align
memory buffers
• Up to 5x slowdown
when not aligned
#pragma acc parallel loop collapse(2) 
present(emb,dm_stripes_buf,length)
for(unsigned int stripe = start;
stripe < stop; stripe++) {
for(unsigned int k = 0;
k < n_samples ; k++) {
…
double my_stripe = dm_stripe[k];
#pragma acc loop seq
for (unsigned int e=0;
e<filled_embs; e++) {
uint64_t offset = n_samples*e;
double u = emb[offset+k];
double v = emb[offset+k+stripe+ 1];
my_stripe += (u-v)*length[e];
}
…
dm_stripe[k] += (u1-v1)*length;
}
}
NVIDIA Tesla V100
(using all 84 SMs)
33 minutes
Invoked fewer times due to batched emp[:] buffers
Optimization step 2
• Reorder loops to maximize
cache reuse.
#pragma acc parallel loop collapse(3) 
present(emb,dm_stripes_buf,length)
for(unsigned int sk = 0;
sk < sample_steps ; sk++) {
for(unsigned int stripe = start;
stripe < stop; stripe++) {
for(unsigned int ik = 0;
ik < step_size ; ik++) {
unsigned int k = sk*step_size + ik;
…
double my_stripe = dm_stripe[k];
#pragma acc loop seq
for (unsigned int e=0;
e<filled_embs; e++) {
uint64_t offset = n_samples*e;
double u = emb[offset+k];
double v = emb[offset+k+stripe+ 1];
my_stripe += (u-v)*length[e];
}
…
dm_stripe[k] += (u1-v1)*length;
}
}
}
NVIDIA Tesla V100
(using all 84 SMs)
12 minutes Was 33 mins
Modest size
EMP dataset
Optimization step 2
• Reorder loops to maximize
cache reuse.
• CPU code
also benefitted
from
optimization
#pragma acc parallel loop collapse(3) 
present(emb,dm_stripes_buf,length)
for(unsigned int sk = 0;
sk < sample_steps ; sk++) {
for(unsigned int stripe = start;
stripe < stop; stripe++) {
for(unsigned int ik = 0;
ik < step_size ; ik++) {
unsigned int k = sk*step_size + ik;
…
double my_stripe = dm_stripe[k];
#pragma acc loop seq
for (unsigned int e=0;
e<filled_embs; e++) {
uint64_t offset = n_samples*e;
double u = emb[offset+k];
double v = emb[offset+k+stripe+ 1];
my_stripe += (u-v)*length[e];
}
…
dm_stripe[k] += (u1-v1)*length;
}
}
}
NVIDIA Tesla V100
(using all 84 SMs)
12 minutes Was 33 mins
Modest size
EMP dataset
Xeon E5-2680 v4 CPU
(using all 14 cores)
193minutes (~3 hours)
Originally 13h on the same CPU
20x speedup on modest EMP dataset
E5-2680 v4 CPU
Original New
GPU
V100
GPU
2080TI
GPU
1080TI
GPU
1080
GPU
Mobile 1050
fp64 800 193 12 59 77 99 213
fp32 - 190 9.5 19 31 36 64
Using fp32 adds additional boost, especially on gaming and mobile GPUs
20x V100 GPU vs Xeon CPU + 4x from general optimization
140x speedup on cutting edge 113k sample
Per chip
(in minutes)
128x CPU
E5-2680 v4
Original New
128x GPU
V100
4x GPU
V100
16x GPU
2080TI
16x GPU
1080TI
fp64 415 97 14 29 184 252
fp32 - 91 12 20 32 82
Aggregated
(in chip hours)
128x
E5-2680 v4 CPU
Original New
128x GPU
V100
4x GPU
V100
16x GPU
2080TI
16x GPU
1080TI
fp64 890 207 30 1.9 49 67
fp32 - 194 26 1.3 8.5 22
140x V100 GPUs vs Xeon CPUs + 4.5x from general optimization
140x V100 GPUs vs Xeon CPUs + 4.5x from general optimization
140x speedup on cutting edge 113k sample
Per chip
(in minutes)
128x CPU
E5-2680 v4
Original New
128x GPU
V100
4x GPU
V100
16x GPU
2080TI
16x GPU
1080TI
fp64 415 97 14 29 184 252
fp32 - 91 12 20 32 82
Aggregated
(in chip hours)
128x
E5-2680 v4 CPU
Original New
128x GPU
V100
4x GPU
V100
16x GPU
2080TI
16x GPU
1080TI
fp64 890 207 30 1.9 49 67
fp32 - 194 26 1.3 8.5 22
Largish
CPU cluster
Single node
22x speedup on consumer GPUs
Per chip
(in minutes)
128x CPU
E5-2680 v4
Original New
128x GPU
V100
4x GPU
V100
16x GPU
2080TI
16x GPU
1080TI
fp64 415 97 14 29 184 252
fp32 - 91 12 20 32 82
Aggregated
(in chip hours)
128x
E5-2680 v4 CPU
Original New
128x GPU
V100
4x GPU
V100
16x GPU
2080TI
16x GPU
1080TI
fp64 890 207 30 1.9 49 67
fp32 - 194 26 1.3 8.5 22
22x 2080TI GPUs vs Xeons CPU, 4.5x from general optimizationConsumer GPUs slower
than server GPUs
but still faster than CPUs
(Memory bound)
Desiderata
• Support for array of pointers
• Was able to work around it,
but annoying
• Better multi-GPU support
• Currently handled with
multiple processes + final merge
• (Better) AMD GPU support
• GCC theoretically has it,
but performance in tests was dismal
• Non-Linux support
• Was not able to find an OpenACC
compiler for MacOS or Windows
Conclusions
• OpenACC made porting UniFrac
to GPUs extremely easy
• With a single code base
• Some additional optimizations were
needed to get maximum benefit
• But most were needed for
the CPU-only code path, too
• Performance on NVIDIA GPUs great
• But wondering what to do for AMD GPUs
and GPUs on non-linux systems
Acknowledgments
This work was partially funded by US National Science
Foundation (NSF) grants OAC-1826967, OAC-1541349
and CNS-1730158, and by US National Institutes of
Health (NIH) grant DP1-AT010885.

More Related Content

PDF
Porting and optimizing UniFrac for GPUs
PDF
Chainer ui v0.3 and imagereport
PDF
Introduction to Chainer 11 may,2018
PDF
Chainer v4 and v5
PDF
[BGOUG] Java GC - Friend or Foe
PDF
NAS EP Algorithm
PDF
【論文紹介】Relay: A New IR for Machine Learning Frameworks
PPTX
Chainer v3
Porting and optimizing UniFrac for GPUs
Chainer ui v0.3 and imagereport
Introduction to Chainer 11 may,2018
Chainer v4 and v5
[BGOUG] Java GC - Friend or Foe
NAS EP Algorithm
【論文紹介】Relay: A New IR for Machine Learning Frameworks
Chainer v3

What's hot (20)

PDF
Automatically Fusing Functions on CuPy
PPTX
Lrz kurs: big data analysis
PPTX
PyTorch Tutorial for NTU Machine Learing Course 2017
PDF
Chainer Update v1.8.0 -> v1.10.0+
PDF
Introduction to Chainer
PDF
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
PDF
CuPy v4 and v5 roadmap
PDF
eBPF Perf Tools 2019
PDF
PyTorch crash course
PDF
Overview of Chainer and Its Features
PDF
計算機性能の限界点とその考え方
PPT
Session 1 introduction to ns2
PPTX
[Update] PyTorch Tutorial for NTU Machine Learing Course 2017
PPTX
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
KEY
Cloud Services - Gluecon 2010
PDF
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
PDF
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
PPTX
C++ AMP 실천 및 적용 전략
PPTX
.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups
PDF
LSFMM 2019 BPF Observability
Automatically Fusing Functions on CuPy
Lrz kurs: big data analysis
PyTorch Tutorial for NTU Machine Learing Course 2017
Chainer Update v1.8.0 -> v1.10.0+
Introduction to Chainer
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
CuPy v4 and v5 roadmap
eBPF Perf Tools 2019
PyTorch crash course
Overview of Chainer and Its Features
計算機性能の限界点とその考え方
Session 1 introduction to ns2
[Update] PyTorch Tutorial for NTU Machine Learing Course 2017
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Cloud Services - Gluecon 2010
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
C++ AMP 실천 및 적용 전략
.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups
LSFMM 2019 BPF Observability
Ad

Similar to Accelerating microbiome research with OpenACC (20)

PDF
GPU Programming
PDF
CUDA Deep Dive
PDF
High-Performance Physics Solver Design for Next Generation Consoles
PDF
CUG2011 Introduction to GPU Computing
PDF
Intel® Xeon® Phi Coprocessor High Performance Programming
PDF
Solving large sparse linear systems on the GPU
PDF
Comparison of Parallel Algorithms For An Image Processing Problem on Cuda
PDF
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
PDF
Cuda Without a Phd - A practical guick start
PPT
Introduction to parallel computing using CUDA
PDF
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
PDF
Gpu perf-presentation
PDF
GPGPU Computation
PPT
Intro2 Cuda Moayad
PPT
Lecture 04
PDF
High Performance Medical Reconstruction Using Stream Programming Paradigms
PPTX
Intro to GPGPU with CUDA (DevLink)
PDF
Programar para GPUs
PDF
Slide tesi
PPTX
Intro to GPGPU Programming with Cuda
GPU Programming
CUDA Deep Dive
High-Performance Physics Solver Design for Next Generation Consoles
CUG2011 Introduction to GPU Computing
Intel® Xeon® Phi Coprocessor High Performance Programming
Solving large sparse linear systems on the GPU
Comparison of Parallel Algorithms For An Image Processing Problem on Cuda
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Cuda Without a Phd - A practical guick start
Introduction to parallel computing using CUDA
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
Gpu perf-presentation
GPGPU Computation
Intro2 Cuda Moayad
Lecture 04
High Performance Medical Reconstruction Using Stream Programming Paradigms
Intro to GPGPU with CUDA (DevLink)
Programar para GPUs
Slide tesi
Intro to GPGPU Programming with Cuda
Ad

More from Igor Sfiligoi (20)

PDF
Preparing Fusion codes for Perlmutter - CGYRO
PDF
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
PDF
Comparing single-node and multi-node performance of an important fusion HPC c...
PDF
The anachronism of whole-GPU accounting
PDF
Auto-scaling HTCondor pools using Kubernetes compute resources
PDF
Speeding up bowtie2 by improving cache-hit rate
PDF
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
PDF
Comparing GPU effectiveness for Unifrac distance compute
PDF
Managing Cloud networking costs for data-intensive applications by provisioni...
PDF
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
PDF
Using A100 MIG to Scale Astronomy Scientific Output
PDF
Using commercial Clouds to process IceCube jobs
PDF
Modest scale HPC on Azure using CGYRO
PDF
Data-intensive IceCube Cloud Burst
PDF
Scheduling a Kubernetes Federation with Admiralty
PDF
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
PDF
Demonstrating 100 Gbps in and out of the public Clouds
PDF
TransAtlantic Networking using Cloud links
PDF
Bursting into the public Cloud - Sharing my experience doing it at large scal...
PDF
Demonstrating 100 Gbps in and out of the Clouds
Preparing Fusion codes for Perlmutter - CGYRO
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
Comparing single-node and multi-node performance of an important fusion HPC c...
The anachronism of whole-GPU accounting
Auto-scaling HTCondor pools using Kubernetes compute resources
Speeding up bowtie2 by improving cache-hit rate
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Comparing GPU effectiveness for Unifrac distance compute
Managing Cloud networking costs for data-intensive applications by provisioni...
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Using A100 MIG to Scale Astronomy Scientific Output
Using commercial Clouds to process IceCube jobs
Modest scale HPC on Azure using CGYRO
Data-intensive IceCube Cloud Burst
Scheduling a Kubernetes Federation with Admiralty
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating 100 Gbps in and out of the public Clouds
TransAtlantic Networking using Cloud links
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Demonstrating 100 Gbps in and out of the Clouds

Recently uploaded (20)

PDF
CEH Module 2 Footprinting CEH V13, concepts
PDF
substrate PowerPoint Presentation basic one
PDF
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
PDF
Electrocardiogram sequences data analytics and classification using unsupervi...
PDF
“The Future of Visual AI: Efficient Multimodal Intelligence,” a Keynote Prese...
PDF
Human Computer Interaction Miterm Lesson
PDF
Planning-an-Audit-A-How-To-Guide-Checklist-WP.pdf
PDF
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
PDF
Decision Optimization - From Theory to Practice
PDF
Examining Bias in AI Generated News Content.pdf
PDF
A symptom-driven medical diagnosis support model based on machine learning te...
PPTX
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
PDF
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
PPTX
Module 1 Introduction to Web Programming .pptx
PPTX
SGT Report The Beast Plan and Cyberphysical Systems of Control
PDF
4 layer Arch & Reference Arch of IoT.pdf
PDF
LMS bot: enhanced learning management systems for improved student learning e...
PPTX
Presentation - Principles of Instructional Design.pptx
PDF
A hybrid framework for wild animal classification using fine-tuned DenseNet12...
PDF
zbrain.ai-Scope Key Metrics Configuration and Best Practices.pdf
CEH Module 2 Footprinting CEH V13, concepts
substrate PowerPoint Presentation basic one
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
Electrocardiogram sequences data analytics and classification using unsupervi...
“The Future of Visual AI: Efficient Multimodal Intelligence,” a Keynote Prese...
Human Computer Interaction Miterm Lesson
Planning-an-Audit-A-How-To-Guide-Checklist-WP.pdf
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
Decision Optimization - From Theory to Practice
Examining Bias in AI Generated News Content.pdf
A symptom-driven medical diagnosis support model based on machine learning te...
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
Module 1 Introduction to Web Programming .pptx
SGT Report The Beast Plan and Cyberphysical Systems of Control
4 layer Arch & Reference Arch of IoT.pdf
LMS bot: enhanced learning management systems for improved student learning e...
Presentation - Principles of Instructional Design.pptx
A hybrid framework for wild animal classification using fine-tuned DenseNet12...
zbrain.ai-Scope Key Metrics Configuration and Best Practices.pdf

Accelerating microbiome research with OpenACC

  • 1. Accelerating microbiome research with OpenACC Igor Sfiligoi – University of California San Diego in collaboration with Daniel McDonald and Rob Knight OpenACC Summit 2020 – Sept 2020
  • 3. We are what we eat Studies demonstrated clear link between • Gut microbiome • General human health https://www.biotechniques.com/multimedia/archive/00252/microbiome2_252150a.jpg
  • 4. UniFrac distance Need to understand how similar pairs of microbiome samples are with respect to the evolutionary histories of the organisms. UniFrac distance matrix • Samples where the organisms are all very similar from an evolutionary perspective will have a small UniFrac distance. • On the other hand, two samples composed of very different organisms will have a large UniFrac distance. Lozupone and Knight Applied and environmental microbiology 2005
  • 5. Computing UniFrac • Matrix can be computed using a striped pattern • Each stripe can be computed independently • Easy to distribute over many compute units CPU1 CPU2
  • 6. Computing UniFrac • Most compute localized in a tight loop • Operating on a stripe range for(unsigned int stripe = start; stripe < stop; stripe++) { dm_stripe = dm_stripes[stripe]; for(unsigned int j = 0; j < n_samples / 4; j++) { int k = j * 4; double u1 = emb[k]; double u2 = emb[k+1]; double v1 = emb[k + stripe + 1]; double v2 = emb[k + stripe + 2]; … dm_stripe[k] += (u1-v1)*length; dm_stripe[k+1] += (u2-v2)*length; } } Invoked many times with distinct emp[:] buffers
  • 7. Computing UniFrac • Most compute localized in a tight loop • Operating on a stripe range for(unsigned int stripe = start; stripe < stop; stripe++) { dm_stripe = dm_stripes[stripe]; for(unsigned int j = 0; j < n_samples / 4; j++) { int k = j * 4; double u1 = emb[k]; double u2 = emb[k+1]; double v1 = emb[k + stripe + 1]; double v2 = emb[k + stripe + 2]; … dm_stripe[k] += (u1-v1)*length; dm_stripe[k+1] += (u2-v2)*length; } } Intel Xeon E5-2680 v4 CPU (using all 14 cores) 800 minutes (13 hours) Modest size EMP dataset Invoked many times with distinct emp[:] buffers
  • 8. Porting to GPU • OpenACC makes it trivial to port to GPU compute. • Just decorate with a pragma. • But needed minor refactoring to have a unified buffer. (Was array of pointers) #pragma acc parallel loop collapse(2) present(emb,dm_stripes_buf) for(unsigned int stripe = start; stripe < stop; stripe++) { for(unsigned int j = 0; j < n_samples / 4; j++) { int idx =(stripe-start_idx)*n_samples; double *dm_stripe =dm_stripes_buf+idx; int k = j * 4; double u1 = emb[k]; double u2 = emb[k+1]; double v1 = emb[k + stripe + 1]; double v2 = emb[k + stripe + 2]; … dm_stripe[k] += (u1-v1)*length; dm_stripe[k+1] += (u2-v2)*length; } } Invoked many times with distinct emp[:] buffers
  • 9. Porting to GPU #pragma acc parallel loop collapse(2) present(emb,dm_stripes_buf) for(unsigned int stripe = start; stripe < stop; stripe++) { for(unsigned int j = 0; j < n_samples / 4; j++) { int idx =(stripe-start_idx)*n_samples; double *dm_stripe =dm_stripes_buf+idx; int k = j * 4; double u1 = emb[k]; double u2 = emb[k+1]; double v1 = emb[k + stripe + 1]; double v2 = emb[k + stripe + 2]; … dm_stripe[k] += (u1-v1)*length; dm_stripe[k+1] += (u2-v2)*length; } } Modest size EMP dataset NVIDIA Tesla V100 (using all 84 SMs) 92 minutes (1.5 hours) Was 13h on CPU • OpenACC makes it trivial to port to GPU compute. • Just decorate with a pragma. • But needed minor refactoring to have a unified buffer. (Was array of pointers) Invoked many times with distinct emp[:] buffers
  • 10. Optimization step 1 Modest size EMP dataset 92 mins before, was 13h on CPU • Cluster reads and minimize writes • Fewer kernel invocations • Memory writes much more expensive than memory reads. • Also undo manual unrolls • Were optimal for CPU • Bad for GPU • Properly align memory buffers • Up to 5x slowdown when not aligned #pragma acc parallel loop collapse(2) present(emb,dm_stripes_buf,length) for(unsigned int stripe = start; stripe < stop; stripe++) { for(unsigned int k = 0; k < n_samples ; k++) { … double my_stripe = dm_stripe[k]; #pragma acc loop seq for (unsigned int e=0; e<filled_embs; e++) { uint64_t offset = n_samples*e; double u = emb[offset+k]; double v = emb[offset+k+stripe+ 1]; my_stripe += (u-v)*length[e]; } … dm_stripe[k] += (u1-v1)*length; } } NVIDIA Tesla V100 (using all 84 SMs) 33 minutes Invoked fewer times due to batched emp[:] buffers
  • 11. Optimization step 2 • Reorder loops to maximize cache reuse. #pragma acc parallel loop collapse(3) present(emb,dm_stripes_buf,length) for(unsigned int sk = 0; sk < sample_steps ; sk++) { for(unsigned int stripe = start; stripe < stop; stripe++) { for(unsigned int ik = 0; ik < step_size ; ik++) { unsigned int k = sk*step_size + ik; … double my_stripe = dm_stripe[k]; #pragma acc loop seq for (unsigned int e=0; e<filled_embs; e++) { uint64_t offset = n_samples*e; double u = emb[offset+k]; double v = emb[offset+k+stripe+ 1]; my_stripe += (u-v)*length[e]; } … dm_stripe[k] += (u1-v1)*length; } } } NVIDIA Tesla V100 (using all 84 SMs) 12 minutes Was 33 mins Modest size EMP dataset
  • 12. Optimization step 2 • Reorder loops to maximize cache reuse. • CPU code also benefitted from optimization #pragma acc parallel loop collapse(3) present(emb,dm_stripes_buf,length) for(unsigned int sk = 0; sk < sample_steps ; sk++) { for(unsigned int stripe = start; stripe < stop; stripe++) { for(unsigned int ik = 0; ik < step_size ; ik++) { unsigned int k = sk*step_size + ik; … double my_stripe = dm_stripe[k]; #pragma acc loop seq for (unsigned int e=0; e<filled_embs; e++) { uint64_t offset = n_samples*e; double u = emb[offset+k]; double v = emb[offset+k+stripe+ 1]; my_stripe += (u-v)*length[e]; } … dm_stripe[k] += (u1-v1)*length; } } } NVIDIA Tesla V100 (using all 84 SMs) 12 minutes Was 33 mins Modest size EMP dataset Xeon E5-2680 v4 CPU (using all 14 cores) 193minutes (~3 hours) Originally 13h on the same CPU
  • 13. 20x speedup on modest EMP dataset E5-2680 v4 CPU Original New GPU V100 GPU 2080TI GPU 1080TI GPU 1080 GPU Mobile 1050 fp64 800 193 12 59 77 99 213 fp32 - 190 9.5 19 31 36 64 Using fp32 adds additional boost, especially on gaming and mobile GPUs 20x V100 GPU vs Xeon CPU + 4x from general optimization
  • 14. 140x speedup on cutting edge 113k sample Per chip (in minutes) 128x CPU E5-2680 v4 Original New 128x GPU V100 4x GPU V100 16x GPU 2080TI 16x GPU 1080TI fp64 415 97 14 29 184 252 fp32 - 91 12 20 32 82 Aggregated (in chip hours) 128x E5-2680 v4 CPU Original New 128x GPU V100 4x GPU V100 16x GPU 2080TI 16x GPU 1080TI fp64 890 207 30 1.9 49 67 fp32 - 194 26 1.3 8.5 22 140x V100 GPUs vs Xeon CPUs + 4.5x from general optimization
  • 15. 140x V100 GPUs vs Xeon CPUs + 4.5x from general optimization 140x speedup on cutting edge 113k sample Per chip (in minutes) 128x CPU E5-2680 v4 Original New 128x GPU V100 4x GPU V100 16x GPU 2080TI 16x GPU 1080TI fp64 415 97 14 29 184 252 fp32 - 91 12 20 32 82 Aggregated (in chip hours) 128x E5-2680 v4 CPU Original New 128x GPU V100 4x GPU V100 16x GPU 2080TI 16x GPU 1080TI fp64 890 207 30 1.9 49 67 fp32 - 194 26 1.3 8.5 22 Largish CPU cluster Single node
  • 16. 22x speedup on consumer GPUs Per chip (in minutes) 128x CPU E5-2680 v4 Original New 128x GPU V100 4x GPU V100 16x GPU 2080TI 16x GPU 1080TI fp64 415 97 14 29 184 252 fp32 - 91 12 20 32 82 Aggregated (in chip hours) 128x E5-2680 v4 CPU Original New 128x GPU V100 4x GPU V100 16x GPU 2080TI 16x GPU 1080TI fp64 890 207 30 1.9 49 67 fp32 - 194 26 1.3 8.5 22 22x 2080TI GPUs vs Xeons CPU, 4.5x from general optimizationConsumer GPUs slower than server GPUs but still faster than CPUs (Memory bound)
  • 17. Desiderata • Support for array of pointers • Was able to work around it, but annoying • Better multi-GPU support • Currently handled with multiple processes + final merge • (Better) AMD GPU support • GCC theoretically has it, but performance in tests was dismal • Non-Linux support • Was not able to find an OpenACC compiler for MacOS or Windows
  • 18. Conclusions • OpenACC made porting UniFrac to GPUs extremely easy • With a single code base • Some additional optimizations were needed to get maximum benefit • But most were needed for the CPU-only code path, too • Performance on NVIDIA GPUs great • But wondering what to do for AMD GPUs and GPUs on non-linux systems
  • 19. Acknowledgments This work was partially funded by US National Science Foundation (NSF) grants OAC-1826967, OAC-1541349 and CNS-1730158, and by US National Institutes of Health (NIH) grant DP1-AT010885.