
The Challenge of Scaling Genome Big Data Analysis
Software on TH-2 Supercomputer
1 {Shaoliang Peng*, Xiangke Liao, Canqun Yang, Yutong Lu, Jie Liu,
Yingbo Cui, Heng Wang, Chengkun Wu},
2 Bingqiang Wang
1 School of Computer Science, National University of Defense Technology, Changsha 410073, China
2 National Supercomputing Center in Shenzhen, Shenzhen 518055, China
Corresponding emails*: pengshaoliang@nudt.edu.cn
ABSTRACT
Whole genome re-sequencing plays a crucial role in biomedical
studies. The emergence of genomic big data calls for an enormous
amount of computing power. However, current computational
methods are inefficient in utilizing available computational
resources. In this paper, we address this challenge by optimizing
the utilization of the fastest supercomputer in the world - TH-2
supercomputer. TH-2 is featured by its neo-heterogeneous
architecture, in which each compute node is equipped with 2 Intel
Xeon CPUs and 3 Intel Xeon Phi coprocessors. The heterogeneity
and the massive amount of data to be processed pose great
challenges for the deployment of the genome analysis software
pipeline on TH-2. Runtime profiling shows that SOAP3-dp and
SOAPsnp are the most time-consuming components (up to 70%
of total runtime) in a typical genome-analyzing pipeline. To
optimize the whole pipeline, we first devise a number of parallel
and optimization strategies for SOAP3-dp and SOAPsnp,
respectively targeting each node to fully utilize all sorts of
hardware resources provided both by CPU and MIC. We also
employ a few scaling methods to reduce communication between
different nodes. We then scaled up our method on TH-2. With
8192 nodes, the whole analyzing procedure took 8.37 hours to
finish the analysis of a 300 TB dataset of whole genome
sequences from 2,000 human beings, which can take as long as 8
months on a commodity server. The speedup is about 700x.
Keywords
Parallel optimization; TH-2 supercomputer; sequence alignment;
SNP detection;whole genome re-sequencing.
1. INTRODUCTION
Whole genome re-sequencing refers to the procedure of
genome sequencing of individuals of species with a reference
genome and the execution of relevant bioinformatics analysis.
Using whole genome re-sequencing, researchers can obtain
plentiful information of variations like single nucleotide
polymorphisms (SNP), copy number variation (CNV), and
structure variation (SV). It can be applied in medical genomics,
population genetics, correlation analysis, and evolution analysis.
The amount of sequencing data is growing rapidly due to
decreasing costs. However, if the data could not be analyzed
efficiently, a large amount of useful information would never be
discovered or utilized.
Extremely powerful computers are needed to help biologists to
handle biological big data [1]. BGI, one of the top three gene
sequencing institutions in the world, produces more than 6TB
sequencing data every day. It can take up to two months to
perform a typical processing pipeline on those data on one single
server. Given the current analyzing efficiency, it would be
impossible to fulfill their ambitious goals of mining valuable
information from population studies. For instance, DNA
sequences of 2000 people constitute a dataset as large as 300 TB;
the ongoing million people genome project needs to analyze
500PB of DNA sequences.
Our analysis highlights sequencing alignment and mutation
detection tools as the performance bottleneck (SOAP3-dp[2] and
SOAPsnp[3] in Fig. 1, occupy 70% of total execution time).
Therefore, it would be beneficial for the whole pipeline if we can
optimize both components by exploring parallelism on available
computing platforms.
The TH-2 supercomputer, developed by the National University
of Defense Technology (NUDT), is located in the National
Supercomputing Center Guangzhou. TH-2 is featured by its so-
called neo-heterogeneous architecture, in which each node
contains 2 multi-core Xeon CPUs and 3 many-core Xeon Phi
MICs. A total number of 16,000 such nodes are included in the
system.
In this paper, we aim to address the above mentioned biological
big data problem by carrying out parallel optimization and scaling
on the aforementioned key components and deploying the
optimized pipeline on the TH-2. We proposed a set of algorithms
and parallel strategies of intra-node and inter-node computation.
Intra-node᧶A number of strategies were proposed to fully use
3 MICs and 2 CPUs on each compute node, including three-
channel IO latency hiding, elimination of computation redundancy,
spatial-temporal compression, vectorization of 512 Bit wide
SIMD command, CPU/MIC collaborated parallel computing
method, etc. This part of optimization aims at improving
scalability of our pipeline on one compute node.
Inter-node: We split genome analysis tasks and data into each
node of TH-2 as evenly as possible. In addition, a number of
methods were also proposed to reduce the amount of
communications between different nodes by exploiting various
characteristics of genome data, such as sequence sorting, fast
window iteration and scalable multi-level parallelism.
We carried out evaluations of the performance of our optimized
pipeline at different scales ranging from 512 to 8192 nodes on
TH-2. The results are inspiring. Using 8192 nodes, we were able
to finish the analysis of a 300TB whole genome sequencing
dataset within 8.37 hours, which would take as long as 8 months
on a commodity server. The speedup is about 700x.
2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
978-1-4799-8006-2/15 $31.00 © 2015 IEEE
DOI 10.1109/CCGrid.2015.46
823