优化TH-2超算上的基因组大数据分析：挑战与突破

PDF文件

349KB | 更新于2024-08-26 | 71 浏览量 | 举报收藏

立即下载

"这篇研究论文探讨了在TH-2超级计算机上扩展基因组大数据分析软件所面临的挑战。TH-2超级计算机拥有独特的新型异构架构，每个计算节点配备2个Intel Xeon CPU和3个Intel Xeon Phi协处理器，这为处理海量基因组数据提供了强大的计算能力。然而，这种异构性以及数据处理量的庞大，给软件部署和优化带来了巨大的困难。文章通过运行时分析，深入研究了如何更有效地利用这些计算资源，以提高基因组数据分析的效率和性能。" 正文：全基因组重测序在生物医学研究中扮演着至关重要的角色，因为它能够揭示遗传变异并提供对疾病机制的理解。随着基因组大数据时代的到来，对计算能力的需求急剧增加。然而，现有的计算方法在充分利用可用计算资源方面表现出不足，这成为了科学研究的一大瓶颈。 TH-2超级计算机作为全球最快的超级计算机之一，其 neo-heterogeneous（新型异构）架构是解决这一问题的关键。这种架构的每个计算节点由两个Intel Xeon CPU和三个Intel Xeon Phi协处理器组成，设计目标是提升并行计算能力和能效。Intel Xeon Phi协处理器是专为高性能计算任务设计的，特别适合处理大规模数据密集型任务，如基因组分析。然而，这种复杂的硬件配置也带来了挑战。软件必须适应异构环境，才能有效地将工作负载分配到不同类型的处理器上。基因组分析软件管道通常包含多个步骤，如读取质量控制、比对、变异检测等，这些步骤的优化需要对硬件特性和数据流有深入理解。运行时分析是理解软件性能瓶颈和资源利用率的关键工具，它可以帮助研究人员识别哪些部分可以进行改进。论文作者进行了详尽的性能分析，探索如何在TH-2上优化基因组分析流程。这可能涉及到算法的调整、内存管理优化、多线程和并行计算的策略改进，以及如何有效利用CPU和协处理器之间的通信。此外，由于基因组数据的庞大，数据存储和传输也是优化的重要环节，包括I/O速度提升和数据局部性优化。通过这些优化，研究团队旨在实现基因组大数据分析的高效并行化，减少计算时间，同时提高资源利用率。这不仅有助于加快科研进程，还能降低计算成本，使更多研究机构能够负担得起大规模基因组分析。该研究论文聚焦于在TH-2超级计算机上应对基因组大数据分析的挑战，通过对软件进行针对性优化，以充分发挥异构计算平台的潜力。这样的工作对于推动生物医学研究的进步，尤其是精准医疗领域，具有深远的影响。

The Challenge of Scaling Genome Big Data Analysis

Software on TH-2 Supercomputer

1 {Shaoliang Peng*, Xiangke Liao, Canqun Yang, Yutong Lu, Jie Liu,

Yingbo Cui, Heng Wang, Chengkun Wu},

2 Bingqiang Wang

1 School of Computer Science, National University of Defense Technology, Changsha 410073, China

2 National Supercomputing Center in Shenzhen, Shenzhen 518055, China

Corresponding emails*: pengshaoliang@nudt.edu.cn

ABSTRACT

Whole genome re-sequencing plays a crucial role in biomedical

studies. The emergence of genomic big data calls for an enormous

amount of computing power. However, current computational

methods are inefficient in utilizing available computational

resources. In this paper, we address this challenge by optimizing

the utilization of the fastest supercomputer in the world - TH-2

supercomputer. TH-2 is featured by its neo-heterogeneous

architecture, in which each compute node is equipped with 2 Intel

Xeon CPUs and 3 Intel Xeon Phi coprocessors. The heterogeneity

and the massive amount of data to be processed pose great

challenges for the deployment of the genome analysis software

pipeline on TH-2. Runtime profiling shows that SOAP3-dp and

SOAPsnp are the most time-consuming components (up to 70%

of total runtime) in a typical genome-analyzing pipeline. To

optimize the whole pipeline, we first devise a number of parallel

and optimization strategies for SOAP3-dp and SOAPsnp,

respectively targeting each node to fully utilize all sorts of

hardware resources provided both by CPU and MIC. We also

employ a few scaling methods to reduce communication between

different nodes. We then scaled up our method on TH-2. With

8192 nodes, the whole analyzing procedure took 8.37 hours to

finish the analysis of a 300 TB dataset of whole genome

sequences from 2,000 human beings, which can take as long as 8

months on a commodity server. The speedup is about 700x.

Keywords

Parallel optimization; TH-2 supercomputer; sequence alignment;

SNP detection;whole genome re-sequencing.

1. INTRODUCTION

Whole genome re-sequencing refers to the procedure of

genome sequencing of individuals of species with a reference

genome and the execution of relevant bioinformatics analysis.

Using whole genome re-sequencing, researchers can obtain

plentiful information of variations like single nucleotide

polymorphisms (SNP), copy number variation (CNV), and

structure variation (SV). It can be applied in medical genomics,

population genetics, correlation analysis, and evolution analysis.

The amount of sequencing data is growing rapidly due to

decreasing costs. However, if the data could not be analyzed

efficiently, a large amount of useful information would never be

discovered or utilized.

Extremely powerful computers are needed to help biologists to

handle biological big data [1]. BGI, one of the top three gene

sequencing institutions in the world, produces more than 6TB

sequencing data every day. It can take up to two months to

perform a typical processing pipeline on those data on one single

server. Given the current analyzing efficiency, it would be

impossible to fulfill their ambitious goals of mining valuable

information from population studies. For instance, DNA

sequences of 2000 people constitute a dataset as large as 300 TB;

the ongoing million people genome project needs to analyze

500PB of DNA sequences.

Our analysis highlights sequencing alignment and mutation

detection tools as the performance bottleneck (SOAP3-dp[2] and

SOAPsnp[3] in Fig. 1, occupy 70% of total execution time).

Therefore, it would be beneficial for the whole pipeline if we can

optimize both components by exploring parallelism on available

computing platforms.

The TH-2 supercomputer, developed by the National University

of Defense Technology (NUDT), is located in the National

Supercomputing Center Guangzhou. TH-2 is featured by its so-

called neo-heterogeneous architecture, in which each node

contains 2 multi-core Xeon CPUs and 3 many-core Xeon Phi

MICs. A total number of 16,000 such nodes are included in the

system.

In this paper, we aim to address the above mentioned biological

big data problem by carrying out parallel optimization and scaling

on the aforementioned key components and deploying the

optimized pipeline on the TH-2. We proposed a set of algorithms

and parallel strategies of intra-node and inter-node computation.

Intra-node᧶A number of strategies were proposed to fully use

3 MICs and 2 CPUs on each compute node, including three-

channel IO latency hiding, elimination of computation redundancy,

spatial-temporal compression, vectorization of 512 Bit wide

SIMD command, CPU/MIC collaborated parallel computing

method, etc. This part of optimization aims at improving

scalability of our pipeline on one compute node.

Inter-node: We split genome analysis tasks and data into each

node of TH-2 as evenly as possible. In addition, a number of

methods were also proposed to reduce the amount of

communications between different nodes by exploiting various

characteristics of genome data, such as sequence sorting, fast

window iteration and scalable multi-level parallelism.

We carried out evaluations of the performance of our optimized

pipeline at different scales ranging from 512 to 8192 nodes on

TH-2. The results are inspiring. Using 8192 nodes, we were able

to finish the analysis of a 300TB whole genome sequencing

dataset within 8.37 hours, which would take as long as 8 months

on a commodity server. The speedup is about 700x.

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

DOI 10.1109/CCGrid.2015.46

823

下载后可阅读完整内容，剩余5页未读，继续阅读

开通会员，免费下载（低至0.43元/天)

成为会员后, 你将解锁

下载资源随意下

优质VIP博文免费学

优质文库回答免费看

付费资源9折优惠

weixin_38583286

粉丝: 2

优化TH-2超算上的基因组大数据分析：挑战与突破

建伍KENWOOD TH-25A TH-45A对讲机 维修手册 维修图纸电路图

欧标编程软件TH-308-HB_CN_C V1.0

Evaluating-genome-assembly:用于评估基因组组装的一些统计数据的计算

欧标TH-308-HB写频软件

CVM-Forth-开源

computer-networking-a-top-down-approach-solution-6th-edition-PDF

forth-memory

forth-lib

欧标TH-308G勤务兵编程软件：TH-308-HB_CN_C V1.0

欧标TH-308-HB写频软件操作指南

最新资源

建伍KENWOOD TH-25A TH-45A对讲机维修手册维修图纸电路图