0% found this document useful (0 votes)
139 views12 pages

Titan System Insights for Scientists

The Titan supercomputer at Oak Ridge National Laboratory features 18,688 compute nodes each with a 16-core AMD Opteron CPU and NVIDIA Tesla K20x GPU providing a total peak performance of 27.1 petaflops. It uses a Cray XK7 system architecture with Gemini interconnect and has over 710 terabytes of memory. Early benchmark results show GPU acceleration providing performance gains of up to 3.8 times for some applications compared to a previous Cray XE6 system without GPUs. Scientists are using Titan for materials science, climate modeling, combustion simulation, and other research.

Uploaded by

bernasek
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
139 views12 pages

Titan System Insights for Scientists

The Titan supercomputer at Oak Ridge National Laboratory features 18,688 compute nodes each with a 16-core AMD Opteron CPU and NVIDIA Tesla K20x GPU providing a total peak performance of 27.1 petaflops. It uses a Cray XK7 system architecture with Gemini interconnect and has over 710 terabytes of memory. Early benchmark results show GPU acceleration providing performance gains of up to 3.8 times for some applications compared to a previous Cray XE6 system without GPUs. Scientists are using Titan for materials science, climate modeling, combustion simulation, and other research.

Uploaded by

bernasek
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Buddy Bland Project Director Oak Ridge Leadership Computing Facility

November 13, 2012

Office of Science

ORNLs Titan Hybrid System: Cray XK7 with AMD Opteron and NVIDIA Tesla processors

4,352 ft2 404 m2


4

SYSTEM SPECIFICATIONS: Peak performance of 27.1 PF (24.5 & 2.6) 18,688 Compute Nodes each with: 16-Core AMD Opteron CPU (32 GB) NVIDIA Tesla K20x GPU (6 GB) 512 Service and I/O nodes 200 Cabinets 710 TB total system memory Cray Gemini 3D Torus Interconnect 8.9 MW peak power 8.3 avg. Buddy Bland SC12

X86 processor provides fast, single thread performance for control & communications AMD Opteron 6274

16 cores 141 GFLOPs peak

Buddy Bland SC12

GPUs are designed for extreme parallelism, performance & power efficiency

NVIDIA Tesla K20x


14 Streaming Multiprocessors

2,688 CUDA cores


1.31 TFLOPs peak (DP) 6 GB GDDR5 memory HPL: >2.0 GFLOPs per Watt
6
Buddy Bland SC12

(Titan full system measured power)

Cray XK7 Compute Node


XK7 Compute Node Characteristics
AMD Opteron 6274 16 core processor @ 141 GF Tesla K20x @ 1311 GF Host Memory 32GB 1600 MHz DDR3

Tesla K20x Memory 6GB GDDR5


Gemini High Speed Interconnect

Y
X

Slide courtesy of Cray, Inc.


Buddy Bland SC12

Titan: Cray XK7 System


System: 200 Cabinets 18,688 Nodes 27 PF 710 TB

Compute Node: 1.45 TF 38 GB

Board: 4 Compute Nodes 5.8 TF 152 GB

Cabinet: 24 Boards 96 Nodes 139 TF 3.6 TB

Buddy Bland SC12

Why GPUs? High Performance and Power Efficiency on a Path to Exascale


Hierarchical parallelism Improves scalability of applications Exposing more parallelism through code refactoring and source code directives Heterogeneous multi-core processor architecture Use the right type of processor for each task.

Data locality Keep the data near the processing. GPU has high bandwidth to local memory for rapid access. GPU has large internal cache
Explicit data management Explicitly manage data movement between CPU and GPU memories.
13
Buddy Bland SC12

Hybrid Programming Model


On Jaguar, with 299,008 cores, we were seeing the limits of a single level of MPI scaling for most applications To take advantage of the vastly larger parallelism in Titan, users need to use hierarchical parallelism in their codes
Distributed memory: MPI, SHMEM, PGAS Node Local: OpenMP, Pthreads, local MPI communicators Within threads: Vector constructs on GPU, libraries, OpenACC

These are the same types of constructs needed on all multi-PFLOPS computers to scale to the full size of the systems!
14
Buddy Bland SC12

Compilers OpenACC is a set of compiler directives that allows the user to express hierarchical parallelism in the source code so that the compiler can generate parallel code for the target platform, be it GPU, MIC, or vector SIMD on CPU Cray compiler supports XK7 nodes and is OpenACC compatible CAPS HMPP compiler supports C, C++ and Fortran compilation for heterogeneous nodes with OpenACC support PGI compiler supports OpenACC and CUDA Fortran Tools Allinea DDT debugger scales to full system size and with ORNL support will be able to debug heterogeneous (x86/GPU) apps ORNL has worked with the Vampir team at TUD to add support for profiling codes on heterogeneous nodes
15

How do you program these nodes?

CrayPAT and Cray Apprentice support XK6 programming


Buddy Bland SC12

Early Science Applications on Titan

Material Science (WL-LSMS)


Role of material disorder, statistics, and fluctuations in nanoscale materials and systems.

Climate Change (CAM-SE)


Answer questions about specific climate change adaptation and mitigation scenarios; realistically represent features like precipitation patterns/statistics and tropical storms.

Biofuels (LAMMPS)
A multiple capability molecular dynamics code.

Astrophysics (NRDF)

Radiation transport critical to astrophysics, laser fusion, combustion, atmospheric dynamics, and medical imaging.

Combustion (S3D)
Combustion simulations to enable the next generation of diesel/bio- fuels to burn more efficiently.

Nuclear Energy (Denovo)


Unprecedented high-fidelity radiation transport calculations that can be used in a variety of nuclear energy and technology applications.

21

Buddy Bland SC12

How Effective are GPUs on Scalable Applications?


OLCF-3 Early Science Codes Very early performance measurements on Titan
XK7 (w/ K20x) vs. XE6

Cray XK7: K20x GPU plus AMD 6274 CPU Cray XE6: Dual AMD 6274 and no GPU Cray XK6 w/o GPU: Single AMD 6274, no GPU

Application S3D Denovo sweep

Performance Ratio

Comments
Turbulent combustion 6% of Jaguar workload Sweep kernel of 3D neutron transport for nuclear reactors 2% of Jaguar workload High-performance molecular dynamics 1% of Jaguar workload Statistical mechanics of magnetic materials 2% of Jaguar workload 2009 Gordon Bell Winner Community atmosphere model 1% of Jaguar workload
Buddy Bland SC12

1.8 3.8 7.4*


(mixed precision)

LAMMPS
WL-LSMS CAM-SE
22

3.8 1.8*
(estimate)

Questions? [email protected]

Want to join our team? ORNL is hiring. Contact us at http://jobs.ornl.gov

27

The research and activities described in this presentation were performed using the resources of the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC0500OR22725.
Buddy Bland SC12

You might also like