0% found this document useful (0 votes)
12 views

WP 100

This document summarizes benchmarking tests performed on the Delft3D modeling suite using PRACE supercomputing infrastructure. Three tests were conducted: 1) A large regular domain showed good scalability up to 1,000 cores. 2) An irregular realistic domain had reasonable scalability up to 100 cores, though I/O was a bottleneck. 3) A regular domain with sediment transport modules benefited from inlining routines and scaled reasonably to 100 cores. Overall, optimizations were needed to improve portability and scaling of Delft3D across different supercomputing platforms.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

WP 100

This document summarizes benchmarking tests performed on the Delft3D modeling suite using PRACE supercomputing infrastructure. Three tests were conducted: 1) A large regular domain showed good scalability up to 1,000 cores. 2) An irregular realistic domain had reasonable scalability up to 100 cores, though I/O was a bottleneck. 3) A regular domain with sediment transport modules benefited from inlining routines and scaled reasonably to 100 cores. Overall, optimizations were needed to improve portability and scaling of Delft3D across different supercomputing platforms.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Available online at www.prace-ri.

eu

Partnership for Advanced Computing in Europe

Delft3D Performance Benchmarking Report


J. Donnersa*, A. Mouritsb, M. Gensebergerb, B. Jagersb
a
SURFsara, Amsterdam, The Netherlands
b
Deltares, Delft, The Netherlands

Abstract

The Delft3D modelling suite has been ported to the PRACE Tier-0 and Tier-1 infrastructure. The portability of
Delft3D was improved by removing platform-dependent options from the build system and replacing non-
standard constructs from the source. Three benchmarks were used to investigate the scaling of Delft3D: (1) a
large, regular domain; (2) a realistic, irregular domain with a low fill-factor; (3) a regular domain with a
sediment transport module. The first benchmark clearly shows a good scalability up to a thousand cores for a
suitable problem. The other benchmarks show a reasonable scalability up to about 100 cores. For test case (2) the
main bottleneck is the serialized I/O. It was attempted to implement a separate I/O server by using the last MPI
process only for the I/O, but this work is not yet finished. The imbalance due to the irregular domain can be
reduced somewhat by using a cyclic placement of MPI tasks. Test case (3) benefits from inlining of often-called
routines.

Introduction
Delft3D [1] is a world leading 3D modelling suite used to investigate hydrodynamics, sediment transport and
morphology and water quality for fluvial, estuarine and coastal environments. As of 1 January 2011, the Delft3D
flow (FLOW), morphology (MOR) and waves (WAVE) modules are available as open source. Delft3D has over
350k lines of code and is developed by Deltaresb.
The software is used and has proven its capabilities all over the world, e.g. in the Netherlands, USA, Hong Kong,
Singapore, Australia, Venice, etc. It is continuously improved and developed with innovating advanced
modelling techniques due to the research work of Deltares. Delft3D is meant to remain a world leading software
package.

Description
The FLOW module is the heart of Delft3D; it is a multi-dimensional (2D or 3D) hydrodynamic (and transport)
simulation programme which calculates non-steady flow and transport phenomena resulting from tidal and
meteorological forcing on a curvilinear, boundary fitted grid or spherical coordinates. A more flexible grid
approach is under development. In 3D simulations, the vertical grid is defined following the so-called sigma
coordinate approach or Z-layer approach. The MOR module computes sediment transport (both suspended and
bed total load) and morphological changes for an arbitrary number of cohesive and non-cohesive fractions.

* Corresponding author. E-mail address: [email protected]


b
Deltares is an independent, institute for applied research in the field of water, subsurface and infrastructure. For
more information, see http://www.deltares.nl/en
1
The application consists of mainly Fortran 90, with some routines in C and C++ and some features from Fortran
2003. The parallel version that we considered uses MPI with 1-D domain decomposition as its parallelisation
strategy, where it automatically selects the longest dimension to be partitioned. The length of the domain (i.e. the
direction with most grid-points) is split across MPI processes. It uses an alternating direction implicit (ADI)
method to solve the momentum and continuity equations. The parallel implementation of the ADI method in
Delft3D uses the halo regions of the computational domain on the processor to some extent as internal boundary
conditions for iterations locally within the process. Therefore, convergence could become a problem when
scaling up to higher process counts. I/O is implemented using a master-only technique. Although the application
should scale well, as shown by similar models with the same input data set, this was not the case on local
hardware at Deltares. The developers could not create an insightful profile that would show the main bottleneck
that prevents its scalability.
The MPI routines are wrapped in custom routines that are mostly used for halo exchanges and the reduction of
convergence parameters. Halo exchanges are executed by two calls to MPI_Isend and MPI_Irecv, immediately
followed by separate calls to MPI_Wait for all communication. The haloes are stored in temporary arrays.

Configuration and setup


The program uses automake, autoconf and libtool to create a configure script, which then configures the
whole package for a particular system. For now, the only method to obtain the sources is to check out the svn-
version. It is expected that a packaged version of the source of Delft3D would come with only one configure
script that can target any platform. At the moment, there are still a few platform-specific options in the
configure.ac file that prevent this.

Porting
Delft3D has been ported to and tested on the following systems:
 IBM Power6 system “Huygens” running Linux at SURFsara (IBM XL compilers + IBM POE)
 Intel Xeon Nehalem cluster “Lisa” running Linux at SURFsara (Intel compilers + OpenMPI)
 BullX Intel Xeon cluster “Curie” running Linux at CEA. (Intel compilers + BullxMPI)
 BullX Intel Xeon cluster “Cartesius” running Linux at SURFsara (Intel compilers + IntelMPI)
During the porting to these systems, several portability issues were identified and fixed in the mainstream
releases of Delft3D:
 The MPI-implementation in Delft3D would check the environment variable PMI_RANK, which is only
used in the MPICH2-library and is therefore not portable. New releases of Delft3D have fixed this and
it now supports MPICH2, Intel MPI, MVAPICH, OpenMPI and POE.
 In case of an abnormal exit of the code, the code would write an error message to the output-file, but the
output-file was not closed. However, some compilers (e.g. from IBM) cache the output and the error
message is not written to disk. This has been fixed.
 An erroneous attempt to de-allocate a static object caused a runtime error with the IBM compiler; this
has been fixed.
 The MPI-implementation would not call MPI_Finalize in case there was only 1 MPI-task. In this
case, Scalasca would not write its final report. This is now fixed.
 Variables with the LOGICAL and INTEGER type were used interchangeably, which is now fixed.
 Bug due to an assumed pointer size of 4 bytes, which is now fixed.
 When running the first benchmark with 3 MPI tasks or more, there would be a signed integer overflow
when multiplying the total nr. of grid points (9M) with the running sum of the cpu weights (in this case
300). This can be fixed by using INTEGER*8 variables.
 OpenMP option in configure.ac updated.
 When opening a file, the non-standard argument access='append' was used, which is now replaced
with the standard argument position='append'.

2
 To replace the non-standard Fortran function iargc() with the standard Fortran 2003 function
command_argument_count().

 Delft3D would crash unexpectedly when using more than 60 MPI tasks. This could eventually be traced
back to some initial conditions that were left empty, which Delft3D did not handle correctly.
 A compiler-specific flag (-fPIC) was included in some Makefile.am files, which were removed to
make the build-system (more) platform-independent.
 Some issues were noted, but not yet fixed completely due to other constraints:
o The library libstdc++ was thought to be necessary for each C++-compiler, but this is not
true. E.g., the IBM XL C++ compiler uses its own libraries. This has therefore been removed
from configure.ac. As a result, the compilation fails because some C++-parts of the
Delft3D-code are compiled as static libraries (libstream.a and libesm_c.a) and linked
with Fortran-parts, which then miss the C++-libraries. The automake-files (*/Makefile.am)
are changed to build shared libraries of the C++-parts through libtool, which then
automatically links to the C++-libraries, even if it is combined with Fortran-code.
o the latest autoconf releases include an option to look for the Fortran compiler-specific flags
to pass preprocessor directives.

Benchmarking
The benchmarks in this report have all been run with Delft3D-binaries that use double precision numbers to
represent a real production environment.

Test Case: Waal river

Figure 1: Overview of part of the simulation domain. The grid representation of a groin can be seen at the bottom right.

The simulation is a schematic representation of the Waal, one of the main rivers in the Netherlands, with groins
and part of the floodplain. This model is used to estimate the effect of lowering the groins on the water level
when the area is flooded.
The resolution is high enough to reach a good scaling up to at least 80 processors using a similar software
package that is maintained by Deltares (this software package is only applied in the Netherlands for a limited
distribution). The total domain is 30x2 km and uses a resolution of 2x2 m in the main channel and 4x2 m on the
floodplain. The total number of grid cells is more than 9 million. The domain is homogeneous and has 15.000

3
gridpoints in the direction of the 1-D domain decomposition, which makes this test case ideal to investigate
scalability.
At the beginning of the project, there was not yet access to a Tier-0 system, so porting started on local systems.
After the first porting issues were resolved, the Delft3D model was compiled without any optimization flags on a
local Intel Xeon cluster “Lisa” and an IBM Power6 system “Huygens”. An initial scaling benchmark showed
about 15 timesteps per minute when using 40 cores with a scaling efficiency above 80%.
Compilation of Delft3D with the -O2 optimization flag shows good scalability on the Curie Tier-0 system (see
Figure 2). Scaling starts to tail off from 512 cores. Scalability was not tested above 1.000 cores, as Delft3D uses
temporary files for each process with a maximum of 1.000 files. The results clearly show that the computational
core of Delft3D can scale to 1.000 cores with a suitable input dataset that is both large enough and has a regular
domain that is homogeneously distributed across processes.

1600
1400

1400
1200
1200
1000
1000
800
800 steps/min dp
600
600 perfect scaling dp
400
400
200
200

0
0 100 200
200 300
400 400 500
600 600 800
700 800
1000900 1000
1200
cores

Figure 2: Performance of Delft3D with Waal schematic setup in timesteps per minute on Curie thin nodes. Perfect scaling is measured
relative to 1 node (16 cores).

Scalasca was used to profile Delft3D. Scalasca version 1.4.1 was specially compiled with position-independent
code, so the Scalasca libraries can be linked with shared libraries in Delft3D. The Delft3D code uses by default
dynamic loading of shared libraries, which is not supported by the Scalasca tool. Therefore, we used the option
to compile Delft3D as a monolithic executable, with the shared libraries all added at link-time. It was also
necessary to move the call to MPI_Init to the main C++-routine.
The most important routines in terms of computing time are:
• uzd (solve the continuity equation), and

• sud (solve the momentum equation)


A more in-depth analysis of the important bottlenecks for realistic benchmarks can be found in the following
section.
Gperftools
Gperftools was used to create a profile at the source-line level. It was not possible to get a profile of the full
program, rather it had to be restricted to the main loop with the routines ProfilerStart and ProfilerStop. A Fortran
interface was written that uses the intrinsic module ISO_C_BINDING. Also, the profiler-library had to be linked
with the Delft3D-executable, instead of being preloaded at runtime. No significant insights were gained from
these profiles.

4
Test Case: Zeedelta

Figure 3: Full domain for the Zeedelta benchmark. Each colour represents a computational domain, with in this case in total 8 domains.

The Zeedelta model is simulation of the Rotterdam estuary and represents a real production case.
This is a 3D model with 501x1539 horizontal grid-points (total 771.039), of which 171.659 active grid-points
and 600.055 inactive grid-points (totaal 771.714). The 3D model has 10 layers. This model has a heterogeneous
domain, with a very high number of inactive grid-points. It represents the other extreme in comparison to the
first benchmark, which had a regular and completely active domain.
The computational processes at the left of the domain will have all active points, while for the processes at the
right of the domain a large fraction of the points in memory will be inactive points. Delft3D splits the domain
into partitions with an equal number of active points, and the processes at the right of the domain will therefore
compute a longer area.
Scalasca was used to generate a profile of the test case at different numbers of processes. Small routines that are
frequently called (comparerealdouble, redvic, reddic, dens_unes), have been filtered from
instrumentation to minimize the impact of profiling on performance. The following results were obtained on the
PRACE Tier-0 Curie system, using 96 MPI processes on 96 cores of the thin node partition. Most important
routines are:
1. uzd: 30% runtime

2. tritra (compute transports for conservative constituents): 22% runtime


3. postpr (I/O postprocessing): 15% runtime

4. tratur (compute transports of turbulent kinetic energy and dissipation): 14% runtime

5. sud: 10% runtime


which in total represents more than 90% of the runtime, excluding initialisation and finalisation.
The default placement was used, with MPI tasks in consecutive blocks on the nodes, and cyclically placing the
tasks within the nodes over the sockets. Cyclic placement of tasks across nodes was shown to increase
performance approximately 5% for 64 processes and above.

5
400

350

300

250
steps/min

200 steps/min dp
steps/min dp (cyclic)
150 perfect scaling
100

50

0
0 20 40 60 80 100 120 140 160 180
cores

Figure 4: Performance of Zeedelta benchmark on Curie thin nodes in timesteps/min. Perfect scaling is measured relative to 16 cores (1 node).

When increasing the number of processes to 128, the master-only I/O is the most expensive part of the code with
22% of the runtime and uzd only takes 19% of the time. Due to the serial nature of the master-only I/O and the
increasing message count to gather all data, the I/O is actually slower for higher process counts.
Unfortunately, for process counts higher than 128, the ADI-algorithm for the transport equation would not
converge anymore within 50 iterations from timestep 192. Although the model does not crash, we concluded not
to investigate further due to the severe impact on the performance and the scientific results.
Several scaling bottlenecks were identified:
1. postpr: master-only I/O

2. uzd, tritra: imbalance. MPI_Wait synchronizes neighbours: a process needs to wait for its slowest
neighbour to send the data. MPI_Allreduce is used to check for convergence at every timestep and
synchronizes all processes. From a test with an MPI_Barrier in front of the iterative ADI algorithm,
it is clear that there is no imbalance in the earlier part of the code. The processes with a higher fill
factor need more computation time.
3. tratur: imbalance

4. sud: Imbalance in the routine cucnp that causes waiting time at MPI_Barrier.
5. incbc: boundary conditions are defined per grid-point. Boundary data are gathered and broadcasted
multiple times at every timestep. This routine takes up about 4% of runtime with 96 tasks and will
not scale due to its serial implementation.
This test case has a high fill factor for the processes to the left of the domain, and a much lower fill factor to the
right. Therefore, the blocked placement of tasks to nodes puts all slower tasks together on the first node, each
competing for the limited memory bandwidth. The cyclic placement of tasks circumvents this and decreases the
imbalance, which results in a 5% overall performance increase. The lower imbalance reduces waiting time for
collective communication, which more than compensates for the longer nearest neighbour communication.
Another solution would be to undersubscribe the nodes, giving every process a second core to increase the
available memory bandwidth per process. This improves performance relative to the blocked placement by over
20%, which should be compared to the performance of using twice as many processes. That would not usually
be advantageous. If energy usage is taken into account, this balance might shift.
Since the main scaling bottleneck for this particular benchmark is the I/O, it was attempted to implement a
separate I/O server by using the last MPI process only for the I/O. Unfortunately, the implementation is not yet
finished and therefore has not yet been tested.
6
Test Case: Sediment transport
The last test case includes sediment transport and morphology updates. Due to these extra processes, different
parts of the code are activated, which are highly compute-intensive. The computational domain is rectangular
with 243 grid points in each dimension. Since Delft3D uses 1D domain decomposition and each domain needs a
minimum number of columns, the maximum number of processes that we could use is 160.
For 16 processes on the Curie thin nodes, the CPU time is dominated by two routines:
1. erosed (compute sediment fluxes): 48%
2. bott3d (update depth due to changes in bottom sediment): 20%

3. adi (uzd+sud): 11%


4. tritra: 7%

5. taubot: 6%
again representing more than 90% of the computing time. I/O takes less than 1% for this test case.

1600

1400

1200

1000

800 steps/min dp
600 perfect scaling

400

200

0
0 20 40 60 80 100 120 140 160 180
cores

Figure 5: Performance of Sediment transport benchmark on Curie thin nodes in timesteps/min. Perfect scaling is measured relative to 16
cores (1 node).

Figure 5 shows the performance of the Sediment transport benchmark as measured on the PRACE Tier-0 Curie
system. It can be seen that the benchmark does not scale well and levels off above 64 cores. However, the
performance does increase up to 128 cores.
The routines erosed and bott3d call several functions for each grid-point with active sediment transport and
depth changes (bedbc1993, calseddf1993, bedtr1993, comparerealdouble, getsedthick_1point and
more). These functions are called tens of billions of times, even for these short benchmark runs. The overhead
for calling these routines is significant and it is important to make sure that these routines are inlined by the
compiler. Due to the fact that Delft3D is composed of several dynamic and static libraries, the compiler does not
automatically inline functions from other libraries, or even functions in the same static libraries. For the Intel
compiler it is needed to add the optimization flag -ipo and to replace the archiver ar with xiar and to replace
the linker ld with xild. The performance increase is about 7% when using 96 cores.
The sediment transport routines only exchange haloes for some of their variables, but halo values for many
variables are computed by the process to reduce communication. This furthers the conclusion that it is important
to check for correct inlining of these often-called routines.

7
Unfortunately, this particular benchmark is sensitive to the number of processes, since the solution of the ADI-
solver depends on the process count which cascades into changes in sediment transport and resulting height
changes that feed back to the circulation, even for short simulations like these. This affects the reliability of the
results, although we do not know to what extent.
A cyclic placement of the processes across the nodes results in a 25% penalty for this benchmark, which shows
that this option is not a panacea and should only be used if there is an imbalance between the processes.

Conclusions
Delft3D is a complex application that is used in a broad range of real-world applications. Consequently, it
contains a large number of different modules, many of which can be activated separately. For this white paper
we have selected a representative set of benchmarks with different use patterns. The first benchmark clearly
shows that Delft3D can scale up to 1.000 cores on PRACE Tier-0 systems for suitably large problems. The
second benchmark has a domain with a realistic, irregular bathymetry that has a large fraction of inactive points.
Delft3D uses static load-balancing by assigning an equal number of active points to each domain. Together with
a cyclic placement of processes across the nodes, this works reasonably well. However, the scalability flattens
around 100 cores, which can be attributed for a large part to the I/O in this particular benchmark. An initial
implementation of a separate I/O server was created, but not yet tested. The third benchmark has a regular
domain and includes an extra module for sediment transport and depth changes. This module takes up about 70%
of the runtime. The benchmark shows little scaling potential beyond 64 cores. Several functions are called
billions of times and it is important to make sure that these are inlined by the compiler.

References
[1] http://oss.deltares.nl/web/opendelft3d/home

Acknowledgements
This work was financially supported by the PRACE project funded in part by the EUs 7th Framework
Programme (FP7/2007-2013) under grant agreement no. RI-283493.

You might also like