0% found this document useful (0 votes)
22 views

hpc_performance

Uploaded by

Rajul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

hpc_performance

Uploaded by

Rajul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Performance

Victor Eijkhout

Fall 2023

1 Eijkhout – Performance – Fall 2023


Justification

Programming for performance is an art.


Here are some examples.

2 Eijkhout – Performance – Fall 2023


Peak performance

Requires all floating point units to be active


Requires all data in L1 cache, or even register
Rightarrow very hard to achieve.

3 Eijkhout – Performance – Fall 2023


Arithmetic intensity

How many operations per word?


Equivalently: reuse factor = ratio between operations and data
Reuse: algorithm vs implementation

4 Eijkhout – Performance – Fall 2023


Bandwidth-limited operations

‘Streaming’ operations

// bandwidth.cxx
vector<double> results(nthreads,0.);
for ( int t=0; t<nthreads; t++) {
auto start_point = t*stream_length;
threads.push_back
( thread( [=,&results] () {
results[t] = memory.sumstream(
how_many_repeats,stream_length,start_point);
} ) );
}
for ( auto &t : threads )
t.join();

5 Eijkhout – Performance – Fall 2023


Bandwidth measurement

Aggregate bandwidth

12K
256K
25M
1,000
bandwidth

500

0
0 20 40 60
cores
6 Eijkhout – Performance – Fall 2023
Bandwidth numbers strictly a posteriori
Cache size effects

Basic idea: go many times over a small data set.


The following code is too simple:

for (int irepeat=0; irepeat<how_many_repeats; irepeat++)


{
for (int iword=0; iword<cachesize_in_words; iword++)
memory[iword] += 1.1;
}

7 Eijkhout – Performance – Fall 2023


Random traversal

Emulate randomness by pointer chasing

// setup
for (int iword=0; iword<cachesize_in_words; iword++)
memory[iword] = (iword+1) % cachesize_in_words

// use:
ptr = 0
for (int iword=0; iword<cachesize_in_words; iword++)
ptr = memory[ptr];

8 Eijkhout – Performance – Fall 2023


Measurement

Bandwidth

2,500

2,000
bandwidth

1,500

1,000
frontera
500 ls6
linear
0
104 105 106 107
dataset size
9 Eijkhout – Performance – Fall 2023
Associativity

Words at certain distance map to the same associativity class


Example: Ice Like has 48KiB cache, 12-way associative
Rightarrow stride 4KiB gives conflict; 12 conflicts can be resolved
Cascade Lake: 8-way associative
Access time per element

skx
5 clx
icx
4
nsec

4 6 8 10 12 14
collisions

10 Eijkhout – Performance – Fall 2023


Loop tiling

Multiple passes over array


Rewrite as by-block
Rightarrow One extra loop level

for (n=0; n<10; n++) bs = ... /* the blocksize */


for (i=0; i<100000; i++) for (b=0; b<100000/bs; b++)
... = ...x[i] ... for (n=0; n<10; n++)
for (i=b*bs; i<(b+1)*bs;
i++)
... = ...x[i] ...

11 Eijkhout – Performance – Fall 2023


Example transpose

// regular.c
for (int i=0; i<N; i++)
for (int j=0; j<N; j++)
A[i][j] = B[j][i];

Does this have any reuse of input or output?

12 Eijkhout – Performance – Fall 2023


Rewrite

// blocked.c
for (int ii=0; ii<N; ii+=blocksize)
for (int jj=0; jj<N; jj+=blocksize)
for (int i=ii*blocksize; i<MIN(N,(ii+1)*blocksize); i++)
for (int j=jj*blocksize; j<MIN(N,(jj+1)*blocksize); j++)
A[i][j] = B[j][i];

13 Eijkhout – Performance – Fall 2023

You might also like