Outline
Lecture 1: Multicore Architecture Concepts Lecture 2: Parallel programming with threads and tasks Lecture 3: Shared memory architecture concepts and performance issues
Memory hierarchy Consistency issues and coherence protocols Performance issues, e.g. false sharing Optimizations for data locality (briefly, more in Lecture 7)
Shared Memory Architecture Concepts and Performance Issues
TDDD56 Lecture 3 Christoph Kessler
PELAB / IDA Linkping university Sweden
Lecture 4: Design and analysis of parallel algorithms Lecture 5: Parallel Sorting Algorithms
2012
2
Shared Memory vs. Distributed Memory
Shared Memory Variants
Single shared memory module (UMA) quickly becomes a
performance bottleneck
Often implemented with caches to leverage access locality
As done for single-processor systems, too
Can even be realized on top of distributed memory system
(NUMA non-uniform memory access)
Cache
Cache = small, fast memory (SRAM) between processor and main memory, today typically on-chip contains copies of main memory words cache hit = accessed word already in cache, get it fast. cache miss = not in cache, load from main memory (slower) Cache line holds a copy of a block of adjacent memory words size: from 16 bytes upwards, can differ for cache levels Cache-based systems profit from spatial access locality access also other data in same cache line temporal access locality access same location multiple times HW-controlled cache line replacement dynamic adaptivity of cache contents suitable for applications with high (also dynamic) data locality
5
Cache (cont.)
Optimizing Programs for Improved Access Locality
Example:
Caches: Memory Update Strategies
Loop Interchange
for (j=0; j<M; j++) for (i=0; i<N; i++) a[ i ][ j ] = 0.0 ;
j
a[0][0]
for (i=0; i<N; i++) for (j=0; j<M; j++) a[ i ][ j ] = 0.0 ;
j
row-wise storage of 2D-arrays in C, Java
a[0][0] a[0][M-1]
i ....
old iteration order a[N-1][0]
i
new iteration order
....
a[N-1][0]
Can improve spatial locality of memory accesses (fewer cache misses / page faults) 7
Memory Hierarchy Example
Cache Coherence and Memory Consistency
10
Cache Coherence Formal Definition
Cache Coherence Protocols
11
12
Details: Write-Invalidate Protocol
Write-Invalidate Protocol (cont.)
13
14
Write-Update Protocol
Bus-Snooping
15
16
Write-back invalidation protocol (MSI-protocol)
MSI-Protocol: State Transitions
17
18
Example
Core i Cache lines / state Core j Core i Load x: Cache lines / state I Core j
Cache lines / state
Cache lines / state I
BusRd x Memory blocks / state Memory blocks / state
x
19
x
20
Write requires invalidation
Core i Cache lines / state Core j Load x: Cache lines / state S Core i Cache lines / state Core j Store x: Cache lines / state M
BusRd x Memory blocks / state Processor read / bus read observation: state for other copies remains S
BusRdX x Memory blocks / state Must be exclusive owner before writing to local copy: first invalidate the others and update my copy, by BusRdX x
x
21
x
22
Core i Cache lines / state
Core j Store x: Cache lines / state M
Core i Load x: Cache lines / state
Core j
Cache lines / state S
BusRdX x Memory blocks
BusRd x Memory blocks
x
23
x
24
Another scenario:
Core i Cache lines / state Core j Core i Store x: Cache lines / state Core j
Cache lines / state M
Cache lines / state I
BusRdX x Memory blocks / state Memory blocks / state Observation of BusRdX x: Cache controller invalidates own copy
x
25
Subsequent store requires flush/update of the cache line in memory and requesting cache before overwriting
x
26
MSI-Protocol: State Transitions
MESI-Protocol
27
28
CC-NUMA: Directory based Protocol for non-bus-based Architectures
Performance Issue: False Sharing
29
30
How to Avoid False Sharing?
Shared Memory Consistency Models
31
32
Consistency Models: Strict Consistency
Consistency Models: Sequential Consistency
33
34
Consistency Models: Weak Consistency
Consistency Models: Weak Consistency in OpenMP
35
36
Consistency Models: Weak Consistency in OpenMP (cont.)
Questions?
37
Further Reading
D. Culler et al. Parallel Computer Architecture, a
Hardware/Software Approach. Morgan Kaufmann, 1998.
J. Hennessy, D. Patterson: Computer Architecture, a
Quantitative Approach, Second edition (1996) or later. Morgan Kaufmann.
S. Adve, K. Gharachorloo: Shared memory consistency
models: a tutorial. IEEE Computer, 1996.
39