Final Exam Topics: CSE 564 Computer Architecture Summer 2017
Final Exam Topics: CSE 564 Computer Architecture Summer 2017
1
Overview of Final Exam Contents
• Lecture 11 – Lecture 24, not including 13
• Cache Optimization
• Instruction Level Parallelism
• Data Level Parallelism
• Thread Level Parallelism
ExTimeold 1
Speedupoverall
ExTimenew Fractionenhanced
1 Fractionenhanced
Speedupenhanced
3
Using Amdahl’s Law
4
Cache Performance
• Memory Stall Cycles: the number of cycles during which the processor is
stalled waiting for a memory access.
• Rewriting the CPU performance time
CPU execution time (CPU clock cycles Memory stall cycles) Clock cycle time
Memory Accesses
CPU Time IC * (CPI Execution Miss Rate Miss Penalty ) Clock Cycle Time
Instructio n
• The number of memory stall cycles depends on both the number of misses
and the cost per miss, which is called the miss penalty:
• 2/3.1 (64.5%) of the time the proc is stalled waiting for memory!
Memory Hierarchy Performance
• Two indirect performance measures have waylaid many a
computer designer.
– Instruction count is independent of the hardware;
– Miss rate is independent of the hardware.
Memory Accesses
CPU Time IC * (CPI Execution Miss Rate Miss Penalty ) Clock Cycle Time
Instructio n
• A better measure of memory hierarchy performance is the
Average Memory Access Time (AMAT) per instructions
Average Memory Access Time =Hit Time + Miss Rate * Miss Penalty
8
Summary of the 10 Advanced Cache Optimization Techniques
9
A Summary on Sources of Cache Misses
• Compulsory (cold start or process migration, first reference): first access
to a block
– “Cold” fact of life: not a whole lot you can do about it
– Note: If you are going to run “billions” of instruction, Compulsory Misses are
insignificant
• Conflict (collision):
– Multiple memory locations mapped
to the same cache location
– Solution 1: increase cache size
– Solution 2: increase associativity
• Capacity:
– Cache cannot contain all blocks access by the program
– Solution: increase cache size
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1) Sequence of access:
for (j = 0; j < 100; j = j+1) X[0][0], X[0][1], X[1][2], …
x[i][j] = 2 * x[i][j];
Sequential accesses instead of striding through memory every 100 words; improved
spatial locality
12
Design Guideline for Caches
• Cache block size: 32 or 64 bytes
– Fixed size across cache levels
• Cache sizes (per core):
– L1: Small and fastest for low hit time, 2K to 62K each for D$ and
I$ separated
– L2: Large and faster for low miss rate, 256K – 512K for combined
D$ and I$ combined
– L3: Large and fast for low miss rate: 1MB – 8MB for combined
D$ and I$ combined
• Associativity
– L1: directed, 2/4 way
– L2: 4/8 way
• Banked, pipelined and no-blocking access
13
Topics for Instruction Level Parallelism
• ILP Introduction, Compiler Techniques and Branch
Prediction
– 3.1, 3.2, 3.3
• Dynamic Scheduling (OOO)
– 3.4, 3.5 and C.5, C.6 and C.7 (FP pipeline and scoreboard)
• Hardware Speculation and Static Superscalar/VLIW
– 3.6, 3.7
• Dynamic Scheduling, Multiple Issue and Speculation
– 3.8, 3.9
• ILP Limitations and SMT
– 3.10, 3.11, 3.12
Data Dependences and Hazards
• Three data dependence: data dependences (true data
dependences), name dependences, and control dependences.
1. Instruction i produces a result that may be used by instruction j (i
→ j), or
2. Instruction j is data dependent on instruction k, and instruction k
is data dependent on instruction i (i → k → j, dependence chain).
• For example, a code sequence
Loop: L.D F0, 0(x1) ;F0=array element
ADD.D F4, F0, F2 ;add scalar in
S.D F4, 0(x1) ;store result
DADDUI x1, x1, #-8 ;decrement pointer 8 bytes
BNE x1, x2, Loop ;branch x1!=x2
15
Data Dependence
• Floating-point data part
Loop: L.D F0, 0(x1) ;F0=array element
ADD.D F4, F0, F2 ;add scalar in
S.D F4, 0(x1) ;store result
19
FP Loop: Where are the Hazards?
• First translate into MIPS code
– To simplify, assume 8 is lowest address
– R1 stores the address of X[999] when the loop starts
20
FP Loop Showing Stalls: V1
• Example 3-1 (p.158): Show how the loop would look on MIPS, both
scheduled and unscheduled including any stalls or idle clock cycles. Schedule
for delays from floating-point operations, but remember that we are ignoring
delayed branches.
• Answer
† 9 clock cycles, 6
for useful work
Rewrite code to
minimize stalls?
21
Revised FP Loop Minimizing Stalls: V2
† 7 clock cycles
† 3 for execution (L.D, ADD.D,S.D)
† 4 for loop overhead; How make faster?
22
Unroll Loop Four Times: V3
1. Loop: L.D F0,0(R1)
3. ADD.D F4,F0,F2
6. S.D 0(R1),F4 ;drop DSUBUI & BNEZ
7. L.D F6,-8(R1)
9. ADD.D F8,F6,F2
12. S.D -8(R1),F8 ;drop DSUBUI & BNEZ
13. L.D F10,-16(R1)
15. ADD.D F12,F10,F2
18. S.D -16(R1),F12 ;drop DSUBUI & BNEZ
19. L.D F14,-24(R1)
21. ADD.D F16,F14,F2
24. S.D -24(R1),F16
25. DADDUI R1,R1,#-32 ;alter to 4*8
26. BNEZ R1,LOOP
24
Unrolled Loop That Minimizes Stalls: V4
1. Loop: L.D F0, 0(R1)
2. L.D F6, -8(R1)
3. L.D F10, -16(R1)
4. L.D F14, -24(R1)
5. ADD.D F4 ,F0, F2
6. ADD.D F8, F6, F2
7. ADD.D F12, F10, F2
8. ADD.D F16, F14, F2
9. S.D 0(R1), F4
10. S.D -8(R1), F8
11. S.D -16(R1), F12
12. DSUBUI R1, R1, #32
13. S.D 8(R1), F16 ; 8-32
= -24
14. BNEZ R1, LOOP
† 14 clock cycles
25
Four Versions Compared
26
Latency and Interval
• Latency
– The number of intervening cycles between an instruction that
produces a result and an instruction that uses the result.
– Usually the number of stages after EX that an instruction
produces a result
• ALU Integer 0, Load latency 1
• Initiation or repeat interval
– the number of cycles that must elapse between issuing two
operations of a given type.
Data Hazards: An Example
I1 FDIV.D f6, f6, f4
RAW Hazards
WAR Hazards
WAW Hazards
Instruction Scheduling
I1 FDIV.D f6, f6, f4
I1
I2 FLD f2, 45(x3)
Valid orderings:
I5
in-order I1 I2 I3 I4 I5 I6
I2 I1 I3 I4 I5 I6
out-of-order I6
I1 I2 I3 I5 I4 I6
out-of-order
Dynamic Scheduling
• Rearrange order of instructions to reduce stalls while
maintaining data flow
– Minimize RAW Hazards
– Minimize WAW and WAR hazards via Register Renaming
– Between registers and memory hazards
• Advantages:
– Compiler doesn’t need to have knowledge of microarchitecture
– Handles cases where dependencies are unknown at compile time
• Disadvantage:
– Substantial increase in hardware complexity
– Complicates exceptions
Dynamic Scheduling
• Dynamic scheduling implies:
– Out-of-order execution
– Out-of-order completion
• Tomasulo’s Approach
– Tracks when operands are available
– Introduces register renaming in hardware
• Minimizes WAW and WAR hazards
Register Renaming
• Example:
DIV.D F0,F2,F4
ADD.D F6,F0,F8 Anti-dependence on F8
S.D F6,0(R1)
Output dependence on F6
SUB.D F8,F10,F14
MUL.D F6,F10,F8
v
Register Renaming
• Register renaming by reservation stations (RS)
– Each entry contains:
• The instruction
• Buffered operand values (when available)
• Reservation station number of instruction providing the
operand values
– RS fetches and buffers an operand as soon as it becomes available (not
necessarily involving register file)
– Pending instructions designate the RS to which they will send their output
• Result values broadcast on the common data bus (CDB)
– Only the last output updates the register file
– As instructions are issued, the register specifiers are renamed with the
reservation station
– May be more reservation stations than registers
Tomasulo Example Cycle 4
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 Load2 Yes 45+R3
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4
DIVD F10 F0 F6
ADDD F6 F8 F2 Waiting for data from
Reservation Stations: memory by the instruction
S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
originally in Load1
Add1 Yes SUBD M(A1) Load2
Add2 No
Add3 No
Mult1 Yes MULTD R(F4) Load2
Mult2 No
• + Reorder Buffer
• - Store Buffer
– Integrated in ROF
Four Steps of Speculative Tomasulo
1. Issue—get instruction from FP Op Queue
If reservation station and reorder buffer slot free, issue instr & send
operands & reorder buffer no. for destination (this stage sometimes called
“dispatch”)
2. Execution—operate on operands (EX)
When both operands ready then execute; if not ready, watch CDB for result;
when both in reservation station, execute; checks RAW (sometimes called
“issue”)
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting FUs
& reorder buffer; mark reservation station available.
4. Commit—update register with reorder result
When instr. at head of reorder buffer & result present, update register with
result (or store to memory) and remove instr from reorder buffer.
Mispredicted branch flushes reorder buffer (sometimes called “graduation”)
Instruction In-order Commit
• Also called completion or graduation
• In-order commit
– In-order issue
– Out-of-order execution
– Out-of-order completion
• Three cases when an instr reaches the head of ROB
– Normal commit: when an instruction reaches the head of the ROB and its result
is present in the buffer
• The processor updates the register with the result and removes the instruction
from the ROB.
– Committing a store:
• is similar except that memory is updated rather than a result register.
– A branch with incorrect prediction
• indicates that the speculation was wrong.
• The ROB is flushed and execution is restarted at the correct successor of the
branch.
Example with ROB and Reservation (Dynamic Scheduling and Speculation)
FLUSHED
IF Misprediction
Lecture 17: Instruction Level Parallelism
-- Hardware Speculation
and VLIW (Static Superscalar)
L.D
ADD.D
S.D
Loop Unrolling in VLIW
• Unroll 10 times
– Enough registers
10 results in 10 clocks, or 1 clock per iteration
Average: 3.2 ops per clock (32/10), 64% efficiency (32/50)
L.D
L.D L.D
ADD.D
ADD.D ADD.D
S.D
S.D S.D
Very Important Terms
• Dynamic Scheduling Out-of-order Execution
• Speculation In-order Commit
• Superscalar Multiple Issue
Techniques Goals Implementation Addressing Approaches
Dynamic Out-of-order Reservation Stations, Data hazards Register
Scheduling execution Load/Store Buffer and (RAW, WAW, renaming
CDB WAR)
Speculation In-order Branch Prediction Control hazards Prediction and
commit (BHT/BTB) and Reorder (branch, func, misprediction
Buffer exception) recovery
Superscalar/V Multiple Software and Hardware To Increase CPI By compiler or
LIW issue hardware
Lecture 18: Instruction Level Parallelism
-- Dynamic Scheduling, Multiple
Issue, and Speculation
r0 v0
[0] [1] [2] [VLRMAX-1]
Vector Length Register VLR
v1
Vector Arithmetic v2
Instructions + + + + + +
ADDV v3, v1, v2
v3
[0] [1] [VLR-1]
Memory
Base, r1 Stride in r2
VMIPS Vector Instructions
• Suffix
– VV suffix
– VS suffix
• Load/Store
– LV/SV
– LVWS/SVWS
• Registers
– VLR (vector
length
register)
– VM (vector
mask)
AXPY (64 elements) (Y = a * X + Y) in MIPS and VMIPS
for (i=0; i<64; i++) The starting addresses of X and Y are in Rx and
Y[i] = a* X[i] + Y[i];
Ry, respectively
• # instrs:
– 6 vs ~600
• Pipeline stalls
– 64x higher by MIPS
• Vector chaining
(forwarding)
– V1, V2, V3 and V4
Vector Instruction Execution with Pipelined Functional Units
ADDV C,A,B
Execution using
Execution using four
one pipelined
pipelined functional units
functional unit
A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27]
A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23]
A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19]
A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]
63
GPU Multi-Threading (SIMD)
• NVIDIA calls it Single-Instruction, Multiple-Thread (SIMT)
– Many threads execute the same instructions in lock-step
• A warp (32 threads)
• Each thread ≈ vector lane; 32 lanes lock step
– Implicit synchronization after every instruction (think vector
parallelism)
SIMT
64
Execution Many Threads (e.g. 8000) on GPU
65
GPU Multi-Threading
• In SIMT, all threads share instructions but operate on their
own private registers, allowing threads to store thread-local
state
SIMT
66
GPU Multi-Threading
• GPUs execute many groups of SIMT threads in parallel
– Each executes instructions independent of the others
67
Warp Switching
SMs can support more concurrent SIMT groups
than core count would suggest Coarse grained
multiwarpping (the term I coined)
• TLP Introduction
– 5.1
• SMP and Snooping Cache Coherence Protocol
– 5.2
• Distributed Shared-Memory and Directory-Based Coherence
– 5.4
• Synchronization Basics and Memory Consistency Model
– 5.5, 5.6
• Others
Examples of MIMD Machines
P P P P
• Symmetric Shared-Memory Multiprocessor
(SMP) Bus
– Multiple processors in box with shared
memory communication Memory
– Current Multicore chips like this
– Every processor runs copy of OS P/M P/M P/M P/M
• Distributed/Non-uniform Shared-Memory
Multiprocessor P/M P/M P/M P/M Host
– Multiple processors
• Each with local memory P/M P/M P/M P/M
• general scalable network
– Extremely light “OS” on node provides simple P/M P/M P/M P/M
services
• Scheduling/synchronization
– Network-accessible host for I/O
• Cluster
– Many independent machine connected with
general network
– Communication through messages
Network
Caches and Cache Coherence
• Caches play key role in all cases
– Reduce average data access time
– Reduce bandwidth demands placed on shared interconnect
a3 = *u; u :5 u :5 u = 7
*u = 7;
b1 = *u I/O devices
1
2
a2 = *u u :5
Memory
Things to note:
Processors see different values for u after event 3
With write back caches, value written back to memory depends on happenstance of
which cache flushes or writes back value and when
Processes accessing main memory may see very stale value
Unacceptable to programs, and frequent!
Cache Coherence Protocols
• Snooping Protocols
– Send all requests for data to all processors
– Processors snoop a bus to see if they have a copy and respond accordingly
– Requires broadcast, since caching information is at processors
– Works well with bus (natural broadcast medium)
– Dominates for centralized shared memory machines
• Directory-Based Protocols
– Keep track of what is being shared in centralized location
– Distributed memory => distributed directory for scalability
(avoids bottlenecks)
– Send point-to-point requests to processors via network
– Scales better than Snooping
– Commonly used for distributed shared memory machines
Implementation of Cache Coherence Protocol -- 1
Me mo ry
Cac he Cac he
Written by CPU 0
Invalidated by CPU 0
CPU 0 CPU 1
Cac he Cac he
Owned by CPU 0
Read/write miss
CPU 0 CPU 1
P1 P2 P3
u =?
u =?
3
4 5 $
$ $
u :5 u :7 u :5 u = 7
1 I/O devices
2
u :5
u Memory
=7
Write-Update (Broadcast)
• Update all the cached copies of a data item when that item
is written.
– Even a processor may not need the updated copy in the future
• Consumes considerably more bandwidth
• Recent multiprocessors have opted to implement a write
invalidate protocol
P1 P2 P3
u =?
u =?
3
4 5 $
$ $
u :5 u=7 u :5 u = 7
I/O devices
1
2
u :5
u Memory
=7