Cache Design
Cache Design
CPU-DRAM Gap
Memory Cache
cpu cache
memory
Memory Locality
Memory hierarchies take advantage of memory locality. Memory locality is the principle that future memory accesses are near past accesses. Memories take advantage of two types of locality
Temporal locality -- near in time
we will often access the same data again very soon our next access is often very close to our last access (or recent accesses).
1,2,3,4,5,6,7,8,8,47,8,9,8,10,8,8...
memory
memory
main memory
memory
disk
Cache Fundamentals
cpu
cache hit -- an access where the data highest-level cache is found in the cache. cache miss -- an access which isnt lower-level hit time -- time to access the higher cache memory/cache miss penalty -- time to move data from lower level to upper, then to cpu hit ratio -- percentage of time the data is found in the higher cache miss ratio -- (1 - hit ratio)
cache block size or cache line size-- the amount of data that gets transferred on a cache miss. instruction cache -- cache that only holds instructions. data cache -- cache that only caches data. unified cache -- cache that holds both. (L1 is unified princeton architecture)
lowest-level cache
next-level memory/cache
Cache Characteristics
Cache Organization Cache Access Cache Replacement Write Policy
tags
data
Cache
Cache Organization
A typical cache has three dimensions
Number of sets (cache size) tag tagtag tag tag tagtag tag tag tagtag tag . . . tag tagtag tag data data data data data data data data data data data data data data data data Blocks/set (associativity) ways tag index block offset
Numbers are averages across a set of benchmarks. Performance improvements vary greatly by individual benchmarks.
tag
16
valid
0 1 2 ... ... ...
index
11 word offset
tag
256
= hit/miss
32
tag
18
index
10
valid
0 1 2 ... ... ...
word offset
valid
v. direct mapped
tag
data
tag
= hit/miss
21264 L1 Cache
64 KB, 64-byte blocks, (SA (set assoc) = 2)
v. direct mapped must know correct line that contains data to control mux direct mapped cache can operate on data without waiting for tag set assoc needs to know which set to operate on! line predictor
Cache Organization
Evaluation of Cache Access Time via. Cacti + Simulation Intel wins by a hair
(From Mark Hills Spec Data) .00346 miss rate .00366 miss rate Spec00 Spec00 Intel Core 2 AMD Opteron Duo
Cache Access
Cache Size = #sets * block size * associativity What is the Cache Size, if we have direct mapped, 128 set cache with a 32-byte block?
What is the Associativity, if we have 128 KByte cache, with 512 sets and a block size of 64-bytes?
Cache Access
16 KB, 4-way set-associative cache, 32-bit address, byteaddressable memory, 32-byte cache blocks/lines how many tag bits? Where would you find the word at address 0x200356A4?
index tag data tag data tag data tag data
WT always combined with write buffers so that dont wait for lower level memory
V 0
V 0 0 0 0
0 Write allocate: The 104 block is 1 loaded into 0 cache on a write miss 108 1 0 0 No-Write allocate: The block is modified in the lower levels of memory but not in cache 112 1 0 0 Write buffer allows merging of writes Write address 100 V 1 0 0 0 V 1 0 0 0 V 1 0 0 0
V 1 0 0 0
Area-limited designs may consider unified caches Generally, the benefits of separating the caches are overwhelming (what are the benefits?)
Cache Performance
CPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle time Memory stall clock cycles = (Reads x Read miss rate x Read miss penalty + Writes x Write miss rate x Write miss penalty) Memory stall clock cycles = Memory accesses x Miss rate x Miss penalty
Cache Performance
CPUtime = IC x (CPIexecution + Memory stalls per instruction) x Clock cycle time CPUtime = IC x (CPIexecution + Mem accesses per instruction x Miss rate x Miss penalty) x Clock cycle time
(includes hit time as part of CPI)
Average memory-access time = Hit time + Miss rate x Miss penalty (ns or clocks)
Reducing Misses
Classifying Misses: 3 Cs
CompulsoryThe first access to a block is not in the cache, so the block must be brought into the cache. These are also called cold start misses or first reference misses. CapacityIf the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. ConflictIf the block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory and capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. These are also called collision misses or interference misses. How To Measure Misses in infinite cache Non-compulsory misses in size X fully associative cache Non-compulsory, non-capacity misses
16K cache, miss penalty for 16-byte block = 42, 32-byte is 44, 64-byte is 48. Miss rates are 3.94, 2.87, and 2.64%?
Time
Prefetching relies on extra memory bandwidth that can be used without penalty
Stream Buffers
Allocate a Stream Buffer on a cache miss Run ahead of the execution stream prefetching N blocks into stream buffer Search stream buffer in parallel with cache access If hit, then move block to cache, and prefetch another block
Impact of hardware prefetching on Pentium 4 pretty large (15 missing benchmarks benefited less than 15%)
Data
Reordering
Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays Loop Interchange: change nesting of loops to access data in order stored in memory Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap Blocking: Improve temporal locality by accessing blocks of data repeatedly vs. going down whole columns or rows
Blocking Example
/* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {r = 0; for (k = 0; k < N; k = k+1){ r = r + y[i][k]*z[k][j];}; x[i][j] = r; };
Blocking Example
/* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+B-1,N); j = j+1) {r = 0; for (k = kk; k < min(kk+B-1,N); k = k+1) { r = r + y[i][k]*z[k][j];}; x[i][j] = x[i][j] + r; };
Capacity Misses from 2N3 + N2 to 2N3/B +N2 B called Blocking Factor Conflict Misses Are Not As Easy...
Key Points
Remember danger of concentrating on just one parameter when evaluating performance Next: reducing Miss penalty
Sub-blocks
Valid Bits
some PPC machines do this.
Most useful with large blocks, Spatial locality a problem; often we next want the next sequential word soon, so not always a benefit (early restart).
assumes stall on use not stall on miss which works naturally with dynamic scheduling, but can also work with static.
- 8 KB cache, 16 cycle miss, 32-byte blocks - old data; good model for misses to L2, not a good model for misses to main memory (~ 300 cycles)
Definitions:
Local miss rate misses in this cache divided by the total number of memory accesses to this cache (Miss rateL2) Global miss ratemisses in this cache divided by the total number of memory accesses generated by the CPU (Miss RateL1 x Miss RateL2)
next-level memory/cache
Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2)
8 or 16
.5
=100
200