0% found this document useful (0 votes)
76 views

Cache Performance Average Memory Access Time

This document discusses cache performance including cache hit/miss rates, miss penalties, and average memory access time. It also covers ways to improve cache performance such as prefetching, interleaving memory modules, and using multiple cache levels.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views

Cache Performance Average Memory Access Time

This document discusses cache performance including cache hit/miss rates, miss penalties, and average memory access time. It also covers ways to improve cache performance such as prefetching, interleaving memory modules, and using multiple cache levels.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Cache Performance

Reading: Chap. 8.7


Recall: Cache Usage
• Cache Read (or Write) Hit/Miss: The read (or write)
operation can/cannot be performed on the cache.
Main
Processor Cache Memory
Unit: Unit:
Cache Line
• Cache Block /Word
Line: The unit composed multiple
successive memory words (size: cache block > word).
– The contents of a cache block (of memory words) will be
loaded into or unloaded from the cache at a time.
• Mapping Functions: Decide how cache is organized
and how addresses are mapped to the main memory.
• Replacement Algorithms: Decide which item to be
unloaded from cache when cache is full.
CSCI2510 Lec08: Cache Performance 2
Outline
• Performance Evaluation
– Cache Hit/Miss Rate and Miss Penalty
– Average Memory Access Time

• Performance Enhancements
– Prefetch
– Memory Module Interleaving
– Load-Through

3
Cache Hit/Miss Rate and Miss Penalty
• Cache Hit:
– The access can be done in the cache.
– Hit Rate: The ratio of number of hits to all accesses.
• Hit rates over 0.9 are essential for high-performance PCs.

• Cache Miss:
– The access can not be done in the cache.
– Miss Rate: The ratio of number of misses to all accesses.
– When cache miss occur, extra time is needed to bring blocks
from the slower main memory to the faster cache.
• During that time, the processor is stalled.
– Miss Penalty: the total access time passed through (seen by
the processor) when a cache miss occurs.
4
Average memory access time

AMAT = Hit time + Miss rate * Miss penalty


First, let's calculate the miss penalty for each cache level:
Miss penalty for L1 cache = Access time of L2 cache + Miss rate to L2 * Miss penalty of L2 cache
Miss penalty for L2 cache = Access time of main memory + Miss rate to main memory * Miss
penalty of main memory

AMAT = (L1 Cache hit time * L1 Cache hit ratio) + (L2 Cache hit time * L2 Cache hit ratio) +
(Main Memory access time * Main Memory hit ratio
An Example of Miss Penalty
• Miss Penalty: the total access time passed through
(seen by the processor) when a cache miss occurs.
• Consider a system with only one level of cache with
following parameters: 𝑡 10𝑡 Main
CPU Cache
– Word access time to the cache: 𝑡 𝑡 Memory

– Word access time to the main memory: 10𝑡


– When a cache miss occurs, a cache block of 8 words will be
transferred from the main memory to the cache.
• Time to transfer the first word: 10𝑡
• Time to transfer one word of the rest 7 words: 𝑡 (hardware support!)
• The miss penalty can be derived as:
𝑡 + 10𝑡 + 7  𝑡 + 𝑡 = 19𝑡
The initial cache access CPU access the requested
that results in a miss.
8: Cache Performance
data in the cache. 7
Average Memory Access Time
• Consider a system with only one level of cache:
– ℎ: Cache Hit Rate
Expected Value
– 1 − ℎ: Cache Miss Rate in Probability

– 𝐶: Cache Access Time 𝐸[𝑋] = ෍ 𝑥𝑖 × 𝑓(𝑥𝑖 )

– 𝑀: Miss Penalty
𝑖

• It mainly consists of the time to access a block in the main memory.


• The average memory access time can be defined as:
𝑡𝑎𝑣𝑔 = ℎ × 𝐶 + (1 − ℎ) × 𝑀
h
Main
Processor Cache
(1-h) Memory

– For example, given ℎ = 0.9, 𝐶 = 1 𝑐𝑦𝑐𝑙𝑒, 𝑀 = 19 𝑐𝑦𝑐𝑙𝑒𝑠:


 Avg. memory access time: 0.9 × 1 + 0.1 × 19 = 2.8 (𝑐𝑦𝑐𝑙𝑒𝑠)
CSCI2510 Lec08: Cache Performance 9
Real-life Example: Intel Core 2 Duo
• Number of Processors :1
• Number of Cores : 2 per processor
• Number of Threads : 2 per processor
• Name : Intel Core 2 Duo E6600
• Code Name : Conroe
• Specification : Intel(R) Core(TM)2 CPU 6600@2.40GHz
• Technology : 65 nm
• Core Speed : 2400 MHz
• Multiplier x Bus speed : 9.0 x 266.0 MHz = 2400 MHz
• Front-Side-Bus speed : 4 x 266.0MHz = 1066 MHz
• Instruction Sets : MMX, SSE, SSE2, SSE3, SSSE3, EM64T
• L1 Cache
– Data Cache : 2 x 32 KBytes, 8-way set associative, 64-byte line size
– Instruction Cache : 2 x 32 KBytes, 8-way set associative, 64-byte line size
• L2 Cache : 4096 KBytes, 16-way set associative, 64-byte line size
CSCI2510 Lec08: Cache Performance 10
Separate Instruction/Data Caches (1/2)
• Consider the system with only one level of cache:
– Word access time to the cache: 1 cycle
– Word access time to the main memory: 10 𝑐𝑦𝑐𝑙𝑒𝑠
– When a cache miss occurs, a cache block of 8 words will be
transferred from the main memory to the cache.
• Time to transfer the first word: 10 𝑐𝑦𝑐𝑙𝑒𝑠
• Time to transfer one word of the rest 7 words: 1 cycle
– Miss Penalty: 1 + 10 + 7  1 + 1 = 19 (𝑐𝑦𝑐𝑙𝑒𝑠)

• Assume there are total 130 memory accesses:


– 100 memory accesses for instructions with hit rate 0.95
– 30 memory access for data (operands) with hit rate = 0.90

CSCI2510 Lec08: Cache Performance 12


Separate Instruction/Data Caches (2/2)
• Total execution cycles without cache:
𝑡 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 = 100 × 10 + 30 × 10 = 1300 𝑐𝑦𝑐𝑙𝑒𝑠
– All of the memory accesses will result in a reading of a
memory word (of latency 10 cycles).

• Total execution cycles with cache:


Avg. memory access time for instructions:
ℎ×𝐶+(1−ℎ)×𝑀
𝑡𝑤𝑖𝑡ℎ = 100 × 0.95 × 1 + 0.05 × 19 +
30 × 0.9 × 1 + 0.1 × 19 =
274 𝑐𝑦𝑐𝑙𝑒𝑠
Avg. memory access time for data: ℎ×𝐶+(1−ℎ)×𝑀
• The performance improvement:

𝑡 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 1300
= 274 = 4.74 (𝑠𝑝𝑒𝑒𝑑 𝑢𝑝!)
𝑡 𝑤𝑖𝑡ℎ
CSCI2510 Lec08: Cache Performance 13
Class Exercise 8.2
• Consider the same system with one level of cache.
– Word access time to the cache: 1 cycle
– Word access time to the main memory: 10 𝑐𝑦𝑐𝑙𝑒𝑠
– Miss Penalty: 1 + 10 + 7  1 + 1 = 19 (𝑐𝑦𝑐𝑙𝑒𝑠)
• What is the performance difference between this
cache and an ideal cache?
– Ideal Cache: All the accesses can be done in cache.

CSCI2510 Lec08: Cache Performance 14


Multi-Level Caches
• In high-performance processors, two levels of caches are
normally used, L1 and L2.
– L1 Cache: Must be very fast as they determine the memory
access time seen by the processor.
– L2 Cache: Can be slower, but it should be much larger
than the L1 cache to ensure a high hit rate.
• The avg. memory access time of two levels of caches:
𝑡𝑎𝑣𝑔 = ℎ1 × 𝐶1 + (1 − ℎ1) × ℎ2 × 𝑀𝐿1 + (1 − ℎ2 )
×𝑀 – ℎ1/ℎ2: hit rate of L1 cache / L2 cache ,
𝐿2
– 𝐶1/𝐶2/𝑀𝑒𝑚: access time to L1 cache / L2 cache / memory
– 𝑀𝐿1: miss penalty of L1 miss & L2 hit
• E.g., 𝑀𝐿1 = 𝐶1 + 𝐶2 + 𝐶1 Avg. memory access time of
one level of cache:
– 𝑀𝐿2: miss penalty of L1 miss & L2 miss
𝑡𝑎𝑣𝑔 = ℎ × 𝐶 + (1 − ℎ) × 𝑀
• E.g., 𝑀 𝐿2 = 𝐶1 + 𝐶2 + 𝑀𝑒𝑚 + 𝐶2
+Cache
CSCI2510 Lec08: 𝐶1 Performance 16
Class Exercise 8.3
• Given a system with one level of cache, and a
system with two level of caches.
• Assume the hit rates of L1 cache and L2 cache (if
any) are both 0.9.
• What are the probabilities that miss penalty must be paid
to read a block from memory in both systems?

CSCI2510 Lec08: Cache Performance 17


How to Improve the Performance?
• Recall the system with only one level of cache:
– ℎ: Cache Hit Rate
– 1 − ℎ: Cache Miss Rate
– 𝐶: Cache Access Time
– 𝑀: Miss Penalty
• It mainly consists of the time to access a block in the main memory.
• The average memory access time can be defined as:
𝑡𝑎𝑣𝑔 = ℎ × 𝐶 + (1 − ℎ) × 𝑀
• Possible ways to further reduce 𝑡𝑎𝑣𝑔 :
 Use faster cache (i.e., 𝐶 ↓)? $$$...
 Improve the hit rate (i.e., ℎ 𝗍 )?
 Reduce the miss penalty (i.e., 𝑀 ↓)?
CSCI2510 Lec08: Cache Performance 20
How to Improve Hit Rate?
• How about larger block size?
– Larger blocks take more advantage of the spatial locality.
• Spatial Locality: If all items in a larger block are needed in a
computation, it is better to load them into cache in a single miss.
– Larger blocks are effective only up to a certain size:
• Too many items will remain unused before the block is replaced.
• It takes longer time to transfer larger blocks, and may also increase
the miss penalty.
– Block sizes of 16 to 128 bytes are most popular.

B
B
Main
Processor Cache Memory
Larger B
Larger B

CSCI2510 Lec08: Cache Performance 22


Prefetch: More rather than Larger
• Prefetch: Load more (rather than larger) blocks into
the cache before they are needed, while CPU is busy.
– Prefetch instruction can be put by programmer or compiler.

• Some data may be loaded into the cache without being


used, before the prefetched data are replaced.
– The overall effect on performance is positive.
– Most processors support the prefetch instruction.

Main
Processor Cache Memory

prefetch
CSCI2510 Lec08: Cache Performance 23
Outline
• Performance Evaluation
– Cache Hit/Miss Rate and Miss Penalty
– Average Memory Access Time

• Performance Enhancements
– Prefetch
– Memory Module Interleaving
– Load-Through

CSCI2510 Lec08: Cache Performance 24


Memory Module Interleaving (3/3)
• Which scheme below can be better interleaved?
– Scheme (a): Consecutive words in the same module.
– Scheme (b): Consecutive words in successive module.
• Keep multiple modules busy at any on time.
( 0 … 0 0 0 0 0 … 0010 ) 2 ( 0 … 0 0 0 0 0 … 0010 ) 2
0 0 =( 2 ) 10 0 0 =( 2 ) 10
( 0 … 0 0 0 0 0 … 0001 ) 2 ( 0 … 0 0 0 0 0 … 0001 ) 2
0 0 = ( 1m ) 0 0 m bits= ( 1 ) k1 0bits
bits1 0
Main Memory( 0 … 0 0Address … 0 0 0 0 ) Main
0 0 in0module Memory( 0 … 0 0 0 0 0 … 0000 )
Address in module Module
0Module
2 2
Address 0 Address 0 0
=( 0 ) 10 =( 0 ) 10
k bits

ABR DBR ABR DBR ABR DBR ABR DBR ABR DBR ABR DBR
Module Module Module Module Module Module
0 i n-1 0 i 2 k- 1
0 1 2 0 1 2

(a) Consecutive words in the same module (b) Consecutive words in
CSCI2510 Lec08: Cache Performance
successive modules 27
Example of Memory Module Interleaving
• Consider a cache read miss, and we need to load a
block of 8 words from main memory to the cache.
• Assume consecutive words are in successive modules for
the better interleaving (i.e., Scheme (b)).
• For every memory module:
– Address Buffer Register & Data Buffer Register
– Module Operations:
• Send an address to ABR: 𝟏 cycle ABR DBR
• Read the first word from module into DBR: 𝟔 cycles
• Module
Read a subsequent word from module into DBR: 𝟒 cycles i
• Read the data from DBR: 𝟏 cycle
Assume reads can be performed in parallel as accessing ABR or DBR, but
it only allows accessing either ABR or DBR of a module at a time.
CSCI2510 Lec08: Cache Performance 28
Without Interleaving (Single Module)
• Total cycles to read a single word from the module:
1 6 1 Send an address to ABR: 𝟏 cycle
Read the first word: 𝟔 cycles
– 1 cycle to send the address Read a subsequent word: 𝟒 cycles
Read the data from DBR: 𝟏 cycle
– 6 cycles to read the first word
– 1 cycle to read the data from DBR  1 + 6 + 1 = 8 𝑐𝑦𝑐𝑙𝑒𝑠
• Total cycles to read an 8-word block from the module:
Cycl e 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 … 36

1st 1 6 1 1+6+4×7+1
(read the 1st word) 1 4 1 = 36 𝑐𝑦𝑐𝑙𝑒𝑠
2nd
1 4 1 3rd
ABR DBR ABR DBR


Send an address Read data from DBR
+ (in paralle) Module Module + (in parallel)
Read a i i Read a word
8 th
word
CSCI2510 Lec08: Cache Performance
from module 1 4 1
29
With Interleaving Send an address to ABR: 𝟏 cycle
Read the first word: 𝟔 cycles
Read a subsequent word: 𝟒 cycles
• Total cycles to read a Read the data from DBR: 𝟏 cycle

8-word block from four interleaved memory modules:


Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
*Send an address &
1st from #0 1 6 1 Read a word from the module
2nd from #1 1 6 1 #0
*Read data from DBR &
Read a word from module #0
3rd from #2 1 6 1
4th from #3 1 6 1
It only allows to send the 5th from #0 1 4 1
memory addresses to
the modules one by one. 6th from #1 1 4 1
Why? The bus is shared.
7th from #2 1 4 1
ABR DBR ABR DBR ABR DBR ABR DBR 8th from #3 1 4 1
Module Module Module Module
#0 #1 #2 #3
1 + 6 + 1 × 8 = 15 𝑐𝑦𝑐𝑙𝑒𝑠
CSCI2510 Lec08: Cache Performance 30
Load-through
• Consider a read cache miss:
– Copy the block containing the requested word to the cache.
– Then forward to CPU after the entire block is loaded.
• Load-through: Instead of waiting the whole block to
be transferred, send the requested word to the processor
as soon as it is ready.
– Pros: Reduce the CPU’s waiting time (i.e., miss paneity)
– Cons: At the expense of more complex circuitry ($)
forward copy
a word a block

Cache Main
Processor Memory
load-through:
forward the requested word to the processor
as soon as it is read from the main memory!
CSCI2510 Lec08: Cache Performance 34
Summary
• Performance Evaluation
– Cache Hit/Miss Rate and Miss Penalty
– Average Memory Access Time

• Performance Enhancements
– Prefetch
– Memory Module Interleaving
– Load-Through

CSCI2510 Lec08: Cache Performance 35

You might also like