Cache Memory
Computer Organization and Architecture
by
William Stallings
Cache
• Small amount of fast memory
• Sits between normal main memory and CPU
• May be located on CPU chip or module
Cache/Main Memory Structure
Cache operation – overview
• CPU requests contents of memory location
• Check cache for this data
• If present, get from cache (fast)
• If not present, read required block from main memory to
cache
• Then deliver from cache to CPU
• Cache includes tags to identify which block of main memory
is in each cache line
Cache Read Operation - Flowchart
Typical Cache Organization
Elements of Cache Design
1. Size
2. Mapping Function
1) associative, 2) direct, 3) set-associative
3. Replacement Algorithm
1) LRU, 2) LFU, 3) FIFO, 4) Random
4. Write Policy
1) Write-through, 2) Write-back, 3) Write -Once
5. Block Size
6. Number of Caches
1) Single or 2-level, 2) Unified or Split
Size
• Cost
—More cache is expensive
• Speed
—More cache is faster (up to a point)
—Checking cache for data takes time
Block Placement
Strategies
Cache organizations
Direct mapped: Each block has only one place in the cache.
Mapping: (Block address) MOD (Number of block in cache)
Set associative: A block can be placed in a restricted set of
places in the cache
A set is a group of blocks in the cache.
Mapping: (Block address) MOD (Number of sets in the cache)
If there are n blocks in a set, the cache placement is called n-way set
associative.
Fully associative: A block can be placed any where in the cache.
Example
Block Identification
The three portions of an address in a set-associative
or direct-mapped cache.
The tag is used to check all the blocks in the set
The index is used to select the set.
The block offset is the address of the desired data within the
block.
Fully associative caches have no index field.
Mapping Function
• Specification of correspondence between main memory blocks
and cache blocks
• The transformation of data from main memory to cache
memory
1) Associative mapping
2) Direct mapping
3) Set-associative mapping
Direct Mapping
• Each block of main memory maps to only one cache line
• Address consists of two parts
– Least Significant w bits - identify unique word
– Most Significant s bits - specify one memory block
• Most Significant s bits are split into
– a cache line field r bits and
– a tag of s-r bits (most significant)
Direct Mapping - Address Structure
• 24 bit address
• 2 bit word identifier (4 byte block)
• 22 bit block identifier
– 8 bits tag (=22-14)
– 14 bits line
• No two blocks in the same line have the same Tag field
• Content of cache is checked by finding line and checking Tag
Direct Mapping
Cache Line Table
Cache line Main Memory blocks held
0 0, m, 2m, 3m…2s-m
1 1,m+1, 2m+1…2s-m+1 m-
1 m-1, 2m-1,3m-1…2s-1
Direct mapping is a procedure used to assign
each memory block in the main memory to a
particular line in the cache. If a line is already
filled with a memory block and a new block
needs to be loaded, then the old block is
discarded from the cache.
Direct Mapping Cache Organization
Direct Mapping Summary
• Address length = (s + w) bits
• Number of addressable units = 2s+w words or bytes
• Block size = line size = 2w words or bytes
• Number of blocks in main memory = 2s+w / 2w = 2s
• Number of lines in cache = m = 2r
• Size of tag = (s – r) bits
• Index=Line + word
• Index field is used to access the word from cache
Direct Mapping pros & cons
• Simple
• Inexpensive
• Fixed location for given block
– If a program accesses 2 blocks that map to the same line
repeatedly, cache misses are very high.
Example: Direct Mapping Function
• Cache memory = 64kByte Main memory =16M
• Block size = 4 bytes =24*210*210
• Main memory = 16MBytes =224 words
Each word is directly addressable by 24-
• Block = 4bytes bit address
= 22 words
Addressable units: 224
• Cache = (64k/4) lines Address length: 24 bits
= 16k lines of 4 bytes each Block size: 22 words
= 24*210 Blocks in main memory: 222
= 214 lines of 4 bytes Lines in cache: 214
s = 22
• Main = (16M/4) blocks w=2
= 4M blocks of 4 byte each r = 14
= 22*210*210 Size of tag (s-r): 8
= 222 blocks of 4 byte each
Example
• Consider a machine with a byte addressable main memory of 216
bytes and block size of 8 bytes. Assume that a direct mapped cache
consisting of 32 lines is used with this machine. How is a 16-bit
memory address divided into tag, line number, and byte number?
• Answer:
– Block size = 8 bytes = 23 bytes
• 3 bits are used for byte offset.
– Cache lines = 32 = 25
• 5 bits are used as index bits.
– Main memory = 216 bytes
• Each word is directly addressable by 16-bit address
– Remaining 16-(5+3) = 8 bits are used as tag bits.
Assignment
1. A 16MB main memory has 32KB cache with 8 bytes per line.
i) How many lines are there in the cache?
ii) Show how the main memory and cache memory is
organized when the cache is direct- mapped.
iii) Show how the main memory address is partitioned.
2. A digital computer has a memory unit of 128K x 16 and a cache
memory of 1K words. The cache uses direct mapping with a
block size of four words.
– How many bits are there in the tag, index, block and word fields
of the address format?
– How many blocks can the cache accommodate?
Associative Mapping
• A main memory block can load into any line of cache
• Memory address is interpreted as tag and word
• Tag uniquely identifies block of memory
• Every line’s tag is examined for a match.
• Cache searching gets expensive.
• Address Structure
– Compare tag field with tag entry in cache to check for hit
– Least significant 2 bits of address identify the word
Associative Cache Organization
Associative Mapping Summary
• Address length = (s + w) bits
• Number of addressable units = 2s+w words or bytes
• Block size = line size = 2w words or bytes
• Number of blocks in main memory = 2s+w / 2w = 2s
• Number of lines in cache = undetermined
• Size of tag = s bits
Example: Associative Mapping
• Cache memory = 64kByte
• Block size = 4 bytes Main memory =16M
• Main memory = 16MBytes =24*210*210
=224 words
• Block = 4bytes Each word is directly addressable by 24-
= 22 words bit address
• Cache = (64k/4) lines Addressable units: 224
= 16k lines of 4 bytes each Address length: 24 bits
= 24*210 Block size: 22 words
= 214 lines of 4 bytes Blocks in main memory: 222
Lines in cache: 214
• Main = (16M/4) blocks s = 22
= 4M blocks of 4 byte each w=2
= 22*210*210 Size of tag: 22
= 222 blocks of 4 byte each
Set Associative Mapping
• Cache is divided into a number of sets
• Each set contains a number of lines
• A given block maps to any line in a given set
—e.g. Block B can be in any line of set i
• e.g. 2 lines per set
—2 way associative mapping
—A given block can be in one of 2 lines in only one set
• Address Structure
– Use set field to determine cache set to look in
– Compare tag field to see if we have a hit
Two Way Set Associative Cache
Organization
Set Associative Mapping Summary
• Address length = (s + w) bits
• Number of addressable units = 2s+w words or bytes
• Block size = line size = 2w words or bytes
• Number of blocks in main memory = 2s
• Number of lines in set = k
• Number of sets = v = 2d
• Number of lines in cache = kv = k * 2d
• Size of tag = (s – d) bits
Example: Two way set associative Mapping
• Cache memory = 64kByte Main memory = (16M/4) blocks
• Block size = 4 bytes = 4M blocks of 4 byte each
• Main memory = 16MBytes = 22*210*210
• Each Set = 2 lines = 222 blocks of 4 byte each
• Block = 4bytes Main memory =16M
= 22 words =24*210*210
=224 words
• Set size = 2 * 4 = 8 = 23 words Each word is directly addressable by 24-bit address
Addressable units: 224
• Cache = (64k/4) lines Address length: 24 bits
= 16k lines of 4 bytes each Block size: 22 words
= 24*210 Blocks in main memory: 222
= 214 lines of 4 bytes Lines in set: 2
Number of sets: 213
• Cache = 64kByte Lines in cache: 214
= (64k/8) sets s = 22
= 8k sets of 2 lines each w=2
= 23*210 d = 13
= 213 sets of 2 lines each Size of tag (s-d): 9
Example: Set Associative Mapping
• A set associative cache consists of 64 lines divided into four-line sets. Main
memory contains 4K blocks of 128 words each. Show the format of memory
addresses.
• Each block contains 128 words.
– Block = 128 words.
= 27 words
Therefore, 7 bits are needed to identify the word within the block.
• Cache = 64 lines / 4
– Cache is divided into 16 sets of 4 lines each (24 sets).
– Therefore, 4 bits are needed to identify the set number.
• Main memory = 4K blocks of 128 words each. (22 * 210 = 212 blocks)
– Therefore, 12 bits are needed to specify the block (set + tag = 12 bits).
– Tag length is 8 bits.
• Main memory = 4K blocks of 128 words each. (22 * 210 * 27 = 219 words)
– Each word is directly addressable by 19-bit address.
Replacement Algorithms
• Direct mapping
– No choice
– Each block only maps to one line
– Replace that line
• Associative & Set Associative
– Least Recently used (LRU)
—replace block that has been in cache longest
– First in first out (FIFO)
—replace block that has been entered first in cache
– Least frequently used
—replace block which has had fewest hits
– Random
Write Policy
• Must not overwrite a cache block unless main memory is up to date
• Multiple CPUs may have individual caches
• I/O may address main memory directly
• Write through
– All writes go to main memory as well as cache
– Multiple CPUs can monitor main memory traffic to keep local
(to CPU) cache up to date
– Lots of traffic
– Slows down writes
Write Policy
• Write back
– Updates initially made in cache only.
– Update bit for cache slot is set when update occurs
– If block is to be replaced, write to main memory only if
update bit is set
– Other caches get out of sync.
– I/O must access main memory through cache
Multilevel Caches
• High logic density enables caches on chip
– Faster than bus access
– Frees bus for other transfers
• Common to use both on and off chip cache
– L1 on chip, L2 off chip in static RAM
– L2 access much faster than DRAM or ROM
– L2 often uses separate data path
– L2 may now be on chip
– Resulting in L3 cache
• Bus access or now on chip…
Unified v Split Caches
• One cache for data and instructions or two, one for data and
one for instructions
• Advantages of unified cache
– Higher hit rate
– Balances load of instruction and data fetch
– Only one cache to design & implement
• Advantages of split cache
– Eliminates cache contention between instruction
fetch/decode unit and execution unit
Line Size
• Retrieve not only desired word but a number of adjacent words as
well
• Increased block size will increase hit ratio at first
• Hit ratio will decreases as block becomes even bigger
– Probability of using newly fetched information becomes less
than probability of reusing replaced
• Larger blocks
– Reduce number of blocks that fit in cache
– Data overwritten shortly after being fetched
– Each additional word is less local so less likely to be needed
• No definitive optimum value has been found
• 8 to 64 bytes seems reasonable
Principle of
Locality
A program tends to access a relatively small region of
memory irrespective of its actual memory footprint in
any given interval of time. While the region of activity
may change over time, such changes are gradual
Principle of
Locality
Spatial Locality: Tendency for locations close to a location
that has been accessed to also be accessed
Temporal Locality: Tendency for a location that has been
accessed to be accessed again
Basic terminologies
Hit: CPU finding contents of memory address in cache
Hit rate (h) is probability of successful lookup in cache by CPU.
Miss: CPU failing to find what it wants in cache (incurs trip to deeper levels of
memory hierarchy
Miss rate (m) is probability of missing in cache and is equal to 1-h.
Miss penalty: Time penalty associated with servicing a miss at any particular
level of memory hierarchy
Effective Memory Access Time (EMAT): Effective access time experienced by
the CPU when accessing memory.
Time to lookup cache to see if memory location is already there
Upon cache miss, time to go to deeper levels of memory hierarchy
EMAT = Tc + m * Tm
where m is cache miss rate, Tc the cache access time and Tm the miss
penalty
Performance analysis
Look through: The cache is checked first for a
hit, and if a miss occurs then the access to main
memory is started.
Look aside: access to main memory in parallel
with the cache lookup;
Hit ratio – Ratio of number of hits to total number of
references – number of hits/(number of hits + number
of Miss)