Memory Hierarchy - Introduction: Cost Performance of Memory Reference
Memory Hierarchy - Introduction: Cost Performance of Memory Reference
fast memory
An economical soln for that is a memory hierarchy
which takes the advantage of principle of locality and cost performance of memory reference
Principle of Locality : says that most programs dont
organized into several levels each smaller, faster, and more expensive per byte than the next lower level.
The goal is to provide a memory system with cost per
byte is as low as cheapest level of memory and speed is as high as the fastest level.
Each level maps addresses from a slower larger
Terminologies
Cache : is the name given to the highest or first level
of memory hierarchy once the address leaves the processor. cache is a temporary storage area where frequently accessed data can be stored for rapid access.
Cache hit : When the processor finds the requested
Compulsory the very first access to block can not be in cache, so the block must be brought into the cache. Compulsory misses are those that occur even if you had infinite cache. Capacity if the cache cannot contain all the blocks needed during execution of a program, capacity misses occur because of blocks being discarded and later retrieved. Conflict if the block placement strategy is not fully associative, conflict misses will occur because a block may be discarded and later retrieved if conflicting blocks map to its set.
2)
3)
block.
Bandwidth determines the time to retrieve the rest of
this block.
The cache misses are handled by hardware and causes
result must wait, but the other instructions may proceed during miss.
Conti..
Block : A fixed size collection of data containing the requested word ,
Pages : Address space is usually broken into fixed size blocks. Page fault: At any time a page resides either in memory or in disk.
When processor references an item within a page that is not present in memory then a page fault occurs, and the entire page is moved from disk to memory. processor is not stalled. The processor usually switches to some other task.
Since page faults take so long they are handled in software and the
Cache performance
Memory hierarchy can substantially improve the performance because
of cycles during which processor is stalled waiting for memory access, which is called memory stall cycles is as given bellow. CPU Execution Time= (CPU clock cycles + Memory stall clock cycles) * clock cycle time.
This equation assumes that the CPU clock cycles include the time to
handle a cache hit, and that the processor stalled during a cache miss.
Contin.
The number of memory stall cycles depends on both the number
of misses and the cost per miss, which is called miss penalty.
Memory stall cycles = Number of misses * miss penalty
= IC * Misses * miss penalty Instruction = IC * Memory accesses * miss rate*miss penalty Instruction The component miss rate is simply the fraction of cache accesses that results in a miss( i.e the number of accesses that miss divided by the number of accesses)
The formula was an approximation since the miss rates and miss
memory accesses per instruction, miss penalty(in clock cycles) for reads and writes, and miss rate for reads and writes;
Memory stall cycles =
IC * reads per instruction *read miss rate* read miss penalty + IC * writes per instruction*write miss rate * write miss penalty.
instruction rather than misses per memory reference. These two are related. Misses = Instrn miss rate * mem accesses = miss rate* mem accesses Instrn count Instrn
Misses per instruction is often reported as misses per
different trade-offs of memories help us at different levels of hierarchy. 1 . Where can a block be placed in the upper level? (block placement) 2. How is a block found if it is in the upper level ? (block identification) 3. Which block should be replaced on miss ? ( block replacement) 4. What happens on a write ? ( write strategy).
1. If each block has only one place it can appear in cache it is said to be direct mapped. The mapping is done usually (block addr ) MOD (no of blocks in cache) 2. If a cache block can be placed anywhere in the cache, the cache is said to be fully associative. 3. If a block can be placed in restricted set of places in the cache, the cache is set associative. A set is a group of blocks in the cache. A block is mapped into a set, and then the block can be placed anywhere within the set. The set is usually chosen by bit selection; that is (block address) MOD (number of sets in cache)
Conti..
If there are n blocks in a set the cache placement is
called n-way set associative. Direct mapped is simply one way set associative and a fully associative cache with m blocks could be called m-way set associative. Direct associative can be thought of having m sets and fully associative as having one set. The vast majority of processor caches today are direct mapped, two way set associative, or four way set associative.
gives the block address. The tag of every cache block is checked to see if it matches the block address from the processor. All possible tags are searched in parallel because speed is critical. A valid bit is added to the tag to say whether or not this entry contains a valid address. If this bit is not set, there can not be a match on this address.
address and the block offset. The block frame address is further divided into the tag field and the index field. The block offset field selects the desired data from the block. The index field selects the set. The tag field is compared against it for a hit.
Memory Hierarchy-Review
By, Chandru 1RV08SCS05
architectures is called the memory hierarchy. It is designed to take advantage of memory locality in computer programs.
Most modern CPUs are so fast that for most program workloads, the
locality of reference of memory accesses and the efficiency of the caching and memory transfer between different levels of the hierarchy are the practical limitation on processing speed.
L1 cache holds cache lines retrieved from the L2 cache memory. L2 cache holds cache lines retrieved from main memory.
L3:
L4:
Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)
0 Level k+1: 4 8 12
1 5 9 13
2 6 10 14
3 7 11 15 Larger, slower, cheaper storage device at level k+1 is partitioned into blocks.
Cold misses occur because the cache is empty. Conflict miss If the block placement strategy is not fully associative, conflict misses will occur because a block may be discarded and later retrieved if conflicting blocks map to itself. Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k block. E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time. Capacity miss If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur
Candidate blocks are randomly selected. Some systems generate pseudorandom block number Least Recently Used (LRU) LRU relies on a corollary of locality: If recently used blocks are likely to be used again, then LRU block is good candidate. First In, First Out (FIFO) used in highly associative caches
Request 12 14
2
Level k:
4* 12
14
12 4*
Request 12
0 Level k+1:
fetch it from level k+1. E.g., block 12. If level k cache is full, then some current block must be replaced (evicted). Which one is the victim?
4 4*
8 12
5
9 13
6
10 14
7
11 15
Placement policy: where can the new block go? E.g., b mod 4 Replacement policy: which block should be evicted? E.g., LRU
generally higher traffic but simplifies cache coherence write back: write cache only (memory is written only when the entry is evicted) a dirty bit per block can further reduce the traffic
Assume a fully associative cache with cache Example empty.write-back sequence ofmanymemory entries that starts Below is a five operations (the address is in square brackets): WriteMem[100]; WriteMem[100]; Read Mem[200]; WriteMem[200]; WriteMem[100]. What are the number of hits and misses when using no-write allocate versus writeallocate?
Answer For no-write allocate, the address 100 is not in the cache, and
there is no allocation on write, so the first two writes will result in misses. Address 200 is also not in the cache, so the read is also a miss. The subsequent write to address 200 is a hit. The last write to 100 is still a miss. The result for no-write allocate is four misses and one hit. For write allocate, the first accesses to 100 and 200 are misses, and the rest are hits since 100 and 200 are both found in the cache. Thus, the result for write allocate is two misses and three hits.
Average memory access time = Hit time + Miss rate * Miss penalty Three Categories of Cache Optimizations Reducing the miss rate : large block size ,large cache size and higher associativity Reducing the miss penalty : multilevel caches and giving reads priority over writes Reducing the time to hit in the cache : avoiding address translation when indexing the cache.
the block must be brought into the cache. These are also called coldstart misses or first-reference misses. Compulsory misses are those that occur in an infinite cache. Capacity : if the cache cannot contain all the blocks needed during execution of a program ,capacity misses will occur because of blocks discarded and later retrieved. Capacity misses are those that occur in a fully associative cache. Conflict : if the block placement strategy is set associative or direct mapped ,conflict misses will occur because a block may be discarded and later retrieved if too many blocks map to its set. These misses are also called collision misses. Conflict misses are those that occur going from fully associative to eight-way associative ,four-way associative and so on.
Four Division of Conflict Misses Eight-way :conflict misses due to going from fully associative to
eight-way associative. Four-way :conflict misses due to going from eight-way associative to four-way associative. Two-way :conflict misses due to going from four-way associative to two-way associative. One-way :conflict misses due to going from two-way associative to one-way associative.
Miss Rate
The simplest way to reduce miss rate is to increase the block size
,larger block size will reduce also compulsory misses. This reduction occur because the principle of locality has two components : temporal locality and spatial locality. Larger block take advantage of spatial locality. But larger blocks also increase the miss penalty. Since they reduce the number of blocks in the cache ,larger blocks may increase conflict misses and even capacity misses if the cache is small.
Cont
The selection of block size depends on both the latency and
bandwidth of the lower-level memory. High latency and high bandwidth encourage large block size since the cache gets many more bytes per miss for a small increase in miss penalty. Low latency and low bandwidth encourage smaller block size since there is little time saved from a larger block.
Miss Rate The obvious way to reduce capacity misses is to increase capacity of the cache. The drawback is potentially longer hit time and higher cost and power. This technique has been especially popular in off-chip caches.
as effective in reducing misses for these sized caches as fully associative. The second rule ,called the 2:1 cache rule of thumb , is that a direct-mapped cache of size N has about the same miss rate as a two-way set-associative cache of size N/2. The higher associativity increases average memory access time.
designer added another level of cache between the original cache and memory. The first-level cache can be small enough to match the clock cycle time of the fast processor. The second-level cache can be large enough to capture many accesses that would go to main memory, thereby lessening the effective miss penalty.
Average memory access time for a two-level cache : Average memory access time = Hit time(L1) + Miss rate(L1)*Miss
penalty(L1) Miss penalty(L1) = Hit time(L2) + Miss rate(L2)*Miss penalty(L2) So Average memory access time = Hit time(L1) + Miss rate(L1)*(hit time(L2) + Miss rate(L2)*Miss penalty(L2))
cache divided by the total number of memory access to this cache. for the first-level cache it is equal to Miss rate(L1) and for the second-level cache it is Miss rate(L2). Global miss rate : the number of misses in the cache divided by the total number of memory accesses generated by the processor. The global miss rate for the first-level cache is still just Miss rate(L1) ,but for second-level cache it is Miss rate(L1)*Miss rate(L2).
buffer of the proper size. Write buffers ,however do complicate memory accesses because they might hold the updated value of a location needed on a read miss. The simplest way out of this is for the read miss to wait until the write buffer is empty. The alternative is to check the content of the write buffer on a read miss ,and if there are no conflict and the memory system is available , let the read miss continue.
Sixth Optimization : Avoiding Address Translation during indexing of the Cache to Reduce Hit rate
We use virtual addresses for the cache ,since hits are much more
common than misses. Such caches are termed virtual caches , with physical cache used to identify the traditional caches that uses physical addresses. Two important tasks are : indexing the cache and comparing addresses. Full virtual addressing for both indices and tags eliminates address translation time from a cache hit.
Some Reasons for not building virtually addressed caches: Protection : page-level protection is checked as part of the virtual to
physical address translation and it must be enforced. One solution is to copy the protection information from the TLB on a miss , and a field to hold it and check it on every access to the virtually addressed cache. Another reason is that every time a process is switched, the virtual addresses refer to different physical addresses, requiring the cache to be flushed. One solution is to increase the width of the cache address tag with a process-identifier tag (PID). If operating system assign these tags to processes, it only need flush the cache when a PID is recycled, that is PID distinguishes whether or not the data in the cache are for this program
different virtual addresses for the same physical address. These duplicate addresses, called synonyms or aliases, could result in two copies of the same data in virtual cache, if one is modified the other will have the wrong value. With a physical cache this would not happen, since the accesses would first be translated to the same physical cache block. Hardware solution to the synonym problem, called antialiasing, guarantee every cache block a unique physical address. Software can make this problem much easier by forcing to share some address bits, this restriction is called page coloring.
mapping to virtual addresses to interact with a virtual cache. One alternative is to use part of the page offset-the part that is identical in both virtual and physical addresses-to index the cache. At the same time as the cache is being read using that index, the virtual part of the address is translated, and the tag match uses physical addresses. This alternative allows the cache read to begin immediately, and yet the tag comparison is still with physical addresses. The limitation of this virtually indexed, physically tagged alternative is that a direct-mapped cache can be no bigger than the page size.