Advanced Architecture Memory
Advanced Architecture Memory
Cache(SRAM)
Increasing in capacity
Increasing in speed
Increasing in cost
Main Memory(DRAM)
Magnetic Disks
Magnetic Tapes
CPU is faster than memory access. A hierarchical memory system can be used to close up the
speed gap. The higher levels are expensive, but they are fast. As we move down the hierarchy,
the cost generally decreases, whereas the access time increases.
• Cache is very high-speed memory, used to increase the speed of processing by making
the current program and data avail to the CPU at a rapid rate. It is employed in computer
systems to compensate for the speed difference between main memory and processor.
Cache memory consists of static RAM cell.
• Main memory or primary memory stores the programs and data that are currently needed
by the processor. All other information is stored in secondary memory and transferred to
main memory when needed.
• Secondary memory provides backup storage. The most common secondary memories
used in computer system are magnetic disks and magnetic tapes. They are used for
storing system programs, large data-files and other backup information.
Locality of reference:
The references to memory at any given interval of time tend to be confined within a few
localized areas in memory. This phenomenon is known as the property of locality of reference.
i) Temporal locality
1
i)Temporal locality: Recently referenced instructions are likely to be referenced again in the
near future. This is called temporal locality. In case of iterative loops, subroutines, a small code
segment will be referenced repeatedly.
ii) Spatial locality: This refers to the tendency for a program to access instructions whose
addresses are near one another. For example, in case of arrays, memory accesses are generally
consecutive addresses.
Cache performance:
One method to evaluate cache performance is to expand CPU execution time. The CPU
execution time is then the product of the clock cycle time and the sum of the CPU cycles and
the memory stall cycles.
CPU execution time = (CPU clock cycles + Memory stall cycles) × clock cycle time
Memory stall cycles are the number of cycles during which the CPU is stalled waiting for a
memory access. The number of memory stall cycles depends on both the number of misses and
miss penalty.
In general, the miss penalty is the time needed to bring the desired information from a slower
unit in the memory hierarchy to a faster unit.
Here we consider, miss penalty is the time needed to bring a block of data from main memory to
cache.
= IC ×
× Miss penalty Here IC means Instruction Count. That means the total
number of instructions are executed
= IC ×
× Miss rate × Miss penalty
Example:
Assume we have a computer where the clock per instruction (CPI) is 1.0 when all memory
accesses hit in the cache. The only data accesses are loads and stores, and these are 50% of the
total instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster
would the computer be if all instructions were cache hits?
2
CPU execution time that always hit = (CPU clock cycles + Memory stall cycles) × clock cycle
time
Here no miss is occurred so, memory stall cycle is zero. CPU clock cycle is the product of
IC (Instruction Count) and the number of clock cycles needed per instruction (CPI).
Memoryaccess
Memory stall cycles = IC × × Miss rate × Miss penalty
Instruction
Memory access per instruction represent one instruction access from memory, that is for
instruction fetch memory access is 100% and data accesses per instruction, that is for
operand fetch memory accesses are 50%.
= IC × (1 + 0.5) × 0.02 × 25
100% 50%
= IC × 0.75
!"!#$%&'(%&)!*&%++&%,(-)&..
Performance ratio =
!"!#$%&'(%&)!*&%+,/*,0.+&%
1×3.56×#/'#7#0#/!%&)!
=
1×3.8×#/'#7#0#/!%&)!
=1.75
3
Example:
A Processor has instruction and data cache with miss rates 2% and 4% respectively. If the
processor frequency of loads and stores on average is 36% and CPI is 2. Miss penalty can be
taken to be 40 cycles for all misses. How much time is needed for CPU execution, if 1000
instructions present in the program? Consider clock cycle time is 2ns.
CPU execution time = (CPU clock cycles + Memory stall cycles) × clock cycle time
Memoryaccess
Memory stall cycles = IC × Instruction × Miss rate × Miss penalty
= 1376 cycles
= 6752 ns
Average memory access time = Hit time + Miss rate × Miss penalty
So, average memory access time is depending upon Hit time, Miss rate and Miss penalty, these
three factor. Average memory access time is reduced if these three factors are reduced. First we
describe the miss penalty reduction technique.
4
Miss penalty reduction technique:
There are five techniques to reduce the miss penalty.
1) Multilevel caches
2) Victim caches
1) Multilevel caches:
Processor
Main
L2
Cache Memory
L1
Cache
The first level cache (L1 cache) is smaller in size compare to second level cache ( L2 cache). L1
cache is on-chip cache, whose access time is near to the clock speed of the CPU. L2 cache is off-
chip cache, larger enough to capture many accesses that would go to main memory. So it reduce
the miss penalty. The speed of L1 cache affects the clock rate of the CPU, while the speed of L2
cache only affects the miss penalty of L1 cache.
Average memory access time for a two level cache is defined by the following formula.
Average memory access time = Hit timeL1 + Miss rateL1 × Miss penaltyL1
Average memory access time = Hit timeL1 + Miss rateL1 × (Hit timeL2 + Miss rateL2 × Miss )
penaltyL2
5
Local miss rate:
Misses per instructionL1 × Hit timeL2 + Misses per instructionL2 × Miss penaltyL2
• Multilevel inclusion:
By multi level inclusion property, data present in L1 cache are always present in L2
cache. Inclusion is desirable because consistency between I/O and caches can be
determined just by checking L2 cache.
• Multilevel exclusion:
By multi level exclusion property, data present in L1 cache never found in L2 cache.
Cache miss in L1 results in a swap of blocks between L1 and L2 instead of a replacement
of L1 block with an L2 block.
Advantage of multilevel exclusion is, this policy prevents wasting space in the
L2 cache.
6
Example:
In 1000 memory references there 40 misses in the first level cache and 20 misses in the second
level cache. What are the various miss rates?
Assume the miss penalty from the L2 cache to memory is 100 clock cycles, the hit time of the L2
cache is 10 clock cycles, the hit time of L1 is 1 clock-cycle and there are 1.6 memory references
per instruction. What is the average memory access time and average stall cycles per instruction?
C8
Local miss rate of L1 cache = × 100 = 4%
3888
D8
Local miss rate of L2 cache = × 100 = 50%
C8
D8
Global miss rate of L2 cache = = × 100 = 2%
3888
Average memory access time = Hit timeL1 + Miss rateL1 × (Hit timeL2 + Miss rateL2 × Miss )
penaltyL2
= 1 + 0.04 × (10 + 0.5 × 100) clock cycles
Misses per instructionL1 × Hit timeL2 + Misses per instructionL2 × Miss penaltyL2
1.5 × x = 1000
1000
x = =625
1.6
7
2) Victim caches:
Main
Processor
Memory
Cache
Victim
Cache
Placement of victim cache in the memory hierarchy
Compulsory miss- The very first access to a block can’t be in the cache, so the
block must be brought into the cache.
Capacity miss- If the cache can’t contain all the blocks needed during execution
of a program capacity misses will occur
Conflict miss- If the block placement strategy is set associative or direct mapped ,
conflict misses will occur because a block may be discarded and later retrieved if
too many block map to its set. These misses are also called collision misses.
8
• Larger block sizes will reduce compulsory misses.
• Larger block size takes advantage of spatial locality.
• At the same time larger block increase the miss penalty
• Since it reduce the number of blocks in the cache, larger blocks may increase th
conflict misses and even capacity misses if the cache is small.
• Choose the optimum size of the block such that miss rate is reduced and other factor
can’t be increased.
2) Larger caches:
3) Compiler optimization:
The previous miss rate reduction technique requires changes to or to the hardware: larger blocks,
larger caches, higher associativity, or pseudoassociativity. This technique reduces miss rate using
software approach.
To give a feeling of this type of optimization, here two examples are shown.
Loop interchange
The above program has nested loops that access data in memory in non sequential order. Simply
exchanging the nesting of the loops can make the code access the data in the order they are
stored.
9
for (i=0; i<500; i++)
{
for (j=0; j<100; j++)
{
X[i][j]=2*X[i][j];
}
}
Here the memory access is sequential and this technique reduces misses by improving spatial
locality.
Blocking
This optimization reduces misses via improved temporal locality.
In case of matrix multiplication one matrix can be accessed in row major order and other
matrix can be accessed in column major order. So, in case of column major order matrix,
memory access is not sequential.
So, the solution is, instead of operating on entire rows or column of an matrix, blocked
algorithms operate on submatrices or block.
Memory latency time is the time gap between two consecutive word accesses. Since
memory uses DRAM cell, one precharge time is extra needed to access the word.
CPU
CPU
Cache
Cache
MUX
Memory
Cache
Memory
11
• Wider main memory for higher bandwidth
First level caches are often organized with a physical width of 1 word because most CPU
accesses are that size. Doubling or quadrupling the width of the cache and the memory will
therefore double or quadruple the memory bandwidth. A wider memory has a narrow L1 cache
and a wide L2 cache.
There is cost in wider connection between the CPU and memory, typically called a memory bus.
CPU will still access the cache one word at a time, so there now needs to be multiplexer between
the cache and CPU. Second level cache can help since the multiplexing can be between first and
second level caches.
• Simple interleaved memory for higher bandwidth
Memory chip can be organized in banks to read or write multiple words at a time rather
than a single word. In general, the purpose of interleaved memory is to try to take
advantage of the potential memory bandwidth of all the chips in the system. Most
memory system activates only those chips that containing the needed words. So, power is
less required in the memory system.
CPU
Cache
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Memory Memory Memory Memory
bank0 bank1 bank2 bank3
The banks are often 1 word wide so that the width of the bus and the cache need not change, but
sending address to several banks permits them all to read simultaneously. In the above example, the
addresses of the four banks are interleaved at the word level. Bank 0 has all words whose address modulo
4 is 0, bank1 has all words whose address modulo 4 is 1, and so on.
12
EX:
Given a cache block of 4 words and that a word is 8 bytes, calculate the miss penalty and the effective
memory bandwidth.
Recomputed the miss penalty and the memory bandwidth assuming we have
• Interleaved main memory with 4 banks with each bank 1-word wide.
Ans:
C×M 3
Memory bandwidth= =
D6L M
In case of main memory width 2 words,
C×M 3
Memory bandwidth= =
3DM C
In case of main memory width 4 words,
C×M 3
Memory bandwidth= =
LC D
In case of interleaved main memory with 4 banks,
C×M
Memory bandwidth= = 0.4
5L
13