0% found this document useful (0 votes)
11 views

Advanced Architecture Memory

Uploaded by

Shuvadipta Das
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Advanced Architecture Memory

Uploaded by

Shuvadipta Das
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

 Memory Hierarchy

Cache(SRAM)
Increasing in capacity

Increasing in speed

Increasing in cost
Main Memory(DRAM)

Magnetic Disks

Magnetic Tapes

Memory hierarchy according to speed, size and cost

CPU is faster than memory access. A hierarchical memory system can be used to close up the
speed gap. The higher levels are expensive, but they are fast. As we move down the hierarchy,
the cost generally decreases, whereas the access time increases.

• Cache is very high-speed memory, used to increase the speed of processing by making
the current program and data avail to the CPU at a rapid rate. It is employed in computer
systems to compensate for the speed difference between main memory and processor.
Cache memory consists of static RAM cell.
• Main memory or primary memory stores the programs and data that are currently needed
by the processor. All other information is stored in secondary memory and transferred to
main memory when needed.
• Secondary memory provides backup storage. The most common secondary memories
used in computer system are magnetic disks and magnetic tapes. They are used for
storing system programs, large data-files and other backup information.
 Locality of reference:
The references to memory at any given interval of time tend to be confined within a few
localized areas in memory. This phenomenon is known as the property of locality of reference.

There are two types of locality of reference.

i) Temporal locality

ii) Spatial locality

1
i)Temporal locality: Recently referenced instructions are likely to be referenced again in the
near future. This is called temporal locality. In case of iterative loops, subroutines, a small code
segment will be referenced repeatedly.

ii) Spatial locality: This refers to the tendency for a program to access instructions whose
addresses are near one another. For example, in case of arrays, memory accesses are generally
consecutive addresses.

 Cache performance:
One method to evaluate cache performance is to expand CPU execution time. The CPU
execution time is then the product of the clock cycle time and the sum of the CPU cycles and
the memory stall cycles.

CPU execution time = (CPU clock cycles + Memory stall cycles) × clock cycle time

Memory stall cycles are the number of cycles during which the CPU is stalled waiting for a
memory access. The number of memory stall cycles depends on both the number of misses and
miss penalty.

In general, the miss penalty is the time needed to bring the desired information from a slower
unit in the memory hierarchy to a faster unit.

Here we consider, miss penalty is the time needed to bring a block of data from main memory to
cache.

Memory stall cycles = Number of misses × Miss penalty


= IC ×
  
× Miss penalty Here IC means Instruction Count. That means the total
number of instructions are executed
  
= IC ×   
× Miss rate × Miss penalty

Example:

Assume we have a computer where the clock per instruction (CPI) is 1.0 when all memory
accesses hit in the cache. The only data accesses are loads and stores, and these are 50% of the
total instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster
would the computer be if all instructions were cache hits?

2
CPU execution time that always hit = (CPU clock cycles + Memory stall cycles) × clock cycle
time

Here no miss is occurred so, memory stall cycle is zero. CPU clock cycle is the product of
IC (Instruction Count) and the number of clock cycles needed per instruction (CPI).

= (IC×CPI + 0) × clock cycle time

= IC × 1.0 × clock cycle time

Memoryaccess
Memory stall cycles = IC × × Miss rate × Miss penalty
Instruction

Memory access per instruction represent one instruction access from memory, that is for
instruction fetch memory access is 100% and data accesses per instruction, that is for
operand fetch memory accesses are 50%.

= IC × (1 + 0.5) × 0.02 × 25

100% 50%

= IC × 0.75

CPU execution time that has both hit and miss

= (CPU clock cycles + Memory stall cycles) × clock cycle time

= (IC × 1.0 + IC × 0.75) × clock cycle time

= IC × 1.75 × clock cycle time

 !"!#$%&'(%&)!*&%++&%,(-)&..
Performance ratio =
 !"!#$%&'(%&)!*&%+,/*,0.+&%

1×3.56×#/'#7#0#/!%&)!
=
1×3.8×#/'#7#0#/!%&)!

=1.75

The computer with no cache misses is 1.75 times faster.

3
Example:

A Processor has instruction and data cache with miss rates 2% and 4% respectively. If the
processor frequency of loads and stores on average is 36% and CPI is 2. Miss penalty can be
taken to be 40 cycles for all misses. How much time is needed for CPU execution, if 1000
instructions present in the program? Consider clock cycle time is 2ns.

CPU execution time = (CPU clock cycles + Memory stall cycles) × clock cycle time

=(IC × CPI + Memory stall cycles) × clock cycle time

Memoryaccess
Memory stall cycles = IC × Instruction × Miss rate × Miss penalty

= IC × (1 × 0.02 + 0.36 × 0.04) × 40 cycles

= 1000 × 0.0344 × 40 cycles

= 1376 cycles

CPU execution time = (1000 × 2 + 1376) × 2 ns

= 6752 ns

 Average memory access time


A better measure of memory performance is the average memory access time. Average memory
access time is defined as

Average memory access time = Hit time + Miss rate × Miss penalty

So, average memory access time is depending upon Hit time, Miss rate and Miss penalty, these
three factor. Average memory access time is reduced if these three factors are reduced. First we
describe the miss penalty reduction technique.

4
 Miss penalty reduction technique:
There are five techniques to reduce the miss penalty.

1) Multilevel caches
2) Victim caches

1) Multilevel caches:

Processor
Main
L2
Cache Memory
L1
Cache

L1 Cache and L2 Cache

The first level cache (L1 cache) is smaller in size compare to second level cache ( L2 cache). L1
cache is on-chip cache, whose access time is near to the clock speed of the CPU. L2 cache is off-
chip cache, larger enough to capture many accesses that would go to main memory. So it reduce
the miss penalty. The speed of L1 cache affects the clock rate of the CPU, while the speed of L2
cache only affects the miss penalty of L1 cache.

Average memory access time for a two level cache is defined by the following formula.

Average memory access time = Hit timeL1 + Miss rateL1 × Miss penaltyL1

Miss penaltyL1 = Hit timeL2 + Miss rateL2 × Miss penaltyL2

Average memory access time = Hit timeL1 + Miss rateL1 × (Hit timeL2 + Miss rateL2 × Miss )
penaltyL2

5
Local miss rate:

9: ;  <   :


Local miss rate = 9 = ;  <    :  :

Local miss rate of L1 = Miss rateL1

Local miss rate of L2 = Miss rateL2

Global miss rate:

9: ;  < :  :


Global miss rate = 9 = ;  <   > ?;:@AB

Global miss rate of L1 = Miss rateL1

Global miss rate of L2 = Miss rateL1 × Miss rateL2

Memory stalls per instruction can be defined as

Average memory stalls per instruction =

Misses per instructionL1 × Hit timeL2 + Misses per instructionL2 × Miss penaltyL2

• Multilevel inclusion:
By multi level inclusion property, data present in L1 cache are always present in L2
cache. Inclusion is desirable because consistency between I/O and caches can be
determined just by checking L2 cache.

Disadvantage of multilevel inclusion is, L2 cache has a redundant copy of the


L1 cache. So, space is wasted in L2 cache.

• Multilevel exclusion:
By multi level exclusion property, data present in L1 cache never found in L2 cache.
Cache miss in L1 results in a swap of blocks between L1 and L2 instead of a replacement
of L1 block with an L2 block.

Advantage of multilevel exclusion is, this policy prevents wasting space in the
L2 cache.

6
Example:

In 1000 memory references there 40 misses in the first level cache and 20 misses in the second
level cache. What are the various miss rates?
Assume the miss penalty from the L2 cache to memory is 100 clock cycles, the hit time of the L2
cache is 10 clock cycles, the hit time of L1 is 1 clock-cycle and there are 1.6 memory references
per instruction. What is the average memory access time and average stall cycles per instruction?

C8
Local miss rate of L1 cache = × 100 = 4%
3888

Global miss rate of L1 cache = 4%

D8
Local miss rate of L2 cache = × 100 = 50%
C8
D8
Global miss rate of L2 cache = = × 100 = 2%
3888

Average memory access time = Hit timeL1 + Miss rateL1 × (Hit timeL2 + Miss rateL2 × Miss )
penaltyL2
= 1 + 0.04 × (10 + 0.5 × 100) clock cycles

= 1 + 2.4 clock cycles

= 3.4 clock cycles

Average memory stalls per instruction =

Misses per instructionL1 × Hit timeL2 + Misses per instructionL2 × Miss penaltyL2

Let, x instructions are present.

1.5 × x = 1000

1000
x = =625
1.6

Average memory stalls per instruction


C8 D8
= ( × 10 + × 100 ) clock cycles
H H
400 D888
= ( 625 +  ) clock cycles
LD6
2400
= clock cycles = 3.84 clock cycles
625

7
2) Victim caches:

Main
Processor
Memory

Cache

Victim
Cache
Placement of victim cache in the memory hierarchy

• Suppose a block was discarded and after this it needed again.


• Since the discarded block has already been fetched, it can be used again at small cost.
• Such recycling requires a small fully associative cache placed in between original cache
and their refill path.
• This small cache is called victim cache because it contains only blocks that are discarded
from a cache due to miss.
• This cache is checked on a miss to see if the cache contains the desired block or not
before accessing the main memory.
• If the desired block is found in the victim cache, the victim block and cache block are
swapped.

 Compulsory miss- The very first access to a block can’t be in the cache, so the
block must be brought into the cache.
 Capacity miss- If the cache can’t contain all the blocks needed during execution
of a program capacity misses will occur
 Conflict miss- If the block placement strategy is set associative or direct mapped ,
conflict misses will occur because a block may be discarded and later retrieved if
too many block map to its set. These misses are also called collision misses.

 Miss rate reduction technique:


There are five techniques to reduce the miss rate.
1) Larger block size
2) Larger caches
3) Compiler optimization
1) Larger block size:
Using larger block in cache the miss rate can be reduced.

8
• Larger block sizes will reduce compulsory misses.
• Larger block size takes advantage of spatial locality.
• At the same time larger block increase the miss penalty
• Since it reduce the number of blocks in the cache, larger blocks may increase th
conflict misses and even capacity misses if the cache is small.
• Choose the optimum size of the block such that miss rate is reduced and other factor
can’t be increased.

2) Larger caches:

• Larger caches reduce capacity misses.


• Drawback is longer hit time and higher cost.
• This technique is essentially popular in off-chip caches.

3) Compiler optimization:

The previous miss rate reduction technique requires changes to or to the hardware: larger blocks,
larger caches, higher associativity, or pseudoassociativity. This technique reduces miss rate using
software approach.

To give a feeling of this type of optimization, here two examples are shown.
 Loop interchange

for (j=0; j<100; j++)


{
for (i=0; i<500; i++)
{
X[i][j]=2*X[i][j];
}
}

The above program has nested loops that access data in memory in non sequential order. Simply
exchanging the nesting of the loops can make the code access the data in the order they are
stored.

9
for (i=0; i<500; i++)
{
for (j=0; j<100; j++)
{
X[i][j]=2*X[i][j];
}
}

Here the memory access is sequential and this technique reduces misses by improving spatial
locality.

 Blocking
This optimization reduces misses via improved temporal locality.
In case of matrix multiplication one matrix can be accessed in row major order and other
matrix can be accessed in column major order. So, in case of column major order matrix,
memory access is not sequential.

So, the solution is, instead of operating on entire rows or column of an matrix, blocked
algorithms operate on submatrices or block.

 Hit time reduction technique


10
Hit time is critical because it affects the clock rate of the processor. There are four
general techniques to reduce cache hit time
1) Small and Simple cache-
• Smaller hardware is faster, so a small cache certainly has lower hit time.
• Smaller cache is easy to fit in onchip otherwise offchip time is included.
• Simple cache means direct mapped cache. Here tag length is minimum than other cache
mapping technique. So searching time is reduced.
2) Avoid address translation to the cache:
• Translation of a virtual address to a physical address is taken more time.
• Use virtual address for the cache, since hits are much more common than misses.
• Cache which uses virtual address, is called virtual cache. Protection security is reduced.
So adding protection information to the virtual cache.

 Main memory organizations for improving performance

Performance measures of a main memory emphasize both latency and bandwidth.


Memory bandwidth is the number of bytes read or written per unit time. On the other
word, per unit time how many bytes are accessed from main memory is called bandwidth.

Memory latency time is the time gap between two consecutive word accesses. Since
memory uses DRAM cell, one precharge time is extra needed to access the word.

CPU
CPU

Cache
Cache

MUX

Memory
Cache

Memory

11
• Wider main memory for higher bandwidth
First level caches are often organized with a physical width of 1 word because most CPU
accesses are that size. Doubling or quadrupling the width of the cache and the memory will
therefore double or quadruple the memory bandwidth. A wider memory has a narrow L1 cache
and a wide L2 cache.
There is cost in wider connection between the CPU and memory, typically called a memory bus.
CPU will still access the cache one word at a time, so there now needs to be multiplexer between
the cache and CPU. Second level cache can help since the multiplexing can be between first and
second level caches.
• Simple interleaved memory for higher bandwidth
Memory chip can be organized in banks to read or write multiple words at a time rather
than a single word. In general, the purpose of interleaved memory is to try to take
advantage of the potential memory bandwidth of all the chips in the system. Most
memory system activates only those chips that containing the needed words. So, power is
less required in the memory system.

CPU

Cache

Word Word Word Word


address address address address

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15
Memory Memory Memory Memory
bank0 bank1 bank2 bank3

The banks are often 1 word wide so that the width of the bus and the cache need not change, but
sending address to several banks permits them all to read simultaneously. In the above example, the
addresses of the four banks are interleaved at the word level. Bank 0 has all words whose address modulo
4 is 0, bank1 has all words whose address modulo 4 is 1, and so on.

12
EX:

Assume the performance of 1-word wide primary memory organization is

• 4 clock cycles to send the address

• 56 clock cycles for the access time per word

• 4 clock cycles to send a word of data.

Given a cache block of 4 words and that a word is 8 bytes, calculate the miss penalty and the effective
memory bandwidth.

Recomputed the miss penalty and the memory bandwidth assuming we have

• Main memory width of 2 words

• Main memory width of 4 words

• Interleaved main memory with 4 banks with each bank 1-word wide.

Ans:

In case of main memory width one word,

Miss penalty=(4+56+4)clock cycles ×4=256 clock cycles

C×M 3
Memory bandwidth= =
D6L M
In case of main memory width 2 words,

Miss penalty=(4+56+4)clock cycles ×2=128 clock cycles

C×M 3
Memory bandwidth= =
3DM C
In case of main memory width 4 words,

Miss penalty=(4+56+4)clock cycles ×1=64 clock cycles

C×M 3
Memory bandwidth= =
LC D
In case of interleaved main memory with 4 banks,

Miss penalty=(4+56+4×4)clock cycles =76 clock cycles

C×M
Memory bandwidth= = 0.4
5L

13

You might also like