0% found this document useful (0 votes)
14 views36 pages

Memory Design

Uploaded by

6wqy47qvgk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views36 pages

Memory Design

Uploaded by

6wqy47qvgk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Memory Design

Key References:
• Computer Organization and Design H-S Interface–Patterson and Hennessy
• Digital Design and Computer Architecture-Harris and Harris
A Computing System
• Three key components
• Computation
• Communication
• Storage/memory

2
Memory (Programmer’s View)

3
Memory Abstraction: Virtual vs. Physical
• Programmer sees virtual memory [in general purpose
machines]
• Can assume the memory is “infinite”

• Reality: Physical memory size is much smaller than what the


programmer assumes

• Who helps? : The system (system software + hardware,


cooperatively) maps virtual memory addresses to physical
memory
• The system automatically manages the physical memory space
transparently to the programmer
+ Programmer does not need to know the physical size of memory nor
manage it  A small physical memory can appear as a huge one to
the programmer. 4
Memory Abstraction:
• We need a larger level of storage to manage a small amount
of physical memory automatically.

5
Ideal Memory Conditions

Pipeline
Instruction Data
(Instruction
Supply Supply
execution)

- Zero latency access - No pipeline stalls - Zero latency access

- Infinite capacity -Perfect data flow - Infinite capacity


(reg/memory dependencies)
- Zero cost - Infinite bandwidth
- Zero-cycle interconnect
- Perfect control flow (operand communication) - Zero cost

- Enough functional units

- Zero latency compute


6
Methods to Store Data?
• Flip-Flops (or Latches)
• Very fast, parallel access
• Very expensive (one bit costs tens of transistors)
• Static RAM
• Relatively fast, only one data word at a time
• Expensive (one bit costs 6 transistors). Cache memories.
• Dynamic RAM
• Slower, one data word at a time, reading destroys content
(refresh), needs special process for manufacturing
• Cheap (one bit costs only one transistor plus one capacitor)
• Other storage technology (flash memory, hard disk, tape)
• Much slower, access takes a long time, non-volatile
• Very cheap (no transistors directly involved)
Building Larger Memories
• Requires larger memory arrays :: Large  slow
• How do we make the memory large without making it very slow?

• Idea: Divide the memory into smaller arrays and interconnect the arrays
to input/output buses.
• Interleaving (banking)
• Goal: Reduce the latency of memory array access and enable multiple
accesses in parallel
• Task: Divide a large array into multiple banks that can be accessed
independently (in the same cycle / in consecutive cycles)
• Each bank is smaller than the entire memory storage
• Accesses to different banks can be overlapped

8
DRAM vs. SRAM
• DRAM
• Slower access (capacitor)
• Higher density (1T 1C cell)
• Lower cost
• Requires refresh (power, performance, circuitry)
• Manufacturing requires putting capacitor and logic together

• SRAM
• Faster access (no capacitor)
• Lower density (6T cell)
• Higher cost
• No need for refresh
• Manufacturing compatible with logic process (no capacitor)
9
The Memory Hierarchy: Ideal Memory
• Zero access time (latency)
• Infinite capacity
• Zero cost
• Infinite bandwidth (to support multiple accesses in parallel)

• Observe:: different memories have different properties (away from


ideal)….So, we need to do better & different..

10
The Problem
• Ideal memory’s requirements oppose each other

• Bigger is slower !!
• Bigger  Takes longer to determine the location

• Faster is more expensive


• Memory technology: SRAM  DRAM  Disk  Tape

• Higher bandwidth is more expensive


• Need more banks, more ports, higher frequency, or faster technology

11
The Problem
• Bigger is slower
• SRAM, 512 Bytes, sub-nanosec
• SRAM, KByte~MByte, ~nanosec
• DRAM, Gigabyte, ~50 nanosec
• Hard Disk, Terabyte, ~10 millisec

• Faster is more expensive (dollars and chip area)


• SRAM, < 10$ per Megabyte
• DRAM, < 1$ per Megabyte
• Hard Disk < 1$ per Gigabyte
• Flash memory and others….

12
Why Memory Hierarchy?
• Yet, …….We want both fast and large

• Unfortunately… we cannot achieve both with a single level of


memory

• So, what now??


• Have multiple levels of storage (progressively bigger and slower as the
levels are farther from the processor) and ensure most of the data the
processor needs is kept in the fast(er) level(s)

13
The Memory Hierarchy

o move frequently needed stuff fast


here small

cheaper per byte


faster per byte
o Backup everything here
big but slow

**With good locality of reference, memory appears as fast as and as large as possible 14
Memory Hierarchy
• Fundamental tradeoff
• Fast memory: small
• Large memory: slow
• Idea: Memory hierarchy

Hard Disk
Main
CPU Cache Memory
RF (DRAM)

• Latency, cost, size,


• bandwidth

15
Locality
• Locality rules the idea of memory hierarchy.

• Temporal Locality: with reference to time, you would do the same task
again in the future (time), soon.
• Temporal: A program tends to reference the same memory location
many times and all within a small time frame. OR, a loop: same
instruction gets executed again and again.

• Spatial Locality: (in space): with reference to space (surroundings), you


would repeat the task in the neighborhood (space) as well.
 Spatial: A program tends to reference a set of memory locations at a
time [PC, PC+4, PC+8…. All are consecutive] OR array/data structure
references
• 16
Caching Basics: Temporal & Spatial Locality
• Temporal Idea: Store recently accessed data in cache (fast memory)
• Temporal locality principle
• Recently accessed data will be again accessed in the near future

• Spatial Idea: Store addresses adjacent to the recently accessed one


in cache.
• Logically divide memory into equal size blocks (ex: some IBM’s system used
16 Kbyte cache with 64 byte blocks
• Fetch to cache the accessed block in its entirety
• Spatial locality principle
• Nearby data in memory will be accessed in the near future

17
Cache Memory
• It is an automatically-managed memory structure based on SRAM that
memorizes frequently used results to avoid :
• repeating the long-latency operations required to reproduce the results from
scratch. [i.e. paying for the DRAM access latency]

18
Caching in a Pipelined Design
 The cache needs synchronization with pipeline
 i.e., access in 1-cycle so that load-dependent operations do not stall
 High frequency pipeline  Cannot make the cache large (why? We
don’t want delay!)
 But, we want a large cache AND a pipelined design
 Idea: Cache hierarchy [i.e. zoom in to memory hierarchy]

Main
Level 2 Memory
CPU Level1 Cache (DRAM)
RF Cache

** L3 cache also possible .


19
A Modern Memory Hierarchy
Register File
32 words, sub-nsec

L1 cache
~32 KB, ~nsec

L2 cache
512 KB ~ 1MB, many nsec

L3 cache,
.....

Main memory (DRAM),


GB, ~100 nsec

Swap Disk
100 GB, ~10 msec
20
Hit or Miss ??
• If the processor requests data that is available in the cache, it is returned
quickly-This is called a cache hit.

• Otherwise, the processor retrieves the data from main memory (DRAM).
This is called a cache miss.

 Memory system performance metrics are miss rate or hit rate and
average memory access time.

21
Hit or Miss ??
• Average memory access time (AMAT) is the average time a processor
must wait for memory per load or store instruction.
• [i.e. search cache?  DRAM ? Disk (virtual mem)?]

22
Caching Basics
 Block (line): Unit of storage in the cache
 Memory is logically divided into cache blocks that map to locations in the
cache

 On a reference:
 HIT: If in cache, use cached data instead of accessing memory
 MISS: If not in cache, bring block into cache
 Maybe have to replace something else out to do it

 Some important cache design decisions


 Placement: where and how to place/find a block in cache?
 Replacement: what data to remove to make room in cache?
 Instructions/data: do we treat them separately? (if we have separate caches)

23
Cache Abstraction

Address
Tag Store Data Store

(is the address (stores


in the cache? memory
+ bookkeeping) blocks)

Hit/miss? Data

• Cache hit rate = (# hits) / (# hits + # misses) = (# hits) / (# accesses)


• Average memory access time (AMAT)
• = ( hit-rate * hit-latency ) + ( miss-rate * miss-latency )

24
Blocks and Addressing the Cache
• Memory is logically divided into fixed-size blocks

• Each block maps to a location in the cache, determined by the index bits in
the address tag index byte in block
• used to index into the tag and data stores 2b 3 bits 3 bits

8-bit address
Find the block
• Cache access process:
Find the byte
• 1) index into the tag and data stores with index bits in address
• 2) check valid bit in tag store
• 3) compare tag bits in address with the stored tag in tag store

• If a block is in the cache (cache hit), the stored tag should be valid and
match the tag of the block
25
Direct-Mapped Cache: Placement and Access
Block: 00000
Block: 00001 • Assume byte-addressable memory: 256 bytes,
Block: 00010
Block: 00011
Block: 00100
8-byte/ blocks  32 blocks (i.e. 256/8=32)
Block: 00101
Block: 00110 • Assume cache: 64 bytes, 8 blocks (i.e. each block is 8 bytes)
Block: 00111
Block: 01000  Direct-mapped: A block can go to only one location
Block: 01001
Block: 01010 tag index byte in block
Block: 01011
Block: 01100 2b 3 bits 3 bits Tag store Data store
Block: 01101
Block: 01110 Address
Block: 01111
Block: 10000
Block: 10001
Block: 10010
Block: 10011
Block: 10100
Block: 10101 V tag
Block: 10110
Block: 10111
Block: 11000 byte in block
Block: 11001 =? MUX
Block: 11010
Block: 11011
Block: 11100 Hit? Data
Block: 11101
Block: 11110  Addresses with same index contend for the same location
Block: 11111  Cause conflict misses
Main memory 26
Direct-Mapped Caches
• Direct-mapped cache: Two blocks in memory that map to the same
index in the cache cannot be present in the cache at the same time
• One index  one entry

• Can lead to 0% hit rate if more than one block accessed in an


interleaved manner map to the same index
• Assume addresses A and B have the same index bits but different tag bits
• A, B, A, B, A, B, A, B, …  conflict in the cache index [full waste of
remaining space]
• All accesses are conflict misses

• Summary:: when two recently accessed addresses map to the same


cache block, a conflict occurs, and the most recently accessed address
evicts the previous one from the block.
27
• the two least significant bits of the 32-bit address are called the byte offset,
• the next three bits are called the set bits, indicate the set to map.
• the remaining 27 tag bits memory address of the data stored in a given cache set

28
• A load instruction reads the specified entry from the cache and checks the
tag and valid bits.
• If the tag matches the most significant 27 bits of the address and the valid bit
is 1, the cache hits and the data is returned to the processor.
29
• Consider a cache consisting of 128 blocks of 16 words each, for a
total of 2048 (2K) words, and assume that the main memory is
addressable by a 16-bit address. The main memory has 64K
words, which we will view as 4K blocks of 16 words each.

• Direct Mapping:
• In this technique, block j of the main memory maps onto block
• j mod 128 of the cache.
• Whenever one of the main memory
blocks 0, 128, 256,... is loaded into
the cache, it is stored in cache block
0. Blocks 1, 129, 257,... are stored in
cache block 1, and so on

• Now there is contention ! Oh.

• How to solve: ?
• Tag (5b): 4096/128=
• The memory address can be divided 32 (which block set
into three fields: from mem)
• Block (7b): cache
has 128 blocks
• Word (4b): Each
block has 16 words
(block offset)
Associative Mapping
• The most flexible mapping method, in which a main memory block
can be placed into any cache block position.
• So, we shall need to worry about only the word-offset (word
address), other fields are memory address itself.  simple !

• 12 b: 4096 mem location


address
• 4 b: which word in a block
(16 words/ block)
• More efficient use of the space in the cache.
• When a new block is brought into the cache, it replaces an existing block
only if the cache is full.  we need an algorithm to select the block to be
replaced [not discussed here).

• The complexity of an associative cache is higher than that of a direct-


mapped cache, because  need to search all tag patterns to determine
whether a given block is in the cache.
• We do it (searching of tags) in parallel.!!  but expensive 

• Done with Content-addressable memory (CAM): compares input search


data against a table of stored data, and returns the address of matching
data.
Set-Associative Mapping
• It uses a combination of the direct- and associative-mapping
techniques.
• The blocks of the cache are grouped into sets, and the mapping
allows a block of the main memory to reside in any block of a
specific set.
• No contention as in direct method have a few choices for block
placement.
• Hardware cost is reduced  by decreasing the size of the
associative search.
 For a cache with two blocks per
set.
 In this case, memory blocks 0, 64,
128,..., 4032 map into cache
set 0, and they can occupy either
of the two block positions within
this set.

• 6 b: (set) : 64 set
• 4 b : word offset
• 6 b : (tag): 4096/64 set = 64 
to compare with two blocks of a
set
• Further Reading:
• The number of blocks per set is a parameter that can be selected to suit
the requirements of a particular computer.
• For example: four blocks per set can be accommodated by a 5-bit set
field 128/4=32), eight blocks per set by a 4-bit set field, and so on.
• The extreme condition of 128 blocks per set requires no set bits and
corresponds to fully-associative technique, with 12 tag bits.
• The other extreme of one block per set is the direct-mapping method.
• A cache that has k blocks per set is referred to as a k-way set-
associative cache. (we studies 2-way)

You might also like