7 Memory
7 Memory
Memory
Chapter 8
Brett H. Meyer
Winter 2024
Revision history:
Warren Gross – 2017
Christophe Dubach – W2020, F2020, F2021, F2022, F2023
Brett H. Meyer – W2021, W2022, W2023, W2024
Some material from Hamacher, Vranesic, Zaky, and Manjikian, Computer Organization and Embedded Systems, 6 th ed, 2012, McGraw Hill,
and “Introduction to the ARM Processor using Altera Toolchain.”
1
Disclaimer
2
Introduction
What is Memory? What is Storage?
3
What is Memory? What is Storage?
4
What is Memory? What is Storage?
5
What is Memory? What is Storage?
source: www.ifixit.com
7
Random Access Memory (RAM)
8
Memory Technology
9
1024x1 RAM
10
Static RAM
p n
Body Body
source: VectorVoyagerPNG version: user:rogerb, CC BY-SA 3.0, via Wikimedia Commons source: VectorVoyagerPNG version: user:rogerb, CC BY-SA 3.0, via Wikimedia Commons
n-type p-type
11
Static RAM
p n
Body Body
source: VectorVoyagerPNG version: user:rogerb, CC BY-SA 3.0, via Wikimedia Commons source: VectorVoyagerPNG version: user:rogerb, CC BY-SA 3.0, via Wikimedia Commons
n-type p-type
12
6T SRAM Bit Cell
13
Dynamic RAM
14
DRAM Bit Cell
A DRAM cell is storing a ‘1’ when the voltage across C is VDD∗ . The
charge in C leaks through T even when T is off.
15
DRAM Bit Cell
A DRAM cell is storing a ‘1’ when the voltage across C is VDD∗ . The
charge in C leaks through T even when T is off.
∗
In practice, voltages less than VDD are recognized as ‘1’, too.
15
Reading DRAM
Reading a DRAM cell refreshes its contents. Note that an entire row
is read and refreshed at the same time.
To refresh the entire DRAM, each row must be periodically read.
16
Refresh Overhead
Assume that each row needs to be refreshed every 64 ms, that the
minimum time between two row accesses is 50 ns, and that all rows
are refreshed in 8192 cycles.
Read/write operations have to be delayed until refresh is finished.
What is the refresh overhead?
17
256 Mb Asynchronous DRAM (32M x 8)
The 25-bit address is broken into 14 bits for row select, 11 for column.
• First, A24−11 is driven, and RAS asserted, reading a row.
• Then, A10−0 is driven, and CAS asserted, selecting a byte.
19
Synchronous DRAM
• Synchronous DRAM
(SDRAM) integrates an
on-chip memory
controller
• A clock helps generate
internal timing signals
(i.e., RAS and CAS)
• Refresh is also built-in
• The “dynamic” nature
of the chip is invisible
to the user
20
Efficient Block Transfers
21
Memory Latency and Bandwidth
22
Double-Data-Rate (DDR) SDRAM
Modern SDRAM uses both rising and falling edges of the clock
(“double data rate”).
E.g., DDR4 has a clock of 2133 MHz, and can support up to 2400 M
transfers per second.
23
Multi-chip Memories
24
Memory Technology
Read-only Memories
Textbook§8.3
Non-volatile Memories
25
Read-only Memory (ROM)
26
PROM, EPROM, and EEPROM
27
Flash Memory
28
Direct Memory Access (DMA)
Textbook§8.4
Direct Memory Access
29
DMA Controller
DMA controllers may be shared; individual I/O devices may also have
DMA controllers.
• CPU writes control registers
(starting address, count,
R/W), and initiates transfer.
• The controller keeps track of
progress with a counter.
• An interrupt can be used to
signal transfer completion.
• DMA can also be invoked to
make repeated transfers
triggered by a timer.
30
Caches
Textbook§8.5, 8.6
The Memory Problem
Solution: use both DRAM and SRAM such that the memory appears
to the CPU to be large, and fast.
The solution should be transparent to the programmer.
31
The Memory Problem
Library:
Library:
Library: large,
large, slow
large, slow
slow access
access
access Desk:
Desk: small,
small,
Desk: fast
fast
small, access
access
fast access
32
Unlimited amounts of fast memory?
33
Memory Hierarchy
34
Memory Hierarchy
35
Memory Hierarchy
36
Memory Hierarchy
Even when two different systems have the same number of levels or
hierarchy, different use cases may mean different sizes for each
memory.
37
Locality, Locality, Locality
38
Cache Basics
Caches are too small to store copies of the entire address space; at
any given time, some recently accessed things will be in the cache,
other things will not.
Each time the CPU (a) fetches an instruction, or (b) accesses data:
39
Hit and Miss Rate
40
Where are items put in the cache?
Some mapping functions are simple; others are more complex, but
result in a higher hit rate.
41
Direct-mapped Cache
42
Direct-mapped Cache
43
Direct-mapped Cache
What happens when a cache block is accessed for the first time? The
tag could match, but the data would be invalid.
Each cache block also has a valid bit, initialized to ‘0,’ and set to ‘1’
whenever a block is copied into the cache.
45
Direct-mapped Cache Harware Design
46
Direct-mapped Cache
47
Fully-associative Cache
This is slower and more expensive, but achieves the highest hit rate.
48
Fully-associative Cache
Fully associative
1
= 1
= 1
=
1 1 1
1 1 1
...
..
Mux
.
cache line
...
Mux
hit word
49
Set-associative Cache
50
Set-associative Cache
Set-associative
1
Decoder
...
1
= 1
=
1 1
1 1
Mux
Mux
1
hit word
51
Every Cache is Set-associative
52
Block Replacement Policies
Each policy choice has pros and cons related to hardware complexity
and resulting miss rate. See more possibilities here.
53
Writes to Cache
54
Write-through
55
Write-back
• Hit: write to the cache. Update main memory only when that
cache block is removed from the cache. A dirty bit (or modified
bit) is set to indicate cache block has been modified and is no
longer identical to the block in main memory.
• Miss: first copy the block containing the addressed word from
main memory into the cache, and then write the new word in
the cache block.
56
Caching Example
57
Caching Example
58
Caching Example: Direct-mapped Cache Results
59
Caching Example: Fully-associative Cache Results
60
Caching Example: Set-associative Cache Results
61
Split L1 Cache
Instruction & Data cache
L1 is usually split into instruction and data caches; later levels are
unified.
Ins
• Harvard architecture: sa
unified L1 would slow
•
things down
• Instruction and data
access patterns are
quite different
• Instruction accesses •
are predictable:
loops; basic blocks •
• Instruction accesses
are read-only
• Splitting L1 cache
results in higher hit
rates
62
Secondary Storage
Textbook§8.10
Secondary Storage
63
Magnetic Hard Disk Drives
64
Magnetic Hard Disk Drives
65
Magnetic Hard Disk Drives
Each disk is divided into concentric tracks, and each track into
sectors. A cylinder is a set of tracks on a stack of disks; such tracks
can accessed simultaneously without moving the read/write heads.
• Data is written
sector-by-sector (e.g., 512 B)
• Formatting information
(including track/sector
markers) and error-correcting
code (ECC) information is
stored on disk
• The file system is on disk, too:
data structures that the OS
uses to keep track of files
66
HDD Access Time
67
Virtual Memory
Textbook§8.8, 8.9
Virtual Memory
68
Virtual Memory
69
Memory Management Unit
70
Virtual Memory Organization
71
Address Translation
72
Address Translation
72
Page Table
73
Page Table
The page table stores all translations from virtual to physical pages.
• The MMU stores the start address of
the page table: page table base
register (PTBR)
• PTBR + VPN = address of the page
table entry (PTE) for the given VPN
• Each PTE maintains control bits
(valid? modified?)
• Each PTE also stores the page frame
number if the page is in memory
• Otherwise, it may indicate where on
disk the page can be found
• PTEs also track process information,
read/write permission, etc
74
Translation Lookaside Buffer (TLB)
The MMU must perform translation for each memory access (i.e.,
every fetch, every load or store). If each translation requires a
references to the page table, this is slow!
• When physical memory is large, the page table has many entries
• It isn’t practical to store the page table in the MMU
• The translation lookaside buffer (TLB) in the MMU caches
recently accessed PTEs
• The TLB is fully associative; on a miss, the full table is accessed,
and TLB updated (e.g., using LRU replacement)
• Split L1 caches? Two TLBs: one for instructions accesses, another
for data accesses
75
Translation Lookaside Buffer (TLB)
76
Page Faults
77
Conclusions
78