Unit-III : Memory
Topics
Types and Hierarchy Model level organization Cache memory Performance considerations Mapping Virtual memory Swapping Paging Segmentation Replacement policies
Random-Access Memory (RAM)
Key features
RAM is packaged as a chip. Basic storage unit is a cell (one bit per cell). Multiple RAM chips form a memory.
Static RAM (SRAM)
Each cell stores bit with a six-transistor circuit.
Retains value indefinitely, as long as it is kept powered. Relatively insensitive to disturbances such as electrical noise. Faster and more expensive than DRAM.
Dynamic RAM (DRAM)
Each cell stores bit with a capacitor and transistor. Value must be refreshed every 10-100 ms. Sensitive to disturbances. Slower and cheaper than SRAM.
SRAM vs DRAM Summary
Tran. per bit SRAM DRAM 6 1
Access time Persist? Sensitive? 1X 10X Yes No No Yes
Cost 100x 1X
Applications cache memories Main memories, frame buffers
Traditional Architecture
Processor MAR
k-bit address bus
n-bit data bus
Memory
MDR
k Up to 2 addressable locations
Word length = bits n Control lines ( R / W , M/IO, etc.)
Figure 5.1. Connection of the memory to the processor.
Conventional DRAM Organization
d x w DRAM:
dw total bits organized as d supercells of size w bits
16 x 8 DRAM chip cols 0
2 bits /
0 1 rows 2 3 supercell (2,1)
addr
memory controller (to CPU) data
8 bits /
internal row buffer
Reading DRAM Supercell (2,1)
Step 1(a): Row access strobe (RAS) selects row 2. Step 1(b): Row 2 copied from DRAM array to row buffer.
16 x 8 DRAM chip cols RAS = 2
2 /
0 0 1
addr
memory controller
8 /
rows 2 3
data
internal row buffer
Reading DRAM Supercell (2,1)
Step 2(a): Column access strobe (CAS) selects column 1. Step 2(b): Supercell (2,1) copied from buffer to data lines, and eventually back to the CPU.
16 x 8 DRAM chip cols CAS = 1
2 /
0
1
To CPU
memory controller supercell (2,1)
addr rows 2
8 /
data
supercell (2,1)
internal row buffer
Memory Modules
addr (row = i, col = j) : supercell (i,j)
DRAM 0
DRAM 7
64 MB memory module consisting of eight 8Mx8 DRAMs
bits bits bits bits bits bits bits 56-63 48-55 40-47 32-39 24-31 16-23 8-15
bits 0-7
63
56 55
48 47
40 39
32 31
24 23 16 15
8 7
64-bit doubleword at main memory address A
Memory controller
64-bit doubleword
Enhanced DRAMs
All enhanced DRAMs are built around the conventional DRAM core.
Fast page mode DRAM (FPM DRAM)
Access contents of row with [RAS, CAS, CAS, CAS, CAS]
instead of [(RAS,CAS), (RAS,CAS), (RAS,CAS), (RAS,CAS)].
Extended data out DRAM (EDO DRAM)
Enhanced FPM DRAM with more closely spaced CAS signals.
Synchronous DRAM (SDRAM)
Driven with rising clock edge instead of asynchronous control
signals.
Double data-rate synchronous DRAM (DDR SDRAM)
Enhancement of SDRAM that uses both clock edges as control
signals.
Video RAM (VRAM)
Like FPM DRAM, but output is produced by shifting row buffer Dual ported (allows concurrent reads and writes)
Nonvolatile Memories
DRAM and SRAM are volatile memories
Lose information if powered off.
Nonvolatile memories retain value even if powered off.
Generic name is read-only memory (ROM). Misleading because some ROMs can be read and modified.
Types of ROMs
Programmable ROM (PROM) Eraseable programmable ROM (EPROM) Electrically eraseable PROM (EEPROM) Flash memory
Firmware
Program stored in a ROM
Boot time code, BIOS (basic input/ouput system) graphics cards, disk controllers.
Disk Geometry
Disks consist of platters, each with two surfaces. Each surface consists of concentric rings called tracks. Each track consists of sectors separated by gaps.
tracks surface track k gaps
spindle
sectors
Disk Geometry (Muliple-Platter View)
Aligned tracks form a cylinder.
cylinder k surface 0 surface 1 surface 2 surface 3 surface 4 surface 5 spindle platter 0 platter 1 platter 2
Disk Capacity
Capacity: maximum number of bits that can be stored.
Vendors express capacity in units of gigabytes (GB), where 1 GB = 10^9.
Capacity is determined by these technology factors:
Recording density (bits/in): number of bits that can be squeezed into a 1 inch segment of a track. Track density (tracks/in): number of tracks that can be squeezed into a 1 inch radial segment. Areal density (bits/in2): product of recording and track density.
Modern disks partition tracks into disjoint subsets called recording zones
Each track in a zone has the same number of sectors, determined by the circumference of innermost track. Each zone has a different number of sectors/track
Computing Disk Capacity
Capacity = (# bytes/sector) x (avg. # sectors/track) x (# tracks/surface) x (# surfaces/platter) x
(# platters/disk)
Example:
512 bytes/sector 300 sectors/track (on average) 20,000 tracks/surface 2 surfaces/platter 5 platters/disk
Capacity = 512 x 300 x 20000 x 2 x 5
= 30,720,000,000 = 30.72 GB
Disk Operation (Single-Platter View)
The disk surface spins at a fixed rotational rate The read/write head is attached to the end of the arm and flies over the disk surface on a thin cushion of air. spindle By moving radially, the arm can position the read/write head over any track. spindle spindle
spindle
Disk Operation (Multi-Platter View)
read/write heads move in unison from cylinder to cylinder
arm
spindle
Disk Access Time
Average time to access some target sector approximated by :
Taccess = Tavg seek + Tavg rotation + Tavg transfer
Seek time (Tavg seek)
Time to position heads over cylinder containing target sector. Typical Tavg seek = 9 ms
Rotational latency (Tavg rotation)
Time waiting for first bit of target sector to pass under r/w head. Tavg rotation = 1/2 x 1/RPMs x 60 sec/1 min
Transfer time (Tavg transfer)
Time to read the bits in the target sector.
Tavg transfer = 1/RPM x 1/(avg # sectors/track) x 60 secs/1 min.
Logical Disk Blocks
Modern disks present a simpler abstract view of the complex sector geometry:
The set of available sectors is modeled as a sequence of bsized logical blocks (0, 1, 2, ...)
Mapping between logical blocks and actual (physical) sectors
Maintained by hardware/firmware device called disk controller. Converts requests for logical blocks into (surface,track,sector) triples.
Allows controller to set aside spare cylinders for each zone.
Accounts for the difference in formatted capacity and maximum capacity.
The CPU-Memory Gap
The increasing gap between DRAM, disk, and CPU speeds.
100,000,000 10,000,000 1,000,000 100,000
ns
Disk seek time DRAM access time SRAM access time CPU cycle time
10,000 1,000 100 10 1 1980 1985 1990 year 1995 2000
Locality
Principle of Locality:
Programs tend to reuse data and instructions near those they have used recently, or that were recently referenced themselves. Temporal locality: Recently referenced items are likely to be referenced in the near future. Spatial locality: Items with nearby addresses tend to be referenced close together in time.
sum = 0; for (i = 0; i < n; i++) sum += a[i]; return sum;
Locality Example:
Data Reference array elements in succession (stride-1 reference pattern): Spatial locality Reference sum each iteration: Temporal locality
Instructions Reference instructions in sequence: Spatial locality Cycle through loop repeatedly: Temporal locality
Memory Hierarchies
Some fundamental and enduring properties of hardware and software:
Fast storage technologies cost more per byte and have less capacity. The gap between CPU and main memory speed is widening. Well-written programs tend to exhibit good locality.
These fundamental properties complement each other beautifully.
They suggest an approach for organizing memory and storage systems known as a memory hierarchy.
An Example Memory Hierarchy
Smaller, faster, and costlier (per byte) storage devices L0: registers L1: on-chip L1 cache (SRAM) L2: off-chip L2 cache (SRAM)
CPU registers hold words retrieved from L1 cache.
L1 cache holds cache lines retrieved from the L2 cache memory. L2 cache holds cache lines retrieved from main memory.
L3:
Larger, slower, and cheaper (per byte) storage devices
L5:
main memory (DRAM)
Main memory holds disk blocks retrieved from local disks.
L4:
local secondary storage (local disks)
Local disks hold files retrieved from disks on remote network servers.
remote secondary storage (distributed file systems, Web servers)
Memory Hierarchy
CPU
Main Memory I/O Processor
Cache
Magnetic Disks
Magnetic Tapes
Cache Memory
Cache: A smaller, faster storage device that acts as a staging area for a subset of the data in a larger, slower device. Fundamental idea of a memory hierarchy:
For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1. Programs tend to access the data at level k more often than they access the data at level k+1. Thus, the storage at level k+1 can be slower, and thus larger and cheaper per bit. Net effect: A large pool of memory that costs as much as the cheap storage near the bottom, but that serves data to programs at the rate of the fast storage near the top.
Why do memory hierarchies work?
Cache
Processor
Cache
Main memory
Figure 5.14. Use of a cache memory. Replacement algorithm Hit / miss Write-through / Write-back Load through
Cache Memory Operation
1. Cache fetches data from next to current addresses in main memory 2. CPU checks to see whether the next instruction it requires is in cache
Main Memory (DRAM)
Cache
Memory
(SRAM)
CPU
4. If not, the CPU has to fetch next instruction from main memory - a much slower process
3. If it is, then the instruction is fetched from the cache a very fast position
= Bus connections
Cache Memory
Miss
CPU
Cache (Fast) Cache 95% hit ratio
Main Memory (Slow)
Mem
Hit
Access = 0.95 Cache + 0.05 Mem
Caching in a Memory Hierarchy
Level k: 8 4 9 14 10 3 Smaller, faster, more expensive device at level k caches a subset of the blocks from level k+1
10 4
Data is copied between levels in block-sized transfer units
0 Level k+1: 4 8 12
1 5 9 13
2 6 10 14
3 7 11 15 Larger, slower, cheaper storage device at level k+1 is partitioned into blocks.
General Caching Concepts
14 12
0 1
Request 12 14
2 3
Program needs object d, which is stored in some block b. Cache hit
Level k:
4* 12
14
Program finds b in the cache at level k. E.g., block 14.
12 4*
Request 12
Cache miss
0 Level k+1:
4 4*
8 12
5
9 13
6
10 14
7
11 15
b is not at level k, so level k cache must fetch it from level k+1. E.g., block 12. If level k cache is full, then some current block must be replaced (evicted). Which one is the victim?
Placement policy: where can the new
block go? E.g., b mod 4 Replacement policy: which block should be evicted? E.g., LRU
General Caching Concepts
Types of cache misses:
Cold (compulsary) miss
Cold misses occur because the cache is empty.
Conflict miss
Most caches limit blocks at level k+1 to a small subset
(sometimes a singleton) of the block positions at level k. E.g. Block i at level k+1 must be placed in block (i mod 4) at level k+1. Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k block. E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time.
Capacity miss
Occurs when the set of active cache blocks (working set) is
larger than the cache.
Examples of Caching in the Hierarchy
Cache Type Registers TLB L1 cache L2 cache Virtual Memory Buffer cache What Cached 4-byte word Address translations 32-byte block 32-byte block 4-KB page Parts of files Where Cached CPU registers On-Chip TLB On-Chip L1 Off-Chip L2 Main memory Main memory Local disk Local disk Remote server disks Latency (cycles) Managed By 0 Compiler 0 Hardware 1 Hardware 10 Hardware 100 Hardware+ OS 100 OS 10,000,000 AFS/NFS client 10,000,000 Web browser 1,000,000,000 Web proxy server
Network buffer Parts of files cache Browser cache Web pages Web cache Web pages
Performance Considerations
Overview
Two key factors: performance and cost
Price/performance ratio
Performance depends on how fast machine instructions can be brought into the processor for execution and how fast they can be executed.
For memory hierarchy, it is beneficial if transfers to and from the faster units can be done at a rate equal to that of the faster unit.
This is not possible if both the slow and the fast units are accessed in the same manner.
However, it can be achieved when parallelism is used in the organizations of the slower unit.
Interleaving
If the main memory is structured as a collection of physically separated modules, each with its own ABR (Address buffer register) and DBR( Data buffer register), memory access operations may proceed in more than one module at the same time.
mbits k bits Module mbits Address in module MM address Address in module k bits Module MM address
ABR DBR ABR DBR Module 0 ABR DBR Module i ABR DBR Module n- 1
ABR DBR
ABR DBR
Module 0
Module i
Module k 2 - 1
(b) Consecutive words in consecutive modules (a) Consecutive words in a module
Figure 5.25. Addressing multiple-module memory systems.
Hit Rate and Miss Penalty
The success rate in accessing information at various levels of the memory hierarchy hit rate / miss rate.
Ideally, the entire memory hierarchy would appear to the processor as a single memory unit that has the access time of a cache on the processor chip and the size of a magnetic disk depends on the hit rate (>>0.9).
A miss causes extra time needed to bring the desired information into the cache.
Hit Rate and Miss Penalty (cont.)
Tave=hC+(1-h)M
Tave: average access time experienced by the processor h: hit rate M: miss penalty, the time to access information in the main memory C: the time to access information in the cache
Example:
Assume that 30 percent of the instructions in a typical program perform a read/write operation, which means that there are 130 memory accesses for every 100 instructions executed. h=0.95 for instructions, h=0.9 for data C=10 clock cycles, M=17 clock cycles, interleaved memory Time without cache : 130x10 Time with cache : 100(0.95x1+0.05x17)+30(0.9x1+0.1x17) = 5.04 The computer with the cache performs five times better
How to Improve Hit Rate?
Use larger cache increased cost Increase the block size while keeping the total cache size constant. However, if the block size is too large, some items may not be referenced before the block is replaced miss penalty increases. Load-through approach
Caches on the Processor Chip
On chip vs. off chip Two separate caches for instructions and data, respectively Single cache for both Which one has better hit rate? -- Single cache Whats the advantage of separating caches? parallelism, better performance
Level 1 and Level 2 caches
L1 cache faster and smaller. Access more than one word simultaneously and let the processor use them one at a time. L2 cache slower and larger.
How about the average access time?
Average access time: tave = h1C1 + (1-h1)h2C2 + (1-h1)(1-h2)M where h is the hit rate, C is the time to access information in cache, M is the time to access information in main memory.
Other Enhancements
Write buffer processor doesnt need to wait for the memory write to be completed
Prefetching prefetch the data into the cache before they are needed
Lockup-Free cache processor is able to access the cache while a miss is being serviced.
Mapping
00000000 00000001 3FFFFFFF
Main Memory
00000 00001 FFFFF
Cache
Address Mapping !!!
Direct Mapping
Block j of main memory maps onto block j modulo 128 of the cache
Cache
tag tag Block 0 Block 1
Main memory Block 0 Block 1
Block 127 Block 128 Block 129
4: one of 16 words. (each block has 16=24 words)
7: points to a particular block in the cache (128=27) 5: 5 tag bits are compared with the tag bits associated with its location in the cache. Identify which of the 32 blocks that are resident in the cache (4096/128).
tag
Block 127
Block 255 Block 256 Block 257
Figure 5.15. Direct-mapped cache.
Block 4095 Tag 5 Block 7 W ord 4 Main memory address
Direct Mapping
Address 000 00500 What happens when Address = 100 00500
00000
Cache
Tag Data
00500 000 0 1 A 6 00900 080 4 7 C C 01400 150 0 0 0 5 FFFFF Compare Match No match 000 0 1 A 6
20 10 16 Bits Bits Bits (Addr) (Tag) (Data)
Direct Mapping with Blocks
Address 000 0050 0
Block Size = 16
00000
Cache
Tag Data
00500 01A6 000 00501 0254 00900 47CC 080 00901 A0B4 01400 0005 150 01401 5C04 FFFFF
000 0 1 A 6
Compare
Match No match
20 10 16 Bits Bits Bits (Addr) (Tag) (Data)
Direct Mapping
Tag 5 Block 7 W ord 4 Main memory address
11101,1111111,1100
Tag: 11101
Block: 1111111=127, in the 127th block of the cache
Word:1100=12, the 12th word of the 127th block in the cache
Associative Mapping
Main memory Block 0 Block 1 Cache tag tag Block 0 Block 1
Block i tag
Block 127
4: one of 16 words. (each block has 16=24 words)
12: 12 tag bits Identify which of the 4096 blocks that are resident in the cache 4096=212.
Block 4095 Tag 12 Word 4 Main memory address
Figure 5.16. Associative-mapped cache.
Associative Memory
Cache Location
00000 00001 FFFFF
00000000 00000001 00012000 08000000 15000000 3FFFFFFF
Main Memory
Cache
00012000 15000000
08000000
Address (Key)
Data
Associative Mapping
Address
00012000
Can have any number of locations
Cache
00012000 0 1 A 6 Data 15000000 0 0 0 5 08000000 4 7 C C 01A6
How many comparators?
30 Bits (Key)
16 Bits (Data)
Associative Mapping
Tag 12 Word 4 Main memory address
111011111111,1100
Tag: 111011111111 Word:1100=12, the 12th word of a block in the cache
Set-Associative Mapping
Cache tag Set 0 tag tag Set 1 Block 0
Main memory Block 0 Block 1
Block 63
Block 1 Block 64 Block 2 Block 65
tag
Block 3
4: one of 16 words. (each Set block has 16=24 words) 63 6: points to a particular set in the cache (128/2=64=26)
tag tag
Block 127 Block 126 Block 128 Block 127 Block 129
6: 6 tag bits is used to check if the desired block is present (4096/64=26).
Tag
6
Block 4095
Figure 5.17. Set-associative-mapped cache with two blocks per set.
Set
6
Word
4
Main memory address
Set-Associative Mapping
Address 000 00500 2-Way Set Associative 00000
Cache
010 0 7 2 1 Tag1 Data1 Tag2 Data2 000 0 8 2 2 000 0 9 0 9 000 0 1 A 6 010 0 7 2 1
00500 000 0 1 A 6 00900 080 4 7 C C 01400 150 0 0 0 5 FFFFF
Compare
Compare
20 10 16 10 16 Bits Bits Bits Bits Bits (Addr) (Tag) (Data) (Tag) (Data)
Match
No match
Set-Associative Mapping
Tag 6 Set 6 Word 4 Main memory address
111011,111111,1100
Tag: 111011
Set: 111111=63, in the 63th set of the cache Word:1100=12, the 12th word of the 63th set in the cache
Replacement Algorithms
Difficult to determine which blocks to kick out Least Recently Used (LRU) block The cache controller tracks references to all blocks as computation proceeds. Increase / clear track counters when a hit/miss occurs
Replacement Algorithms
For Associative & Set-Associative Cache
Which
location should be emptied when the cache is full and a miss occurs? First In First Out (FIFO) Least Recently Used (LRU)
Distinguish an Empty location from a Full one
Valid Bit
Replacement Algorithms
CPU Reference A
Miss
B
Miss
C
Miss
A
Hit
D
Miss
E
Miss
A
Miss
D
Hit
C
Hit
F
Miss
Cache FIFO
A B
A B C
A B C
A B C D
E B C D
E A C D
E A C D
E A C D
E A F D
Hit Ratio = 3 / 10 = 0.3
Replacement Algorithms
CPU Reference A
Miss
B
Miss
C
Miss
A
Hit
D
Miss
E
Miss
A
Hit
D
Hit
C
Hit
F
Miss
Cache LRU
B A
C B A
A C B
D A C B
E D A C
A E D C
D A E C
C D A E
F C D A
Hit Ratio = 4 / 10 = 0.4