0% found this document useful (0 votes)
24 views

Conspect of Lecture 7

Cache memory is designed to provide fast memory access speeds while maintaining a large memory size. It works by storing copies of frequently accessed memory blocks from main memory in a high-speed cache. There are three main mapping techniques - direct mapping maps each block to a single cache line; fully associative allows blocks to map to any line; and set associative divides the cache into sets with multiple lines per set. Key aspects of cache design include size, mapping function, replacement policy, and write strategy.

Uploaded by

arukaborbekova
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Conspect of Lecture 7

Cache memory is designed to provide fast memory access speeds while maintaining a large memory size. It works by storing copies of frequently accessed memory blocks from main memory in a high-speed cache. There are three main mapping techniques - direct mapping maps each block to a single cache line; fully associative allows blocks to map to any line; and set associative divides the cache into sets with multiple lines per set. Key aspects of cache design include size, mapping function, replacement policy, and write strategy.

Uploaded by

arukaborbekova
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Lecture № 7

CACHE MEMORY
The goal of the lecture: analyze and study principles of Cache work, elements of
Cache Design, Mapping Function, direct, associative and set associative
techniques, Cache organization in PENTIUM & PowerPC processors.

Contents

1. Purpose and principles of work. Elements of Cache Design.


2. Mapping Function. Direct, associative and set associative techniques.
3. Cache organization in PENTIUM & PowerPC processors.

Literature.
1.http://home.ustc.edu.cn/~leedsong/reference_books_tools/Computer
%20Organization%20and%20Architecture%2010th%20-%20William
%20Stallings.pdf.
2. Pustovarov V. I. Assembler. Programming and analysis of machinery
programs correctness, - Kiev: “Irina”, 2010. - 476
3. Э. Таненбаум. Т. Остин. Архитектура компьютера. 6-е изд.
Издательство: Питер, 2016. — 816 стр.
Keywords.
Reference Locality principal, Cache Design, Mapping Function, direct, associative
and set associative techniques, Replacement algorithm, data integrity, out-of-order
execution.

Employment of computer memory hierarchy resulted in the next its organization:


smaller, more expensive, faster memories are supplemented by larger, cheaper,
slower memories.The key to the success of this organizationis item: decreasing
frequency of access.The basis for the validity of that condition is a principle known
as locality of reference. During the course of execution of a program, memory
references by the processor, for both instructions and data, tend to cluster.
Programstypically contain a number of iterative loops and subroutines. Once a
loop or subroutine is entered, there are repeated references to a small set of
instructions. Similarly, operations on tables and arrays involve access to a
clustered set of data words. Over a long period of time, the clusters in use change,
but over a short period of time, the processor is primarily working with fixed
clusters of memory references. Accordingly, it is possible to organize data across
the hierarchy such that the percentage of accesses to each successively lower level
is substantially less than that of the level above.

Purpose and principles of Cache memory work.


Cache memory is intended to give memory speed
approaching that of the fastest memories available, and at
the same time provide this fast memory the price of less
expensive types of semiconductor memories.
So, Cache memory is designed to combine the memory access time of expensive,
highspeed memory combined with the large memory size of less expensive, lower-
speed memory. The concept is illustrated in Figures: 1a and 1b. There is a
relatively large and slow main memory together with a smaller, faster cache
memory. The cache contains a copy of portions of main memory. When the
processor attempts to read a word of memory, a check is made to determine if the
word is in the cache. If so, the word is delivered to the processor. If not, a block of
main memory, consisting of some fixed number of words, is read into the cache
and then the word is delivered to the processor. Because of the phenomenon of
locality of reference, when a block of data is fetched into the cache to satisfy a
single memory reference, it is likely that there will be future references to that
same memory location or to other words in the block.

Figure 1a. Typical Cache organization.


Block transfer
Word transfer

CPU CACHE Main Memory


Fast Slow

Figure 1b. Cache and Main Memory.

The effectiveness of the cache mechanism is based on a property of


computer programs called locality of reference. Analysis of programs
shows that most of their execution time is spent on routines in which
many instructions are executed repeatedly. These instructions may
constitute a simple loop, nested loops, or a few procedures that
repeatedly call each other. The actual detailed pattern of instruction
sequencing is not important – the point is that many instructions in
localized areas of the program are executed repeatedly during some time
period, and the remainder of the program is accessed relatively
infrequently. This is referred to as locality of reference. It manifests
itself in two ways: temporal and spatial. The first means that a recently
executed instruction is likely to be executed again very soon. The spatial
aspect means that instructions in close proximity to a recently executed
instruction (with respect to the instructions’ addresses) are also likely to
be executed soon.
If the active segments of a program can be placed in a fast cache
memory, then the total execution time can be reduced significantly.
Conceptually, operation of a cache memory is very simple. The memory
control circuitry is designed to take advantage of the property of locality
of reference. The temporal aspect of the locality of reference suggests
that whenever an information item (instruction or data) is first needed,
this item should be brought into the cache where it will hopefully remain
until it is needed again. The spatial aspect suggests that instead of
fetching just one item from the main memory to the cache, it is useful to
fetch several items that reside at adjacent addresses as well. We will use
the term block to refer to a set of contiguous address locations of some
size. Another term that is often used to refer to a cache block is cache
line.
Principles of work:

 Small amount of fast memory


 Sits between normal main memory and CPU (off-chip cache)
 May be located on CPU chip or module (on-chip cache)
 CPU requests contents of memory location
 Check cache for this data

 If present, get from cache (fast)and is called hit


 If not present, read required block from main memory to cache and is

called miss
 Then deliver from cache to CPU.Cache includes tags to identify
which block of main memory is in each cache slot

The Cache Efficiency is characterized by hit ratio.


The hit ratio is a ratio of all hits in the cache to
the number of CPU’s accesses to the memory.

Cache/Main Memory Structure


Figure 2. Cache/Main Memory Structure

Cache Design.

Cache Size

Mapping Function
 Direct
 Associative
 Set associative

Replacement Algorithm
 Least recently used (LRU)
 First in first out (FIFO)
 Least frequently used (LFU)
 Random

Write Policy
 Write through
 Write back

Line Size

Number of Caches
 Single or two level
 Unified or split
Analysis of Cache Design Elements.

Size does matter.


 Cost
 More cache is expensive
 Speed
 Large cache is slightly slower than small one
 Checking cache for data takes time

Mapping Function
 Cache of 64kByte
 Cache slot of 4 bytes
 i.e. cache is 16k (214) lines(slots) of 4 bytes
 16MBytes main memory
 24-bit address
 (224=16M)

Direct Mapping
 Each block of main memory maps to only one cache line
 i.e., if a block is in cache, it must be in one specific place
 Address is in two parts
 Least Significant w bits identify unique word
 Most Significant sbits specify one memory block
 The MSBs are split into a cache line field r and a tag of s-r
(most significant)

Direct Mapping Address Structure

Tag s-r Line or Slot Word w


r 14 2
8

 24-bit address
 the low-order 2 bits select one of 4 words in 4-byte block
 22-bit block identifier
 8-bit tag (=22-14) (the high-order 8 bits of the memory address of the
block are stored in 8 tag bits associated with its location in the cache)
 14-bit slot or line (determines the cache position in this block)
 No two blocks in the same line have the same Tag field
 Check contents of cache by finding line and checking Tag

Direct Mapping Cache Line Table


 Cache line Main Memory blocks held
 0 0, m, 2m, ……… 2s-m
 1 1, m+1, 2m+1 … ……2s-m+1
……………………………………………………….
……………………………………………………….
 m-1 m-1, 2m-1, 3m-1 ……… 2s-1

Direct Mapping: advantages & disadvantages

 Simple
 Inexpensive
 Fixed location for given block
o If a program accesses 2 blocks that map to the same line repeatedly, cache
misses are very high

Associative Mapping
 A main memory block can load into any line of cache
 Memory address is interpreted as tag and word
 Tag uniquely identifies block of memory
 Every line’s tag is examined for a match
 Cache searching gets expensive
Associative Mapping Address Structure
Word
Tag 22 bit 2 bit

 22-bit tag stored with each 32-bit block of data


 Compare tag field with tag entry in cache to check for hit
 Least significant 2 bits of address identify which 16-bit word is
required from 32-bit data block
 e.g.,
o Address Tag Data Cache line
o FFFFFC FFFFFC 24682468 3FFF
Set Associative Mapping
 Cache is divided into a number of sets
 Each set contains a number of lines
 A given block maps to any line in a given set
 e.g., Block B can be in any line of set i
 e.g., 2 lines per set
 2-way associative mapping
 A given block can be in one of 2 lines in only one set

Set Associative Mapping Address Structure

Word
Tag 9 Set 13 2 bits
bit bit

Replacement Algorithms
Replacement algorithms are only needed for associative and set associative
techniques.
1. Least Recently Used (LRU) – replace the cache line that has been in the cache
thelongest with no references to it.
2. First-in First-out (FIFO) – replace the cache line that has been in the cache
thelongest.
3. Least Frequently Used (LFU) – replace the cache line that has experienced the
fewestreferences.
4. Random – pick a line at random from the candidate lines.

Note1: LRU is probably the most effective


Note2: Simulations have shown that random is only slightly inferior to an
algorithms
based on usage

Write Policy
If a cache line has not been modified, then it can be overwritten immediately;
however, if one or more words have been written to a cache line, then main
memory must be updated before replacing the cache line.
There are two main potential write problems:
• If an I/O module is able to read/write to memory directly, then if the cache has
been modified a memory read cannot happen right away. If memory is written to,
then the cache line becomes invalid.
• If multiple processors each have their own cache, if one processor modifies its
cache, then the cache lines of the other processors could be invalid.
1. write through – this is the simplest technique where all write operations are
made to memory as well as cache ensuring main memory is always valid. This
generates a lot of main memory traffic and creates a potential bottleneck;
2. write back – updates are made only to the cache and not to main memory until
the line is replaced.

Cache coherency – keeps the same word in other caches up to date using some
technique. This is an active field of research.

 Must not overwrite a cache block unless main memory is up to date


 Multiple CPUs may have individual caches
 I/O may address main memory directly

Write through
 All writes go to main memory as well as cache
 Multiple CPUs can monitor main memory traffic to keep local (to CPU) cache
up to date
 Lots of traffic
 Slows down writes

Write back
 Updates initially made in cache only
 Update bit for cache slot is set when update occurs
 If block is to be replaced, write to main memory only if update bit is set
 Other caches get out of sync
 I/O must access main memory through cache
 15% of memory references are writes
Line Size
Cache lines sizes between 8 to 64 bytes seem to produce optimum results.

Number of Caches
An on-chip cache reduces the processor's external bus activity. Further, an off-chip cache
is usually desirable. This is the typical level 1 (L1) and level 2 (L2) cache design where
the L2 cache is composed of static RAM. As chip densities have increased, the L2 cache
has been moved onto the on-chip area and an additional L3 cache has been added.

 On-chip cache (L1)


 Reduces the processor’s external bus activity
 Speeds up execution times and increases overall system performance
 External cache (L2)
 If an L2 SRAM cache is used, then frequently the missing information can be quickly
retrieved.
 The data can be accessed using the fastest type of bus transfer.
 Contemporary designs include both L1 and L2 caches
 The potential savings due to the use of L2 cache depends on the hit rates in both the L1
and L2 caches

Unified vs Split Caches


Recent cache designs have gone from a unified cache to a split cache design (one
for instructions and one for data).
Unified caches have the following advantages:
1. unified caches typically have a higher hit rate;
2. only one cache is designed and implemented.
Split caches have the following advantages:
 parallel instruction execution and prefetching is better handled because of the
elimination of contention between the instruction fetch/decode unit and
execution unit

Pentium 4 Cache Organization.


The evolution of cache organization is seen clearly in the evolution of Intel
microprocessors. The 80386 does not include an on-chip cache. The 80486
includes a single on-chip cache of 8 kB, using a line size of 16 bytes and a four-
wayset-associative organization. All of the Pentium processors include two on-chip
L1 caches, one for data and one for instructions. For the Pentium 4, the L1 data
cache is 16 kB, using a line size of 64 bytes and a four-way set-associative
organization. The Pentium 4 instruction cache is described subsequently. The
Pentium II also includes an L2 cache that feeds both of the L1 caches. The L2
cache is eightway set associative with a size of 512 kB and a line size of 128 bytes.
An L3 cache was added for the Pentium III and became on-chip with high-end
versions of the Pentium 4.
Figure 3. provides a simplified view of the Pentium 4 organization, highlighting
the placement of the three caches. The processor core consists of four major
components:
■ Fetch/decode unit: Fetches program instructions in order from the L2 cache,
decodes these into a series of micro-operations, and stores the results in the L1
instruction cache.
■ Out- of- order execution logic: Schedules execution of the micro- operations
subject to data dependencies and resource availability; thus, micro-operations may
be scheduled for execution in a different order than they were fetched from the
instruction stream. As time permits, this unit schedules speculative execution of
micro-operations that may be required in the future.
■ Execution units: These units execute micro-operations, fetching the required data
from the L1 data cache and temporarily storing results in registers.
■ Memory subsystem: This unit includes the L2 and L3 caches and the system bus,
which is used to access main memory when the L1 and L2 caches have a cache
miss and to access the system I/O resources.
Unlike the organization used in all previous Pentium models, and in most other
processors, the Pentium 4 instruction cache sits between the instruction decode
logic and the execution core. The reasoning behind this design decision is as
follows: the Pentium process decodes, or translates, Pentium machine instructions
into simple RISC-like instructions called micro-operations. The use of simple,
fixed-length micro-operations enables the use of superscalar pipelining and
scheduling techniques that enhance performance. However, the Pentium machine
instructions are cumbersome to decode; they have a variable number of bytes and
many different options. It turns out that performance is enhanced if this decoding is
done independently of the scheduling and pipelining logic.
The data cache employs a write-back policy: Data are written to main memory
only when they are removed from the cache and there has been an update. The
Pentium 4 processor can be dynamically configured to support write-through
caching.
The L1 data cache is controlled by two bits in one of the control registers, labeled
the CD (cache disable) and NW (not write-through) bits. There are also two
Pentium 4 instructions that can be used to control the data cache: INVD invalidates
(flushes) the internal cache memory and signals the external cache (if any) to
invalidate. WBINVD writes back and invalidates internal cache and then writes
back and invalidates external cache. Both the L2 and L3 caches are eight-way set-
associative with a line size of 128 bytes.
Figure 3. Pentium 4 Block diagram.

Data Cache Consistency


To provide cache consistency the data cache supports a protocol MESI
(modified/exclusive/ shared/invalid).
The data cache includes two status bits per tag, so each line can be in one of four
states:
 Modified: the line in the cache has been modified and it differs from that in the
main memory, so it is available only in this cache.
 Exclusive: The line in the cache is the same as that in the main memory and
also is not present in any other cache.
 Shared: The line in the cache is the same as that in the main memory and may
be present in another cache.
 Invalid: the line in the cache does not contain valid data.

Problems.
1. What is the main purpose of Cache Memory implementation?
2. Describe principles of Cache Memory work.
3. Enumerate elements of Cache Design.
4. Analyze the block-diagram of Pentium 4 processor.
5. How is ensured the Data Cache Consistency?

You might also like