DDCO_M5_Notes_sessionwise
DDCO_M5_Notes_sessionwise
Module 5
Basic Processing Unit: Some Fundamental Concepts: Register Transfers, Performing ALU
operations, fetching a word from Memory, Storing a word in memory. Execution of a Complete
Instruction. Pipelining: Basic concepts, Role of Cache memory, Pipeline Performance.
→ Issuing the signals that control the operation of all the units inside
the processor (and for interacting with memory bus).
→ imple menting the actions specified by the instruction (loaded in the IR)
• Registers R0 through R(n- 1) are provided for general purpose use by programme r.
• Three registers Y, Z & TEMP are used by processor for temporary storage during execution
of some instructions. These are transparent to the programme r i.e. programme r need not be
concerned with them because they are never referenced explicitly by any instruction.
• MUX(Multiplexer) selects either
→ output of Y or
• As instruction execution progresses, data are transferred from one register to another,
often passing through ALU to perform arithmetic or logic operation.
Session 2
• Instruction execution involves a sequence of steps in which data are transferred from one
register to another.
• Input & output of register Ri is connected to bus via switches controlled by 2 control-
signals: Riin & Riout. These are called gating signals.
• When edge-triggered flip- flops are not used, 2 or more clock- signals may be needed to
guarantee proper transfer of data. This is known as multiphase clocking.
D flip- flop.
• When Riin=1, mux selects data on bus. This data will be loaded into flip- flop at rising-
edge of clock.
When Riin=0, mux feeds back the value currently stored in flip-flop.
• Q output of flip-flop is connected to bus via a tri-state gate.
Q.
Session 3
The ALU performs arithmetic operations on the 2 operands applied to its A and B inputs.
One of the operands is output of MUX & the other operand is obtained directly frombus.
• The signals are activated for the duration of the clock cycle corresponding to that step.
All other signals are inactive.
Write the complete control sequence for the instruction : Move (Rs),Rd
• This instruction copies the contents of memory-loc ation pointed to by Rs into Rd. This is
a memory read operation. This requires the following actions
→ fetch the operand (i.e. the contents of the memory -location pointed by Rs ).
→ transfer the data to Rd.
• The control-sequence is written as follows
1) PCout, MARin, Read, Select4, Add, Zin
2) Zout, PCin, Yin, WMFC
3) MDRout, IRin
4) Rs , MARin, Read
5) MDRin, WMFC
6) MDRout, Rd, End
• When requested-data are received from memory, they are stored in MDR. From MDR,
they are transferred to other registers
• MFC (Memory Function Completed): Addressed-device sets MFC to 1 to indicate that the
contents of the specified location
R1out, MARin,
1) Read ;desired address is loaded into MAR & Read command is issued
;load MDR from memo r y bus & Wait for MFC respo ns e fr o m
2) MDRinE , WMFC mem or y
3) MDRout, R2in ;load R2 from MDR
where WMFC=c ontrol signal that causes processor's
control
• Consider the instruction Add (R3),R1 which adds the contents of a memory-loc at ion
pointed by R3 to register R1. Executing this instruction requires the following actions:
causes the Mux to select constant 4. This value is added to operand at input B (PC‟s
content), and the result is stored in Z
Step2--> Updated value in Z is moved to PC.
Session 5
5.6 Branching Instructions
• Since the updated value of PC is already available in register Y, the offset X is gated onto the
bus, and an addition operation is performed.
• In step 5, the result, which is the branch-address, is loaded into the PC.
• The offset X used in a branch instruction is usually the difference between the branch
target-address and the address immediately following the branch instruction. (For example,
if the branch instruction is at location 1000 and branch target -address is 1200,
then the value of X must be 196, since the PC will be containing the address 1004 after
fetching the instruction at location 1000).
• In case of conditional branch, we need to check the status of the condition-codes before
loading a new value into the PC.
e.g.: Offset-field-of-IRout, Add, Zin, If N=0 then End
If N=0, processor returns to step 1
immediately after step 4. If N=1, step 5
is performed to load a new value into PC.
Session 6
5.7 Pipelining:
The basic building blocks of a computer are introduced in preceding chapters. In this
chapter, we discuss in detail the concept of pipelining, which is used in modern computers to
achieve high performanc e. We begin by explaining the basics of pipelining and how it can
lead to improved performanc e. Then we examine mac hine instruction features that facilitate
pipelined execution, and we show that the choice of instructions and instruction sequencing
can have a significant effect on performanc e. Pipelined organization requires sophisticated
compilation techniques, and optimizing compiler shave been developed for this purpose.
Among other things, such compilers re arrange the sequence of operations to maximize the
benefits of pipelined execution.
5.71 BASIC CONCEPTS
The speed of execution of programs is influenced by many factors. One way to improv e
performanc e is to use faster circuit technology to build the processor and the
mainme mory.Anotherpossibilityistoarrangethehardwaresothat moretha noneoperationcanb
eperformedatthesametime. Inthisway, the number of operations performed per second is
increased even though the elapsed time needed to perform any one operations not
changed. We have encountered concurrent activities several times before. Chapter 1 in-
traduced the concept of multiprogra mming and explained how it is possible for I/Transfers
and computational activities to proceed simultaneously. DMA devices make this possible
because they can perform I/O transfers independently once these transfers are initiated by
the processor
Pipelining is a particularly effective way of organizing concurrent activity in a computer
system. The basic idea is very simple. It is frequently encountered in manufac turing plants,
where pipelining is commonly known as an assembly - line operation. Readers are
undoubtedly familiar with the assembly line used in car manufac turing. The first station in an
assembly line may prepare the chassis of a car, the next station adds the body, and the next
one installs the engine, and so on. While one group of workers is installing the engine on one
car, another group is fi tting a car body on the chassis of another car, and yet another group
is preparing a new chassis for a third car. It may take days to complete work on a given car,
but it is possible to have a new car rolling off the end of the assembly line every few minutes.
Consider how the idea of pipelining can be used in a computer. The processor executes a
program by fetching and executing instructions, one after the other. Let Fi and Ei refer to the
fetch and execute steps for instruction Ii.Execution of a program consists of a sequence of fetch
and execute steps, as shown in Figure 8.1a
.Now consider a computer that has two separate hardware units, one for fetching
instructions and another for exec uting them, as shown in Figure 8.1 b
The instruction fetched by the fetch unit is deposited in an intermediate storage buffer, B1.
This buffer is needed to enable the execution unit to execute the instruction while the fetch
unit is fetching the next instruction. The results of execution are deposited in the destination
location specified by the instruction. For the purposes of this discussion, we assume that
both the source and the destination of the data operated on by the instructions are inside the
block labelled Execution unit.
The computer is controlled by a clock whose period is such that the fetch and execute steps of
any instruction can each be completed in one clock cycle. Operation of the computer
proceeds as in Figure 8.1c . In the first clock cycle, the fetch unit fetches an instruction I1
(step F1) and stores it in buffer B1 at the end of the clock cycle. In the second clock cycle,
the instruction fetch unit proceeds with the fetch operation for instruction I2 (step F2).
Meanwhile, the execution unit performs the operation specified by instruction I1, which
is available to it in buffer B1 (step E1). By the end of the second clock cycle, the execution
of instruction I 1 is completed and instruction I2 isavailable.InstructionI2 isstoredinB1,
replacing 1, which is no longer needed.StepE 2 is performed by the execution unit during
the third clock cycle, while instruction I 3 is being fetched by the fetch unit. In this manne r,
both the fetch and execute units are kept busy all the time. If the pattern in Figure 8.1 c can
be sustained for a long time, the completion rate of instruction execution will be twice that
achievable by the sequential operation depicted in Figure 8.1a.
In summa ry, the fetch and execute units in Figure 8.1b constitute a two-stage pipeline in which
each stage performs one step in processing an instruction. An inter-stage storage buffer,
B1, is needed to hold the information being passed from one stage to the next. New information
is loaded into this buffer at the end of each c lock cycle.
The processing of an instruction need not be divided into only two steps. For example, a
pipelined processor may process each instruction in four steps, as follows:
F Fetch: read the instruction from the memory.
D Decode: decode the instruction and fetch the source operand(s). E
Execute: perform the operation specified by the instruction.
W Write: store the result in the destination location.
The sequence of events for this case is shown in Figure 8.2a. Four instructions are in
progress at any given time. This means that four distinct hardware units are needed,
asshown in Figure 8.2b. These units must be capable of performing their tasks
simultaneously and without interfering with one another. Information is passed from one
unit to the next through a storage buffer. As an instruction progresses through the pipeline, all
the information needed by the stages downstream must be passed along. For example , during
clock cycle 4, the information in the buffers is as follows:
•Buffer B1 holds instruction I3, which was fetched in cycle 3 and is being decoded by the
instruction-decoding unit.
• Buffer B2 holds both the source operands for instruction I2 and the specification of
the operation to be performed. This is the information produced by the decoding
hardware in cycle 3.The buffer also holds the information needed for the write set
of instruction I2(STEP W2).
• Buffer B3 holds the results produced by the execution unit and the destination
information for instruction I1.
Session 7
5. 72 ROLE OF CACHE MEMORY:
Each stage in a pipeline is expected to complete its operation in one clock cyc le. Hence, the
clock period should be sufficiently long to complete the task being performed in any stage.
If different units require different amounts of time, the
clock period must allow the longest task to be completed. A unit that completes its task early is
idle for the remainder of the clock period. Hence, pipelining is most effective in improving
performanc e if the tasks being performed in different stages require about the same amount
of time.
This consideration is particularly important for the instruction fetch step, which is
assigned one clock period in Figure 8.2a. The clock cycle has to be equal to or greater than
the time needed to complete a fetch operation. However, the acc ess time of the ma in
memory may be as muc h as ten times greater than the time needed to perform basic
pipeline stage operations inside the processor, such as adding two numbers. Thus if each
instruction fetch required access to the main memory, pipelining would be of little value.
The use of cache memories solves the memory access problem. In particular, when a cache
is included on the same chip as the processor, access time to the cache usually the same
as the time needed to perform other basic operations inside the processor. This makes it
possible to divide instruction fetching and processing into steps that are more or less equal
in duration. Each of these steps is performed by a different pipeline stage, and the clock
period is chosen to correspond to the longest one.
Session 8
5.73 PIPELINE PERFORMANCE:
The pipelined processor in Figure 8.2 completes the processing of one instruction in each
clock cycle, which means that the rate of instruction processing is four times that of
sequential operation. The potential increase in performanc e resulting from pipelining is
proportional to the number of pipeline stages. However, this increase would be achieved
only if pipelined operation as depicted in Figure 8.2
a could be sustained without interruption throughout program execution. Unfortunately,
this is not the case. For a variety of reasons, one of the pipeline stages may not be able to
complete its processing task for a given instruction in the time allotted. For example, stage
E in the four-stage pipeline of Figure 8.2b is responsible for arithmetic and logic operations, and
one clock cycle is assigned for this task. Although this may be sufficient for most operations,
some operations, such as divide, may require more time to complete. Figure
8.3 shows an example in which the operation specified in instruction I2 requires three cycles
to complete, from cycle 4 through cycle 6. Thus, in cycles 5 and 6, the Write stage must be
told to do nothing, because It has no data to work with. Meanwhile, the informat ion in buffer
B2 must remain intact until the Execute stage has completed its operation. This means that
stage 2 and, in turn, stage 1 are blocked from accepting new instructions because the
information in B1 cannot be overwritten. Thus, steps D 4and F 5 must be postponed as
shown
Pipelined operation in Figure 8.3 is said to have been stalled for two clock cycles. Normal
pipelined operation resumes in cycle 7. Any condition that causes the pipeline to stall is
called a hazard .We have just seen an example of a data hazard .A data hazard is any
condition in which either the source or the destination operands of an instruction are not
available at the time expected in the pipeline. As a result some operation has to be delayed,
and the pipeline stalls.
The pipeline may also be stalled because of a delay in the availability of an instruction. For
example, this may be a result of a miss in the cache, requiring the instruction to be fetched
from the main memory. Such hazards are often called control hazards or instruction
hazards. The effect of a cache miss on pipelined operation is illustrated in Figure
8.4.Instruction I1 is fetched from the cache in cycle1, and its execution proceeds normally .
However, the fetch operation for instruction I2 which is started in cycle 2,results in a cache
miss.The instruction fetch unit must now suspend any further fetch re-quests and wait for
I2 to arrive. We assume that instructionI2 is received and loaded into buffer B1 at the endof
cycle 5. The pipeline resumes its normal operation at that point.
An alternative representation of the operation of a pipeline in the case of a c ache miss is shown
in Figure 8.4b. This figure gives the function performed by each pipe line stage in each clock
cycle. Note that the Decode unit is idle in cycles 3 t hrough 5,the Execute unit is idle in cycles
4 through 6, and the Write unit is idle in cycles 5through 7. Such idle periods are called stalls.
They are also often referred to as bubbles in the pipe line. Once created as a result of a
delay in one of the pipe line stages, a bubble moves downstream until it reaches the last
unit.
Load X(R1),R2
can be accommodated in our example 4-stage pipeline. The memory address, X+[R1], is
computed in step E2 in cycle 4,then memory access takes place in cycle 5.The operand
read from memory is written into register R2 in cycle 6. This means that the execution step of
this instruction takes two clock cycles (cycles 4 and 5). It causes the pipeline to stall for one
cycle, because both instructions I2 and I3 require access to the register file in cycle 6. Even
though the instructions and their data are all available, the pipeline is
Stalled because one hardware resource, the register file, cannot handle two operations at
once. If the register file had two input ports, that is, if it allowed two simultaneous write
operations, the pipeline would not be stalled. In general, structural hazards are avoided by
providing sufficient hardware resources on the processorchip.It is important to understand that
pipelining does not result in individual instructions being executed faster; rather, it is the
throughput that increases, where throughput is measured by the rate at which instruction
execution is completed. Any time one of the stages in the pipeline cannot complete its
operation in one clock cycle, the pipeline stalls, and some degrada tion in performanc e
occurs.Thus,the performanc e level of one instruction completion in each clock cycle is actually
the upper limit for the through put achievable in a pipelined processor organized as in Figure
8.2b.An important goal in designing processors is to identify all hazards that may cause the
pipeline to stall and to find ways to minimize their impac t. In the following sections we
discuss various hazards, starting with data hazards, followed by control hazards