0% found this document useful (0 votes)
19 views12 pages

COA Unit 3 Pipelining 31.5.23

Pipelining is a technique to enhance processor performance by executing instructions concurrently through multiple stages such as fetch, decode, execute, and write. Hazards such as data, control, and structural hazards can cause stalls in the pipeline, which can be mitigated using techniques like operand forwarding and branch prediction. Effective management of these hazards is crucial for optimizing the performance of pipelined processors.

Uploaded by

deviveeranan7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views12 pages

COA Unit 3 Pipelining 31.5.23

Pipelining is a technique to enhance processor performance by executing instructions concurrently through multiple stages such as fetch, decode, execute, and write. Hazards such as data, control, and structural hazards can cause stalls in the pipeline, which can be mitigated using techniques like operand forwarding and branch prediction. Effective management of these hazards is crucial for optimizing the performance of pipelined processors.

Uploaded by

deviveeranan7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

UNIT 3 PIPELINING (2nd Half)

Pipelining means to execute the instructions concurrently.


Basic concepts
The speed of execution of programs is influenced by many factors.

1. To improve performance is to use faster circuit technology to build the processor and the
main memory.

2. To arrange the hardware so that more than one operation can be performed at the same
time. In this way, the number of operations performed per second is increased even though the
elapsed time needed to perform any one operation is not changed. 2- Stage Pipeline :

The
processor executes a program by fetching & executing instructions one after the other. Let Fi
& Ei refers to the fetch & execute steps for instruction Ii. An execution of a program consists
of a sequence of fetch & executes steps. Computer has two separate units, one for fetching
instructions and another for executing them. The instruction fetched by the fetch unit is
deposited in an intermediate storage buffer, B1. This buffer is needed to enable the execution
unit to execute the instruction while the fetch unit is fetching the next instruction. The result
of execution is deposited in the destination location specified by the instruction.
The computer is controlled by a clock whose period is such that the fetch and execute
steps of any instruction can each be completed in one clock cycle. In the first clock cycle, the
fetch unit fetches an instruction I1(step F1) & stores it in buffer B1 at the end of the clock
cycle. In the second clock cycle, the instruction fetch unit proceeds with the fetch operation
for instruction I2(step F2). Meanwhile, the execution unit performs the operation specified by
instruction I1, which is available to it in buffer B1 (step E1). By the end of the second clock
cycle, the execution of instruction I1 is completed & instruction I2 is available. Instruction I2
is stored in B1, replacing I1, which is no longer needed. Step E2 is performs by the execution
unit during the third clock cycle, while instruction I3 is being fetched by the fetch unit.

4-Stage Pipeline :
A pipelined processor may process each instruction in 4 steps, as follows:
1. F – Fetch : Read the instruction from the memory
2. D – Decode : Decode the instruction & fetch the source operands.
3. E – Execute : Perform the operation specified by the instruction.
4. W – Write : Store the result in the destination location.

Four instructions are in progress at any given time. This means that four distinct
hardware units are needed. These units must be capable of performing their tasks
simultaneously and without interfering with one another. Information is passed from one unit
to the next through a storage buffer. As an instruction progresses through the pipeline, all the
information needed by the stage downstream must be passes along. During clock cycle 4, the
information in the buffers is as follows :
1. Buffer B1 holds instruction I3, which was fetched in cycle 3 & is being decoded by the
instruction-decoding unit.
2. Buffer B2 holds both the source operands for instruction I2 and the specification of the
operation to be performed. This is the information produced by the decoding hardware in
cycle 3. The buffer also holds the information needed for the write step of instruction
I2(Step W2). Even though it is not needed by stage E, this information must be passed on
to stage W in the following clock cycle to enable that stage to perform the required Write
operation.
3. Buffer B3 holds the results produced by the execution unit and the destination
information for instruction I1.

Role of cache memory :


The use of cache memories solves the memory access problem. In particular, when a
cache is included on the same chips as the processor, access time to the cache is usually the same
as the time needed to perform other basic operations inside the processor. This makes it possible
to divide instruction fetching & processing into steps that are more or loess equal in duration.
Each of these steps is performed by a different pipeline stage, and the clock period is chosen to
correspond to the longest one.
Pipeline Performance :
The potential increase in performance resulting from pipelining is proportional to the
number of pipeline stages.
Hazard : Any condition that causes the pipeline to stall is called a hazard.
Types of hazards :
1. Data hazard
2. Control hazards or instruction hazards
3. Structural hazard.

Data hazard : It is any condition in which either the source or the destination operands of an
instruction are not available at the time expected in the pipeline. As a result some operation has to
be delayed, and the pipeline stalls.

Control hazards or instruction hazards : The pipeline may also be stalled because of a delay in
the availability of an instruction. For example, this may be a result of a miss in the cache,
requiring the instruction to be fetched from the main memory. Such hazards are called control or
instruction hazards.

Structural hazard : This is the situation when two instructions require the use of a given
hardware resource at the same time. The most common case in which this hazard may arise is in
access to memory. One instruction may need to access memory as part of the Execute or Write
stage while another instruction is being fetched. If instructions & data reside in the same cache
unit, only one instruction can proceed and the other instruction is delayed. Many processors use
separate instruction and data caches to avoid this delay.
Data Hazards
A data hazard is a situation in which the pipeline is stalled because the data to be operated on are
delayed for some reason.
Data dependency arises when the destination of one instruction is used as a source in the
next instruction.
For example, the two instructions
Mul R2,R3,R4
Add R5, R4, R6
Give rise to a data dependency. The result of the multiply instruction is placed into register
R4, which in turn is one of the two source operands of the Add instruction. As the Decode unit
decodes the Add instruction in cycle 3, it realizes that R4 is used as a source operand. Hence, the
D step of that instruction cannot be completed until the W step of the multiply instruction has
been completed. Completion of step D2 must be delayed to clock cycle 5. Instruction I3 is fetched
in cycle 3, but is decoding must be delayed because step D3 cannot precede D2. Hence, pipelined
execution is stalled for two cycles.
Operand Forwarding :
The data hazard arises because one instruction, instruction I2 is waiting for data to be
written in the register file. However, these data are available at the output of the ALU once the
Execute stage completes step E1. Hence, the delay can be reduced, or possibly eliminated, if
arrange for the result of instruction I1 to be forwarded directly for use in step E2. The registers
SRC1, SRC2, and RSLT constitute the interstage buffers needed for pipelined operation. SRC1
& SRC2 are part of buffer B2 & RSLT is part of B3. The two multiplexers connected at the inputs
to the ALU allow the data on the destination bus to be selected instead of the contents of either
SRC1 or SRC2 register.
Datapath

Position of the source and result registers in the processor pipeline


After decoding instruction I2 and detecting the data dependency, a decision is made to use
data forwarding. The operand not involved in the dependency, register R2, isread and loaded in
register SRC1 in clock cycle 3. In the next clock cycle, the product produced by instruction I1 is
available in register RSLT, and because of the forwarding connection, it can be used in step E2.
Hence, execution of I2 proceeds without interruption.
Handling Data hazards in software :
The control hardware delays reading register R4 until cycle 5, thus introducing a 2-cycle
stall unless operand forwarding is used. An alternative approach is to leave the task of detecting
data dependencies and dealing with them to the software. In this case, the compiler can introduce
the two-cycle delay needed between instructions I1 and I2 by inserting NOP (No-operation)
instructions, as follows:
I1 : Mul R2, R3, R4
NOP
NOP
I2 : Add R5, R4, R6.
If the responsibility for detecting such dependencies is left entirely to the software, the compiler
must insert the NOP instructions to obtain a correct result. This possibility illustrates the close link
between the compiler and the hardware. Being aware of the need for a delay, the compiler can
attempt to reorder instruction to perform useful tasks in the NOP slots, and thus achieve better
performance. On the other hand, the insertion of NOP instruction leads to larger code size. Also, it
is often the case that a given processor architecture has several hardware implementations,
offering different features.
Side effects :
When a location other than one explicitly named in an instruction as a destination operand
is affected, the instruction is said to have a side effect. For example, stack instructions, such as
push and pop, produce similar side effects because they implicitly use the autoincrement &
autodecrement addressing modes.

Another possible side effect involves the condition code flags, which are used by
instructions such as conditional branches and add-with-carry. Suppose that registers R1 & R2 hold
a double-precision integer number adds to another double-precision number in register R3 & R4.
This may be accomplished as follows :
Add R1, R3
AddWithCarry R2, R4.
An implicit dependency exists between these two instructions through the carry flag. This
flag is set by the first instruction & used in the second instruction, which performs the operation
R4 [R2] + [R4] + carry.
Instructions that have side effects give rise to multiple data dependencies, which lead to a
substantial increase in the complexity of the hardware or software needed to resolve them.

Instruction Hazards
∙ A pipeline stalled because of a delay in the availability of an instruction is called as
control hazard.
∙ This type of hazard is also called as Instruction hazard.

∙ For example, a miss in the cache, requiring the instruction to be fetched from the main
memory.

Instruction execution steps in successive clock cycles

Functions performed by each processor stage in successive clock cycles


Such idle periods are called stalls. They are also often referred to as bubbles in the
pipeline. A branch instruction may also cause the pipeline to stall. It may be
∙ Conditional branch
∙ Unconditional branch
Unconditional Branches :
Instructions I1 to I3 are stored at successive memory addresses, and I2 is a branch instruction.
Let the branch target be instruction Ik . The time lost as a result of a branch instruction is often
referred to as the branch penalty.

Instruction Queue and Prefetching

∙ Either a cache miss or a branch instruction stalls the pipeline for one or more clock
cycles.

∙ To reduce this stall period, many processors employ sophisticated fetch units that can
fetch instructions before they are needed and put them in a queue.
∙ Typically, the instruction queue can store several instructions.

∙ A separate unit, which calls the dispatch unit, takes instructions from the front of the
queue sends them to the execution unit.
∙ Every fetch operation adds one instruction to the queue and every dispatch operation
reduces the queue length by one. Hence, the queue length remains the same for the
first four clock cycles.
The sequence of instruction completions Instructions I1,I2, I3, I4, and Ik complete execution in
successive clock cycles. This is because the instruction fetch unit has executed the branch
instruction concurrently with the execution of other instructions. This technique is referred to as
branch folding.

∙ Branch folding occurs only if at the time a branch instruction is encountered, at least one
instruction is available in the queue other than the branch instruction.

∙ The effectiveness of this technique is enhanced when the instruction fetch unit is able to read
more than one instruction at a time from the instruction cache.
Conditional Branches and Branch Prediction :

∙ A conditional branch instruction introduces the added hazard caused by the dependency of
the branch condition on the result of a preceding instruction.

∙ The decision to branch cannot be made until the execution of that instruction has been
completed.

∙ Branch instructions represent about 20 percent of the dynamic instruction count of most
programs.

∙ Because of the branch penalty, this large percentage would reduce the gain in
performance expected from pipelining.
∙ The different solutions for this are:
▪Delayed branching

Static branch prediction.

Dynamic branch prediction.

Delayed branching :

- The location following a branch instruction is called a branch delay slot. -


In the above eg. There is one delay slot 44

- The instructions in the delay slots are always fetched and at least partially executed before
the branch decision is made and the branch target address is computed.

- A technique called delayed branching can minimize the penalty incurred as a result of
conditional branch instructions.
- The objective is to be able to place useful instructions in these slots.

- If no useful instructions can be placed in the delay slots, these slots must be filled with NOP
instructions.

- After reordering of the code for delayed branching as show below, the delay slot is
removed.

- In this branching takes place one instruction later than where the branch instruction appears
in the instruction sequence in the memory, hence the name ―delayed branch.‖
Branch Prediction
- It is a technique for reducing the branch penalty associated with conditional Branches. -
It attempts to predict whether or not a particular branch will be taken.

- The simplest form of branch prediction is to assume that the branch will not take place and
to continue to fetch instructions in sequential address order.

- The other variety is Speculative execution means that instructions are executed before the
processor is certain that they are in the correct execution sequence.

- Thus, care must be taken that no processor registers or memory locations are updated until
it is confirmed that these instructions should indeed be executed.

- If the branch decision indicates otherwise, the instructions and all their associated data in
the execution units must be purged, and the correct instructions fetched and executed.
- A simplest approach of assuming will be that branches will not be taken. -
It would save the time lost to conditional branches 50 percent of the time.

- However, better performance can be achieved if assume that for some branch instructions
prediction as taken and others as not taken. (Depending on the expected program
behavior.)
- Any approach that has this characteristic is called static branch prediction.
Dynamic Branch Prediction

- It is an approach in which the prediction decision may change depending on execution


history.

- The objective of branch prediction algorithms is to reduce the probability of making a


wrong decision.
- Thus to avoid fetching instructions that eventually have to be discarded.

- In this the processor hardware assesses the likelihood of a given branch being taken by
keeping track of branch decisions every time that instruction is executed.
- In a simplest form, it uses the branch instructions execution history.

- There exists a different type of dynamic prediction based on the amount of knowledge can
accumulate.
- Examples types include:
▪1 bit, 2 bit etc.
Dynamic prediction based on 1 -bit
The two states are:
◆LT: Branch is likely to be taken

◆LNT: Branch is likely not to be taken


Assume that the algorithm is started in state LNT

Dynamic prediction based on 2 -bit


- The four states are:
o ST: Strongly likely to be taken
o LT: Likely to be taken
o LNT: Likely not to be taken
o SNT: Strongly likely not to be taken
- Assume that the state of the algorithm is initially set to LNT.

You might also like