Homework 2, Q3
Computer Architecture CIS 655/CSE661
Instructor: Dr. Mo Abdallah
Autor name: Milen Dimitrov
Created on: 07/14/2021
Syracuse University, College of Engineering & Computer Science
Question 3: [30 points] Write a review report about
computer pipelining. The report must be more than 2500
words in any format. Use your own words.
.
1.Introduction
In the field of computer architecture and computer science itself, instruction pipelining is a
technology that implements ILP - instruction-level parallelism, in a single core CPU. Pipeline
purpose is to attempt to keep all the parts of the processor busy, with multiple instructions at the
same time, by splitting incoming instructions into a stages of sequential steps , a.k.a.
"pipelines", which are executed by different CPU circuit units and run different parts of
instructions at the same time, in parallel.
2. Concepts
In a single CPU computer, only one task can be executed at time. To simulate many tasks,
computers do context switching.To do multiple instructions at the time, usually multi processors
aor multi core processors are used. But there is also another way to increase the throughput of
a single CPU, by making it run various instructions at the same time, and not letting parts of the
CPU to Idle. in a computer with an implemented pipeline, instructions flow through the CPU in
stages. For example, for each step of the clock cycle, it may have one stage: fetch instructions
(IF) , decode instruction (ID) , execute instructions (EX) , access memory (MEM) and write
registers / memory back(WB). Pipeline computers usually have "pipeline registers" after each
stage. These registers store information from instructions and calculations temporarily, so that
the next level of logic gates can proceed to the next step. These pipeline registers are permitting
to read and write at the same time in the main registers.
This architecture allows the CPU to complete one instruction every clock cycle. Most often the
even stages run on one edge of the clock, and the odd stages run on the other edge. At a given
clock frequency, this allows more CPU throughput than a multi-cycle computer, but may
increase latency due to the additional overhead of the pipeline process itself.to add more , even
if the electronic elements have a fixed max speed, the pipelined computer can be made even
faster by changing the number of stages in the pipelines. As there are more number of stages,
the less things each stage needs to do, so this stage has less delay from the gates and can run
at a higher clock rate, higher frequency.
When the cost is measured in logic gates per instruction per second, the pipelined computer
model is usually the most economical. At each moment, an instruction is in only one pipeline
stage. On average, the pipeline stage is cheaper than a multi-cycle computer. In addition, if
done well, the logic of most pipeline computers is in use most of the time. In contrast,
disordered computers usually have a lot of idle logic at any given moment. Similar calculations
usually indicate that the pipeline computer uses less energy per instruction.
Pipeline computers are usually more complex and more costly than comparable multi-cycle
computers. It usually has more transistors, registers, gates and more complex control blocks. In
a similar way, it may use more total energy, while each instruction uses less energy. Out-of-
order CPUs can usually execute more instructions per second because they can execute
multiple instructions at once. In the computer with the implemented pipeline, the control block
arranges the instruction sequence to start, continue and stop according to the program.
Instruction data is usually passed from one stage to the next in the pipeline register, and each
stage has a slightly separate control logic. The control block also ensures that the operations of
each stage will not affect the operation of the instructions of other stages, so called hazards. For
example, data hazard may occur if the same piece of data must be used in both stages, the
control logic can ensure that it is used in the correct order.
When running efficiently, the CPU with pipelines will execute one instruction at each stage.
Then it processes all these instructions at the same time. It can complete approximately one
instruction per cycle of its clock. But when the program switches to a different instruction
sequence, the pipeline sometimes must discard the data being processed and restart. This is
called a "stall".
Most of the design of the pipeline computer can prevent interference between stages and
reduce pauses
Numbers of stages more often are 3 or 7 , but extreme cases exist. IBM used 3 stages around
fifties and sixties years of the last century, the classic RISC architecture is 5 stages, Atmel used
for AVR and PIC 2 staged pipelines.Intel Pentium 4 has 20 stages, and some Xeon Intel cores
even 31. There is an example - X10q of Xelerated with more than 1000 stages, but actually that
are just cores, dedicated for specific instructions.
3. History
The pioneering use of the pipeline was in the ILLIAC II project and the IBM Stretch project,
although a simple version was used in the early Z1 in 1939 and Z3 in 1941. John Vincent
Atanasoff and Clifford Berry computer - ABC , the first programmable digital computer was
using a pipeline.
Pipelining really began in supercomputers such as vector processors and array processors in
the late 1970s. One of the early supercomputers was the Cyber series built by Control Data
Corporation. Its principal architect Seymour Cray later led Cray Research. Cray developed the
XMP supercomputer series, which uses pipelines for multiplication and addition / subtraction
functions. Later, Star Technologies added parallelism (multiple pipeline functions working in
parallel) developed by Roger Chen. In 1984, Star Technologies added a pipeline division circuit
developed by James Bradley. By the mid-1980s, many different companies around the world
were using assembly line technology.
The pipeline is not limited to supercomputers. In 1976, Amdahl's 470 series general-purpose
mainframe had a 7 -step pipeline and a patented branch prediction circuit.
Intel introduced pipelines in 80486 in 1989. Pentium had 2 pipelines in 1993.
4. Hazards
Using pipelines to manage few instructions at the same time, often can cause problematic
conditions, when an instruction depends on the previous one. These conditions are called
“Hazards” and there are few well known examples - Structural hazard, Data hazard ( few
different types, that will be reviewed later) and Control hazards, involving the branching
instructions.
4.1 Structural hazard , sometimes called recourse hazard.
When two instructions may try to use the same resource at the same time, structural hazards
occur. As an example, consider that multiple instructions are ready to enter the execution
instruction phase and there is only one ALU. One solution to this resource risk is to increase the
available resources, such as connecting multiple ports to the main memory and multiple ALUs.
4.2 Data hazard
When arbitrary scheduled instructions try to use data before the data in the registers is
available, data hazards occur. Or in other words data hazards happen when instructions that
has data dependence are modifying data that is used at different stages of the pipeline. If not
properly handled, potential data hazards can lead to race conditions (a.k.a. race hazards).
There are three possible data hazards:
True dependency - Read after write, an anti-dependency - Write after read, output dependency
- Write after write.
True Data Dependency hazard, Read after write, ( Instruction2 - I2 tries to read the source
before Instruction 1- i1 is written) Read after write Data hazard refers to the situation where the
instruction refers to a result that has not been calculated or retrieved. This happens because
even if the instruction is executed after another, previous instruction, the previous instruction is
only partially processed in the pipeline. The first instruction is to calculate the value to be stored
in say, register R2, and the second instruction will use this value to calculate the result of
register R3. But when in the pipeline the operand is fetched for the second operation, the result
of the first operation has not yet been saved, so there will be data dependence.
Anti-dependency data hazard - Write after read, (i2 tries to write to the target before i1 reads)
Write after read data hazard indicates that there is a problem with concurrent execution.In any
case, i2 may complete before i1 (that is, concurrent execution), and it must be ensured that the
result of register is not stored before i1 has a chance to get the operand.
Output dependency data hazard , Write after write - (I2 tries to write the operand before i1 is
written) In a concurrent execution environment, write-after-write data hazards may occur.The
write-back of i2 must be delayed until i1 completes its execution.
4.3. Control hazard ( a.k.a instruction hazard or branching hazard)
A control hazard occurs when the pipeline makes the wrong decision on the branch prediction
and therefore brings instructions that must be discarded later into the pipeline.
4.4 Methods to eliminate or decrease pipeline hazards
The classic 5 stages RISC pipeline avoids these dangers by expanding the hardware. In
particular, branch and jumping instructions can use values calculated by ALU to calculate the
target address of the branch. If the ALU is busy, used at the same time by the decoding stage,
then the ALU instruction followed by a branch will see two instructions trying to use the ALU at
the same time. It is relatively easy to resolve this problem by designing a special branch target
arithmetic full adder in the decoding stage, so at least for calculating the offset and indexing of
the memory address there will be no need to use the main ALU.
In the classic, standard 5 stages RISC pipeline, data hazards can be reduced or avoided in one
of various ways: Bypass ( operand forwarding) , out -of -order execution or simple pipeline
buble, NOP instructions. The last one is the worst and last resort solution, because introducing
delays, some CPU stages are doing nothing useful, wasting resources. Operand forwarding is
costly, you need to add a lot of gates and register. And reordering of the instruction needs to be
done by the compiler and programmers. For the former, register renaming may be used, which
means the compiler is using different general purpose registers, even if in the program the same
register is reused. That's one of the reasons why more and more general purpose registers are
needed.
Bypass is also called operand forwarding. This is additional hardware, transistors and gates,
that are allowing data to be send directly from the ALU output ( from the previous instruction) to
the input, to be used by the next instruction, before the saving value in the appropriate register
( it still will be saved, later, just will be used ahead of time).The instruction fetch and decode
stage sends the second instruction after the first cycle. Other method is Pipeline interlocking
When the machine code of the program is written by the compiler, this data hazard can be
easily detected. In this case, the Stanford MIPS machine relies on the compiler to add NOP
instructions instead of letting the circuit detect and (more laboriously) stop the first two pipeline
stages. Hence the name MIPS: Microprocessor without interlocking pipeline stage. It turns out
that the additional NOP instructions added by the compiler expand the program binary file,
thereby reducing the instruction cache hit rate. Although the stall hardware is expensive, it was
later redesigned to improve the instruction cache hit rate, at which point the acronym no longer
makes sense.
Control hazards
Control hazards are caused by conditional branches and unconditional branches. The classic 5
stages RISC pipeline resolves branches in the decoding stage, which means that the branch
resolution cycle is two cycles long. It has the following three meanings:
Branch resolution is repeated through quite a few circuits: instruction cache reading, register file
reading, branch condition calculation (involving 32-bit comparison on MIPS CPU), and the next
instruction address multiplexer.
Since branch and jump targets are calculated in parallel with register reading, RISC ISA usually
does not have instructions for branching to register + offset address.
When making any branch, the instruction immediately following the branch is always fetched
from the instruction cache. If this instruction is ignored, the IPC penalty for each branch has one
cycle, which is large enough.
There are four solutions to this performance problem with branches:
Prediction not taken: The instruction after the branch is always fetched from the instruction
cache, but only executed when the branch is not taken. If no branch is taken, the pipeline
remains full. If a branch is taken, the instruction is discarded (marked as NOP), and the
opportunity to complete the instruction in one cycle is lost.
Possible branch: always fetch the instruction after the branch from the instruction cache, but
execute it only when the branch is taken. The compiler can always fill the branch delay slot on
such a branch, and because the branch is used more frequently than not, the IPC penalty for
this branch is smaller than the previous type.
Branch Delay Slot: Always fetch the branched instruction from the instruction cache and always
execute it, even if the branch is taken. The branch delay slot IPC penalizes the branches for
which the compiler cannot schedule the branch delay slot, instead of IPC penalties for some
branches that are adopted (maybe 60%) or not adopted (maybe 40%). The designers of
SPARC, MIPS and MC88K designed a branch delay slot in their ISA.
Branch prediction: In parallel with fetching each instruction, guess whether the instruction is a
branch or a jump, and if it is, guess the target. In the loop after the branch or jump, the
instruction is fetched at the guessed target. When the guess is wrong, refresh the target
obtained by the error.
Delayed branches are controversial, first of all because of their complex semantics. Delayed
branch designates a jump to a new location after the next instruction. The next instruction is the
instruction that the instruction cache inevitably loads after the branch.
Delayed branches are criticized as a bad short-term choice in Instruction Set Architecture
design: Compilers usually find it difficult to find logically independent instructions to place after
the branch (the instructions after the branch are called delay slots), so they must insert NOPs
into the delay slots.
Superscalar processors fetch multiple instructions per cycle and must have some form of
branch prediction, so they will not benefit from delayed branches. Alpha ISA ignores the delayed
branch because it is designed for superscalar processors.
The most serious disadvantage of delayed branches is that they bring additional control
complexity. If an exception occurs in the delay slot instruction, the processor must be restarted
on the branch, not on the next instruction. Exceptions essentially have two addresses, the
exception address and the restart address, and in all cases correctly generating and
distinguishing these two has always been the source of error in the later design.
4.5 Other interesting cases - Self modifying programs and Uninterruptible
instructions
Self-modifying programs were very popular at the beginning of the computer age, as they
permitted very efficient use of the then- very expensive memory. The technique of self-
modifying code may create problems on pipeline CPUs. In this technique, it may happen that
one of the functions of the program may modify its own upcoming instructions. If the CPU has
an instruction cache, the original instruction may get copied to the input queue, and the
modification will not take effect.
An instruction may be uninterruptible to ensure its atomicity, for example when it exchanges two
items and to avoid race conditions. Sequential processors allow interrupts between instructions,
but pipeline processors overlap instructions, so executing non-interruptible instructions will
make some ordinary instructions uninterruptible.
5. When is not good to use pipelines
While pipelines in general increase the throughput, there may be situations when they may do
more harm than good. One of the biggest problems is that the latency for individual instruction
on average increases. For example in Low Latency Trading it is common practice to disable
cores of the CPU - single thread core is faster , even with lower throughput. Same for hyper-
threading. ( additional effect is that when only 1 core is working, the thermal load is lower, and
the clock rate can be increased, CPU overclocking ). Another case is when more predictability
is needed, pipelining introduces more uncertainty. Programming is also more complicated for
the pipelined computer, and compilers may not always produce the optimal code.
6. Conclusion.
Pipelines greatly increase the throughput thru ILP, and even with the current multicore chips
they are one of the most important parts of the CPUs. With Moore's law declining in the latest
years, inability to increase frequency of the CPUs, pipelines, together with multi-core
architecture helps the CPU to grow horizontally. More and more space on the silicon dice is
dedicated to pipelines and cache, and that is mentioned even in discussion about RISC / CISC
architectures by DEC and Patterson in the 1980s, tha papers we reviewed in the Homework 1,
that the silicon , released from the complex instructions may be used for more pipeline stages
and cache.
But with increasing the complexity of the pipelines, programers need to know more about it, and
to learn the internal structure of the CPU, pipeline design and workflow, so they can code more
efficient programs.