Pipelining
The PowerPC and Pentium II
CSE 3322-001
Fall 1999
November 8, 1999
I Introduction
Pipelining is the breaking down of an instruction into discrete stages. This allows work to be done on
different instructions in each stage during the same clock cycle. Allowing work to be done on multiple
instructions improves overall instruction throughput. The Pentium II and the PowerPC (MPC750) are two
popular microprocessors that have implemented pipelining to achieve greater performance. Both processors
are superscalar, which refers to the ability to initiate and complete more than one instruction per clock cycle.
It is possible for a processor to achieve a CPI (clocks per instruction) of less than 1.
The first IA (Intel Architecture) to include pipelining was the i386, which had a single pipeline with
three stages. The i486 expanded the pipeline to five stages. The Pentium added a second pipeline to achieve
two-way superscalar performance and branch prediction was also added. The Pentium Pro has a three-way
superscalar architecture and out-of-order execution. The Pentium II added MMX instructions, which
required the addition of some new stages to the integer pipeline.
The PowerPC processor has been pipelined and superscalar since it came to the market in 1991. The
PowerPC pipeline has not changed much since the 601. The 603 added a separate floating-point execution
unit to the pipeline and the 740 added a second integer execution unit.
The instruction set architecture determines how easy it will be to design a pipeline for the instruction
set, particularly the decoder. The PowerPC (750) instructions are all a word in length (32 bits). The x86
instruction set varies from 1 byte to 17 bytes.
II Overview
The pipelines for the PowerPC and the Pentium II are very similar. An instruction goes through four
phases in both processors. Refer to figure 1 for the generic pipeline organization of the PowerPC and the PII.
1) Fetch
2) Decode/Dispatch
3) Execute
4) Commit
The PII can send 16 bytes to the decoding unit for processing per clock for a maximum of three
instructions. The Intel chip is dependent on the instruction mix. The PowerPC can decode two instructions
per clock plus a branch instruction.
The PII has eight execution units clustered around five ports. Complex integer/complex floating point
and simple floating point are clustered around port 0. Simple integer and branch are clustered on port 1.
Load, store, and store address are on the remaining three ports The PII can send a maximum of five
instructions to execution units in a clock cycle. The PowerPC has two integer units; branch, floating point;
load/store and a system register unit. Figure 1 is a generic representation of the two pipelines combined to
show the similarities
After execution, all instructions are sent to a commit unit that finalizes that results after all preceding
instructions have been committed. This is necessary because instructions do not necessarily finish in program
order. In the event of a mis-predicted branch, all instructions after the prediction must be cleared and
fetching starts at the location indicated by the branch.
1
PC Instruction Data
Cache Cache
Branch Instruction
Prediction Queue Register
File
Decode/Dispatch
Unit
Reservation Reservation Reservation Reservation Reservation Reservation
Station Station Station Station Station Station
Integer Integer Store* Store** Load**
FPU** Branch** Address
Branch** System* Load/*
FPU* Integer** Integer** Register Store
Commit
Unit
Reorder
Buffer
* PowerPC unit Figure 1
** Pentium II unit
III Fetch
The PowerPC can retrieves 16 bytes per clock (4 instructions) into the 24-byte (6 instructions)
instruction queue (IQ). The PII can retrieve 32 bytes into a prefetch buffer. The PowerPC fetches four
instructions per clock while the instruction count for the x86 is variable.
2
The instruction fetch unit for the PII has three stages (refer to figure 2):
1) IFU1 – Load 32 bytes into the prefetch buffer
2) IFU2 - Mark the boundaries between the instructions in a 16 byte block. Present any branches to the
BTB (branch target buffer) for dynamic prediction. If a branch entry exists and the branch is predicted as
taken, the prefetch stream will be adjusted.
3) IFU3 – Align the instructions for presentation to the appropriate decoder
The PowerPC has two stages (refer to figure 3):
1) Load up to 16 bytes into the 24 byte instruction queue
2) Present any branches to the BPU (branch processing unit)
Note that the PowerPC does not need to mark any boundaries between instructions or to change the
alignment of instructions for the decoders. The PowerPC also has one of its execution units directly off of the
fetching unit. The BPU contains an adder to compute the target address of the branch. The BPU also
calculates the return address of a procedure call and saves it in the programmer visible link register (LR). If a
branch instruction does not update the count register (CR) or the LR, it is removed from the instruction
stream.
Both processors maintain a history of previously seen branches. The Pentium II refers the table as a
branch target buffer (BTB) and the PowerPC calls it a branch history table (BHT). When referring to both
tables collectively, I will call them a branch table (BT).
The BT is a dynamic branch predictor. The BT stores the history of the previously seen branches and
their targets. When a branch instruction is fetched, it address is looked up in the BT. If the address is found a
prediction is made, otherwise, an entry is generated in the BT and fetching continues in sequential order (the
branch is predicted as not taken). The BT contains the branch address, the destination address and in the PII a
four-bit status and in the PowerPC, a two-bit status field. Both processors maintain 512 entries in the BT. If
the branch is predicted as taken, the instruction buffer is flushed and fetching continues from the address
supplied by the BT.
In the PII, the BTB can predict four branches simultaneously and uses Yeh’s algorithm. The four bits
in the BTB record the branch’s behavior the last four times it was executed. The estimated accuracy is 90%.
The BTB also has a return stack buffer (RSB). The RSB is used to save the return address of procedure calls.
Procedure calls and returns are unconditional branches. The RSB is used to speedup returns from procedures.
The PowerPC can predict only two branches deep. The PowerPC maintains a 64-entry branch target
instruction cache (BTIC). The first two instructions of a previously taken branch are stored here. When a
branch is predicted as being taken, the first two instruction from the new fetching address can be loaded from
here. The instructions are fetched one clock cycle sooner than if the fetching unit had to go to the instruction
cache. The branch processing unit (BPU) of the PowerPC retrieves branch instructions from the fetch queue
for dynamic branch prediction. The BPU will also get the branch instruction from the instruction queue (IQ)
for execution.
3
IV Decode
The PowerPC and the Pentium II differ the most in the decoding of instructions. This is due to the
widely variable instruction lengths of the x86.
The Pentium II decodes each instruction into 1 or more 118-bit micro-ops. The micro-ops are then
forwarded on through the pipeline. The PII can submit three instructions to the decoders, dependent upon the
instruction mix. The Pentium II has three decode units. The first unit can decode an instruction that results
in1 to 4 micro-ops and is seven or fewer bytes. The other two decoders can only decode an instruction that
results in one micro-op. The PII can rotate the three instructions to align them with an appropriate decoder to
yield better throughput. For instance, let a complex instruction be considered to decode into 2-4 micro-ops. If
the next three instructions presented to the decoder unit come in complex/simple/simple, then the last stage
of the fetch process will pass them directly to the three decoders. If the ordering is different,
simple/complex/simple, the decoder unit can rearrange the instructions so that the complex instruction goes
to the first decoder and the two simple instructions go to decoders two and three. This is done while still
preserving the program order of the instructions. An instruction of five or more micro-ops must be run
through the Micro Instruction Sequencer (MIS). The MIS is a microprogram that translates an instruction
into a sequence of micro-ops. A maximum of six micro-ops per clock advance to the decoded instruction
(ID) queue in strict program order. If any of the operations in the ID queue are branch instructions, static
branch prediction is used as a backup predictor to the BTB. The ID queue can submit three micro-ops per
clock to the register allocation table (RAT).
In the PII, the RAT looks at the operands referred to by the micro-op and reassigns them as necessary
to point to the rename registers in the reorder buffer (ROB). This allows for data forwarding. Data
forwarding lets an instruction get the results of a previous instruction before the previous instruction finishes
executing. This allows the following instruction to get an earlier start. The PII has a set of 40 rename
registers in the ROB to facilitate data forwarding. The ROB is a 40 entry circular queue and can accept three
micro-ops per clock. The reservation station (RS) can copy up to five micro-ops out of the ROB in one clock.
The micro-op entry remains in the ROB until the instruction finishes. The PII uses a 20 entry RS. The RS is
the scheduler/dispatcher for the rest of the pipeline. The RS holds a micro-op until it operands are available
which causes out-of-order execution. If two micro-ops become available a the same time for the same
resource, the RS uses a FIFO rule to determine which micro-op gets dispatched. The RS has five ports with
multiple execution units clustered on ports 0 and 1. The commit unit will commit the results in-order.
The decoding of instructions for the PowerPC is straightforward. Two instructions are submitted to
the decoding/dispatch unit in a clock cycle. All the PowerPC instructions are decoded in one clock cycle.
The decoded instructions are held by the dispatch unit an entry is available in the six-entry reorder buffer
(ROB). The ROB and commit units work like the same units of the PII. The differences are that the ROB is
a six-entry queue with twelve rename registers. Since the PowerPC has 32 general-purpose registers, the
pipeline design does not need as many rename registers as the PII, which has fewer GPR’s. The dispatch unit
looks for operand dependencies and changes the operand to point to a rename register if necessary. The
instructions can finish in any order. The commit unit will commit them in program order.
The Pentium II takes 2.5 clock cycles to decode an instruction while the PowerPC takes only one
cycle. The PII’s MIS can take longer to decode because a complex instruction can generate a large number of
micro-ops.
The ability to execute operations out-of-order allows the execution units to achieve higher
throughput.
4
V Execute
The Pentium II (PII) has eight execution units. The first three units are clustered on port 0 of the RS,
complex integer, complex floating point and simple floating point. Port 1 has simple integer and branch.
Ports 2,3 and 4 have store, store address and load. The MMX instructions are executed with additional stages
on the integer units. Floating point divide is not pipelined.
The PowerPC has six execution units, integer 1, integer 2, floating point (FPU), branch (BPU), the
load store unit (LSU) and a system register unit (SRU).
In the PII, the branch unit computes the results of the branch and compares the result to the predicted
result. If the prediction was accurate, all instructions in the speculative stream are marked as valid and
retired. If the branch was mis-predicted, the BTB is updated and the pipeline is flushed and restarted from
the new address. When an instruction finishes, its results and status are written back to the reorder buffer
(ROB).
In the PowerPC, a maximum of two instructions can be dispatched from the bottom two entries of the
instruction queue (IQ). Each of the two integer units (IU) has a single entry reservation station (RS). The
second integer unit cannot do multiplication or division. Each IU has three subunits, a fast adder, logic
operations and a subunit for rotates and shifts. IU1 has a 32-bit multiplier and divider. The FPU is pipelined
into three stages allowing for three FP instructions to be executing simultaneously. The load/store unit (LSU)
is pipelined into two stages and can execute load instructions out-of-order. Store instructions are executed in-
order. The system register unit (SRU) executes various system level instructions, condition register logic
operations and moves that use special purpose registers. The SRU executes instruction in program order.
VI Commit
In the PII, the retire unit (RU) checks the status of the micro-ops in the (ROB). The RU retires the
micro-ops in program order. An operation must be complete, valid (no mis-predicted branches) and next in
program order to be retired. When an operation is retired, its entry is removed from the ROB and its results
are committed to the programmer visible registers. The RU can retire up to three operations per clock.
The PowerPC completion unit (CU) works the same as the retire unit in the PII. The CU can retire
two instructions per clock cycle.
VII Summary
Both processors are pipelined, superscalar and use dynamic execution. The PII and the PowerPC
have very similar pipelines that can be characterized by the following stages:
1) Fetch
2) Decode
3) Resource Allocation
4) Out-of-order execution
5) In-order commit
Both processors fetch 16 bytes from the instruction cache. The PII marks the instruction boundaries
and aligns the instructions for decoding. The PowerPC does not have to mark and align the instructions since
they are all four bytes in length.
The PII can decode a maximum of three instructions at once. The PowerPC can decode two
instructions and a branch instruction.
5
Both processors reassign operands as necessary to rename registers to allow data forwarding to
achieve greater throughput. The PII has 40 rename registers while the PowerPC has only 12. The PII needs
more rename registers because is has fewer general purpose registers (GPR).
The execution units differ slightly. The reorder buffer (ROB) of the PII has 40 entries while the
PowerPC has only 6 entries. Both execution units make use of a reservation station (RS) that can submit the
instructions out-of-order.
The commit unit of both processors will retire the instructions in-order and commit the results to the
programmer visible registers.
Figure 2
6
Figure 3
7
John L. Hennessy and David A. Patterson [1998], Computer Organization and Design: The
hardware/software interface, Second Edition, Mogran Kaufmann.
Tom Shanley [1998], Pentium Pro and Pentium II System Architecture, Second Edition, Addison Wesley
[Link]/design/pentiumii/
[Link], Intel Architecture Software Developers Manual: Volume I Basic Architecture [1998]
[Link], Pentium II Processor Developers Manual [1997]
[Link], P6 Family of Processors: Hardware Developers Manual [1998]
[Link]/products
750_hs.pdf, MPC750A RISC Microprocessor Hardware Specifications [1999]
750_ts.pdf, MPC750 RISC Microprocessor Technical Summary [1997]