0% found this document useful (0 votes)

369 views9 pages

Pentium II Floating Point Pipeline Overview

The document summarizes the pipelines of the PowerPC and Pentium II processors. Both processors implement pipelining to improve instruction throughput by allowing work to be done on different instructions in each stage per clock cycle. The pipelines for the PowerPC and Pentium II are similar, with four main stages: fetch, decode/dispatch, execute, and commit. However, the processors differ in their decoding approaches due to differences in their instruction set architectures.

Uploaded by

amol1agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

369 views9 pages

Pentium II Floating Point Pipeline Overview

Uploaded by

amol1agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Pipelining

The PowerPC and Pentium II

CSE 3322-001
Fall 1999

November 8, 1999
I Introduction

Pipelining is the breaking down of an instruction into discrete stages. This allows work to be done on
different instructions in each stage during the same clock cycle. Allowing work to be done on multiple
instructions improves overall instruction throughput. The Pentium II and the PowerPC (MPC750) are two
popular microprocessors that have implemented pipelining to achieve greater performance. Both processors
are superscalar, which refers to the ability to initiate and complete more than one instruction per clock cycle.
It is possible for a processor to achieve a CPI (clocks per instruction) of less than 1.

The first IA (Intel Architecture) to include pipelining was the i386, which had a single pipeline with
three stages. The i486 expanded the pipeline to five stages. The Pentium added a second pipeline to achieve
two-way superscalar performance and branch prediction was also added. The Pentium Pro has a three-way
superscalar architecture and out-of-order execution. The Pentium II added MMX instructions, which
required the addition of some new stages to the integer pipeline.

The PowerPC processor has been pipelined and superscalar since it came to the market in 1991. The
PowerPC pipeline has not changed much since the 601. The 603 added a separate floating-point execution
unit to the pipeline and the 740 added a second integer execution unit.

The instruction set architecture determines how easy it will be to design a pipeline for the instruction
set, particularly the decoder. The PowerPC (750) instructions are all a word in length (32 bits). The x86
instruction set varies from 1 byte to 17 bytes.

II Overview

The pipelines for the PowerPC and the Pentium II are very similar. An instruction goes through four
phases in both processors. Refer to figure 1 for the generic pipeline organization of the PowerPC and the PII.

1) Fetch
2) Decode/Dispatch
3) Execute
4) Commit

The PII can send 16 bytes to the decoding unit for processing per clock for a maximum of three
instructions. The Intel chip is dependent on the instruction mix. The PowerPC can decode two instructions
per clock plus a branch instruction.

The PII has eight execution units clustered around five ports. Complex integer/complex floating point
and simple floating point are clustered around port 0. Simple integer and branch are clustered on port 1.
Load, store, and store address are on the remaining three ports The PII can send a maximum of five
instructions to execution units in a clock cycle. The PowerPC has two integer units; branch, floating point;
load/store and a system register unit. Figure 1 is a generic representation of the two pipelines combined to
show the similarities

After execution, all instructions are sent to a commit unit that finalizes that results after all preceding
instructions have been committed. This is necessary because instructions do not necessarily finish in program
order. In the event of a mis-predicted branch, all instructions after the prediction must be cleared and
fetching starts at the location indicated by the branch.

1
PC Instruction Data
Cache Cache

Branch Instruction
Prediction Queue Register
File
Decode/Dispatch
Unit

Reservation Reservation Reservation Reservation Reservation Reservation

Station Station Station Station Station Station

Integer Integer Store* Store Load

FPU** Branch** Address
Branch** System* Load/*
FPU* Integer** Integer** Register Store

Commit
Unit

Reorder
Buffer

* PowerPC unit Figure 1

** Pentium II unit

III Fetch

The PowerPC can retrieves 16 bytes per clock (4 instructions) into the 24-byte (6 instructions)
instruction queue (IQ). The PII can retrieve 32 bytes into a prefetch buffer. The PowerPC fetches four
instructions per clock while the instruction count for the x86 is variable.

2
The instruction fetch unit for the PII has three stages (refer to figure 2):

1) IFU1 – Load 32 bytes into the prefetch buffer

2) IFU2 - Mark the boundaries between the instructions in a 16 byte block. Present any branches to the
BTB (branch target buffer) for dynamic prediction. If a branch entry exists and the branch is predicted as
taken, the prefetch stream will be adjusted.
3) IFU3 – Align the instructions for presentation to the appropriate decoder

The PowerPC has two stages (refer to figure 3):

1) Load up to 16 bytes into the 24 byte instruction queue

2) Present any branches to the BPU (branch processing unit)

Note that the PowerPC does not need to mark any boundaries between instructions or to change the
alignment of instructions for the decoders. The PowerPC also has one of its execution units directly off of the
fetching unit. The BPU contains an adder to compute the target address of the branch. The BPU also
calculates the return address of a procedure call and saves it in the programmer visible link register (LR). If a
branch instruction does not update the count register (CR) or the LR, it is removed from the instruction
stream.

Both processors maintain a history of previously seen branches. The Pentium II refers the table as a
branch target buffer (BTB) and the PowerPC calls it a branch history table (BHT). When referring to both
tables collectively, I will call them a branch table (BT).
The BT is a dynamic branch predictor. The BT stores the history of the previously seen branches and
their targets. When a branch instruction is fetched, it address is looked up in the BT. If the address is found a
prediction is made, otherwise, an entry is generated in the BT and fetching continues in sequential order (the
branch is predicted as not taken). The BT contains the branch address, the destination address and in the PII a
four-bit status and in the PowerPC, a two-bit status field. Both processors maintain 512 entries in the BT. If
the branch is predicted as taken, the instruction buffer is flushed and fetching continues from the address
supplied by the BT.

In the PII, the BTB can predict four branches simultaneously and uses Yeh’s algorithm. The four bits
in the BTB record the branch’s behavior the last four times it was executed. The estimated accuracy is 90%.
The BTB also has a return stack buffer (RSB). The RSB is used to save the return address of procedure calls.
Procedure calls and returns are unconditional branches. The RSB is used to speedup returns from procedures.

The PowerPC can predict only two branches deep. The PowerPC maintains a 64-entry branch target
instruction cache (BTIC). The first two instructions of a previously taken branch are stored here. When a
branch is predicted as being taken, the first two instruction from the new fetching address can be loaded from
here. The instructions are fetched one clock cycle sooner than if the fetching unit had to go to the instruction
cache. The branch processing unit (BPU) of the PowerPC retrieves branch instructions from the fetch queue
for dynamic branch prediction. The BPU will also get the branch instruction from the instruction queue (IQ)
for execution.

3
IV Decode

The PowerPC and the Pentium II differ the most in the decoding of instructions. This is due to the
widely variable instruction lengths of the x86.

The Pentium II decodes each instruction into 1 or more 118-bit micro-ops. The micro-ops are then
forwarded on through the pipeline. The PII can submit three instructions to the decoders, dependent upon the
instruction mix. The Pentium II has three decode units. The first unit can decode an instruction that results
in1 to 4 micro-ops and is seven or fewer bytes. The other two decoders can only decode an instruction that
results in one micro-op. The PII can rotate the three instructions to align them with an appropriate decoder to
yield better throughput. For instance, let a complex instruction be considered to decode into 2-4 micro-ops. If
the next three instructions presented to the decoder unit come in complex/simple/simple, then the last stage
of the fetch process will pass them directly to the three decoders. If the ordering is different,
simple/complex/simple, the decoder unit can rearrange the instructions so that the complex instruction goes
to the first decoder and the two simple instructions go to decoders two and three. This is done while still
preserving the program order of the instructions. An instruction of five or more micro-ops must be run
through the Micro Instruction Sequencer (MIS). The MIS is a microprogram that translates an instruction
into a sequence of micro-ops. A maximum of six micro-ops per clock advance to the decoded instruction
(ID) queue in strict program order. If any of the operations in the ID queue are branch instructions, static
branch prediction is used as a backup predictor to the BTB. The ID queue can submit three micro-ops per
clock to the register allocation table (RAT).

In the PII, the RAT looks at the operands referred to by the micro-op and reassigns them as necessary
to point to the rename registers in the reorder buffer (ROB). This allows for data forwarding. Data
forwarding lets an instruction get the results of a previous instruction before the previous instruction finishes
executing. This allows the following instruction to get an earlier start. The PII has a set of 40 rename
registers in the ROB to facilitate data forwarding. The ROB is a 40 entry circular queue and can accept three
micro-ops per clock. The reservation station (RS) can copy up to five micro-ops out of the ROB in one clock.
The micro-op entry remains in the ROB until the instruction finishes. The PII uses a 20 entry RS. The RS is
the scheduler/dispatcher for the rest of the pipeline. The RS holds a micro-op until it operands are available
which causes out-of-order execution. If two micro-ops become available a the same time for the same
resource, the RS uses a FIFO rule to determine which micro-op gets dispatched. The RS has five ports with
multiple execution units clustered on ports 0 and 1. The commit unit will commit the results in-order.

The decoding of instructions for the PowerPC is straightforward. Two instructions are submitted to
the decoding/dispatch unit in a clock cycle. All the PowerPC instructions are decoded in one clock cycle.
The decoded instructions are held by the dispatch unit an entry is available in the six-entry reorder buffer
(ROB). The ROB and commit units work like the same units of the PII. The differences are that the ROB is
a six-entry queue with twelve rename registers. Since the PowerPC has 32 general-purpose registers, the
pipeline design does not need as many rename registers as the PII, which has fewer GPR’s. The dispatch unit
looks for operand dependencies and changes the operand to point to a rename register if necessary. The
instructions can finish in any order. The commit unit will commit them in program order.

The Pentium II takes 2.5 clock cycles to decode an instruction while the PowerPC takes only one
cycle. The PII’s MIS can take longer to decode because a complex instruction can generate a large number of
micro-ops.

The ability to execute operations out-of-order allows the execution units to achieve higher
throughput.

4
V Execute

The Pentium II (PII) has eight execution units. The first three units are clustered on port 0 of the RS,
complex integer, complex floating point and simple floating point. Port 1 has simple integer and branch.
Ports 2,3 and 4 have store, store address and load. The MMX instructions are executed with additional stages
on the integer units. Floating point divide is not pipelined.

The PowerPC has six execution units, integer 1, integer 2, floating point (FPU), branch (BPU), the
load store unit (LSU) and a system register unit (SRU).

In the PII, the branch unit computes the results of the branch and compares the result to the predicted
result. If the prediction was accurate, all instructions in the speculative stream are marked as valid and
retired. If the branch was mis-predicted, the BTB is updated and the pipeline is flushed and restarted from
the new address. When an instruction finishes, its results and status are written back to the reorder buffer
(ROB).

In the PowerPC, a maximum of two instructions can be dispatched from the bottom two entries of the
instruction queue (IQ). Each of the two integer units (IU) has a single entry reservation station (RS). The
second integer unit cannot do multiplication or division. Each IU has three subunits, a fast adder, logic
operations and a subunit for rotates and shifts. IU1 has a 32-bit multiplier and divider. The FPU is pipelined
into three stages allowing for three FP instructions to be executing simultaneously. The load/store unit (LSU)
is pipelined into two stages and can execute load instructions out-of-order. Store instructions are executed in-
order. The system register unit (SRU) executes various system level instructions, condition register logic
operations and moves that use special purpose registers. The SRU executes instruction in program order.

VI Commit

In the PII, the retire unit (RU) checks the status of the micro-ops in the (ROB). The RU retires the
micro-ops in program order. An operation must be complete, valid (no mis-predicted branches) and next in
program order to be retired. When an operation is retired, its entry is removed from the ROB and its results
are committed to the programmer visible registers. The RU can retire up to three operations per clock.
The PowerPC completion unit (CU) works the same as the retire unit in the PII. The CU can retire
two instructions per clock cycle.

VII Summary

Both processors are pipelined, superscalar and use dynamic execution. The PII and the PowerPC
have very similar pipelines that can be characterized by the following stages:

1) Fetch
2) Decode
3) Resource Allocation
4) Out-of-order execution
5) In-order commit

Both processors fetch 16 bytes from the instruction cache. The PII marks the instruction boundaries
and aligns the instructions for decoding. The PowerPC does not have to mark and align the instructions since
they are all four bytes in length.
The PII can decode a maximum of three instructions at once. The PowerPC can decode two
instructions and a branch instruction.

5
Both processors reassign operands as necessary to rename registers to allow data forwarding to
achieve greater throughput. The PII has 40 rename registers while the PowerPC has only 12. The PII needs
more rename registers because is has fewer general purpose registers (GPR).
The execution units differ slightly. The reorder buffer (ROB) of the PII has 40 entries while the
PowerPC has only 6 entries. Both execution units make use of a reservation station (RS) that can submit the
instructions out-of-order.
The commit unit of both processors will retire the instructions in-order and commit the results to the
programmer visible registers.

Figure 2

6
Figure 3
7
John L. Hennessy and David A. Patterson [1998], Computer Organization and Design: The
hardware/software interface, Second Edition, Mogran Kaufmann.

Tom Shanley [1998], Pentium Pro and Pentium II System Architecture, Second Edition, Addison Wesley

[Link]/design/pentiumii/

[Link], Intel Architecture Software Developers Manual: Volume I Basic Architecture [1998]

[Link], Pentium II Processor Developers Manual [1997]

[Link], P6 Family of Processors: Hardware Developers Manual [1998]

[Link]/products

750_hs.pdf, MPC750A RISC Microprocessor Hardware Specifications [1999]

750_ts.pdf, MPC750 RISC Microprocessor Technical Summary [1997]

Superscalar and VLIW Processor Optimization
No ratings yet
Superscalar and VLIW Processor Optimization
49 pages
Processor Structure and RISC Overview
No ratings yet
Processor Structure and RISC Overview
46 pages
Processor Structure and Function Overview
No ratings yet
Processor Structure and Function Overview
35 pages
Pentium Processor Architecture Overview
No ratings yet
Pentium Processor Architecture Overview
20 pages
Instruction Cycle and Pipeline Overview
No ratings yet
Instruction Cycle and Pipeline Overview
22 pages
Understanding the Instruction Cycle
No ratings yet
Understanding the Instruction Cycle
9 pages
RISC vs CISC Architecture Explained
No ratings yet
RISC vs CISC Architecture Explained
10 pages
Understanding CPU Pipelining Techniques
No ratings yet
Understanding CPU Pipelining Techniques
52 pages
Instruction Pipelining in CPUs Explained
No ratings yet
Instruction Pipelining in CPUs Explained
5 pages
Instruction-Level Parallelism in Superscalar Processors
No ratings yet
Instruction-Level Parallelism in Superscalar Processors
30 pages
Pipelining and Vector Processing Explained
No ratings yet
Pipelining and Vector Processing Explained
71 pages
Advanced Instruction Delivery Techniques
No ratings yet
Advanced Instruction Delivery Techniques
28 pages
CPU Structure and Instruction Flow
No ratings yet
CPU Structure and Instruction Flow
27 pages
Pentium Processor Features & Architecture
No ratings yet
Pentium Processor Features & Architecture
22 pages
Processor Structure and Function Overview
No ratings yet
Processor Structure and Function Overview
9 pages
Pipelining in RISC Architectures Explained
No ratings yet
Pipelining in RISC Architectures Explained
82 pages
Understanding Multiprocessors and Pipelining
No ratings yet
Understanding Multiprocessors and Pipelining
11 pages
ARM Pipeline: 3-Stage vs 5-Stage
No ratings yet
ARM Pipeline: 3-Stage vs 5-Stage
6 pages
CPU Structure and Function Overview
100% (1)
CPU Structure and Function Overview
30 pages
CPU Architecture and Register Functions
100% (1)
CPU Architecture and Register Functions
55 pages
Hardwired vs Microprogrammed Control Units
No ratings yet
Hardwired vs Microprogrammed Control Units
9 pages
Overview of Pentium Processor Features
No ratings yet
Overview of Pentium Processor Features
43 pages
Processor Structure and Function Overview
No ratings yet
Processor Structure and Function Overview
40 pages
Parallel Processing and Pipelining Techniques
No ratings yet
Parallel Processing and Pipelining Techniques
20 pages
Computer Architecture: Pipelining Explained
No ratings yet
Computer Architecture: Pipelining Explained
44 pages
Pentium Processor
No ratings yet
Pentium Processor
14 pages
Processor Structure and Pipelining Insights
No ratings yet
Processor Structure and Pipelining Insights
48 pages
Understanding Pipelining in CPUs
No ratings yet
Understanding Pipelining in CPUs
118 pages
Intel Pentium Processor Architecture Overview
No ratings yet
Intel Pentium Processor Architecture Overview
12 pages
Instruction Cycle and Pipeline Overview
No ratings yet
Instruction Cycle and Pipeline Overview
35 pages
Instruction Set Architecture Overview
No ratings yet
Instruction Set Architecture Overview
35 pages
Processor Organization and Instruction Cycle
No ratings yet
Processor Organization and Instruction Cycle
78 pages
Pipelined RiSC-16 Architecture Overview
No ratings yet
Pipelined RiSC-16 Architecture Overview
9 pages
Processor Structure and Function Overview
No ratings yet
Processor Structure and Function Overview
55 pages
Processor Structure and Instruction Cycle
No ratings yet
Processor Structure and Instruction Cycle
56 pages
Pentium Processor Overview and Features
No ratings yet
Pentium Processor Overview and Features
15 pages
Overview of Intel Pentium Architecture
No ratings yet
Overview of Intel Pentium Architecture
12 pages
Pipelining and Branch Prediction in MIPS
No ratings yet
Pipelining and Branch Prediction in MIPS
54 pages
Branch Instruction Handling in Pipelines
No ratings yet
Branch Instruction Handling in Pipelines
11 pages
Overview of Pentium Processor Features
No ratings yet
Overview of Pentium Processor Features
24 pages
Understanding Parallel Processing Techniques
No ratings yet
Understanding Parallel Processing Techniques
10 pages
Pentium and Pentium Pro Architecture Overview
No ratings yet
Pentium and Pentium Pro Architecture Overview
60 pages
Pipelining in Computer Architecture
No ratings yet
Pipelining in Computer Architecture
74 pages
RISC Architecture Overview and Features
100% (1)
RISC Architecture Overview and Features
38 pages
Pentium Processor Architecture Overview
No ratings yet
Pentium Processor Architecture Overview
28 pages
Intel CPU Architecture Overview
No ratings yet
Intel CPU Architecture Overview
32 pages
CPU Architecture: Structure & Pipelining
No ratings yet
CPU Architecture: Structure & Pipelining
24 pages
Processor Structure and Function Overview
100% (1)
Processor Structure and Function Overview
55 pages
RISC Instruction Pipeline Design and Analysis
100% (1)
RISC Instruction Pipeline Design and Analysis
16 pages
Pentium Processor Architecture Overview
100% (3)
Pentium Processor Architecture Overview
24 pages
Processor Organization and Performance
No ratings yet
Processor Organization and Performance
44 pages
Pipeline Evolution Since 1985
No ratings yet
Pipeline Evolution Since 1985
30 pages
Pipelining: Concepts and Hazards Explained
No ratings yet
Pipelining: Concepts and Hazards Explained
28 pages
Understanding Instruction Pipelining
No ratings yet
Understanding Instruction Pipelining
8 pages
Understanding Pipelining in Computing
No ratings yet
Understanding Pipelining in Computing
7 pages
CPU Architecture and Pipelining Explained
No ratings yet
CPU Architecture and Pipelining Explained
17 pages
Lecture 2
No ratings yet
Lecture 2
15 pages
CPU Structure and Instruction Pipelining
No ratings yet
CPU Structure and Instruction Pipelining
54 pages
CPU Instruction Execution Overview
No ratings yet
CPU Instruction Execution Overview
85 pages
Advanced Microprocessors Exam 2010
No ratings yet
Advanced Microprocessors Exam 2010
2 pages
Computer Architecture Essentials Explained
No ratings yet
Computer Architecture Essentials Explained
6 pages
22426 Microcontroller Exam Paper 2023
No ratings yet
22426 Microcontroller Exam Paper 2023
4 pages
IA-32系统编程指南中文版
No ratings yet
IA-32系统编程指南中文版
265 pages
Unlocking Intel CPUs for Overclocking
No ratings yet
Unlocking Intel CPUs for Overclocking
4 pages
CS2002 Exam: Computer Architecture Q&A
No ratings yet
CS2002 Exam: Computer Architecture Q&A
2 pages
W2KSP4_EN.EXE File Contents Overview
No ratings yet
W2KSP4_EN.EXE File Contents Overview
32 pages
Overview of Intel 8086 Microprocessor
No ratings yet
Overview of Intel 8086 Microprocessor
4 pages
RISC-V Instruction Set Overview
No ratings yet
RISC-V Instruction Set Overview
1 page
Types of Instruction Set Architectures (CA ASSIGNMENT)
No ratings yet
Types of Instruction Set Architectures (CA ASSIGNMENT)
3 pages
Computer Architecture Exam Guide
No ratings yet
Computer Architecture Exam Guide
3 pages
Pipeline Diagrams for Branch Instructions
No ratings yet
Pipeline Diagrams for Branch Instructions
7 pages
Microprocessor Timing and Multiplexing Guide
No ratings yet
Microprocessor Timing and Multiplexing Guide
4 pages
CLAT3 Exam Paper for Computing Technologies
No ratings yet
CLAT3 Exam Paper for Computing Technologies
5 pages
Computer Organization and Architecture, William Stallings
No ratings yet
Computer Organization and Architecture, William Stallings
273 pages
Understanding Pipelining Techniques
No ratings yet
Understanding Pipelining Techniques
25 pages
Microcontroller System Design Exam Guide
No ratings yet
Microcontroller System Design Exam Guide
2 pages
8086 Microprocessor Pin Guide
No ratings yet
8086 Microprocessor Pin Guide
35 pages
Understanding Data-Level Parallelism
No ratings yet
Understanding Data-Level Parallelism
6 pages
Vector Processors and Memory Management
No ratings yet
Vector Processors and Memory Management
96 pages
DDR Implementation for RISC-V Processor
No ratings yet
DDR Implementation for RISC-V Processor
93 pages
Superscalar Architecture Overview
No ratings yet
Superscalar Architecture Overview
46 pages
8051 Microcontroller Instruction Set
No ratings yet
8051 Microcontroller Instruction Set
95 pages
FIntCLI20390_2 Configuration Details
No ratings yet
FIntCLI20390_2 Configuration Details
5 pages
8085 Microprocessor Architecture Overview
No ratings yet
8085 Microprocessor Architecture Overview
30 pages
8086 Microprocessor Q&A Guide
No ratings yet
8086 Microprocessor Q&A Guide
3 pages
8051 Microcontroller Overview and Features
No ratings yet
8051 Microcontroller Overview and Features
4 pages
Multi-Cycle MIPS Implementation Overview
No ratings yet
Multi-Cycle MIPS Implementation Overview
10 pages
Microprocessor Lab Manual
No ratings yet
Microprocessor Lab Manual
41 pages
Memory Unit and CPU Revision Overview
No ratings yet
Memory Unit and CPU Revision Overview
9 pages

Pentium II Floating Point Pipeline Overview

Uploaded by

Pentium II Floating Point Pipeline Overview

Uploaded by

Pipelining

The PowerPC and Pentium II

Reservation Reservation Reservation Reservation Reservation Reservation

Integer Integer Store* Store** Load**

* PowerPC unit Figure 1

1) IFU1 – Load 32 bytes into the prefetch buffer

The PowerPC has two stages (refer to figure 3):

1) Load up to 16 bytes into the 24 byte instruction queue

[Link], Pentium II Processor Developers Manual [1997]

[Link], P6 Family of Processors: Hardware Developers Manual [1998]

750_hs.pdf, MPC750A RISC Microprocessor Hardware Specifications [1999]

750_ts.pdf, MPC750 RISC Microprocessor Technical Summary [1997]

You might also like

Integer Integer Store* Store Load