0% found this document useful (0 votes)
66 views

Computer Architecture Notes

The document provides an overview of computer architecture. It discusses how abstraction allows programs to be independent of machine details and portable across different computers. Computer architecture studies the components of computer systems and how they interconnect, while computer organization focuses on implementation. The key parts of a computer system are the processor, primary storage (memory), input/output devices, and buses connecting them. The processor executes instructions stored in primary memory and uses registers to perform arithmetic logic on data. Stored programs, with instructions and data both held in primary memory, enabled modern general-purpose computers.

Uploaded by

drusilla bagabo
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views

Computer Architecture Notes

The document provides an overview of computer architecture. It discusses how abstraction allows programs to be independent of machine details and portable across different computers. Computer architecture studies the components of computer systems and how they interconnect, while computer organization focuses on implementation. The key parts of a computer system are the processor, primary storage (memory), input/output devices, and buses connecting them. The processor executes instructions stored in primary memory and uses registers to perform arithmetic logic on data. Stored programs, with instructions and data both held in primary memory, enabled modern general-purpose computers.

Uploaded by

drusilla bagabo
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 53

Computer Architecture

Overview
Abstraction means programmers can describe algorithms in a ``high level'' notation
that is independent of details about the machine that will execute the algorithm.

Portability is a byproduct of abstraction that allows programs to be run on a wide


variety of computers as long as there is a compiler that will translate them for each
machine.
Computer architecture is the study of the components that make up computer systems
and how they are interconnected. Computer organization is concerned with the
implementation of a computer architecture.
Computer engineering refers to the actual construction of a system: lengths of wires,
sizes of circuits, cooling and electrical requirements, etc. Programmers often use
knowledge of a system's architecture, and sometimes organization, to optimize
performance of their programs, but rarely, if ever, are they concerned with
engineering aspects.

Why the need to know the architecture of a computer?


We aim to give enough background information on common structures such as vector
processors and cache memories so you will be able to
(a) recognize when your program is not performing near the capacity of your system,
(b) understand performance improvement techniques recommended by the compiler
writers and/or system architects of your system, and
(c) decide whether the benefits of increased performance are worth sacrificing
abstraction and portability.

Another, closely related, goal is to provide the necessary background in computer


architecture to evaluate competing algorithms to decide which is likely to be the most
efficient for a given machine, even before they are expressed in a programming
language.

A general-purpose computer has these parts:

1. processor: the ``brain'' that does arithmetic, responds to incoming information,


and generates outgoing information
2. primary storage (memory or RAM): the ``scratchpad'' that remembers
information that can be used by the processor. It is connected to the processor
by a system bus (wiring).

1
3. system and expansion busses: the transfer mechanisms (wiring plus connectors)
that connect the processor to primary storage and input/output devices.

A computer usually comes with several input/output devices: For input: a keyboard, a
mouse; For output, a display (monitor), a printer; For both input and output: an
internal disk drive, memory key, CD reader/writer, etc., as well as connections to
external networks.

For reasons of speed, primary storage is connected ``more closely'' to the processor
than are the input/output devices. Most of the devices (e.g., internal disk, printer) are
themselves primitive computers in the sense that they contain simple processors that
help transfer information to/from the processor to/from the device.

Here is a simple picture that summarizes the above:

Information and binary coding

For humans, information can be pictures, symbols, words, sounds, movements, and
more. A typical computer has a keyboard and mouse so that words and movements
can be sent to the processor as information. The information must be converted into
electrical off-on (``0 and 1'') pulses that travel on the bus and arrive to the processor,
which can save them in primary storage.

It is premature to study precisely how numbers and symbols can be represented as off-
on (0-1) pulses, but here is review of base-2 (binary) coding of numbers, which is the
concept upon which computer information is based:
number binary coding
0 0000
1 0001
2 0010
3 0011
4 0100
5 0101
6 0110
7 0111

2
8 1000
...
14 1110
15 1111d
and so on. It is possible to do arithmetic in base two, e.g. 3+5 is written:
0011
+0101
-----
1000
The addition works like normal (base-10) arithmetic, where 1 + 1 = 10 (0 with a carry
of 1). Subtraction, multiplication, etc., work this way, too, and it is possible to wire an
electrical circuit that mechanically does the addition of the 0s and 1s. Indeed, a
processor uses such a wiring, which operates on binary numbers held in registers,
where a register is a sequence of bits (electronic ``flip-flops'' each of which can
remember a 0 or 1). Here is a picture of an 8-bit register that holds the number 9:
+--+--+--+--+--+--+--+--+
| 0| 0| 0| 0| 1| 0| 0| 1|
+--+--+--+--+--+--+--+--+

A processor has multiple such registers, and it can compute 3+5 by placing 3 (0000
0011) and 5 (0000 0101) into two registers and then using the wiring between the
registers to compute the sum, which might be saved in a third register. A typical,
modern register has 32 bits, called a fullword. Such a register can store a value in the
approximate range of -2 billion to +2 billion.

When an answer, like 3+5 = 8, is computed, the processor might copy the answer to
primary storage to save it for later use. Later, the processor can copy the number from
storage back into a register and do more arithmetic with it.

The CPU
The processor is truly the computer --- it is wired to compute arithmetic and related
operations on numbers that it can hold in its data registers. A processor is also called
a Central Processing Unit (CPU).

Here is a simplistic picture of the parts of a processor:

3
 The data registers hold numbers for computation, as noted earlier.
 There is a simple clock --- a pulse generator --- that helps the Control Unit do
instructions in proper time steps.
 The arithmetic-logic unit (ALU) holds the wiring for doing arithmetic on the
numbers held in the data registers. (Review the addition example above.)
 The control unit holds wiring that triggers the arithmetic operations in the
ALU. How does the control unit know to request an addition or a subtraction?
The answer is: it obtains instructions, one at a time, that have been stored in
primary storage.
 The instruction counter is a register that tells the control unit where to find the
instruction that it must do. (The details will be explained shortly.)
 The instruction register is where the instruction can be copied and held for
study by the control unit,
 The address buffer and data buffer are two registers that are a ``drop-off'' point
when the processor wishes to copy information from a register to primary
storage (or read information from primary storage to a register). We study them
later.
 The interrupt register is studied much later.

A processor's speed is measured in Hertz (a kind of vibration speed) and is literally


the speed of the computer's internal clock; the larger the Hertz number, the faster the
processor.

Figure 3-1. The CPU

4
A processor's speed is measured in Hertz (a kind of vibration speed) and is literally
the speed of the computer's internal clock; the larger the Hertz number, the faster the
processor.

Primary storage

Primary storage (also called random-access memory --- RAM) is literally a long


sequence of fullwords, also called cells, where numbers can be saved for later use by
the processor. (Recall that a fullword is 32 bits). Here is a simplistic picture:

The picture shows that each fullword (cell) is numbered by a


unique address (analogous to street addresses for houses), so that information
transferred from the processor can be saved at a specific cell's address and can be later
retrieved by referring to that same address.

The picture shows an additional component, the memory controller, which is itself a


primitive processor that can quickly find addresses and copy information stored in the
addresses to/from the system bus. This works faster than if the processor did the work
of reaching into storage to extract information.

When a number is copied from the processor into storage, we say it is written; when it
is copied from storage into the processor, we say it is read.

As the diagram suggests, the address lines in the system bus are wires that transfer the
bits that form the address of the cell in storage that must be read or written (the
address is transmitted from the processor's address buffer --- see the previous section);
the data lines are wires that transfer the information between the processor's data
buffer and the cell in storage; and the control lines transmit whether the operation is a
read or write to primary storage.

The tradition is to measure size of storage in bytes, where 8 bits equal one byte, and 4
bytes equal one fullword. The larger the number, the larger the storage.

5
To greatly simplify, a computer consists of a central processing unit (CPU) attached to
memory. The figure above illustrates the general principle behind all computer operations.

The CPU executes instructions read from memory. There are two categories of instructions

1. Those that load values from memory into registers and store values from registers to


memory.
2. Those that operate on values stored in registers. For example adding, subtracting
multiplying or dividing the values in two registers, performing bitwise operations (and,
or, xor, etc) or performing other mathematical operations (square root, sin, cos, tan,
etc).

So in the example we are simply adding 100 to a value stored in memory, and storing this new
result back into memory.

Stored programs

In the 1950's, John von Neumann realized that primary storage could hold not only
numbers, but patterns of bits that represented instructions that could tell the processor
(actually, tell the processor's control unit) what to do. A sequence of instructions was
called a program, and this was the beginning of stored-program, general purpose
computers, where each time a computer was started, it could receive a new program in
storage, which told the processor what computations to do.

6
Here is a simplistic example of a stored program that tells the processor to compute
the sum of three numbers held in primary storage at addresses, 64, 65, and 66 and
place the result into the cell at address 67:
LOAD (read) the number at storage address 64 into data register 1
LOAD the number at storage address 65 into data register 2
ADD register 1 to register 2 and leave the sum in register 2
LOAD the number at address 66 to register 1
ADD register 1 to register 2 and leave the sum in register 2
STORE (write) the value in register 2 to storage address 67
instructions like LOAD, ADD, and STORE can be represented as bit patterns that are
copied into the processor's instruction register.

Here is a simple coding of the six-instruction program, which is situated at addresses


1-6 of primary storage (and the numbers are at 64-66). The instructions are coded in
bit patterns, and we assume that LOAD is 1001, ADD is 1010, and STORE is 1011.
Registers 1 and 2 are 0001 and 0010. Storage addresses 64 -- 67 are of course 0100
0000 to 0100 0011.

The format of each instruction is: IIII RRRR DDDD DDDD, where IIII is the coding
that states the operation required, RRRR is the coding of which data register to use,
and DDDD DDDD is the data, which is either a storage address or another register
number.
PRIMARY STORAGE

address: contents
------- --------
0: ...
1: 1001 0001 0100 0000
2: 1001 0010 0100 0001
3: 1010 0010 0000 0001
4: 1001 0001 0100 0010
5: 1010 0010 0000 0001
6: 1011 0010 0100 0011
7: ...
...
64: 0000 0000 0000 0100
65: 0000 0000 0000 0011
66: 0000 0000 0000 0001
67: ...
...
(Note: I have shortened the instructions to 16 bits, rather than use 32, because I got
tired typing lots of zeros!)

The example is a contrived, but it should convince you that it is indeed possible to
write instructions in terms of binary codings that a control unit can decode,
disassemble, and execute.

7
It is painful for humans to read and write such codings, which are called machine
language, and there are abbreviations, called assembly language, that use text forms.
Here is a sample assembly-language version of the addition program:
LOAD R1 64
LOAD R2 65
ADD R2 R1
LOAD R1 66
ADD R2 R1
STORE R2 67

Instruction cycle

The instructor cycle are the actions taken by the processor to execute one instruction.
Each time the processor's clock pulses (ticks) the control unit does these steps: (actually,
modern processors do multiple instruction cycles for each clock pulse)

1. uses the number in the instruction counter to fetch an instruction from primary
storage and copy it into the instruction register
2. reads the pattern of bits in the instruction register and decodes the instruction
3. based on the decoding, tells the ALU to execute the instruction, which means
that the ALU manipulates the registers accordingly.
4. There is a fourth step in the instruction cycle, an interrupt check, that we study
later.

Of course, the control unit is not alive, and it does not ``read'' or ``tell'' anything to
anyone, but there is wiring between electrical components that propagate electrical 0-
1 signals --- a kind of falling domino game--- that gives the appearance of conscious
execution.

Here is a small example. Say that the clock has ``ticked'' (pulsed), and the instruction
register holds 3. Say that address 3 in primary storage holds the coding of the
instruction, ADD R2 R1. The instruction cycle might go like this:

1. Fetch: Consult the instruction counter; see it holds 0000 0011, that is, 3. Signal
the memory controller to copy the contents of the cell at address 0000 0010 into
the data buffer.

When the instruction arrives, copy it from the data buffer into the instruction
register.

Increment the instruction counter to 4 (that is, 0000 0100).

8
2. Decode: Read the first (leading or high-order) bits and see that they indicate an
ADD. Extract the bits that state the two registers to be added, here, R2 and R1.
3. Execute: Signal the ALU to add the values in registers 1 and 2 and place the
result in register 2.

The previous description reads a bit tediously. This is OK, because the processor is
incredibly fast. Nonetheless, modern processors can be made even faster, because
while the ALU is doing the execution step, the controller can start the fetch-and-
decode steps of the next instruction cycle. This form of speedup is
called pipelining and is a topic intensively studied in computer architecture.

The forms of instruction that the processor can execute are called the instruction set.

There are these forms of instructions found in an instruction set:

1. data transfer between storage and registers (LOAD and STORE)


2. arithmetic and logic (ADD, SUBTRACT, ...)
3. control (test and branch) (the ALU perhaps resets the instruction counter)
4. input and output (the ALU sends a request on the system bus to an input/output
device to read or write new information into storage)

Even small examples are painful to write in assembly language, and people quickly
developed simpler notations that could be mechanically converted to assembly (which
could itself be mechanically converted into base-2 codings).

FORTRAN (formula translator language) is a famous example, developed in the


1950's by John Backus. When a human writes a program using FORTRAN, she writes
a set of mathematical equations that the computer executes. Instead of using specific
numerical storage addresses, names from algebra (``variable names''), like x and y, can
be used instead.

Here is an example, coded in FORTRAN, that places a value in a storage cell, named
x, and then divides it by 2, saving the result again in the same cell:
x = 3.14159
x = x / 2
And here is an example that divides x by y, saving the answer in x's cell, provided
that y has a non-zero value:
if ( y .NEQ. 0 ) x = x / y
(read this as ``if y not-equal-to 0, then compute x = x / y'')

With some work, one can write a program that mechanically translates FORTRAN
programs into (long) sequences of machine code; such a program is called a compiler.

9
There is another ``translation program,'' called an interpreter, which does not convert
a program to machine code, but instead reads a program one line at a time and tells the
processor to execute ``pre-fabricated'' sequences of instructions that match the
program's lines. These concepts are developed in another lecture.

Languages like FORTRAN (and COBOL and LISP and C and Java and ...) are
called high-level programming languages.

Secondary storage: disks

The previous section stated that programs and numbers can be saved in primary
storage. But there is a limited amount of primary storage, and it is used to hold the
program that the computer executes now. Programs and information that are saved for
later use can be copied to secondary storage, such as the internal disk that is common
to almost all computers.

Although it looks and operates differently than primary storage, it is perfectly fine to
think of disk storage (and other forms of secondary storage, like a memory key or a
CD), as a variant of primary storage, connected to the processor by means of the
system bus, using its own controller to help read and write information. The main
distinction is that secondary storage is cheaper (to buy) than primary storage, but it
is slower to read and write information to and from it.

A typical computer uses disk secondary storage to hold a wide variety of programs
that can be copied into primary storage for execution, as requested by the user.
Secondary storage is also used to archive data files.

Secondary-storage devices are activated when the processor executes a READ or


WRITE instruction. These instructions are not as simple to do as the LOAD and
STORE instructions, because the secondary-storage devices are so slow, and the
processor should not waste time, doing nothing, waiting for the device to finish its
work.

The solution is: The processor makes the request for a read or write and then
proceeds to do other work.

Consider how a processor might execute a WRITE instruction to the disk; here is how
the instruction cycle might go:

1. Fetch: The control unit obtains the instruction from primary storage and places
it in the instruction register, as usual.

10
2. Decode: The control unit reads the instruction and determines that it is a
WRITE. It extracts that name of the device to be read (the disk), it extracts the
address on the device where the information should be written, and it extracts
the name of the register than holds the information to be written.
3. Execute: The control unit writes the address and data to the disk's address
buffer and data buffer, which are two fullwords in primary storage. When these
writes are finished, the controller signals the disk along the control lines of the
system bus that there is information waiting for it in primary storage.

Now that the processor has initiated the disk-write, it proceeds to the next instruction
to execute, and at the same time, the disk starts to spin, its own controller does a read
of primary storage for the address and data information saved there, and finally, the
data is written from primary storage to the disk.

Each secondary-storage device has its own ``buffers'' reserved for it in primary
storage --- this is simpler than wiring the processor for buffers for each possible
storage device.

An important ``secondary storage'' device (actually, it is an output device!) is the


computer's display. A typical display is a huge grid of pixels (colored dots), each of
which is defined by a trio of red-green-blue numerical values. The display has a huge
buffer in primary storage, where there is one (or more) cell that describes the color of
each pixel. A write instruction executed by the processor causes the display's buffer to
be altered at the appropriate cells, and the display's controller (called the ``video
controller'') reads the information in the buffer and copies its contents to the display,
thus repainting the display.

To summarize, here is a picture of a computer with buffers reserved for input/output


devices in primary storage:

11
It is important to see in the picture that (the controllers in) the various storage devices
can use the system bus to read/write from primary storage without bothering the
processor. So, input and output can proceed at the same time that the processor
executes instructions.

When a computer is connected to an outside network, the network can also be


considered a kind of secondary-storage device that responds to read and write
instructions, but the format of the reads and writes is far more complex --- they must
include the address of the destination computer, the kind of data transmitted, the stage
of interaction that is being done, etc. So, there are standardized patterns of bits,
called protocols, that must be transmitted as ``reads'' and ``writes'' from the processor
to the system bus to the port to the network. To accomplish a complete read or write,
there might well be multiple transmissions from processor to bus to port to network.
The design of protocols is a crucial issue to computer networks.

Interrupts

The previous section noted that a processor should not wait for a secondary-storage
device to complete a write operation. But what if the processor asks the device to
perform a read operation, how will the processor know when the information has been
successfully read and deposited into the device's buffer in storage?

Here is a second, similar situation: A human presses the mouse's button, demanding
attention from the processor (perhaps to start or stop a program or to provide input to
the program that the processor is executing). How is the processor signaled about the
mouse click?

To handle these situations, all processors are wired for interruption of their normal
executions. Such an interruption is called an interrupt.

Recall the standard execution cycle:

1. fetch
2. decode
3. execute
4. check for interrupts

and recall the extra register, the interrupt register, that is embedded in the processor:

12
The interrupt register is connected to the system bus, so that when a secondary storage
device has completed an action, it signals the control unit by setting to 1 one of the
bits in the interrupt register.

Now, we can explain the final step of the execution cycle, the check for interrupts:
After the execution step, the control unit examines the contents of the interrupt
register, checking to see if any bit in the register is set to 1. If all bits are 0, then no
device has completed an action, so the processor can start a new instruction.

But if a bit is set to 1, then there is an interrupt --- the processor must pause its
execution and do whatever instructions are needed:

For example, perhaps the user has pressed the mouse button. The device controller for
the mouse sends a signal on the system bus to set to 1 the bit for a ``mouse interrupt''
in the interrupt register. When the control unit examines the interrupt register at the
end of its current execution cycle, it sees that the bit for the mouse is set to 1. So, it
resets the bit to 0 and resets the instruction counter to the address of the program that
must be executed whenever the mouse button is pressed. Once the mouse-button
program finishes, the processor can resume the work it was doing.

The mouse-button program is called an interrupt handler.

The previous story skipped a lot of details: Where does the processor find the
interrupt-handler program for the mouse? What happens to the information resting in
the registers if we must pause execution and start a new program, namely, the

13
interrupt handler? What if more than one interrupt bit is set? What if a new interrupt
bit gets set while the processor is executing the mouse-button program?

Some of the answers are a bit complex. Based on this picture, we can provide
simplistic answers:

Cells in primary storage hold the addresses of the starting instructions for each of the
interrupt handlers for the devices. The sequence of addresses is called an interrupt
vector. The processor finds the address of the needed interrupt handler from the
interrupt vector.

Before the processor starts executing an interrupt handler, it must copy the current
values in all its registers to a register-save area in primary storage. When the interrupt
handler is finished, the values in the register-save area are copied back into the
registers in the processor, so that the processor can resume what it was doing before
the interrupt.

The case of multiple interrupts is not covered here, but the basic idea is that an
executing interrupt handler can itself be interrupted and its own registers can be saved.

The Operating System

14
The previous narrative shows that the computer's operation is getting complicated ---
there are special storage areas, special programs, etc. It is useful to have a startup
program that creates these special items and manages everything.

The startup- and manager-program is the operating system. When the computer is first
started, the operating system is the program that executes first. As noted, it initializes
the computer's storage as well as the controllers for the various devices. The interrupt
handlers just discussed as considered parts of the operating system.

In addition, the operating system helps the processor execute multiple programs
``simultaneously'' by executing each program a bit at a time. This technique, which is
studied carefully in another lecture, is crucial so that a human user can start and use,
say, a web browser and a text editor, at the same time.

The operating system is especially helpful at managing one particular output device
--- the computer's display. The operating system includes a program called
the window manager, which when executed, paints and repaints as needed the pixels
in the display. The window manager must be executing ``all the time,'' even while the
human user starts programs like a web browser, text editor, etc.

The operating system lets the window manager repaint the display in stages: when the
window-manager program repaints the display, it must execute a sequence of WRITE
instructions. When the processor executes one of the WRITE instructions, this triggers
the display's controller to paint part of the display. When the display controller
finishes painting the part, it sets a bit in the interrupt register so that the interrupt
handler for the display can execute and tell the processor to restart the window
manager and continue repainting the display. In this way, the window manager is
executing ``all the time,'' in starts and stops.

Here is a revised picture of the computer's storage, which shows the inclusion of the
operating system (``OS'') and the division of the remaining storage for the multiple
user programs that are executing:

15
The actions of the operating system are developed in a later lecture.

Branching
Apart from loading or storing, the other important operation of a CPU is branching. Internally,
the CPU keeps a record of the next instruction to be executed in the instruction pointer.
Usually, the instruction pointer is incremented to point to the next instruction sequentially;
the branch instruction will usually check if a specific register is zero or if a flag is set and, if
so, will modify the pointer to a different address. Thus the next instruction to execute will be
from a different part of program; this is how loops and decision statements work.

For example, a statement like if (x==0) might be implemented by finding the or of two


registers, one holding x and the other zero; if the result is zero the comparison is true (i.e.
all bits of x were zero) and the body of the statement should be taken, otherwise branch past
the body code.

16
Cycles

We are all familiar with the speed of the computer, given in Megahertz or Gigahertz (millions
or thousands of millions cycles per second). This is called the clock speed since it is the speed
that an internal clock within the computer pulses.

The pulses are used within the processor to keep it internally synchronised. On each tick or
pulse another operation can be started; think of the clock like the person beating the drum to
keep the rower's oars in sync.

Fetch, Decode, Execute, Store

Executing a single instruction consists of a particular cycle of events; fetching, decoding,


executing and storing.

For example, to do the add instruction above the CPU must

1. Fetch: get the instruction from memory into the processor.


2. Decode: internally decode what it has to do (in this case add).
3. Execute: take the values from the registers, actually add them together
4. Store: store the result back into another register. You might also see the
term retiring the instruction.

Looking inside a CPU

Internally the CPU has many different sub components that perform each of the above steps,
and generally they can all happen independently of each other. This is analogous to a physical
production line, where there are many stations where each step has a particular task to
perform. Once done it can pass the results to the next station and take a new input to work
on.

Figure 3-2. Inside the CPU

17
Above we have a very simple block diagram illustrating some of the main parts of a modern
CPU.

You can see the instructions come in and are decoded by the processor. The CPU has two
main types of registers, those for integer calculations and those for floating
point calculations. Floating point is a way of representing numbers with a decimal place in
binary form, and is handled differently within the CPU. MMX (multimedia extension)
and SSE (Streaming Single Instruction Multiple Data) or Altivec registers are similar to floating
point registers.

A register file is the collective name for the registers inside the CPU. Below that we have the
parts of the CPU which really do all the work.

We said that processors are either loading or storing a value into a register or from a register
into memory, or doing some operation on values in registers.

The Arithmetic Logic Unit (ALU) is the heart of the CPU operation. It takes values in registers
and performs any of the multitude of operations the CPU is capable of. All modern processors
have a number of ALUs so each can be working independently. In fact, processors such as the
Pentium have both fast and slow ALUs; the fast ones are smaller (so you can fit more on the
CPU) but can do only the most common operations, slow ALUs can do all operations but are
bigger.

18
The Address Generation Unit (AGU) handles talking to cache and main memory to get values
into the registers for the ALU to operate on and get values out of registers back into main
memory.

Floating point registers have the same concepts, but use slightly different terminology for
their components.

Pipelining

As we can see above, whilst the ALU is adding registers together is completely separate to the
AGU writing values back to memory, so there is no reason why the CPU can not be doing both
at once. We also have multiple ALUs in the system, each of which can be working on separate
instructions. Finally the CPU could be doing some floating point operations with its floating
point logic whilst integer instructions are in flight too. This process is called pipelining[1] In
fact, any modern processor has many more than four stages it can pipeline, above we have
only shown a very simplified view. The more stages that can be executed at the same time,
the deeper the pipeline, and a processor that can do this is referred to as a superscalar
architecture. All modern processors are superscalar.

Another analogy might be to think of the pipeline like a hose that is being filled with marbles,
except our marbles are instructions for the CPU. Ideally you will be putting your marbles in
one end, one after the other (one per clock pulse), filling up the pipe. Once full, for each
marble (instruction) you push in all the others will move to the next position and one will fall
out the end (the result).

Branch instruction play havoc with this model however, since they may or may not cause
execution to start from a different place. If you are pipelining, you will have to basically
guess which way the branch will go, so you know which instructions to bring into the pipeline.
If the CPU has predicted correctly, everything goes fine![2] Processors such as the Pentium
use a trace cache to keep a track of which way branches are going. Much of the time it can
predict which way a branch will go by remembering its previous result. For example, in a loop
that happens 100 times, if you remember the last result of the branch you will be right 99
times, since only the last time will you actually continue with the program.

Conversely, if the processor has predicted incorrectly it has wasted a lot of time and has to
clear the pipeline and start again.

This process is usually referred to as a pipeline flush and is analogous to having to stop and
empty out all your marbles from your hose!

Branch Prediction

pipeline flush, predict taken, predict not taken, branch delay slots

Reordering

In fact, if the CPU is the hose, it is free to reorder the marbles within the hose, as long as
they pop out the end in the same order you put them in. We call this program order since this
is the order that instructions are given in the computer program.

19
Consider an instruction stream such as

Figure 3-3. Reorder buffer example

1: r3 = r1 * r2
2: r4 = r2 + r3
3: r7 = r5 * r6
4: r8 = r1 + r7

Instruction 2 needs to wait for instruction 1 to complete fully before it can start. This means
that the pipeline has to stall as it waits for the value to be calculated. Similarly instructions 3
and 4 have a dependency on r7. However, instructions 2 and 3 have no dependency on each
other at all; this means they operate on completely separate registers. If we swap
instructions 2 and 3 we can get a much better ordering for the pipeline since the processor
can be doing useful work rather than waiting for the pipeline to complete to get the result of
a previous instruction.

However, when writing very low level code some instructions may require some security
about how operations are ordered. We call this requirement memory semantics. If you require
acquire semantics this means that for this instruction you must ensure that the results of all
previous instructions have been completed. If you require release semantics you are saying
that all instructions after this one must see the current result. Another even stricter semantic
is a memory barrier or memory fence which requires that operations have been committed to
memory before continuing.

On some architectures these semantics are guaranteed for you by the processor, whilst on
others you must specify them explicitly. Most programmers do not need to worry directly
about them, although you may see the terms.

CISC v RISC

A common way to divide computer architectures is into Complex Instruction Set


Computer (CISC) and Reduced Instruction Set Computer (RISC).

Note in the first example, we have explicitly loaded values into registers, performed an
addition and stored the result value held in another register back to memory. This is an
example of a RISC approach to computing -- only performing operations on values in registers
and explicitly loading and storing values to and from memory.

A CISC approach may have only a single instruction taking values from memory, performing
the addition internally and writing the result back. This means the instruction may take many
cycles, but ultimately both approaches achieve the same goal.

All modern architectures would be considered RISC architectures[3] Even the most common
architecture, the Intel Pentium, whilst having an instruction set that is categorised as CISC,
internally breaks down instructions to RISC style sub-instructions inside the chip before
executing..

20
There are a number of reasons for this

 Whilst RISC makes assembly programming become more complex, since virtually all
programmers use high level languages and leave the hard work of producing assembly
code to the compiler, so the other advantages outweigh this disadvantage.
 Because the instructions in a RISC processor are much more simple, there is more
space inside the chip for registers. As we know from the memory hierarchy, registers
are the fastest type of memory and ultimately all instructions must be performed on
values held in registers, so all other things being equal more registers leads to higher
performance.
 Since all instructions execute in the same time, pipelining is possible. We know
pipelining requires streams of instructions being constantly fed into the processor, so
if some instructions take a very long time and others do not, the pipeline becomes far
too complex to be effective.

EPIC

The Itanium processor, is an example of a modified architecture called Explicitly Parallel


Instruction Computing.

We have discussed how superscaler processors have pipelines that have many instructions in
flight at the same time in different parts of the processor. Obviously for this to work as well
as possible instructions should be given the processor in an order that can make best use of
the available elements of the CPU.

Traditionally organising the incoming instruction stream has been the job of the hardware.
Instructions are issued by the program in a sequential manner; the processor must look ahead
and try to make decisions about how to organise the incoming instructions.

The theory behind EPIC is that there is more information available at higher levels which can
make these decisions better than the processor. Analysing a stream of assembly language
instructions, as current processors do, loses a lot of information that the programmer may
have provided in the original source code. Think of it as the difference between studying a
Shakespeare play and reading the Cliff's Notes version of the same. Both give you the same
result, but the original has all sorts of extra information that sets the scene and gives you
insight into the characters.

Thus the logic of ordering instructions can be moved from the processor to the compiler. This
means that compiler writers need to be smarter to try and find the best ordering of code for
the processor. The processor is also significantly simplified, since a lot of its work has been
moved to the compiler.[4]

[4] Another term often used around EPIC is Very Long Instruction World (VLIW), which is
where each instruction to the processor is extended to tell the processor about where it
should execute the instruction in it's internal units. The problem with this approach is
that code is then completely dependent on the model of processor is has been compiled
for. Companies are always making revisions to hardware, and making customers
recompile their application every single time, and maintain a range of different binaries

21
was impractical.

EPIC solves this in the usual computer science manner by adding a layer of abstraction.
Rather than explicitly specifying the exact part of the processor the instructions should
execute on, EPIC creates a simplified view with a few core units like memory, integer
and floating point.

Memory
Memory is a passive component that simply stores information until it is requested by another
part of the system. During normal operations it feeds instructions and data to the processor,
and at other times it is the source or destination of data transferred by I/O devices

Memory Hierarchy

The CPU can only directly fetch instructions and data from cache memory, located directly on
the processor chip. Cache memory must be loaded in from the main system memory (the
Random Access Me
mory, or RAM). RAM however, only retains it's contents when the power is on, so needs to be
stored on more permanent storage.

We call these layers of memory the memory hierarchy

Table 3-1. Memory Hierarchy

Speed Memory Description

Cache memory is memory actually embedded inside the CPU. Cache memory is very
fast, typically taking only one cycle to access, but since it is embedded directly into
Fastest Cache
the CPU there is a limit to how big it can be. In fact, there are several sub-levels of
cache memory (termed L1, L2, L3) all with slightly increasing speeds.

All instructions and storage addresses for the processor must come from RAM.
Although RAM is very fast, there is still some significant time taken for the CPU to
  RAM
access it (this is termed latency). RAM is stored in separate, dedicated chips attached
to the motherboard, meaning it is much larger than cache memory.

We are all familiar with software arriving on a floppy disk or CDROM, and saving our
files to the hard disk. We are also familiar with the long time a program can take to
Slowest Disk load from the hard disk -- having physical mechanisms such as spinning disks and
moving heads means disks are the slowest form of storage. But they are also by far
the largest form of storage.

The important point to know about the memory hierarchy is the trade offs between speed and
size -- the faster the memory the smaller it is. Of course, if you can find a way to change this
equation, you'll end up a billionaire!

Cache in depth

22
Cache is one of the most important elements of the CPU architecture. To write efficient code
developers need to have an understanding of how the cache in their systems works.

The cache is a very fast copy of the slower main system memory. Cache is much smaller than
main memories because it is included inside the processor chip alongside the registers and
processor logic. This is prime real estate in computing terms, and there are both economic
and physical limits to it's maximum size. As manufacturers find more and more ways to cram
more and more transistors onto a chip cache sizes grow considerably, but even the largest
caches are tens of megabytes, rather than the gigabytes of main memory or terrabytes of
hard disk otherwise common. (XXXexample)

The cache is made up of small chunks of mirrored main memory. The size of these chunks is
called the line size, and is typically something like 64 kilobytes. When talking about cache, it
is very common to talk about the line size, or a cache line, which refers to one chunk of
mirrored main memory. The cache can only load and store memory in sizes a multiple of a
cache line.

As the cache is quite small compared to main memory, it will obviously fill up quite quickly as
a process goes about its execution. Once the cache is full the processor needs to get rid of a
line to make room for a new line. There are many algorithms by which the processor can
choose which line to evict; for example least recently used (LRU) is an algorithm where the
oldest unused line is discarded to make room for the new line.

When data is only read from the cache there is no need to ensure consistency with main
memory. However, when the processor starts writing to cache lines it needs to make some
decisions about how to update the underlying main memory. A write-through cache will write
the changes directly into the main system memory as the processor updates the cache. This is
slower since the process of writing to the main memory is, as we have seen, slower.
Alternatively a write-back cache delays writing the changes to RAM until absolutely
necessary. The obvious advantage is that less main memory access is required when cache
entries are written. Cache lines that have been written but not committed to memory are
referred to as dirty. The disadvantage is that when a cache entry is evicted, it may require
two memory accesses (one to write dirty data main memory, and another to load the new
data).

XXX: associativity, thrashing?

Peripherals and busses

Peripherals are any of the many external devices that connect to your computer. Obviously,
the processor must have some way of talking to the peripherals to make them useful.

The communication channel between the processor and the peripherals is called a bus. The
devices directly connected to the processor use a type of bus called Peripheral Component
Interconnect, commonly referred to as PCI.

PCI Bus

23
PCI transfers data between the device and memory, but importantly allows for the automatic
configuration of attached peripherals. The configuration broadly falls into two categories

Interrupts

An interrupt allows the device to literally interrupt the processor to flag some information.
For example, when a key is pressed, an interrupt is generated and delivered to the CPU. An
interrupt (called the IRQ) is assigned to the device on system boot by the system BIOS.

When the device wants to interrupt, it will signal to the processor via raising the voltage
of interrupt pins. The processor will acknowledge the interrupt, and pass the IRQ onto the
operating system. This part of the operating system code is called the interrupt handler.

The interrupt handler knows what to do with the interrupt as when each device driver
initialises it will register its self with the kernel to accept the interrupt from the peripheral it
is written for. So as the interrupt arrives it is passed to the driver which can deal with the
information from the device correctly.

Most drivers will split up handling of interrupts into bottom and top halves. The bottom half
will acknowledge the interrupt and return the processor to what it was doing quickly. The top
half will then run later when the CPU is free and do the more intensive processing. This is to
stop an interrupt hogging the entire CPU.

IO Space

Obviously the processor will need to communicate with the peripheral device, and it does this
via IO operations. The most common form of IO is so called memory mapped IO where
registers on the device are mapped into memory.

This means that to communicate with the device, you need simply read or write to a specific
address in memory. TODO: expand

DMA

Since the speed of devices is far below the speed of processors, there needs to be some way
to avoid making the CPU wait around for data from devices.

Direct Memory Access (DMA) is a method of transferring data directly between a peripheral
and system RAM.

The driver can setup a device to do a DMA transfer by giving it the area of RAM to put it's data
into. It can then start the DMA transfer and allow the CPU to continue with other tasks.

Once the device is finished, it will raise an interrupt and signal to the driver the transfer is
complete. From this time the data from the device (say a file from a disk, or frames from a
video capture card) is in memory and ready to be used.

Other Busses

24
Other busses connect between the PCI bus and external devices. Some of you will have heard
of

 USB/Firewire for small external data devices.


 IDE/SCSI for disk drives.

Small to big systems

As Moore's law has predicted, computing power has been growing at a furious pace and shows
no signs of slowing down. It is relatively uncommon for any high end servers to contain only a
single CPU. This is achieved in a number of different fashions.

Symmetric Multi-Processing

Symmetric Multi-Processing, commonly shortened to SMP, is currently the most common


configuration for including multiple CPUs in a single system.

The symmetric term refers to the fact that all the CPUs in the system are the same (e.g.
architecture, clock speed). In a SMP system there are multiple processors that share other
system resources (memory, disk, etc).

Cache Coherency

For the most part, the CPUs in the system work independently; each has its own set of
registers, program counter, etc. Despite running separately, there is one component that
requires strict synchronization.

This is the CPU cache; remember the cache is a small area of quickly accessible memory that
mirrors values stored in main system memory. If one CPU modifies data in main memory and
another CPU has an old copy of that memory in its cache the system will obviously not be in a
consistent state. Note that the problem only occurs when processors are writing to memory,
since if a value is only read the data will be consistent.

To co-ordinate keeping the cache coherent on all processors an SMP system uses snooping.
Snooping is where a processor listens on a bus which all processors are connected to for cache
events, and updates its cache accordingly.

One protocol for doing this is the MOESI protocol; standing for Modified, Owner, Exclusive,
Shared, Invalid. Each of these is a state that a cache line can be in on a processor in the
system. There are other protocols for doing as much, however they all share similar concepts.
Below we examine MOESI so you have an idea of what the process entails.

When a processor requires reading a cache line from main memory, it firstly has to snoop all
other processors in the system to see if they currently know anything about that area of
memory (e.g. have it cached). If it does not exist in any other process, then the processor can
load the memory into cache and mark it as exclusive. When it writes to the cache, it then

25
changes state to be modified. Here the specific details of the cache come into play; some
caches will immediately write back the modified cache to system memory (known as a write-
through cache, because writes go through to main memory). Others will not, and leave the
modified value only in the cache until it is evicted, when the cache becomes full for example.

The other case is where the processor snoops and finds that the value is in another processors
cache. If this value has already been marked as modified, it will copy the data into its own
cache and mark it as shared. It will send a message for the other processor (that we got the
data from) to mark its cache line as owner. Now imagine that a third processor in the system
wants to use that memory too. It will snoop and find both a shared and a owner copy; it will
thus take its value from the owner value. While all the other processors are only reading the
value, the cache line stays shared in the system. However, when one processor needs to
update the value it sends an invalidate message through the system. Any processors with that
cache line must then mark it as invalid, because it no longer reflects the "true" value. When
the processor sends the invalidate message, it marks the cache line as modified in its cache
and all others will mark as invalid (note that if the cache line is exclusive the processor knows
that no other processor is depending on it so can avoid sending an invalidate message).

From this point the process starts all over. Thus whichever processor has the modified value
has the responsibility of writing the true value back to RAM when it is evicted from the cache.
By thinking through the protocol you can see that this ensures consistency of cache lines
between processors.

There are several issues with this system as the number of processors starts to increase. With
only a few processors, the overhead of checking if another processor has the cache line (a
read snoop) or invalidating the data in every other processor (invalidate snoop) is
manageable; but as the number of processors increase so does the bus traffic. This is why SMP
systems usually only scale up to around 8 processors.

Having the processors all on the same bus starts to present physical problems as well. Physical
properties of wires only allow them to be laid out at certain distances from each other and to
only have certain lengths. With processors that run at many gigahertz the speed of light starts
to become a real consideration in how long it takes messages to move around a system.

Note that system software usually has no part in this process, although programmers should
be aware of what the hardware is doing underneath in response to the programs they design
to maximize performance.

Hyperthreading

Much of the time of a modern processor is spent waiting for much slower devices in the
memory hierarchy to deliver data for processing.

Thus strategies to keep the pipeline of the processor full are paramount. One strategy is to
include enough registers and state logic such that two instruction streams can be processed at
the same time. This makes one CPU look for all intents and purposes like two CPUs.

While each CPU has its own registers, they still have to share the core logic, cache and input
and output bandwidth from the CPU to memory. So while two instruction streams can keep

26
the core logic of the processor busier, the performance increase will not be as great as having
two physically separate CPUs. Typically the performance improvement is below 20% (XXX
check), however it can be drastically better or worse depending on the workloads.

Multi Core

With increased ability to fit more and more transistors on a chip, it became possible to put
two or more processors in the same physical package. Most common is dual-core, where two
processor cores are in the same chip. These cores, unlike hyperthreading, are full processors
and so appear as two physically separate processors a la a SMP system.

While generally the processors have their own L1 cache, they do have to share the bus
connecting to main memory and other devices. Thus performance is not as great as a full SMP
system, but considerably better than a hyperthreading system (in fact, each core can still
implement hyperthreading for an additional enhancement).

Multi core processors also have some advantages not performance related. 1. As we
mentioned, external physical buses between processors have physical limits; by containing
the processors on the same piece of silicon extremely close to each other some of these
problems can be worked around.

2. The power requirements for multi core processors are much less than for two separate
processors. This means that there is less heat needing to be dissipated which can be a big
advantage in data centre applications where computers are packed together and cooling
considerations can be considerable.

3. By having the cores in the same physical package it makes multi-processing practical in
applications where it otherwise would not be, such as laptops.

4. It is also considerably cheaper to only have to produce one chip rather than two.

Clusters

Many applications require systems much larger than the number of processors a SMP system
can scale to. One way of scaling up the system further is a cluster.

A cluster is simply a number of individual computers which have some ability to talk to each
other. At the hardware level the systems have no knowledge of each other; the task of
stitching the individual computers together is left up to software.

Software such as - allow programmers to write their software and then "farm out" parts of the
program to other computers in the system. For example, imagine a loop that executes several
thousand times performing independent action (that is no iteration of the loop affects any
other iteration). With four computers in a cluster, the software could make each computer do
250 loops each.

27
The interconnect between the computers varies, and may be as slow as an internet link or as
fast as dedicated, special buses (Infiniband). Whatever the interconnect, however, it is still
going to be further down the memory hierarchy and much, much slower than RAM. Thus a
cluster will not perform well in a situation when each CPU requires access to data that may
be stored in the RAM of another computer; since each time this happens the software will
need to request a copy of the data from the other computer, copy across the slow link and
into local RAM before the processor can get any work done.

However, many applications do not require this constant copying around between computers.
One large scale example is SETI@Home, where data collected from a radio antenna is
analysed for signs of Alien life. Each computer can be distributed a few minutes of data to
analyse, and only needs report back a summary of what it found. SETI@Home is effectively a
very large, dedicated cluster.

Another application is rendering of images, especially for special effects in films. Each
computer can be handed a single frame of the movie which contains the wire-frame models,
textures and light sources which needs to be combined (rendered) into the amazing special
effects we now take for grained. Since each frame is static, once the computer has the initial
input it does not need any more communication until the final frame is ready to be sent back
and combined into the movie. For example the block-buster Lord of the Rings had their
special effects rendered on a huge cluster running Linux.

Non-Uniform Memory Access

Non-Uniform Memory Access, more commonly abbreviated to NUMA, is almost the opposite of
a cluster system mentioned above. As in a cluster system it is made up of individual nodes
linked together, however the linkage between nodes is highly specialised (and expensive!). As
opposed to a cluster system where the hardware has no knowledge of the linkage between
nodes, in a NUMA system the software has no (well, less) knowledge about the layout of the
system and the hardware does all the work to link the nodes together.

The term non uniform memory access comes from the fact that RAM may not be local to the
CPU and so data may need to be accessed from a node some distance away. This obviously
takes longer, and is in contrast to a single processor or SMP system where RAM is directly
attached and always takes a constant (uniform) time to access.

NUMA Machine Layout

With so many nodes talking to each other in a system, minimizing the distance between each
node is of paramount importance. Obviously it is best if every single node has a direct link to
every other node as this minimizes the distance any one node needs to go to find data. This is
not a practical situation when the number of nodes starts growing into the hundreds and
thousands as it does with large supercomputers; if you remember your high school maths the
problem is basically a combination taken two at a time (each node talking to another), and
will grow n!/2*(n-2)!.

To combat this exponential growth alternative layouts are used to trade off the distance
between nodes with the interconnects required. One such layout common in modern NUMA
architectures is the hypercube.

28
A hypercube has a strict mathematical definition (way beyond this discussion) but as a cube is
a 3 dimensional counterpart of a square, so a hypercube is a 4 dimensional counterpart of a
cube.

Figure 3-4. A Hypercube

Above we can see the outer cube contains four 8 nodes. The maximum number of paths
required for any node to talk to another node is 3. When another cube is placed inside this
cube, we now have double the number of processors but the maximum path cost has only
increased to 4. This means as the number of processors grow by 2 n the maximum path cost
grows only linearly.

Cache Coherency

Cache coherency can still be maintained in a NUMA system (this is referred to as a cache-
coherent NUMA system, or ccNUMA). As we mentioned, the broadcast based scheme used to
keep the processor caches coherent in an SMP system does not scale to hundreds or even
thousands of processors in a large NUMA system. One common scheme for cache coherency in
a NUMA system is referred to as a directory based model. In this model processors in the
system communicate to special cache directory hardware. The directory hardware maintains
a consistent picture to each processor; this abstraction hides the working of the NUMA system
from the processor.

29
The Censier and Feautrier directory based scheme maintains a central directory where each
memory block has a flag bit known as the valid bit for each processor and a single dirty bit.
When a processor reads the memory into its cache, the directory sets the valid bit for that
processor.

When a processor wishes to write to the cache line the directory needs to set the dirty bit for
the memory block. This involves sending an invalidate message to those processors who are
using the cache line (and only those processors whose flag are set; avoiding broadcast
traffic).

After this should any other processor try to read the memory block the directory will find the
dirty bit set. The directory will need to get the updated cache line from the processor with
the valid bit currently set, write the dirty data back to main memory and then provide that
data back to the requesting processor, setting the valid bit for the requesting processor in the
process. Note that this is transparent to the requesting processor and the directory may need
to get that data from somewhere very close or somewhere very far away.

Obviously having thousands of processors communicating to a single directory does also not
scale well. Extensions to the scheme involve having a hierarchy of directories that
communicate between each other using a separate protocol. The directories can use a more
general purpose communications network to talk between each other, rather than a CPU bus,
allowing scaling to much larger systems.

NUMA Applications

NUMA systems are best suited to the types of problems that require much interaction
between processor and memory. For example, in weather simulations a common idiom is to
divide the environment up into small "boxes" which respond in different ways (oceans and
land reflect or store different amounts of heat, for example). As simulations are run, small
variations will be fed in to see what the overall result is. As each box influences the
surrounding boxes (e.g. a bit more sun means a particular box puts out more heat, affecting
the boxes next to it) there will be much communication (contrast that with the individual
image frames for a rendering process, each of which does not influence the other). A similar
process might happen if you were modelling a car crash, where each small box of the
simulated car folds in some way and absorbs some amount of energy.

Although the software has no direct knowledge that the underlying system is a NUMA system,
programmers need to be careful when programming for the system to get maximum
performance. Obviously keeping memory close to the processor that is going to use it will
result in the best performance. Programmers need to use techniques such as profiling to
analyze the code paths taken and what consequences their code is causing for the system to
extract best performance.

Memory ordering, locking and atomic operations

The multi-level cache, superscalar multi-processor architecture brings with it some


interesting issues relating to how a programmer sees the processor running code.

30
Imagine program code is running on two processors simultaneously, both processors sharing
effectively one large area of memory. If one processor issues a store instruction, to put a
register value into memory, when can it be sure that the other processor does a load of that
memory it will see the correct value?

In the simplest situation the system could guarantee that if a program executes a store
instruction, any subsequent load instructions will see this value. This is referred to as strict
memory ordering, since the rules allow no room for movement. You should be starting to
realize why this sort of thing is a serious impediment to performance of the system.

Much of the time, the memory ordering is not required to be so strict. The programmer can
identify points where they need to be sure that all outstanding operations are seen globally,
but in between these points there may be many instructions where the semantics are not
important.

Take, for example, the following situation.

Example 3-1. Memory Ordering

typedef struct {
int a;
int b;
} a_struct;

/*
* Pass in a pointer to be allocated as a new structure
*/
void get_struct(a_struct *new_struct)
{
void *p = malloc(sizeof(a_struct));

/* We don't particularly care what order the following two


* instructions end up acutally executing in */
p->a = 100;
p->b = 150;

/* However, they must be done before this instruction.


* Otherwise, another processor who looks at the value of p
* could find it pointing into a structure whose values have
* not been filled out.
*/
new_struct = p;
}

In this example, we have two stores that can be done in any particular order, as it suits the
processor. However, in the final case, the pointer must only be updated once the two
previous stores are known to have been done. Otherwise another processor might look at the
value of p, follow the pointer to the memory, load it, and get some completely incorrect
value!

31
To indicate this, loads and stores have to have semantics that describe what behavior they
must have. Memory semantics are described in terms of fences that dictate how loads and
stores may be reordered around the load or store.

By default, a load or store can be re-ordered anywhere.

Acquire semantics. Whenever you use atomic operations to gain access to some data, your
code must make sure that other processors see the lock before any other changes that will be
made. This is what we call acquire semantics, because the code is trying to acquire ownership
of some data. This is also referred to as read barriers or import barriers
Acquire semantics is like a fence that only allows load and stores to move downwards through
it. That is, when this load or store is complete you can be guaranteed that any later load or
stores will see the value (since they cannot be moved above it).

Release semantics. On the other hand if you use atomics operations to release recently
modified data, your code must make sure that the new code is visible before releasing it. This
is what we call release semantics because the code is trying to acquire ownership of some
data. This is also referred to as write barriers or export barriers

Release semantics is the opposite, that is a fence that allows any load or stores to be done
before it (move upwards), but nothing before it to move downwards past it. Thus, when load
or store with release semantics is processed, you can be sure that any earlier load or stores
will have been complete.

Figure 3-5. Acquire and Release semantics

32
A full memory fence is a combination of both; where no loads or stores can be reordered in
any direction around the current load or store.

The strictest memory model would use a full memory fence for every operation. The weakest
model would leave every load and store as a normal re-orderable instruction.

Processors and memory models

Different processors implement different memory models.

The x86 (and AMD64) processor has a quite strict memory model; all stores have release
semantics (that is, the result of a store is guaranteed to be seen by any later load or store)
but all loads have normal semantics. lock prefix gives memory fence.

Itanium allows all load and stores to be normal, unless explicitly told. XXX

Locking

33
Knowing the memory ordering requirements of each architecture is no practical for all
programmers, and would make programs difficult to port and debug across different processor
types.

Programmers use a higher level of abstraction called locking to allow simultaneous operation


of programs when there are multiple CPUs.

When a program acquires a lock over a piece of code, no other processor can obtain the lock
until it is released. Before any critical pieces of code, the processor must attempt to take the
lock; if it cannot have it, it does not continue.

You can see how this is tied into the naming of the memory ordering semantics in the previous
section. We want to ensure that before we acquire a lock, no operations that should be
protected by the lock are re-ordered before it. This is how acquire semantics works.

Conversely, when we release the lock, we must be sure that every operation we have done
whilst we held the lock is complete (remember the example of updating the pointer
previously?). This is release semantics.

There are many software libraries available that allow programmers to not have to worry
about the details of memory semantics and simply use the higher level of abstraction
of lock()and unlock().

Locking difficulties

Locking schemes make programming more complicated, as it is possible


to deadlock programs. Imagine if one processor is currently holding a lock over some data,
and is currently waiting for a lock for some other piece of data. If that other processor is
waiting for the lock the first processor holds before unlocking the second lock, we have a
deadlock situation. Each processor is waiting for the other and neither can continue without
the others lock.

Often this situation arises because of a subtle race condition; one of the hardest bugs to track
down. If two processors are relying on operations happening in a specific order in time, there
is always the possibility of a race condition occurring. A gamma ray from an exploding star in
a different galaxy might hit one of the processors, making it skip a beat, throwing the
ordering of operations out. What will often happen is a deadlock situation like above. It is for
this reason that program ordering needs to be ensured by semantics, and not by relying on
one time specific behaviors. (XXX not sure how i can better word that).

A similar situation is the opposite of deadlock, called livelock. One strategy to avoid deadlock
might be to have a "polite" lock; one that you give up to anyone who asks. This politeness
might cause two threads to be constantly giving each other the lock, without either ever
taking the lock long enough to get the critical work done and be finished with the lock (a
similar situation in real life might be two people who meet at a door at the same time, both
saying "no, you first, I insist". Neither ends up going through the door!).

Locking strategies

34
Underneath, there are many different strategies for implementing the behaviour of locks.

A simple lock that simply has two states - locked or unlocked, is referred to as a mutex (short
for mutual exclusion; that is if one person has it the other can not have it).

There are, however, a number of ways to implement a mutex lock. In the simplest case, we
have what its commonly called a spinlock. With this type of lock, the processor sits in a tight
loop waiting to take the lock; equivalent to it saying "can I have it now" constantly much as a
young child might ask of a parent.

The problem with this strategy is that it essentially wastes time. Whilst the processor is
sitting constantly asking for the lock, it is not doing any useful work. For locks that are likely
to be only held locked for a very short amount of time this may be appropriate, but in many
cases the amount of time the lock is held might be considerably longer.

Thus another strategy is to sleep on a lock. In this case, if the processor can not have the lock
it will start doing some other work, waiting for notification that the lock is available for use.

A mutex is however just a special case of a semaphore (variable or abstract data type used to
control access to a common resource by multiple processes in a concurrent system such as a
multiprogramming operating system), famously invented by the Dutch computer scientist
Dijkstra. In a case where there are multiple resources available, a semaphore can be set to
count accesses to the resources. In the case where the number of resources is one, you have a
mutex. The operation of semaphores can be detailed in any algorithms book.

These locking schemes still have some problems however. In many cases, most people only
want to read data which is updated only rarely. Having all the processors wanting to only read
data require taking a lock can lead to lock contention where less work gets done because
everyone is waiting to obtain the same lock for some data.

Assignment 1

Atomic Operations

1. Explain what it is.

2. Explain why the clock speed, given in cycles per second, is not the best
indicator of actual processor speed.

35
A typical schematic symbol for an ALU: A & B are operands; R is the output; F is the input from
the Control Unit; D is an output status
At the heart of any computer, modern or early, is a circuit called an ALU, or Arithmetic Logic
Unit. It's comprised of a few simple operations which can be done very quickly. This, along with
a small amount of memory running at processor speed called registers, make up what is known
as the CPU, or Central Processing Unit.
A CPU isn't very useful unless there is some way to communicate to it, and receive information
back from it. This is usually known as a Bus. The Bus is the input/output, or I/O gateway for the
CPU. The primary area with which the CPU communicates with its system memory in what is
commonly known as RAM, or Random Access Memory. Depending on the platform, the CPU
may communicate with other parts of the system, or it may communicate just through memory.

The "word" size of a platform is the native amount of bits that can be moved over the bus that is
internal to the CPU.* Early computers varied on bit sizes, but most modern computers work in
multiples of 8 bits, commonly known as a Byte. The first general purpose CPU on a chip, built
by Intel, was the 8080 built in 1974. The 8080 used an 8 bit word, meaning it would
communicate over the bus 1 byte at a time. In contrast, the 80386 has a 32 bit (4 byte) word,
and the IA64 is a 64 bit (8 byte) word.

 Note that a few companies considered the "word" size to be number of bits transferred at
one time over the CPU<=>RAM bus. Technically, knowing the actual "word" size inside the

36
CPU can be quite important for people developing system level software, such as device
drivers.

2.3 Buses
A bus is used to transfer information between several different modules. Small and
mid-range computer systems, such as the Macintosh have a single bus connecting all
major components. Supercomputers and other high performance machines have more
complex interconnections, but many components will have internal buses.

Communication on a bus is broken into discrete  transactions. Each transaction has a


sender and receiver. In order to initiate a transaction, a module has to gain control of
the bus and become (temporarily, at least) the bus  master. Often several devices have
the ability to become the master; for example, the processor controls transactions that
transfer instructions and data between memory and CPU, but a disk controller
becomes the bus master to transfer blocks between disk and memory. When two or
more devices want to transfer information at the same time, an  arbitration protocol is
used to decide which will be given control first. A protocol is a set of signals
exchanged between devices in order to perform some task, in this case to agree which
device will become the bus master.

Once a device has control of the bus, it uses a  communication protocol to transfer the
information. In an asynchronous (unclocked) protocol the transfer can begin at any
time, but there is some overhead involved in notifying potential receivers that
information needs to be transferred. In a synchronous protocol transfers are controlled
by a global clock and begin only at well-known times.

The performance of a bus is defined by two parameters, the transfer time and the
overall bandwidth (sometimes called throughput). Transfer time is similar to latency
in memories: it is the amount of time it takes for data to be delivered in a single
transaction. For example, the transfer time defines how long a processor will have to
wait when it fetches an instruction from memory. Bandwidth, expressed in units of
bits per second (bps), measures the capacity of the bus. It is defined to be the product
of the number of bits that can be transferred in parallel in any one transaction by the
number of transactions that can occur in one second. For example, if the bus has 32
data lines and can deliver 1,000,000 packets per second, it has a bandwidth of
32Mbps.

At first it may seem these two parameters measure the same thing, but there are subtle
differences. The transfer time measures the delay until a piece of data arrives. As soon
as the data is present it may be used while other signals are passed to complete the

37
communication protocol. Completing the protocol will delay the next transaction, and
bandwidth takes this extra delay into account.
Another factor that distinguishes the two is that in many high performance systems a
block of information can be transferred in one transaction; in other words, the
communication protocol may say ``send    items from location  .'' There will be some
initial overhead in setting up the transaction, so there will be a delay in receiving the
first piece of data, but after that information will arrive more quickly.

Bandwidth is a very important parameter. It is also used to describe processor


performance, when we count the number of instructions that can be executed per unit
time, and the performance of networks.

I/O
Many computational science applications generate huge amounts of data which must
be transferred between main memory and I/O devices such as disk and tape. We will
not attempt to characterize file I/O in this chapter since the devices and their
connections to the rest of the system tend to be idiosyncratic (do things in their own
way). If your application needs to read or write large data files you will need to learn
how your system organizes and transfers files and tune your application to fit that
system. It is worth reiterating, though, that performance is measured in terms of
bandwidth: what counts is the volume of data per unit of time that can be moved into
and out of main memory.

The rest of this section contains a brief discussion of video displays. These output
devices and their capabilities also vary from system to system, but since scientific
visualization is such a prominent part of this work we should introduce some concepts
and terminology for readers who are not familiar with video displays.

Most users who generate high quality images will do so on workstations configured
with extra hardware for creating and manipulating images. Almost every workstation
manufacturer includes in its product line versions of their basic systems that are
augmented with extra processors that are dedicated to drawing images. These extra
processors work in parallel with the main processor in the workstation. In most cases
data generated on a supercomputer is saved in a file and later viewed on a video
console attached to a graphics workstation. However there are situations that make
use of high bandwidth connections from supercomputers directly to video displays;
these are useful when the computer is generating complex data that should be viewed
in ``real time.'' For example, a demonstration program from Thinking Machines, Inc.
allows a user to move a mouse over the image of a fluid moving through a pipe. When

38
the user pushes the mouse button, the position of the mouse is sent to a parallel
processor which simulates the path of particles in a turbulent flow at this position. The
results of the calculations are sent directly to the video display, which shows the new
positions of the particles in real time. The net effect is as if the user is holding a
container of fluid that is being poured into the pipe.

There are many different techniques for drawing images with a computer, but the
dominant technology is based on a raster scan. A beam of electrons is directed at a
screen that contains a quick-fading phosphor. The beam can be turned on and off very
quickly, and it can be bent in two dimensions via magnetic fields. The beam is swept
from left to right (from the user's point of view) across the screen. When the beam is
on, a small white dot will appear on the screen where the beam is aimed, but when it
is off the screen will remain dark. To paint an image on the entire screen, the beam is
swept across the top row; when it reaches the right edge, it is turned off, moved back
to the left and down one row, and then swept across to the right again. When it
reaches the lower right corner, the process repeats again in the upper left corner.

The number of times per second the full screen is painted determines the  refresh rate.
If the rate is too low, the image will flicker, since the bright spots on the phosphor will
fade before the gun comes back to that spot on the next pass. Refresh rates vary from
30 times per second up to 60 times per second.

The individual locations on a screen that can be either painted or not are known
as pixels (from ``picture cell''). The resolution of the image is the number of pixels per
inch. A high resolution display will have enough pixels in a given area that from a
reasonable distance (an arm's length away) the gaps between pixels are not visible and
a sequence of pixels that are all on will appear to be a continuous line. A common
screen size is 1280 pixels across and 1024 pixels high on a 16'' or 19'' monitor.

The controller for the electron gun decides whether a pixel will be black or white by
reading information from a memory that has one bit per pixel. If the bit is a 1, the
pixel will be painted, otherwise it will remain dark.
The operating system set aside a portion of the main memory for displays, and all an
application had to do to paint something on the screen was to write a bit pattern into
this portion of memory. This was an economical choice for the time (early 1980s), but
it came at the cost of performance: the processor and video console had to alternate
accesses to memory. During periods when the electron gun was being moved back to
the upper left hand corner, the display did not access memory, and the processor was
able to run at full speed. Once the gun was positioned and ready for the next scan line,
however, the processor and display went back to alternating memory cycles.

39
With the fall in memory prices and the rising demand for higher performance, modern
systems use a dedicated memory known as a  frame buffer for holding bit patterns that
control the displays. On inexpensive systems the main processor will compute the
patterns and transfer them to the frame buffer. On high performance systems, though,
the main processor sends information to the ``graphics engine'', a dedicated processor
that performs the computations. For example, if the user wants to draw a rectangle,
the CPU can send the coordinates to the graphics processor, and the latter will figure
out which pixels lie within the rectangle and turn on the corresponding bits in the
frame buffer. Sophisticated graphics processors do all the work required in complex
shading, texturing, overlapping of objects (deciding what is visible and what is not),
and other operations required in 3D images.

The discussion so far has dealt only with black and white images. Color displays are
based on the same principles: a raster scan illuminates regions on a phosphor, with the
information that controls the display coming from a frame buffer. However, instead of
one gun there are three, one for each primary color. When combining light, the
primary colors are red, green, and blue, which is why these displays are known as
RGB monitor.  Since we need to specify whether or not each gun should be on for
each pixel, the frame buffer will have at least three bits per pixel. To have a wide
variety of colors, though, it is not enough just to turn a gun on or off; we need to
control its intensity. For example, a violet color can be formed by painting a pixel
with the red gun at 61% of full intensity, green at 24%, and blue at 80%.

Typically a system will divide the range of intensities into 256 discrete values, which
means the intensity can be represented by an 8-bit number. 8 bits times 3 guns means
24 bits are required for each pixel. Recall that high resolution displays have 1024
rows of 1280 pixels each, for a total of 1.3 million pixels. Dedicating 24 bits to each
pixel would require almost 32MB of RAM for the frame buffer alone. What is done
instead is to create a color map with a fixed number of entries, typically 256. Each
entry in the color map is a full 24 bits wide. Each pixel only needs to identify a
location in the map that contains its color, and since a color map of 256 entries
requires only 8 bits per pixel to specify one of the entries there is a savings of 16 bits
per pixel. The drawback is that only 256 different colors can be displayed in any one
image, but this is enough for all applications except those that need to create highly
realistic images.

2.5 Operating Systems


The user's view of a computer system is of a complex set of services that are provided
by a combination of hardware (the architecture and its organization) and software (the

40
operating system). Attributes of the operating system also affect the performance of
user programs.

Operating systems for all but the simplest personal computers are  multi-
tasking operating systems. This means the computer will be running several jobs at
once. A program is a static description of an algorithm. To run a program, the system
will decide how much memory it needs and then start a  process for this program; a
process (also known as a task) can be viewed as a dynamic copy of a program. For
example, the C compiler is a program. Several different users can be compiling their
code at the same time; there will be a separate process in the system for each of these
invocations of the compiler.

Processes in a multi-tasking operating system will be in one of three states. A process


is active if the CPU is executing the corresponding program. In a single processor
system there will be only one active process at any time. A process is idle if it is
waiting to run. In order to allocate time on the CPU fairly to all processes, the
operating system will let a process run for a short time (known as a  time slice;
typically around 20ms) and then interrupt it, change its status to idle, and install one
of the other idle tasks as the new active process. The previous task goes to the end of
a process queue to wait for another time slice.

The third state for a process is blocked. A blocked process is one that is waiting for
some external event. For example, if a process needs a piece of data from a file, it will
call the operating system routine that retrieves the information and then voluntarily
give up the remainder of its time slice. When the data is ready, the system changes the
process' state from blocked to idle, and it will be resumed again when its turn comes.

The predominant operating systems for workstations is Unix, developed in the 1970s
at Bell Labs and made popular in the 1980s by the University of California at
Berkeley. Even though there may be just one user, and that user is executing only one
program (e.g. a text editor), there will be dozens of tasks running. Many Unix services
are provided by small systems programs known as daemons that are dedicated to one
special purpose. There are daemons for sending and receiving mail, using the network
to find files on other systems, and several other jobs.

The fact that there may be several processes running in a system at the same time as
your computational science application has ramifications for performance. One is that
it makes it slightly more difficult to measure performance. You cannot simply start a
program, look at your watch, and then look again when the program stops to measure
the time spent. This measure is known as  real time or ``wall-clock time,'' and it
depends as much on the number of other processes in the system as it does on the
performance of your program. Your program will take longer to run on a heavily-
41
loaded system since it will be competing for CPU cycles with those other jobs. To get
an accurate assessment of how much time is required to run your program you need to
measure CPU time. Unix and other operating systems have system routines that can be
called from an application to find out how much CPU time has been allocated to the
process since it was started.

Another impact of having several other jobs in the process queue is that as they are
executed they work themselves into the cache, displacing your program and data.
During your application's time slice its code and data will fill up the cache. But when
the time slice is over and a daemon or other user's program runs, its code and data will
soon replace yours, so that when yours resumes it will have a higher miss rate until it
reloads the code and data it was working on when it was interrupted. This period
during which your information is being moved back into the cache is known as
a reload transient. The longer the interval between time slices and the more processes
that run during this interval the longer the reload transient.

Supercomputers and parallel processors also use variants of Unix for their runtime
environments. You will have to investigate whether or not daemons run on the main
processor or a ``front end'' processor and how the operating system allocates
resources. As an example of the range of alternatives, on an Intel Paragon XPS with
56 processors some processors will be dedicated to system tasks (e.g. file transfers)
and the remainder will be split among users so that applications do not have to share
any one processor. The MasPar 1104 consists of a front-end (a DEC workstation) that
handles the system tasks and 4096 processors for user applications. Each processor
has its own 64KB RAM. More than one user process can run at any one time, but
instead of allocating a different set of processors to each job the operating system
divides up the memory. The memory is split into equal size partitions, for example
8KB, and when a job starts the system figures out how many partitions it needs. All
4096 processors execute that job, and when the time slice is over they all start
working on another job in a different set of partitions.

2.6 Data Representations


Another important interaction between user programs and computer architecture is in
the representation of numbers. This interaction does not affect performance as much
as it does portability. Users must be extremely careful when moving programs and/or
data files from one system to another because numbers and other data are not always
represented the same way. Recently programming languages have begun to allow
users to have more control over how numbers are represented and to write code that
does not depend so heavily on data representations that it fails when executed on the
``wrong'' system.

42
The binary number system is the starting point for representing information. All items
in a computer's memory - numbers, characters, instructions, etc. - are represented by
strings of 1's and 0's. These two values designate one of two possible states for the
underlying physical memory. It does not matter to us which state corresponds to 1 and
which corresponds to 0, or even what medium is used. In an electronic memory, 1
could stand for a positively charged region of semiconductor and 0 for a neutral
region, or on a device that can be magnetized a 1 would represent a portion of the
surface that has a flux in one direction, while a 0 would indicate a flux in the opposite
direction. It is only important that the mapping from the set {1,0} to the two states be
consistent and that the states can be detected and modified at will.

Systems usually deal with fixed-length strings of binary digits. The smallest unit of
memory is a single bit, which holds a single binary digit. The next largest unit is
a byte, now universally recognized to be eight bits (early systems used anywhere from
six to eight bits per byte). A word is 32 bits long in most workstations and personal
computers, and 64 bits in supercomputers. A  double word is twice as long as a single
word, and operations that use double words are said to be  double precision operations.

Storing a positive integer in a system is trivial: simply write the integer in binary and
use the resulting string as the pattern to store in memory. Since numbers are usually
stored one per word, the number is padded with leading 0's first. For example, the
number 52 is represented in a 16-bit word by the pattern 0000000000110100.

The meaning of an  -bit string   when it is interpreted as a binary number is defined
by the formula,  ,
i.e. bit number i has weight:

Compiler writers and assembly language programmers often take advantage of the
binary number system when implementing arithmetic operations. For example, if the
pattern of bits is ``shifted left'' by one, the corresponding number is multiplied by two.
A left shift is performed by moving every bit left and inserting 0's on the right side. In
an 8-bit system, for example, the pattern 00000110 represents the number 6; if this
pattern is shifted left, the resulting pattern is 00001100, which is the representation of
the number 12. In general, shifting left by   bits is equivalent to multiplying by  .

Shifts such as these can be done in one machine cycle, so they are much faster than
multiplication instructions, which usually takes several cycles. Other ``tricks'' are
using a right shift to implement integer division by a power of 2, in which the result is

43
an integer and the remainder is ignored (e.g. 15    4 = 3) and taking the modulus or
remainder with respect to a power of 2 (see problem 8).

A fundamental relationship about binary patterns is that there are 2  distinct  -digit
strings. For example, for   there are   = 256 different strings of 1's and 0's.
From this relationship it is easy to see that the largest integer that can be stored in an  
-bit word is  : the   patterns are used to represent the   integers in the
interval  .

An overflow occurs when a system generates a value greater than the largest integer.
For example, in a 32-bit system, the largest positive integer is   .=
4,294,976,295. If a program tries to add 3,000,000,000 and 2,000,000,000 it will
cause an overflow. Right away we can see one source of problems that can arise when
moving a program from one system to another: if the word size is smaller on the new
system a program that runs successfully on the original system may crash with an
overflow error on the new system.

There are two different techniques for representing negative values. One method is to
divide the word into two fields, i.e. represent two different types of information within
the word. We can use one field to represent the sign of the number, and the other field
to represent the value of the number. Since a number can be just positive or negative,
we need only one bit for the sign field. Typically the leftmost bit represents the sign,
with the convention that a 1 means the number is negative and a 0 means it is positive.
This type of representation is known as a sign-magnitude representation, after the
names of the two fields. For example, in a 16-bit sign-magnitude system, the pattern
1000000011111111 represents the number and the pattern 0000000000000101
represents +5.

The other technique for representing both positive and negative integers is known
as two's complement. It has two compelling advantages over the sign-magnitude
representation, and is now universally used for integers, but as we will see below sign-
magnitude is still used to represent real numbers. The two's complement method is
based on the fact that binary arithmetic in fixed-length words is actually arithmetic
over a finite cyclic group. If we ignore overflows for a moment, observe what happens
when we add 1 to the largest possible number in an   -bit system (this number is
represented by a string of  1's):

44
The result is a pattern with a leading 1 and    0's. In an  -bit system only the low
order   bits of each result are saved, so this sum is functionally equivalent to 0.
Operations that lead to sums with very large values ``wrap around'' to 0, i.e. the
system is a finite cyclic group. Operations in this group are defined by arithmetic
modulo 2 .

For our purposes, what is interesting about this type of arithmetic is that   , which is
represented by a 1 followed by   0's, is equivalent to 0, which means   
for all   between 0 and  . A simple ``trick'' that has its roots in this fact can be
applied to the bit pattern of a number in order to calculate its additive inverse: if we
invert every bit (turn a 1 into a 0 and vice versa) in the representation of a number    
and then add 1, we come up with the representation of   . For example, the
representation of 5 in an 8-bit system is 00000101. Inverting every bit and adding 1 to
the result gives the pattern 11111011. This is also the representation of 251, but in
arithmetic modulo 2  we have so this pattern is a perfectly acceptable representation
of   (see problem 7).

In practice we divide all  -bit patterns into two groups. Patterns that begin with 0
represent the positive integers   and patterns beginning with 1
represent the negative integers  . To determine which integer is
represented by a pattern that begins with a 1, compute its complement (invert every
bit and add 1). For example, in an 8-bit two's complement system the pattern
11100001 represents , since the complement is
. Note that the leading bit determines the
sign, just as in a sign-magnitude system, but one cannot simply look at the remaining
bits to ascertain the magnitude of the number. In a sign-magnitude system, the same
pattern represents  .

The first step in defining a representation for real numbers is to realize that binary
notation can be extended to cover negative powers of two, e.g. the string ``110.101'' is
interpreted as

Thus a straightforward method for representing real numbers would be to specify


some location within a word as the ``binary point'' and give bits to the left of this
location weights that are positive powers of two and bits to the right weights that are
negative powers of two. For example, in a 16-bit word, we can dedicate the rightmost
5 bits for the fraction part and the leftmost 11 bits for the whole part. In this system,
the representation of 6.625 is 0000000011010100 (note there are leading 0's to pad the
whole part and trailing 0's to pad the fraction part). This representation, where there is

45
an implied binary point at a fixed location within the word, is known as a  fixed
point representation.

There is an obvious tradeoff between range and precision in fixed point


representations.   bits for the fraction part means there will be    numbers in the
system between any two successive integers. With 5 bit fractions there are 32 numbers
in the system between any two integers; e.g. the numbers between 5 and 6 are 5  
(5.03125), 5  (5.03125), etc. To allow more precision, i.e. smaller divisions between
successive numbers, we need more bits in the fraction part. The number of bits in the
whole part determines the magnitude of the largest positive number we can represent,
just as it does for integers. With 11 digits in the whole part, as in the example above,
the largest number we can represent in 16 bits
is  . Moving one bit from the whole part to
the fraction part in order to increase precision cuts the range in half, and the largest
number is now  .

To allow for a larger range without sacrificing precision, computer systems use a
technique known as floating point. This representation is based on the familiar
``scientific notation'' for expressing both very large and very small numbers in a
concise format as the product of a small real number and a power of 10,
e.g.  . This notation has three components: a base (10 in this example);
an exponent (in this case 23); and a mantissa (6.022). In computer systems, the base is
either 2 or 16. Since it never changes for any given computer system it does not have
to be part of the representation, and we need only two fields to specify a value, one for
the mantissa and one for the exponent.

As an example of how a number is represented in floating point, consider again the


number 6.625. In binary, it is

If a 16-bit system has a 10-bit mantissa and 6-bit exponent, the number would be
represented by the string 1101010000 000010. The mantissa is stored in the first ten
bits (padded on the right with trailing 0's), and the exponent is stored in the last six
bits.

As the above example illustrates, computers transform the numbers so the mantissa is
a manageable number. Just as   is preferred to   
or   in scientific notation, in binary the mantissa should be
between   and  . When the mantissa is in this range it is said to be

46
normalized. The definition of the normal form varies from system to system, e.g. in
some systems a normalized mantissa is between    and  .

Since we need to represent both positive and negative real numbers, the complete
representation for a real number in a floating point format has three fields: a one-bit
sign, a fixed number of bits for the mantissa, and the remainder of the bits for the
exponent. Note that the exponent is an integer, and that this integer can be either
positive or negative, e.g. we will want to represent very small numbers such
as  . Any method such as two's complement that can represent both
positive and negative integers can be used within the exponent field. The sign bit at
the front of the number determines the sign of the entire number, which is
independent of the sign of the exponent, e.g. it indicates whether the number
is   or  .

In the past every computer manufacturer used their own floating point representation,
which made it a nightmare to move programs and datasets from one system to
another. A recent IEEE standard is now being widely adopted and will add stability to
this area of computer architecture. For 32-bit systems, the standard calls for a 1-bit
sign, 8-bit exponent, and 23-bit mantissa. The largest number that can be represented
is  , and the smallest positive number (closest to 0.0) is   .
Details of the standard are presented in an appendix to this chapter.

Figure 2 Distribution of Floating Point Numbers View Figure

Figure 2 illustrates the numbers that can be stored in a typical computer system with a
floating point representation. The figure shows three disjoint regions: positive
numbers  , 0.0, and negative numbers  .   is the largest number
that can be stored in the system; in the IEEE standard representation   .   is
the smallest positive number, which is   in the IEEE standard.

Programmers need to be aware of several important attributes of the floating point


representation that are illustrated by this figure. The first is the magnitude of the range
between   and   . There are about  integers in this range.  However there are
only   different 32-bit patterns. What this means is there are numbers in the
range that do not have representations. Whenever a calculation results in one of these
numbers, a round-off error will occur when the system approximates the result by the
nearest (we hope) representable number. The arithmetic circuitry will produce a
binary pattern that is close to the desired result, but not an exact representation. An

47
interesting illustration of just how common these round-off errors are is the fact that 1
does not have a finite representation in binary, but is instead the infinitely repeating
pattern  .

The next important point is that there is a gap between   , the smallest positive
number, and 0.0. A round-off error in a calculation that should produce a small non-
zero value but instead results in 0.0 is called an  underflow. One of the strengths of the
IEEE standard is that it allows a special  denormalized form for very small numbers in
order to stave off underflows as long as possible. This is why the exponent in the
largest and smallest positive numbers are not symmetrical. Without denormalized
numbers, the smallest positive number in the IEEE standard would be around   .

Finally, and perhaps most important, is the fact that the numbers that can be
represented are not distributed evenly throughout the range. Representable numbers
are very dense close to 0.0, but then grow steadily further apart as they increase in
magnitude. The dark regions in Figure 2 correspond to parts of the number line where
representable numbers are packed close together. It is easy to see why the distribution
is not even by asking what two numbers are represented by two successive values of
the mantissa for any given exponent. To make the calculations easier, suppose we
have a 16-bit system with a 7-bit mantissa and 8-bit exponent. No matter what the
exponent is, the distance between any two successive values of the mantissa, e.g.
between   and  , will be  . For
numbers closest to 0.0, the exponent will be a negative number, e.g.   , and the
distance between two successive floating point numbers will
be  . At the other end of
the scale, when exponents are large, the distance between two numbers will be
approximately  ,
namely  .

2.7 Performance Models


 The most widely recognized aspect of a machine's internal organization that relates to
performance is the clock cycle time, which controls the rate of internal operations in
the CPU (Section 2.1). A shorter clock cycle time, or equivalently a larger number of
cycles per second, implies more operations can be performed per unit time.

For a given architecture, it is often possible to rank systems according to their clock
rates. For example, the HP 9000/725 and 9000/735 workstations have basically the
same architecture, meaning they have the same instruction set and, in general, appear

48
to be the same system as far as compiler writers are concerned. The 725 has a 66MHz
clock, while the 735 has a 99MHz clock, and indeed the 735 has a higher performance
on most programs.

There are several reasons why simply comparing clock cycle times is an inadequate
measure of performance. One reason is that processors don't operate ``in a vacuum'',
but rely on memories and buses to supply information. The size and access times of
the memories and the bandwidth of the bus all play a major role in performance. It is
very easy to imagine a program that requires a large amount of memory running faster
on an HP 725 that has a larger cache and more main memory than a 735. We will
return to the topic of memory organization and processor- memory interconnection in
later sections on vector processors and parallel processors since these two aspects of
systems organization are even more crucial for high performance in those systems.

A second reason clock rate by itself is an inadequate measure of performance is that it


doesn't take into account what happens during a clock cycle. This is especially true
when comparing systems with different instruction sets. It is possible that a machine
might have a lower clock rate, but because it requires fewer cycles to execute the
same program it would have higher performance. For example, consider two
machines, A and B, that are almost identical except that A has a multiply instruction
and B does not. A simple loop that multiplies a vector by a scalar (the constant 3 in
this example) is shown in the table below. The number of cycles for each instruction
is given in parentheses next to the instruction.

Table 3 View Table

The first instruction loads an element of the vector into an internal processor
register X. Next, machine A multiplies the vector element by 3, leaving the result in
the register. Machine B does the same operation by shifting and adding,
i.e.  . B copies the contents of X to another register Y, shifts X left one bit
(which multiplies it by 2), and then adds Y, again leaving the result in X. Both
machines then store the result back into the vector in memory and branch back to the
top of the loop if the vector index is not at the end of the vector (the comparison and
branch are done by the dbrinstruction). Machine A might be slightly slower than B,
but since it takes fewer cycles it will execute the loop faster. For example if A's cycle
time is 9 MHz (.11 s per cycle) and B's cycle time is 10 MHz (.10 s per cycle) A will
execute one pass through the loop in 1.1 s but B will require 1.2 s.

49
As a historical note, microprocessor and microcomputer designers in the 1970s tended
to build systems with instruction sets like those of machine A above. The goal was to
include instructions with a large ``semantic content,'' e.g. multiplication is relatively
more complex than loading a value from memory or shifting a bit pattern. The payoff
was in reducing the overhead to fetch instructions, since fewer instructions could
accomplish the same job. By the 1980s, however, it became widely accepted that
instruction sets such as those of machine B were in fact a better match for VLSI chip
technology. The move toward simpler instructions became known as  RISC, for
Reduced Instruction Set Computer. A RISC has fewer instructions in its repertoire,
but more importantly each instruction is very simple. The fact that operations are so
simple and so uniform leads to some very powerful implementation techniques, such
as pipelining, and opens up room on the processor chip for items such as on-chip
caches or multiple functional units, e.g. a CPU that has two or more arithmetic units.
We will discuss these types of systems in more detail later, in the section on
superscalar designs (Section 3.5.2). Another benefit to simple instructions is that cycle
times can also be much shorter; instead of being only moderately faster, e.g 10MHz
vs. 9MHz as in the example above, cycle times on RISC machines are often much
faster, so even though they fetch and execute more instructions they typically
outperform complex instruction set (CISC) machines designed at the same time.

In order to compare performance of two machines with different instruction sets, and
even different styles of instruction sets (e.g. RISC vs. CISC), we can break the total
execution time into constituent parts  [11]. The total time to execute any given
program is the product of the number of machine cycles required to execute the
program and the processor cycle time:

The number of cycles executed can be rewritten as the number of instructions


executed times the average number of cycles per instruction:

The middle factor in this expression describes the average number of machine cycles
the processor devotes to each instruction. It is the number of cycles per instruction,
or CPI. The basic performance model for a single processor computer system is thus

where

50
The three factors each describe different attributes of the execution of a program. The
number of instructions depends on the algorithm, the compiler, and to some extent the
instruction set of the machine. Total execution time can be reduced by lowering the
instruction count, either through a better algorithm (one that executes an inner loop
fewer times, for example), a better compiler (one that generates fewer instructions for
the body of the loop), or perhaps by changing the instruction set so it requires fewer
instructions to encode the same algorithm. As we saw earlier, however, a more
compact encoding as a result of a richer instruction set does not always speed up a
program since complex instructions require more cycles. The interaction between
instruction complexity and the number of cycles to execute a program is very
involved, and it is hard to predict ahead of time whether adding a new instruction will
really improve performance.

The second factor in the performance model is CPI. At first it would seem this factor
is simply a measure of the complexity of the instruction set: simple instructions
require fewer cycles, so RISC machines should have lower CPI values. That view is
misleading, however, since it concerns a static quantity. The performance equation
describes the average number of cycles per instruction  measured during the execution
of a program. The difference is crucial. Implementation techniques such as pipelining
allow a processor to overlap instructions by working on several instructions at one
time. These techniques will lower CPI and improve performance since more
instructions are executed in any given time period. For example, the average
instruction in a system might require three machine cycles: one to fetch it from cache,
one to fetch its operands from registers, and one to perform the operation and store the
result in a register. Based on this static description one might conclude the CPI is 3.0,
since each instruction requires three cycles. However, if the processor can juggle three
instructions at once, for example by fetching instruction    while it is locating the
operands for instruction   and executing instruction , then the effective CPI
observed during the execution of the program is just a little over 1.0 (Figure 3). Note
that this is another illustration of the difference between speed and bandwidth. Overall
performance of a system can be improved by increasing bandwidth, in this case by
increasing the number of instructions that flow through the processor per unit time,
without changing the execution time of the individual instructions.

The third factor in the performance model is the processor cycle time   . This is
usually in the realm of computer engineering: a better layout of the components on the
surface of the chip might shorten wire lengths and allow for a faster clock, or a
different material (e.g. gallium arsenide vs. silicon based semiconductors) might have

51
a faster switching time. However, the architecture can also affect cycle time. One of
the reasons RISC is such a good fit for current VLSI technology is that if the
instruction set is small, it requires less logic to implement. Less logic means less space
on the chip, and smaller circuits run faster and consume less power  [12]. Thus the
design of the instruction set, the organization of pipelines, and other attributes of the
architecture and its implementation can impact cycle time.

Figure 3 Pipelined execution View Figure

We conclude this section with a few remarks on some metrics that are commonly used
to describe the performance of computer systems.  MIPS stands for ``millions of
instructions per second.'' With the variation in instruction styles, internal organization,
and number of processors per system it is almost meaningless for comparing two
systems. As a point of reference, the DEC VAX 11/780 executed approximately one
million instructions per second. You may see a system described as having
performance rated at ``X VAX MIPS.'' This is a measure of performance normalized
to VAX 11/780 performance. What this means is someone ran a program on the VAX,
then ran the same program on the other system, and the ratio is X. The term ``native
MIPS'' refers to the number of millions of instructions of the machine's own
instruction set that can be executed per second.

MFLOPS (pronounced ``megaflops'') stands for ``millions of floating point


operations per second.'' This is often used as a ``bottom-line'' figure. If you know
ahead of time how many operations a program needs to perform, you can divide the
number of operations by the execution time to come up with a MFLOPS rating. For
example, the standard algorithm for multiplying    matrices requires   
operations (  inner products, with   multiplications and   additions in each
product). Suppose you compute the product of two    matrices in 0.35
seconds. Your computer achieved

Obviously this type of comparison ignores the overhead involved in setting up loops,
checking terminating conditions, and so on, but as a ``bottom line'' it gets to the point:
what you care about (in this example) is how long it takes to multiply two matrices,
and if that operation is a major component of your research it makes sense to compare
machines by how fast they can multiply matrices. A standard set of reference
programs known as LINPACK (linear algebra package) is often used to compare

52
systems based on their MFLOPS ratings by measuring execution times for Gaussian
elimination on   matrices [8].

The term ``theoretical peak MFLOPS'' refers to how many operations per second
would be possible if the machine did nothing but numerical operations. It is obtained
by calculating the time it takes to perform one operation and then computing how
many of them could be done in one second. For example, if it takes 8 cycles to do one
floating point multiplication, the cycle time on the machine is 20 nanoseconds, and
arithmetic operations are not overlapped with one another, it takes 160ns for one
multiplication, and

so the theoretical peak performance is 6.25 MFLOPS. Of course, programs are not
just long sequences of multiply and add instructions, so a machine rarely comes close
to this level of performance on any real program. Most machines will achieve less
than 10% of their peak rating, but vector processors or other machines with internal
pipelines that have an effective CPI near 1.0 can often achieve 70% or more of their
theoretical peak on small programs.

Using metrics such as CPI, MIPS, or MFLOPS to compare machines depends heavily
on the programs used to measure execution times. A  benchmark is a program written
specifically for this purpose. There are several well-known collections of benchmarks.
One that is be particularly interesting to computational scientists is LINPACK, which
contains a set of linear algebra routines written in Fortran. MFLOPS ratings based on
LINPACK performance are published regularly [8]. Two collections of a wider range
of programs are SPEC (System Performance Evaluation Cooperative) and the Perfect
Club, which is oriented toward parallel processing. Both include widely used
programs such as a C compiler and a text formatter, not just small special purpose
subroutines, and are useful for comparing systems such as high performance
workstations that will be used for other jobs in addition to computational science
modelling.

53

You might also like