Subject: Advance Architecture Author: Dr. Deepti Mehrotra Paper Code: MS Vetter: Lesson: Parallel Computer Models Lesson No.: 01
Subject: Advance Architecture Author: Dr. Deepti Mehrotra Paper Code: MS Vetter: Lesson: Parallel Computer Models Lesson No.: 01
Objective
Introduction
Vetter:
Lesson No. : 01
Keywords
Summary
1.0 Objective
The main aim of this chapter is to learn about the evolution of computer systems, various
attributes on which performance of system is measured, classification of computers on
their ability to perform multiprocessing and various trends towards parallel processing.
1.1 Introduction
From an application point of view, the mainstream of usage of computer is experiencing
a trend of four ascending levels of sophistication:
1
Data processing
Information processing
Knowledge processing
Intelligence processing
With more and more data structures developed, many users are shifting to computer roles
from pure data processing to information processing. A high degree of parallelism has
been found at these levels. As the accumulated knowledge bases expanded rapidly in
recent years, there grew a strong demand to use computers for knowledge processing.
Intelligence is very difficult to create; its processing even more so. Todays computers are
very fast and obedient and have many reliable memory cells to be qualified for datainformation-knowledge processing.
Parallel processing is emerging as one of the key technology in area of modern
computers. Parallel appears in various forms such as lookahead, vectorization
concurrency, simultaneity, data parallelism, interleaving, overlapping, multiplicity,
replication, multiprogramming, multithreading and distributed computing at different
processing level.
1.2 The state of computing
Modern computers are equipped with powerful hardware technology at the same time
loaded with sophisticated software packages. To access the art of computing we firstly
review the history of computers then study the attributes used for analysis of performance
of computers.
1.2.1
implementations. And designers always tried to manufacture a new machine that should
be upward compatible with the older machines.
3) Concept of specialized registers where introduced for example index registers were
introduced in the Ferranti Mark I, concept of register that save the return-address
instruction was introduced in UNIVAC I, also concept of immediate operands in IBM
704 and the detection of invalid operations in IBM 650 were introduced.
4) Punch card or paper tape were the devices used at that time for storing the program. By
the end of the 1950s IBM 650 became one of popular computers of that time and it used
the drum memory on which programs were loaded from punch card or paper tape. Some
high-end machines also introduced the concept of core memory which was able to
provide higher speeds. Also hard disks started becoming popular.
5) In the early 1950s as said earlier were design specific hence most of them were
designed for some particular numerical processing tasks. Even many of them used
decimal numbers as their base number system for designing instruction set. In such
machine there were actually ten vacuum tubes per digit in each register.
6) Software used was machine level language and assembly language.
7) Mostly designed for scientific calculation and later some systems were developed for
simple business systems.
8) Architecture features
Vacuum tubes and relay memories
CPU driven by a program counter (PC) and accumulator
Machines had only fixed-point arithmetic
9) Software and Applications
Machine and assembly language
Single user at a time
No subroutine linkage mechanisms
Programmed I/O required continuous use of CPU
10) examples: ENIAC, Princeton IAS, IBM 701
IInd generation of computers (1954 64)
The transistors were invented by Bardeen, Brattain and Shockely in 1947 at Bell Labs
and by the 1950s these transistors made an electronic revolution as the transistor is
smaller, cheaper and dissipate less heat as compared to vacuum tube. Now the transistors
were used instead of a vacuum tube to construct computers. Another major invention was
invention of magnetic cores for storage. These cores where used to large random access
memories. These generation computers has better processing speed, larger memory
capacity, smaller size as compared to pervious generation computer.
The key features of this generation computers were
1) The IInd generation computer were designed using Germanium transistor, this
technology was much more reliable than vacuum tube technology.
2) Use of transistor technology reduced the switching time 1 to 10 microseconds thus
provide overall speed up.
2) Magnetic cores were used main memory with capacity of 100 KB. Tapes and disk
peripheral memory were used as secondary memory.
3) Introduction to computer concept of instruction sets so that same program can be
executed on different systems.
4) High level languages, FORTRAN, COBOL, Algol, BATCH operating system.
5) Computers were now used for extensive business applications, engineering design,
optimation using Linear programming, Scientific research
6) Binary number system very used.
7) Technology and Architecture
Discrete transistors and core memories
I/O processors, multiplexed memory access
Floating-point arithmetic available
Register Transfer Language (RTL) developed
8) Software and Applications
High-level languages (HLL): FORTRAN, COBOL, ALGOL with compilers and
subroutine libraries
Batch operating system was used although mostly single user at a time
9) Example : CDC 1604, UNIVAC LARC, IBM 7090
IIIrd Generation computers(1965 to 1974)
In 1950 and 1960 the discrete components ( transistors, registers capacitors) were
manufactured packaged in a separate containers. To design a computer these discrete
unit were soldered or wired together on a circuit boards. Another revolution in computer
designing came when in the 1960s, the Apollo guidance computer and Minuteman
missile were able to develop an integrated circuit (commonly called ICs). These ICs
made the circuit designing more economical and practical. The IC based computers are
called third generation computers. As integrated circuits, consists of transistors, resistors,
capacitors on single chip eliminating wired interconnection, the space required for the
computer was greatly reduced. By the mid-1970s, the use of ICs in computers became
very common. Price of transistors reduced very greatly. Now it was possible to put all
components required for designing a CPU on a single printed circuit board. This
advancement of technology resulted in development of minicomputers, usually with 16bit words size these system have a memory of range of 4k to 64K.This began a new era
of microelectronics where it could be possible design small identical chips ( a thin wafer
of silicons). Each chip has many gates plus number of input output pins.
Key features of IIIrd Generation computers:
1) The use of silicon based ICs, led to major improvement of computer system. Switching
speed of transistor went by a factor of 10 and size was reduced by a factor of 10,
reliability increased by a factor of 10, power dissipation reduced by a factor of 10. This
cumulative effect of this was the emergence of extremely powerful CPUS with the
capacity of carrying out 1 million instruction per second.
2) The size of main memory reached about 4MB by improving the design of magnetic
core memories also in hard disk of 100 MB become feasible.
3) On line system become feasible. In particular dynamic production control systems,
airline reservation systems, interactive query systems, and real time closed lop process
control systems were implemented.
4) Concept of Integrated database management systems were emerged.
5) 32 bit instruction formats
6) Time shared concept of operating system.
7) Technology and Architecture features
Integrated circuits (SSI/MSI)
Microprogramming
Pipelining, cache memories, lookahead processing
in parallel we have multiple execution unit i.e., separate arithmetic-logic units (ALUs).
Now instead executing single instruction at a time, the system divide program into
several independent instructions and now CPU will look for several similar instructions
that are not dependent on each other, and execute them in parallel. The example of this
design are VLIW and EPIC.
1) Technology and Architecture features
ULSI/VHSIC processors, memory, and switches
High-density packaging
Scalable architecture
Vector processors
2) Software and Applications
Massively parallel processing
Grand challenge applications
Heterogenous processing
3) Examples : Fujitsu VPP500, Cray MPP, TMC CM-5, Intel Paragon
Elements of Modern Computers
The hardware, software, and programming elements of modern computer systems can be
characterized by looking at a variety of factors in context of parallel computing these
factors are:
Computing problems
Hardware resources
Operating systems
Compiler support
Computing Problems
Traditional algorithms and data structures are designed for sequential machines.
New, specialized algorithms and data structures are needed to exploit the
capabilities of parallel architectures.
These
often
require
interdisciplinary
interactions
among
theoreticians,
The operating system and applications also significantly influence the overall
architecture.
Not only must the processor and memory architectures be considered, but also the
architecture of the device interfaces (which often include their advanced
processors).
Operating System
UNIX, Mach, and OSF/1 provide support for multiprocessors and multicomputers
Compilers, assemblers, and loaders are traditional tools for developing programs
in high-level languages. With the operating system, these tools determine the bind
of resources to applications, and the effectiveness of this determines the efficiency
of hardware utilization and the systems programmability.
Most programmers still employ a sequential mind set, abetted by a lack of popular
parallel software support.
New languages have obvious advantages (like new constructs specifically for
parallelism), but require additional programmer education and system software.
Compiler Support
Compiler directives are often inserted into source code to aid compiler
parallelizing efforts
distinguishes
multi-processor
computer
architectures
according
two
10
They are also called scalar processor i.e., one instruction at a time and each
instruction have only one set of operands.
Single instruction: only one instruction stream is being acted on by the CPU
during any one clock cycle
Single data: only one data stream is being used as input during any one clock
cycle
Deterministic execution
This is the oldest and until recently, the most prevalent form of computer
Single instruction: All processing units execute the same instruction issued by the
control unit at any given clock cycle as shown in figure 13.5 where there are
multiple processor executing instruction given by one control unit.
11
Multiple data: Each processing unit can operate on a different data element as
shown if figure below the processor are connected to shared memory or
interconnection network providing multiple data to processing unit
This type of machine typically has an instruction dispatcher, a very highbandwidth internal network, and a very large array of very small-capacity
instruction units.
Two varieties: Processor Arrays e.g., Connection Machine CM-2, Maspar MP-1,
MP-2 and Vector Pipelines processor e.g., IBM 9000, Cray C90, Fujitsu VP, NEC
SX-2, Hitachi S820
12
Thus in these computers same data flow through a linear array of processors
executing different instruction streams as shown in figure 1.6.
Few actual examples of this class of parallel computer have ever existed. One is
the experimental Carnegie-Mellon C.mmp computer (1971).
13
Multiple Data: every processor may be working with a different data stream as
shown in figure 1.7 multiple data stream is provided by shared memory.
As shown in figure 1.8 there are different processor each processing different
task.
14
hardware technology
architectural features
algorithm design
data structures
language efficiency
programmer skill
compiler technology
When we talk about performance of computer system we would describe how quickly a
given system can execute a program or programs. Thus we are interested in knowing the
turnaround time. Turnaround time depends on:
compilation time
CPU time
An ideal performance of a computer system means a perfect match between the machine
capability and program behavior. The machine capability can be improved by using
better hardware technology and efficient resource management. But as far as program
behavior is concerned it depends on code used, compiler used and other run time
conditions. Also a machine performance may vary from program to program. Because
there are too many programs and it is impractical to test a CPU's speed on all of them,
15
benchmarks were developed. Computer architects have come up with a variety of metrics
to describe the computer performance.
Clock rate and CPI / IPC : Since I/O and system overhead frequently overlaps
processing by other programs, it is fair to consider only the CPU time used by a program,
and the user CPU time is the most important factor. CPU is driven by a clock with a
constant cycle time (usually measured in nanoseconds, which controls the rate of internal
operations in the CPU. The clock mostly has the constant cycle time (t in nanoseconds).
The inverse of the cycle time is the clock rate (f = 1/, measured in megahertz). A shorter
clock cycle time, or equivalently a larger number of cycles per second, implies more
operations can be performed per unit time. The size of the program is determined by the
instruction count (Ic). The size of a program is determined by its instruction count, Ic, the
number of machine instructions to be executed by the program. Different machine
instructions require different numbers of clock cycles to execute. CPI (cycles per
instruction) is thus an important parameter.
Average CPI
It is easy to determine the average number of cycles per instruction for a particular
processor if we know the frequency of occurrence of each instruction type.
Of course, any estimate is valid only for a specific set of programs (which defines the
instruction mix), and then only if there are sufficiently large number of instructions.
In general, the term CPI is used with respect to a particular instruction set and a given
program mix. The time required to execute a program containing Ic instructions is just T
= Ic * CPI * .
Each instruction must be fetched from memory, decoded, then operands fetched from
memory, the instruction executed, and the results stored.
The time required to access memory is called the memory cycle time, which is usually k
times the processor cycle time . The value of k depends on the memory technology and
the processor-memory interconnection scheme. The processor cycles required for each
instruction (CPI) can be attributed to cycles needed for instruction decode and execution
(p), and cycles needed for memory references (m* k).
The total time needed to execute a program can then be rewritten as
T = Ic* (p + m*k)*.
16
MIPS: The millions of instructions per second, this is calculated by dividing the number
of instructions executed in a running program by time required to run the program. The
MIPS rate is directly proportional to the clock rate and inversely proportion to the CPI.
All four systems attributes (instruction set, compiler, processor, and memory
technologies) affect the MIPS rate, which varies also from program to program. MIPS
does not proved to be effective as it does not account for the fact that different systems
often require different number of instruction to implement the program. It does not
inform about how many instructions are required to perform a given task. With the
variation in instruction styles, internal organization, and number of processors per system
it is almost meaningless for comparing two systems.
MFLOPS (pronounced ``megaflops'') stands for ``millions of floating point operations
per second.'' This is often used as a ``bottom-line'' figure. If one know ahead of time how
many operations a program needs to perform, one can divide the number of operations by
the execution time to come up with a MFLOPS rating. For example, the standard
algorithm for multiplying n*n matrices requires 2n3 n operations (n2 inner products,
with n multiplications and n-1additions in each product). Suppose you compute the
product of two 100 *100 matrices in 0.35 seconds. Then the computer achieves
(2(100)3 100)/0.35 = 5,714,000 ops/sec = 5.714 MFLOPS
The term ``theoretical peak MFLOPS'' refers to how many operations per second would
be possible if the machine did nothing but numerical operations. It is obtained by
calculating the time it takes to perform one operation and then computing how many of
them could be done in one second. For example, if it takes 8 cycles to do one floating
point multiplication, the cycle time on the machine is 20 nanoseconds, and arithmetic
operations are not overlapped with one another, it takes 160ns for one multiplication, and
(1,000,000,000 nanosecond/1sec)*(1 multiplication / 160 nanosecond) =
6.25*106
17
18
Total CPU time can be used as a basis in estimating the execution rate of a
processor.
Programming Environments
Programmability depends on the programming environment provided to the users.
Conventional computers are used in a sequential programming environment with tools
developed for a uniprocessor computer. Parallel computers need parallel tools that allow
specification or easy detection of parallelism and operating systems that can perform
parallel scheduling of concurrent events, shared memory allocation, and shared peripheral
and communication links.
Implicit Parallelism
Use a conventional language (like C, Fortran, Lisp, or Pascal) to write the program.
Use a parallelizing compiler to translate the source code into parallel code.
The compiler must detect parallelism and assign target machine resources.
Success relies heavily on the quality of the compiler.
Explicit Parallelism
Programmer writes explicit parallel code using parallel dialects of common languages.
Compiler has reduced need to detect parallelism, but must still preserve existing
parallelism and assign target machine resources.
Needed Software Tools
Parallel extensions of conventional high-level languages.
Integrated environments to provide different levels of program abstraction validation,
testing and debugging performance prediction and monitoring visualization support to aid
program development, performance measurement graphics display and animation of
computational results
1.3 MULTIPROCESSOR AND MULTICOMPUTERS
Two categories of parallel computers are discussed below namely shared common
memory or unshared distributed memory.
1.3.1 Shared memory multiprocessors
19
Shared memory parallel computers vary widely, but generally have in common
the ability for all processors to access all memory as global address space.
Multiple processors can operate independently but share the same memory
resources.
Changes in a memory location effected by one processor are visible to all other
processors.
Shared memory machines can be divided into two main classes based upon
memory access times: UMA , NUMA and COMA.
Identical processors
20
If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent
NUMA
Data sharing between tasks is both fast and uniform due to the proximity of
memory to CPUs
Disadvantages:
21
CPU path, and for cache coherent systems, geometrically increase traffic
associated with cache/memory management.
Like shared memory systems, distributed memory systems vary widely but share
a common characteristic. Distributed memory systems require a communication
network to connect inter-processor memory.
Processors have their own local memory. Memory addresses in one processor do
not map to another processor, so there is no concept of global address space
across all processors.
Because each processor has its own local memory, it operates independently.
Changes it makes to its local memory have no effect on the memory of other
processors. Hence, the concept of cache coherency does not apply.
When a processor needs access to data in another processor, it is usually the task
of the programmer to explicitly define how and when data is communicated.
Synchronization between tasks is likewise the programmer's responsibility.
22
The network "fabric" used for data transfer varies widely, though it can be as
simple as Ethernet.
Advantages:
Each processor can rapidly access its own memory without interference and
without the overhead incurred with trying to maintain cache coherency.
Disadvantages:
The programmer is responsible for many of the details associated with data
communication between processors.
23
A vector processor consists of a scalar processor and a vector unit, which could be
thought of as an independent functional unit capable of efficient vector operations.
1.4.1Vector Hardware
Vector computers have hardware to perform the vector operations efficiently. Operands
can not be used directly from memory but rather are loaded into registers and are put
back in registers after the operation. Vector hardware has the special ability to overlap or
pipeline operand processing.
24
clock is used. Thus at each step i.e., when global clock pulse changes all processors
execute the same instruction, each on a different data (single instruction multiple data).
SIMD machines are particularly useful at in solving problems involved with vector
calculations where one can easily exploit data parallelism. In such calculations the same
set of instruction is applied to all subsets of data. Lets do addition to two vectors each
having N element and there are N/2 processing elements in the SIMD. The same addition
instruction is issued to all N/2 processors and all processor elements will execute the
instructions simultaneously. It takes 2 steps to add two vectors as compared to N steps on
a SISD machine. The distributed data can be loaded into PEMs from an external source
via the system bus or via system broadcast mode using the control bus.
The array processor can be classified into two category depending how the memory units
are organized. It can be
a. Dedicated memory organization
b. Global memory organization
A SIMD computer C is characterized by the following set of parameter
C= <N,F,I,M>
Where N= the number of PE in the system . For example the iliac IV has N=64 , the
BSP has N= 16.
F= a set of data routing function provided by the interconnection network
I= The set of machine instruction for scalar vector, data routing and network
manipulation operations
M = The set of the masking scheme where each mask partitions the set of PEs into
disjoint subsets of enabled PEs and disabled PEs.
26
EREW - Exclusive read, exclusive write; any memory location may only be
accessed once in any one step. Thus forbids more than one processor from reading
or writing the same memory cell simultaneously.
CREW - Concurrent read, exclusive write; any memory location may be read any
number of times during a single step, but only written to once, with the write
taking place after the reads.
ERCW This allows exclusive read or concurrent writes to the same memory
location.
CRCW - Concurrent read, concurrent write; any memory location may be written
to or read from any number of times during a single step. A CRCW PRAM model
must define some rule for resolving multiple writes, such as giving priority to the
lowest-numbered processor or choosing amongst processors randomly. The
PRAM is popular because it is theoretically tractable and because it gives
27
28
remote memory data at a much slower speed. PRAM and VSLI are the advance
technologies that are used for designing the architecture.
1.7 Keywords
multiprocessor A computer in which processors can execute separate instruction
streams, but have access to a single address space. Most multiprocessors are shared
memory machines, constructed by connecting several processors to one or more memory
banks through a bus or switch.
multicomputer A computer in which processors can execute separate instruction
streams, have their own private memories and cannot directly access one another's
memories. Most multicomputers are disjoint memory machines, constructed by joining
nodes (each containing a microprocessor and some memory) via links.
MIMD Multiple Instruction, Multiple Data; a category of Flynn's taxonomy in which
many instruction streams are concurrently applied to multiple data sets. A MIMD
architecture is one in which heterogeneous processes may execute at different rates.
MIPS one Million Instructions Per Second. A performance rating usually referring to
integer or non-floating point instructions
vector processor A computer designed to apply arithmetic operations to long vectors or
arrays. Most vector processors rely heavily on pipelining to achieve high performance
pipelining Overlapping the execution of two or more operations
29
Vetter:
Lesson No. : 02
Objective
Introduction
Condition of parallelism
o Data dependence and resource dependence
o Hardware and software dependence
o The role of compiler
Summary
Keywords
2.0 Objective
In this lesson we will study about fundamental properties of programs how parallelism
can be introduced in program. We will study about the granularity, partitioning of
programs , program flow mechanism and compilation support for parallelism.
Interconnection architecture both static and dynamic type will be discussed.
2.1 Introduction
The advantage of multiprocessors lays when parallelism in the program is popularly
exploited and implemented using multiple processors. Thus in order to implement the
parallelism we should understand the various conditions of parallelism.
30
What are various bottlenecks in implementing parallelism? Thus for full implementation
of parallelism there are three significant areas to be understood namely computation
models for parallel computing, interprocessor communication in parallel architecture and
system integration for incorporating parallel systems. Thus multiprocessor system poses a
number of problems that are not encountered in sequential processing such as designing a
parallel algorithm for the application, partitioning of the application into tasks,
coordinating communication and synchronization, and scheduling of the tasks onto the
machine.
2.2 Condition of parallelism
The ability to execute several program segments in parallel requires each segment to be
independent of the other segments. We use a dependence graph to describe the relations.
The nodes of a dependence graph correspond to the program statement (instructions), and
directed edges with different labels are used to represent the ordered relations among the
statements. The analysis of dependence graphs shows where opportunity exists for
parallelization and vectorization.
2.2.1 Data and resource Dependence
Data dependence: The ordering relationship between statements is indicated by the data
dependence. Five type of data dependence are defined below:
1. Flow dependence: A statement S2 is flow dependent on S1 if an execution path exists
from s1 to S2 and if at least one output (variables assigned) of S1feeds in as input
(operands to be used) to S2 also called RAW hazard and denoted as
2. Antidependence: Statement S2 is antidependent on the statement S1 if S2 follows S1 in
the program order and if the output of S2 overlaps the input to S1 also called RAW
hazard and denoted as
3. Output dependence : two statements are output dependent if they produce (write) the
same output variable. Also called WAW hazard and denoted as
4. I/O dependence: Read and write are I/O statements. I/O dependence occurs not
because the same variable is involved but because the same file referenced by both I/O
statement.
31
A variable appears more than once with subscripts having different coefficients
of the loop variable.
Parallel execution of program segments which do not have total data independence can
produce non-deterministic results.
Consider the following fragment of any program:
S1 Load R1, A
S2 Add R2, R1
S3 Move R1, R3
S4 Store B, R1
here the Forward dependency S1to S2, S3 to S4, S2 to S2
Anti-dependency from S2to S3
Output dependency S1 toS3
32
Bernsteins Conditions - 2
33
In terms of data dependencies, Bernsteins conditions imply that two processes can
execute in parallel if they are flow-independent, antiindependent, and outputindependent. The parallelism relation || is commutative (Pi || Pj implies Pj || Pi ), but not
transitive (Pi || Pj and Pj || Pk does not imply Pi || Pk ) . Therefore, || is not an equivalence
relation. Intersection of the input sets is allowed.
2.2.2 Hardware and software parallelism
Hardware parallelism is defined by machine architecture and hardware multiplicity i.e.,
functional parallelism times the processor parallelism .It can be characterized by the
number of instructions that can be issued per machine cycle. If a processor issues k
instructions per machine cycle, it is called a k-issue processor. Conventional processors
are one-issue machines. This provide the user the information about peak attainable
performance. Examples. Intel i960CA is a three-issue processor (arithmetic, memory
access, branch). IBM RS -6000 is a four-issue processor (arithmetic, floating-point,
memory access, branch).A machine with n k-issue processors should be able to handle a
maximum of nk threads simultaneously.
Software Parallelism
Software parallelism is defined by the control and data dependence of programs, and is
revealed in the programs flow graph i.e., it is defined by dependencies with in the code
and is a function of algorithm, programming style, and compiler optimization.
2.2.3 The Role of Compilers
Compilers used to exploit hardware features to improve performance. Interaction
between compiler and architecture design is a necessity in modern computer
development. It is not necessarily the case that more software parallelism will improve
performance in conventional scalar processors. The hardware and compiler should be
designed at the same time.
2.3Program Partitioning & Scheduling
2.3.1 Grain size and latency
The size of the parts or pieces of a program that can be considered for parallel execution
can vary. The sizes are roughly classified using the term granule size, or simply
granularity. The simplest measure, for example, is the number of instructions in a
34
program part. Grain sizes are usually described as fine, medium or coarse, depending on
the level of parallelism involved.
Latency
Latency is the time required for communication between different subsystems in a
computer. Memory latency, for example, is the time required by a processor to access
memory. Synchronization latency is the time required for two processes to synchronize
their execution. Computational granularity and communication latency are closely
related. Latency and grain size are interrelated and some general observation are
As grain size is reduced, there are fewer operations between communication, and
hence the impact of latency increases.
Levels of Parallelism
Instruction Level Parallelism
This fine-grained, or smallest granularity level typically involves less than 20 instructions
per grain. The number of candidates for parallel execution varies from 2 to thousands,
with about five instructions or statements (on the average) being the average level of
parallelism.
Advantages:
There are usually many candidates for parallel execution
Compilers can usually do a reasonable job of finding this parallelism
Loop-level Parallelism
Typical loop has less than 500 instructions. If a loop operation is independent between
iterations, it can be handled by a pipeline, or by a SIMD machine. Most optimized
program construct to execute on a parallel or vector machine. Some loops (e.g. recursive)
are difficult to handle. Loop-level parallelism is still considered fine grain computation.
Procedure-level Parallelism
35
Medium-sized grain; usually less than 2000 instructions. Detection of parallelism is more
difficult than with smaller grains; interprocedural dependence analysis is difficult and
history-sensitive. Communication requirement less than instruction level SPMD (single
procedure multiple data) is a special case Multitasking belongs to this level.
Subprogram-level Parallelism
Job step level; grain typically has thousands of instructions; medium- or coarse-grain
level. Job steps can overlap across different jobs. Multiprograming conducted at this level
No compilers available to exploit medium- or coarse-grain parallelism at present.
Job or Program-Level Parallelism
Corresponds to execution of essentially independent jobs or programs on a parallel
computer. This is practical for a machine with a small number of powerful processors,
but impractical for a machine with a large number of simple processors (since each
processor would take too long to process a single job).
Communication Latency
Balancing granularity and latency can yield better performance. Various latencies
attributed to machine architecture, technology, and communication patterns used.
Latency imposes a limiting factor on machine scalability. Ex. Memory latency increases
as memory capacity increases, limiting the amount of memory that can be used with a
given tolerance for communication latency.
Interprocessor Communication Latency
Communication Patterns
36
How can I partition a program into parallel pieces to yield the shortest execution time?
What is the optimal size of parallel grains?
There is an obvious tradeoff between the time spent scheduling and synchronizing
parallel grains and the speedup obtained by parallel execution.
One approach to the problem is called grain packing.
Program Graphs and Packing
A program graph is similar to a dependence graph Nodes = { (n,s) }, where n = node
name, s = size (larger s = larger grain size).
Edges = { (v,d) }, where v = variable being communicated, and d = communication
delay.
Packing two (or more) nodes produces a node with a larger grain size and possibly more
edges to other nodes. Packing is done to eliminate unnecessary communication delays or
reduce overall scheduling overhead.
Scheduling
A schedule is a mapping of nodes to processors and start times such that communication
delay requirements are observed, and no two nodes are executing on the same processor
at the same time. Some general scheduling goals
Select grain sizes for packing to achieve better schedules for a particular parallel
machine.
Node Duplication
Grain packing may potentially eliminate interprocessor communication, but it may not
always produce a shorter schedule. By duplicating nodes (that is, executing some
instructions on multiple processors), we may eliminate some interprocessor
communication, and thus produce a shorter schedule.
Program partitioning and scheduling
Scheduling and allocation is a highly important issue since an inappropriate scheduling of
tasks can fail to exploit the true potential of the system and can offset the gain from
parallelization. In this paper we focus on the scheduling aspect. The objective of
scheduling is to minimize the completion time of a parallel application by properly
37
allocating the tasks to the processors. In a broad sense, the scheduling problem exists in
two forms: static and dynamic. In static scheduling, which is usually done at compile
time, the characteristics of a parallel program (such as task processing times,
communication, data dependencies, and synchronization requirements) are known before
program execution
A parallel program, therefore, can be represented by a node- and edge-weighted directed
acyclic graph (DAG), in which the node weights represent task processing times and the
edge weights represent data dependencies as well as the communication times between
tasks. In dynamic scheduling only, a few assumptions about the parallel program can be
made before execution, and thus, scheduling decisions have to be made on-the-fly. The
goal of a dynamic scheduling algorithm as such includes not only the minimization of the
program completion time but also the minimization of the scheduling overhead which
constitutes a significant portion of the cost paid for running the scheduler. In general
dynamic scheduling is an NP hard problem.
2.4 Program flow mechanism
Conventional machines used control flow mechanism in which order of program
execution explicitly stated in user programs. Dataflow machines which instructions can
be executed by determining operand availability.
Reduction machines trigger an instructions execution based on the demand for its
results.
Control Flow vs. Data Flow In Control flow computers the next instruction is executed
when the last instruction as stored in the program has been executed where as in Data
flow computers an instruction executed when the data (operands) required for executing
that instruction is available
Control flow machines used shared memory for instructions and data. Since variables are
updated by many instructions, there may be side effects on other instructions. These side
effects frequently prevent parallel processing. Single processor systems are inherently
sequential.
Instructions in dataflow machines are unordered and can be executed as soon as their
operands are available; data is held in the instructions themselves. Data tokens are passed
from an instruction to its dependents to trigger execution.
38
Tagged tokens enter PE through local path (pipelined), and can also be communicated to
other PEs through the routing network. Instruction address(es) effectively replace the
program counter in a control flow machine. Context identifier effectively replaces the
frame base register in a control flow machine. Since the dataflow machine matches the
data tags from one instruction with successors, synchronized instruction execution is
implicit.
An I-structure in each PE is provided to eliminate excessive copying of data structures.
Each word of the I-structure has a two-bit tag indicating whether the value is empty, full,
or has pending read requests.
This is a retreat from the pure dataflow approach. Special compiler technology needed for
dataflow machines.
Demand-Driven Mechanisms
Data-driven machines select instructions for execution based on the availability of their
operands; this is essentially a bottom-up approach.
Demand-driven machines take a top-down approach, attempting to execute the
instruction (a demander) that yields the final result. This triggers the execution of
instructions that yield its operands, and so forth. The demand-driven approach matches
naturally with functional programming languages (e.g. LISP and SCHEME).
Pattern driven computers : An instruction is executed when we obtain a particular data
patterns as output. There are two types of pattern driven computers
39
String-reduction model: each demander gets a separate copy of the expression string to
evaluate each reduction step has an operator and embedded reference to demand the
corresponding operands each operator is suspended while arguments are evaluated
Graph-reduction model: expression graph reduced by evaluation of branches or
subgraphs, possibly in parallel, with demanders given pointers to results of reductions.
based on sharing of pointers to arguments; traversal and reversal of pointers continues
until constant arguments are encountered.
2.5 System interconnect architecture.
Various types of interconnection networks have been suggested for SIMD computers.
These are basically classified have been classified on network topologies into two
categories namely
Static Networks
Dynamic Networks
Static versus Dynamic Networks
The topological structure of an SIMD array processor is mainly characterized by the data
routing network used in interconnecting the processing elements.
The topological structure of an SIMD array processor is mainly characterized by the data
routing network used in the interconnecting the processing elements. To execute the
communication the routing function f is executed and via the interconnection network the
PEi copies the content of its Ri register into the Rf(i) register of PEf(i). The f(i) the
processor identified by the mapping function f. The data routing operation occurs in all
active PEs simultaneously.
2.5.1 Network properties and routing
The goals of an interconnection network are to provide low-latency high data transfer rate
wide communication bandwidth. Analysis includes latency bisection bandwidth datarouting functions scalability of parallel architecture
These Network usually represented by a graph with a finite number of nodes linked by
directed or undirected edges.
Number of nodes in graph = network size .
Number of edges (links or channels) incident on a node = node degree d (also note in and
out degrees when edges are directed).
40
Node degree reflects number of I/O ports associated with a node, and should ideally be
small and constant.
Network is symmetric if the topology is the same looking from any node; these are easier
to implement or to program.
Diameter : The maximum distance between any two processors in the network or in
other words we can say Diameter, is the maximum number of (routing) processors
through which a message must pass on its way from source to reach destination. Thus
diameter measures the maximum delay for transmitting a message from one processor to
another as it determines communication time hence smaller the diameter better will be
the network topology.
Connectivity: How many paths are possible between any two processors i.e., the
multiplicity of paths between two processors. Higher connectivity is desirable as it
minimizes contention.
Arch connectivity of the network: the minimum number of arcs that must be removed for
the network to break it into two disconnected networks. The arch connectivity of various
network are as follows
1 for linear arrays and binary trees
2 for rings and 2-d meshes
4 for 2-d torus
d for d-dimensional hypercubes
Larger the arch connectivity lesser the conjunctions and better will be network topology.
Channel width :
41
between two halves of the network with equal numbers of processors This is important
for the networks with weighted arcs where the weights correspond to the link width i.e.,
(how much data it can transfer). The Larger bisection width the better network topology
is considered.
Cost the cost of networking can be estimated on variety of criteria where we consider the
the number of communication links or wires used to design the network as the basis of
cost estimation. Smaller the better the cost
Data Routing Functions: A data routing network is used for inter PE data exchange. It
can be static as in case of hypercube routing network or dynamic such as multistage
network. Various type of data routing functions are Shifting, Rotating, Permutation (one
to one), Broadcast (one to all), Multicast (many to many), Personalized broadcast (one to
many), Shuffle, Exchange Etc.
Permutations
Given n objects, there are n ! ways in which they can be reordered (one of which is no
reordering). A permutation can be specified by giving the rule for reordering a group of
objects. Permutations can be implemented using crossbar switches, multistage networks,
shifting, and broadcast operations. The time required to perform permutations of the
connections between nodes often dominates the network performance when n is large.
Perfect Shuffle and Exchange
Stone suggested the special permutation that entries according to the mapping of the k-bit
binary number a b k to b c k a (that is, shifting 1 bit to the left and wrapping it
around to the least significant bit position). The inverse perfect shuffle reverses the effect
of the perfect shuffle.
Hypercube Routing Functions
If the vertices of a n-dimensional cube are labeled with n-bit numbers so that only one bit
differs between each pair of adjacent vertices, then n routing functions are defined by the
bits in the node (vertex) address. For example, with a 3-dimensional cube, we can easily
identify routing functions that exchange data between nodes with addresses that differ in
the least significant, most significant, or middle bit.
Factors Affecting Performance
42
43
44
45
of source and destination addresses. If there is a match goes to next stage via passthrough else in case of it mismatch goes via cross-over using the switch.
There are two classes of dynamic networks namely
multi stage
Straight
Exchange
Upper Broadcast
46
Lower broadcast.
A two function switch can assume only two possible state namely state or exchange
states. However a four function switch box can be any of four possible states. A
multistage network is capable of connecting any input terminal to any output terminal.
Multi-stage networks are basically constructed by so called shuffle-exchange switching
element, which is basically a 2 x 2 crossbar. Multiple layers of these elements are
connected and form the network.
Figure 2.5 A two-by-two switching box and its four interconnection states
A multistage network is capable of connecting an arbitrary input terminal to an arbitrary
output terminal. Generally it is consist of n stages where N = 2n is the number of input
and output lines. And each stage use N/2 switch boxes. The interconnection patterns from
one stage to another stage is determined by network topology. Each stage is connected to
the next stage by at least N paths. The total wait time is proportional to the number stages
i.e., n and the total cost depends on the total number of switches used and that is Nlog2N.
The control structure can be individual stage control i.e., the same control signal is used
to set all switch boxes in the same stages thus we need n control signal. The second
control structure is individual box control where a separate control signal is used to set
the state of each switch box. This provide flexibility at the same time require n2/2 control
signal which increases the complexity of the control circuit. In between path is use of
partial stage control.
47
One side networks : also called full switch having input output port on the same
side
Two sided multistage network : which have an input side and an output side. It
can be further divided into three class
o Blocking: In Blocking networks, simultaneous connections of more than
one terminal pair may result conflicts in the use of network
communication links. Examples of blocking network are the Data
Manipulator, Flip, N cube, omega, baseline. All multistage networks that
are based on shuffle-exchange elements, are based on the concept of
blocking network because not all possible here to make the input-output
connections at the same time as one path might block another. The figure
2.6 (a) show an omega network.
o Rearrangeable : In rearrangeable network, a network can perform all
possible connections between inputs and outputs by rearranging its
existing connections so that a connection path for a new input-output pair
can always be established. An example of this network topology is Benes
Network ( see figure 2.6 (b) showing a 8** Benes network)which support
synchronous data permutation and a synchronous interprocessor
communication.
o Non blocking : A non blocking network is the network which can handle
all possible connections without blocking. There two possible cases first
one is the Clos network ( see figure 2.6(c)) where a one to one connection
48
49
neighboring nodes are only allowed to communicate the data in one step i.e., each PEi is
allowed to send the data to any one of PE(i+1) , PE (i-1), Pe(i+r) and PE(i-r) where r=
square root N( in case of Iliac r=8). In a periodic mesh, nodes on the edge of the mesh
have wrap-around connections to nodes on the other side this is also called a toroidal
mesh.
Mesh Metrics
For a q-dimensional non-periodic lattice with kq nodes:
Network connectivity = q
Network diameter = q(k-1)
Network narrowness = k/2
Bisection width = kq-1
Expansion Increment = kq-1
Edges per node = 2q
Thus we observe the output of IS k is connected to inputs of OSj where j = k-1,K+1,kr,k+r as shown in figure below.
50
51
Each PEi is directly connected to its four neighbors in the mesh network. The graph
shows that in one step a PE can reach to four PEs, seven PEs in two step and eleven PEs
in three steps. In general it takes I steps ( recirculations) to route data from PEi to another
PEj for a network of size N where I is upper bound given by
I<= square root(N) -1
Thus in above example for N=16 it will require at most 3 steps to route data from one PE
to another PE and for Illiac IV network with 64 PE need maximum of 7 steps for routing
data from one PE to Another.
Cube Interconnection Networks
The cube network can be implemented as either a recirculating network or as a multistage
network for SIMD machine. It can be 1-D i.e., a single line with two pE each at end of a
line, a square with four PEs at the corner in case of 2-D, a cube for 3-D and hypercube in
4-D. in case of n-dimension hypercube each processor connects to 2n neighbors. This can
be also visualized as the unit (hyper) cube embedded in d-dimensional Euclidean space,
with one corner at 0 and lying in the positive orthant. The processors can be thought of as
lying at the corners of the cube, with their (x1,x2,...,xd) coordinates identical to their
processor numbers, and connected to their nearest neighbors on the cube. The popular
examples where cube topology is used are : iPSC, nCUBE, SGI O2K.
Vertical lines connect vertices (PEs) whose address differ in the most significant
bit position. Vertices at both ends of the diagonal lines differ in the middle bit position.
Horizontal lines differ in the least significant bit position. The unit cube concept can be
extended to an n- dimensional unit space called an n cube with n bits per vertex. A cube
network for an SIMD machine with N PEs corresponds to an n cube where n = log2 N.
We use binary sequence to represent the vertex (PE) address of the cube. Two processors
are neighbors if and only if their binary address differs only in one digit place
52
53
54
of the children are gotten by toggling the next address bit, and so are 000, 010, 100 and
110. Note that each node also plays the role of the left child. Finally, the leaves are gotten
by toggling the third bit. Having one child identified with the parent causes no problems
as long as algorithms use just one row of the tree at a time. Here is a picture.
55
57
Figure 2.16
The diameter is m=log_2 p, since all message must traverse m stages. The bisection
width is p. This network was used in the IBM RP3, BBN Butterfly, and NYU
Ultracomputer. If we compare the omega network with cube network we find Omega
network can perform one to many connections while n-cube cannot. However as far as
bijections connections n-cube and Omega network they perform more or less same.
2.6 Summary
Fine-grain exploited at instruction or loop levels, assisted by the compiler.
Medium-grain (task or job step) requires programmer and compiler support.
Coarse-grain relies heavily on effective OS support.
Shared-variable communication used at fine- and medium grain levels.
Message passing can be used for medium- and coarse grain communication, but fine grain really need better technique because of heavier communication requirements.
Control flow machines give complete control, but are less efficient than other approaches.
Data flow (eager evaluation) machines have high potential for parallelism and throughput
and freedom from side effects, but have high control overhead, lose time waiting for
unneeded arguments, and difficulty in manipulating data structures. Reduction (lazy
58
Bus System
Multistage Network
Crossbar Switch
O(log k n)
Constant
O(w) to O(nw)
O(w) to O(nw)
Wiring Complexity
O(w)
O(nw log k n)
O(n2w)
Switching complexity
O(n)
O(n log k n)
O(n2)
at a time
2.7 Keywords
Dependence graph : A directed graph whose nodes represent calculations and whose
edges represent dependencies among those calculations. If the calculation represented by
59
node k depends on the calculations represented by nodes i and j, then the dependence
graph contains the edges i-k and j-k.
data dependency : a situation existing between two statements if one statement can store
into a location that is later accessed by the other statement
granularity The size of operations done by a process between communications events. A
fine grained process may perform only a few arithmetic operations between processing
one message and the next, whereas a coarse grained process may perform millions
control-flow computers refers to an architecture with one or more program counters that
determine the order in which instructions are executed.
dataflow A model of parallel computing in which programs are represented as
dependence graphs and each operation is automatically blocked until the values on which
it depends are available. The parallel functional and parallel logic programming models
are very similar to the dataflow model.
network A physical communication medium. A network may consist of one or more
buses, a switch, or the links joining processors in a multicomputer.
Static networks: point-to-point direct connections that will not change during program
execution
Dynamic networks: switched channels dynamically configured to match user program
communication demands include buses, crossbar switches, and multistage networks
routing The act of moving a message from its source to its destination. A routing
technique is a way of handling the message as it passes through individual nodes.
Diameter D of a network is the maximum shortest path between any two nodes, measured
by the number of links traversed; this should be as small as possible (from a
communication point of view).
Channel bisection width b = minimum number of edges cut to split a network into two
parts each having the same number of nodes. Since each channel has w bit wires, the wire
bisection width B = bw. Bisection width provides good indication of maximum
communication bandwidth along the bisection of a network, and all other cross sections
should be bounded by the bisection width.
Wire (or channel) length = length (e.g. weight) of edges between nodes.
60
3.0 Objective
3.1 Introduction
3.2 Linear pipeline
3.3 Nonlinear pipeline
3.4 Design instruction and arithmetic pipeline
3.5 Superscalar and super pipeline
3.6 Pipelining in RISC
3.6.1 CISC approach
3.6.2 RISC approach
3.6.3 CRISC
3.7 VILW architecture
3.8 Summary
3.9 Key words
3.10 Self assessment questions
3.11 References/Suggested readings
3.0 Objective
The main objective of this lesson is to known the basic properties of pipelining,
classification of pipeline processors and the required memory support. The main aim this
lesson is to learn the how pipelining is implemented in various computer architecture like
RISC and CISC etc. How the issues related to limitations of pipelining and are overcame
by using superscalar pipeline architecture.
3.1 Introduction
Pipeline is similar to the assembly line in industrial plant. To achieve pipelining one must
divide the input process into a sequence of sub tasks and each of which can be executed
concurrently with other stages. The various classification or pipeline line processor are
arithmetic pipelining, instruction pipelining, processor pipelining have also been briefly
discussed. Limitations of pipelining are discussed and shift to Pipeline architecture to
61
A basic function must be divisible into independent stages such that each stage
have minimal overlap.
62
Now lets see how problem works behaves with pipelining concept. This can be
illustrated with a space time diagram given below figure 3.1, which shows the segment
utilization as function of time. Lets us take there are 6 processes to be handled
(represented in figure as P1, P2, P3, P4, P5 and P6) and each process is divided into 4
segments (S1, S2, S3, S4). For sake of simplicity we take each segment takes equal time
to complete the assigned job i.e., equal to one clock cycle. The horizontal axis displays
the time in clock cycles and vertical axis gives the segment number. Initially, process1 is
handled by the segment 1. After the first clock segment 2 handles process 1 and segment
1 handles new process P2. Thus first process will take 4 clock cycles and remaining
processes will be completed one process each clock cycle. Thus for above example total
time required to complete whole job will be 9 clock cycles ( with pipeline organization)
instead of 24 clock cycles required for non pipeline configuration.
P1
P2
P3
S1
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
P4
P5
P6
S4
63
64
Efficiency : The efficiency of linear pipeline is measured by the percentage of time when
processor are busy over total time taken i.e., sum of busy time plus idle time. Thus if n is
number of task , k is stage of pipeline and t is clock period then efficiency is given by
= n/ [k + n -1]
Thus larger number of task in pipeline more will be pipeline busy hence better will be
efficiency. It can be easily seen from expression as n , 1.
= Sk/k
Thus efficiency of the pipeline is the speedup divided by the number of stages, or one
can say actual speed ratio over ideal speed up ratio. In steady stage where n>>k,
approaches 1.
Throughput: The number of task completed by a pipeline per unit time is called
throughput, this represents computing power of pipeline. We define throughput as
W= n/[k*t + (n-1) *t] = /t
In ideal case as -> 1 the throughout is equal to 1/t that is equal to frequency. Thus
maximum throughput is obtained is there is one output per clock pulse.
Que 3.1. A non-pipeline system takes 60 ns to process a task. The same task can be
processed in six segment pipeline with a clock cycle of 10 ns. Determine the speedup
ratio of the pipeline for 100 tasks. What is the maximum speed up that can be achieved?
Soln. Total time taken by for non pipeline to complete 100 task is = 100 * 60 = 6000 ns
Total time taken by pipeline configuration to complete 100 task is
= (100 + 6 1) *10 = 1050 ns
Thus speed up ratio will be = 6000 / 1050 = 4.76
The maximum speedup that can be achieved for this process is = 60 / 10 = 6
Thus, if total speed of non pipeline process is same as that of total time taken to complete
a process with pipeline than maximum speed up ratio is equal to number of segments.
Que 3.2. A non-pipeline system takes 50 ns to process a task. The same task can be
processed in a six segment pipeline with a clock cycle of 10 ns. Determine the speedup
ratio of the pipeline for 100 tasks. What is the maximum speed up that can be achieved?
Soln. Total time taken by for non pipeline to complete 100 task is = 100 * 50 = 5000 ns
Total time taken by pipeline configuration to complete 100 task is
= (100 + 6 1) *10 = 1050 ns
65
control hazards that happens when there is change in flow of statement like
due to branch, jump, or any other control flow changes conditions
66
fixed function. A dynamic pipeline allows feed forward and feedback connections in
addition to streamline connection. A dynamic pipelining may initiate tasks from
different reservation tables simultaneously to allow multiple numbers of initiations of
different functions in the same pipeline.
3.3.1 Reservation Tables and latency analysis
Reservation tables are used how successive pipeline stages are utilized for a specific
evaluation function. These reservation tables show the sequence in which each function
utilizes each stage. The rows correspond to pipeline stages and the columns to clock time
units. The total number of clock units in the table is called the evaluation time. A
reservation table represents the flow of data through the pipeline for one complete
evaluation of a given function. (For example, think of X as being a floating square root,
and Y as being a floating cosine. A simple floating multiply might occupy just S1 and S2
in sequence.) We could also denote multiple stages being used in parallel, or a stage
being drawn out for more than one cycle with these diagrams.
67
We determine the next start time for one or the other of the functions by lining up the
diagrams and sliding one with respect to another to see where one can fit into the open
slots. Once an X function has been scheduled, another X function can start after 1, 3 or 6
cycles. A Y function can start after 2 or 4 cycles. Once a Y function has been scheduled,
another Y function can start after 1, 3 or 5 cycles. An X function can start after 2 or 4
cycles. After two functions have been scheduled, no more can be started until both are
complete.
Job Sequencing and Collision Prevention
Initiation the start a single function evaluation collision may occur as two or more
initiations attempt to use the same stage at the same time. Thus it is required to properly
schedule queued tasks awaiting initiation in order to avoid collisions and to achieve high
throughput. We can define collision as:
1. A collision occurs when two tasks are initiated with latency (initiation interval) equal
to the column distance between two X on some row of the reservation table.
2. The set of column distances F ={l1,l2,,lr} between all possible pairs of X on each
row of the reservation table is called the forbidden set of latencies.
3. The collision vector is a binary vector C = (CnC2 C1), Where Ci=1 if i belongs to F
(set of forbidden latencies) and Ci=0 otherwise.
Some fundamental concepts used in it are:
Latency - number of time units between two initiations (any positive integer 1, 2,)
Latency sequence sequence of latencies between successive initiations
Latency cycle a latency sequence that repeats itself
Control strategy the procedure to choose a latency sequence
Greedy strategy a control strategy that always minimizes the latency between the
current initiation and the very last initiation
Example: Let us consider a Reservation Table with the following set of forbidden
latencies F and permitted latencies P (complementation of F).
68
69
1. The collision vector shows both permitted and forbidden latencies from the same
reservation table.
2. One can use n-bit shift register to hold the collision vector for implementing a control
strategy for successive task initiations in the pipeline. Upon initiation of the first task, the
collision vector is parallel-loaded into the shift register as the initial state. The shift
register is then shifted right one bit at a time, entering 0s from the left end. A collision
free initiation is allowed at time instant t+k a bit 0 is being shifted at of the register after k
shifts from time t.
A state diagram is used to characterize the successive initiations of tasks in the pipeline
in order to find the shortest latency sequence to optimize the control strategy. A state on
the diagram is represented by the contents of the shift register after the proper number of
shifts is made, which is equal to the latency between the current and next task initiations.
3. The successive collision vectors are used to prevent future task collisions with
previously initiated tasks, while the collision vector C is used to prevent possible
collisions with the current task. If a collision vector has a 1 in the ith bit (from the
right), at time t, then the task sequence should avoid the initiation of a task at time t+i.
4. Closed logs or cycles in the state diagram indicate the steady state sustainable latency
sequence of task initiations without collisions. The average latency of a cycle is the sum
of its latencies (period) divided by the number of states in the cycle.
5. The throughput of a pipeline is inversely proportional to the reciprocal of the average
latency. A latency sequence is called permissible if no collisions exist in the successive
initiations governed by the given latency sequence.
6. The maximum throughput is achieved by an optimal scheduling strategy that achieves
the (MAL) minimum average latency without collisions.
Simple cycles are those latency cycles in which each state appears only once per each
iteration of the cycle. A single cycle is a greedy cycle if each latency contained in the
cycle is the minimal latency (outgoing arc) from a state in the cycle. A good taskinitiation sequence should include the greedy cycle.
Procedure to determine the greedy cycles
1. From each of the state diagram, one chooses the arc with the smallest latency label
unit; a closed simple cycle can formed.
70
2. The average latency of any greedy cycle is no greater than the number of latencies in
the forbidden set, which equals the number of 1s in the initial collision vector.
3. The average latency of any greedy cycle is always lower-bounded by the
MAL in the collision vector
Two methods for improving dynamic pipeline throughput have been proposed by
Davidson and Patel these are
Thus high throughput can be achieved by using the modified reservation table yielding a
more desirable latency pattern such the each stage is maximum utilized. Any computation
can be delayed by inserting a non compute stage.
Reconfigurable pipelines with different function types are more desirable. This requires
an extensive resource sharing among different functions. To achieve this one need a more
complicated structure of pipeline segments and their interconnection controls like bypass
techniques to avoid unwanted stage.
A dynamic pipeline would allow several configurations to be simultaneously present like
arithmetic unit performing both addition as well as multiplication at same time. But to
achieve this tremendous control overhead and increased interconnection complexity
would be expected.
3.4 Design of Instruction pipeline
As we know that in general case, the each instruction to execute in computer undergo
following steps:
71
For sake of simplicity we take calculation of the effective address and fetch operand from
memory as single segment as operand fetch unit. Thus below figure shows how the
instruction cycle in CPU can be processed with five segment instruction pipeline.
While the instruction is decoded (ID) in segment 2 the new instruction is fetched (IF)
from segment 1. Similarly in third time cycle when first instruction effective operand is
fetch (OF), the 2nd instruction is decoded and the 3rd instruction is fetched. In same
manner in fourth clock cycle, and subsequent cycles all subsequent instructions can be
fetched and placed in instruction FIFO. Thus up to five different instructions can be
processed at the same time. The figure show how the instruction pipeline works, where
time is in the horizontal axis and divided into steps of equal duration. Although the major
difficulty with instruction pipeline is that different segment may take different time to
operate the forth coming information. For example if operand is in register mode require
much less time as compared if operand has to be fetched from memory that to with
indirect addressing modes. The design of an instruction pipeline will be most effective if
the instruction cycle is divided into segments of equal duration. As there can be resource
conflict, data dependency, branching, interrupts and other reasons due to pipelining can
branch out of normal sequence.
Que 5.3 Consider a program of 15,000 instructions executed by a linear pipeline
processor with a clock rate of 25MHz. The instruction pipeline has five stages and one
instruction is issued per clock cycle. Calculate speed up ratio, efficiency and throughput
of this pipelined processor?
Soln: Time taken to execute without pipeline is = 15000 * 5* (1/25) microsecs
Time taken with pipeline = (15000 + 5 -1)*(1/ 25) microsecs
72
73
74
Store-Store Overwriting
The following two memory updates of the same word can be combined into one; since
the second store overwrites the first. 2 memory accesses
Mi -> (R1) (store)
Mi -> (R2) (store)
Is being replaced by only by one memory access
Mi -> (R2) (store)
The above steps shows how to apply internal forwarding to simplify a sequence of
arithmetic and memory access operations in figure thick arrows for memory accesses and
dotted arrows for register transfers
Forwarding and Data Hazards
Sometimes it is possible to avoid data hazards by noting that a value that results from one
instruction is not needed until a late stage in a following instruction, and sending the data
directly from the output of the first functional unit back to the input of the second one
75
(which is sometimes the same unit). In the general case, this would require the output of
every functional unit to be connected through switching logic to the input of every
functional unit.
Data hazards can take three forms:
Read after write (RAW): Attempting to read a value that hasn't been written yet. This is
the most common type, and can be overcome by forwarding.
Write after write (WAW): Writing a value before a preceding write has completed. This
can only happen in complex pipes that allow instructions to proceed out of order, or that
have multiple write-back stages (mostly CISC), or when we have multiple pipes that can
write (superscalar).
Write after read (WAR): Writing a value before a preceding read has completed. These
also require a complex pipeline that can sometimes write in an early stage, and read in a
later stage. It is also possible when multiple pipelines (superscalar) or out-of-order issue
are employed.
The fourth situation, read after read (RAR) does not produce a hazard.
Forwarding does not solve every RAW hazard situation. For example, if a functional unit
is merely slow and fails to produce a result that can be forwarded in time, then the
pipeline must stall. A simple example is the case of a load, which has a high latency. This
is the sort of situation where compiler scheduling of instructions can help, by rearranging
independent instructions to fill the delay slots. The processor can also rearrange the
instructions at run time, if it has access to a window of prefetched instructions (called a
prefetch buffer). It must perform much the same analysis as the compiler to determine
which instructions are dependent on each other, but because the window is usually small,
the analysis is more limited in scope. The small size of the window is due to the cost of
providing a wide enough datapath to predecode multiple instructions at once, and the
complexity of the dependence testing logic.
Out of order execution introduces another level of complexity in the control of the
pipeline, because it is desirable to preserve the abstraction of in-order issue, even in the
presence of exceptions that could flush the pipe at any stage. But we'll defer this to later.
Branch Penalty Hiding
76
The control hazards due to branches can cause a large part of the pipeline to be flushed,
greatly reducing its performance. One way of hiding the branch penalty is to fill the pipe
behind the branch with instructions that would be executed whether or not the branch is
taken. If we can find the right number of instructions that precede the branch and are
independent of the test, then the compiler can move them immediately following the
branch and tag them as branch delay filling instructions. The processor can then execute
the branch, and when it determines the appropriate target, the instruction is fetched into
the pipeline with no penalty.
The filling of branch delays can be done dynamically in hardware by reordering
instructions out of the prefetch buffer. But this leads to other problems. Another way to
hide branch penalties is to avoid certain kinds of branches. For example, if we have
IF A < 0
THEN A = -A
we would normally implement this with a nearby branch. However, we could instead use
an instruction that performs the arithmetic conditionally (skips the write back if the
condition fails). The advantage of this scheme is that, although one pipeline cycle is
wasted, we do not have to flush the rest of the pipe (also, for a dynamic branch prediction
scheme, we need not put an extra branch into the prediction unit). These are called
predicated instructions, and the concept can be extended to other sorts of operations, such
as conditional loading of a value from memory.
Branch Prediction
Branches are the bane of any pipeline, causing a potentially large decrease in
performance as we saw earlier. There are several ways to reduce this loss by predicting
the action of the branch ahead of time.
Simple static prediction assumes that all branches will be taken or not. The designer
decides which way is predicted from instruction trace statistics. Once the choice is made,
the compiler can help by properly ordering local jumps. A slightly more complex static
branch prediction heuristic is that backward branches are usually taken and forward
branches are not (backwards taken, forwards not or BTFN). This assumes that most
backward branches are loop returns and that most forward branches are the less likely
cases of a conditional branch.
77
Compiler static prediction involves the use of special branches that indicate the most
likely choice (taken or not, or more typically taken or other, since the most predictable
branches are those at the ends of loops that are mostly taken). If the prediction fails in this
case, then the usual cancellation of the instructions in the delay slots occurs and a branch
penalty results.
Dynamic instruction scheduling
As discussed above the static instruction scheduling can be optimized by compiler the
dynamic scheduling is achieved either by using scoreboard or with Tomasulos register
tagging algorithm and discussed in superscalar processors
3.5 Arithmetic pipeline
Pipeline arithmetic is used in very high speed computers specially involved in scientific
computations a basic principle behind vector processor and array processor. They are
used to implement floating point operations, multiplication of fixed point numbers
and similar computations encountered in computation problems. These computation
problems can easily decomposed in suboperations. Arithmetic pipelining is well
implemented in the systems involved with repeated calculations such as calculations
involved with matrices and vectors. Let us consider a simple vector calculation like
A[i] + b[i] * c[i] for I = 1,2,3,,8
The above operation can be subdivided into three segment pipeline such each
segment has some registers and combinational circuits. Segment 1 load contents of b[i]
and c[i] in register R1 and R2 , segment 2 load a[i] content to R3 and multiply content of
R1, R2 and store them R4 finally segment 3 add content of R3 and R4 and store in R5 as
shown in figure below.
Clock
pulse Segment 1
Segment 2
Segment 3
number
R1
R2
R3
R4
R5
B1
C1
B2
C2
B1*C1
A1
B3
C3
B2*C2
A2
A1+ B1*C1
B4
C4
B3*C3
A3
A2+ B2*C2
B5
C5
B4*C4
A4
A3+ B3*C3
78
B6
C6
B5*C5
A5
A4+ B4*C4
B7
C7
B6*C6
A6
A5+ B5*C5
B8
C8
B7*C7
A7
A6+ B6*C6
B8*C8
A8
A7+ B7*C7
9
10
A8+ B8*C8
79
mantissa determines the left shift in mantissa and same number should be subtracted for
exponent. Various registers R are used to hold intermediate results.
In order to implement pipelined adder we need extra circuitry but its cost is compensated
if we have implement it for large number of floating point numbers. Operations at each
stage can be done on different pairs of inputs, e.g. one stage can be comparing the
exponents in one pair of operands at the same time another stage is adding the mantissas
of a different pair of operands.
3.6 Superpipeline and Superscalar technique
Instruction level parallelism is obtained primarily in two ways in uniprocessors: through
pipelining and through keeping multiple functional units busy executing multiple
instructions at the same time. When a pipeline is extended in length beyond the normal
five or six stages (e.g., I-Fetch, Decode/Dispatch, Execute, D-fetch, Writeback), then it
may be called Superpipelined. If a processor executes more than one instruction at a time,
it may be called Superscalar. A superscalar architecture is one in which several
instructions can be initiated simultaneously and executed independently. These two
techniques can be combined into a Superscalar pipeline architecture.
80
3.6.1 Superpipeline
In order to make processors even faster, various methods of optimizing pipelines have
been devised. Superpipelining refers to dividing the pipeline into more steps. The more
pipe stages there are, the faster the pipeline is because each stage is then shorter. thus
Superpipelining increases the number of instructions which are supported by the pipeline
at a given moment. For example if we divide each stage into two, the clock cycle period t
will be reduced to the half, t/2; hence, at the maximum capacity, the pipeline produces a
result every t/2 s. For a given architecture and the corresponding instruction set there is
an optimal number of pipeline stages; increasing the number of stages over this limit
reduces the overall performance Ideally, a pipeline with five stages should be five times
faster than a non-pipelined processor (or rather, a pipeline with one stage). The
instructions are executed at the speed at which each stage is completed, and each stage
takes one fifth of the amount of time that the non-pipelined instruction takes. Thus, a
processor with an 8-step pipeline (the MIPS R4000) will be even faster than its 5-step
counterpart. The MIPS R4000 chops its pipeline into more pieces by dividing some steps
81
into two. Instruction fetching, for example, is now done in two stages rather than one.
The stages are as shown:
Instruction Fetch (First Half)
Instruction Fetch (Second Half)
Register Fetch
Instruction Execute
Data Cache Access (First Half)
Data Cache Access (Second Half)
Tag Check
Write Back
Given a pipeline stage time T, it may be possible to execute at a higher rate by starting
operations at intervals of T/n. This can be accomplished in two ways:
Further divide each of the pipeline stages into n substages.
Provide n pipelines that are overlapped.
The first approach requires faster logic and the ability to subdivide the stages into
segments with uniform latency. It may also require more complex inter-stage interlocking
and stall-restart logic.
The second approach could be viewed in a sense as staggered superscalar operation, and
has associated with it all of the same requirements except that instructions and data can
be fetched with a slight offset in time. In addition, inter-pipeline interlocking is more
difficult to manage because of the sub-clock period differences in timing between the
pipelines.
Inevitably, superpipelining is limited by the speed of logic, and the frequency of
unpredictable branches. Stage time cannot productively grow shorter than the interstage
latch time, and so this is a limit for the number of stages.
The MIPS R4000 is sometimes called a superpipelined machine, although its 8 stages
really only split the I-fetch and D-fetch stages of the pipe and add a Tag Check stage.
Nonetheless, the extra stages enable it to operate with higher throughput. The
UltraSPARC's 9-stage pipe definitely qualifies it as a superpipelined machine, and in fact
it is a Super-Super design because of its superscalar issue. The Pentium 4 splits the
pipeline into 20 stages to enable increased clock rate. The benefit of such extensive
82
pipelining is really only gained for very regular applications such as graphics. On more
irregular applications, there is little performance advantage.
3.6.2 Superscalar
A solution to further improve speed is the superscalar architecture. Superscalar pipelining
involves multiple pipelines in parallel. Internal components of the processor are
replicated so it can launch multiple instructions in some or all of its pipeline stages. The
RISC System/6000 has a forked pipeline with different paths for floating-point and
integer instructions. If there is a mixture of both types in a program, the processor can
keep both forks running simultaneously. Both types of instructions share two initial
stages (Instruction Fetch and Instruction Dispatch) before they fork. Often, however,
superscalar pipelining refers to multiple copies of all pipeline stages (In terms of laundry,
this would mean four washers, four dryers, and four people who fold clothes). Many of
today's machines attempt to find two to six instructions that it can execute in every
pipeline stage. If some of the instructions are dependent, however, only the first
instruction or instructions are issued.
Dynamic pipelines have the capability to schedule around stalls. A dynamic pipeline is
divided into three units: the instruction fetch and decode unit, five to ten execute or
functional units, and a commit unit. Each execute unit has reservation stations, which act
as buffers and hold the operands and operations.
83
While the functional units have the freedom to execute out of order, the instruction
fetch/decode and commit units must operate in-order to maintain simple pipeline
behavior. When the instruction is executed and the result is calculated, the commit unit
decides when it is safe to store the result. If a stall occurs, the processor can schedule
other instructions to be executed until the stall is resolved. This, coupled with the
efficiency of multiple units executing instructions simultaneously, makes a dynamic
pipeline an attractive alternative
Superscalar processing has its origins in the Cray-designed CDC supercomputers, in
which multiple functional units are kept busy by multiple instructions. The CDC
machines could pack as many as 4 instructions in a word at once, and these were fetched
together and dispatched via a pipeline. Given the technology of the time, this
configuration was fast enough to keep the functional units busy without outpacing the
instruction memory.
In some cases superscalar machines still employ a single fetch-decode-dispatch pipe that
drives all of the units. For example, the UltraSPARC splits execution after the third stage
of a unified pipeline. However, it is becoming more common to have multiple fetchdecode-dispatch pipes feeding the functional units.
The choice of approach depends on tradeoffs of the average execute time vs. the speed
with which instructions can be issued. For example, if execution averages several cycles,
and the number of functional units is small, then a single pipe may be able to keep the
units utilized. When the number of functional units grows large and/or their execution
time approaches the issue time, then multiple issue pipes may be necessary.
Having multiple issue pipes requires
inter-pipeline interlocking
multiport D-cache and/or register file, and/or functionally split register file
Reordering may be either static (compiler) or dynamic (using hardware lookahead). It can
be difficult to combine the two approaches because the compiler may not be able to
predict the actions of the hardware reordering mechanism.
84
85
and the execution rules. The scoreboard can be thought of as preceding dispatch, but it
also controls execution after the issue. In a scoreboarded system, the results can be
forwarded directly to their destination register (as long as there are no write after read
hazards, in which case their execution is stalled), rather than having to proceed to a final
write-back stage.
In the CDC scoreboard, each register has a matching Result Register Designator that
indicates which functional unit will write a result into it. The fact that only one functional
unit can be designated for writing to a register at a time ensures that WAW dependences
cannot occur. Each functional unit also has a corresponding set of Entry-Operand
Register Designators that indicate what register will hold each operand, whether the value
is valid (or pending) and if it is pending, what functional unit will produce it (to facilitate
forwarding). None of the operands is released to a functional unit until they are all valid,
precluding RAW dependences. In addition , the scoreboard stalls any functional unit
whose result would write a register that is still listed as an Entry-Operand to a functional
unit that is waiting for an operand or is busy, thus avoiding WAR violations. An
instruction is only allowed to issue if its specified functional unit is free and its result
register is not reserved by another functional unit that has not yet completed. Four Stages
of Scoreboard Control
1. Issuedecode instructions & check for structural hazards (ID1) If a functional
unit for the instruction is free and no other active instruction has the same destination
register (WAW), the scoreboard issues the instruction to the functional unit and updates
its internal data structure. If a structural or WAW hazard exists, then the instruction issue
stalls, and no further instructions will issue until these hazards are cleared.
2. Read operandswait until no data hazards, then read operands (ID2) A source
operand is available if no earlier issued active instruction is going to write it, or if the
register containing the operand is being written by a currently active functional unit.
When the source operands are available, the scoreboard tells the functional unit to
proceed to read the operands from the registers and begin execution. The scoreboard
resolves RAW hazards dynamically in this step, and instructions may be sent into
execution out of order.
86
87
Determines when to read ops, when can execute, when can wb.
Hazard detection and resolution is centralized.
Reservation Stations The reservation station approach releases instructions directly to a
pool of buffers associated with their intended functional units (if more than one unit of a
particular type is present, then the units may share a single station). The reservation
stations are a distributed resource, rather than being centralized, and can be thought of as
following dispatch. A reservation is a record consisting of an instruction and its
requirements to execute -- its operands as specified by their sources and destination and
bits indicating when valid values are available for the sources. The instruction is released
to the functional unit when its requirements are satisfied, but it is important to note that
satisfaction doesn't require an operand to actually be in a register -- it can be forwarded to
the reservation station for immediate release or to be buffered (see below) for later
release. Thus, the reservation station's influence on execution can be thought of as more
implicit and data dependent than the explicit control exercised by the scoreboard.
Tomasulo Algorithm
The hardware dependence resolution technique used For IBM 360/91 about 3 years
after CDC 6600. Three Stages of Tomasulo Algorithm
1. Issueget instruction from FP Op Queue
If reservation station free, then issue instruction & send operands (renames registers).
2. Executionoperate on operands (EX)
When both operands ready then execute; if not ready, watch CDB for result
3. Write resultfinish execution (WB)
Write on Common Data Bus to all awaiting units; mark reservation station available.
Here the storage of operands resulting from instructions that completed out of order is
done through renaming of the registers. There are two mechanisms commonly used for
renaming. One is to assign physical registers from a free pool to the logical registers as
they are identified in an instruction stream. A lookup table is then used to map the logical
register references to their physical assignments. Usually the pool is larger than the
logical register set to allow for temporary buffering of results that are computed but not
yet ready to write back. Thus, the processor must keep track of a larger set of register
88
names than the instruction set architecture specifies. When the pool is empty, instruction
issue stalls.
The other mechanism is to keep the traditional association of logical and physical
registers, but then provide additional buffers either associated with the reservation
stations or kept in a central location. In either case, each of these "reorder buffers" is
associated with a given instruction, and its contents (once computed) can be used in
forwarding operations as long as the instruction has not completed. When an instruction
reaches the point that it may complete in a manner that preserves sequential semantics,
then its reservation station is freed and its result appears in the logical register that was
originally specified. This is done either by renaming the temporary register to be one of
the logical registers, or by transferring the contents of the reorder buffer to the
appropriate physical register.
Out of Order Issue
To enable out-of-order dispatch of instructions to the pipelines, we must provide at least
two reservation stations per pipe that are available for issue at once. An alternative would
be to rearrange instructions in the prefetch buffer, but without knowing the status of the
pipes, it would be difficult to make such a reordering effective. By providing multiple
reservation stations, however, we can continue issuing instructions to pipes, even though
an instruction may be stalled while awaiting resources. Then, whatever instruction is
ready first can enter the pipe and execute. At the far end of the pipeline, the out-of-order
instruction must wait to be retired in the proper order. This necessitates a mechanism for
keeping track of the proper order of instructions (note that dependences alone cannot
guarantee that instructions will be properly reordered when they complete).
3.7 RISC Pipelines
An efficient way to use instruction pipeline is one of characteristic feature of RISC
architecture. A RISC processor pipeline operates in much the same way, although the
stages in the pipeline are different. As discussed earlier, the length of the pipeline is
dependent on the length of the longest step. Because RISC instructions are simpler than
those used in pre-RISC processors (now called CISC, or Complex Instruction Set
Computer), they are more conducive to pipelining. While CISC instructions varied in
length, RISC instructions are all the same length and can be fetched in a single operation.
89
Ideally, each of the stages in a RISC processor pipeline should take 1 clock cycle so that
the processor finishes an instruction each clock cycle and averages one cycle per
instruction (CPI). Hence RISC can achieve pipeline segments, just requiring just one
clock cycle, while CISC may use many segments in its pipeline, with longest segment
requiring two or more clock cycles.
As most RISC data manipulation operations have register to register operations and an
instruction cycle has following two phase.
1. I : Instruction fetch
2. E : Execute . Performs an ALU operation with register input and output.
The data transfer instructions in RISC are limited to Load and Store. These instructions
use register indirect addressing and require three stages in pipeline
I : Instruction fetch
E: Calculate memory address
D: Memory. Register to memory or memory to register operation.
To prevent conflicts between memory access to fetch an instruction and to load or store
operand, most RISC machine use two separate buses with two memories: one or storing
the instruction and other for storing data.
Another feature of RISC over CSIC as far as pipelining is considered is compiler support.
Instead of designing hardware to handle the data dependencies and branch penalties,
RISC relies on efficiency of compiler to detect and minimize the delay encountered with
these problems.
A RISC processor pipeline operates in much the same way, although the stages in the
pipeline are different. While different processors have different numbers of steps, lets us
consider a three segment Instruction pipeline
I: Instruction fetch
A: ALU operation
E: Execute instruction
The I segment fetches the instruction from memory and decode it. The ALU is used for
three different functions, it can be data manipulation , effective address calculation for
LOAD and STORE operations, or calculation of the branch address for a program control
instruction depending on type of instruction. The E segment directs the output of the
90
ALU to one of three destination i.e., a destination register or effective address to a data
memory for loading or storing or the branch address to program counter, depending upon
decode instruction.
Multiplying Two Numbers in Memory
Lets consider an example of matrix multiplication here the main memory is divided into
locations numbered from (row) 1: (column) 1 to (row) 6: (column) 4. The execution unit
is responsible for carrying out all computations. However, the execution unit can only
operate on data that has been loaded into one of the six registers (A, B, C, D, E, or F).
Let's say we want to find the product of two numbers - one stored in location 2:3 and
another stored in location 5:2 - and then store the product back in the location 2:3.
The CISC Approach
The primary goal of CISC architecture is to complete a task in as few lines of assembly
as possible. This is achieved by building processor hardware that is capable of
understanding and executing a series of operations. For this particular task, a CISC
processor would come prepared with a specific instruction (we'll call it "MULT"). When
executed, this instruction loads the two values into separate registers, multiplies the
operands in the execution unit, and then stores the product in the appropriate register.
Thus, the entire task of multiplying two numbers can be completed with one instruction:
MULT 2:3, 5:2
91
commands: "LOAD," which moves data from the memory bank to a register, "PROD,"
which finds the product of two operands located within the registers, and "STORE,"
which moves data from a register to the memory banks. In order to perform the exact
series of steps described in the CISC approach, a programmer would need to code four
lines of assembly:
LOAD A, 2:3
LOAD B, 5:2
PROD A, B
STORE 2:3, A
At first, this may seem like a much less efficient way of completing the operation.
Because there are more lines of code, more RAM is needed to store the assembly level
instructions. The compiler must also perform more work to convert a high-level language
statement into code of this form.
CRICS : CISC and RISC Convergence
State of the art processor technology has changed significantly since RISC chips were
first introduced in the early '80s. Because a number of advancements (including the ones
described on this page) are used by both RISC and CISC processors, the lines between
the two architectures have begun to blur. In fact, the two architectures almost seem to
have adopted the strategies of the other. Because processor speeds have increased, CISC
chips are now able to execute more than one instruction within a single clock. This also
allows CISC chips to make use of pipelining. With other technological improvements, it
is now possible to fit many more transistors on a single chip. This gives RISC processors
enough space to incorporate more complicated, CISC-like commands. RISC chips also
make use of more complicated hardware, making use of extra function units for
superscalar execution. All of these factors have led some groups to argue that we are now
in a "post-RISC" era, in which the two styles have become so similar that distinguishing
between them is no longer relevant. This era is often called complex reduced instruction
set CRISC.
3.8 VLIW Machines
Very Long Instruction Word machines typically have many more functional units that
superscalars (and thus the need for longer 256 to 1024 bits instructions to provide
control for them) usually hundreds of bits long. These machines mostly use
92
microprogrammed control units with relatively slow clock rates because of the need to
use ROM to hold the microcode. Each instruction word essentially carries multiple short
instructions. Each of the short instructions are effectively issued at the same time.
(This is related to the long words frequently used in microcode.) Compilers for VLIW
architectures should optimally try to predict branch outcomes to properly group
instructions.
Pipelining in VLIW Processors
Decoding of instructions is easier in VLIW than in superscalars, because each region of
an instruction word is usually limited as to the type of instruction it can contain. Code
density in VLIW is less than in superscalars, because if a region of a VLIW word isnt
needed in a particular instruction, it must still exist (to be filled with a no op).
Superscalars can be compatible with scalar processors; this is difficult with VLIW
parallel and non-parallel architectures. Random parallelism among scalar operations is
exploited in VLIW, instead of regular parallelism in a vector or SIMD machine.
The efficiency of the machine is entirely dictated by the success, or goodness, of the
compiler in planning the operations to be placed in the same instruction words.
Different implementations of the same VLIW architecture may not be binary-compatible
with each other, resulting in different latencies.
5.7 Summary
1. The job-sequencing problem is equivalent to finding a permissible latency cycle with
the MAL in the state diagram.
2. The minimum number of Xs in array single row of the reservation table is a lower
bound of the MAL.
Pipelining allows several instructions to be executed at the same time, but they have to be
in different pipeline stages at a given moment. Superscalar architectures include all
features of pipelining but, in addition, there can be several instructions executing
simultaneously in the same pipeline stage. They have the ability to initiate multiple
instructions during the same clock cycle. There are two typical approaches today, in order
to improve performance:
1. Superpipelining
2. Superscalar
93
VLIW reduces the effort required to detect parallelism using hardware or software
techniques.
The main advantage of VLIW architecture is its simplicity in hardware structure and
instruction set. Unfortunately, VLIW does require careful analysis of code in order to
compact the most appropriate short instructions into a VLIW word.
3.9 Keywords
pipelining Overlapping the execution of two or more operations. Pipelining is used
within processors by prefetching instructions on the assumption that no branches are
going to preempt their execution; in vector processors, in which application of a single
operation to the elements of a vector or vectors may be pipelined to decrease the time
needed to complete the aggregate operation; and in multiprocessors and multicomputers,
in which a process may send a request for values before it reaches the computation that
requires them..
scoreboard A hardware device that maintains the state of machine resources to enable
instructions to execute without conflict at the earliest opportunity.
instruction pipelining strategy of allowing more than one instruction to be in some stage
of execution at the same time.
3.10 Self assessment questions
1.
2.
Define the following terms with regard to clocking and timing control.
Describe the speedup factors and the optimal number of pipeline stages for a linear
pipeline unit.
4.
5.
Explain the pipelined execution of the following instructions with the following
instructions:
a) X = Y + Z b) A = B X C
94
6.
What are the possible hazards that can occur between read and write operations in
an instruction pipeline?
95
Lesson No. : 04
4.0 Objective
4.1Introduction
4.2 Cache addressing models
4.2.1 Physical addressing mode
4.2.2 Virtual addressing mode
4.3 Cache mapping
4.3.1 Direct mapping
4.3.2 Associative mapping
4.3.3 Set associative mapping
4.3.4 Sector mapping
4.3.5 Cache performance
4.4 Replacement polices
4.5 Cache Coherence and Synchronization
4.5.1 Cache coherence problem
4.5.2 Snoopy bus protocol
4.5.3 Write back vs write through
4.5.4 Directory based protocol
4.6 Summary
4.7 Key words
4.8 Self assessment questions
4.9 References/Suggested readings
4.0 Objective
In this lesson we will discuss about bus that is used for interconnections between
different processor. We will discuss about use of cache memory in multiprocessor
environment and various addressing scheme used for cache memory. The page
replacement policy and performance of cache is also measured. Also we will discuss how
shared memory concept is used in multiprocessor. Various issues regarding event
ordering specially in case of memory events that deal with shared memory creates
96
synchronization problem we will also discuss various models designed to overcome these
issues.
4.1 Introduction
In the hierarchy memory cache memory are the fastest memory that lies between registers
and RAM . It holds recently used data and/or instructions and has a size varying from
few kB to several MB.
97
For the comparison of address generated by CPU the memory controller use some
algorithm which determines whether the value currently being addressed in memory is
available in the cache. The transformation data from main memory to cache memory is
referred as a mapping process. Let us derive an address translation scheme using cache as
a linear array of entries, each entry having the following structure as shown in figure 4.3.
A Cache Storage is divided into three fields:
Data - The block of data from memory that is stored in a specific line in the cache
Tag - A small field of length K bits, used for comparison, to check the correct address of
data
Valid Bit - A one-bit field that indicates status of data written into the cache.
The N-bit address is produced by the processor to access cache data is divided into three
fields:
Tag - A K-bit field that corresponds to the K-bit tag field in each cache entry,
Index - An M-bit field in the middle of the address that points to one cache entry
Byte Offset L Bits that finds particular data in a line if valid cache is found.
It follows that the length of the virtual address is given by N = K + M + L bits.
Cache Address Translation. As shown in Figure 4.3, we assume that the cache address
has length 32 bits. Here, bits 12-31 are occupied by the Tag field, bits 2-11 contain the
Index field, and bits 0,1 contain the Offset information. The index points to the line in
cache that supposedly contains the data requested by the processor. After the cache line is
retrieved, the Tag field in the cache line is compared with the Tag field in the cache
address. If the tags do not match, then a cache miss is detected and the comparator
outputs a zero value. Otherwise, the comparator outputs a one, which is and-ed with the
valid bit in the cache row pointed to by the Index field of the cache address. If the valid
bit is a one, then the Hit signal output from the and gate is a one, and the data in the
cached block is sent to the processor. Otherwise a cache miss is registered.
98
99
no synonym problem (several different virtual addresses can span the same
physical addresses : a much better hit ratio between processes)
increase in hit time because must translate the virtual address before access the
cache
VA
PA
PA
MMU
CPU
Caption
VA = Virtual address
Main
Memory
PA = Physical Address
I = Instructions
D = Data stream
Cache
I or D
VA
First level
D-Cache
PA
D
Second
Level
D-Cache
PA
D
Main
Memory
D
PA
CPU
I-Cache
Virtual Address caches: when a cache is indexed or tagged with virtual address it is
called virtual address cache. In this model both cache and MMU translation or validation
are done in parallel. The physical address generated by the MMU can be saved in tags for
later write back but is not used during the cache lookup operations.
Advantage of virtually-addressed caches
MMU
PA
Main
Memory
CPU
I or D
Cache
Captions:
VA = Virtual address
PA = Physical Address
I = Instructions
D = Data stream
D or I
32
I
64
PA
32
VA
32
IU
D
MMU
32
Main
Memory
32
32
FU
VA
128
D-Cache
(8K Bytes)
128
D
Figure 4.6(b) Virtual address for split cache
101
Aliasing: The major problem with cache organization in multiprocessor is that multiple
virtual addresses can map to a single physical address i.e., different virtual address cache
logically addressed data have the same index/tag in the cache. Most processors guarantee
that all updates to that single physical address will happen in program order. To deliver
on that guarantee, the processor must ensure that only one copy of a physical address
resides in the cache at any given time.
4.3 Cache mapping
Caches can be organized according to four different strategies:
Direct
Fully associative
Set associative
Sectored
4.3.1 Direct-Mapped Caches
The easiest way of organizing a cache memory employs direct mapping that is based on a
simple algorithm to map data block i from the main memory into data block j in the
cache. There is a one-to-one correspondence between each block of data in the cache and
each memory block thus to find a memory block i, then there is one and only one place in
the cache where i is stored
If we have 2n words in main memory and 2k words in cache memory. In cache
memory each word consists of data word and its associated tag. The n-bit memory
address is divided into three fields : low order k bits are referred as the index field and
used to address a word in the cache. The remaining n-k high-order bits are called the tag.
The index field is further divided into the slot field, which will be used find a particular
slot in the cache; and the offset field is used to identify a particular memory word in the
slot. When a block is stored in the cache, its tag field is stored in the tag field of the cache
slot.
When CPU generates an address the index field is used to access the cache. The
tag field of CPU address is compared with the tag in word read from the cache. If the two
tags match, there is a hit and else there is a miss and the required word is read from main
memory. Whenever a ``cache miss'' occurs, the cache line will be replaced by a new line
102
of information from main memory at an address with the same index but with a different
tag.
Lets us understand how direct mapping is implemented with following simple
example Figure 4.7. The memory is composed of 32 words and accessed by a 5-bit
address. Let the address has a 2-bit tag (set) field, a 2-bit slot (line) field and a 1-bit word
field. The cache memory holds 22 = 4 lines each having two words. When the processor
generates an address, the appropriate line (slot) in the cache is accessed. For example, if
the processor generates the 5-bit address 111102, line 4 in set 4 is accessed. The memory
space is divided into sets and the sets into lines. The Figure 4.7 reveals that there are four
possible lines that can occupy cache line 4 lines 4 in set 0, in set 1, in set 2 and set 4. In
this example the processor accessed line 4 in set 4. Now How does the system resolve
this issue?"
Figure 4.7 shows how a direct mapped cache resolves the contention between lines. Each
line in the cache memory has a tag or label that identifies which set this particular line
belongs to. When the processor accesses line 4, the tag belonging to line 4 in the cache is
sent to a comparator. At the same time the set field from the processor is also sent to the
comparator. If they are the same, the line in the cache is the desired line and a hit occurs.
If they are not the same, a miss occurs and the cache must be updated. Figure 4.17
provides a skeleton structure of a direct mapped cache memory system.
104
Let the Get data routine and compare routine use two blocks, both these blocks have
same index but have different tags are repeated accessed. Consequently, the performance
of a direct-mapped cache can be very poor under above circumstances. However,
statistical measurements on real programs indicate that the very poor worst-case behavior
of direct-mapped caches has no significant impact on their average behavior.
4.3.3 Associative Mapping:
One way of organizing a cache memory which overcomes the limitations of direct
mapped cache such that there is no restriction on what data it can contain can be done
with associative cache memory. An associative memory is the fastest and most flexible
way of cache organization. It stores both the address and the value (data) from main
memory in the cache. An associative memory has an n-bit input. An address from the
processor is divided into three fields: the tag, the line, and the word.The mapping is done
with storing tag information in n-bit argument register and comparing it with address tag
in each location simultaneously. If the input tag matches a stored tag, the data associated
with that location is output. Otherwise the associative memory produces a miss output.
Unfortunately, large associative memories are not yet cost-effective. Once the associative
cache is full, a new line can be brought in only by overwriting an existing line that
requires a suitable line replacement policy. Associative cache memories are efficient
because they place no restriction on the data they hold, as permits any location of cache
to store any word from main memory.
CPU Address (argument register )
Address
Data
01101001
10010100
10010001
10101010
105
106
The idea is to partition both the cache and memory into fixed size sectors. Thus in a
sectored cache, main memory is partitioned into sectors, each containing several blocks.
The cache is partitioned into sector frames, each containing several lines. (The number of
lines/sector frame = the number of blocks/sector.) As shown in figure below sector size is
of 16 block. Each sector can be mapped to any of the sector frame with full associative at
the sector level.
Each sector can be placed in any of the available sector frame. The memory requests are
destined for blocks not for sectors. This can be filtered out by comparing the sector tag in
the memory address with all sector tags using fully associative search.
When block b of a new sector c is brought in,
it is brought into line b within some sector frame f, and
the rest of the lines in sector frame f are marked invalid.
Thus, if there are S sector frames, there are S choices of where to place a block.
4.3.5 CACHE performance Issues
As far as the performance of cache is considered the trade off exist among the cache size,
set number, block size and memory speed. Important aspect in cache designing with
regard to performance are :
a. the cycle count : This refers to the number of basic machine cycles needed for
cache access, update and coherence control. This count is affected by underlying
static or dynamic RAM technology, the cache organization and the cache hit
ratios. The write through or write back policy also affect the cycle count. The
108
cycle count is directly related to the hit ratio, which decreases almost linearly with
increasing values of above cache parameters.
b. Hit ratio: The processor generates the address of a word to be read and send it to
cache controller, if the word is in the cache it generates a Hit signal and also
deliver it to the processor. If the data is not found in the cache, then it generates a
MISS signal and that data is delivered to the processor from main memory, and
simultaneously loaded into the cache. The hit ratio is number of hits divided by
total number of CPU references to memory (hits plus misses). When cache size
approaches
c. Effect of Block Size: With a fixed cache size, cache performance is sensitive to
the block size. This block size is determined mainly by the temporal locality in
typical program.
d. Effect of set number in set associative number.
4.4 Cache replacement algorithm
When a new block is brought into cache, one of the existing blocks must be replaced. The
obvious question arise is which page to be replaced? With direct mapping, the solution is
easy as we have not choice. But in other circumstances, we do. The three most commonly
used algorithms are Least Recently Used, First in First out and Random.
Random -- The optimal algorithm is called random replacement, whereby a location to
which a block is to be written in cache is chosen at random from the range of cache
indices. The random replacement strategy usually implemented using a random number
generator. In a 2-way set associative cache, this can be accomplished with a single
modulo 2 random variable obtained, from an internal clock
First in, first out (FIFO) -- here the first value stored in the cache is the index position
representing value to be replaced. For a 2-way set associative cache, this replacement
strategy can be implemented by setting a pointer to the previously loaded word each time
a new word is stored in the cache; this pointer need only be a single bit.
Least recently used (LRU) -- here the value which was actually used least recently is
replaced. In general, it is more likely that the most recently used value will be the one
required in the near future. This approach, while not always optimal, is intuitively
attractive from the perspective of temporal locality. That is, a given program will likely
109
not access a page or block that has not been accessed for some time. The LRU
replacement algorithm requires that each cache or page table entry have a timestamp.
This is a combination of date and time that uniquely identifies the entry as having been
written at a particular time. Given a timestamp t with each of N entries, LRU merely
finds the minimum of the cached timestamps, as
tmin = min{ti : i = 1..N} .
The cache or page table entry having t = tmin is then overwritten with the new entry.
For a 2-way set associative cache, this is readily implemented by setting a special bit
called the ``USED'' bit for the other word when a value is accessed while the
corresponding bit for the word which was accessed is reset. The value to be replaced is
then the value with the USED bit set. This replacement strategy can be implemented by
adding a single USED bit to each cache location. The LRU strategy operates by setting a
bit in the other word when a value is stored and resetting the corresponding bit for the
new word. For an n-way set associative cache, this strategy can be implemented by
storing a modulo n counter with each data word.
4.5 Cache Coherence and Synchronization
4.5.1Cache coherence problem
An important problem that must be addressed in many parallel systems - any system that
allows multiple processors to access (potentially) multiple copies of data - is cache
coherence. The existence of multiple cached copies of data creates the possibility of
inconsistency between a cached copy and the shared memory or between cached copies
themselves.
110
111
DMA I/O this inconsistency problem occur during the I/O operation that bypass
the cache. This problem is present even in a uniprocessor and can be removed by
OS cache flushes)
In practice, these issues are managed by a memory bus, which by its very nature ensures
write serialization, and also allows us to broadcast invalidation signals (we essentially
just put the memory address to be invalidated on the bus). We can add an extra valid bit
to cache tags to mark then invalid. Typically, we would use a write-back cache, because
it has much lower memory bandwidth requirements. Each processor must keep track of
which cache blocks are dirty - that is, that it has written to - again by adding a bit to the
cache tag. If it sees a memory access for a word in a cache block it has marked as dirty, it
intervenes and provides the (updated) value. There are numerous other issues to address
when considering cache coherence.
One approach to maintaining coherence is to recognize that not every location needs to be
shared (and in fact most don't), and simply reserve some space for non-cacheable data
such as semaphores, called a coherency domain.
Using a fixed area of memory, however, is very restrictive. Restrictions can be reduced
by allowing the MMU to tag segments or pages as non-cacheable. However, that requires
the OS, compiler, and programmer to be involved in specifying data that is to be
coherently shared. For example, it would be necessary to distinguish between the sharing
of semaphores and simple data so that the data can be cached once a processor owns its
semaphore, but the semaphore itself should never be cached.
In order to remove this data inconsistency there are a number of approaches based on
hardware and software techniques few are given below:
Make shared-data non-cacheable this is the simplest software solution but produce
low performance if a lot of data is shared
software flush at strategic times: e.g., after critical sections, this is relatively
simple technique but has low performance if synchronization is not frequent
hardware cache coherence this can be achieved by making memory and caches
coherent (consistent) with each other, in other words if the memory and other
processors see writes then without intervention of the to software
112
absolute coherence all copies of each block have same data at all times
In general a cache coherence protocols consist of the set of possible states in local caches,
the state in shared memory and the state transitions caused by the messages transported
through the interconnection network to keep memory coherent. There are basically two
kinds of protocols depends on how writes is handled
4.5.2 Snooping Cache Protocol (for bus-based machines);
With a bus interconnection, cache coherence is usually maintained by adopting a "snoopy
protocol", where each cache controller "snoops" on the transactions of the other caches
and guarantees the validity of the cached data. In a (single-) multi-stage network,
however, the unavailability of a system "bus" where transactions are broadcast makes
snoopy protocols not useful. Directory based schemes are used in this case.
In case of snooping protocol processors perform some form of snooping - that is, keeping
track of other processor's memory writes. ALL caches/memories see and react to ALL
bus events. The protocol relies on global visibility of requests (ordered broadcast). This
allows the processor to make state transitions for its cache-blocks.
Write Invalidate protocol
The states of a cache block copy changes with respect to read, write and replacement
operations in the cache. The most common variant of snooping is a write invalidate
protocol. In the example above, when processor A writes to X, it broadcasts the fact and
all other processors with a copy of X in their cache mark it invalid. When another
processor (B, say) tries to access X again then there will be a cache miss and either
(i)
in the case of a write-through cache the value of X will have been updated
(actually, it might not because not enough time may have elapsed for the
memory write to complete - but that's another issue); or
(ii)
in the case of a write-back cache processor A must spot the read request, and
substitute the correct value for X.
113
Invalid
115
The last two states indicate ownership. The trouble with this scheme is that if a nonowner frequently accesses an owned shared value, it can slow down to main memory
speed or slower, and generate excessive bus traffic because all accesses must be to the
owning cache, and the owning cache would have to perform a broadcast on its next write
to signal that the line is again invalid.
One solution is to grant ownership to the first processor to write to the location and not
allow reading directly from the cache. This eliminates the extra read cycles, but then the
cache must write-through all cycles in order to update the copies.
We can change the scheme so that when a write is broadcast, if any other processor has a
snoop hit, it signals this back to the owner. Then the owner knows it must write through
again. However, if no other processor has a copy (signals snooping), it can proceed to
write privately. The processor's cache must then snoop for read accesses from other
processors and respond to these with the current data, and by marking the line as
snooped. The line can return to private status once a write-through results in a no-snoop
response.
One interesting side effect of ownership protocols is that they can sometimes result in a
speedup greater than the number of processors because the data resides in faster memory.
Thus, other processors gain some speed advantage on misses because instead of fetching
from the slower main memory, they get data from another processor's fast cache.
However, it takes a fairly unusual pattern of access for this to actually be observed in real
system performance.
If multiple processors read and update the same data item, they generate
coherence functions across processors.
116
Rather than flush the cache completely, hardware can be provided to "snoop" on the bus,
watching for writes to main memory locations that are cached.
Another approach is to have the DMA go through the cache, as if the processor is writing
it to memory. This results in all valid cache locations. However, any processor cache
accesses are stalled during that time, and it clearly does not work well in a
multiprocessor, as it would require copies being written to all caches and a protocol for
write-back to memory that avoids inconsistency.
4.5.4 Directory-based Protocols
When a multistage network is used to build a large multiprocessor system, the snoopy
cache protocols must be modified. Since broadcasting is very expensive in a multistage
network, consistency commands are sent only to caches that keep a copy of the block.
This leads to Directory Based protocols. A directory is maintained that keeps track of the
sharing set of each memory block. Thus each bank of main memory can keep a directory
of all caches that have copied a particular line (block). When a processor writes to a
location in the block, individual messages are sent to any other caches that have copies.
Thus the Directory-based protocols selectively send invalidation/update requests to only
those caches having copiesthe sharing set leading the network traffic limited only to
essential updates. Proposed schemes differ in the latency with which memory operations
are performed and the implementation cost of maintaining the directory. The memory
must keep a bit-vector for each line that has one bit per processor, plus a bit to indicate
ownership (in which case there is only one bit set in the processor vector).
117
These bitmap entries are sometimes referred to as the presence bits. Only processors that
hold a particular block (or are reading it) participate in the state transitions due to
coherence operations. Note that there may be other state transitions triggered by
processor read, write, or flush (retiring a line from cache) but these transitions can be
handled locally with the operation reflected in the presence bits and state in the directory.
If different processors operate on distinct data blocks, these blocks become dirty in the
respective caches and all operations after the first one can be performed locally.
If multiple processors read (but do not update) a single data block, the data block gets
replicated in the caches in the shared state and subsequent reads can happen without
triggering any coherence overheads.
Various directory-based protocols differ mainly in how the directory maintains
information and what information is stored. Generally speaking the directory may be
central or distributed. Contention and long search times are two drawbacks in using a
central directory scheme. In a distributed-directory scheme, the information about
memory blocks is distributed. Each processor in the system can easily "find out" where to
go for "directory information" for a particular memory block. Directory-based protocols
fall under one of three categories:
Full-map directories, limited directories, and chained directories.
This full-map protocol is extremely expensive in terms of memory as it store enough data
associated with each block in global memory so that every cache in the system can
simultaneously store a copy of any block of data.. It thus defeats the purpose of leaving a
bus-based architecture.
A limited-map protocol stores a small number of processor ID tags with each line in main
memory. The assumption here is that only a few processors share data at one time. If
there is a need for more processors to share the data than there are slots provided in the
directory, then broadcast is used instead.
Chained directories have the main memory store a pointer to a linked list that is itself
stored in the caches. Thus, an access that invalidates other copies goes to memory and
then traces a chain of pointers from cache to cache, invalidating along the chain. The
actual write operation stalls until the chain has been traversed. Obviously this is a slow
process.
118
Duplicate directories can be expensive to implement, and there is a problem with keeping
them consistent when processor and bus accesses are asynchronous. For a write-through
cache, consistency is not a problem because the cache has to go out to the bus anyway,
precluding any other master from colliding with its access.
But in a write-back cache, care must be taken to stall processor cache writes that change
the directory while other masters have access to the main memory.
On the other hand, if the system includes a secondary cache that is inclusive of the
primary cache, a copy of the directory already exists. Thus, the snooping logic can use
the secondary cache directory to compare with the main memory access, without stalling
the processor in the main cache. If a match is found, then the comparison must be passed
up to the primary cache, but the number of such stalls is greatly reduced due to the
filtering action of the secondary cache comparison.
A variation on this approach that is used with write-back caches is called dirty inclusion,
and simply requires that when a primary cache line first becomes dirty, the secondary line
is similarly marked. This saves writing through the data, and writing status bits on every
write cycle, but still enables the secondary cache to be used by the snooping logic to
monitor the main memory accesses. This is especially important for a read-miss, which
must be passed to the primary cache to be satisfied.
The previous schemes have all relied heavily on broadcast operations, which are easy to
implement on a bus. However, buses are limited in their capacity and thus other
structures are required to support sharing for more than a few processors. These
structures may support broadcast, but even so, broadcast-based protocols are limited.
The problem is that broadcast is an inherently limited means of communication. It
implies a resource that all processors have access to, which means that either they
contend to transmit, or they saturate on reception, or they have a factor of N hardware for
dealing with the N potential broadcasts.
Snoopy cache protocols are not appropriate for large-scale systems because of the
bandwidth consumed by the broadcast operations
In a multistage network, cache coherence is supported by using cache directories to store
information on where copies of cache reside.
119
A cache coherence protocol that does not use broadcast must store the locations of all
cached copies of each block of shared data. This list of cached locations whether
centralized or distributed is called a cache directory. A directory entry for each block of
data contains a number of pointers to specify the locations of copies of the block.
Distributed directory schemes
In scalable architectures, memory is physically distributed across processors. The
corresponding presence bits of the blocks are also distributed. Each processor is
responsible for maintaining the coherence of its own memory blocks. Since each memory
block has an owner its directory location is implicitly known to all processors. When a
processor attempts to read a block for the first time, it requests the owner for the block.
The owner suitably directs this request based on presence and state information locally
available. When a processor writes into a memory block, it propagates an invalidate to
the owner, which in turn forwards the invalidate to all processors that have a cached copy
of the block. Note that the communication overhead associated with state update
messages is not reduced. Distributed directories permit O(p) simultaneous coherence
operations, provided the underlying network can sustain the associated state update
messages. From this point of view, distributed directories are inherently more scalable
than snoopy systems or centralized directory systems. The latency and bandwidth of the
network become fundamental performance bottlenecks for such systems.
4.6 Keywords
cache A high-speed memory, local to a single processor , whose data transfers are carried
out automatically in hardware. Items are brought into a cache when they are referenced,
while any changes to values in a cache are automatically written when they are no longer
needed, when the cache becomes full, or when some other process attempts to access
them. Also To bring something into a cache.
cache consistency The problem of ensuring that the values associated with a particular
variable in the caches of several processors are never visibly different.
associative memory: Memory that can be accessed by content rather than by address;
content addressable is often used synonymously. An associative memory permits its user
to specify part of a pattern or key and retrieve the values associated with that pattern.
120
direct mapping :A cache that has a set associativity of one so that each item has a unique
place in the cache at which it can be stored.
4.7 Summary
In this lesson we had learned how cache memory in multiprocessor is organized and how
its address are generated both for physical and virtual address. Various techniques of
cache mapping are discussed.
Mapping
Advantage
disadvantage
technique
Direct
Mapping
Fully
associative
121
122
Lesson No. : 05
5.0 Objective
5.1 Introduction
5.2 Multithreading
5..2.1 multiple context processor
5.2.2 multidimensional processor
5.3 Data flow architecture
5.3.1Data flow graph
5.3.2 Static dataflow
5.3.3 Dynamic dataflow
5.4 Self assignment questions
5.5 Reference.
5.0 Objective
In this lesson we will study about advance concepts of improving the performance of
multiprocessor. The techniques studied is multithreading , multiple context processor and data
flow architecture.
5.1 Introduction
The computers are basically designed for execution of instructions, which are stored as
programs in the memory. These instructions are executed sequentially and hence are
slow as the next instruction can be executed only after the output of pervious instruction
has been obtained. As discussed earlier to improve the speed and through put the
concept of parallel processing was introduced. To execute the more than one instruction
simultaneously one has to identify the independent instruction which can be passed to
separate processors. The parallelism in multiprocessor can be implemented on principle
in three ways:
Instruction Level Parallelism
The potential of overlap among instructions is called instruction-level parallelism (ILP) since
the instructions can be evaluated in parallel. Instruction level parallelism is obtained primarily in
123
two ways in uniprocessors: through pipelining and through keeping multiple functional units
busy executing multiple instructions at the same time.
Data Level Parallelsim
The simplest and most common way to increase the amount of parallelism available among
instructions is to exploit parallelism among iterations of a loop. This type of parallelism is often
called loop-level parallelism as an example of it vector processor.
Difficult to continue to extract instruction-level parallelism (ILP) or data-level parallelism (DLP)
from a single sequential thread of control. Many workloads can make use of thread-level
parallelism (TLP)
Thread Level Parallelism
Thread level parallelism (TLP) is the act of running multiple flows of execution of a single
process simultaneously. TLP is most often found in applications that need to run independent,
unrelated tasks (such as computing, memory accesses, and IO) simultaneously. These types of
applications are often found on machines that have a high workload, such as web servers. TLP is
a popular ground for current research due to the rising popularity of multi-core and multiprocessor systems, which allow for different threads to truly execute in parallel. The TLP can be
implemented either through multiprogramming (i.e., run independent sequential jobs) or from
multithreaded applications (i.e., run one job faster using parallel threads). Thus Multithreading
uses TLP to improve utilization of a single processor
As a designers perspective there are various possible ways in which one can design a system
depending on the way we execute the instructions. Four possible ways are
Control flow computers : The next instruction is executed when the last instruction as stored in the
program has been executed
Data flow computers An instruction executed when the data (operands) required for executing that
instruction is available
Demand driven computers : An instruction is executed when the results of the instruction which is
required as input by other instruction is available.
Pattern driven computers : An instruction is executed when we obtain a particular data patterns as
output.
5.2 Multi-Threading
124
the number of thread the number of thread that can be interleaved in each
processor. A thread is represented by a context consisting a program counter,
register set and required context status word.
c. The context switching overhead: this refer to cycle lost in performing context
switching in processor. This depends on the switching mechanism and the amount
of processor state devoted to maintaining the active thread.
d. The interval between switches: this refer to cycle between switches triggered by
remote reference. This inverse of rate of request.
There are a number of ways that multithreading can be implemented, including: finegrained multithreading, coarse-grained multithreading, and simultaneous multithreading.
125
Fine-Grained Multithreading
Fine-grained multithreading involves instructions from threads issuing in a round-robin
fashion--one instruction from process A, one instruction from process B, another from A,
and so on (note that there can be more than two threads). This type of multithreading
applies to situations where multiple threads share a single pipeline or are executing on a
single-issue CPU.
Coarse-Grained Multithreading
The next type of multithreading is coarse-grained multithreading. Coarse-grained
multithreading allows one thread to run until it executes an instruction that causes a
latency (cache miss), and then the CPU swaps another thread in while the memory access
completes. If a thread doesn't require a memory access, it will continue to run until its
time limit is up. As with fine-grained multithreading, this applies multiple threads sharing
a single pipeline or executing on a single-issue CPU.
Simultaneous Multithreading (SMT)
Simultaneous multithreading is a refinement on coarse-grained multithreading. The
scheduling algorithm allows the active thread to issue as many instructions as it can (up
to the issue-width) to keep the functional units busy. If a thread does not have sufficient
ILP to do this, other threads may issue instructions to fill the empty slots. SMT only
applies to superscalar architectures which are characterized by multiple-issue CPUs. With
the advent of multithreaded architectures, dependence management has become easier
due to availability of more parallelism. But, the demand for hardware resources has
increased. In order for the processor to cater efficiently to multiple threads, it would be
useful to consider resource conflicts between instructions from different threads. This
need is greater for simultaneous multithreaded processors, since they issue instructions
from multiple threads in the same cycle. Similar to the operating systems interest in
maintaining a good job mix, the processor is now interested in maintaining a good mix of
instructions. One way to achieve this is for the processor to exploit the choice available
during instruction fetch. To aid this, a good thread selection mechanism should be in
place. Dependences - data and control - limit the exploitation of instruction level
parallelism (ILP) in processors. This is especially so in superscalar processors, where
multiple instructions are issued in a single cycle. Hence, a considerable amount of
126
research has been carried out in the area of dependence management to improve
processor performance.
Data dependences are of two types: true and false. False data dependences: anti and
output dependences are removed using register renaming, a process of allocating different
hardware registers to an architectural register. True data dependences are managed with
the help of queues where instructions wait for their operands to become available. The
same structure is used to wait for FUs. Control dependences are managed with the help of
branch prediction.
Multithreaded processors add another dimension to dependence management by bringing
in instruction fetch from multiple threads. The advantage in this approach is that the
latencies of true dependences can be covered more effectively. Thus thread-level
parallelism is used to make up for lack of instruction-level parallelism.
Simultaneous multithreading (SMT) combines the best features of multithreading and
superscalar architectures. Like a superscalar, SMT can exploit instruction-level
parallelism in one thread by issuing multiple instructions each cycle. Like a
multithreaded processor, it can hide long latency operations by executing instructions
from different threads. The difference is that it can do both at the same time, that is, in the
same cycle.
The main issue in SMT is effective thread scheduling and selection. While scheduling of
threads from the job mix may be handled by the operating system, selection of threads to
be fetched is handled at the microarchitecture level. One technique for job scheduling
called Symbiotic Job scheduling collects information about different schedules and
selects a suitable schedule for different threads.
A number of techniques have been used for thread selection. The Icount feedback
technique gives the highest priority to the threads that have the least number of
instructions in the decode, renaming, and queue pipeline stages. Another technique
minimizes branch mispredictions by giving priority to threads with the fewest
outstanding branches. Yet another technique minimizes load delays by giving priority to
threads with the fewest outstanding on-chip cache misses. Of these the Icount technique
has been found to give better results.
Costs occurred in implementing Multithreading
127
128
129
Data flow is one of the technique that meet the above requirement and hence are found
useful for designing the future supercomputer. Before we study in detail about these data
flow computers lets revise the drawbacks of processors based on pipeline architecture.
The major hazards are
o
Structural hazards
Control hazards
Among these the Data hazards due to true dependences and care is required to avoid it
while the control hazards can be handled if next instructions in the pipeline to be
executed is basically from different contexts Hence if data dependency can be removed
the performance of the system will definitely improve. It can removed by one of the
followings techniques:
130
By renaming the data this will lead to extra burden to complier as this
operation is performed by compiler
Data flow computers are based on the principle of data driven computation which is very
much different from the von Neumann architecture which is basically based on the
control flow while where the data flow architecture is designed on availability of data
hence also called data driven computers. There are various types data flow model are
static dynamic, VLSI, Hybrid we will discussing about them in this module. The concept
of data flow computing was originally developed in 1960s by Karp and Miller. They
used a graphical means of representing computations. Later in the early 1970s Dennis
and later other developed the computer architectures based on data flow systems.
Concept of dataflow computing finds its application in specialized architectures for
Digital Signal Processing (DSP) and specialized architectures for demanding
computation in the fields of graphics and virtual reality.
Data driven computing and languages
In order to under how Dataflow is different from Control-Flow. Lets see the working of
von Neumann architecture which is based on the control flow computing model. Here
each program is sequence of instructions which are stored in memory. These a series of
addressable instructions store the information about the an operation along with the
information about the with memory locations that store the operand or in case of interrupt
or some function call it store the address of the location where control has to transferred
or in case of conditional transfer it specifies the status bits to be checked and location
where the control has to transferred.
The next instruction to be executed depends on what happened during the execution of
the current instruction. Thus accordingly the address of next instruction to be executed is
transferred to PC. And on next clock pulse the instruction is executed, the operands are
fetched from the desired memory location as required in the instruction. Here the
instruction is also executed even if some of its operands are not available yet (e.g.
uninitialized). The fetching of data and instruction from memory becomes bottleneck in
131
exploiting the parallelism to its maximum possible utility. The key features of control
flow model are
Flow of control is implicitly sequential but special control operators can be used
for explicit parallelism
However the data driven model accept the execution of any instruction only on
availability of the operand. Data flow programs are represented by directed graphs which
show the flow of data between instructions. Each instruction consists of an operator, one
or two operands and one or more destinations to which the result is to be transferred. The
key features of data driven model are as follows:
Intermediate results as well as final result are passed directly as data token
between instruction.
In contrast to control driven computers where the program has complete control
over the instruction sequencing here the data driven computer the program
sequencing is constrained only by data dependency among the instructions.
Instructions are examined to check the operand availability and if functional unit
and operand both are available the instruction is immediately executed.
As the fetching of data every time from memory which is part of instruction cycle of von
Neumann model is overcome by transferring the available data the bottleneck in
exploiting parallelism are missing or we can say parallelism is better implemented in data
driven system. This is because there is no concept of shared memory cells and one can
say that data flow diagram are free from side effects as in data driven computers the
operands are directly transferred as token value instead of address variable as in case of
control flow model. There is always a chance of side effect as the change of memory
words in case of control flow computers.
The data driven concept means asynchrony which means that many instructions can be
executed simultaneously no PC and global updateable store is required. Information
required in a data flow computer are operation packets that are composed of opcode,
132
operand and destinations of its successor instructions and data token which is formed
with a result value and its destinations. Many of these packets are passed among various
resource sections in a data flow machine. One basic rules involved in computation are in
data flow computer are :
Enabling rule which states that an instruction is enabled (i.e. executable) if all
operands are available however in control flow computer as in case of von
Neumann model, an instruction is enabled if it is pointed to by PC.
133
Add
arcs
+
[]
[]
nodes
These graphs demonstrate the data dependency among the instructions. In data flow
computers the machine level program is represented by data flow graphs.
In conventional computer the only focus while designing program is for assignment
of control flow. To implement the parallel computing in this architecture if we need many
processing elements ( electronic chips like ALU) working in parallel simultaneously.
Now designing a prospect of programming for each chip individually becomes
unthinkable. Researchers have designed various computer architects based on the von
Neumann principle i.e., to create a single large machine from many processors like Illiac
IV, Cmmp, etc. The major problem for implementing implicit parallelism in these
machine ( based on von Neumann architecture) is
Data flow languages make a clean break from the von Neumann framework, giving a
new definition to concurrent programming languages. They manage to make optimal use
of the implicit parallelism in a program. Consider the following segment:
l . P = X + Y (waits for availability of input value for X and Y)
2. Q = P I Y (as P is required input it must waits for instruction 1 to finish)
3. R = X * P (as P is required input it must waits for instruction 1 to finish)
4. S = R - Q(as R and Q are required as input it must waits for instruction 2 and 3 to
finish)
134
135
dataflow program is a graph, where nodes represent operations and edges represent data
paths
Various notations used to construct a data flow diagram with help of operators (nodes)
and links (arcs)
The above Figure show various commonly used symbols in a data flow graph. Data links
are used to transmit all types of data whether it is integer or float except for Boolean
values, for which special links are used as shown in the figure. Any operator is stored in
node and has two or more input and one output except for the identity operator that has
one input arc and it transfer the value of data token unchanged. For conditional and
iterative computations deciders, gates and merge operators are used in data flow graphs.
A decider requires a value from each of its input arcs and test the condition and according
to the condition it satisfies it transmit a truth value. Control tokens bearing boolean
values control the flow of data tokens by means of the gates namely T gates, the F gates,
and the merge operators where the T gate will transmit a data token from its input arc to
its output arc if the value on its control input is true. It will absorb a data token from its
data input arc and place nothing on its output ARC IF IT RECIEVES A False value. The
F gate also have similar behavior except now the control test for false condition. A merge
136
operator has T and F input arcs and a truth-value control arc. When a true value is
received on its control arc, the data token on the T input is transmitted.
The token on the other unused input arc is discarded. Similary the false input is passed to
the output when the control arc is false.
As said earlier in data flow graphs the tokens flow through the graph. When a node
receives the tokens from the incoming edge it will execute and put the result as tokens on
its output edges. Unlike control flow computer there is no predetermined sequence of the
execution of a data flow computer rather here the data drives the order of execution. Once
a node is activated and operation stored in its node is performed, this process s also called
fired and the output of the operation is passed along the arc to waiting node. This
process is repeated until all of the nodes are fired and the final result is created. The
parallelism is implemented as simultaneously more than one node can be fired.
Lets see the data flow diagram for an equation x2 -2x + 3
{
t1 = a*c;
t2 = 4*t1;
t3 = b*b;
t4 = t3 - t2;
t5 = sqrt( t4);
t6 = -b;
t7 = t6 - t5;
t8 = t7 + t5;
t9 = 2*a;
r1 = t7/t9;
r2 = t8/t9;
}
In the control flow computer this algorithm is implemented line by line. In order to
implement it through data flow computr one should first note the dependancies between
each operation. For example t2 can not be computed before t1, but t3 could be computed
before t1 or t2.
Lets consider example of iterative computation z = x" and represent it by the data flow
graph Figure 5. 3. using the symbols shown in Fig.2. The input reuired are for inputs x,n:
Variable used are y,i
138
y= 1 :i=n
while i>0 do
begin y= y*x , i= i-1 end
z=y
output z
The computation involve successive calculation of loop variable values i.e., y and I and
these value will pass through the links and test the condition. The initial values of the
control arcs are labeled false to initiate computation. The result z will be obtained when
the deciders output is false.
Two important characteristics of dataflow graphs are
139
1. The development of efficient data flow languages which are easy to use and to be
interpreted by machine hardware
2. The decomposition of programs and the assignment of program modules to data
flow processors
3. Controlling and supporting large amounts of interprocessor communication with
cost-effective packet-switched networks
4. Developing intelligent data-driven mechanisms for either static or dynamic data
flow machines
5. Efficient handling of complex data structures, such as arrays, in a data flow
environment
6. Developing a memory hierarchy and memory allocation schemes for supporting
data flow computations
7. A large need for user acquaintance of functional data flow languages, software
supports, data flow compiling, and new programming methodologies
8. Performance evaluation of data flow hardware in a large variety of application
domains, especially in the scientific areas
Disadvantage of dataflow model
o
Data flow programs tends to waste lot of memory space for increased
code length due to single assignment rule and excessive copying of
data array.
140
static,
dynamic
RISC approach,
multithreading,
large-grain computation,
etc.
141
Let begin the study about the Pure Dataflow computer. The basic principle of any
Dataflow computer is data driven and hence it executes a program by receiving,
processing and sending out token.
These token consist of some data and a tag. These tags are used for representing all types
of dependences between instructions. Thus dependencies are handled by translating them
into tag matching and tag transformation. The processing unit is composed of two parts
matching unit that is used for matching the tokens and execution unit used for actual
implementation of instruction. When the processing element gets a token the matching
unit perform the matching operation and when a set of matched tokens the processing
begins by execution unit. The type of operation to be performed by the instruction has to
be fetched from the instruction store which is stored as the tag information. This
information contains details about
o
The matching unit and the execution unit are connected through an asynchronous
pipeline, with queues added between the stages. To perform fast token matching some
form of fast associative memories are used. The various possible solution for the
associative memory used to support token matching are.
o
Jack deniss and his associates at MIT have pioneered the area of data flow research and
they came forward with two models called Dennis machine and Arvind machine. The
Dennis machine has static architecture while Arvind used tagged token and colored
activities and was designed for dynamic architecture.
There are variety of static, dynamic and also hybrid dataflow computing models.
In static model, there is possibility to place only one token on the edge at the same time.
When firing an actor, no token is allowed on the output edge of an actor. It is called static
model because token arms are not labeled and control tokens must be used to
acknowledge the proper timing in the transferring data token from one node to another.
142
143
The static architecture was proposed by Dennis and Misunas [1975]. The static data flow
computer data tokens are assumed to move along the arcs of the data flow program graph
to the operator nodes. The nodal operations gets executed only when all its input are
present at the input arc. Data flow graph used in the Dennis machine must follow the
static execution rule that only one token is allowed to exist on any arc at any given time,
otherwise successive sets of tokens cannot be distinguished thus instead of FIFO design
of string token at arc is replace by simple design where the arc can hold at most one data
token. This is called static because here tokens are not labeled and control token are used
for acknowledgement purpose so that proper timing in the transferring data tokens from
node to node can take place. Here the complete program is loaded into memory before
execution begins. Same storage space is used for storing both the instructions as well as
data. In order to implement this, acknowledge arcs are implicitly added to the dataflow
graph that go in the opposite direction to each existing arc and carry an acknowledgment
token Some example of static data flow computers are MIT Static Dataflow, DDM1 Utah
Data Driven, LAU System, TI Distributed Data Processor, NEC Image Pipelined
Processor
The graph itself is stored in the computer as a collection of activity templates,
such that each template represents a node of the graph. The template as shown in the
figure below holds opcode specifying operation to be performed by the node; a memory
space to hold the value of the data token i.e., address of operand on each input arc, with a
presence flag for each one; and a list of destination addresses for the output tokens
referring to the operand slots in sub-sequent activity templates that need to receive the
result value.
The instruction stored in memory cell is represented as in figure below
144
Operands Destinations
The advantage of this approach is that operands can only be affected by one selected
node at a time. On the other hand, complex data structures, or even simple arrays could
not reasonably be carried in the instruction and hence cannot be handles in the
mechanism.
145
The resulting packet or token consist only of a value and a destination address and it has
the following form:
Value
Destination
The output from an instruction cell generated when all of the input packets (tokens) have
been received. Thus Static dataflow has the following firing rules:
1) Nodes are fire when all input tokens is released and the previous output token
have been consumed.
2) Input tokens are then removed and new output tokens are generated.
The major drawback of this scheme is if different tokens are destined for the same
destination data flow computer cannot be distinguished between them. However Static
dataflow overcome this problem by allowing at most one token on any one arc which
extends the basic firing rule as follows:
o
This rule allow pipeline computations and loops but does not allow the computation that
involve the code sharing and recursion.
The static data flow adopts a handshaking acknowledgement mechanism which can take
the form of special control tokens set from processors once they respond to a fired node.
In order to implement this, acknowledge arcs are implicitly added to the dataflow graph
that go in the opposite direction to each existing arc and carry an acknowledgment
token. Thus additional acknowledge signals (tokens ), travel along additional arcs from
consuming to producing nodes. As acknowledgement concept is used we can redefine
the firing rule in its original form:
o
Some example of dynamic dataflow computers are Manchester Dataflow, MIT Tagged
Token, CSIRAC II , NTT Dataflow Processor Array, Distributed Data Driven Processor,
Stateless Dataflow Architecture , SIGMA-1, Parallel Inference Machine (1984) (17)
Case study of MIT Static dataflow computer
The static dataflow mechanism was the first one to receive attention for hardware
realization at MIT. MIT Static Dataflow Machine
146
Memory section consist of instruction cells which hold instructions and their
operands. The memory section is a collection of memory cells, each cell
composed of three memory words that represent an instruction template. The first
word of each instruction cell contains op-code and destination address(es), and the
next two words represent the operands
Arbitration network delivers operation packets from the memory section to the
processing section. Its purpose is to establish a smooth flow of enabled
instructions (i.e., instruction packet) from the memory section to the processing
section. An instruction packet contains the corresponding op-code, operand
value(s), and destination address(es).
147
Control network delivers a control token from the processing section to the
memory section. The control network reduces the load on the distribution network
by transferring the Boolean tokens and the acknowledgement signals from the
processing section to the memory section.
Distribution network delivers data tokens from the processing section to the
memory section.
Instruction stored in the memory section are enabled for execution by the arrival of their
operands in data token from the distributed network and control token from the control
network. The instruction together with data and control are sent as operation packets to
the processing section through arbitration network. The results of the instruction are sent
through the distribution network and the control network to the memory section where
they become input data for the other instruction.
Deficiencies of static dataflow
Consecutive iterations of a loop can only be pipelined In certain cases, the singletoken-per-arc limitation means that a second loop iteration cannot begin executing
until the present loop has completed its execution
no procedure calls,
148
no recursion.
Advantage:
The static architectures main strength is that it is very simple it does not require a data
structure like queue or stack to hold the list of tokens as only one token is allowed at a
node. The static architecture is quickly able to detect whether or not a node is fireable.
Additionally, it means that memory can be allocated for each arc at compile-time as each
arc will only ever hold 0 or 1 data token. This implies that there is no need to create
complex hardware for managing queues of data tokens: each arc can be assigned to a
particular piece of memory store.
Dynamic Dataflow
In Dynamic machine data tokens are tagged ( labeled or colored) to allow multiple tokens
to appear simultaneously on any input arc of an operator. No control tokens are needed to
acknowledge the transfer of data tokens among the instructions. The tagging is achieve
by attaching a label with each token which uniquely identifies the context of that
particular token. This dynamically tagged data flow model suggests that maximum
parallelism is exploited from the program graph. However here the matching of token
tags ( labels or colors) is performed to merge them for instructions requiring more than
one operand token. Thus the dynamic model, it exposed to an additional parallelism by
allowing multiple invocations of a subgraph that is for implementation of an iterative
loop by performing dynamically unfolding of the iterative loop. While this is the
conceptual view of the tagged token model, in reality only one copy of the graph is kept
in memory and tags are used to distinguish between tokens that belong to each
invocation. A general format for instruction has opcode, the number of constants stored
in instruction and number of destination for the result token. Each destination is identified
by four fields namely the destination address, the input port at the destination instruction,
number of token needed to enable the destination and the assignment function used in
selecting processing element for the execution of destination instruction. The dynamic
architecture has following characteristic different from static architecture. Here Program
nodes can be instantiated at run time unlike in static architecture where it is loaded in the
beginning. Also in dynamic architecture Several instances of an data packet are enabled
and also Separate storage space used for instructions and data
149
Dynamic dataflow refer to a system in which the dataflow graph being executed is not
fixed and can be altered through such actions as code sharing and recursion. Tags could
be attached to the packets to identify tokens with particular computations.
Dynamic dataflow has the following firing rules:
1) A node fires when all input tokens with the same tag appear.
2) More than one token is allowed on each arc and previous output tokens need not
be consumed before the node can be fired again.
The dynamic architecture requires storage space for the unmatched tokens. First in first
out token queue for storing the tokens is not suitable. A tag contains a unique subgraph
invocation ID, as well as an iteration ID if the subgraph is a loop. These pieces of
information, taken together, are commonly known as the color of the token However no
acknowledgement mechanism is required. The term coloring is used for the token
labeling operations and tokens with the same color belong together.
Iteration
level
Activation
name
Index
Each field will hold a number. Iteration level identifies the particular activation for loop
body, activation name represents the particular function call and index describe the
particular element of an array.
Thus instead of the single-token-per-arc rule of the static model, the dynamic model
represents each arc as a large queue that can contain any number of tokens, each with a
different tag. In this scenario, a given node is said to be fireable whenever the same tag is
found in a data token on each input arc. It is important to note that, because the data
tokens are not ordered in the tagged-token model, processing of tokens does not
necessarily proceed in the same order as they entered the system. However, the tags
ensure that the tokens do not conflict, so this does not cause a problem. The tags
themselves are generated by the system. Tokens being processed in a given invocation
of a subgraph are given the unique invocation ID of that subgraph. Their iteration
ID is set to zero. When the token reaches the end of the loop and is being fed
back into the top of the loop, a special control operator increments the iteration ID.
Whenever a token finally leaves the loop, another control operator sets its iteration
150
ID back to zero.
A hardware architecture based on the dynamic model is necessarily more complex
than the static architecture . Additional units are required
to form tokens and match tags. More memory is also required to store the
extra tokens that will build up on the arcs. The key advantage of the tagged-token
model is that it can take full advantage of pipelining effects and can even execute
separate loop iterations simultaneously. It can also execute out-of-order, bypassing
any tokens that require complex execution and that delay the rest of the computation.
It has been shown that this model offers the maximum possible parallelism
in any dataflow interpreter.
address of the instruction for which the particular data value is destined
Each arc can be viewed as a bag that may contain an arbitrary number of tokens
with different tags.
Major advantage of the dynamic data flow computers is its better performance as
compared with static data flow computer as this architecture allows existence of
multiple tokens on each arc which thereby lead to unfold iterative program
leading to more parallelism.
efficient implementation of the matching unit that collects tokens with matching
tags.
151
dyadic instructions lead to pipeline bubbles when first operand tokens arrive
The main disadvantage of the tagged token model is the extra overhead required
to match tags on tokens, instead of simply their presence or absence. More memory is
also required and, due to the quantity of data being stored, an associative memory is not
practical. Thus, memory access is not as fast as it could be . Nevertheless, the taggedtoken model does seem to offer advantages over the static model. A number of computers
using this model have been built and studied.
Case study of Dynamic Data Flow Computers
Three dynamic data flow projects are introduced below. In dynamic machines, data
tokens are tagged (labeled or colored) to allow multiple tokens to appear simultaneously
on any input are of an operator node. No control tokens are needed to acknowledge the
transfer of data tokens among instructions. Instead, the matching of token tags (labes or
colors) is performed to merge them for instructions requiring more than one operand
token. Therefore, additional hardware is needed to attach tags onto data tokens and to
152
perform tag matching. We shall present the Arvind machine. These machine was
designed with following objectives:
1) Modularity: The machine should be constructed from
only a few different component types, regularly interconnected, but internally these
components will probably be quite complex (e.g., a processor).
2) Reliability and Fault- Tolerance: Components should be pooled, so removal of a failed
component may lower speed and capacity but not the ability to complete a computation.
The development of the Irvine data flow machine was motivated by the desire to
exploit the potential of VLSI and to provide a high-level, highly concurrent program
organization. This project originated at the University of California at Irvine and now
continues at the Massachusetts Institute of Technology by Arvind and his associates. The
architectecture of the original Irvine machine is conceptually shown in Figure 10.15. The
ID programming language was developed for this machine. This machine has not been
built; but extensive simulation studies have been performed on its projected performance.
153
The Irvine machine was proposed to consist of multiple PE clusters. All PE clusters
(physical domains) can operate concurrently. Here a PE organized as
a pipelined processor. Each box in the figure is a unit that performs work on one item at a
time drawn from FIFO input queue(s).
The physical domains are interconnected by two system buses. The token bus is
a pair of bidirectional shift-register rings. Each ring is partitioned into as many slots as
there are PEs and each slot is either empty or holds one data token. Obviously, the token
rings are used to transfer tagged tokens among the PEs.
Each cluster of PEs (four PEs per cluster, as shown in Figure 10.15) shares a local
memory through a local bus and a memory controller. A global bus is used to transfer
data structures among the local memories. Each PE must accept all tokens that are sent
to it and sort those tokens into groups by activity name. When all input tokens for an
activity have arrived (through tag matching), the PE must execute that activity. The Uinterpreter can help implement interative or procedure computation by mapping the loop
or procedure instances into the PE clusters for parallel executions
The Arvind machine at MIT is modified from the Irvine machine, but still based
on the ID Language. Instead of using token rings, the Arvind machine has chosen to use
an
Figure 10.16a. The machine consists of N PEs, where each PE is a complete computer
with an instruction set, a memory, tag-matching hardware, etc. Activities are divided
among the PEs according to a mapping from tags to PE numbers. Each PE uses a
statistically chosen assignment function to determine the destination PE number.
5.4 Keywords
context switching Saving the state of one process and replacing it with that of another
that is time sharing the same processor. If little time is required to switch contexts,
processor overloading can be an effective way to hide latency in a message passing
system
data flow graph (1) machine language for a data flow computer; (2) result of data flow
analysis.
dataflow A model of parallel computing in which programs are represented as
dependence graphs and each operation is automatically blocked until the values on which
154
it depends are available. The parallel functional and parallel logic programming models
are very similar to the dataflow model.
thread a lightweight or small granularity process.
5.5 Summary
The Multithreading paradigm has become more popular as efforts to further exploit
instruction level parallelism have stalled since the late-1990s. This allowed the concept of
Throughput Computing to re-emerge to prominence from the more specialized field of
transaction processing:
Techniques that would allow speed up of the overall system throughput of all
tasks would be a meaningful performance gain.
The two major techniques for throughput computing are multiprocessing and
multithreading.
Advantages :
If a thread gets a lot of cache misses, the other thread(s) can continue, taking
advantage of the unused computing resources, which thus can lead to faster
overall execution, as these resources would have been idle if only a single thread
was executed.
If a thread can not use all the computing resources of the CPU (because
instructions depend on each other's result), running another thread permits to not
leave these idle.
If several threads work on the same set of data, they can actually share their
cache, leading to better cache usage or synchronization on its values.
156
Lesson No. : 06
6.0 Objective
6.1 Introduction
6.2 Vector Processors
6.2.1 functional units,
6.2.2 vector instruction,
6.2.3 processor implementation,
6.3 Vector memory
6.3.1 modeling vector memory performance,
6.3.2 Gamma Binomial model.
6.4 Vector processor speedup
6.5 Multiple issue processors
6.6 Self assignment questions
6.7 Reference.
6.0 Objective
In this lesson we will about various types of concurrent processor. To study vector
processor how pipelining is implemented in vector processor through the instruction
format, functional unit. To provides a general overview of the architecture of a vector
computer which includes an introduction to vectors and vector arithmetic, a discussion of
performance measurements used to evaluate this type of machine. Various models for
memory organization for the vector processor are also discussed. We will also study
about multiple instruction issue machine which include VLIW, EPIC etc .
6.1 Introduction
The Concurrent Processors must be able to execute multiple instructions at the same time.
Concurrent processors must be able to make simultaneous accesses to memory and to
simultaneously execute multiple operations. Concurrent processors depend on
sophisticated compilers to detect various types of instruction level parallelism that exist
within a program. They are classified as
157
Vector processors
A Vector processor is a processor that can operate on an entire vector in one instruction.
The operands to the instructions are complete vectors instead of one element.
Vector processors reduce the fetch and decode bandwidth as the numbers of instructions
fetched are less.
They also exploit data parallelism in large scientific and multimedia applications. Based
on how the operands are fetched, vector processors can be divided into two categories - in
memory-memory architecture operands are directly streamed to the functional units from
the memory and results are written back to memory as the vector operation proceeds. In
vector-register architecture, operands are read into vector registers from which they are
fed to the functional units and results of operations are written to vector registers.
Many performance optimization schemes are used in vector processors. Memory banks
are used to reduce load/store latency. Strip mining is used to generate code so that vector
operation is possible for vector operands whose size is less than or greater than the size of
vector registers.
Various techniques are used for fast accessing these include
Vector chaining - the equivalent of forwarding in vector processors - is used
in case of data dependency among vector instructions.
Special scatter and gather instructions are provided to efficiently operate on
sparse matrices.
158
Instruction set has been designed with the property that all vector arithmetic instructions
only allow element N of one vector register to take part in operations with element N
from other vector registers. This dramatically simplifies the construction of a highly
parallel vector unit, which can be structured as multiple parallel lanes. As with a traffic
highway, we can increase the peak throughput of a vector unit by adding more lanes.
Adding multiple lanes is a popular technique to improve vector performance as it requires
little increase in control complexity and does not require changes to existing machine
code. The reason behind the declining popularity of vector processors is their cost as
compared to multiprocessors and superscalar processors. The reasons behind high cost of
vector processors are
Vector processors do not use commodity parts. Since they sell very few copies, design
cost dominates overall cost.
Vector processors need high speed on-chip memories which are expensive.
It is difficult to package the processors with such high speed. In the past, vector
manufactures have employed expensive designs for this.
There have been few architectural innovations compared to superscalar processors to
improve performance keeping the cost low.
Vector processing has the following semantic advantages.
Programs size is small as it requires less number of instructions. Vector instructions also
hide many branches by executing a loop in one instruction.
Vector memory access has no wastage like cache access. Every data item requested by
the processor is actually used.
Once a vector instruction starts operating, only the functional unit(FU) and the register
buses feeding it need to be powered. Fetch unit, de-code unit, ROB etc can be powered
off. This reduces the power usage.
6.2 Vector processor
The vector computer or vector processor is a machine designed to efficiently handle
arithmetic operations on elements of arrays, called vectors. Such machines are especially
useful in high-performance scientific computing, where matrix and vector arithmetic are
quite common. The Cray Y-MP and the Convex C3880 are two examples of vector
processors used today.
159
[A:] The exponents of the two floating-point numbers to be added are compared
to find the number with the smallest magnitude.
[B:] The significand of the number with the smaller magnitude is shifted so that
the exponents of the two numbers agree.
[E:] Checks are made to see if any floating-point exceptions occurred during the
addition, such as overflow.
160
Ste
p
0.1234E
0.12340E
0.5678E
0.05678E
0.066620E
0.66620E
0.66620E
0.6662E
2t
3t
4t
5t
6t
7t
8t
Step
x1
y1
x2
y2
161
x1
x2
y1
y2
x1
y1
x1
y1
x1
y1
x1
y1
2t
3t
4t
5t
6t
7t
8t
Step
x1
y1
x2
y2
x3
y3
x4
y4
x5
y5
x6
y6
x7
y7
x8
y8
162
x1
y1
x2
y2
x1
y1
x3
y3
x2
x4
y4
+
y2
x1
x3
y1
x2
y1
x4
x3
x2
y1
x5
x4
x3
x2
y2
x6
x5
y5
+
y3
+
y6
y4
+
x7
y7
y5
y2
x1
y3
+
x6
y6
y4
y2
x1
y5
y3
+
x5
x4
y4
+
x3
y3
A vector operand contains an ordered set of n elements, where n is called the length of
the vector. Each element in a vector is a scalar quantity, which may be a floating point
number, an integer, a logical value or a character. A vector processor consists of a scalar
processor and a vector unit, which could be thought of as an independent functional unit
capable of efficient vector operations.
Vector Hardware
Vector computers have hardware to perform the vector operations efficiently. Operands
can not be used directly from memory but rather are loaded into registers and are put
back in registers after the operation. Vector hardware has the special ability to overlap or
pipeline operand processing.
164
the previous pair of operands. The processing of a number of operands may be carried out
simultaneously.
The loading of a vector register is itself a pipelined operation, with the ability to load one
element each clock period after some initial startup overhead.
Chaining
Theoretical speedup depends on the number of segments in the pipeline so there is a
direct relationship between the number of stages in the pipeline you can keep full and the
performance of the code. The size of the pipeline can be increased by chaining thus the
Cray combines more than one pipeline to increase its effective size. Chaining means that
the result from a pipeline can be used as an operand in a second pipeline as illustrated in
the next diagram
165
Figure 6.9 Vector chaining used to compute a scalar value a times a vector x, adding the
elements the resultant vector to the elements of a second vector y (of the same length).
Chaining can double the number of floating-point operations that are done in x
units of time. Once both the multiplication and addition pipelines have been filled,
one floating-point multiplication and one floating-point addition (a total of two
floating-point operations) are completed every x time units. Conceptually, it is
possible to chain more than two functional units together, providing an even
greater speedup. However this is rarely (if ever) done due to difficult timing
problems.
166
167
large vector into one of the vector registers. The scatter operation does the opposite. The
masking operations allows conditional execution of an instruction based on a
"masking" register.
Masking instructions:
fa: Va x Vm->Vb (e.g. MMOVE V1, V2, V3)
Gather and scatter are used to process sparse matrices/vectors. The gather operation, uses
a base address and a set of indices to access from memory "few" of the elements of a
large vector into one of the vector registers. The scatter operation does the opposite. The
masking operation allows conditional execution of an instruction based on a "masking"
register.
A Boolean vector can be generated as a result of comparing two vectors, and can
be used as a masking vector for enabling and disabling component operations in a
vector instruction.
A merge instruction combines two vectors under the control of a masking vector.
In general machine operation suitable for pipelining should have the following properties:
Identical Processes (or functions) are repeatedly invoked many times, each of
which can be subdivided into subprocesses (or sub functions)
Successive Operands are fed through the pipeline segments and require as few
buffers and local controls as possible.
The operation code must be specified in order to select the functional unit or to
reconfigure a multifunctional unit to perform the specified operation.
For a memory reference instruction, the base addresses are needed for both
source operands and result vectors. If the operands and results are located in the
vector register file, the designated vector registers must be specified.
The address offset relative to the base address should be specified. Using the
base address and the offset the relative effective address can be calculated.
168
The major hurdle for designing a vector unit is to ensure that the flow of data from
memory to the vector unit will not pose a bottleneck. In particular, for a vector unit to be
effective, the memory must be able to deliver one datum per clock cycle. This is usually
achieved using pipelining using the C-access memory organization (concurrent access) or
the S-access memory organization (simultaneous access), or a combination thereof.
A scalar operation works on only one pair of operands from the S register and returns the
result to another S register whereas a vector operation can work on 64 pairs of operands
together to produce 64 results executing only one instruction. Computational efficiency is
achieved by processing each element of a vector identically eg initializing all the
elements of a vector to zero.
A vector instruction provides iterative processing of successive vector register elements
by obtaining the operands from the first element of one or more V registers and
delivering the result to another V register. Successive operand pairs are transmitted to a
functional unit in each clock period so that the first result emerges after the start up time
of the functional unit and successive results appear each clock cycle.
Vector overhead is larger than scalar overhead, one reason being the vector length which
has to be computed to determine how many vector registers are going to be needed (i.e.,
the number of elements divided by 64).
Each vector register can hold up to 64 words so vectors can only be processed in 64
element segments. This is important when it comes to programming as one situation to be
avoided is where the number of elements to be processed exceeds the register capacity by
a small amount e.g., a vector length of 65. What happens in this case is that the first 64
elements are processed from one register, the 65th element must then be processed using
a separate register, after the first 64 elements have been processed. The functional unit
will process this element in a time equal to the start up time instead of one clock cycle
hence reducing the computational efficiency.
There is a sharp decrease in performance at each point where the vector length spills over
into a new register.
The Cray can receive a result by a vector register and retransmit it as an operand to a
subsequent operation in the same clock period. In other words a register may be both a
result and an operand register which allows the chaining of two or more vector operations
together as seen earlier. In this way two or more results may be produced per clock cycle.
Parallelism is also possible as the functional units can operate concurrently and two or
more units may be co-operating at once. This combined with chaining, using the result of
one functional unit as the input of another, leads to very high processing speeds.
Scalar and vector processing examples
170
DO 10 I = 1, 3
JJ(I) = KK(I)+LL(I)
10 CONTINUE
Scalar registers
Scalar registers behave like general purpose or floating-point registers; they hold a single
value. However, these registers are configured so that they may be used by a vector
pipeline; the value in the register is read once every tau units of time and put into the
pipeline, just as a vector element is released from the vector pipeline. This allows the
elements of a vector to be operated on by a scalar. To compute
y = 2.5 * x,
the 2.5 is stored in a scalar register and fed into the vector multiplication pipeline every
tau units of time in order to be multiplied by each element of x to produce y.
6.2.4 Vector computing performance
For typical vector architectures, the value of tau (the time to complete one pipeline
stage) is equivalent to one clock cycle of the machine On some machines, it may be equal
to two or more clock cycles.. Once a pipeline like the one shown in figure 3 has been
filled, it generates one result for each t units of time, that is, for each clock cycle. This
means the hardware performs one floating-point operation per clock cycle.
Let k represent the number of t time units the same sequential operation would take (or
the number of stages in the pipeline). Then the time to execute that sequential operation
on a vector of length n is
Ts = k*n*t,
and the time to perform the pipelined version is
171
Rn: For a vector processor, the number of Mflops obtainable for a vector of length
n.
172
n_v: The length, n, of a vector such that performing a vector operation on the n
elements of that vector is more efficient than executing the n scalar operations
instead.
Year Clock
Peak
R_infinity
n_1/2
Cycle
Perf
(x * y)
(x * y)
Characteristics
(nsec) (Mflops)
Cray-1
(Mflops)
1976
12.5
160
22
18
20.0
100
50
86
1983
9.5
210
70
53
---
---
840
---
---
1985
4.1
488
56
83
---
---
1951
---
---
1985
18.5
108
54
high 20's
---
---
432
---
---
1986
10.5
1250
---
---
---
---
10,000
---
---
1986 170.0
151
Cray X-MP
... with 4 Procs
Cray-2
... with 4 Procs
IBM 3090
... with 8 Procs
ETA 10
... with 8 Procs
Alliant FS/8
... with 8 Procs
Cray C90
... with Procs
Convex C3880
---
---
47
23
1990
4.2
952
---
---
---
---
15,238
---
650
---
---
960
---
---
173
Performance
Year Clock
Peak
R_infinity
n_1/2
Cycle
Perf
(x * y)
(x * y)
Characteristics
(nsec) (Mflops)
Cray 3-128
(Mflops)
1993
2.1
948
---
---
---
---
3972
---
---
Let { ai for 1 <= i<= n) be n scalar contstants, Xj = (X1j,X2j Xmj)T for j = 1,2,3
.n be n column vectors and Yj = (Y1j,Y2j Ym)T be a column vector of m
components. The computation to be performed is
Y = ai.x1 + a2.x2 + . an.xn
Y1 = Z11 + Z12 + ..Z1n
Y2 = Z21 + Z22 + ..Z2n
.
.
. Ym = Zm1 +Zm2+..Zmn
Horizontal Vector Processing
In this method all components of the vector y are calculated in sequential order, yi for i =
1,2,.m. Each summation involving n-1 additions must be completed before switching
to the evaluation the next summation.
Vertical Vector Processing :
The sequence of additions in this method are, compute the partial sum sequentially
through the pipeline (in row wise z11+z12)
Computer the partial sum in the column format repeatedly.
Vector Looping Method:
It combines the horizontal and vertical approaches into a block approach.
175
176
r
---------(1-f)r + f
So even if the performance of the vector unit is extremely high (r = oo) we get a speedup
less than 1/(1-f), which suggests that the ratio f is crucial to performace since it poses a
limit on the attainable speedup. This ratio depends on the efficiency of the compilation,
etc... This also suggests that a scalar unit with a mediocre performance (even if coupled
with the fastest vector unit), will yield mediocre speedup.
Strip-mining
If a vector to be processed has a length greater than that of the vector registers, then stripmining is used, whereby the original vector is divided into equal size segments (equal to
the size of the vector registers) and these segments are processed in sequence. The
process of strip-mining is usually performed by the compiler but in some architectures
(like the Fujitsu VP series) it could be done by the hardware.
Compound Vector Processing
A sequence of vector operation may be bundled into a "compound" vector function
(CVF), which could be executed as one operation (without having to store intermediate
results in register vectors, etc..) using a technique called chaining, which is an extension
of bypassing (used in scalar pipelines). The purpose of "discovering" CVFs is to explore
opportunities for concurrent processing of linked vector operations.
Notice that the number of available vector registers and functional units imposes
limitations on how many CVFs can be executed simulataneously (e.g. Cray 1 CVP of
SAXPY code leads to a speedup of 5/3. The X-MP results in a speadup of 5).
6.3 Vector memory
n = 2^k,
where k = 1, 2, 3, or 4.
One memory access (load or store) of a data value in a memory bank takes several clock
cycles to complete. Each memory bank allows only one data value to be read or stored in
a single memory access, but more than one memory bank may be accessed at the same
time. When the elements of a vector stored in an interleaved memory are read into a
vector register, the reads are staggered across the memory banks so that one vector
element is read from a bank per clock cycle. If one memory access takes n clock cycles,
then n elements of a vector may be fetched at a cost of one memory access; this is n
times faster than the same number of memory accesses to a single bank.
The figure below is an interleaved memory as it can be seen it places consecutive words
of memory in different memory modules:
Since a read or write to one module can be started before a read/write to another module
finishes, reads/writes can be overlapped. Only the leading bits of the address are used to
determine the address within the module. The least-significant bits (in the diagram above,
the two least-significant bits) determine the memory module. Thus, by loading a single
address into the memory-address register (MAR) and saying read or write, the
processor can read/write M words of memory. We say that memory is M-way interleaved.
Low-order interleaving distributes the addresses so that consecutive addresses are located
within consecutive modules. For example, for 8-way interleaving:
178
Interleaved memory access improves available bandwidth and may reduce latency for
concurrent accesses.
High end machine use the multiple concurrent banks
Might use crossbar switch (instead of bus, not instead of VDS) to connect several
memory banks to the VDS simultaneously
Might be interleaved and assume different subsets of banks connected each clock
Interleaved-memory designs: Interleaved memory divides an address into two portions:
one selects the module, and the other selects an address within the module.
Each module has a separate MAR and a separate MDR.
When an address is presented, a decoder determines which MAR should be loaded with
this address. It uses the low-order m log2M bits to decide this.
The high-order nm bits are actually loaded into the MAR. They select the proper
location within the module.
179
The vector registers are large 64 to 256 floating point numbers each. 256 floating point
numbers at 64 bits each times 8 registers is equivalent to a 16k byte internal data cache.
6.3.1 Vector Memory Modeling
In vector processor when vector operate the parallel execution the memory access can be
overlapped with vector execution the problem arise if the memory cannot keep up with
vector execution rate.
Gamma () inomial model
This model request is based on the principal to use vector request buffer to bypass
waiting requests. An associated issue is the degree of bypassing or out-of-order requests
that a source can make to the memory system. Suppose a conflict arises: a request is
directed to a busy module. How many subsequent requests can the source make before it
must wait? Assume each of s access ports to memory has a buffer of size T BE / s (Fig
7.19). This buffer holds requests (element addresses) to memory that are being held due
to a conflict. For each source, the degree of bypassing is defined as the allowable number
of requests waiting before stalling of subsequent requests occurs.
From a modeling point of view, this is different from the simple binomial or the
-binomial models. The basis difference is that the queue awaiting service from a
module is larger by an amount , where is the man queue size of bypassed requests
awaiting service. Note that the average queue size ( ) is always less than or equal to the
buffer size:
TBF / s,
180
(Although,
depending on the organization of the TBF, one source buffer could borrow from
another)
With or without request bypassing, there ids a buffer between the s request sources and
the m memory modules (Figure 7.19). This must be large enough to accommodate
denied requests (no bypassing) i.e.:
Buffer = TBF. mQc
Where Qc is the expected number of denied requests per module, and m is the number of
modules. The m . Qc = n B, as discussed in chapter 6. If we allow bypassing, we will
require additional buffer entries and additional control. Typically, an entry could include:
Module id.
While some optimization is possible, it is clear that large bypassed request buffers can be
complex.
7.3.3 Gamma( ) -Binomial Model
We now develop the -binomial model of bypassed vector memory behavior. Assume
that each vector sources issues a request each cycle ( = 1), and that each physical
requestor in the vector processor has the same buffer capacity and characteristic. If the
vector processor can make s requests per cycle, and there are t cycles per Tc, we have:
Total requests per Tc = t . s = n.
This is the same as our n requests per Tc in the simple binominal model, but the situation
in the vector processor is more complex. We assume that each of the sources s makes a
request each cycle and each of its -buffered requests also makes a request.
Depending on the buffer control, these buffer requests are made only implicitly. The
controller knows when a target module will be free and therefore schedules the actual
request for that time. From a memory modeling point of view, this is equivalent to the
buffer requesting service each cycle until the module is free.
Thus, we now have:
181
182
183
Another way of looking at superscalar machines is as dynamic instruction schedulers the hardware decides on the fly which instructions to execute in parallel, out of order, etc.
An alternative approach would be to get the compiler to do it beforehand - that is, to
statically schedule execution. This is the basic concept behind Very Long Instruction
Word, or VLIW machines.
VLIW machines have, as you may guess, very long instruction words - in which a
number of 'traditional' instructions can be packed. (Actually for more recent examples,
this is arguably not really true but it's a convenient mental model for now.) For example,
suppose we have a processor which has two integer operation units; a floating point unit;
a load/store unit; and a branch unit. An 'instruction' for such a machine would consist of
[up to] two integer operations, a floating point operation, a load or store, and a branch. It
is the compilers responsibility to find the appropriate operations, and pack them together
into a very long instruction - which the hardware can execute simultaneously without
worrying about dependencies (because the compiler has already considered them).
Pros and Cons
VLIW has both advantages and disadvantages. The main advantage is the saving in
hardware - the compiler now decides what can be executed in parallel, and the hardware
just does it. There is no need to check for dependencies or decide on scheduling - the
compiler has already resolved these issues. (Actually, as we shall see, this may not be
entirely true either.) This means that much more hardware can be devoted to useful
computation, bigger on-chip caches etc., meaning faster processors.
Not surprisingly, there are also disadvantages.
Compilers. First, obviously compilers will be harder to build. In fact, to get the
best out of current, dynamically scheduled superscalar processors it is necessary
for compilers to do a fair bit of code rearranging to 'second guess' the hardware,
so this technology is already developing.
Code Bigger. Secondly, programs will get bigger. If there are not enough
instructions that can be done in parallel to fill all the available slots in an
instruction (which will be the case most of the time). There will consequently be
empty slots in instructions. It is likely that the majority of instructions, in typical
184
applications, will have empty code slots, meaning wasted space and bigger code.
(It may well be the case that to ensure that all scheduling problems are resolved at
compiler time, we will need to put in some completely empty instructions.)
Memory and disk space is cheap - however, memory bandwidth is not. Even with
the large and efficient caches, we would prefer not to have to fetch large, halfempty instructions.
One Stalls, all Stall. Unfortunately, it is not possible at compile time to identify
all possible sources of pipeline stalls and their durations. For example, suppose a
memory access causes a cache miss, leading to a longer than expected stall. If
other, parallel, functional units are allowed to continue operating, sources of data
dependency may dynamically emerge. For example, consider two operations
which have an output dependency. The original scheduling by the compiler would
ensure that there is no consequent WAW hazard. However, if one stalls and the
other 'runs ahead', the dependency may turn into a WAW hazard. In order to get
the compiler to do all dependency resolution, it is required to stall all pipeline
elements together. This is another performance problem.
Hardware Shows Through A significant issue is the break in the barrier between
architecture and implementation which has existed since the IBM 360 in the
early/mid 60s. It will be necessary for compilers to know exactly what the
capabilities of the processor are - for example, how many functional units are
there?
machines
is
not
backward
compatible
with
older,
narrower
implementations.
Load responses from a memory hierarchy which includes CPU caches and
DRAM do not give a deterministic delay of when the load response returns to the
processor. This makes static scheduling of load instructions by the compiler very
difficult.
185
A check load instruction also aids speculative loads by checking that a load was
not dependent on a previous store.
The EPIC architecture also includes a grab-bag of architectural concepts to increase ILP:
Delayed exceptions (using a Not-A-Thing bit within the general purpose registers)
also allow more speculative execution past possible exceptions.
Very large architectural register files avoid the need for register renaming.
The IA-64 architecture also added register rotation - a digital signal processing concept
useful for loop unrolling and software pipelining.
6.5 Summary
Vector supercomputers are not viable due to cost reason, but vector instruction set
architecture is still useful. Vector supercomputers are adapting commodity technology
like SMT to improve their price-performance. Superscalar microprocessor designs have
begun to absorb some of the techniques made popular in earlier vector computer systems
(Ex - Intel MMX extension). Vector processors are useful for embedded and multimedia
applications which require low power, small code size and high performance.
Vector Processor vs Multiple Issue processor
Advantage of Vector Processor
good Sp on large scientific problems
186
187
6.8 Reference:
Advance computer architecture by Kai Hwang
Computer Architecture by Michael J. Flynn
188