COME6102 Chapter 1 Introduction 2 of 2
COME6102 Chapter 1 Introduction 2 of 2
Two categories of parallel computers are discussed below namely shared common memory or
unshared distributed memory.
Shared memory machines can be divided into two main classes based upon memory access times:
UMA, NUMA and COMA.
They are sometimes referred to as CC-UMA - Cache Coherent UMA. Cache coherent means that if
one processor updates a location in shared memory, all the other processors know about the update.
Cache coherency is accomplished at the hardware level
Advantages:
Disadvantages:
Primary disadvantage is the lack of scalability between memory and CPUs. Adding more
CPUs can geometrically increase traffic on the shared memory
CPU path, and for cache coherent systems, geometrically increase traffic associated with
cache/memory management.
Programmer responsibility for synchronization constructs that insure "correct" access of
global memory.
Expense: it becomes increasingly difficult and expensive to design and produce shared
memory machines with ever increasing numbers of processors.
Distributed Memory
Like shared memory systems, distributed memory systems vary widely but share a common
characteristic. Distributed memory systems require a communication network to connect inter-
processor memory as depicted in Figure 1.11.
19
Processors have their own local memory. Memory addresses in one processor do not map to another
processor, so there is no concept of global address space across all processors.
Because each processor has its own local memory, it operates independently and the changes it makes
to its local memory have no effect on the memory of other processors. Hence, the concept of cache
coherency does not apply.
When a processor needs access to data in another processor, it is usually the task of the programmer
to explicitly define how and when data is communicated. Synchronization between tasks is likewise
the programmer's responsibility.
Modern multicomputer use hardware routers to pass message. Based on the interconnection and
routers and channel used the multicomputers are divided into generation
o 1st generation: based on board technology using hypercube architecture and software
controlled message switching.
o 2nd Generation: implemented with mesh connected architecture, hardware message routing
and software environment for medium distributed – grained computing.
o 3rd Generation: fine grained multicomputer like MIT J-Machine.
The network "fabric" used for data transfer varies widely, though it can be as simple as Ethernet.
Advantages:
Memory is scalable with number of processors. Increase the number of processors and the
size of memory increases proportionately.
Each processor can rapidly access its own memory without interference and without the
overhead incurred with trying to maintain cache coherency.
Cost effectiveness: can use commodity, off-the-shelf processors and networking.
Disadvantages:
The programmer is responsible for many of the details associated with data communication
between processors.
20
It may be difficult to map existing data structures, based on global memory, to this memory
organization.
Non-uniform memory access (NUMA) times
A vector processor consists of a scalar processor and a vector unit, which could be thought of as an
independent functional unit capable of efficient vector operations.
Vector Hardware
Vector computers have hardware to perform the vector operations efficiently. Operands cannot be
used directly from memory but rather are loaded into registers and are put back in registers after the
operation. Vector hardware has the special ability to overlap or pipeline operand processing as
depicted in Figure 1.12.
Vector functional units pipelined, fully segmented each stage of the pipeline performs a step of the
function on different operand(s) once pipeline is full, a new result is produced per each clock period
(cp).
Pipelining
The pipeline is divided up into individual segments, each of which is completely independent and
involves no hardware sharing. This implies that the machine can be working on separate operands at
the same time. This ability enables it to produce one result per clock period as soon as the pipeline is
full. The same instruction is followed repeatedly using the pipeline technique so the vector processor
processes all the elements of a vector in exactly the same way. The pipeline segments arithmetic
operation such as floating point multiply into stages passing the output of one stage to the next stage
as input. The next pair of operands may enter the pipeline after the first stage has processed
the previous pair of operands. The processing of a number of operands may be carried out
simultaneously.
21
The loading of a vector register is itself a pipelined operation, with the ability to load one element
each clock period after some initial startup overhead.
This Control unit is a computer with high speed registers, local memory and arithmetic logic unit. An
array processor is basically a single instruction and multiple data (SIMD) computers. There are N
data streams; one per processor, so different data can be used in each processor. Figure 1.13 below
shows a typical SIMD or array processor.
These processors consist of a number of memory modules which can be either global or dedicated to
each processor. Thus the main memory is the aggregate of the memory modules. These Processing
elements and memory unit communicate with each other through an interconnection network. SIMD
processors are especially designed for performing vector computations. SIMD has two basic
architectural organizations:
All N identical processors operate under the control of a single instruction stream issued by a central
control unit. The popular examples of this type of SIMD configuration is ILLIAC IV, CM-2, MP-1.
Each PEi is essentially an arithmetic logic unit (ALU) with attached working registers and local
memory PEMi for the storage of distributed data.
The CU also has its own main memory for the storage of program. The function of CU is to decode
the instructions and determine where the decoded instruction should be executed. The PE perform
same function (same instruction) synchronously in a lock step fashion under command of CU. In
order to maintain synchronous operations a global clock is used. Thus at each step i.e., when global
clock pulse changes all processors execute the same instruction, each on a different data (single
instruction multiple data).
SIMD machines are particularly useful at in solving problems involved with vector calculations where
one can easily exploit data parallelism. In such calculations the same set of instruction is applied to
22
all subsets of data. Let us do addition to two vectors each having N element and there are N/2
processing elements in the SIMD. The same addition instruction is issued to all N/2 processors and
all processor elements will execute the instructions simultaneously. It takes 2 steps to add two vectors
as compared to N steps on a SISD machine. The distributed data can be loaded into PEMs from an
external source via the system bus or via system broadcast mode using the control bus.
The array processor can be classified into two category depending how the memory units
are organized. It can be
Where N= the number of PE in the system. For example the iliac –IV has N=64 , the BSP has N= 16.
F= a set of data routing function provided by the interconnection network
I= The set of machine instruction for scalar vector, data routing and network manipulation operations
M = The set of the masking scheme where each mask partitions the set of PEs into disjoint subsets of
enabled PEs and disabled PEs.
The PRAM model can apply to SIMD class machines if all processors execute identical instructions
on the same cycle, or to MIMD class machines if the processors are executing different instructions.
Load imbalance is the only form of overhead in the PRAM model.
The four most important variations of the PRAM are:
EREW - Exclusive read, exclusive write; any memory location may only be accessed once in
any one step. Thus forbids more than one processor from reading or writing the same memory
cell simultaneously.
23
CREW - Concurrent read, exclusive write; any memory location may be read any number of
times during a single step, but only written to once, with the write taking place after the reads.
ERCW - This allows exclusive read or concurrent writes to the same memory location.
CRCW - Concurrent read, concurrent write; any memory location may be written to or read
from any number of times during a single step. A CRCW PRAM model must define some
rule for resolving multiple writes, such as giving priority to the lowest-numbered processor
or choosing amongst processors randomly. The PRAM is popular because it is theoretically
tractable and because it gives algorithm designers a common target. Nevertheless, PRAMs
cannot be emulated optimally on all architectures.
VLSI Model:
Parallel computers rely on the use of VLSI chips to fabricate the major components such as processor
arrays memory arrays and large scale switching networks. The rapid advent of very large scale
integrated (VSLI) technology now computer architects are trying to implement parallel algorithms
directly in hardware. An AT2 model is an example for two dimension VLSI chips.
Processor speed is often measured in terms of millions of instructions per second, frequently called
the MIPS rate of the processor. The multiprocessor architecture can be broadly classified as tightly
coupled multiprocessor and loosely coupled multiprocessor.
A tightly coupled Multiprocessor is also called a UMA, for uniform memory access, because each
CPU can access memory data at the same (uniform) amount of time. This is the true multiprocessor.
A loosely coupled Multiprocessor is called a NUMA. Each of its node computers can access their
local memory data at one (relatively fast) speed, and remote memory data at a much slower speed.
PRAM and VSLI are the advance technologies that are used for designing the architecture.
24
Key Words
multiprocessor A computer in which processors can execute separate instruction streams, but have
access to a single address space. Most multiprocessors are shared memory machines, constructed by
connecting several processors to one or more memory banks through a bus or switch.
multicomputer A computer in which processors can execute separate instruction streams, have their
own private memories and cannot directly access one another's memories. Most multicomputers are
disjoint memory machines, constructed by joining nodes (each containing a microprocessor and some
memory) via links.
MIMD Multiple Instruction, Multiple Data; a category of Flynn's taxonomy in which many
instruction streams are concurrently applied to multiple data sets. A MIMD architecture is one in
which heterogeneous processes may execute at different rates.
MIPS one Million Instructions Per Second. A performance rating usually referring to integer or non-
floating point instructions
vector processor A computer designed to apply arithmetic operations to long vectors or arrays. Most
vector processors rely heavily on pipelining to achieve high performance
pipelining Overlapping the execution of two or more operations
25