0% found this document useful (0 votes)
10 views

COME6102 Chapter 1 Introduction 2 of 2

Shared memory multiprocessors have memory that can be accessed by all processors, allowing independent processors to share resources. They are divided into UMA, where access times are uniform, and NUMA, where access times vary depending on memory location. Distributed memory systems require a communication network since processors have separate, non-shared memories and must explicitly communicate to access data on other processors. Vector processors efficiently perform same operations on all elements of a vector using pipelining. SIMD computers execute the same instruction on multiple data elements simultaneously using an array of identical processors controlled by a single control unit.

Uploaded by

Franck Tiomo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

COME6102 Chapter 1 Introduction 2 of 2

Shared memory multiprocessors have memory that can be accessed by all processors, allowing independent processors to share resources. They are divided into UMA, where access times are uniform, and NUMA, where access times vary depending on memory location. Distributed memory systems require a communication network since processors have separate, non-shared memories and must explicitly communicate to access data on other processors. Vector processors efficiently perform same operations on all elements of a vector using pipelining. SIMD computers execute the same instruction on multiple data elements simultaneously using an array of identical processors controlled by a single control unit.

Uploaded by

Franck Tiomo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

MULTIPROCESSOR AND MULTICOMPUTERS

Two categories of parallel computers are discussed below namely shared common memory or
unshared distributed memory.

Shared Memory Multiprocessors


Shared memory parallel computers vary widely, but generally have in common the ability for all
processors to access all memory as global address space. Therefore, multiple processors can operate
independently but share the same memory resources. Changes in a memory location effected by one
processor are visible to all other processors.

Shared memory machines can be divided into two main classes based upon memory access times:
UMA, NUMA and COMA.

Uniform Memory Access (UMA)


Most commonly represented today by Symmetric Multiprocessor (SMP) machines. They have
identical processors with equal access and access times to memory as depicted in Figure 1.9.

They are sometimes referred to as CC-UMA - Cache Coherent UMA. Cache coherent means that if
one processor updates a location in shared memory, all the other processors know about the update.
Cache coherency is accomplished at the hardware level

Non-Uniform Memory Access (NUMA)


Most often they are made by physically linking two or more SMPs. One SMP can directly access
memory of another SMP and not all the processors have equal access time to all memories. Memory
access across link is slower. If cache coherency is maintained, then it may also be called CC-NUMA
- Cache Coherent NUMA
18
The COMA model
The COMA model is a special case of NUMA machine in which the distributed main memories are
converted to caches. All caches form a global address space and there is no memory hierarchy at each
processor node.

Advantages:

 Global address space provides a user-friendly programming perspective to memory


 Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs

Disadvantages:

 Primary disadvantage is the lack of scalability between memory and CPUs. Adding more
CPUs can geometrically increase traffic on the shared memory
 CPU path, and for cache coherent systems, geometrically increase traffic associated with
cache/memory management.
 Programmer responsibility for synchronization constructs that insure "correct" access of
global memory.
 Expense: it becomes increasingly difficult and expensive to design and produce shared
memory machines with ever increasing numbers of processors.

Distributed Memory
Like shared memory systems, distributed memory systems vary widely but share a common
characteristic. Distributed memory systems require a communication network to connect inter-
processor memory as depicted in Figure 1.11.

19
Processors have their own local memory. Memory addresses in one processor do not map to another
processor, so there is no concept of global address space across all processors.
Because each processor has its own local memory, it operates independently and the changes it makes
to its local memory have no effect on the memory of other processors. Hence, the concept of cache
coherency does not apply.

When a processor needs access to data in another processor, it is usually the task of the programmer
to explicitly define how and when data is communicated. Synchronization between tasks is likewise
the programmer's responsibility.

Modern multicomputer use hardware routers to pass message. Based on the interconnection and
routers and channel used the multicomputers are divided into generation

o 1st generation: based on board technology using hypercube architecture and software
controlled message switching.
o 2nd Generation: implemented with mesh connected architecture, hardware message routing
and software environment for medium distributed – grained computing.
o 3rd Generation: fine grained multicomputer like MIT J-Machine.
The network "fabric" used for data transfer varies widely, though it can be as simple as Ethernet.

Advantages:

 Memory is scalable with number of processors. Increase the number of processors and the
size of memory increases proportionately.
 Each processor can rapidly access its own memory without interference and without the
overhead incurred with trying to maintain cache coherency.
 Cost effectiveness: can use commodity, off-the-shelf processors and networking.

Disadvantages:

 The programmer is responsible for many of the details associated with data communication
between processors.
20
 It may be difficult to map existing data structures, based on global memory, to this memory
organization.
 Non-uniform memory access (NUMA) times

MULTIVECTOR AND SIMD COMPUTERS


A vector operand contains an ordered set of n elements, where n is called the length of the vector.
Each element in a vector is a scalar quantity, which may be a floating point number, an integer, a
logical value or a character.

A vector processor consists of a scalar processor and a vector unit, which could be thought of as an
independent functional unit capable of efficient vector operations.

Vector Hardware
Vector computers have hardware to perform the vector operations efficiently. Operands cannot be
used directly from memory but rather are loaded into registers and are put back in registers after the
operation. Vector hardware has the special ability to overlap or pipeline operand processing as
depicted in Figure 1.12.

Vector functional units pipelined, fully segmented each stage of the pipeline performs a step of the
function on different operand(s) once pipeline is full, a new result is produced per each clock period
(cp).

Pipelining
The pipeline is divided up into individual segments, each of which is completely independent and
involves no hardware sharing. This implies that the machine can be working on separate operands at
the same time. This ability enables it to produce one result per clock period as soon as the pipeline is
full. The same instruction is followed repeatedly using the pipeline technique so the vector processor
processes all the elements of a vector in exactly the same way. The pipeline segments arithmetic
operation such as floating point multiply into stages passing the output of one stage to the next stage
as input. The next pair of operands may enter the pipeline after the first stage has processed
the previous pair of operands. The processing of a number of operands may be carried out
simultaneously.

21
The loading of a vector register is itself a pipelined operation, with the ability to load one element
each clock period after some initial startup overhead.

SIMD Array Processors


The Synchronous parallel architectures coordinate Concurrent operations in lockstep through global
clocks, central control units, or vector unit controllers. A synchronous array of parallel processors is
called an array processor. These processors are composed of N identical processing elements (PES)
under the supervision of a one control unit (CU)

This Control unit is a computer with high speed registers, local memory and arithmetic logic unit. An
array processor is basically a single instruction and multiple data (SIMD) computers. There are N
data streams; one per processor, so different data can be used in each processor. Figure 1.13 below
shows a typical SIMD or array processor.

These processors consist of a number of memory modules which can be either global or dedicated to
each processor. Thus the main memory is the aggregate of the memory modules. These Processing
elements and memory unit communicate with each other through an interconnection network. SIMD
processors are especially designed for performing vector computations. SIMD has two basic
architectural organizations:

a. Array processor using random access memory


b. Associative processors using content addressable memory.

All N identical processors operate under the control of a single instruction stream issued by a central
control unit. The popular examples of this type of SIMD configuration is ILLIAC IV, CM-2, MP-1.
Each PEi is essentially an arithmetic logic unit (ALU) with attached working registers and local
memory PEMi for the storage of distributed data.

The CU also has its own main memory for the storage of program. The function of CU is to decode
the instructions and determine where the decoded instruction should be executed. The PE perform
same function (same instruction) synchronously in a lock step fashion under command of CU. In
order to maintain synchronous operations a global clock is used. Thus at each step i.e., when global
clock pulse changes all processors execute the same instruction, each on a different data (single
instruction multiple data).

SIMD machines are particularly useful at in solving problems involved with vector calculations where
one can easily exploit data parallelism. In such calculations the same set of instruction is applied to
22
all subsets of data. Let us do addition to two vectors each having N element and there are N/2
processing elements in the SIMD. The same addition instruction is issued to all N/2 processors and
all processor elements will execute the instructions simultaneously. It takes 2 steps to add two vectors
as compared to N steps on a SISD machine. The distributed data can be loaded into PEMs from an
external source via the system bus or via system broadcast mode using the control bus.
The array processor can be classified into two category depending how the memory units
are organized. It can be

a. Dedicated memory organization


b. Global memory organization
A SIMD computer C is characterized by the following set of parameter
C= <N,F,I,M>

Where N= the number of PE in the system. For example the iliac –IV has N=64 , the BSP has N= 16.
F= a set of data routing function provided by the interconnection network

I= The set of machine instruction for scalar vector, data routing and network manipulation operations
M = The set of the masking scheme where each mask partitions the set of PEs into disjoint subsets of
enabled PEs and disabled PEs.

PRAM AND VLSI MODELS


PRAM model (Parallel Random Access Machine)
PRAM Parallel random access machine; a theoretical model of parallel computation in which an
arbitrary but finite number of processors can access any value in an arbitrarily large shared memory
in a single time step. Processors may execute different instruction streams, but work synchronously.
This model assumes a shared memory, multiprocessor machine as shown:

1. The machine size n can be arbitrarily large


2. The machine is synchronous at the instruction level. That is, each processor is executing its
own series of instructions, and the entire machine operates at a basic time step (cycle). Within
each cycle, each processor executes exactly one operation or does nothing, i.e. it is idle. An
instruction can be any random access machine instruction, such as: fetch some operands from
memory, perform an ALU operation on the data, and store the result back in memory.
3. All processors implicitly synchronize on each cycle and the synchronization overhead is
assumed to be zero. Communication is done through reading and writing of shared variables.
4. Memory access can be specified to be UMA, NUMA, EREW, CREW, or CRCW with a
defined conflict policy.

The PRAM model can apply to SIMD class machines if all processors execute identical instructions
on the same cycle, or to MIMD class machines if the processors are executing different instructions.
Load imbalance is the only form of overhead in the PRAM model.
The four most important variations of the PRAM are:

 EREW - Exclusive read, exclusive write; any memory location may only be accessed once in
any one step. Thus forbids more than one processor from reading or writing the same memory
cell simultaneously.
23
 CREW - Concurrent read, exclusive write; any memory location may be read any number of
times during a single step, but only written to once, with the write taking place after the reads.
 ERCW - This allows exclusive read or concurrent writes to the same memory location.
 CRCW - Concurrent read, concurrent write; any memory location may be written to or read
from any number of times during a single step. A CRCW PRAM model must define some
rule for resolving multiple writes, such as giving priority to the lowest-numbered processor
or choosing amongst processors randomly. The PRAM is popular because it is theoretically
tractable and because it gives algorithm designers a common target. Nevertheless, PRAMs
cannot be emulated optimally on all architectures.

VLSI Model:
Parallel computers rely on the use of VLSI chips to fabricate the major components such as processor
arrays memory arrays and large scale switching networks. The rapid advent of very large scale
integrated (VSLI) technology now computer architects are trying to implement parallel algorithms
directly in hardware. An AT2 model is an example for two dimension VLSI chips.

Chapter 1in a Nutshell


Computer Architecture has gone through evolutional, rather than revolutional change. Sustaining
features are those that are proven to improve performance. Starting with the von Neumann
architecture (strictly sequential), computer architectures have evolved to include processing
lookahead, parallelism, and pipelining. Also a variety of parallel architectures are discussed like
SIMD, MIMD, Associative Processor, Array Processor, multicomputers, Mutiprocessor. The
performance of system is measured as CPI, MIPS. It depends on the clock rate let us say t. If C is the
total number of clock cycles needed to execute a given program, then total CPU time can be estimated
as
T= C * t = C / f

Other relationships are easily observed:


CPI = C / Ic
T =Ic * CPI * t
T =Ic * CPI / f

Processor speed is often measured in terms of millions of instructions per second, frequently called
the MIPS rate of the processor. The multiprocessor architecture can be broadly classified as tightly
coupled multiprocessor and loosely coupled multiprocessor.

A tightly coupled Multiprocessor is also called a UMA, for uniform memory access, because each
CPU can access memory data at the same (uniform) amount of time. This is the true multiprocessor.

A loosely coupled Multiprocessor is called a NUMA. Each of its node computers can access their
local memory data at one (relatively fast) speed, and remote memory data at a much slower speed.
PRAM and VSLI are the advance technologies that are used for designing the architecture.

24
Key Words
multiprocessor A computer in which processors can execute separate instruction streams, but have
access to a single address space. Most multiprocessors are shared memory machines, constructed by
connecting several processors to one or more memory banks through a bus or switch.

multicomputer A computer in which processors can execute separate instruction streams, have their
own private memories and cannot directly access one another's memories. Most multicomputers are
disjoint memory machines, constructed by joining nodes (each containing a microprocessor and some
memory) via links.
MIMD Multiple Instruction, Multiple Data; a category of Flynn's taxonomy in which many
instruction streams are concurrently applied to multiple data sets. A MIMD architecture is one in
which heterogeneous processes may execute at different rates.

MIPS one Million Instructions Per Second. A performance rating usually referring to integer or non-
floating point instructions

vector processor A computer designed to apply arithmetic operations to long vectors or arrays. Most
vector processors rely heavily on pipelining to achieve high performance
pipelining Overlapping the execution of two or more operations

25

You might also like