0% found this document useful (0 votes)

3 views

Module 1 Chapter2

Chapter 2 discusses program and network properties essential for parallel computing, including conditions of parallelism, data dependencies, and scheduling. It outlines various types of data and control dependencies, Bernstein's conditions for parallel execution, and the mismatch between software and hardware parallelism. Additionally, it covers program partitioning, communication latencies, and flow mechanisms in conventional and dataflow architectures.

Uploaded by

Dr. Usha Divakarla NMAMIT

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Module 1 Chapter2

Uploaded by

Dr. Usha Divakarla NMAMIT

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 98

Chapter 2

Program and Network

Properties

1
Program and Network Properties

• Conditions of parallelism
• Program partitioning and scheduling
• Program flow mechanisms
• System interconnect architectures
Conditions of Parallelism
The exploitation of parallelism in computing requires
understanding the basic theory associated with it.
Progress is needed in several areas:
computation models for parallel computing
interprocessor communication in parallel architectures
integration of parallel systems into general environments
Data and Resources Dependences

Data dependences

The ordering relationship between statements is indicated by

the data dependence.
• Flow dependence
• Anti dependence
• Output dependence
• I/O dependence
• Unknown dependence
Data Dependence - 1

• Flow dependence: S1 precedes S2, and at least

one output of S1 is input to S2.
• Anti dependence: S1 precedes S2, and the
output of S2 overlaps the input to S1.
• Output dependence: S1 and S2 write to the
same output variable.
• I/O dependence: two I/O statements
(read/write) reference the same variable, and/or
the same file.
Data Dependence - 2

• Unknown dependence:
– The subscript of a variable is itself subscripted.
– The subscript does not contain the loop index variable.
– A variable appears more than once with subscripts
having different coefficients of the loop variable (that is,
different functions of the loop variable).
– The subscript is nonlinear in the loop index variable.

• Parallel execution of program segments which do

not have total data independence can produce
non-deterministic results.
Data dependence example

S1: Load R1, A S1

S2: Add R2, R1
S3: Move R1, R3 S2 S4
S4: Store B, R1
S3
I/O dependence example

S1: Read (4), A(I)

S2: Rewind (4)
S3: Write (4), B(I) S1 I/O
S3
S4: Rewind (4)
Control dependence

• The order of execution of statements cannot be

determined before run time
o Conditional branches
o Successive operations of a looping procedure
Control dependence examples

Do 20 I = 1, N Do 10 I = 1, N
A(I) = C(I) IF(A(I-1) .EQ. 0)
IF(A(I) .LT. 0) A(I)=0
A(I)=1 10 Continue
20 Continue
Resource dependence

• Concerned with the conflicts in using shared resources

o Integer units
o Floating-point units
o Registers
o Memory areas
o ALU
o Workplace storage
Bernstein’s conditions

• Set of conditions for two processes to execute in parallel

I1  O2 = Ø

I2  O1 = Ø

O1  O 2 = Ø
Bernstein’s Conditions - 2

• In terms of data dependencies, Bernstein’s

conditions imply that two processes can execute
in parallel if they are flow-independent,
antiindependent, and output-independent.

• The parallelism relation || is commutative (Pi || Pj

implies Pj || Pi ), but not transitive (Pi || Pj and Pj ||

Pk does not imply Pi || Pk ) . Therefore, || is not

an equivalence relation.
Utilizing Bernstein’s conditions

P1 : C=DxE P1
P2 : M=G+C
P2 P4
P3 : A=B+C

P4 : C=L+M

P5 : F=G/E P3 P5
Utilizing Bernstein’s conditions
Hardware parallelism

• A function of cost and performance tradeoffs

• Displays the resource utilization patterns of
simultaneously executable operations
• Denote the number of instruction issues per
machine cycle: k-issue processor
• A multiprocessor system with n k-issue processors
should be able to handle a maximum number of nk
threads of instructions simultaneously
Software parallelism

• Defined by the control and data dependence of programs

• A function of algorithm, programming style, and compiler
organization
• The program flow graph displays the patterns of
simultaneously executable operations
Mismatch between software and hardware
parallelism - 1

Cycle 1 L1 L2 L3 L4
Maximum software
parallelism (L=load,
Cycle 2 X1 X2 X/+/- = arithmetic).

Cycle 3 + -

A B
Mismatch between software and hardware
parallelism - 2
L1 Cycle 1

Same problem, but L2 Cycle 2

considering the
X1 L3 Cycle 3
parallelism on a two-issue
superscalar processor. L4 Cycle 4
X2 Cycle 5

+ Cycle 6

- Cycle 7
A

B
Mismatch between software and hardware
parallelism - 3
L1 L3 Cycle 1

L2 L4 Cycle 2

Same problem, X1 X2 Cycle 3

with two single-
issue processors
S1 S2 Cycle 4

L5 L6 Cycle 5
= inserted for
synchronization + - Cycle 6

A B
Software parallelism

• Control parallelism – allows two or more operations to be

performed concurrently
o Pipelining, multiple functional units

• Data parallelism – almost the same operation is performed

over many data elements by many processors concurrently
o Code is easier to write and debug
Types of Software Parallelism

• Control Parallelism – two or more operations can

be performed simultaneously. This can be
detected by a compiler, or a programmer can
explicitly indicate control parallelism by using
special language constructs or dividing a
program into multiple processes.
• Data parallelism – multiple data elements have
the same operations applied to them at the same
time. This offers the highest potential for
concurrency (in SIMD and MIMD modes).
Synchronization in SIMD machines handled by
hardware.
Solving the Mismatch Problems

• Develop compilation support

• Redesign hardware for more efficient
exploitation by compilers
• Use large register files and sustained instruction
pipelining.
• Have the compiler fill the branch and load delay
slots in code generated for RISC processors.
The Role of Compilers

• Compilers used to exploit hardware features to

improve performance.
• Interaction between compiler and architecture
design is a necessity in modern computer
development.
• It is not necessarily the case that more software
parallelism will improve performance in
conventional scalar processors.
• The hardware and compiler should be designed
at the same time.
Program Partitioning & Scheduling

• The size of the parts or pieces of a program that

can be considered for parallel execution can vary.
• The sizes are roughly classified using the term
“granule size,” or simply “granularity.”
• The simplest measure, for example, is the
number of instructions in a program part.
• Grain sizes are usually described as fine, medium
or coarse, depending on the level of parallelism
involved.
Latency

• Latency is the time required for communication

between different subsystems in a computer.
• Memory latency, for example, is the time
required by a processor to access memory.
• Synchronization latency is the time required for
two processes to synchronize their execution.
• Computational granularity and communicatoin
latency are closely related.
Levels of Parallelism

Increasing
communication
Jobs or programs

Subprograms, job steps or } Coarse grain

}
demand and
scheduling related parts of a program
overhead Medium grain
Procedures, subroutines,
Higher degree tasks, or coroutines
of parallelism

}
Non-recursive loops
or unfolded iterations
Fine grain
Instructions
or statements
Instruction Level Parallelism

• This fine-grained, or smallest granularity level

typically involves less than 20 instructions per
grain. The number of candidates for parallel
execution varies from 2 to thousands, with about
five instructions or statements (on the average)
being the average level of parallelism.
• Advantages:
– There are usually many candidates for parallel
execution
Loop-level Parallelism

• Typical loop has less than 500 instructions.

• If a loop operation is independent between
iterations, it can be handled by a pipeline, or by
a SIMD machine.
• Most optimized program construct to execute on
a parallel or vector machine
• Some loops (e.g. recursive) are difficult to
handle.
• Loop-level parallelism is still considered fine
Procedure-level Parallelism

• Medium-sized grain; usually less than 2000

instructions.
• Detection of parallelism is more difficult than
with smaller grains; interprocedural dependence
analysis is difficult and history-sensitive.
• Communication requirement less than
instruction-level
• SPMD (single procedure multiple data) is a
special case
• Multitasking belongs to this level.
Subprogram-level Parallelism

• Job step level; grain typically has thousands of

instructions; medium- or coarse-grain level.
• Job steps can overlap across different jobs.
• Multiprograming conducted at this level
• No compilers available to exploit medium- or
coarse-grain parallelism at present.
Job or Program-Level Parallelism

• Corresponds to execution of essentially

independent jobs or programs on a parallel
computer.
• This is practical for a machine with a small
number of powerful processors, but impractical
for a machine with a large number of simple
processors (since each processor would take too
long to process a single job).
Communication Latency

• Balancing granularity and latency can yield

better performance.
• Various latencies attributed to machine
architecture, technology, and communication
patterns used.
• Latency imposes a limiting factor on machine
scalability. Ex. Memory latency increases as
memory capacity increases, limiting the amount
of memory that can be used with a given
tolerance for communication latency.
Interprocessor Communication Latency

• Needs to be minimized by system designer

• Affected by signal delays and communication
patterns
• Ex. n communicating tasks may require n (n -
1)/2 communication links, and the complexity
grows quadratically, effectively limiting the
number of processors in the system.
Communication Patterns

• Determined by algorithms used and architectural

support provided
• Patterns include
– permutations
– broadcast
– multicast
– conference

• Tradeoffs often exist between granularity of

parallelism and communication demand.
Grain Packing and Scheduling

• Two questions:
– How can I partition a program into parallel “pieces” to
yield the shortest execution time?
– What is the optimal size of parallel grains?

• There is an obvious tradeoff between the time

spent scheduling and synchronizing parallel
grains and the speedup obtained by parallel
execution.
• One approach to the problem is called “grain
Program Graphs and Packing

• A program graph is similar to a dependence

graph
– Nodes = { (n,s) }, where n = node name, s = size
(larger s = larger grain size).
– Edges = { (v,d) }, where v = variable being
“communicated,” and d = communication delay.
• Packing two (or more) nodes produces a node
with a larger grain size and possibly more edges
to other nodes.
• Packing is done to eliminate unnecessary
communication delays or reduce overall
scheduling overhead.
Example 2.5

• Example 2.5 illustrates a matrix multiplication

program requiring 8 multiplications and 7
additions.
• Using various approaches, the program requires:
– 212 cycles (software parallelism only)
– 864 cycles (sequential program on one processor)
– 741 cycles (8 processors) - speedup = 1.16
– 446 cycles (4 processors) - speedup = 1.94
Scheduling

• A schedule is a mapping of nodes to processors

and start times such that communication delay
requirements are observed, and no two nodes
are executing on the same processor at the same
time.
• Some general scheduling goals
– Schedule all fine-grain activities in a node to the same
processor to minimize communication delays.
– Select grain sizes for packing to achieve better
Static multiprocessor scheduling

• Grain packing may not be optimal

• Dynamic multiprocessor scheduling is an NP-hard problem
• Node duplication is a static scheme for multiprocessor
scheduling
Node duplication

• Duplicate some nodes to eliminate idle time and reduce

communication delays
• Grain packing and node duplication are often used jointly
to determine the best grain size and corresponding
schedule
Schedule without node duplication

P1 P2 P1 P2

A,4 4 A I 4
a,1
a,8 6 B
B,1 C,1 I
c,1 13 12
c,8 C 14
b,1 E
D,2 E,2 16
21 20
D
23
d,4 e,4 27
Schedule with node duplication

P1 P2 P1 P2

A,4 A’,4 4 A A 4
a,1 a,1
a,1 6 B C 6
B,1 C’,1 C,1 7 C
c,1 E
b,1 c,1 9
10 D
D,2 E,2

13
14
Grain determination and scheduling
optimization

Step 1: Construct a fine-grain program graph

Step 2: Schedule the fine-grain computation
Step 3: Grain packing to produce coarse grains
Step 4: Generate a parallel schedule based on
the packed graph
Program Flow Mechanisms

• Conventional machines used control flow

mechanism in which order of program execution
explicitly stated in user programs.
• Dataflow machines which instructions can be
executed by determining operand availability.
• Reduction machines trigger an instruction’s
execution based on the demand for its results.
Control Flow vs. Data Flow

• Control flow machines used shared memory for instructions and

data. Since variables are updated by many instructions, there
may be side effects on other instructions. These side effects
frequently prevent parallel processing. Single processor
systems are inherently sequential.

• Instructions in dataflow machines are unordered and can be

executed as soon as their operands are available; data is held in
the instructions themselves. Data tokens are passed from an
instruction to its dependents to trigger execution.
Data Flow Features

• No need for
– shared memory
– program counter
– control sequencer

• Special mechanisms are required to

– detect data availability
– match data tokens with instructions needing them
– enable chain reaction of asynchronous instruction
execution
A Dataflow Architecture - 1

• The Arvind machine (MIT) has N PEs and an N-by-

N interconnection network.
• Each PE has a token-matching mechanism that
dispatches only instructions with data tokens
available.
• Each datum is tagged with
– address of instruction to which it belongs
– context in which the instruction is being executed
• Tagged tokens enter PE through local path
(pipelined), and can also be communicated to other
PEs through the routing network.
A Dataflow Architecture - 2

• Instruction address(es) effectively replace the

program counter in a control flow machine.
• Context identifier effectively replaces the frame
base register in a control flow machine.
• Since the dataflow machine matches the data
tags from one instruction with successors,
synchronized instruction execution is implicit.
A Dataflow Architecture - 3

• An I-structure in each PE is provided to eliminate

excessive copying of data structures.
• Each word of the I-structure has a two-bit tag
indicating whether the value is empty, full, or has
pending read requests.
• This is a retreat from the pure dataflow
approach.
• Special compiler technology needed for dataflow
machines.
FIRST INTERNAL
PORTION
Demand-Driven Mechanisms

• Data-driven machines select instructions for execution based on

the availability of their operands; this is essentially a bottom-up
approach.

• Demand-driven machines take a top-down approach, attempting

to execute the instruction (a demander) that yields the final
result. This triggers the execution of instructions that yield its
operands, and so forth.

• The demand-driven approach matches naturally with functional

programming languages (e.g. LISP and SCHEME).
Reduction Machine Models

• String-reduction model:
– each demander gets a separate copy of the expression
string to evaluate
– each reduction step has an operator and embedded
reference to demand the corresponding operands
– each operator is suspended while arguments are evaluated
• Graph-reduction model:
– expression graph reduced by evaluation of branches or
subgraphs, possibly in parallel, with demanders given
pointers to results of reductions.
– based on sharing of pointers to arguments; traversal and
reversal of pointers continues until constant arguments
are encountered.
System Interconnect Architectures

• Direct networks for static connections

• Indirect networks for dynamic connections
• Networks are used for
– internal connections in a centralized system among
• processors
• memory modules
• I/O disk arrays
– distributed networking of multicomputer nodes
Goals and Analysis

• Static networks: point-to-point direct connections

that will not change during program execution
• Dynamic networks:
– switched channels dynamically configured to match
user program communication demands
– include buses, crossbar switches, and multistage
networks

• Both network types also used for inter-PE data

routing in SIMD computers
Network Parameters

• Network size: The number of nodes in the graph used to

represent the network
• Node Degree d: The number of edges incident to a node.
Sum of in degree and out degree
• Network Diameter D: The maximum shortest path between
any two nodes
Network Parameters (cont.)

• Bisection Width:
o Channel bisection width b: The minimum number of edges
along the cut that divides the network in two equal halves
o Each channel has w bit wires
o Wire bisection width: B=b*w; B is the wiring density of the
network. It provides a good indicator of tha max communication
bandwidth along the bisection of the network
Data Routing Functions

• Shifting
• Rotating
• Permutation (one to one)
• Broadcast (one to all)
• Multicast (many to many)
• Personalized broadcast (one to many)
• Shuffle
• Exchange

These routing functions can be implemented on

ring, mesh, hypercube, or multistage networks.
Permutations

• For n objects there are n! permutations by

which the n objects can be reordered.The
set of all permutations form a permutation
group with respect to a composition
operation. Cycle notation can be used to
specify a permutation operation.
• Permutation p = (a, b, c)(d, e) means: a-
>b, b->c, c->a, d->e and e->d in a circular
fashion. The cycle (a, b, c) has a period of
3, and the cycle (d, e) has a period of 2. p
will have a period equal to 2 x 3 = 6.
Permutations (cont.)

• Can be implemented using crossbar switches, multistage

networks or with shifting or broadcast operations.
• Permutation capability is an indication of network’s data
routing capabilities
Perfect Shuffle and Exchange

• Stone suggested the special permutation that

entries according to the mapping of the k-bit
binary number a b … k to b c … k a (that is,
shifting 1 bit to the left and wrapping it around
to the least significant bit position).
• The inverse perfect shuffle reverses the effect of
the perfect shuffle.
Perfect Shuffle

• Special permutation function

• n = 2k objects; each object representation requires k bits
• Perfect shuffle maps x to y where:
o x = ( xk-1, …, x1, x0 )
o y = ( xk-2, …, x1, x0, xk-1 )
Hypercube Routing Functions

• If the vertices of a n-dimensional cube are

labeled with n-bit numbers so that only one bit
differs between each pair of adjacent vertices,
then n routing functions are defined by the bits
in the node (vertex) address.
• For example, with a 3-dimensional cube, we can
easily identify routing functions that exchange
data between nodes with addresses that differ in
the least significant, most significant, or middle
Broadcast and Multicast

• Broadcast: One-to-all mapping

• Multicast: one subset to another subset
• Personalized Broadcast: Personalized messages to only
selected receivers
Factors Affecting Performance

• Functionality – how the network supports data routing, interrupt

handling, synchronization, request/message combining, and
coherence

• Network latency – worst-case time for a unit message to be

transferred

• Bandwidth – maximum data rate

• Hardware complexity – implementation costs for wire, logic,

switches, connectors, etc.

• Scalability – how easily does the scheme adapt to an increasing

number of processors, memories, etc.?
Static Networks

• Linear Array
• Ring and Chordal Ring
• Barrel Shifter
• Tree and Star
• Fat Tree
• Mesh and Torus
Static Networks – Linear Array

• N nodes connected by n-1 links (not a bus);

segments between different pairs of nodes can
be used in parallel.
• Internal nodes have degree 2; end nodes have
degree 1.
• Diameter = n-1
• Bisection = 1
• For small n, this is economical, but for large n, it
is obviously inappropriate.
Static Networks – Ring, Chordal Ring

• Like a linear array, but the two end nodes are

connected by an n th link; the ring can be uni- or
bi-directional. Diameter is n/2 for a
bidirectional ring, or n for a unidirectional ring.
• By adding additional links (e.g. “chords” in a
circle), the node degree is increased, and we
obtain a chordal ring. This reduces the network
diameter.
• In the limit, we obtain a fully-connected network,
with a node degree of n -1 and a diameter of 1.
Static Networks – Barrel Shifter

• Like a ring, but with additional links between all

pairs of nodes that have a distance equal to a
power of 2.
• With a network of size N = 2n, each node has
degree d = 2n -1, and the network has diameter
D = n /2.
• Barrel shifter connectivity is greater than any
chordal ring of lower node degree.
• Barrel shifter much less complex than fully-
interconnected network.
Static Networks – Tree and Star

• A k-level completely balanced binary tree will

have N = 2k – 1 nodes, with maximum node
degree of 3 and network diameter is 2(k – 1).
• The balanced binary tree is scalable, since it has
a constant maximum node degree.
• A star is a two-level tree with a node degree d =
N – 1 and a constant diameter of 2.
Static Networks – Fat Tree

• A fat tree is a tree in which the number of edges

between nodes increases closer to the root
(similar to the way the thickness of limbs
increases in a real tree as we get closer to the
root).
• The edges represent communication channels
(“wires”), and since communication traffic
increases as the root is approached, it seems
logical to increase the number of channels there.
Static Networks – Mesh and Torus

• Pure mesh – N = n k
nodes with links between each adjacent
pair of nodes in a row or column (or higher degree). This is not
a symmetric network; interior node degree d = 2k, diameter = k
(n – 1).

• Illiac mesh (used in Illiac IV computer) – wraparound is allowed,

thus reducing the network diameter to about half that of the
equivalent pure mesh.

• A torus has ring connections in each dimension, and is

symmetric. An n  n binary torus has node degree of 4 and a
diameter of 2  n / 2 .
Static Networks – Systolic Array

• A systolic array is an arrangement of processing

elements and communication links designed
specifically to match the computation and
communication requirements of a specific
algorithm (or class of algorithms).
• This specialized character may yield better
performance than more generalized structures,
but also makes them more expensive, and more
difficult to program.
Static Networks – Hypercubes

• A binary n-cube architecture with N = 2n nodes

spanning along n dimensions, with two nodes per
dimension.
• The hypercube scalability is poor, and packaging
is difficult for higher-dimensional hypercubes.
Static Networks – Cube-connected Cycles

• k-cube connected cycles (CCC) can be created

from a k-cube by replacing each vertex of the k-
dimensional hypercube by a ring of k nodes.
• A k-cube can be transformed to a k-CCC with k 
2k nodes.
• The major advantage of a CCC is that each node
has a constant degree (but longer latency) than
in the corresponding k-cube. In that respect, it is
more scalable than the hypercube architecture.
Static Networks – k-ary n-Cubes

• Rings, meshes, tori, binary n-cubes, and Omega

networks (to be seen) are topologically
isomorphic to a family of k-ary n-cube networks.
• n is the dimension of the cube, and k is the radix,
or number of of nodes in each dimension.
• The number of nodes in the network, N, is k n.
• Folding (alternating nodes between connections)
can be used to avoid the long “end-around”
delays in the traditional implementation.
Static Networks – k-ary n-Cubes

• The cost of k-ary n-cubes is dominated by the

amount of wire, not the number of switches.
• With constant wire bisection, low-dimensional
networks with wider channels provide lower
latecny, less contention, and higher “hot-spot”
throughput than higher-dimensional networks
with narrower channels.
Network Throughput

• Network throughput – number of messages a network can

handle in a unit time interval.

• One way to estimate is to calculate the maximum number of

messages that can be present in a network at any instant (its
capacity); throughput usually is some fraction of its capacity.

• A hot spot is a pair of nodes that accounts for a

disproportionately large portion of the total network traffic
(possibly causing congestion).

• Hot spot throughput is maximum rate at which messages can be

sent between two specific nodes.
Minimizing Latency

• Latency is minimized when the network radix k

and dimension n are chose so as to make the
components of latency due to distance (# of
hops) and the message aspect ratio L / W
(message length L divided by the channel width
W ) approximately equal.
• This occurs at a very low dimension. For up to
1024 nodes, the best dimension (in this respect)
is 2.
Dynamic Connection Networks

• Dynamic connection networks can implement all communication

patterns based on program demands.

• In increasing order of cost and performance, these include

– bus systems
– multistage interconnection networks
– crossbar switch networks

• Price can be attributed to the cost of wires, switches, arbiters,

and connectors.

• Performance is indicated by network bandwidth, data transfer

rate, network latency, and communication patterns supported.
Dynamic Networks – Bus Systems

• A bus system (contention bus, time-sharing bus) has

– a collection of wires and connectors
– multiple modules (processors, memories, peripherals, etc.) which
connect to the wires
– data transactions between pairs of modules

• Bus supports only one transaction at a time.

• Bus arbitration logic must deal with conflicting requests.

• Lowest cost and bandwidth of all dynamic schemes.

• Many bus standards are available.

Dynamic Networks – Switch Modules

• An a  b switch module has a inputs and b

outputs. A binary switch has a = b = 2.
• It is not necessary for a = b, but usually a = b =
2k, for some integer k.
• In general, any input can be connected to one or
more of the outputs. However, multiple inputs
may not be connected to the same output.
• When only one-to-one mappings are allowed, the
switch is called a crossbar switch.
Multistage Networks

• In general, any multistage network is comprised of a collection

of a  b switch modules and fixed network modules. The a  b
switch modules are used to provide variable permutation or
other reordering of the inputs, which are then further
reordered by the fixed network modules.

• A generic multistage network consists of a sequence alternating

dynamic switches (with relatively small values for a and b) with
static networks (with larger numbers of inputs and outputs).
The static networks are used to implement interstage
connections (ISC).
Omega Network

• A 2  2 switch can be configured for

– Straight-through
– Crossover
– Upper broadcast (upper input to both outputs)
– Lower broadcast (lower input to both outputs)
– (No output is a somewhat vacuous possibility as well)
• With four stages of eight 2  2 switches, and a
static perfect shuffle for each of the four ISCs, a
16 by 16 Omega network can be constructed (but
not all permutations are possible).
• In general , an n-input Omega network requires
log 2 n stages of 2  2 switches and n / 2 switch
modules.
Baseline Network

• A baseline network can be shown to be

topologically equivalent to other networks
(including Omega), and has a simple recursive
generation procedure.
• Stage k (k = 0, 1, …) is an m  m switch block
(where m = N / 2k ) composed entirely of 2  2
switch blocks, each having two configurations:
straight through and crossover.
4  4 Baseline Network
Crossbar Networks

• A m  n crossbar network can be used to provide a constant latency

connection between devices; it can be thought of as a single stage
switch.

• Different types of devices can be connected, yielding different

constraints on which switches can be enabled.
– With m processors and n memories, one processor may be able to generate
requests for multiple memories in sequence; thus several switches might be set
in the same row.
– For m  m interprocessor communication, each PE is connected to both an
input and an output of the crossbar; only one switch in each row and column
can be turned on simultaneously. Additional control processors are used to
manage the crossbar itself.
Summary of Dynamic Network Characteristics

Module 1 Chapter2
No ratings yet
Module 1 Chapter2
100 pages
Chapter 2: Program and Network Properties
No ratings yet
Chapter 2: Program and Network Properties
94 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
36 pages
Pca Chapter 2 Program & Network Properties
No ratings yet
Pca Chapter 2 Program & Network Properties
71 pages
Program and Network Properties 2.1 Conditions of Parallelism 2.2 Program Partitioning and Scheduling
No ratings yet
Program and Network Properties 2.1 Conditions of Parallelism 2.2 Program Partitioning and Scheduling
47 pages
Program and Network Properties
No ratings yet
Program and Network Properties
27 pages
Advanced Computer Architecture: Conditions of Parallelism
No ratings yet
Advanced Computer Architecture: Conditions of Parallelism
27 pages
Dependency Graph and Bernstein Conditions
No ratings yet
Dependency Graph and Bernstein Conditions
39 pages
Grain Packing & Scheduling Ch2 Hwang - Copy
No ratings yet
Grain Packing & Scheduling Ch2 Hwang - Copy
80 pages
COA - Unit 4
No ratings yet
COA - Unit 4
84 pages
Coa Unit 04
No ratings yet
Coa Unit 04
85 pages
COA UNIT 5 (AutoRecovered)
No ratings yet
COA UNIT 5 (AutoRecovered)
14 pages
Performnce Metrics and Measures
No ratings yet
Performnce Metrics and Measures
24 pages
Hpc_unit-1 Insem Notes
No ratings yet
Hpc_unit-1 Insem Notes
76 pages
15CS72_ACA_Module1_Chapter2FinalCopy
No ratings yet
15CS72_ACA_Module1_Chapter2FinalCopy
28 pages
Parallelism
No ratings yet
Parallelism
22 pages
Parallel_computing
No ratings yet
Parallel_computing
32 pages
CA Classes-21-25
No ratings yet
CA Classes-21-25
5 pages
HPC Unit2 Part1
No ratings yet
HPC Unit2 Part1
44 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
Coa Chapter 5
No ratings yet
Coa Chapter 5
96 pages
Chap 4-7 - Parallel - Abstractions - and - MPI
No ratings yet
Chap 4-7 - Parallel - Abstractions - and - MPI
34 pages
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
19 pages
Lecture1 Introduction PDF
No ratings yet
Lecture1 Introduction PDF
43 pages
CH03
No ratings yet
CH03
26 pages
hpc_parallel
No ratings yet
hpc_parallel
122 pages
Unit 4
No ratings yet
Unit 4
42 pages
Parallel Computing
No ratings yet
Parallel Computing
28 pages
Parallel Computing
No ratings yet
Parallel Computing
24 pages
Introduction To High Performance Computing: Unit-I
No ratings yet
Introduction To High Performance Computing: Unit-I
70 pages
mod5_aca-1-52
No ratings yet
mod5_aca-1-52
52 pages
Unit 5
No ratings yet
Unit 5
96 pages
What Is Concurrency? What Is Concurrency? What Is Concurrency? What Is Concurrency?
No ratings yet
What Is Concurrency? What Is Concurrency? What Is Concurrency? What Is Concurrency?
10 pages
HPC-Unit-2
No ratings yet
HPC-Unit-2
72 pages
CS526 3 Design of Parallel Programs
No ratings yet
CS526 3 Design of Parallel Programs
83 pages
Lecture 2
No ratings yet
Lecture 2
32 pages
BDS-Session-2
No ratings yet
BDS-Session-2
58 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
CS 6461: Computer Architecture Instruction Level Parallelism
No ratings yet
CS 6461: Computer Architecture Instruction Level Parallelism
41 pages
Speedup
No ratings yet
Speedup
12 pages
Lec7 PDF
No ratings yet
Lec7 PDF
16 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
Part 1 - Lecture 3 - Parallel Software-1
No ratings yet
Part 1 - Lecture 3 - Parallel Software-1
45 pages
Unit 4
No ratings yet
Unit 4
7 pages
DSECL ZG 522: Big Data Systems: Session 2: Parallel and Distributed Systems
No ratings yet
DSECL ZG 522: Big Data Systems: Session 2: Parallel and Distributed Systems
58 pages
HPC-Unit-1
No ratings yet
HPC-Unit-1
65 pages
CS439 CC 2 Parallel Distributed Systems[1]
No ratings yet
CS439 CC 2 Parallel Distributed Systems[1]
37 pages
Module 1
No ratings yet
Module 1
14 pages
Flynns
No ratings yet
Flynns
41 pages
Lecture-2-06.01.2025
No ratings yet
Lecture-2-06.01.2025
21 pages
Parallel Programming
No ratings yet
Parallel Programming
12 pages
24-25 - Parallel Processing PDF
No ratings yet
24-25 - Parallel Processing PDF
36 pages
Unit V
No ratings yet
Unit V
95 pages
p1
No ratings yet
p1
30 pages
P 1
No ratings yet
P 1
44 pages
Parallel Computation Lecture Notes
No ratings yet
Parallel Computation Lecture Notes
44 pages
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
No ratings yet
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
65 pages
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet
Dataflow and Reactive Programming Systems
From Everand
Dataflow and Reactive Programming Systems
Matt Carkci
No ratings yet
Swift Programming Simplified: A Practical Guide with Examples
From Everand
Swift Programming Simplified: A Practical Guide with Examples
William E. Clark
No ratings yet
module-3-chapter-1
No ratings yet
module-3-chapter-1
58 pages
module-4-chapter-1
No ratings yet
module-4-chapter-1
28 pages
module-3-chapter-2
No ratings yet
module-3-chapter-2
40 pages
module4-chapter1
No ratings yet
module4-chapter1
12 pages
module3-chapter3
No ratings yet
module3-chapter3
32 pages
chapter2-Intelligentagents
No ratings yet
chapter2-Intelligentagents
47 pages
Chapter3&4-problemsolvingagents-Expertsystems
No ratings yet
Chapter3&4-problemsolvingagents-Expertsystems
71 pages
Hybrid Dataflow Von-Neumann Architectures
No ratings yet
Hybrid Dataflow Von-Neumann Architectures
21 pages
A Brief History of The Computer: Babylonia Abacus Blaise Pasca Digital Computer
No ratings yet
A Brief History of The Computer: Babylonia Abacus Blaise Pasca Digital Computer
66 pages
SambaNova - Accelerated Computing With A Reconfigurable Dataflow Architecture - Whitepaper - English
No ratings yet
SambaNova - Accelerated Computing With A Reconfigurable Dataflow Architecture - Whitepaper - English
10 pages
Advanced Computer Architecture: Hwang, Chapter 2 Program and Network Properties 2.3 Program Flow Mechanisms
No ratings yet
Advanced Computer Architecture: Hwang, Chapter 2 Program and Network Properties 2.3 Program Flow Mechanisms
10 pages
Advanced Computer Arc.
No ratings yet
Advanced Computer Arc.
128 pages
STM U3
No ratings yet
STM U3
73 pages
Dataflow Architectures: Topic 7
No ratings yet
Dataflow Architectures: Topic 7
22 pages
HPC Question Bank From SNGCE, Kadayirippu
No ratings yet
HPC Question Bank From SNGCE, Kadayirippu
3 pages
Advance Computer Architecture
83% (6)
Advance Computer Architecture
166 pages
03 Von Neumann and Data Flow Model
No ratings yet
03 Von Neumann and Data Flow Model
22 pages
Module-1 Theory of Parallelism: The State of Computing Computer Development Milestones
No ratings yet
Module-1 Theory of Parallelism: The State of Computing Computer Development Milestones
48 pages
ACA Unit 4
No ratings yet
ACA Unit 4
41 pages
ILP Saad Saeed
No ratings yet
ILP Saad Saeed
31 pages
Csa Notes Mod 6-Part 2
No ratings yet
Csa Notes Mod 6-Part 2
17 pages
Data Flow Architecture
No ratings yet
Data Flow Architecture
3 pages
Advanced Computer Architecture Slides
No ratings yet
Advanced Computer Architecture Slides
105 pages