0% found this document useful (0 votes)
56 views85 pages

Coa Unit 04

COA

Uploaded by

Govind Kalawate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views85 pages

Coa Unit 04

COA

Uploaded by

Govind Kalawate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

18CSC203J-Computer Organization

and Architecture
UNIT - 4
Parallelism

2
Course Outcome
• CLR-3 : Understand the concepts of Pipelining
and basic processing units
• CLR-4 : Study about parallel processing and
performance considerations.

• CLO-3 : Analyze the detailed operation of


Basic Processing units and the performance
of Pipelining
• CLO-4 : Analyze concepts of parallelism and
multi-core processors.
Contents
• Parallelism
• Need for parallelism
• Types of Parallelism
• Applications of Parallelism
• Parallelism in Software
– Instruction level parallelism
– Data level parallelism
• Challenges in Parallelism
• Architecture of Parallel system
– Flynn’s Classification
– SISD , SIMD
– MIMD, MIMD
• Hardware Multi threading
– Coarse grain Parallelism
– Fine grain Parallelism
• Uni-Processor and MultiProcessor
• Muti-core Processor
• Memory in multi-processor system
• Cache Coherency in multi-processor system
• MESI Protocol for multi-processor system

4
Parallelism
• Executing two or more operations at the same time is
known as parallelism.
• Parallel processing is a method to improve computer
system performance by executing two or more
instructions simultaneously
• A parallel computer is a set of processors that are able
to work cooperatively to solve a computational
problem.
• Two or more ALUs in CPU can work concurrently to
increase throughput
• The system may have two or more processors
operating concurrently

5
Goals of parallelism
• To increase the computational speed (ie) to
reduce the amount of time that you need to wait
for a problem to be solved
• To increase throughput (ie) the amount of
processing that can be accomplished during a
given interval of time
• To improve the performance of the computer for
a given clock speed
• To solve bigger problems that might not fit in the
limited memory of a single CPU

6
Applications of Parallelism
• Numeric weather prediction
• Socio economics
• Finite element analysis
• Artificial intelligence and automation
• Genetic engineering
• Weapon research and defence
• Medical Applications
• Remote sensing applications

7
Applications of Parallelism

8
Types of parallelism
1. Hardware Parallelism
2. Software Parallelism

• Hardware Parallelism :
The main objective of hardware parallelism is to increase the processing speed.
Based on the hardware architecture, we can divide hardware parallelism into two
types: Processor parallelism and memory parallelism.
• Processor parallelism
Processor parallelism means that the computer architecture has multiple nodes,
multiple CPUs or multiple sockets, multiple cores, and multiple threads.
• Memory parallelism means shared memory, distributed memory, hybrid distributed
shared memory, multilevel pipelines, etc. Sometimes, it is also called a parallel
random access machine (PRAM). “It is an abstract model for parallel computation
which assumes that all the processors operate synchronously under a single clock
and are able to randomly access a large shared memory. In particular, a processor
can execute an arithmetic, logic, or memory access operation within a single clock
cycle”. This is what we call using overlapping or pipelining instructions to achieve
parallelism.

9
Hardware Parallelism
• One way to characterize the parallelism in a processor is by
the number of instruction issues per machine cycle.
• If a processor issues k instructions per machine cycle, then
it is called a k-issue processor.
• In a modern processor, two or more instructions can be
issued per machine cycle.
• A conventional processor takes one or more machine
cycles to issue a single instruction. These types of
processors are called one-issue machines, with a single
instruction pipeline in the processor.
• A multiprocessor system which built n k-issue processors
should be able to handle a maximum of nk threads of
instructions simultaneously

10
Software Parallelism
• It is defined by the control and data dependence of
programs.
• The degree of parallelism is revealed in the program
flow graph.
• Software parallelism is a function of algorithm,
programming style, and compiler optimization.
• The program flow graph displays the patterns of
simultaneously executable operations.
• Parallelism in a program varies during the execution
period .
• It limits the sustained performance of the processor.

11
12
13
14
15
16
17
Software Parallelism - types
Parallelism in Software
Instruction level parallelism
Task-level parallelism
Data parallelism
Transaction level parallelism

18
Instruction level parallelism
• Instruction level Parallelism (ILP) is a measure of
how many operations can be performed in
parallel at the same time in a computer.

• Parallel instructions are set of instructions that do


not depend on each other to be executed.
• ILP allows the compiler and processor to overlap
the execution of multiple instructions or even to
change the order in which instructions are
executed.
19
Eg. Instruction level parallelism
Consider the following example
1. x= a+b
2. y=c-d
3. z=x * y
Operation 3 depends on the results of 1 & 2
So ‘Z ‘ cannot be calculated until X & Y are calculated
But 1 & 2 do not depend on any other. So they can
be computed simultaneously.

20
• If we assume that each operation can be
completed in one unit of time then these 3
operations can be completed in 2 units of
time .
• ILP factor is 3/2=1.5 which is greater than
without ILP.
• A superscalar CPU architecture implements
ILP inside a single processor which allows
faster CPU throughput at the same clock rate.

21
Data-level parallelism (DLP)

• Data parallelism is parallelization across


multiple processors in parallel computing
environments.
• It focuses on distributing the data across
different nodes, which operate on the data in
parallel.
• Instructions from a single stream operate
concurrently on several data

22
DLP - example
• Let us assume we want to sum all the
elements of the given array of size n and the
time for a single addition operation is Ta time
units.
• In the case of sequential execution, the time
taken by the process will be n*Ta time unit
• if we execute this job as a data parallel job on
4 processors the time taken would reduce to
(n/4)*Ta + merging overhead time units.

23
DLP in Adding elements of array

24
DLP in matrix multiplication

• A[m x n] dot B [n x k] can be finished in O(n) instead of


O(m∗n∗k ) when executed in parallel using m*k processors.

25
• The locality of data references plays an
important part in evaluating the performance
of a data parallel programming model.
• Locality of data depends on the memory
accesses performed by the program as well as
the size of the cache.

26
Flynn’s Classification
▪ Was proposed by researcher Michael J. Flynn in 1966.
▪ It is the most commonly accepted taxonomy of computer
organization.
▪ In this classification, computers are classified by whether it
processes a single instruction at a time or multiple
instructions simultaneously, and whether it operates on
one or multiple data sets.

27
Flynn’s Classification
• This taxonomy distinguishes multi-processor computer
architectures according to the two independent dimensions
of Instruction stream and Data stream.
• An instruction stream is sequence of instructions executed
by machine.
• A data stream is a sequence of data including input, partial
or temporary results used by instruction stream.
• Each of these dimensions can have only one of two possible
states: Single or Multiple.
• Flynn’s classification depends on the distinction between
the performance of control unit and the data processing
unit rather than its operational and structural
interconnections.

28
Flynn’s Classification
• Four category of Flynn classification

29
SISD
• They are also called scalar • SISD computer having one
processor i.e., one instruction control unit, one processor
at a time and each instruction unit and single memory unit.
have only one set of operands. •
• Single instruction: only one
instruction stream is being
acted on by the CPU during any
one clock cycle.
• Single data: only one data
stream is being used as input
during any one clock cycle.
• Deterministic execution.
• Instructions are executed
sequentially.

30
SIMD
• A type of parallel computer. • single instruction is executed
• Single instruction: All by different processing unit on
processing units execute the different set of data
same instruction issued by the
control unit at any given clock
cycle .
• Multiple data: Each processing
unit can operate on a different
data element as shown if
figure below the processor are
connected to shared memory
or interconnection network
providing multiple data to
processing unit

31
MISD
• A single data stream is fed into • same data flow through a
multiple processing units. linear array of processors
• Each processing unit operates executing different instruction
on the data independently via streams
independent instruction.
• A single data stream is
forwarded to different
processing unit which are
connected to different control
unit and execute instruction
given to it by control unit to
which it is attached.

32
MIMD
• Multiple Instruction: • Different processor each
every processor may be processing different task.
executing a different
instruction stream.
• Multiple Data: every
processor may be working
with a different data
stream.
• Execution can be
synchronous or
asynchronous,
deterministic or
nondeterministic

33
Hardware Multithreading
• Hardware multithreading allows multiple threads to share the
functional units of a single processor in an overlapping fashion to
try to utilize the hardware resources efficiently.
• To permit this sharing, the processor must duplicate the
independent state of each thread. It Increases the utilization of a
processor
• For example, each thread would have a separate copy of register
file and program counter. The memory itself can be shared
through virtual memory mechanisms, which already support
multi-programming.
• In addition, hardware must support to change to a different
thread relatively quickly. In particular, a thread switch should be
much more efficient than a process switch, which typically
requires hundreds to thousands of processor cycles while a thread
switch can be instantaneous.

34
Fine grained multi threading
• Fine-grained multithreading switches between threads on each
instruction, resulting in interleaved execution of multiple threads.
• This interleaving is often done in a round-robin fashion, skipping
any threads that are stalled at that clock cycle.
• To make fine-grained multithreading practical, processor must be
able to switch threads on every clock cycle.
• One advantage of fine-grained multithreading is that it can hide
throughput losses that arise from both short and long stalls, since
instructions from or threads can be executed when one thread
stalls.
• The primary disadvantage of fine-grained multithreading is that it
slows down execution of individual threads, since a thread that is
ready to execute without stalls will be delayed by instructions
from or threads.

35
Coarse grained multi threading
• Coarse-grained multithreading was invented as an alternative to fine-grained
multithreading.
• Coarse-grained multithreading switches threads only on costly stalls, such as
last-level cache misses.
• This change relieves need to have thread switching be extremely fast and is much
less likely to slow down execution of an individual thread, since instructions from or
threads will only be issued when a thread encounters a costly stall.
• Drawback: it is limited in its ability to overcome throughput losses, especially from
shorter stalls.
• This limitation arises from pipeline start-up costs of coarse-grained multithreading.
Because a processor with coarse-grained multithreading issues instructions from a
single thread, when a stall occurs, pipeline must be emptied or frozen.
• The new thread that begins executing after stall must fill pipeline before
instructions will be able to complete. Due to start-up overhead, coarse-grained
multithreading is much more useful for reducing penalty of high-cost stalls, where
pipeline refill is negligible compared to stall time.

36
Comparison

37
Single-core computer

38
Single-core CPU chip
the single core

39
Multi-core architectures
• Replicate multiple processor cores
on a single die.
Core 1 Core 2 Core 3 Core 4

Multi-core CPU chip 40


Multi-core CPU chip
• The cores fit on a single processor socket
• Also called CMP (Chip Multi-Processor)

c c c c
o o o o
r r r r
e e e e

1 2 3 4

41
The cores run in parallel
thread 1 thread 2 thread 3 thread 4

c c c c
o o o o
r r r r
e e e e

1 2 3 4

42
Within each core, threads are time-sliced (just
like on a uniprocessor)
several several several several
threads threads threads threads

c c c c
o o o o
r r r r
e e e e

1 2 3 4

43
Memory in Multiprocessor System
• Two architectures:
– Shared common memory
– Unshared Distributed memory.

44
Shared memory multiprocessors

• Shared memory multiprocessors


• A system with multiple CPUs “sharing” the same main
memory is called Shared memory multiprocessor.
• In a multiprocessor system all processes on the various
CPUs share a unique logical address space.
• Multiple processors can operate independently but
share the same memory resources.
• Changes in a memory location effected by one
processor are visible to all other processors.
• Shared memory machines can be divided into two
main classes based upon memory access times: UMA ,
NUMA.
45
Uniform Memory Access (UMA)
• Most commonly represented today by Symmetric
Multiprocessor (SMP) machines.
• Identical processors .
• Equal access and access times to memory .
• Sometimes called CC-UMA - Cache Coherent
UMA. Cache coherent means if one processor
updates a location in shared memory, all the
other processors know about the update. Cache
coherency is accomplished at the hardware level.
• It can be used to speed up the execution of a
single large program in time critical applications
46
Non-Uniform Memory Access (NUMA)
• these systems have a shared logical address
space, but physical memory is distributed among
CPUs, so that access time to data depends on data
position, in local or in a remote memory (thus the
NUMA denomination)
• •These systems are also called Distributed Shared
Memory (DSM)architecture
• Memory access across link is slower
• If cache coherency is maintained, then may also
be called CC-NUMA - Cache Coherent NUMA
47
• The COMA model : The COMA model is a special
case of NUMA machine in which the distributed
main memories are converted to caches. All
caches form a global address space and there is
no memory hierarchy at each processor node.
• Data have no specific “permanent” location (no
specific memory address) where they stay and
when they can be read (copied into local caches)
and/or modified (first in the cache and then
updated at their “permanent” location).

48
Shared Memory

Uniform Memory Access Non-Uniform Memory Access

49
Distributed memory systems
• Distributed memory systems require a communication network to
connect inter-processor memory.
• Processors have their own local memory.
• Memory addresses in one processor do not map to another processor, so
there is no concept of global address space across all processors.
• Because each processor has its own local memory, it operates
independently.
• Changes it makes to its local memory have no effect on the memory of
other processors. Hence, the concept of cache coherency does not
apply.
• When a processor needs access to data in another processor, it is usually
the task of the programmer to explicitly define how and when data is
communicated.
• Synchronization between tasks is likewise the programmer's
responsibility. 50
The memory hierarchy
• If simultaneous multithreading only:
– all caches shared
• Multi-core chips:
– L1 caches private
– L2 caches private in some architectures
and shared in others
• Memory is always shared

51
“Fish” machines
hyper-threads
• Dual-core
Intel Xeon processors

CORE1

CORE0
• Each core is L1 cache L1 cache
hyper-threaded
L2 cache

• Private L1 caches
memory
• Shared L2 caches
52
Designs with private L2 caches

CORE0

CORE1

CORE0
CORE1

L1 cache L1 cache L1 cache L1 cache

L2 cache L2 cache L2 cache L2 cache

L3 cache L3 cache
memory
memory
Both L1 and L2 are private
A design with L3 caches
Examples: AMD Opteron,
AMD Athlon, Intel Pentium D Example: Intel Itanium 2
53
Private vs shared caches?
• Advantages/disadvantages?

54
Private vs shared caches
• Advantages of private:
– They are closer to core, so faster access
– Reduces contention
• Advantages of shared:
– Threads on different cores can share the
same cache data
– More cache space available if a single (or a
few) high-performance thread runs on the
system
55
The cache coherence problem
• Since we have private caches:
How to keep the data consistent across caches?
• Each core should perceive the memory as a
monolithic array, shared by all the cores

56
The cache coherence problem
Suppose variable x initially contains 15213

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache

multi-core chip
Main memory
57
x=15213
The cache coherence problem
Core 1 reads x

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=15213

multi-core chip
Main memory
58
x=15213
The cache coherence problem
Core 2 reads x

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=15213 x=15213

multi-core chip
Main memory
59
x=15213
The cache coherence problem
Core 1 writes to x, setting it to 21660

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660 x=15213

multi-core chip
Main memory assuming
x=21660 write-through
60
caches
The cache coherence problem
Core 2 attempts to read x… gets a stale
copy
Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660 x=15213

multi-core chip
Main memory
x=21660
61
Solutions for cache coherence
• This is a general problem with
multiprocessors, not limited just to
multi-core
• There exist many solution
algorithms, coherence protocols,
etc.

• A simple solution:
invalidation-based protocol with snooping
62
Inter-core bus

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache

multi-core chip
Main memory
inter-core
bus 63
Invalidation protocol with
snooping
• Invalidation:
If a core writes to a data item, all other
copies of this data item in other
caches are invalidated
• Snooping:
All cores continuously “snoop”
(monitor) the bus connecting the
cores.

64
Bus Snooping
• Each CPU (cache system) ‘snoops’ (i.e.
watches continually) for write activity
concerned with data addresses which it has
cached.• This assumes a bus structure which
is ‘global’, i.e all communication can be seen
by all.

65
The cache coherence problem
Revisited: Cores 1 and 2 have both read x

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=15213 x=15213

multi-core chip
Main memory
x=15213
66
The cache coherence problem
Core 1 writes to x, setting it to 21660

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660 x=15213

sends INVALIDATED
invalidation
multi-core chip
request
Main memory assuming
inter-core
x=21660 write-through
bus 67
caches
The cache coherence problem
After invalidation:

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660

multi-core chip
Main memory
68
x=21660
The cache coherence problem
Core 2 reads x. Cache misses, and loads the new
copy.

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660 x=21660

multi-core chip
Main memory
69
x=21660
Alternative to invalidate protocol:
update protocol
Core 1 writes x=21660:

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660 x=21660
UPDATED

broadcasts
multi-core chip
updated
value Main memory assuming
inter-core
x=21660 write-through
bus 70
caches
Which do you think is better?
Invalidation or update?

71
Invalidation vs update
• Multiple writes to the same location
– invalidation: only the first time
– update: must broadcast each write
(which includes new variable value)

• Invalidation generally performs


better: it generates less bus traffic

72
MESI Protocol

For Multiprocessor Systems

73
• The MESI protocol is an Invalidate-based cache
coherence protocol, and is one of the most common
protocols which support write-back caches.
• Write back caches can save a lot on bandwidth that
is generally wasted on a write through cache.
• There is always a dirty state present in write back
caches which indicates that the data in the cache is
different from that in main memory

74
MESI Protocol (2)
Any cache line can be in one of 4 states (2 bits)
• Modified - The cache line is present only in the current cache, and is
dirty - it has been modified (M state) from the value in main
memory. The cache is required to write the data back to main
memory at some time in the future, before permitting any other
read of the (no longer valid) main memory state. The write-back
changes the line to the Shared state(S).
• Exclusive – The cache line is present only in the current cache, but is
clean - it matches main memory. It may be changed to the Shared
state at any time, in response to a read request. Alternatively, it may
be changed to the Modified state when writing to it.
• Shared – Indicates that this cache line may be stored in other caches
of the machine and is clean - it matches the main memory. The line
may be discarded (changed to the Invalid state) at any time
• Invalid – Indicates that this cache line is invalid (unused).

75
Operation

• A processor P1 has a Block X in its Cache, and there is a request


from the processor to read or write from that block.
• The second stimulus comes from other processors, which doesn't
have the Cache block or the updated data in its Cache.
• The bus requests are monitored with the help of Snoopers which
snoops all the bus transactions.
• Following are the different type of Processor requests and Bus
side requests:
• Processor Requests to Cache includes the following operations:
• PrRd: The processor requests to read a Cache block.
• PrWr: The processor requests to write a Cache block

76
• Bus side requests are the following:
• BusRd: Snooped request that indicates there is a read request to a
Cache block made by another processor
• BusRdX: Snooped request that indicates there is a write request to
a Cache block made by another processor which doesn't already
have the block.
• BusUpgr: Snooped request that indicates that there is a write
request to a Cache block made by another processor but that
processor already has that Cache block resident in its Cache.
• Flush: Snooped request that indicates that an entire cache block is
written back to the main memory by another processor.
• FlushOpt: Snooped request that indicates that an entire cache
block is posted on the bus in order to supply it to another
processor(Cache to Cache transfers).

77
78
• Snooping Operation: In a snooping system, all
caches on the bus monitor (or snoop) all the bus
transactions. Every cache has a copy of the
sharing status of every block of physical memory
it has stored. The state of the block is changed
according to the State Diagram of the protocol
used.(Refer image above for MESI state diagram).
The bus has snoopers on both sides:
• Snooper towards the Processor/Cache side.
• The snooping function on the memory side is
done by the Memory controller.

79
State Transitions and response to
various Processor Operations

80
81
82
Illustration of MESI protocol operations
• Let us assume that the following stream of read/write references. All the references are to
the same location and the digit refers to the processor issuing the reference.
• The stream is : R1, W1, R3, W3, R1, R3, R2.
• Initially it is assumed that all the caches are empty.

83
• Step 1: As the cache is initially empty, so the main
memory provides P1 with the block and it becomes
exclusive state.
• Step 2: As the block is already present in the cache and
in an exclusive state so it directly modifies that
without any bus instruction. The block is now in a
modified state.
• Step 3: In this step, a BusRd is posted on the bus and
the snooper on P1 senses this. It then flushes the data
and changes its state to shared. The block on P3 also
changes its state to shared as it has received data from
another cache. There is no main memory access here.

84
• Step 4: Here a BusUpgr is posted on the bus and the snooper on
P1 senses this and invalidates the block as it is going to be
modified by another cache. P3 then changes its block state to
modified.
• Step 5: As the current state is invalid, thus it will post a BusRd on
the bus. The snooper at P3 will sense this and so will flush the data
out. The state of the both the blocks on P1 and P3 will become
shared now. Notice that this is when even the main memory will
be updated with the previously modified data.
• Step 6: There is a hit in the cache and it is in the shared state so no
bus request is made here.
• Step 7: There is cache miss on P2 and a BusRd is posted. The
snooper on P1 and P3 sense this and both will attempt a flush.
Whichever gets access of the bus first will do that operation.

85

You might also like