COA Unit - V Notes
COA Unit - V Notes
UNIT- V
Pipelining: Basic concepts of pipelining, Arithmetic pipeline, Instruction pipeline, Instruction Hazards.
Parallel Processors: Introduction to parallel processors, Multiprocessor, Interconnection structures and
Cache coherency.
---------------------------------------------------------------------------------------------------------------------------------
Pipelining is the process of accumulating instruction from the processor through a pipeline. It allows
storing and executing instructions in an orderly process. It is also known as pipeline processing.
Pipelining is a technique where multiple instructions are overlapped during execution. Pipeline is divided
into stages and these stages are connected with one another to form a pipe like structure. Instructions enter
from one end and exit from another end.
In pipeline system, each segment consists of an input register followed by a combinational circuit. The
register is used to hold data and combinational circuit performs operations on it. The output of
combinational circuit is applied to the input register of the next segment.
Pipeline system is like the modern day assembly line setup in factories. For example in a car manufacturing
industry, huge assembly lines are setup and at each point, there are robotic arms to perform a certain task,
and then the car moves on ahead to the next arm.
Types of Pipeline
1. Arithmetic Pipeline
2. Instruction Pipeline
Arithmetic Pipeline
Arithmetic Pipelines are mostly used in high-speed computers. They are used to implement floating-point
operations, multiplication of fixed-point numbers, and similar computations encountered in scientific
problems.
To understand the concepts of arithmetic pipeline in a more convenient way, let us consider an example of
a pipeline unit for floating-point addition and subtraction.
The inputs to the floating-point adder pipeline are two normalized floating-point binary numbers defined
as:
X = A * 2a = 0.9504 * 103
Y = B * 2b = 0.8200 * 102
Where A and B are two fractions that represent the mantissa and a and b are the exponents.
The combined operation of floating-point addition and subtraction is divided into four segments. Each
segment contains the corresponding sub operation to be performed in the given pipeline. The sub
operations that are shown in the four segments are:
We will discuss each sub operation in a more detailed manner later in this section.
The following block diagram represents the sub operations performed in each segment of the pipeline.
Note: Registers are placed after each suboperation to store the intermediate results.
The exponents are compared by subtracting them to determine their difference. The larger exponent is
chosen as the exponent of the result.
The difference of the exponents, i.e., 3 - 2 = 1 determines how many times the mantissa associated with the
smaller exponent must be shifted to the right.
The mantissa associated with the smaller exponent is shifted according to the difference of exponents
determined in segment one.
X = 0.9504 * 103
Y = 0.08200 * 103
3. Add mantissas:
Instruction Pipeline
Pipeline processing can occur not only in the data stream but in the instruction stream as well.
Most of the digital computers with complex instructions require instruction pipeline to carry out operations
like fetch, decode and execute instructions.
In general, the computer needs to process each instruction with the following sequence of steps.
Each step is executed in a particular segment, and there are times when different segments may take
different times to operate on the incoming information. Moreover, there are times when two or more
segments may require memory access at the same time, causing one segment to wait until another is
finished with the memory.
The organization of an instruction pipeline will be more efficient if the instruction cycle is divided into
segments of equal duration. One of the most common examples of this type of organization is a Four-
segment instruction pipeline.
A four-segment instruction pipeline combines two or more different segments and makes it as a single
one. For instance, the decoding of the instruction can be combined with the calculation of the effective
address into one segment.
The following block diagram shows a typical example of a four-segment instruction pipeline. The
instruction cycle is completed in four segments.
Segment 1:
The instruction fetch segment can be implemented using first in, first out (FIFO) buffer.
Segment 2:
The instruction fetched from memory is decoded in the second segment, and eventually, the effective
address is calculated in a separate arithmetic circuit.
Segment 3:
An operand from memory is fetched in the third segment.
Segment 4:
The instructions are finally executed in the last segment of the pipeline organization.
Pipeline Conflicts
There are some factors that cause the pipeline to deviate its normal performance. Some of these factors are
given below:
1. Timing Variations
All stages cannot take same amount of time. This problem generally occurs in instruction processing where
different instructions have different operand requirements and thus different processing time.
2. Data Hazards
When several instructions are in partial execution, and if they reference same data then the problem arises.
We must ensure that next instruction does not attempt to access data before the current instruction, because
this will lead to incorrect results.
3. Branching
In order to fetch and execute the next instruction, we must know what that instruction is. If the present
instruction is a conditional branch, and its result will lead us to the next instruction, then the next
instruction may not be known until the current one is processed.
4. Interrupts
Interrupts set unwanted instruction into the instruction stream. Interrupts effect the execution of instruction.
5. Data Dependency
It arises when an instruction depends upon the result of a previous instruction but this result is not yet
available.
Advantages of Pipelining
Disadvantages of Pipelining
There are mainly three types of dependencies possible in a pipelined processor. These are :
1) Structural Dependency
2) Control Dependency
3) Data Dependency
Structural dependency
This dependency arises due to the resource conflict in the pipeline. A resource conflict is a situation when
more than one instruction tries to access the same resource in the same cycle. A resource can be a register,
memory, or ALU.
Exampl6e:
Instruction / Cycle 1 2 3 4 5
I1 IF(Mem) ID EX Mem
I2 IF(Mem) ID EX
I3 IF(Mem) ID EX
I4 IF(Mem) ID
In the above scenario, in cycle 4, instructions I1 and I4 are trying to access same resource (Memory) which
introduces a resource conflict.
To avoid this problem, we have to keep the instruction on wait until the required resource (memory in our
case) becomes available. This wait will introduce stalls in the pipeline as shown below:
Cycle 1 2 3 4 5 6 7 8
I1 IF(Mem) ID EX Mem WB
I2 IF(Mem) ID EX Mem WB
I3 IF(Mem) ID EX Mem WB
I4 – – – IF(Mem)
NOTE: Generally, the target address of the JMP instruction is known after ID stage only.
Instruction/ Cycle 1 2 3 4 5 6
I1 IF ID EX MEM WB
I2 IF ID (PC:250) EX Mem WB
I3 IF ID EX Mem
BI1 IF ID EX
To correct the above problem we need to stop the Instruction fetch until we get target address of branch
instruction. This can be implemented by introducing delay slot until we get the target address.
Instruction/ Cycle 1 2 3 4 5 6
I1 IF ID EX MEM WB
I2 IF ID (PC:250) EX Mem WB
Delay – – – – – –
BI1 IF ID EX
As the delay slot performs no operation, this output sequence is equal to the expected output sequence. But
this slot introduces stall in the pipeline.
Solution for Control dependency Branch Prediction is the method through which stalls due to control
dependency can be eliminated. In this at 1st stage prediction is done about which branch will be taken.For
branch prediction Branch penalty is zero.
Branch penalty : The number of stalls introduced during the branch operations in the pipelined processor
is known as branch penalty.
NOTE : As we see that the target address is available after the ID stage, so the number of stalls introduced
in the pipeline is 1. Suppose, the branch target address would have been present after the ALU stage, there
would have been 2 stalls. Generally, if the target address is present after the kth stage, then there will be (k –
1) stalls in the pipeline.
Total number of stalls introduced in the pipeline due to branch instructions = Branch frequency * Branch
Penalty
Flow (data) dependence: O(S1) ∩ I (S2), S1 → S2 and S1 writes after something read by S2
Anti-dependence: I(S1) ∩ O(S2), S1 → S2 and S1 reads something before S2 overwrites it
Output dependence: O(S1) ∩ O(S2), S1 → S2 and both write the same memory location.
When the above instructions are executed in a pipelined processor, then data dependency condition will
occur, which means that I2 tries to read the data before I1 writes it, therefore, I2 incorrectly gets the old
value from I1.
Instruction / Cycle 1 2 3 4
I1 IF ID EX DM
I2 IF ID(Old value) EX
Operand Forwarding : In operand forwarding, we use the interface registers present between the stages to
hold intermediate output so that dependent instruction can access new value from the interface register
directly.
Data Hazards
Data hazards occur when instructions that exhibit data dependence, modify data in different stages of a
pipeline. Hazard cause delays in the pipeline. There are mainly three types of data hazards:
1) RAW (Read after Write) [Flow/True data dependency]
2) WAR (Write after Read) [Anti-Data dependency]
3) WAW (Write after Write) [Output data dependency]
RAW hazard occurs when instruction J tries to read data before instruction I writes it.
Eg:
I: R2 <- R1 + R3
J: R4 <- R2 + R3
WAR hazard occurs when instruction J tries to write data before instruction I reads it.
Eg:
I: R2 <- R1 + R3
J: R3 <- R4 + R5
WAW hazard occurs when instruction J tries to write output before instruction I writes it.
Eg:
I: R2 <- R1 + R3
J: R2 <- R4 + R5
WAR and WAW hazards occur during the out-of-order execution of the instructions.
Parallel Processors:
1.Multiprocessor:
A Multiprocessor is a computer system with two or more central processing units (CPUs) share full access
to a common RAM. The main objective of using a multiprocessor is to boost the system’s execution speed,
with other objectives being fault tolerance and application matching.
There are two types of multiprocessors, one is called shared memory multiprocessor and another is
distributed memory multiprocessor. In shared memory multiprocessors, all the CPUs shares the common
memory but in a distributed memory multiprocessor, every CPU has its own private memory.
Applications of Multiprocessor –
Enhanced performance.
Multiple applications.
Multi-tasking inside an application.
High throughput and responsiveness.
Hardware sharing among CPUs.
M.J. Flynn proposed a classification for the organization of a computer system by the number of
instructions and data items that are manipulated simultaneously.
The operations performed on the data in the processor constitute a data stream.
Flynn's classification divides computers into four major groups that are:
Parallel computing is a computing where the jobs are broken into discrete parts that can be executed
concurrently. Each part is further broken down to a series of instructions. Instructions from each part
execute simultaneously on different CPUs. Parallel systems deal with the simultaneous use of multiple
computer resources that can include a single computer with multiple processors, a number of computers
connected by a network to form a parallel processing cluster or a combination of both.
Parallel systems are more difficult to program than computers with a single processor because the
architecture of parallel computers varies accordingly and the processes of multiple CPUs must be
coordinated and synchronized.
The crux of parallel processing are CPUs. Based on the number of instruction and data streams that can
be processed simultaneously, computing systems are classified into four major categories:
Flynn’s classification –
1. Single-instruction,single-data(SISD)systems:
An SISD computing system is a uniprocessor machine which is capable of executing a single
instruction, operating on a single data stream. In SISD, machine instructions are processed in a
sequential manner and computers adopting this model are popularly called sequential computers.
Most conventional computers have SISD architecture. All the instructions and data to be processed
have to be stored in primary memory.
The speed of the processing element in the SISD model is limited(dependent) by the rate at which
the computer can transfer information internally. Dominant representative SISD systems are IBM
PC, workstations.
2. Single-instruction, multiple-data(SIMD)systems:
An SIMD system is a multiprocessor machine capable of executing the same instruction on all the
CPUs but operating on different data streams. Machines based on an SIMD model are well suited to
scientific computing since they involve lots of vector and matrix operations. So that the information
can be passed to all the processing elements (PEs) organized data elements of vectors can be
divided into multiple sets(N-sets for N PE systems) and each PE can process one data set.
3. Multiple-instruction,single-data(MISD)systems:
An MISD computing system is a multiprocessor machine capable of executing different instructions
on different PEs but all of them operating on the same dataset .
ExampleZ=sin(x)+cos(x)+tan(x)
The system performs different operations on the same data set. Machines built using the MISD
model are not useful in most of the application, a few machines are built, but none of them are
available commercially.
4. Multiple-instruction,multiple-data(MIMD)systems:
An MIMD system is a multiprocessor machine which is capable of executing multiple instructions
on multiple data sets. Each PE in the MIMD model has separate instruction and data streams;
therefore machines built using this model are capable to any kind of application. Unlike SIMD and
MISD machines, PEs in MIMD machines work asynchronously.
MIMD machines are broadly categorized into shared-memory MIMD and distributed-memory
MIMD based on the way PEs are coupled to the main memory.
In the shared memory MIMD model (tightly coupled multiprocessor systems), all the PEs are
connected to a single global memory and they all have access to it. The communication between
PEs in this model takes place through the shared memory, modification of the data stored in the
global memory by one PE is visible to all other PEs.
Dominant representative shared memory MIMD systems are Silicon Graphics machines and
Sun/IBM’s SMP (Symmetric Multi-Processing).
In Distributed memory MIMD machines (loosely coupled multiprocessor systems) all PEs have a
local memory. The communication between PEs in this model takes place through the
interconnection network (the inter process communication channel, or IPC). The network
connecting PEs can be configured to tree, mesh or in accordance with the requirement.
The shared-memory MIMD architecture is easier to program but is less tolerant to failures and
harder to extend with respect to the distributed memory MIMD model. Failures in a shared-memory
MIMD affect the entire system, whereas this is not the case of the distributed model, in which each
of the PEs can be easily isolated. Moreover, shared memory MIMD architectures are less likely to
scale because the addition of more PEs leads to memory contention.
Interconnection Structures:
The interconnection between the components of a multiprocessor System can have different
physical configurations depending n the number of transfer paths that are available between the processors
and memory in a shared memory system and among the processing elements in a loosely coupled system.
Some of the schemes are as: -
Time-Shared Common Bus –
Multiport Memory –
Crossbar Switch –
Multistage Switching Network –
Hypercube System
Time shared common Bus
All processors (and memory) are connected to a common bus or busses
- Memory access is fairly uniform, but not very scalable
- A collection of signal lines that carry module-to-module communication
- Data highways connecting several digital system elements
- Operations of Bus
In the above figure we have number of local buses to its own local memory and to one or more processors.
Each local bus may be connected to a CPU, an IOP, or any combinations of processors. A system bus
controller links each local bus to a common system bus. The I/O devices connected to the local IOP, as
well as the local memory, are available to the local processor. The memory connected to the common
system bus is shared by all processors. If an IOP is connected directly to the system bus the I/O devices
attached to it may be made available to all processors
Disadvantage.:
• Only one processor can communicate with the memory or another processor at any given time.
• As a consequence, the total overall transfer rate within the system is limited by the speed of the single
path b.
Multiport Memory System employs separate buses between each memory module and each CPU. A
processor bus comprises the address, data and control lines necessary to communicate with memory. Each
memory module connects each processor bus. At any given time, the memory module should have internal
control logic to obtain which port can have access to memory.
Memory module can be said to have four ports and each port accommodates one of the buses. Assigning
fixed priorities to each memory port resolve the memory access conflicts. the priority is established for
memory access associated with each processor by the physical port position that its bus occupies in each
module. Therefore CPU 1 can have priority over CPU 2, CPU 2 can have priority over CPU 3 and CPU 4
can have the lowest priority.
Advantage:-
High transfer rate can be achieved because of multiple paths
Disadvantage:-
It requires expensive memory control logic and a large number of cables and connectors.
It is only good for systems with small number of processors.
Cache coherence :
In a multiprocessor system, data inconsistency may occur among adjacent levels or within the same level of
the memory hierarchy.
In a shared memory multiprocessor with a separate cache memory for each processor, it is possible to have
many copies of any one instruction operand: one copy in the main memory and one in each cache memory.
When one copy of an operand is changed, the other copies of the operand must be changed also.
Example :
Cache and the main memory may have inconsistent copies of the same object.
Suppose there are three processors, each having cache. Suppose the following scenario:-
Processor 1 read X : obtains 24 from the memory and caches it.
Processor 2 read X : obtains 24 from memory and caches it.
Again, processor 1 writes as X : 64, Its locally cached copy is updated. Now, processor 3 reads X,
what value should it get?
Memory and processor 2 thinks it is 24 and processor 1 thinks it is 64.
As multiple processors operate in parallel, and independently multiple caches may possess different copies
of the same memory block, this creates a cache coherence problem.
Cache coherence is the discipline that ensures that changes in the values of shared operands are propagated
throughout the system in a timely fashion.
There are various Cache Coherence Protocols in multiprocessor system. These are :-
Modified –
It means that the the value in the cache is dirty, that is the value in current cache is different from
the main memory.
Exclusive –
It means that the value present in the cache is same as that present in the main memory, that is the
value is clean.
Shared –
It means that the cache value holds the most recent data copy and that is what shared among all the
cache and main memory as well.
Owned –
It means that the current cache holds the block and is now the owner of that block, that is having all
rights on that particular blocks.
Invalid –
This states that the current cache block itself is invalid and is required to be fetched from other
cache or main memory.
Coherency mechanisms :
There are three types of coherence :
1. Directory-based –
In a directory-based system, the data being shared is placed in a common directory that maintains
the coherence between caches. The directory acts as a filter through which the processor must ask
permission to load an entry from the primary memory to its cache. When an entry is changed, the
directory either updates or invalidates the other caches with that entry.
2. Snooping –
First introduced in 1983, snooping is a process where the individual caches monitor address lines
for accesses to memory locations that they have cached. It is called a write invalidate protocol.
When a write operation is observed to a location that a cache has a copy of and the cache controller
invalidates its own copy of the snooped memory location.
3. Snarfing –
It is a mechanism where a cache controller watches both address and data in an attempt to update its
own copy of a memory location when a second master modifies a location in main memory. When
a write operation is observed to a location that a cache has a copy of the cache controller updates its
own copy of the snarfed memory location with the new data.