0% found this document useful (0 votes)
29 views21 pages

COA Unit - V Notes

computer organization and architecture unit 5

Uploaded by

eruvaram12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views21 pages

COA Unit - V Notes

computer organization and architecture unit 5

Uploaded by

eruvaram12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

COMPUTER ORGANIZATION AND ARCHITECTURE

UNIT- V
Pipelining: Basic concepts of pipelining, Arithmetic pipeline, Instruction pipeline, Instruction Hazards.
Parallel Processors: Introduction to parallel processors, Multiprocessor, Interconnection structures and
Cache coherency.
---------------------------------------------------------------------------------------------------------------------------------

Pipelining: Basic concepts of pipelining:

Pipelining is the process of accumulating instruction from the processor through a pipeline. It allows
storing and executing instructions in an orderly process. It is also known as pipeline processing.

Pipelining is a technique where multiple instructions are overlapped during execution. Pipeline is divided
into stages and these stages are connected with one another to form a pipe like structure. Instructions enter
from one end and exit from another end.

Pipelining increases the overall instruction throughput.

In pipeline system, each segment consists of an input register followed by a combinational circuit. The
register is used to hold data and combinational circuit performs operations on it. The output of
combinational circuit is applied to the input register of the next segment.

Pipeline system is like the modern day assembly line setup in factories. For example in a car manufacturing
industry, huge assembly lines are setup and at each point, there are robotic arms to perform a certain task,
and then the car moves on ahead to the next arm.

Types of Pipeline

It is divided into 2 categories:

1. Arithmetic Pipeline
2. Instruction Pipeline

Arithmetic Pipeline
Arithmetic Pipelines are mostly used in high-speed computers. They are used to implement floating-point
operations, multiplication of fixed-point numbers, and similar computations encountered in scientific
problems.

To understand the concepts of arithmetic pipeline in a more convenient way, let us consider an example of
a pipeline unit for floating-point addition and subtraction.

The inputs to the floating-point adder pipeline are two normalized floating-point binary numbers defined
as:
X = A * 2a = 0.9504 * 103
Y = B * 2b = 0.8200 * 102

Where A and B are two fractions that represent the mantissa and a and b are the exponents.

The combined operation of floating-point addition and subtraction is divided into four segments. Each
segment contains the corresponding sub operation to be performed in the given pipeline. The sub
operations that are shown in the four segments are:

1. Compare the exponents by subtraction.


2. Align the mantissas.
3. Add or subtract the mantissas.
4. Normalize the result.

We will discuss each sub operation in a more detailed manner later in this section.

The following block diagram represents the sub operations performed in each segment of the pipeline.
Note: Registers are placed after each suboperation to store the intermediate results.

1. Compare exponents by subtraction:

The exponents are compared by subtracting them to determine their difference. The larger exponent is
chosen as the exponent of the result.

The difference of the exponents, i.e., 3 - 2 = 1 determines how many times the mantissa associated with the
smaller exponent must be shifted to the right.

2. Align the mantissas:

The mantissa associated with the smaller exponent is shifted according to the difference of exponents
determined in segment one.
X = 0.9504 * 103
Y = 0.08200 * 103

3. Add mantissas:

The two mantissas are added in segment three.


Z = X + Y = 1.0324 * 103
4. Normalize the result:

After normalization, the result is written as:


Z = 0.1324 * 104

Instruction Pipeline
Pipeline processing can occur not only in the data stream but in the instruction stream as well.

Most of the digital computers with complex instructions require instruction pipeline to carry out operations
like fetch, decode and execute instructions.

In general, the computer needs to process each instruction with the following sequence of steps.

1. Fetch instruction from memory.


2. Decode the instruction.
3. Calculate the effective address.
4. Fetch the operands from memory.
5. Execute the instruction.
6. Store the result in the proper place.

Each step is executed in a particular segment, and there are times when different segments may take
different times to operate on the incoming information. Moreover, there are times when two or more
segments may require memory access at the same time, causing one segment to wait until another is
finished with the memory.

The organization of an instruction pipeline will be more efficient if the instruction cycle is divided into
segments of equal duration. One of the most common examples of this type of organization is a Four-
segment instruction pipeline.

A four-segment instruction pipeline combines two or more different segments and makes it as a single
one. For instance, the decoding of the instruction can be combined with the calculation of the effective
address into one segment.

The following block diagram shows a typical example of a four-segment instruction pipeline. The
instruction cycle is completed in four segments.
Segment 1:
The instruction fetch segment can be implemented using first in, first out (FIFO) buffer.
Segment 2:
The instruction fetched from memory is decoded in the second segment, and eventually, the effective
address is calculated in a separate arithmetic circuit.
Segment 3:
An operand from memory is fetched in the third segment.
Segment 4:
The instructions are finally executed in the last segment of the pipeline organization.

Pipeline Conflicts

There are some factors that cause the pipeline to deviate its normal performance. Some of these factors are
given below:

1. Timing Variations

All stages cannot take same amount of time. This problem generally occurs in instruction processing where
different instructions have different operand requirements and thus different processing time.

2. Data Hazards

When several instructions are in partial execution, and if they reference same data then the problem arises.
We must ensure that next instruction does not attempt to access data before the current instruction, because
this will lead to incorrect results.

3. Branching

In order to fetch and execute the next instruction, we must know what that instruction is. If the present
instruction is a conditional branch, and its result will lead us to the next instruction, then the next
instruction may not be known until the current one is processed.

4. Interrupts

Interrupts set unwanted instruction into the instruction stream. Interrupts effect the execution of instruction.

5. Data Dependency

It arises when an instruction depends upon the result of a previous instruction but this result is not yet
available.

Advantages of Pipelining

1. The cycle time of the processor is reduced.


2. It increases the throughput of the system
3. It makes the system reliable.

Disadvantages of Pipelining

1. The design of pipelined processor is complex and costly to manufacture.


2. The instruction latency is more.
Instruction Hazards:

Dependencies in a pipelined processor

There are mainly three types of dependencies possible in a pipelined processor. These are :
1) Structural Dependency
2) Control Dependency
3) Data Dependency

These dependencies may introduce stalls in the pipeline.

Stall : A stall is a cycle in the pipeline without new input.

Structural dependency

This dependency arises due to the resource conflict in the pipeline. A resource conflict is a situation when
more than one instruction tries to access the same resource in the same cycle. A resource can be a register,
memory, or ALU.

Exampl6e:
Instruction / Cycle 1 2 3 4 5
I1 IF(Mem) ID EX Mem
I2 IF(Mem) ID EX
I3 IF(Mem) ID EX
I4 IF(Mem) ID

In the above scenario, in cycle 4, instructions I1 and I4 are trying to access same resource (Memory) which
introduces a resource conflict.
To avoid this problem, we have to keep the instruction on wait until the required resource (memory in our
case) becomes available. This wait will introduce stalls in the pipeline as shown below:
Cycle 1 2 3 4 5 6 7 8
I1 IF(Mem) ID EX Mem WB
I2 IF(Mem) ID EX Mem WB
I3 IF(Mem) ID EX Mem WB
I4 – – – IF(Mem)

Solution for structural dependency


To minimize structural dependency stalls in the pipeline, we use a hardware mechanism called Renaming.
Renaming : According to renaming, we divide the memory into two independent modules used to store the
instruction and data separately called Code memory(CM) and Data memory(DM) respectively. CM will
contain all the instructions and DM will contain all the operands that are required for the instructions.
Instruction/ Cycle 1 2 3 4 5 6 7
I1 IF(CM) ID EX DM WB
I2 IF(CM) ID EX DM WB
I3 IF(CM) ID EX DM WB
I4 IF(CM) ID EX DM
I5 IF(CM) ID EX
I6 IF(CM) ID
I7 IF(CM)

Control Dependency (Branch Hazards)


This type of dependency occurs during the transfer of control instructions such as BRANCH, CALL, JMP,
etc. On many instruction architectures, the processor will not know the target address of these instructions
when it needs to insert the new instruction into the pipeline. Due to this, unwanted instructions are fed to
the pipeline.

Consider the following sequence of instructions in the program:


100: I1
101: I2 (JMP 250)
102: I3
.
.
250: BI1

Expected output: I1 -> I2 -> BI1

NOTE: Generally, the target address of the JMP instruction is known after ID stage only.
Instruction/ Cycle 1 2 3 4 5 6
I1 IF ID EX MEM WB
I2 IF ID (PC:250) EX Mem WB
I3 IF ID EX Mem
BI1 IF ID EX

Output Sequence: I1 -> I2 -> I3 -> BI1


So, the output sequence is not equal to the expected output, that means the pipeline is not implemented
correctly.

To correct the above problem we need to stop the Instruction fetch until we get target address of branch
instruction. This can be implemented by introducing delay slot until we get the target address.
Instruction/ Cycle 1 2 3 4 5 6
I1 IF ID EX MEM WB
I2 IF ID (PC:250) EX Mem WB
Delay – – – – – –
BI1 IF ID EX

Output Sequence: I1 -> I2 -> Delay (Stall) -> BI1

As the delay slot performs no operation, this output sequence is equal to the expected output sequence. But
this slot introduces stall in the pipeline.

Solution for Control dependency Branch Prediction is the method through which stalls due to control
dependency can be eliminated. In this at 1st stage prediction is done about which branch will be taken.For
branch prediction Branch penalty is zero.

Branch penalty : The number of stalls introduced during the branch operations in the pipelined processor
is known as branch penalty.

NOTE : As we see that the target address is available after the ID stage, so the number of stalls introduced
in the pipeline is 1. Suppose, the branch target address would have been present after the ALU stage, there
would have been 2 stalls. Generally, if the target address is present after the kth stage, then there will be (k –
1) stalls in the pipeline.

Total number of stalls introduced in the pipeline due to branch instructions = Branch frequency * Branch
Penalty

Data Dependency (Data Hazard)


Let us consider an ADD instruction S, such that
S : ADD R1, R2, R3
Addresses read by S = I(S) = {R2, R3}
Addresses written by S = O(S) = {R1}

Now, we say that instruction S2 depends in instruction S1, when


This condition is called Bernstein condition.

Three cases exist:

 Flow (data) dependence: O(S1) ∩ I (S2), S1 → S2 and S1 writes after something read by S2
 Anti-dependence: I(S1) ∩ O(S2), S1 → S2 and S1 reads something before S2 overwrites it
 Output dependence: O(S1) ∩ O(S2), S1 → S2 and both write the same memory location.

Example: Let there be two instructions I1 and I2 such that:


I1 : ADD R1, R2, R3
I2 : SUB R4, R1, R2

When the above instructions are executed in a pipelined processor, then data dependency condition will
occur, which means that I2 tries to read the data before I1 writes it, therefore, I2 incorrectly gets the old
value from I1.
Instruction / Cycle 1 2 3 4
I1 IF ID EX DM
I2 IF ID(Old value) EX

To minimize data dependency stalls in the pipeline, operand forwarding is used.

Operand Forwarding : In operand forwarding, we use the interface registers present between the stages to
hold intermediate output so that dependent instruction can access new value from the interface register
directly.

Considering the same example:


I1 : ADD R1, R2, R3
I2 : SUB R4, R1, R2
Instruction / Cycle 1 2 3 4
I1 IF ID EX DM
I2 IF ID EX

Data Hazards

Data hazards occur when instructions that exhibit data dependence, modify data in different stages of a
pipeline. Hazard cause delays in the pipeline. There are mainly three types of data hazards:
1) RAW (Read after Write) [Flow/True data dependency]
2) WAR (Write after Read) [Anti-Data dependency]
3) WAW (Write after Write) [Output data dependency]

Let there be two instructions I and J, such that J follow I. Then,

 RAW hazard occurs when instruction J tries to read data before instruction I writes it.
Eg:
I: R2 <- R1 + R3
J: R4 <- R2 + R3
 WAR hazard occurs when instruction J tries to write data before instruction I reads it.
Eg:
I: R2 <- R1 + R3
J: R3 <- R4 + R5
 WAW hazard occurs when instruction J tries to write output before instruction I writes it.
Eg:
I: R2 <- R1 + R3
J: R2 <- R4 + R5

WAR and WAW hazards occur during the out-of-order execution of the instructions.

Parallel Processors:

Introduction to parallel processors:


Parallel processing can be described as a class of techniques which enables the system to achieve
simultaneous data-processing tasks to increase the computational speed of a computer system.
A parallel processing system can carry out simultaneous data-processing to achieve faster execution time.
For instance, while an instruction is being processed in the ALU component of the CPU, the next
instruction can be read from memory.
The primary purpose of parallel processing is to enhance the computer processing capability and increase
its throughput, i.e. the amount of processing that can be accomplished during a given interval of time.
A parallel processing system can be achieved by having a multiplicity of functional units that perform
identical or different operations simultaneously. The data can be distributed among various multiple
functional units.
The following diagram shows one possible way of separating the execution unit into eight functional units
operating in parallel.
The operation performed in each functional unit is indicated in each block if the diagram:
 The adder and integer multiplier performs the arithmetic operation with integer numbers.
 The floating-point operations are separated into three circuits operating in parallel.
 The logic, shift, and increment operations can be performed concurrently on different data. All units
are independent of each other, so one number can be shifted while another number is being
incremented.

1.Multiprocessor:
A Multiprocessor is a computer system with two or more central processing units (CPUs) share full access
to a common RAM. The main objective of using a multiprocessor is to boost the system’s execution speed,
with other objectives being fault tolerance and application matching.

There are two types of multiprocessors, one is called shared memory multiprocessor and another is
distributed memory multiprocessor. In shared memory multiprocessors, all the CPUs shares the common
memory but in a distributed memory multiprocessor, every CPU has its own private memory.
Applications of Multiprocessor –

1. As a uniprocessor, such as single instruction, single data stream (SISD).


2. As a multiprocessor, such as single instruction, multiple data stream (SIMD), which is usually used
for vector processing.
3. Multiple series of instructions in a single perspective, such as multiple instruction, single data
stream (MISD), which is used for describing hyper-threading or pipelined processors.
4. Inside a single system for executing multiple, individual series of instructions in multiple
perspectives, such as multiple instruction, multiple data stream (MIMD).

Benefits of using a Multiprocessor –

 Enhanced performance.
 Multiple applications.
 Multi-tasking inside an application.
 High throughput and responsiveness.
 Hardware sharing among CPUs.

Flynn's Classification of Computers

M.J. Flynn proposed a classification for the organization of a computer system by the number of
instructions and data items that are manipulated simultaneously.

The sequence of instructions read from memory constitutes an instruction stream.

The operations performed on the data in the processor constitute a data stream.

Note: The term 'Stream' refers to the flow of instructions or data.


Parallel processing may occur in the instruction stream, in the data stream, or both.

Flynn's classification divides computers into four major groups that are:

1. Single instruction stream, single data stream (SISD)


2. Single instruction stream, multiple data stream (SIMD)
3. Multiple instruction stream, single data stream (MISD)
4. Multiple instruction stream, multiple data stream (MIMD)

Parallel computing is a computing where the jobs are broken into discrete parts that can be executed
concurrently. Each part is further broken down to a series of instructions. Instructions from each part
execute simultaneously on different CPUs. Parallel systems deal with the simultaneous use of multiple
computer resources that can include a single computer with multiple processors, a number of computers
connected by a network to form a parallel processing cluster or a combination of both.
Parallel systems are more difficult to program than computers with a single processor because the
architecture of parallel computers varies accordingly and the processes of multiple CPUs must be
coordinated and synchronized.
The crux of parallel processing are CPUs. Based on the number of instruction and data streams that can
be processed simultaneously, computing systems are classified into four major categories:

Flynn’s classification –

1. Single-instruction,single-data(SISD)systems:
An SISD computing system is a uniprocessor machine which is capable of executing a single
instruction, operating on a single data stream. In SISD, machine instructions are processed in a
sequential manner and computers adopting this model are popularly called sequential computers.
Most conventional computers have SISD architecture. All the instructions and data to be processed
have to be stored in primary memory.

The speed of the processing element in the SISD model is limited(dependent) by the rate at which
the computer can transfer information internally. Dominant representative SISD systems are IBM
PC, workstations.

2. Single-instruction, multiple-data(SIMD)systems:
An SIMD system is a multiprocessor machine capable of executing the same instruction on all the
CPUs but operating on different data streams. Machines based on an SIMD model are well suited to
scientific computing since they involve lots of vector and matrix operations. So that the information
can be passed to all the processing elements (PEs) organized data elements of vectors can be
divided into multiple sets(N-sets for N PE systems) and each PE can process one data set.

Dominant representative SIMD systems is Cray’s vector processing machine.

3. Multiple-instruction,single-data(MISD)systems:
An MISD computing system is a multiprocessor machine capable of executing different instructions
on different PEs but all of them operating on the same dataset .

ExampleZ=sin(x)+cos(x)+tan(x)
The system performs different operations on the same data set. Machines built using the MISD
model are not useful in most of the application, a few machines are built, but none of them are
available commercially.

4. Multiple-instruction,multiple-data(MIMD)systems:
An MIMD system is a multiprocessor machine which is capable of executing multiple instructions
on multiple data sets. Each PE in the MIMD model has separate instruction and data streams;
therefore machines built using this model are capable to any kind of application. Unlike SIMD and
MISD machines, PEs in MIMD machines work asynchronously.
MIMD machines are broadly categorized into shared-memory MIMD and distributed-memory
MIMD based on the way PEs are coupled to the main memory.

In the shared memory MIMD model (tightly coupled multiprocessor systems), all the PEs are
connected to a single global memory and they all have access to it. The communication between
PEs in this model takes place through the shared memory, modification of the data stored in the
global memory by one PE is visible to all other PEs.

Dominant representative shared memory MIMD systems are Silicon Graphics machines and
Sun/IBM’s SMP (Symmetric Multi-Processing).
In Distributed memory MIMD machines (loosely coupled multiprocessor systems) all PEs have a
local memory. The communication between PEs in this model takes place through the
interconnection network (the inter process communication channel, or IPC). The network
connecting PEs can be configured to tree, mesh or in accordance with the requirement.

The shared-memory MIMD architecture is easier to program but is less tolerant to failures and
harder to extend with respect to the distributed memory MIMD model. Failures in a shared-memory
MIMD affect the entire system, whereas this is not the case of the distributed model, in which each
of the PEs can be easily isolated. Moreover, shared memory MIMD architectures are less likely to
scale because the addition of more PEs leads to memory contention.

Interconnection Structures:
The interconnection between the components of a multiprocessor System can have different
physical configurations depending n the number of transfer paths that are available between the processors
and memory in a shared memory system and among the processing elements in a loosely coupled system.
Some of the schemes are as: -
Time-Shared Common Bus –
Multiport Memory –
Crossbar Switch –
Multistage Switching Network –
Hypercube System
Time shared common Bus
All processors (and memory) are connected to a common bus or busses
- Memory access is fairly uniform, but not very scalable
- A collection of signal lines that carry module-to-module communication
- Data highways connecting several digital system elements
- Operations of Bus

In the above figure we have number of local buses to its own local memory and to one or more processors.
Each local bus may be connected to a CPU, an IOP, or any combinations of processors. A system bus
controller links each local bus to a common system bus. The I/O devices connected to the local IOP, as
well as the local memory, are available to the local processor. The memory connected to the common
system bus is shared by all processors. If an IOP is connected directly to the system bus the I/O devices
attached to it may be made available to all processors
Disadvantage.:
• Only one processor can communicate with the memory or another processor at any given time.
• As a consequence, the total overall transfer rate within the system is limited by the speed of the single
path b.

Multiport Memory System employs separate buses between each memory module and each CPU. A
processor bus comprises the address, data and control lines necessary to communicate with memory. Each
memory module connects each processor bus. At any given time, the memory module should have internal
control logic to obtain which port can have access to memory.
Memory module can be said to have four ports and each port accommodates one of the buses. Assigning
fixed priorities to each memory port resolve the memory access conflicts. the priority is established for
memory access associated with each processor by the physical port position that its bus occupies in each
module. Therefore CPU 1 can have priority over CPU 2, CPU 2 can have priority over CPU 3 and CPU 4
can have the lowest priority.

Advantage:-
High transfer rate can be achieved because of multiple paths
Disadvantage:-
 It requires expensive memory control logic and a large number of cables and connectors.
 It is only good for systems with small number of processors.

Cache coherence :
In a multiprocessor system, data inconsistency may occur among adjacent levels or within the same level of
the memory hierarchy.

In a shared memory multiprocessor with a separate cache memory for each processor, it is possible to have
many copies of any one instruction operand: one copy in the main memory and one in each cache memory.
When one copy of an operand is changed, the other copies of the operand must be changed also.

Example :
Cache and the main memory may have inconsistent copies of the same object.

Suppose there are three processors, each having cache. Suppose the following scenario:-
 Processor 1 read X : obtains 24 from the memory and caches it.
 Processor 2 read X : obtains 24 from memory and caches it.
 Again, processor 1 writes as X : 64, Its locally cached copy is updated. Now, processor 3 reads X,
what value should it get?
 Memory and processor 2 thinks it is 24 and processor 1 thinks it is 64.

As multiple processors operate in parallel, and independently multiple caches may possess different copies
of the same memory block, this creates a cache coherence problem.

Cache coherence is the discipline that ensures that changes in the values of shared operands are propagated
throughout the system in a timely fashion.

There are three distinct level of cache coherence :-

1. Every write operation appears to occur instantaneously.


2. All processors see exactly the same sequence of changes of values for each separate operand.
3. Different processors may see an operation and assume different sequences of values; this is known
as non-coherent behavior.

There are various Cache Coherence Protocols in multiprocessor system. These are :-

1. MSI protocol (Modified, Shared, Invalid)


2. MOSI protocol (Modified, Owned, Shared, Invalid)
3. MESI protocol (Modified, Exclusive, Shared, Invalid)
4. MOESI protocol (Modified, Owned, Exclusive, Shared, Invalid)

These important terms are discussed as follows:

 Modified –
It means that the the value in the cache is dirty, that is the value in current cache is different from
the main memory.
 Exclusive –
It means that the value present in the cache is same as that present in the main memory, that is the
value is clean.
 Shared –
It means that the cache value holds the most recent data copy and that is what shared among all the
cache and main memory as well.
 Owned –
It means that the current cache holds the block and is now the owner of that block, that is having all
rights on that particular blocks.
 Invalid –
This states that the current cache block itself is invalid and is required to be fetched from other
cache or main memory.
Coherency mechanisms :
There are three types of coherence :

1. Directory-based –
In a directory-based system, the data being shared is placed in a common directory that maintains
the coherence between caches. The directory acts as a filter through which the processor must ask
permission to load an entry from the primary memory to its cache. When an entry is changed, the
directory either updates or invalidates the other caches with that entry.
2. Snooping –
First introduced in 1983, snooping is a process where the individual caches monitor address lines
for accesses to memory locations that they have cached. It is called a write invalidate protocol.
When a write operation is observed to a location that a cache has a copy of and the cache controller
invalidates its own copy of the snooped memory location.
3. Snarfing –
It is a mechanism where a cache controller watches both address and data in an attempt to update its
own copy of a memory location when a second master modifies a location in main memory. When
a write operation is observed to a location that a cache has a copy of the cache controller updates its
own copy of the snarfed memory location with the new data.

You might also like