0% found this document useful (0 votes)
37 views

CA Unit-3 Part2

Uploaded by

bhargavialluri30
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

CA Unit-3 Part2

Uploaded by

bhargavialluri30
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

UNIT – IIIV

Pipelining: Basic concepts of pipelining, throughput and speedup, pipeline hazards.


Parallel Processors: Introduction to parallel processors, Concurrent access to memory and cache
coherency.

Basic concepts of pipelining:


Performance of a computer can be increased by increasing the performance of the CPU.
This can be done by executing more than one task at a time. This procedure is referred to as pipelining.
The concept of pipelining is to allow the processing of a new task even though the processing of
previous task has not ended.
Pipelining is a technique of decomposing a sequential process into suboperations, with each subprocess
being executed in a special dedicated segment that operates concurrently with all other segments. A
pipeline can be visualized as a collection of processing segments through which binary information
flows. Each segment performs partial processing dictated by the way the task is partitioned. The result
obtained from the computation in each segment is transferred to the next segment in the pipeline. The
final result is obtained after the data have passed through all segments.

Consider the following operation: Result=(A+B)*C


First the A and B values are Fetched which is nothing but a “Fetch Operation”.
The result of the Fetch operations is given as input to the Addition operation, which is an Arithmetic
operation.
The result of the Arithmetic operation is again given to the Data operand C which is fetched from the
memory and using another arithmetic operation which is Multiplication in this scenario is executed.
Finally the Result is again stored in the “Result” variable.

In this process we are using up-to 5 pipelines which are


Fetch Operation (A), Fetch Operation(B)
Addition of (A & B), Fetch Operation(C)
Multiplication of ((A+B), C)
Load ( (A+B)*C)

COA – Dept of IT Page 37


Now consider the case where a k-segment pipeline with a clock cycle time t, is used to execute n tasks.
The first task T1 requires a time equal to k t, to complete its operation since there are k segments in the
pipe. The remaining n - 1 tasks emerge from the pipe at the rate of one task per clock cycle and they will
be completed after a time equal to (n - 1)t, . Therefore, to complete
n tasks using a k-segment pipeline requires k + (n - 1) clock cycles. For example, the diagram of Fig.
shows four segments and six tasks.
The time required to complete all the operations is 4 + (6 - 1) = 9 clock cycles, as indicated in the
diagram.

Throughput and Speedup


Parallel processing is a term used to denote a large class of techniques that are used to provide
simultaneous data-processing tasks for the purpose of inaeasing the computational speed of a computer
system. The purpose of parallel processing is to speed up the computer processing capability and
increase its throughput.
Throughput: Is the amount of processing that can be accomplished during a given interval of time. The
amount of hardware increases with parallel processing and with it, the cost of the system increases.
However, technological developments have reduced hardware costs to the point where parallel
processing techniques a.re economically feasible.
Speedup of a pipeline processing: The speedup of a pipeline processing over an equivalent nonpipeline
processing is defined by the ratio
S = Tseq / Tpipe = n*m / (m+n -1)

the maximum speedup, also called ideal speedup, of a pipeline processor with m stages over an
equivalent nonpipelined processor is m. In other words, the ideal speedup is equal to the number of
pipeline stages. That is, when n is very large, a pipelined processor can produce output approximately m
times faster than a nonpipelined processor. When n is small, the speedup decreases.

Pipeline Hazards
There are situations in pipelining when the next instruction cannot execute in the
following clock cycle. These events are called hazards, and there are three different
types.
COA – Dept of IT Page 38
Hazards
The first hazard is called a structural hazard. It means that the hardware cannot support the
combination of instructions that we want to execute in the same clock cycle. A structural hazard in the
laundry room would occur if we used a washer dryer combination instead of a separate washer and
dryer, or if our roommate was busy doing something else and wouldn‟t put clothes away. Our carefully
scheduled pipeline plans would then be foiled.
As we said above, the MIPS instruction set was designed to be pipelined, making it fairly easy
for designers to avoid structural hazards when designing a pipeline. Suppose, however, that we had a
single memory instead of two memories. If the pipeline in Figure 4.27 had a fourth instruction, we
would see that in the same clock cycle the fi rst instruction is accessing data from memory while the
fourth instruction is fetching an instruction from that same memory. Without two memories, our pipeline
could have a structural hazard.

Data Hazards
Data hazards occur when the pipeline must be stalled because one step must wait for another to
complete. Suppose you found a sock at the folding station for which no match existed. One possible
strategy is to run down to your room and search through your clothes bureau to see if you can find the
match. Obviously, while you are doing the search, loads must wait that have completed drying and are
ready to fold as well as those that have finished washing and are ready to dry.
In a pipeline, data hazards arise from the dependence of one instruction on an earlier one that is still in
the pipeline (a relationship that does not really exist when doing laundry). For example, suppose we
have an add instruction followed immediately by a subtract instruction that uses the sum ($s0):
add $s0, $t0, $t1
sub $t2, $s0, $t3

Without intervention, a data hazard could severely stall the pipeline. The add instruction doesn‟t
write its result until the fifth stage, meaning that we would have to waste three clock cycles in the
pipeline.Although we could try to rely on compilers to remove all such hazards, the results would not be
satisfactory. These dependences happen just too oft en and the delay is just too long to expect the
compiler to rescue us from this dilemma.
The primary solution is based on the observation that we don‟t need to wait for the instruction to
complete before trying to resolve the data hazard. For the code sequence above, as soon as the ALU
creates the sum for the add, we can supply it as an input for the subtract. Adding extra hardware to
retrieve the missing item early from the internal resources is called forwarding or bypassing.
In this graphical representation of events, forwarding paths are valid only if the destination stage
is later in time than the source stage. For example, there cannot be a valid forwarding path from the
output of the memory access stage in the first instruction to the input of the execution stage of the
following, since that would mean going backward in time.

COA – Dept of IT Page 39


It cannot prevent all pipeline stalls, however. For example, suppose the first instruction was a
load of $s0 instead of an add. As we can imagine from looking at Figure 4.29, the desired data would be
available only after the fourth stage of the first instruction in the dependence, which is too late for the
input of the third stage of sub. Hence, even with forwarding, we would have to stall one stage for a load-
use data hazard, as Figure 4.30 shows. This figure shows an important pipeline concept, officially
called a pipeline stall, but oft en given the nickname bubble. We shall see stalls elsewhere in the
pipeline.

Control Hazards
The third type of hazard is called a control hazard, arising from the need to make a decision based on
the results of one instruction while others are executing. Suppose our laundry crew was given the happy
task of cleaning the uniforms of a football team. Given how filthy the laundry is, we need to determine
whether the detergent and water temperature setting we select is strong enough to get the uniforms clean
but not so strong that the uniforms wear out sooner. In our laundry pipeline, we have to wait until aft er
the second stage to examine the dry uniform to see if we need to change the washer setup or not. What
to do?
Here is the first of two solutions to control hazards in the laundry room and its computer equivalent.
Stall: Just operate sequentially until the first batch is dry and then repeat until you have the right
formula.
This conservative option certainly works, but it is slow.

COA – Dept of IT Page 40


Parallel Processors
Introduction to parallel processors:
Parallel processing is a term used to denote a large class of techniques that are used to provide
simultaneous data-processing tasks for the purpose of in a easing the computational speed of a computer
system. Instead of processing each instruction sequentially as in a conventional computer, a parallel
processing system is able to perform concurrent data processing to achieve faster execution time.
The purpose of parallel processing is to speed up the computer processing capability and increase
its throughput, that is, the amount of processing that can be accomplished during a given interval of
time. The amount of hardware increases with parallel processing and with it, the cost of the system
increases. However, technological developments have reduced hardware costs to the point where
parallel processing techniques a.re economically feasible.
Parallel processing can be viewed from various levels of complexity. At the lowest level, we
distinguish between parallel and serial operations by the type of registers used. Shift registers operate in
serial fashion one bit at a time, while registers with parallel load operate with all the bits of the word
simultaneously.
Parallel processing at a higher level of complexity can be achieved by having a multiplicity of
functional units that perform identical or different operations simultaneously. Parallel processing is
established by distributing the data among the multiple functional units. For example, the arithmetic,
logic, and shift operations can be separated into three units and the operands diverted to each unit under
the supervision of a control unit.
Figure 9-1 shows one possible way of separating the execution unit into eight functional units
operating in parallel. The operands in the registers are applied to one of the units depending on the
operation specified by the instruction associated with the operands. The operation performed in each
functional unit is indicated in each block of the diagram. The adder and integer multiplier perform the
arithmetic operations with integer numbers.

COA – Dept of IT Page 41


There are a variety of ways that parallel processing can be classified. It can be considered from
the internal organization of the processors, from the interconnection structure between processors, or
from the flow of information through the system. One classification introduced by M. J. Flynn considers
the organization of a computer system by the number of instructions and data items that are manipulated
simultaneously. The normal operation of a computer is to fetch instructions from memory and execute
them in the processor.
The sequence of instructions read from memory constitutes an instruction stream . The operations
performed on the data in the processor constitutes a data stream . Parallel processing may occur in the
instruction stream, in the data stream, or in both.

Flynn's classification divides computers into four major groups as follows:


Single instruction stream, single data stream (SISD)
Single instruction stream, multiple data stream (SIMD)
Multiple instruction stream, single data stream (MISD)
Multiple instruction stream, multiple data stream (MIMD)

SISD represents the organization of a single computer containing a control unit, a processor unit,
and a memory unit. Instructions are executed sequentially and the system may or may not have internal
parallel processing capabilities. Parallel processing in this case may be achieved by means of multiple
functional units or by pipeline processing.
SIMD represents an organization that includes many processing units under the supervision of a
common control unit. All processors receive the same instruction from the control unit but operate on
different items of data. The shared memory unit must contain multiple modules so that it can
communicate with all the processors simultaneously.
MISD structure is only of theoretical interest since no practical system has been constructed
using this organization.
MIMD organization refers to a computer system capable of processing several programs at the
same time. Most multiprocessor and multicomputer systems can be classified in this category.

Concurrent access to memory and cache coherency:


The primary advantage of cache is its ability to reduce the average access time in uniprocessors.
When the processor finds a word in cache during a read operation, the main memory is not involved in
the transfer. If the operation is to write, there are two commonly used procedures to update memory.
Write-through policy: In the write-through policy, both cache and main memory are updated with
every write operation.
Write-back policy: In the write-back policy, only the cache is updated and the location is marked so
that it can be copied later into main memory.
In a shared memory multiprocessor system, all the processors share a common memory. In
addition, each processor may have a local memory, part or all of which may be a cache. The compelling
reason for having separate caches for each processor is to reduce the average access time in each
processor. The same information may reside in a number of copies in some caches and main memory.
To ensure the ability of the system to execute memory operations correctly, the multiple copies must be
kept identical.
This requirement imposes a cache coherence problem. A memory scheme is coherent if the value
returned on a load instruction is always the value given by the latest store instruction with the same
address. Without a proper solution to the cache coherence problem, caching cannot be used in bus-
oriented multiprocessors with two or more processors.

COA – Dept of IT Page 42


Conditions for Incoherence
Cache coherence problems exist in multiprocessors with private caches because of the need to share
writable data. Read-only data can safely be replicated without cache coherence enforcement
mechanisms.
To illustrate the problem, consider the three-processor configuration with private caches shown in Fig.
13-12. Sometime during the operation an element X from main memory is loaded into the three
processors, P1, P2, and P3. As a consequence, it is also copied into the private caches of the three
processors. For simplicity, we assume that X contains the value of 52. The load on X to the three
processors results in consistent copies in the caches and main memory. If one of the processors performs
a store to X, the copies of X in the caches become inconsistent. A load by the other processors will not
return the latest value. Depending on the memory update policy used in the cache, the main memory
may also be inconsistent with respect to the cache.

This is shown in Fig. 13-13. A store to X (of the value of 120) into the cache of processor P1 updates
memory to the new value in a write-through policy. A write-through policy maintains consistency
between memory and the originating cache, but the other two caches are inconsistent since they still
hold the old value. In a write-back policy, main memory is not updated at the time of the store. The
copies in the other two caches and main memory are inconsistent. Memory is updated eventually when
the modified data in the cache are copied back into memory.

COA – Dept of IT Page 43


Another configuration that may cause consistency problems is a direct memory access (DMA)
activity in conjunction with an IOP connected to the system bus. In the case of input, the DMA may
modify locations in main memory that also reside in cache without updating the cache. During a DMA
output, memory locations may be read before they are updated from the cache when using a write-back
policy. VO-based memory incoherence can be overcome by making the IOP a participant in the cache
coherent solution that is adopted in the system.

COA – Dept of IT Page 44

You might also like