0% found this document useful (0 votes)
102 views57 pages

Scribid ACA Important Topics With Answers

The document discusses advanced concepts in computer architecture, focusing on pipelining, parallel processing, RISC architecture, and vector processing. It covers definitions, applications, types of pipeline hazards, and various pipelining techniques, along with examples and diagrams. The authors, Ms. G. Anjana Harshitha Reddy, Ms. Kavya Chalamalashetty, and Ms. Hamsalekha, provide insights into memory interleaving, instruction pipelining, and the advantages of RISC over CISC architectures.

Uploaded by

anjana harshita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views57 pages

Scribid ACA Important Topics With Answers

The document discusses advanced concepts in computer architecture, focusing on pipelining, parallel processing, RISC architecture, and vector processing. It covers definitions, applications, types of pipeline hazards, and various pipelining techniques, along with examples and diagrams. The authors, Ms. G. Anjana Harshitha Reddy, Ms. Kavya Chalamalashetty, and Ms. Hamsalekha, provide insights into memory interleaving, instruction pipelining, and the advantages of RISC over CISC architectures.

Uploaded by

anjana harshita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

ADVANCED

COMPUTER
ADVANCEDARCHITECTURE
COMPUTER
ARCHITECTURE

Authors
Ms.G.Anjana Harshitha Reddy,
Assistant Professor,ECE
St.Peter’s Engineering College,Hyderabad
[email protected]

Ms.Kavya Chalamalashetty,
Assistant Professor,ECE
Sri Vasavi Engineering College,Thadepalligudam
[email protected]

Ms.M.Hamsalekha,
Assistant Professor,ECE
Sri Vasavi Engineering College,Thadepalligudam
[email protected]
UNIT-1
a) Define pipelining in computer architecture ? 2M
Ans: Pipelining is a technique of decomposing a sequential process into sub-operations, where each
subprocess being executed in a special dedicated segment that operates concurrently with all other
segments.
b) What are the stages of an instruction pipeline? 2M
Ans: Computers with complex instructions requires phases to process an instruction completely as
shown below
 Fetch the instruction from memory.
 Decode the instruction.
 Calculate the effective address.
 Fetch the operands from memory.
 Execute the instruction.
 Store the result in the proper place.
c) What is Vector processing?2M
Ans: Vector processing is a computational approach where a single instruction operates
simultaneously on multiple data points, known as a vector.Vector processing is commonly used in
applications such as scientific computing, engineering simulations, and image processing.
d) What is RISC pipelining ?2M
Ans: RISC stands for "Reduced Instruction Set Computer," and it is a technique used in computer
architecture to enhance the performance of RISC processors. It has uniform length and is used for
load and store operations to deal with memory and registers.
e) Define parallel processing ?2M
Ans: It is a technique used for processing the data simultaneously or concurrently to perform the
computational tasks. Parallel processing increases the computational speed of a computer system
and reduce the execution time.
a) List the types of pipeline hazards and solutions ?3M
Ans: Pipeline Hazards are Resource Conflict,
Data Dependency and
Branch difficulties .
Solutions for Pipeline Hazards are Hardware interlocks,operand forwarding,delayed load,delayed
branch,Handling Branch Instructions,Prefetch the target instruction,Branch target Buffer,loop
buffer and branch prediction.
b) What are the applications of parallel processing ?3M
Ans:1) Weather forecasting
2) Computational aerodynamics
3) Remote sensing applications
4) Weapon research and defense
c) List different pipelining techniques ?3M
Ans:
1. Arithmetic pipeline : It is a type of pipeline technique used in performing arithmetic operations.
2. Instruction pipeline: It is a technique used for processing the instructions in parallel by
breaking down into different steps.
3. RISC pipeline :It is a technique used for processing the instructions in parallel by breaking
down into different steps of uniform length.
d) What are problems in pipelining ?3M
Ans: Pipelining problems, also known as pipeline hazards, occur when a pipeline stalls for any
reason. Some of the pipelining hazards are data dependency, memory delay, branch delay, and
resource conflict.
e) What is Flynn Classifications based on ?Give the names of Flynn Classifications? 3M
Ans: Flynn Classification is based on the types of instruction and data streams:
1) Single Instruction Stream, Single Data Stream (SISD)
2) Single Instruction Stream, Multiple Data Stream (SIMD)
3) Multiple Instruction Stream, Single Data Stream (MISD)
4) Multiple Instruction Stream, Multiple Data Stream (MIMD)
a) What is memory interleaving ? Explain with a neat diagram? 5M
Ans: What ? It involves dividing the memory into multiple modules or banks and distributing the
memory addresses across these banks in a way that allows simultaneous access to multiple memory
locations.
Why? Memory interleaving is a technique used in computer architecture to improve the
performance of memory access.This helps in reducing the wait time for memory access and
increases the overall throughput of the system.
 An instruction pipeline may require the fetching of an instruction and an operand at the same
time from two different segments. An arithmetic pipeline usually requires two or more
operands to enter the pipeline at the same time. Instead of using two memory buses for
simultaneous access, the memory can be partitioned into a number of modules connected to a
common memory address and data buses.

The advantage of a modular memory is that it allows the use of a technique called interleaving. In
an interleaved memory, different sets of addresses are assigned to different memory modules. By
staggering the memory access, the effective memory cycle time can be reduced by a factor close to
the number of modules.
b) What is pipelining? Explain Ai*Bi+Ci where i goes from 1 to 4 by using pipelining ? 5M
Ans: Pipelining is a technique of decomposing a sequential process into sub-operations, where each
subprocess being executed in a special dedicated segment that operates concurrently with all other
segments.
The pipeline organization will be demonstrated by means of a simple example. To perform the
combined multiply and add operations with a stream of numbers Ai * Bi + Ci for i = 1, 2, 3, …, 7 .
Each sub operation is to be implemented in a segment within a pipeline.
 R1←Ai, R2 ←Bi Input Ai and Bi
 R3 ← R1 * R2, R4 ← Ci Multiply and input Ci
 R5 ← R3 + R4 Add Ci to product
The five registers are loaded with new data every clock pulse. The effect of each clock is shown in
Table below. The first clock pulse transfers A1 and B1 into R1 and R2. The second clock pulse
transfers the product of R1 and R2 into R3 and C1 into R4. The same clock pulse transfers A2 and
B2 into R1 and R2. The third clock pulse operates on all three segments simultaneously. It places
A3 and B3 into R1 and R2, transfers the product of R1 and R2 into R3, transfers C2 into R4, and
places the sum of R3 and R4 into R5. It takes three clock pulses to fill up the pipe and retrieve the
first output from R5. From there on, each clock produces a new output and moves the data one step
down the pipeline. This happens as long as new input data flow into the system. When no more
input data are available, the clock must continue until the last output emerges out of the pipeline.

The main characteristic of pipelining is that several computations can be in progess in distant
segment at same time.The registers provide isolation between each segment so each segment can
work on distant data simultaneously.
c) What is array processing? Explain different types of array processors? 5M
Ans:
WHAT? An array processor is a processor that performs computations on large arrays of data.To
perform high computational tasks and execute multiple data sets array processors are used.
Array Processing is used in different types of processors.
Attached array processor: Attached Array Processor is a parallel processor with multiple
functional units .The objective of the attached array processor is to provide vector manipulation
capabilities to a conventional computer at a fraction of the cost of supercomputer. Fig.D shows the
interconnection of an attached array processor to a host computer. Is an auxiliary processor. It is
intended to improve the performance of the host computer in specific numerical computation tasks.
Fig D: Attached array processor with host computer

SIMD array processor: An SIMD array processor is a computer with multiple processing units
operating in parallel. A general block diagram of an array processor is shown in Fig. E.
 It contains a set of identical processing elements (PEs), each having a local memory M. Each
PE includes an ALU, a floating-point arithmetic unit, and working registers. Vector instructions
are broadcast to all PEs simultaneously. Masking schemes are used to control the status of each
PE during the execution of vector instructions. Each PE has a flag that is set when the PE is
active and reset when the PE is inactive.

Fig E: SIMD array processor organization

d) What is vector processing ? Explain how Matrix multiplication is done using vector
processing?5M

Ans: Vector processing is a computational approach where a single instruction operates


simultaneously on multiple data points, known as a vector. This contrasts with scalar processing,
where one instruction operates on a single data point at a time. Vector processing is commonly used
in applications that involve large datasets and require the same operation to be performed on each
element, such as scientific computing, engineering simulations, and image processing.A computer
capable of vector processing eliminates the overhead associated with the time it takes to fetch and
execute the instructions in the program loop.

C(1:100) = A(1:100) + B(1:100)


A possible instruction format for a vector instruction is shown in Fig. A. This assumes that the
vector operands reside in memory.

Matrix Multiplication
The multiplication of two n x n matrices consists of n2 inner products or n3 multiply-add
operations.
o Consider, for example, the multiplication of two 3 x 3 matrices A and B.
c11= a11b11+ a12b21+ a13b31
o This requires three multiplication and (after initializing c11 to 0) three additions. In general, the
inner product consists of the sum of k product terms of the form
C = A1B1+A2B2+A3B3+…+AkBk.
o In a typical application k may be equal to 100 or even 1000.The inner product calculation on a
pipeline vector processor is shown in Fig. B.

C = A1B1+A5B5+A9B9+A13B13
+A2B2+A6B6+A10B10+A14B14
+A3B3+A7B7+A11B11+A15B15
+A4B4+A8B8+A12B12+A16B16

Fig B: Pipeline for calculating an inner product


e) Explain the RISC pipeline and its characteristics. Discuss the advantages of RISC architecture
over CISC in terms of pipelining efficiency? 5M
Ans: RISC stands for Reduced Instruction Set Computer. RISC pipelining is a technique used in
computer architecture to enhance the performance of RISC processors. The main idea behind
pipelining is to divide the processing of instructions into several stages, with each stage performing
a part of the instruction. This allows multiple instructions to be processed simultaneously, with each
instruction being at a different stage of completion.
 One segment fetches the instruction from program memory
 The other segment executes the instruction in the ALU
 Third segment may be used to store the result of the ALU operation in a destination register
 RISC can achieve pipeline segments, requiring just one clock cycle by using a compiler that
translates the high-level language program into machine language program.
 Instead of designing hardware to handle the difficulties associated with data conflicts and
branch penalties, RISC processors rely on the efficiency of the compiler to detect and minimize
the delays encountered with these problems.
RISC CISC
Feature Architecture Architecture
Instruction Set Simple instruction Complex
Complexity set instruction set
Fixed-length Variable-length
Instruction Length instructions instructions
Instruction Uniform execution Variable execution
Execution Time time time
Pipelining
Efficiency Highly pipelinable Less pipelinable
Load/Store
architecture
(memory access Memory access
limited to load and within complex
Memory Access store instructions) instructions
a) Explain addition of floating point numbers X=0.9504 *103 and Y=0.8200*102 by using
arithmetic pipeling architecture ?Explain with a neat diagram?10M
Ans: Floating–point operations are easily decomposed into suboperations
An example of a pipeline unit for floating-point addition and subtraction is showed in the following:
1. The inputs to the floating-point adder pipeline are two normalized floating point binary number
2.
X = A*2 a , Y = B* 2 b where X=0.9504 *103 and Y=0.8200*102
3. A and B are two fractions that represent the mantissas, a and b are the exponents. The floating-
point addition and subtraction can be performed in four segments, as shown in Fig below.The
suboperations that are performed in the four segments are:
a) Compare the exponents :The larger exponent is chosen as the exponent of the result.
b) Align the mantissas: (Mantissa (Significand) A fraction that contains the significant digits of
the number.)
c) Add or subtract the mantissas :The exponent difference determines how many times the
mantissa associated with the smaller exponent must be shifted to the right.
d) Normalize the result:When an overflow occurs, the mantissa of the sum or difference is shifted
right and the exponent incremented by one.
If an underflow occurs, the number of leading zeros in the mantissa determines the number of left
shifts in the mantissa and the number that must be subtracted from the exponent.
b) Explain instruction pipelining with a neat diagram when branching occurs at branch 3? 10M
Ans: Instruction Pipeline is processing the instructions in parallel by breaking down to different
steps.Pipeline processing can occur not only in the data stream but in the instruction as well.
Computers with complex instructions requires phases to process an instruction completely as shown
below
 Fetch the instruction from memory.
 Decode the instruction.
 Calculate the effective address.
 Fetch the operands from memory.
 Execute the instruction.
 Store the result in the proper place.

An instruction in the sequence may be causes a branch out of normal sequence. In that case the
pending operations in the last two segments are completed and all information stored in the
instruction buffer is deleted. Similarly, an interrupt request will cause the pipeline to empty and
start again from a new address value.
The time in the horizontal axis is divided into steps of equal duration. The four segments are
represented in the diagram with an abbreviated symbol.
1. F1 is the segment that fetches an instruction.
2. DA is the segment that decodes the instruction and calculates the effective address.
3. FO is the segment that fetches the operand.
4. EX is the segment that executes the instruction.

It is assumed that the processor has separate instruction and data memories so that the operation in
F1 and PC can proceed at the same time. In the absence of a branch instruction, each segment
operates on different instructions. Thus,in step 4, instruction 1 is being executed in segment EX; the
operand for instruction 2 is being fetched in segment FO; instruction 3 is being decoded in segment
DA; and instruction 4 is being fetched from memory in segment FI. Assume now that instruction 3
is a branch instruction. As soon as this instruction is decoded in segment DA in step 4, the transfer
from F1 to DA of the other instructions is halted until the branch instruction is executed in step6. If
the branch is taken, a new instruction is fetched in step 7. If the branch is not taken, the instruction
fetched previously in step 4 can be used. The pipeline then continues until a new branch instruction
is encountered. Another delay may occur in the pipeline if the EX segment needs to store the result
of the operation in the data memory while the FO segment needs to fetch an operand. In that case,
segment FO must wait until segment EX has finished its operation.

c) What is RISC pipelining ? Explain delayed load and delayed branch with a neat diagram ? 10M
Ans: RISC stands for "Reduced Instruction Set Computer," and it is a technique used in computer
architecture to enhance the performance of RISC processors. It has uniform length and is used for
load and store operations to deal with memory and registers.
Delayed Load :Consider the operation of the following four instructions:
o LOAD: R1 M[address 1]
o LOAD: R2 M[address 2]
o ADD: R3 R1 +R2
o STORE: M[address 3] R3
There will be a data conflict in instruction 3 because the operand in R2 is not yet available in the
A segment. This can be seen from the timing of the pipeline shown in Fig A below
o The E segment in clock cycle 4 is in a process of placing the memory data into R2.
o The A segment in clock cycle 4 is using the data from R2.
It is up to the compiler to make sure that the instruction following the load instruction uses the data
fetched from memory. This concept of delaying the use of the data loaded from memory is referred
to as delayed load.
figA
Fig. (b) shows the same program with a no-op instruction inserted after the load to R2
instruction. Thus the no-op instruction is used to advance one clock cycle in order to compensate
for the data conflict in the pipeline. The advantage of the delayed load approach is that the data
dependency is taken care of by the compiler rather than the hardware.

Fig b
Delayed Branch
The method used in most RISC processors is to rely on the compiler to redefine the branches so that
they take effect at the proper time in the pipeline. This method is referred to as delayed branch.
The compiler is designed to analyze the instructions before and after the branch and rearrange the
program sequence by inserting useful instructions in the delay steps.
It is up to the compiler to find useful instructions to put after the branch instruction. Failing that, the
compiler can insert no-op instructions.
An Example of Delayed Branch:The program for this example consists of five instructions.
o Load from memory to R1
o Increment R2
o Add R3 to R4
o Subtract R5 from R6
o Branch to address X
o The branch address X is transferred to PC in clock cycle 7.
In Fig. (c) the compiler inserts two no-op instructions after the branch.

Fig (c): Using no operation instruction


The program in Fig. (c) is rearranged by placing the add and subtract instructions after the branch
instruction.
o PC is updated to the value of X in clock cycle 5.

Fig (d): Rearranging the instructions

d) Discuss the various types of pipeline hazards and the techniques used to overcome them.
Provide examples to illustrate your points. 10M
Ans: In general, there are three major difficulties that cause the instruction pipeline to deviate from
its normal operation.
 Resource conflicts caused by access to memory by two segments at the same time. Can be
resolved by using separate instruction and data memories and separate buses for memories
 Data dependency conflicts arise when an instruction depends on the result of
a previous instruction, but this result is not yet available.
 Branch difficulties arise from branch and other instructions that change the
value of PC.
 Data dependency: A difficulty that may caused a degradation of performance in an instruction
pipeline is due to possible collision of data or address.A data dependency occurs when an
instruction needs data that are not yet available. An address dependency may occur when an
operand address cannot be calculated because the information needed by the addressing mode
is not available.

Pipelined computers deal with such conflicts between data dependencies in a variety of ways.
 Hardware interlocks: an interlock is a circuit that detects instructions whose source operands
are destinations of instructions farther up in the pipeline.This approach maintains the program
sequence by using hardware to insert the required delays.

 Operand forwarding: uses special hardware to detect a conflict and then avoid it by routing the
data through special paths between pipeline segments. This method requires additional
hardware paths through multiplexers as well as the circuit that detects the conflict.

 Delayed load: The compiler for such computers is designed to detect a data conflict and reorder
the instructions as necessary to delay the loading of the conflicting data by inserting no-
operation instructions.

 Handling of branch instructions: One of the major problems in operating an instruction


pipeline is the occurrence of branch instructions. An unconditional branch always alters the
sequential program flow by loading the program counter with the target address. In a
conditional branch, the control selects the target instruction if the condition is satisfied or the
next sequential instruction if the condition is not satisfied. Pipelined computers employ various
hardware techniques to minimize the performance degradation caused by instruction branching.

 Prefetch target instruction: To prefetch the target instruction in addition to the instruction
following the branch. Both are saved until the branch is executed.

 Branch target buffer(BTB): The BTB is an associative memory included in the fetch segment
of the pipeline. Each entry in the BTB consists of the address of a previously executed branch
instruction and the target instruction for that branch. It also stores the next few instructions after
the branch target instruction.
 Loop buffer: This is a small very high speed register file maintained by the instruction fetch
segment of the pipeline.

 Branch prediction: A pipeline with branch prediction uses some additional logic to guess the
outcome of a conditional branch instruction before it is executed.

 Delayed branch: in this procedure, the compiler detects the branch instructions and rearranges
the machine language code sequence by inserting useful instructions that keep the pipeline
operating without interruptions. A procedure employed in most RISC processors. e.g. no-
operation instruction
UNIT -2
a) What are the different arithmetic operations ? 2M
Ans: Arithmetic operations are fundamental to digital computers. They involve
manipulating data to produce results required for solving computational problems.
The four basic arithmetic operations are:

 Addition
 Subtraction
 Multiplication
 Division

b) How are the signed numbers represented using binary numbers ?2M
Ans:Signed numbers are integers with a positive or negative sign. Since computers
understand only binary, it's necessary to represent these signed integers in binary
form. There are three common methods for this:

1. Sign Bit: A bit is designated to indicate the sign, typically 0 for positive and 1
for negative.
2. 1's Complement: The bits of the positive number are inverted (0 becomes 1,
1 becomes 0).
3. 2's Complement: 1 is added to the 1's complement. This is the most common
method used in computers.The specific representation method used depends
on the computer architecture and the application's requirements.

c) What is the 1’s compliment of 0001101?2M


Ans:The 1's complement of a binary number is obtained by inverting all the bits.
So, 0 becomes 1, and 1 becomes 0.Therefore, the 1's complement of 0001101 is
1110010.
d) Explain Exclusive OR (XOR) gate with a table? 2M
Ans:An XOR gate is a digital logic gate that takes two binary inputs and produces
an output that is 1 only when the inputs are different.

e) What is the 2’s compliment of 1001101? 2M


Ans: The 2's complement of a binary number is obtained by adding 1 to its 1's
complement.
a) Find the 1's complement: 1's complement of 1001101 is 0110010.
b) Add 1 to the 1's complement: 0110010 + 1 = 0110011.Therefore, the 2's
complement of 1001101 is 0110011.
a) How 2’s compliment addition and subtraction better than signed magnitude
addition? 3M
Ans: 2's complement addition and subtraction offer several advantages over signed
magnitude addition, making them more efficient for computer arithmetic:
a) Representation for Zero: Both positive and negative zero can be represented
uniquely in 2's complement.
b) Simplified Arithmetic Operations : Arithmetic operations are simpler and
more direct in 2's complement, as there's no need for separate circuits to
handle signs.
c) Overflow Handling : Overflow detection is easier in 2's complement.
d) Consistent Behavior : 2's complement arithmetic provides consistent results,
regardless of the sign of the operands.
b) Draw the hardware architecture of signed 2’s compliment addition and
subtraction ? 3M
Ans: Hardware architecture of signed 2’s compliment addition and subtraction
consists of Addend in BR Register, Augend in AC Register, Complementer &
Parallel Adder for performing complementing and adding and V(Overflow) to
store the overflow bit.

BR Register

Complementer &
V(Overflow) Parallel Adder

AC Register

c) What are floating point arithmetic operations? 3M


Ans:Floating-point operations refer to arithmetic operations performed on floating-
point numbers, such as addition, subtraction, multiplication, and division. These
operations are essential for implementing mathematical functions like sine and
cosine in computer programs.
d) Perform the arithmetic addition and subtraction of 20 and 15 i.e (20+15),(20-
15) using binary representations? 3M

Ans: First, convert the decimal numbers 20 and 15 to binary:20 in binary is


10100 and 15 in binary is 01111

Now, let's perform the binary addition:

10100 (20)
+ 01111 (15)
-------
100111 (35)

Next, let's perform the binary subtraction:To subtract using binary, we use a
method similar to 2's complement subtraction.

Find the 2's complement of the subtrahend (15):

1) Invert the bits: 10000


2) Add 1: 10001

Add the minuend (20) and the 2's complement of the subtrahend:

10100 (20)
+ 10001 (2's complement of 15)
-------
100111 (5)
Therefore:

 20 + 15 = 35 (binary: 100111)
 20 - 15 = 5 (binary: 000101)

e) Perform the binary multiplication of 10 and 9 ? 3M

Ans:Let's convert the decimal numbers 10 and 9 to binary:

 10 in binary is 1010
 9 in binary is 1001

Now, let's perform the binary multiplication:

1010
x 1001
------
1010
0000
0000
+ 1010
------
1011010

So, the binary multiplication of 10 and 9 is 1011010.


a) Explain the hardware implementation of signed magnitude addition and
subtraction? 5M

Ans: Components:

1. B Register: This register stores one of the operands for arithmetic


operations.
2. Complementer: This unit generates the 2's complement of the input
operand.
3. Parallel Adder: This adder performs the addition or subtraction operation
based on the mode control signal (M).
4. A Register: This register stores the other operand and the result of the
operation.
5. Mode Control (M): This signal determines whether the operation is
addition or subtraction.

Operation:

Addition:The value in the B register is loaded into the parallel adder.The value in
the A register is also loaded into the parallel adder.The mode control signal (M) is
set to 0 (indicating addition).The parallel adder performs the addition operation,
and the result is stored back in the A register.

Subtraction:The value in the B register is loaded into the complementer.The


complementer generates the 2's complement of the input value.The 2's complement
value is loaded into the parallel adder.The value in the A register is also loaded
into the parallel adder.The mode control signal (M) is set to 1 (or any other value
indicating subtraction).The parallel adder performs the addition operation (which
effectively subtracts the 2's complement value from the A register value), and the
result is stored back in the A register.

b) Explain the hardware implementation of signed 2’s compliment addition and


subtraction? 5M
Ans:

Hardware implementation of signed 2’s compliment addition and subtraction


Hardware architecture of signed 2’s compliment addition and subtraction consists
of Addend in BR Register, Augend in AC Register, Complementer & Parallel
Adder for performing complementing and adding and V(Overflow) to store the
overflow bit.The hardware implementation of signed 2’s compliment addition and
subtraction is as shown above
Algorithm for signed 2’s compliment addition and subtraction is as shown
below with
Subtract Add

Minuend in AC Augend in AC
Subtrahand in BR Addend in BR

AC AC+(BR)’+1 AC AC+BR
V Overflow V Overflow

END

Subtract:

 Minuend is placed in the AC (Accumulator) register.


 Subtrahend is placed in the B register.
 The 2's complement of the subtrahend is added to the minuend.
 If an overflow occurs, it is indicated by the V flag.

Add:

 Augend is placed in the AC register.


 Addend is placed in the B register.
 The contents of the AC and B registers are added.
 If an overflow occurs, it is indicated by the V flag.

c) Explain the hardware implementation of multiplication? 5M


Ans:In the beginning, the multiplicand is in B and the multiplier in Q. Their
corresponding signs are in Bs and Qs respectively. We compare the signs of both
A and Q and set to corresponding sign of the product since a double-length product
will be stored in registers A and Q. Registers A and E are cleared and the sequence
counter SC is set to the number of bits of the multiplier. Since an operand must be
stored with its sign, one bit of the word will be occupied by the sign and the
magnitude will consist of n-1 bits. Now, the low order bit of the multiplier in Qn is
tested. If it is 1, the multiplicand (B) is added to present partial product (A), 0
otherwise. Register EAQ is then shifted once to the right to form the new partial
product. The sequence counter is decremented by 1 and its new value checked. If it
is not equal to zero, the process is repeated and a new partial product is formed.
When SC = 0 we stops the process.
d) Explain the hardware implementation of division ? 5M

Ans:

Components:

1. A Register: This register likely stores the operand A, which is used as input
for the adder.
2. B Register: This register likely stores the operand B, which is also used as
input for the adder.
3. Sequence Counter (SC): This counter keeps track of the sequence of
operations being performed.
4. Q Register: This register might store the result of the operation or some
intermediate value.
5. A' Register: This register could be used to store the complemented value of A
(A').
6. Complementer:This circuit takes the value of A and produces its two's
complement A'. This is likely used for subtraction operations.
7. Parallel Adder:This adder takes the values of A and B (or A' and B) as inputs
and produces their sum.

Operation:

1) The operands A and B are loaded into their respective registers.


2) The sequence counter (SC) is initialized to a value (e.g., n) to indicate the
number of operations based on the number of bits.
3) Based on the value of the SC, the following operations can be performed:
Addition: The values of A and B are fed into the parallel adder, and the result
is stored in the Q register.
4) Subtraction: The complemented value of A (A') is fed into the parallel adder
along with B, and the result is stored in the Q register.
5) Incrementing SC:After each operation, the SC is decremented to point to the
next operation in the sequence.
6) Repeating:Steps 2 and 3 are repeated until the SC reaches a specific value,
indicating the end of the operation sequence.

BS B Register Sequence Counter

Complementer and
Parallel Adder
Qn

AS Q Register
A Register QS

0 E

e) Explain the hardware implementation of floating point addition? 5M


Ans: The register configuration for floating-point operations is shown in figure
4.13. As a rule, the same registers and adder used for fixed-point arithmetic
operations which are used for processing the mantissas. The difference lies in the
way the exponents are handled. The register organization for floating-point
operations is shown in Fig. 4.13. Three registers are there, BR, AC, and QR. Each
register is subdivided into two parts. The mantissa part has the same uppercase
letter symbols as in fixed-point representation. The exponent part may use
corresponding lower-case letter symbol.
Addition and Subtraction of Floating Point Numbers :
During addition or subtraction, the two floating-point operands are kept in AC and
BR. The sum or difference is formed in the AC. The algorithm can be divided into
four consecutive parts:
1. Check for zeros.
2. Align the mantissas.
3. Add or subtract the mantissas
4. Normalize the result
A floating-point number cannot be normalized, if it is 0. If this number is used for
computation, the result may also be zero. Instead of checking for zeros during the
normalization process we check for zeros at the beginning and terminate the
process if necessary. The alignment of the mantissas must be carried out prior to
their operation. After the mantissas are added or subtracted,the result may be un-
normalized. The normalization procedure ensures that the result is normalized
before it is transferred to memory. If the magnitudes were subtracted, there may be
zero or may have an underflow in the result. If the mantissa is equal to zero the
entire floating-point number in the AC is cleared to zero. Otherwise, the mantissa
must have at least one bit that is equal to 1. The mantissa has an underflow if the
most significant bit in position A1, is 0. In that case, the mantissa is shifted left and
the exponent decremented. The bit in A1 is checked again and the process is
repeated until A1 = 1. When A1 = 1, the mantissa is normalized and the operation
is completed.

Figure 4.13: Registers for Floating Point arithmetic operations

a) Draw and explain the algorithm for signed magnitude addition and
subtraction ?10M
Ans:
Algorithm:
The flowchart is shown in Figure below .The two signs A, and B, are compared by
an exclusive-OR gate. If the output of the gate is 0 the signs are identical; If it is 1,
the signs are different. For an add operation, identical signs dictate that the
magnitudes be added.The magnitudes are added with a micro-operation, EA=A +
B, where EA is a register that combines E and A. The carry in E after the addition
constitutes an overflow if it is equal to 1. The value of E is transferred into the add-
overflow flip-flop AVF. The two magnitudes are subtracted if the signs are
different for an add operation or identical for a subtract operation. The magnitudes
are subtracted by adding A to the 2's complemented B. No overflow can occur if
the numbers are subtracted so AVF is cleared to 0. 1 in E indicates that A >= B
and the number in A is the correct result. If this number is zero, the sign A must be
made positive to avoid a negative zero. 0 in E indicates that A < B. For this case it
is necessary to take the 2's complement of the value in A. The operation can be
done with one micro-operation A=A' +1. However, we assume that the A register
has circuits for micro-operations complement and increment, so the 2's
complement is obtained from these two micro-operations. In other paths of the
flowchart, the sign of the result is the same as the sign of A. so no change in A is
required. However, when A < B, the sign of the result is the complement of the
original sign of A. It is then necessary to complement A, to obtain the correct sign.
The final result is found in register A and its sign in As. The value in AVF
provides an overflow indication. The final value of E is immaterial.

b) Draw and explain the algorithm for signed magnitude multiplication with an
example ? 10M
Ans:Multiplication Algorithm:
In the beginning, the multiplicand is in B and the multiplier in Q. Their
corresponding signs are in Bs and Qs respectively. We compare the signs of both
A and Q and set to corresponding sign of the product since a double-length product
will be stored in registers A and Q. Registers A and E are cleared and the sequence
counter SC is set to the number of bits of the multiplier. Since an operand must be
stored with its sign, one bit of the word will be occupied by the sign and the
magnitude will consist of n-1 bits. Now, the low order bit of the multiplier in Qn is
tested. If it is 1, the multiplicand (B) is added to present partial product (A), 0
otherwise. Register EAQ is then shifted once to the right to form the new partial
product. The sequence counter is decremented by 1 and its new value checked. If it
is not equal to zero, the process is repeated and a new partial product is formed.
When SC = 0 we stops the process.

c) Draw and explain the algorithm for division with an example ? 10M
Ans: Division of two fixed-point binary numbers in signed magnitude
representation is performed with paper and pencil by a process of successive
compare, shift and subtract operations. Binary division is much simpler than
decimal division because here the quotient digits are either 0 or 1 and there is no
need to estimate how many times the dividend or partial remainder fits into the
divisor. The division process is described in Figure below .The divisor is compared
with the five most significant bits of the dividend. Since the 5-bit number is
smaller than B, we again repeat the same process. Now the 6-bit number is greater
than B, so we place a 1 for the quotient bit in the sixth position above the dividend.
Now we shift the divisor once to the right and subtract it from the dividend. The
difference is known as a partial remainder because the division could have stopped
here to obtain a quotient of 1 and a remainder equal to the partial remainder.
Comparing a partial remainder with the divisor continues the process. If the partial
remainder is greater than or equal to the divisor, the quotient bit is equal to 1. The
divisor is then shifted right and subtracted from the partial remainder. If the partial
remainder is smaller than the divisor, the quotient bit is 0 and no subtraction is
needed. The divisor is shifted once to the right in any case. Obviously the result
gives both a quotient and a remainder.
d) Draw and explain the algorithm for floating point multiplication with an
example ? 10M

Ans: Floating-Point Multiplication:

Floating-point numbers consist of a sign, a mantissa (significand), and an


exponent. The multiplication of two floating-point numbers involves the following
steps:

1. Initialization:The multiplicand (BR) and multiplier (QR) are loaded into


their respective registers.The accumulator (AC) is initialized to zero.
2. Sign Handling:The signs of the multiplicand and multiplier are XORed to
determine the sign of the product. This is because the product of two numbers
with the same sign is positive, and the product of two numbers with different
signs is negative.
3. Exponent Handling:The exponents of the multiplicand and multiplier are
added to determine the exponent of the product. This is because multiplying
two numbers with exponents a and b is equivalent to multiplying their
mantissas and then raising the result to the power of a+b.
4. Mantissa Multiplication (Fixed-Point Multiplication):The mantissas of
the multiplicand and multiplier are multiplied using a fixed-point
multiplication algorithm. This is typically done using a shift-and-add or
Booth's algorithm.
5. Normalization:The product obtained from the mantissa multiplication
might not be in the normalized form (i.e., the most significant bit might not be
1). The product is normalized by shifting it left or right until the most
significant bit becomes 1. The exponent is adjusted accordingly to compensate
for the shift.
6. Rounding:The product might have more bits than the desired precision.
Rounding is applied to the product to fit it within the specified number of bits.
7. Result:The final product is stored in the accumulator (AC), with the sign,
exponent, and rounded mantissa.

Diagram Explanation:The diagram depicts the flow of the floating-point


multiplication process. Here's a breakdown of the key components:

BR: Multiplicand register QR: Multiplier register AC: Accumulator register a:


Accumulator content

Steps:

1. Load Multiplicand and Multiplier: The multiplicand and multiplier are


loaded into the BR and QR registers, respectively.
2. Check Signs: The signs of the operands are XORed to determine the sign
of the product.
3. Add Exponents: The exponents of the operands are added to determine the
exponent of the product.
4. Initialize AC: The AC is initialized to zero.
5. Mantissa Multiplication: The mantissas are multiplied using a fixed-point
multiplication algorithm.
6. Shift and Add: The product in the AC is shifted and added iteratively to
form the final product.
7. Normalize and Round: The product is normalized and rounded to fit
within the desired precision.
8. END: The final product is stored in the AC.
UNIT-3
a) What are different generations of computer?2M
Ans:There are 5 different generations of computer:
 First generation(1945-54) used vacuum tubes and relay memories.
 Second generation(1955-64) used Transistors and core memories
 Third generation(1965-74) used IC’s(Integrated Circuits) i.e SSI &MSI(Small
Scale and medium scale integration)
 Fourth generation(1975-90) used LSI&VLSI[Large Scale and Very Large Scale
Integration]
 Fifth generation(1991-present) used Advanced VLSI processors &ULSI
b) Name some system attributes to performance?2M
Ans: System attributes to performance are
1. Clock Rate
2. CPI
3. MIPS Rate
4. Execution Time
5. Throughput Rate
c) What is a vector super computer ? 2M
Ans: A vector supercomputer is a type of high-performance computing system that
specializes in handling vector processing tasks. It is optimized to perform complex
mathematical operations on large data sets by processing data in "vectors" or
"arrays" instead of individual scalar values. A vector processor is usually built on top
of scalar processor ,it is attached to scalar processor as an optional feature
d) What is full form of SIMD and why do we need it ? 2M
Ans: Full form of SIMD is Single Instruction Multiple Data.SIMD is ideal for
applications that require the same operation to be performed on large sets of data,
such as matrix calculations, image processing, and simulations.
SIMD supercomputers are designed to improve the performance,efficiency,resource
sharing and scalability.
e) What are parallel computer models ?2M
Ans: Parallel computer models are architectures of computers designed to perform
multiple operations simultaneously by dividing tasks across multiple processing
units. Here are the main types of parallel computer models:
a) Shared Memory Multicomputer
b) Distributed Memory Multicomputer
c) SIMD Supercomputer
d) Vector Supercomputer
a) What is meant by shared memory model ? 3M
Ans: Shared memory multiprocessor systems are a type of parallel computer
architecture where multiple processors access the same physical memory space.
These systems allow processors to communicate and share data directly through a
common memory, making them suitable for a wide range of parallel computing
tasks.The system consists of two or more processors (CPU’s) that can execute
instructions simultaneously.All processors have access to a shared global memory.
Communication between processors occurs implicitly through the shared memory.
When one processor writes to a memory location, other processors can read from
that location.
b) What is meant by distributed memory Multi computer ? 3M
Ans: A distributed memory multicomputer is a type of parallel computing
architecture consisting of multiple computers (called nodes) where each node is an
autonomous computer consisting of processor, local memory attached disks or I/O
peripherals and processors communicate with one another via message passing. A
distributed memory multicomputer Nodes interconnected by a message passing
network which can be Mesh, Ring, Torus, Hypercube etc .

c) What is SIMD super computer ?3M


Ans: SIMD (Single Instruction, Multiple Data)** is a type of parallel computing
architecture where a single instruction is executed on multiple data points
simultaneously. SIMD is ideal for applications that require the same operation to be
performed on large sets of data, such as matrix calculations, image processing, and
simulations.
Single Instruction: A single control unit issues one instruction to be executed.
Multiple Data: The same instruction operates on multiple data points at the same
time using multiple processors or ALU’s (Arithmetic Logic Units).
d) What are different shared memory models ? 3M
Ans: Different shared memory models are UMA,NUMA and COMA
 UMA(Uniform Memory Access ):Physical memory is uniformly shared by all
processors.All processors (PE1….PEn) take equal access time to memory –
Thus it is termed as Uniform Memory Access Computers.
 NUMA(Non-Uniform Memory Access Computers):Access time varies with
location of memory.Shared memory is distributed to all processors and
collection of all local memories forms global memory space accessible by all
processors.
 COMA(Multiprocessor+Cache Memory = COMA Model) :Multiprocessor
using cache only memory machine is COMA.It is a Special case of NUMA
machine in which distributed main memories are converted to caches – all
caches together form a global address space.
e) What is vector super computer?3M
Ans: A vector supercomputer is a type of high-performance computing system that
specializes in handling vector processing tasks. It is optimized to perform complex
mathematical operations on large data sets by processing data in "vectors" or
"arrays" instead of individual scalar values. A vector processor is usually built on top
of scalar processor ,it is attached to scalar processor as an optional feature .
VECTOR PROCESSOR MODELS are of 2 types :
 Register to Register architecture
 Memory to memory architecture
a) What are different stages in the evolution of computer architecture ?5M
Ans:The study of computer architecture combines software and hardware aspects to
design efficient systems. Traditional computer architecture started with the
1.Von Neumann model, which executes instructions sequentially on scalar data.This
approach, though foundational, is limited in speed due to
its serial nature. To address these limitations,
2.lookahead techniques and pipelining were introduced. Lookahead allows
prefetching of instructions to overlap operations, while
pipelining divides tasks into smaller stages, enabling
different parts of an instruction to be processed
simultaneously, significantly boosting performance.
3.Vector processing was another major development to enhance performance by
handling large data sets, particularly useful for repeated operations on arrays. Vector
processors contain multiple pipelines that can concurrently operate on one-
dimensional arrays, or vectors, controlled by hardware or firmware. There are two
types:
1) memory-to-memory (transferring data directly between memory and pipelines)
2) register-to-register (using registers to manage data flow). In addition,
4.SIMD(Single Instruction, Multiple Data) systems enable synchronized parallelism
across multiple data points by using multiple processing elements (PEs) .Further
innovations led to
5.MIMD (Multiple Instruction, Multiple Data) systems, which allow processors to
execute different instructions on different data streams simultaneously. MIMD
systems are categorized into
1) shared memory multiprocessors, where processors share a common memory,
2) message-passing multicomputers, where each node has its local memory and
communicates via messages.
Together, these architectures form the backbone of modern parallel processing,
enabling faster and more complex computations in fields such as scientific research,
graphics, and AI.

b) Explain the working of UMA with a neat diagram?5M


Ans:In UMA physical memory is uniformly shared by all processors.All processors
(PE1….PEn) take equal access time to memory – Thus it is termed as Uniform
Memory Access Computers .Each PE can have its own private Cache
- High degree of resource sharing(memory and I/O ) – Tightly Coupled
- Interconnection Network can be – Common bus, cross bar switch or Multistage n/w
- When all PE’s have equal access to all peripheral devices – Symmetric
Multiprocessor
- In Asymmetric multiprocessor only one subset of processors have peripheral
access. Master Processors control Slave (attached) processors.
Applications of UMA Model
- Suitable for general purpose and time sharing application by multiple users
- Can be used to speed up execution of a single program in time critical
application13
DISADVANTAGES
- Interacting process cause simultaneous access to same locations – cause problem
when an update is followed by read operation (old value will be read)
- Poor Scalability – as no: of processors increase –shared memory area increase-thus
n/w becomes bottleneck.
- No: of processors usually in range(10-100)
c) Explain the working of NUMA with a neat diagram?5M
Ans:

In Non-Uniform Memory Access (NUMA)systems, access time varies depending on


the location of the memory in relation to the processor requesting it. Here’s how it
works:
1. Memory Distribution: In a NUMA architecture, shared memory is distributed
across all processors, with each processor having its own local memory. This local
memory can be accessed quickly by the processor it is attached to.
2.Global Memory Space: The collection of all local memories across processors
forms a global memory space. This global memory space is accessible by all
processors, allowing them to retrieve data from any processor's memory, not just
their own.
3. Access Time Variation: Accessing data stored within a processor's own local
memory is much faster than accessing data in remote memory.Accessing remote
memory incurs additional delay due to the interconnection between processors. As a
result, access time is non-uniform it depends on whether the data is available locally
or must be fetched from a remote location.
In summary, NUMA architecture enhances efficiency by allowing each processor
faster access to its local memory, while also enabling inter-processor memory
sharing, albeit with slower access to remote memory.
d) Explain the working of a SIMD super computer ? 5M
Ans:
**SIMD (Single Instruction, Multiple Data)** is a type of parallel computing
architecture where a single instruction is executed on multiple data points
simultaneously. SIMD is ideal for applications that require the same operation to be
performed on large sets of data, such as matrix calculations, image processing, and
simulations.
How Does SIMD Supercomputer Work?
In a SIMD supercomputer, the **control unit** sends a single instruction to multiple
**processing units (PEs)**, which execute the same instruction on different sets of
data. Here’s a simplified step-by-step process:
1. **Instruction Fetch**: The control unit fetches a single instruction (e.g., add,
subtract, multiply).
2. **Data Distribution**: The data set is divided across multiple processors or
processing elements (PEs). Each processor gets a unique piece of the data.
3. **Instruction Execution**: All processors execute the same instruction at the
same time on their respective data points.
4. **Result Aggregation**: After execution, the results from each processor are
collected and, if needed, combined.
e) Explain the working of a COMA?5M
Ans:The **Cache-Only Memory Architecture (COMA)** model is a type of
multiprocessor architecture that eliminates the use of traditional main memory,
relying instead on cache memory for data storage. Here’s an overview of how the
COMA model works:
1. In COMA, each processor in a multiprocessor system has its own cache memory,
and there is no traditional main memory. Instead, the **distributed caches** of all
processors collectively form a **global address space**. This means that data
required by any processor can be stored in the cache of any processor in the system,
creating a shared memory environment through the distributed cache.
2. **Data Access and Remote Caches**: When a processor needs data, it first
checks its own cache. If the data is not present locally, the system searches for it in
the cache of other processors. The **distributed cache directories** (marked as D in
some architectures) keep track of data locations across the distributed caches,
making it possible to locate and retrieve data efficiently from remote caches.
**Application of COMA**: COMA is especially suited for **general-purpose,
multi-user applications** that benefit from high-speed data access and low-latency
memory operations.
In essence, COMA is a special case of **Non-Uniform Memory Access (NUMA)**
where the distributed main memories are transformed into caches. By using only
cache memory and coordinating data access through distributed directories, COMA
offers an efficient memory access model for parallel computing applications.
a) Explain the working ,characteristics and applications of a SIMD ?10M
Ans:SIMD (Single Instruction, Multiple Data) is a type of parallel computing
architecture where a single instruction is executed on multiple data points
simultaneously.
The basic architecture of a SIMD system consists of:
Control Unit: Responsible for issuing a single instruction that all processing units
execute.
Processing Elements (PEs): A large number of processors that execute the same
instruction on different data sets in parallel.
Memory System: Each processor may have its own memory or share memory with
other processors. In most designs, memory is structured to support the simultaneous
access of multiple data sets by multiple processors.
Interconnection Network: This links the processors and memory, allowing efficient
data distribution and collection.
Working of SIMD Supercomputer Work?
In a SIMD supercomputer, the control unit sends a single instruction to multiple
processing units (PEs), which execute the same instruction on different sets of data.
Here’s a simplified step-by-step process:
1.Instruction Fetch: The control unit fetches a single instruction (e.g., add, subtract,
multiply).
2.Data Distribution: The data set is divided across multiple processors or
processing elements (PE’s). Each processor gets a unique piece of the data.
3.Instruction Execution: All processors execute the same instruction at the same
time on their respective data points.
4.Result Aggregation: After execution, the results from each processor are collected
and, if needed, combined.

Characteristics:
SIMD supercomputers are designed to exploit **data-level parallelism**, meaning
they can handle large volumes of repetitive operations on different data sets. The
main reasons for using SIMD supercomputers include:
1. **Efficiency**: They are highly efficient for tasks that involve repeating the same
operation on a large set of data.
2. **Performance**: SIMD reduces the overhead of fetching and decoding
multiple instructions, leading to faster computation times compared to SISD (Single
Instruction, Single Data) systems.
3. **Resource Sharing**: By sharing the same control unit for multiple processing
units, SIMD supercomputers are able to process vast amounts of data in parallel,
optimizing resource use.
4. **Scalability**: SIMD systems can easily scale to handle larger datasets or more
complex computations by adding more processing elements.
Applications:
SIMD is ideal for applications that require the same operation to be performed on
large sets of data, such as
Real-World Applications of SIMD Supercomputers
1. Graphics Processing in images, videos, and games.
2. Scientific Simulations used in Weather forecasting, fluid dynamics simulations
3. Machine Learning used in operations like matrix multiplication and activation
functions can be efficiently executed using SIMD.
b) Explain the working, characteristics and applications of a distributed memory
multi computer ?10M
Ans:A distributed memory multicomputer is a type of parallel computing
architecture consisting of multiple computers (called nodes) where each node is an
autonomous computer consisting of processor, local memory attached disks or I/O
peripherals and processors communicate with one another via message passing. A
distributed memory multicomputer Nodes interconnected by a message passing
network which can be Mesh, Ring, Torus, Hypercube etc .
 Thus Multicomputer are also called No-remote-Memory –Access(NORMA).
Communication between nodes if required is carried out by passing messages
through static connection network.
Working:
In distributed memory multi-computers, each processor executes its own program
and has its own local memory. They work using the following key principles:
1.Independent Processors: Each processor performs computations independently,
using its local memory.
2.Message Passing for Communication:Since the memory is not shared, processors
communicate with each other through message-passing mechanisms. This could
involve specialized communication libraries like MPI (Message Passing Interface).
3.Task Distribution: Tasks or workloads are divided among the processors. Each
processor may compute part of the problem independently. The processors may
exchange data during computation when needed, often at the start or end of an
operation.
4.Network Interconnect: A communication network (like Ethernet, InfiniBand,
etc.) connects the processors. Latency and bandwidth in this network can
significantly affect the performance of distributed memory systems.
5.Data Distribution:Data is distributed across the processors to minimize the
communication overhead. The division of data should ideally be balanced, so no
single processor becomes a bottleneck.
Characteristics:
Distributed memory multi-computers are used for several reasons, especially in
large-scale parallel computing systems:
Scalability: Shared memory architectures do not scale well for a large number of
processors because of memory access bottlenecks (that is when one processor is
using memory the other cannot use it) . Distributed memory systems allow systems
to scale efficiently by adding more nodes, with each node managing its own
memory.
Cost-Effectiveness: Distributed memory systems are cheaper to scale because each
node is relatively independent and can be made of standard, off-the-shelf
components.
Higher Memory Bandwidth: In shared memory systems, processors often compete
for access to a central memory, leading to performance degradation. Distributed
memory systems avoid this issue by providing each processor with its own local
memory.
Suitability for Data-Parallel Applications: In distributed memory systems, each
node can handle a portion of the data independently, performing computations in
parallel, and only communicating results when necessary.
Fault Tolerance: In a distributed system, if one node fails, it often does not bring
down the entire system. Other nodes can continue working, which improves
reliability.

Applications:

High-performance computing (HPC) applications like

1. Climate modeling,
2. Fluid dynamics simulations, and
3. Machine learning training often use distributed memory systems.

c) Explain the working, characteristics and applications of a Vector super computer ?


10M
Ans: A vector supercomputer is designed to process data in vector form, allowing it
to handle multiple data elements simultaneously. Unlike traditional scalar processors
that handle one operation at a time a vector processor operates on one-dimensional
arrays of data, known as vectors. This enables the system to perform the same
operation on multiple data points in parallel, significantly speeding up computations
that involve large data sets.A vector processor is usually built on top of scalar
processor ,it is attached to scalar processor as an optional feature

Working:Program and data are first loaded into main memory through host
computer.Instructions are first decoded by scalar Control Unit, if it’s a scalar
operation or program control operation, it will be directly executed using scalar
functional pipelines. If its vector operation it will be send to the vector control unit.
The control unit supervises the flow of vector data between main memory and vector
functional pipelines. Vector data flow is coordinated by the control unit. A number
of vector functional pipelines may be built into a vector processor.
VECTOR PROCESSOR MODELS are of 2 types :
 Register to Register architecture
 Memory to memory architecture
REGISTER to REGISTER architecture
- The fig above shows a register to register architecture. Vector registers are used
to hold vector operands, intermediate and final vector results. All vector registers are
programmable and length of vector register is usually fixed and some machines use
re-configurable vector registers to dynamically match register length(ex: Fujitsu
VP2000)
MEMORY to MEMORY architecture
Differs from register to register architecture in use of vector stream unit in place of
vector registers. Vector operands and results are directly retrieved from and stored
into main memory in superwords (ex: 512 bits in Cyber 205)
Characteristics:
1. High Performance on Repetitive Operations
2. Vector Instruction Set
3. Memory Bandwidth
4. Pipelining
5. SIMD Architecture
Applications of a Vector Supercomputer:
1) Scientific Computing: Vector supercomputers are widely used in scientific
fields for tasks such as weather forecasting, climate modeling, and fluid
dynamics simulations.
2) Engineering and Computational Fluid Dynamics: Fields like aerospace
engineering rely on vector processing to model airflow, heat distribution, and
stress analysis.
3) Big Data Analytics: In applications that analyze large volumes of data, such as
machine learning and data mining, vector supercomputers process datasets
efficiently by applying vectorized algorithms. For example, they can be used in
genomic research, where they rapidly process genetic data.
4) High-Performance Physics Simulations: In physics, simulations of particle
interactions or molecular dynamics require intensive computations, which are
well-suited for vector processing due to the similarity of operations across large
data sets.
UNIT-4

a) What are different processor families ? 2M


Ans: Major processor families are CISC, RISC, superscalar, VLIW, super-
pipelined, vector, and symbolic processors. Scalar and vector processors are for
numerical computations. Symbolic processors have been developed for AI
applications.
b) What is instruction pipelining? 2M
Ans: Instruction Pipelining is a technique used in computer architecture to
improve the throughput of instruction execution. Instead of executing one
instruction at a time, pipelining divides the execution of an instruction into
multiple stages, where each stage performs a specific operation (e.g., fetching,
decoding, executing, etc.). These stages are executed concurrently for different
instructions, allowing multiple instructions to be processed simultaneously, one at
each stage of the pipeline.
c) Draw the design space of processors with RISC and CISC processor
families? 2M
Ans:

Processor families can be mapped onto a coordinated space of clock rate versus
cycles per instruction (CPI), as illustrated in Fig. 4.1.
• Two main categories of processors are:-
o CISC (eg:X86 architecture)
o RISC(e.g. Power series, SPARC, MIPS, etc.) .
Under both CISC and RISC categories, products designed for multi-core chips,
embedded applications, or for low cost and/or low power consumption, tend to
have lower clock speeds.
d) Define CISC & RISC scalar processor ? 2M
Ans: CISC (Complex Instruction Set Computer) scalar processors are designed
with a complex instruction set that includes instructions capable of performing
multiple low-level operations, such as memory access, arithmetic, and branching,
in a single instruction. These processors aim to reduce the number of instructions
per program by providing highly specialized instructions.
RISC (Reduced Instruction Set Computer) scalar processors use a simpler
instruction set with each instruction performing a small, atomic operation. These
processors focus on achieving high performance by optimizing instruction
execution through simplicity and pipelining.
e) What is a superscalar processor ? 2M
Ans:A superscalar processor is a type of CPU that can execute more than one
instruction per clock cycle by using multiple execution units. Unlike scalar
processors, which handle only one instruction at a time, superscalar processors
achieve parallelism by issuing and executing multiple instructions simultaneously.
a) Explain how the design space of processors impact performance and
efficiency? 3M
Ans: The design space of processors refers to the various architectural features and
choices made during processor development, such as instruction set design,
parallelism, memory hierarchy, and energy efficiency. These design choices
directly influence a processor's performance and efficiency. Below are key design
aspects and their impacts:
CISC (Complex Instruction Set Computing): Provides rich, complex
instructions, reducing code size but increasing decoding complexity, which may
slow performance.
RISC (Reduced Instruction Set Computing): Uses simpler instructions executed
faster, allowing pipelining and parallelism for better efficiency.
b) Describe the differences between CISC and RISC processors? 3M
Ans:
S.No CISC RISC
1 Large set of instructions Small set of instructions
with variable format (16-64 with fixed (32 bit) format,
bits per instr) mostly register based
2 12-24 addressing modes 12-24 addressing modes
3 3-5 addressing modes 3-5 addressing modes
4 CPI btw 2 and 15 CPI btw 2 and 15
5 Clock rates btw 33-50MHz Clock rates btw 33-50MHz

c) Explain processor and co-processor connection with a neat diagram?


3M
Ans:Processors: The central processing unit (CPU) is the main processor of a computer. It
handles all the computational tasks.
Co-processors: Co-processors are specialized processors designed to handle specific types of tasks,
such as floating-point arithmetic or graphics processing. They work in conjunction with the main
CPU to improve overall performance.CPU's can have integrated FPU or use attached co-
processors.Co-processors are specialized processors that work alongside the main CPU to handle
specific tasks.
a) Discuss the architectural features of CISC scalar processors with a
neat diagram ? 5M
Ans:The architectural features of the CISC architecture :

Micro-programmed Control:The control unit utilizes microcode, a sequence of


micro-instructions that execute atomic operations.Micro-instructions are stored in a
special memory called the micro-program control memory.This allows for flexible
and complex instruction execution, as each instruction is broken down into a series
of micro-instructions.

Unified Cache:The cache is used to store both instructions and data.This design
simplifies the cache hardware and reduces the need for separate instruction and
data caches.However, it can potentially lead to performance bottlenecks if both
instructions and data are heavily accessed simultaneously.

Complex Instruction Set:CISC architectures typically have a large and complex


instruction set.Each instruction can perform multiple operations, such as
arithmetic, logical, and memory access operations.This can reduce the number of
instructions needed to execute a program, but can also increase instruction
decoding and execution time.

Instruction and Data Path:The instruction and data path is responsible for
fetching instructions, decoding them, and executing the corresponding operations
on data.This path includes components like registers, ALU’s (Arithmetic Logic
Units), and data buses.
Additional Considerations:CISC architectures often have variable-length
instructions, which can complicate instruction decoding and execution.They may
also have complex addressing modes, which can increase instruction complexity.
 12-24 addressing modes
 3-5 addressing modes
 8-24 general purpose registers
 CPI btw 2 and 15
While CISC architectures were dominant in the past, they have been largely
replaced by RISC
b) Discuss in detail the architectural features of RISC scalar processors
with a neat diagram ? 5M

Ans:Key Architectural Features of the RISC Architecture:


Hardwired Control:The control unit uses a fixed logic circuit to generate control
signals.This design is simpler and faster than micro-programmed control, as it
doesn't require the overhead of fetching and decoding micro-instructions.

Split Instruction and Data Cache:The cache is divided into two separate caches:
one for instructions and one for data.This design can improve performance by
reducing cache conflicts and increasing the likelihood of cache hits.

Reduced Instruction Set:RISC architectures have a smaller and simpler


instruction set compared to CISC architectures.Each instruction performs a single,
well-defined operation, making them easier to decode and execute.

Load-Store Architecture:Memory access is limited to load and store


instructions.This simplifies the instruction set and reduces the complexity of the
data path.

Pipelining:RISC architectures are highly pipelined, meaning that multiple


instructions can be executed in parallel.This increases the overall performance of
the processor.

Additional Considerations:RISC architectures often have fixed-length


instructions, which simplifies instruction decoding and fetching.They typically
have a smaller number of addressing modes, which also simplifies instruction
execution.Small set of instructions with fixed (32 bit) format, mostly register based

 12-24 addressing modes,


 3-5 addressing modes
 8-24 general purpose registers
 Use separate instruction and data cache
 CPI btw 2 and 15
c) Explain the working mechanism of VAX8600 CISC Processor with a
diagram? 5M
Ans:VAX 8600 processor uses typical CISC architecture with micro-programmed
control.
• The instruction set contained about 300 instructions with 20 different addressing
modes.
• The CPU in the VAX 8600 consisted of two functional units for concurrent
execution of integer and floating-point instructions.
• The unified cache was used for holding both instructions and data.
• There were 16 GPR’s in the instruction unit.
• Instruction pipelining was built with six stages in the VAX 8600.
• The instruction unit prefetched and decoded instructions, handled branching
operations, and supplied operands to the two functional units in a pipelined
fashion.
• A translation lookaside buffer [TLB) was used in the memory control unit for fast
generation of a physical address from a virtual address.
• Both integer and floating-point units were pipelined.
• The CPI of VAX 8600 instruction varied from 2 to 20 cycles. Because both
multiply and divide instructions needs execution units for a large number of cycles.

d) Explain the working mechanism of Motorola MC 68040 CISC


Processor with a diagram? 5M
Ans:

The architecture has involved


• Separate instruction and data memory unit, with a 4-Kbyte data cache, and a
4-Kbyte instruction cache, with separate memory management units (MMUs)
supported by an address translation cache (ATC), equivalent to the TLB used in
other systems.
• The processor implements 113 instructions using 16 general-purpose registers.
• 18-Addressing modes includes:- register direct and indirect, indexing, memory
indirect, program counter indirect, absolute, and immediate modes.
• The instruction set includes data movement, integer, BCD, and floating point
arithmetic, logical, shifting, bit-field manipulation, cache maintenance, and
multiprocessor communications, in addition to program and system control and
memory management instructions.
• The integer unit is organized in a six-stage instruction pipeline.
• The floating-point unit consists of three pipeline stages .
• All instructions are decoded by the integer unit. Floating-point instructions are
forwarded to the floating point unit for execution.
• Separate instruction and data buses are used to and from the instruction and data
from memory units, respectively. Dual MMU’s allow interleaved fetch of
instructions and data from the main memory.
• Three simultaneous memory requests can he generated by the dual MMU’s,
including data operand read and write and instruction pipeline refill.
• Snooping logic is built into the memory units for monitoring bus events for cache
invalidation.
• The complete memory management is provided with support for virtual demand
paged operating system.
• Each of the two ATC’s has 64 entries providing fast translation from virtual
address to physical address.
e) Explain the working of SUN SPARK RISC Processor with floating point
unit in the diagram? 5M

Ans: SUN SPARC Architecture Overview: SPARC (Scalable Processor


Architecture) is a RISC architecture known for its simplicity, efficiency, and
scalability. It has been widely used in servers and workstations due to its high
performance and reliability.

Key Features of SPARC:

 Reduced Instruction Set: SPARC uses a simple instruction set with fixed-
length instructions, making it easier to decode and execute.
 Load-Store Architecture: Memory access is limited to load and store
instructions, simplifying the data path.
 Register Windows: SPARC uses a unique register window scheme to reduce
the number of load and store instructions, improving performance.
 Pipelining: SPARC employs pipelining to execute multiple instructions
concurrently, increasing throughput.

Floating-Point Unit (FPU)

The FPU is a specialized unit within the SPARC processor that handles floating-
point arithmetic operations. It is designed to perform calculations on real numbers,
which are represented in a format that includes a mantissa and an exponent.

Working of SPARC with FPU: Here's a simplified overview of how a SPARC


processor with an FPU executes a floating-point operation:

1. Instruction Fetch: The instruction containing the floating-point operation is fetched from
memory and decoded by the instruction decoder.
2. Operand Fetch: The operands for the operation are fetched from the register file.
3. FPU Execution: The FPU performs the specified floating-point operation on the
operands. This may involve several stages, including:

o Normalization: Adjusting the mantissa and exponent to a standard format.


o Arithmetic Operation: Performing the actual addition, subtraction,
multiplication, or division.
o Rounding: Rounding the result to the desired precision.
4. Result Write-Back: The result of the operation is written back to the register file.

Diagram of SPARC Processor with FPU

In conclusion, the SPARC architecture with its integrated FPU is well-suited for
applications that require high-performance floating-point computations, such as
scientific simulations, financial modeling, and image processing.
a) Describe in detail the design space of processors and how it
influences the choice between CISC and RISC architectures ? 10M
Ans:Design space of CISC ,RISC, Superscalar and VLIW processors
 The CPI of different CISC instructions varies from 1 to 20. Therefore, CISC
processors are at the upper part of the design space. With advanced
implementation techniques, the clock rate of today‘s CISC processors ranges up to
a few GHz.
 With efficient use of pipelines, the average CPI of RISC instructions has been
reduced to between one and two cycles.
 An important subclass of RISC processors are the superscalar processors, which
allow multiple instructions to be issued simultaneously during each cycle. Thus the
effective CPI of a superscalar processor should be lower than that of a scalar RISC
processor. The clock rate of superscalar processors matches that of scalar RISC
processors.
 The very long instruction word (VLIW) architecture can in theory use even more
functional units than a superscalar processor. Thus the CPI of a VLIW processor
can be further lowered. Intel‘s i860 RISC processor had VLIW architecture.
The effective CPI of a processor used in a supercomputer should be very low,
positioned at the lower right corner of the design space. However, the cost and
power consumption increase appreciably if processor design is restricted to the
lower right corner

Processor families can be mapped onto a coordinated space of clock rate versus
cycles per instruction (CPI), as illustrated in Fig. 4.1.As implementation
technology evolves rapidly, the clock rates of various processors have moved from
low to higher speeds toward the right of the design space (ie increase in clock rate).
and processor manufacturers have been trying to lower the CPI rate(cycles taken to
execute an instruction) using innovative hardware approaches.
Two main categories of processors are:-
o CISC (eg:X86 architecture)
o RISC(e.g. Power series, SPARC, MIPS, etc.) .
Under both CISC and RISC categories, products designed for multi-core chips,
embedded applications, or for low cost and/or low power consumption, tend to
have lower clock speeds. High performance processors must necessarily be
designed to operate at high clock speeds. The category of vector 2 processors has
been marked VP; vector processing features may be associated with CISC or RISC
main processors.

b) Explain working of super scalar processor with a neat diagram ? 10M


Ans:
 Instruction issue degree (m) in a superscalar processors is limited btw 2 to
5.(2<m<5)
 A superscalar processor of degree m can issue m instructions per cycle. In this
sense, the base scalar processor, implemented either in RISC or CISC, has m =
1.
 In order to fully utilize a superscalar processor of degree m, m instructions
must be executable in parallel.
 This situation may not be true in all clock cycles. In that case, some of the
pipelines may be stalling in a wait state.
 In a superscalar processor, the simple operation latency should require only
one cycle, as in the base scalar processor.
 Due to the desire for a higher degree of instruction-level parallelism in
programs, the superscalar processor depends more on an optimizing
compiler to exploit parallelism.
 The instruction cache supplies multiple instructions per fetch. However, the
actual number of instructions issued to various functional units may vary in
each cycle.
 The number is constrained by data dependencies and resource conflicts among
instructions that are simultaneously decoded .
 Multiple functional units are built into the integer unit and into the floating
point unit. Multiple data buses exist among the functional units. In theory, all
functional units can be simultaneously used if conflicts and dependencies do
not exist among them during a given cycle.
 The maximum number of instructions issued per cycle ranges from two to
five in these superscalar processors.
 Typically. the register files in the lU and FPU each have 32 registers. Most
superscalar processors implement both the IU and the FPU on the same chip.
The superscalar degree is low due to limited instruction parallelism that can be
exploited in ordinary programs.

c) Explain working of Intel i860 with a neat diagram ? 10M


Ans:It was a 64-bit RISC processor fabricated on a single chip containing more
than l million transistors.
• The peak performance of the i860 was designed to reach 80 Mflops single-
precision or 60 Mflops double-precision, or 40 MIPS in 32-bit integer operations at
a 40-MHz clock rate.
• In the block diagram there were nine functional units (shown in 9 boxes)
interconnected by multiple data paths with widths ranging from 32 to 128 bits.
All external or internal address buses were 32-bit wide, and the external data path
or internal data bus was 64 bits wide. However, the internal RISC integer ALU
was only 32 bits wide.
• The instruction cache had 4 Kbytes organized as a two-way set-associative
memory with 32 bytes per cache block. lt transferred 64 bits per clock cycle,
equivalent to 320 Mbytes/s at 40 MHz.
• The data cache was a two-way set associative memory of 8 Kbytes. lt transferred
128 bits per clock cycle (640 Mbytes/s) at 40 MHZ .
• The bus control unit coordinated the 64-bit data transfer between the chip and the
outside world.
• The MMU implemented protected 4 Kbyte paged virtual memory of 2^32 bytes
via a TLB .
• The RISC integer unit executed load, store. Integer , bit, and control instructions
and fetched instructions for the floating-point control unit as well.
• There were two floating-point units, namely, the multiplier unit and the adder
unit which could be used separately or simultaneously under the coordination of
the floating-point control unit.Special dual-operation floating-point instructions
such as add-and-multiply and subtract-and-multiply used both
the multiplier and adder units in parallel .
• The graphics unit supported three-dimensional drawing in a graphics frame
buffer, with color intensity, shading, and hidden surface elimination.
• The merge register was used only by vector integer instructions. This register
accumulated the results of multiple addition operations .
UNIT-5
a) What are Asynchronous and Synchronous models in pipeline processing?
2M
Ans: In an asynchronous pipeline, data flow between adjacent stages is controlled
by a handshake protocol. In a synchronous pipeline, data flow between stages is
synchronized by a common clock signal. This clock signal triggers the transfer of
data between stages at regular intervals.
b) Define Clocking and Timing Control in the context of pipelining? 2M
Ans:The clock cycle of a pipeline is the time it takes for data to move from one
stage to the next. It's determined by the maximum stage delay (Tmax) and the latch
delay (d).Clock skew refers to the difference in arrival times of the clock signal at
different stages of the pipeline.This can lead to timing issues, as data may not be
stable at the receiving stage when the clock signal arrives.
Clock Cycle Time (T): T = max(Ti)1m + d=Tm+d ,where: Ti: Delay of stage I
and d: Latch delay
c) Explain the concept of Speedup in pipeline processing ? 2M
Ans:The speed up ratio is ratio between maximum time taken by non pipeline
process over process using pipelining. Speedup (S): S = (T_np) /(T_p), where:
T_np: Execution time of a non-pipelined processor
T_p: Execution time of a k-stage pipelined processor
Let tp be the one clock cycle time.The time taken for n processes having k
segments in pipeline configuration will be = k*tp + (n-1)*tp= (k+n-1)*tp.The
time taken for one process is tn thus the time taken to complete n process in non
pipeline configuration will be = n*tn.Thus speed up ratio for one process in non
pipeline and pipeline configuration is = n*tn / (n+k-1)*tp
d) What do you understand by pipeline Efficiency?2M
Ans:The efficiency of linear pipeline is measured by the percentage of time when
processor are busy over total time taken i.e., sum of busy time plus idle time.
Thus if n is number of task , k is stage of pipeline and t is clock period then
efficiency is given by η = n/ [k + n -1]. Efficiency (E): E = S / S_ideal
e) Define Throughput in pipeline processing ?2M
Ans:The number of task completed by a pipeline per unit time is called
throughput, this represents computing power of pipeline. We define throughput as
W= n/[k*t + (n-1) *t] = η/t.
a) What are the factors affecting Speedup in pipelined processors ?3M
Ans: Speedup is a measure of how much faster a pipelined processor can execute
tasks compared to a non-pipelined processor.Ideally,a k-stage pipeline can
process n tasks in k+(n-1) clock cycles.The speedup factor (S) is calculated by
dividing the execution time of the non-pipelined processor (Tnp) by the execution
time of the pipelined processor (Tp).

 The speedup factor is directly related to the number of stages (k) in the
pipeline.Speedup (S):S = (T_np) /(T_p) where: T_np: Execution time of a
non-pipelined processor.T_p: Execution time of a k-stage pipelined
processor .Ideal Speedup (S_ideal): S_ideal = k

b) What are Asynchronous models in pipeline processing? Explain with an


example? 3M
Ans:In an asynchronous pipeline, data flow between adjacent stages is controlled
by a handshake protocol. This handshake mechanism ensures that data is
transferred only when the receiving stage is ready, preventing data loss or
corruption due to timing mismatches.
This protocol involves the following steps: External data (operands) is fed
into the first stage (S₁) of the pipeline.The first stage processes the data and
sends a "ready" signal to the second stage (S₂). S₂ acknowledges the data
transfer and sends an "acknowledge" signal back to S₁.S₁ sends the processed
data to S₂.Sequential Processing: This process continues sequentially for each
stage in the pipeline, with each stage processing its data and sending a "ready"
signal to the next stage.
c) What are Synchronous models in pipeline processing? Explain with an
example ?3M
Ans:In a synchronous pipeline, data flow between stages is synchronized by a
common or global clock signal. This clock signal triggers the transfer of data
between stages at regular intervals. Each stage in the pipeline has a latch that
stores the data being processed. When the clock signal arrives, the latches
simultaneously transfer the data to the next stage.

1. Data Input: External data (operands) is fed into the first stage (S₁) of the
pipeline.
2. Processing and Latching: The first stage processes the data and stores the
result in its latch.
3. Clock Signal: The clock signal arrives, triggering the transfer of data from
the latches in all stages to the next stage.
4. Sequential Processing: This process continues sequentially for each stage in
the pipeline, with each stage processing its data and storing the result in its
latch.

d) Describe the phases of Instruction Execution in a pipeline processor? 3M


Ans:The phases are ideal for overlapped execution of a linear pipeline, each
phase may require one or more clock cycles to execute depending on the
instruction type and processor/memory architecture.
Fetch Stage – fetches instructions from a cache memory, ideally one/cycle
Decode Stage –reveals instr function to be performed and identifies the resources
needed. Resources include registers, buses, Functional
Units…
Issue Stage – reserves resources, operands are read from registers.
Execute Stage –instructions are executed in one or several execute stages.
Write back -stage is used to write results into registers.
e) Explain the Differences between Synchronous and Asynchronous models ?
3M
Ans:
Aspect Synchronous models ASynchronous models
Data Flow Control Synchronized by a Controlled by a
common clock signal. handshake protocol
between stages.
Clock Dependency Requires a clock signal Operates without a
for timing and data clock, enabling self-
transfer. timed operation.
Processing Mechanism Data transfer occurs Data transfer occurs
simultaneously at only when the receiving
regular intervals with stage is ready.
the clock signal.
Data Storage Uses latches in each Relies on
stage to store and acknowledgment
transfer data. signals for data transfer.
a) Analyze the differences between Asynchronous and Synchronous models in
pipelining ? 5M
Ans:
Aspect Synchronous models ASynchronous models
Data Flow Control Synchronized by a Controlled by a
common clock signal. handshake protocol
between stages.
Clock Dependency Requires a clock signal Operates without a
for timing and data clock, enabling self-
transfer. timed operation.
Processing Mechanism Data transfer occurs Data transfer occurs
simultaneously at only when the receiving
regular intervals with stage is ready.
the clock signal.
Data Storage Uses latches in each Relies on
stage to store and acknowledgment
transfer data. signals for data transfer.
Flexibility Less flexible, as all More flexible, as it
stages operate at the accommodates variable
same clock rate. processing times and
delays.
Design Complexity Simpler to design due More complex to
to the use of a global design due to the need
clock. for handshake
protocols.
Performance Predictable Performance may vary
performance with depending on
regular data transfer processing times and
intervals delays.
Application Commonly used in Suitable for scenarios
modern processors due with variable
to simplicity and processing times or
predictability. unpredictable delays.
b) Discuss the Instruction Execution phases ? 5M
Ans:Instruction Execution Phases :
The phases are ideal for overlapped execution of a linear pipeline, each
phase may require one or more clock cycles to execute depending on the
instruction type and processor/memory architecture.
Fetch Stage – fetches instructions from a cache memory, ideally one/cycle
Decode Stage – reveals instr function to be performed and identifies the
resources needed. Resources include registers, buses,
Functional Units…
Issue Stage – reserves resources, operands are read from registers.
Execute Stage – instructions are executed in one or several execute stages.
Write back - stage is used to write results into registers.

Figure a Figure b

Figure above shows the flow of machine instructions through a typical pipeline.
These eight instructions are for pipelined execution of the high-level language
statements X = Y + Z and A = B * C. Here we have assumed that load and store
instructions take four execution clock cycles, while floating-point add and
multiply operations take three cycles.Figure a (above)illustrates the issue of
instructions following the original program order. The shaded boxes correspond
to idle cycles when instruction issues are blocked due to resource latency or
conflicts or due to data dependencies. The first two load instructions issue on
consecutive cycles. The add is dependent on both loads and must wait three
cycles before the data (Y and Z) are loaded in. Similarly, the store of the sum to
memory location X must wait three cycles for the add to finish due to a flow
dependence. Figure b (above) shows an improved timing after the instruction
issuing order is changed to eliminate unnecessary delays due to dependence. The
idea is to issue all four load operations in the beginning. Both the add and
multiply instructions are blocked fewer cycles due to this data prefetch. The
reordering should not change the end results. The time required is being reduced
to 11 cycles, measured from cycle 4 to cycle 14.
c) Explain the concept of Pipeline Schedule Optimization with an example ?
5M
Ans: Pipeline Schedule Optimization technique based on the Minimal Average
Latency (MAL) concept inserts non-compute delay stages into a pipeline to
modify the reservation table, resulting in a new collision vector and an improved
state diagram. This aims to achieve an optimal latency cycle, which is the shortest
possible.
Bounds on the MAL:Shar (1972) determined the following bounds on the MAL
for a statically reconfigured pipeline:

1. Lower Bound: The MAL is lower-bounded by the maximum number of


check marks in any row of the reservation table.
2. Upper Bound: The MAL is upper-bounded by the number of 1's in the
initial collision vector plus 1.

These bounds suggest that the optimal latency cycle must be selected from one of
the lowest greedy cycles in the state diagram. However, a greedy cycle alone does
not guarantee the optimality of the MAL. The lower bound provides a guarantee
of optimality.

Example:

Reservation function for a function x


Time/ 1 2 3 4 5 6 7 8
Stage
S1 X X X
S2 X X
S3 X X X

Output function for a function Y

Time/ 1 2 3 4 5 6
Stage
S1 Y Y
S2 Y
S3 Y Y Y

Latency: The number of time units (clock cycles) between two initiations of a
pipeline is the latency between them. Latency values must be non-negative
integers.

Collision: When two or more initiations are done at same pipeline stage at the
same time will cause a collision. A collision implies resource conflicts between
two initiations in the pipeline, so it should be avoided.

Forbidden Latency: Latencies that cause collisions are called forbidden


latencies. (E.g. in above reservation table 2, 4, 5 and 7 are forbidden latencies).

Permissible Latency: Latencies that do not cause any collision are called
permissible latencies. (E.g. in above reservation table 1, 3 and 6 are permissible
latencies).

Latency Cycle: A Latency cycle is a latency sequence which repeats the same
subsequence (cycle) indefinitely.The Average Latency of a latency cycle is
obtained by dividing the sum of all latencies by the number of latencies along the
cycle.The latency cycle (1, 8) has an average latency of (1+8)/2=4.5. A Constant
Cycle is a latency cycle which contains only one latency value. (E.g. Cycles (3)
and (6) both are constant cycle).

Collision Vector: The combined set of permissible and forbidden latencies can
be easily displayed by a collision vector, which is an m-bit (m<=n-1 in a n
column reservation table) binary vector C=(CmCm-1….C2C1). The value of
Ci=1 if latency I causes a collision and Ci=0 if latency i is permissible. (E.g. Cx=
(1011010)).

State Diagram: Specifies the permissible state transitions among successive


initiations based on the collision vector.

Simple Cycle, Greedy Cycle and MAL: A Simple Cycle is a latency cycle in
which each state appears only once. In above state diagram only (3), (6), (8), (1,
8), (3, 8), and (6, 8) are simple cycles. The cycle(1, 8, 6, 8) is not simple as it
travels twice through state (1011010).A Greedy Cycle is a simple cycle whose
edges are all made with minimum latencies from their respective starting states.
The cycle (1, 8) and (3) are greedy cycles.MAL (Minimum Average Latency) is
the minimum average latency obtained from the greedy cycle. In greedy cycles
(1, 8) and (3), the cycle (3) leads to MAL value 3.For functions X and Y, the
MAL is 3, and both have met the lower bound of 3 from their respective
reservation tables. However, the upper bound on the MAL for function X is
4+1=5, a rather loose bound. On the other hand, the upper bound for function Y is
2+1=3, a tighter bound. Therefore, all greedy cycles for function Y lead to the
optimal latency value of 3, which cannot be further reduced.

Optimization Technique

To optimize the MAL, one needs to find the lower bound by modifying the
reservation table. The approach is to reduce the maximum number of checkmarks
in any row while preserving the original function being evaluated. Patel and
Davidson (1976) proposed using non-compute delay stages to increase pipeline
performance and achieve a shorter MAL. Their technique is described in more
detail.
d) Describe the role of Branch Handling Techniques in maintaining pipeline
efficiency ? 5M
Ans:BRANCH HANDLING TECHNIQUES :
TERMS used in branching :
Branch Taken : The action of fetching a non sequential or remote instruction
after a branch instruction is called Branch taken.
Branch Target : The instruction to be executed after a branch taken is called
Branch target
Delay slot : The no of pipeline cycles wasted between a branch taken and its
branch target is called delay slot, denoted by b ,0 ≤ b ≤ k-1 (K-no of pipeline
stages)
EFFECTS OF BRANCHING :
When a branch is taken all instructions following the branch in the pipeline
becomes useless and will be drained from the pipeline. Thus branch taken causes
a pipeline to be flushed losing a number of pipeline stages.

BRANCH HANDLING TECHNIQUES include BRANCH PREDICTION


( Static and Dynamic )
Static Branch Prediction Strategy : Branches are predicted based on branch
types statically or based on branch history during program execution.The
frequency and probabilities of branch taken and branch types across large no of
program traces is used to predict a branch. The static prediction direction (taken
or not taken) can even be wired into the processor.The wired-in static prediction
cannot be changed once committed to the hardware.
Dynamic Branch Prediction Strategy- works better than static)
Uses recent branch history to predict whether or not the branch will be taken
next time when it occurs. To be accurate we may need to use entire history of
branch to predict future choice – but not practical. Thus we use limited recent
history.Requires additional hardware to keep track of the past behavior of the
branch instructions at run time.
Branch Target Buffers are used to implement branch prediction – BTB holds
recent branch information including address of branch target used. The address of
branch instr locates its entry in BTB.
Another way to deal with branching is using delayed branch which is more
effective in short instruction pipelines.
e) Discuss the impact of Clocking and Timing Control on the speedup,
efficiency, and throughput of a pipelined processor? 5M

Ans: Clocking and Timing Control:

Clock Cycle:The clock cycle of a pipeline is the time it takes for data to move
from one stage to the next. It's determined by the maximum stage delay (Tmax)
and the latch delay (d).The clock cycle is essentially the time it takes for a clock
pulse to rise and fall. Clock Skew:Clock skew refers to the difference in arrival
times of the clock signal at different stages of the pipeline.This can lead to timing
issues, as data may not be stable at the receiving stage when the clock signal
arrives.Clock Cycle Time (T): T = max(Ti)1m + d=Tm+d

 where: Ti: Delay of stage I


 d: Latch delay
 Clock Skew (S): Maximum difference in arrival times of the clock
signal at different pipeline stages.

Speedup ratio : The speed up ratio is ratio between maximum time taken by non
pipeline process over process using pipelining. Thus speed up ratio for one
process in non pipeline and pipeline configuration is = n*tn / (n+k-1)*tp where
The time taken for one process is tn thus the time taken to complete n process in
non pipeline configuration will be = n*tn.The time taken for n processes having k
segments in pipeline configuration will be = k*tp + (n-1)*tp= (k+n-1)*tp

The following are various limitations due to which any pipeline system cannot
operate at its maximum theoretical rate i.e., k (speed up ratio).

Ideal Speedup (S_ideal): S_ideal = k

Efficiency : The efficiency of linear pipeline is measured by the percentage of


time when processor are busy over total time taken i.e., sum of busy time plus
idle time. Thus if n is number of task , k is stage of pipeline and t is clock period
then efficiency is given by η = n/ [k + n -1]. Thus efficiency η of the pipeline is
the speedup divided by the number of stages, η = Sk/k

Efficiency (E): E = S / S_ideal

Throughput: The number of task completed by a pipeline per unit time is called
throughput, this represents computing power of pipeline. We define throughput as
W= n/[k*t + (n-1) *t] = η/t.

The concepts of clock cycle, clock skew, speedup, efficiency, and throughput are
interrelated and form the foundation of pipeline performance analysis.Mitigating
clock skew ensures reliability and consistency in pipelined operations.However,
real-world limitations like varying stage delays and clock cycle constraints
prevent achieving the ideal speedup.Higher throughput indicates better utilization
of the pipeline stages and faster processing rates for tasks.
Evaluate the significance of Dynamic Instruction Scheduling in modern
processors and its impact on overall pipeline performance ? 10M
Ans:

Data dependencies in sequence of instructions create interlocked relationships.


Inter-locking is resolved through compiler based static scheduling approach or by
using dynamic Instruction scheduling which include additional hardware units.
Dynamic Instruction scheduling has 2 techniques
1.Tomasulo’s Algorithm
2. CDC score-boarding
Tomasulo’s algorithm scheme: Named after the chief designer. This hardware
dependence –resolution scheme was first implemented with multiple floating
point units of the IBM 360/91 processor. Functional units are internally pipelined
and can complete one operation in every clock cycle, provided the reservation
station of the unit is ready with the required input operand values. If source
register is busy when an instr reaches issue stage, tag for source register is
forwarded to RS .When register becomes available tag can signal availability.
This value is copied into all reservation station which have the matching tag.
Thus operand forwarding is achieved here with the use of tags. All
destinations which require a data value receive it in the same clock cycle over the
common data bus, by matching stored operand tags with source tag sent over the
bus.

The fig above shows a functional unit connected to common data bus with
three reservation stations provided on it.
Op – operation to be carried out .OPnd-1 and Opnd-2 – two operand values
needed for operation 1 and 2 – two source tags associated with the operands.
When the needed operand values are available in reservation station, the
functional unit can initiate the required operation in the next clock cycle. At time
of instruction issue the reservation station is filled out with the operation
code(op).If an operand value is available in programmable register it is
transferred to the corresponding source operand field in the reservation station. It
waits until its data dependencies are resolved and operands become available.
Dependence is resolved by monitoring Result bus and when all operands of an
instr are available its dispatched to functional unit for execution. If the operand
value is not available at the time of issue, the corresponding source tag(t1
and/or t2) is copied into the reservation station. The source tag identifies the
source of the required operand. As soon as the required operand is available at its
source- typically output of functional unit – the data value is forwarded over the
common data bus along with source tag

CDC SCOREBOARDING

Figure above shows CDC6600 like processor that uses dynamic instruction
scheduling hardware.Here Multiple Functional units appeared as multiple
execution units pipelines.Parallel units allow instructions-to complete out of
original program order. The processor had instr buffers for each execution unit.
Instrs are issued to available FU’s regardless of whether register i/p data are
available. To Control correct routing of data btw execution units and registers
CDC 6600 used a Centralized Control unit known as scoreboard. Scoreboard kept
track of registers needed by instrs waiting for various functional units.When all
registers have valid data scoreboard enables instr execution. When a FU finishes
it signals scoreboard to release the resources. Scoreboard is a Centralized control
logic which keeps track of status of registers and multiple functional units.
Explain dynamic Instruction Scheduling And branch handling techniques?
10M
Ans:
branch handling’ techniques}
Data dependencies in sequence of instructions create interlocked relationships.
Inter-locking is resolved through compiler based static scheduling approach or by
using dynamic Instruction scheduling which include additional hardware units.
Dynamic scheduling has 2 techniques
1.Tomasulo’s Algorithm
2. CDC scoreboarding
Tomasulo’s algorithm scheme :Named after the chief designer. This hardware
dependence –resolution scheme was first implemented with multiple floating
point units of the IBM 360/91 processor.
Functional units are internally pipelined and can complete one operation in every
clock cycle, provided the reservation station(Structure of RS shown below) of the
unit is ready with the required input operand values.If source register is busy
when an instr reaches issue stage, tag for source register is forwarded to RS.
When register becomes available tag can signal availability.This value is copied
into all reservation station which have the matching tag.Thus operand
forwarding is achieved here with the use of tags. All destinations which require
a data value receive it in the same clock cycle over the common data bus, by
matching stored operand tags with source tag sent over the bus.
The fig above shows a functional unit connected to common data bus with
three reservation stations provided on it.
Op – operation to be carried out ,OPnd-1 and Opnd-2 – two operand values
needed for operation 1 and t2 – two source tags associated with the operands
When the needed operand values are available in reservation station, the
functional unit can initiate the required operation in the next clock cycle. At time
of instruction issue the reservation station is filled out with the operation
code(op).If an operand value is available in programmable register it is
transferred to the corresponding source operand field in the reservation station. It
waits until its data dependencies are resolved and operands become available.
Dependence is resolved by monitoring Result bus and when all operands of an
instr are available its dispatched to functional unit for execution. If the operand
value is not available at the time of issue, the corresponding source tag(t1
and/or t2) is copied into the reservation station. The source tag identifies the
source of the required operand. As soon as the required operand is available at its
source- typically output of functional unit – the data value is forwarded over the
common data bus along with source tag
(OR)
CDC SCOREBOARDING

Figure above shows CDC6600 like processor that uses dynamic instruction
scheduling hardware.Here Multiple Functional units appeared as multiple
execution units pipelines.Parallel units allow instructionsto complete out of
original program order. The processor had instr buffers for each execution unit.
Instrs are issued to available FU’s regardless of whether register i/p data are
available.To Control correct routing of data btw execution units and registers
CDC 6600 used a Centralized Control unit known as scoreboard.Scoreboard kept
track of registers needed by instrs waiting for various functional units.When all
registers have valid data scoreboard enables instr execution. When a FU finishes
it signals scoreboard to release the resources. Scoreboard is a Centralized control
logic which keeps track of status of registers and multiple functional units.
BRANCH HANDLING TECHNIQUES :
TERMS used in branching :
Branch Taken : The action of fetching a non sequential or remote instruction
after a branch instruction is called Branch taken.
Branch Target : The instruction to be executed after a branch taken is called
Branch target
Delay slot : The no of pipeline cycles wasted between a branch taken and its
branch target is called delay slot, denoted by b ,0 ≤ b ≤ k-1 (K-no of pipeline
stages)
EFFECTS OF BRANCHING :
When a branch is taken all instructions following the branch in the pipeline
becomes useless and will be drained from the pipeline. Thus branch taken causes
a pipeline to be flushed losing a number of pipeline stages.

BRANCH HANDLING TECHNIQUES include BRANCH PREDICTION


( Static and Dynamic )
Static Branch Prediction Strategy
Branches are predicted based on branch types statically or based on branch
history during program execution. The frequency and probabilities of branch
taken and branch types across large no of program traces is used to predict a
branch. The static prediction direction (taken or not taken) can even be wired into
the processor. The wired-in static prediction cannot be changed once committed
to the hardware.
Dynamic Branch Prediction Strategy- works better than static)
Uses recent branch history to predict whether or not the branch will be taken
next time when it occurs. To be accurate we may need to use entire history of
branch to predict future choice – but not practical. Thus we use limited recent
history.Requires additional hardware to keep track of the past behavior of the
branch instructions at run time.
Branch Target Buffers are used to implement branch prediction – BTB holds
recent branch information including address of branch target used. The address of
branch instr locates its entry in BTB.
Another way to deal with branching is using delayed branch which is more
effective in short instruction pipelines.
c) Explain how Pipeline Schedule Optimization can be used to maximize
Speedup and Efficiency, with examples from real-world processors ? 10M

Ans: Pipeline Schedule Optimization technique based on the Minimal Average


Latency (MAL) concept inserts non-compute delay stages into a pipeline to
modify the reservation table, resulting in a new collision vector and an improved
state diagram. This aims to achieve an optimal latency cycle, which is the shortest
possible.

Bounds on the MAL:Shar (1972) determined the following bounds on the MAL
for a statically reconfigured pipeline:Lower Bound: The MAL is lower-bounded
by the maximum number of checkmarks in any row of the reservation
table.Upper Bound: The MAL is upper-bounded by the number of 1's in the
initial collision vector plus 1.

Example:

Time/ 1 2 3 4 5 6 7 8 1 2 3 4 5 6
Stage S1 Y Y
S1 X X X
S2 Y
S2 X X
S3 Y Y Y
S3 X X X

Reservation function for a function x Reservation function for function Y

Latency: The number of time units (clock cycles) between two initiations of a
pipeline is the latency between them. Latency values must be non-negative
integers.

Collision: When two or more initiations are done at same pipeline stage at the
same time will cause a collision. A collision implies resource conflicts between
two initiations in the pipeline, so it should be avoided.

Forbidden Latency: Latencies that cause collisions are called forbidden


latencies. (E.g. in above reservation table 2, 4, 5 and 7 are forbidden latencies).

Permissible Latency: Latencies that do not cause any collision are called
permissible latencies. (E.g. in above reservation table 1, 3 and 6 are permissible
latencies).

Latency Sequence and Latency Cycle: A Latency Sequence is a sequence of


permissible non-forbidden latencies between successive task initiations.

A Latency cycle is a latency sequence which repeats the same subsequence


(cycle) indefinitely.The Average Latency of a latency cycle is obtained by
dividing the sum of all latencies by the number of latencies along the cycle.

The latency cycle (1, 8) has an average latency of (1+8)/2=4.5.

A Constant Cycle is a latency cycle which contains only one latency value. (E.g.
Cycles (3) and (6) both are constant cycle).

Collision Vector: The combined set of permissible and forbidden latencies can
be easily displayed by a collision vector, which is an m-bit (m<=n-1 in a n
column reservation table) binary vector C=(CmCm-1….C2C1). The value of
Ci=1 if latency I causes a collision and Ci=0 if latency i is permissible. (E.g. Cx=
(1011010)).

State Diagram: Specifies the permissible state transitions among successive


initiations based on the collision vector.

Simple Cycle, Greedy Cycle and MAL: A Simple Cycle is a latency cycle in
which each state appears only once. In above state diagram only (3), (6), (8), (1,
8), (3, 8), and (6, 8) are simple cycles. The cycle(1, 8, 6, 8) is not simple as it
travels twice through state (1011010).A Greedy Cycle is a simple cycle whose
edges are all made with minimum latencies from their respective starting states.
The cycle (1, 8) and (3) are greedy cycles.MAL (Minimum Average Latency) is
the minimum average latency obtained from the greedy cycle. In greedy cycles
(1, 8) and (3), the cycle (3) leads to MAL value 3.

For functions X and Y, the MAL is 3, and both have met the lower bound of 3
from their respective reservation tables. However, the upper bound on the MAL
for function X is 4+1=5, a rather loose bound. On the other hand, the upper bound
for function Y is 2+1=3, a tighter bound. Therefore, all greedy cycles for function
Y lead to the optimal latency value of 3, which cannot be further reduced.

Optimization Technique

To optimize the MAL, one needs to find the lower bound by modifying the
reservation table. The approach is to reduce the maximum number of checkmarks
in any row while preserving the original function being evaluated. Patel and
Davidson (1976) proposed using non-compute delay stages to increase pipeline
performance and achieve a shorter MAL.

Pipeline Throughput:This concept refers to the average number of tasks


initiated per clock cycle. The pipeline throughput is primarily determined by the
inverse of the Minimal Average Latency (MAL) adapted.

MAL and throughput: A shorter MAL leads to a higher throughput. Unless the
MAL is reduced to 1, the pipeline throughput becomes a fraction.

Pipeline Efficiency:Another important measure is pipeline efficiency. It


represents the percentage of time each pipeline stage is utilized . The accumulated
rate of all stage utilizations determines the overall pipeline efficiency.

You might also like