Scribid ACA Important Topics With Answers
Scribid ACA Important Topics With Answers
COMPUTER
ADVANCEDARCHITECTURE
COMPUTER
ARCHITECTURE
Authors
Ms.G.Anjana Harshitha Reddy,
Assistant Professor,ECE
St.Peter’s Engineering College,Hyderabad
[email protected]
Ms.Kavya Chalamalashetty,
Assistant Professor,ECE
Sri Vasavi Engineering College,Thadepalligudam
[email protected]
Ms.M.Hamsalekha,
Assistant Professor,ECE
Sri Vasavi Engineering College,Thadepalligudam
[email protected]
UNIT-1
a) Define pipelining in computer architecture ? 2M
Ans: Pipelining is a technique of decomposing a sequential process into sub-operations, where each
subprocess being executed in a special dedicated segment that operates concurrently with all other
segments.
b) What are the stages of an instruction pipeline? 2M
Ans: Computers with complex instructions requires phases to process an instruction completely as
shown below
Fetch the instruction from memory.
Decode the instruction.
Calculate the effective address.
Fetch the operands from memory.
Execute the instruction.
Store the result in the proper place.
c) What is Vector processing?2M
Ans: Vector processing is a computational approach where a single instruction operates
simultaneously on multiple data points, known as a vector.Vector processing is commonly used in
applications such as scientific computing, engineering simulations, and image processing.
d) What is RISC pipelining ?2M
Ans: RISC stands for "Reduced Instruction Set Computer," and it is a technique used in computer
architecture to enhance the performance of RISC processors. It has uniform length and is used for
load and store operations to deal with memory and registers.
e) Define parallel processing ?2M
Ans: It is a technique used for processing the data simultaneously or concurrently to perform the
computational tasks. Parallel processing increases the computational speed of a computer system
and reduce the execution time.
a) List the types of pipeline hazards and solutions ?3M
Ans: Pipeline Hazards are Resource Conflict,
Data Dependency and
Branch difficulties .
Solutions for Pipeline Hazards are Hardware interlocks,operand forwarding,delayed load,delayed
branch,Handling Branch Instructions,Prefetch the target instruction,Branch target Buffer,loop
buffer and branch prediction.
b) What are the applications of parallel processing ?3M
Ans:1) Weather forecasting
2) Computational aerodynamics
3) Remote sensing applications
4) Weapon research and defense
c) List different pipelining techniques ?3M
Ans:
1. Arithmetic pipeline : It is a type of pipeline technique used in performing arithmetic operations.
2. Instruction pipeline: It is a technique used for processing the instructions in parallel by
breaking down into different steps.
3. RISC pipeline :It is a technique used for processing the instructions in parallel by breaking
down into different steps of uniform length.
d) What are problems in pipelining ?3M
Ans: Pipelining problems, also known as pipeline hazards, occur when a pipeline stalls for any
reason. Some of the pipelining hazards are data dependency, memory delay, branch delay, and
resource conflict.
e) What is Flynn Classifications based on ?Give the names of Flynn Classifications? 3M
Ans: Flynn Classification is based on the types of instruction and data streams:
1) Single Instruction Stream, Single Data Stream (SISD)
2) Single Instruction Stream, Multiple Data Stream (SIMD)
3) Multiple Instruction Stream, Single Data Stream (MISD)
4) Multiple Instruction Stream, Multiple Data Stream (MIMD)
a) What is memory interleaving ? Explain with a neat diagram? 5M
Ans: What ? It involves dividing the memory into multiple modules or banks and distributing the
memory addresses across these banks in a way that allows simultaneous access to multiple memory
locations.
Why? Memory interleaving is a technique used in computer architecture to improve the
performance of memory access.This helps in reducing the wait time for memory access and
increases the overall throughput of the system.
An instruction pipeline may require the fetching of an instruction and an operand at the same
time from two different segments. An arithmetic pipeline usually requires two or more
operands to enter the pipeline at the same time. Instead of using two memory buses for
simultaneous access, the memory can be partitioned into a number of modules connected to a
common memory address and data buses.
The advantage of a modular memory is that it allows the use of a technique called interleaving. In
an interleaved memory, different sets of addresses are assigned to different memory modules. By
staggering the memory access, the effective memory cycle time can be reduced by a factor close to
the number of modules.
b) What is pipelining? Explain Ai*Bi+Ci where i goes from 1 to 4 by using pipelining ? 5M
Ans: Pipelining is a technique of decomposing a sequential process into sub-operations, where each
subprocess being executed in a special dedicated segment that operates concurrently with all other
segments.
The pipeline organization will be demonstrated by means of a simple example. To perform the
combined multiply and add operations with a stream of numbers Ai * Bi + Ci for i = 1, 2, 3, …, 7 .
Each sub operation is to be implemented in a segment within a pipeline.
R1←Ai, R2 ←Bi Input Ai and Bi
R3 ← R1 * R2, R4 ← Ci Multiply and input Ci
R5 ← R3 + R4 Add Ci to product
The five registers are loaded with new data every clock pulse. The effect of each clock is shown in
Table below. The first clock pulse transfers A1 and B1 into R1 and R2. The second clock pulse
transfers the product of R1 and R2 into R3 and C1 into R4. The same clock pulse transfers A2 and
B2 into R1 and R2. The third clock pulse operates on all three segments simultaneously. It places
A3 and B3 into R1 and R2, transfers the product of R1 and R2 into R3, transfers C2 into R4, and
places the sum of R3 and R4 into R5. It takes three clock pulses to fill up the pipe and retrieve the
first output from R5. From there on, each clock produces a new output and moves the data one step
down the pipeline. This happens as long as new input data flow into the system. When no more
input data are available, the clock must continue until the last output emerges out of the pipeline.
The main characteristic of pipelining is that several computations can be in progess in distant
segment at same time.The registers provide isolation between each segment so each segment can
work on distant data simultaneously.
c) What is array processing? Explain different types of array processors? 5M
Ans:
WHAT? An array processor is a processor that performs computations on large arrays of data.To
perform high computational tasks and execute multiple data sets array processors are used.
Array Processing is used in different types of processors.
Attached array processor: Attached Array Processor is a parallel processor with multiple
functional units .The objective of the attached array processor is to provide vector manipulation
capabilities to a conventional computer at a fraction of the cost of supercomputer. Fig.D shows the
interconnection of an attached array processor to a host computer. Is an auxiliary processor. It is
intended to improve the performance of the host computer in specific numerical computation tasks.
Fig D: Attached array processor with host computer
SIMD array processor: An SIMD array processor is a computer with multiple processing units
operating in parallel. A general block diagram of an array processor is shown in Fig. E.
It contains a set of identical processing elements (PEs), each having a local memory M. Each
PE includes an ALU, a floating-point arithmetic unit, and working registers. Vector instructions
are broadcast to all PEs simultaneously. Masking schemes are used to control the status of each
PE during the execution of vector instructions. Each PE has a flag that is set when the PE is
active and reset when the PE is inactive.
d) What is vector processing ? Explain how Matrix multiplication is done using vector
processing?5M
Matrix Multiplication
The multiplication of two n x n matrices consists of n2 inner products or n3 multiply-add
operations.
o Consider, for example, the multiplication of two 3 x 3 matrices A and B.
c11= a11b11+ a12b21+ a13b31
o This requires three multiplication and (after initializing c11 to 0) three additions. In general, the
inner product consists of the sum of k product terms of the form
C = A1B1+A2B2+A3B3+…+AkBk.
o In a typical application k may be equal to 100 or even 1000.The inner product calculation on a
pipeline vector processor is shown in Fig. B.
C = A1B1+A5B5+A9B9+A13B13
+A2B2+A6B6+A10B10+A14B14
+A3B3+A7B7+A11B11+A15B15
+A4B4+A8B8+A12B12+A16B16
An instruction in the sequence may be causes a branch out of normal sequence. In that case the
pending operations in the last two segments are completed and all information stored in the
instruction buffer is deleted. Similarly, an interrupt request will cause the pipeline to empty and
start again from a new address value.
The time in the horizontal axis is divided into steps of equal duration. The four segments are
represented in the diagram with an abbreviated symbol.
1. F1 is the segment that fetches an instruction.
2. DA is the segment that decodes the instruction and calculates the effective address.
3. FO is the segment that fetches the operand.
4. EX is the segment that executes the instruction.
It is assumed that the processor has separate instruction and data memories so that the operation in
F1 and PC can proceed at the same time. In the absence of a branch instruction, each segment
operates on different instructions. Thus,in step 4, instruction 1 is being executed in segment EX; the
operand for instruction 2 is being fetched in segment FO; instruction 3 is being decoded in segment
DA; and instruction 4 is being fetched from memory in segment FI. Assume now that instruction 3
is a branch instruction. As soon as this instruction is decoded in segment DA in step 4, the transfer
from F1 to DA of the other instructions is halted until the branch instruction is executed in step6. If
the branch is taken, a new instruction is fetched in step 7. If the branch is not taken, the instruction
fetched previously in step 4 can be used. The pipeline then continues until a new branch instruction
is encountered. Another delay may occur in the pipeline if the EX segment needs to store the result
of the operation in the data memory while the FO segment needs to fetch an operand. In that case,
segment FO must wait until segment EX has finished its operation.
c) What is RISC pipelining ? Explain delayed load and delayed branch with a neat diagram ? 10M
Ans: RISC stands for "Reduced Instruction Set Computer," and it is a technique used in computer
architecture to enhance the performance of RISC processors. It has uniform length and is used for
load and store operations to deal with memory and registers.
Delayed Load :Consider the operation of the following four instructions:
o LOAD: R1 M[address 1]
o LOAD: R2 M[address 2]
o ADD: R3 R1 +R2
o STORE: M[address 3] R3
There will be a data conflict in instruction 3 because the operand in R2 is not yet available in the
A segment. This can be seen from the timing of the pipeline shown in Fig A below
o The E segment in clock cycle 4 is in a process of placing the memory data into R2.
o The A segment in clock cycle 4 is using the data from R2.
It is up to the compiler to make sure that the instruction following the load instruction uses the data
fetched from memory. This concept of delaying the use of the data loaded from memory is referred
to as delayed load.
figA
Fig. (b) shows the same program with a no-op instruction inserted after the load to R2
instruction. Thus the no-op instruction is used to advance one clock cycle in order to compensate
for the data conflict in the pipeline. The advantage of the delayed load approach is that the data
dependency is taken care of by the compiler rather than the hardware.
Fig b
Delayed Branch
The method used in most RISC processors is to rely on the compiler to redefine the branches so that
they take effect at the proper time in the pipeline. This method is referred to as delayed branch.
The compiler is designed to analyze the instructions before and after the branch and rearrange the
program sequence by inserting useful instructions in the delay steps.
It is up to the compiler to find useful instructions to put after the branch instruction. Failing that, the
compiler can insert no-op instructions.
An Example of Delayed Branch:The program for this example consists of five instructions.
o Load from memory to R1
o Increment R2
o Add R3 to R4
o Subtract R5 from R6
o Branch to address X
o The branch address X is transferred to PC in clock cycle 7.
In Fig. (c) the compiler inserts two no-op instructions after the branch.
d) Discuss the various types of pipeline hazards and the techniques used to overcome them.
Provide examples to illustrate your points. 10M
Ans: In general, there are three major difficulties that cause the instruction pipeline to deviate from
its normal operation.
Resource conflicts caused by access to memory by two segments at the same time. Can be
resolved by using separate instruction and data memories and separate buses for memories
Data dependency conflicts arise when an instruction depends on the result of
a previous instruction, but this result is not yet available.
Branch difficulties arise from branch and other instructions that change the
value of PC.
Data dependency: A difficulty that may caused a degradation of performance in an instruction
pipeline is due to possible collision of data or address.A data dependency occurs when an
instruction needs data that are not yet available. An address dependency may occur when an
operand address cannot be calculated because the information needed by the addressing mode
is not available.
Pipelined computers deal with such conflicts between data dependencies in a variety of ways.
Hardware interlocks: an interlock is a circuit that detects instructions whose source operands
are destinations of instructions farther up in the pipeline.This approach maintains the program
sequence by using hardware to insert the required delays.
Operand forwarding: uses special hardware to detect a conflict and then avoid it by routing the
data through special paths between pipeline segments. This method requires additional
hardware paths through multiplexers as well as the circuit that detects the conflict.
Delayed load: The compiler for such computers is designed to detect a data conflict and reorder
the instructions as necessary to delay the loading of the conflicting data by inserting no-
operation instructions.
Prefetch target instruction: To prefetch the target instruction in addition to the instruction
following the branch. Both are saved until the branch is executed.
Branch target buffer(BTB): The BTB is an associative memory included in the fetch segment
of the pipeline. Each entry in the BTB consists of the address of a previously executed branch
instruction and the target instruction for that branch. It also stores the next few instructions after
the branch target instruction.
Loop buffer: This is a small very high speed register file maintained by the instruction fetch
segment of the pipeline.
Branch prediction: A pipeline with branch prediction uses some additional logic to guess the
outcome of a conditional branch instruction before it is executed.
Delayed branch: in this procedure, the compiler detects the branch instructions and rearranges
the machine language code sequence by inserting useful instructions that keep the pipeline
operating without interruptions. A procedure employed in most RISC processors. e.g. no-
operation instruction
UNIT -2
a) What are the different arithmetic operations ? 2M
Ans: Arithmetic operations are fundamental to digital computers. They involve
manipulating data to produce results required for solving computational problems.
The four basic arithmetic operations are:
Addition
Subtraction
Multiplication
Division
b) How are the signed numbers represented using binary numbers ?2M
Ans:Signed numbers are integers with a positive or negative sign. Since computers
understand only binary, it's necessary to represent these signed integers in binary
form. There are three common methods for this:
1. Sign Bit: A bit is designated to indicate the sign, typically 0 for positive and 1
for negative.
2. 1's Complement: The bits of the positive number are inverted (0 becomes 1,
1 becomes 0).
3. 2's Complement: 1 is added to the 1's complement. This is the most common
method used in computers.The specific representation method used depends
on the computer architecture and the application's requirements.
BR Register
Complementer &
V(Overflow) Parallel Adder
AC Register
10100 (20)
+ 01111 (15)
-------
100111 (35)
Next, let's perform the binary subtraction:To subtract using binary, we use a
method similar to 2's complement subtraction.
Add the minuend (20) and the 2's complement of the subtrahend:
10100 (20)
+ 10001 (2's complement of 15)
-------
100111 (5)
Therefore:
20 + 15 = 35 (binary: 100111)
20 - 15 = 5 (binary: 000101)
10 in binary is 1010
9 in binary is 1001
1010
x 1001
------
1010
0000
0000
+ 1010
------
1011010
Ans: Components:
Operation:
Addition:The value in the B register is loaded into the parallel adder.The value in
the A register is also loaded into the parallel adder.The mode control signal (M) is
set to 0 (indicating addition).The parallel adder performs the addition operation,
and the result is stored back in the A register.
Minuend in AC Augend in AC
Subtrahand in BR Addend in BR
AC AC+(BR)’+1 AC AC+BR
V Overflow V Overflow
END
Subtract:
Add:
Ans:
Components:
1. A Register: This register likely stores the operand A, which is used as input
for the adder.
2. B Register: This register likely stores the operand B, which is also used as
input for the adder.
3. Sequence Counter (SC): This counter keeps track of the sequence of
operations being performed.
4. Q Register: This register might store the result of the operation or some
intermediate value.
5. A' Register: This register could be used to store the complemented value of A
(A').
6. Complementer:This circuit takes the value of A and produces its two's
complement A'. This is likely used for subtraction operations.
7. Parallel Adder:This adder takes the values of A and B (or A' and B) as inputs
and produces their sum.
Operation:
Complementer and
Parallel Adder
Qn
AS Q Register
A Register QS
0 E
a) Draw and explain the algorithm for signed magnitude addition and
subtraction ?10M
Ans:
Algorithm:
The flowchart is shown in Figure below .The two signs A, and B, are compared by
an exclusive-OR gate. If the output of the gate is 0 the signs are identical; If it is 1,
the signs are different. For an add operation, identical signs dictate that the
magnitudes be added.The magnitudes are added with a micro-operation, EA=A +
B, where EA is a register that combines E and A. The carry in E after the addition
constitutes an overflow if it is equal to 1. The value of E is transferred into the add-
overflow flip-flop AVF. The two magnitudes are subtracted if the signs are
different for an add operation or identical for a subtract operation. The magnitudes
are subtracted by adding A to the 2's complemented B. No overflow can occur if
the numbers are subtracted so AVF is cleared to 0. 1 in E indicates that A >= B
and the number in A is the correct result. If this number is zero, the sign A must be
made positive to avoid a negative zero. 0 in E indicates that A < B. For this case it
is necessary to take the 2's complement of the value in A. The operation can be
done with one micro-operation A=A' +1. However, we assume that the A register
has circuits for micro-operations complement and increment, so the 2's
complement is obtained from these two micro-operations. In other paths of the
flowchart, the sign of the result is the same as the sign of A. so no change in A is
required. However, when A < B, the sign of the result is the complement of the
original sign of A. It is then necessary to complement A, to obtain the correct sign.
The final result is found in register A and its sign in As. The value in AVF
provides an overflow indication. The final value of E is immaterial.
b) Draw and explain the algorithm for signed magnitude multiplication with an
example ? 10M
Ans:Multiplication Algorithm:
In the beginning, the multiplicand is in B and the multiplier in Q. Their
corresponding signs are in Bs and Qs respectively. We compare the signs of both
A and Q and set to corresponding sign of the product since a double-length product
will be stored in registers A and Q. Registers A and E are cleared and the sequence
counter SC is set to the number of bits of the multiplier. Since an operand must be
stored with its sign, one bit of the word will be occupied by the sign and the
magnitude will consist of n-1 bits. Now, the low order bit of the multiplier in Qn is
tested. If it is 1, the multiplicand (B) is added to present partial product (A), 0
otherwise. Register EAQ is then shifted once to the right to form the new partial
product. The sequence counter is decremented by 1 and its new value checked. If it
is not equal to zero, the process is repeated and a new partial product is formed.
When SC = 0 we stops the process.
c) Draw and explain the algorithm for division with an example ? 10M
Ans: Division of two fixed-point binary numbers in signed magnitude
representation is performed with paper and pencil by a process of successive
compare, shift and subtract operations. Binary division is much simpler than
decimal division because here the quotient digits are either 0 or 1 and there is no
need to estimate how many times the dividend or partial remainder fits into the
divisor. The division process is described in Figure below .The divisor is compared
with the five most significant bits of the dividend. Since the 5-bit number is
smaller than B, we again repeat the same process. Now the 6-bit number is greater
than B, so we place a 1 for the quotient bit in the sixth position above the dividend.
Now we shift the divisor once to the right and subtract it from the dividend. The
difference is known as a partial remainder because the division could have stopped
here to obtain a quotient of 1 and a remainder equal to the partial remainder.
Comparing a partial remainder with the divisor continues the process. If the partial
remainder is greater than or equal to the divisor, the quotient bit is equal to 1. The
divisor is then shifted right and subtracted from the partial remainder. If the partial
remainder is smaller than the divisor, the quotient bit is 0 and no subtraction is
needed. The divisor is shifted once to the right in any case. Obviously the result
gives both a quotient and a remainder.
d) Draw and explain the algorithm for floating point multiplication with an
example ? 10M
Steps:
Characteristics:
SIMD supercomputers are designed to exploit **data-level parallelism**, meaning
they can handle large volumes of repetitive operations on different data sets. The
main reasons for using SIMD supercomputers include:
1. **Efficiency**: They are highly efficient for tasks that involve repeating the same
operation on a large set of data.
2. **Performance**: SIMD reduces the overhead of fetching and decoding
multiple instructions, leading to faster computation times compared to SISD (Single
Instruction, Single Data) systems.
3. **Resource Sharing**: By sharing the same control unit for multiple processing
units, SIMD supercomputers are able to process vast amounts of data in parallel,
optimizing resource use.
4. **Scalability**: SIMD systems can easily scale to handle larger datasets or more
complex computations by adding more processing elements.
Applications:
SIMD is ideal for applications that require the same operation to be performed on
large sets of data, such as
Real-World Applications of SIMD Supercomputers
1. Graphics Processing in images, videos, and games.
2. Scientific Simulations used in Weather forecasting, fluid dynamics simulations
3. Machine Learning used in operations like matrix multiplication and activation
functions can be efficiently executed using SIMD.
b) Explain the working, characteristics and applications of a distributed memory
multi computer ?10M
Ans:A distributed memory multicomputer is a type of parallel computing
architecture consisting of multiple computers (called nodes) where each node is an
autonomous computer consisting of processor, local memory attached disks or I/O
peripherals and processors communicate with one another via message passing. A
distributed memory multicomputer Nodes interconnected by a message passing
network which can be Mesh, Ring, Torus, Hypercube etc .
Thus Multicomputer are also called No-remote-Memory –Access(NORMA).
Communication between nodes if required is carried out by passing messages
through static connection network.
Working:
In distributed memory multi-computers, each processor executes its own program
and has its own local memory. They work using the following key principles:
1.Independent Processors: Each processor performs computations independently,
using its local memory.
2.Message Passing for Communication:Since the memory is not shared, processors
communicate with each other through message-passing mechanisms. This could
involve specialized communication libraries like MPI (Message Passing Interface).
3.Task Distribution: Tasks or workloads are divided among the processors. Each
processor may compute part of the problem independently. The processors may
exchange data during computation when needed, often at the start or end of an
operation.
4.Network Interconnect: A communication network (like Ethernet, InfiniBand,
etc.) connects the processors. Latency and bandwidth in this network can
significantly affect the performance of distributed memory systems.
5.Data Distribution:Data is distributed across the processors to minimize the
communication overhead. The division of data should ideally be balanced, so no
single processor becomes a bottleneck.
Characteristics:
Distributed memory multi-computers are used for several reasons, especially in
large-scale parallel computing systems:
Scalability: Shared memory architectures do not scale well for a large number of
processors because of memory access bottlenecks (that is when one processor is
using memory the other cannot use it) . Distributed memory systems allow systems
to scale efficiently by adding more nodes, with each node managing its own
memory.
Cost-Effectiveness: Distributed memory systems are cheaper to scale because each
node is relatively independent and can be made of standard, off-the-shelf
components.
Higher Memory Bandwidth: In shared memory systems, processors often compete
for access to a central memory, leading to performance degradation. Distributed
memory systems avoid this issue by providing each processor with its own local
memory.
Suitability for Data-Parallel Applications: In distributed memory systems, each
node can handle a portion of the data independently, performing computations in
parallel, and only communicating results when necessary.
Fault Tolerance: In a distributed system, if one node fails, it often does not bring
down the entire system. Other nodes can continue working, which improves
reliability.
Applications:
1. Climate modeling,
2. Fluid dynamics simulations, and
3. Machine learning training often use distributed memory systems.
Working:Program and data are first loaded into main memory through host
computer.Instructions are first decoded by scalar Control Unit, if it’s a scalar
operation or program control operation, it will be directly executed using scalar
functional pipelines. If its vector operation it will be send to the vector control unit.
The control unit supervises the flow of vector data between main memory and vector
functional pipelines. Vector data flow is coordinated by the control unit. A number
of vector functional pipelines may be built into a vector processor.
VECTOR PROCESSOR MODELS are of 2 types :
Register to Register architecture
Memory to memory architecture
REGISTER to REGISTER architecture
- The fig above shows a register to register architecture. Vector registers are used
to hold vector operands, intermediate and final vector results. All vector registers are
programmable and length of vector register is usually fixed and some machines use
re-configurable vector registers to dynamically match register length(ex: Fujitsu
VP2000)
MEMORY to MEMORY architecture
Differs from register to register architecture in use of vector stream unit in place of
vector registers. Vector operands and results are directly retrieved from and stored
into main memory in superwords (ex: 512 bits in Cyber 205)
Characteristics:
1. High Performance on Repetitive Operations
2. Vector Instruction Set
3. Memory Bandwidth
4. Pipelining
5. SIMD Architecture
Applications of a Vector Supercomputer:
1) Scientific Computing: Vector supercomputers are widely used in scientific
fields for tasks such as weather forecasting, climate modeling, and fluid
dynamics simulations.
2) Engineering and Computational Fluid Dynamics: Fields like aerospace
engineering rely on vector processing to model airflow, heat distribution, and
stress analysis.
3) Big Data Analytics: In applications that analyze large volumes of data, such as
machine learning and data mining, vector supercomputers process datasets
efficiently by applying vectorized algorithms. For example, they can be used in
genomic research, where they rapidly process genetic data.
4) High-Performance Physics Simulations: In physics, simulations of particle
interactions or molecular dynamics require intensive computations, which are
well-suited for vector processing due to the similarity of operations across large
data sets.
UNIT-4
Processor families can be mapped onto a coordinated space of clock rate versus
cycles per instruction (CPI), as illustrated in Fig. 4.1.
• Two main categories of processors are:-
o CISC (eg:X86 architecture)
o RISC(e.g. Power series, SPARC, MIPS, etc.) .
Under both CISC and RISC categories, products designed for multi-core chips,
embedded applications, or for low cost and/or low power consumption, tend to
have lower clock speeds.
d) Define CISC & RISC scalar processor ? 2M
Ans: CISC (Complex Instruction Set Computer) scalar processors are designed
with a complex instruction set that includes instructions capable of performing
multiple low-level operations, such as memory access, arithmetic, and branching,
in a single instruction. These processors aim to reduce the number of instructions
per program by providing highly specialized instructions.
RISC (Reduced Instruction Set Computer) scalar processors use a simpler
instruction set with each instruction performing a small, atomic operation. These
processors focus on achieving high performance by optimizing instruction
execution through simplicity and pipelining.
e) What is a superscalar processor ? 2M
Ans:A superscalar processor is a type of CPU that can execute more than one
instruction per clock cycle by using multiple execution units. Unlike scalar
processors, which handle only one instruction at a time, superscalar processors
achieve parallelism by issuing and executing multiple instructions simultaneously.
a) Explain how the design space of processors impact performance and
efficiency? 3M
Ans: The design space of processors refers to the various architectural features and
choices made during processor development, such as instruction set design,
parallelism, memory hierarchy, and energy efficiency. These design choices
directly influence a processor's performance and efficiency. Below are key design
aspects and their impacts:
CISC (Complex Instruction Set Computing): Provides rich, complex
instructions, reducing code size but increasing decoding complexity, which may
slow performance.
RISC (Reduced Instruction Set Computing): Uses simpler instructions executed
faster, allowing pipelining and parallelism for better efficiency.
b) Describe the differences between CISC and RISC processors? 3M
Ans:
S.No CISC RISC
1 Large set of instructions Small set of instructions
with variable format (16-64 with fixed (32 bit) format,
bits per instr) mostly register based
2 12-24 addressing modes 12-24 addressing modes
3 3-5 addressing modes 3-5 addressing modes
4 CPI btw 2 and 15 CPI btw 2 and 15
5 Clock rates btw 33-50MHz Clock rates btw 33-50MHz
Unified Cache:The cache is used to store both instructions and data.This design
simplifies the cache hardware and reduces the need for separate instruction and
data caches.However, it can potentially lead to performance bottlenecks if both
instructions and data are heavily accessed simultaneously.
Instruction and Data Path:The instruction and data path is responsible for
fetching instructions, decoding them, and executing the corresponding operations
on data.This path includes components like registers, ALU’s (Arithmetic Logic
Units), and data buses.
Additional Considerations:CISC architectures often have variable-length
instructions, which can complicate instruction decoding and execution.They may
also have complex addressing modes, which can increase instruction complexity.
12-24 addressing modes
3-5 addressing modes
8-24 general purpose registers
CPI btw 2 and 15
While CISC architectures were dominant in the past, they have been largely
replaced by RISC
b) Discuss in detail the architectural features of RISC scalar processors
with a neat diagram ? 5M
Split Instruction and Data Cache:The cache is divided into two separate caches:
one for instructions and one for data.This design can improve performance by
reducing cache conflicts and increasing the likelihood of cache hits.
Reduced Instruction Set: SPARC uses a simple instruction set with fixed-
length instructions, making it easier to decode and execute.
Load-Store Architecture: Memory access is limited to load and store
instructions, simplifying the data path.
Register Windows: SPARC uses a unique register window scheme to reduce
the number of load and store instructions, improving performance.
Pipelining: SPARC employs pipelining to execute multiple instructions
concurrently, increasing throughput.
The FPU is a specialized unit within the SPARC processor that handles floating-
point arithmetic operations. It is designed to perform calculations on real numbers,
which are represented in a format that includes a mantissa and an exponent.
1. Instruction Fetch: The instruction containing the floating-point operation is fetched from
memory and decoded by the instruction decoder.
2. Operand Fetch: The operands for the operation are fetched from the register file.
3. FPU Execution: The FPU performs the specified floating-point operation on the
operands. This may involve several stages, including:
In conclusion, the SPARC architecture with its integrated FPU is well-suited for
applications that require high-performance floating-point computations, such as
scientific simulations, financial modeling, and image processing.
a) Describe in detail the design space of processors and how it
influences the choice between CISC and RISC architectures ? 10M
Ans:Design space of CISC ,RISC, Superscalar and VLIW processors
The CPI of different CISC instructions varies from 1 to 20. Therefore, CISC
processors are at the upper part of the design space. With advanced
implementation techniques, the clock rate of today‘s CISC processors ranges up to
a few GHz.
With efficient use of pipelines, the average CPI of RISC instructions has been
reduced to between one and two cycles.
An important subclass of RISC processors are the superscalar processors, which
allow multiple instructions to be issued simultaneously during each cycle. Thus the
effective CPI of a superscalar processor should be lower than that of a scalar RISC
processor. The clock rate of superscalar processors matches that of scalar RISC
processors.
The very long instruction word (VLIW) architecture can in theory use even more
functional units than a superscalar processor. Thus the CPI of a VLIW processor
can be further lowered. Intel‘s i860 RISC processor had VLIW architecture.
The effective CPI of a processor used in a supercomputer should be very low,
positioned at the lower right corner of the design space. However, the cost and
power consumption increase appreciably if processor design is restricted to the
lower right corner
Processor families can be mapped onto a coordinated space of clock rate versus
cycles per instruction (CPI), as illustrated in Fig. 4.1.As implementation
technology evolves rapidly, the clock rates of various processors have moved from
low to higher speeds toward the right of the design space (ie increase in clock rate).
and processor manufacturers have been trying to lower the CPI rate(cycles taken to
execute an instruction) using innovative hardware approaches.
Two main categories of processors are:-
o CISC (eg:X86 architecture)
o RISC(e.g. Power series, SPARC, MIPS, etc.) .
Under both CISC and RISC categories, products designed for multi-core chips,
embedded applications, or for low cost and/or low power consumption, tend to
have lower clock speeds. High performance processors must necessarily be
designed to operate at high clock speeds. The category of vector 2 processors has
been marked VP; vector processing features may be associated with CISC or RISC
main processors.
The speedup factor is directly related to the number of stages (k) in the
pipeline.Speedup (S):S = (T_np) /(T_p) where: T_np: Execution time of a
non-pipelined processor.T_p: Execution time of a k-stage pipelined
processor .Ideal Speedup (S_ideal): S_ideal = k
1. Data Input: External data (operands) is fed into the first stage (S₁) of the
pipeline.
2. Processing and Latching: The first stage processes the data and stores the
result in its latch.
3. Clock Signal: The clock signal arrives, triggering the transfer of data from
the latches in all stages to the next stage.
4. Sequential Processing: This process continues sequentially for each stage in
the pipeline, with each stage processing its data and storing the result in its
latch.
Figure a Figure b
Figure above shows the flow of machine instructions through a typical pipeline.
These eight instructions are for pipelined execution of the high-level language
statements X = Y + Z and A = B * C. Here we have assumed that load and store
instructions take four execution clock cycles, while floating-point add and
multiply operations take three cycles.Figure a (above)illustrates the issue of
instructions following the original program order. The shaded boxes correspond
to idle cycles when instruction issues are blocked due to resource latency or
conflicts or due to data dependencies. The first two load instructions issue on
consecutive cycles. The add is dependent on both loads and must wait three
cycles before the data (Y and Z) are loaded in. Similarly, the store of the sum to
memory location X must wait three cycles for the add to finish due to a flow
dependence. Figure b (above) shows an improved timing after the instruction
issuing order is changed to eliminate unnecessary delays due to dependence. The
idea is to issue all four load operations in the beginning. Both the add and
multiply instructions are blocked fewer cycles due to this data prefetch. The
reordering should not change the end results. The time required is being reduced
to 11 cycles, measured from cycle 4 to cycle 14.
c) Explain the concept of Pipeline Schedule Optimization with an example ?
5M
Ans: Pipeline Schedule Optimization technique based on the Minimal Average
Latency (MAL) concept inserts non-compute delay stages into a pipeline to
modify the reservation table, resulting in a new collision vector and an improved
state diagram. This aims to achieve an optimal latency cycle, which is the shortest
possible.
Bounds on the MAL:Shar (1972) determined the following bounds on the MAL
for a statically reconfigured pipeline:
These bounds suggest that the optimal latency cycle must be selected from one of
the lowest greedy cycles in the state diagram. However, a greedy cycle alone does
not guarantee the optimality of the MAL. The lower bound provides a guarantee
of optimality.
Example:
Time/ 1 2 3 4 5 6
Stage
S1 Y Y
S2 Y
S3 Y Y Y
Latency: The number of time units (clock cycles) between two initiations of a
pipeline is the latency between them. Latency values must be non-negative
integers.
Collision: When two or more initiations are done at same pipeline stage at the
same time will cause a collision. A collision implies resource conflicts between
two initiations in the pipeline, so it should be avoided.
Permissible Latency: Latencies that do not cause any collision are called
permissible latencies. (E.g. in above reservation table 1, 3 and 6 are permissible
latencies).
Latency Cycle: A Latency cycle is a latency sequence which repeats the same
subsequence (cycle) indefinitely.The Average Latency of a latency cycle is
obtained by dividing the sum of all latencies by the number of latencies along the
cycle.The latency cycle (1, 8) has an average latency of (1+8)/2=4.5. A Constant
Cycle is a latency cycle which contains only one latency value. (E.g. Cycles (3)
and (6) both are constant cycle).
Collision Vector: The combined set of permissible and forbidden latencies can
be easily displayed by a collision vector, which is an m-bit (m<=n-1 in a n
column reservation table) binary vector C=(CmCm-1….C2C1). The value of
Ci=1 if latency I causes a collision and Ci=0 if latency i is permissible. (E.g. Cx=
(1011010)).
Simple Cycle, Greedy Cycle and MAL: A Simple Cycle is a latency cycle in
which each state appears only once. In above state diagram only (3), (6), (8), (1,
8), (3, 8), and (6, 8) are simple cycles. The cycle(1, 8, 6, 8) is not simple as it
travels twice through state (1011010).A Greedy Cycle is a simple cycle whose
edges are all made with minimum latencies from their respective starting states.
The cycle (1, 8) and (3) are greedy cycles.MAL (Minimum Average Latency) is
the minimum average latency obtained from the greedy cycle. In greedy cycles
(1, 8) and (3), the cycle (3) leads to MAL value 3.For functions X and Y, the
MAL is 3, and both have met the lower bound of 3 from their respective
reservation tables. However, the upper bound on the MAL for function X is
4+1=5, a rather loose bound. On the other hand, the upper bound for function Y is
2+1=3, a tighter bound. Therefore, all greedy cycles for function Y lead to the
optimal latency value of 3, which cannot be further reduced.
Optimization Technique
To optimize the MAL, one needs to find the lower bound by modifying the
reservation table. The approach is to reduce the maximum number of checkmarks
in any row while preserving the original function being evaluated. Patel and
Davidson (1976) proposed using non-compute delay stages to increase pipeline
performance and achieve a shorter MAL. Their technique is described in more
detail.
d) Describe the role of Branch Handling Techniques in maintaining pipeline
efficiency ? 5M
Ans:BRANCH HANDLING TECHNIQUES :
TERMS used in branching :
Branch Taken : The action of fetching a non sequential or remote instruction
after a branch instruction is called Branch taken.
Branch Target : The instruction to be executed after a branch taken is called
Branch target
Delay slot : The no of pipeline cycles wasted between a branch taken and its
branch target is called delay slot, denoted by b ,0 ≤ b ≤ k-1 (K-no of pipeline
stages)
EFFECTS OF BRANCHING :
When a branch is taken all instructions following the branch in the pipeline
becomes useless and will be drained from the pipeline. Thus branch taken causes
a pipeline to be flushed losing a number of pipeline stages.
Clock Cycle:The clock cycle of a pipeline is the time it takes for data to move
from one stage to the next. It's determined by the maximum stage delay (Tmax)
and the latch delay (d).The clock cycle is essentially the time it takes for a clock
pulse to rise and fall. Clock Skew:Clock skew refers to the difference in arrival
times of the clock signal at different stages of the pipeline.This can lead to timing
issues, as data may not be stable at the receiving stage when the clock signal
arrives.Clock Cycle Time (T): T = max(Ti)1m + d=Tm+d
Speedup ratio : The speed up ratio is ratio between maximum time taken by non
pipeline process over process using pipelining. Thus speed up ratio for one
process in non pipeline and pipeline configuration is = n*tn / (n+k-1)*tp where
The time taken for one process is tn thus the time taken to complete n process in
non pipeline configuration will be = n*tn.The time taken for n processes having k
segments in pipeline configuration will be = k*tp + (n-1)*tp= (k+n-1)*tp
The following are various limitations due to which any pipeline system cannot
operate at its maximum theoretical rate i.e., k (speed up ratio).
Throughput: The number of task completed by a pipeline per unit time is called
throughput, this represents computing power of pipeline. We define throughput as
W= n/[k*t + (n-1) *t] = η/t.
The concepts of clock cycle, clock skew, speedup, efficiency, and throughput are
interrelated and form the foundation of pipeline performance analysis.Mitigating
clock skew ensures reliability and consistency in pipelined operations.However,
real-world limitations like varying stage delays and clock cycle constraints
prevent achieving the ideal speedup.Higher throughput indicates better utilization
of the pipeline stages and faster processing rates for tasks.
Evaluate the significance of Dynamic Instruction Scheduling in modern
processors and its impact on overall pipeline performance ? 10M
Ans:
The fig above shows a functional unit connected to common data bus with
three reservation stations provided on it.
Op – operation to be carried out .OPnd-1 and Opnd-2 – two operand values
needed for operation 1 and 2 – two source tags associated with the operands.
When the needed operand values are available in reservation station, the
functional unit can initiate the required operation in the next clock cycle. At time
of instruction issue the reservation station is filled out with the operation
code(op).If an operand value is available in programmable register it is
transferred to the corresponding source operand field in the reservation station. It
waits until its data dependencies are resolved and operands become available.
Dependence is resolved by monitoring Result bus and when all operands of an
instr are available its dispatched to functional unit for execution. If the operand
value is not available at the time of issue, the corresponding source tag(t1
and/or t2) is copied into the reservation station. The source tag identifies the
source of the required operand. As soon as the required operand is available at its
source- typically output of functional unit – the data value is forwarded over the
common data bus along with source tag
CDC SCOREBOARDING
Figure above shows CDC6600 like processor that uses dynamic instruction
scheduling hardware.Here Multiple Functional units appeared as multiple
execution units pipelines.Parallel units allow instructions-to complete out of
original program order. The processor had instr buffers for each execution unit.
Instrs are issued to available FU’s regardless of whether register i/p data are
available. To Control correct routing of data btw execution units and registers
CDC 6600 used a Centralized Control unit known as scoreboard. Scoreboard kept
track of registers needed by instrs waiting for various functional units.When all
registers have valid data scoreboard enables instr execution. When a FU finishes
it signals scoreboard to release the resources. Scoreboard is a Centralized control
logic which keeps track of status of registers and multiple functional units.
Explain dynamic Instruction Scheduling And branch handling techniques?
10M
Ans:
branch handling’ techniques}
Data dependencies in sequence of instructions create interlocked relationships.
Inter-locking is resolved through compiler based static scheduling approach or by
using dynamic Instruction scheduling which include additional hardware units.
Dynamic scheduling has 2 techniques
1.Tomasulo’s Algorithm
2. CDC scoreboarding
Tomasulo’s algorithm scheme :Named after the chief designer. This hardware
dependence –resolution scheme was first implemented with multiple floating
point units of the IBM 360/91 processor.
Functional units are internally pipelined and can complete one operation in every
clock cycle, provided the reservation station(Structure of RS shown below) of the
unit is ready with the required input operand values.If source register is busy
when an instr reaches issue stage, tag for source register is forwarded to RS.
When register becomes available tag can signal availability.This value is copied
into all reservation station which have the matching tag.Thus operand
forwarding is achieved here with the use of tags. All destinations which require
a data value receive it in the same clock cycle over the common data bus, by
matching stored operand tags with source tag sent over the bus.
The fig above shows a functional unit connected to common data bus with
three reservation stations provided on it.
Op – operation to be carried out ,OPnd-1 and Opnd-2 – two operand values
needed for operation 1 and t2 – two source tags associated with the operands
When the needed operand values are available in reservation station, the
functional unit can initiate the required operation in the next clock cycle. At time
of instruction issue the reservation station is filled out with the operation
code(op).If an operand value is available in programmable register it is
transferred to the corresponding source operand field in the reservation station. It
waits until its data dependencies are resolved and operands become available.
Dependence is resolved by monitoring Result bus and when all operands of an
instr are available its dispatched to functional unit for execution. If the operand
value is not available at the time of issue, the corresponding source tag(t1
and/or t2) is copied into the reservation station. The source tag identifies the
source of the required operand. As soon as the required operand is available at its
source- typically output of functional unit – the data value is forwarded over the
common data bus along with source tag
(OR)
CDC SCOREBOARDING
Figure above shows CDC6600 like processor that uses dynamic instruction
scheduling hardware.Here Multiple Functional units appeared as multiple
execution units pipelines.Parallel units allow instructionsto complete out of
original program order. The processor had instr buffers for each execution unit.
Instrs are issued to available FU’s regardless of whether register i/p data are
available.To Control correct routing of data btw execution units and registers
CDC 6600 used a Centralized Control unit known as scoreboard.Scoreboard kept
track of registers needed by instrs waiting for various functional units.When all
registers have valid data scoreboard enables instr execution. When a FU finishes
it signals scoreboard to release the resources. Scoreboard is a Centralized control
logic which keeps track of status of registers and multiple functional units.
BRANCH HANDLING TECHNIQUES :
TERMS used in branching :
Branch Taken : The action of fetching a non sequential or remote instruction
after a branch instruction is called Branch taken.
Branch Target : The instruction to be executed after a branch taken is called
Branch target
Delay slot : The no of pipeline cycles wasted between a branch taken and its
branch target is called delay slot, denoted by b ,0 ≤ b ≤ k-1 (K-no of pipeline
stages)
EFFECTS OF BRANCHING :
When a branch is taken all instructions following the branch in the pipeline
becomes useless and will be drained from the pipeline. Thus branch taken causes
a pipeline to be flushed losing a number of pipeline stages.
Bounds on the MAL:Shar (1972) determined the following bounds on the MAL
for a statically reconfigured pipeline:Lower Bound: The MAL is lower-bounded
by the maximum number of checkmarks in any row of the reservation
table.Upper Bound: The MAL is upper-bounded by the number of 1's in the
initial collision vector plus 1.
Example:
Time/ 1 2 3 4 5 6 7 8 1 2 3 4 5 6
Stage S1 Y Y
S1 X X X
S2 Y
S2 X X
S3 Y Y Y
S3 X X X
Latency: The number of time units (clock cycles) between two initiations of a
pipeline is the latency between them. Latency values must be non-negative
integers.
Collision: When two or more initiations are done at same pipeline stage at the
same time will cause a collision. A collision implies resource conflicts between
two initiations in the pipeline, so it should be avoided.
Permissible Latency: Latencies that do not cause any collision are called
permissible latencies. (E.g. in above reservation table 1, 3 and 6 are permissible
latencies).
A Constant Cycle is a latency cycle which contains only one latency value. (E.g.
Cycles (3) and (6) both are constant cycle).
Collision Vector: The combined set of permissible and forbidden latencies can
be easily displayed by a collision vector, which is an m-bit (m<=n-1 in a n
column reservation table) binary vector C=(CmCm-1….C2C1). The value of
Ci=1 if latency I causes a collision and Ci=0 if latency i is permissible. (E.g. Cx=
(1011010)).
Simple Cycle, Greedy Cycle and MAL: A Simple Cycle is a latency cycle in
which each state appears only once. In above state diagram only (3), (6), (8), (1,
8), (3, 8), and (6, 8) are simple cycles. The cycle(1, 8, 6, 8) is not simple as it
travels twice through state (1011010).A Greedy Cycle is a simple cycle whose
edges are all made with minimum latencies from their respective starting states.
The cycle (1, 8) and (3) are greedy cycles.MAL (Minimum Average Latency) is
the minimum average latency obtained from the greedy cycle. In greedy cycles
(1, 8) and (3), the cycle (3) leads to MAL value 3.
For functions X and Y, the MAL is 3, and both have met the lower bound of 3
from their respective reservation tables. However, the upper bound on the MAL
for function X is 4+1=5, a rather loose bound. On the other hand, the upper bound
for function Y is 2+1=3, a tighter bound. Therefore, all greedy cycles for function
Y lead to the optimal latency value of 3, which cannot be further reduced.
Optimization Technique
To optimize the MAL, one needs to find the lower bound by modifying the
reservation table. The approach is to reduce the maximum number of checkmarks
in any row while preserving the original function being evaluated. Patel and
Davidson (1976) proposed using non-compute delay stages to increase pipeline
performance and achieve a shorter MAL.
MAL and throughput: A shorter MAL leads to a higher throughput. Unless the
MAL is reduced to 1, the pipeline throughput becomes a fraction.