The document discusses different types of pipelines in computers including arithmetic and instruction pipelines. Arithmetic pipelines perform operations like floating point addition in multiple stages. Instruction pipelines allow overlapping the fetch, decode, and execute stages of instructions. The document also discusses different hazards that can occur in pipelines including data hazards from instructions dependencies, control hazards from branches, and structural hazards from resource conflicts. Common solutions to hazards include branch prediction, scheduling instructions to avoid dependencies, and optimizing pipeline usage.
The document discusses different types of pipelines in computers including arithmetic and instruction pipelines. Arithmetic pipelines perform operations like floating point addition in multiple stages. Instruction pipelines allow overlapping the fetch, decode, and execute stages of instructions. The document also discusses different hazards that can occur in pipelines including data hazards from instructions dependencies, control hazards from branches, and structural hazards from resource conflicts. Common solutions to hazards include branch prediction, scheduling instructions to avoid dependencies, and optimizing pipeline usage.
• Arithmetic Pipeline Concepts of Pipelining • Instruction Pipeline By
Suparna Dutta
Arithmetic pipelining Arithmetic pipelining
Arithmetic pipelines are usually found in most of the computers. They The floating point addition and subtraction is done in 4 parts: are used for floating point operations, multiplication of fixed point • Compare the exponents. numbers etc. For example: The input to the Floating Point Adder pipeline is: • Align the mantissas. • Add or subtract mantissas • Produce the result. Instruction pipelining Advantages and Disadvantages of pipelining • In this a stream of instructions can be executed by overlapping fetch, decode and execute phases of an instruction cycle. This type of technique is used to increase the throughput of the computer system. • An instruction pipeline reads instruction from the memory while previous instructions are being executed in other segments of the pipeline. Thus we can execute multiple instructions simultaneously. The pipeline will be more efficient if the instruction cycle is divided into segments of equal duration.
Pipelining Hazards Types of Hazards
• As we all know, the CPU’s speed is limited by memory. There’s one The three different types of hazards in computer architecture are: more case to consider, i.e. a few instructions are at some stage of execution in a pipelined design. There is a chance that these sets of • 1. Structural instructions will become dependent on one another, reducing the • 2. Data pipeline’s pace. Dependencies arise for a variety of reasons, which we will examine shortly. The dependencies in the pipeline are referred to • 3. Control as hazards since they put the execution at risk. • We can swap the terms, dependencies and hazards since they are used interchangeably in computer architecture. A hazard, in essence, prevents an instruction present in the pipe from being performed during the specified clock cycle. Since each of the instructions may be in a separate machine cycle, we use the term clock cycle. Data Hazard Types of Data Hazard There are mainly three types of data hazards: • 1) RAW (Read after Write) [Flow/True data dependency] • Data hazards in pipelining emerge when the execution of one 2) WAR (Write after Read) [Anti-Data dependency] instruction is dependent on the results of another instruction that is 3) WAW (Write after Write) [Output data dependency] still being processed in the pipeline. The order of the READ or WRITE Let there be two instructions I and J, such that J follow I. Then, operations on the register is used to classify data threats into three • RAW hazard occurs when instruction J tries to read data before instruction I writes it. groups. Eg: I: R2 <- R1 + R3 J: R4 <- R2 + R3 • WAR hazard occurs when instruction J tries to write data before instruction I reads it. Eg: I: R2 <- R1 + R3 J: R3 <- R4 + R5 • WAW hazard occurs when instruction J tries to write output before instruction I writes it. Eg: I: R2 <- R1 + R3 J: R2 <- R4 + R5
Control Hazard Control Hazard example and solution
• Branch hazards are caused by branch instructions and are known as control hazards in computer architecture. The flow of program/instruction execution is controlled by branch instructions. Remember that conditional statements are used in higher-level languages for iterative loops and condition testing (correlate with while, for, and if case statements). These are converted into one of the BRANCH instruction variations. As a result, when the decision to execute one instruction is reliant on the result of another instruction, such as a conditional branch, which examines the condition’s consequent value, a conditional hazard develops. Solution for Control dependency Structural Hazard Branch Prediction is the method through which stalls due to control Hardware resource conflicts among the dependency can be eliminated. In this at 1st stage prediction is done about instructions in the pipeline cause which branch will be taken. For branch prediction Branch penalty is zero. structural hazards. Memory, a GPR • Branch penalty : The number of stalls introduced during the branch Register, or an ALU might all be used as operations in the pipelined processor is known as branch penalty. resources here. When more than one instruction in the pipe requires access to NOTE : As we see that the target address is available after the ID stage, so the very same resource in the same the number of stalls introduced in the pipeline is 1. Suppose, the branch clock cycle, a resource conflict is said to target address would have been present after the ALU stage, there would arise. In an overlapping pipelined have been 2 stalls. Generally, if the target address is present after the execution, this is a circumstance where kth stage, then there will be (k – 1) stalls in the pipeline. the hardware cannot handle all • Total number of stalls introduced in the pipeline due to branch instructions potential combinations. = Branch frequency * Branch Penalty
Solution Pipeline Optimization
Process to maximize the rendering speed, then allow stages that are not bottlenecks to consume as much time as the bottleneck. Pipeline Optimization Non-linear pipeline • For a given reservation table, find the current average sample period (ASP). Non-Linear pipeline is a pipeline which is made of different pipelines • Find the largest no. of cycles for which a resource is busy. that are present at different stages. The different pipelines are connected to perform multiple functions. It also has feedback and • This is equal to the Minimum possible Average Sampling Time feed-forward connections. It is made such that it performs various (MASP). function at different time intervals. In Non-Linear pipeline the • If ASP = MASP, there is nothing to be done. functions are dynamically assigned. • Else, we should try to re-schedule events such that MASP is achieved.
Reservation Table Reservation Table
• A reservation table is a way of representing the task flow pattern of a pipelined system. Each row of the reservation table represents one resource of the pipeline and each column represents one time-slice of the pipeline. All the elements of the table are either 0 or 1. If one resource (say, resource i) is used in a time-slice (say time-slice j), then the (i,j)-th element of the table will have the entry 1. On the other hand, if a resource is not used in a particular time-slice, then that entry 1.Forbidden Latencies are : 0, 2, 3, 5 of the table will have the value 0. 2.Pipeline collision Vector is : (101101) 3.Greedy Cycle is : (1, 6)* 4.Minimal Average Latency is : 3.5 5.Throughput is 0.28 Definitions Problem • Latency means the number of time units [clock cycles] between two initiations of a pipeline. • Forbidden Latency: Latencies that cause collisions. • Permissible Latency: Latencies that will not cause collisions. • Latency Sequence : A sequence of permissible latencies between successive task initiations • Latency Cycle : A Latency Cycle is a latency sequence which repeats the same subsequence (cycle) indefinitely. • Average latency : The average latency of a latency cycle is obtained by dividing the sum of all latencies by the number of latencies along the cycle. • Collision Vector: The combined Set of permissible and forbidden latencies can be easily displayed by a collision vector
• Latency Sequence: A sequence of permissible latencies between successive
initiations • Latency Cycle: A latency sequence that repeats the same subsequence (cycle) indefinitely • Simple cycles: A simple cycle is a latency cycle in which each state appears only once. (3), (6), (8), (1, 8), (3, 8), and (6,8) • Greedy Cycles: Simple cycles whose edges are all made with minimum latencies from their respective starting states. • Greedy cycles must first be simple, and their average latencies must be lower than those of other simple cycles. (1,8), (3) → one of them is MAL(Minimum Average latency) • MAL (Minimum Average Latency) is the minimum average latency obtained from the greedy cycle. In greedy cycles (1, 8) and (3), the cycle (3) leads to MAL value 3.
Cycles: (1, 8), (I, 8, 6, 8), (1, 8,
3, 8), (3), (6), [3, 8), (3, 6, 3) Instruction Level Parallelism Compiler Techniques for Exposing ILP • Instruction Level Parallelism (ILP) is used to refer to the architecture Basic Pipeline Scheduling and Loop Unrolling in which multiple operations can be performed parallelly in a To avoid a pipeline stall, a dependent instruction must be separated particular process, with its own set of resources – address space, from the source instruction by a distance in clock cycles equal to the registers, identifiers, state, program counters. It refers to the pipeline latency of that source instruction. compiler design techniques and processors designed to execute operations, like memory load and store, integer addition, float A compiler’s ability to perform this scheduling depends both on the multiplication, in parallel to improve the performance of the amount of ILP available in the program and on the latencies of the processors. Examples of architectures that exploit ILP are VLIWs, functional units in the pipeline. Throughout this chapter we will Superscalar Architecture. assume the FP unit latencies shown in Figure
Compiler Techniques for Exposing ILP contd…. Example
for (i=1000; i>0; i=i–1)
x[i] = x[i] + s;
How this loop will run
when it is scheduled on a simple pipeline for MIPS with the latencies ?
This code takes 10
clock cycles per instruction Compiler Techniques for Exposing ILP Compiler Techniques for Exposing ILP cont…. cont…. This loop is parallel by noticing that the body of each iteration is independent. The first step We can schedule the loop to obtain only one stall: is to translate the above segment to MIPS assembly language. In the following code segment, R1is initially the address of the element in the array with the highest address, and F2 contains the scalar value, s. Register R2 is precomputed, so that 8(R2) is the last element to operate on. The straightforward MIPS code, not scheduled for the pipeline, looks like this: Loop: L.D F0,0(R1) ;F0=array element Loop: L.D F0,0(R1) ADD.D F4,F0,F2 ;add scalar in F2 S.D F4,0(R1) ;store result DADDUI R1,R1,#-8 DADDUI R1,R1,#-8 ;decrement pointer ;8 bytes (per DW) ADD.D F4,F0,F2 BNE R1,R2,Loop ;branch R1!=zero stall Let’s start by seeing how well this loop will run when it is scheduled on a simple pipeline for MIPS with the latencies BNE R1,R2,Loop ;delayed branch S.D F4,8(R1) ;altered & interchanged with DADDUI
Compiler Techniques for Exposing ILP
cont…. • Loop unrolling can also be used to improve scheduling. Because it eliminates the branch, it allows instructions from different iterations to be scheduled together. In this case, we can eliminate the data use stall by creating additional independent instructions within the loop body. If we simply replicated the instructions when we unrolled the loop, the resulting use of the same registers could prevent us from effectively scheduling the loop. Thus, we will want to use different registers for each iteration, increasing the required register count.