0% found this document useful (0 votes)
12 views59 pages

My_Thesis (2)

Uploaded by

Abdul Ahad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views59 pages

My_Thesis (2)

Uploaded by

Abdul Ahad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Design and Implementation of FPU in a Pipeline

RISC-V processor

Malik Nauman

May 2024
Dedication

Enter your Dedication

i
Abstruct

Enter your Abstruct

ii
Acknowledgement

I would like to express my deepest appreciation to the person who made it possible
for me to work on a project that required extreme attention. Special thanks to my
supervisor, Dr. Arshad Hussain for the ultimate support he gave and his advice
throughout this project. My thesis work took place at System on Chip lab located
in Electronics department, Quaid-e-Azam University. SOC lab facilitated with ad-
vanced technological software and provided me with all the necessary equipment and
given me great professional environment. Hereby, I would like to express my appre-
ciation to all the members of Electronic Department who played an important role
in making my project a successful attempt.
Most importantly, thanks my parents who supported me and believed in me
throughout my bachelor’s program

iii
Contents

Dedication i

Abstruct ii

Acknowledgement iii

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Objecive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Literature Review 3
2.1 Overview of RISC-V architecture . . . . . . . . . . . . . . . . . . . . 3
2.2 Single Cycle RISC-V . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Pipeline RISC-V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Floating Point Unit and IEEE Standard . . . . . . . . . . . . . . . . 7
2.5 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Design and Implementation of Single-Cycle RISC-V Processor 13

iv
Contents v

3.1 Overview of Single-Cycle Processor . . . . . . . . . . . . . . . . . . . 13


3.2 Architectural Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 Components Overview . . . . . . . . . . . . . . . . . . . . . . 14
3.2.2 Instruction Execution Flow . . . . . . . . . . . . . . . . . . . 21
3.3 Instruction Set Architecture(ISA) . . . . . . . . . . . . . . . . . . . . 23
3.4 Single Cycle Data Path . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5 Single Cycle Control Unit . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4 Upgrading to a Pipelined RISC-V Processor 38


4.1 Introduction to Pipelining . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.1 Data Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.2 Control Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 Pipeline Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 42

5 Floating Point Unit (FPU) 43


5.1 Introduction to IEEE 754 Standard . . . . . . . . . . . . . . . . . . . 43
5.1.1 IEEE 754 Special Values . . . . . . . . . . . . . . . . . . . . . 45
5.2 Floating Point Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.1 Addition or Subtraction of IEEE 754 Numbers . . . . . . . . . 46
5.3 FPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Contents vi

5.4 Integration with RISC-V ISA . . . . . . . . . . . . . . . . . . . . . . 47


5.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6 Integrating FPU into Pipelined RISC-V Processor 48


6.1 Modifying the Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . 48
6.2 Pipeline Control for FPU . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.3 Verification and Testing . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.4 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

7 Implementation on FPGA 49
7.1 synthesis of the design . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.2 PMod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.3 Constraint File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
List of Figures

2.1 SiFive RISC-V Processor . . . . . . . . . . . . . . . . . . . . . . . . . 4


2.2 Common RSIC-V Standard Extensions . . . . . . . . . . . . . . . . . 5
2.3 Floating Point Representation Types . . . . . . . . . . . . . . . . . . 8
2.4 Simplified CLB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Internal structure of an FPGA . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Zedboard FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 Program Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15


3.2 Instruction Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Word-Addressable Memory . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Byte-Addressable Memory . . . . . . . . . . . . . . . . . . . . . . . . 18
3.5 Register File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.6 Data Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.7 RISC-V base instructions . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.8 R-Type Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.9 I-Type Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.10 S-Type Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

vii
List of Figures viii

3.11 U-Type Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28


3.12 J-Type Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.13 RV32I Base Instruction Set . . . . . . . . . . . . . . . . . . . . . . . 29
3.14 Example Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.15 Fetch instruction from memory . . . . . . . . . . . . . . . . . . . . . 30
3.16 Read source operand from register file . . . . . . . . . . . . . . . . . . 31
3.17 Sign-extend the immediate . . . . . . . . . . . . . . . . . . . . . . . . 32
3.18 Compute memory address . . . . . . . . . . . . . . . . . . . . . . . . 32
3.19 Increment program counter . . . . . . . . . . . . . . . . . . . . . . . 33
3.20 Complete single-cycle processor . . . . . . . . . . . . . . . . . . . . . 34
3.21 Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.22 Main Deocder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.23 ALU Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1 Data Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40


4.2 forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Branch is taken . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 Pipelined RISC-V Processor . . . . . . . . . . . . . . . . . . . . . . . 42

5.1 IEEE 32-bit (single-precision) floating-point number . . . . . . . . . . 44


5.2 IEEE 754 special values . . . . . . . . . . . . . . . . . . . . . . . . . 45
Acronyms

ISA Instruction Set Architecture

FPU Floating Point Unit

IEEE Institute of Electrical and Electronics Engineers

FPGA Field Programmable Gate Array

ASIC Application Specific Integrated Circuit

CLB Configurable Logic Block

LUT Look Up Table

CPI Clock Cycles Per Instruction

PC Program Counter

ALU Arithmetic Logic Unit


Chapter 1

Introduction

1.1 Background

RISC-V is an open standard instruction set architecture based on the established


reduced instruction set principles. RISC-V started as a three month project in 2010
at the University of California, Berkeley. Two graduate students Andrew Waterman
and Yunsup Lee and their professor Krste Asanović needed a simple ISA which could
easily be extended they found that the commercial instruction set architectures were
too complex and also presented legal issues so what is risified at a high level it is a
high quality license free royalty-free risk instruction set architecture. It is a standard
which is maintained by the non-profit RISC-V Foundation and its suitable for all the
types of computing from microcontrollers to supercomputers. The RISC-V architec-
ture is available freely and under a premissive or any specific CPU implementation
the RISC-V Foundation governs the the RISC-V architecture and is a non-profit
organization which was formed in 2015.

1
1.2. Motivation 2

1.2 Motivation

The demand for increased computational capacity continues to rise in various do-
mains like data science, artificial intelligence or even consumer electronics. With
a modular, flexible and open-source architecture RISC-V is an attractive platform
for processor development innovation. Many high-performance applications cannot
function effectively without floating-point arithmetic. However creating a Floating-
Point Unit (FPU) that is reliable in terms of accuracy, efficiency and complexity is
difficult. By integrating an FPU the capabilities of a pipelined RISC-V processor are
improved, but design challenges such as data management and control hazards are
introduced. This research addresses existing gaps by developing a method for effi-
cient FPU integration in pipelined RISC-V processors, aiming to advance processor
performance and reliability. The outcomes of this work have the potential to impact
both academic research and industry applications, contributing to the development
of more powerful and efficient computing systems.

1.3 Objecive

The main objective of this thesis is to Design a 5 Stage RISC-V Pipelined Procrssor
with the implementation of the Floating Point Unit (FPU) from scratch and to study
the behaviour of the processor in depth such as its interaction between different
components of a processor i.e. ALU, Reg file, Memories, Control Unit, FPU etc.
And then to implement this whole processor on the Zedboard FPGA.
Chapter 2

Literature Review

2.1 Overview of RISC-V architecture

Reduced Instruction Set Computing (RISC) is a new ISA (instruction set architec-
ture) that was originally designed to support computer architecture and research
education, but it has evolved so much over time that we can now expect RISC
V processors in our laptops, mobile devices, and personal computers sooner than
we think. Big tech companies like NVIDIA, Google, Huawei, Tesla, Samsung, and
dozens of others have begun to adopt the RISC V architecture, and we’ll soon be
seeing products with RISC-V processors on board due to improved battery efficiency,
security, and, most importantly, a royalty-free license to boot it on devices. As a
result, organizations won’t have to pay for Intel and ARM’s pricey licenses, which
can be a pain to deal with when working on a project.

3
2.1. Overview of RISC-V architecture 4

Figure 2.1: SiFive RISC-V Processor

RISC-V uses the standard naming convention to describe the ISA supported in
a given implementation and the ISA naming convention has the format as followed.
So, RV which stands for RISC-V and then a sequence of numbers which can be either
32,64 or 128 this number indicated the width of the integer register file and the size
of the user address space then following this number it could be a sequence of the
letters and these letters are used to indicate set of standard extensions supported
by a given implementation. So, extensions define the instructions supported by an
implementation and for all the RISC-V implementations the only required extension
is the “I” extension and I stands for integer. The I extension defines a set of 40 in-
2.1. Overview of RISC-V architecture 5

structions and then the RISC-V specification itself also defines a number of what it
calls standard extensions. Standard extensions are the extensions which are defined
in the RISC-V foundation. For example the M extension defines Hardware multi-
plication and division instructions, the A extension defines the Atomic instructions,
etc.

Figure 2.2: Common RSIC-V Standard Extensions

One of the great things about RISC-V is it allows for easy and modular cus-
tomization, and that it allows you to add your own custom instructions and you
can indicate this in your implementation using the X in the ISA string to denote
your custom non-standard extension. For example, RV32IMAC which implements
2.2. Single Cycle RISC-V 6

the integer multiple multiply Atomic and compressed extensions.


The RV32I defines 32 integer registers. The registers also have an optional 32
floating Point registers if you implement the F and D extensions there’s also an-
other standard extension called RV32E. The RV32V introduces a reduced register
file where only 16 integer registers are implemented this is primarily for area con-
strained embedded devices the width of the registers are determined by the ISA and
the risk 5 architecture also defines the application binary interface that software uses
to communicate with itself.

2.2 Single Cycle RISC-V

Single Cycle executes instructions in a single cycle. This means that all the process
is completed in a single cycle from fetching the instruction to writing back the result.
In single cycle the control unit is very simple because there is nothing much com-
plex going in there. Because the instruction is completed in one cycle and it does
not require any non-architectural state. The single cycle processor design includes
essential components such as the control unit, arithmetic logic unit (ALU), registers,
and instruction memory, program counter, all interconnected through a straightfor-
ward data path. The main disadvantage of this processor is that it limits the cycle
time because of the slowest instruction (i.e load word instruction) because in this
instruction the critical time is largest.
2.3. Pipeline RISC-V 7

2.3 Pipeline RISC-V

The pipeline RISC-V applies the pipelining to the single-cycle microarchitecture. It


means that after every stage we insert pipeline register to make the processor faster
because it can now execute multiple instructions at a time. Since, it therefore can
execute several instructions simultaneously, improving the throughput significantly.
Pipelining the processor make it faster but also increases complexity. Pipelining
add dependencies between simultaneously executing instructions. And these depen-
dencies is control by the Control Hazard Unit which is essential component in the
pipeline RISC-V processor.

2.4 Floating Point Unit and IEEE Standard

The floating point unit (FPU) is a math coprocessor designed to carry out the opera-
tions on the floating point numbers. FPU is an optional hardware component. Some
Cortex-M4 microcontrollers don’t have the FPU. Therefore, check the datasheet to
see whether the FPU is present or not.Typical a FPU operations include addition,
subtraction, multiplication, division, and the square root. Certain FPUs are capa-
ble of performing a variety of transcendental functions, including trigonometric and
exponential calculations; however, their precision may be low.
In the early 1990s, computers were built with a distinct FPU to perform these
types of calculations. Although computer manufacturers incorporated FPUs into the
microprocessor device beginning with the Motorola 68000 and Intel Pentium series.
FPUs are now a common component of the central processing unit (CPU).
2.4. Floating Point Unit and IEEE Standard 8

There are three methods by which a floating-point operation can be executed


when a CPU is executing a program:

• A floating-point unit emulator (a floating-point library in software)

• Additional FPU hardware

• Hardware-integrated FPU

The Institute of Electrical and Electronics Engineers (IEEE) established the IEEE
Standard for Floating-Point Arithmetic (IEEE 754) in 1985. This standard contains
total 4 common standards (i.e half (16-bits), single (32-bits), double (64-bits) and
quad (128-bits) ) to represent a floating point numbers. In this thesis the only single
precision is discussed.

Figure 2.3: Floating Point Representation Types

Despite the fact that 2’s complement representations for negative numbers are
frequently used, neither the fraction nor the exponent in the IEEE floating-point
2.4. Floating Point Unit and IEEE Standard 9

representations use 2’s complement. Because the IEEE 754 designers wanted the
format to be easily sorted, they used a biased notation for the exponent and a sign-
magnitude scheme for the fractional component.
The IEEE 754 floating-point formats need three sub-fields: sign, fraction, and
exponent. The fractional part of the number is represented using a sign magnitude
representation in the IEEE floating-point formats—that is, there is an explicit sign
bit (S) for the fraction. The sign is 0 for positive numbers and 1 for negative numbers.
In a binary normalized scientific notation, the leading bit before the binary point is
always 1. Therefore, the designers of the IEEE format decided to make it implied,
representing only the bits after the binary point.
The exponent in the IEEE floating-point formats uses what is known as a biased
notation. A biased representation is one in which every number is represented by
the number plus a certain bias. In the IEEE single precision format, the bias is 127.
Hence, if the exponent is 11, it will be represented by 11 1 127 5 128. If the exponent
is 22, it will be represented by 22 1 127 5 125. Thus, exponents less than 127 indicate
actual negative exponents, and exponents greater than 127 indicate actual positive
exponents. The bias is 1023 in the double precision format.
If a positive exponent becomes too large to fit in the exponent field, the situation
is called overflow, and if a negative exponent is too large to fit in the exponent field,
that situation is called underflow. IEEE 754 standard also supports special cases like
infinity, unnormalized , zero and NaN.
2.5. FPGA 10

2.5 FPGA

FPGA stands for Field Programmable Gate Array that is an integrated circuit de-
signed to be reprogrammable using using hardware description language such as Ver-
ilog HDL or VHDL. The fact that FPGAs are reprogrammable distinguishes FPGAs
from application specific integrated circuits also known as ASICs which are inte-
grated circuits manufactured to do specific design tasks although there are one-time
programmable FPGAs the dominant types are reprogrammable.

Figure 2.4: Simplified CLB

FPGA is typically based on a matrix of Configurable Logic Blocks also known


as CLBs. CLBs are made out of four basic components look-up tables (LUts), mul-
tiplexers, full adder and D flip-flop reconfigurable interconnects these allows CLBs
to be connected with one another input/output blocks they facilitate external con-
nections for two FPGA they carry signals into or out of FPGA fixed-function logic
blocks such as multiplier or digital signal processing blocks and block Ram. Block
2.5. FPGA 11

Ram serves as large memory structure or a group of flip-flops tied together.

Figure 2.5: Internal structure of an FPGA

Some Advantages of FPGA in today’s market are:

• Performance such as faster and parallel processing of signals which is hard


to do for processors because they’re typically sequential.

• Reprogrammable so you don’t have to waste any money on remanufactured


chips you can just go ahead and reprogram your FPGA and you’re good to go.

• Cost FPGAs are getting cheaper to manufacture thus they’re getting more
popular in today’s industry. FPGAs are used in aerospace and defense audio
2.5. FPGA 12

processing medical devices automotive industry security systems video and


image processing wireless communication ASIC prototyping and in many other
areas. Some FPGA design tools are LabVIEW System Design Software, Altera
Software and Xilinx Vivado Design Suite.

In this thesis the Zedboard FPGA is used and the software used is Xilinx Vivado

Figure 2.6: Zedboard FPGA


Chapter 3

Design and Implementation of Single-Cycle

RISC-V Processor

3.1 Overview of Single-Cycle Processor

The single-cycle processor of RISC-V is still a major part of computer architecture’s


simplest and most direct methods, and it is known for that. Everything that needs
to be done to carry out an instruction in a single-cycle processor is done in one clock
cycle, from getting the instruction to writing the result back to the register file.
The greatest thing about a single-cycle processor is its simplicity. Each instruc-
tion will be completed in a single cycle, resulting in a design that eliminates the need
for controlling the overlap and dependencies of the instruction that are usually in
multi-cycle or pipelined processors. The only problem is that the single-cycle model
proves to be a fantastic educational tool that shapes the FIELD processor design and
instruction execution basics at the very beginning. Besides, its plain control logic
and data path design demonstrate a clear and comprehensible basis for the students
and engineers dealing with computer architecture.

13
3.2. Architectural Design 14

However, it comes at a high cost. For the slowest instruction, a single-cycle


processor needs a long clock cycle. This makes it less efficient and effective than
modern structures. The clock cycle is longer for complex instructions, which slows
down simpler commands. High-performance applications are not well suited for
single-cycle processors due to their lower efficiency. They are effective in small-scale
applications and in education.
This chapter provides a comprehensive examination of the single-cycle RISC-V
processor. Described in the chapter are the architectural design, instruction set
architecture (ISA), data path, control path, and the simulation results. We can form
a more complete picture of the design and the solutions to the issues through the
consideration of each state and its role in the processor.

3.2 Architectural Design

Architecture allows for the division of microarchitectures into two components which
are interconnected with each other: the datapath and the control unit. It includes
memories, registers, ALUs, and multiplexers. We work with the 32-bit RISC-V
(RV32I) architecture which is why we use a 32-bit datapath. Datapath feeds the
control unit to inform it about the ongoing instruction and directs the datapath to
execute that instruction.

3.2.1 Components Overview

We will be talking about the basic components of the single-cycle RISC-V processor
moving forward in this section. Within the constraints of the single clock cycle, each
3.2. Architectural Design 15

component executes its responsibilities in order to execute each instruction accurately


and effectively. Critical components of the single-cycle RISC-V processor include
data memory, register file, Arithmetic Logic Unit (ALU), instruction memory, and
Program Counter (PC). This section will provide specific information on the roles
and interrelations of each of the respective components in the framework of the
single-cycle design.

• Program Counter

The program counter (PC) is a register that holds the address of the current
instruction. A program counter (PC) is an essential part of a computer’s central
processor unit (CPU). It is a register that maintains track of the memory
location of the following instruction that the processor will fetch and carry
out. When running a program, the program counter is essential because it tells
the CPU which instruction to retrieve from memory and run next.

Figure 3.1: Program Counter

Unless the program ends or unless a special circumstance, like a branch or jump
instruction, modifies the program flow, the process of carrying out instructions
and updating the program counter continues. The program counter is changed
3.2. Architectural Design 16

to point to the memory address of the target instruction when a branch or


jump instruction is used, enabling the CPU to alter the program’s execution
path.

• Instruction Memory

It is a state element that provides read access to the instructions of a program


and given an address as input, supplies the corresponding instruction at that
address. The instruction memory has a single read port.It takes a 32-bit in-
struction address input, A, and reads the 32-bit data (i.e., instruction) from
that address onto the read data output, RD. Since, RISC-V is byte-addressable
memory so to get the next instruction we add 4 to the program counter. Now,
why we add +4 not +1 to go the next instruction for this we will first discussed
types of addressable memories.

Figure 3.2: Instruction Memory

Memory can be of two types word-addressable memory and byte-addressable


memory. In word-addressable memory each word has its own address and in
byte addressable memory each byte has its own address.
3.2. Architectural Design 17

Word-Addressable Memory: Each 32 bit data word has its own unique
address. Now take an example of load instruction (lw). And its lw format is
lw r1, 10(r0), where r1 is the destination register, 10 is the offset or constant
or immediate and r0 is the base register. So, how the address is calculated? It
is simply by adding the base register and the offset. Adding the base register
and the offset gives us the address. So, in this example the address is 10. And
after this instruction executes r1 will hold the data value at address r0 + 10.

Figure 3.3: Word-Addressable Memory

Now take an another example this time of store word. Suppose we want to
store the value of r4 into the memory address 3. So, the format is sw, r4,
0x3(r0), where r4 is the source register, 0x3 is the offset and r0 is the base
register. So, how the address is calculated? It is simply by adding the base
3.2. Architectural Design 18

register and the offset. Adding the base register and the offset gives us the
address. So, in this example the address is (0 + 0x3) = 3. So, the value in r4
register will be written on the word 3 of the memory.

Byte-Addressable Memory: Each data byte has its own unique address.
Since RISC-V is byte addressable so each data byte has its own unique address
so a word is 32-bits or 4-bytes. So, the word address increments by 4. So,
lets take an example of lord word in byte addressable memory. So the syntax
is lw, r1, 10(r0). So, how the address is calculated? It is simply by adding
the base register and the offset and then multiplied by 4. So in this case the
address would be (0 + 10) x 4 = 40 which is 0x28 in hexadecimal. And after
this instruction executes r1 will hold the data value at address 40.

Figure 3.4: Byte-Addressable Memory


3.2. Architectural Design 19

• Register File

An essential part of a computer’s central processing unit (CPU) is a register


file, which maintains and controls a group of registers. A register is a compact,
quick storage area utilized by the CPU to process and store temporary data.
The Register File is essential for providing quick data access and modification
while a computer program’s instructions are being executed. Each register in
the Register File is a compact, fast-access memory area that may store a fixed-
size binary data value (generally 32 or 64 bits in contemporary architectures).
These registers are arranged in a two-dimensional array with rows and columns.
Depending on the CPU’s architecture, the Register File’s number of registers
can change; 32, 64, or even more registers are frequently found in register files.
The register x0 is hardwired to 0.

Figure 3.5: Register File

The register file has a write port and two read ports. These read interfaces
are set to receive the 5-bit address inputs each: A1 and A2 that specifies one
of 25 = 32 registers as their source operands. RD1 and RD2 are OUTPUT
3.2. Architectural Design 20

pins of READ data which have the 32-bit register values from the register file.
The third port, WRITE PORT accepts; a five bit ADDRESS input designated
by A3; WE3 (Write Enable), a signal for enabling writing in port three, WD3
(Write Data), and CLOCK (a clock). On the positive edge of its clock, if
its write enable is asserted (WE3), then this module will store the write data
(WD3) into that destination register address from the specified source register
number right away.

Overall, the Register File is a crucial part of contemporary processors since


it offers quick, temporary storage for data needed during the execution of in-
structions and considerably boosts the CPU’s overall speed.

• Data Memory A computer system’s data memory, which is sometimes called


a data storage or data memory unit, is a part of the system where information
can be stored and gotten in the course of program execution.. It is a sort of
memory that may hold both temporary and permanent data according to the
type of memory technology used.

Instruction memory or code memory is what program instructions are stored


in for retrieval and execution by the processor. Data memory is different from
this memory.

The data memory has a single read/write port. If its write enable pin is high
which is WE then it writes the data which is on the WD pin into the address
A on the rising edge of the clock. If its write enable pin WE is low then it
reads from the address A onto the output which is RD.
3.2. Architectural Design 21

Figure 3.6: Data Memory

3.2.2 Instruction Execution Flow

The instruction execution flow in single cycle processor consists of five stages each
plays a vital role in instruction execution flow: Instruction Fetch(IF), Instruction De-
code(ID), Execute(EX), Memory(MEM), WriteBack(WB). Each instruction is passed
through all these stages for the completion of the instruction in one single cycle.
This section covers the operations performed in each stage illustrating that how an
instruction flows from fetch stage to write back stage.

• Instruction Fetch(IF): This is the first stage of the instruction flow execu-
tion. Here, in this stage the processor retrieves the next instruction that is
to be executed form the instruction memory. The next instruction to be per-
formed is indicated by the PC pointing to its memory address. The fetched
instruction and PC is passed to the Decode stage.

• Instruction Decode(ID): Here in this stage the fetched instruction is de-


coded to determine which instruction is this whether it is R-type, I-type, S-
3.2. Architectural Design 22

type, U-type, etc. After determining the type of instruction it generates the
control signals using the control unit. This stage also involves the reading from
the register file and then providing values of the registers r1 and r2 and control
signals to the next stage.

• Execute(EX): Here in this stage the actual computation is performed. The


main part in-charge of carrying out the operations is the ALU. This stage
also involves the branch determination, if the instruction is of B-type then the
ALU calculates the branch target address and determines whether the branch
is taken or not. The memory address computation may also take place at
this point if the instruction needs to access memory (such as load or store
instructions) using the immediate generator. The result of the ALU is passed
to the next stage which is Memory stage.

• Memory(MEM): The memory access stage is where the CPU deals with
instructions like load and store operations that need memory access. In the
MEM stage, instructions that are either a load or a store access the data
memory. The load reads data from memory into the processor, which is then
passed into the writeback stage. The store takes data from the register file and
writes it to memory.

• WriteBack(WB): In this stage, the result of the instruction’s execution are


now written back into the correct processor registers. After this, the execu-
tion of the instruction is completed and the processor proceeds to the next
instruction.
3.3. Instruction Set Architecture(ISA) 23

3.3 Instruction Set Architecture(ISA)

RISC-V—pronounced “risk-five”—is a new instruction-set architecture (ISA) that


was originally designed to support computer architecture research and education.
In terms of the components of computer architecture, the most important part is
what defines the instruction set a processor is capable of executing at the lowest
level of abstraction: its Instruction Set Architecture. It acts as the interface between
the software and hardware that defines how commands from software are translated
into hardware actions. ISAs define instructions and their formats, addressing modes,
and behavior when executing the instructions. Generally speaking, an ISA defines
such things as the supported instructions, data types, registers, hardware support
for managing main memory, fundamental features (such as the memory consistency,
addressing modes, and virtual memory), and the input/output model of implemen-
tations of the ISA.
Everything from systems and microcontrollers to high-performance computing
and beyond, RISC-V has many advantages as an ISA. It is designed to be easy to
implement and modular, easy to build on both ends: academics and commerce.
For example, in RISC-V, ISA is modular and extensible, targeting all markets
from simple embedded systems to the most complex and high-performance comput-
ing applications. There is a base ISA that includes essential instructions, with other
functionalities being optional extensions.

• RV32I Contains integer computational instructions, integer loads, integer stores,


and control flow instructions.
3.3. Instruction Set Architecture(ISA) 24

• RV32F The standard extension for single-precision floating-point operations.

• RV32D The standard extension for double-precision floating-point operations.

• RV32E A reduced version of RV32I with fewer registers, intended for embed-
ded systems.

• RV32A The standard atomic instruction extension, adds instructions that


atomically read, modify, and write memory for inter-processor synchronization.

• RV64I The base integer instruction set for 64-bit processors.

• RV64M The standard extension for integer multiplication and division.

• RV64C The standard compressed instruction extension provides narrower 16-


bit forms of common instructions.

In this thesis only RV32I is implemented,and only some of instructions are im-
plemented, so, we will now discuss the RV32I Base Instruction Set.
The RISC-V instruction set architecture family of instruction sets includes a base
integer instruction set architecture called RV32I. RISC-V is pronounced “RISC Five,”
an open-standard instruction set architecture based on the principles of Reduced
Instruction Set Computing, or RISC. This base integer ISA is entitled “RV32I”,
meaning a 32-bit version of the RISC-V ISA with “I” extension for integer base
instructions.
The basis for RISC-V processors is the RV32I base integer instruction set. The
ISA can be extended for specific purpose for example if you want a processor that
3.3. Instruction Set Architecture(ISA) 25

can do atomic instructions then A is used, similarly if the requirement is to perform


32-bit floating point operations then F is used and so on. In this chapter, only I is
employed, and the F is addressed in advance when we examine the FPU.
RV32I contains 40 unique instructions, but only some of instructions are imple-
mented in this thesis. In RV32I, there are 32 registers each size of 32-bits wide.
Register x0 is hardwired to 0. General purpose registers x1–x31 hold values that
various instructions interpret as a collection of Boolean values, or as two’s comple-
ment signed binary integers or unsigned binary binary integers. There is one extra
unprivileged register: the program counter PC holds the address of the current in-
struction.
In the Base Integer, there is no dedicated stack pointer or subroutine return
address link register. It is still ISA; the instruction encoding allows any x register
for these purposes. However, the Standard software calling convention uses register
x1 to hold the return address for a call with register x5 available as an alternate link
register. The standard calling convention uses register x2 as the stack pointer.
In RV32I there are four core instructions that are R,I,S,U. All of them are of
32-bits in length.An instruction-address-misaligned exception occurs when a taken
branch or jump targets an address that is not aligned to four bytes. This exception
is triggered by the branch or jump instruction itself, not the instruction at the target
address. If a conditional branch is not taken, no instruction-address-misaligned ex-
ception is raised. The RISC-V ISA keeps the source rs1 and rs2 and the destination
register in the same position to simplify the decoding in all formats.
There are following classes of instructions
3.3. Instruction Set Architecture(ISA) 26

Figure 3.7: RISC-V base instructions

• R-Type Instruction: The registers are commonly utilized as follows for R-


type instructions, which entail operations between two source registers and
store the outcome in a destination register.

Figure 3.8: R-Type Instruction

rs1: Source register 1(operand)


rs2: Source register 2(operand)
rd: Destination register, means the result will store in this register.
opcode: Defines the type of instruction i.e I-type, R-type, S-type, etc.
Funct3 and Funct7: Defines which operation to be performed.

• I-Type Instruction: The registers are utilized in the following ways for I-type
instructions, which require an immediate value and a source register.

rs1: Base register, holds the base address for the memory operation.
3.3. Instruction Set Architecture(ISA) 27

Figure 3.9: I-Type Instruction

rs2: Source register, contains the data that is to be stored in memory.


imm: Constant (offset), used to calculate the memory address along with the
base register.
opcode: Defines the type of instruction i.e I-type, R-type, S-type, etc.
Funct3: Defines which operation to be performed.

Note: Immediate value is always represented in 2’s complement notation

• S-Type Instruction: S-type instructions store data from the source register
to memory, whose effective address is calculated by a base register and an
immediate offset, using destination-register source-register registers.

Figure 3.10: S-Type Instruction

rs1: Source register 1(operand)


imm: Constant(operand)
rd: Destination register, means the result will store in this register.
opcode: Defines the type of instruction i.e I-type, R-type, S-type, etc.
Funct3: Defines which operation to be performed.
3.3. Instruction Set Architecture(ISA) 28

• U-Type Instruction: The registers are utilized in the following ways for U-
type instructions, which involve operations using a large immediate value that
is used to construct addresses or constants.

Figure 3.11: U-Type Instruction

rd: Destination register, where the result will be stored.


imm: Large immediate value, typically used for constructing addresses or con-
stants.
opcode: Defines the type of instruction i.e I-type, R-type, S-type, etc.

• J-Type Instruction: The registers are utilized in the following ways for J-
type instructions, which involve jump operations where the target address is
specified by an immediate value.

Figure 3.12: J-Type Instruction

rd: Destination register, where the result will be stored.


imm: Immediate value, specifies the jump target address relative to the PC.
opcode: Defines the type of instruction i.e I-type, R-type, S-type, etc.

These are the some basic RV32I instructions. Here are 40 instructions that are
supported by RV32I.
3.4. Single Cycle Data Path 29

Figure 3.13: RV32I Base Instruction Set

3.4 Single Cycle Data Path

Now, we will discuss about eh datapath of the processor. This section will incre-
mentally construct the single-cycle data path; new elements will be added in each
step to the state elements. To construct the datapath, we will use an example of
instructions to understand the datapath.
The PC holds the address of the instruction that is to be executed. The first step
3.4. Single Cycle Data Path 30

Figure 3.14: Example Program

is to fetch this instruction from the instruction memory. In Figure 3.15, we can see
that the PC is directly connected to the address input of the instruction memory.
The instruction memory reads out the 32-bit instruction, which is called Instr. Instr
is of load word so the particular instruction fetched determines what the processor
will be doing. We shall start by establishing what the datapath relationships of the
lw instruction are. We shall then discuss how we may extend the application of the
datapaths to cover more instructions.

Figure 3.15: Fetch instruction from memory

For the execution of the lw instruction we will first need to read the source register
containing the base address. Recall that lw is an I-type instruction, and the base
register is specified in the rs1 field of the instruction which is from instr[19:15]. As
3.4. Single Cycle Data Path 31

Figure 3.16: Read source operand from register file

visible From Figure 3.16, these instruction bits are fed to the A1 address of the
register file. Register value is read via the register file. is forwarded to RD1. The
register file in our case, reads 0x2004 from x9.
An offset is also needed for the Iw instruction. The offset is held in the 12-bit
immediate field of the instruction Instr[31:20]. Since the value is signed, sign ex-
tension to 32 bits is necessary. The process of copying the sign bit into the most
significant bits is known as sign extension – for instance, ImmExt[11:0] = Instr[31:20]
and ImmExt[31:12] = Instr[31]. An Extend unit, as shown in Figure 3.17 performs
this sign-extension. It takes the 12-bit signed instant from Instr[31:20] and produces
the 32-bit sign-extended immediate, or ImmExt. In our example, the two’s comple-
ment instant -4 is zero-extended from its 12-bit version, 0xFFC, into a 32-bit form,
0xFFFFFFFC.
To determine the address to read from memory, the CPU multiplies the offset by
the base address. An ALU is introduced in Figure 3.18 to carry out this addition.
SrcA and SrcB, two operands, are given to the ALU. The offset from the sign-
extended instant, or ImmExt, is SrcB, while the base address from the register file is
3.4. Single Cycle Data Path 32

Figure 3.17: Sign-extend the immediate

SrcA. Numerous operations are possible with the ALU. The procedure is specified by
the 3-bit ALUControl signal. 32-bit operands are sent into the ALU, which returns a
32-bit ALUResult. ALUControl should be set to 000 for the lw instruction in order
to conduct addition. As seen in Figure 7.6, ALUResult is sent to the data memory
as the address to read.

Figure 3.18: Compute memory address


3.4. Single Cycle Data Path 33

This memory address is forwarded to the address (A) port of the data memory
from the ALU. Finally, the data is written back to the destination register after
being read from the data memory onto the ReadData bus.The processor needs to
be working out the address of PCNext for the next instruction in the Execute cycle
as it executes each instruction. Since instructions are 32 bits, the next instruction
comes at PC+4.

Figure 3.19: Increment program counter

So this was the complete datapath for the load word instruction. In the same
way we can implement the store word instruction, R-type instruction or any other
instruction that you want your microprocessor the deal it with.All instructions can
be implemented by seeing the architecture and then controlling the signals.There is
a trade off between the instructions that your processor can deal and the complexity
of the processor. The more instructions you add the complex your design would
3.5. Single Cycle Control Unit 34

become. So the complete datapath for the RISC-V microprocessor that can deal
with S-type,I-type,R-type,B-type,and J-type instructions is as follow.

Figure 3.20: Complete single-cycle processor

3.5 Single Cycle Control Unit

The control signals for the single-cycle processor are computed based on funct3,
funct7 and op. since we only use bit 5 of funct7 in the RV32I instruction set, we
only need to consider the following three inputs to the control unit: op(Instr[6:0]),
funct3(Instr[14:12]) and funct7bit5(Instr[30]).
The Control Unit is divided into two components.Main Decoder: which deter-
3.5. Single Cycle Control Unit 35

Figure 3.21: Control Unit

mines which type of instruction is this and to produce the correct control signals.
ALU Decoder: which determines which operation is to be performed.
Figure 3.22 Processors combine the output of the Main Decoder and the next state
information to produce all the control signals. Figure 3.22 shows the control signals
produced by the Main Decoder, the control signals were determined as a part of the
datapath design in. The type of instruction from the opcode dictates the correct
control signals to be sent to the datapath by the Main Decoder. The Main Decoder
does produce the bulk of the datapath’s control signals. In addition, it produces
the internal signals to the controller, Branch and ALUOp. The truth table can be
used to implement logic for the Main Decoder using your favorite combinational logic
design techniques.
3.6. Simulation 36

Figure 3.22: Main Deocder

ALUControl will be produced using ALUOp, function3, and the ALU Decoder.
Actually, following Table 7.3, the ALU Decoder uses funct7bit5 and op5 in its cal-
culation of ALUControl for the sub and add instructions.

Figure 3.23: ALU Decoder

3.6 Simulation

Since our single cycle RISC-v processor is completed, now its time to check its output
that whether we are getting expected output or not. To do this we first fill the
3.6. Simulation 37

instruction memory with out intructions that we want to execute and then run it.
Here is the example of the single cycle processor.
Chapter 4

Upgrading to a Pipelined RISC-V Processor

4.1 Introduction to Pipelining

Pipelining is a technique in which multiple instructions are overlapped in execution,


much like an assembly line. To make our processor we divide the the single cycle
into five stages. What we done is that we have make our processor 5 times faster
then the single cycle. Because now five instructions can execute simultaneously.

4.2 Pipeline Stages

There are five stages in pipelined processor. There can be many stages but for the
simplicity we divide the processor in five stages.

• Instruction Fetch(IF): The CPU pulls the subsequent instruction from mem-
ory during an instruction fetch

• Instruction Decode(ID): Information about the operation to be carried out


and the operands involved are obtained from the fetched instruction by decod-

38
4.3. Hazards 39

ing it

• Execute(EX): : The operands are subjected to the necessary operations, such


as arithmetic or logical operations

• Memory(MEM): This stage handles memory operations if the instruction


involves accessing memory (for example, loading or storing data).

• WriteBack(WB): The outcomes of the execution of the instruction are writ-


ten back to the processor’s relevant registers

Each of these stages handles one instruction at a time as the instructions flow through
them one at a time. By spreading out the execution of the instructions, the processor
spends less time idle and performs better overall

4.3 Hazards

When we our designing the pipeline processor then the hazards occurs. Hazards
occur when the one instruction is depended on the other instruction that has not be
yet completed then the hazard occurs. There are several types of hazards but the
important hazards are as follow.

4.3.1 Data Hazards

Data Hazards are also knows as Read After Write hazards because these hazards can
occur when the last instruction result is the source of the second instruction. For
example the first instruction is the add x2,x1,x3 and the second instruction is the
4.3. Hazards 40

x5,x2,x0. Now look carefully when the first instruction is in the memory stage then
the second instruction is in the execute stage, now the updated result for the x2 is
not stored in the register yet. Now this hazard occur the second instruction will get
the wrong result.

Figure 4.1: Data Hazards

Data Hazards can be solve by using the nops or the forwarding. Nops is not
considered good becasue it stops the processor and there for the forwarding is used.
When one instruction is dependent on a result, from another instruction, that has
not been written into the register file, RAW data is a possibility. Provided the
result is computed rapidly enough, forwarding can handle data hazards; otherwise
the pipeline must be stalled until the result is available.

4.3.2 Control Hazards

Control Hazards are hazards that occur when the decision of what instruction to
fetch next has not been made by the time the fetch takes place. This hazard occurs
when the Branch instruction is executed.
4.4. Pipeline Control 41

Figure 4.2: forwarding

Control hazards can be avoided by simply waiting for the pipeline to stop, but
only either by prediction of the instruction that should be fetched or by flushing the
pipeline if the forecast proves wrong.

Figure 4.3: Branch is taken

4.4 Pipeline Control

The pipelined processor has the same control signals as the single cycle processor as
it includes the same control unit. The control unit in Decode inspects the op, funct3
and funct7 fields of the instruction, in Biblical terminology the process of producing
4.5. Performance Evaluation 42

the control signals. The control signals must also be pipelined through all the stages.
After the pipeline control the pipelined processor is shown in the following figure.

Figure 4.4: Pipelined RISC-V Processor

4.5 Performance Evaluation

Ideally, pipeline processor should have a CPI equal to 1 since a new instruction is
issued, or fetched, every cycle. However, as we saw, a stall or flush takes one or
two cycles so actual CPI will a few percent higher than one, independently of the
program executed.
Chapter 5

Floating Point Unit (FPU)

5.1 Introduction to IEEE 754 Standard

In the 1970’s integrated circuit-based microprocessors began to see wide use, and
effective floating-point and arithmetic circuits supporting the new CPUs began to
emerge. Also as usual several different proposals for encoding floating-point numbers
emerged along with the associated incompatibilities between manufacturers which
hindered general interoperability. An IEEE standard was defined to offer a uniform
way of encoding floating-point integers. The IEEE 754 Standard for Floating-Point
Arithmetic, when first published in 1985, defined a few formats and special sorts
for floating-point integers. Since then, the standard has undergone minor revisions
in 2008 and 2019. The most commonly used method for encoding floating-point
integers is the IEEE 754. IEEE 754 standard include many variations like half
precision, single precision, double precision, etc. Since our processor is the 32-bit so
we will be using 32-bit floating point which is the single precision.
The single precision uses 32-bits and it is also known as binary32. In single

43
5.1. Introduction to IEEE 754 Standard 44

Figure 5.1: IEEE 32-bit (single-precision) floating-point number

precision, there is only 1-sign bit, 8-bits for exponent and 23-bits for the mantissa.
If the sign bit is 0 then it means that the number is positive and not then negative.
The 8-bits of the exponent represents the decimal value of the exponent and this
exponent is also stored by adding bias of 127. The bias is add because to represent
both negative values and positive values the value would range from -127 to 127 but
with the help of bias we have increased the range from 0 to 255.
The 23-bits of the mantissa represents the significant bits of the number. IEEE
754 employs the concept of an implied 1 because a number in binary SN can always be
written in a form that guarantees that the whole number to the left of the radix point
will always be 1. Unlike other standards, in IEEE 754, the whole number component
of the mantissa is not included in the 23-bit field. This provides an increment of one
accuracy bit to the finally encoded number. A binary32 word does not contain the
presumed 1 during number encoding. During the decode of a number, the presumed
1 is added back to the mantissa recovering the original value. A number is said to
be normalized if it is represented in binary SN with a single 1 in each point.
5.1. Introduction to IEEE 754 Standard 45

5.1.1 IEEE 754 Special Values

IEEE 754 also defines a set of the special values like infinity, zero, NaN, etc. The
unique codes for +0 and -0 are provided by IEEE 754. The exponent and mantissa
fields both contain all 0 s, which denotes zero. For this purpose the sign bit used to
determine the zero is negative or positive. IEEE 754 also provides unique codes for
the +infinity and -infinity. Infinity is represented by setting all the exponents to 1’s
and all the mantissa bits to 0’s and if there is any non-zero in the mantissa bits then
the number would become Nan.

Figure 5.2: IEEE 754 special values


5.2. Floating Point Arithmetic 46

5.2 Floating Point Arithmetic

Now we have learned how to represent the number in the single precision format,
now its time to learn about the how the operations are performed on the two binay32
numbers.

5.2.1 Addition or Subtraction of IEEE 754 Numbers

In binay32 the addition or subtraction follows some steps that are very similar to
the addition or subtraction when dealing with the base 10 numbers. The first step
is to make sure that both the numbers have the same exponent. There are involved
a few more processes in addition and subtraction of IEEE 754 numbers, over and
above the base 10 SN operation. The first step is to convert the input number into
3 distinct fields i.e one for sign bit the second for the exponent bits and the third for
the mantissa bits. This step is call unpacking.It is during this phase that the bias
is removed from the exponent and the inferred 1 added to the mantissa field, hence
increasing its size by one bit.
The second step is to modify the the inputs such that they have the same expo-
nent. This is done by comparing the both the exponents and which ever exponent
is greater then the other number mantissa is shift by the difference of the exponents
of two numbers.

5.3 FPU Architecture

FPU Architecture
5.4. Integration with RISC-V ISA 47

5.4 Integration with RISC-V ISA

Integration with RISC-V ISA

5.5 Implementation Details

Implementation Details
Chapter 6

Integrating FPU into Pipelined RISC-V Processor

6.1 Modifying the Pipeline Stages

Modifying the Pipeline Stages

6.2 Pipeline Control for FPU

Pipeline Control for FPU

6.3 Verification and Testing

Verification and Testing

6.4 Performance Analysis

Performance Analysis

48
Chapter 7

Implementation on FPGA

7.1 synthesis of the design

synthesis of the design

7.2 PMod

Pmod

7.3 Constraint File

Constraint File

49

You might also like