0% found this document useful (0 votes)
5 views

SOC

The document provides an overview of System-on-Chip (SoC) architecture, detailing its components, processor architectures, memory structures, and interconnection methods. It discusses various types of processors, including sequential, pipelined, superscalar, and VLIW, along with their parallelism techniques and memory addressing strategies. Additionally, it highlights the importance of efficient interconnection methods, comparing bus-based and network-on-chip approaches for effective communication between modules.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

SOC

The document provides an overview of System-on-Chip (SoC) architecture, detailing its components, processor architectures, memory structures, and interconnection methods. It discusses various types of processors, including sequential, pipelined, superscalar, and VLIW, along with their parallelism techniques and memory addressing strategies. Additionally, it highlights the importance of efficient interconnection methods, comparing bus-based and network-on-chip approaches for effective communication between modules.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

UNIT 1

SYSTEM-ON-CHIP (SOC) ARCHITECTURE

 An SoC integrates mul ple components (processors, memories, interconnects) on a single


chip, op mized for specific applica ons.

 Example: Emo on Engine in Sony PlaySta on 2:

o Two primary func ons: behaviour simula on and geometry transla on.

o Key components:

 A main processor: RISC architecture.

 Two vector processing units (VPUs):

 VPU0 and VPU1 contain four SIMD-style parallel processors.

COMPONENTS OF SOC

 The system architecture defines the system - level building blocks, such as processors and
memories, and the interconnec on between them
 The processor architecture determines the processor ’s instruc on set, the associated
programming model, its detailed implementa on
 The implementa on of a processor is also known as microarchitecture
 Some of the basic elements of an SOC system include a number of heterogeneous processors
interconnected to one or more memory elements. Frequently, the SOC also has analog
circuitry for managing sensor data and analog to digital conversion, or to support wireless
data transmission.

 If all the elements cannot be contained


on a single chip, the implementa on is probably best
referred to as a system on a board, but o en is s ll
called a SOC. What dis nguishes a system on a board
(or chip) from the conven onal general - purpose
computer plus memory on a board is the specific
nature of the design target

 Core processor: general ARM cortex A9 processor


 Media processor: Mali 400MP graphics processor and Mali-VE engine
 Analog circuitry: sensor data APC’s
 Interconnec ons: AXI- advanced extensible interface interconnects
 System components: interfacing with peripherals like camera, screen, communica on unit

PROCESSOR ARCHITECTURES: Func onal view


Parallelism in Processors

Modern processors aim to execute mul ple instruc ons simultaneously using various techniques:

1. Sequen al vs. Concurrent Execu on:

o Sequen al processors execute one instruc on at a me.

o Concurrent execu on involves mul ple instruc ons processed at the same me,
o en transparent to the programmer.

2. Techniques for Concurrent Execu on:

o Pipelining: Divides instruc on execu on into stages, with mul ple instruc ons in
different stages at the same me.

o Mul ple Execu on Units: Allows processors to handle different tasks concurrently.

o Mul ple Cores: Provides parallelism by running instruc ons on separate cores.

Levels of Parallelism

Parallelism can be exploited at mul ple levels:

1. Instruc on-Level Parallelism (ILP):

o Executes mul ple instruc ons in parallel within a single program.

o Achieved using:

 Hardware: Pipelining, superscalar execu on.

 Compiler Techniques: Reordering instruc ons to op mize performance.

 Opera ng Systems: Efficiently scheduling tasks.

2. Loop-Level Parallelism:

o Parallelizes consecu ve itera ons of a loop if no dependencies exist between


itera ons.

3. Procedure-Level Parallelism:

o Executes different procedures or func ons in parallel, depending on the algorithm.

4. Program-Level Parallelism:

o Runs mul ple independent programs in parallel.


PROCESSOR ARCHITECTURES: Architectural view

1. SIMPLE SEQUENTIAL PROCESSOR:


 These processors process instruc ons sequen ally from the instruc on stream.
 The next instruc on is not processed un l all execu on for the current instruc on is
complete and its results have been commi ed.

 1. fetching the instruc on into the instruc on register (IF)


2. decoding the opcode of the instruc on (ID)
3. genera ng the address in memory of any data item residing there (AG)
4. fetching data operands into executable registers (DF)
5. execu ng the specified opera on (EX)
6. wri ng back the result to the register file (WB).
 During execu on, a sequen al processor executes one or more opera ons per clock
cycle from the instruc on stream.

 An instruc on is a container that represents the smallest execu on packet managed


explicitly by the processor
 One or more opera ons are contained within an instruc on.
 Execu ng each instruc on sequen ally has significant performance drawbacks: A
considerable amount of me is spent on overhead and not on actual execu on.

2. PIPELINED PROCESSOR:
 Pipelining in a processor is a technique that allows the CPU to work on mul ple
instruc ons at the same me by dividing the processing of instruc ons into different
stages
 Each stage of the pipeline handles a different part of the instruc on, such as fetching
the instruc on, decoding it, execu ng it, and so on.
 Phases of Instruction Processing:
1. Instruction Fetch (IF): The CPU fetches the instruction from memory.
2. Instruction Decode (ID): The CPU decodes the fetched instruction to understand
what needs to be done.
3. Address Generation (AG): The CPU calculates the memory addresses required
for the instruction.
4. Data Fetch (DF): The CPU accesses the operands needed for the instruction.
5. Execution (EX): The CPU performs the operation defined by the instruction.
6. Write Back (WB): The CPU writes the result of the execution back to the register
or memory.
 Overlapping of phases increases the efficiency of the CPU because it allows the
processor to work on mul ple instruc ons at different stages rather than wai ng for
one instruc on to complete before star ng the next.
 Static vs. Dynamic Pipelining:
1. Static Pipeline: The processor must go through every stage of the pipeline for
each instruction, regardless of whether all stages are needed. This is simpler but
less flexible.
2. Dynamic Pipeline: The processor can skip unnecessary stages depending on the
instruction's needs. Dynamic pipelines can even execute instructions out of
order, but they must ensure the program's final result is as if the instructions
were executed in the correct order.

3. ILP – INSTRUCTION LEVEL PARALLELISM:


 It is a technique where mul ple instruc ons are executed simultaneously within a CPU to
increase performance
1. SUPERSCALAR PROCESSOR:
 The processors schedule execu on of instruc on wrt to me this is called
scheduling
 2 types
 Dynamic scheduling
 Sta c scheduling
 Dynamic scheduling is a method in which hardware determines which
instruc on to execute.
 Sta c scheduling is a method in which compiler determines order of execu on
 These processors have mul ple func onal units that allow them to execute
several instruc ons per cycle
 Limita ons: The complexity of dynamic scheduling limits the number of
instruc ons that can be processed per cycle, o en topping out at four to six due
to the complexity of ensuring correct instruc on dependencies.

 Pre-decode block: Recognize and decode single instruc ons that should be kept
together
 Rename Buffer: processors need to allocate these buffers to des na on registers
of instruc on.
 Dispatch block: determines what program should the CPU execute next.
 Reorder buffer: allows instruc on to be executed in order

2. VLIW PROCESSOR:
 Unlike Superscalar processors, VLIW processors rely on the compiler to analyze
and schedule instruc ons.
 VLIW processors are less complex than Superscalar processors because they do
not need dynamic scheduling hardware. The complexity is offloaded to the
compiler.
 VLIW processors can poten ally offer high performance, especially for
applica ons where the parallelism can be sta cally determined by the compiler
 Limita ons:
i. delayed results from opera ons whose latency differs from the assumed
latency scheduled by the compiler and
ii. interrup ons from excep ons or interrupts, which change the execu on
path to a completely different schedule.
4. SIMD ARCHITECTURE:
 (SIMD) architectures are designed to handle opera ons on regular data structures
like vectors and matrices efficiently
 SIMD processors can execute the same opera on on mul ple data points
simultaneously
1. ARRAY PROCESSORS:
 Array processors consist of mul ple interconnected processor elements
(PEs), each with its own local memory
 A control processor broadcasts instruc ons to all Pes
 Each PE processes its por on of the data, with data being carefully
distributed to minimize complex rou ng between Pes
 Ideal for tasks with regular data structures and uniform computa ons, such
as solving matrix equa ons or other tasks involving large datasets.
 Example: The ClearSpeed processor, designed for signal processing, is an
example of an array processor.

2. VECTOR PROCESSOR:
 A vector processor resembles a tradi onal processor but includes special
func on units and registers designed to handle vectors (sequences of data)
as single en es.
 Vector processors have deeply pipelined func on units, enabling high
throughput despite poten ally higher latencies.
 When a vector is longer than the processor's registers, it is processed in
segments.
 Vector processors o en support chaining, where the result of one opera on
can be immediately used by the next, allowing for efficient sequen al
computa ons with minimal latency.
 Suitable for applica ons requiring high throughput on vectorized data, such
as scien fic computa ons and tasks involving large datasets.
 Example: IBM mainframes offer vector instruc ons for scien fic compu ng,
highligh ng their use in high-performance environments.

MEMORY STRUCTURES IN SOC

1. Simple Memory Design:

o Programs are stored in on-chip read-only memory (ROM).

o Data is stored in on-chip random-access memory (RAM).

2. Complex Memory Design:

o For advanced applica ons (e.g., opera ng systems), memory systems use:

 Off-chip memory (e.g., DRAM).

 A memory management unit (MMU).

 Cache hierarchies to manage memory access efficiently.

Why Include Memory on the Processor Die?

Advantages:

1. Faster access me and be er bandwidth.

2. Reduces reliance on cache memory.

3. Improves performance for memory-intensive tasks.

Challenges:

1. Technology Difference:

o DRAM memory technology differs from processor technology, reducing memory


density if integrated.

2. Limited Size:

o On-die memory is small and unsuitable for applica ons requiring large memory.
MEMORY ADDRESSING

The user’s view of memory depends on the addressing features :

 Applica on programmers work with virtual memory to:

o Use more memory than physically available.

o Isolate memory regions for security (important for mul tasking).

 Opera ng system programmers manage the underlying mechanisms to enable this.

Why This is Important: Virtual memory ensures efficient memory usage and protec on when
mul ple applica ons run simultaneously. However, transla ng between different types of addresses
is essen al to make this work seamlessly.

Address Transla on: The Process

Address transla on involves conver ng the addresses used in programs (virtual addresses) into
physical addresses in memory. This is done in three steps:

1. Crea ng the Virtual Address

 The applica on generates a process address, which is used to compute a virtual address:
Virtual Address=Offset+Base+Index

o Offset: Specified in the program instruc on.

o Base and Index: Stored in registers.

Why This is Important: The virtual address allows the program to reference memory without
worrying about the actual physical loca on. This simplifies applica on programming and ensures
portability.
2. Crea ng the System Address

 Since mul ple processes share memory, each process's virtual address must be mapped to a
unique system address: System Address=Virtual Address+ (Process Base)

o A segment table helps coordinate and relocate process addresses.

o The system address must be within bounds defined by the segment table.

Why This is Important: Segment tables manage memory spaces for each process, ensuring no
overlap between processes and maintaining system stability and security.

3. Virtual vs. Real Address

 If the memory space exceeds physical memory (common in large SoC applica ons), virtual
memory comes into play:

o The program’s memory is divided into pages.

o A page table maps virtual pages to physical pages.

o Only the most recently used pages are kept in memory; others remain on disk.

The physical address is calculated as:

 Upper bits: From the page table (to locate the page in memory).

 Lower bits: From the virtual address (rela ve posi on within the page).

Why This is Important: This system allows applica ons to use more memory than is physically
available, enabling the efficient use of limited hardware resources.

Transla on Lookaside Buffer (TLB)

Address transla on involves mul ple tables (e.g., segment and page tables), which can be slow. To
speed this up:

 A Transla on Lookaside Buffer (TLB) stores recently used address transla ons.

 When an address is requested:

1. The TLB checks if the virtual address is already translated.

2. If found, the TLB provides the physical address directly.

3. If not, a not-in-TLB event occurs, and the system performs a full transla on.

Why This is Important: The TLB reduces the overhead of frequent address transla on, significantly
improving performance.
SYSTEM-LEVEL INTERCONNECTION

Why Interconnec on Ma ers

1. Raising Design Level:

o SoC technology moves from designing individual circuits to assembling a system


from predesigned modules (IP blocks). This modular approach simplifies and speeds
up design.

2. Efficient Communica on:

o A good interconnec on method ensures that modules communicate effec vely and
operate in parallel without bo lenecks, which is crucial for high performance.

3. Interoperability and Reuse:

o Well-defined communica on protocols encourage interoperability (modules from


different vendors working together) and design reuse, making SoC development
cost-effec ve and scalable.

Two Main Interconnec on Approaches

1. Bus-Based Approach

In this approach, modules communicate through shared buses:

 How It Works:

o Modules behave according to standard bus protocols like ARM’s AMBA or IBM’s
CoreConnect.

o Communica on occurs by sharing a physical bus with address, data, and control
signals.

 Features:

o Hierarchical Design: Mul ple buses are arranged


hierarchically:

 The bus closest to the CPU has high


bandwidth to handle frequent, high-speed
data exchanges.

 Peripheral buses farther from the CPU have


lower bandwidth, saving cost and
complexity.

 Why It’s Used:

o Simple and widely adopted for SoC designs.

o Effec ve for systems with a moderate number of IP blocks.

Limita on: As the number of connected modules grows, shared buses can become a bo leneck,
limi ng scalability.
2. Network-on-Chip (NoC) Approach

NoC is a newer approach, inspired by computer networks, where communica on between modules
occurs through a network of switches.

 How It Works:

o Modules are connected through switches, forming either a dynamically switched


crossbar or a sta cally switched mesh network.

o Communica on occurs through data packets routed between switches.

 Advantages:

1. Higher Throughput:

 Unlike buses, NoC supports simultaneous communica ons over mul ple paths,
avoiding conges on.

2. Mul ple Clock Domains:

 Crossbar networks use asynchronous channels, allowing modules to run at different


clock speeds.

3. Reduced Delay and Noise:

 Mesh networks ensure fixed interconnec on distances, minimizing wire delay and
crosstalk.

 Mesh Example:

o Each module (or "node") has:

 Processing logic: For computa on.

 Rou ng logic: To direct data packets to their des na on.

Why It’s Be er for Larger Systems:

 Scales more efficiently than buses as the number of IP blocks grows.

 Be er suited for complex SoCs requiring high-speed communica on.


DESIGN COMPLEXITY AND ITS EVOLUTION

The complexity of designing systems changes drama cally as more transistors become available per
die. Let's break it down:

Early Design Complexity

 Single Processor: Ini ally, implemen ng a 32-bit pipelined processor with a small first-level
cache may require about 100,000 transistors.

o Die Density: As transistor density increases, it becomes easier to fit mul ple


processors onto the same chip, but design complexity increases due to the need for
addi onal func onality and coordina on between components.

Expanding the Processor

 Cache: Adding mul ple levels of cache (L1, L2, etc.) helps speed up data access, but each
cache level comes with its own design challenges, especially as cache size grows.

 More Advanced Processors: Implemen ng more powerful processors, such as superscalar or


Very Long Instruc on Word (VLIW) processors, introduces complexity by allowing mul ple
instruc ons per cycle and speeding up execu on units (especially for floa ng-point
opera ons). This adds more transistors and increases complexity.

 Mul ple Processors: Implemen ng mul ple processors with their own mul level caches
increases design complexity significantly, as synchroniza on and memory access consistency
across processors must be managed. As a result, the system architecture becomes much
more complex, with millions of transistors required to coordinate the processors and
memory.

Managing Complexity: Reuse and Specializa on

To manage this complexity, design reuse becomes crucial. Instead of designing one advanced
processor for all tasks, designers can reuse simpler processors and specialize them for specific tasks.
This approach allows designers to:

 Combine specialized processors for different parts of the applica on, improving
performance by matching processors to the task at hand.

 Use interconnec on mechanisms (like buses or switching mechanisms) to link these


processors and memory.
UNIT-2

PROCESSOR SELECTION IN SOC DESIGN

 Processor Selec on is Cri cal: The processor in an SoC design is a fundamental element
since it runs the system's so ware. O en, the ini al task is to select a processor that meets
the func onal and performance requirements of the applica on.

 General Purpose Processors (GPPs): In many cases, a general-purpose processor (GPP) is


selected for the system. This processor serves as the core processor responsible for
execu ng most of the system's tasks.

Key Considera ons:

1. System So ware: The selected processor must be able to run the specific system so ware
required by the applica on.

2. Compute-limited Applica ons: For compute-intensive applica ons, the focus is on ensuring
that the processor meets performance requirements, o en including real- me constraints.
In such cases, real- me processing ability becomes a primary design considera on early on.

3. Memory and Interconnects: During the ini al design phase, the memory and interconnect
components are simplified to "idealized components." These are treated as basic delay
elements with conserva ve performance es mates, allowing for a simplified view of system
behaviour in the early stages.

The goal is to choose a processor and set its parameters to build an ini al design that meets the
system's func onal and performance requirements.

Example: So Processors

 A so core is a type of processor design stored in bitstream format. These processors can be
programmed into a field-programmable gate array (FPGA). So processors are commonly
used in SoC designs because they offer flexibility and customiza on.

Advantages of So Processors:

1. Cost Reduc on: By integra ng the processor design into an SoC, the need for separate chips
is reduced, lowering system-level costs.

2. Design Reuse: So processors allow for reuse of exis ng processor designs, reducing me
and effort for new projects, especially when varia ons of a processor are needed.

3. Customiza on: They allow the crea on of processors that are customized for specific
microcontroller/peripheral combina ons.

4. Futureproofing: So processors can be used to avoid reliance on specific microcontroller


variants that might be discon nued, providing a level of protec on against market changes.

Examples of So Processors:

 Nios II (Altera): A so processor developed for use on Altera FPGAs and ASICs.

 MicroBlaze (Xilinx): A so processor designed for Xilinx FPGAs and ASICs.

 OpenRISC: A free, open-source so processor.

 Leon: An open-source so processor that implements the SPARC v8 instruc on set


architecture (ISA).
PROCESSOR CORE SELCTION

Processor Core Selec on involves choosing the right type of processor core based on system
requirements, such as performance, area, power, and other factors. It’s a cri cal step in designing a
processor to meet the specific needs of a par cular applica on.

Here’s a simplified breakdown of the core selec on process:

Example 1: General Core Path

1. Ini al Design: Assume you start with a processor that has a performance of 1 and uses 100K
"rbe" (register bit equivalent) of area.

2. Doubling Performance: If you want to double the performance (i.e., reduce execu on me
by half), it requires increasing the area to 400K rbe and power increases by a factor of 8
(because each rbe now uses more power).

3. Memory System Impact: As performance increases, cache misses also increase, which
nega vely affects overall system performance. To counter this, you need to increase the
cache size.

4. Cache Size Increase: To reduce cache misses, you need to double the cache size. If the ini al
cache size was 100K rbe, the new system would have 600K rbe and use significantly more
power.

5. Is It Worth It?: The decision depends on whether there’s enough available area and if power
isn’t a major concern. If the increased performance provides cri cal features (e.g., be er
security or I/O capabili es), it might be worth the trade-off.
Example 2: Compute Core Path

1. Parallelizable Applica on: Suppose the applica on can be split into smaller tasks
(parallelized) and two op ons are available for increasing performance:

o Op on 1: A 10-stage pipelined vector processor with 300K rbe.

o Op on 2: Mul ple simpler processors with a single processor using 100K rbe.

2. Increasing Performance: You need to increase the performance to 1.5. There are two main
op ons:

o Op on 1: Increase the number of vector pipelines, which doubles the area and
power but keeps the clock rate unchanged.

o Op on 2: Use an array of simpler processors. To meet the target, you need at least
four processors, adding more interconnect and memory-sharing circuitry, resul ng in
a larger area.

3. Choosing the Best Op on: The decision depends on several factors:

o Applica on Par oning: Can the work be easily divided for both op ons?

o Support So ware: Is there so ware (compilers, OS) that works be er for one
approach?

o Fault Tolerance: Can the mul processor approach help with fault tolerance (i.e.,
reliability)?

o Integra on: Can the mul processor approach be integrated with the rest of the
system?

o Design Effort: How difficult is it to implement each approach?


PROCESSOR ARCHITECTURE OVERVIEW

 The architecture of a processor consists primarily of its instruc on set, which defines the set
of opera ons the processor can perform
 However, the actual implementa on of the processor (microarchitecture) goes beyond the
instruc on set, involving trade-offs between area, me, and power to meet specific user
requirements.
 Instruc on Set Basics
o Most processors use a register set to hold operands and addresses
o Program Status Word (PSW): This includes control status informa on, like condi on
codes (CCs) that reflect the results of opera ons.
o Instruc on Set Architectures:
 Load/Store (L/S) Architecture:
 Used in RISC processors.
 Requires operands to be in registers before execu on.
 Simplifies instruc on decode and execu on ming.
 Register/Memory (R/M) Architecture:
 Used in processors like Intel’s x86 series.
 Allows opera ons between registers and memory directly.
 More complex to decode but provides compact code with fewer
instruc ons.
 Branches
o Manage control flow (e.g., jumps, calls, returns). Condi onal branches (BC) depend
on the condi on codes (CC) set by ALU instruc ons
o for example, specifying whether the instruc on has generated
 1. a posi ve result
 2. a nega ve result
 a zero result
 an overflow/Underflow.
 Interrupts and Excep ons
o User Requested vs. Coerced: User errors (e.g., divide by zero) vs. external events
(e.g., device failure).
o Maskable vs. Non-maskable: Can be ignored or not.
o Terminate vs. Resume: May stop processing or allow con nua on.
o Asynchronous vs. Synchronous: Occur independently or in sync with the processor's
clock.
o Between vs. Within Instruc ons: Can be recognized either between or during
instruc on execu on.

BASIC CONCEPTS IN PROCESSOR MICROARCHITECTURE

 Modern processors u lize an instruc on execu on pipeline design to enhance performance.


 this design allows mul ple instruc ons to be processed simultaneously,
 Every processor has a memory system, execu on unit (data paths), and instruc on unit.

 The faster the cache and memory, the smaller the number of cycles required for fetching
instruc ons and data
 The control of the cache and execu on unit is done by the instruc on unit.
 The pipeline mechanism or control has many possibili es. Poten ally, it can execute one or
more instruc ons for each cycle.
 Pipeline performance is primarily limited by delays or breaks, which can arise from several
factors:
o Data Conflicts (Data Hazards):
 Occur when a current instruc on requires a source operand that is the result
of a preceding instruc on that hasn’t completed yet.
 Solu on: Extensive buffering of operands can reduce this conflict.
o Resource Conten on:
 Happens when mul ple instruc ons compete for the same resource
 Solu on: Adding more resources and using techniques like out-of-order
execu on can minimize conten on.
o Run-On Delays (In-Order Execu on Only):
 Occurs when instruc ons must complete in the exact order they appear in
the program. Any delay in one instruc on will delay subsequent instruc ons.
 This is specific to in-order execu on pipelines where instruc ons cannot be
re-ordered.
o Branches:
 the next instruc on to be executed depends on the outcome of a branch
(e.g., if-else condi ons).
 Solu on: Techniques like branch predic on, branch tables, and branch target
buffers help reduce delays caused by branches by predic ng the outcome of
the branch and pre-fetching the target instruc on.
BASIC ELEMENTS IN INSTRUCTION HANDLING

 Instruc on handling in a processor involves several key components that work together to
ensure the proper execu on of instruc ons in the correct order.
 Instruc on Register: Holds the current instruc on being executed
 Instruc on Buffer: Pre-fetches instruc ons into registers, allowing them to be quickly
decoded and executed. This helps to keep the pipeline full and reduces delays.
 Instruc on Decoder:
o Controls various components like the cache, Arithme c Logic Unit (ALU), and
registers.
o In pipelined processors, it helps in sequencing instruc ons, o en managed by
hardware.
 Interlock Unit: Ensures that the concurrent execu on of mul ple instruc ons produces the
same result as if they were executed serially

Instruc on Decoder and Interlocks

 The instruc on decoder plays a cri cal role in managing the pipeline and ensuring correct
execu on
 Scheduling the Current Instruc on:
o The decoder might delay the current instruc on if there’s a data dependency or if an
excep on occurs
 Scheduling Subsequent Instruc ons:
o Later instruc ons may need to be delayed to ensure that instruc ons complete in
the correct order
 Branch Predic on:
o The decoder also selects or predicts the path of branch instruc ons, determining
which instruc on to execute next based on the outcome of condi onal branches.

Data Interlocks

 Data interlocks are mechanisms within the instruc on decoder that manage dependencies
between instruc ons. They ensure that an instruc on does not use a result from a previous
instruc on un l that result is available.
 When an instruc on is decoded, its source registers are compared with the des na on
registers of previously issued but uncompleted instruc ons.
 If the execu on of the current instruc on takes more cycles than specified by the ming
template, subsequent instruc ons may need to be delayed to maintain the correct execu on
order.
Store Interlocks

 Store interlocks manage dependencies related to storage addresses.


 These interlocks compare the storage address with any pending store opera ons to detect
dependencies.
 the interlock ensures that the read is delayed un l the write is complete.

BYPASSING (FORWARDING)

Bypassing, or forwarding, is a technique used in pipelined processors to improve performance by


passing the result of a computa on directly from one stage to another, without needing to store it in
a register first.

 How it works:

o Normally, the result of a computa on (such as an addi on in the ALU) would be


wri en to a register, and then that value would be read in a later stage of the
pipeline.

o With bypassing, the result is directly routed from the ALU to the next stage of the
pipeline (where it’s needed) instead of wai ng for it to be wri en to a register first.
This reduces the me delay in ge ng the result to where it's needed.

 Example: If an instruc on in a pipeline needs the result of an opera on that is s ll being


computed by a previous instruc on, bypassing sends the result directly to the next
instruc on, so it doesn't have to wait for the result to be stored in a register.

Execu on Unit

The execu on unit is responsible for performing arithme c and logical opera ons in a processor. It
usually includes components like the ALU (Arithme c Logic Unit) and FPU (Floa ng-Point Unit).

 Basic Func on:

o The execu on unit performs the core opera ons like addi on, subtrac on,
mul plica on, division, and more complex opera ons such as floa ng-point
calcula ons.

o In some processors, mul ple execu on units (for integer and floa ng-point
opera ons) work in parallel to handle different types of instruc ons.

 Floa ng-Point Unit (FPU):

o The FPU is a specific type of execu on unit that handles floa ng-point arithme c
(e.g., decimal calcula ons) which are more complex than integer opera ons.

o Floa ng-point opera ons can take longer to execute, and the area needed for these
units is usually large due to the complexity of the calcula ons.
 Pipelining:

o In a pipelined execu on unit, tasks are broken into smaller stages, and each stage
can execute a part of the opera on. This allows mul ple opera ons to be processed
at the same me (in parallel).

o The pipelined design allows for con nuous throughput, with a new opera on being
completed in each cycle a er the ini al latency.

 Area-Time Tradeoff:

o There’s a tradeoff between the area (how much space the execu on unit takes up on
the chip) and the execu on me (how fast it performs opera ons). A more complex
FPU may provide higher precision but require more area and me to execute
opera ons.

BUFFERS: MINIMIZING PIPELINE DELAYS

 Buffers are essen al components in processors that help manage the ming of instruc on
and data handling, reducing delays and improving overall performance.
 Buffers are temporary storage areas that hold data or instruc ons while they are wai ng to
be processed or used.
 Latency Tolerance: Buffers help the processor tolerate delays, even if there are delays in one
part of the system, buffers can hold data un l it is needed, reducing the impact of those
delays.
 Minimizing Pipeline Delays: By holding data temporarily, buffers prevent the pipeline from
stalling due to delays in data retrieval or processing.

 Types of Buffer Design


o Mean Request Rate Buffers:
 Designed based on the average rate at which data requests are expected.
 The size of the buffer is chosen to match the average request rate. This
design aims to balance between buffer size and the probability of overflow
o Maximum Request Rate Buffers:
 Designed for situa ons where data or instruc on requests are cri cal and
can dominate performance, such as in high-speed instruc on fetches or
video buffers.
 size is chosen to handle the maximum expected request rate. This helps
ensure that the processor can con nue to operate at its highest
performance level without running out of data or instruc ons.
APPROACHES TO REDUCING THE COST OF BRANCHES

Branches in a processor can significantly impact performance, primarily because the processor needs
to decide whether to take the branch and which instruc on to fetch next. This decision introduces
delays in instruc on fetch and execu on. Several approaches have been developed to reduce or
mi gate the performance cost associated with branches. These approaches are categorized into
simple and complex strategies.

Simple Approaches

1. Branch Elimina on: This approach works for certain types of code sequences where the
branch can be replaced with another opera on, avoiding the branch altogether. This reduces
the delay caused by branching.

2. Simple Branch Speedup: This method aims to reduce the me spent wai ng for the branch's
outcome. It speeds up the process of determining whether the branch will be taken and
fetching the target instruc on.

Complex Approaches

The more complex strategies involve improving the predic on of branch outcomes, which helps to
fetch the right instruc ons ahead of me, minimizing delays due to branches.

1. Branch Target Capture (Branch Target Buffers - BTBs)

o What it is: A Branch Target Buffer (BTB) stores the target instruc on of a branch that was
previously executed, allowing the processor to fetch the target instruc on early when the
same branch is encountered again.

o How it works:

 The BTB holds the target address and the corresponding instruc on for branches that
were recently executed.

 When a branch is encountered again, the processor checks the BTB to see if the branch
is listed. If it is, the processor can immediately fetch the target instruc on without
wai ng for the branch to be fully resolved.

 If the branch was not previously recorded or the predic on is wrong, the processor must
s ll fetch and resolve the branch.

o Impact: This reduces the delay for branches that are commonly encountered, especially in
loops or frequently executed code paths. The effec veness depends on the hit ra o, or the
probability that a branch is found in the BTB. A higher hit ra o leads to be er performance.
2. Branch Predic on

o What it is: Branch predic on involves predic ng the outcome of a branch instruc on
before it is resolved, so that the processor can con nue execu ng instruc ons without
wai ng for the branch decision.

o There are several strategies for branch predic on, including:

 Fixed Strategy: This is the simplest form of predic on where the processor always
predicts that the branch will be taken or not taken (based on the branch type or other
fixed factors). For example, predic ng that backward branches in loops will be taken,
and forward branches will not.

 Sta c Strategy: This is more advanced than the fixed strategy, where the branch's
opcode or direc on (e.g., whether the branch is forward or backward) is used to predict
its outcome. For example, backward branches are usually taken, and forward branches
are not.

 Dynamic Strategy: This strategy predicts branch outcomes based on the history of the
branch’s behaviour, using past execu on data to guide future predic ons. This strategy
adapts based on how o en branches are taken or not taken.

o Types of Dynamic Predic on:

 Bimodal Predic on:

 The simplest dynamic approach, where a satura ng counter is used to track the
outcome of a branch (taken or not taken) based on its past history. For example, a 2-
bit counter records whether a branch was recently taken or not taken.

 The counter has 4 possible states: 00 (predict not taken), 01 (predict not taken), 10
(predict taken), and 11 (predict taken). The branch is predicted as taken or not taken
based on the counter's state.

 Effec veness: Bimodal predictors can achieve predic on accuracy between 83% and
96%, depending on the program's behavior.

 Two-Level Adap ve Predic on:

 This more sophis cated method keeps track of the history of the last few outcomes
of a branch. A shi register records the recent history of the branch, and this history
is used to index into a table of counters (similar to the bimodal approach) to make a
predic on.

 This method can adapt based on the pa ern of taken and not taken branches. For
example, if a branch has been taken twice, it may be predicted as taken in the future.

 Effec veness: This approach can improve predic on accuracy to 95% or higher for
large programs with stable branch pa erns.
 Combined Methods:

 A more advanced technique that combines different predic on strategies (such as


bimodal and adap ve) to improve accuracy. A vote table is used to combine
predic ons from mul ple predictors.

 Effec veness: The combined methods can further improve predic on accuracy by
making use of different types of informa on.

VECTOR PROCESSORS AND VECTOR INSTRUCTION EXTENSIONS

 Vector processors are specialized computing units designed to handle vector operations
efficiently.
 Vector instructions boost performance by
o reducing the number of instructions required to execute a program
o organizing data into regular sequences that can be efficiently handled by the
hardware
o representing simple loop constructs, thus removing the control overhead for loop
execution.

 Vector Processor Design


o Vector Registers (VRs):
 Vector registers store vector operands and results. Each vector register
typically contains multiple elements
o Vector Functional Units:
 Vector processors include specialized functional units for different
operations such as addition, subtraction, multiplication, division, and logical
operations.
 Vector functional units often use pipelining to improve performance. For
example, a vector add (VADD) operation may pass through multiple stages
of an adder pipeline.
o Execution and Timing:
 Once a vector operation begins, it continues at the system’s clock rate. A
vector operation processes one element per clock cycle, enabling high-
throughput data processing.
 Vector operations can be chained together, where the result of one
operation is directly used as an operand in the next.
o Memory Bandwidth:
 Vector processors require significant memory bandwidth to handle vector
loads and stores efficiently. Ideally, the system should support at least two
memory references per cycle
 Insufficient bandwidth can lead to idle times where the processor waits for
data, reducing overall performance.

VLIW Processors (Very Long Instruc on Word):

1. Types of VLIW Processors:

o VLIW processors can be either sta cally scheduled or dynamically scheduled.

 Sta cally scheduled: The compiler schedules the instruc ons in advance,
determining which ones can be executed simultaneously. The instruc ons
are grouped into instruc on packets, which are then decoded and executed
at run me.

 Dynamically scheduled: The compiler may arrange the code to op mize


execu on, but the hardware in the processor ul mately decides which
instruc ons to execute based on availability and dependencies during
run me.

2. VLIW Machines:

o VLIW processors from early manufacturers like Mul flow and Cydrome use long
instruc on words (up to 200 bits) where each fragment controls a specific execu on
unit. This allows for execu ng mul ple instruc ons in parallel, but the processor
must have a large register set to support this.

o A key technology to overcome branch delays in VLIW processors is trace scheduling,


where branches are predicted, and predicted paths are included in a larger basic
block of code to minimize branches during execu on.

3. Simultaneous Mul threading (SMT):

o SMT allows mul ple threads to use the same execu on hardware, with each thread
having its own registers and instruc on counters. For example, a two-way SMT
processor with two cores can run four programs simultaneously.
4. VLIW Data Path:

o The data paths in a VLIW processor require extensive use of register ports to allow
for simultaneous access to mul ple execu on units. This can be a bo leneck
because the number of register ports required increases with the number of
execu on units.

Superscalar Processors:

1. Superscalar Processors Overview:

o Superscalar processors use mul ple execu on units (like ALUs) and mul ple buses to
connect registers and func onal units. This allows for parallel execu on of
independent instruc ons, similar to VLIW processors, but with a key difference:
independence detec on is done in hardware.

o Superscalar processors must handle dynamic detec on of instruc on independence,


which complicates the control hardware.

2. Data Dependencies:

o In out-of-order execu on (common in superscalar processors), there are three types


of data dependencies:

 Read-A er-Write (RAW): This happens when an instruc on needs to read a


value that hasn't yet been wri en by a previous instruc on.

o Also called: True Dependency or Data Dependency.

 Write-A er-Read (WAR): This happens when an instruc on writes to a


register that was read by a previous instruc on.

o If instruc ons execute out of order, the later instruc on might


overwrite the register before the earlier instruc on finishes using it.

o Also called: An -Dependency.

 Write-A er-Write (WAW): Happens when two instruc ons write to the
same des na on register, poten ally causing incorrect results if executed in
the wrong order. Output dependency

3. Processor Design Challenges:

o Superscalar processors face challenges in managing dependencies and detec ng


independent instruc ons in hardware, which requires more complex control
mechanisms compared to simpler processors.
UNIT-3

SOC EXTERNAL MEMORY: FLASH

Flash memory is a type of non-vola le storage, meaning it retains data even when power is lost. It's
widely used in systems where large amounts of data need to be stored and retrieved but not
frequently changed.

 Structure: Flash memory uses floa ng-gate transistors, which store electrical charge to
represent data. This charge is non-vola le, meaning it stays in place even without power.

 Write Limita on: Flash memory has a limited number of write cycles, typically less than a
million. A er many writes, the memory can degrade, so error detec on and correc on are
o en added to improve reliability.

 Types:

o NOR Flash: More flexible but has lower density. It's good for storing code (e.g.,
firmware).

o NAND Flash: Offers higher density (more data storage per unit) but is less flexible.
It’s ideal for storing large amounts of data.

o Hybrid NOR/NAND: A combina on, using NOR for flexibility and NAND for density.

 Use Cases: Flash memory is commonly used for storage in devices like USB drives, SD cards,
and embedded systems. Flash can be stacked in large sizes (up to 256 GB) for high-capacity
storage.

 Variants: Flash technology is evolving with alterna ves like SONOS (non-vola le) and Z-RAM
(DRAM replacement). These newer types don't suffer from write cycle limita ons, and Z-
RAM offers high density and speed like DRAM.

SOC INTERNAL MEMORY: PLACEMENT

The placement of memory in a System-on-Chip (SOC) design is crucial because it affects both
performance and system complexity.

 On-die Memory: The memory is placed on the same chip as the processor, which allows
faster access mes due to shorter physical distances.
 Off-die Memory: The memory is placed on a separate chip. This o en increases the access
me because the data has to travel over longer distances. However, it can provide larger
memory sizes.

Two important factors in memory system design:

1. Access Time: How long it takes to retrieve data from memory. This depends on the distance
and delays between the processor and memory.

2. Memory Bandwidth: How quickly the memory can handle mul ple requests. More
independent memory arrays and op mized access methods help improve bandwidth.

 Challenges: For high-performance systems, placing memory off-die can be a challenge


because it increases access me. The processor’s cache system helps by temporarily storing
frequently accessed data, thus reducing the need to fetch data from slower off-die memory.

SCRATCHPAD AND CACHE

1. Scratch pad:
 Scratchpad memory is a small, fast memory directly managed by the
programmer. The programmer explicitly controls what data is stored and when it
is accessed.
 Scratchpad memory is par cularly useful in System on Chip (SoC) designs where
the applica on is well-known
 By elimina ng the need for cache control hardware, scratchpad memory frees up
space that can be used to increase the scratchpad size, leading to improved
performance.
 Limita ons: Scratchpad memory is typically used for data rather than
instruc ons because manually managing instruc on storage and retrieval can be
not worth the programming effort.
2. Cache memory
 Cache memory is also a small, fast memory, but it is managed automa cally by
the hardware. The hardware decides what data should be stored in the cache
based on the program's access pa erns.

 Principles of Cache Memory: (*** 5M CACHE LOCALITIES)


o Temporal Locality
 Temporal locality refers to the reuse of specific data or resources
within a rela vely short period of me.
 If a program accesses a specific memory loca on (e.g., a
variable) repeatedly in a short period of me, the chances are
high that this data will be accessed again soon
o Spa al Locality
 Spa al locality refers to the tendency of a program to access
data loca ons that are physically close to each other within
memory.
 When a program accesses an element of an array, it is likely that
nearby elements (e.g., in the same or adjacent memory blocks)
will be accessed soon a er.

o Sequen al Locality
 Sequen al locality is a subset of spa al locality where the
memory loca ons accessed are sequen al or con guous.
 In a loop where an array is processed element by element, the
memory accesses typically proceed in a sequen al manner
 Caches are designed to prefetch sequen ally following memory
blocks or lines to enhance performance for programs that exhibit
this pa ern.

CACHE ORGANIZATION

 Cache memory is a small, fast storage area that stores frequently accessed data and
instruc ons to speed up processing.
 Fetch Strategies
1. Fetch-on-Demand:
 This strategy brings data into the cache only when it is needed, i.e., when a
"miss" occurs (the data is not already in the cache).
 Commonly used in simple processors. It only loads data into the cache when
the processor requests it and finds it missing.
2. Prefetch Strategy:
 An cipates the data that will be needed soon and loads it into the cache
before the processor requests it.
 Commonly used in instruc on caches (I-caches). By preloading instruc ons,
the processor can execute them without wai ng for a cache miss.

 There are three basic types of cache mapping or organiza on:

1. Fully Associa ve (FA) Cache:


 The cache can place a block of memory data into any cache line. The en re
cache is searched to find if a requested data block is present (this is called a
"directory hit").
 Advantage: Very flexible; low conflict misses since any block can go
anywhere.
 Disadvantage: Slow and complex due to the need to compare the requested
address with every cache line.

2. Direct-Mapped Cache:
 Each block of memory data is mapped to exactly one loca on (cache line) in
the cache. The lower bits of the memory address are used as an index to
locate the cache line.
 Advantage: Fast, as the loca on is directly determined by the address,
allowing for simultaneous access to both the cache array and directory.
 Disadvantage: High conflict miss rate; if mul ple memory addresses map to
the same cache line, they will keep replacing each other.

3. Set-Associa ve Cache:
 A hybrid between fully associa ve and direct-mapped caches. The cache is
divided into "sets," and each memory block can be stored in any cache line
within a set. A set-associa ve cache with 2 lines per set is called "2-way set-
associa ve," with 4 lines per set is "4-way set-associa ve," and so on.
 Advantage: Balances speed and flexibility, offering be er performance than
direct-mapped and simpler implementa on than fully associa ve caches.
 Disadvantage: Slower than direct-mapped but faster than fully associa ve.
More complex to implement.

 Cache Addressing
 When accessing the cache, the memory address provided by the processor is divided
into several parts:

 Tag: The most significant bits used to compare against the addresses in the
cache to check for a hit.
 Index: Used to locate the specific set or line within the cache.
 Offset: Identifies the specific word within the cache line.
 Byte: Specifies a specific byte within the word, used during partial writes.

Write Policies

 There are two primary strategies for handling writes in a cache:


 Write-Through (WT):
 The cache writes data to both the cache and the main memory
simultaneously.
 Advantage: Ensures that memory always contains the most up-to-date
data.
 Disadvantage: Higher memory bandwidth usage because every write
opera on goes to main memory.

 Copy-Back (Write-Back) (CB):


 The cache writes data only to the cache and updates the main memory
only when the cache line is replaced.
 Advantage: Reduces memory traffic since data is wri en to the main
memory less frequently.
 Disadvantage: Requires tracking which lines have been modified (marked
as "dirty") to ensure they are eventually wri en back to memory.
 Handling Write Misses
 When a write opera on misses in the cache (i.e., the data is not already present in
the cache), there are two possible approaches:
 Write Allocate (WA): Fetches the missing line into the cache and then writes the
data.
 No Write Allocate (NWA): Writes the data directly to memory without bringing it
into the cache.
 WTNWA (Write-Through No Write Allocate): Commonly used in write-through
caches to avoid unnecessary cache line alloca ons.
 CBWA (Copy-Back Write Allocate): Used in copy-back caches to ensure that
frequently wri en data is brought into the cache for faster subsequent accesses.

Strategies for line replacement

In cache memory systems, a cache miss occurs when a requested memory line is not found in the
cache. When this happens, two main ac ons must be taken:

1. Fetching the Missed Line: The line must be retrieved from the main memory. Depending on
the cache policy, this can work in two ways:

o Write-through Cache: The fetched line replaces the current line without any extra
ac on.

o Copy-back Cache: If the line to be replaced has been modified (dirty), it must be
wri en back to memory before replacement. If it's clean, it can simply be replaced.

To speed up this process, caches may use a nonblocking or prefetching approach, allowing the
processor to con nue execu ng instruc ons while the cache miss is handled, as long as the missing
data isn’t immediately needed by the processor.

2. Line Replacement Policy: When the cache is full, a replacement policy determines which line
to remove:

o Least Recently Used (LRU): Replaces the least recently accessed line, aligning with
the idea of temporal locality but is more complex to implement.

o First In – First Out (FIFO): Replaces the line that has been in the cache the longest,
with simpler implementa on than LRU.

o Random (RAND): Replaces a randomly selected line, simplest to implement but less
efficient than LRU.

Although LRU generally performs best due to its alignment with temporal locality, simpler FIFO or
RAND policies are o en acceptable, resul ng in only a slight performance loss.

3. Cache Environments and Miss Rates: The effec veness of a cache can vary depending on
system demands and environment:

o Mul programmed Environment: With mul ple programs sharing memory, a warm
cache is created, retaining some recently used lines when a program resumes,
though the miss rate may increase slightly.

o Transac on Processing Environment: Short transac ons run to comple on before


switching, crea ng a cold cache since each transac on starts with an empty cache.

Both environments involve context switching before fully loading a program’s working set, which can
increase the overall miss rate due to disrupted data locality.
Other Types of Cache

Split(I/D) Cache

Split caches divide memory into separate instruc on (I) and data (D) caches. This setup increases the
cache bandwidth by allowing simultaneous access to both instruc ons and data, effec vely doubling
the access speed.

However, unified caches, which combine both data and instruc ons into a single cache, tend to have
a lower miss rate at the same total size, as they adapt be er to changing instruc on-to-data demands
during program execu on.

Despite the higher miss rate, split caches are o en more efficient in prac ce:

 Flexible sizing: Split caches allow for unequal par oning, like a 75% data cache and 25%
instruc on cache, op mizing for workload needs.

 Simplified I-cache: Since the instruc on cache (I-cache) only reads data and does not need to
handle write opera ons, its design is simpler.

Mul level Cache


Mul level caches involve two or more levels (L1, L2, etc.), where L1 is closest to the processor, offering
faster access but smaller capacity, and L2 and beyond are progressively larger but slower. Cache access
me increases with cache size due to the longer wiring and addi onal circuitry. Research shows that
the access me (in nanoseconds) for a direct-mapped cache can be approximated as:

where:

 f = feature size (in microns),

 C = cache capacity (in KB),

 A = associa vity level (1 for direct mapping).


Evalua ng Mul level Cache Performance

In a mul level cache system, performance analysis o en focuses on both levels (L1 and L2) using
data from the L1 cache. The principle of inclusion applies to systems where all data in L1 is also
contained in L2.

1. Principle of Inclusion:

 Content of L1= Content of L2

 Number of lines in L1= Number of lines in L2

 Line Size of L2 > Line Size of L1 .

o If L2 has smaller lines, loading data to L1 could trigger mul ple misses in L2,

 Second cache must be larger that first cache

o reduce overall miss rates and benefit the system.

 Miss rate

2. Miss Rates in Mul level Systems:


( )
 Local Miss Rate:

 Global Miss Rate: ( )

 Solo Miss Rate: The hypothe cal miss rate if L2 was the only cache, defined by the principle
of inclusion.

3. Guidelines for Miss Rate Analysis:

 When L1 is the same size as or larger than L2, the principle of inclusion s ll provides a
reliable es ma on of L2 behavior.

 When L2 is much larger than L1, L2 can operate independently, and its miss rate will align
with the solo miss rate, focusing analysis only on L2 behavior.
4. Logical Inclusion
 ensures that all data in L1 cache is also present in L2, guaranteeing full consistency between
the two levels
Key Requirements for Logical Inclusion:

Write Policy:

o Write-Through L1 Cache: For logical inclusion, L1 should be configured as a write-


through cache. This means every write to L1 is also immediately updated in L2, keeping
both caches aligned.
o L2 Cache Flexibility: L2 does not necessarily need to be write-through because it can s ll
fulfil the consistency requirements by aligning with L1 writes.

Write-Back in L1 and Inconsistencies:

o If L1 used a write-back policy, it would only write changes to L2 when the data is evicted
from L1, crea ng temporary differences in content between L1 and L2. This makes
logical inclusion challenging as L1 and L2 could hold different values temporarily.

Consistent Cache Policies:

o When logical inclusion is needed, L1 and L2 should use coordinated cache policies to
ensure synchronized data content.

Virtual-to-Real Transla on and TLB (Transla on Lookaside Buffer)

The TLB is a crucial component that translates virtual addresses (used by programs) into real
(physical) addresses that hardware needs to access memory. This transla on is necessary because
virtual memory allows processes to run as if they have their own dedicated memory space, but real
addresses are needed for actual data access.
TLB Transla on Process:

1. Two-Way Set Associa ve TLB:


o The TLB can be organized in various ways. A two-way set associa ve TLB has pairs of
entries that store virtual-to-real address mappings (transla ons).
o Virtual addresses are divided into a page address and an offset. The page address
part requires transla on.
2. TLB Indexing and Hashing:
o The TLB uses a por on of the virtual address bits (a "hash") to find an entry in the
TLB, minimizing the chance of address collisions between different virtual addresses
that might have similar page values.
o The index size depends on the number of entries divided by the degree of set
associa vity; for example, log₂(t) where t is the number of entries divided by the set-
associa ve level.
3. Address Matching:
o A er an index is accessed, the system compares the virtual address against the tag in
each entry. When a match is found, the TLB outputs the corresponding real address.
o If no match is found, this is a TLB miss or "not-in-TLB" event, which requires several
cycles to correct by loading the transla on from memory.
4. Simultaneous TLB and Cache Access:
o Ideally, TLB access can happen at the same me as cache access for efficiency. If the
cache requires real addresses, however, the TLB transla on must finish before the
cache access begins.
5. Special Considera ons for SOC and Board Systems:
o Small TLBs can lead to frequent TLB misses, slowing down programs.
o Cache Access with Real Addresses: If a cache is set up to use real addresses, the TLB
lookup must complete before the cache can access data, which adds delay.

SOC (On-Die) Memory Systems

On-die memory is a type of embedded memory in System-on-Chip (SOC) designs, op mized for
performance and space within the chip

Types of On-Die Memory:

1. SRAM (Sta c RAM):


o SRAM uses a 6-transistor cell, making it fast but rela vely large and less dense.
o Built in the same technology as the processor, SRAM has very low latency, making it
ideal for speed-sensi ve areas, such as cache.
o However, it is expensive in terms of area, taking up 10x to 20x more space than
DRAM on the die.
2. DRAM (Dynamic RAM):
o DRAM uses a 1-transistor cell plus a deep trench capacitor, allowing it to be highly
dense and compact.
o Although denser, DRAM is slower than SRAM due to leakage currents and the need
for periodic refreshing to maintain stored data.
o DRAM is usually not integrated on the same process as the processor and is more
o en used off-die or in separate chips.
3. eDRAM (Embedded DRAM):
o eDRAM provides a middle ground by incorpora ng DRAM density benefits with
be er access mes that are closer to SRAM.
o The eDRAM process requires extra fabrica on steps (addi onal masks and layers),
which can increase the manufacturing cost by about 20%.
o eDRAM is suitable for SOC applica ons where a large memory capacity is needed
without sacrificing speed, providing a reasonable balance between density and
access me.
UNIT 5

SoC Design Approach

Designing a System on Chip (SoC) device is a complex process that involves balancing cost,
performance, and functionality. The process often requires multiple iterations to ensure the final
design meets all requirements.

 Initial Project Plan: The design starts with a project plan, including budget, schedule, target
market, competitive analysis, and goals for cost and performance.
 Placeholder Product Design: An early version of the design, known as a "straw man," is
created to give a rough idea of the product’s structure and performance.
 Detailed Specifications and Analysis: All functions and performance requirements are
specified. Models are created to understand the trade-offs between functionality and
performance.
 System Design:
o Memory and Processor Selection: Memory and storage are first allocated. Then,
processors are chosen, often with a base processor for the operating system.
o Interconnect Architecture: The memory layout and processor choices define how
components will connect and communicate. Bandwidth needs are analyzed and
cache is added to support data transfer speeds.
o Peripheral Selection: required, peripherals are selected based on bandwidth needs,
like a JPEG encoder for a camera.
 Cost and Performance Estimation: The initial design is assessed to get a rough estimate of
overall cost and performance.
 Optimization and Verification: Tools help refine the design by improving efficiency and
reducing cost. Each change is evaluated for its impact on accuracy, speed, and energy
consumption.
 Final Evaluation: After several optimization rounds, the design’s profitability and market
potential are evaluated to decide on the final design.

AES: Algorithm and Requirements

The Advanced Encryp on Standard (AES) is a widely used symmetric encryp on algorithm that
ensures data security. It operates on fixed-size blocks of data and employs a series of transforma ons
to encrypt and decrypt that data.

Block Sizes:

 AES supports three block sizes:

o AES-128: 128 bits

o AES-192: 192 bits

o AES-256: 256 bits

Rounds:

 The encryp on process consists of:

o One Ini al Round

o r − 1 Standard Rounds: Where r is determined by the key length (10 rounds for AES-
128, 12 for AES-192, and 14 for AES-256).

o One Final Round: Similar to the standard rounds but without the MixColumns step.

Major Transforma ons in AES

1. SubBytes:

o Each byte in the input block is replaced with a corresponding byte from a predefined
subs tu on box (S-Box). This step adds non-linearity to the cipher.

2. Shi Rows:

o The bytes of the input are arranged into four rows. Each row is then rotated with a
predefined step according to its row value.
3. MixColumns:

o Each column of the four-row structure is transformed using polynomial


mul plica on over the Galois Field (GF(2^8)).

4. AddRoundKey:

o The input block is XORed with a round key derived from the original encryp on key.
This opera on is performed in each round and adds a layer of security.

Rounds Structure

 Ini al Round: The AddRoundKey opera on is performed.

 Standard Rounds: Each of the four transforma ons (SubBytes, Shi Rows, MixColumns,
AddRoundKey) is applied.

 Final Round: The MixColumns transforma on is omi ed, and the other three
transforma ons are applied.

 Decryp on involves applying the inverse transforma ons of the four main steps (using an
inverse S-Box, inverse row shi s, and inverse column mixing).
 The round transforma ons in AES can be parallelized, enabling faster implementa ons,
especially in hardware architectures.

AES : Design and Evalua on

Ini al Design Considera ons

 The design specifies the use of a PLCC68 (Plas c Leaded Chip Carrier) package with a die size
of 24.2 mm x 24.2 mm².
 The ARM7TDMI, a 32-bit RISC processor, is considered. Its die sizes are:
o 180 nm process: 0.59 mm²
o 90 nm process: 0.18 mm²
 Both versions of the ARM7 processor can fit within the specified die size.
 The AES encryp on process, according to the SimpleScalar tool set, has a cycle count of
16,511.

Op miza on Strategies

Cache Modifica on:

 To enhance the system throughput without exceeding the ini al area constraint, the study
looks into modifying the cache size based on techniques

 By doubling the block size of a 512-set L1 direct-mapped instruc on cache from 32 bytes to
64 bytes:

o The AES cycle count decreases from 16,511 to 16,094 (a 2.6% improvement).

 Area considera ons:

o Ini al area without cache: 60K rbe

o With an 8K L1 instruc on cache: 68K rbe

o A er doubling the cache size: 76K rbe

o The area increase is over 11%, which is deemed not worthwhile for the minimal
speed improvement.
Alterna ve Architectural Approaches:

 The ARM7 already u lizes pipelining. Exploring parallel pipelined datapaths could yield
be er performance, especially for applica ons needing high throughput. However, these
approaches may lead to larger area and power consump on compared to ASIC designs.

 Another sugges on is to enhance the instruc on set of the processor with custom
instruc ons specific to AES, poten ally improving performance.

Further Considera ons

1. Pipelined AES Implementa on:

o AES can be fully pipelined and implemented on FPGA devices, achieving high
throughput (over 21 Gbit/s) by leveraging FPGA-specific technologies, such as block
memories and mul pliers.

2. AES in Larger Systems:

o AES cores are o en part of more extensive systems. For example, integra ng the AES
core into the ViaLink FPGA fabric on a QuickMIPS device, which contains a 32-bit
MIPS 4Kc processor core, showcases such implementa on.

o AES can also be u lized in designs involving secure hash methods.

Image Compression

Image compression methods, such as JPEG, share common intraframe opera ons with video
compression methods like MPEG and H.264. These opera ons include:

 Color Space Transforma on: Changing the color representa on of the image to op mize
compression.

 Entropy Coding (EC): A lossless compression method to encode data efficiently.

JPEG Compression

JPEG compression involves three main steps:

1. Color Space Transforma on:

o The image, originally in RGB (24 bits per pixel, with 8 bits each for red, green, and
blue), is transformed into the YCbCr color space.

 Y: Represents brightness (luminance).

 Cb and Cr: Represent color informa on (chrominance).

o Human vision is more sensi ve to luminance than to chrominance, allowing for


downsampling of Cb and Cr components.
o Common downsampling ra os in JPEG are:

 4:4:4: No downsampling.

 4:2:2: Downsampled by a factor of 2 in the horizontal direc on.

 4:2:0: Downsampled by a factor of 2 in both horizontal and ver cal


direc ons.

o Each component (Y, Cb, Cr) is processed separately for compression.

2. Discrete Cosine Transform (DCT):

o Each component is divided into 8x8 pixel blocks.

o The DCT converts each block from the spa al domain to the frequency domain using
an 8x8 matrix mul plica on. This transforma on allows the high-frequency
components (which contain less visual informa on) to be reduced more than the
low-frequency components.

o Quan za on is applied a er the DCT to further reduce high-frequency components,


which helps in achieving higher compression rates.

3. Entropy Coding (EC):

o This step involves arranging the frequency coefficients in a zigzag order to priori ze
low-frequency components.

o Run-Length Coding (RLC): Used to compress the AC components (the coefficients


a er the DCT).

o Differen al Pulse Code Modula on (DPCM): Applied to the DC components.

o Finally, either Huffman coding or arithme c coding is used to encode the remaining
data. While arithme c coding is generally more efficient, it is also more complex to
decode.

Performance and Opera ons

The sec on es mates the computa onal load involved in processing the JPEG compression:

 Opera ons for DCT:

o For a k×k \ mes k×k block, the opera ons required for DCT involve:

 k data loads for image data.

 k data loads for DCT coefficients.

 k mul ply-accumulate opera ons.

 1 data store.

o This totals to 3k+1 opera ons per pixel, and 2k^2(3k + 1) for the en re block.

 For frames of size n×n \ mes n×n at f frames per second (fps), the number of opera ons
can be calculated as:

2fn(3k+1)

 Common Formats:

o CIF (Common Intermediate Format): 352 × 288 pixels.


o QCIF (Quarter CIF): 176 × 144 pixels.

Compression Ra os

 Lossless Compression: Typically achieves a size reduc on of up to 3 mes.

 Lossy Compression: Can achieve reduc ons of up to 25 mes.

Example JPEG System for Digital S ll Camera

1. Processor Overview:

o The TMS320C549 processor is u lized for implemen ng the imaging pipeline, which
processes 16 × 16 blocks of pixels.

o It features:

 32 KB of 16-bit RAM.

 16 KB of 16-bit ROM.

o The processor executes all imaging opera ons on-chip, minimizing the need for
slower external memory and enhancing processing speed.

2. Performance Specifica ons:

o The TMS320C549 can achieve up to 100 Million Instruc ons Per Second (MIPS).

o Power consump on is low, approximately 0.45 mA/MIPS.

o The imaging pipeline, including JPEG compression, requires about 150 cycles per
pixel, transla ng to around 150 instruc ons/pixel at 100 MIPS and 100 MHz clock
speed.

o The processor can process a 1-megapixel CCD (Charge Coupled Device) image in 1.5
seconds.
3. Shot-to-Shot Delay:

o There is a 2-second shot-to-shot delay, which includes the me required to transfer


data from external memory to the on-chip memory. This delay is crucial for ensuring
that each image is properly processed before the next shot.

4. Image Playback:

o A er capturing images, users can display them on the LCD screen of the camera or
an external TV monitor.

o The images are stored as JPEG bitstreams on a flash memory card.

o Playback-mode so ware decodes these JPEG images, scales them to suitable


resolu ons, and displays them, requiring 100 cycles per pixel for a 1-second playback
of a megapixel image.

5. Memory Requirements:

o The processor requires:

 1.7 KB for program memory.

 4.6 KB for data memory.


UNIT 4

INTERCONNECT ARCHITECTURES

interconnect architecture in a System-on-Chip (SoC) plays a vital role in enabling communica on


between different components (or IP blocks) on the chip and external devices.

components and considera ons involved:

 The SoC typically includes various IP blocks like processors, caches, graphics processors,
video codecs, and network units, all integrated onto a single chip.

 These blocks need to communicate effec vely with each other and with off-chip devices
(e.g., external memory or peripherals) to ensure smooth opera on and high performance.

 IP blocks communicate through an interconnect network within the SoC.

 An Interconnect Interface Unit (ICU) standardizes communica on by providing a common


interface protocol that allows each IP block to interact with others seamlessly.

Key Considera ons in Interconnect Design

 Communica on Bandwidth: Measures data transfer speed, o en in bytes per second.


Higher bandwidth means faster data transmission, which is cri cal for throughput in high-
demand applica ons.

 Communica on Latency: Refers to the delay from when data is requested to when it is
received. Low latency is essen al for real- me systems (like mobile communica on) but may
be less crucial in applica ons where slight delays (like video streaming) are acceptable.

 Master and Slave Roles:

o A master can ini ate communica on, while a slave responds to requests.

o For instance, a processor (master) may request data from memory (slave), with SoCs
typically having mul ple masters and several slaves.

 Concurrency Requirement: Indicates how many simultaneous communica on channels are


ac ve. More channels generally mean higher bandwidth but require a more sophis cated
interconnect design.

 Packet or Bus Transac ons: Describes how data is transmi ed.

o In a bus system, each transac on contains an address, control bits, and data.

o In a Network-on-Chip (NoC), transac ons are broken into packets, each with a
header (address/control) and payload (data).

 ICU (Interconnect Protocol Management): Manages protocol-specific details of data transfer.


For example:

o A bus wrapper allows IP blocks that don’t use the primary protocol to communicate
with others.

o In NoCs, it manages packet buffering and transmission order.

 Mul ple Clock Domains: Different parts of the SoC may operate at different clock speeds
due to differing opera onal requirements (e.g., a processor vs. a video input).

o Clock domains help separate these areas, but design care is needed to avoid
synchroniza on problems that can cause data transfer errors.
BUS: BASIC ARCHITECTURE

The bus architecture in a computer system serves as the main communica on pathway between
various components (e.g., processor, memory, and peripherals

 A bus’s design significantly impacts system performance. If poorly designed, it can restrict (or
thro le) data transfer, crea ng a bo leneck that slows down the system.

 Conven onal bus systems are o en not op mal for System-on-Chip (SoC) applica ons since
they were designed for backplane connec ons in larger systems, like rack-mounted servers
or motherboards.

 Limita ons include restricted signal pin counts on IC packages, high-capacitance loads,
connector resistance, and electromagne c noise.

Bus Ownership and Arbitra on

 In a bus system, mul ple units share the bus. When a unit gains exclusive access, it’s said to
"own" the bus. Bus masters (e.g., processors) ini ate communica on, while slaves (e.g.,
memory) respond.

 Arbitra on is the process of gran ng bus ownership. In a centralized approach, an


arbitra on unit manages requests and allocates the bus to one unit based on specific rules.

 The bus protocol dictates communica on rules, including data order, acknowledgment of
successful recep on, data compression methods, error-checking, and arbitra on priority.

Bus Bridges

 A bus bridge connects two different bus systems and can serve three main func ons:

1. Protocol Conversion: If the buses use different communica on protocols, the bridge
translates between them.

2. Traffic Segmenta on: Bridges segment traffic to keep it contained within sec ons,
enabling both buses to operate concurrently.

3. Memory Buffering: The bridge temporarily stores data in buffers, allowing the
master to con nue its opera ons before data reaches the slave, enhancing
performance.

5. Physical Bus Structure

 The physical structure, including wire paths and cycle me, influences how bus transac ons
occur.

 Arbitra on cycles determine access priori es, with complex systems adding addi onal lines
and logic to maintain priority without extra delay.

6. Types of Buses

 Unified Bus: Uses the same pathway for both address and data, transmi ed sequen ally.

 Split Bus: Has separate pathways for address and data, allowing them to be processed
independently.

 Single Transac on vs. Tenured Bus: Single-transac on buses are dedicated to each request
individually, while tenured buses support buffered transac ons, allowing one transac on’s
data to occupy the bus even as new transac ons are ini ated.
AMBA

 The AMBA (Advanced Microcontroller Bus Architecture) was introduced by ARM in 1997 as a
structured interconnect standard primarily for ARM-based SoCs.
 It provides mul ple bus levels to support different performance and power needs within a system.
 AMBA’s main buses are the Advanced High-Performance Bus (AHB) for high-speed components and
the Advanced Peripheral Bus (APB) for lower-power, slower peripherals.
 Addi onally, there is an older bus, the Advanced System Bus (ASB), intended for simpler
microcontrollers.

Key AMBA Bus Components

1. AHB (Advanced High-Performance Bus):

o High-Speed and High-Bandwidth: Designed for ARM processor cores, DMA


controllers, on-chip memory, and other high-performance peripherals. It has
separate buses for address, read, and write data, which allow faster data flow.

o Mul master Support: The AHB supports mul ple masters, such as processors or
DMA controllers, allowing concurrent transac ons with mul ple slaves (e.g.,
memory).

o Burst Mode & Split Transac ons: AHB can handle burst transfers, where large blocks
of data move in one opera on, and split transac ons, allowing a master to ini ate a
transfer and return to it later.

o Pipelining: It features a two-phase opera on (address phase and data phase),


allowing one transfer’s address phase to overlap with the previous transfer’s data
phase. This pipelining increases data throughput.
AHB Protocol and Transfer Process

In a typical AMBA system, the AHB forms the primary bus for high-speed components like processors
and memory. Here’s how an AHB transfer works:

 Master Access: A master (e.g., ARM processor) requests bus access from an arbiter. If
mul ple masters request access, the arbiter grants it based on priority.

 Ini ate Transfer: The bus master drives the address and control signals, indica ng the type
and width of the transfer and whether it’s a burst opera on. Data flows from master to slave
in a write opera on and vice versa in a read.

 Slave Response: The slave responds with status indicators (e.g., success, delay, error) to
no fy the master of the transfer’s status.

In pipelined (tenured) AHB buses, one transfer’s address phase overlaps with the previous transfer’s
data phase, enhancing speed

2. APB (Advanced Peripheral Bus):

o Low Power & Low Complexity: APB is op mized for simple, low-power interfaces
with slower peripheral devices (e.g., GPIO, mers).

o Simpler Opera on: Unlike AHB, APB has a straigh orward, three-state data transfer
process—idle, setup, and enable states, making it easier to implement in low-
complexity applica ons.

APB Protocol and Simplicity

APB operates on a simple, state-driven protocol:

 Idle State: The bus remains idle un l a data transfer request.

 Setup and Enable States: The bus enters setup and enable states in sequence for each
transfer, facilita ng simple, low-power opera on suitable for peripherals.

3. ASB (Advanced System Bus):

o An earlier version of AHB, used in lower-performance systems where the full


capabili es of AHB aren’t needed, using 16/32-bit opera ons for simpler
microcontrollers.

AMBA System Opera on and Benefits

1. Modular Design & Reuse: AMBA's well-defined interface makes it easier to design modular,
reusable SoC components, reducing development complexity and improving interoperability.

2. Clocking and Reset Flexibility: The AMBA interface’s design is simple yet flexible, with
op ons for mul master systems, split transac ons, and burst modes.

3. Low-Power Design: AMBA’s par oned design (with AHB for high-performance and APB for
low-power peripherals) ensures efficient power consump on, essen al for portable devices.

4. On-Chip Tes ng: AMBA supports on-chip test access through its bus infrastructure,
simplifying the tes ng of bus-connected modules.
CORE CONNECT

 IBM's Core Connect Bus is a structured interconnect standard for SoC systems, primarily designed
around IBM's PowerPC processor but flexible enough for other processors.
 Core Connect organizes data pathways into a hierarchical bus system, ensuring high-performance
data transfers alongside simpler, low-power connec on.

Key Components of Core Connect


1. Processor Local Bus (PLB):
o Purpose: PLB is a high-speed, high-bandwidth bus that interconnects performance-
cri cal components like the processor, memory, and DMA controllers.
o Architecture: It is a fully synchronous, split transac on bus with separate read, write,
and address buses, allowing simultaneous data transfers.
o Transac on Phases: PLB transac ons are divided into dis nct phases:
 Address Phases: These include the request (RQ), transfer (XFER), and
address acknowledgment (ACK) phases, where masters request bus
ownership and transfer data to slaves.
 Data Phases: In data tenures, there is a transfer and acknowledgment phase
for each data beat. During the transfer phase the master drives the write
data bus for a write transfer or samples the read data bus for a read transfer.
o Split Transac ons: This feature decouples the address, read, and write buses so
different masters can operate concurrently on the bus.

2. On-Chip Peripheral Bus (OPB):


o OPB serves as a secondary bus for low-bandwidth peripherals such as UARTs, GPIOs,
and mers. It offloads the PLB by reducing capaci ve load.
o OPB supports mul ple masters and slaves through a distributed mul plexer
architecture, which allows peripherals to be added without altering the exis ng
configura on.
3. Device Control Register (DCR) Bus:
o This is a low-speed, low-complexity bus mainly used for configura on and control
func ons. It’s o en daisy-chained for simplicity.

Comparison between CoreConnect and AMBA Architectures [198]

(5m)
*** In a System-on-Chip (SOC) environment, integra ng reusable Intellectual Property (IP) blocks
with different bus standards can be challenging because each bus standard has its own protocol,
which may not be compa ble with other standards. To solve this, Bus Interface Units are used,
which include bus sockets and bus wrappers. These components help isolate the IP core from the
bus protocol, enabling flexibility in connec ng IP blocks across different bus systems.

Bus Wrappers/Hardware Sockets: These interface components sit between the IP core and
the physical bus. They enable communica on across different bus protocols by adap ng the
IP core’s protocol to match the bus’s protocol.
Conten on and Shared Bus

In bus-based systems, conten on arises when mul ple units (like processors or memory modules)
request access to a shared resource (such as a bus) at the same me. Conten on leads to delays
because only one request can be processed at a me. There are two ways to handle conten on:
1. Idle Un l Available: The reques ng unit waits and remains idle un l the shared resource
becomes available.
2. Queue in Buffer: The request is placed in a buffer, allowing the unit to con nue other
processes un l the resource is free. This approach only works when the requested resource
isn’t cri cal to the current execu on, such as cache prefetching.
The need to analyze a bus for conten on depends on its bandwidth rela ve to the memory
bandwidth. If the bus is a bo leneck (i.e., has less available bandwidth than memory), then it must
be analyzed for conten on as it restricts data flow. Buses with no buffering lead to system
slowdowns as requests get denied immediately.
There are two main access pa erns:
1. Requests without Immediate Resubmissions: The denied request does not need to be
immediately fulfilled, allowing the system to con nue. For instance, a cache line prefetch can
wait without stalling the program.
2. Requests Are Immediately Resubmi ed: In this common case, a denied request must be
resubmi ed instantly. This is typical for systems where mul ple processors share a bus, and
the program cannot proceed un l the request is granted, causing the processor to remain
idle un l the resource becomes available.

Analy cal models of buses and compu ng offered occupancy

1. Simple Bus Model: Without Resubmission

 Bus Transac on Time (Tline access): This is the me the bus takes to handle a request.

 Processor Time: This is the average me a processor needs to perform computa ons before
making a bus request.

Occupancy Calcula on (ρ):

 The offered bus occupancy (ρ) is calculated as:

 This ra o indicates how o en the bus is busy rela ve to the total me available for
processing.

 The probability that a processor does not access the bus is given by 1−ρ

 The probability that the bus is busy is represented as ρ^n for n processors.

 The realized bus bandwidth calculated as

Achieved Occupancy (ρa):

 The achieved occupancy per processor is given by:

 This tells us how much of the bus's bandwidth is effec vely used per processor.

Impact of Conges on:

 A processor's speed is reduced by the ra o ρa\ ρ due to bus conges on, highligh ng the
performance impact of conten on.
2. Bus Model with Request Resubmission

This model incorporates a more complex analysis to handle scenarios where requests are
resubmi ed a er being denied.

Itera ve Solu on:

 Two equa ons are provided to calculate the achieved occupancy (ρa):

 The variable a represents the actual offered request rate. The itera ve process starts with an
ini al guess a=ρ, and typically converges within four itera ons.

3. Compu ng the Offered Occupancy

Mean Bus Transac on Time:

 The model requires knowledge of the average bus transac on me to compute the offered
occupancy, which indicates how busy the bus would be without conten on (ranging from 0.0
to 1.0).

Types of Transac ons:

 Blocking Transac ons: A er ini a ng a bus request, the processor becomes idle un l the
bus transac on is complete.

o For a single bus master, achieved occupancy ρ_a equals offered occupancy ρ.

o For mul ple bus masters, the offered occupancy becomes nρ, and conten on can
occur, necessita ng the use of the bus model to find ρ_a.

 Nonblocking (Buffered) Transac ons:

o More complex processors can con nue processing a er making a bus request,
poten ally issuing several requests before the ini al one is completed.

SOC Customiza on

 Customiza on in the context of System on Chip (SoC) design refers to the op miza on of
hardware and so ware to meet specific applica on requirements and implementa on
constraints.
 Customiza on can occur at various stages: during design me (which includes fabrica on and
compile me) and at run me. Each of these stages plays a crucial role in shaping the final
characteris cs of the SoC.

Stages of Customiza on

1. Fabrica on Time:

o This is when the physical device is constructed. For custom chips, such as ASICs,
much of the func onality is predetermined at this stage.

o If the device is configurable, it may allow for further customiza on post-fabrica on.
2. Compile Time:

o This stage involves genera ng configura on informa on from design descrip ons
that will be used to customize the device at run me.

o This includes producing instruc ons tailored to the specific architecture of the
processor.

3. Run Time:

o Customiza on can also occur while the system is opera onal, allowing for dynamic
reconfigura on to adapt to changing applica on needs.

Classifica ons of Customizable SoCs

Customizable SoCs can be classified based on how the reconfigurable fabric interfaces with the
processor:

 A ached to System Bus: The reconfigurable fabric connects


to the system bus, allowing for shared access but with
poten ally lower performance.

 As a Co-Processor: In this configura on, the fabric acts as a


co-processor, providing be er coupling and communica on
with the main CPU.

 Tightly Coupled Fabric: Here, the reconfigurable fabric is part


of the processor, enabling custom instruc ons that can
enhance performance.

 Embedded Processor in Programmable Fabric: The processor


may be implemented within the reconfigurable fabric itself,
using its resources, as seen with so processors like
MicroBlaze and Nios.

Customizing Instruc on Processors

Instruc on processors in a System on Chip (SoC) can be specialized for specific tasks, such as media
processing or encryp on. Customiza on usually happens before fabrica on but can also be done for
so processors. This process op mizes performance in terms of speed, size, power consump on, and
accuracy.

Approaches to Customiza on

1. Family of Processors: Some companies, like ARM, offer different families of processors
op mized for various applica ons. For example:

o Cortex-A Series for high-performance applica ons.

o Cortex-M Series for low-power microcontrollers.

2. Custom Processor Genera on: Companies like ARC and Tensilica provide tools that let
designers configure processors by choosing features they need and removing unnecessary
ones. This helps in op mizing the design for specific applica ons.

Tools and Automa on

Modern SoC design tools help automate much of the customiza on process, making it easier and
faster to create custom processors. Some common func onali es of these tools include:
 Integra ng components from various sources.

 Genera ng test scripts and simula on models.

 Enhancing so ware tools to support custom instruc ons.

Architecture Descrip on Languages

As processors become more complex, architecture descrip on languages help automate their design
and the associated so ware tools. These languages allow designers to describe the processor in a
high-level way so that tools can automa cally generate the hardware and so ware needed.

Descrip on Languages:

 Behavioural Languages: Focus on the instruc on set, making it easier to generate tools like
compilers. They offer high abstrac on but less flexibility in hardware design. Ex: nML and TIE.

 Structural Languages: Describe the hardware components and their connec ons. They allow
for direct hardware synthesis but require more detailed specifica ons. Ex: SPREE

Hybrid Approaches: Some languages combine both behavioural and structural elements for greater
flexibility. Ex: LISA.

Iden fying Custom Instruc ons

Designers can iden fy which custom instruc ons to add by analysing high-level applica on
descrip ons. Techniques include:

 Grouping related opera ons.

 Using methods like VLIW (Very Long Instruc on Word) to execute mul ple opera ons
simultaneously.

 Developing vector opera ons that work on mul ple data items at once.

These techniques help in crea ng efficient processors tailored for specific tasks while considering
factors like power consump on and performance.
Reconfigurable Func onal Units (FUs)

Types of FUs

Reconfigurable func onal units (FUs) can be categorized into two main types based on their
granularity:

1. Fine-Grained FUs:

o These are designed to implement simple func ons at a very small


scale, usually dealing with single bits or small groups of bits.

o The most common fine-grained FUs are Look-Up Tables (LUTs),


typically found in FPGAs. A LUT can be configured to perform any
func on of up to three inputs by se ng the appropriate bit pa erns.

o LUTs are o en grouped into clusters (e.g., Logic Array Blocks in Altera
FPGAs and Configurable Logic Blocks in Xilinx FPGAs). Each cluster
allows for flexible implementa on of various digital circuits.

2. Coarse-Grained FUs:

o These FUs are larger and can handle more complex func ons,
o en integra ng components like arithme c and logic units
(ALUs).

o Examples:

 Embedded Mul pliers: Some FPGAs include embedded units (like


18x18 bit mul pliers) to efficiently perform mul plica on. These
are highly beneficial for applica ons requiring extensive
mul plica on but are less useful for others. Xilinx Virtex

 DSP Blocks: Altera's DSP blocks can handle various opera ons,
providing more flexibility than simple mul pliers but may require
more area and have slower performance for specific tasks. Altera
Stra x,

Example of a Coarse-Grained Architecture: ADRES

The ADRES architecture is an example of a reconfigurable system that contains


only coarse-grained FUs. Each FU in this architecture is a 32-bit ALU (Arithme c
Logic Unit) that can perform various func ons like addi on, mul plica on, and
logic opera ons.

 Characteris cs:

o Each FU contains two small register files.

o The ALU is designed to handle tasks like addi on or


mul plica on in a highly efficient manner.

o While these FUs are less flexible than fine-grained FUs (like
LUTs), they can efficiently implement opera ons that match
their capabili es, making them ideal for applica ons that
primarily involve arithme c and logic func ons.
Reconfigurable Interconnects

Regardless of whether the FUs are fine-grained or coarse-grained, they need to be connected flexibly
to op mize performance. There are two types of reconfigurable interconnect architectures:

1. Fine-Grained Interconnects:

o Each wire in this architecture can be switched independently, allowing for maximum
flexibility in rou ng signals between FUs.

o Fine-grained rou ng is commonly used in FPGAs, where the FUs are arranged in a
grid and connected through horizontal and ver cal channels. This flexibility comes
with increased complexity and overhead.

2. Coarse-Grained Interconnects:

o The connec ons in this architecture switch en re buses as a unit rather than
individual wires, resul ng in fewer programming bits and lower overhead.

o Examples:

 Totem System: This system features flexible interconnects that can establish
arbitrary connec ons between FUs.

 Silicon Hive System: This architecture is less flexible but faster and smaller,
designed to connect only those units likely to communicate with each other.
So ware Configurable Processors

So ware configurable processors, developed by Stretch, combine tradi onal instruc on processing
with a reconfigurable fabric, enabling dynamic customiza on of the instruc on set by applica on
programs.

Key Features of So ware Configurable Processors

1. Architecture:

o Conven onal Processor: At its core is a 32-bit Reduced Instruc on Set Computer
(RISC) processor.

o Programmable Instruc on Set Extension Fabric (ISEF): This component extends the
conven onal processor's capabili es by allowing for custom instruc ons that can be
tailored to specific applica ons.

2. Performance Benefits:

o Data Parallelism: The ability to perform mul ple opera ons simultaneously, leading
to higher throughput.

o Operator Specializa on: Custom opera ons can be defined to op mize specific
computa onal tasks.

o Deep Pipelining: The architecture supports deep pipelining of instruc ons, allowing
mul ple stages of instruc on processing to occur simultaneously, which increases
efficiency.

3. ISEF Components:

o ALUs and Mul pliers: The ISEF consists of blocks containing arrays of 4-bit ALUs and
mul pliers. These 4-bit ALUs can be cascaded via a fast carry circuit to create larger
ALUs (up to 64 bits).

o Logic Func ons: Each 4-bit ALU can implement mul ple 3-input logic func ons and
has four register bits for storing instruc on state variables or facilita ng pipelining.

4. Instruc on Handling:

o Extension Instruc ons: The ISEF can support mul ple applica on-specific
instruc ons, called extension instruc ons. Each can read up to three 128-bit
operands and write up to two 128-bit results using a set of 32 wide registers (128
bits each).

o Load/Store Instruc ons: A comprehensive set of dedicated instruc ons enables


efficient movement of data between registers, cache, and memory.
o State Variables: Extension instruc ons can define arbitrary state variables in the
ISEF’s registers, allowing mul ple instruc ons to share state informa on, which
reduces the need for wide register traffic.

Mapping Designs to FPGAs

1. Design Descrip on:

o Designs for FPGAs are usually described using Hardware Descrip on Languages
(HDLs) such as VHDL and Verilog at the Register Transfer Level (RTL). This level of
descrip on specifies opera ons for each clock cycle.

2. Synthesis Process:

o The synthesis process consists of several stages:

 Iden fica on of Opera ons: The ini al stage iden fies datapath opera ons
and translates them into basic logic gates (AND, OR, XOR).

 Netlist Op miza on: The netlist of basic gates is op mized for size and
efficiency through techniques

o This op mized netlist is then mapped to the specific FPGA architecture (e.g., Xilinx
Virtex or Altera Stra x).

3. Architecture-Specific Op miza on:

o Addi onal op miza ons are made based on the FPGA architecture, including using
dedicated features like carry chains for adders or specific shi func ons for logic
blocks.

o Packing and Clustering: LUTs and registers are packed and clustered into logic blocks
to minimize interconnec ons between blocks.
4. Placement and Rou ng:

o Placement: The op mized logic blocks are placed on the FPGA considering goals like
speed, routability, and wire length.

o Rou ng: This step determines how the logic block inputs and outputs will connect
via the programmable rou ng resources in the FPGA, ul mately genera ng a
configura on bitstream that defines these connec ons.

5. High-Level Programming Support:

o To improve developer produc vity, high-level programming languages and tools


(e.g., Autopilot, Harmonic, ROCC) are included in the FPGA tool flow. These tools
allow developers to work without deep knowledge of hardware implementa on and
can automa cally op mize designs for parallelism and pipelining.

6. Analysis Tools:

o Addi onal tools are available to analyze metrics such as delay, area, and power
consump on to ensure that the circuit meets applica on requirements.

Instance-Specific Design

Instance-specific design refers to the customiza on of hardware and so ware implementa ons to
op mize performance for par cular computa ons. This approach aims to enhance speed and reduce
resource usage, thereby lowering power and energy consump on, although it sacrifices some
flexibility. Here are three primary techniques for automa ng instance-specific design:

1. Constant Folding

 Constant folding involves propaga ng known, sta c input values through computa ons to
eliminate unnecessary hardware or so ware opera ons.

 In hardware design, if certain filter coefficients are constant, the design can be specialized to
use one-input constant-coefficient mul pliers instead of two-input mul pliers. This
specializa on results in smaller and faster mul pliers.

 By op mizing specific designs for fixed parameters, instance-specific designs can yield
significant improvements in efficiency, making reconfigurable logic poten ally more effec ve
than ASICs for certain applica ons. For example, in FIR (Finite Impulse Response) filters,
techniques like modified common subexpression elimina on can lead to up to 50% reduc on
in FPGA slice usage and 75% reduc on in LUT usage. This translates into substan al
reduc ons in dynamic power consump on as well.
2. Func on Adapta on

 Func on adapta on involves modifying func ons in hardware or so ware to find the best
trade-off between performance, resource usage, and output quality for a specific applica on
instance.

 Word-length op miza on is a key aspect of func on adapta on. In FPGA implementa ons,
the word length and scaling of signals in a digital signal processing (DSP) system can be
customized based on applica on needs. This flexibility allows designers to choose variable
sizes that op mize trade-offs in numerical accuracy, design size, speed, and power
consump on, unlike fixed architectures in tradi onal microprocessors.

3. Architecture Adapta on

 Architecture adapta on focuses on modifying the underlying hardware and so ware


architecture to op mize for a specific applica on instance, such as by suppor ng relevant
custom instruc ons.

 For instance-specific applica ons, it may involve crea ng custom instruc ons that enhance
performance for certain computa ons, improving the efficiency of the architecture used.

Customizable So Processor: CUSTARD

CUSTARD is a mul threaded so processor designed to demonstrate the customiza on of processors


through a flexible instruc on set and mul threading capabili es.

 Single-Cycle Context Switch: Allows quick switching between threads, reducing overhead
and improving execu on interleaving.
 Latency Hiding: If one thread waits (e.g., for memory), the processor can switch to another
thread, preven ng stalls.
 Resource Management: Suppor ng mul ple threads requires more register files, but
modern FPGAs have sufficient on-chip memory to efficiently handle these addi onal needs.
Key architectural features include:

 Base Architecture: CUSTARD features a standard load/store RISC architecture similar to


MIPS, incorpora ng a fully bypassed and interlocked 4-stage pipeline. This architecture can
also accommodate custom instruc ons using available opcode space.

 Parameteriza on: CUSTARD supports four main sets of parameters for customiza on:

1. Mul threading Support: Choose the number of threads and threading type (Block
Mul threading (BMT) or Interleaved Mul threading (IMT)).

Block Mul threading (BMT):

o Context switches are triggered by run me events in the currently ac ve thread,


such as cache misses or explicit yield commands.

o When only one thread is ac ve, BMT behaves like a conven onal single-threaded
processor. With mul ple threads, it hides latencies by switching contexts during
the execu on stage of the pipeline.

Interleaved Mul threading (IMT):

o A context switch occurs every cycle, allowing interleaved execu on of threads.

o IMT simplifies the pipeline architecture because independent instruc ons can be
guaranteed in certain pipeline stages, thus reducing hazards. This allows for
op miza on of the processor by selec vely removing unnecessary forwarding
paths.

2. Custom Instruc ons: Define custom instruc ons and associated data paths for
execu on.

Custom Instruc on Genera on

 Instruc on Customiza on Flow: The CUSTARD tool flow enables the


genera on of custom instruc ons through a preop miza on stage and a
scheduling stage that organizes instruc ons to minimize pipeline stalls.

 Hardware and So ware Integra on: The compila on process generates


hardware datapaths for custom instruc ons and so ware to execute on the
CUSTARD processor. Custom instruc ons can be encoded into unused
por ons of the opcode space, enhancing processor func onality.

3. Forwarding and Interlock Architecture: Specify the necessity of branch and load
delay slots and forwarding paths.

4. Register File Configura on: Customize the number of registers and ports per register
file to enhance flexibility.
1 Give context of usage of integrated caches ,split I/D cache

 Integrated Caches: Used to combine both instruc on and data caching in a single, unified
cache. This setup can op mize cache usage and save space but may lead to conten on when
accessing instruc ons and data simultaneously.
 Split I/D Cache: Separates instruc on (I-cache) and data (D-cache) caches, allowing
concurrent access to instruc ons and data, improving performance at the cost of added
complexity and space.

2 list system level issued and specifica ons to choose an interconnect architecture

 Bandwidth and Latency Requirements: Determines data transfer speed and delay tolerance.
 Scalability: Supports growing numbers of cores or components.
 Power Efficiency: Minimizes power consump on, especially for mobile or low-power
devices.
 Compa bility: Aligns with exis ng protocols and component standards.
 Reliability and Fault Tolerance: Ensures stable opera on and error handling.
 Cost and Complexity: Balances performance with budget and design constraints.

3 what is instance specific design what are the ways to automate it

Instance-Specific Design: Customizes a design for a specific instance or set of inputs to op mize
performance, power, or area.

Automa on Methods:

 Constant Folding: Simplifies expressions by pre-compu ng constants at compile me.


 Func on Adapta on: Customizes func ons based on specific input pa erns or opera ons.
 Architecture Adapta on: Tailors hardware architecture to op mize for par cular
applica ons or workloads.

4 State reasons that specify system design is more challenging that processor design

 Complex Interac ons: System design must manage interac ons across mul ple
heterogeneous components, unlike single-processor design.
 Broad Requirements: Balances power, performance, security, and scalability, o en with
conflic ng demands.
 Integra on and Compa bility: Ensures seamless integra on of hardware, so ware, and
interfaces.
 Customiza on Needs: Adapts to diverse applica ons, requiring flexible, domain-specific
op miza ons.
 Reliability and Fault Tolerance: Demands higher resilience due to complex dependencies
and varied usage condi ons.

5 how to solve problem raised in bus conten on model

 Arbitra on: Use bus arbitra on techniques to control access, ensuring only one device
communicates at a me.
 Bus Segmenta on: Split the bus into segments to reduce conten on.
 Caching: Cache frequently accessed data to minimize bus access needs.
 Use Faster Bus Protocols: Upgrade to high-speed protocols to reduce delay.
 Priori za on: Implement priority levels to manage cri cal data access first.
7 what is addroundkey

AddRoundKey: A step in cryptographic algorithms, par cularly in the AES (Advanced Encryp on
Standard), where a round key is combined with the current state of the data using bitwise XOR. This
opera on adds security by mixing the key into the data during encryp on or decryp on.

8 what is resolu on , describe about p-frame

Resolu on: In video, resolu on refers to the amount of detail an image holds, typically defined by
the width and height in pixels (e.g., 1920x1080).

P-Frame (Predic ve Frame): A type of video frame that stores only the differences between the
current frame and a reference frame (usually a preceding I-frame or P-frame). P-frames use mo on
compensa on to efficiently compress video data, reducing file size while maintaining quality by
predic ng the contents of the frame based on previous frames.

9 what is DCT

DCT (Discrete Cosine Transform): A mathema cal transform used in signal processing and image
compression (e.g., JPEG) to convert spa al domain data into frequency domain data. It helps reduce
redundancy in data by represen ng the image in terms of its frequency components, allowing for
effec ve compression while preserving essen al visual informa on.

10 what is packet or bus transac on

Packet Transac on: A data packet is a forma ed unit of data carried by a packet-switched network.
In packet transac ons, data is divided into packets for transmission, enabling efficient rou ng and
error checking.

Bus Transac on: A bus transac on involves communica on between devices over a shared bus. It
includes the transfer of data, control signals, and address informa on, allowing devices to request
and transfer data in a coordinated manner

11 what is arbitra on and bus bridge

Arbitra on: A method used to control access to a shared resource (like a bus) among mul ple
devices. It determines which device can use the bus at any given me, preven ng conflicts and
ensuring orderly communica on.

Bus Bridge: A hardware component that connects two different bus architectures, allowing data
transfer between them. It facilitates communica on between devices that operate on incompa ble
bus standards, ensuring interoperability within a system.

12 differen ate tenured and unified bus architecture

Tenured Bus Architecture: A bus structure where mul ple buses are used for different types of data,
allowing specialized buses for high-bandwidth or low-latency tasks. It can op mize performance by
dedica ng resources based on specific needs.

Unified Bus Architecture: A single bus system that handles all data types and device
communica ons. It simplifies design and integra on but can lead to conten on and bo lenecks, as
all devices share the same bus bandwidth.
13 what is socket and bus wrapper

Socket: A so ware endpoint for sending and receiving data across a network. It provides a way for
programs to communicate over a network using protocols like TCP or UDP, encapsula ng the
necessary networking func onality.

Bus Wrapper: A hardware or so ware component that encapsulates and manages the
communica on between a device and a bus interface. It translates signals and protocols to ensure
compa bility and efficient data transfer between devices and the bus system.

14 what is access me

Access Time: The dura on it takes to retrieve data from a storage device or memory a er a request
is made. It includes the me needed to locate the data and the me to transfer it, influencing overall
system performance.

Memory Bandwidth: How quickly the memory can handle mul ple requests. More independent
memory arrays and op mized access methods help improve bandwidth.

15 tag , index , offset

 Tag: A por on of an address used to iden fy if a specific block of data is stored in a cache. It
helps differen ate between different memory addresses that may map to the same cache
line.
 Index: A part of an address that specifies which cache line or set to access. It determines
where to look for the data within the cache.
 Offset: The specific loca on within a cache line or memory block that indicates the exact
byte or word being accessed. It helps pinpoint the exact data within the selected cache line.

16 what are warm and cold cache

 Warm Cache: A cache that has been recently used and contains data that is likely to be
accessed again soon. It typically has a higher hit rate due to the presence of frequently
accessed data.
 Cold Cache: A cache that is empty or has not been populated with relevant data yet. It has a
lower hit rate since the required data may not be present, leading to more cache misses.

17 what is a so processor

So Processor: A processor implemented using programmable logic on an FPGA, rather than


dedicated hardware. It is flexible, customizable, and used for specific applica ons or prototyping, but
generally slower than hard processors.

18 what is meant by pipeline break

Pipeline Break: A disrup on in the sequen al flow of instruc ons in a pipeline, caused by hazards
such as data dependency, control dependency, or resource conflicts. It stalls or flushes the pipeline,
reducing performance temporarily.

19 components of execu on unit

 ALU (Arithme c Logic Unit): Performs arithme c and logical opera ons.
 Floa ng-Point Unit (FPU): Handles floa ng-point calcula ons.
 Registers: Store operands and intermediate results.
 Control Logic: Manages instruc on execu on and coordina on.
 Shi er/Barrel Shi er: Executes bit-shi ing and rota on opera ons.
 Branch Unit: Handles branching and condi onal instruc ons.
20 what is the processor as IP

Processor as IP: A pre-designed, reusable processor core provided as Intellectual Property (IP) for
integra on into custom chips (SoCs). It simplifies design, accelerates development, and is
customizable for specific applica ons. Examples include ARM Cortex and RISC-V cores.

21 Why Include Memory on the Processor Die?

Advantages:

 Faster access me and be er bandwidth.


 Reduces reliance on cache memory.
 Improves performance for memory-intensive tasks.

Challenges:

 Technology Difference:
 Limited Size:
parallelism :

 pipelining
 mul ple execu on units
 mul ple cores

levels of parallelism

 instruc on level
 loop level
 procedure level
 program level

pipelining

 sta c : every stage


 dynamic : skip

pipeline breaks

 data conflict
 resource conten on
 run-on-delay
 branch

scheduling

 sta c: compiler
 dynamic : hardware

SOC memory structure

 simple : RAM , ROM


 complex: Off-chip, MMU, Cache hierarchies

interconnec on approaches

 bus based
 NOC

key considera ons for processor selec on

 System so ware
 compute limited
 memory interconnects
instruc on set architecture

 L/S
 R/M

Interrupts

 user requested, coerced


 maskable, non- maskable
 terminate , resume
 async, sync
 between, within

Buffer design

 mean request rate buffers


 max req rate buffers

branch cost reduc on

 simple:
o branch elimina on
o simple branch speedup
 complex
o branch target capture
o branch predic on
 fixed
 sta c
 dynamic
 bimodal
 two level adap ve

types of data dependencies

 read a er write
 write a er read
 write a er write

considera ons for interconnect design

 communica on bandwidth
 communica on latency
 master and slave
 concurrency requirement
 packet or bus transac ons
 mul ple clock domain
Bus bridge

 protocol conversion
 traffic segmenta on
 memory buffering

Bus

 unified
 split
 single transac on vs tenure

types of transac on

 Blocking transac on
 Non blocking Transac on

customizable SOC

 A ached to system bus


 As a coprocessor
 Tightly coupled fabric
 Embedded processor

Mul threading

 Block mul threading


 interleaved mul threading

You might also like