SOC
SOC
o Two primary func ons: behaviour simula on and geometry transla on.
o Key components:
COMPONENTS OF SOC
The system architecture defines the system - level building blocks, such as processors and
memories, and the interconnec on between them
The processor architecture determines the processor ’s instruc on set, the associated
programming model, its detailed implementa on
The implementa on of a processor is also known as microarchitecture
Some of the basic elements of an SOC system include a number of heterogeneous processors
interconnected to one or more memory elements. Frequently, the SOC also has analog
circuitry for managing sensor data and analog to digital conversion, or to support wireless
data transmission.
Modern processors aim to execute mul ple instruc ons simultaneously using various techniques:
o Concurrent execu on involves mul ple instruc ons processed at the same me,
o en transparent to the programmer.
o Pipelining: Divides instruc on execu on into stages, with mul ple instruc ons in
different stages at the same me.
o Mul ple Execu on Units: Allows processors to handle different tasks concurrently.
o Mul ple Cores: Provides parallelism by running instruc ons on separate cores.
Levels of Parallelism
o Achieved using:
2. Loop-Level Parallelism:
3. Procedure-Level Parallelism:
4. Program-Level Parallelism:
2. PIPELINED PROCESSOR:
Pipelining in a processor is a technique that allows the CPU to work on mul ple
instruc ons at the same me by dividing the processing of instruc ons into different
stages
Each stage of the pipeline handles a different part of the instruc on, such as fetching
the instruc on, decoding it, execu ng it, and so on.
Phases of Instruction Processing:
1. Instruction Fetch (IF): The CPU fetches the instruction from memory.
2. Instruction Decode (ID): The CPU decodes the fetched instruction to understand
what needs to be done.
3. Address Generation (AG): The CPU calculates the memory addresses required
for the instruction.
4. Data Fetch (DF): The CPU accesses the operands needed for the instruction.
5. Execution (EX): The CPU performs the operation defined by the instruction.
6. Write Back (WB): The CPU writes the result of the execution back to the register
or memory.
Overlapping of phases increases the efficiency of the CPU because it allows the
processor to work on mul ple instruc ons at different stages rather than wai ng for
one instruc on to complete before star ng the next.
Static vs. Dynamic Pipelining:
1. Static Pipeline: The processor must go through every stage of the pipeline for
each instruction, regardless of whether all stages are needed. This is simpler but
less flexible.
2. Dynamic Pipeline: The processor can skip unnecessary stages depending on the
instruction's needs. Dynamic pipelines can even execute instructions out of
order, but they must ensure the program's final result is as if the instructions
were executed in the correct order.
Pre-decode block: Recognize and decode single instruc ons that should be kept
together
Rename Buffer: processors need to allocate these buffers to des na on registers
of instruc on.
Dispatch block: determines what program should the CPU execute next.
Reorder buffer: allows instruc on to be executed in order
2. VLIW PROCESSOR:
Unlike Superscalar processors, VLIW processors rely on the compiler to analyze
and schedule instruc ons.
VLIW processors are less complex than Superscalar processors because they do
not need dynamic scheduling hardware. The complexity is offloaded to the
compiler.
VLIW processors can poten ally offer high performance, especially for
applica ons where the parallelism can be sta cally determined by the compiler
Limita ons:
i. delayed results from opera ons whose latency differs from the assumed
latency scheduled by the compiler and
ii. interrup ons from excep ons or interrupts, which change the execu on
path to a completely different schedule.
4. SIMD ARCHITECTURE:
(SIMD) architectures are designed to handle opera ons on regular data structures
like vectors and matrices efficiently
SIMD processors can execute the same opera on on mul ple data points
simultaneously
1. ARRAY PROCESSORS:
Array processors consist of mul ple interconnected processor elements
(PEs), each with its own local memory
A control processor broadcasts instruc ons to all Pes
Each PE processes its por on of the data, with data being carefully
distributed to minimize complex rou ng between Pes
Ideal for tasks with regular data structures and uniform computa ons, such
as solving matrix equa ons or other tasks involving large datasets.
Example: The ClearSpeed processor, designed for signal processing, is an
example of an array processor.
2. VECTOR PROCESSOR:
A vector processor resembles a tradi onal processor but includes special
func on units and registers designed to handle vectors (sequences of data)
as single en es.
Vector processors have deeply pipelined func on units, enabling high
throughput despite poten ally higher latencies.
When a vector is longer than the processor's registers, it is processed in
segments.
Vector processors o en support chaining, where the result of one opera on
can be immediately used by the next, allowing for efficient sequen al
computa ons with minimal latency.
Suitable for applica ons requiring high throughput on vectorized data, such
as scien fic computa ons and tasks involving large datasets.
Example: IBM mainframes offer vector instruc ons for scien fic compu ng,
highligh ng their use in high-performance environments.
o For advanced applica ons (e.g., opera ng systems), memory systems use:
Advantages:
Challenges:
1. Technology Difference:
2. Limited Size:
o On-die memory is small and unsuitable for applica ons requiring large memory.
MEMORY ADDRESSING
Why This is Important: Virtual memory ensures efficient memory usage and protec on when
mul ple applica ons run simultaneously. However, transla ng between different types of addresses
is essen al to make this work seamlessly.
Address transla on involves conver ng the addresses used in programs (virtual addresses) into
physical addresses in memory. This is done in three steps:
The applica on generates a process address, which is used to compute a virtual address:
Virtual Address=Offset+Base+Index
Why This is Important: The virtual address allows the program to reference memory without
worrying about the actual physical loca on. This simplifies applica on programming and ensures
portability.
2. Crea ng the System Address
Since mul ple processes share memory, each process's virtual address must be mapped to a
unique system address: System Address=Virtual Address+ (Process Base)
o The system address must be within bounds defined by the segment table.
Why This is Important: Segment tables manage memory spaces for each process, ensuring no
overlap between processes and maintaining system stability and security.
If the memory space exceeds physical memory (common in large SoC applica ons), virtual
memory comes into play:
o Only the most recently used pages are kept in memory; others remain on disk.
Upper bits: From the page table (to locate the page in memory).
Lower bits: From the virtual address (rela ve posi on within the page).
Why This is Important: This system allows applica ons to use more memory than is physically
available, enabling the efficient use of limited hardware resources.
Address transla on involves mul ple tables (e.g., segment and page tables), which can be slow. To
speed this up:
A Transla on Lookaside Buffer (TLB) stores recently used address transla ons.
3. If not, a not-in-TLB event occurs, and the system performs a full transla on.
Why This is Important: The TLB reduces the overhead of frequent address transla on, significantly
improving performance.
SYSTEM-LEVEL INTERCONNECTION
o A good interconnec on method ensures that modules communicate effec vely and
operate in parallel without bo lenecks, which is crucial for high performance.
1. Bus-Based Approach
How It Works:
o Modules behave according to standard bus protocols like ARM’s AMBA or IBM’s
CoreConnect.
o Communica on occurs by sharing a physical bus with address, data, and control
signals.
Features:
Limita on: As the number of connected modules grows, shared buses can become a bo leneck,
limi ng scalability.
2. Network-on-Chip (NoC) Approach
NoC is a newer approach, inspired by computer networks, where communica on between modules
occurs through a network of switches.
How It Works:
Advantages:
1. Higher Throughput:
Unlike buses, NoC supports simultaneous communica ons over mul ple paths,
avoiding conges on.
Mesh networks ensure fixed interconnec on distances, minimizing wire delay and
crosstalk.
Mesh Example:
The complexity of designing systems changes drama cally as more transistors become available per
die. Let's break it down:
Single Processor: Ini ally, implemen ng a 32-bit pipelined processor with a small first-level
cache may require about 100,000 transistors.
Cache: Adding mul ple levels of cache (L1, L2, etc.) helps speed up data access, but each
cache level comes with its own design challenges, especially as cache size grows.
Mul ple Processors: Implemen ng mul ple processors with their own mul level caches
increases design complexity significantly, as synchroniza on and memory access consistency
across processors must be managed. As a result, the system architecture becomes much
more complex, with millions of transistors required to coordinate the processors and
memory.
To manage this complexity, design reuse becomes crucial. Instead of designing one advanced
processor for all tasks, designers can reuse simpler processors and specialize them for specific tasks.
This approach allows designers to:
Combine specialized processors for different parts of the applica on, improving
performance by matching processors to the task at hand.
Processor Selec on is Cri cal: The processor in an SoC design is a fundamental element
since it runs the system's so ware. O en, the ini al task is to select a processor that meets
the func onal and performance requirements of the applica on.
1. System So ware: The selected processor must be able to run the specific system so ware
required by the applica on.
2. Compute-limited Applica ons: For compute-intensive applica ons, the focus is on ensuring
that the processor meets performance requirements, o en including real- me constraints.
In such cases, real- me processing ability becomes a primary design considera on early on.
3. Memory and Interconnects: During the ini al design phase, the memory and interconnect
components are simplified to "idealized components." These are treated as basic delay
elements with conserva ve performance es mates, allowing for a simplified view of system
behaviour in the early stages.
The goal is to choose a processor and set its parameters to build an ini al design that meets the
system's func onal and performance requirements.
Example: So Processors
A so core is a type of processor design stored in bitstream format. These processors can be
programmed into a field-programmable gate array (FPGA). So processors are commonly
used in SoC designs because they offer flexibility and customiza on.
Advantages of So Processors:
1. Cost Reduc on: By integra ng the processor design into an SoC, the need for separate chips
is reduced, lowering system-level costs.
2. Design Reuse: So processors allow for reuse of exis ng processor designs, reducing me
and effort for new projects, especially when varia ons of a processor are needed.
3. Customiza on: They allow the crea on of processors that are customized for specific
microcontroller/peripheral combina ons.
Examples of So Processors:
Nios II (Altera): A so processor developed for use on Altera FPGAs and ASICs.
Processor Core Selec on involves choosing the right type of processor core based on system
requirements, such as performance, area, power, and other factors. It’s a cri cal step in designing a
processor to meet the specific needs of a par cular applica on.
1. Ini al Design: Assume you start with a processor that has a performance of 1 and uses 100K
"rbe" (register bit equivalent) of area.
2. Doubling Performance: If you want to double the performance (i.e., reduce execu on me
by half), it requires increasing the area to 400K rbe and power increases by a factor of 8
(because each rbe now uses more power).
3. Memory System Impact: As performance increases, cache misses also increase, which
nega vely affects overall system performance. To counter this, you need to increase the
cache size.
4. Cache Size Increase: To reduce cache misses, you need to double the cache size. If the ini al
cache size was 100K rbe, the new system would have 600K rbe and use significantly more
power.
5. Is It Worth It?: The decision depends on whether there’s enough available area and if power
isn’t a major concern. If the increased performance provides cri cal features (e.g., be er
security or I/O capabili es), it might be worth the trade-off.
Example 2: Compute Core Path
1. Parallelizable Applica on: Suppose the applica on can be split into smaller tasks
(parallelized) and two op ons are available for increasing performance:
o Op on 2: Mul ple simpler processors with a single processor using 100K rbe.
2. Increasing Performance: You need to increase the performance to 1.5. There are two main
op ons:
o Op on 1: Increase the number of vector pipelines, which doubles the area and
power but keeps the clock rate unchanged.
o Op on 2: Use an array of simpler processors. To meet the target, you need at least
four processors, adding more interconnect and memory-sharing circuitry, resul ng in
a larger area.
o Applica on Par oning: Can the work be easily divided for both op ons?
o Support So ware: Is there so ware (compilers, OS) that works be er for one
approach?
o Fault Tolerance: Can the mul processor approach help with fault tolerance (i.e.,
reliability)?
o Integra on: Can the mul processor approach be integrated with the rest of the
system?
The architecture of a processor consists primarily of its instruc on set, which defines the set
of opera ons the processor can perform
However, the actual implementa on of the processor (microarchitecture) goes beyond the
instruc on set, involving trade-offs between area, me, and power to meet specific user
requirements.
Instruc on Set Basics
o Most processors use a register set to hold operands and addresses
o Program Status Word (PSW): This includes control status informa on, like condi on
codes (CCs) that reflect the results of opera ons.
o Instruc on Set Architectures:
Load/Store (L/S) Architecture:
Used in RISC processors.
Requires operands to be in registers before execu on.
Simplifies instruc on decode and execu on ming.
Register/Memory (R/M) Architecture:
Used in processors like Intel’s x86 series.
Allows opera ons between registers and memory directly.
More complex to decode but provides compact code with fewer
instruc ons.
Branches
o Manage control flow (e.g., jumps, calls, returns). Condi onal branches (BC) depend
on the condi on codes (CC) set by ALU instruc ons
o for example, specifying whether the instruc on has generated
1. a posi ve result
2. a nega ve result
a zero result
an overflow/Underflow.
Interrupts and Excep ons
o User Requested vs. Coerced: User errors (e.g., divide by zero) vs. external events
(e.g., device failure).
o Maskable vs. Non-maskable: Can be ignored or not.
o Terminate vs. Resume: May stop processing or allow con nua on.
o Asynchronous vs. Synchronous: Occur independently or in sync with the processor's
clock.
o Between vs. Within Instruc ons: Can be recognized either between or during
instruc on execu on.
The faster the cache and memory, the smaller the number of cycles required for fetching
instruc ons and data
The control of the cache and execu on unit is done by the instruc on unit.
The pipeline mechanism or control has many possibili es. Poten ally, it can execute one or
more instruc ons for each cycle.
Pipeline performance is primarily limited by delays or breaks, which can arise from several
factors:
o Data Conflicts (Data Hazards):
Occur when a current instruc on requires a source operand that is the result
of a preceding instruc on that hasn’t completed yet.
Solu on: Extensive buffering of operands can reduce this conflict.
o Resource Conten on:
Happens when mul ple instruc ons compete for the same resource
Solu on: Adding more resources and using techniques like out-of-order
execu on can minimize conten on.
o Run-On Delays (In-Order Execu on Only):
Occurs when instruc ons must complete in the exact order they appear in
the program. Any delay in one instruc on will delay subsequent instruc ons.
This is specific to in-order execu on pipelines where instruc ons cannot be
re-ordered.
o Branches:
the next instruc on to be executed depends on the outcome of a branch
(e.g., if-else condi ons).
Solu on: Techniques like branch predic on, branch tables, and branch target
buffers help reduce delays caused by branches by predic ng the outcome of
the branch and pre-fetching the target instruc on.
BASIC ELEMENTS IN INSTRUCTION HANDLING
Instruc on handling in a processor involves several key components that work together to
ensure the proper execu on of instruc ons in the correct order.
Instruc on Register: Holds the current instruc on being executed
Instruc on Buffer: Pre-fetches instruc ons into registers, allowing them to be quickly
decoded and executed. This helps to keep the pipeline full and reduces delays.
Instruc on Decoder:
o Controls various components like the cache, Arithme c Logic Unit (ALU), and
registers.
o In pipelined processors, it helps in sequencing instruc ons, o en managed by
hardware.
Interlock Unit: Ensures that the concurrent execu on of mul ple instruc ons produces the
same result as if they were executed serially
The instruc on decoder plays a cri cal role in managing the pipeline and ensuring correct
execu on
Scheduling the Current Instruc on:
o The decoder might delay the current instruc on if there’s a data dependency or if an
excep on occurs
Scheduling Subsequent Instruc ons:
o Later instruc ons may need to be delayed to ensure that instruc ons complete in
the correct order
Branch Predic on:
o The decoder also selects or predicts the path of branch instruc ons, determining
which instruc on to execute next based on the outcome of condi onal branches.
Data Interlocks
Data interlocks are mechanisms within the instruc on decoder that manage dependencies
between instruc ons. They ensure that an instruc on does not use a result from a previous
instruc on un l that result is available.
When an instruc on is decoded, its source registers are compared with the des na on
registers of previously issued but uncompleted instruc ons.
If the execu on of the current instruc on takes more cycles than specified by the ming
template, subsequent instruc ons may need to be delayed to maintain the correct execu on
order.
Store Interlocks
BYPASSING (FORWARDING)
How it works:
o With bypassing, the result is directly routed from the ALU to the next stage of the
pipeline (where it’s needed) instead of wai ng for it to be wri en to a register first.
This reduces the me delay in ge ng the result to where it's needed.
Execu on Unit
The execu on unit is responsible for performing arithme c and logical opera ons in a processor. It
usually includes components like the ALU (Arithme c Logic Unit) and FPU (Floa ng-Point Unit).
o The execu on unit performs the core opera ons like addi on, subtrac on,
mul plica on, division, and more complex opera ons such as floa ng-point
calcula ons.
o In some processors, mul ple execu on units (for integer and floa ng-point
opera ons) work in parallel to handle different types of instruc ons.
o The FPU is a specific type of execu on unit that handles floa ng-point arithme c
(e.g., decimal calcula ons) which are more complex than integer opera ons.
o Floa ng-point opera ons can take longer to execute, and the area needed for these
units is usually large due to the complexity of the calcula ons.
Pipelining:
o In a pipelined execu on unit, tasks are broken into smaller stages, and each stage
can execute a part of the opera on. This allows mul ple opera ons to be processed
at the same me (in parallel).
o The pipelined design allows for con nuous throughput, with a new opera on being
completed in each cycle a er the ini al latency.
Area-Time Tradeoff:
o There’s a tradeoff between the area (how much space the execu on unit takes up on
the chip) and the execu on me (how fast it performs opera ons). A more complex
FPU may provide higher precision but require more area and me to execute
opera ons.
Buffers are essen al components in processors that help manage the ming of instruc on
and data handling, reducing delays and improving overall performance.
Buffers are temporary storage areas that hold data or instruc ons while they are wai ng to
be processed or used.
Latency Tolerance: Buffers help the processor tolerate delays, even if there are delays in one
part of the system, buffers can hold data un l it is needed, reducing the impact of those
delays.
Minimizing Pipeline Delays: By holding data temporarily, buffers prevent the pipeline from
stalling due to delays in data retrieval or processing.
Branches in a processor can significantly impact performance, primarily because the processor needs
to decide whether to take the branch and which instruc on to fetch next. This decision introduces
delays in instruc on fetch and execu on. Several approaches have been developed to reduce or
mi gate the performance cost associated with branches. These approaches are categorized into
simple and complex strategies.
Simple Approaches
1. Branch Elimina on: This approach works for certain types of code sequences where the
branch can be replaced with another opera on, avoiding the branch altogether. This reduces
the delay caused by branching.
2. Simple Branch Speedup: This method aims to reduce the me spent wai ng for the branch's
outcome. It speeds up the process of determining whether the branch will be taken and
fetching the target instruc on.
Complex Approaches
The more complex strategies involve improving the predic on of branch outcomes, which helps to
fetch the right instruc ons ahead of me, minimizing delays due to branches.
o What it is: A Branch Target Buffer (BTB) stores the target instruc on of a branch that was
previously executed, allowing the processor to fetch the target instruc on early when the
same branch is encountered again.
o How it works:
The BTB holds the target address and the corresponding instruc on for branches that
were recently executed.
When a branch is encountered again, the processor checks the BTB to see if the branch
is listed. If it is, the processor can immediately fetch the target instruc on without
wai ng for the branch to be fully resolved.
If the branch was not previously recorded or the predic on is wrong, the processor must
s ll fetch and resolve the branch.
o Impact: This reduces the delay for branches that are commonly encountered, especially in
loops or frequently executed code paths. The effec veness depends on the hit ra o, or the
probability that a branch is found in the BTB. A higher hit ra o leads to be er performance.
2. Branch Predic on
o What it is: Branch predic on involves predic ng the outcome of a branch instruc on
before it is resolved, so that the processor can con nue execu ng instruc ons without
wai ng for the branch decision.
Fixed Strategy: This is the simplest form of predic on where the processor always
predicts that the branch will be taken or not taken (based on the branch type or other
fixed factors). For example, predic ng that backward branches in loops will be taken,
and forward branches will not.
Sta c Strategy: This is more advanced than the fixed strategy, where the branch's
opcode or direc on (e.g., whether the branch is forward or backward) is used to predict
its outcome. For example, backward branches are usually taken, and forward branches
are not.
Dynamic Strategy: This strategy predicts branch outcomes based on the history of the
branch’s behaviour, using past execu on data to guide future predic ons. This strategy
adapts based on how o en branches are taken or not taken.
The simplest dynamic approach, where a satura ng counter is used to track the
outcome of a branch (taken or not taken) based on its past history. For example, a 2-
bit counter records whether a branch was recently taken or not taken.
The counter has 4 possible states: 00 (predict not taken), 01 (predict not taken), 10
(predict taken), and 11 (predict taken). The branch is predicted as taken or not taken
based on the counter's state.
Effec veness: Bimodal predictors can achieve predic on accuracy between 83% and
96%, depending on the program's behavior.
This more sophis cated method keeps track of the history of the last few outcomes
of a branch. A shi register records the recent history of the branch, and this history
is used to index into a table of counters (similar to the bimodal approach) to make a
predic on.
This method can adapt based on the pa ern of taken and not taken branches. For
example, if a branch has been taken twice, it may be predicted as taken in the future.
Effec veness: This approach can improve predic on accuracy to 95% or higher for
large programs with stable branch pa erns.
Combined Methods:
Effec veness: The combined methods can further improve predic on accuracy by
making use of different types of informa on.
Vector processors are specialized computing units designed to handle vector operations
efficiently.
Vector instructions boost performance by
o reducing the number of instructions required to execute a program
o organizing data into regular sequences that can be efficiently handled by the
hardware
o representing simple loop constructs, thus removing the control overhead for loop
execution.
Sta cally scheduled: The compiler schedules the instruc ons in advance,
determining which ones can be executed simultaneously. The instruc ons
are grouped into instruc on packets, which are then decoded and executed
at run me.
2. VLIW Machines:
o VLIW processors from early manufacturers like Mul flow and Cydrome use long
instruc on words (up to 200 bits) where each fragment controls a specific execu on
unit. This allows for execu ng mul ple instruc ons in parallel, but the processor
must have a large register set to support this.
o SMT allows mul ple threads to use the same execu on hardware, with each thread
having its own registers and instruc on counters. For example, a two-way SMT
processor with two cores can run four programs simultaneously.
4. VLIW Data Path:
o The data paths in a VLIW processor require extensive use of register ports to allow
for simultaneous access to mul ple execu on units. This can be a bo leneck
because the number of register ports required increases with the number of
execu on units.
Superscalar Processors:
o Superscalar processors use mul ple execu on units (like ALUs) and mul ple buses to
connect registers and func onal units. This allows for parallel execu on of
independent instruc ons, similar to VLIW processors, but with a key difference:
independence detec on is done in hardware.
2. Data Dependencies:
Write-A er-Write (WAW): Happens when two instruc ons write to the
same des na on register, poten ally causing incorrect results if executed in
the wrong order. Output dependency
Flash memory is a type of non-vola le storage, meaning it retains data even when power is lost. It's
widely used in systems where large amounts of data need to be stored and retrieved but not
frequently changed.
Structure: Flash memory uses floa ng-gate transistors, which store electrical charge to
represent data. This charge is non-vola le, meaning it stays in place even without power.
Write Limita on: Flash memory has a limited number of write cycles, typically less than a
million. A er many writes, the memory can degrade, so error detec on and correc on are
o en added to improve reliability.
Types:
o NOR Flash: More flexible but has lower density. It's good for storing code (e.g.,
firmware).
o NAND Flash: Offers higher density (more data storage per unit) but is less flexible.
It’s ideal for storing large amounts of data.
o Hybrid NOR/NAND: A combina on, using NOR for flexibility and NAND for density.
Use Cases: Flash memory is commonly used for storage in devices like USB drives, SD cards,
and embedded systems. Flash can be stacked in large sizes (up to 256 GB) for high-capacity
storage.
Variants: Flash technology is evolving with alterna ves like SONOS (non-vola le) and Z-RAM
(DRAM replacement). These newer types don't suffer from write cycle limita ons, and Z-
RAM offers high density and speed like DRAM.
The placement of memory in a System-on-Chip (SOC) design is crucial because it affects both
performance and system complexity.
On-die Memory: The memory is placed on the same chip as the processor, which allows
faster access mes due to shorter physical distances.
Off-die Memory: The memory is placed on a separate chip. This o en increases the access
me because the data has to travel over longer distances. However, it can provide larger
memory sizes.
1. Access Time: How long it takes to retrieve data from memory. This depends on the distance
and delays between the processor and memory.
2. Memory Bandwidth: How quickly the memory can handle mul ple requests. More
independent memory arrays and op mized access methods help improve bandwidth.
1. Scratch pad:
Scratchpad memory is a small, fast memory directly managed by the
programmer. The programmer explicitly controls what data is stored and when it
is accessed.
Scratchpad memory is par cularly useful in System on Chip (SoC) designs where
the applica on is well-known
By elimina ng the need for cache control hardware, scratchpad memory frees up
space that can be used to increase the scratchpad size, leading to improved
performance.
Limita ons: Scratchpad memory is typically used for data rather than
instruc ons because manually managing instruc on storage and retrieval can be
not worth the programming effort.
2. Cache memory
Cache memory is also a small, fast memory, but it is managed automa cally by
the hardware. The hardware decides what data should be stored in the cache
based on the program's access pa erns.
o Sequen al Locality
Sequen al locality is a subset of spa al locality where the
memory loca ons accessed are sequen al or con guous.
In a loop where an array is processed element by element, the
memory accesses typically proceed in a sequen al manner
Caches are designed to prefetch sequen ally following memory
blocks or lines to enhance performance for programs that exhibit
this pa ern.
CACHE ORGANIZATION
Cache memory is a small, fast storage area that stores frequently accessed data and
instruc ons to speed up processing.
Fetch Strategies
1. Fetch-on-Demand:
This strategy brings data into the cache only when it is needed, i.e., when a
"miss" occurs (the data is not already in the cache).
Commonly used in simple processors. It only loads data into the cache when
the processor requests it and finds it missing.
2. Prefetch Strategy:
An cipates the data that will be needed soon and loads it into the cache
before the processor requests it.
Commonly used in instruc on caches (I-caches). By preloading instruc ons,
the processor can execute them without wai ng for a cache miss.
2. Direct-Mapped Cache:
Each block of memory data is mapped to exactly one loca on (cache line) in
the cache. The lower bits of the memory address are used as an index to
locate the cache line.
Advantage: Fast, as the loca on is directly determined by the address,
allowing for simultaneous access to both the cache array and directory.
Disadvantage: High conflict miss rate; if mul ple memory addresses map to
the same cache line, they will keep replacing each other.
3. Set-Associa ve Cache:
A hybrid between fully associa ve and direct-mapped caches. The cache is
divided into "sets," and each memory block can be stored in any cache line
within a set. A set-associa ve cache with 2 lines per set is called "2-way set-
associa ve," with 4 lines per set is "4-way set-associa ve," and so on.
Advantage: Balances speed and flexibility, offering be er performance than
direct-mapped and simpler implementa on than fully associa ve caches.
Disadvantage: Slower than direct-mapped but faster than fully associa ve.
More complex to implement.
Cache Addressing
When accessing the cache, the memory address provided by the processor is divided
into several parts:
Tag: The most significant bits used to compare against the addresses in the
cache to check for a hit.
Index: Used to locate the specific set or line within the cache.
Offset: Identifies the specific word within the cache line.
Byte: Specifies a specific byte within the word, used during partial writes.
Write Policies
In cache memory systems, a cache miss occurs when a requested memory line is not found in the
cache. When this happens, two main ac ons must be taken:
1. Fetching the Missed Line: The line must be retrieved from the main memory. Depending on
the cache policy, this can work in two ways:
o Write-through Cache: The fetched line replaces the current line without any extra
ac on.
o Copy-back Cache: If the line to be replaced has been modified (dirty), it must be
wri en back to memory before replacement. If it's clean, it can simply be replaced.
To speed up this process, caches may use a nonblocking or prefetching approach, allowing the
processor to con nue execu ng instruc ons while the cache miss is handled, as long as the missing
data isn’t immediately needed by the processor.
2. Line Replacement Policy: When the cache is full, a replacement policy determines which line
to remove:
o Least Recently Used (LRU): Replaces the least recently accessed line, aligning with
the idea of temporal locality but is more complex to implement.
o First In – First Out (FIFO): Replaces the line that has been in the cache the longest,
with simpler implementa on than LRU.
o Random (RAND): Replaces a randomly selected line, simplest to implement but less
efficient than LRU.
Although LRU generally performs best due to its alignment with temporal locality, simpler FIFO or
RAND policies are o en acceptable, resul ng in only a slight performance loss.
3. Cache Environments and Miss Rates: The effec veness of a cache can vary depending on
system demands and environment:
o Mul programmed Environment: With mul ple programs sharing memory, a warm
cache is created, retaining some recently used lines when a program resumes,
though the miss rate may increase slightly.
Both environments involve context switching before fully loading a program’s working set, which can
increase the overall miss rate due to disrupted data locality.
Other Types of Cache
Split(I/D) Cache
Split caches divide memory into separate instruc on (I) and data (D) caches. This setup increases the
cache bandwidth by allowing simultaneous access to both instruc ons and data, effec vely doubling
the access speed.
However, unified caches, which combine both data and instruc ons into a single cache, tend to have
a lower miss rate at the same total size, as they adapt be er to changing instruc on-to-data demands
during program execu on.
Despite the higher miss rate, split caches are o en more efficient in prac ce:
Flexible sizing: Split caches allow for unequal par oning, like a 75% data cache and 25%
instruc on cache, op mizing for workload needs.
Simplified I-cache: Since the instruc on cache (I-cache) only reads data and does not need to
handle write opera ons, its design is simpler.
where:
In a mul level cache system, performance analysis o en focuses on both levels (L1 and L2) using
data from the L1 cache. The principle of inclusion applies to systems where all data in L1 is also
contained in L2.
1. Principle of Inclusion:
o If L2 has smaller lines, loading data to L1 could trigger mul ple misses in L2,
Miss rate
Solo Miss Rate: The hypothe cal miss rate if L2 was the only cache, defined by the principle
of inclusion.
When L1 is the same size as or larger than L2, the principle of inclusion s ll provides a
reliable es ma on of L2 behavior.
When L2 is much larger than L1, L2 can operate independently, and its miss rate will align
with the solo miss rate, focusing analysis only on L2 behavior.
4. Logical Inclusion
ensures that all data in L1 cache is also present in L2, guaranteeing full consistency between
the two levels
Key Requirements for Logical Inclusion:
Write Policy:
o If L1 used a write-back policy, it would only write changes to L2 when the data is evicted
from L1, crea ng temporary differences in content between L1 and L2. This makes
logical inclusion challenging as L1 and L2 could hold different values temporarily.
o When logical inclusion is needed, L1 and L2 should use coordinated cache policies to
ensure synchronized data content.
The TLB is a crucial component that translates virtual addresses (used by programs) into real
(physical) addresses that hardware needs to access memory. This transla on is necessary because
virtual memory allows processes to run as if they have their own dedicated memory space, but real
addresses are needed for actual data access.
TLB Transla on Process:
On-die memory is a type of embedded memory in System-on-Chip (SOC) designs, op mized for
performance and space within the chip
Designing a System on Chip (SoC) device is a complex process that involves balancing cost,
performance, and functionality. The process often requires multiple iterations to ensure the final
design meets all requirements.
Initial Project Plan: The design starts with a project plan, including budget, schedule, target
market, competitive analysis, and goals for cost and performance.
Placeholder Product Design: An early version of the design, known as a "straw man," is
created to give a rough idea of the product’s structure and performance.
Detailed Specifications and Analysis: All functions and performance requirements are
specified. Models are created to understand the trade-offs between functionality and
performance.
System Design:
o Memory and Processor Selection: Memory and storage are first allocated. Then,
processors are chosen, often with a base processor for the operating system.
o Interconnect Architecture: The memory layout and processor choices define how
components will connect and communicate. Bandwidth needs are analyzed and
cache is added to support data transfer speeds.
o Peripheral Selection: required, peripherals are selected based on bandwidth needs,
like a JPEG encoder for a camera.
Cost and Performance Estimation: The initial design is assessed to get a rough estimate of
overall cost and performance.
Optimization and Verification: Tools help refine the design by improving efficiency and
reducing cost. Each change is evaluated for its impact on accuracy, speed, and energy
consumption.
Final Evaluation: After several optimization rounds, the design’s profitability and market
potential are evaluated to decide on the final design.
The Advanced Encryp on Standard (AES) is a widely used symmetric encryp on algorithm that
ensures data security. It operates on fixed-size blocks of data and employs a series of transforma ons
to encrypt and decrypt that data.
Block Sizes:
Rounds:
o r − 1 Standard Rounds: Where r is determined by the key length (10 rounds for AES-
128, 12 for AES-192, and 14 for AES-256).
o One Final Round: Similar to the standard rounds but without the MixColumns step.
1. SubBytes:
o Each byte in the input block is replaced with a corresponding byte from a predefined
subs tu on box (S-Box). This step adds non-linearity to the cipher.
2. Shi Rows:
o The bytes of the input are arranged into four rows. Each row is then rotated with a
predefined step according to its row value.
3. MixColumns:
4. AddRoundKey:
o The input block is XORed with a round key derived from the original encryp on key.
This opera on is performed in each round and adds a layer of security.
Rounds Structure
Standard Rounds: Each of the four transforma ons (SubBytes, Shi Rows, MixColumns,
AddRoundKey) is applied.
Final Round: The MixColumns transforma on is omi ed, and the other three
transforma ons are applied.
Decryp on involves applying the inverse transforma ons of the four main steps (using an
inverse S-Box, inverse row shi s, and inverse column mixing).
The round transforma ons in AES can be parallelized, enabling faster implementa ons,
especially in hardware architectures.
The design specifies the use of a PLCC68 (Plas c Leaded Chip Carrier) package with a die size
of 24.2 mm x 24.2 mm².
The ARM7TDMI, a 32-bit RISC processor, is considered. Its die sizes are:
o 180 nm process: 0.59 mm²
o 90 nm process: 0.18 mm²
Both versions of the ARM7 processor can fit within the specified die size.
The AES encryp on process, according to the SimpleScalar tool set, has a cycle count of
16,511.
Op miza on Strategies
To enhance the system throughput without exceeding the ini al area constraint, the study
looks into modifying the cache size based on techniques
By doubling the block size of a 512-set L1 direct-mapped instruc on cache from 32 bytes to
64 bytes:
o The AES cycle count decreases from 16,511 to 16,094 (a 2.6% improvement).
o The area increase is over 11%, which is deemed not worthwhile for the minimal
speed improvement.
Alterna ve Architectural Approaches:
The ARM7 already u lizes pipelining. Exploring parallel pipelined datapaths could yield
be er performance, especially for applica ons needing high throughput. However, these
approaches may lead to larger area and power consump on compared to ASIC designs.
Another sugges on is to enhance the instruc on set of the processor with custom
instruc ons specific to AES, poten ally improving performance.
o AES can be fully pipelined and implemented on FPGA devices, achieving high
throughput (over 21 Gbit/s) by leveraging FPGA-specific technologies, such as block
memories and mul pliers.
o AES cores are o en part of more extensive systems. For example, integra ng the AES
core into the ViaLink FPGA fabric on a QuickMIPS device, which contains a 32-bit
MIPS 4Kc processor core, showcases such implementa on.
Image Compression
Image compression methods, such as JPEG, share common intraframe opera ons with video
compression methods like MPEG and H.264. These opera ons include:
Color Space Transforma on: Changing the color representa on of the image to op mize
compression.
JPEG Compression
o The image, originally in RGB (24 bits per pixel, with 8 bits each for red, green, and
blue), is transformed into the YCbCr color space.
4:4:4: No downsampling.
o The DCT converts each block from the spa al domain to the frequency domain using
an 8x8 matrix mul plica on. This transforma on allows the high-frequency
components (which contain less visual informa on) to be reduced more than the
low-frequency components.
o This step involves arranging the frequency coefficients in a zigzag order to priori ze
low-frequency components.
o Finally, either Huffman coding or arithme c coding is used to encode the remaining
data. While arithme c coding is generally more efficient, it is also more complex to
decode.
The sec on es mates the computa onal load involved in processing the JPEG compression:
o For a k×k \ mes k×k block, the opera ons required for DCT involve:
1 data store.
o This totals to 3k+1 opera ons per pixel, and 2k^2(3k + 1) for the en re block.
For frames of size n×n \ mes n×n at f frames per second (fps), the number of opera ons
can be calculated as:
2fn(3k+1)
Common Formats:
Compression Ra os
1. Processor Overview:
o The TMS320C549 processor is u lized for implemen ng the imaging pipeline, which
processes 16 × 16 blocks of pixels.
o It features:
32 KB of 16-bit RAM.
16 KB of 16-bit ROM.
o The processor executes all imaging opera ons on-chip, minimizing the need for
slower external memory and enhancing processing speed.
o The TMS320C549 can achieve up to 100 Million Instruc ons Per Second (MIPS).
o The imaging pipeline, including JPEG compression, requires about 150 cycles per
pixel, transla ng to around 150 instruc ons/pixel at 100 MIPS and 100 MHz clock
speed.
o The processor can process a 1-megapixel CCD (Charge Coupled Device) image in 1.5
seconds.
3. Shot-to-Shot Delay:
4. Image Playback:
o A er capturing images, users can display them on the LCD screen of the camera or
an external TV monitor.
5. Memory Requirements:
INTERCONNECT ARCHITECTURES
The SoC typically includes various IP blocks like processors, caches, graphics processors,
video codecs, and network units, all integrated onto a single chip.
These blocks need to communicate effec vely with each other and with off-chip devices
(e.g., external memory or peripherals) to ensure smooth opera on and high performance.
Communica on Latency: Refers to the delay from when data is requested to when it is
received. Low latency is essen al for real- me systems (like mobile communica on) but may
be less crucial in applica ons where slight delays (like video streaming) are acceptable.
o A master can ini ate communica on, while a slave responds to requests.
o For instance, a processor (master) may request data from memory (slave), with SoCs
typically having mul ple masters and several slaves.
o In a bus system, each transac on contains an address, control bits, and data.
o In a Network-on-Chip (NoC), transac ons are broken into packets, each with a
header (address/control) and payload (data).
o A bus wrapper allows IP blocks that don’t use the primary protocol to communicate
with others.
Mul ple Clock Domains: Different parts of the SoC may operate at different clock speeds
due to differing opera onal requirements (e.g., a processor vs. a video input).
o Clock domains help separate these areas, but design care is needed to avoid
synchroniza on problems that can cause data transfer errors.
BUS: BASIC ARCHITECTURE
The bus architecture in a computer system serves as the main communica on pathway between
various components (e.g., processor, memory, and peripherals
A bus’s design significantly impacts system performance. If poorly designed, it can restrict (or
thro le) data transfer, crea ng a bo leneck that slows down the system.
Conven onal bus systems are o en not op mal for System-on-Chip (SoC) applica ons since
they were designed for backplane connec ons in larger systems, like rack-mounted servers
or motherboards.
Limita ons include restricted signal pin counts on IC packages, high-capacitance loads,
connector resistance, and electromagne c noise.
In a bus system, mul ple units share the bus. When a unit gains exclusive access, it’s said to
"own" the bus. Bus masters (e.g., processors) ini ate communica on, while slaves (e.g.,
memory) respond.
The bus protocol dictates communica on rules, including data order, acknowledgment of
successful recep on, data compression methods, error-checking, and arbitra on priority.
Bus Bridges
A bus bridge connects two different bus systems and can serve three main func ons:
1. Protocol Conversion: If the buses use different communica on protocols, the bridge
translates between them.
2. Traffic Segmenta on: Bridges segment traffic to keep it contained within sec ons,
enabling both buses to operate concurrently.
3. Memory Buffering: The bridge temporarily stores data in buffers, allowing the
master to con nue its opera ons before data reaches the slave, enhancing
performance.
The physical structure, including wire paths and cycle me, influences how bus transac ons
occur.
Arbitra on cycles determine access priori es, with complex systems adding addi onal lines
and logic to maintain priority without extra delay.
6. Types of Buses
Unified Bus: Uses the same pathway for both address and data, transmi ed sequen ally.
Split Bus: Has separate pathways for address and data, allowing them to be processed
independently.
Single Transac on vs. Tenured Bus: Single-transac on buses are dedicated to each request
individually, while tenured buses support buffered transac ons, allowing one transac on’s
data to occupy the bus even as new transac ons are ini ated.
AMBA
The AMBA (Advanced Microcontroller Bus Architecture) was introduced by ARM in 1997 as a
structured interconnect standard primarily for ARM-based SoCs.
It provides mul ple bus levels to support different performance and power needs within a system.
AMBA’s main buses are the Advanced High-Performance Bus (AHB) for high-speed components and
the Advanced Peripheral Bus (APB) for lower-power, slower peripherals.
Addi onally, there is an older bus, the Advanced System Bus (ASB), intended for simpler
microcontrollers.
o Mul master Support: The AHB supports mul ple masters, such as processors or
DMA controllers, allowing concurrent transac ons with mul ple slaves (e.g.,
memory).
o Burst Mode & Split Transac ons: AHB can handle burst transfers, where large blocks
of data move in one opera on, and split transac ons, allowing a master to ini ate a
transfer and return to it later.
In a typical AMBA system, the AHB forms the primary bus for high-speed components like processors
and memory. Here’s how an AHB transfer works:
Master Access: A master (e.g., ARM processor) requests bus access from an arbiter. If
mul ple masters request access, the arbiter grants it based on priority.
Ini ate Transfer: The bus master drives the address and control signals, indica ng the type
and width of the transfer and whether it’s a burst opera on. Data flows from master to slave
in a write opera on and vice versa in a read.
Slave Response: The slave responds with status indicators (e.g., success, delay, error) to
no fy the master of the transfer’s status.
In pipelined (tenured) AHB buses, one transfer’s address phase overlaps with the previous transfer’s
data phase, enhancing speed
o Low Power & Low Complexity: APB is op mized for simple, low-power interfaces
with slower peripheral devices (e.g., GPIO, mers).
o Simpler Opera on: Unlike AHB, APB has a straigh orward, three-state data transfer
process—idle, setup, and enable states, making it easier to implement in low-
complexity applica ons.
Setup and Enable States: The bus enters setup and enable states in sequence for each
transfer, facilita ng simple, low-power opera on suitable for peripherals.
1. Modular Design & Reuse: AMBA's well-defined interface makes it easier to design modular,
reusable SoC components, reducing development complexity and improving interoperability.
2. Clocking and Reset Flexibility: The AMBA interface’s design is simple yet flexible, with
op ons for mul master systems, split transac ons, and burst modes.
3. Low-Power Design: AMBA’s par oned design (with AHB for high-performance and APB for
low-power peripherals) ensures efficient power consump on, essen al for portable devices.
4. On-Chip Tes ng: AMBA supports on-chip test access through its bus infrastructure,
simplifying the tes ng of bus-connected modules.
CORE CONNECT
IBM's Core Connect Bus is a structured interconnect standard for SoC systems, primarily designed
around IBM's PowerPC processor but flexible enough for other processors.
Core Connect organizes data pathways into a hierarchical bus system, ensuring high-performance
data transfers alongside simpler, low-power connec on.
(5m)
*** In a System-on-Chip (SOC) environment, integra ng reusable Intellectual Property (IP) blocks
with different bus standards can be challenging because each bus standard has its own protocol,
which may not be compa ble with other standards. To solve this, Bus Interface Units are used,
which include bus sockets and bus wrappers. These components help isolate the IP core from the
bus protocol, enabling flexibility in connec ng IP blocks across different bus systems.
Bus Wrappers/Hardware Sockets: These interface components sit between the IP core and
the physical bus. They enable communica on across different bus protocols by adap ng the
IP core’s protocol to match the bus’s protocol.
Conten on and Shared Bus
In bus-based systems, conten on arises when mul ple units (like processors or memory modules)
request access to a shared resource (such as a bus) at the same me. Conten on leads to delays
because only one request can be processed at a me. There are two ways to handle conten on:
1. Idle Un l Available: The reques ng unit waits and remains idle un l the shared resource
becomes available.
2. Queue in Buffer: The request is placed in a buffer, allowing the unit to con nue other
processes un l the resource is free. This approach only works when the requested resource
isn’t cri cal to the current execu on, such as cache prefetching.
The need to analyze a bus for conten on depends on its bandwidth rela ve to the memory
bandwidth. If the bus is a bo leneck (i.e., has less available bandwidth than memory), then it must
be analyzed for conten on as it restricts data flow. Buses with no buffering lead to system
slowdowns as requests get denied immediately.
There are two main access pa erns:
1. Requests without Immediate Resubmissions: The denied request does not need to be
immediately fulfilled, allowing the system to con nue. For instance, a cache line prefetch can
wait without stalling the program.
2. Requests Are Immediately Resubmi ed: In this common case, a denied request must be
resubmi ed instantly. This is typical for systems where mul ple processors share a bus, and
the program cannot proceed un l the request is granted, causing the processor to remain
idle un l the resource becomes available.
Bus Transac on Time (Tline access): This is the me the bus takes to handle a request.
Processor Time: This is the average me a processor needs to perform computa ons before
making a bus request.
This ra o indicates how o en the bus is busy rela ve to the total me available for
processing.
The probability that a processor does not access the bus is given by 1−ρ
The probability that the bus is busy is represented as ρ^n for n processors.
This tells us how much of the bus's bandwidth is effec vely used per processor.
A processor's speed is reduced by the ra o ρa\ ρ due to bus conges on, highligh ng the
performance impact of conten on.
2. Bus Model with Request Resubmission
This model incorporates a more complex analysis to handle scenarios where requests are
resubmi ed a er being denied.
Two equa ons are provided to calculate the achieved occupancy (ρa):
The variable a represents the actual offered request rate. The itera ve process starts with an
ini al guess a=ρ, and typically converges within four itera ons.
The model requires knowledge of the average bus transac on me to compute the offered
occupancy, which indicates how busy the bus would be without conten on (ranging from 0.0
to 1.0).
Blocking Transac ons: A er ini a ng a bus request, the processor becomes idle un l the
bus transac on is complete.
o For a single bus master, achieved occupancy ρ_a equals offered occupancy ρ.
o For mul ple bus masters, the offered occupancy becomes nρ, and conten on can
occur, necessita ng the use of the bus model to find ρ_a.
o More complex processors can con nue processing a er making a bus request,
poten ally issuing several requests before the ini al one is completed.
SOC Customiza on
Customiza on in the context of System on Chip (SoC) design refers to the op miza on of
hardware and so ware to meet specific applica on requirements and implementa on
constraints.
Customiza on can occur at various stages: during design me (which includes fabrica on and
compile me) and at run me. Each of these stages plays a crucial role in shaping the final
characteris cs of the SoC.
Stages of Customiza on
1. Fabrica on Time:
o This is when the physical device is constructed. For custom chips, such as ASICs,
much of the func onality is predetermined at this stage.
o If the device is configurable, it may allow for further customiza on post-fabrica on.
2. Compile Time:
o This stage involves genera ng configura on informa on from design descrip ons
that will be used to customize the device at run me.
o This includes producing instruc ons tailored to the specific architecture of the
processor.
3. Run Time:
o Customiza on can also occur while the system is opera onal, allowing for dynamic
reconfigura on to adapt to changing applica on needs.
Customizable SoCs can be classified based on how the reconfigurable fabric interfaces with the
processor:
Instruc on processors in a System on Chip (SoC) can be specialized for specific tasks, such as media
processing or encryp on. Customiza on usually happens before fabrica on but can also be done for
so processors. This process op mizes performance in terms of speed, size, power consump on, and
accuracy.
Approaches to Customiza on
1. Family of Processors: Some companies, like ARM, offer different families of processors
op mized for various applica ons. For example:
2. Custom Processor Genera on: Companies like ARC and Tensilica provide tools that let
designers configure processors by choosing features they need and removing unnecessary
ones. This helps in op mizing the design for specific applica ons.
Modern SoC design tools help automate much of the customiza on process, making it easier and
faster to create custom processors. Some common func onali es of these tools include:
Integra ng components from various sources.
As processors become more complex, architecture descrip on languages help automate their design
and the associated so ware tools. These languages allow designers to describe the processor in a
high-level way so that tools can automa cally generate the hardware and so ware needed.
Descrip on Languages:
Behavioural Languages: Focus on the instruc on set, making it easier to generate tools like
compilers. They offer high abstrac on but less flexibility in hardware design. Ex: nML and TIE.
Structural Languages: Describe the hardware components and their connec ons. They allow
for direct hardware synthesis but require more detailed specifica ons. Ex: SPREE
Hybrid Approaches: Some languages combine both behavioural and structural elements for greater
flexibility. Ex: LISA.
Designers can iden fy which custom instruc ons to add by analysing high-level applica on
descrip ons. Techniques include:
Using methods like VLIW (Very Long Instruc on Word) to execute mul ple opera ons
simultaneously.
Developing vector opera ons that work on mul ple data items at once.
These techniques help in crea ng efficient processors tailored for specific tasks while considering
factors like power consump on and performance.
Reconfigurable Func onal Units (FUs)
Types of FUs
Reconfigurable func onal units (FUs) can be categorized into two main types based on their
granularity:
1. Fine-Grained FUs:
o LUTs are o en grouped into clusters (e.g., Logic Array Blocks in Altera
FPGAs and Configurable Logic Blocks in Xilinx FPGAs). Each cluster
allows for flexible implementa on of various digital circuits.
2. Coarse-Grained FUs:
o These FUs are larger and can handle more complex func ons,
o en integra ng components like arithme c and logic units
(ALUs).
o Examples:
DSP Blocks: Altera's DSP blocks can handle various opera ons,
providing more flexibility than simple mul pliers but may require
more area and have slower performance for specific tasks. Altera
Stra x,
Characteris cs:
o While these FUs are less flexible than fine-grained FUs (like
LUTs), they can efficiently implement opera ons that match
their capabili es, making them ideal for applica ons that
primarily involve arithme c and logic func ons.
Reconfigurable Interconnects
Regardless of whether the FUs are fine-grained or coarse-grained, they need to be connected flexibly
to op mize performance. There are two types of reconfigurable interconnect architectures:
1. Fine-Grained Interconnects:
o Each wire in this architecture can be switched independently, allowing for maximum
flexibility in rou ng signals between FUs.
o Fine-grained rou ng is commonly used in FPGAs, where the FUs are arranged in a
grid and connected through horizontal and ver cal channels. This flexibility comes
with increased complexity and overhead.
2. Coarse-Grained Interconnects:
o The connec ons in this architecture switch en re buses as a unit rather than
individual wires, resul ng in fewer programming bits and lower overhead.
o Examples:
Totem System: This system features flexible interconnects that can establish
arbitrary connec ons between FUs.
Silicon Hive System: This architecture is less flexible but faster and smaller,
designed to connect only those units likely to communicate with each other.
So ware Configurable Processors
So ware configurable processors, developed by Stretch, combine tradi onal instruc on processing
with a reconfigurable fabric, enabling dynamic customiza on of the instruc on set by applica on
programs.
1. Architecture:
o Conven onal Processor: At its core is a 32-bit Reduced Instruc on Set Computer
(RISC) processor.
o Programmable Instruc on Set Extension Fabric (ISEF): This component extends the
conven onal processor's capabili es by allowing for custom instruc ons that can be
tailored to specific applica ons.
2. Performance Benefits:
o Data Parallelism: The ability to perform mul ple opera ons simultaneously, leading
to higher throughput.
o Operator Specializa on: Custom opera ons can be defined to op mize specific
computa onal tasks.
o Deep Pipelining: The architecture supports deep pipelining of instruc ons, allowing
mul ple stages of instruc on processing to occur simultaneously, which increases
efficiency.
3. ISEF Components:
o ALUs and Mul pliers: The ISEF consists of blocks containing arrays of 4-bit ALUs and
mul pliers. These 4-bit ALUs can be cascaded via a fast carry circuit to create larger
ALUs (up to 64 bits).
o Logic Func ons: Each 4-bit ALU can implement mul ple 3-input logic func ons and
has four register bits for storing instruc on state variables or facilita ng pipelining.
4. Instruc on Handling:
o Extension Instruc ons: The ISEF can support mul ple applica on-specific
instruc ons, called extension instruc ons. Each can read up to three 128-bit
operands and write up to two 128-bit results using a set of 32 wide registers (128
bits each).
o Designs for FPGAs are usually described using Hardware Descrip on Languages
(HDLs) such as VHDL and Verilog at the Register Transfer Level (RTL). This level of
descrip on specifies opera ons for each clock cycle.
2. Synthesis Process:
Iden fica on of Opera ons: The ini al stage iden fies datapath opera ons
and translates them into basic logic gates (AND, OR, XOR).
Netlist Op miza on: The netlist of basic gates is op mized for size and
efficiency through techniques
o This op mized netlist is then mapped to the specific FPGA architecture (e.g., Xilinx
Virtex or Altera Stra x).
o Addi onal op miza ons are made based on the FPGA architecture, including using
dedicated features like carry chains for adders or specific shi func ons for logic
blocks.
o Packing and Clustering: LUTs and registers are packed and clustered into logic blocks
to minimize interconnec ons between blocks.
4. Placement and Rou ng:
o Placement: The op mized logic blocks are placed on the FPGA considering goals like
speed, routability, and wire length.
o Rou ng: This step determines how the logic block inputs and outputs will connect
via the programmable rou ng resources in the FPGA, ul mately genera ng a
configura on bitstream that defines these connec ons.
6. Analysis Tools:
o Addi onal tools are available to analyze metrics such as delay, area, and power
consump on to ensure that the circuit meets applica on requirements.
Instance-Specific Design
Instance-specific design refers to the customiza on of hardware and so ware implementa ons to
op mize performance for par cular computa ons. This approach aims to enhance speed and reduce
resource usage, thereby lowering power and energy consump on, although it sacrifices some
flexibility. Here are three primary techniques for automa ng instance-specific design:
1. Constant Folding
Constant folding involves propaga ng known, sta c input values through computa ons to
eliminate unnecessary hardware or so ware opera ons.
In hardware design, if certain filter coefficients are constant, the design can be specialized to
use one-input constant-coefficient mul pliers instead of two-input mul pliers. This
specializa on results in smaller and faster mul pliers.
By op mizing specific designs for fixed parameters, instance-specific designs can yield
significant improvements in efficiency, making reconfigurable logic poten ally more effec ve
than ASICs for certain applica ons. For example, in FIR (Finite Impulse Response) filters,
techniques like modified common subexpression elimina on can lead to up to 50% reduc on
in FPGA slice usage and 75% reduc on in LUT usage. This translates into substan al
reduc ons in dynamic power consump on as well.
2. Func on Adapta on
Func on adapta on involves modifying func ons in hardware or so ware to find the best
trade-off between performance, resource usage, and output quality for a specific applica on
instance.
Word-length op miza on is a key aspect of func on adapta on. In FPGA implementa ons,
the word length and scaling of signals in a digital signal processing (DSP) system can be
customized based on applica on needs. This flexibility allows designers to choose variable
sizes that op mize trade-offs in numerical accuracy, design size, speed, and power
consump on, unlike fixed architectures in tradi onal microprocessors.
3. Architecture Adapta on
For instance-specific applica ons, it may involve crea ng custom instruc ons that enhance
performance for certain computa ons, improving the efficiency of the architecture used.
Single-Cycle Context Switch: Allows quick switching between threads, reducing overhead
and improving execu on interleaving.
Latency Hiding: If one thread waits (e.g., for memory), the processor can switch to another
thread, preven ng stalls.
Resource Management: Suppor ng mul ple threads requires more register files, but
modern FPGAs have sufficient on-chip memory to efficiently handle these addi onal needs.
Key architectural features include:
Parameteriza on: CUSTARD supports four main sets of parameters for customiza on:
1. Mul threading Support: Choose the number of threads and threading type (Block
Mul threading (BMT) or Interleaved Mul threading (IMT)).
o When only one thread is ac ve, BMT behaves like a conven onal single-threaded
processor. With mul ple threads, it hides latencies by switching contexts during
the execu on stage of the pipeline.
o IMT simplifies the pipeline architecture because independent instruc ons can be
guaranteed in certain pipeline stages, thus reducing hazards. This allows for
op miza on of the processor by selec vely removing unnecessary forwarding
paths.
2. Custom Instruc ons: Define custom instruc ons and associated data paths for
execu on.
3. Forwarding and Interlock Architecture: Specify the necessity of branch and load
delay slots and forwarding paths.
4. Register File Configura on: Customize the number of registers and ports per register
file to enhance flexibility.
1 Give context of usage of integrated caches ,split I/D cache
Integrated Caches: Used to combine both instruc on and data caching in a single, unified
cache. This setup can op mize cache usage and save space but may lead to conten on when
accessing instruc ons and data simultaneously.
Split I/D Cache: Separates instruc on (I-cache) and data (D-cache) caches, allowing
concurrent access to instruc ons and data, improving performance at the cost of added
complexity and space.
2 list system level issued and specifica ons to choose an interconnect architecture
Bandwidth and Latency Requirements: Determines data transfer speed and delay tolerance.
Scalability: Supports growing numbers of cores or components.
Power Efficiency: Minimizes power consump on, especially for mobile or low-power
devices.
Compa bility: Aligns with exis ng protocols and component standards.
Reliability and Fault Tolerance: Ensures stable opera on and error handling.
Cost and Complexity: Balances performance with budget and design constraints.
Instance-Specific Design: Customizes a design for a specific instance or set of inputs to op mize
performance, power, or area.
Automa on Methods:
4 State reasons that specify system design is more challenging that processor design
Complex Interac ons: System design must manage interac ons across mul ple
heterogeneous components, unlike single-processor design.
Broad Requirements: Balances power, performance, security, and scalability, o en with
conflic ng demands.
Integra on and Compa bility: Ensures seamless integra on of hardware, so ware, and
interfaces.
Customiza on Needs: Adapts to diverse applica ons, requiring flexible, domain-specific
op miza ons.
Reliability and Fault Tolerance: Demands higher resilience due to complex dependencies
and varied usage condi ons.
Arbitra on: Use bus arbitra on techniques to control access, ensuring only one device
communicates at a me.
Bus Segmenta on: Split the bus into segments to reduce conten on.
Caching: Cache frequently accessed data to minimize bus access needs.
Use Faster Bus Protocols: Upgrade to high-speed protocols to reduce delay.
Priori za on: Implement priority levels to manage cri cal data access first.
7 what is addroundkey
AddRoundKey: A step in cryptographic algorithms, par cularly in the AES (Advanced Encryp on
Standard), where a round key is combined with the current state of the data using bitwise XOR. This
opera on adds security by mixing the key into the data during encryp on or decryp on.
Resolu on: In video, resolu on refers to the amount of detail an image holds, typically defined by
the width and height in pixels (e.g., 1920x1080).
P-Frame (Predic ve Frame): A type of video frame that stores only the differences between the
current frame and a reference frame (usually a preceding I-frame or P-frame). P-frames use mo on
compensa on to efficiently compress video data, reducing file size while maintaining quality by
predic ng the contents of the frame based on previous frames.
9 what is DCT
DCT (Discrete Cosine Transform): A mathema cal transform used in signal processing and image
compression (e.g., JPEG) to convert spa al domain data into frequency domain data. It helps reduce
redundancy in data by represen ng the image in terms of its frequency components, allowing for
effec ve compression while preserving essen al visual informa on.
Packet Transac on: A data packet is a forma ed unit of data carried by a packet-switched network.
In packet transac ons, data is divided into packets for transmission, enabling efficient rou ng and
error checking.
Bus Transac on: A bus transac on involves communica on between devices over a shared bus. It
includes the transfer of data, control signals, and address informa on, allowing devices to request
and transfer data in a coordinated manner
Arbitra on: A method used to control access to a shared resource (like a bus) among mul ple
devices. It determines which device can use the bus at any given me, preven ng conflicts and
ensuring orderly communica on.
Bus Bridge: A hardware component that connects two different bus architectures, allowing data
transfer between them. It facilitates communica on between devices that operate on incompa ble
bus standards, ensuring interoperability within a system.
Tenured Bus Architecture: A bus structure where mul ple buses are used for different types of data,
allowing specialized buses for high-bandwidth or low-latency tasks. It can op mize performance by
dedica ng resources based on specific needs.
Unified Bus Architecture: A single bus system that handles all data types and device
communica ons. It simplifies design and integra on but can lead to conten on and bo lenecks, as
all devices share the same bus bandwidth.
13 what is socket and bus wrapper
Socket: A so ware endpoint for sending and receiving data across a network. It provides a way for
programs to communicate over a network using protocols like TCP or UDP, encapsula ng the
necessary networking func onality.
Bus Wrapper: A hardware or so ware component that encapsulates and manages the
communica on between a device and a bus interface. It translates signals and protocols to ensure
compa bility and efficient data transfer between devices and the bus system.
14 what is access me
Access Time: The dura on it takes to retrieve data from a storage device or memory a er a request
is made. It includes the me needed to locate the data and the me to transfer it, influencing overall
system performance.
Memory Bandwidth: How quickly the memory can handle mul ple requests. More independent
memory arrays and op mized access methods help improve bandwidth.
Tag: A por on of an address used to iden fy if a specific block of data is stored in a cache. It
helps differen ate between different memory addresses that may map to the same cache
line.
Index: A part of an address that specifies which cache line or set to access. It determines
where to look for the data within the cache.
Offset: The specific loca on within a cache line or memory block that indicates the exact
byte or word being accessed. It helps pinpoint the exact data within the selected cache line.
Warm Cache: A cache that has been recently used and contains data that is likely to be
accessed again soon. It typically has a higher hit rate due to the presence of frequently
accessed data.
Cold Cache: A cache that is empty or has not been populated with relevant data yet. It has a
lower hit rate since the required data may not be present, leading to more cache misses.
17 what is a so processor
Pipeline Break: A disrup on in the sequen al flow of instruc ons in a pipeline, caused by hazards
such as data dependency, control dependency, or resource conflicts. It stalls or flushes the pipeline,
reducing performance temporarily.
ALU (Arithme c Logic Unit): Performs arithme c and logical opera ons.
Floa ng-Point Unit (FPU): Handles floa ng-point calcula ons.
Registers: Store operands and intermediate results.
Control Logic: Manages instruc on execu on and coordina on.
Shi er/Barrel Shi er: Executes bit-shi ing and rota on opera ons.
Branch Unit: Handles branching and condi onal instruc ons.
20 what is the processor as IP
Processor as IP: A pre-designed, reusable processor core provided as Intellectual Property (IP) for
integra on into custom chips (SoCs). It simplifies design, accelerates development, and is
customizable for specific applica ons. Examples include ARM Cortex and RISC-V cores.
Advantages:
Challenges:
Technology Difference:
Limited Size:
parallelism :
pipelining
mul ple execu on units
mul ple cores
levels of parallelism
instruc on level
loop level
procedure level
program level
pipelining
pipeline breaks
data conflict
resource conten on
run-on-delay
branch
scheduling
sta c: compiler
dynamic : hardware
interconnec on approaches
bus based
NOC
System so ware
compute limited
memory interconnects
instruc on set architecture
L/S
R/M
Interrupts
Buffer design
simple:
o branch elimina on
o simple branch speedup
complex
o branch target capture
o branch predic on
fixed
sta c
dynamic
bimodal
two level adap ve
read a er write
write a er read
write a er write
communica on bandwidth
communica on latency
master and slave
concurrency requirement
packet or bus transac ons
mul ple clock domain
Bus bridge
protocol conversion
traffic segmenta on
memory buffering
Bus
unified
split
single transac on vs tenure
types of transac on
Blocking transac on
Non blocking Transac on
customizable SOC
Mul threading