0% found this document useful (0 votes)
70 views41 pages

FPGA Synthesis and Design Overview

The document discusses FPGA synthesis and physical design, detailing the evolution of FPGAs from small designs to complex systems and the associated CAD flows. It covers high-level synthesis, logic synthesis, and physical design aspects such as placement, routing, and emerging architecture features. The authors highlight the importance of power optimization and the interdependence between CAD tools and FPGA architecture in advancing design capabilities.

Uploaded by

Freddy González
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views41 pages

FPGA Synthesis and Design Overview

The document discusses FPGA synthesis and physical design, detailing the evolution of FPGAs from small designs to complex systems and the associated CAD flows. It covers high-level synthesis, logic synthesis, and physical design aspects such as placement, routing, and emerging architecture features. The authors highlight the importance of power optimization and the interdependence between CAD tools and FPGA architecture in advancing design capabilities.

Uploaded by

Freddy González
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

FPGA Synthesis

and Physical Design 16


Mike Hutton, Vaughn Betz, and Jason Anderson

CONTENTS
16.1 Introduction 374
16.1.1 Architecture of FPGAs 375
16.1.2 CAD Flow for FPGAs 377
16.2 High-Level Synthesis 380
16.2.1 Domain-Specific Synthesis 380
16.2.2 HLS 380
[Link] Commercial FPGA HLS Tools 382
[Link] Academic FPGA HLS Research 382
[Link] Research Challenges for HLS 384
16.3 Logic Synthesis 385
16.3.1 RTL Synthesis 385
16.3.2 Logic Optimization 388
16.3.3 Technology Mapping 389
16.3.4 Power-Aware Synthesis 391
16.4 Physical Design 392
16.4.1 Placement and Clustering 392
[Link] Problem Formulation 392
[Link] Clustering 392
[Link] Placement 393
16.4.2 Physical Resynthesis Optimizations 398

373

© 2016 by Taylor & Francis Group, LLC


374   16.1 Introduction

16.4.3 Routing 400


[Link] Problem Formulation 400
[Link] Two-Step Routing 401
[Link] Single-Step Routers 404
16.5 CAD for Emerging Architecture Features 406
16.5.1 Power Management 406
16.5.2 More-than-2D Integration 406
16.6 Looking Forward 407
References 407

16.1 INTRODUCTION

Since their introduction in the early 1980s, Field-Programmable Gate Arrays (FPGAs) have
evolved from implementing small glue-logic designs to implementing large complete systems.
Programmable logic devices range from lower-capacity nonvolatile devices such as Altera
MAX™, Xilinx CoolRunner™, and MicroSemi ProASIC™ to very-high-density static RAM
(SRAM)-programmed devices with significant components of hard logic (ASIC). The latter are
commonly called FPGAs. Most of the interesting CAD problems apply to these larger devices,
which are dominated by Xilinx (Virtex™, Kintex™, Artix™ families) and Altera (Stratix™,
Arria™, Cyclone™ families) [1]. All of these are based on a tiled arrangement of lookup table
(LUT) cells, embedded memory blocks, digital signal processing (DSP) blocks, and I/O tiles.
FPGAs have been historically used for communications infrastructure, including wireless
base stations, wireline packet processing, traffic management, protocol bridging, video process-
ing, and military radar. The growth domains for FPGAs include industrial control, automotive,
high-performance computing, and datacenter compute acceleration.
The increasing use of FPGAs across this wide range of applications, combined with the growth
in logic density, has resulted in significant research in tools targeting programmable logic. Flows
for High-Level Synthesis (Vivado HLS™) and OpenCL™ programming of FPGAs have recently
emerged to target the new application domains. Xilinx Zynq™ and Altera Cyclone/Arria SoC™
contain embedded processors and introduce new CAD directions.
There are two branches of FPGA CAD tool research: one is concerned with developing algo-
rithms for a given FPGA and another is a parallel branch that deals with developing the tools
required to design FPGA architectures. This emphasizes the interdependence between CAD and
architecture: unlike ASIC flows where CAD is an implementation of a design in silicon, the CAD
flow for FPGAs is an embedding of the design into a device architecture with fixed cells and rout-
ing. Some algorithms for FPGAs continue to overlap with ASIC-targeted tools, notably language
and technology-independent synthesis. However, technology mapping, routing, and aspects of
placement are notably different.
Emerging tools for FPGAs now concentrate on power and timing optimization and the grow-
ing areas of embedded and system-level design. Going forward, power optimizations will be a
combination of semiconductor industry–wide techniques and ideas targeting the programmable
nature of FPGAs, closely tuned to the evolving FPGA architectures. System-level design tools will
attempt to exploit two of the key benefits of FPGAs—fast design and verification time combined
with a high degree of programmable flexibility.
FPGA tools differ from ASIC tools in that the core portions of the tool flow are owned by
the silicon vendors. Third-party EDA tools exist for synthesis and verification, but with very few
exceptions, physical design tools are supplied by the FPGA vendor, supporting only that vendor’s
products. The two largest FPGA vendors (Altera and Xilinx) offer complete CAD flows from lan-
guage extraction, synthesis, placement, and routing, along with power and timing analysis. These
complete design tools are Quartus™ for Altera and Vivado™ for Xilinx.

© 2016 by Taylor & Francis Group, LLC


Chapter 16 – FPGA Synthesis and Physical Design    375

After some introductory description of FPGA architectures and CAD flows, we will describe
research in the key areas of CAD for FPGAs roughly in flow order: high-level synthesis (HLS),
register-transfer level (RTL) and logic synthesis, technology mapping, placement, and routing.

16.1.1 ARCHITECTURE OF FPGAs

To appreciate CAD algorithms for FPGAs, it is necessary to have some understanding of FPGA
devices.
Figure 16.1 shows a high-level block diagram of a modern FPGA, showing block resources.
This is not any specific commercial device but an abstraction of typical features. FPGAs contain
high-speed serial I/O or transceivers, usually including the Physical Coding Sublayer process-
ing for a variety of different protocols, and sometimes an ASIC IP (embedded intellectual prop-
erty or macro) block for PCI Express or Ethernet. All FPGAs have a general-purpose parallel
I/O, which can be programmed for different parallel standards, and sometimes have dedicated
external memory interface controllers for SRAM or DDR. Designer logic will target embedded
memory blocks (e.g., 10 kB or 20 kB block RAM with programmable organization), DSP blocks
(multiply-and-accumulate blocks, built with ASIC standard cell methodology), and logic blocks
(Altera Logic Array Block [LAB], Xilinx Configurable Logic Block [CLB]) comprising the LUTs.
The device may have an embedded processor, shown here as a dual-core CPU with memories and
a set of peripheral devices.
Figure 16.2 illustrates the routing interface. A LAB or CLB cluster is composed of a set of
logic elements (LEs) and local routing to interconnect them. Each LE is composed of an LUT,
a register, and a dedicated circuitry for arithmetic functions. Figure 16.2a shows a four-input
LUT (4 LUT). The 16 bits of a programmable LUT mask specify the truth table for a four-input
function, which is then controlled by the A, B, C, and D inputs to the LUT. Figure 16.2b shows
an example of an LE (from Stratix I), comprising a 4 LUT and a D flip-flop (DFF) with select-
able control signals. The arithmetic in this LE and a number of other devices is accomplished

Transceiver I/O
PCle PCle

Embedded memory blocks

DSP (MAC) blocks

Embedded memory blocks

Embedded memory blocks

EMIF
Parallel I/O (DDR)
EMIF

Embedded memory blocks uP uP

Embedded memory blocks

DSP (MAC) blocks

Embedded memory blocks

Transceiver I/O

FIGURE 16.1 FPGA high-level block diagram.

© 2016 by Taylor & Francis Group, LLC


376    16.1 Introduction

A B C D From H,V
To H,V
R 0 LIM
R 1
0
R 0 1
R 1 0
1
R 0
R 1 LE
0
1 LEIM
R 0
R 1 Y
0
R 1
0
R 1 0
1
R 0
R 1
0
1
R 0
R 1
0
1
R 0
R 1 Local

(a) (c)

Sload Sclear Aload


(LAB Wide) (LAB Wide) (LAB Wide)
Register chain
connection

Addnsub (LAB Wide)


ALD/PRE
(1) Row, column, and
ADATA Q
D direct link routing
Data 1
Data 2 ENA Row, column, and
Data 3 Four-Input CLRN direct link routing
cin (from cout LUT
of previous LE) Clock (LAB Wide) Local routing
Data 4 ena (LAB Wide)
aclr (LAB Wide)
LUT chain
connection

Register
Register feedback chain output
(b)

FIGURE 16.2 Inside an FPGA logic block. (a) 4 LUT showing LUT mask, (b) 4 LUT Stratix Logic ele-
ment, and (c) LAB cluster showing routing interface to inter-logic-block H and V routing wires.

Packed register cin

LUT-out 4 FF

Fracturable 6 LUT
+
4 outputs
8 inputs

6 LUT or
2 × 5 LUT or
4 × 4 LUT feeding +
arithmetic
LUT-out

Packed register cout

FIGURE 16.3 Fracturable LUT/FF (Stratix V ALM).

© 2016 by Taylor & Francis Group, LLC


Chapter 16 – FPGA Synthesis and Physical Design    377

by using the two 3 LUTs independently to generate sum and ripple-carry functions. The carry-
out becomes a dedicated fifth connection into the neighboring LE and optionally replaces the
data3 or C input. By chaining the carry-out of several LEs together, a fast multibit adder can
be formed. Figure 16.2c shows how several LEs are grouped together to form a LAB. A subset
of the global horizontal (H) and vertical (V) lines that form the FPGA interconnect can drive
into a number of LAB input multiplexers (LIMs). The signals generated by the LIMs (along with
feedback from other LEs in the same LAB) can drive the inputs to the LE through logic-element
input multiplexers (LEIMs). On the output side of the LE, routing multiplexers allow signals
from one or more LEs to drive H and V routing wires in order to reach other LABs. These rout-
ing multiplexers also allow H and V routing wires to connect to each other in order to form
longer routes that can reach across the chip. Each of the LAB input, LE input, and routing mul-
tiplexers is controlled by SRAM configuration bits. Thus, the total SRAM bits programming
the LUT, DFF, modes of the LE and the various multiplexers comprise the configuration of the
FPGA to perform a specific logical function.
There is extensive literature on FPGA architecture exploration; the reader is referred to Betz [2] for
a description of early FPGA research. Depending on the architectural choices, some of the blocks
in Figure 16.1 may not be present, for example, the device could have no DSP blocks, the rout-
ing could be organized differently, and control signals on the DFF could be added or removed.
More modern devices do not have the simplified 4 LUT/DFF structure shown in Figure 16.2. To
reduce circuit depth and address common six-input functions such as a 4:1 multiplexer, they are
now more often 6 LUT based and this 6 LUT usually has additional circuitry that allows it to be
fractured to compute two different functions of less than 6 inputs each. Lewis describes one such
architecture in [3], which is shown in Figure 16.3.
The Versatile Place and Route (VPR) toolset [4] introduced what is now the standard para-
digm for empirical architecture evaluation. An architecture specification file controls a param-
eterized CAD flow capable of targeting a wide variety of FPGA architectures, and individual
architecture parameters are swept across different values to determine their impact on FPGA
area, delay, and recently power, as shown in Figure 16.4. Some example parameters include
LUT and cluster sizes, resource counts, logic-element characteristics (e.g., arithmetic and sec-
ondary signal structures), lengths, and number of wires. These are used both to generate the
architecture and to modify the behavior of the CAD tools to target it. There have been about
100 research papers published that use VPR for either architecture exploration or evaluating
alternative CAD algorithms. The open-source VPR has recently been updated [5] to include
a wide range of new architectural enhancements, notably Verilog RTL synthesis, memory,
and arithmetic support, in combination with new benchmarks for architecture research (and
renamed VTR for Verilog-to-Routing).
The interaction of CAD and architecture is highlighted by Yan [6], who illustrated the sensi-
tivity of architecture results to the CAD algorithms and settings used in experiments and how
this could dramatically alter design conclusions. There have been a number of studies related to
power optimization and other architectural parameters. For example, Li [7] expanded the VPR
toolset to evaluate voltage islands on FPGA architectures, and Wilton [8] evaluated CAD and
architecture targeting embedded memory blocks in FPGAs.

16.1.2 CAD FLOW FOR FPGAS

The core RTL CAD flow seen by an FPGA user is shown in Figure 16.5. After design entry, the
design is elaborated into operators (e.g., adders, multiplexers, multipliers), state machines, and
memory blocks. Gate-level logic synthesis follows, then technology mapping to LUTs and regis-
ters. Clustering groups of LEs into logic blocks (LABs) and placement determines a fixed location
for each logic or other block. Routing selects which programmable switches to turn on in order
to connect all the terminals of each signal net, essentially determining the configuration settings
for each of the LIM, LEIM, and routing multiplexers described in Section 16.1.1. Physical resyn-
thesis, shown between placement and routing in this flow, is an optional step that resynthesizes
the netlist to improve area or delay now that preliminary timing is known. Timing analysis [9]
and power analysis [10] for FPGAs are approximated at many points in the flow to guide the

© 2016 by Taylor & Francis Group, LLC


378    16.1 Introduction
/RJLFDUFKLWHFWXUHSDUDPHWHUV
VXEEORFNBOXWBVL]H8VLQJ/87%/(V Test designs
VXEEORFNVBSHUBFOE2QH/87SHU/$%

5RXWLQJDUFKLWHFWXUHSDUDPHWHUV
VZLWFKBEORFNBW\SHVXEVHW
© 2016 by Taylor & Francis Group, LLC

)FBRXWSXW
)FBLQSXW
VZLWFKEXIIHUHG\HV5&LQH±?
&RXW2H±

$OOEXIIHUHGOHQJWKZLUHV
VHJPHQWIUHTXHQF\OHQJWKZLUHVZLWFK?
RSLQVZLWFK)UDFBFE)UDFBVE?
5PHWDO&PHWDOH±
3URFHVVSDUDPHWHUV
5BPLQ:BQPRV FMT prototype
5BPLQ:BSPRV synthesis

Parameterization

FMT prototype
Architecture place and route
database

FMT architecture
generation computer storage
automotive wireless

Analyze area, medical


wireline
speed, power networking

FIGURE 16.4 FPGA modeling toolkit flow based on Versatile Place and Route.
Chapter 16 – FPGA Synthesis and Physical Design    379

Design entry

RTL synthesis

Logic synthesis

Technology mapping

Clustering and placement

Physical synthesis

Routing

Timing and power


analysis

Bitstream generation

FIGURE 16.5 FPGA CAD flow.

optimization tools but are performed for final analysis and reporting at the end of the flow. The
last step is bit-stream generation, which determines the sequence of 1’s and 0’s that will be serially
loaded to configure the device. Notice that some ASIC CAD steps are absent: sizing and com-
paction, buffer insertion, clock tree, and power grid synthesis are not relevant to a prefabricated
architecture.
Prior to the design entry phase, a number of different HLS flows can be used to generate
high-level design language (HDL) code—domain-specific languages for DSP or other application
domains and HLS from C or OpenCL are several examples that will be discussed in Section 16.2.
Research on FPGA algorithms can take place at all parts of this flow. However, the academic
literature has generally been dominated by synthesis (particularly LUT technology mapping)
at the front end and place-and-route algorithms at the back end, with most other aspects left
to commercial tools and the FPGA vendors. The literature on FPGA architecture and CAD
can be found in the major FPGA conferences: the ACM International Symposium on FPGAs
(FPGA), the International Conference on Field-Programmable Logic and Applications (FPL), the
International Conference on Field-Programmable Technology (FPT), the International Symposium
on Field-Programmable Custom-Computing Machines (FCCM) [11], as well as the general CAD
conferences.
In this chapter, we will survey algorithms for both the end-user and architecture development
FPGA CAD flows. The references we have chosen to include are representative of a large body
of research in FPGAs, but the list cannot be comprehensive. For a more complete discussion,
the reader is referred to [12,13] for the overall FPGA CAD flow and [14,15,16] for FPGA-specific
synthesis.

© 2016 by Taylor & Francis Group, LLC


380    16.2 High-Level Synthesis

16.2 HIGH-LEVEL SYNTHESIS

Though the majority of current FPGA designs are entered directly in VHDL or Verilog, there
have been a number of attempts to raise the level of abstraction to the behavioral or block-
integration level with higher-level compilation tools. These tools then generate RTL HDL, which
is shown in Figure 16.5. This section briefly describes HLS tools for domain-specific tasks and
then the emerging commercial HLS. It closes with research challenges in the field.

16.2.1 DOMAIN-SPECIFIC SYNTHESIS

Berkeley Design Technologies Inc. [17] found that FPGAs have better price/performance than
DSP processors for many DSP applications. However, HDL-based flows are not natural for
most DSP designers, so higher-level DSP design flows are an important area of research. Altera
DSP Builder™ and Xilinx System Generator™ link the MATLAB ® and Simulink® algorithm
exploration and simulation environments popular with DSP designers to VHDL and Verilog
descriptions targeting FPGAs. See Chapter 8 of [12] for a case study on the use of Xilinx System
Generator™.
For embedded processing and connecting complex systems, the FPGA vendors provide a
number of system-level interconnect tools such as Altera’s QSYS™ and Xilinx IP Integrator™ to
enable the stitching of design blocks across programmable bus interconnect and to the embedded
processors.
The Field-Programmable Port Extender (FPX) system of Lockwood [18] developed system-
level and modular design methodologies for the network processing domain. The modular nature
of FPX also allows for the exploration of hardware and software implementations. Kulkarni [19]
proposed a methodology for mapping networking-specific functions into FPGAs.
Maxeler Technologies [20] offers a vertical solution for streaming dataflow applications that
include software tools and FPGA hardware. The user describes the application’s dataflow graph in
a variant of the Java language—including the portion of the application to run on the FPGA and a
portion to run on a connected x86 host processor. The company provides integrated desktop and
rack-scale x86/FPGA hardware platforms comprising Intel processors connected with Altera/
Xilinx FPGAs.

16.2.2 HLS

HLS refers to the automated compilation of a software program into a hardware circuit described
at the RTL in VHDL or Verilog. For some HLS tools, the software program is expressed in a
standard language, such as C or C++, whereas other tools rely on extended versions of standard
languages [21–23] or even entirely new languages as in the case of BlueSpec [24]. Most HLS tools
provide a mechanism for the user to influence the hardware produced by the tool via (1) con-
straints in a side file provided to the HLS tool, (2) pragmas in the source code, or (3) constructs
available in the input language. HLS tools exist for both ASIC and FPGA technologies, with the
main difference being that the FPGA tools have models for the speed, area, and power of the
resources available on the target FPGA, including soft logic and hard IP blocks. Regardless of
the specific input language, constraints, and target IC media, we consider the defining feature
of HLS to be the translation of an untimed clockless behavioral input specification into an RTL
hardware circuit. The RTL is then subsequently synthesized by FPGA vendor synthesis. HLS thus
eases hardware design by raising the level of abstraction an engineer uses, namely, by permitting
the use of software design methodologies.
HLS is attractive to two different types of users: (1) hardware engineers who wish to shorten/
ease the design process, perhaps just for certain modules in the system, and (2) software engineers
who lack hardware design skills but wish to glean some of the energy and speed benefits associ-
ated with implementing computations in hardware versus software. To be sure, recent studies
have shown that FPGA hardware implementations can outperform software implementations
by orders of magnitude in speed and energy [25]. Traditionally, FPGA circuit design has required

© 2016 by Taylor & Francis Group, LLC


Chapter 16 – FPGA Synthesis and Physical Design    381

hardware skills; however, software engineers outnumber hardware engineers by 10:1 [26]. FPGA
vendors are keenly interested in broadening access to their technology to include software devel-
opers who can use FPGAs as computing platforms, for example, to implement accelerators that
work alongside traditional processors. Indeed, both Altera and Xilinx have invested heavily in
HLS in recent years, and both companies have released commercial solutions: Altera’s OpenCL
SDK performs HLS for an OpenCL program, while Xilinx’s AutoESL-based HLS accepts a C pro-
gram as input. Both are overviewed in further detail in the remainder of this section, along with
recent academic FPGA HLS research.
Figure 16.6 shows the HLS flow. The program is first parsed/optimized by a front-end com-
piler. Then, the allocation step determines the specifications of the hardware to be synthesized,
for example, the number and types of the functional units (e.g., number of divider units). Speed
constraints and characterization models of the target FPGA device are also provided to the
allocation step (e.g., the speed of an 8-bit addition, a 16-bit addition). Following allocation, the
scheduling step assigns the computations in the (untimed) program to specific hardware clock
cycles, defining a finite-state machine (FSM). The subsequent binding step assigns (binds) the
computations from the C to specific hardware units, while adhering to the scheduling results.
For example, a program may contain 10 division operations, and the HLS-generated hardware
chosen in the allocation step may contain 3 hardware dividers. The binding step assigns each of
the 10 division operations to one of the dividers. Memory operations, for example, loads/stores,
are also bound to specific ports on specific memories. The final RTL generation step produces the
RTL to be passed to back-end vendor tools to complete the implementation steps, which will be
described later in Sections 16.3 and 16.4.
Often, there is a gap in quality—area, speed, and power—between HLS auto-generated hard-
ware and that designed by a human hardware expert, especially for applications where there exists
a particular spatial hardware layout, such as in the implementation of a fast Fourier transform. The
impact of that gap can be higher in custom ASIC technologies than in FPGAs, where the entire
purpose of a custom implementation is to achieve the best-possible power/speed in the lowest
silicon area. With FPGAs, as long as the (possibly bloated) synthesized implementation fits in the
target device, additional area is usually not a problem. For these reasons, it appears that FPGAs
may be the IC medium through which HLS will enter the mainstream of hardware design.

Software program

Compiler/
Optimizer

User constraints
(e.g., timing, resource) Allocation Target H/W
characterization

Scheduling

Binding

RTL
generation

Synthesizable RTL

FIGURE 16.6 High-level synthesis design flow.

© 2016 by Taylor & Francis Group, LLC


382    16.2 High-Level Synthesis

[Link] COMMERCIAL FPGA HLS TOOLS

In 2011, Xilinx purchased AutoESL, an HLS vendor spawned from research at the University of
California, Los Angeles [27]. AutoESL has since become VivadoHLS—Xilinx’s commercial HLS
solution [28], which supports the synthesis of programs written in C, C++, and SystemC. By using
pragmas in the code, the user can control the hardware produced by VivadoHLS, for example, by
directing the tool to perform loop pipelining. Loop pipelining is a key performance concept in HLS
that permits a loop iteration to commence before its prior iteration has completed—essentially
implementing loop-level parallelism. Loop pipelining is illustrated in Figure 16.7, where the left
side of the figure shows a C code snippet having three addition operations in the loop body, and
the right side of the figure shows the loop pipelined schedule. In this example, it is assumed that
an addition operation takes one cycle. In cycle #1, the first addition of the 0th loop iteration is
executed. In cycle #2, the second addition of the 0th loop iteration and the first addition of the 1st
loop iteration are executed. Observe that by cycle #3, three iterations of the loop are in flight at
once, utilizing three adder functional units—referred to as the steady state of the loop pipeline.
The entire loop execution is concluded after N+2 clock cycles. The ability to perform loop pipelin-
ing depends on the data dependencies between loop iterations (loop-carried dependencies) and
amount of hardware available. In VivadoHLS, one would insert the pragma, #pragma AP pipeline
II=1, just prior to a loop to direct the tool to pipeline the loop with an initiation interval (II) of
1, meaning that the tool should start a loop iteration every single cycle. Additional pragmas per-
mit the control of hardware latency, function inlining, loop unrolling, and so on. A recent study
showed that for certain benchmarks, VivadoHLS produces hardware of comparable quality to
human-crafted RTL [29].
Altera takes a different approach to HLS versus Xilinx by using OpenCL [30,31] as the input
language. OpenCL is a C-like language that originated in the graphics computing domain but
has since been used as an input language for varied computing platforms. In OpenCL, parallel-
ism is expressed explicitly by the programmer—a feature aligned nicely with an FPGA’s spatial
parallelism. An OpenCL program has two parts: a host program that runs on a standard proces-
sor and one or more kernels, which are C-like functions that execute computations on OpenCL
devices—in this case, an Altera FPGA connected to the host x86 processor via a PCIe interface.
Altera’s OpenCL SDK performs HLS on the kernels to produce deeply pipelined implementations
that aim to keep the underlying hardware as busy as possible by accepting a new thread into the
pipeline every cycle, where possible. The HLS-generated kernel implementations connect to the
host processor via a PCIe interface. A recent demonstration [32] for a fractal video compression
application showed the OpenCL HLS providing 3× performance improvement over a GPU and
two orders of magnitude improvement over a CPU.

[Link] ACADEMIC FPGA HLS RESEARCH

HLS has also been the focus of recent academic research with several different research frame-
works under active development: GAUT is an HLS tool from the Universite ́ de Bretagne Sud that
is specifically designed for DSP applications [32]. Shang is a generic (application-agnostic) HLS tool
under development at the Advanced Digital Sciences Center in Singapore [33]. Riverside optimizing

Cycle 1 2 3 4 5 … N N+1 N+2


i=0 + + +
i=1 + + +
IRU LQWL L1L ^ i=3 + + +
VXP>L@ DEFG …. …. … …. …
`
i = N– + + +
2
i = N– + + +
1 Steady state

FIGURE 16.7 Polyhedral loop model.

© 2016 by Taylor & Francis Group, LLC


Chapter 16 – FPGA Synthesis and Physical Design    383

compiler for configurable computing (ROCCC) [34] is a framework developed at UC Riverside that
is specifically aimed at streaming applications. Bambu [35] and Dwarv [36] are generic HLS tools
being developed at Politecnico di Milano and Delft University of Technology, respectively. Of these
tools, Bambu and Sheng have been made open source. A binary is available for GAUT and ROCCC.
Dwarv is not available publicly as it is built within a proprietary compiler framework.
LegUp is an open-source FPGA HLS tool from the University of Toronto, which was first
released in 2011 and is presently on its third release [37]. LegUp accepts C as input and synthe-
sizes the program entirely to hardware or, alternately, to a hybrid system containing a MIPS soft
processor and one or more HLS-generated accelerators. It specifically targets Altera FPGAs, with
the processor and accelerators connecting to one another using a memory-mapped on-chip bus
interface. LegUp is implemented as back-end passes of the open-source low-level virtual machine
(LLVM) compiler framework [38], and therefore, it leverages the parser and optimizations avail-
able in LLVM. Scheduling in LegUp is formulated mathematically as a linear program [39], and
binding is implemented using a weighted bipartite matching approach [40]—operations from
the C are matched to hardware units, balancing the number of operations assigned to each unit.
Figure 16.8 shows the LegUp design flow. Beginning at the top left, a C program is compiled and
run on a self-profiling processor, which gathers data as the program executes. The profiling data
are used to select portions of the program (at the function level of granularity) to implement as
accelerators. The selected functions are passed through HLS, and the original program is modi-
fied, with the functions replaced by wrappers that invoke and communicate with the accelerators.
Ultimately, a complete FPGA-based processor/accelerator system is produced.
Beyond the traditional HLS steps, several FPGA-specific HLS studies have been conducted
using LegUp. Resource sharing is an HLS area reduction technique that shares a hardware func-
tional unit (e.g., a divider) among several operations in the input program (e.g., multiple division
operations in the source). Resource sharing can be applied whenever the operations are sched-
uled in different clock cycles, and it requires adding multiplexers to the inputs of the functional
unit to steer the correct input signals into the unit, depending on which operation is executing.
Hadjis [41] studied resource sharing in the FPGA context, where, because multiplexers are costly
to implement with LUTs, the authors showed there to be little area savings to be had by resource
sharing unless the target resource is large. Dividers and repeated patterns of interconnected com-
putational operators were deemed worth sharing, and the authors also showed the benefits of
sharing to depend on whether the target FPGA architecture contained 4 LUTs or 6 LUTs, with

LQW),5 LQWQWDSVLQWVXP ^
LQWL
Self-profiling
IRU L LQWDSVL processor
VXP K>L@ ]>L@
C Compiler
UHWXUQ VXP 
`


Program code

Altered SW binary (calls HW accelerators) Profiling data:

Execution cycles
High-level
Suggested power
synthesis
program Cache misses
μP Hardened segments to
program target to
segments HW
FPGA fabric

FIGURE 16.8 Loop pipelining.

© 2016 by Taylor & Francis Group, LLC


384    16.2 High-Level Synthesis

j=4

j=3

ORRSIRU L L L j=2
ORRSIRU M M LM
0>L@>M@ 0>L±@ 0>M@0>L@ 0>M±@
j=1
 i=1 i=2 i=3 i=4

FIGURE 16.9 LegUp design flow.

sharing being more profitable in 6-LUT-based architectures owing to the steering multiplexers
being covered within the same LUTs as the shared logic. In another FPGA-centric study, Canis
[42] takes advantage of the high-speed DSP blocks in commercial FPGAs, which usually can
operate considerably faster than the surrounding system (with LUTs and programmable inter-
connect). Their idea was to multipump the DSP blocks, operating them at 2× the clock frequency
of the system, thereby allowing 2 multiply operations per system cycle.
There has also been considerable recent HLS research that, while not unique to FPGAs, never-
theless uses FPGAs as their test vehicle. Huang [43] studied the 50+ compiler optimization passes
distributed with LLVM (e.g., constant propagation, loop rotation, and common subexpression
elimination) and assessed which passes were beneficial to LegUp HLS-generated hardware. The
authors also considered customized recipes of passes specifically tailored to an application and
showed that it is possible to improve hardware performance by 18% versus using the blanket “-O3”
optimization level. Choi [44] added support for pthreads and OpenMP to LegUp: two widely used
software parallelization paradigms. The authors synthesize parallel software threads into parallel
operating hardware accelerators, thereby providing a relatively straightforward way for a software
engineer to realize spatial parallelism in an FPGA. Zheng et al. studied multicycling combina-
tional paths in HLS—a technique applicable to combinational paths with cycle slack to lower the
minimum clock period [45] to raise performance. There has also been work on supporting other
input languages, such as CUDA, where, like OpenCL, parallelism is expressed explicitly [46].
Another popular HLS topic in recent years has been on the use of the polyhedral model to
analyze and optimize loops in ways that improve the HLS hardware (Figure 16.9). The iteration
space of the nested loops on the left side of the figure is shown as black points on the right side of
the figure. Observe that the iteration space resembles a geometrical triangle—a polyhedron. The
arrows in the figure illustrate the dependencies between loop iterations. Polyhedral loop analysis
and optimization [47] represent loop iteration spaces mathematically as polyhedra and can be
applied to generate alternative implementations of a loop where the order of computations, and
thereby the dependencies between iterations, is changed. A straightforward example of such a
manipulation would be to interchange the inner and outer loops with one another, and polyhedral
loop optimizers also consider optimizations that are considerably more complex. Optimization
via the polyhedral model may permit loop nests to be pipelined with a lower initiation interval to
improve hardware performance [48] or to reduce the amount of hardware resources required to
meet a given throughput [49] for the synthesis of partitioned memory architectures [50] and for
the optimization of off-chip memory bandwidth [51].

[Link] RESEARCH CHALLENGES FOR HLS

There are a number of future challenges to the widespread adoption of HLS. First is the need
for debugging and visualization tools for HLS-generated hardware. Presently, with many HLS
tools, the user must resort to logic simulation and waveform inspection to resolve bugs in the
hardware or its integration with the surrounding system—a methodology that is unacceptable
and incomprehensible to software engineers. Likewise, the machine-generated RTL produced
by HLS tools is often extremely difficult for a human reader to follow, making it difficult for the
engineer to understand the hardware produced and determine how to optimize it. A second key
challenge is the need to raise the quality of HLS hardware from the power, performance, and area

© 2016 by Taylor & Francis Group, LLC


Chapter 16 – FPGA Synthesis and Physical Design    385

perspectives, to narrow the gap between HLS and human-crafted hardware. Within this chal-
lenge, a present issue is that the quality of the hardware produced depends on the specific style of
the input program. For example, some HLS tools may only support a subset of the input language,
or the output may depend strongly on the input coding style. Another issue on this front is that,
presently, it is difficult for the HLS tool to be able to leverage the parallelism available in the target
fabric, given that the input to the HLS tool is typically a sequential specification. The extraction
of parallelism is mitigated somewhat in some cases by the use of parallel input languages, like
OpenCL™, where more of the parallelism is specified by the designer.

16.3 LOGIC SYNTHESIS

This chapter covers the application of traditional RTL and logic synthesis to FPGA design and
then overviews technology mapping algorithms specific to the LUT-covering problem of FPGA
synthesis.

16.3.1 RTL SYNTHESIS

RTL synthesis includes the inference and processing of high-level structures—adders, multipliers,
multiplexers, buses, shifters, crossbars, RAMs, shift registers, and FSMs—prior to decomposition
into generic gate-level logic. In commercial tools, 20%–30% of logic elements eventually seen
by placement are mapped directly from RTL into the device-specific features rather than being
processed by generic gate-level synthesis. For example, arithmetic is synthesized into carry-select
adders in some FPGAs and ripple carry in others. An 8:1 multiplexor will be synthesized differ-
ently for 4 LUT or 6 LUT architectures and for devices with dedicated multiplexor hardware.
Though RTL synthesis is an important area of work for FPGA and CAD tool vendors, it has
very little attention in the published literature. One reason for this is that the implementation
of operators can be architecture specific. A more pedantic issue is simply that there are histori-
cally very few public-domain VHDL/Verilog front-end tools or high-level designs with arithmetic
and other features, and both are necessary for research in the area. There are some promising
improvements to research infrastructure, however. The VTR toolset [5] provides an open-source
tool containing a full Verilog analysis and elaboration front end to VPR (referring to the physical
design portion) and enables academic research for the first time to address the RTL synthesis flow
and to examine CAD for binding and mapping.
In commercial FPGA architectures, dedicated hardware is provided for multipliers, adders,
clock enables, clear, preset, RAM, and shift registers. Arithmetic was discussed briefly earlier; all
major FPGAs either convert 4 LUTs into 3-LUT-based sum and carry computations or provide
dedicated arithmetic hardware separate from the LUT. The most important effect of arithmetic
is the restriction it imposes on placement, since cells in a carry chain must be placed in fixed
relative positions.
One of the goals of RTL synthesis is to take better advantage of the hardware provided. This
will often result in different synthesis than that which would take place in an ASIC flow because
the goal is to minimize LEs rather than gates. For example, a 4:1 mux can be implemented opti-
mally in two 4 LUTs, as shown in Figure 16.10. RTL synthesis typically produces these premapped
cells and then protects them from processing by logic synthesis, particularly when they occur in
a bus and there is a timing advantage from a symmetric implementation. RTL synthesis will rec-
ognize barrel shifters (“y <= x >> s” in Verilog) and convert these into shifting networks as
shown in Figure 16.11, again protecting them from gate-level manipulation.
Recognition of register control signals such as clock enable, clear/preset, and synchronous/
asynchronous load signals adds additional complications to FPGA synthesis. Since these preexist
in the logic cell hardware (see Figure 16.2), there is a strong incentive to synthesize to them even
when it would not make sense in ASIC synthesis. For example, a 4:1 mux with one constant input
does not fit in a 4 LUT, but when it occurs in a datapath, it can be synthesized for most com-
mercial LEs by using the LAB-wide (i.e., shared by all LEs in a LAB) synchronous load signal as
a fifth input. Similarly, a clock enable already exists in the hardware, so register feedback can be

© 2016 by Taylor & Francis Group, LLC


386    16.3 Logic Synthesis

S0 S1

C D

A B

0 1

0 1
1 0

1 0

FIGURE 16.10 Implementing a 4:1 mux in two 4 LUTs.

s0 s1 s2

FIGURE 16.11 An 8-bit barrel-shifter network.

converted to an alternative structure with a clock enable to hold the current value but no routed
register feedback. For example, if f = z in Figure 16.12, we can express the cone of logic with a
clock enable signal CE = c5∙c3∙c1′ added to DFF z, and MUX( f, g) replaced simply by g—this is a
win for bus widths of two or more. This transformation can be computed with a binary decision
diagram (BDD) or other functional techniques. Tinmaung [52] applies BDD-based techniques for
this and other power optimizations.
Several algorithms have addressed the RTL synthesis area, nearly all from FPGA vendors.
Metzgen and Nancekievill [53,54] showed algorithms for the optimization of multiplexer-based
buses, which would otherwise be inefficiently decomposed into gates. Most modern FPGA archi-
tectures do not provide on-chip tri-state buses, so multiplexers are the only choice for buses
and are heavily used in designs. Multiplexers are very interesting structures for FPGAs because
the LUT cell yields a relatively inefficient implementation of a mux, and hence, special-purpose
hardware for handling multiplexers is common. Figure 16.12 shows an example taken from [53],
which restructures buses of multiplexers for better technology mapping (covering) into 4 LUTs.
The structure on the left requires 3 LUTs per bit to implement in 4 LUTs, while the structure on
the right requires only 2 LUTs per bit.

© 2016 by Taylor & Francis Group, LLC


Chapter 16 – FPGA Synthesis and Physical Design    387

e d
e d
f g 0 1
0 1 c6 0 1
c
c6
c
1 0
c5 00 11
0 1
c4 g
c4
b a 1 0
1 0 c3
1 0 b a f
c3
c2 0 1
0 1 1 0
c2 0 1 c5 and ~c3
c1 1 0
11 00
c1

z z

FIGURE 16.12 Multiplexer bus restructuring for LUT packing.

Also to address multiplexers, newer FPGA architectures have added clever hardware to aid
in the synthesis of multiplexer structures such as crossbars and barrel shifters. Virtex devices
provide additional stitching multiplexers for adjacent LEs, which can be combined to efficiently
build efficient and larger multiplexers; an abstraction of this composable LUT is shown in
Figure 16.13a. These are also used for stitching RAM bits together when the LUT is used as a
16-bit RAM (discussed earlier). Altera’s adaptive logic module [55], shown abstractly in Figure
16.13b, allows a 6 LUT to be fractured to implement a 6 LUT, two independent 4 LUTs, two
5 LUTs that share two input signals, and also two 6 LUTs that have 4 common signals and two
different signals (a total of 8). This latter feature allows two 4:1 mux with common data and dif-
ferent select signals to be implemented in a single LE, which means crossbars and barrel shifters
built out of 4:1 mux can use half the area they would otherwise require.
At the RTL, a number of FPGA-specific hardware optimizations can be made for addressing
the programmable hardware. For example, Tessier [56] examines the alternate methods for pro-
ducing logical memories out of the hard-embedded memory blocks in the FPGA: a logical 16K
word by 16b word memory can be constructed either by joining sixteen 16Kx1 physical memories
with shared address or by joining sixteen 1Kx16 memories with an external mux built of logic.
These come with trade-offs on performance and area versus power consumption, and Tessier

d0

e0 a e
a0 4 b
b0 c0
c0 LUT
d0 z1(a,b,c0,d0,e)

a1 z0(a,b,c0,
b1
c1 d0,e,f)
d1
a2 c1
b2 f
c2
d2 z2(a,b,c1,d1,f)
a3
b3
c3 3-
d3
LUT f
e1
f d1
(a) (b)

FIGURE 16.13 Composable and fracturable logic elements: (a) Composable LUT and
(b) fracturable LUT.

© 2016 by Taylor & Francis Group, LLC


388    16.3 Logic Synthesis

shows up to 26% dynamic power reduction (on memory blocks) by synthesizing the min-power
configurations versus the min-area versions.

16.3.2 LOGIC OPTIMIZATION

Technology-independent logic synthesis for FPGAs is similar to ASIC synthesis. Berkeley sys-
tem for interactive synthesis (SIS) [57] and more recently Mischenko’s AIG-based [58] synthesis
implemented in the ABC system [59] are in widespread use. Nearly, all FPGA-based synthesis
research is now based on ABC as a base tool in the way that VPR is the standard for physical
design. The general topic of logic synthesis is described in [14,15]. Synthesis tools for FPGAs con-
tain basically the same two-level minimization algorithms and algebraic and Boolean algorithms
for multilevel synthesis. Here, we will generally restrict our discussion to the differences from
ASIC synthesis and refer the reader to the chapter on logic synthesis in this book [14] for the
shared elements.
One major difference between standard and FPGA synthesis is in cost metrics. The target
technology in a standard cell ASIC library is a more finely grained cell (e.g., a two-input NAND
gate), while a typical FPGA cell is a generic k-input LUT. A 4 LUT is a 16-bit SRAM LUT mask
driving a 4-level tree of 2:1 mux controlled by the inputs A, B, C, D (Figure 16.2b). Thus, A + B +
C + D and AB + CD + AB′D′ + A′B′C′ have identical costs in LUTs, even though the former has
4 literals and the latter 10—the completeness of LUTs makes input counts more important than
literals. In general, the count of two-input gates correlates much better to 4 LUT implementation
cost than the literal-count cost often used in ASIC synthesis algorithms, but this is not always the
case, as described for the 4:1 mux in the preceding section. A related difference is that inverters
are free in FPGAs because (1) the LUT mask can always be reprogrammed to remove an inverter
feeding or fed by an LUT and (2) programmable inversion at the inputs to RAM, IO, and DSP
blocks is available in most FPGA architectures. In general, registers are also free because all LEs
have a built-in DFF. This changes cost functions for retiming and state-machine encoding as well
as designer preference for pipelining.
Subfactor extraction algorithms are much more important for FPGA synthesis than those
commonly reported in the academic literature where ASIC gates are assumed. It is not clear
whether this arises from the much larger and more datapath-oriented designs seen in industrial
flows (versus open-source gate-level circuits), from the more structured synthesis from a com-
plete HDL to gates flow, or due to the larger cell granularity. In contrast, algorithms in the class of
speed _ up in SIS do not have significant effects on circuit performance for commercial FPGA
designs. Again, this can be due either to the flow and reference circuits or to differing area/depth
trade-offs. Commercial tools perform a careful balancing of area and depth during multilevel
synthesis.
The synthesis of arithmetic functions is typically performed separately in commercial FPGA
tools (see Section 16.3.1). Prior to VTR and some more recent versions of ABC, public tools like
SIS synthesize arithmetic into LUTs, which distort their behavior in benchmarking results.
Though reasonable for gate-level synthesis targeting ASICs, this can result in a dramatic shift
in critical paths and area metrics when targeting FPGAs—an LUT–LUT delay may be as much
as 10× the delay of a dedicated cin–cout path visible in Figure 16.2b. Typical industrial designs
contain 10%–25% of logic cells in arithmetic mode, in which the dedicated carry circuitry is used
with or instead of the LUT [60].
Retiming algorithms from general logic synthesis [61] have been adapted specifically for
FPGAs [62], taking into account practical restrictions such as metastability (retiming should not
be applied to synchronization registers), I/O versus core timing trade-offs (similar asynchronous
transfers), power-up conditions (which can be provably unsatisfiable), and the abundance of reg-
isters (unique to FPGAs). Retiming is also used as part of physical resynthesis algorithms, as will
be discussed later.
There are a number of resynthesis algorithms that are of particular interest to FPGAs, spe-
cifically structural decomposition, functional decomposition, and postoptimization using set of
pairs of functions to be distinguished (SPFD)-based rewiring. SPFDs exploit the inherent flexibility
in LUT-based netlists. SPFDs were proposed by Yamashita [63]. SPFDs are a generalization of

© 2016 by Taylor & Francis Group, LLC


Chapter 16 – FPGA Synthesis and Physical Design    389

observability don’t-care functions. The on/off/dc set of functions is represented abstractly as a


bipartite graph denoting distinguishing edges between minterms. A coloring of the graph thus
gives an alternative implementation of the function. An inherent flexibility of LUTs in FPGAs is
that they do not need to represent inverters, because these can always be absorbed by changing
the destination node’s LUT mask. By storing distinctions rather than functions, SPFDs generalize
this to allow for more efficient expressions of logic.
Cong [64,65] applied SPFD calculations to the problem of rewiring a previously technology-
mapped netlist. The algorithm consists of precomputing the SPFDs for each node in the network,
identifying a target wire (e.g., a delay-critical input of an LUT after technology mapping), and
then trying to replace that wire with another LUT output that satisfies its SPFD. The don’t-care
sets in the SPFDs occur in the internal nodes of the network where flexibility exists in the LUT
implementation after synthesis and technology mapping. Rewiring was shown to have benefits
for both delay and area.
An alternative more FPGA-specific approach to synthesis was taken by Vemuri’s BDS system
[66,67] building on BDD-based decomposition [68]. These authors argued that the separation of
technology-independent synthesis from technology mapping disadvantaged FPGAs, which need
to optimize LUTs rather than literals due to their greater flexibility and larger granularity (many
SIS algorithms target literal count). The BDS system integrated technology-independent optimi-
zation using BDDs with LUT-based logic restructuring and used functional decomposition
to target decompositions into LUT mappings. The standard sweep, eliminate, decomposition,
and factoring algorithms from SIS were implemented in a BDD framework. The end result uses a
technology mapping step but on a netlist more amenable to LUT mapping. Comparisons between
SIS and BDS-pga using the same technology mapper showed area and delay benefits attributable
to the BDD-based algorithms.

16.3.3 TECHNOLOGY MAPPING

Technology mapping for FPGAs is the process of turning a network of primitive gates into a
network of LUTs of size at most k. The constant k is historically 4, though many recent commer-
cial architectures have used fracturable logic cells with k = 6: Altera starting with Stratix II and
Xilinx with Virtex V. LUT-based technology mapping is best seen as a covering problem, since
it is both common and necessary to cover some gates by multiple LUTs for an efficient solution.
Figure 16.14, taken from the survey in [69], illustrates this concept in steps from the original
netlist (a), covering (b), and final result (c). Technology mapping aims for the least unit depth
combined with the least number of cells in the mapped network.

a b c d e a b c d e a b c d e

x x

4 LUT 4 LUT

f g f g f g

FIGURE 16.14 Technology mapping as a covering problem. (From Ling, A. et al., FPGA technology
mapping: A study of optimality, in Proceedings of the Design Automation Conference, 2005.)

© 2016 by Taylor & Francis Group, LLC


390    16.3 Logic Synthesis

FPGA technology mapping differs significantly from library-based mapping for cell-based
ASICs and uses different techniques. The most successful attempts divide into two paradigms:
dynamic programming approaches derived from Chortle [70] and cut-based approaches branch-
ing from FlowMap [71]. Technology mapping is usually preceded by the decomposition of the
netlist into two-input gates using an algorithm such as DOGMA [72].
In the first FPGA-specific technology mapping algorithm, Chortle [70], the netlist is decom-
posed into two-input gates and then divided into a forest of trees. Chortle computes an optimum
set of k-feasible mappings for the current node. A k-feasible mapping is a subcircuit comprising
the node and (some of) its predecessors such that the number of inputs is no more than k and
only the mapped node has an output edge. Chortle combines solutions for children within reach
of one LUT implemented at the current node, following the dynamic programming paradigm.
Improvements on Chortle considered not trees but maximum fanout-free cones (MFFCs), which
allowed for mapping with duplication. Area mapping with no duplication was later shown to be
optimally solvable in polynomial time for MFFCs [73]. But, perhaps contrary to intuition, dupli-
cation is important in improving results for LUTs because it allows nodes with fanout greater
than the one to be implemented as internal nodes of the cover; this is required to obtain improved
delay and can also contribute to improved area. Figure 16.14b shows a mapping to illustrate this
point.
A breakthrough in technology mapping research came with the introduction of FlowMap [71]
proposed by Cong and Ding. In that work, the authors consider the combinational subcircuit rooted
at each node of a Boolean network. A cut of that subcircuit divides it into two parts: one part con-
taining the root node, referred to as A′, and a second part containing the rest of the subcircuit,
referred to as A. In mapping to LUTs, one cares only that the number of signals to every LUT does
not exceed k, the number of LUT inputs. Consequently, the portion of the circuit, A′, can be covered
by an LUT as long as the number of signals crossing from A to A′ does not exceed k. FlowMap uses
network flow techniques to find cuts of the circuit network in a manner that provably produces a
depth-optimal mapping (for a given fixed decomposition into two-input gates).
All state-of-the-art approaches to FPGA technology mapping use the notion of k-feasible cuts.
However, later work has shown that network flow methods are not needed to find such cuts. In
fact, it is possible to find all k-feasible cuts for every node in the network. To achieve this, the
network is traversed in topological order from primary inputs to outputs. The set of k-feasible
cuts for any given node is generated by combining cuts from its fanin nodes and discarding those
cuts that are not k-feasible. Figure 16.15 gives an example, where it is assumed we are at the point
of computing the cuts for node z, in which case, the cuts for x and c have already been computed
(owing to the topological traversal order). In this example, C x is a cut for node x, and Cc is a cut
for node c. We can find a cut for node z by combining C x and Cc, producing cut Cz in the figure.

x
Cx

a d c Cc

Cz
b e

FIGURE 16.15 Example of cut generation.

© 2016 by Taylor & Francis Group, LLC


Chapter 16 – FPGA Synthesis and Physical Design    391

Taking all such pairs yields the complete k-feasible cut set for node z. Schlag [74] proved that this
cut generation approach is comprehensive—it does not miss out on finding any cuts. Technology
mapping with cuts then starts (1) finding all cuts for each node in the network, (2) selecting a best
cut for each node, and (3) constructing a mapping solution in reverse topological order using the
selected best cuts.
CutMap by Cong and Hwang [75] improved the area of mapping solutions, while maintaining
the property of optimal delay. The key concept in CutMap is to first compute a depth-optimal
mapping and then, based on the slack in that mapping, remap portions of the network to mini-
mize area. That is, for portions of the input network that are on the critical path, the cuts seen as
best are those that minimize delay. Conversely, for portions of the network that are not on the
critical path, the cuts seen as best are those that minimize area. Empirical results for CutMap
show 15% better area than FlowMap with the same unit delay at the cost of longer runtime.
DAOmap [76] is a more recent work that generates k-feasible cones for nodes as in CutMap and
then iteratively traverses the graph forward and backward to choose implementation cuts balanc-
ing area and depth. The forward traversal identifies covering cones for each node (depth optimal
for critical and area optimal for noncritical nodes), and the backward traversal then selects the
covering set and updates the heights of remaining nodes. The benefit of iteration is to relax the
need for delay-optimal implementation once updated heights mark nodes as no longer critical,
allowing greater area improvement, while maintaining depth optimality. DAOmap additionally
considers the potential node duplications during the cut enumeration procedure, with look-ahead
pruning. It is worth mentioning that while it is possible to compute a depth-optimal mapping
for a given input network in polynomial time, computing an area-optimal mapping was shown
to be NP-hard by Farrahi [77]. However, clever heuristics have been devised on the area front,
such as the concept of area flow [78], which, like DAOmap, iterates over the network to find good
mappings for multifanout nodes. Specifically, area flow–based mapping makes smart decisions
regarding whether a multifanout node should be replicated in multiple LUTs or should be the
fanout node of a subcircuit covered by an LUT in the final mapping.
The ABC logic synthesis framework [59] also incorporates technology mapping to LUTs. It
mirrors prior work in that it computes and costs cuts for each node in the network. However,
rather than storing all cuts for each node in the network (which can scale exponentially in the
worst case), ABC only stores a limited set of priority cuts—a set whose size can be set by the user
[79]. Experimental results demonstrated little quality loss, despite considering a reduced number
of cuts. A circuit having a certain function can be represented in a myriad of ways, and more
recent work in ABC has considered the structure of the input network on the technology map-
ping results and proposed ways to perform technology mapping on several different networks
(representing the same circuit), choosing the best results from each [80,81].

16.3.4 POWER-AWARE SYNTHESIS

Recently, synthesis and technology mapping algorithms have begun to address power [82].
Anderson and Najm [83] proposed a modification to technology mapping algorithms to mini-
mize node duplication and thereby minimize the number of wires between LUTs (as a proxy for
dynamic power required to charge up the inter-LUT routing wires). EMAP by Lamoreaux [84]
modifies CutMAP with an additional cost function component to favor cuts that reuse nodes
already cut and those that have a low-activity factor (using probabilistic activity factors) to trap
high-activity nodes inside clusters with lower capacitance. Chen [85] extends this type of map-
ping (proposed in the paper) to a heterogeneous FPGA architecture with dual voltage supplies,
where high-activity nodes additionally need to be routed to low-V DD (low power) LEs and critical
nodes to high-V DD (high-performance) LEs.
Anderson [86] proposes some interesting ideas on modifying the LUT mask during synthesis
to place LUTs into a state that will reduce leakage power in the FPGA routing. Commercial tools
synthesize clock enable circuitry to reduce dynamic power consumption within blocks. These
are likely the beginning of many future treatments for power management in FPGAs. In a differ-
ent work, Anderson [87] shows a technique for adding guarded evaluation of functions to reduce
dynamic power consumption—this is a technique unique to FPGAs, as it utilizes the leftover

© 2016 by Taylor & Francis Group, LLC


392    16.4 Physical Design

circuitry in the FPGA that would otherwise be unused. The technique adds a new cost function to
technology mapping that looks for unobservable logic (i.e., don’t-care cones) and gates the activity
to the subdesign, trading off additional area versus power savings. Hwang [88] and Kumthekar
[89] also applied SPFD-based techniques for power reduction.

16.4 PHYSICAL DESIGN

The physical design flow for FPGAs consists of clustering, placement, physical resynthesis, and
routing; we discuss each of these phases in the following subsections.
Commercial tools have additional preprocessing steps to allocate clock and reset signals to spe-
cial low-skew global clock networks, to place phase-locked loops, and to place transceiver blocks
and I/O pins to meet the many electrical restrictions imposed on them by the FPGA device and
the package. With the exception of some work on placing FPGA I/Os to respect electrical restric-
tions [90,91], however, these preprocessing steps are typically not seen in any literature.
FPGA physical design can broadly be divided into routability-driven and timing-driven algo-
rithms. Routability-driven algorithms seek primarily to find a legal placement and routing of the
design by optimizing for reduced routing demand. In addition to optimizing for routability, timing-
driven algorithms also use timing analysis to identify critical paths and/or connections and attempt
to optimize the delay of those connections. Since the majority of delay in an FPGA is contributed
by the programmable interconnect, timing-driven placement and routing can achieve a large cir-
cuit speedup versus routability-driven approaches. For example, a Xilinx commercial CAD system
achieves an average of 50% higher design performance with full effort timing-driven placement and
routing versus routability-only placement and routing at the cost of 5× runtime [92].
In addition to optimizing timing and routability, some recent FPGA physical design algo-
rithms also implement circuits such that power is minimized.

16.4.1 PLACEMENT AND CLUSTERING

Since nearly all FPGAs have clustering (into CLB or LAB structures), physical design usually consists
of a clustering phase followed by direct placement or a two-step (global and detailed) placement.

[Link] PROBLEM FORMULATION

The placement problem for FPGAs differs from the placement problem for ASICs in several impor-
tant ways. First, placement for FPGAs is a slot assignment problem—each circuit element in the
technology-mapped netlist must be assigned to a discrete location, or slot, on the FPGA device of a
type that can accommodate it. Figure 16.1 shows the floorplan of a typical modern FPGA. An LE,
for example, must be assigned to a location on the FPGA where an LE has been fabricated, while
an input/output (I/O) block or RAM block must each be placed in a location where the appropriate
resource exists on the FPGA. Second, there are usually a large number of constraints that must be
satisfied by a legal FPGA placement. For example, groups of LEs that are placed in the same logic
block have limits on the maximum number of distinct input signals and the number of distinct
clocks they can use, and cells in carry chains must be placed together as a macro. Finally, all rout-
ing in FPGAs consists of prefabricated wires and transistor-based switches to interconnect them.
Hence, the amount of routing required to connect two circuit elements, and the delay between
them, is a function not just of the distance between the circuit elements but also of the FPGA rout-
ing architecture. The amount of (prefabricated) routing is also strictly limited, and a placement that
requires more routing in some region of the FPGA than which exists there cannot be routed.

[Link] CLUSTERING

A common adjunct to FPGA placement algorithms is a bottom-up clustering step that runs before
the main placement algorithm in order to group related circuit elements together into clusters
(LABs, RAM, and DSP blocks in Figure 16.1). Clustering reduces the number of elements to place,

© 2016 by Taylor & Francis Group, LLC


Chapter 16 – FPGA Synthesis and Physical Design    393

improving the runtime of the main placement algorithm. In addition, the clustering algorithm
usually deals with many of the complex FPGA legality constraints by grouping primitives (such
as LEs in Figure 16.2) into legal function blocks (e.g., LABs in Figure 16.2), simplifying legality
checking for the main placement algorithm.
Many FPGA clustering algorithms are variants of the VPack algorithm [93]. VPack clusters
LEs into logic blocks by choosing a seed LE for a new cluster and then greedily packing the LE
with the highest attraction to the current cluster until no further LEs can be legally added to the
cluster. The attraction function is the number of nets in common between an LE and the current
cluster. The T-VPack algorithm by Marquardt [94] is a timing-driven enhancement of VPack,
where the attraction function for an LE, L, to cluster C becomes

Nets( L) Ç Nets(C )
(16.1) attraction( L) = .75 × criticality( L,C ) + .25
MaxNets

The first term gives higher attraction to LEs that are connected to the current cluster by timing-
critical connections, while the second term is taken from VPack and favors grouping LEs with
many common signals together. Somewhat surprisingly, T-VPack improves not only circuit speed
versus VPack but also routability, by absorbing more connections within clusters. The iRAC [95]
clustering algorithm achieves further reductions in the amount of routing necessary to intercon-
nect the logic blocks by using attraction functions that favor the absorption of small nets within
a cluster and by sometimes leaving empty space in clusters. The study of [96] showed that the
Quartus II commercial CAD tool also significantly reduces routing demand by not packing clus-
ters to capacity when it requires grouping unrelated logic in a single cluster.
Feng introduced an alternative clustering approach with PPack2 [97]. Instead of greedily form-
ing clusters one by one with an attraction function, PPack2 recursively partitions a circuit into
smaller and smaller partitions until each partition fits or nearly fits into a single cluster. A rebal-
ancing step after partitioning moves LEs from overfull clusters to clusters with some spare room
and prefers moves between clusters that are close in the partitioning hierarchy. This approach
minimizes the packing of unrelated logic into a single cluster, creating a clustered netlist for the
placement step that has better locality. PPack2 is able to reduce wiring by 35% and circuit delay
by 11% versus T-VPack, without increasing the cluster count when clusters have no legality con-
straints other than a limit on the number of logic cells they can contain. When clusters have lim-
its on the number and type of signals they can accommodate, PPack2 requires a postprocessing
step to produce legal clusters and this can moderately increase the cluster count versus T-VPack.
The AAPack algorithm of [98] uses the T-VPack approach to clustering but adds much more
complex legality checking to ensure the clusters created are legal and routable when the con-
nectivity within a cluster is limited by the architecture. Using these complex legality checkers,
AAPack can cluster not only logic cells into logic blocks but also other circuit primitives such as
RAM slices and basic multipliers into RAM and DSP blocks, respectively. AAPack uses a variety
of heuristics to check if a group of primitives are routable in a cluster; usually, fast heuristics pro-
vide an answer, but if they fail, a slower routing algorithm checks the cluster legality.
Lamoureaux [84] developed a power-aware modification of T-VPack that adds a term to the
attraction function of Equation 16.1 such that LEs connected to the current cluster by connec-
tions with a high rate of switching have a larger attraction to the cluster. This favors the absorp-
tion of nets that frequently switch logic states, resulting in lower capacitance for these nets and
lower dynamic power.

[Link] PLACEMENT

Simulated annealing is the most widely used placement algorithm for FPGAs due to its ability
to adapt to different FPGA architectures and optimization goals. However, the growth in FPGA
design size has outpaced the improvement in CPU speeds in recent years, and this has created
a need to speed up placement by using multiple CPUs in parallel, incorporating new heuristics
within an annealing framework, or by using other algorithms to create a coarse or starting place-
ment that is usually then refined by an annealer.

© 2016 by Taylor & Francis Group, LLC


394    16.4 Physical Design

P = InitialPlacement ();
T = InitialTemperature ();
while (ExitCriterion () == False) {
while (InnerLoopCriterion () == False) {/* “Inner Loop” */
Pnew = PerturbPlacementViaMove (P);
∆Cost = Cost (Pnew) − Cost (P);
r = random (0,1);
if (r < e−∆Cost/T) {
P = Pnew; /* Move Accepted */
}
} /* End “Inner Loop” */
T = UpdateTemp (T);
}

FIGURE 16.16 Pseudocode of a generic simulated annealing placement algorithm.

Figure 16.16 shows the basic flow of simulated annealing. An initial placement is generated,
and a placement perturbation is proposed by a move generator, generally by moving a small num-
ber of circuit elements to new locations. A cost function is used to evaluate the impact of each
proposed move. Moves that reduce cost are always accepted or applied to the placement, while
those that increase cost are accepted with probability e−(Δcost/T), where T is the current temperature.
Temperature starts at a high level and gradually decreases throughout the anneal, according to
the annealing schedule. The annealing schedule also controls how many moves are performed
between temperature updates and when the ExitCriterion that terminates the anneal is met.
There are two key strengths of simulated annealing that many other approaches lack:

1. It is possible to enforce all the legality constraints imposed by the FPGA architecture in
a fairly direct manner. The two basic techniques are either to forbid the creation of illegal
placements in the move generator or to add a penalty cost to illegal placements.
2. By creating an appropriate cost function, it is possible to directly model the impact of the
FPGA routing architecture on circuit delay and routing congestion.

VPR [1,93,94] contains a timing-driven simulated annealing placement algorithm as well as tim-
ing-driven routing. The VPR placement algorithm is usually used in conjunction with T-VPack
or AAPack, which preclusters the LEs into legal logic blocks. The placement annealing schedule
is based on the monitoring statistics generated during the anneal, such as the fraction of pro-
posed moves that are accepted. This adaptive annealing schedule lets VPR automatically adjust
to different FPGA architectures. VPR’s cost function also automatically adapts to different FPGA
architectures [94]:

é bb (i) bby (i) ù


(16.2) Cost = (1 - l) å
iÎAllNets
q(i) ê x + ú+l
ë C av , x (i) C av , y (i) û
å
jÎAllConnections
criticality( j ) × delay( j )

The first term in Equation 16.2 causes the placement algorithm to optimize an estimate of the
routed wirelength, normalized to the average wiring supply in each region of the FPGA; see
Figure 16.17 for an example computation. The wirelength needed to route each net i is estimated
as the sum of the x- and y-directed (bbx and bby) span of the bounding box that just encloses
all the terminals of the net, multiplied by a fanout-based correction factor, q(i). Figure 16.17b
shows that for higher-fanout nets, the half perimeter of the net bounding box underestimates
wiring; multiplying by q(i) helps correct this bias. Figure 16.17b also shows that Equation 16.2
tends to underestimate wiring when an FPGA contains longer wiring segments, as some of the
wiring segment may extend beyond what is needed to route a net. So long as an FPGA contains
at least some short-wiring segments, however, Equation 16.2 is usually sufficiently accurate to
guide the placement algorithm. In FPGAs with differing amounts of routing available in differ-
ent regions or directions, it is beneficial to move wiring demand to the more routing-rich areas,
so the estimated wiring required is divided by the average routing capacity over the bounding
box in the appropriate direction (Cav,x and Cav,y).

© 2016 by Taylor & Francis Group, LLC


Chapter 16 – FPGA Synthesis and Physical Design    395

bbx(i) = 5
Cav,x(i) = 180

x=6
y=5

Net i

bby(i) = 4
Horizontal
channel width Cav,y(i) = 120
180 wires

x=1
y=1

Vertical channel width: 120 wires


(a)

Net source

Routing
wire
Programmable
switch

(b)

FIGURE 16.17 (a) Estimating the wiring cost of a net by half-perimeter bounding box and
(b) best-case routing of the same net on an FPGA with all length four wires.

The second term in Equation 16.2 optimizes timing by favoring placements in which
timing-critical connections have the potential to be routed with low delay. To evaluate the
second term quickly, VPR needs to be able to quickly estimate the delay of a connection. To
accomplish this, VPR assumes that the delay is a function only of the difference in the coor-
dinates of a connection’s endpoints, (Δx, Δy), and invokes the VPR router with each possible
(Δx, Δy) to determine a table of delays versus (Δx, Δy) for the current FPGA architecture
before the simulated annealing algorithm begins. The criticality of each connection in the
design is determined via periodic timing analysis using delays computed from the current
placement.
Many enhancements have been made to the original VPR algorithm. The PATH algorithm
from Kong [99] uses a new timing criticality formulation in which the timing criticality of a
connection is a function of the slacks of all paths passing through it, rather than just a func-
tion of the worst-case (smallest) slack of any path through that connection. This technique
significantly improves timing optimization and results in circuits with 15% smaller critical

© 2016 by Taylor & Francis Group, LLC


396    16.4 Physical Design

path delay on average. Lin [100] models unknown placement and routing delays as statisti-
cal process variation to make decisions during placement that maximize the probability of
improving the circuit speed.
The SCPlace algorithm [101] enhances VPR so that a portion of the moves are fragment
moves in which a single logic cell is moved instead of an entire logic block. This allows
the placement algorithm to modify the initial clustering, and it improves both circuit tim-
ing and wirelength. Lamoureaux [84] modified VPR’s cost function by adding a third term,
PowerCost, to Equation 16.2:

(16.3) PowerCost = å
iÎAllNets
q(i) éë bbx ( i ) + bby ( i ) ùû × activity(i)

where activity(i) represents the average number of times net i transitions per second. This addi-
tional cost function term reduces circuit power, although the gains are less than those obtained
by power-aware clustering.
Independence [102] is an FPGA placement tool that can effectively target a very wide variety
of FPGA routing architectures by directly evaluating (rather than heuristically estimating) the
routability of each placement generated during the anneal. It is purely routability driven, and
its cost function monitors both the amount of wiring used by the placement and the routing
congestion:

(16.4) Cost = å RoutingResources(i) + l ×


iÎNets
å
k ÎRoutingResources
max(occupancy(k ) - capacity(k ), 0)

The λ parameter in Equation 16.4 is a heuristic weighting factor. Independence uses the
Pathfinder routing algorithm ([127], discussed in detail in Section 16.4.3) [128] to find new
routes for all affected nets after each move and allow wire congestion by routing two nets on
the same routing resource. Such a routing is not legal. However, by summing the overuse of all
the routing resources in the FPGA, Independence can directly monitor the amount of routing
congestion implicit in the current placement. The Independence cost function monitors not
Cut line 4

Cut line 7

Cut line 5 Cut line 6 Cut line 8 Cut line 9


Cut line 1

Critical Net

Cut line 2 Cut line 3

FIGURE 16.18 Typical recursive partitioning sequence for placement. The critical net will force the
right terminal to be in the bottom partition when the design is partitioned along cut line 9.

© 2016 by Taylor & Francis Group, LLC


Chapter 16 – FPGA Synthesis and Physical Design    397

only routing congestion but also the total number of routing resources (wires and block inputs/
outputs) used by the router to create a smoother cost function that is easier for the annealer to
optimize. Independence produces high-quality results on a wide variety of FPGA architectures
but requires very high CPU time.
Sankar and Rose [103] seek the opposite trade-off of reduced quality for extremely low run-
times. They create a hierarchical annealer that clusters the logic blocks twice to reduce the size
of the placement problem, as shown in Figure 16.18. They first group logic blocks into level-one
clusters of approximately 64 logic blocks and then cluster four of these level-one clusters into
each level-two cluster. The level-two clusters are placed with a greedy (temperature = 0) anneal
seeded by a fast constructive initial placement. Next, each level-one cluster is initially placed within
the boundary of the level-two cluster that contained it, and another temperature = 0 anneal is
performed. Finally, the placement of each logic block is refined with a low-starting-temperature
anneal. For very fast CPU times, this algorithm significantly outperforms VPR in terms of
achieved wirelength, while for longer permissible CPU times, it lags VPR.
Maidee [104] also seeks reduced placement runtime but does so by creating a coarse placement
using recursive bipartitioning of a circuit netlist into smaller squares of the physical FPGA, as
shown in Figure 16.18, followed by a low-starting-temperature anneal to refine the placement. This
algorithm also includes modifications to the partitioning algorithm to improve circuit speed. As
recursive partitioning proceeds, the algorithm records the minimum length each net could achieve,
given the current number of partitioning boundaries it crosses and the FPGA routing architec-
ture. Timing-critical connections to terminals outside of the region being partitioned act as anchor
points during each partitioning. This forces the other end of the connection to be allocated to the
partition that allows the critical connection to be made short, as shown in Figure 16.19. Once par-
titioning has proceeded to the point that each region contains only a few cells, the placement is
fine-tuned by a low-temperature anneal with VPR. This step allows blocks to move anywhere in the
device, so early placement decisions made by the partitioner, when little information about the criti-
cal paths or the final wirelength of each net was available, can be reversed. The technique achieves
wirelength and speed comparable to VPR, with significantly reduced CPU time.
Analytic techniques are another approach to create a coarse placement. Analytic algorithms
are based on creating a continuous and differentiable function of a placement that approximates
routed wirelength. Efficient numerical techniques are used to find the global minimum of this
function, and if the function approximates wirelength well, this solution is a placement with
good wirelength. However, the global minimum is usually an illegal placement with overlapping
blocks, so constraints and heuristics must be applied to guide the algorithm to a legal solution.
Analytic approaches have been very popular for ASIC placement but have been less widely used
for FPGAs, likely due to the more difficult legality constraints and the fact that delay is a function

Logic
blocks Level 1 Logic Level 1 Level 2
clusters blocks clusters clusters
Level 2
clusters
I/O pad

(a) (b)

FIGURE 16.19 Hierarchical annealing placement algorithm. (a) Multilevel clustering, (b) place large
clusters, uncluster and refine placement.

© 2016 by Taylor & Francis Group, LLC


398    16.4 Physical Design

of not just wirelength but also the FPGA routing architecture. Analytic approaches scale well to
very large problems, however, and this has resulted in increased interest in their use for FPGAs
in recent years.
Gort and Anderson [105] develop the HeAP algorithm by adapting the SimPL [106] analytic
placement algorithm to target heterogeneous FPGAs that contain RAM, DSP, and logic blocks.
HeAP approximates the half perimeter of the bounding box enclosing each net with a smooth
function and minimizes the sum of these wirelengths by solving a matrix equation to determine
the (x,y) location of each block. This solution places more blocks of a certain type in some regions
than the chip can accommodate. HeAP spreads the blocks out while maintaining their rela-
tive positions as much as possible, adds new terms to the matrix equation to reflect the desired
spreading, and solves again. This solve/spread procedure iterates many times until the placement
converges, and the authors of HeAP found that controlling which blocks are movable in each
iteration is very important in heterogeneous FPGAs. Allowing all blocks to be placed by the solver
and then spread simultaneously can result in blocks of different types (e.g., RAM and logic) that
were solved to be close together moving in different directions during spreading. Better perfor-
mance is achieved when some iterations of solve/spread place only one type of block (e.g., RAM
blocks only) with other blocks being kept in their prior locations. HeAP provides better results
at low CPU times than the commercial Quartus II placer, while the simulated annealing–based
Quartus II placer produces higher quality results when longer CPU times are permitted.
HeAP uses an iterative and greedy swap algorithm to refine the final placement; essentially,
this is a temperature = 0 anneal. The Synopsys commercial CAD tools also combine analytic
placement with annealing for final refinement [107] but still allow some hill climbing during the
anneal.
Another approach to speeding up placement is to leverage multiple CPUs working in parallel.
The commercial Quartus II algorithm speeds up annealing in two ways [108]: first by using
directed moves that explore the search space more productively than purely random moves and
second by evaluating multiple moves in parallel. Evaluating moves in parallel yields a speedup of
2.4× versus a serial algorithm, without compromising quality. This parallel placer also maintains
determinism (obtains the same results for every run of a certain input problem) by detecting
when two moves would access the same blocks or nets (termed a “collision”) and aborting and
later retrying the move that would have occurred later in a serial program. An et al. [109] paral-
lelized the simpler VPR annealer by prechecking if multiple moves would interact and evaluating
them in parallel when they would not—they achieved speedups of 5× while maintaining deter-
minism and 34× with a nondeterministic algorithm, with negligible impact on result quality.
Goeders [110] took an alternative approach of changing the moves and cost function to guarantee
that multiple CPUs could each optimize a different region in parallel without ever making con-
flicting changes to the placement. Their approach is deterministic and achieves 51× speedup over
serial VPR, but at a cost of a 10% wirelength increase and a 5% circuit slowdown.

16.4.2 PHYSICAL RESYNTHESIS OPTIMIZATIONS

Timing visibility can be poor during FPGA synthesis. What appears to be a noncritical path can
turn out to be critical after placement and routing. This is true for ASICs too, but the problem is
especially acute for FPGAs. Unlike an ASIC implementation, FPGAs have a predefined logic and
routing fabric and hence cannot use drive-strength selection, wire sizing, or buffer insertion to
increase the speed of long routes.
Recently, physical (re)synthesis techniques have arisen both in the literature and in commer-
cial tools. Physical resynthesis techniques for FPGAs generally refer either to the resynthesis of
the netlist once some approximate placement has occurred, and thus, some visibility of timing
exists, or local modifications to the netlist during placement itself. Figure 16.20 highlights the
difference between the two styles of physical resynthesis flow. The iterative flow of Figure 16.20a
iterates between synthesis and physical design. The advantage of this flow is that the synthesis
tool is free to make large-scale changes to the circuit implementation, while its disadvantage is
that the placement and routing of this new design may not match the synthesis tool expectations,
and hence, the loop may not converge well. The incremental flow of Figure 16.20b instead makes

© 2016 by Taylor & Francis Group, LLC


Chapter 16 – FPGA Synthesis and Physical Design    399

Initial synthesis

Placement

Initial synthesis
Resynthesize localized
portion of design
Extract timing and Placement (May be
physical information. fast estimate) Legalize placement,
Resynthesis optional: routing given synthesis change

No Satisfactory
No Satisfactory
result? result?

Yes
Yes
Routing
Final routing

(a) (b)

FIGURE 16.20 Physical resynthesis flows: Example of (a) “iterative” and (b) “incremental” physical
synthesis flow.

only more localized changes to the circuit netlist such that it can integrate these changes into the
current placement with only minor perturbations. This flow has the advantage that convergence
is easier, since a legal or near-legal placement is maintained at all times, but it has the disadvan-
tage that it is more difficult to make large-scale changes to the circuit structure.
Commercial tools from Synopsys (Syncplicity) [1] follow the iterative flow. They resynthesize
a netlist given output from the FPGA vendor place and route tool and provide constraints to the
place and route tool in subsequent iterations to assist convergence. Lin [111] described a similar
academic flow in which remapping is performed either after a placement estimate or after the
actual placement delays are known. Suaris [112] used timing budgets for resynthesis, where the
budget is calculated using a quick layout of the design. This work also makes modifications to
the netlist to facilitate retiming in the resynthesis step. In a later improvement [113], the flow was
altered to incrementally modify the placement after each netlist transform, assisting convergence.
There are commercial and academic examples of the incremental physical resynthesis flow as
well. Schabas [114] used logic duplication as a postprocessing step at the end of placement, with
an algorithm that simultaneously duplicates logic and finds legal and optimized locations for
the duplicates. Logic duplication, particularly on high-fanout registers, allows significant relax-
ation on placement critical paths because it is common for a multifanout register to be pulled in
multiple directions by its fanouts, as shown in Figure 16.21. Chen [115] integrated duplication
throughout a simulated annealing–based placement algorithm. Before each temperature update,
logic duplicates are created and placed if deemed beneficial to timing, and previously duplicated
logic may be unduplicated if the duplicates are no longer necessary.
Manoharajah [116,117] performed local restructuring of timing-critical logic to shift delay
from the critical path to less critical paths. An incremental placement algorithm then integrates
any changed or added LUTs into a legal placement. Ding [118] gave an algorithm for postplace-
ment pin permutation in LUTs. This algorithm reorders LUT inputs to take advantage of the
fact that each input typically has a different delay. Also, this algorithm also swaps inputs among
several LUTs that form a logic cone in which inputs can be legally swapped, such as an AND-tree
or EXOR-tree. An advantage of this algorithm is that no placement change is required, since only
the ordering of inputs is affected and no new LUTs are created.
Singh and Brown [119] present a postplacement retiming algorithm. This algorithm initially
places added registers and duplicated logic at the same location as the original logic and then

© 2016 by Taylor & Francis Group, LLC


400    16.4 Physical Design

A B

(a)

A B

(b)

FIGURE 16.21 Duplicating registers to optimize timing in physical resynthesis: (a) Register with
three time-critical output connections. (b) Three register duplicates created and legally placed to
optimize timing.

invokes an incremental placement algorithm to legalize the placement. Chen and Singh [120]
describe an improved incremental placement algorithm that reduces runtime.
In an alternative to retiming, Singh and Brown [121] employed unused PLLs and global clock-
ing networks to create several shifted versions of a clock and developed a postplacement algo-
rithm that selects a time-shifted clock for each register to improve speed. This approach is similar
to retiming after placement but involves shifting clock edges at registers rather than moving reg-
isters across combinational logic. Chao-Yang and Marek-Sadowska [122] extended this beneficial
clock-skew timing optimization to a proposed FPGA architecture, where clocks can be delayed
via programmable delay elements on the global clock distribution networks.

16.4.3 ROUTING

Routing for FPGAs is unique in that it is a purely discrete problem, in that the wires already exist
as part of the underlying architecture. This section describes historical development and state-
of-the-art FPGA routing algorithms.

[Link] PROBLEM FORMULATION

All FPGA routing consists of prefabricated metal wires and programmable switches to connect
the wires to each other and to the circuit element input and output pins. Figure 16.22 shows an
example of FPGA routing architecture. In this example, each routing channel contains four wires

© 2016 by Taylor & Francis Group, LLC


Chapter 16 – FPGA Synthesis and Physical Design    401

Logic block
Routing wire
Programmable switch between routing wires
Programmable switch from routing
wire to logic block input or output

FIGURE 16.22 Example of FPGA routing architecture.

of length 4—wires that span four logic blocks before terminating—and one wire of length 1. In
the example shown in Figure 16.22, the programmable switches allow wires to connect only at
their endpoints, but many FPGA architectures also allow programmable connections from inte-
rior points of long wires as well.
Usually, the wires and the circuit element input and output pins are represented as nodes
in a routing-resource graph, while programmable switches that allow connections to be made
between the wires and pins become directed edges. Programmable switches can be fabricated
as pass transistors, tri-state buffers, or multiplexers. Multiplexers are the dominant form of pro-
grammable interconnect in recent FPGAs due to a superior area-delay product [123]. Figure 23
shows how a small portion of an FPGA’s routing is transformed into a routing-resource graph.
This graph can also efficiently store information on which pins are logically equivalent and
hence may be swapped by the router, by including source and sink nodes that connect to all the
pins that can perform a desired function. It is common to have many logically equivalent pins
in commercial FPGAs—for example, all the inputs to an LUT are logically equivalent and may
be swapped by the router. A legal routing of a design consists of a tree of routing-resource nodes
for each net in the design such that (1) each tree electrically connects the net source to all the
net sinks and (2) no two trees contain the same node, as that would imply a short between two
signal nets.
Since the number of routing wires in an FPGA is limited and the limited number of program-
mable switches also creates many constraints on which wires can be connected to each other,
congestion detection and avoidance is a key feature of FPGA routers. Also since most delay in
FPGAs is due to the programmable routing, timing-driven routing is important to obtain the
best speed.

[Link] TWO-STEP ROUTING

Some FPGA routers operate in two sequential phases, as shown in Figure 16.24. First, a global
route for each net in the design is determined, using channeled global routing algorithms that are
essentially the same as those for ASICs. The output of this stage is the series of channel segments
through which each connection should pass. Next, a detailed router is invoked to determine
exactly which wire segment should be used within each channel segment. The SEGA [124] algo-
rithm finds detailed routes by employing different levels of effort in searching the routing graph.

© 2016 by Taylor & Francis Group, LLC


402    16.4 Physical Design

Vwire1 Vwire2 Vwire3

In2
Logic block

In1 Out

SRAM cell
Hwire1

Hwire2

Hwire3
(a)

Source

Out

Hwire1 Vwire1 Hwire3

Hwire2 Vwire2 Vwire3

In1 In2

Sink
(b)

FIGURE 16.23 Transforming FPGA routing circuitry to a routing-resource graph: (a) Example of
FPGA routing circuitry. (b) Equivalent routing-resource graph.

A search of only a few routing options is conducted first in order to quickly find detailed routes
for nets that are in uncongested regions, while a more exhaustive search is employed for nets
experiencing routing difficulty. An alternative approach by Nam formulates the FPGA detailed
routing problem as a Boolean satisfiability problem [125]. This approach guarantees that a legal
detailed routing (which obeys the current global routing) will be found if one exists but can take
high CPU time.
The divide-and-conquer approach of two-step routing reduces the problem space for both the
global and detailed routers, helping to keep their CPU times down. However, the flexibility loss
of dividing the routing problem into two phases in this way can result in significantly reduced
result quality. The global router optimizes only the wirelength of each route and attempts to con-
trol congestion by trying to keep the number of nets assigned to a channel segment comfortably
below the number of routing wires in that channel segment. However, the fact that FPGA wiring
is prefabricated and can be interconnected only in limited patterns makes the global router’s view
of both wirelength and congestion inaccurate. For example, a global route one logic block long
may require the detailed router to use a wire that is four logic blocks long to actually complete
the connection, thereby wasting wire and increasing delay. Figure 16.25 highlights this behavior;
the global route requires 9 units of wire, but the final wires used in the detailed routing of the net
are 13 wiring units long in total. Similarly, a global route where the number of nets assigned to
each channel segment is well below the capacity of each segment may still fail detailed routing
because the wiring patterns (i.e., the limited connectivity between wires) may not permit this
pattern of global routes.

© 2016 by Taylor & Francis Group, LLC


Chapter 16 – FPGA Synthesis and Physical Design    403

Source

Channel segment

Sink1

Sink2

(a)

Source

Programmable switch

Routing wire

Sink1

Sink2

(b)

FIGURE 16.24 Two-step FPGA routing flow: (a) Step one, global routing chooses a set of channel
segments for a net. (b) Step two, detailed routing wires within each channel segment and switches
to connect them.

Pathfinder routing
delay cost
RCV routing
delay cost

Delay
portion of
routing
cost

DBudget, Min DTarget DBudget, Max


Routing delay

FIGURE 16.25 Routing Cost Valleys routing delay cost compared to Pathfinder routing delay cost.

© 2016 by Taylor & Francis Group, LLC


404    16.4 Physical Design

[Link] SINGLE-STEP ROUTERS

Most modern FPGA routers are single-step routers that find routing paths through the routing-
resource graph in a single unified search algorithm. Most such routers use some variant of a maze
router [126] as their inner loop—a maze router uses Dijkstra’s algorithm to search through the
routing-resource graph and find a low-cost path to connect two terminals of a net. Single-step
FPGA routers differ primarily in their costing of various routing alternatives and their congestion
resolution techniques.
The Pathfinder algorithm by McMurchie and Ebeling [127] introduced the concept of negoti-
ated congestion routing, which now underlies many academic and commercial FPGA routers.
In a negotiated congestion router, each connection is initially routed to minimize some metric,
such as delay or wirelength, with little regard to congestion, or overuse of routing resources.
After each routing iteration, in which every net in the circuit is ripped up and rerouted, the cost
of congestion is increased such that it is less likely that overused nodes will occur in the next
routing iteration. Over the course of many routing iterations, the increasing cost of conges-
tion gradually forces some nets to accept suboptimal routing in order to resolve congestion and
achieve a legal routing.
The congestion cost of a node is

(16.5) CongestionCost(n) = [ b(n) + h(n)] × p(n)

where
b(n) is the base cost of the node
p(n) is the present congestion of the node
h(n) is the historical cost of the node

The base cost of a node could be its intrinsic delay, its length, or simply 1 for all nodes. The pres-
ent congestion cost of a node is a function of the overuse of the node and the routing iteration.
For nodes that are not currently overused, p(n) is one. In early routing iterations, p(n) will be
only slightly higher than 1 for nodes that are overused, while in later routing iterations, to ensure
congestion is resolved, p(n) becomes very large for overused nodes. h(n) maintains a congestion
history for each node. h(n) is initially 0 for all nodes but is increased by the amount of overuse on
node n at the end of each routing iteration. The incorporation of not just the present congestion
but also the entire history of congestion of a node, into the cost of that node, is a key innovation
of negotiated congestion. Historical congestion ensures that nets that are trapped in a situation
where all their routing choices have present congestion can see which choices have been overused
the most in the past. Exploring the least historically congested choices ensures new portions of
the solution space are being explored and resolves many cases of congestion that the present con-
gestion cost term alone cannot resolve.
In the Pathfinder algorithm, the complete cost of using a routing-resource node n in the rout-
ing of a connection c is

(16.6) Cost(n) = [1 - Crit(c)]CongestionCost(n) + Crit(c)Delay(n)

The criticality is the ratio of the connection slack to the longest delay in the circuit:

Slack(c)
(16.7) Crit(c) =
D max

The total cost of a routing-resource node is therefore a weighted sum of its congestion cost and
its delay, with the weighting being determined by the timing criticality of the connection being
routed. This formulation results in the most timing-critical connections receiving delay-optimized
routes, with non-timing-critical connections using routes optimized for minimal wirelength
and congestion. Since timing-critical connections see less cost from congestion, these connec-
tions are also less likely to be forced off their optimal routing paths due to congestion—instead,
non-timing-critical connections will be moved out of the way of timing-critical connections.

© 2016 by Taylor & Francis Group, LLC


Chapter 16 – FPGA Synthesis and Physical Design    405

The VPR router [1,93] is based on the Pathfinder algorithm but introduces several enhance-
ments. The most significant enhancement is that instead of using a breadth-first search or an A*
search through the routing-resource graph to determine good routes, VPR uses a more aggressive
directed search technique. This directed search sorts each routing-resource node, n, found dur-
ing graph search toward a sink, j, by a total cost given by

(16.8) TotalCost(n) = PathCost(n) + a × ExpectedCost(n, j )

Here, PathCost(n) is the known cost of the routing path from the connection source to node n,
while ExpectedCost(n, j) is a prediction of the remaining cost that will be incurred in completing
the route from node n to the target sink. The directedness of the search is controlled by α. An α
of 0 results in a breadth-first search of the graph, while α larger than 1 makes the search more
efficient but may result in suboptimal routes. An α of 1.2 leads to improved CPU time without a
noticeable reduction in result quality.
An FPGA router based on negotiated congestion, but designed for very low CPU times, is
presented by Swartz [129]. This router achieves very fast runtimes through the use of an aggres-
sive directed search during routing graph exploration and by using a binning technique to speed
the routing of high-fanout nets. When routing the kth terminal of a net, most algorithms begin
the route toward terminal k by considering every routing-resource node used in routing the pre-
vious k − 1 terminals. For a k-terminal net, this results in an O(k2) algorithm, which becomes
slow for large k. By examining only the portions of the routing of the previous terminals that lie
within a geographical bin near the sink for connection k, the algorithm achieves a significant CPU
reduction.
Wilton developed a cross talk−aware FPGA routing algorithm [130]. This algorithm enhances
the VPR router by adding an additional term to the routing cost function that penalizes routes in
proportion to the amount of delay they will add to neighboring routes due to cross talk, weighted
by the timing criticality of those neighboring routes. Hence, this router achieves a circuit speedup
by leaving routing tracks near those used by critical connections vacant.
Lamoureaux [84] enhanced the VPR router to optimize power by adding a term to the rout-
ing node cost, Equation 16.6, which includes the capacitance of a routing node multiplied by the
switching activity of the net being routed. This drives the router to achieve low-energy routes for
rapidly toggling nets.
The Routing Cost Valleys (RCV) algorithm [131] combines negotiated congestion with a new
routing cost function and an enhanced slack allocation algorithm. RCV is the first FPGA routing
algorithm that not only optimizes long-path timing constraints, which specify that the delay on
a path must be less than some value, but also addresses the increasing importance of short-path
timing constraints, which specify that the delay on a path must be greater than some value.
Short-path timing constraints arise in FPGA designs as a consequence of hold time constraints
within the FPGA or of system-level hold time constraints on FPGA input pins and system-level
minimum clock-to-output constraints on FPGA output pins. As with ASIC design, increasing
process variation in clock trees can result in the need to address hold time in addition to setup.
To meet short-path timing constraints, RCV will intentionally use slow or circuitous routes to
increase the delay of a connection and guarantee minimum delays when required.
RCV allocates both short-path and long-path slack to determine a pair of delay budgets,
D Budget,Min(c) and D Budget,Max(c), for each connection, c, in the circuit. A routing of the circuit in
which every connection has a delay between D Budget,Min(c) and D Budget,Max(c) will satisfy all the long-
path and short-path timing constraints. Such a routing may not exist for all connections how-
ever, so, where possible, it is desirable if connection delays lie at the middle of the delay window
defined by D Budget,Min(c) and D Budget,Max(c), which is referred to as D Target(c). The extra timing margin
achieved by connection c may allow another connection on the same path to have a delay outside
its delay budget window, without violating any of the path-based timing constraints. Figure 16.25
shows the form of the RCV routing cost function compared to that of the original Pathfinder
algorithm. RCV strongly penalizes routes that have delays outside the delay budget window and
weakly guides routes to achieve D Target . RCV achieves superior results on short-path timing con-
straints and also outperforms Pathfinder in optimizing traditional long-path timing constraints.

© 2016 by Taylor & Francis Group, LLC


406    16.5 CAD for Emerging Architecture Features

Routing is a time-consuming portion of the FPGA CAD flow, motivating work by Gort and
Anderson [132] to reduce the CPU time of the negotiated congestion routing. They first find
that in rerouting every net, each routing iteration is wasteful. By rerouting only those nets that
are illegally routed (use some congested resources), they achieve a 3× speedup versus VPR. The
authors then achieve a further 2.3× speedup by routing multiple nets in parallel when the bound-
ing boxes of those nets do not overlap, and hence, routings constrained to lie within the net
­terminal bounding boxes will not interact. This algorithm remains deterministic and achieves
the same result quality as VPR.

16.5 CAD FOR EMERGING ARCHITECTURE FEATURES

As FPGA architectures incorporate new hardware features, new CAD flows are needed to sup-
port them. Two such areas that have attracted significant research interest are new power man-
agement hardware and 2.5D or 3D FPGA systems that are built from multiple silicon dice.

16.5.1 POWER MANAGEMENT

As the previous sections of this chapter have described, power can be saved at every stage of
the FPGA CAD flow by making power-aware decisions concerning a design’s implementation.
However, while an ASIC CAD tool can change the underlying hardware to save power—by mak-
ing a low-voltage island or using high-V T transistors, for example—an FPGA tool can only work
with the hardware that exists in the FPGA. This has led to several proposals to augment FPGA
hardware with features that enable more advanced power management.
Li et al. investigated several different FPGA architectures where the logic blocks and routing
switches could select from either a high- or low-voltage supply [133]. They found that hardwir-
ing the choice of which logic blocks used high Vdd and which used low Vdd led to poor results,
while adding extra transistors to allow a programmable selection per design worked well. They
augmented the FPGA CAD flow to incorporate a voltage assignment step after placement and
routing and found that they could reduce FPGA power by 48%, at a cost of 18% delay increase due
to the voltage selection switches.
The commercial Stratix III FPGA [134] added programmable back-bias at the tile granularity
to reduce leakage power. A tile is a pair of logic blocks along with all their adjacent routing, and
the nMOS transistors in a tile can use either the conventional body voltage of 0 V for maximum
speed or a negative voltage for lower leakage. After placement and routing, a new CAD step
chooses the body voltage for each tile and seeks to back-bias as many tiles as possible without
violating any timing constraints. Typically, this approach is able to set 80% of the tiles to the low-
leakage state, reducing the FPGA static power by approximately half.
Huda et al. [135] proposed a small change to FPGA routing switch design that enables unused
routing wires to become free-floating capacitors. They then modify the VPR router to prefer rout-
ing rapidly switching signals on wires that are adjacent to free-floating routing wires to reduce
their effective coupling capacitance and hence dynamic power. They see a routing power reduc-
tion of 10%–15% with only a 1% delay increase.

16.5.2 MORE-THAN-2D INTEGRATION

While FPGA capacity has grown for many years in line with Moore’s law, there is an increasing
interest in even larger FPGA systems that combine multiple silicon dice. Xilinx’s Virtex-7 series
FPGAs contain up to four 28 nm FPGA dice connected via microbumps to a 65 nm silicon inter-
poser that creates approximately 10,000 connections between adjacent dice [136], as shown in
Figure 16.26. This system enables very large 2.5D FPGAs with over 2 million logic cells, and the
Xilinx Vivado CAD system allows designers to target it as if it was a single very large FPGA. Hahn
and Betz investigate such systems in [137] and show that with suitable modification to the place-
ment algorithm to understand that wires crossing the interposer are relatively scarce and slow,

© 2016 by Taylor & Francis Group, LLC


Chapter 16 – FPGA Synthesis and Physical Design    407

FPGA slice FPGA slice FPGA slice FPGA slice Micro-bumps

Si Interposer TSV

C4 bumps
Package substrate

BGA balls

FIGURE 16.26 2.5D Virtex-7 FPGA built with silicon interposer technology.

the system does indeed behave like a single FPGA. So long as the interposer crossing provides at
least 30% of the vertical wiring within an FPGA, this 2.5D device remains quite routable. Even
more ambitious systems that stack multiple FPGA dice on top of each other and connect them
with through-silicon vias are investigated in [138]. The CAD system of [138] first employs parti-
tioning to assign blocks to different silicon layers and then places blocks within a single die/layer.

16.6 LOOKING FORWARD

In this chapter, we have surveyed the current algorithms for FPGA synthesis, placement, and
routing. Some of the more recent publications in this area point to the growth areas in CAD tools
for FPGAs, specifically HLS [30,33].
Power modeling and optimization algorithms are likely to continue to increase in importance
as power constraints become ever more stringent with shrinking process nodes. Placement and
routing for multi-VDD and other power-reduction modes have had several publications [83,85–
87], but many have yet to be fully integrated into commercial tools. A workshop at the FPGA 2012
conference [139] discussed these and other emerging topics such as high-performance comput-
ing acceleration on FPGAs.
Timing modeling for FPGA interconnect will need to take into account variation, multicorner,
cross talk, and other physical effects that have been largely ignored to date and incorporate
these into the optimization algorithms. Though this work parallels the development for ASIC
(e.g., PrimeTime-SI™), FPGAs have some unique problems in that the design is not known at
fabrication.
Producing ever-larger designs without lengthening design cycles will require tools with
reduced runtime and will likely also further accelerate the adoption of HLS flows. As FPGAs
incorporate processors, fast memory interfaces, and other system-level features, tool flows that
enable complete embedded system development will also become ever more important.

REFERENCES

INTRODUCTION
1. See www.<companyname>.com for commercial tools and architecture information.
2. Betz, V., Rose, J., and Marquardt, A. Architecture and CAD for Deep-Submicron FPGAs, Kluwer,
February 1999.
3. Lewis, D. et al., Architectural enhancements in Stratix V, in Proceedings of the 21st ACM International
Symposium on FPGAs, 2013, pp. 147–156.
4. Betz, V. and Rose, J., Automatic generation of FPGA routing architectures from high-level descrip-
tions, in Proceedings Seventh International Symposium on FPGAs, 2000, pp. 175–184.
5. Luu, J. et al., VTR 7.0: Next generation architecture and CAD system for FPGAs, ACM Transactions
Reconfigurable Technology and Systems, 7(2), 6:1–6:30, June 2014.
6. Yan, A., Cheng, R., and Wilton, S., On the sensitivity of FPGA architectural conclusions to experi-
mental assumptions, tools and techniques, in Proceedings of the 10th International Symposium on
FPGAs, 2003, pp. 147–156.

© 2016 by Taylor & Francis Group, LLC


408    Synthesis: HLS

7. Li, F., Chen, D., He, L., and Cong, J., Architecture evaluation for power-efficient FPGAs, in Proceedings
of the 11th International Symposium on FPGAs, 2003, pp. 175–184.
8. Wilton, S., SMAP: Heterogeneous technology mapping for FPGAs with embedded memory arrays, in
Proceedings of the Eighth International Symposium on FPGAs, 1998, pp. 171–178.
9. Hutton, M., Karchmer, D., Archell, B., and Govig, J., Efficient static timing analysis and applications
using edge masks, in Proceedings of the 13th International Symposium FPGAs, 2005, pp. 174–183.
10. Poon, K., Yan, A., and Wilton, S.J.E., A flexible power model for FPGAs, ACM Transactions on Design
Automation of Digital Systems, 10(2), 279–302, April 2005.
11. Conference websites: [Link], [Link], [Link], [Link].
12. DeHon, A. and Hauck, S., Reconfigurable Computing, Morgan Kaufman, San Francisco, CA, 2007.
13. Chen, D., Cong, J., and Pan, P., FPGA design automation: A survey, Foundations and Trends in
Electronic Design Automation, 1(3), 195–330, October 2006.
14. Khatri, S., Shenoy, N., Khouja, A., and Giomi, J.C., Logic synthesis, Electronic Design Automation for
IC Implementation, Circuit Design, and Process Technology, L. Lavagno, I.L. Markov, G.E. Martin, and
L.K. Scheffer, eds., Taylor & Francis Group, Boca Raton, FL, 2016.
15. De Micheli, G., Synthesis and Optimization of Digital Circuits, McGraw Hill, New York, 1994.
16. Murgai, R., Brayton, R., and Sangiovanni-Vincentelli, A. Logic Synthesis for Field-Programmable Gate
Arrays, Kluwer, Norwell, MA, 2000.

SYNTHESIS: HLS
17. Berkeley Design Technology Inc., Evaluating FPGAs for communication infrastructure applications,
in Proceedings of the Communications Design Conference, 2003.
18. Lockwood, J., Naufel, N., Turner, J., and Taylor, D., Reprogrammable network packet processing on
the Field-Programmable Port Extender (FPX), in Proceedings of the Ninth International Symposium
FPGAs, 2001, pp. 87–93.
19. Kulkarni, C., Brebner, G., and Schelle, G., Mapping a domain-specific language to a platform FPGA,
in Proceedings of the Design Automation Conference, 2004.
20. Maxeler Technologies, [Link]
21. Impulse Accelerated Technologies, [Link]
22. Auerbach, J., Bacon, D., Cheng, P., and Rabbah, R., Lime: A Java-compatible and synthesizable lan-
guage for heterogeneous architectures, in Object-Oriented Programming, Systems, Languages and
Applications, 2010, pp. 89–108.
23. Grotker, T., Liau, S., and Martin, G., System Design with SystemC, Kluwer, Norwell, MA, 2010.
24. Bluespec Inc., [Link]
25. Cong, J. and Zou, Y., FPGA-based hardware acceleration of lithographic aerial image simulation,
ACM Transactions on Reconfigurable Technology and Systems, 2(3), 17.1–17.29, 2009.
26. United States Bureau of Labor Statistics. Occupational Outlook Handbook 2010–2011 Edition, 2010.
27. Cong, J., Fan, Y., Han, G., Jiang, W., and Zhang, Z., Platform-based behavior-level and system-level
synthesis, in Proceedings of IEEE International SOC Conference, Austin, TX, 2006, pp. 199–202.
28. Xilinx Vivado High-Level Synthesis, [Link]
tion/esl-design/, 2014.
29. Cong, J., Liu, C., Neuendorffer, S., Noguera, J., Vissers, K., and Zhang, Z. High-level synthesis for
FPGAs: From prototyping to deployment, IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, 30(4), 473–491, 2011.
30. Chen, D. and Singh, D., Fractal video compression in OpenCL: An evaluation of CPUs, GPUs, and
FPGAs as acceleration platforms, in IEEE/ACM Asia and South Pacific Design Automation Conference,
2013, pp. 297–304.
31. Altera SDK for OpenCL, [Link] 2014.
32. Coussy, P., Lhairech-Lebreton, G., Heller, D., and Martin, E., GAUT—A free and open source high-
level synthesis tool, in Proceedings of the IEEE/ACM Design Automation and Test in Europe, University
Booth, 2010.
33. Zheng, H., Shang high-level synthesis, [Link]
34. Villarreal, J., Park, A., Najjar, W., and Halstead, R. Designing modular hardware accelerators in C
with ROCCC 2.0, in IEEE International Symposium on Field-Programmable Custom Computing
Machines, 2010, pp. 127–134.
35. Pilato, C. and Ferrandi, F., Bambu: A free framework for the high-level synthesis of complex applica-
tions, in ACM/IEEE Design Automation and Test in Europe, University Booth, 2012.
36. Nane, R., Sima, V., Olivier, B., Meeuws, R., Yankova, Y., and Bertels, K., DWARV 2.0: A CoSy-based
C-to-VHDL hardware compiler, in International Conference on Field-Programmable Logic and
Applications, 2012, pp. 619–622.

© 2016 by Taylor & Francis Group, LLC


Chapter 16 – FPGA Synthesis and Physical Design    409

37. Canis, A., Choi, J., Aldham, M., Zhang, V., Kammoona, A., Czajkowski, T., Brown, S., and Anderson,
J. LegUp: Open source high-level synthesis for FPGA-based processor/accelerator systems, ACM
Transactions on Embedded Computing Systems, 13(2), 1:1–1:25, 2013.
38. Lattner, C. and Adve, V., LLVM: A compilation framework for lifelong program analysis & transfor-
mation, in International Symposium on Code Generation and Optimization, 2004, pp. 75–88.
39. Cong, J. and Zhang, Z., An efficient and versatile scheduling algorithm based on SDC formulation, in
IEEE/ACM Design Automation Conference, 2006, pp. 433–438.
40. Huang, C., Che, Y., Lin, Y., and Hsu, Y., Data path allocation based on bipartite weighted matching, in
IEEE/ACM Design Automation Conference, 1990, pp. 499–504.
41. Hadjis, S., Canis, A., Anderson, J., Choi, J., Nam, K., Brown, S., and Czajkowski, T. Impact of FPGA
architecture on resource sharing in high-level synthesis, in ACM/SIGDA International Symposium
on Field Programmable Gate Arrays, 2012, pp. 111–114.
42. Canis, A., Anderson, J., and Brown, S., Multi-pumping for resource reduction in FPGA high-level
synthesis, in IEEE/ACM Design Automation and Test in Europe Conference, 2013, pp. 194–197.
43. Huang, Q., Lian, R., Canis, A., Choi, J., Xi, R., Brown, S., and Anderson, J., The effect of compiler
optimizations on high-level synthesis for FPGAs, in IEEE International Symposium on Field-
Programmable Custom Computing Machines, 2013, pp. 89–96.
44. Choi, J., Anderson, J., and Brown, S., From software threads to parallel hardware in FPGA high-level
synthesis, in IEEE International Conference on Field-Programmable Technology (FPT), 2013,
pp. 270–279.
45. Zheng, H., Swathi T., Gurumani, S., Yang, L., Chen, D., and Rupnow, K., High-level synthesis with
behavioral level multi-cycle path analysis, in International Conference on Field-Programmable Logic
and Applications, 2013.
46. Papakonstantinou, A., Gururaj, K., Stratton, J., Chen, D., Cong, J., and Hwu, W., FCUDA: Enabling
efficient compilation of CUDA kernels onto FPGAs, in Symposium on Application-Specific Processors,
2009, pp. 35–42.
47. Zuo, W., Liang, Y., Li, P., Rupnow, K., Chen, D., and Cong, J., Improving high-level synthesis optimiza-
tion opportunity through polyhedral transformations, in ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, 2013, pp. 9–18.
48. Bastoul, C., Cohen, A., Girbal, S., Sharma, S., and Temam, O, Putting polyhedral loop transforma-
tions to work, in International Workshop on Languages and Compilers for Parallel Computing,
College Station, TX, 2003, pp. 209–225.
49. Cong, J., Huang, M., and Zhang, P., Combining computation and communication optimizations
in system synthesis for streaming applications, in ACM/SIGDA International Symposium on Field
Programmable Gate Arrays, 2014, pp. 213–222.
50. Wang, Y., Li, P., and Cong, J., Theory and algorithm for generalized memory partitioning in high-level
synthesis, in ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2014,
pp. 199–208.
51. Bayliss, S. and Constantinides, G., Optimizing SDRAM bandwidth for custom FPGA loop accelerators,
in ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2012, pp. 195–204.

SYNTHESIS: LOGIC OPTIMIZATION


52. Tinmaung, K. and Tessier, R., Power-aware FPGA logic synthesis using binary decision diagrams, in
Proceedings of the 15th International Symposium on FPGAs, 2007.
53. Metzgen, P. and Nancekievill, D., Multiplexor restructuring for FPGA implementation cost reduc-
tion, in Proceedings of the Design Automation Conference, 2005.
54. Nancekievill, D. and Metzgen, P., Factorizing multiplexors in the datapath to reduce cost in FPGAs,
in Proceedings of the International Workshop on Logic Synthesis, 2005.
55. Hutton, M. et al., Improving FPGA performance and area using an adaptive logic module, in
Proceedings of the 14th International Symposium Field-Programmable Logic, 2004, pp. 134–144.
56. Tessier, R., Betz, V., Neto, D., Egier, A., and Gopalsamy, T., Power-efficient RAM-mapping algorithms
for FPGA embedded memory blocks, IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, 26(2), 278–290, February 2007.
57. Sentovich, E. et al., SIS: A system for sequential circuit analysis, Tech Report No. UCB/ERL M92/41,
UC Berkeley, Berkeley, CA, 1992.
58. Mischenko, A., Brayton, R., Jiang, J., and Jang, S., DAG-aware AIG re-writing: A fresh look at combi-
national logic synthesis, in Proceedings of the ACM DAC, 2006, pp. 532–536.
59. Mischenko, A. et al., ABC—A system for sequential synthesis and verification, [Link]
[Link]/~alanmi/abc/, 2009.

© 2016 by Taylor & Francis Group, LLC


410    Power-Aware Synthesis

60. Luu, J. et al., On hard adders and carry chains in FPGAs, in International Symposium on Field-
Programmable Custom Computing Machines (FCCM), 2014.
61. Shenoy, N. and Rudell, R., Efficient implementation of retiming, in Proceedings of the International
Conference on CAD (ICCAD), 1994, pp. 226–233.
62. van Antwerpen, B., Hutton, M., Baeckler, G., and Yuan, R., A safe and complete gate-level register
retiming algorithm, in Proceedings of the IWLS, 2003.
63. Yamashita, S., Sawada, H., and Nagoya, A., A new method to express functional permissibilities
for LUT-based FPGAs and its applications, in Proceedings of the International Conference on CAD
(ICCAD), 1996, pp. 254–261.
64. Cong, J., Lin, Y., and Long, W., SPFD-based global re-wiring, in Proceedings of the 10th International
Symposium on FPGAs, 2002, pp. 77–84.
65. Cong, J., Lin, Y., and Long, W., A new enhanced SPFD rewiring algorithm, in Proceedings of the
International Conference on CAD (ICCAD), 2002, pp. 672–678.
66. Vemuri, N., Kalla, P., and Tessier, R., BDD-based logic synthesis for LUT-based FPGAs, ACM
Transactions on Design Automation of Electronic Systems, 7(4), 501–525, 2002.
67. Yang, C., Ciesielski, M., and Singhal, V., BDS A BDD-based logic optimization system, in Proceedings
of the Design Automation Conference, 2000, pp. 92–97.
68. Lai, Y., Pedram, M., and Vrudhala, S., BDD-based decomposition of logic functions with application
to FPGA synthesis, in Proceedings of the Design Automation Conference, 1992, pp. 448–451.

TECHNOLOGY MAPPING FOR FPGAS


69. Ling, A., Singh, D.P., and Brown, S.D., FPGA technology mapping: A study of optimality, in Proceedings
of the Design Automation Conference, 2005.
70. Francis, R.J., Rose, J., and Chung, K., Chortle: A technology mapping program for lookup table-
based field-programmable gate arrays, in Proceedings of the Design Automation Conference, 1990,
pp. 613–619.
71. Cong, J. and Ding, E., An optimal technology mapping algorithm for delay optimization in lookup
table based FPGA designs, IEEE Transactions on CAD, 13(1), 1–12, 1994.
72. Cong, J. and Hwang, Y. Structural gate decomposition for depth-optimal technology mapping in LUT-
based FPGA designs, ACM Transactions on Design Automation of Digital Systems, 5(2), 193–225,
2000.
73. Cong, J. and Ding, Y., On area/depth trade-off in LUT-based FPGA technology mapping, IEEE
Transactions on VLSI, 2(2), 137–148, 1994.
74. Schlag, M., Kong, J., and Chan, P.K., Routability-driven technology mapping for lookup table-based
FPGA’s, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 13(1),
13–26, 1994.
75. Cong, J. and Hwang, Y., Simultaneous depth and area minimization in LUT-based FPGA mapping, in
Proceedings of the fourth International Symposium FPGAs, 1995, pp. 68–74.
76. Chen, D. and Cong, D., DAOmap: A depth-optimal area optimization mapping algorithm for FPGA
designs, in Proceedings of the International Conference on CAD (ICCAD), November 2004.
77. Farrahi, A. and Sarrafzadeh, M., Complexity of the lookup-table minimization problem for FPGA
technology mapping, IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, 13(11), 1319–1332, 2006.
78. Manohararajah, V., Brown, S.D., and Vranesic, Z.G., Heuristics for area minimization in LUT-based
FPGA technology mapping, IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, 25(11), 2331–2340, 2006.
79. Mischenko, A. et al., Combinational and sequential mapping with priority cuts, in Proceedings of the
IEEE International Conference on CAD (ICCAD), 2007, pp. 354–361.
80. Mischenko, A., Chatterjee, S., and Brayton, R., Improvements to technology mapping for LUT-based
FPGAs, in Proceedings of the 14th International Symposium on FPGAs, 2006.
81. Mischenko, A., Brayton, R., and Jang, S., Global delay optimization using structural choices, in
Proceedings of the 18th International Symposium on FPGAs, 2010, pp. 181–184.

POWER-AWARE SYNTHESIS
82. Farrahi, A.H. and Sarrafzadeh, M., FPGA technology mapping for power minimization, in Proceedings
of the International Workshop on Field-Programmable Logic and Applications, 1994.
83. Anderson, J. and Najm, F.N., Power-aware technology mapping for LUT-based FPGAs, in Proceedings
of the International Conference on Field-Programmable Technology, 2002.

© 2016 by Taylor & Francis Group, LLC


Chapter 16 – FPGA Synthesis and Physical Design    411

84. Lamoreaux, J. and Wilton, S.J.E., On the interaction between power-aware CAD algorithms for
FPGAs, in Proceedings of the International Conference on CAD (ICCAD), 2003.
85. Chen, D., Cong, J., Li, F., and He., Low-power technology mapping for FPGA architectures with dual
supply voltages, in Proceedings of the 12th International Symposium on FPGAs, 2004, pp. 109–117.
86. Anderson, J., Najm, F., and Tuan, T., Active leakage power estimation for FPGAs, in Proceedings of the
12th International Symposium on FPGAs, 2004, pp. 33–41.
87. Anderson, J. and Ravishankar, C., FPGA power reduction by guarded evaluation, in Proceedings of the
18th International Symposium on FPGAs, 2010, pp. 157–166.
88. Hwang, J.M., Chiang, F.Y., and Hwang, T.T., A re-engineering approach to low power FPGA design
using SPFD, in Proceedings of the 35th ACM/IEEE Design Automation Conference, 1998, pp. 722–725.
89. Kumthekar, B. and Somenzi, F., Power and delay reduction via simultaneous logic and placement
optimization in FPGAs, in Proceedings of the Design and Test in Europe (DATE), 2000, pp. 202–207.

PHYSICAL DESIGN
90. Anderson, J., Saunders, J., Nag, S., Madabhushi, C., and Jayarman, R., A placement algorithm for
FPGA designs with multiple I/O standards, in Proceedings of the International Conference on Field
Programmable Logic and Applications, 2000, pp. 211–220.
91. Mak, W., I/O placement for FPGA with multiple I/O standards, in Proceedings of the 11th International
Symposium on FPGAs, 2003, pp. 51–57.
92. Anderson, J., Nag, S., Chaudhary, K., Kalman, S., Madabhushi, C., and Cheng, P., Run-time conscious
automatic timing-driven FPGA layout synthesis, in Proceedings of the 14th International Conference
on Field-Programmable Logic and Applications, 2004, pp. 168–178.

CLUSTERING
93. Betz, V. and Rose, J., VPR: A new packing, placement and routing tool for FPGA research, in
Proceedings of the Seventh International Conference on Field-Programmable Logic and Applications,
1997, pp. 213–222.
94. Marquardt, A., Betz, V., and Rose, J., Timing-driven placement for FPGAs, in Proceedings of the
International Symposium on FPGAs, 2000, pp. 203–213.
95. Singh, A. and Marek-Sadowska, M., Efficient circuit clustering for area and power reduction in
FPGAs, in Proceedings of the International Symposium on FPGAs, 2002, pp. 59–66.
96. Murray, K., Whitty, S., Luu, J., Liu, S., and Betz, V., Titan: Enabling large and realistic benchmarks for
FPGAs, in Proceedings of the International Conference on Field-Programmable Logic and Applications,
2013, pp. 1–8.
97. Feng, W., Greene, J., Vorwerk, K., Pevzner, V., and Kundu, A., Rent’s rule based FPGA packing for
routability optimization, in Proceedings of the International Symposium on FPGAs, 2014, pp. 31–34.
98. Luu, J., Rose, J., and Anderson, J., Towards interconnect-adaptive packing for FPGAs, in Proceedings
of the International Symposium on FPGAs, 2014, pp. 21–30.

PLACEMENT
99. Kong, T., A novel net weighting algorithm for timing-driven placement, in Proceedings of the
International Conference on CAD (ICCAD), 2002, pp. 172–176.
100. Lin, Y., He, L., and Hutton, M., Stochastic physical synthesis considering pre-routing interconnect
uncertainty and process variation for FPGAs, IEEE Transactions on VLSI Systems, 16(2), 124–133,
2008.
101. Chen, G. and Cong, J., Simultaneous timing driven clustering and placement for FPGAs, in Proceedings
of the International Conference on Field Programmable Logic and Applications, 2004, pp. 158–167.
102. Sharma, A., Ebeling, C., and Hauck, S., Architecture adaptive routability-driven placement for FPGAs,
in Proceedings of the International Conference on Field-Programmable Logic and Applications, 2005,
pp. 95–100.
103. Sankar, Y. and Rose, J., Trading quality for compile time: Ultra-fast placement for FPGAs, in
Proceedings of the International Symposium on FPGAs, 1999, pp. 157–166.
104. Maidee, M., Ababei, C., and Bazargan, K., Fast timing-driven partitioning-based placement for Island
style FPGAs, in Proceedings of the Design Automation Conference (DAC), 2003, pp. 598–603.
105. Gort, M. and Anderson, J., Analytical placement for heterogeneous FPGAs, in Proceedings of the
International Conference on Field-Programmable Logic and Applications, 2012, pp. 143–150.

© 2016 by Taylor & Francis Group, LLC


412    Routing

106. Kim, M.-C., Lee, D., and Markov, I., SimPL: An effective placement algorithm, IEEE Transactions on
CAD, 31(1), 50–60, 2012.
107. Wu, K. and McElvain, K., A fast discrete placement algorithm for FPGAs, in Proceedings of the
International Symposium on FPGAs, 2012, pp. 115–119.
108. Ludwin, A. and Betz, V., Efficient and deterministic parallel placement for FPGAs, ACM Transactions
on Design Automation of Electronic Systems, 16(3), 22:1–22:23, June 2011.
109. An, M., Steffan, G., and Betz, V., Speeding up FPGA placement: Parallel algorithms and methods,
in Proceedings of the International Symposium on Field-Configurable Custom Computing Machines,
2014, pp. 178–185.
110. Goeders, J., Lemieux, G., and Wilton, S., Deterministic timing-driven parallel placement by simulated
annealing using half-box window decomposition, in Proceedings of the International Conference on
Reconfigurable Computing and FPGAs, 2011, pp. 41–48.

PHYSICAL RESYNTHESIS
111. Lin, J., Jagannathan, A., and Cong, J. Placement-driven technology mapping for LUT-based FPGAs, in
Proceedings of the 11th International Symposium on FPGAs, 2003, pp. 121–126.
112. Suaris, P., Wang, D., and Chou, N., Smart move: A placement-aware retiming and replication method
for field-programmable gate arrays, in Proceedings of the Fifth International Conference on ASICs,
2003.
113. Suaris, P., Liu, L., Ding, Y., and Chou, N., Incremental physical re-synthesis for timing optimization,
in Proceedings of the 12th International Symposium on FPGAs, 2004, pp. 99–108.
114. Schabas, K. and Brown, S., Using logic duplication to improve performance in FPGAs, in Proceedings
of the 11th International Symposium on FPGAs, 2003, pp. 136–142.
115. Chen, G. and Cong, J., Simultaneous timing-driven placement and duplication, in Proceedings of the
13th International Symposium on FPGAs, 2005, pp. 51–61.
116. Manohararajah, V., Singh, D, Brown, S., and Vranesic, Z., Post-placement functional decomposition
for FPGAs, in Proceedings of the International Workshop on Logic Synthesis, 2004, pp. 114–118.
117. Manohararajah, V., Singh, D.P., and Brown, S., Timing-driven functional decomposition for FPGAs,
in Proceedings of the International Workshop on Logic and Synthesis, 2005.
118. Ding, Y., Suaris, P., and Chou, N., The effect of post-layout pin permutation on timing, in Proceedings
of the 13th International Symposium on FPGAs, 2005, pp. 41–50.
119. Singh, D. and Brown, S., Integrated retiming and placement for FPGAs, in Proceedings of the 10th
International Symposium on FPGAs, 2002, pp. 67–76.
120. Chen, D. and Singh, D., Line-level incremental resynthesis techniques for FPGAs, in Proceedings of
the 19th International Symposium on FPGAs, 2011, pp. 133–142.
121. Singh, D. and Brown, S., Constrained clock shifting for field programmable gate arrays, in Proceedings
of the 10th International Symposium on FPGAs, 2002, pp. 121–126.
122. Chao-Yang, Y. and Marek-Sadowska, M., Skew-programmable clock design for FPGA and skew-aware
placement, in Proceedings of the International Symposium on FPGAs, 2005, pp. 33–40.

ROUTING
123. Lemieux, G. and Lewis, D., Design of Interconnection Networks for Programmable Logic, Kluwer,
Norwell, MA, 2004.
124. Lemieux, G. and Brown, S., A detailed router for allocating wire segments in FPGAs, in Proceedings
of the Physical Design Workshop, 1993, pp. 215–226.
125. Nam, G.-J., Aloul, F., Sakallah, K., and Rutenbar, R., A comparative study of two Boolean formulations
of FPGA detailed routing constraints, in Proceedings of the International Symposium on Physical
Design, 2001, pp. 222–227.
126. Lee, C.Y., An algorithm for path connections and applications, IRE Transactions on Electronic
Computers, EC-10(2) 346–365, 1961.
127. McMurchie, L. and Ebeling, C., PathFinder: A negotiation-based performance-driven router for
FPGAs, in Proceedings of the Fifth International Symposium on FPGAs, 1995, pp. 111–117.
128. Youssef, H. and Shragowitz, E., Timing constraints for correct performance, in Proceedings of the
International Conference on CAD, 1990, pp. 24–27.
129. Swartz, J., Betz, V., and Rose, J., A fast routability-driven router for FPGAs, in Proceedings of the Sixth
International Symposium on FPGAs, 1998, pp. 140–151.

© 2016 by Taylor & Francis Group, LLC


Chapter 16 – FPGA Synthesis and Physical Design    413

130. Wilton, S., A crosstalk-aware timing-driven router for FPGAs, in Proceedings of the Ninth ACM
International Symposium on FPGAs, 2001, pp. 21–28.
131. Fung, R., Betz, V., and Chow, W., Slack allocation and routing to improve FPGA timing while repair-
ing short-path violations, IEEE Transactions on CAD, 27:4, pp. 686–697, April 2008.
132. Gort, M. and Anderson, J., Accelerating FPGA routing through parallelization and engineering
enhancements, IEEE Transactions on CAD, 31:1, pp. 61–74, January 2012.

CAD FOR EMERGING ARCHITECTURE FEATURES


133. Li, F., Lin, Y., and He, L., Field programmability of supply voltages for FPGA power reduction, IEEE
Transactions on CAD, 13:9, pp. 752–764, April 2007.
134. Lewis, D., et al., Architectural enhancements in Stratix-III and Stratix-IV, in Proceedings of the
International Symposium on FPGAs, 2009, pp. 33–41.
135. Huda, S., Anderson, J., and Tamura, H., Optimizing effective interconnect capacitance for FPGA
power reduction, in Proceedings of the International Symposium on FPGAs, pp. 11–19, 2014.
136. Chaware, R., Nagarajan, K., and Ramalingam, S., Assembly and reliability challenges in 3D integra-
tion of 28 nm FPGA die on a large high density 65 nm passive interposer, in Proceedings of the IEEE
Electronic Components and Technology Conference, 2012, pp. 279–283.
137. Hahn Pereira, A. and Betz, V., CAD and architecture for interposer-based multi-FPGA systems, in
Proceedings of the International Symposium on FPGAs, 2014, pp. 75–84.
138. Ababei, C., Mogal, H., and Bazargan, K., Three-dimensional place and route for FPGAs, IEEE
Transactions on CAD, 25:6, pp. 1132–1140, June 2006.
139. “FPGAs in 2032”, Workshop at the 2012 ACM/IEEE International Symposium on FPGAs (slides avail-
able at [Link]).

© 2016 by Taylor & Francis Group, LLC

Common questions

Powered by AI

Deterministic parallel placement methods are significant in FPGA design because they offer efficient and consistent ways to handle the growing complexity and size of FPGA designs. Such methods ensure that placement can be done quickly and reliably across multiple FPGA instances, yielding predictable performance and timing results, critical for meeting modern design requirements .

The CAD flow for FPGAs incorporates timing and power analyses at multiple points in the design process to guide optimization tools. Preliminary timing is known after physical resynthesis, and final timing and power analyses are performed at the end of the flow for reporting. This allows the design to be optimized for performance and energy efficiency throughout the process, unlike some ASIC flows where such analyses might not be integrated as comprehensively .

The absence of ASIC CAD steps such as sizing and compaction, buffer insertion, clock tree, and power grid synthesis means that FPGA design flows rely more on prefabricated architecture. This results in a design process focused more on utilizing pre-defined FPGA resources, such as LUTs and blocks, without needing to engage in resource-level customizations typical in ASIC design. These steps are not relevant to FPGA because the architecture is already determined through the pre-manufactured FPGA design .

System-level design tools aim to exploit key architectural components of FPGAs such as high-speed serial I/O, parallel I/O programmability, embedded memory blocks, DSP blocks, and logic blocks. FPGAs may also feature embedded processors, which contribute to their programmable flexibility and high degree of versatility in various applications .

Parallelism extraction is challenging for HLS tools when using sequential specifications because the input coding style typically does not inherently express parallel operations, limiting the tool's ability to leverage the available parallel fabric. Mitigation comes in the form of using parallel input languages like OpenCL™, where more explicit parallelism is specified by the designer, enabling better utilization of the FPGA's capabilities .

Strategies to address the complexity of LUT minimization in FPGA technology mapping include heuristics for area minimization and algorithms that target both depth and area optimization. Researchers have developed techniques like routability-driven mapping and improvements in combinational and sequential mapping using priority cuts to achieve efficient mapping while tackling the inherent complexity .

Power optimizations for FPGAs are achieved through a combination of industry-wide semiconductor techniques and strategies specifically targeting the programmable nature of FPGAs. These strategies are finely tuned to the evolving FPGA architectures, utilizing the benefits of FPGAs such as faster design and verification time, alongside programmable flexibility .

Technology mapping plays a crucial role in power-aware FPGA synthesis by optimally utilizing LUTs to reduce power consumption. Approaches include power-aware algorithms that consider both functional and structural aspects, dual supply voltages, and techniques like guarded evaluation to minimize active leakage power, thereby strategically reducing power use without sacrificing performance .

The core difference in the tool flow between FPGA and ASIC design is that FPGA tools often have core portions of the tool flow owned by the silicon vendors, whereas ASIC tools do not. For FPGAs, the physical design tools are generally supplied by the FPGA vendor and support only that vendor’s products. The main FPGA vendors like Altera and Xilinx provide complete CAD flows, including synthesis, placement, and routing, uniquely for their products through tools like Quartus™ for Altera and Vivado™ for Xilinx, which is not the case in ASIC design .

Advancements in enabling academic research in FPGA RTL synthesis flow include the introduction of open-source tools like the VTR toolset, which supports full Verilog analysis and elaboration. This development provides researchers access to tools necessary for addressing RTL synthesis flows and examining CAD for binding and mapping, thus opening opportunities for extensive research in this area .

You might also like