0% found this document useful (0 votes)
12 views

Frequency Optimization Objective During System Prototyping On Multi-FPGA Platform

Uploaded by

Khaled Ismail
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Frequency Optimization Objective During System Prototyping On Multi-FPGA Platform

Uploaded by

Khaled Ismail
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Hindawi Publishing Corporation

International Journal of Reconfigurable Computing


Volume 2013, Article ID 853510, 12 pages
http://dx.doi.org/10.1155/2013/853510

Research Article
Frequency Optimization Objective during System Prototyping
on Multi-FPGA Platform

Mariem Turki,1 Zied Marrakchi,2 Habib Mehrez,1 and Mohamed Abid3


1
LIP6, UPMC, 75005 Paris, France
2
FLEXRAS Technologies, 93521 Saint-Denis Cedex, France
3
CES Lab, ENIS, 3038 Sfax, Tunisia

Correspondence should be addressed to Mariem Turki; [email protected]

Received 30 April 2013; Accepted 16 October 2013

Academic Editor: Michael Hübner

Copyright © 2013 Mariem Turki et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Multi-FPGA hardware prototyping is becoming increasingly important in the system on chip design cycle. However, after
partitioning the design on the multi-FPGA platform, the number of inter-FPGA signals is greater than the number of physical
connections available on the prototyping board. Therefore, these signals should be time-multiplexed which lowers the system
frequency. The way in which the design is partitioned affects the number of inter-FPGA signals. In this work, we propose a set
of constraints to be taken into account during the partitioning task. Then, the resulting inter-FPGA signals are routed with an
iterative routing algorithm in order to obtain the best multiplexing ratio. Indeed, signals are grouped and then routed using the
intra-FPGA routing algorithm: Pathfinder. This algorithm is adapted to deal with the inter-FPGA routing problem. Many scenarios
are proposed to obtain the most optimized results in terms of prototyping system frequency. Using this technique, the system
frequency is improved by an average of 12.8% compared to constructive routing algorithm.

1. Introduction an ASIC logic design be partitioned across multiple FPGA


devices to achieve the necessary device logic capacity. The
With the ever increasing complexity of system on chip number of FPGAs depends on the size of the prototyping
circuits, the software and hardware developers can no longer system, ranging from a few [5] up to 60 FPGAs [6].
wait for the fabrication phase to test their designs [1]. In order to map the design into a multi-FPGA board,
Currently, it is estimated that 60 to 80 percent of an ASIC a partitioning tool decomposes the design into pieces that
design is spent in performing verification [2]. will fit within the logic resources of individual FPGA devices.
FPGA-based prototyping is an important step in the Partitioning is often performed to minimize required inter-
creation of the final product and it is the key to the success FPGA interconnect, control system-wide critical path delay
of marketing in time. The key advantage of FPGA-based and localize memory access. For some systems, partitioning
prototyping is the ability to run at high speed (sometimes at must be performed so that routing restrictions in terms of
almost real-time speed) a cycle-accurate, bit-accurate model available FPGA pin count and system topology are taken
of the SoC [3]. The availability of automatic FPGA mapping into account. These constraints are considered in order to
tools has streamlined the design conversion process, making manage the large timing delays in inter-FPGA communica-
the path from ASIC design to FPGA implementation more tions compared to intra-FPGA ones and also to cope with
straightforward. the limited bandwidth problem between FPGAs, which is due
When the logic capacity of a single FPGA is less than the to the limited I/Os per FPGA. Indeed, the number of I/Os
size of the design under test, a multi-FPGA platform is used is increasing for each new FPGA generation, but the ratio
to map the entire design. Because the silicon area overhead of FPGA I/Os over FPGA logic capacity is decreasing. Thus,
of FPGA versus ASIC technology has been measured to be even though the logic capacity of the multi-FPGA board is
about 40x [4], FPGA programming technology requires that sufficient to map the complete design, the number of signals,
2 International Journal of Reconfigurable Computing

which appear at the interface and which should be transmit- selected if all signals it depends upon have been routed
ted between FPGAs, is significantly higher than the number in previous phases. The router then uses the shortest path
of available traces between those FPGAs. The communication analysis with a cost function based on pin utilisation to
of interpartition signals between FPGAs is based on routing route as many selected signals as possible, routing the most
algorithms. The most used routing algorithm involves the critical signals first. Any selected signals which cannot be
determination of the shortest feasible path between FPGAs, routed are delayed to the next phase. In this technique, all
using available board interconnect resources for each cut the signals are multiplexed, without promoting the signals on
signal [7]. This approach is not recursive and leads inevitably the critical path. Some critical signals may not be multiplexed
to a blockage. to obtain a better performance in terms of system frequency.
In this paper, we propose a set of constraints to be Another disadvantage related to the combinational loops is
considered during the partitioning task. These constraints are that any unpredictable delay of an inter-FPGA signal causes
intended to get the best results in terms of the number of the transmission of nonupdated values and then system
cut signals and the critical path optimization. We propose errors. Even though such a sophisticated approach may
also a new approach to route the resulted inter-FPGA signals, realize faster verification speed, it decreases the reliability of
based on signal multiplexing technique. To reach this goal, circuit verification which is the most critical issue of circuit
we use an iterative routing algorithm, called Pathfinder [8]. verification.
This algorithm was used to route the intra-FPGA signals. We In [15, 16], the authors proposed a new multiplexing
extend it for the inter-FPGA signals in order to obtain the best approach based on the integer linear programming. The main
routing results. objective of this study is to select which signals must be
The rest of this paper is organized as follows. Section 2 multiplexed and those which must not. Using this technique,
is dedicated to the different techniques used in the state of all signals are transmitted on each phase, but only those
the art to route the inter-FPGA signals. Section 3 describes with updated values are considered. Since all the signals
the different steps of the design prototyping flow. In Sec- are transmitted in each phase, the number of slots per
tion 4, we present the proposed routing algorithm which phase increases and the system frequency is decreased. This
is used initially to route the intra-FPGA signals. Section 5 technique, as the one in [9, 10], uses a constructive routing
explains the scenarios we propose to test the performance algorithm which is not optimized. In fact, when a signal is
of the routing algorithm. These scenarios include the inter- already routed, it cannot be rerouted to leave the routing
FPGA signal form and also the routing graph direction. In resources currently used to another signal that has the
Section 6, we describe the multiplexing IP that we use to greatest need for these resources. This disadvantage will be
transfer the multiplexed signals. Section 7 is dedicated to the solved using an iterative routing algorithm as proposed in
experimental results and to the evaluation of the proposed this study. In fact, the objective of our iterative approach is
methods. Finally, Section 8 concludes the paper. that all the signals negotiate the use of the routing resources.
Each physical wire will be used by the signal which has the
biggest need to this resource. This negotiation will be done
2. Related Works through several iterations to solve all the conflicts, unlike the
constructive routing algorithm which is done only by one
To address the inter-FPGA signals routing problem, the iteration.
authors in [9, 10] proposed heuristic algorithms to solve mul-
titerminal routing signals in partial crossbar architectures.
In [11, 12], multiterminal signals are decomposed into two-
3. Prototyping Flow
terminal nets. Therefore, routing algorithm is applied to these To prototype an ASIC design into a multi-FPGA platform, the
nets. input circuit is transformed into multi-FPGA configuration
In this paper, our goal is to find the best signal shape bitstream to be downloaded onto the prototyping board.
which gives the best routing results. For this reason, many Figure 1 presents the prototyping flow.
scenarios are applied with the proposed routing graph in
order to get the best system frequency.
To remedy the number of pin limitations, Babb et al. 3.1. Logic Synthesis. The HDL description files of the pro-
[13, 14] introduced time multiplexing of I/O pins. Multi- totyping architecture are mapped onto the target library
plexing means that multiple design signals are assembled of FPGA primitives. In this paper, the benchmarks are
and serialized through the same board connection and synthesized with the Synplify industrial tool [18]. The output
then demultiplexed at the receiving FPGA. This technique of this task is a postsynthesis Verilog netlist.
increases dramatically the available inter-FPGA communi-
cation bandwidth. On the other hand, it makes the proto- 3.2. Partitioning. After mapping the netlist onto the tar-
typing system much slower since the system clock period is get technology, it is divided into partitions; each can fit
composed of several phases. Each phase contains a number into a single target FPGA. The partitioner performs K-way
of slots. Consequently, in each phase, the selected signals are partitioning with multiobjective function. The partitioning
transmitted, each in a slot, between a pair of FPGAs. Signals step is very critical since it has a significant impact on the
are selected based on their criticality which is calculated performance of the prototyping system. In this study, we use
depending on the logic dependency analysis. A signal is the Wasga partitioning tool of Flexras technologies [19]. For
International Journal of Reconfigurable Computing 3

HDL Logic
design synthesis Constraints

Design
netlist

Board Partitioning Constraints


description

Routing and
multiplexing
Individual FPGA compilation

Partition Partition ··· Partition


netlist netlist netlist

Place and route Place and route Place and route

Partition Partition Partition


bitstream bitstream bitstream

Figure 1: Prototyping flow.

this tool, we set some constraints in order to have a good partitioning tool is to minimize the 𝐶𝑝 parameter presented
trade-off between the following criteria. in
𝑁 2
(a) Minimize the Number of Cut Signals. For big designs, it 𝑆𝑝
is difficult, if not impossible, to find a partitioning solution 𝐶𝑝 = √ ∑ ( ), (1)
𝑝=0 𝑇𝑝
which meets the constraint related to the number of physical
connections between FPGAs. As will be explained subse- with 𝑁 being the number of FPGA pairs in the prototyping
quently, the solution is to make a postpartitioning process platform, 𝑆𝑝 being the number of signals between the pair 𝑝 of
allowing a number of signals to share the same physical wire FPGAs, and 𝑇𝑝 being the number of available tracks between
in different time fractions. The insertion of these multiplexers the same pair.
increases the delays on combinatorial paths. These delays are Finally, the partitioner aims to provide guidance about the
correlated to the number of multiplexed signals (multiplexing signals which should not be multiplexed since they affect the
ratio). Thus, the main goal of the partitioner is to reduce the critical path.
number of the cut signals in order to get the lowest rate of
multiplexing. (b) Combinatorial Paths. The system frequency is imposed
On the other side, the ratio between the number of cut by the delay of the longest combinatorial path (between
signals and the number of available wires should be balanced two registers). The delay on a combinatorial path is strongly
between all pairs of FPGA. Therefore, the objective of the correlated with the number of times a path crosses the border
4 International Journal of Reconfigurable Computing

faster than that of the system in order to transfer all the signals
within one system clock period.
FF Combi- The system clock period is given by the following equa-
logic
tion:
Combi-
logic 𝑇SYS CLK = settle start + comm delay + settle end. (2)
FF Combi-
logic Settle start and settle end correspond to the intradelay of
propagation inside the source and destination FPGA, respec-
FPGA 0 FPGA 1 tively. During the intra-FPGA place and route tasks, we define
a multicycle path constraint to set the intra-delay propagation
(a) Partitioning solution with cut signals = 2, combi-hop = 2
to 3 times the intercommunication period; that is, 3 ∗ 𝑇IOclk
in order to relax the timing constraint inside each FPGA. The
comm delay is the delay of the inter-FPGA communication.
This delay should be reduced in order to optimize the system
frequency. The communication delay is represented by the
Combi- Combi- FF following expression:
FF logic logic
Comm delay = 𝑇mux + 𝑇routing hop + 𝑇latencies . (3)

𝑇mux is the amount of delay spent to transfer all signals via the
FPGA 0 FPGA 1 same physical wire and it is proportional to the multiplexing
(b) Partitioning solution with cut signals = 1, combi-hop = 1 ratio. The 𝑇routing hop is the delay spent to cross all the routing
hops. In fact, the number of routing hops is the number of
Figure 2: Combinatorial hop example. FPGAs to cross to route a signal between the source and
the destination. Finally, 𝑇latencies is the latency of the SERDES
modules.
In order to reduce the multiplexing ratio, the effort should
of an FPGA, called combinatorial hop. This is because the be spent on the routing task. Indeed, using an appropriate
transmission through inter-FPGA connection is much slower routing algorithm, the router can find the optimized solution
than the one inside the FPGA. Therefore, it is important related to the given constraints. As shown in Figure 4, the
to absorb the signals belonging to the critical combinatorial router takes as input the architecture of the prototyping
paths. In Figure 2(a), the number of combinatorial hops is platform, the list of cut signals to be routed, and the initial
equal to 2, and the number of cut signals is equal to 2. mux ratio parameter which is the number of inter-FPGA
If the partitioner identifies the best module to move, the signals to be transmitted through the same physical wire. This
partitioning solution will be improved since the number of parameter is calculated as the max of the multiplexing ratio
inter-FPGA signals and the number of combinatorial hops of all the FPGA pairs. The mux ratio of one FPGA pair is
will be reduced as shown in Figure 2(b). the ratio between the number of signals and the number of
connection wires between these two FPGAs.
(c) Logical Resources Limitation. The number of logical Figure 4 shows the proposed flow to reduce the multi-
resources in the FPGA circuits is limited. During the par- plexing ratio. Depending on the given inputs, the router tries
titioning, an occupancy rate constraint is set, so the par- to route all inter-FPGA signals by meeting the mux ratio
titioning tool must take into account this number and try calculated initially. If a feasible solution exists, the mux ratio
to make a partitioning solution which meets the available is decremented and the router attempts to find another
resources. These resources are heterogeneous since they routing solution with the new mux ratio. Otherwise, the
include different types (LUT, Ram, DSP, etc.). The occupancy router exits with the best obtained multiplexing ratio.
rate should consider the additional logical area which will
be occupied by the multiplexing IPs after the inter-FPGA 3.4. FPGA Place and Route. Once the routing is achieved, the
routing tasks. multiplexing IPs are inserted on the source and destination
Unlike most of commercial tools, the partitioning tool FPGAs to ensure the inter-FPGA signals transmission in the
used in our experiments operates on synthesized netlists corresponding time slots. One netlist is generated for each
which gives accurate information about the size of the design FPGA. Each netlist must be processed with FPGA specific
so it can meet the available logical resources of each FPGA. automated place and route software to generate configuration
bitstreams.
3.3. Routing and Multiplexing. The system clock is the clock
of the logic design being prototyped. The system clock period 4. Inter-FPGA Signals Routing Strategy
is divided into a number of slots as shown in Figure 3. Each
signal is transmitted between a pair of FPGA within one slot To route inter-FPGA signals, it is necessary to find an algo-
period. These slots are controlled by an I/O clock which is rithm that can assign, in an optimized manner, signals to the
International Journal of Reconfigurable Computing 5

MUX IP

D Q Combi- Combi- D Q
logic logic

DFF I/O CLK DFF


SYS CLK Settle start Settle end SYS CLK

Comm delay

FPGA 1 FPGA 2

SYS CLK

I/O CLK

Figure 3: Clocking framework.

Architecture description suited to our problem as it offers a compromise between


performance and routability goals.
Signals to route Initial mux ratio

4.1. Routing Graph. Since we have chosen Pathfinder to route


Signal routing
all inter-FPGA signals, our interest was about the modelling
of the multi-FPGA board. Therefore, we chose to model all
the routing resources by an oriented routing graph 𝐺(𝑉, 𝐸).
Mux ratio The set of vertices, 𝑉 = V1 , . . . , V𝑛 , in the graph represents
the I/O pins of all FPGAs, and each FPGA is represented by
Success? a top vertex. The set of edges, 𝐸 = 𝑒1 , . . . , 𝑒𝑛 , represents all
Yes
the inter-FPGA connections. An unidirectional connection is
No modelled by a directed edge, while a bidirectional connection
(e.g., between a vertex and a top vertex) is represented by two
Best mux ratio directed edges.
Figure 4: Mux ratio optimization based on iterative approach. Figure 6 presents a routing graph of a three-FPGA-based
platform.

available resources. The techniques mentioned in Section 2 4.2. Routing Algorithm: Pathfinder. Pathfinder is used pri-
use constructive routing algorithm. This algorithm keeps marily for routing intra-FPGA signals. We adapt it to deal
the track of the reserved and available physical connections with the inter-FPGA signals [21]. Pathfinder uses an iterative,
between FPGAs. The router applies Dijkstra’s shortest path negotiation-based approach to successfully route all the
algorithm [20] to determine the shortest path between the signals. The routing problem for a given signal is to find
source and destination FPGAs. If the shortest path exists, a directed tree embedded in 𝐺 that connects the source of
the capacity of all used resources is decremented; then, they the signal to each of its FPGA destinations. During the first
cannot be used to route the next signals. Otherwise, router routing iteration, the signals are freely routed without paying
returns unsuccessfully. The main disadvantage of this method attention to resource sharing. Individual signals are routed
is its irreversibility. Indeed, when a signal is already routed, it using Dijkstra’s shortest path algorithm [20]. At the end of the
cannot be rerouted to leave the routing resources currently first iteration, resources may be congested because multiple
used to another signal that has the greatest need for these signals have used them. During subsequent iterations, the
resources. In the example of Figure 5, signals are routed cost of using a resource is increased, based on the number of
randomly. If the signal S1 is first routed through FPGA1, signals that share the resource and the history of congestion
then S2 cannot be routed since the wire between FPGA1 and on that resource. Thus, signals are forced to negotiate for
FPGA2 is used by S1. In this case, the design is considered routing resources. If a resource is highly congested, nets
nonroutable. To avoid this problem, we route the inter-FPGA which can use lower congestion alternatives are forced to do
signals by an iterative routing algorithm. Among existing so. On the other hand, if the alternatives are more congested
techniques, the Pathfinder routing algorithm seems to be best than the resource, then a signal may still use that resource.
6 International Journal of Reconfigurable Computing

S1
FPGA 0 FPGA 1 FPGA 0 FPGA 1

S2 S1 S2
Conflict

FPGA 3 FPGA 2 FPGA 3 FPGA 2

First iteration: conflict Second iteration: conflict resolved

Figure 5: Conflict resolution by an iterative routing algorithm.

F0 P0-F0 P0-F1
FPG A0 FPG A1
P0-F0 P0-F1
P1-F0 P1-F1 P2-F1 P1-F0 P1-F1
F1

P0-F2 P0-F2
F2 P2-F1
FPG A2

Figure 6: Modelling multi-FPGA platform as routing graph.

Observing the final routing results, we notice that inter- Unidirectional routing graph
FPGA signals can be directly routed between source and modelling
destination FPGAs or intermediate through-hops may be
necessary.

Compute initial mux ratio


5. Routing Algorithm Adaptation
Taking into account some problems to be detailed later, we
adapt our routing approach to the new routing topology. Set nodes capacity to
In this section, we discuss the proposed solutions and the “mux ratio”
various changes we make.

5.1. Convention. All FPGAs on the prototyping board are Start Pathfinder
Mux ratio
indexed sequentially, starting at 0. We say that a signal has
a direct direction if the index of the FPGA source is lower
than its FPGA destination. Signal with indirect direction is
the signal which is directed towards the opposite. Yes
Success?

5.2. Signal Direction Conflicts. The Pathfinder routing algo-


rithm processes each signal independently. Each routing No
resource (node) may be shared by more than one signal. Best mux ratio
Signals that share the same resource are multiplexed together.
As mentioned above, we model our architecture by a bidi- Figure 7: Routing flow on unidirectional graph.
rectional routing graph. This causes direction conflicts since
the signals sharing the same resources can have different
directions. be detailed later, a definite direction to all physical wires. In
the routing graph, this is translated by a single edge between
5.2.1. Unidirectional Routing Graph. To avoid direction con- each pair of nodes.
flicts, we apply the Pathfinder routing algorithm on a unidi- Figure 7 represents the routing flow on a unidirectional
rectional graph. The idea is to assign, according to criteria to graph. The first step generates the unidirectional graph
International Journal of Reconfigurable Computing 7

depending on the number of inter-FPGA signals between


each pair. The number of physical wires that transmit direct
(resp., indirect) signals between two FPGAs is proportional
to the number of direct (resp., indirect) signals between these
two FPGAs. The following equation represents the number of
FPGA 0 FPGA 1
physical wires in a given direction:
Sig[𝑓𝑖 → 𝑓𝑗 ]
NBwires[𝑓𝑖 → 𝑓𝑗 ] = ∗ NBwires[𝑓𝑖 ,𝑓𝑗 ] . (4)
Sig[𝑓𝑖 ,𝑓𝑗 ]

Sig[𝑓𝑖 → 𝑓𝑗 ] and Sig[𝑓𝑖 ,𝑓𝑗 ] are, respectively, the number of direct


signals between FPGA𝑖 and FPGA𝑗 and the total number Figure 8: Unidirectional wires selection proportional to the number
of signals between the same pair. NBwires [𝑓𝑖 , 𝑓𝑗 ] is the total of signals.
number of available physical wires between FPGA𝑖 and
FPGA𝑗 .
Bidirectional routing graph
In the example of Figure 8, the number of direct wires
modelling
is set to 3, and the number of indirect wires is set to 2. The
second step consists in computing the initial mux ratio. This
parameter is calculated as follows:
Sig𝑓1 → 𝑓2 Compute initial mux ratio
mux ratio = Max𝑓1 → 𝑓2∈FPGAs . (5)
Wires𝑓1 → 𝑓2
The maximum mux ratio of all the FPGA pairs is the ratio Create GSignals containing N
between the number of signals and the number of available signals, with N ≤ mux ratio
physical wires between each pair.
After calculating the multiplexing ratio, the capacity of all
nodes is set to mux ratio. Then, Pathfinder routing algorithm
tries to find a feasible solution in which all signals should Mux ratio Start Pathfinder
be routed and each node should not be shared by more
than “mux ratio” signals. If these two constraints are met,
the mux ratio parameter is decremented and the router tries Yes
to find a feasible solution with the new value of mux ratio. Success?
Otherwise, the router exits with the best solution found.
No
5.2.2. Bidirectional Routing Graph. The selection of the unidi-
Best mux ratio
rectional wires proportional to the number of signals between
each pair of FPGA is not an optimized decision. For this Figure 9: Routing flow on bidirectional graph.
reason, we keep the bidirectional graph and we assemble
signals into groups. Indeed, signals that have the same source
and the same destinations are grouped together in “GSignals”
iteration, no node is used by more than one group of signals
and are considered as a single signal. Each GSignal contains a
or GSignals, which all have the same direction.
maximum of mux ratio signals. Therefore, the capacity of all
However, this method is not fully optimized. Indeed,
resources in the routing graph is set to 1. The bidirectional
from a source “𝑆” to destination “𝐷,” the number 𝑁 of
graph allows a better use for available routing wires of the
existing signals can be much below the maximum number
multi-FPGA prototyping board.
of signals allowed in a GSignal, which is equal to mux ratio.
Figure 9 presents the steps to route inter-FPGA signals
Consequently, since the capacity of the nodes is set to 1, in a
on a bidirectional routing graph. The first step creates the
path between “𝑆” and “𝐷,” only 𝑁 signals are routed, which
graph using two edges of opposite directions to represent
means a bad routing resource utilization.
each physical wire. Next, the initial mux ratio parameter is
calculated in the same way as in the unidirectional graph. This
parameter determines the number of signals to be grouped 5.3. Signal Modelling. For better routing results, we notice
together into one GSignal. that the choice of signal form is essential with two possibilities
After running the Pathfinder algorithm to route the to consider the signal shape: a multiterminal or a two-
GSignals, all GSignals should be routed and each node should terminal signal.
be used by only one GSignal. Finally, the router retains the
routing solution with the best mux ratio. 5.3.1. Multiterminal Signal. After partitioning the prototyp-
This method avoids conflict management, since the ing design, the next step routes all nets from the part contain-
Pathfinder algorithm prevents congestion; at the end of every ing the driver of this net to all parts containing destinations.
8 International Journal of Reconfigurable Computing

Part 0 Part 1 dedicated output parallel-to-serial converters (OSERDES)


and input serial-to-parallel converters (ISERDES) are instan-
tiated in the sending and receiving FPGAs. The low-voltage
differential signalling (LVDS) is used to transfer the data
between SERDES converters. The LVDS is a signalling stan-
Part 2 dard providing high-speed data transfers.
(a) (b) When the number of cut signals exceeds the num-
ber of available I/O pins, the partitioning tool inserts 4-
bit wide SERDES converters. Nevertheless, if the number
FPGA 0 FPGA 1 FPGA 0 FPGA 1 of cut signals is not a multiple of 4, some OSERDES
inputs (resp., ISERDES outputs) can be left unconnected.
The maximum number of signals transmitted between an
ISERDES/OSERDES pair is defined as mux ratio (2 ≤
mux ratio ≤ 4).
In highly connected designs, the number of signals can
FPGA 2 FPGA 2
still exceed the capacity of transmission between a pair of
FPGA, even after implementing the SERDES converters. In
this case, 𝑛-bit wide MUXs (resp., DEMUXs) are added at the
Figure 10: Routing solutions of multiterminal net.
input of the OSERDES (resp., at the output of the ISERDES).
The number 𝑛 equals the number of 4-bit data words to be
sent. When the mux ratio is less or equal to 4, then 𝑛 = 1. On
the other side, if mux ratio ≥ 5, then 𝑛 ≥ 2.
Actually, a signal can have more than one destination. The The combination of one 4-bit wide SERDES with one
Pathfinder routing algorithm can route multiterminal nets. 𝑛-bit wide MUXs/DEMUXs constitutes the multiplexing IP.
In fact, the algorithm starts by selecting the source and the This IP manages the inter-FPGA communication by sending
list of all destinations. After routing the first one, Pathfinder the data, as well as a 4-bit start pattern (for the inter-FPGA
moves to the next destination and so on. Such a signal can synchronisation) and a 4-bit checksum (to verify the integrity
be routed in 2 different ways as shown in Figure 10. The first of the transmitted data). Since 2 inter-FPGA clock cycles are
solution is to use 2 different paths to route the source to its required to send a 4-bit word, then 2 + 2 + 2 ∗ 𝑛 are needed
2 destinations. Indeed, after routing the first destination via to send the start pattern, the checksum, and the 𝑛 4-bit data.
path 1, the router selects the second destination and tries to If we consider the SERDES converters latencies as well as
route it. An uncongested path can be found via a second the IP latency, 12 + 2 ∗ 𝑛 cycles are needed to complete the
one that does not intersect with path 1. This solution is communication between a pair of FPGA.
represented by Figure 10(a). The second solution is to route On the other hand, some signals are routed through hops
a destination using a part of path used to route an already since the direct path from the source to the destination does
routed destination. It means that a destination can be reached not exist. So, when a signal is routed through one or more
from the last routed one as shown in Figure 10(b). routing hops, we insert 5 registers in the netlist of each routing
Although the routing of multiterminal signals can be the hop in order to recover the 4-bit data before sending it to the
optimal solution, considering the number of used I/O pins, next FPGA on the routing path. Therefore, 5 cycles are needed
the design is considered unflexible, especially when grouping to cross a hop.
those signals into GSignals. Indeed, in some cases, signals According to (3), the communication delay is equal to
with the same source and the same destinations are not
numerous; consequently, some GSignals do not contain the mux ratio
max number of signals, equal to mux ratio. Comm delay = NB𝑅 hop ∗ 5 + 12 + . (6)
2

5.3.2. Two-Terminal Signal. In order to make the design Since the comm delay causes the biggest delay, we neglect the
more flexible, we decompose the multiterminal signals into effect of the intradelays into the sending and the receiving
branches, each with one source and only one destination. FPGAs defined in (2). Therefore, the system frequency is
The Pathfinder routing algorithm tries to find separately a represented in
routing path for each branch. With this decomposition, only
the solution shown in Figure 10(a) is feasible. I/O frequency
Sys freq = . (7)
NB𝑅 hop ∗ 5 + 12 + mux ratio/2
6. Multiplexing IP
7. Experimental Results
The approach described above determines which signals to be
multiplexed together. These signals are transmitted through We use the benchmark generator [22] to generate several
the same physical wire and transferred using 2 multiplexing synthetic designs. The generated benchmarks are hierarchical
IP placed in the sending and receiving FPGAs, as shown since the partitioner operates on high levels of hierarchy in
in Figure 11. To ensure the inter-FPGA communications, order to reduce the partitioning runtime and the number
International Journal of Reconfigurable Computing 9

SYS CLK Clock I/O CLK I/O CLK Clock SYS CLK
generator generator

OSERDES

Demultiplexer
Multiplexer
Data

ISERDES
Design partition LV Data p Data Design partition
LV

LVDS

LVDS
DS Data n DS

Sending IP Receiving IP

FPGA 0 FPGA 1

Figure 11: Multiplexing IP architecture.

Table 1: Comparison between routing results of WASGA and CERTIFY partitioning tools.

WASGA CERTIFY
Benchmarks
Cut signals NB FPGA R hop Mux ratio Cut signals NB FPGA R hop Mux ratio
CPU20 occ10 1545 6 0 3 3316 6 0 10
CPU20 occ20 1002 4 0 3 1634 4 0 4
CPU30 occ20 1710 4 0 3 3076 4 0 6
CPU30 occ30 1487 4 0 4 2521 3 0 7
CPU50 occ30 2819 4 0 5 5279 4 0 11
CPU50 occ50 2202 4 0 6 4019 3 0 9
CPU125 occ50 7809 6 1 11 NR NR NR NR
CPU125 occ65 7644 5 0 12 NR NR NR NR

of managed elements. The targeted multi-FPGA prototyping [23]. The number of cut signals obtained by the WASGA
board that we use for the experiments is a DNV6F6PCIe partitioner is considerably less than the number of the
from the Dini group [17]. As shown in Figure 12, this signals obtained by CERTIFY. WASGA aims to optimize the
board contains 6 FPGAs Virtex-6 LX550T using all the same number of combinational hops. Therefore, for all the tested
package FF1759, meaning that they have the same number designs, the mux ratio results are improved compared to
of total user I/Os. The inter-FPGA clock frequency is set to those obtained by CERTIFY. Table 1 shows results related to
500 MHz. Applying this frequency on the multiplexing IPs the number of routing hops used in each benchmark. The
(ISERDES/OSERDES with LVDS), the inter-FPGA commu- number of routing hops is the number of FPGAs crossed by a
nication data rate on this board is 1 Gbps using double data signal from the source until reaching its destination. Results
rate (DDR). presented in Table 1 reflect the importance of partitioning on
To map the designs into this board, we use the WASGA the system frequency. We should notice that for the 2 last
partitioning flow provided by Flexras Technologies [19]. benchmarks, the designs are not routable (NR) with the CER-
WASGA partitions the designs and outputs the list of inter- TIFY tool. In fact, since the number of cut signals is becoming
FPGA signals that should be routed. After routing these larger, the mux ratio is more and more important. CERTIFY
signals, using the routing methodology detailed in this paper, provides multiplexing IP with a maximum width equal to 32
WASGA generates a netlist for each FPGA which contains the [24]. Thus, all the designs which need a multiplexing ratio ≥
multiplexing IP to ensure the transmission of the multiplexed 32 are not routable.
signals. The resulting netlists are entered into the FPGA flow On the other hand, we tried to select the best shape of
to execute the place and route and the bitstream generation the routing signals. Table 2 shows the results for each routing
individually for each FPGA. scenario described in Section 5. These scenarios are defined
Firstly, we set the constraints listed in Section 3.2 to the depending on the signal shape and the routing graph. Four
WASGA partitioning tool. Table 1 presents the routing results scenarios are selected to test the performance of the iterative
obtained by WASGA flow and CERTIFY partitioning tool routing algorithm on the multi-FPGA prototyping platform.
10 International Journal of Reconfigurable Computing

SMA

GTX expansion
FPGA C USER R
FPGA E

header
FPGA F SFP MEG array expansion MEG array expansion MEG array expansion
10/100/1000 128 Mb 128 Mb 8
CLK 25 (25 MHz) or connector (400-pins) flash connector (400-pins) flash connector (400-pins)
phy
SFP+

GTX
1
FPGA A SATA II 96 96 96 96 96 96
USER L
Config FPGA

DDR3 SODIMM
GTX

DDR3 SODIMM
FPGA D (device) GTX 8 GTX 8
FPGA F FPGA D

(4 Gb max)
FPGA E

(4 Gb max)
FPGA B Virtex-6 Virtex-6
78 Virtex-6 100
24 114.285 MHz LX550T
(host) LX550T LX550T 130
MHz Frequency G0 (2 kHz to
synthesizer (FF1759) (FF1759) (FF1759)
700 MHz) 130 78 100
(Si5326)
40

GTX

GTX

GTX
FPGA Q 40
24 114.285 MHz 40
Virtex-5 40
MHz Frequency
G1 (2 kHz to 40
synthesizer Config FPGA 114
(Si5326) 700 MHz) 40 114 20 95 95 16 93 93 16
PCIe NMB
GTX 8

DDR3 SODIMM
GTX

DDR3 SODIMM
24 8
114.285 MHz FPGA C FPGA B FPGA A

(4 Gb max)

(4 Gb max)
MHz Frequency
G2 (2 kHz to Virtex-6 140 Virtex-6 140 Virtex-6
synthesizer
(Si5326) 700 MHz) LX550T LX550T LX550T
130 130
125 MHz (FF1759) 140 (FF1759) 140 (FF1759)
150 MHz

GTX

GTX
OSC PCIe 8 8
250 MHz
312.5 MHz MGT clock Gen1
G0 4-lanes GTX expansion 128 Mb GTX expansion 128 Mb
MPP bus

OSC header header flash


Flash
RJ45
10/100/1000
10/100/1000 phy
Base T Marvell MV 78200 64 128 M ∗ 64
DDR2
RS232
DMA
FPU FPU 3 128 Mb
USB

USB 2.0 SPI


SATA CPU CPU
3x PCIe Boot flash
8
SATA II
(host) 256 Mb
RTC 4-lanes
2x PCI(GEN1) Nand flash
Boot

PCI express

LVDS when paired but can be run single-ended at reduced frequency

GTX: Packet I/O transceivers 6.5 Gb/s per channel bidirectional (13 Gb/s max)

Figure 12: Prototyping board based on six Virtex-VI from Dini group [17].

Table 2: Comparison of routing strategies effects on prototyping system performance.

Scenario 1 Scenario 2 Scenario 3 Scenario 4


Benchmarks
Mux ratio R hop Freq (MHz) Mux ratio R hop Freq (MHz) Mux ratio R hop Freq (MHz) Mux ratio R hop Freq (MHz)
Circuit A 12 2 17.85 15 2 16.66 4 2 20.83 4 1 26.31
Circuit B 18 3 13.88 24 2 14.7 4 3 17.24 7 1 23.8
Circuit C 24 3 12.82 44 2 11.36 11 3 15.15 11 1 21.73
Circuit D 50 3 9.61 50 2 10.63 15 3 14.28 20 1 18.51
Circuit E 119 6 4.9 116 4 5.55 57 2 9.8 56 4 8.33
Circuit F 160 3 4.67 168 3 4.5 68 3 8.19 68 1 9.8
Circuit G 220 5 3.4 256 1 3.44 89 2 7.46 86 3 7.14

In these experiments, we use benchmarks where 70% of (iv) Finally, in the fourth scenario, the two-terminal
signals are multiterminal ones. branches are combined into groups and routed into
(i) In scenario 1, multiterminal signals are routed on a a bidirectional routing graph.
unidirectional routing graph.
(ii) In scenario 2, two-terminal signals are routed into a Results show that routing on a bidirectional graph gives much
unidirectional routing graph where the nodes capac- better results since the router has more flexibility to select
ity can be greater or equal to 1. the routing path. On the other hand, routing multiterminal
(iii) In scenario 3, multiterminal signals are grouped into signals is not always optimized, even if the mux ratio of
GSignals. These GSignals are routed into a bidirec- scenario 3 is sometimes less than the one of scenario 4, but
tional routing graph where all node capacities are set using large number of routing hops penalizes the system
to 1. frequency.
International Journal of Reconfigurable Computing 11

Table 3: Comparison between OAR and NCR strategies on system performance.

OAR NCR
Benchmarks Gain
R hop Mux ratio Freq (MHz) R hop Mux ratio Freq. (MHz)
CPU50 occ30 0 9 29.41 0 9 29.41 0%
CPU125 occ50 2 16 16.66 1 16 20 20.04%
CPU150 occ30 3 24 12.82 1 29 15.62 21.84%
CPU150 occ50 2 51 10.41 1 51 11.62 11.65%
CPU375 occ80 2 51 10.41 1 51 11.62 11.65%
CPU375 occ85 2 79 8.06 2 69 8.77 8.8%
CPU700 occ80 2 134 5.61 2 109 6.49 15.68%

Since we have demonstrated that scenario 4 gives usu- Design Automation Conference (ASP-DAC ’11), pp. 297–300,
ally the best results, we apply Pathfinder and the obstacle January 2011.
avoidance routing algorithms (constructive algorithms) to [2] M. Santarini, “ASIC prototyping: make versus buy,” EDN, vol.
route inter-FPGA signals, all with one source and one des- 50, no. 24, pp. 30–40, 2005.
tination (branch), and grouped into GSignals. Table 3 shows [3] D. Amos, A. Lesea, and R. Richter, FPGA-Based Prototyping
the results of comparison. OAR means obstacle avoidance Methodology Manual, Synopsys, 2011.
routing and NCR refers to negotiated congestion routing. [4] I. Kuon and J. Rose, “Measuring the gap between FPGAs and
Results show the important impact of the NCR iterative ASICs,” in Proceedings of the 14th ACM/SIGDA International
routing and its efficiency to improve system performance. The Symposium on Field Programmable Gate Arrays, pp. 21–30,
frequency is increased on average by 12.8% and the impact of February 2006.
NCR is important for highly congested partitioning results. [5] H. Krupnova, “Mapping multi-million gate SoCs on FPGAs:
In fact, thanks to its negotiation aspect, it avoids easily local industrial methodology and experience,” in Proceedings of
minima and reduces the path length from a source FPGA to the Design, Automation and Test in Europe Conference and
a destination FPGA. In addition, it leads to a good trade-off Exhibition (DATE ’04), vol. 2, pp. 1236–1241, February 2004.
between maximum multiplexing ratio and routing hops. [6] S. Asaad, R. Bellofatto, B. Brezzo et al., “A cycle-accurate,
cycle-reproducible multi-FPGA system for accelerating multi-
core processor simulation,” in Proceedings of the ACM/SIGDA
8. Conclusion International Symposium on Field Programmable Gate Arrays
(FPGA ’12), pp. 153–162, February 2012.
Prototyping is no longer optional due to the cost of chips [7] J. Babb, R. Tessier, M. Dahl, S. Z. Hanono, D. M. Hoki,
and difficulty to simulate huge designs. To validate designs and A. Agarwal, “Logic emulation with virtual wires,” IEEE
more efficiently, the highest frequency should be reached. The Transactions on Computer-Aided Design of Integrated Circuits
system frequency depends on the way the inter-FPGA signals and Systems, vol. 16, no. 6, pp. 609–626, 1997.
are routed. In this paper, we presented our approach to route [8] L. McMurchie and C. Ebeling, “PathFinder: a negotiation-based
these inter-FPGA signals. We set a number of constraints performance-driven router for FPGAs,” in Proceedings of the
to the partitioning tool in order to get the best partitioning International Workshop on Field Programmable Gate Array, pp.
solution which leads to the optimal routing. We extend the 111–117, February 1995.
Pathfinder routing algorithm to deal with the inter-FPGA [9] A. Ejnioui and N. Ranganathan, “Multiterminal net routing for
signals. In order to select the best signal shape, we tested partial crossbar-based multi-FPGA systems,” IEEE Transactions
the performance of this iterative routing algorithm on four on Very Large Scale Integration Systems, vol. 11, no. 1, pp. 71–78,
scenarios. 2003.
The best scenario in terms of system frequency consists [10] X. Song, W. N. N. Hung, A. Mishchenko, M. Chrzanowska-
in grouping signals into GSignals where each one has 1 Jeske, A. Kennings, and A. Coppola, “Board-level multiterminal
source and only 1 destination. Compared to common obstacle net assignment for the partial cross-bar architecture,” IEEE
avoidance algorithms, we obtain a significant prototyping Transactions on Very Large Scale Integration ystems, vol. 11, no.
system frequency improvement of 12.8%. 3, pp. 511–513, 2003.
[11] W. Mak and D. F. Wong, “Board-level multiterminal net routing
for FPGA-based logic emulation,” ACM Transactions on Design
Acknowledgment Automation of Electronic Systems, vol. 2, no. 2, pp. 151–157, 1997.
[12] W. Mak and D. F. Wong, “On optimal board-level routing for
This research paper is made possible through the help and FPGA-based logic emulation,” IEEE Transactions on Computer-
support of the Feder European Grant. Aided Design of Integrated Circuits and Systems, vol. 16, no. 3, pp.
282–289, 1997.
References [13] J. Babb, R. Tessier, and A. Agarwal, “Virtual wires: overcoming
pin limitations in FPGA-based logic emulators,” in Proceed-
[1] C. Huang, Y. Yin, and C. Hsu, “SoC HW/SW verification and ings of the IEEE Workshop on FPGAs for Custom Computing
validation,” in Proceedings of the 16th Asia and South Pacific Machines (FCCM ’93), pp. 142–151, April 1993.
12 International Journal of Reconfigurable Computing

[14] R. Tessier, J. Babb, M. Dahl et al., “The virtual wires emu-


lation system: a gate-efficient ASIC prototyping environe-
ment,” in Proceedings of the International Workshop on Field-
Programmable Gate Array, ACM, Berkeley, Calif, USA, Febru-
ary 1994.
[15] M. Inagi, Y. Takashima, Y. Nakamura, and A. Takahashi,
“Optimal time-multiplexing in inter-FPGA connections for
accelerating multi-FPGA prototyping systems,” IEICE Trans-
actions on Fundamentals of Electronics, Communications and
Computer Sciences A, vol. E91, no. 12, pp. 3539–3547, 2008.
[16] M. Inagi, Y. Takashima, and Y. Nakamura, “Globally optimal
time-multiplexing in inter-FPGA connections for accelerating
multi-FPGA systems,” in Proceedings of the 19th International
Conference on Field Programmable Logic and Applications (FLP
’09), pp. 212–217, September 2009.
[17] [Online], http://www.dinigroup.com/new/products.php.
[18] Synopsys FPGA Synthesis User Guide, 2011.
[19] [online], http://www.flexras.com/.
[20] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,
Introduction to Algorithms, MIT Press, London, UK, 2001.
[21] M. Turki, Z. Marrakchi, H. Mehrez, and M. Abid, “Itera-
tive routing algorithm of Inter-FPGA signals for Multi-FPGA
prototyping platform,” in Proceedings of the 9th international
conference on Reconfigurable Computing (ARC ’13), Los Angeles,
Calif, USA, March 2013.
[22] M. Turki, Z. Marrakchi, H. Mehrez, and M. Abid, “Towards
synthetic benchmarks generator for CAD tool evaluation,”
in Proceedings of the 8th Conference on Ph.D. Research in
Microelectronics and Electronics (PRIME ’12), 2012.
[23] [Online], http://www.synopsys.com/Systems/FPGABasedPro-
totyping/Pages/Certify.aspx.
[24] Certify Partition Driven Synthesis, User Guide, March 2011, p.
153.

You might also like