Exploring and Optimizing Partitioning of Large Designs For
Exploring and Optimizing Partitioning of Large Designs For
net/publication/343120800
CITATIONS READS
7 259
2 authors, including:
Umer Farooq
Dhofar University
67 PUBLICATIONS 625 CITATIONS
SEE PROFILE
All content following this page was uploaded by Umer Farooq on 11 November 2020.
1 Introduction
Modern day System on Chip (SoC) designs have huge computation capability and
they are enormously complex to design. Moreover, shrinking product life cycle and
faster time-to-market pressures increase the need for an efficient, fault-free design
process [1] [2]. Because a faulty and inefficient design can cost a huge fortune [3] [4].
U. Farooq
Electrical and Computer Engineering Department
Dhofar University, Salalah, Oman
E-mail: [email protected]
B. Alzahrani
Faculty of Computing & Information Technology
King Abdulaziz University, Jeddah, Saudi Arabia
E-mail: [email protected]
2 Umer Farooq, Bander A Alzahrani
In this regard, FPGA-based prototyping offers a good option for complete design-
to-silicon system verification. FPGA-based prototyping is pre-silicon verification
technique that offers better speed as compared to simulation-based verification [5].
Simulation-based solutions are cost-effective but they are very slow and offer only
abstract level view of the system. Although emulation-based pre-silicon verification
gives good speed, unique feature of FPGA-based prototyping is that it gives real-world
testing and trouble shooting experience to a user.
Prototyping of less complex Application Specific Integrated Circuit (ASIC) can be
performed on a single FPGA as the modern day FPGAs are quite capable and have huge
logic capacity. However, as the complexity of the system under consideration grows,
the capability of even the most modern FPGAs becomes insufficient to handle the
resource and I/O requirement of the ASIC. For such scenarios, multi-FPGA platforms
are required because the gap between FPGAs capability and ASIC requirement is
huge [6] and with every new processing technology, it is becoming increasingly
difficult to bridge this gap. Normally, the number of FPGAs required to prototype a
design depends upon the complexity of the design under consideration and this number
may vary from a few FPGAs to a couple of dozen FPGAs [7] [8]. The prototyping
of complex ASIC designs using multi-FPGA platforms usually follows a complex
back end flow that involves several optimization steps. The core objective of this back
end flow is to optimize the frequency and the execution speed of the design under
consideration. The back end flow starts with the RTL description of the design. The
design is first synthesized and next partitioned using a partitioning algorithm. After
partitioning, the routing of the design is performed. Finally the flow is culminated in
the intra-FPGA placement and routing of the design.
Partitioning is one of the most critical steps of the multi-FPGA partitioning flow. In
this step, based on the number of FPGAs on multi-FPGA board, the design under
consideration is divided into multiple parts. Because of the several optimization
constraints, finding an optimal partitioning solution is an NP hard problem [9]. When
we consider partitioning problem from multi-FPGA prototyping perspective, several
constraints are associated with the partitioning of a complex design. Two of the
principle objectives of a partitioning tool are to respect the logic capacity of the target
FPGA architecture while keeping the communication between different partitions
as small as possible. Thanks to the improved design process and better processing
technology, both the logic capacity and number of I/Os of modern generations of
FPGAs have increased. However, the rate at which the logic capacity in FPGAs has
increased is much higher when compared to the rate of increase of number of I/Os. This
trend has led to an increased logic to I/O ratio in newer generations of FPGAs and it
has become particularly difficult for a partitioning tool to minimize the inter-partition
communication. Thus, the number of signals (also termed as cut-nets) traversing
different partitions are more than the available I/Os between different FPGAs of a
multi-FPGA board. These signals are routed between different FPGAs through next
step of the prototyping flow which is called inter-FPGA routing.
Inter-FPGA routing follows the partitioning of the design under consideration and also
plays an important role in the overall optimization of the design under consideration. In
this step, the cut-nets of the partitioned design are routed on the tracks of multi-FPGA
board in a time division multiplexed (TDM) manner. So, higher value of cut-nets will
Title Suppressed Due to Excessive Length 3
lead to a higher value of multiplexing ratio which in turn will reduce the execution
speed of the final prototyped design. The results produced by the routing tool are
directly linked with the quality of preceding partitioning process. Even a highly
efficient routing tool cannot overturn the poor results of a partitioning tool. An in
depth discussion on the quality of partitioning tool and its impact on the frequency of
final prototyped design is presented in the subsequent sections of the paper.
It is evident from the discussion presented above that partitioning plays very important
role in the multi-FPGA prototyping flow. In this work, we propose and explore two
partitioning approaches, namely hierarchical and multilevel partitioning approach. For
this purpose, we use an open source flow. This flow gives complete experience for the
prototyping of multi-FPGA systems. However, our focus in this work remains on the
partitioning aspect of the flow. The flow proposed in this work starts with the generation
of large and complex benchmarks. These benchmarks are generated using a generic
academic tool that can generate both flat and hierarchical benchmarks. The generated
benchmarks are next logically synthesized using an open source tool by VERIFIC [10].
This tool not only performs standard cell synthesis, it also gives complete information
about the interconnect of the design under consideration. After synthesis, we perform
partitioning of the design. In order to produce the best partitioning results, we strive to
exploit the inherent interconnect patterns of the design under consideration. For this
purpose, we explore two different partitioning approaches. One approach exploits the
hierarchical interconnect which is inherent in certain designs. We call this proposed
approach as hierarchical partitioning approach. Second partitioning approach that
we use in this work performs partitioning using multilevel clustering and refinement.
We call this approach multilevel partitioning approach in this work. Both proposed
approaches are novel in the sense that they have been specifically customized in
the context of prototyping for multi-FPGA systems. An in-depth discussion on both
approaches is given in Section 4 of the paper. After partitioning, the inter-FPGA routing
of the design is performed. After routing, system frequency results are obtained for the
two partitioning approaches and a thorough analysis of those results is also presented.
Here, only a brief overview of different steps involved in the proposed back end flow
is given. Detailed discussion on these steps is given in the subsequent sections of the
paper. The main contributions of this paper are summarized as follows:
– Development of an open source, generic, back end flow for prototyping of multi-
FPGA systems. All the steps of the proposed flow either use open source tools or
the tools that are free for academia.
– Development and implementation of two partitioning approaches for exploration
and optimization of different designs in multi-FPGA prototyping.
– Extensive experimentation and thorough analysis of results obtained through the
proposed back end flow.
In the rest of the paper, Section 2 discusses the background and related work and also
elaborates the contribution of this work. Section 3 then gives a detailed discussion on
the proposed flow where comprehensive details of all the steps of back end flow are
presented. Section 4 presents in-depth discussion on the two partitioning approaches
that we propose and explore in this work. Section 5 gives profound analysis of the
4 Umer Farooq, Bander A Alzahrani
results obtained through experimentation and Section 6 concludes this paper with
discussion on the future work.
The discussion presented in Section 1 shows that partitioning plays a very important
role in determining the system frequency of final prototyped design. Partitioning is a
well formulated research problem and researchers have been active in this area since
1970s. Many techniques have been proposed in past to find the efficient solution of
partitioning problems. Mainly, there are three different types of techniques which are
used to find the solution of a partitioning problem.
1. Analytical partitioning technique [11] [12] is commonly utilized where objective
function is to optimize the quadratic length of the critical path. Although min-
imizing the quadratic length of the critical path is only an indirect measure of
the partitioning solution, its main advantage is that the objective function can be
achieved in very small time. This kind of approach is particularly suitable for very
large problems. A quadratic function, however, does not give the best possible
solution and it is often followed by several local tweaks.
2. Simulated annealing based placement [13] [14] is another technique that uses the
annealing concept for molten metal which is cooled down gradually to produce
high quality solutions. The objective function of this approach is to minimize the
overall Manhattan distance between all the connected instances. This approach
is quite effective in finding a reasonably good solution in a small amount of time.
This type of technique is commonly used for island style architectures. However,
simulated annealing technique is classified more as a placement technique rather
than a partitioning technique.
3. Min-Cut based partitioning approach [15] [16] is generally suitable for partitioning
of complex designs. The min-cut partitioner recursively partitions the design under
consideration. The aim of the partitioner is to minimize the cut-nets of the design
by merging the connected instances in a single cluster. Because of the ability to
find a good solution in small time, in this work, we mainly consider min-cut based
partitioning algorithms. Further discussion on different min-cut based partitioning
algorithms is given next.
In min-cut based partitioning approach, the design is presented as a hypergraph and
the connections between different instances of the design are presented as hyper edges.
The main objective of the partitioner is to minimize the number of hyper edges (con-
nections that traverse more than one partition) in the graph. In this regard, authors
in [17] present Kerninghan-Lin bi-partitioning algorithm. Authors in [18] present FM
partitioning algorithm that uses recursive bi-partitioning approach to find a solution
of a partitioning problem. Similarly authors in [19] present another bi-partitioning
algorithm that promises to give optimal results for small graphs. However, this al-
gorithm either gives sub-optimal or no results for large to very large hypergraph.
The aforementioned three algorithms are the main partitioning algorithms used for
digital systems and the research work done later is mainly an extension of one of
Title Suppressed Due to Excessive Length 5
partitioning solutions indicates that either these solutions are platform dependent
or they offer only partial solution. Moreover, all of them are proprietary tools with
thousands of dollars in annual subscription fees.
On the other hand, if we look at state-of-the-art academic solutions of partitioning
from multi-FPGA prototyping perspective, sufficient work is not available. Authors
in [30] [31] propose a new multilevel hierarchical FPGA architecture and they propose
to use a multilevel partitioning tool for the partitioning of the design. However, their
proposed solution can handle homogeneous blocks and gives partitioning solution for
a single FPGA only. Similarly, authors in [32] explore the partitioning problem for
multi-FPGA systems. They perform comparison between solutions obtained through
commercial WASGA and CERTIFY partitioning tools only and do not give any aca-
demic solution. Also, authors in [33] [34] explore prototyping of multi-FPGA systems.
However, for partitioning, they use commercial tool called CERTIFY [26] by Synop-
sys. The main focus of their flow remains the inter-FPGA routing issue of the back end
flow. Furthermore, authors in [35] [36] also address the back end flow for multi-FPGA
systems, but their focus remains mainly the inter-FPGA routing as well.
In this work, we not only address the routing issue but we also focus on the partitioning
problem. Because even a highly efficient routing tool cannot improve the frequency of
final prototyped design if it is preceded by an inefficient partitioning process. In order
to make the partitioning process efficient, we put particular emphasis on the knowledge
of interconnect of the design under consideration. We extract the information on the
interconnect of the design through open source tool called VERIFIC [10]. Because,
when it comes to different types of designs, they exhibit different interconnect patterns.
Some of them are hierarchical in nature while others have rather flat interconnect. So,
partitioning all the designs with a single approach is not justified and it may eventually
lead to poor frequency results. For this reason, in our back end flow, we propose and
explore two different partitioning approaches in this work. first approach is called
hierarchical partitioning approach and it uses a hierarchical partitioning algorithm.
This approach is more useful for designs exhibiting hierarchical interconnect. Second
approach is based on multilevel partitioning algorithm and it is more suitable for rather
flat designs. Details about the two proposed approaches are given in Section 4. The
two partitioning approaches coupled with an efficient inter-FPGA routing tool give
the best frequency results for the partitioned design.
To the best of our knowledge, there is not enough academic work in state-of-the-art for
multi-FPGA prototyping systems from partitioning perspective. As discussed before,
some work exists that either uses commercial tools or performs comparison between
partitioning results of commercial tool. The unique contribution of this work is that
we extract the information on the interconnect of the design through VERIFIC tool
which is free for academia. Next, we apply one of two partitioning approaches that
best exploits the interconnect in terms of minimizing the cut-nets of the partitioned
design. Both the proposed partitioning approaches used in this work are either based
on academic tools or the customized versions of those tools. So, through this work,
we strive to provide a platform for academia in multi-FPGA prototyping and advance
the research in the important domain of pre-silicon verification through multi-FPGA
prototyping.
Title Suppressed Due to Excessive Length 7
Benchmark
Logic Synthesis
Hierarchical Multilevel
Partitioner Partitioner
Routing
3 Prototyping Flow
In this paper, we propose a prototyping flow for multi-FPGA based systems. In this
flow, we explore two different partitioning approaches and analyze their effect on the
system frequency of final prototyped design. An overview of the complete flow is
shown in Figure 1. It can be seen from this figure that the flow starts with the logic
synthesis of the benchmark under consideration. After passing through various steps,
8 Umer Farooq, Bander A Alzahrani
the flow terminates at the bitstream generation of the design. Further discussion on the
steps of the flow is given next.
For any exploration flow, benchmarks are a fundamental requirement. For multi-FPGA
prototyping flow, this requirement is even more pertinent as complex benchmarks
mimicking the real life applications are utmost necessary to test the capability of
the tools of a prototyping flow. Researchers in the past [37] [38] [39] have used
different sets of benchmarks for different types of exploration environments. But these
benchmarks are either too small to pose a real challenge to the exploration tools or
they are synthetic in nature and lack resemblance with real life applications. In this
work, we use benchmarks that are generated by DSX [40] academic tool. Using this
tool, we can generate mono-core and multi-core MPSoC architectures. A mono-core
MPSoC architecture contains components like UART, RAM, multiple FIFOs, and co-
processors. These components are further connected with each other through a cross
Title Suppressed Due to Excessive Length 9
3.2 Synthesis
It can be seen from Figure 1 that the benchmarks generated through the DSX tool
are first logically synthesized. During synthesis, the design is logically optimized.
For logic synthesis, in this work, we use open source tool by VERIFIC [10] which
is free for non-commercial academic purposes. When the benchmark is given to this
tool, it parses the whole design through a very powerful parser. The parser of this
tool builds a comprehensive database of all the components of the design and gives
complete information about the interconnect of different components of the design.
This information is very useful as it is used by the hierarchical partitioner later in the
flow. The tool also performs transformation of the design into the standard logic gate
format. We use this tool to keep our flow open source and generic in nature.
3.3 Partitioning
After synthesis, the partitioning of the design under consideration is performed. Since
the designs are quite large and complex, a single FPGA cannot satisfy their logic and
I/O resource requirements, thus, they have to be partitioned in multiple partitions.
As discussed in Section 1, the partitioning plays a very important role in the final
execution speed of the design under consideration. Normally, number of physical
connections are quite small between different partitions while the number of cut-nets
that span these partitions are quite large. So, in subsequent process, these cut-nets have
to share the physical resources between different FPGAs in a time multiplexed manner.
Eventually, larger cut-nets will lead to greater size of multiplexer; hence increasing the
delay and reducing the overall speed. Thus the main goal of any partitioner is to keep
the number of cut-nets as small as possible. Another constraint that a partitioner has
to deal with is the logic capacity of the target FPGA architecture. A partitioner must
satisfy this constraint while performing partitioning. These two combined constraints
make partitioning an NP hard problem [9] for large and complex designs and it is not
possible to find an optimal solution.
Figure 4 summarizes the partitioning problem. Figure 4a shows two partitions where
the number of cut-nets are 2. Figure 4b shows the partitioning solution where the
number of cut-nets are reduced from 2 to 1. But in order to do that, we have to move
large combinatorial logic from partition 2 to partition 1 and new combinatorial logic
part may not fit in the logic capacity of partition 1. So, a partitioner always has to find
a trade-off between the logic capacity and the cut-net constraint. To find an efficient
10 Umer Farooq, Bander A Alzahrani
Combinational Combinational
Logic Logic
Combinational Logic
Combinational
Logic Combinational
Logic
Partition 1 Partition 2
(a)
Combinational Combinational
Logic Logic
Combinational
Logic Combinational
Logic
Combinational Logic
Partition 1 Partition 2
(b)
Fig. 4: (a) Partitioning Solution with 2 cut-nets; (b) Partitioning Solution with 1 cut-net
partitioning solution, the partitioner should know and exploit the interconnect of the
design under consideration. For this purpose, in this work, we explore two different
partitioning approaches. The details of these approaches are given in Section 4 of this
paper.
3.4 Routing
Once the partitioning is completed, the routing of the design under consideration is
performed on the multi-FPGA board. The aim of partitioning approaches discussed in
Section 3.3 is to minimize the number of cut-nets. However, as discussed in Section 1,
the number of cut-nets are always greater than the available I/O resources of FPGAs.
This is because of higher logic capacity and fewer I/Os of newer generations of FPGAs.
Therefore, we have to route the cut-nets in a time division multiplexing manner. A
Title Suppressed Due to Excessive Length 11
Once the routing is complete, the netlists are generated as shown in Figure 1. These
netlists contain all the information related to the partitioned design and their routing
information. The netlists are next passed to the the vendor specific tool to perform
intra-FPGA synthesis, placement, and routing of all the partitions. After a successful
completion of this step, the bitstreams of the partitions are generated which can finally
be loaded into the respective FPGAs to complete the prototyping flow. The process of
loading of the bitstreams allows to perform the in-circuit verification and debugging
of the partitioned designed. Moreover, it also gives the real world, cycle accurate and
bit-accurate execution information of the partitioned design.
A comprehensive overview of all the steps of prototyping flow is given in this section.
In the next section, a further detailed discussion is provided on the two partitioning
approaches that are proposed and explored in this work.
Start
Choose N
instances
Yes
End
Partitions where
instance can go?
Yes
Get lower level.
Remake No Choose where
instance list instance is most
connected
Yes Instance
breakable? No
Partitioning
Impossible
End
It is discussed in Section 3.2 that we use VERIFIC to perform logic synthesis of the
design under consideration. While performing logic synthesis, VERIFIC parses the
whole design and it gives complete information about the interconnect of the design.
In hierarchical partitioning approach, we extract information about the hierarchy of the
design from VERIFIC parser tool. At the next step, based on the required number of
Title Suppressed Due to Excessive Length 13
Get hierarchy;
Get partitions;
Get capacity;
while unassigned instances do
instances=N;
find(max connection);
if capacity > N then
assign instances(N,M);
assigned = N;
end
else if instance breakable then
level = level - 1;
end
else
Partitioning impossible;
end
end
Algorithm 1: Pseudo-code for the Hierarchical Algorithm
The aforementioned steps are performed iteratively where connectivity among the
instances is given top priority and the partition size is always respected. As described
in Section 5, the above approach is more suited for designs which have an inherent
14 Umer Farooq, Bander A Alzahrani
N5 absorbed in
this move
C7 C6 C7 N4
N4
N2 N2
N5
C1 C5 C1 C5C6
N1 N1
C2 C2C3
C3
N3 C4 C4
N3
N3 absorbed in
C7 N4 this move
C5C6C7 N2
N4 absorbed in this
N2 move
C5C6
C1
N1 N1
C1C2C3C4 C2C3C4
Contrary to the hierarchical approach that exploits the hierarchy of the design, the
multilevel approach uses clustering and refinement approach over multiple levels. In
this approach, the instances of the benchmark are first represented in the form of a
hypergraph. Initially, the graph is quite complex as it contains a lot of instances and
it is difficult to partition it. Therefore, the graph is next reduced by merging smaller
instances together. This process is called clustering and it is repeated over multiple
levels until the number of clusters are reduced to a few dozens in number. The process
continues until the graph becomes considerably small and the refinement becomes
easy. An example of this multilevel clustering process is given in Figure 7 where
a large hyper-graph is reduced to a smaller hyper-graph after multiple iterations of
clustering.
Once the clustering process is complete, the refinement of the graph is done and the
graph is expanded in a reverse manner. During the refinement process, the instances
are moved between different clusters. The objective of the refinement process is to
minimize the overall cut-net count of the design. Each time a block (i.e. instance) is
moved from one cluster to another, the change in the total cut-net count is computed.
If the change is negative (which means total cut-nets are reduced), the move is
accepted and it is rejected otherwise. This is a greedy approach which may lead to
a problem of local-minima. To avoid such situation, moves with positive gain are
also accepted depending upon the level of refinement. At higher levels, such moves
are accepted. However, these moves are not accepted when the refinement is being
performed at lower levels. The refinement process continues until the bottom level of
Title Suppressed Due to Excessive Length 15
Un
oa
rse
nin
ga
ase
nd
ph
ref
ing
ine
en
me
ars
nt
Co
Ph
ase
Fig. 8: An Overview of Multilevel Refinement
the graph is reached. Upon reaching this point, the partitioning process is complete
and we have the final partitioned result. An overview of the refinement process is
shown in Figure 8 where only 2-way refinement is shown. However, the proposed
multilevel partitioning tool is able to perform N-way partition as it is generic in nature.
The multilevel partitioning tool uses same approach as presented in [43] where first
clustering is performed which is then followed by initial partitioning and refinement
phases. However, the work presented in [43] performs partitioning of homogeneous
instances only. On the contrary, the proposed tool can handle heterogeneous instances
and also takes into account the maximum partition size.
The multilevel partitioning is a highly sophisticated technique and for flat designs, it
offers better results when compared to hierarchical approach. However, it requires
significantly more time to produce the partitioning result. Furthermore, the hierarchical
approach gives equal or better results for designs which are purely hierarchical in
nature. The pseudo code of the multilevel algorithm used in this work is shown in
Algorithm 2.
In this section, we present the experimental results that are obtained through the
exploration flow described in Section 3. Initially, an overview of the benchmarks
used in this work is presented and next the results obtained for those benchmarks are
discussed.
5.1 Benchmarks
level = 0;
hierarchy[level] = hypergraph;
min vertices = 200;
while hierarchy[level].vertex count() > min vertices do
next level = cluster(hierarchy[level]);
level = level + 1;
hierarchy[level] = next level;
end
partitioning[level] = a random initial solution for top-level hypergraph;
FM(hierarchy[level], partitioning[level]);
while level > 0 do
level = level - 1;
partitioning[level] = project(partitioning[level+1], hierarchy[level]);
FM(hierarchy[level], partitioning[level]);
end
Algorithm 2: Pseudo-code for the Multilevel Partitioning Algorithm
generate both mono- and multi-cluster benchmarks which have varying degree of com-
plexity. The mono-cluster benchmarks mainly exhibit a non-hierarchical interconnect
pattern. Multi-cluster benchmarks, on the other hand, are hierarchical in nature where
different clusters are connected to each other in a hierarchy. The connection patterns
of the two types of benchmarks used in this work are further verified through the VER-
IFC parsing tool. The core objective of incorporating two types of benchmarks with
different interconnect patterns is to test the capability of two partitioning approaches
being used in this work. The details of these benchmarks are given in Table 1. It can
be seen from this table that we use four mono- and ten multi-cluster benchmarks. The
internal structure of each mono-cluster benchmarks is already discussed in Section 3.
The number of coprocessors in each mono-cluster benchmarks are indicated at the end
of each benchmark’s name as it can be seen from Table1. As far as the multi-cluster
Title Suppressed Due to Excessive Length 17
Hierarchical Multilevel
25000
Biterminal cut-nets
20000
15000
10000
5000
0
Benchmark Name
Hierarchical Multilevel
10000
Multiterminal cut-nets
8000
6000
4000
2000
Benchmark Name
benchmarks are concerned, they are named like CPUXxY xZ where XxY indicates the
size of cluster and Z indicates the number of processors in each cluster. For example,
the name CPU2x2x6 indicates that this benchmark has four clusters and inside each
cluster there are six processors. As shown in Table 1, in this work, we use a variety of
benchmarks that have varying requirements in terms of number of components.
18 Umer Farooq, Bander A Alzahrani
Hierarchical Multilevel
40000
35000
30000
Cut-Nets
25000
20000
15000
10000
5000
0
Benchmark Name
Fig. 11: Cut-Net Comparison Between Hierarchical and Multilevel Partitioning Ap-
proach
Hierarchical Multilevel
40
35
30
MUX Ratio
25
20
15
10
5
0
Benchmark Name
Fig. 12: Multiplexing Ratio Comparison Between Hierarchical and Multilevel Parti-
tioning Approach
Once the routing of the benchmark is completed, its system frequency is estimated
according to [44] using equation 1. It can be seen from this equation that partitioning
approach with smaller mux ratio values will result in better system frequency results.
125
sys f req = MHz (1)
mux ratio
The system frequency results obtained using the two partitioning approaches are
shown in Figure 13. It can be seen from this figure that, for mono-cluster benchmarks,
the multilevel partitioning approach gives better system frequency results whereas for
20 Umer Farooq, Bander A Alzahrani
Hierarchical Multilevel
16
14
Frequency (MHz)
12
10
8
6
4
2
0
Benchmark Name
Fig. 13: System Frequency Comparison Between Hierarchical and Multilevel Parti-
tioning Approach
Hierarchical Multilevel
1400
Execution Time (Sec)
1200
1000
800
600
400
200
0
Benchmark Name
Fig. 14: Execution Time Comparison Between Hierarchical and Multilevel Partitioning
Approach
6 Conclusion
For multi-FPGA systems, partitioning plays a very important role in determining the
quality of a final prototyped design. This work explores two partitioning approaches
for multi-FPGA prototyping systems. One approach exploits the inherent hierarchy
of benchmarks while second approach uses a multilevel clustering and refinement
approach to partition the design under consideration. For exploration purpose, we
use a set of fourteen large, complex and realistic benchmarks. Experimental results
obtained through the exploration environment of this work demonstrate that multilevel
partitioning approach gives overall better results for mono-cluster benchmarks. On
the other hand, hierarchical partitioning approach gives better results for multi-cluster
benchmarks. On average, multilevel approach gives 12.5% better frequency results
for mono-cluster benchmarks whereas hierarchical approach gives 13% better fre-
quency results for multi-cluster benchmarks. Execution time comparison between two
approaches further reveals that hierarchical approach gives better results irrespective
of the nature of benchmarks under consideration. Hierarchical partitioning approach
gives on average 60% better execution time results as compared to multilevel parti-
tioning approach.
In this work, our emphasis has mainly been the exploration of partitioning approaches.
In the future, we will make the proposed multi-FPGA prototyping flow more compre-
hensive by introducing novel in-circuit verification techniques. These techniques can
be used for the functional verification of design after the prototyping of the design is
finished.
References
1. M. Santarini, “Asic prototyping: Make versus buy,” EDN, vol. 11, 2005.
2. “Sigenics: Custom asic calculator,” http://www.sigenics.com/page/custom-asic-cost-calculator, 2017.
3. AMD, “http://techreport.com/news/13721/chip-problem-limits-supply-of-quad-core-opterons,” 2007.
4. Pentium, “https://en.wikipedia.org/wiki/pentium fdiv bug,” 1994.
5. M. Graphics, “https://www.mentor.com/products/fv/modelsim/,” 2017.
6. J. R. Ian Kuon, Quantifying and Exploring the Gap Between FPGAs and ASICs, Springer, Ed. Springer
US, 2010, vol. 1.
22 Umer Farooq, Bander A Alzahrani
7. H. Krupnova, “Mapping multi-million gate socs on fpgas: industrial methodology and experience,” in
Design, Automation and Test in Europe Conference and Exhibition, 2004. Proceedings, vol. 2, Feb
2004, pp. 1236–1241 Vol.2.
8. S. Asaad, R. Bellofatto, B. Brezzo, C. Haymes, M. Kapur, B. Parker, T. Roewer, P. Saha, T. Takken,
and J. Tierno, “A cycle-accurate, cycle-reproducible multi-fpga system for accelerating multi-core
processor simulation,” in Proceedings of the ACM/SIGDA International Symposium on Field
Programmable Gate Arrays, ser. FPGA ’12. New York, NY, USA: ACM, 2012, pp. 153–162.
[Online]. Available: http://doi.acm.org/10.1145/2145694.2145720
9. M. R. Garey and D. S. Johnson, Computers and Intractability; A Guide to the Theory of NP-
Completeness. New York, NY, USA: W. H. Freeman & Co., 1990.
10. VERIFIC, “https://www.verific.com/,” 2019.
11. G.Sigl, K.Doll, and F.Johannes, “Analytical Placement: A Linear or a Quadratic Objective Function?”
Design Automation Conference, pp. 427–432, 1991.
12. C.J.Alpert, T.Chan, D.Huang, A.Kahng, I.Markov, P.Mulet, and K.Yan, “Faster Minimization of Linear
Wirelength for Global Placement,” ACM Symposium on Physical Design, pp. 4–11, 1997.
13. S.Kirkpatrick, C.D.Gelatt, and M.P.Vecchi, “Optimization by Simulated Annealing,” Science 220, pp.
671–680, 1983.
14. C.Sechen and A.Sangiovanni-Vincentelli, “The Timberwolf Placement and Routing Package,” JSSC,
pp. 510–522, April 1985.
15. A.Dunlop and B.Kernighan, “A Procedure for Placement of Standard-cell VLSI Circuits,” IEEE
Transactions on CAD, pp. 92–98, Jan 1985.
16. D.Huang and A.Kahng, “Partitioning-based Standard-cell Global Placement with an Exact Objective,”
ACM Symposium on Physical Design, pp. 18–25, 1997.
17. B.Kernighan and S.Lin, “An Efficient Heuristic Procedure for Partitioning Graphs,” Bell System Tech.
Journal, vol. 49, pp. 291–307, 1970.
18. C.M.Fiduccia and R.M.Mattheyeses, “A Linear-time Heuristic for Improving Network Partitions,”
Design Automation Conference, pp. 175–181, 1982.
19. T.Bui, S.Chaudhuri, T.Leighton, and M.Sipser, “Graph Bisection Algorithms with Good Average
Behavior,” Combinatorica, vol. 7, no. 2, pp. 171–191, Jun. 1987.
20. C.J.Alpert, L.W.Hagen, and A.B.Kahng, “Multilevel Circuit Partitioning,” Design Automation Confer-
ence, pp. 530–533, 1997.
21. G.Karypis, R.Aggarwal, V.Kumar, and S.Shekhar, “Multilevel Hypergraph Partitioning: Application in
VLSI Design,” Design Automation Conference, pp. 526–529, June 1997.
22. G.Karypis and V.Kumar, “Multilevel k-way Hypergraph Partitioning,” Design automation conference,
June 1999.
23. “Haps protocompiler by synopsys,” http://www.synopsys.com/Prototyping/FPGA BasedPrototyp-
ing/Pages /protocompiler.aspx, 2017.
24. “Haps multi-fpga board by synopsys,” http://www.synopsys.com/Prototyping/FPGA BasedPrototyp-
ing/Pages /HAPS.aspx, 2017.
25. Auspy, “https://www.mentor.com/products/fv/aupsy,” 2017.
26. “Certify partitioning tool by synopsys,” http://www.synopsys.com/Prototyping/FPGA BasedPrototyp-
ing/ Pages/Certify.aspx, 2017.
27. C. P. Series, “http://www.cadence.com/products/sd/palladium xp series/ pages / default.aspx,” 2017.
28. M. G. Veloce, “https://www.mentor.com/products/fv/emulation-systems/,” 2017.
29. “Zebu-server asic emulator by synopsys,” http://www.synopsys.com/tools/verification /hardware-
verification/emulation/Pages/default.aspx, 2017.
30. Z.Marrakchi, H.Mrabet, and H.Mehrez, “Hierarchical FPGA Clustering to Improve Routability,”
Conference on Ph.D Research in Microelectronics and Electronics, PRIME, 2005.
31. Z. Marrakchi, H. Mrabet, and H. Mehrez, “A new Multilevel Hierarchical MFPGA and its suitable
configuration tools,” Proc. ISVLSI, Karlsruhe, Germany, March 2006.
32. M. Turki, H. Mehrez, Z. Marrakchi, and M. Abid, “Partitioning constraints and signal routing approach
for multi-fpga prototyping platform,” in 2013 International Symposium on System on Chip (SoC), Oct
2013, pp. 1–4.
33. Q. Tang, H. Mehrez, and M. Tuna, “Routing algorithm for multi-fpga based systems using multi-point
physical tracks,” in Rapid System Prototyping (RSP), 2013 International Symposium on, Oct 2013, pp.
2–8.
34. U. Farooq, I. Baig, and B. A. Alzahrani, “An efficient inter-fpga routing exploration environment for
multi-fpga systems,” IEEE Access, vol. 6, pp. 56 301–56 310, 2018.
Title Suppressed Due to Excessive Length 23
35. M. Inagi, Y. Takashima, and Y. Nakamura, “Globally optimal time-multiplexing in inter-fpga connec-
tions for accelerating multi-fpga systems,” in Field Programmable Logic and Applications, 2009. FPL
2009. International Conference on, Aug 2009, pp. 212–217.
36. S. Hauck and A. DeHon, Reconfigurable Computing: The Theory and Practice of FPGA-Based
Computation. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2007.
37. D. Stroobandt, P. Verplaetse, and J. Van Campenhout, “Generating synthetic benchmark circuits for
evaluating cad tools,” Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions
on, vol. 19, no. 9, pp. 1011–1022, Sep 2000.
38. U. Farooq, H. Parvez, H. Mehrez, and Z. Marrakchi, “A new heterogeneous tree-based
application specific fpga and its comparison with mesh-based application specific fpga,”
Microprocess. Microsyst., vol. 36, no. 8, pp. 588–605, Nov. 2012. [Online]. Available:
http://dx.doi.org/10.1016/j.micpro.2012.06.012
39. S. Yang, “Logic synthesis and optimization benchmarks user guide, version 3.0,” Jan 1991.
40. N. Pouillon and A. Greiner, “Soc lib project,” 2010. [Online]. Available:
https://www.asim.lip6.fr/trac/dsx/
41. I. Miro Panades, A. Greiner, and A. Sheibanyrad, “A low cost network-on-chip with guaranteed
service well suited to the gals approach,” in Nano-Networks and Workshops, 2006. NanoNet ’06. 1st
International Conference on, Sept 2006, pp. 1–5.
42. L.McMurchie and C.Ebeling, “Pathfinder: A negotiation-based performance-driven router for fpgas,”
in ACM International Symposium on Field-Programmable Gate Arrays. New York, NY, USA: ACM
Press, 1995, pp. 111–117.
43. G. Karypis and V. Kumar, “Multilevel k-way hypergraph partitioning,” in Proceedings of the 36th
Annual ACM/IEEE Design Automation Conference, ser. DAC ’99. New York, NY, USA: ACM, 1999,
pp. 343–348. [Online]. Available: http://doi.acm.org/10.1145/309847.309954
44. Synopsys, “http://www.synopsys.com/prototyping/fpgabasedprototyping/,” 2017. [Online]. Available:
http://www.synopsys.com/Prototyping/FPGA BasedPrototyping/FPMM/ Pages/default.aspx