ICCAD24-TopoOrderPart A Multi-Level Scheduling-Driven Partitioning Framework For Processor-Based Emulation
ICCAD24-TopoOrderPart A Multi-Level Scheduling-Driven Partitioning Framework For Processor-Based Emulation
Convert
ABSTRACT G1
p
g1
In a compilation flow of processor-based emulation (PBE), partition- l FF1 m G4 s FF3
v g4 f3
q f1 g
ing involves dividing a large netlist into smaller pieces and assign- G2 2
ing them to different processors. Furthermore, the scheduling pro- G5 t g5 f4
G6 u
f2
cess must adhere to the levels of the netlist, which are determined FF2
n
G3 r FF4
w g3 g6
by topological ordering, and the logic gates in the same level can be Netlist Hypergraph
emulated in parallel. However, during the netlist partitioning stage, Scheduling Partitioning
assigning most gates at the same level to one processor would un- P3
without TOB
P1 g P1 f f
dermine the benefits of parallelization in scheduling, leading to calcluate s,t,u 1 2 1
P2 Time step: 7 f1 g4 f3 f4
overall performance degradation. This paper proposes the Topo- calcluate p,q,r
Yes
The second term 𝐾 (𝜎𝜏2 ) denotes the TOB metric of a partition result Obtain the final partition result Rollback to the best
and is defined in equation (3). The balance of a topological order 𝜏 result
The TOB gain is calculated using equation (3.12) before the node 𝑣 To assess the effectiveness of the TopoOrderPart, we compared
(gray dash circle) moves from 𝑖 to 𝑗, and the gain is shown in blue it to the state-of-the-art hypergraph partitioner, hMETIS. Further-
boxes in Figure 5. After node 𝑣 moves to 𝑗, the number of nodes in more, the hMETIS+TopoRefine algorithm invokes the hMETIS to
|𝑃𝑖𝜏 | reduces to 4, while the number of nodes in |𝑃 𝜏𝑗 | increases to 2. obtain a partition, and then the TopoRefine algorithms introduced
The changes are illustrated in Figure 5, highlighted in red font. The in this work are applied to enhance the TOB. The default param-
number of nodes in |𝑃𝑚 𝜏 | and |𝑃 𝜏 | remains the same, with 2 and 8, eters for hMETIS are set as [19]. Considering the random seed’s
respectively. In order to visually represent the updated TOB gains, influence in hMETIS, we sample the best result of 5 times. We con-
Figure 5 features yellow boxes. ducted two separate experiments to validate the efficiency of our
ideas.
i Nodes in Nodes in Nodes in
i j m 4.1 Benchmarks and Baseline
V
In experiments, we adopt the Titan23 benchmark suite[20] whose
(u, i ) /
𝟐(𝟏 − 𝟓 + 𝟏 𝟐(𝟐 − 𝟓 + 𝟏
2 𝟑 × 𝟖𝟐 𝟑 × 𝟖𝟐 statistics are summarized in Table 2. The “# Nodes” represents the
ˆ (u, i) number of DAH nodes, the “# Nets” denotes the number of nets in
𝟐(𝟐 − 𝟒 + 𝟏 𝟐(𝟐 − 𝟒 + 𝟏
j / 𝟑 × 𝟖𝟐 𝟑 × 𝟖𝟐
2
the circuit netlist. Since the input to our algorithm in this paper is
(u, j )
𝟐(𝟓 − 𝟏 + 𝟏 𝟐(𝟐 − 𝟏 + 𝟏
𝟑 × 𝟖𝟐
/ 𝟑 × 𝟖𝟐
a DAH, we acyclicize the Titan23 benchmark. After addressing the
cyclic issue, the nets of the acyclic hypergraph are reduced, and
2
ˆ (u, j ) 𝟑 × 𝟖
𝟐(𝟒 − 𝟐 + 𝟏 𝟐(𝟐 − 𝟐 + 𝟏
/ the nodes are the same as original hypergraph. We make these
m 2
𝟐 𝟑 × 𝟖𝟐
generated benchmarks public at [21].
(u, m) 𝟐(𝟓𝟑−×𝟐𝟖+ 𝟏
2 𝟐
𝟐(𝟏 − 𝟐 + 𝟏
𝟑 × 𝟖𝟐
/ Given that there is no existing research on DAH partitioning
with handling topological order balancing, we have chosen to use
ˆ (u, m)
2
𝟐(𝟒 − 𝟐 + 𝟏
𝟑 × 𝟖𝟐
𝟐(𝟐 − 𝟐 + 𝟏
𝟑 × 𝟖𝟐
/
the TopoOrderPart– as our baseline. The only distinction between
TopoOrderPart– and TopoOrderPart lies in the initial partitioning
Figure 5: An example of how to update the TOB gain in local stage, where TopoOrderPart– employs random root selecting method
refinement. and TopoOrderPart use the super-far root selection method. In or-
4 EXPERIMENTAL SETUP AND RESULTS der to reduce the impact of randomness, we pick the best result
In this work, we developed TopoOrderPart using C++ and compiled from the five runs when applying TopoOrderPart–.
it with g++ version 11.3.0. Our experiments were carried out on a
single core of a Linux workstation equipped with an Intel Xeon 4.2 Experimental Results
Gold 6248R CPU at 3.0 GHz, operating on Ubuntu 22.04 OS. Given 4.2.1 The Performance of Single level Partition. In Table 2, the “Cut
the absence of existing benchmarks or baselines that address topo- Size” denotes the sum of the hyperedges’ connectivity. The “TOB
logical order balancing in the DAH partitioning problem, we gen- Metric” refers to the topological order balancing. The “Avg. Ra-
erated a set of benchmarks derived from a public benchmark suite tio” of “Cut Size”, “TOB Metric” and “Time” is the average relative
and constructed a baseline for a fair comparison. performance of TopoOrderPart and hMETIS+TopoRefine comparing
with the TopoOrderPart– for all the benchmarks. In addition, Fig- no existing work addresses the architecture-level partitioning, we
ure 6 presents a more detailed comparison among each benchmark directly employ hMETIS+TopoRefine with 𝑘ℎ = 48. The result is
focusing on cut size and TOB. The y-axis represents the ratio of the shown in Table 3 and Figure 7. Compared to the hMETIS+TopoRefine,
best result divided by the other algorithms’ results. A ratio closer the average run time of TopoOrderPart is reduced by 55%, and the
to 1 indicates superior performance, whereas a closer to 0 denotes TOB is reduced to 73%, while half of the benchmarks on cut size
inferior results. lead. The results demonstrate that our framework’s superior per-
1 .2 5 1 .2 5
8, 𝑘 2 = 6 and 𝜀 = 0.03
1 .0 0 1 .0 0
0 .7 5 0 .7 5 hMETIS+TopoRefine TopoOrderPart
Benchmark Cut Size TOB Metric Time(s) Cut Szie TOB Metric Time(s)
0 .5 0 0 .5 0 dart 20129 0.48 16.36 18561 0.45 8.67
-
h M E T IS + T o p o R e fin e
0 .2 5
T o p o O rd e rP a rt
0 .2 5 denoise 21446 76.10 71.35 31082 65.12 20.05
T o p o O rd e rP a rt
0 .1 0 .1 stereo_vision 8706 1.64 7.14 11927 1.12 2.74
0 4 8 1 2 1 6 2 0 2 3 0 4 8 1 2 1 6 2 0 2 3 sparcT1_core 20735 2.63 7.05 27647 1.13 4.32
# B e n c h m a rk s # B e n c h m a rk s
stap_qrd 15335 5.03 14.19 25140 3.18 9.92
openCV 21786 1.26 17.48 19872 0.63 8.76
Figure 6: The single level partitioning comparison of cut-
des90 12473 0.37 8.03 11333 0.36 7.14
size and TOB between TopoOrderPart, TopoOrderPart– and directrf 66314 2.59 146.43 65239 1.86 39.85
hMETIS+TopoRefine. A quality ratio closer to 1 indicates bet- cholesky_bdti 17459 4.25 20.58 15485 2.78 12.41
ter performance. minres 16719 15.45 27.36 15686 12.95 8.71
sparcT2_core 52758 1.72 48.71 50094 1.19 17.44
The result in Table 2 and Figure 6 show that the TopoOrder- LU_Network 44157 0.59 100.92 43196 0.37 24.52
Part achieves 69% reduction on average regarding the TOB metric, cholesky_mc 12561 5.18 9.92 16104 3.70 3.76
with 0.53× runtime and comparable cut size when compared with LU230 35781 0.69 80.16 34367 0.60 25.01
SLAM_spheric 18104 5.90 10.88 27431 3.75 5.07
the hMETIS+TopoRefine. This comparison indicates a significant im-
neuron 12687 12.91 14.87 11380 8.21 3.28
provement in topological order balancing under our framework. bitonic_mesh 15457 0.44 27.98 14030 0.42 11.91
Based on the result, it appears that while both TopoOrderPart and bitcoin_miner 64212 4.14 77.01 85616 3.92 50.86
hMETIS+TopoRefine use the same TopoRefine algorithm, the for- segmentation 15882 173.55 57.31 26037 167.62 9.17
mer significantly outperforms the latter in terms of the TOB metric. gsm_switch 40020 296.69 80.08 56922 274.61 64.32
mes_noc 30425 3.59 73.67 45411 1.92 43.22
It is suggested that there will be help in improving topological or- sparcT1_chip2 48459 3.89 106.97 62128 2.22 44.57
der balancing if considering the TOB in the coarsening and initial [Link] 1 1 1 1.18 0.73 0.45
partitioning stage. Additionally, a further set of comparisons un-
derscores this conclusion. For instance, in comparing the baseline
formance on balancing topological order and optimizing the parti-
with hMETIS+TopoRefine, it is observed that the baseline exhibits a
tioning process. On the other hand, TopoOrderPart achieved better
62% improvement in TOB over hMETIS+TopoRefine.
TOB performance by only employing a straightforward coarsen-
Compared with the baseline, despite increasing runtime by 13%,
ing and refinement method. We believe that integrating our pro-
TopoOrderPart generates partitioning solutions that reduce the cut
posed TOB optimization techniques into advanced partitioners like
size and the TOB by an average of 17% and 7%, respectively. The
KaHyPar, which adopt sophisticated and complex coarsening and
result demonstrates that the super-far root selection method sub-
refinement method, would further enhance the performance.
stantially strengthens the quality of the solutions while maintain-
ing an acceptable run time.
5 CONCLUSION
T O B m e tr ic C o m p a r is o n R u n tim e C o m p a r is o n
In this paper, we propose TopoOrderPart, a partitioning framework
Q u a lity re la tiv e to b e s t
1 .0 0 1 .0 0
designed to optimize topological order balancing and cut size. Com-
0 .7 5 0 .7 5 pared to previous state-of-the-art partitioners, it considers the topo-
logical order balancing during every partitioning process. Notably,
0 .5 0 0 .5 0 it achieves a partitioning scheme with the minimum topological
0 .2 5 T o p o O rd e rP a rt
0 .2 5
order balancing, showcasing its unique capabilities. The quality of
0 .1
h M E T IS + T o p o R e fin e
0 .1
partitioning is further improved through optimization algorithms
0 4 8 1 2 1 6 2 0 2 3 0 4 8 1 2 1 6 2 0 2 3 that efficient root node selection method in the initial partitioning
# B e n c h m a rk s # B e n c h m a rk s
stage. Furthermore, two high-performance refinement algorithms
Figure 7: The two level partitioning comparison of TOB and are introduced to enable partitioning iteratively for reducing the
runtime between TopoOrderPart and hMETIS+TopoRefine. A topological order balancing and cut size. We integrate these algo-
quality ratio closer to 1 indicates better performance. rithms into a framework that accommodates to PBE architecture
level. Through empirical study, our TopoOrderPart has shown out-
4.2.2 The Performance of Two level Partition. In this section, we standing performance in minimizing topological order balancing
conduct a separate experiment focusing on the architecture-level and cut size compared to the proposed baseline.
partitioning in TopoOrderPart. We set 𝑘 1 = 8 and 𝑘 2 = 6. Since
REFERENCES [12] Julien Herrmann, Jonathan Kho, Bora Uçar, and Kamer Kaya. 2017. Acyclic
[1] Cadence. 2024. Cadence palladium. [Link] partitioning of large directed acyclic graphs. In 2017 17th IEEE/ACM Interna-
ools/system-design-and-verification/emulation-and-prototyping/palladium tional Symposium on Cluster, Cloud and Grid Computing (CCGRID), 371–380.
.html. [13] Julien Herrmann, M. Yusuf Özkaya, Bora Uçar, and Kamer Kaya. 2019. Multi-
[2] Mentor Graphics. 2024. Mentor veloce. [Link] level algorithms for acyclic partitioning of directed acyclic graphs. SIAM J. Sci.
/veloce/strato-hardware. Comput., 41, 4, A2117–A2145.
[3] Amir Yazdanshenas and Mohammed A. S. Khalid. 2007. A new scheduling al- [14] Merten Popp, Sebastian Schlag, Christian Schulz, and Daniel Seemaier. 2021.
gorithm for processor-based logic emulation systems. In 2007 50th Midwest Multilevel acyclic hypergraph partitioning. In 2021 Proceedings of the Sympo-
Symposium on Circuits and Systems, 1505–1508. sium on Algorithm Engineering and Experiments (ALENEX), 1–15.
[4] Amir Ali Yazdanshenas. 2006. Hardware design and cad for processor-based [15] Ismail Bustany, Andrew B. Kahng, Ioannis Koutis, and Bodhisatta Pramanik.
logic emulation systems. In Electronic Theses and Dissertations. 2022. Specpart: a supervised spectral framework for hypergraph partitioning
[5] Marwan Kanaan. 2007. A low-cost processor-based logic emulation system solution improvement. In 2022 IEEE/ACM International Conference On Com-
using fpgas. In Electronic Theses and Dissertations. puter Aided Design (ICCAD), 1–9.
[6] Ruiyao Pu, Yiwei Sun, Pei-Hsin Ho, Fan Yang, Li Shang, and Xuan Zeng. 2023. [16] Ismail Bustany, Andrew B. Kahng, Ioannis Koutis, and Bodhisatta Pramanik.
Sphinx: a hybrid boolean processor-fpga hardware emulation system. In 2023 2024. K-specpart: supervised embedding algorithms and cut overlay for im-
IEEE/ACM International Conference on Computer Aided Design (ICCAD), 1–9. proved hypergraph partitioning. IEEE Transactions on Computer-Aided Design
[7] Yong Fu. 2022. Processor based logic simulation acceleration and emulation of Integrated Circuits and Systems, 43, 4, 1232–1245.
system. (Mar. 2022). US Patent App. 17/448,216. [17] Ismail Bustany, Grigor Gasparyan, Andrew B. Kahng, and Ioannis Koutis. 2023.
[8] Ngai Ngai William Hung and Amiya Ranjan Satapathy. 2024. Systems and An open-source constraints-driven general partitioning multi-tool for vlsi phys-
methods for distributed and parallelized emulation processor configuration. ical design. In 2023 IEEE/ACM International Conference on Computer Aided De-
(Jan. 2024). US Patent App. 11/868,786. sign (ICCAD), 1–9.
[9] G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar. 1999. Multilevel hyper- [18] Sebastian Schlag, Yaroslav Akhremtsev, Tobias Heuer, and Peter Sanders. 2017.
graph partitioning: applications in vlsi domain. IEEE Transactions on Very Large Engineering a direct k-way hypergraph partitioning algorithm. In ALENEX’17.
Scale Integration (VLSI) Systems, 7, 1, 69–79. [19] G. Karypis and V. Kumar. 1998. Hmetis, a hypergraph partitioning package,
[10] U.V. Catalyurek and C. Aykanat. 1999. Hypergraph-partitioning-based decom- version 1.5.3. [Link]
position for parallel sparse-matrix vector multiplication. IEEE Transactions on df.
Parallel and Distributed Systems, 10, 7, 673–693. [20] Kevin E. Murray, Scott Whitty, Suya Liu, Jason Luu, and Vaughn Betz. 2013.
[11] Sebastian Schlag, Vitali Henne, Tobias Heuer, and Henning Meyerhenke. 2016. Titan: enabling large and complex benchmarks in academic cad. In 2013 23rd
K-way hypergraph partitioning via n-level recursive bisection. In 2016 Proceed- International Conference on Field programmable Logic and Applications, 1–8.
ings of the Meeting on Algorithm Engineering and Experiments (ALENEX), 53– [21] 2024. Dah benchmark generated from the titan23. [Link]
67. le/d/1J6wE7AwuBvnZZfICbnqa42Kia8yXcF4d/view?usp=drive_link.