0% found this document useful (0 votes)

88 views9 pages

ICCAD24-TopoOrderPart A Multi-Level Scheduling-Driven Partitioning Framework For Processor-Based Emulation

The document presents TopoOrderPart, a novel multi-level scheduling-driven partitioning framework designed for processor-based emulation (PBE). This framework aims to balance topological order and minimize cut size, significantly enhancing scheduling efficiency and reducing run time by 0.53× while achieving a 69% improvement in topological order balancing. Experimental results demonstrate its effectiveness compared to existing partitioners, emphasizing the importance of considering topological ordering in the partitioning stage.

Uploaded by

huangjunying1163

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

88 views9 pages

ICCAD24-TopoOrderPart A Multi-Level Scheduling-Driven Partitioning Framework For Processor-Based Emulation

Uploaded by

huangjunying1163

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

TopoOrderPart: a Multi-level Scheduling-Driven Partitioning

Framework for Processor-Based Emulation

Shunyang Bi† , Jing Tang† , Hailong You†,∗ , Haonan Wu† , Cong Li† , Richard Sun§
† Faculty of Integrated Circuit, Xidian University, Xi’an, China
§ S2C Inc., Shenzhen, China

Convert
ABSTRACT G1
p
g1
In a compilation flow of processor-based emulation (PBE), partition- l FF1 m G4 s FF3
v g4 f3
q f1 g
ing involves dividing a large netlist into smaller pieces and assign- G2 2
ing them to different processors. Furthermore, the scheduling pro- G5 t g5 f4
G6 u
f2
cess must adhere to the levels of the netlist, which are determined FF2
n
G3 r FF4
w g3 g6
by topological ordering, and the logic gates in the same level can be Netlist Hypergraph
emulated in parallel. However, during the netlist partitioning stage, Scheduling Partitioning
assigning most gates at the same level to one processor would un- P3
without TOB
P1 g P1 f f
dermine the benefits of parallelization in scheduling, leading to calcluate s,t,u 1 2 1
P2 Time step: 7 f1 g4 f3 f4
overall performance degradation. This paper proposes the Topo- calcluate p,q,r

OrderPart, the first scheduling-driven partitioning framework for P1 calcluate

m,n
Calcluate
v,w
P2 g6 g5 P2 g
3
simultaneous balancing topological order and minimizing the cut 1 2 3 4 5 6 7 8
Time Step f2 g3 g1 g2
P1 calcluate m,p,s
size, which holds significant value in reducing time steps of sched- P3 P3 g
P2 calcluate n,r,t,u,w f3
uling. In particular, the topological order balancing and cut size f4 g5
6
g2 g4
are considered throughout the multilevel paradigm, and balance- P3 calcluate q,v Time step: 5
with TOB with TOB without TOB
aware coarsening achieves balancing between clusters in the early
stage, with super-far root growing initial partitioning obtaining Figure 1: The result of scheduling with/without topological
the better partition by selecting those root nodes in distant rela- ordering balancing (TOB) in partitioning stage. The bottom
tionship within the connection space and two novel TopoRefine left of the Figure illustrates that the scheduling stage takes
algorithms further enhancing the solution. Experimental results fewer time steps when the partitioning stage considers TOB.
show TopoOrderPart can improve 69% topological order balancing (without TOB: 7 time steps, with TOB: 5 time steps).
and 0.53× run time while maintaining comparable cut size, com-
pared to the state-of-the-art partitioner.
the scheduling stage conducts a schedule of processors in the cor-
rect time order[5]. Furthermore, the partitioning needs to be per-
CCS CONCEPTS formed in the early stage to divide a massively large circuit into
• Hardware → Partitioning and floorplanning. smaller sub-circuits to fit in a processor’s capacity. With circuit
systems becoming increasingly vast and sophisticated, partition-
KEYWORDS ing and scheduling play a more critical role in PBE.
processor-based emulation, partitioning, topological order, sched- In the compilation flow of PBE, the scheduling process fits the hi-
uling erarchical structure of the circuit since the emulation is performed
level by level, which denotes that the current level emulation will
1 INTRODUCTION only be calculated if the previous level is completed[6]. In addition,
As the scale and complexity of designs increase, processor-based the level of the netlist is determined by topological ordering, and
emulation (PBE)[1][2] has widespread applications in functional all gates at the same level are independent and can be calculated
verification due to its fast compilation speed and high logic capac- in parallel[6]. In the partitioning stage, if most gates at the same
ity[3]. A processor-based emulator consists of a massive array of in- level are divided into a processor, the compiler will produce an in-
terconnected processors[4]. In a typical compilation flow for PBE, appropriate scheduling of tasks. The inappropriate schedule would
∗ Corresponding author: Hailong You. This work is supported by NSFC 62234010 and deteriorate the benefits of parallelization by incorrectly allocating
The 111 Project of China 61574109. tasks to processors[3], further worsening the system’s true poten-
tial. Therefore, ignoring the topological ordering balancing (TOB)
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed of the netlist in partitioning could negatively impact system per-
for profit or commercial advantage and that copies bear this notice and the full cita- formance.
tion on the first page. Copyrights for components of this work owned by others than A toy example is given in Figure 1 to demonstrate the impor-
the author(s) must be honored. Abstracting with credit is permitted. To copy other-
wise, or republish, to post on servers or to redistribute to lists, requires prior specific tance of this scheduling-aware partitioning. The top of Figure 1
permission and/or a fee. Request permissions from permissions@[Link]. presents the netlist containing logic gates and flip-flops is con-
ICCAD ’24, October 27–31, 2024, New York, NY, USA verted to a directed acyclic hypergraph (DAH). For the same DAH,
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 979-8-4007-1077-3/24/10 two different partitions without TOB and with TOB are shown in
[Link] the bottom right of Figure 1. Note that the topological order of
flip-flop in the netlist is zero after topological ordering. Therefore, The remainder of the paper is organized as follows. Section 2
the node number of topological order 0 in processors 1, 2, and 3 provides terminology and problem formulation. Section 3 explains
when considering TOB is 1, 1, and 2, respectively. So the variance the details of proposed algorithm. Section 4 presents the bench-
of topological order 0, 1, 2, and 3 are {0.22, 0, 0.22, 0.22}. Similarly, marks and experimental results. Finally, the paper is concluded in
the variance of topological order without TOB is {3.55, 2, 0.89, 0.22}, Section 5.
which is more dispersed than the with TOB’s. Although these two
partitions are not min-cut results, the scheduling result differs a 2 PRELIMINARIES
lot. In the bottom left of Figure 1, the partition without TOB needs
7 time steps, while only 5 time steps in the partition with TOB. Table 1: Terminology and Notation
Apparently, considering with TOB in partitioning should be more
Term Description
preferred for better scheduling results.
The DAH 𝐻 , 𝑉 denotes hypernodes set
In addition, a processor-based emulator typically consists of mul- 𝐻 (𝑉 , 𝐸)
𝐸 denotes hyperedges set
tiple boards, wherein each board is a cluster of CPUs[7][8]. Each
of CPU includes a plurality of processors. It is thus common that 𝑐 (𝑣) The weight of node 𝑣
a very large design needs to be partitioned into the emulator level 𝜔 (𝑒) The weight of edge 𝑒
by level according to the emulator architecture level. 𝐼 (𝑣) Incident edges of node 𝑣, 𝐼 (𝑣) := {𝑒 | ∃𝑣 ∈ 𝑒}
However, to the best of our knowledge, there is no existing re- The neighbors of node 𝑣
Γ(𝑣)
search on DAH partitioning algorithms that consider the charac- Γ(𝑣) := {𝑢 | ∃𝑒 ∈ 𝐸 : {𝑣, 𝑢} ⊆ 𝑒}
teristics mentioned above of the processor-based emulator. The tra- 𝑝𝑘 Partition block id
ditional and well-known hypergraph partitioners hMETIS[9], Pa- 𝑁 (𝑝𝑘 ) The set of hyperedges connect to block 𝑝𝑘
ToH[10], and KaHyPar[11] mainly focus on minimizing cut size Π𝐻 The 𝑘 partitions of DAH 𝐻 , {𝑉𝑖 |𝑖 = 1, 2, ..., 𝑘 }
with balancing constraints but are not TOB-aware or addressing 𝑏 (𝑣) The partition id 𝑝𝑘 of node 𝑣
emulator architecture-level partitioning. Although the DAH parti- 𝑐𝑢𝑡 (𝑒) The connectivity of edge 𝑒
tioning is taken into consideration in [12][13][14], the main prob- 𝜀 Imbalance parameter
lem their works solved is that the quotient graph must be acyclic. 𝜏 (𝑣) The topological order of node 𝑣
There are some recent works on hypergraph partitioning. [15] pro- 𝑟𝜏 The number of nodes with topological order 𝜏
posed the supervised spectral framework to solve the two-way 𝜏𝑚𝑎𝑥 Max topological order of all nodes
hypergraph partitioning problem. The work [16] focused on im- The set of nodes with topological order 𝜏¤ in node 𝑣
𝑣 𝜏¤
proving the k-way cut size by generating vertex embeddings and 𝑣 𝜏¤ := {𝑢 | 𝑢 ∈ 𝑣, (𝑢, 𝑣) ∈ 𝑉 , 𝜏 (𝑢) = 𝜏,
¤ 𝜏¤ ∈ [0, 𝜏𝑚𝑎𝑥 ]}
employing cut-overlay clustering. The author in [17] studied the
The set of nodes with topological order 𝜏¤ in block 𝑖
hypergraph partitioning problem containing multiple constraints, 𝑃𝑖𝜏¤
such as: multi-dimensional balance, embedding, and timing con- 𝑃𝑖𝜏¤ := {𝑣 | 𝑣 ∈ 𝑉𝑖 , 𝜏 (𝑣) = 𝜏,
¤ 𝜏¤ ∈ [0, 𝜏𝑚𝑎𝑥 ]}
straints. However, none of these works consider the TOB or emu- |𝑃𝑖𝜏¤ | The nodes number in the set 𝑃𝑖𝜏¤
lator architecture-level requirements in partitioning. 𝑃 𝜏¤ The set of nodes with topological order 𝜏¤
In this paper, we propose a multi-level scheduling-driven par- The nodes number in the set 𝑃 𝜏¤
titioning framework, named TopoOrderPart, which takes into ac- |𝑃 𝜏¤ | ∑
|𝑃 𝜏¤ | := 𝑘𝑖=1 𝑃𝑖𝜏¤
count the TOB and emulator architecture level at the partitioning
The variance of nodes with topological order 𝜏¤ in
stage. The goal of TopoOrderPart is to minimize the topological or- 𝜎𝜏2¤
all blocks
der balancing and cut size while handling the architecture-level
𝐾 (𝜎𝜏2 ) The TOB metric
partitioning. TopoOrderPart exhibits a strong ability to minimize
the topological order balancing, which is crucial for reducing time
steps in scheduling. Our contributions include:
• We design the first multi-level partitioning framework con-
2.1 Problem Formulation
sidering both the topological order balancing and emulator Given a netlist, it can be represented by a DAH 𝐻 (𝑉 , 𝐸). Further-
architecture level. more, the logic gates are represented by the set of nodes 𝑉 , and
• We propose two theorems for providing efficient objective the nets are denoted as the set of edges 𝐸. For a DAH 𝐻 (𝑉 , 𝐸),
values to guide the initial partitioning and refinement. In each node 𝑣 ∈ 𝑉 with a positive weight 𝑐 : 𝑉 → R ≥0 , and each
the initial partitioning stage, the super-far root growing par- edge 𝑒 with a positive weight 𝜔 : 𝐸 → R>0 . In addition, the
titioning offers the root nodes in distant relationship and emulator’s boards, CPUs, and processors are denoted as the par-
improves the cut size to 17% while maintaining less TOB. In titioning target blocks. The PBE architecture-level partitioning is
the refinement stage, boundary nodes refine and local refine depicted by a series of partitioning tasks with independent 𝑘 tar-
are applied to incorporate simultaneous TOB and cut size. get blocks. Specifically, the next level partitioning task is based
• Experimental results show that compared to the best parti- on one of the sub-DAHs generated by the last level partition re-
tioner, TopoOrderPart can achieve 69% reduction on average sult Π𝐻 . At each partition level, a positive integer 𝑘 is provided
in TOB and 0.53× run time while with the comparable cut and the scheduling-driven partitioning problem is to find a Π𝐻
size. wherein each block 𝑉𝑖 ∈ Π𝐻 satisfies the balance constraint for
𝑐 (𝑉 )
a given 𝜀: 𝑐 (𝑉𝑖 ) ≤ (1 + 𝜀) d 𝑘 e, while minimizes the objective
Obtain the final partition result Rollback to the best
result

function (equation (1)). Furthermore, the problem is NP-hard. Pre-process

∑ Given a DAH H(V,E) from netlist
Φ(𝐻, Π𝐻 ) = 𝑐𝑢𝑡 (𝑒) + 𝛽𝐾 (𝜎𝜏2 ),
𝑒 ∈𝐸 (1) Merge duplicate hyperedges
TopoOrderPart
𝜏 ∈ [0, 𝜏𝑚𝑎𝑥 ]
Topological ordering (Kahn's algorithm)
Balancing-aware
In equation (1), the first term 𝑐𝑢𝑡 (𝑒) is the connectivity of the edge Partitioning
coarsening

𝑒, which denotes the number of blocks the 𝑒 connected, as shown

in equation (2). Moreover, 𝑐𝑢𝑡 (𝑒) signifies the data communication Obtain the partitioning level Super-far root growing
initial partitioning
between processors, which could become the performance bottle- Employ the TopoOrderPart
neck due to its larger overhead in comparison to that within a pro- Uncoarsening

cessor[6]. Generate the sub-DAHs from the result

𝑐𝑢𝑡 (𝑒) := |{𝑣 ∈ 𝑉𝑖 |𝑣 ∈ 𝑒, 𝑖 ∈ [1, 𝑘]}| (2) No

The last level? TopoRefine algotithms

Yes
The second term 𝐾 (𝜎𝜏2 ) denotes the TOB metric of a partition result Obtain the final partition result Rollback to the best
and is defined in equation (3). The balance of a topological order 𝜏 result

is represented by the variance 𝜎𝜏2 of the number of nodes with 𝜏 in

each block. That means the smaller the variance 𝜎𝜏2 , the more bal- Figure 2: Flow chart of the TopoOrderPart.
anced the topological order 𝜏. And the 𝜎𝜏2 is defined in equation (4). applying the corollaries of these two theorems, the time complex-
However, in practical benchmarks, we observed that the number ity of updating TOB can be decreased to 𝑂 (1). The difference be-
of nodes with each topological order varies. There are more nodes tween Theorems 3.1 and 3.3 is that the former is implemented in
in smaller topological order and fewer in larger ones. To eliminate the unpartitioned DAH, while the latter is mainly for the parti-
the impact of the number of nodes in different topological orders, tioned one.
/
we use 𝜎𝜏 2 |𝑃 𝜏 | to represent the balance of 𝜏. As a result, the TOB
2
TheoRem 3.1. Let 𝐾 (𝜎𝜏2 ) denote the TOB metric of an unparti-
metric is denoted as the aggregate balance of across all topological 𝜏 . We denote the
tioned DAH, and 𝜗 represent the change in set 𝑃𝑚
orders, thereby providing a more accurate measure of balance.
change in 𝐾 (𝜎𝜏2 ) as Δ1 . Then, the Δ1 is only related to the
Furthermore, the 𝛽 in equation (1) is a parameter that controls
[ ]/
the relative importance of the TOB metric. The selection of 𝛽 is a (1 − 𝑘)𝜗 2
− 2𝜗 (|𝑃𝑚 | − 𝑃 ) |𝑃 𝜏 | 2
𝜏 𝜏
trade-off between cut size and TOB. In our experiment, the 𝛽 is 𝑘
set as 51 . In equation (5), the 𝑃 𝜏 is the average value of nodes with
𝜏 as 𝑃ˆ𝜏 , and the changed
PRoof. We denote the set changed by 𝑃𝑚
topological order 𝜏 in all blocks. 𝑚
/ TOB metric as 𝐾 (𝜎𝜏 ). Then we have:
ˆ 2
∑𝜏max
𝐾 (𝜎𝜏2 ) = (𝜎𝜏2 |𝑃 𝜏 | 2 ) (3) 𝜏
𝜗 = |𝑃ˆ𝑚 𝜏
| − |𝑃𝑚 | (3.1)
𝜏=0
∑𝑘 𝜏 Δ1 = 𝐾 (𝜎𝜏2 ) − 𝐾ˆ (𝜎𝜏2 ) (3.2)
𝑖=1 (|𝑃𝑖 | − 𝑃𝜏 )2
𝜎𝜏2 = , 𝜏 ∈ [0, 𝜏𝑚𝑎𝑥 ] (4) 𝑘
∑
𝑘 2 2
∑𝑘
𝜏 | − 𝑃ˆ𝜏 ) +
(|𝑃ˆ𝑚 (|𝑃𝑖𝜏 | − 𝑃ˆ𝜏 )
|𝑃 𝜏| 𝑖=1,𝑖≠𝑚
𝑖=1 𝑖 𝐾ˆ (𝜎𝜏2 ) (3.3)
𝑃𝜏 = (5) =
𝑘 𝑘 · |𝑃 𝜏 | 2
𝑘
∑
Important terms throughout this work are summarized in Table 1. (|𝑃𝑖𝜏 | − 𝑃 𝜏 )
2
𝑖=1
𝐾 (𝜎𝜏2 ) = (3.4)
3 METHODOLOGY 𝑘 · |𝑃 𝜏 | 2
In this section, we firstly discuss the flow of the scheduling-driven
where 𝑃ˆ𝜏 equals 𝑃 𝜏 + 𝜗𝑘 . According to equations (3.1), (3.2), (3.3),
partitioning framework. Figure 2 is a flow chart of the framework.
and (3.4), we further obtain equation (3.5):
Like other state-of-the-art partitioners, TopoOrderPart is based on
(1 − 𝑘)𝜗 2 2𝑘𝜗 (|𝑃𝑚 𝜏 | − 𝑃𝜏 )
the multi-level partitioning paradigm. In particular, TopoOrderPart
Δ1 = − (3.5)
employs efficient algorithms throughout multi-level paradigm to 𝑘 · |𝑃 𝜏 | 2 𝑘 · |𝑃 𝜏 | 2
integrate the two objectives smoothly. Moreover, according to the ∑𝑘 /
emulator architecture level, we integrate the TopoOrderPart into 2𝜗 · [(|𝑃𝑖𝜏 | − 𝑃 𝜏 ) 𝑘]
𝑖=1
the framework and parallelly partition the DAH level by level. +
The following sections elaborate on the components of Topo- 𝑘 · |𝑃 𝜏 | 2
OrderPart. 𝑘
∑ /
Combined with equation (5), the value of term [(|𝑃𝑖𝜏 | − 𝑃 𝜏 ) 𝑘]
𝑖=1
3.1 Improved TOB Metric is 0. Therefore, Δ1 is
[ ]/
Before presenting the TopoOrderPart, we introduce Theorems 3.1 (1 − 𝑘)𝜗 2 𝜏
Δ1 = − 2𝜗 (|𝑃𝑚 | − 𝑃 𝜏 ) |𝑃 𝜏 | 2 (3.6)
and 3.3, which provide the efficient objective derived from TOB 𝑘
metric to guide the initial partitioning and refinement stages. By
CoRollaRy 3.2. When assigning a node with topological order 𝜏 3.2 Balancing-Aware Coarsening
𝜏 | to update
to block 𝑚 for an unpartitioned DAH, we can use the |𝑃𝑚 The first step in the multilevel partitioning paradigm is coarsening,
TOB metric, and the time complexity is 𝑂 (1). which constructs a sequence of progressively coarser hypergraphs
that reflect the basic structure of the input[18]. Specifically, at each
PRoof. According to Theorem 3.1, for an unpartitioned DAH, level, nodes with higher connectivity will be merged as a new su-
when a node with topological order of 𝜏 is assigned to block 𝑚, the pernode. The connectivity between a pair of nodes (u, v) is defined
value of 𝜗 is 1. Therefore, we have: in hMETIS[9] as:
[ ]/
(1 − 𝑘) 𝜏 ∑ 𝜔 (𝑒)
Δ1 = − 2(|𝑃𝑚 | − 𝑃 𝜏 ) |𝑃 𝜏 | 2 (3.7) 𝑟 (𝑢, 𝑣) = (6)
𝑘 |𝑒 | − 1
𝑒 ∈ {𝐼 (𝑣)∩𝐼 (𝑢 ) }
Additionally, the block 𝑘, the average topological order 𝑃 𝜏 of topo- However, applying the connectivity method directly can be in-
logical order 𝜏, and the total number of nodes |𝑃 𝜏 | 2 of topological adequate for solving the topological order balancing issue. This ap-
order 𝜏 are all constant values. Therefore, the time complexity of proach tends to cause clustering of nodes with high connectivity
TOB metric updating is a constant. and ignore the topological order balancing of nodes, which could
lead to an unbalanced coarser hypergraph and reduce the problem-
TheoRem 3.3. Let 𝐾 (𝜎𝜏2 ) denote the TOB metric of partition Π𝐻 ,
solving capabilities in initial partitioning and refinement.
and 𝜗 represent nodes number of moving from the set 𝑃 𝜏𝑓 to set 𝑃𝑡𝜏 .
We denote the change in 𝐾 (𝜎𝜏2 ) as Δ2 . Then, the Δ2 is only related to 1
the |𝑃 𝜏𝑓 | and the |𝑃𝑡𝜏 |: ave1=1 ave2=2 ave3=2
1 2 |v1|=1 |v2|=1 |v3|=1
2𝜗 (|𝑃 𝜏𝑓 | − |𝑃𝑡𝜏 | − 𝜗) merge u |u1|=1 |u2|=1 |u3|=0
3 v
-1 0 1
𝑘 · |𝑃 𝜏 | 2 2
PRoof. We denote the set changed by 𝑃 𝜏𝑓 and 𝑃𝑡𝜏 as 𝑃ˆ𝜏𝑓 and 𝑃𝑡
ˆ 𝜏,
Figure 3: A toy example of calculating topological order bal-
respectively. And the new TOB is represented as 𝐾ˆ (𝜎𝜏2 ). Then we ancing gain in coarsening.
have:
Therefore, we present a new coarsening method incorporating
𝜗 = |𝑃 𝜏𝑓 | − |𝑃ˆ𝜏𝑓 | = |𝑃ˆ𝑡𝜏 | − |𝑃𝑡𝜏 | (3.8)
topological order into the connectivity rating. The new rating 𝑟ˆ is
Δ2 = 𝐾 (𝜎𝜏2 ) − 𝐾ˆ (𝜎𝜏2 ) (3.9) as shown in equation (7). A parameter 𝛾 is set to adjust the ratio
between connectivity and topological order balancing.
2 2 𝜏∑
(|𝑃ˆ𝜏𝑓 | − 𝑃 𝜏 ) + (|𝑃ˆ𝑡𝜏 | − 𝑃 𝜏 ) max

𝐾ˆ (𝜎𝜏2 ) = (3.10) 𝑟ˆ (𝑢, 𝑣) = 𝑟 (𝑢, 𝑣) + 𝛾 · (𝑎𝑣𝑒𝜏 − |𝑢𝜏 | − |𝑣 𝜏 |) (7)

𝑘· |𝑃 𝜏 | 2 𝜏
𝑘
∑ 𝑟𝜏
2 where the 𝑎𝑣𝑒𝜏 = 𝑏𝑜𝑢𝑛𝑑 , 𝑟𝜏 represents the number of all nodes
(|𝑃𝑖𝜏 | − 𝑃 𝜏 )
𝑖=1,𝑖≠𝑓 ,𝑖≠𝑡 with topological order 𝜏 and 𝑏𝑜𝑢𝑛𝑑 denotes the lower bound when
+ the coarsening process stops. Here, 𝑢𝜏 and 𝑣 𝜏 is the number of
𝑘 · |𝑃 𝜏 | 2
nodes with topological order 𝜏 in node 𝑢 and 𝑣 respectively. 𝛾 is a
Combined with equation(3.4), we further get the Δ2 : parameter to control the topological order gain. The parameter in-
2𝜗 (|𝑃 𝜏𝑓 | − |𝑃𝑡𝜏 | − 𝜗) fluence the nodes number of coarsener DAH. In practice, we set 𝛾
Δ2 = (3.11) as 1. The node 𝑢 and 𝑣 could be representative of other nodes due to
𝑘 · |𝑃 𝜏 | 2 the coarsening. When merging two nodes, 𝑢 and 𝑣, with 𝑣 being the
node merged into 𝑢 (making 𝑢 the representative node), we visit
all topological orders involving the merged node. For each of the
CoRollaRy 3.4. When moving a node with topological order 𝜏 topological orders, we compute the sum of the number of nodes
from block 𝑓 to block 𝑡 for a partitioned DAH, we can use the |𝑃 𝜏𝑓 | in 𝑢 and 𝑣. If the sum value exceeds the average value for a given
and |𝑃𝑡𝜏 | to update TOB metric, and the time complexity is 𝑂 (1). topological order, then the gain of the topological order is nega-
tive. Conversely, the gain is positive if the sum value is below the
PRoof. According to Theorem 3.3, for a partitioned DAH, when average. The overall gain is determined by aggregating the gains
a node with topological order of 𝜏 is moved from block 𝑓 to block of all topological orders.
𝑡, the value of 𝜗 is 1. Therefore, we have: In Figure 3, a toy example illustrates how to compute the gain
2(|𝑃 𝜏𝑓 | − |𝑃𝑡𝜏 | − 1) of topological order balancing in this stage. The example involves
Δ2 = (3.12) a representative node 𝑢 and a contracted node 𝑣. In the left part of
𝑘 · |𝑃 𝜏 | 2
Figure 3, 𝑣 consists of nodes with three different topological orders,
Additionally, the total number of nodes |𝑃 𝜏 | 2 of topological order while 𝑢 has nodes with two topological orders. On the right side
𝜏 is a constant value. Therefore, the time complexity of TOB metric of Figure 3, the first row of the table shows that the mean value of
updating when moving a node is 𝑂 (1). nodes in the blocks with topological order 1, 2, and 3 is 1, 2, and 2,
respectively. Both 𝑢 1 and 𝑣 1 have one node. Therefore, the gain for 3.3.1 Super-far Root Selection. Firstly, we present the super-far
topological order one is calculated as 1 − 1 − 1 = −1. In same way, root selection method, as detailed in Algorithm 1. Initially, the
we obtain the gains of the other two topological orders is 0 and 1. connectivity (𝑐) of each node 𝑣𝑖 is computed (line 3). Additionally,
Furthermore, the topological order balancing gain is 0. we define the edge length between 𝑣𝑖 and 𝑣 𝑗 as 𝑐1 . It mirrors the
intuitive concept that the long edge length reflects a distant re-
Algorithm 1: Super-Far Root Selecting Algorithm lationship and the minor probability of cuts between two nodes.
Input: Coarsest DAH 𝐻𝐶 (𝑉 , 𝐸), Block set 𝑘,Block number Then, we introduce a virtual super source node 𝑣𝑠 (line 5) and em-
𝑏𝑙𝑜𝑐𝑘𝑁𝑢𝑚 ploy the single-source shortest path algorithm to calculate the max-
Output: Root nodes set 𝑟𝑜𝑜𝑡𝑁𝑜𝑑𝑒𝑠 (𝑣𝑖 ) for each block imum shortest edge length of 𝑣𝑠 for obtaining the first root node
1 for 𝑣 𝑖 ∈ 𝑉 do (from line 10 to line 22). Similarly, based on the first root, we re-
2 for 𝑣 𝑗 ∈ Γ(𝑣𝑖 ) do peat the process to get the second one. Finally, after multiple iter-
∑ 𝜔 (𝑒 ) ations, each block has one root node. These nodes represent the
3 𝑐← |𝑒 | −1
𝑣𝑖 ,𝑣 𝑗 ∈𝑒 most distant relationships within the DAH. Note that the single-
4 𝑤𝑒𝑖𝑔ℎ𝑡 (𝑣𝑖 , 𝑣 𝑗 , 𝑐1 ) source shortest path algorithm we apply in the process is the im-
proved Bellman-Fords algorithm. Since the time-consuming nature
5 𝑣𝑠 ← 𝑉 // randomly pick a node as super source node of maintaining the 𝑑𝑖𝑠 array when executing the Bellman-Ford al-
6 𝑄.𝑝𝑢𝑠ℎ(𝑣𝑠 ) // push 𝑣𝑠 to queue 𝑄 gorithm multiple times. Therefore, in Bellman-Ford algorithm, we
7 𝑑𝑖𝑠.𝑟𝑒𝑠𝑖𝑧𝑒 (|𝑉 |, 𝑀𝐴𝑋 ) only update one root node’s distance to zero and push it to the
8 𝑑𝑖𝑠 ← (𝑣 𝑠 , 0) // the initial 𝑑𝑖𝑠 of 𝑣𝑠 is 0 queue 𝑄 (line 27 to line 28) rather than resetting the 𝑑𝑖𝑠 array after
9 while 𝑏𝑙𝑜𝑐𝑘𝑁𝑢𝑚 − − do finding the shortest path. In the results of our experiments, per-
10 while 𝑄.𝑠𝑖𝑧𝑒 () > 0 do forming the improved Bellman-Ford algorithm multiple times can
11 𝑡 ← 𝑄.𝑓 𝑟𝑜𝑛𝑡 () achieve the time complexity of 𝑂 (𝐾 |𝐸|) (𝐾 < 2), while the time
12 𝑄.𝑝𝑜𝑝 () complexity of Bellman-Ford algorithm is 𝑂 (𝐾 |𝑘𝐸|) (𝐾 < 2).
13 for 𝑤 ∈ 𝑤𝑒𝑖𝑔ℎ𝑡 do 3.3.2 Three Strategies Growing partitioning. Now we introduce the
14 if 𝑑𝑖𝑠 (𝑤 .𝑠𝑒𝑐𝑜𝑛𝑑) > 𝑑𝑖𝑠 (𝑡) + 𝑤 .𝑡ℎ𝑖𝑟𝑑 then gain function considering the net weight and TOB metric (corol-
15 𝑑𝑖𝑠 (𝑤 .𝑠𝑒𝑐𝑜𝑛𝑑) = 𝑑𝑖𝑠 (𝑡) + 𝑤 .𝑡ℎ𝑖𝑟𝑑 lary 3.2). The gain function is:
16 𝑄.𝑝𝑢𝑠ℎ(𝑤 .𝑠𝑒𝑐𝑜𝑛𝑑) ∑
𝑔𝑎𝑖𝑛(𝑣, 𝑝𝑘 ) = 𝛿Δ1 (𝑣, 𝑝𝑘 ) + 𝜔 (𝑒),
17 𝑡𝑎𝑟𝐵𝑙𝑜𝑐𝑘 ← {𝑘𝑖 |𝑘𝑖 ∈ 𝑘 } // pick a block without root 𝑒 ∈𝐼 (𝑣) (8)
18 𝑚𝑎𝑥𝑑𝑖𝑠, 𝑚𝑎𝑥𝑛𝑜𝑑𝑒 ← 0 𝑒 ∩ 𝑁 (𝑝𝑘 ) ≠ ∅
19 for 𝑣 ∈ 𝑉 do
where the 𝑝𝑘 is the block ID, the 𝑁 (𝑝𝑘 ) is the set of edge con-
20 if 𝑆𝑎𝑡𝑖𝑠 𝑓 𝑦𝐵𝑎𝑙𝑎𝑛𝑐𝑒 (𝑚𝑎𝑥𝑛𝑜𝑑𝑒, 𝑡𝑎𝑟𝐵𝑙𝑜𝑐𝑘) and
nect to block 𝑝𝑘 , and Δ1 is the equation (3.7). 𝛿 is a parameter to
𝑑𝑖𝑠 (𝑣) ≠ max and 𝑑𝑖𝑠 (𝑣) > 𝑚𝑎𝑥𝑑𝑖𝑠 then control the TOB gain. We do not want any large TOB in initial par-
21 𝑚𝑎𝑥𝑑𝑖𝑠 ← 𝑑𝑖𝑠 (𝑣) tition, so the larger this value, the better. In our experiments, we
22 𝑚𝑎𝑥𝑛𝑜𝑑𝑒 ← 𝑣; set 𝛿 as 10000. For every block, we use a priority queue to store the
23 𝑟𝑜𝑜𝑡 𝑁𝑜𝑑𝑒𝑠 ← (𝑚𝑎𝑥𝑛𝑜𝑑𝑒, 𝑡𝑎𝑟𝐵𝑙𝑜𝑐𝑘) 𝑔𝑎𝑖𝑛 between a node and the block. Then, we adopt the portfolio-
based approach[11] to select the highest gain for increasing diver-
24 if 𝑏𝑙𝑜𝑐𝑘𝑁𝑢𝑚 == |𝑘 | − 1 then
sification and producing the best result. Technically, the approach
25 𝑑𝑖𝑠.𝑐𝑙𝑒𝑎𝑟 ()
includes three strategies: Global-Grow, Sequential-Grow, and Loop-
26 𝑑𝑖𝑠.𝑟𝑒𝑠𝑖𝑧𝑒 (|𝑉 |, 𝑀𝐴𝑋 )
Grow. Furthermore, the difference between them is illustrated by
27 𝑑𝑖𝑠 ← (𝑚𝑎𝑥𝑛𝑜𝑑𝑒, 0) an example shown in Figure 4. In the upper-left of Figure 4, the
28 𝑄.𝑝𝑢𝑠ℎ(𝑚𝑎𝑥𝑛𝑜𝑑𝑒) gain of all node-block pairs is stored in three priority queues. The
29 return 𝑟𝑜𝑜𝑡𝑁𝑜𝑑𝑒𝑠 𝑃𝑄 0 represents the gain priority queue of the nodes with block
0. And the gain in priority queues is sorted by descending order.
The Global-Grow always selects the maximum gain from all the
priority queues when assigning a node (The result is shown in the
3.3 Super-Far Root Growing Initial upper-right of Figure 4). Sequential-Grow selects gain from the first
Partitioning priority queue until it is empty, then proceeds to the second one,
After performing the coarsening process, topological order balanc- and so forth (the bottom-left of Figure 4). And the Loop-Grow se-
ing aware initial partitioning is applied in the coarsest hypergraph. lects the maximum gain from the current priority queue and then
Additionally, in the initial partitioning, an efficient super-far root proceeds to pick the maximum from the next priority queue (the
selecting method is introduced to select root nodes located at sig- bottom-right of Figure 4). Since the size of the DAH is significantly
nificant distances from each other. Then, we employ three growing scaled down after coarsening, performing the three strategy is not
partitioning strategies to assign nodes around the root while con- time-consuming. After the gain in the priority queue is empty, we
sidering the cut and TOB metric. Finally, the best partition with select the partition with the best result to be projected back to the
the smallest cut and TOB is chosen for refinement. original hypergraph.
8 5 Algorithm 2: Boundary Nodes Refinement Algorithm
PQ0 PQ1 PQ2 12
Input: Boundary nodes 𝑉𝑏 , Block number 𝑏𝑙𝑜𝑐𝑘𝑁𝑢𝑚
12 7 11 11 7 4 Output: Improved partition soulution 𝑆
8 6 9 9 6 3 1 𝑛𝑜𝑑𝑒𝑠𝑅𝑎𝑛𝑑𝑜𝑚𝑃𝑒𝑟𝑚𝑢𝑡𝑎𝑡𝑖𝑜𝑛(𝑉𝑏 )
5 4 3 2 for 𝑣 ∈ 𝑉𝑏 do
PQs with gain Global-Grow
3 𝑚𝑎𝑥𝑔𝑎𝑖𝑛 ← 0, 𝑡𝑎𝑟𝐵𝑙𝑜𝑐𝑘 ← −1,j ← 1
Loop-Grow 4 while 𝑗 + + ≤ 𝑏𝑙𝑜𝑐𝑘𝑁𝑢𝑚 do
5 if 𝑗 == 𝑣.𝑏𝑙𝑜𝑐𝑘 then
12 5 𝑐𝑜𝑛𝑡𝑖𝑛𝑢𝑒
12 7 11 7 4 6
∑
8 6 9 11 8 3 7 𝑔𝑎𝑖𝑛𝑟 ← 𝜁 Δ2 (𝑣, 𝑗) + 𝜔 (𝑒)
5 4 3 6 𝑒 ∈𝐼 (𝑣)

Senquential-Grow 9 8 if 𝑔𝑎𝑖𝑛𝑟 ≥ 𝑚𝑎𝑥𝑔𝑎𝑖𝑛 then

9 𝑚𝑎𝑥𝑔𝑎𝑖𝑛 ← 𝑔𝑎𝑖𝑛𝑟 , 𝑡𝑎𝑟𝐵𝑙𝑜𝑐𝑘 ← 𝑗
Figure 4: Difference in selecting gain among Global-Grow, 10 if 𝑡𝑎𝑟𝐵𝑙𝑜𝑐𝑘 == −1 or
Sequential-Grow and Loop-Grow in the initial partitioning.
𝑆𝑎𝑡𝑖𝑠 𝑓 𝑦𝐵𝑎𝑙𝑎𝑛𝑐𝑒 (𝑣, 𝑡𝑎𝑟𝐵𝑙𝑜𝑐𝑘) == 𝑓 𝑎𝑙𝑠𝑒 then
3.4 TopoRefine Algorithms 11 𝑐𝑜𝑛𝑡𝑖𝑛𝑢𝑒
After a solution of the final coarsened DAH is obtained by ini- 12 𝑣.𝑏𝑙𝑜𝑐𝑘 ← 𝑡𝑎𝑟𝐵𝑙𝑜𝑐𝑘
tial partitioning, we perform uncoarsening and move-based refine-
ment to improve the partitioning solution. In the traditional refine-
ment algorithm, the movement of nodes is determined by their
• In cases where the updating nodes are neither in block 𝑖 nor
gain value, which represents the increase in cut size caused by
in block 𝑗, their gain will increase with respect to block 𝑖
movement. However, to consider the topological order balancing
and decrease with respect to block 𝑗.
introduced by scheduling, the gain function now includes both the
Depending on equation (3.12), the TOB gain updating can be de-
TOB metric and cut size. The 𝑔𝑎𝑖𝑛𝑟 is defined as equation (9).
∑ fined as equation (10).
𝑔𝑎𝑖𝑛𝑟 (𝑣, 𝑝𝑘 ) = 𝜁 Δ2 (𝑣, 𝑝𝑘 ) + 𝜔 (𝑒) /
(9) 

 Δ2 (𝑢, 𝑗) − 4 /(𝑘 · |𝑃 𝜏 | 2 ), 𝑖 𝑓 (𝑢 ∈ 𝑉𝑖 )


𝑒 ∈𝐼 (𝑣)

 Δ2 (𝑢, 𝑚) − 2/ (𝑘 · |𝑃 𝜏 | 2 ), 𝑖 𝑓 (𝑢 ∈ 𝑉𝑖 )


 Δ2 (𝑢, 𝑖) + 4 (𝑘 · |𝑃 𝜏 | 2 ), 𝑖 𝑓 (𝑢 ∈ 𝑉 𝑗 )
where Δ2 represents the TOB metric in equation (3.12) according /
Δ̂2 (𝑢) = (10)
to the corollary 3.4. 
 Δ2 (𝑢, 𝑚) + 2/ (𝑘 · |𝑃 𝜏 | 2 ), 𝑖 𝑓 (𝑢 ∈ 𝑉 𝑗 )

 𝜏
Our local search algorithms are similar to the FM-algorithm but 
 Δ2 (𝑢, 𝑖) + 2 (𝑘 · |𝑃 | ), 𝑖 𝑓 (𝑢 ∈ 𝑉𝑚 )
2

more highly greedy. Specifically, we employ two greedy strategies.  Δ (𝑢, 𝑗) − 2 / (𝑘 · |𝑃 𝜏 | 2 ), 𝑖 𝑓 (𝑢 ∈ 𝑉 )


 2 𝑚
The boundary nodes refinement strategy only focuses on the nodes
where Δ̂2 (𝑢) denotes the updated TOB gain for a neighbor 𝑢 of
in cut edges and moves these nodes to reduce the cut size. In con-
the moved node 𝑣. Here, blocks 𝑖 and 𝑗 represent the source and
trast, the local refinement strategy selects nodes whose gain is a
target blocks of 𝑣, respectively, while block 𝑚 refers to any block
positive value to move. The two algorithms will be introduced in
other than 𝑖 and 𝑗. We will briefly derive the expression for Δ̂2 (𝑢, 𝑗).
this section.
The approach for other derivations is analogous. Based on equation
3.4.1 Boundary Nodes Refinement. In the boundary nodes refine- (3.12), the terms |𝑃 𝜏𝑗 | and |𝑃𝑖𝜏 | indicate the count of nodes within
ment, we store the boundary nodes in a random permutation array, their respective target and source blocks that have the same topo-
which can break ties randomly to increase diversification. Then, we logical order 𝜏. Upon moving node 𝑣 from block 𝑖 to block 𝑗, the
calculate the gain of a boundary node according to equation (9). Fi- expressions become |𝑃 𝜏𝑗 | = |𝑃 𝜏𝑗 | + 1 and |𝑃𝑖𝜏 | = |𝑃𝑖𝜏 | − 1. As a result,
nally, move the node if the gain is a positive value. Note that a node the TOB gain associated with block 𝑗 for node 𝑢 is:
is locked after it is moved. Until all boundary nodes are locked, the 2((|𝑃𝑖𝜏 | − 1) − (|𝑃 𝜏𝑗 | + 1) − 1)
boundary nodes refinement is finished. Algorithm 2 describes the Δ̂2 (𝑢, 𝑗) =
procedure of boundary nodes refinement. 𝑘 · |𝑃 𝜏 | 2
2(|𝑃𝑖𝜏 | − |𝑃 𝜏𝑗 | − 1) 4
3.4.2 Local Refinement. In contrast to boundary nodes refinement, = −
local refinement involves the process of TOB gain updating. Fur- 𝑘 · |𝑃 𝜏 | 2 𝑘 · |𝑃 𝜏 | 2
thermore, when a node 𝑣 with a topological order 𝜏 is moved from 4
= Δ2 (𝑢, 𝑗) −
block 𝑖 to block 𝑗, the TOB gain of all nodes with 𝜏 should be up- 𝑘 · |𝑃 𝜏 | 2
dated. This update is determined by where the updating nodes lo- In Figure 5, a toy example illustrates how to update the TOB
cated, leading to three potential scenarios: gain. There are eight yellow circles on the left of Figure 5 repre-
• If updating nodes are located within block 𝑖, the TOB gains senting nodes with the same topological order 𝜏 assigned to three
of the nodes will decrease. blocks 𝑖, 𝑗, and 𝑚. Among these nodes, five are in block 𝑖, one is
• If updating nodes belong to block 𝑗, their TOB gains will in block 𝑗, and two are in block 𝑚. The total number |𝑃 𝜏 | of nodes
increase. with topological order 𝜏 is 8, with 5 in |𝑃𝑖𝜏 |, 1 in |𝑃 𝜏𝑗 |, and 2 in |𝑃𝑚
𝜏 |.
Table 2: Single level partition result comparison with 𝑘 = 8 and 𝜀 = 0.03

TopoOrderPart– hMETIS+TopoRefine TopoOrderPart

Benchmark # Nodes # Nets Cut size TOB Metric Time (s) Cut size TOB Metric Time (s) Cut Size TOB Metric Time (s)
dart 202354 174814 43865 0.96 6.28 33574 1.49 40.77 37285 0.93 7.77
denoise 275638 273088 49716 246.74 14.85 46060 402.99 11.28 49211 260.17 19.88
stereo_vision 94050 93079 17527 3.69 1.28 17368 11.65 6.02 17723 3.49 1.97
sparcT1_core 91976 79063 26535 5.14 2.10 20131 10.79 7.77 22239 4.66 3.72
stap_qrd 240240 215292 13096 24.00 5.45 9112 19.25 9.00 9463 21.40 9.01
openCV 217453 162176 25646 3.94 7.19 10341 7.28 6.50 11146 4.08 4.90
des90 111221 92253 19793 1.95 3.72 5385 2.20 6.06 5668 2.02 2.77
directrf 931275 923215 34241 9.26 26.65 25000 23.90 62.59 27934 11.19 24.27
cholesky_bdti 266422 250741 8894 19.76 11.68 8167 20.86 6.98 8472 16.98 11.45
minres 261359 251527 6715 127.49 17.03 5350 115.88 5.48 5744 87.50 4.46
sparcT2_core 300109 269116 46646 6.66 11.58 43726 8.81 50.31 47869 5.89 12.53
LU_Network 635456 554309 29475 1.15 18.34 28718 7.37 94.69 28803 1.25 14.98
cholesky_mc 113250 104090 18350 20.41 2.71 13047 26.91 5.17 13724 18.67 2.79
LU230 574372 495657 50996 3.06 19.40 25211 3.34 75.56 28013 2.90 20.02
SLAM_spheric 113115 99675 22419 25.57 4.18 19503 26.14 5.08 20316 21.51 2.59
neuron 92290 83630 18940 29.13 2.35 17512 58.52 2.64 17936 31.70 2.30
bitonic_mesh 192064 159011 4607 2.63 10.75 5086 2.45 10.31 5576 2.32 4.37
bitcoin_miner 1089284 1088855 35607 21.24 36.50 25767 7.96 38.41 26295 7.05 40.01
segmentation 138295 137133 33290 773.94 6.33 30019 872.40 7.92 30249 791.67 7.23
gsm_switch 493260 454497 34394 1550.30 12.84 26077 1573.42 16.97 30045 1551.24 59.21
mes_noc 547544 470092 10497 16.00 49.03 7375 21.65 25.94 8105 18.56 35.00
sparcT1_chip2 820886 726374 18209 24.66 68.01 14351 23.40 55.25 15432 21.31 21.32
Avg. Ratio 1 1 1 0.77 1.62 2.12 0.83 0.93 1.13

The TOB gain is calculated using equation (3.12) before the node 𝑣 To assess the effectiveness of the TopoOrderPart, we compared
(gray dash circle) moves from 𝑖 to 𝑗, and the gain is shown in blue it to the state-of-the-art hypergraph partitioner, hMETIS. Further-
boxes in Figure 5. After node 𝑣 moves to 𝑗, the number of nodes in more, the hMETIS+TopoRefine algorithm invokes the hMETIS to
|𝑃𝑖𝜏 | reduces to 4, while the number of nodes in |𝑃 𝜏𝑗 | increases to 2. obtain a partition, and then the TopoRefine algorithms introduced
The changes are illustrated in Figure 5, highlighted in red font. The in this work are applied to enhance the TOB. The default param-
number of nodes in |𝑃𝑚 𝜏 | and |𝑃 𝜏 | remains the same, with 2 and 8, eters for hMETIS are set as [19]. Considering the random seed’s
respectively. In order to visually represent the updated TOB gains, influence in hMETIS, we sample the best result of 5 times. We con-
Figure 5 features yellow boxes. ducted two separate experiments to validate the efficiency of our
ideas.
i Nodes in Nodes in Nodes in
i j m 4.1 Benchmarks and Baseline
V
In experiments, we adopt the Titan23 benchmark suite[20] whose
 (u, i ) /
𝟐(𝟏 − 𝟓 + 𝟏 𝟐(𝟐 − 𝟓 + 𝟏
2 𝟑 × 𝟖𝟐 𝟑 × 𝟖𝟐 statistics are summarized in Table 2. The “# Nodes” represents the
ˆ (u, i) number of DAH nodes, the “# Nets” denotes the number of nets in
𝟐(𝟐 − 𝟒 + 𝟏 𝟐(𝟐 − 𝟒 + 𝟏
j / 𝟑 × 𝟖𝟐 𝟑 × 𝟖𝟐
2
the circuit netlist. Since the input to our algorithm in this paper is
 (u, j )
𝟐(𝟓 − 𝟏 + 𝟏 𝟐(𝟐 − 𝟏 + 𝟏
𝟑 × 𝟖𝟐
/ 𝟑 × 𝟖𝟐
a DAH, we acyclicize the Titan23 benchmark. After addressing the
cyclic issue, the nets of the acyclic hypergraph are reduced, and
2

ˆ (u, j ) 𝟑 × 𝟖
𝟐(𝟒 − 𝟐 + 𝟏 𝟐(𝟐 − 𝟐 + 𝟏
/ the nodes are the same as original hypergraph. We make these
m 2
𝟐 𝟑 × 𝟖𝟐
generated benchmarks public at [21].
 (u, m) 𝟐(𝟓𝟑−×𝟐𝟖+ 𝟏
2 𝟐
𝟐(𝟏 − 𝟐 + 𝟏
𝟑 × 𝟖𝟐
/ Given that there is no existing research on DAH partitioning
with handling topological order balancing, we have chosen to use
ˆ (u, m)
2
𝟐(𝟒 − 𝟐 + 𝟏
𝟑 × 𝟖𝟐
𝟐(𝟐 − 𝟐 + 𝟏
𝟑 × 𝟖𝟐
/
the TopoOrderPart– as our baseline. The only distinction between
TopoOrderPart– and TopoOrderPart lies in the initial partitioning
Figure 5: An example of how to update the TOB gain in local stage, where TopoOrderPart– employs random root selecting method
refinement. and TopoOrderPart use the super-far root selection method. In or-
4 EXPERIMENTAL SETUP AND RESULTS der to reduce the impact of randomness, we pick the best result
In this work, we developed TopoOrderPart using C++ and compiled from the five runs when applying TopoOrderPart–.
it with g++ version 11.3.0. Our experiments were carried out on a
single core of a Linux workstation equipped with an Intel Xeon 4.2 Experimental Results
Gold 6248R CPU at 3.0 GHz, operating on Ubuntu 22.04 OS. Given 4.2.1 The Performance of Single level Partition. In Table 2, the “Cut
the absence of existing benchmarks or baselines that address topo- Size” denotes the sum of the hyperedges’ connectivity. The “TOB
logical order balancing in the DAH partitioning problem, we gen- Metric” refers to the topological order balancing. The “Avg. Ra-
erated a set of benchmarks derived from a public benchmark suite tio” of “Cut Size”, “TOB Metric” and “Time” is the average relative
and constructed a baseline for a fair comparison. performance of TopoOrderPart and hMETIS+TopoRefine comparing
with the TopoOrderPart– for all the benchmarks. In addition, Fig- no existing work addresses the architecture-level partitioning, we
ure 6 presents a more detailed comparison among each benchmark directly employ hMETIS+TopoRefine with 𝑘ℎ = 48. The result is
focusing on cut size and TOB. The y-axis represents the ratio of the shown in Table 3 and Figure 7. Compared to the hMETIS+TopoRefine,
best result divided by the other algorithms’ results. A ratio closer the average run time of TopoOrderPart is reduced by 55%, and the
to 1 indicates superior performance, whereas a closer to 0 denotes TOB is reduced to 73%, while half of the benchmarks on cut size
inferior results. lead. The results demonstrate that our framework’s superior per-

C u t-s iz e C o m p a r is o n T O B m e tr ic C o m p a r is o n Table 3: Two level partition result comparison with 𝑘 1 =

Q u a lity re la tiv e to b e s t

1 .2 5 1 .2 5
8, 𝑘 2 = 6 and 𝜀 = 0.03
1 .0 0 1 .0 0

0 .7 5 0 .7 5 hMETIS+TopoRefine TopoOrderPart
Benchmark Cut Size TOB Metric Time(s) Cut Szie TOB Metric Time(s)
0 .5 0 0 .5 0 dart 20129 0.48 16.36 18561 0.45 8.67

-
h M E T IS + T o p o R e fin e

0 .2 5
T o p o O rd e rP a rt
0 .2 5 denoise 21446 76.10 71.35 31082 65.12 20.05
T o p o O rd e rP a rt
0 .1 0 .1 stereo_vision 8706 1.64 7.14 11927 1.12 2.74
0 4 8 1 2 1 6 2 0 2 3 0 4 8 1 2 1 6 2 0 2 3 sparcT1_core 20735 2.63 7.05 27647 1.13 4.32
# B e n c h m a rk s # B e n c h m a rk s
stap_qrd 15335 5.03 14.19 25140 3.18 9.92
openCV 21786 1.26 17.48 19872 0.63 8.76
Figure 6: The single level partitioning comparison of cut-
des90 12473 0.37 8.03 11333 0.36 7.14
size and TOB between TopoOrderPart, TopoOrderPart– and directrf 66314 2.59 146.43 65239 1.86 39.85
hMETIS+TopoRefine. A quality ratio closer to 1 indicates bet- cholesky_bdti 17459 4.25 20.58 15485 2.78 12.41
ter performance. minres 16719 15.45 27.36 15686 12.95 8.71
sparcT2_core 52758 1.72 48.71 50094 1.19 17.44
The result in Table 2 and Figure 6 show that the TopoOrder- LU_Network 44157 0.59 100.92 43196 0.37 24.52
Part achieves 69% reduction on average regarding the TOB metric, cholesky_mc 12561 5.18 9.92 16104 3.70 3.76
with 0.53× runtime and comparable cut size when compared with LU230 35781 0.69 80.16 34367 0.60 25.01
SLAM_spheric 18104 5.90 10.88 27431 3.75 5.07
the hMETIS+TopoRefine. This comparison indicates a significant im-
neuron 12687 12.91 14.87 11380 8.21 3.28
provement in topological order balancing under our framework. bitonic_mesh 15457 0.44 27.98 14030 0.42 11.91
Based on the result, it appears that while both TopoOrderPart and bitcoin_miner 64212 4.14 77.01 85616 3.92 50.86
hMETIS+TopoRefine use the same TopoRefine algorithm, the for- segmentation 15882 173.55 57.31 26037 167.62 9.17
mer significantly outperforms the latter in terms of the TOB metric. gsm_switch 40020 296.69 80.08 56922 274.61 64.32
mes_noc 30425 3.59 73.67 45411 1.92 43.22
It is suggested that there will be help in improving topological or- sparcT1_chip2 48459 3.89 106.97 62128 2.22 44.57
der balancing if considering the TOB in the coarsening and initial [Link] 1 1 1 1.18 0.73 0.45
partitioning stage. Additionally, a further set of comparisons un-
derscores this conclusion. For instance, in comparing the baseline
formance on balancing topological order and optimizing the parti-
with hMETIS+TopoRefine, it is observed that the baseline exhibits a
tioning process. On the other hand, TopoOrderPart achieved better
62% improvement in TOB over hMETIS+TopoRefine.
TOB performance by only employing a straightforward coarsen-
Compared with the baseline, despite increasing runtime by 13%,
ing and refinement method. We believe that integrating our pro-
TopoOrderPart generates partitioning solutions that reduce the cut
posed TOB optimization techniques into advanced partitioners like
size and the TOB by an average of 17% and 7%, respectively. The
KaHyPar, which adopt sophisticated and complex coarsening and
result demonstrates that the super-far root selection method sub-
refinement method, would further enhance the performance.
stantially strengthens the quality of the solutions while maintain-
ing an acceptable run time.
5 CONCLUSION
T O B m e tr ic C o m p a r is o n R u n tim e C o m p a r is o n
In this paper, we propose TopoOrderPart, a partitioning framework
Q u a lity re la tiv e to b e s t

1 .0 0 1 .0 0
designed to optimize topological order balancing and cut size. Com-
0 .7 5 0 .7 5 pared to previous state-of-the-art partitioners, it considers the topo-
logical order balancing during every partitioning process. Notably,
0 .5 0 0 .5 0 it achieves a partitioning scheme with the minimum topological
0 .2 5 T o p o O rd e rP a rt
0 .2 5
order balancing, showcasing its unique capabilities. The quality of
0 .1
h M E T IS + T o p o R e fin e
0 .1
partitioning is further improved through optimization algorithms
0 4 8 1 2 1 6 2 0 2 3 0 4 8 1 2 1 6 2 0 2 3 that efficient root node selection method in the initial partitioning
# B e n c h m a rk s # B e n c h m a rk s
stage. Furthermore, two high-performance refinement algorithms
Figure 7: The two level partitioning comparison of TOB and are introduced to enable partitioning iteratively for reducing the
runtime between TopoOrderPart and hMETIS+TopoRefine. A topological order balancing and cut size. We integrate these algo-
quality ratio closer to 1 indicates better performance. rithms into a framework that accommodates to PBE architecture
level. Through empirical study, our TopoOrderPart has shown out-
4.2.2 The Performance of Two level Partition. In this section, we standing performance in minimizing topological order balancing
conduct a separate experiment focusing on the architecture-level and cut size compared to the proposed baseline.
partitioning in TopoOrderPart. We set 𝑘 1 = 8 and 𝑘 2 = 6. Since
REFERENCES [12] Julien Herrmann, Jonathan Kho, Bora Uçar, and Kamer Kaya. 2017. Acyclic
[1] Cadence. 2024. Cadence palladium. [Link] partitioning of large directed acyclic graphs. In 2017 17th IEEE/ACM Interna-
ools/system-design-and-verification/emulation-and-prototyping/palladium tional Symposium on Cluster, Cloud and Grid Computing (CCGRID), 371–380.
.html. [13] Julien Herrmann, M. Yusuf Özkaya, Bora Uçar, and Kamer Kaya. 2019. Multi-
[2] Mentor Graphics. 2024. Mentor veloce. [Link] level algorithms for acyclic partitioning of directed acyclic graphs. SIAM J. Sci.
/veloce/strato-hardware. Comput., 41, 4, A2117–A2145.
[3] Amir Yazdanshenas and Mohammed A. S. Khalid. 2007. A new scheduling al- [14] Merten Popp, Sebastian Schlag, Christian Schulz, and Daniel Seemaier. 2021.
gorithm for processor-based logic emulation systems. In 2007 50th Midwest Multilevel acyclic hypergraph partitioning. In 2021 Proceedings of the Sympo-
Symposium on Circuits and Systems, 1505–1508. sium on Algorithm Engineering and Experiments (ALENEX), 1–15.
[4] Amir Ali Yazdanshenas. 2006. Hardware design and cad for processor-based [15] Ismail Bustany, Andrew B. Kahng, Ioannis Koutis, and Bodhisatta Pramanik.
logic emulation systems. In Electronic Theses and Dissertations. 2022. Specpart: a supervised spectral framework for hypergraph partitioning
[5] Marwan Kanaan. 2007. A low-cost processor-based logic emulation system solution improvement. In 2022 IEEE/ACM International Conference On Com-
using fpgas. In Electronic Theses and Dissertations. puter Aided Design (ICCAD), 1–9.
[6] Ruiyao Pu, Yiwei Sun, Pei-Hsin Ho, Fan Yang, Li Shang, and Xuan Zeng. 2023. [16] Ismail Bustany, Andrew B. Kahng, Ioannis Koutis, and Bodhisatta Pramanik.
Sphinx: a hybrid boolean processor-fpga hardware emulation system. In 2023 2024. K-specpart: supervised embedding algorithms and cut overlay for im-
IEEE/ACM International Conference on Computer Aided Design (ICCAD), 1–9. proved hypergraph partitioning. IEEE Transactions on Computer-Aided Design
[7] Yong Fu. 2022. Processor based logic simulation acceleration and emulation of Integrated Circuits and Systems, 43, 4, 1232–1245.
system. (Mar. 2022). US Patent App. 17/448,216. [17] Ismail Bustany, Grigor Gasparyan, Andrew B. Kahng, and Ioannis Koutis. 2023.
[8] Ngai Ngai William Hung and Amiya Ranjan Satapathy. 2024. Systems and An open-source constraints-driven general partitioning multi-tool for vlsi phys-
methods for distributed and parallelized emulation processor configuration. ical design. In 2023 IEEE/ACM International Conference on Computer Aided De-
(Jan. 2024). US Patent App. 11/868,786. sign (ICCAD), 1–9.
[9] G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar. 1999. Multilevel hyper- [18] Sebastian Schlag, Yaroslav Akhremtsev, Tobias Heuer, and Peter Sanders. 2017.
graph partitioning: applications in vlsi domain. IEEE Transactions on Very Large Engineering a direct k-way hypergraph partitioning algorithm. In ALENEX’17.
Scale Integration (VLSI) Systems, 7, 1, 69–79. [19] G. Karypis and V. Kumar. 1998. Hmetis, a hypergraph partitioning package,
[10] U.V. Catalyurek and C. Aykanat. 1999. Hypergraph-partitioning-based decom- version 1.5.3. [Link]
position for parallel sparse-matrix vector multiplication. IEEE Transactions on df.
Parallel and Distributed Systems, 10, 7, 673–693. [20] Kevin E. Murray, Scott Whitty, Suya Liu, Jason Luu, and Vaughn Betz. 2013.
[11] Sebastian Schlag, Vitali Henne, Tobias Heuer, and Henning Meyerhenke. 2016. Titan: enabling large and complex benchmarks in academic cad. In 2013 23rd
K-way hypergraph partitioning via n-level recursive bisection. In 2016 Proceed- International Conference on Field programmable Logic and Applications, 1–8.
ings of the Meeting on Algorithm Engineering and Experiments (ALENEX), 53– [21] 2024. Dah benchmark generated from the titan23. [Link]
67. le/d/1J6wE7AwuBvnZZfICbnqa42Kia8yXcF4d/view?usp=drive_link.

08 Dataparallel
No ratings yet
08 Dataparallel
51 pages
Parallel Computing Quiz
No ratings yet
Parallel Computing Quiz
15 pages
HPC Parallel
No ratings yet
HPC Parallel
122 pages
Introduction to Parallel Computing
No ratings yet
Introduction to Parallel Computing
19 pages
Research Topics in Parallel Computing
No ratings yet
Research Topics in Parallel Computing
3 pages
Introduction To Parallel Computing: Solution Manual
No ratings yet
Introduction To Parallel Computing: Solution Manual
70 pages
Solution 2-DD
No ratings yet
Solution 2-DD
70 pages
Jeffrey Dean CSE Summa Sum1990
No ratings yet
Jeffrey Dean CSE Summa Sum1990
34 pages
HPC Lecture (1) Summary
No ratings yet
HPC Lecture (1) Summary
8 pages
Mapping and Execution of Nested Loops On
No ratings yet
Mapping and Execution of Nested Loops On
14 pages
Experiment 8 - Parallel Processing Using MARIE Simulator
No ratings yet
Experiment 8 - Parallel Processing Using MARIE Simulator
12 pages
Parallel Electromagnetic Field Computation
No ratings yet
Parallel Electromagnetic Field Computation
6 pages
Chapter 06
No ratings yet
Chapter 06
47 pages
UNIT-2 PP FlynnsClassification
No ratings yet
UNIT-2 PP FlynnsClassification
80 pages
ACA 2024W 02 Program Design
No ratings yet
ACA 2024W 02 Program Design
27 pages
Parallel Algorithms Course Overview
No ratings yet
Parallel Algorithms Course Overview
65 pages
Parallel Algorithms for Image Processing
No ratings yet
Parallel Algorithms for Image Processing
5 pages
Easy Part
No ratings yet
Easy Part
9 pages
Parallel Numerical Methods Overview
No ratings yet
Parallel Numerical Methods Overview
46 pages
Cublas Library
No ratings yet
Cublas Library
254 pages
Chapter 14: Parallel Algorithms
No ratings yet
Chapter 14: Parallel Algorithms
23 pages
Com - 612 Exam
No ratings yet
Com - 612 Exam
13 pages
002 IntroHPC
No ratings yet
002 IntroHPC
33 pages
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators
No ratings yet
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators
30 pages
CUBLAS Library
No ratings yet
CUBLAS Library
264 pages
Compre 1
No ratings yet
Compre 1
2 pages
RISC-V CNN Accelerator Design
100% (1)
RISC-V CNN Accelerator Design
6 pages
Conclusions and Future Work: Multirate Sine Output
No ratings yet
Conclusions and Future Work: Multirate Sine Output
4 pages
Cufft Performance Graphs
No ratings yet
Cufft Performance Graphs
10 pages
Cublas Library: User Guide
No ratings yet
Cublas Library: User Guide
248 pages
The Parallel Finite Difference Time Domain (FDTD) Project
No ratings yet
The Parallel Finite Difference Time Domain (FDTD) Project
4 pages
FZHT2013 TECS RuleClustering
No ratings yet
FZHT2013 TECS RuleClustering
29 pages
CUDA for Parallel Computing Experts
No ratings yet
CUDA for Parallel Computing Experts
33 pages
Parallel Computing Analysis with MATLAB
No ratings yet
Parallel Computing Analysis with MATLAB
65 pages
Parallel computing-Ch1-BDIC
No ratings yet
Parallel computing-Ch1-BDIC
50 pages
A Case Study of Design Space Exploration For Embedded Multimedia Applications On Socs
No ratings yet
A Case Study of Design Space Exploration For Embedded Multimedia Applications On Socs
7 pages
4th Sem RR Campus Course Information
No ratings yet
4th Sem RR Campus Course Information
20 pages
Parallel Computing Architectures
No ratings yet
Parallel Computing Architectures
57 pages
FPGA Square Wave Generator Guide
No ratings yet
FPGA Square Wave Generator Guide
11 pages
VWV Finaly 1
No ratings yet
VWV Finaly 1
11 pages
2015 Simulating P Systems On GPU Devices A Survey
No ratings yet
2015 Simulating P Systems On GPU Devices A Survey
17 pages
A Multiplier-Free RNS-Based CNN Accelerator Exploiting Bit-Level Sparsity
No ratings yet
A Multiplier-Free RNS-Based CNN Accelerator Exploiting Bit-Level Sparsity
17 pages
Applications Enabled by FPGA-Based Technology
No ratings yet
Applications Enabled by FPGA-Based Technology
4 pages
3-D Parallel Fault Simulation With GPGPU
No ratings yet
3-D Parallel Fault Simulation With GPGPU
11 pages
Parallel Computing
No ratings yet
Parallel Computing
2 pages
Emerging Field of NoC-Based Convolution Architecture With Systolic Arrays
No ratings yet
Emerging Field of NoC-Based Convolution Architecture With Systolic Arrays
6 pages
Parallel Computer Architectures 2015
No ratings yet
Parallel Computer Architectures 2015
59 pages
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
No ratings yet
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
65 pages
FPGA Scheduling for Embedded Systems
No ratings yet
FPGA Scheduling for Embedded Systems
7 pages
Lecture Notes On Parallel Computation
No ratings yet
Lecture Notes On Parallel Computation
30 pages
Research 5 - Third Sample
No ratings yet
Research 5 - Third Sample
10 pages
Parallel Computing Seminar Report
100% (3)
Parallel Computing Seminar Report
35 pages
In3200 Chap05
No ratings yet
In3200 Chap05
34 pages
An Efficient Hardware Accelerator For Block Sparse Convolutional Neural Networks On FPGA
No ratings yet
An Efficient Hardware Accelerator For Block Sparse Convolutional Neural Networks On FPGA
4 pages
Design and Implementation of A Parallel Priority Queue On Many-Core Architectures
No ratings yet
Design and Implementation of A Parallel Priority Queue On Many-Core Architectures
10 pages
Understanding High Performance Computing
No ratings yet
Understanding High Performance Computing
8 pages
SIMD Computer Organizations
0% (1)
SIMD Computer Organizations
20 pages
A Deep Learning Prediction Process Accelerator Based FPGA PDF
No ratings yet
A Deep Learning Prediction Process Accelerator Based FPGA PDF
4 pages
Major Project Presentation Harsh
No ratings yet
Major Project Presentation Harsh
12 pages
Asme B18.24-2020
83% (6)
Asme B18.24-2020
190 pages
JD - Specialist-Grants Management & KM
No ratings yet
JD - Specialist-Grants Management & KM
3 pages
ProblemadeTransporte Resultados
No ratings yet
ProblemadeTransporte Resultados
5 pages
E-Banking Services Class11
No ratings yet
E-Banking Services Class11
2 pages
04 Application PDF
No ratings yet
04 Application PDF
4 pages
SOP082-Set Up Google Analytics Alerts To Monitor Your Main KPIs
No ratings yet
SOP082-Set Up Google Analytics Alerts To Monitor Your Main KPIs
7 pages
Interview Questions by Prashanth
No ratings yet
Interview Questions by Prashanth
2 pages
SP3D Equipment Labs v7
100% (6)
SP3D Equipment Labs v7
55 pages
Application Whitelisting vs. Blacklisting
No ratings yet
Application Whitelisting vs. Blacklisting
13 pages
Fundamentals of PLC Final
100% (1)
Fundamentals of PLC Final
117 pages
Sample Title Proposal
No ratings yet
Sample Title Proposal
8 pages
Face Recognition Student Attendance System
No ratings yet
Face Recognition Student Attendance System
12 pages
Et200sp Cpu1510sp 1 PN Manual en-US en-US
No ratings yet
Et200sp Cpu1510sp 1 PN Manual en-US en-US
38 pages
Dire, Dire Docks
100% (1)
Dire, Dire Docks
3 pages
LGS GNSS Comparison Chart
No ratings yet
LGS GNSS Comparison Chart
1 page
LEVEL9
No ratings yet
LEVEL9
1 page
Evita V500 Service Manual en
No ratings yet
Evita V500 Service Manual en
250 pages
Fluent-Intro 17.0 Module01 IntroCFD
No ratings yet
Fluent-Intro 17.0 Module01 IntroCFD
14 pages
Vol60 No2 9
No ratings yet
Vol60 No2 9
5 pages
Ece Project Titles: Bright Chip Technologies Project Centre, Dharmapuri, Tamilnadu Phone: 9787681446
No ratings yet
Ece Project Titles: Bright Chip Technologies Project Centre, Dharmapuri, Tamilnadu Phone: 9787681446
6 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
35 pages
Notes - OSCM - Chapter 1
No ratings yet
Notes - OSCM - Chapter 1
7 pages
SF Dump
No ratings yet
SF Dump
30 pages
Long Test No. 1 Limits Questionnaire
No ratings yet
Long Test No. 1 Limits Questionnaire
4 pages
CI-JBTAUX-2G Quick Manual
No ratings yet
CI-JBTAUX-2G Quick Manual
5 pages
Groupware Technology
100% (5)
Groupware Technology
18 pages
TPS Control Guide
No ratings yet
TPS Control Guide
48 pages
Salesforce Automation Simplified
No ratings yet
Salesforce Automation Simplified
35 pages
Analog Convolutional Operator Circuit For Low-Powe
No ratings yet
Analog Convolutional Operator Circuit For Low-Powe
17 pages

ICCAD24-TopoOrderPart A Multi-Level Scheduling-Driven Partitioning Framework For Processor-Based Emulation

Uploaded by

ICCAD24-TopoOrderPart A Multi-Level Scheduling-Driven Partitioning Framework For Processor-Based Emulation

Uploaded by

TopoOrderPart: a Multi-level Scheduling-Driven Partitioning

Framework for Processor-Based Emulation

OrderPart, the first scheduling-driven partitioning framework for P1 calcluate

function (equation (1)). Furthermore, the problem is NP-hard. Pre-process

𝑒, which denotes the number of blocks the 𝑒 connected, as shown

cessor[6]. Generate the sub-DAHs from the result

𝑐𝑢𝑡 (𝑒) := |{𝑣 ∈ 𝑉𝑖 |𝑣 ∈ 𝑒, 𝑖 ∈ [1, 𝑘]}| (2) No

is represented by the variance 𝜎𝜏2 of the number of nodes with 𝜏 in

𝐾ˆ (𝜎𝜏2 ) = (3.10) 𝑟ˆ (𝑢, 𝑣) = 𝑟 (𝑢, 𝑣) + 𝛾 · (𝑎𝑣𝑒𝜏 − |𝑢𝜏 | − |𝑣 𝜏 |) (7)

Senquential-Grow 9 8 if 𝑔𝑎𝑖𝑛𝑟 ≥ 𝑚𝑎𝑥𝑔𝑎𝑖𝑛 then

more highly greedy. Specifically, we employ two greedy strategies.  Δ (𝑢, 𝑗) − 2 / (𝑘 · |𝑃 𝜏 | 2 ), 𝑖 𝑓 (𝑢 ∈ 𝑉 )

TopoOrderPart– hMETIS+TopoRefine TopoOrderPart

C u t-s iz e C o m p a r is o n T O B m e tr ic C o m p a r is o n Table 3: Two level partition result comparison with 𝑘 1 =

You might also like