License: arXiv.org perpetual non-exclusive license
arXiv:2604.07287v1 [cs.AR] 08 Apr 2026

Symbolic Polyhedral-Based Energy Analysis for
Nested Loop Programs

Avinash Mahesh Nirmala, Dominik Walter, Frank Hannig, and Jürgen Teich
Abstract

This work presents a symbolic approach for estimating the energy consumption for nested loop programs when mapped and scheduled on parallel processor array accelerator architectures. Instead of simulation-based evaluation, we derive a methodology for symbolic energy analysis that captures the impact of mapping and scheduling decisions of loop nests on processor arrays. We compare our approach against simulation-based results for selected benchmarks and varying sizes of the iteration spaces. Whereas the latter are not scalable, our symbolic analysis is shown to be independent of the problem size. The presented evaluation methodology can be beneficially used during the design space exploration of mapping and scheduling decisions, for studying the influence of array size variations, and for comparisons with other loop nest accelerator architectures.

Index Terms—Loop programs, Processor arrays, Loop Compilation

I Introduction

There is a rapid growth of AI workloads in general. But, particularly for applications running on embedded and edge platforms, power and energy budgets are very challenging to meet, especially as computational requirements grow. Thereby, understanding and optimizing the impact of operations as well as data movements through a processing architecture is of utmost importance. Often, the above workloads can be described adequately by nested programs for which dedicated accelerator architectures have been proposed, known as massively parallel processor architectures, including CGRAs, TCPAs [6, 4, 18], and GPUs. For these, efficient compilation techniques exist to map loop nests to maximize performance. However, missing is a highly accurate, yet efficient analysis of the impact of mapping decisions on energy consumption. Whereas measurement-based approaches, being highly accurate, would require a physical prototype, simulation-based approaches at the register-transfer level (RTL) might be too time-consuming and thus not applicable for an early-stage energy analysis, such as during a design space exploration (DSE). Moreover, the effect of tiling, scheduling, and mapping transformations on energy consumption has not yet been adequately studied for modern loop-intensive workloads. Obviously, accuracy and efficiency of evaluation are necessary, but contradictory requirements for suitable analysis techniques.

Based on the above requirements, this paper presents an analytical framework for symbolic (parametric) evaluation of energy consumption of loop-intensive workloads when mapped and scheduled on processor array accelerator architectures. The framework is applicable to a broad class of accelerator architectures and can be used at early design stages, thus neither requiring a hardware prototype nor time-intensive simulations.

The presented analysis methodology takes a loop nest with parametric loop bounds and a space-time mapping to a target processor array of given size. Using a one-time classification of energy costs for different memory and register accesses, we show how parametric expressions can be derived for the number of accesses and operations during the execution of a loop nest without actually executing it. These volumes of accesses and operations can be obtained symbolically, i.e., for loop nests whose bounds remain parametric. By combining these symbolic counts with architecture-specific per-access energy weights, our framework enables fast, fully analytical energy estimation without requiring cycle-accurate simulation or physical prototype measurements. Whereas the approach can be applied to a broad class of parallel processor architectures, we present an evaluation study focusing on Tightly Coupled Processor Arrays (TCPAs) [6, 4, 18], which are massively parallel loop accelerator architectures that execute loop-intensive applications across grids of lightweight processing elements (PEs).

This paper is structured as follows: In Section II, we provide a review of existing work on energy estimation for massively parallel processor architectures. In Section III, we introduce a class of VLSI processor arrays called TCPA, define a notation for loop nests, and present the basic methodologies for mapping and scheduling loop nests to such architectures. In Section IV, we then describe our symbolic energy analysis approach that is based on the efficient computation of volumes of polyhedral spaces of different types of memory access for which typical amounts of energy can be pre-characterized per access, thus providing a memory-centric evaluation of the overall energy consumption of a loop nest. We show that such volume computations can be carried out even once only for parametric loop bounds, thus symbolically. Section V gives evidence of the accuracy and speedup of this symbolic approach over a simulation-based approach. Finally, Section VI summarizes and concludes the paper.

II Related Work

Prior work on accelerator energy estimation largely assumes that execution behavior is fixed and known. Activity-based analytical models, such as AccelWattch [5], estimate energy by weighting cycle-level activity statistics obtained from detailed simulation or hardware counters. Trace-driven architectural models, including CGRA-EAM [21], estimate energy by evaluating execution traces over pre-characterized functional units and interconnect components. Component-level frameworks such as Accelergy [25] compute energy by associating architectural components with per-action energy costs and combining them with externally supplied action counts generated by analytical models, simulators, or mapping tools.

While effective for analyzing a fixed design and workload, these approaches share a common limitation: memory-access counts are evaluated only for explicitly specified workload instances. Even architecture-specific studies such as Eyeriss [3] and analytical frameworks such as Timeloop [8] derive access counts only after fixing loop bounds, tensor dimensions, and a concrete dataflow, tiling, or mapping. Consequently, when workload sizes or execution order changes, the analysis must be recomputed.

In the polyhedral compiler community, [16] [16] launched the integer set library (ISL), which efficiently computes the volume of parametric polyhedra using Ehrhart polynomials, based on Barvinok’s algorithm [1]. In [11], [11] apply this calculus for the analysis of cache effects on execution time in single-core architectures.

In this paper, we rather analyze the energy consumption of loop nests when executed on massively parallel processor arrays in which many tiny processing elements do not even carry a cache. For these, we show how to derive closed-form expressions relating tiling choices, execution schedules, and resource bindings to energy consumption, including operations and memory accesses. This calculus enables an efficient analytical evaluation of energy consumption that is parametric in the problem size (loop bounds). To the best of our knowledge, this is among the first approaches to symbolically analyze the energy consumption of nested-loop programs in dependence on space–time mapping decisions.

In the domain of processor array accelerators, coarse-grained reconfigurable arrays (CGRAs) [22, 7] represent a prominent class of architectures that employ an operation-centric mapping approach [19], where the operations from a data flow graph (DFG) are individually assigned to processing elements. For these, simulation is usually used to analyze performance and energy tradeoffs of a compiled loop nest. But obviously, this is not a scalable approach. In order to exploit scalable symbolic energy analysis techniques as introduced in this paper, a polyhedral iteration space representation is necessary. Whereas CGRA compilers usually operate at the granularity of individual operations of a single-dimensional loop, other classes of processor array architectures, such as tightly coupled processor arrays (TCPAs) [6, 4], start the compilation from a given polyhedral representation of a loop nest to determine schedules and mappings over tiled, multidimensional iteration spaces (iteration-centric mapping). Thus, although in general the symbolic approach described in the following could also be applied to many CGRA architectures from a hardware perspective, compilation approaches would only benefit when using parametric polyhedral loop descriptions.

III Loop Nests and their Mapping and Scheduling on Processor Arrays

This section introduces a class of massively parallel processor arrays known as TCPAs [6, 4], and outlines the core concepts of loop-nest representation, space–time mapping, and symbolic loop scheduling [12].

III-A Tightly Coupled Processor Arrays (TCPAs)

TCPAs [6, 4] are parallel processor array architectures designed to execute multidimensional, loop applications represented as polyhedral recurrence equations as piecewise regular algorithms (PRAs). A TCPA works with a host, such as a CPU or FPGA. In this section, we provide an overview of the TCPA architecture. Next, we introduce the PRA representation for expressing loop nests and explain how iteration spaces are partitioned, scheduled, and mapped for spatial execution.

Refer to caption
Figure 1: Example of a tightly coupled processor array (TCPA) architecture [6].

TCPAs are two-dimensional arrays of programmable processing elements (PEs) connected by a configurable circuit-switched neighbor-to-neighbor interconnect, as shown in Figure 1. The PE grid is bordered by four input/output (I/O) buffers, each comprising multiple dedicated address generators (AG). The architecture also includes peripheral control units, a global controller (GC), and a configuration manager (CM), which orchestrates loop-program execution on the array. The TCPA operates as a memory-coupled coprocessor alongside a host system. The loop I/O controller schedules direct memory access (DMA) to transfer data between the host’s external memory and the I/O buffers. After the TCPA is configured by the host, it executes the loop program independently, while the loop I/O controller schedules DMA transfers during execution to support dynamic refilling.

TCPAs avoid costly direct PE-to-DRAM communication by transferring data between the host and I/O buffers rather than having the PEs access them directly. During execution, all active data movement is restricted to the I/O buffers and the PE array [20]. This ensures predictable, high-bandwidth on-chip communication.

Each PE can be configured with multiple parallel functional units (FUs). Each FU has its own instruction memory, branch unit, and program counter. Input and output data streams between the I/O buffers and the PEs, while all intermediate values remain within the PE’s internal register hierarchy. Within each PE, the architecture supports multiple register classes, namely general-purpose, feedback, and input/output registers. The compiler [23] maps data to these register classes based on dependence structures and schedules. During execution, values produced and consumed by operations are stored and forwarded through these registers according to architectural constraints, and the circuit-switched interconnect supports inter-PE communication when dependencies span multiple PEs, while the exact mapping rules are defined in the binding phase and discussed later.

III-B Loop-Nest Representation

Almost any standard programming language supports loop nests syntactically, e.g., C, C++, Java, and Python. In this paper, we assume loop nests described using a polyhedral notation called Piecewise Linear Algorithms (PLAs) [14, 15]. Normal sequential for-loop specifications can be converted to such a polyhedral description for parallelization. A PLA describes an nn-dimensional loop nest by an iteration space n\mathcal{I}\subseteq\mathbb{Z}^{n}, where each element 𝐢=(i0,i1,,in1)T\mathbf{i}=(i_{0},i_{1},\dots,i_{n-1})^{\mathrm{T}}\in\mathcal{I}, called iteration vector, corresponds to one loop iteration. In the following, we assume \mathcal{I} is described by a polyhedral description ={𝐢𝐀𝐢𝐛}\mathcal{I}=\{\,\mathbf{i}\mid\mathbf{A}\mathbf{i}\geq\mathbf{b}\,\}, where Am×n,bnA\in\mathbb{Z}^{m\times n},\;b\in\mathbb{Z}^{n}. The computations of a loop nest are described by a set of quantified statements S={,Sq}S=\{\ldots,S_{q}\ldots\}, with SqS_{q} given by

Sq:\displaystyle S_{q}: xq[𝐏q𝐢+𝐟q]=Fq(,xq,r[𝐐q,r𝐢𝐝q,r],)if 𝐢q.\displaystyle x_{q}[\mathbf{P}_{q}\mathbf{i}+\mathbf{f}_{q}]=F_{q}(\ldots,x_{q,r}[\mathbf{Q}_{q,r}\mathbf{i}-\mathbf{d}_{q,r}],\ldots)\;\text{if }\mathbf{i}\in\mathcal{I}_{q}. (1)

For all 𝐢q\mathbf{i}\in\mathcal{I}\cap\mathcal{I}_{q} (with q\mathcal{I}_{q} called condition space), the variable xqx_{q} at index (vector) 𝐏q𝐢+𝐟q\mathbf{P}_{q}\mathbf{i}+\mathbf{f}_{q} is defined as value of the function FqF_{q} evaluated on variables xq,rx_{q,r} at 𝐐q,r𝐢𝐝q,r\mathbf{Q}_{q,r}\mathbf{i}-\mathbf{d}_{q,r}. We finally call a PLA Piecewise Regular Algorithm (or PRA) for the special case that 𝐏q\mathbf{P}_{q} and 𝐐q,r\mathbf{Q}_{q,r} are identity matrices and 𝐟𝐪\mathbf{f_{q}} is zero. As a result, a statement of a PRA has the following form:

Sq:\displaystyle S_{q}: xq[𝐢]=Fq(,xq,r[𝐢𝐝q,r],)if 𝐢q.\displaystyle x_{q}[\mathbf{i}]=F_{q}(\ldots,x_{q,r}[\mathbf{i}-\mathbf{d}_{q,r}],\ldots)\;\text{if }\mathbf{i}\in\mathcal{I}_{q}. (2)

Note that in a PLA or PRA, there exists neither an explicit order of execution of iterations like in imperative loop programs, nor any order of execution of statements. Instead, each quantified statement only implies implicit schedule restrictions by so-called data dependencies. For example, in a PRA with a statement given in Eq. (2), the left-hand-side variable xq[𝐢]x_{q}[\mathbf{i}] can only be evaluated after all variables xq,r[𝐢𝐝q,r]x_{q,r}[\mathbf{i}-\mathbf{d}_{q,r}] (arguments of FqF_{q}) on the right-hand side have been computed. We also say xq[𝐢]x_{q}[\mathbf{i}] depends on xq,r[𝐢𝐝q,r]x_{q,r}[\mathbf{i}-\mathbf{d}_{q,r}], and call the constant vector 𝐝q,r\mathbf{d}_{q,r} the dependence vector. Finally, instances of variables appearing only on the right-hand side of statements are called input variables. Similarly, instances of variables appearing only on the left-hand side of statements are called output variables. All other instances of variables are called internal variables.

Example 1

In the following, we introduce as a running example the nested loop program GESUMMV from the PolyBench [10] benchmark, which computes the sum of two matrix–vector products involving a vector xN1x\in\mathbb{Z}^{N_{1}} and matrices A,BN0×N1A,B\in\mathbb{Z}^{N_{0}\times N_{1}}:

Y[i0]=i1=0N11(A[i0,i1]X[i1]+B[i0,i1]X[i1]), 0i0<N0.Y[i_{0}]=\sum_{i_{1}=0}^{N_{1}-1}\bigl(A[i_{0},i_{1}]\cdot X[i_{1}]+B[i_{0},i_{1}]\cdot X[i_{1}]\bigr),\ \ 0\leq i_{0}<N_{0}.

A corresponding 2-dimensional PRA can be formulated with an iteration space ={(i0,i1)0i0<N0, 0i1<N1}\mathcal{I}=\{(i_{0},i_{1})\mid 0\leq i_{0}<N_{0},\;0\leq i_{1}<N_{1}\} and the following set of statements:

S1:\displaystyle S_{1}: x[i0,i1]=X[i1]\displaystyle x[i_{0},i_{1}]=X[i_{1}]\hskip 18.49988pt if i0=0\displaystyle\text{if }i_{0}=0
S2:\displaystyle S_{2}: x[i0,i1]=x[i01,i1]\displaystyle x[i_{0},i_{1}]=x[i_{0}-1,i_{1}]\hskip 18.49988pt if i0>0\displaystyle\text{if }i_{0}>0
S3:\displaystyle S_{3}: a[i0,i1]=A[i0,i1]x[i0,i1]\displaystyle a[i_{0},i_{1}]=A[i_{0},i_{1}]\cdot x[i_{0},i_{1}]
S4:\displaystyle S_{4}: b[i0,i1]=B[i0,i1]x[i0,i1]\displaystyle b[i_{0},i_{1}]=B[i_{0},i_{1}]\cdot x[i_{0},i_{1}]
S5:\displaystyle S_{5}: sA[i0,i1]=a[i0,i1]\displaystyle s_{A}[i_{0},i_{1}]=a[i_{0},i_{1}]\hskip 18.49988pt if i1=0\displaystyle\text{if }i_{1}=0
S6:\displaystyle S_{6}: sA[i0,i1]=sA[i0,i1]+a[i0,i1]\displaystyle s_{A}[i_{0},i_{1}]=s_{A}^{*}[i_{0},i_{1}]+a[i_{0},i_{1}]\hskip 18.49988pt if i1>0\displaystyle\text{if }i_{1}>0
S7:\displaystyle S_{7}: sA[i0,i1]=sA[i0,i11]\displaystyle s_{A}^{*}[i_{0},i_{1}]=s_{A}[i_{0},i_{1}-1]\hskip 18.49988pt if i1>0\displaystyle\text{if }i_{1}>0
S8:\displaystyle S_{8}: sB[i0,i1]=b[i0,i1]\displaystyle s_{B}[i_{0},i_{1}]=b[i_{0},i_{1}]\hskip 18.49988pt if i1=0\displaystyle\text{if }i_{1}=0
S9:\displaystyle S_{9}: sB[i0,i1]=sB[i0,i1]+b[i0,i1]\displaystyle s_{B}[i_{0},i_{1}]=s_{B}^{*}[i_{0},i_{1}]+b[i_{0},i_{1}]\hskip 18.49988pt if i1>0\displaystyle\text{if }i_{1}>0
S10:\displaystyle S_{10}: sB[i0,i1]=sB[i0,i11]\displaystyle s_{B}^{*}[i_{0},i_{1}]=s_{B}[i_{0},i_{1}-1]\hskip 18.49988pt if i1>0\displaystyle\text{if }i_{1}>0
S11:\displaystyle S_{11}: Y[i0]=sA[i0,i1]+sB[i0,i1]\displaystyle Y[i_{0}]=s_{A}[i_{0},i_{1}]+s_{B}[i_{0},i_{1}]\hskip 18.49988pt if i1=N11\displaystyle\text{if }i_{1}=N_{1}-1

In the above description, all instances of variables AA, BB, and XX are input variables. Similarly, all instances of variable YY are output variables. Instances of aa and bb represent the element-wise products of AXA\cdot X and BXB\cdot X. Instances sAs_{A} and sBs_{B} are partial sums accumulated along the dimension i1i_{1} (statements S6,S7S_{6},S_{7} and S9,S10S_{9},S_{10}), initialized in statements S5,S8S_{5},S_{8}, respectively (at i1=0i_{1}=0). The outputs Y[i0]Y[i_{0}] are obtained at i1=N11i_{1}=N_{1}-1 (statement S11S_{11}).

Now, in order to map a polyhedral loop nest to a 1- or 2-dimensional array of processing elements, the given iteration space \mathcal{I} is partitioned into as many tiles as available processing elements. For each tile, then a feasible set of iterations within a tile and between tiles needs to be found. This is explained in the following.

III-C Symbolic Tiling

According to [23, 14, 13], we partition the nn-dimensional iteration space \mathcal{I} into congruent rectangular tiles 𝒥\mathcal{J} by using a partitioning matrix P=diag(p0,p1,,pn1)P={\mathrm{diag}}(p_{0},\ p_{1},\ldots,p_{n-1}) such that 𝒥P𝒦\mathcal{I}\subseteq\mathcal{J}\boldsymbol{\oplus}P\mathcal{K}. A tile 𝒥\mathcal{J} can be described as follows:

𝒥={𝐣=(j0,,j,,jn1)T0j<p}.\mathcal{J}=\bigl\{\,\mathbf{j}=(j_{0},\ldots,j_{\ell},\ldots,j_{n-1})^{\mathrm{T}}\mid 0\leq j_{\ell}<p_{\ell}\,\bigr\}. (3)

Similarly, the set of non-empty tile origins 𝒦\mathcal{K} can be described as:

𝒦={𝐤=(k0,,k,,kn1)T0k<t}.\mathcal{K}=\bigl\{\,\mathbf{k}=(k_{0},\ldots,k_{\ell},\ldots,k_{n-1})^{\mathrm{T}}\mid 0\leq k_{\ell}<t_{\ell}\,\bigr\}. (4)

The vector T=(t0,,t,tn1)T=(t_{0},\ldots,t_{\ell}\ldots,t_{n-1}) describes the number of tiles along each dimension \ell. After tiling, the original dependencies in statements as shown in Eq. (2) 𝐝n\mathbf{d}\in\mathbb{Z}^{n} are decomposed into an intra-tile dependence vector (𝐝J\mathbf{d}_{J}) and an inter-tile dependence vector (𝐝K\mathbf{d}_{K}) and all instances of variables are embedded into the 2n2n-dimensional space with dependence vector (𝐝J,𝐝K)T\mathbf{d}_{J},\mathbf{d}_{K})^{\mathrm{T}}. Without loss of generality, we can split the statement SqS_{q} in Eq. (2) into two equations below [14]:

Sq:xq[𝐣,𝐤]=Fq(,xq,r[𝐣,𝐤],)if 𝐣+P𝐤qS_{q}:\;x_{q}[\mathbf{j},\mathbf{k}]=F_{q}(\ldots,\,x_{q,r}^{\ast}[\mathbf{j},\mathbf{k}],\,\ldots)\;\text{if }\mathbf{j}+P\mathbf{k}\in\mathcal{I}_{q} (5)
Sq:xq,r[𝐣,𝐤]=\displaystyle S_{q}^{\ast}:\;x_{q,r}^{\ast}[\mathbf{j},\,\mathbf{k}]= xq,r[𝐣dPγ,𝐤+γ]if 𝐣+P𝐤q\displaystyle x_{q,r}[\mathbf{j}-d-P\gamma,\,\mathbf{k}+\gamma]\;\text{if }\mathbf{j}+P\mathbf{k}\in\mathcal{I}_{q}\;
𝐣dPγ𝒥.\displaystyle\wedge\mathbf{j}-d-P\gamma\in\mathcal{J}. (6)

Whereas the dependencies of the transformed statement SqS_{q} are all zero, we need to create one statement of the form SqS_{q}^{\ast} for each solution γ\gamma that satisfies the set of inequalities

{γn:e<γ+P1d<e}\{\gamma\in\mathbb{Z}^{n}:-e<\gamma+P^{-1}d<e\} (7)

and e=(1 11)T.e=(1\;1\,\cdots 1)^{\mathrm{T}}.

Refer to caption
Figure 2: Visualization of the tiled iteration space of the GESUMMV benchmark for an iteration space of size N0×N1=4×5=20N_{0}\times N_{1}=4\times 5=20 on a processor array of size t0×t1=2×2=4t_{0}\times t_{1}=2\times 2=4 PEs and tiles of size p0×p1=2×3p_{0}\times p_{1}=2\times 3. The dependencies are also indicated by arrows. Green arrows denote I/O buffer accesses.
Example 2

In the following, we tile the iteration space of the GESUMMV PRA listing introduced in Example 1. Assume a matrix of size N0×N1=4×5N_{0}\times N_{1}=4\times 5 to be mapped onto a 2×22\times 2 target processor array. Figure 2 provides a visual illustration of the tiling into congruent tiles 𝒥={𝐣=(j0,j1)T0j<p}\mathcal{J}=\bigl\{\,\mathbf{j}=(j_{0},j_{1})^{\mathrm{T}}\mid 0\leq j_{\ell}<p_{\ell}\,\bigr\} of size p0×p1=2×3=6p_{0}\times p_{1}=2\times 3=6 iterations. This tile size was chosen to partition the iteration space into exactly as many PEs as are available in each dimension of the processor array. Thus, the resulting set 𝒦\mathcal{K} of tile origins is 𝒦={𝐤=(k0,k1)T0k<t}\mathcal{K}=\bigl\{\,\mathbf{k}=(k_{0},k_{1})^{\mathrm{T}}\mid 0\leq k_{\ell}<t_{\ell}\,\bigr\} with t0=t1=2t_{0}=t_{1}=2, thus a total of t0×t1=2×2=4t_{0}\times t_{1}=2\times 2=4 tiles. For the explanation of the transformation of data dependencies due to tiling, consider statement S7S_{7} as an example after tiling. According to Eq. (6), we obtain the following transformed statement S7S_{7}^{*}.

S7:sA[𝐣,𝐤]\displaystyle S_{7}^{\ast}:\;s_{A}^{\ast}[\mathbf{j},\mathbf{k}] =sA[𝐣(0,1)TPγ,𝐤+γ]if j1+p1k1>0\displaystyle=s_{A}[\mathbf{j}-(0,1)^{\mathrm{T}}-P\gamma,\;\mathbf{k}+\gamma]\;\text{if }j_{1}+p_{1}k_{1}>0
𝐣(0,1)TPγ𝒥.\displaystyle\wedge\ \mathbf{j}-(0,1)^{\mathrm{T}}-P\gamma\in\mathcal{J}.

for which we can find two solutions for the vector γ\gamma that satisfy Eq. (7): γ={(0,0)T,(0,1)T}.\mathcal{\gamma}=\left\{(0,0)^{\mathrm{T}},\ (0,-1)^{\mathrm{T}}\right\}. The final new equations for S7S_{7}^{*} that replace S7S_{7}^{\ast}, are thus:

S71:sA[𝐣,𝐤]\displaystyle S_{7}^{\ast 1}:\;s_{A}^{\ast}[\mathbf{j},\mathbf{k}] =sA[𝐣(0,1)T,𝐤]if j1+p1k1>0\displaystyle=s_{A}[\mathbf{j}-(0,1)^{\mathrm{T}},\;\mathbf{k}]\;\text{if }j_{1}+p_{1}k_{1}>0
𝐣(0,1)T𝒥.\displaystyle\wedge\ \mathbf{j}-(0,1)^{\mathrm{T}}\in\mathcal{J}.
S72:sA[𝐣,𝐤]=sA[𝐣(0,1p1)T,𝐤+(0,1)T]\displaystyle S_{7}^{\ast 2}:\;s_{A}^{\ast}[\mathbf{j},\mathbf{k}]=s_{A}[\mathbf{j}-(0,1-p_{1})^{\mathrm{T}},\;\mathbf{k}+(0,-1)^{\mathrm{T}}]\;
if j1+p1k1>0𝐣(0,1p1)T𝒥.\displaystyle\text{if }j_{1}+p_{1}k_{1}>0\wedge\ \mathbf{j}-(0,1-p_{1})^{\mathrm{T}}\in\mathcal{J}.

The two new resulting dependence vectors

𝐝6((0,1)T)={(0,1,0,0)T,(0,1p1,0,1)T}\mathbf{d}_{6}^{\ast}\!\bigl((0,1)^{\mathrm{T}}\bigr)=\left\{(0,1,0,0)^{\mathrm{T}},\;(0,1-p_{1},0,1)^{\mathrm{T}}\right\}

are also visualized in Figure 2. The first vector (shown in yellow) corresponds to an intra-tile dependence and leads to an intra-processor memory access, whereas the second (shown in orange) represents an inter-tile dependence that leads to an inter-processor memory access and communication, as we exploit later in our energy analysis.

Now, in order to execute the iterations within each tile and to respect data dependencies both within a tile and across tiles, a feasible schedule must be determined. This is explained in the following.

III-D Symbolic Scheduling

Whereas tiling defines the mapping of iterations and their corresponding computations to PEs, we still need to find execution schedules that minimize execution time. A schedule assigns a variable xqx_{q} defined in a statement SqS_{q} a start time tq(J,K)t_{q}(J,K) when the operation FqF_{q} is computed for each iteration (J,K)(J,K). According to [12, 23], iterations within a tile must be executed in either a sequential or a pipelined (modulo-scheduled) order with an initiation interval π\pi between two iterations. Different tiles are scheduled in parallel across a PE array, such an execution corresponds to a locally sequential, globally parallel (LSGP) modulo schedule. Moreover, the regular computations of loop nests can be efficiently scheduled using linear schedules in which the schedule of iterations within a tile is described by an intra-tile schedule vector 𝝀J\boldsymbol{\lambda}^{J}. Similarly, the start time of tiles is also described by a so-called inter-tile schedule vector 𝝀K\boldsymbol{\lambda}^{K}. According to [24], schedules (𝝀J,𝝀K)(\boldsymbol{\lambda}^{J},\boldsymbol{\lambda}^{K}) that minimize the global latency LL of a loop nest for a given initiation interval π\pi can be determined efficiently as follows:

L=𝝀J(p01pn11)+𝝀K(t01tn11)+Lc.L\;=\;\boldsymbol{\lambda}^{J}\begin{pmatrix}p_{0}-1\\ \vdots\\ p_{n-1}-1\end{pmatrix}\;+\;\boldsymbol{\lambda}^{K}\begin{pmatrix}t_{0}-1\\ \vdots\\ t_{n-1}-1\end{pmatrix}\;+\;L_{c}. (8)

The first term determines the maximal difference of start times between any two tiles. The second term determines the maximum difference of any pair of iterations within a tile. Finally, Lc=max1q|S|(τq+wq)L_{c}=\max_{1\leq q\leq|S|}(\tau_{q}+w_{q}) describes the latency of a single iteration within a tile, where τq\tau_{q} is the start time and wqw_{q} the execution latency of operation FqF_{q} in statement SqS_{q}.

Example 3

Given the tiling of the iteration space of the GESUMMV program introduced in Example 2. For the given application, and assuming a pipeline interval π=1\pi=1, we can find the symbolic intra-tile schedule λJ=(1,p0)T\lambda^{J}=(1,p_{0})^{\mathrm{T}} and the inter-tile schedule λK=(p0,p0(p11)+1)T\lambda^{K}=(p_{0},p_{0}\cdot(p_{1}-1)+1)^{\mathrm{T}}, minimizing the global latency given by L=λJ(p01,p11)T+λK(t01,t11)T+LcL\;=\;\lambda^{J}\cdot(p_{0}-1,p_{1}-1)^{\mathrm{T}}+\lambda^{K}\cdot(t_{0}-1,t_{1}-1)^{\mathrm{T}}\;+\;L_{c} with Lc=4L_{c}=4, assuming each FqF_{q} has a latency of wq=1w_{q}=1, which leads to L=(p0p11)+p0(t01)+(p0(p11)+1)(t11)+4L=(p_{0}p_{1}-1)+p_{0}(t_{0}-1)+(p_{0}(p_{1}-1)+1)(t_{1}-1)+4. In the running example with p0=2,p1=3,t0=t1=2p_{0}=2,p_{1}=3,t_{0}=t_{1}=2, we obtain L=5+7+4=16L=5+7+4=16.

IV Symbolic Loop Nest Energy Analysis

In this section, we introduce our symbolic energy analysis methodology for loop nests when executed on processor arrays. The methodology starts by analyzing both the computational energy (for execution of loop statements of type SqS_{q} in Eq. (5)) and the data transport energy for any data movement, distinguishing both from outside the chip boundary, e.g., DRAM, and inside the processor array and back (by analyzing loop statements of type SqS_{q}^{\ast} in Eq. (6)). This first part of the analysis delivers a formula for the energy consumption of each loop statement per loop iteration. Subsequently, these energy-by-statement expressions are multiplied by the number of iterations each statement is executed to deliver a final overall energy estimate. This step relies on the efficient calculation of volumes of polyhedral spaces related to the size of the condition space of each loop statement. We demonstrate that these volumes can be computed symbolically for parametric loop bounds, thus enabling the energy analysis to be computed only once for parametric loop bounds and ultra-fast scalability analysis by simply inserting a set of concrete loop bound values into the derived volume formulae.

IV-A Energy-by-Statement Analysis

Without loss of generality, we can assume that any PRA loop statement as described in the most general form in Eq. (1) can be split into one statement with only zero dependence vector right-hand side (RHS) arguments SqS_{q} but containing computations expressed by a function q{\cal F}_{q} as described in Eq. (5) and statements that just relate each RHS variable with variables at displacements, i.e., statements of type SqS_{q}^{*} in Eq. (6). This eases the description of the following energy analysis by splitting the analysis of computational energy and the analysis of memory-related energy. In the following, we denote CC the set of computational statements, e.g., and let TT denote the set of memory statements.

Example 4

In the running example introduced in Ex. 1, all 11 statements already adhere to the above form with C={S3,S4,S6,S9,S11}C=\{S_{3},S_{4},S_{6},S_{9},S_{11}\} denoting statements with computations and M={S1,S2,S5,S7,S8,S10}M=\{S_{1},S_{2},S_{5},S_{7},S_{8},S_{10}\} being the set of memory-related statements.

A processor array, as shown in Figure 2, comprises a memory system consisting of 6 different memory types τ𝒯\tau\in\mathcal{T} as classified in the above table: on-chip I/O buffers (IO\mathrm{IO}) at the periphery, through which data transfers for fetching and storing tensor I/O data from/to a host DRAM (DR\mathrm{DR}) are performed explicitly via DMA operations. Because DRAM accesses are the most expensive, the loop mapping strategy should avoid moving any input or output variable instance of a loop nest across the chip boundary more than once. Moreover, within each PE, we distinguish four classes of registers with the following use: general-purpose registers (RD\mathrm{RD}) for handling intra-iteration dependencies, feedback registers (FD\mathrm{FD}) for temporarily storing variables with inter-iteration dependencies that are processed locally within a PE, and finally input/output registers (ID\mathrm{ID} and OD\mathrm{OD}) used to receive data from, or transmit data to, neighboring PEs via communication ports. A breakdown of typical energies per access for different types of memory accesses T={RD,FD,ID,OD,IOb,DR}{\mathrm{T}}=\{\mathrm{RD},\mathrm{FD},\mathrm{ID},\mathrm{OD},\mathrm{IOb},\mathrm{DR}\} and for some basic arithmetic operations is given in Table I for a \qty45\nano technology [9].

TABLE I: Energy related to memory accesses/operations in \qty45\nano technology [9].
Memory Class/Operation Type Energy EE [\unit\pico]
Register Files
General-purpose register (RD\mathrm{RD}) 0.12
Feedback register (FD\mathrm{FD}) 0.35
Input register (ID\mathrm{ID}) 0.24
Output register (OD\mathrm{OD}) 0.12
Buffers / Off-chip Access
I/O buffer (IOb\mathrm{IOb}) 16
DRAM (DR\mathrm{DR}) 1280
Arithmetic Operations
Addition (add\mathrm{add}) 0.36
Multiplication (mul\mathrm{mul}) 1.24

IV-A1 Energy Computational Statements

We start with the analysis of energy related to computational statements SqCS_{q}\in C. Let L:xTL:x\rightarrow\mathrm{T} denote the function that returns the memory class of a variable xx occurring either on the LHS (write location) or the RHS (read locations). Then, the computational energy for one loop iteration of such a statement SqS_{q} can be estimated by:

EqC=r=1RE(L(xq,r))+E(q)+E(L(xq))E_{q}^{C}=\sum_{r=1}^{R}E(L(x_{q,r}^{*}))+E({\cal F}_{q})+E(L(x_{q})) (9)

Explanation: The first term sums up the energy needed to read each input location according to the RHS side of the statement SqS_{q}, the middle term denotes the energy needed to compute the function q{\cal F}_{q}, and the right term denotes the energy needed to write this result into the location L(xq)TL(x_{q})\in\mathrm{T}.

To systematically capture these dependencies, our analysis tool uses an internal representation called reduced dependence graph (RDG), a directed multigraph where nodes represent inputs, outputs, and computations, and edges capture data dependencies between them over the iteration space. The RDG of a computational statement after tiling as given in Eq. (5) can be represented as shown in Figure 3.

\cdotsxq,rx_{q,r}^{*}\cdotsq{\cal F}_{q}xqx_{q}dj,dkd_{j},d_{k}
Figure 3: RDG of a computational statement SqS_{q} as given in Eq. (5)

IV-A2 Energy Memory-Related Statements

For a statement SqMS_{q}^{*}\in M, we estimate the energy by statement similarly:

EqM=E(L(xq,r))+E(L(xq,r))E_{q}^{M}=E(L(x_{q,r}))+E(L(x_{q,r}^{*})) (10)

Explanation: The first term calculates the energy needed to read the right-hand-side variable xq,rx_{q,r} of statement SqS_{q}^{*}, and the second term denotes the energy needed to copy its value to location L(xq,r)TL(x_{q,r}^{*})\in\mathrm{T}. Finally, holding for both types of statements, the energy of a read or write access of any variable xx depends on its location L(x)TL(x)\in{\mathrm{T}} as follows:

E(L(x))={E(DR)+E(IOb)+E(ID)if x{I}E(DR)+E(IOb)+E(OD)elsif x{O}E(RD)elsif 𝐝j=𝐝k=0E(FD)elsif 𝐝j0𝐝k=0E(ID)else (𝐝k0)E(L(x))=\begin{cases}E(\mathrm{DR})+E(\mathrm{IOb})+E(\mathrm{ID})&\text{if }x\in\{I\}\\ E(\mathrm{DR})+E(\mathrm{IOb})+E(\mathrm{OD})&\text{elsif }x\in\{O\}\\ E(\mathrm{RD})\;\;\;\text{elsif }\mathbf{d}_{j}=\mathbf{d}_{k}={0}\\ E(\mathrm{FD})\quad\text{elsif }\mathbf{d}_{j}\neq 0\wedge\mathbf{d}_{k}={0}\\ E(\mathrm{ID})\;\quad\text{else }(\mathbf{d}_{k}\neq{0})\end{cases}

Explanations: The first two cases hold when xx is an input variable and an output variable. Here, the energy accounts for fetching the variable from DRAM into an I/O buffer and from there to an input/output register of a PE.

Example 5

In the GESUMMV kernel introduced in Example 1, AA, BB, and XX are the names of input variables (appearing in statements S1,S3S_{1},S_{3} and S4S_{4}. Therefore, the execution of any statement involving an instance of such tensors requires a transport of the related variable instance from DRAM to an I/O buffer, and from there to an input register of a PE as indicated by the incoming green arrows at the left and top of the PE grid in Figure 2. Similarly, variable YY in statement S11S_{11} is an output variable; each instance of it is written to an output register and then over an I/O buffer back to the DRAM, as illustrated by the outgoing green arrows at the right of the PE grid in Figure 2.

Otherwise, the location LL of a variable xx is a register (RD\mathrm{RD}), if the data is stored tile-locally (zero dependence vectors).

Example 6

In the GESUMMV example, statements S5S_{5} and S8S_{8} are examples of local register accesses (RD\mathrm{RD}). These PE-local data accesses are illustrated by circular blue arrows within each iteration node in Figure 2.

Otherwise, if only the intra-tile dependence vector 𝐝j\mathbf{d}_{j} is non-zero, the variable might get used again in the very same PE later and is thus stored in an FD\mathrm{FD} register.

Example 7

In the GESUMMV example, statements S2S_{2}, S7S_{7}, and S10S_{10} are examples of transport statements that, after tiling, produce a statement in which the left-hand-side variable is indexed by (j0,j1,k0,k1)(j_{0},j_{1},k_{0},k_{1}) and is assigned the value of a right-hand-side variable indexed by (j0,j11,k0,k1)(j_{0},j_{1}-1,k_{0},k_{1}), 𝐝j0𝐝k=0\mathbf{d}_{j}\neq 0\wedge\mathbf{d}_{k}={0} holds, and the RHS variable will thus be stored in an FD\mathrm{FD} register for PE-internal re-use. These tile-local data dependencies are illustrated by the yellow arrows in Figure 2.

Else, the data comes from a different tile and thus PE and must be read into an ID\mathrm{ID} register.

Example 8

For the given GESUMMV example, inter-tile dependencies may arise due to tiling, see, e.g., the statement S7S_{7}^{*} described in Example 2 is split into two statements, of which the second has the dependence vector 𝐝=(0,1p1,0,1)T\mathbf{d}^{\ast}=(0,1-p_{1},0,1)^{\textrm{T}}. This dependence represents an inter-tile communication along the k1k_{1}-direction. For an iteration vector (j0,j1,k0,k1)T(j_{0},j_{1},k_{0},k_{1})^{\textrm{T}}, the data thus comes from iteration (j0,j1,k0,k11)(j_{0},j_{1},k_{0},k_{1}-1), corresponding to the left neighbor tile (PE). The data is therefore stored in an ID\mathrm{ID} register, as is illustrated by the orange arrows in Figure 2.

IV-B Total Energy Evaluation

With the above analysis of energy estimates per statement and iteration, we can finally derive estimates of the total energy EtotE_{\mathrm{tot}} for execution of a parametric loop nest by summing up the energies per statement after tiling multiplied by the volume Sq{S}_{q} (number of integer points of) the polyhedral space where statement SqS_{q} is defined:

Etot=SqCVol(Sq)EqC+SqMVol(Sq)EqM.E_{\mathrm{tot}}=\sum_{S_{q}\in C}\operatorname{Vol}({S}_{q})\cdot E_{q}^{C}+\sum_{S_{q}\in M}\operatorname{Vol}({S}_{q})\cdot E_{q}^{M}. (11)

IV-C Symbolic Volume Computation

In the previous section, we provided a generic energy analysis per statement of a loop nest and a formula for the total energy for executing a given loop nest by multiplying these estimates by the number of integer points in the corresponding parametric polyhedra. For computational statements SqCS_{q}\in C as described in Eq. (5), the number of iterations after tiling is obtained from the tiled iteration space.

Vol(Sq)=|{𝐢𝐢=𝐣+P𝐤𝐢q}|\operatorname{Vol}({S}_{q})=|\big\{\mathbf{i}\mid\mathbf{i}=\mathbf{j}+P\mathbf{k}\land\mathbf{i}\in\mathcal{I}_{q}\cap\mathcal{I}\big\}| (12)

For memory-related statements of type SqMS_{q}^{*}\in M as described in Eq. (6), the execution count of the associated memory accesses additionally depends on the dependencies 𝐝j\mathbf{d}_{j} and 𝐝k\mathbf{d}_{k}. Therefore, for SqTS_{q}\in T, the volume is given by

Vol(Sq)=|{𝐢𝐢=𝐣+P𝐤𝐣𝐝j𝒥𝐢q}|\operatorname{Vol}(S_{q}^{\ast})=|\big\{\mathbf{i}\mid\mathbf{i}=\mathbf{j}+P\mathbf{k}\land\mathbf{j}-\mathbf{d}_{j}\in\mathcal{J}\land\mathbf{i}\in\mathcal{I}_{q}\cap\mathcal{I}\big\}| (13)

The computation of such volumes corresponds to counting the number of integer points in parametric polyhedra. This can be performed symbolically using Barvinok’s algorithm [2] as implemented in the Integer Set Library (ISL) [17]. The resulting volumes are returned as piecewise quasi-polynomial functions of the loop bounds NiN_{i} and tile size parameters pip_{i}, which can then be inserted into Eq. (11) to determine the total energy consumption EtotE_{\mathrm{tot}} of all statements as executed in a given loop nest.

Example 9

To illustrate the symbolic volume computation, we consider statement S7S_{7}^{\ast} of the GESUMMV kernel introduced in Example 2. The transformed statement is decomposed into two cases based on the location of the source operand. The iteration spaces of S71S_{7}^{\ast 1} and S72S_{7}^{\ast 2} are given as follows with ti=Ni/pit_{i}=\lceil N_{i}/p_{i}\rceil:

71={(j0,j1,k0,k1)4|0j0<p00j1<p10k0<t00k1<t10j0+p0k0<N00<j1+p1k1<N10j11<p1}\mathcal{I}_{7}^{\ast 1}=\left\{(j_{0},j_{1},k_{0},k_{1})\in\mathbb{Z}^{4}\;\middle|\;\begin{aligned} &0\leq j_{0}<p_{0}\land 0\leq j_{1}<p_{1}\land\\ &0\leq k_{0}<t_{0}\land 0\leq k_{1}<t_{1}\land\\ &0\leq j_{0}+p_{0}k_{0}<N_{0}\land\\ &0<j_{1}+p_{1}k_{1}<N_{1}\land\\ &0\leq j_{1}-1<p_{1}\end{aligned}\right\}
72={(j0,j1,k0,k1)4|0j0<p00j1<p10k0<t00k1<t10j0+p0k0<N00<j1+p1k1<N10j1+p11<p1}\mathcal{I}_{7}^{\ast 2}=\left\{(j_{0},j_{1},k_{0},k_{1})\in\mathbb{Z}^{4}\;\middle|\;\begin{aligned} &0\leq j_{0}<p_{0}\land 0\leq j_{1}<p_{1}\land\\ &0\leq k_{0}<t_{0}\land 0\leq k_{1}<t_{1}\land\\ &0\leq j_{0}+p_{0}k_{0}<N_{0}\land\\ &0<j_{1}+p_{1}k_{1}<N_{1}\land\\ &0\leq j_{1}+p_{1}-1<p_{1}\end{aligned}\right\}

Note that the above inequalities do contain non-linear terms like the expression j+pk<...\leq j+p\cdot k<... (i.e., the product of the parameters pp and kk). But for a given (fixed) processor array size, kk (the processor element index in a given processor array dimension with a total of tt elements in this dimension) is bounded, i.e., 0k<t0\leq k<t. Therefore, we can unfold the respective inequality constraints practically as follows: {j+p0,j+p1,,j+p(t1)}<...\leq\{j+p\cdot 0,\quad j+p\cdot 1,\quad\dots,\quad j+p(t-1)\}<....111Interestingly, the symbolic energy analysis time remains on the order of 1 minute, even for large processor arrays of size 50×50=250050\times 50=$2500$ processors. By applying Barvinok’s algorithm [1], the integer-point count for S71S_{7}^{\ast 1} and S72MS_{7}^{\ast 2}\in M is then obtained for the example of a 2×22\times 2 processor array target as shown in Figure 2:

vol(S71)=|71|={4p0(p11)if 0<p02p0<N0p122p1<N12N0(p11)if N0>02p0N0p122p1<N1(2N14)p0if 0<p02p0<N0p1N122p1N1N0(N12)if N0>02p0N0p1N122p1N10otherwise\operatorname{vol}({{S}_{7}^{\ast 1}})=|{\mathcal{I}_{7}^{\ast 1}}|=\begin{cases}4\,p_{0}(p_{1}-1)&\begin{aligned} \text{if }&0<p_{0}\land 2p_{0}<N_{0}\land\\ &p_{1}\geq 2\land 2p_{1}<N_{1}\end{aligned}\\ 2\,N_{0}(p_{1}-1)&\begin{aligned} \text{if }&N_{0}>0\land 2p_{0}\geq N_{0}\land\\ &p_{1}\geq 2\land 2p_{1}<N_{1}\end{aligned}\\ (2N_{1}-4)\,p_{0}&\begin{aligned} \text{if }&0<p_{0}\land 2p_{0}<N_{0}\land\\ &p_{1}\leq N_{1}-2\land\\ &2p_{1}\geq N_{1}\end{aligned}\\ N_{0}(N_{1}-2)&\begin{aligned} \text{if }&N_{0}>0\land 2p_{0}\geq N_{0}\land\\ &p_{1}\leq N_{1}-2\land\\ &2p_{1}\geq N_{1}\end{aligned}\\ 0&\text{otherwise}\end{cases}

The above expression corresponds to the intra-tile case, where the dependence is resolved locally within a tile.

vol(S72)=|72|={2p0if 0<p0<N0/20<p1<N1N0ifN0>0p0N0/20<p1<N10otherwise.\operatorname{vol}({{S}_{7}^{\ast 2}})=|{\mathcal{I}_{7}^{\ast 2}}|=\begin{cases}2\,p_{0}&\text{if}\ 0<p_{0}<{N_{0}}/{2}\land 0<p_{1}<N_{1}\\ N_{0}&\begin{aligned} &\text{if}\ N_{0}>0\land p_{0}\geq{N_{0}}/{2}\land\\ &0<p_{1}<N_{1}\end{aligned}\\ 0&\text{otherwise.}\end{cases}

This expression corresponds to the inter-tile case, where the dependence crosses tile boundaries. The volume remains fully parametric in the loop bounds NiN_{i} and tile sizes pip_{i}, and thus can be evaluated very simply by just inserting concrete values for the parameters N0,N1,p0,N_{0},N_{1},p_{0}, and p1p_{1}. E.g., for the values exemplified in Figure 2 with a shown iteration space of size N0×N1=4×5=20N_{0}\times N_{1}=4\times 5=20 and tiles of size 2×32\times 3, the resulting volumes are Vol(𝒮71)=12{\operatorname{Vol}}(\mathcal{S}_{7}^{\ast 1})=12 and Vol(𝒮72)=4{\operatorname{Vol}}(\mathcal{S}_{7}^{\ast 2})=4, corresponding to 12 intra-tile dependences (exactly matches the number of yellow arrows in j1j_{1} direction) and 4 inter-tile dependences (exactly matches the number of orange arrows in k1k_{1} direction).

Using Eq. (10), the energy of S71S_{7}^{\ast 1} is given by one FD read and one RD write,

E71M=E(FD)+E(RD)=0.35+0.12=\qty0.47\pico.E_{7^{\ast 1}}^{M}=E(\mathrm{FD})+E(\mathrm{RD})=0.35+0.12=\qty{0.47}{\pico}.

Similarly, the energy of S72S_{7}^{\ast 2} is given by one ID read and one RD write,

E72M=E(ID)+E(RD)=0.24+0.12=\qty0.36\pico.E_{7^{\ast 2}}^{M}=E(\mathrm{ID})+E(\mathrm{RD})=0.24+0.12=\qty{0.36}{\pico}.

For the concrete configuration, the contribution of these two statements to EtotE_{\mathrm{tot}} in Eq. (11) evaluates to

Vol(𝒮71)E71M+Vol(𝒮72)E72M=120.47+40.36=\qty7.08\pico.\operatorname{Vol}(\mathcal{S}_{7}^{\ast 1})E_{7^{\ast 1}}^{M}+\operatorname{Vol}(\mathcal{S}_{7}^{\ast 2})E_{7^{\ast 2}}^{M}=12\cdot 0.47+4\cdot 0.36=\qty{7.08}{\pico}.

This energy corresponds to the total energy contribution of statement S7S_{7}^{\ast} for the given configuration. The overall kernel energy is obtained by summing the contributions of all statements according to Eq. (11).

V Experimental results

This section evaluates the proposed symbolic energy analysis framework from two complementary perspectives. First, we validate the accuracy and analysis-time efficiency of the symbolic approach by deriving parametric volumes for both computations and memory operations, and comparing these with counts obtained from a cycle-accurate simulation of the execution of a loop nest. Second, we analyze energy and latency scaling with increasing problem sizes and show that the symbolic model provides a scalable basis for energy evaluation, making it particularly suitable to explore application-specific architecture sizing.

Refer to caption
Figure 4: Comparison of symbolic and simulation-based analysis times for the GESUMMV benchmark on an 8×8 PE array across increasing matrix sizes. The symbolic approach remains nearly constant, while simulation time grows rapidly with problem size.

V-A Validation of Symbolic Volume Computation Results

We validate the accuracy of the symbolic volume computation results by comparing analytically derived integer-point counts for both data transfer and computation against reference counts obtained from a cycle-accurate simulator. The simulator operates using an XML-based architectural description that captures the entire TCPA architecture. Based on this description, the TCPA compiler [23] automatically tiles the input loop programs, maps the tiles onto the processor array, and schedules their execution. During simulation, all data transfers and computations are tracked, yielding exact reference counts. Each selected PolyBench kernel [10] is then simulated for a specified architectural configuration to obtain ground-truth memory access and computation counts.

Note that our analytical method can analyze the energy of a loop kernel symbolically without requiring concrete loop bounds. The resulting volumes are expressed as closed-form quasi-polynomials. In the following, we evaluated our approach across eight different benchmarks from [10] for varying problem sizes and architectural parameters. The analytically derived access counts and obtained total energy values match the simulation results exactly, confirming the accuracy of the symbolic formulation.

Figure 4 compares the analysis time of simulation-based counting and the proposed symbolic method for the GESUMMV benchmark mapped onto an 8×88\times 8 PE array. The simulation-based approach exhibits a rapid increase in analysis time with increasing matrix size, as it explicitly executes all loop iterations, instruction activities, and memory accesses within the cycle-accurate model. As the iteration space grows quadratically with the matrix dimension, the simulation cost scales accordingly.

In contrast, the symbolic approach evaluates one-time calculated closed-form expressions for computation and memory access statements, independent of the number of loop iterations. As a result, the analysis time remains almost constant at less than 0.5\units0.5\ \unit{s} across all evaluated problem sizes. This decoupling of analysis cost from the dynamic execution makes the symbolic method scalable and suitable for exploring large loop bounds that are impractical for simulation-based analysis.

Refer to caption
Figure 5: Energy EtotE_{\mathrm{tot}} and latency LL vs. matrix size for GEMM on an 8×8 PE grid TCPA, showing a shift from DRAM-dominated energy to increasing on-chip communication (FD/RD) at larger scales due to growing tile sizes.

V-B Energy and Latency Scaling Analysis

Since the proposed analysis derives expressions, it enables a fast evaluation of bounds and exploration of architectural configurations. In the following, we study the total energy EtotE_{\mathrm{tot}} and latency LL of a given loop nest with increasing loop bounds. Thereby, we provide a fine-grained breakdown of energy contributions across different access locations L(x)TL(x)\in\mathrm{T}, revealing how memory accesses and computations contribute individually. As will be shown, computation energy remains relatively small compared to data transfer energy across all configurations. The total energy and latency are computed using Eq. (11) and (8), respectively. For this analysis, we consider the GEMM kernel [10], where the iteration space grows cubically with problem size (i.e., O(N3)O(N^{3})). As loop bounds increase, both computation and data movement grow rapidly. Figure 5 illustrates how total energy and latency increase with matrix size for GEMM, analytically evaluated for an 8×88\times 8 PE grid. As expected, both metrics grow rapidly with increasing loop bounds due to the cubic growth of the iteration space.

The energy breakdown further reveals how different components scale. For smaller problem sizes, DRAM accesses dominate the total energy consumption. However, as the loop bounds increase, the relative contribution of DRAM energy decreases, while the energy associated with on-chip storage and communication—such as FD and RD registers—as well as computation, increases. This shift is primarily due to larger tile sizes, which increase intra-tile data reuse and consequently amplify activity within local storage locations. This growth is not strictly proportional to the loop bounds, as data accessed from different locations scales differently depending on tile sizes and data dependencies. In many existing energy estimation approaches, activity counts are obtained by simulating a fixed workload instance for smaller loop bounds and extrapolating to larger ones, since simulation is not efficient at handling larger bounds. Our proposed symbolic analysis does not suffer from scalability.

Since not only the energy estimates are parametric but also the schedules, performance metrics such as latency, throughput, and thus also energy efficiency can be computed analytically. This paves the way for a rapid comparison of architectural configurations and supports DSE to identify suitable accelerator architectures for a huge number of loop applications.

VI Conclusion

This paper introduces a symbolic methodology for energy analysis of loop nests when mapped and scheduled on parallel processor array accelerator architectures. In contrast to simulation-based approaches, which require an explicit execution of all loop iterations, all memory accesses, and all operations, our method derives closed-form expressions for computations and memory accesses directly from the program representation and mapping. By combining symbolic volume computation of polyhedral spaces with pre-characterized energy margins per access and operation types, the approach enables a symbolic and accurate estimation of energy without requiring time-intensive cycle-accurate simulations for each setting of loop bounds.

The proposed analysis lifts energy evaluation from a simulation-based process to a polyhedral-model-based formulation that is applicable at early design stages. Once the symbolic expressions are derived, the total energy can be evaluated efficiently for different loop bounds and architecture configurations without repeated analysis. This is particularly important for loop-intensive applications, where simulation time increases rapidly with problem size, while the symbolic evaluation remains nearly constant, providing a scalable basis for design space exploration.

Our evaluations have demonstrated that our symbolic and cycle-accurate simulation-based energy analysis approaches match in their results, with the symbolic approach also providing a fine-grained view of the contributions of different types of memory and register accesses. This makes it possible to not only study the total energy, but also the influence of array size, mapping, scheduling, and data movement across different access locations.

Overall, the presented framework provides a practical and efficient approach to energy estimation for processor-array accelerators. By enabling fast symbolic evaluation of energy, latency, and related performance metrics from parametric loop bounds, it supports a fast analysis of architectural configurations and helps to identify suitable accelerator architectures for a given application.

References

  • [1] A. I. Barvinok (1994) A polynomial time algorithm for counting integral points in polyhedra when the dimension is fixed. Mathematics of Operations Research 19 (4), pp. 769–779. Cited by: §II, Example 9.
  • [2] A. I. Barvinok (2002) A course in convexity. Graduate studies in mathematics, Vol. 54, American Mathematical Society. External Links: ISBN 978-0-8218-2968-4 Cited by: §IV-C.
  • [3] Y. Chen, T. Krishna, J. S. Emer, and V. Sze (2017) Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid State Circuits 52 (1), pp. 127–138. External Links: Document Cited by: §II.
  • [4] F. Hannig, V. Lari, S. Boppu, A. Tanase, and O. Reiche (2014) Invasive tightly-coupled processor arrays: A domain-specific architecture/compiler co-design approach. ACM Trans. Embedded Comput. Syst. 13 (4s), pp. 133:1–133:29. External Links: Link, Document Cited by: §I, §I, §II, §III-A, §III.
  • [5] V. Kandiah, S. Peverelle, M. Khairy, J. Pan, A. Manjunath, T. G. Rogers, T. M. Aamodt, and N. Hardavellas (2021) AccelWattch: A power modeling framework for modern gpus. In Int. Symposium on Microarchitecture (MICRO), pp. 738–753. External Links: Document Cited by: §II.
  • [6] D. Kissler, F. Hannig, A. Kupriyanov, and J. Teich (2006) A highly parameterizable parallel processor array architecture. In IEEE Int. Conf. on Field Programmable Technology (FPT), Bangkok, Thailand, pp. 105–112. External Links: Document Cited by: §I, §I, §II, Figure 1, §III-A, §III.
  • [7] L. Liu, J. Zhu, Z. Li, Y. Lu, Y. Deng, J. Han, S. Yin, and S. Wei (2019) A survey of coarse-grained reconfigurable architecture and design: taxonomy, challenges, and applications. Comput. Surv. 52 (6), pp. 118:1–118:39. External Links: Document Cited by: §II.
  • [8] A. Parashar, P. Raina, Y. S. Shao, Y. Chen, V. A. Ying, A. Mukkara, R. Venkatesan, B. Khailany, S. W. Keckler, and J. S. Emer (2019) Timeloop: A systematic approach to DNN accelerator evaluation. In IEEE Int. Symposium on Performance Analysis of Systems and Software, ISPASS, Madison, USA, pp. 304–315. External Links: Document Cited by: §II.
  • [9] A. Pedram, S. Richardson, M. Horowitz, S. Galal, and S. Kvatinsky (2017) Dark memory and accelerator-rich system optimization in the dark silicon era. IEEE Des. Test 34 (2), pp. 39–50. External Links: Document Cited by: §IV-A, TABLE I.
  • [10] L. Pouchet PolyBench: the polyhedral benchmark suite. Note: Accessed: Oct. 10, 2025 Cited by: §V-A, §V-A, §V-B, Example 1.
  • [11] R. Seghir, S. Verdoolaege, K. Beyls, and V. Loechner (2004) Analytical computation of ehrhart polynomials and its application in compile-time generated cache hints. In Int. Conf. on Compilers, Architecture and Synthesis for Embedded Systems (CASES), Cited by: §II.
  • [12] J. Teich, A. Tanase, and F. Hannig (2014) Symbolic mapping of loop programs onto processor arrays. J. Signal Process. Syst. 77 (1-2), pp. 31–59. External Links: Document Cited by: §III-D, §III.
  • [13] J. Teich, L. Thiele, and L. Z. Zhang (1997) Partitioning processor arrays under resource constraints. J. VLSI Signal Process. 17 (1), pp. 5–20. External Links: Document Cited by: §III-C.
  • [14] J. Teich and L. Thiele (1993) Partitioning of processor arrays: a piecewise regular approach. Integr. 14 (3), pp. 297–332. External Links: Document Cited by: §III-B, §III-C, §III-C.
  • [15] J. Teich (1993) A compiler for application specific processor arrays. Ph.D. Thesis, Saarland University, Germany. External Links: Link, ISBN 978-3-86111-701-8 Cited by: §III-B.
  • [16] S. Verdoolaege, R. Seghir, K. Beyls, V. Loechner, and M. Bruynooghe (2004) Analytical computation of ehrhart polynomials: enabling more compiler analyses and optimizations. In Int. Conf. on Compilers, Architecture, and Synthesis for Embedded Systems (CASES), pp. 248–258. External Links: ISBN 1581138903, Link, Document Cited by: §II.
  • [17] S. Verdoolaege (2010) isl: an integer set library for the polyhedral model. In Third Int. Congress on Mathematical Software (ICMS), K. Fukuda, J. van der Hoeven, M. Joswig, and N. Takayama (Eds.), Vol. 6327, pp. 299–302. External Links: Document Cited by: §IV-C.
  • [18] D. Walter, M. Brand, C. Heidorn, M. Witterauf, F. Hannig, and J. Teich (2024) ALPACA: an accelerator chip for nested loop programs. In Int. Symposium on Circuits and Systems (ISCAS), pp. 1–5. External Links: Document Cited by: §I, §I.
  • [19] D. Walter, M. Halm, D. Seidel, I. Ghosh, C. Heidorn, F. Hannig, and J. Teich (2026) Modeling and Mapping of Regular Nested Loops on Processor Arrays: CGRAs vs. TCPAs. In 29. Workshop zu Methoden und Beschreibungssprachen zur Modellierung und Verifikation von Schaltungen und Systemen, Cited by: §II.
  • [20] D. Walter, M. Witterauf, and J. Teich (2020) Real-time scheduling of I/O transfers for massively parallel processor arrays. In 18th ACM/IEEE Int. Conf. on Formal Methods and Models for System Design, MEMOCODE 2020, Jaipur, India, December 2-4, 2020, pp. 1–11. External Links: Document Cited by: §III-A.
  • [21] M. Wijtvliet, H. Corporaal, and A. Kumar (2021) CGRA-EAM - rapid energy and area estimation for coarse-grained reconfigurable architectures. ACM Trans. Reconfigurable Technol. Syst. 14 (4), pp. 19:1–19:28. External Links: Document Cited by: §II.
  • [22] M. Wijtvliet, L. Waeijen, and H. Corporaal (2016) Coarse grained reconfigurable architectures in the past 25 years: overview and classification. In SAMOS, pp. 235–244. External Links: Document Cited by: §II.
  • [23] M. Witterauf, D. Walter, F. Hannig, and J. Teich (2021) Symbolic loop compilation for Tightly Coupled Processor Arrays. ACM Trans. Embedded Comput. Syst. (TECS) 20 (5), pp. 1–31. Cited by: §III-A, §III-C, §III-D, §V-A.
  • [24] M. Witterauf, A. Tanase, F. Hannig, and J. Teich (2016) Modulo scheduling of symbolically tiled loops for Tightly Coupled Processor Arrays. In Int. Conf. on Application-specific Systems, Architectures and Processors (ASAP), pp. 58–66. External Links: Document Cited by: §III-D.
  • [25] Y. N. Wu, J. S. Emer, and V. Sze (2019) Accelergy: an architecture-level energy estimation methodology for accelerator designs. In Proceedings of the Int. Conf. on Computer-Aided Design, ICCAD 2019, Westminster, CO, USA, November 4-7, 2019, D. Z. Pan (Ed.), pp. 1–8. External Links: Document Cited by: §II.
BETA