Symbolic Polyhedral-Based Energy Analysis for
Nested Loop Programs

Avinash Mahesh Nirmala, Dominik Walter, Frank Hannig, and Jürgen Teich

Abstract

This work presents a symbolic approach for estimating the energy consumption for nested loop programs when mapped and scheduled on parallel processor array accelerator architectures. Instead of simulation-based evaluation, we derive a methodology for symbolic energy analysis that captures the impact of mapping and scheduling decisions of loop nests on processor arrays. We compare our approach against simulation-based results for selected benchmarks and varying sizes of the iteration spaces. Whereas the latter are not scalable, our symbolic analysis is shown to be independent of the problem size. The presented evaluation methodology can be beneficially used during the design space exploration of mapping and scheduling decisions, for studying the influence of array size variations, and for comparisons with other loop nest accelerator architectures.

Index Terms—Loop programs, Processor arrays, Loop Compilation

I Introduction

There is a rapid growth of AI workloads in general. But, particularly for applications running on embedded and edge platforms, power and energy budgets are very challenging to meet, especially as computational requirements grow. Thereby, understanding and optimizing the impact of operations as well as data movements through a processing architecture is of utmost importance. Often, the above workloads can be described adequately by nested programs for which dedicated accelerator architectures have been proposed, known as massively parallel processor architectures, including CGRAs, TCPAs [6, 4, 18], and GPUs. For these, efficient compilation techniques exist to map loop nests to maximize performance. However, missing is a highly accurate, yet efficient analysis of the impact of mapping decisions on energy consumption. Whereas measurement-based approaches, being highly accurate, would require a physical prototype, simulation-based approaches at the register-transfer level (RTL) might be too time-consuming and thus not applicable for an early-stage energy analysis, such as during a design space exploration (DSE). Moreover, the effect of tiling, scheduling, and mapping transformations on energy consumption has not yet been adequately studied for modern loop-intensive workloads. Obviously, accuracy and efficiency of evaluation are necessary, but contradictory requirements for suitable analysis techniques.

Based on the above requirements, this paper presents an analytical framework for symbolic (parametric) evaluation of energy consumption of loop-intensive workloads when mapped and scheduled on processor array accelerator architectures. The framework is applicable to a broad class of accelerator architectures and can be used at early design stages, thus neither requiring a hardware prototype nor time-intensive simulations.

The presented analysis methodology takes a loop nest with parametric loop bounds and a space-time mapping to a target processor array of given size. Using a one-time classification of energy costs for different memory and register accesses, we show how parametric expressions can be derived for the number of accesses and operations during the execution of a loop nest without actually executing it. These volumes of accesses and operations can be obtained symbolically, i.e., for loop nests whose bounds remain parametric. By combining these symbolic counts with architecture-specific per-access energy weights, our framework enables fast, fully analytical energy estimation without requiring cycle-accurate simulation or physical prototype measurements. Whereas the approach can be applied to a broad class of parallel processor architectures, we present an evaluation study focusing on Tightly Coupled Processor Arrays (TCPAs) [6, 4, 18], which are massively parallel loop accelerator architectures that execute loop-intensive applications across grids of lightweight processing elements (PEs).

This paper is structured as follows: In Section II, we provide a review of existing work on energy estimation for massively parallel processor architectures. In Section III, we introduce a class of VLSI processor arrays called TCPA, define a notation for loop nests, and present the basic methodologies for mapping and scheduling loop nests to such architectures. In Section IV, we then describe our symbolic energy analysis approach that is based on the efficient computation of volumes of polyhedral spaces of different types of memory access for which typical amounts of energy can be pre-characterized per access, thus providing a memory-centric evaluation of the overall energy consumption of a loop nest. We show that such volume computations can be carried out even once only for parametric loop bounds, thus symbolically. Section V gives evidence of the accuracy and speedup of this symbolic approach over a simulation-based approach. Finally, Section VI summarizes and concludes the paper.

II Related Work

Prior work on accelerator energy estimation largely assumes that execution behavior is fixed and known. Activity-based analytical models, such as AccelWattch [5], estimate energy by weighting cycle-level activity statistics obtained from detailed simulation or hardware counters. Trace-driven architectural models, including CGRA-EAM [21], estimate energy by evaluating execution traces over pre-characterized functional units and interconnect components. Component-level frameworks such as Accelergy [25] compute energy by associating architectural components with per-action energy costs and combining them with externally supplied action counts generated by analytical models, simulators, or mapping tools.

While effective for analyzing a fixed design and workload, these approaches share a common limitation: memory-access counts are evaluated only for explicitly specified workload instances. Even architecture-specific studies such as Eyeriss [3] and analytical frameworks such as Timeloop [8] derive access counts only after fixing loop bounds, tensor dimensions, and a concrete dataflow, tiling, or mapping. Consequently, when workload sizes or execution order changes, the analysis must be recomputed.

In the polyhedral compiler community, [16] [16] launched the integer set library (ISL), which efficiently computes the volume of parametric polyhedra using Ehrhart polynomials, based on Barvinok’s algorithm [1]. In [11], [11] apply this calculus for the analysis of cache effects on execution time in single-core architectures.

In this paper, we rather analyze the energy consumption of loop nests when executed on massively parallel processor arrays in which many tiny processing elements do not even carry a cache. For these, we show how to derive closed-form expressions relating tiling choices, execution schedules, and resource bindings to energy consumption, including operations and memory accesses. This calculus enables an efficient analytical evaluation of energy consumption that is parametric in the problem size (loop bounds). To the best of our knowledge, this is among the first approaches to symbolically analyze the energy consumption of nested-loop programs in dependence on space–time mapping decisions.

In the domain of processor array accelerators, coarse-grained reconfigurable arrays (CGRAs) [22, 7] represent a prominent class of architectures that employ an operation-centric mapping approach [19], where the operations from a data flow graph (DFG) are individually assigned to processing elements. For these, simulation is usually used to analyze performance and energy tradeoffs of a compiled loop nest. But obviously, this is not a scalable approach. In order to exploit scalable symbolic energy analysis techniques as introduced in this paper, a polyhedral iteration space representation is necessary. Whereas CGRA compilers usually operate at the granularity of individual operations of a single-dimensional loop, other classes of processor array architectures, such as tightly coupled processor arrays (TCPAs) [6, 4], start the compilation from a given polyhedral representation of a loop nest to determine schedules and mappings over tiled, multidimensional iteration spaces (iteration-centric mapping). Thus, although in general the symbolic approach described in the following could also be applied to many CGRA architectures from a hardware perspective, compilation approaches would only benefit when using parametric polyhedral loop descriptions.

III Loop Nests and their Mapping and Scheduling on Processor Arrays

This section introduces a class of massively parallel processor arrays known as TCPAs [6, 4], and outlines the core concepts of loop-nest representation, space–time mapping, and symbolic loop scheduling [12].

III-A Tightly Coupled Processor Arrays (TCPAs)

TCPAs [6, 4] are parallel processor array architectures designed to execute multidimensional, loop applications represented as polyhedral recurrence equations as piecewise regular algorithms (PRAs). A TCPA works with a host, such as a CPU or FPGA. In this section, we provide an overview of the TCPA architecture. Next, we introduce the PRA representation for expressing loop nests and explain how iteration spaces are partitioned, scheduled, and mapped for spatial execution.

Refer to caption — Figure 1: Example of a tightly coupled processor array (TCPA) architecture [6].

TCPAs are two-dimensional arrays of programmable processing elements (PEs) connected by a configurable circuit-switched neighbor-to-neighbor interconnect, as shown in Figure 1. The PE grid is bordered by four input/output (I/O) buffers, each comprising multiple dedicated address generators (AG). The architecture also includes peripheral control units, a global controller (GC), and a configuration manager (CM), which orchestrates loop-program execution on the array. The TCPA operates as a memory-coupled coprocessor alongside a host system. The loop I/O controller schedules direct memory access (DMA) to transfer data between the host’s external memory and the I/O buffers. After the TCPA is configured by the host, it executes the loop program independently, while the loop I/O controller schedules DMA transfers during execution to support dynamic refilling.

TCPAs avoid costly direct PE-to-DRAM communication by transferring data between the host and I/O buffers rather than having the PEs access them directly. During execution, all active data movement is restricted to the I/O buffers and the PE array [20]. This ensures predictable, high-bandwidth on-chip communication.

Each PE can be configured with multiple parallel functional units (FUs). Each FU has its own instruction memory, branch unit, and program counter. Input and output data streams between the I/O buffers and the PEs, while all intermediate values remain within the PE’s internal register hierarchy. Within each PE, the architecture supports multiple register classes, namely general-purpose, feedback, and input/output registers. The compiler [23] maps data to these register classes based on dependence structures and schedules. During execution, values produced and consumed by operations are stored and forwarded through these registers according to architectural constraints, and the circuit-switched interconnect supports inter-PE communication when dependencies span multiple PEs, while the exact mapping rules are defined in the binding phase and discussed later.

III-B Loop-Nest Representation

Almost any standard programming language supports loop nests syntactically, e.g., C, C++, Java, and Python. In this paper, we assume loop nests described using a polyhedral notation called Piecewise Linear Algorithms (PLAs) [14, 15]. Normal sequential for-loop specifications can be converted to such a polyhedral description for parallelization. A PLA describes an $n$ -dimensional loop nest by an iteration space $\mathcal{I}\subseteq\mathbb{Z}^{n}$ , where each element $\mathbf{i}=(i_{0},i_{1},\dots,i_{n-1})^{\mathrm{T}}\in\mathcal{I}$ , called iteration vector, corresponds to one loop iteration. In the following, we assume $\mathcal{I}$ is described by a polyhedral description $\mathcal{I}=\{\,\mathbf{i}\mid\mathbf{A}\mathbf{i}\geq\mathbf{b}\,\}$ , where $A\in\mathbb{Z}^{m\times n},\;b\in\mathbb{Z}^{n}$ . The computations of a loop nest are described by a set of quantified statements $S=\{\ldots,S_{q}\ldots\}$ , with $S_{q}$ given by

\displaystyle S_{q}:

\displaystyle x_{q}[\mathbf{P}_{q}\mathbf{i}+\mathbf{f}_{q}]=F_{q}(\ldots,x_{q,r}[\mathbf{Q}_{q,r}\mathbf{i}-\mathbf{d}_{q,r}],\ldots)\;\text{if }\mathbf{i}\in\mathcal{I}_{q}.

(1)

For all $\mathbf{i}\in\mathcal{I}\cap\mathcal{I}_{q}$ (with $\mathcal{I}_{q}$ called condition space), the variable $x_{q}$ at index (vector) $\mathbf{P}_{q}\mathbf{i}+\mathbf{f}_{q}$ is defined as value of the function $F_{q}$ evaluated on variables $x_{q,r}$ at $\mathbf{Q}_{q,r}\mathbf{i}-\mathbf{d}_{q,r}$ . We finally call a PLA Piecewise Regular Algorithm (or PRA) for the special case that $\mathbf{P}_{q}$ and $\mathbf{Q}_{q,r}$ are identity matrices and $\mathbf{f_{q}}$ is zero. As a result, a statement of a PRA has the following form:

\displaystyle S_{q}:

\displaystyle x_{q}[\mathbf{i}]=F_{q}(\ldots,x_{q,r}[\mathbf{i}-\mathbf{d}_{q,r}],\ldots)\;\text{if }\mathbf{i}\in\mathcal{I}_{q}.

(2)

Note that in a PLA or PRA, there exists neither an explicit order of execution of iterations like in imperative loop programs, nor any order of execution of statements. Instead, each quantified statement only implies implicit schedule restrictions by so-called data dependencies. For example, in a PRA with a statement given in Eq. (2), the left-hand-side variable $x_{q}[\mathbf{i}]$ can only be evaluated after all variables $x_{q,r}[\mathbf{i}-\mathbf{d}_{q,r}]$ (arguments of $F_{q}$ ) on the right-hand side have been computed. We also say $x_{q}[\mathbf{i}]$ depends on $x_{q,r}[\mathbf{i}-\mathbf{d}_{q,r}]$ , and call the constant vector $\mathbf{d}_{q,r}$ the dependence vector. Finally, instances of variables appearing only on the right-hand side of statements are called input variables. Similarly, instances of variables appearing only on the left-hand side of statements are called output variables. All other instances of variables are called internal variables.

Example 1

In the following, we introduce as a running example the nested loop program GESUMMV from the PolyBench [10] benchmark, which computes the sum of two matrix–vector products involving a vector $x\in\mathbb{Z}^{N_{1}}$ and matrices $A,B\in\mathbb{Z}^{N_{0}\times N_{1}}$ :

Y[i_{0}]=\sum_{i_{1}=0}^{N_{1}-1}\bigl(A[i_{0},i_{1}]\cdot X[i_{1}]+B[i_{0},i_{1}]\cdot X[i_{1}]\bigr),\ \ 0\leq i_{0}<N_{0}.

A corresponding 2-dimensional PRA can be formulated with an iteration space $\mathcal{I}=\{(i_{0},i_{1})\mid 0\leq i_{0}<N_{0},\;0\leq i_{1}<N_{1}\}$ and the following set of statements:

$\displaystyle S_{1}:$	$\displaystyle x[i_{0},i_{1}]=X[i_{1}]\hskip 18.49988pt$	$\displaystyle\text{if }i_{0}=0$
$\displaystyle S_{2}:$	$\displaystyle x[i_{0},i_{1}]=x[i_{0}-1,i_{1}]\hskip 18.49988pt$	$\displaystyle\text{if }i_{0}>0$
$\displaystyle S_{3}:$	$\displaystyle a[i_{0},i_{1}]=A[i_{0},i_{1}]\cdot x[i_{0},i_{1}]$
$\displaystyle S_{4}:$	$\displaystyle b[i_{0},i_{1}]=B[i_{0},i_{1}]\cdot x[i_{0},i_{1}]$
$\displaystyle S_{5}:$	$\displaystyle s_{A}[i_{0},i_{1}]=a[i_{0},i_{1}]\hskip 18.49988pt$	$\displaystyle\text{if }i_{1}=0$
$\displaystyle S_{6}:$	$\displaystyle s_{A}[i_{0},i_{1}]=s_{A}^{*}[i_{0},i_{1}]+a[i_{0},i_{1}]\hskip 18.49988pt$	$\displaystyle\text{if }i_{1}>0$
$\displaystyle S_{7}:$	$\displaystyle s_{A}^{*}[i_{0},i_{1}]=s_{A}[i_{0},i_{1}-1]\hskip 18.49988pt$	$\displaystyle\text{if }i_{1}>0$
$\displaystyle S_{8}:$	$\displaystyle s_{B}[i_{0},i_{1}]=b[i_{0},i_{1}]\hskip 18.49988pt$	$\displaystyle\text{if }i_{1}=0$
$\displaystyle S_{9}:$	$\displaystyle s_{B}[i_{0},i_{1}]=s_{B}^{*}[i_{0},i_{1}]+b[i_{0},i_{1}]\hskip 18.49988pt$	$\displaystyle\text{if }i_{1}>0$
$\displaystyle S_{10}:$	$\displaystyle s_{B}^{*}[i_{0},i_{1}]=s_{B}[i_{0},i_{1}-1]\hskip 18.49988pt$	$\displaystyle\text{if }i_{1}>0$
$\displaystyle S_{11}:$	$\displaystyle Y[i_{0}]=s_{A}[i_{0},i_{1}]+s_{B}[i_{0},i_{1}]\hskip 18.49988pt$	$\displaystyle\text{if }i_{1}=N_{1}-1$

In the above description, all instances of variables $A$ , $B$ , and $X$ are input variables. Similarly, all instances of variable $Y$ are output variables. Instances of $a$ and $b$ represent the element-wise products of $A\cdot X$ and $B\cdot X$ . Instances $s_{A}$ and $s_{B}$ are partial sums accumulated along the dimension $i_{1}$ (statements $S_{6},S_{7}$ and $S_{9},S_{10}$ ), initialized in statements $S_{5},S_{8}$ , respectively (at $i_{1}=0$ ). The outputs $Y[i_{0}]$ are obtained at $i_{1}=N_{1}-1$ (statement $S_{11}$ ).

Now, in order to map a polyhedral loop nest to a 1- or 2-dimensional array of processing elements, the given iteration space $\mathcal{I}$ is partitioned into as many tiles as available processing elements. For each tile, then a feasible set of iterations within a tile and between tiles needs to be found. This is explained in the following.

III-C Symbolic Tiling

According to [23, 14, 13], we partition the $n$ -dimensional iteration space $\mathcal{I}$ into congruent rectangular tiles $\mathcal{J}$ by using a partitioning matrix $P={\mathrm{diag}}(p_{0},\ p_{1},\ldots,p_{n-1})$ such that $\mathcal{I}\subseteq\mathcal{J}\boldsymbol{\oplus}P\mathcal{K}$ . A tile $\mathcal{J}$ can be described as follows:

\mathcal{J}=\bigl\{\,\mathbf{j}=(j_{0},\ldots,j_{\ell},\ldots,j_{n-1})^{\mathrm{T}}\mid 0\leq j_{\ell}<p_{\ell}\,\bigr\}.

(3)

Similarly, the set of non-empty tile origins $\mathcal{K}$ can be described as:

\mathcal{K}=\bigl\{\,\mathbf{k}=(k_{0},\ldots,k_{\ell},\ldots,k_{n-1})^{\mathrm{T}}\mid 0\leq k_{\ell}<t_{\ell}\,\bigr\}.

(4)

The vector $T=(t_{0},\ldots,t_{\ell}\ldots,t_{n-1})$ describes the number of tiles along each dimension $\ell$ . After tiling, the original dependencies in statements as shown in Eq. (2) $\mathbf{d}\in\mathbb{Z}^{n}$ are decomposed into an intra-tile dependence vector ( $\mathbf{d}_{J}$ ) and an inter-tile dependence vector ( $\mathbf{d}_{K}$ ) and all instances of variables are embedded into the $2n$ -dimensional space with dependence vector ( $\mathbf{d}_{J},\mathbf{d}_{K})^{\mathrm{T}}$ . Without loss of generality, we can split the statement $S_{q}$ in Eq. (2) into two equations below [14]:

S_{q}:\;x_{q}[\mathbf{j},\mathbf{k}]=F_{q}(\ldots,\,x_{q,r}^{\ast}[\mathbf{j},\mathbf{k}],\,\ldots)\;\text{if }\mathbf{j}+P\mathbf{k}\in\mathcal{I}_{q}

(5)

	$\displaystyle S_{q}^{\ast}:\;x_{q,r}^{\ast}[\mathbf{j},\,\mathbf{k}]=$	$\displaystyle x_{q,r}[\mathbf{j}-d-P\gamma,\,\mathbf{k}+\gamma]\;\text{if }\mathbf{j}+P\mathbf{k}\in\mathcal{I}_{q}\;$
		$\displaystyle\wedge\mathbf{j}-d-P\gamma\in\mathcal{J}.$		(6)

Whereas the dependencies of the transformed statement $S_{q}$ are all zero, we need to create one statement of the form $S_{q}^{\ast}$ for each solution $\gamma$ that satisfies the set of inequalities

\{\gamma\in\mathbb{Z}^{n}:-e<\gamma+P^{-1}d<e\}

(7)

and $e=(1\;1\,\cdots 1)^{\mathrm{T}}.$

Example 2

In the following, we tile the iteration space of the GESUMMV PRA listing introduced in Example 1. Assume a matrix of size $N_{0}\times N_{1}=4\times 5$ to be mapped onto a $2\times 2$ target processor array. Figure 2 provides a visual illustration of the tiling into congruent tiles $\mathcal{J}=\bigl\{\,\mathbf{j}=(j_{0},j_{1})^{\mathrm{T}}\mid 0\leq j_{\ell}<p_{\ell}\,\bigr\}$ of size $p_{0}\times p_{1}=2\times 3=6$ iterations. This tile size was chosen to partition the iteration space into exactly as many PEs as are available in each dimension of the processor array. Thus, the resulting set $\mathcal{K}$ of tile origins is $\mathcal{K}=\bigl\{\,\mathbf{k}=(k_{0},k_{1})^{\mathrm{T}}\mid 0\leq k_{\ell}<t_{\ell}\,\bigr\}$ with $t_{0}=t_{1}=2$ , thus a total of $t_{0}\times t_{1}=2\times 2=4$ tiles. For the explanation of the transformation of data dependencies due to tiling, consider statement $S_{7}$ as an example after tiling. According to Eq. (6), we obtain the following transformed statement $S_{7}^{*}$ .

	$\displaystyle S_{7}^{\ast}:\;s_{A}^{\ast}[\mathbf{j},\mathbf{k}]$	$\displaystyle=s_{A}[\mathbf{j}-(0,1)^{\mathrm{T}}-P\gamma,\;\mathbf{k}+\gamma]\;\text{if }j_{1}+p_{1}k_{1}>0$
		$\displaystyle\wedge\ \mathbf{j}-(0,1)^{\mathrm{T}}-P\gamma\in\mathcal{J}.$

for which we can find two solutions for the vector $\gamma$ that satisfy Eq. (7): $\mathcal{\gamma}=\left\{(0,0)^{\mathrm{T}},\ (0,-1)^{\mathrm{T}}\right\}.$ The final new equations for $S_{7}^{*}$ that replace $S_{7}^{\ast}$ , are thus:

	$\displaystyle S_{7}^{\ast 1}:\;s_{A}^{\ast}[\mathbf{j},\mathbf{k}]$	$\displaystyle=s_{A}[\mathbf{j}-(0,1)^{\mathrm{T}},\;\mathbf{k}]\;\text{if }j_{1}+p_{1}k_{1}>0$
		$\displaystyle\wedge\ \mathbf{j}-(0,1)^{\mathrm{T}}\in\mathcal{J}.$

	$\displaystyle S_{7}^{\ast 2}:\;s_{A}^{\ast}[\mathbf{j},\mathbf{k}]=s_{A}[\mathbf{j}-(0,1-p_{1})^{\mathrm{T}},\;\mathbf{k}+(0,-1)^{\mathrm{T}}]\;$
	$\displaystyle\text{if }j_{1}+p_{1}k_{1}>0\wedge\ \mathbf{j}-(0,1-p_{1})^{\mathrm{T}}\in\mathcal{J}.$

The two new resulting dependence vectors

\mathbf{d}_{6}^{\ast}\!\bigl((0,1)^{\mathrm{T}}\bigr)=\left\{(0,1,0,0)^{\mathrm{T}},\;(0,1-p_{1},0,1)^{\mathrm{T}}\right\}

are also visualized in Figure 2. The first vector (shown in yellow) corresponds to an intra-tile dependence and leads to an intra-processor memory access, whereas the second (shown in orange) represents an inter-tile dependence that leads to an inter-processor memory access and communication, as we exploit later in our energy analysis.

Now, in order to execute the iterations within each tile and to respect data dependencies both within a tile and across tiles, a feasible schedule must be determined. This is explained in the following.

III-D Symbolic Scheduling

Whereas tiling defines the mapping of iterations and their corresponding computations to PEs, we still need to find execution schedules that minimize execution time. A schedule assigns a variable $x_{q}$ defined in a statement $S_{q}$ a start time $t_{q}(J,K)$ when the operation $F_{q}$ is computed for each iteration $(J,K)$ . According to [12, 23], iterations within a tile must be executed in either a sequential or a pipelined (modulo-scheduled) order with an initiation interval $\pi$ between two iterations. Different tiles are scheduled in parallel across a PE array, such an execution corresponds to a locally sequential, globally parallel (LSGP) modulo schedule. Moreover, the regular computations of loop nests can be efficiently scheduled using linear schedules in which the schedule of iterations within a tile is described by an intra-tile schedule vector $\boldsymbol{\lambda}^{J}$ . Similarly, the start time of tiles is also described by a so-called inter-tile schedule vector $\boldsymbol{\lambda}^{K}$ . According to [24], schedules $(\boldsymbol{\lambda}^{J},\boldsymbol{\lambda}^{K})$ that minimize the global latency $L$ of a loop nest for a given initiation interval $\pi$ can be determined efficiently as follows:

L\;=\;\boldsymbol{\lambda}^{J}\begin{pmatrix}p_{0}-1\\ \vdots\\ p_{n-1}-1\end{pmatrix}\;+\;\boldsymbol{\lambda}^{K}\begin{pmatrix}t_{0}-1\\ \vdots\\ t_{n-1}-1\end{pmatrix}\;+\;L_{c}.

(8)

The first term determines the maximal difference of start times between any two tiles. The second term determines the maximum difference of any pair of iterations within a tile. Finally, $L_{c}=\max_{1\leq q\leq|S|}(\tau_{q}+w_{q})$ describes the latency of a single iteration within a tile, where $\tau_{q}$ is the start time and $w_{q}$ the execution latency of operation $F_{q}$ in statement $S_{q}$ .

Example 3

Given the tiling of the iteration space of the GESUMMV program introduced in Example 2. For the given application, and assuming a pipeline interval $\pi=1$ , we can find the symbolic intra-tile schedule $\lambda^{J}=(1,p_{0})^{\mathrm{T}}$ and the inter-tile schedule $\lambda^{K}=(p_{0},p_{0}\cdot(p_{1}-1)+1)^{\mathrm{T}}$ , minimizing the global latency given by $L\;=\;\lambda^{J}\cdot(p_{0}-1,p_{1}-1)^{\mathrm{T}}+\lambda^{K}\cdot(t_{0}-1,t_{1}-1)^{\mathrm{T}}\;+\;L_{c}$ with $L_{c}=4$ , assuming each $F_{q}$ has a latency of $w_{q}=1$ , which leads to $L=(p_{0}p_{1}-1)+p_{0}(t_{0}-1)+(p_{0}(p_{1}-1)+1)(t_{1}-1)+4$ . In the running example with $p_{0}=2,p_{1}=3,t_{0}=t_{1}=2$ , we obtain $L=5+7+4=16$ .

IV Symbolic Loop Nest Energy Analysis

In this section, we introduce our symbolic energy analysis methodology for loop nests when executed on processor arrays. The methodology starts by analyzing both the computational energy (for execution of loop statements of type $S_{q}$ in Eq. (5)) and the data transport energy for any data movement, distinguishing both from outside the chip boundary, e.g., DRAM, and inside the processor array and back (by analyzing loop statements of type $S_{q}^{\ast}$ in Eq. (6)). This first part of the analysis delivers a formula for the energy consumption of each loop statement per loop iteration. Subsequently, these energy-by-statement expressions are multiplied by the number of iterations each statement is executed to deliver a final overall energy estimate. This step relies on the efficient calculation of volumes of polyhedral spaces related to the size of the condition space of each loop statement. We demonstrate that these volumes can be computed symbolically for parametric loop bounds, thus enabling the energy analysis to be computed only once for parametric loop bounds and ultra-fast scalability analysis by simply inserting a set of concrete loop bound values into the derived volume formulae.

IV-A Energy-by-Statement Analysis

Without loss of generality, we can assume that any PRA loop statement as described in the most general form in Eq. (1) can be split into one statement with only zero dependence vector right-hand side (RHS) arguments $S_{q}$ but containing computations expressed by a function ${\cal F}_{q}$ as described in Eq. (5) and statements that just relate each RHS variable with variables at displacements, i.e., statements of type $S_{q}^{*}$ in Eq. (6). This eases the description of the following energy analysis by splitting the analysis of computational energy and the analysis of memory-related energy. In the following, we denote $C$ the set of computational statements, e.g., and let $T$ denote the set of memory statements.

Example 4

In the running example introduced in Ex. 1, all 11 statements already adhere to the above form with $C=\{S_{3},S_{4},S_{6},S_{9},S_{11}\}$ denoting statements with computations and $M=\{S_{1},S_{2},S_{5},S_{7},S_{8},S_{10}\}$ being the set of memory-related statements.

A processor array, as shown in Figure 2, comprises a memory system consisting of 6 different memory types $\tau\in\mathcal{T}$ as classified in the above table: on-chip I/O buffers ( $\mathrm{IO}$ ) at the periphery, through which data transfers for fetching and storing tensor I/O data from/to a host DRAM ( $\mathrm{DR}$ ) are performed explicitly via DMA operations. Because DRAM accesses are the most expensive, the loop mapping strategy should avoid moving any input or output variable instance of a loop nest across the chip boundary more than once. Moreover, within each PE, we distinguish four classes of registers with the following use: general-purpose registers ( $\mathrm{RD}$ ) for handling intra-iteration dependencies, feedback registers ( $\mathrm{FD}$ ) for temporarily storing variables with inter-iteration dependencies that are processed locally within a PE, and finally input/output registers ( $\mathrm{ID}$ and $\mathrm{OD}$ ) used to receive data from, or transmit data to, neighboring PEs via communication ports. A breakdown of typical energies per access for different types of memory accesses ${\mathrm{T}}=\{\mathrm{RD},\mathrm{FD},\mathrm{ID},\mathrm{OD},\mathrm{IOb},\mathrm{DR}\}$ and for some basic arithmetic operations is given in Table I for a \qty45\nano technology [9].

TABLE I: Energy related to memory accesses/operations in \qty45\nano technology [9].

Memory Class/Operation Type	Energy $E$ [\unit\pico]
Register Files
General-purpose register ( $\mathrm{RD}$ )	0.12
Feedback register ( $\mathrm{FD}$ )	0.35
Input register ( $\mathrm{ID}$ )	0.24
Output register ( $\mathrm{OD}$ )	0.12
Buffers / Off-chip Access
I/O buffer ( $\mathrm{IOb}$ )	16
DRAM ( $\mathrm{DR}$ )	1280
Arithmetic Operations
Addition ( $\mathrm{add}$ )	0.36
Multiplication ( $\mathrm{mul}$ )	1.24

IV-A1 Energy Computational Statements

We start with the analysis of energy related to computational statements $S_{q}\in C$ . Let $L:x\rightarrow\mathrm{T}$ denote the function that returns the memory class of a variable $x$ occurring either on the LHS (write location) or the RHS (read locations). Then, the computational energy for one loop iteration of such a statement $S_{q}$ can be estimated by:

E_{q}^{C}=\sum_{r=1}^{R}E(L(x_{q,r}^{*}))+E({\cal F}_{q})+E(L(x_{q}))

(9)

Explanation: The first term sums up the energy needed to read each input location according to the RHS side of the statement $S_{q}$ , the middle term denotes the energy needed to compute the function ${\cal F}_{q}$ , and the right term denotes the energy needed to write this result into the location $L(x_{q})\in\mathrm{T}$ .

To systematically capture these dependencies, our analysis tool uses an internal representation called reduced dependence graph (RDG), a directed multigraph where nodes represent inputs, outputs, and computations, and edges capture data dependencies between them over the iteration space. The RDG of a computational statement after tiling as given in Eq. (5) can be represented as shown in Figure 3.

Figure 3: RDG of a computational statement

S_{q}

as given in Eq. (5)

IV-A2 Energy Memory-Related Statements

For a statement $S_{q}^{*}\in M$ , we estimate the energy by statement similarly:

E_{q}^{M}=E(L(x_{q,r}))+E(L(x_{q,r}^{*}))

(10)

Explanation: The first term calculates the energy needed to read the right-hand-side variable $x_{q,r}$ of statement $S_{q}^{*}$ , and the second term denotes the energy needed to copy its value to location $L(x_{q,r}^{*})\in\mathrm{T}$ . Finally, holding for both types of statements, the energy of a read or write access of any variable $x$ depends on its location $L(x)\in{\mathrm{T}}$ as follows:

E(L(x))=\begin{cases}E(\mathrm{DR})+E(\mathrm{IOb})+E(\mathrm{ID})&\text{if }x\in\{I\}\\ E(\mathrm{DR})+E(\mathrm{IOb})+E(\mathrm{OD})&\text{elsif }x\in\{O\}\\ E(\mathrm{RD})\;\;\;\text{elsif }\mathbf{d}_{j}=\mathbf{d}_{k}={0}\\ E(\mathrm{FD})\quad\text{elsif }\mathbf{d}_{j}\neq 0\wedge\mathbf{d}_{k}={0}\\ E(\mathrm{ID})\;\quad\text{else }(\mathbf{d}_{k}\neq{0})\end{cases}

Explanations: The first two cases hold when $x$ is an input variable and an output variable. Here, the energy accounts for fetching the variable from DRAM into an I/O buffer and from there to an input/output register of a PE.

Example 5

In the GESUMMV kernel introduced in Example 1, $A$ , $B$ , and $X$ are the names of input variables (appearing in statements $S_{1},S_{3}$ and $S_{4}$ . Therefore, the execution of any statement involving an instance of such tensors requires a transport of the related variable instance from DRAM to an I/O buffer, and from there to an input register of a PE as indicated by the incoming green arrows at the left and top of the PE grid in Figure 2. Similarly, variable $Y$ in statement $S_{11}$ is an output variable; each instance of it is written to an output register and then over an I/O buffer back to the DRAM, as illustrated by the outgoing green arrows at the right of the PE grid in Figure 2.

Otherwise, the location $L$ of a variable $x$ is a register ( $\mathrm{RD}$ ), if the data is stored tile-locally (zero dependence vectors).

Example 6

In the GESUMMV example, statements $S_{5}$ and $S_{8}$ are examples of local register accesses ( $\mathrm{RD}$ ). These PE-local data accesses are illustrated by circular blue arrows within each iteration node in Figure 2.

Otherwise, if only the intra-tile dependence vector $\mathbf{d}_{j}$ is non-zero, the variable might get used again in the very same PE later and is thus stored in an $\mathrm{FD}$ register.

Example 7

In the GESUMMV example, statements $S_{2}$ , $S_{7}$ , and $S_{10}$ are examples of transport statements that, after tiling, produce a statement in which the left-hand-side variable is indexed by $(j_{0},j_{1},k_{0},k_{1})$ and is assigned the value of a right-hand-side variable indexed by $(j_{0},j_{1}-1,k_{0},k_{1})$ , $\mathbf{d}_{j}\neq 0\wedge\mathbf{d}_{k}={0}$ holds, and the RHS variable will thus be stored in an $\mathrm{FD}$ register for PE-internal re-use. These tile-local data dependencies are illustrated by the yellow arrows in Figure 2.

Else, the data comes from a different tile and thus PE and must be read into an $\mathrm{ID}$ register.

Example 8

For the given GESUMMV example, inter-tile dependencies may arise due to tiling, see, e.g., the statement $S_{7}^{*}$ described in Example 2 is split into two statements, of which the second has the dependence vector $\mathbf{d}^{\ast}=(0,1-p_{1},0,1)^{\textrm{T}}$ . This dependence represents an inter-tile communication along the $k_{1}$ -direction. For an iteration vector $(j_{0},j_{1},k_{0},k_{1})^{\textrm{T}}$ , the data thus comes from iteration $(j_{0},j_{1},k_{0},k_{1}-1)$ , corresponding to the left neighbor tile (PE). The data is therefore stored in an $\mathrm{ID}$ register, as is illustrated by the orange arrows in Figure 2.

IV-B Total Energy Evaluation

With the above analysis of energy estimates per statement and iteration, we can finally derive estimates of the total energy $E_{\mathrm{tot}}$ for execution of a parametric loop nest by summing up the energies per statement after tiling multiplied by the volume ${S}_{q}$ (number of integer points of) the polyhedral space where statement $S_{q}$ is defined:

E_{\mathrm{tot}}=\sum_{S_{q}\in C}\operatorname{Vol}({S}_{q})\cdot E_{q}^{C}+\sum_{S_{q}\in M}\operatorname{Vol}({S}_{q})\cdot E_{q}^{M}.

(11)

IV-C Symbolic Volume Computation

In the previous section, we provided a generic energy analysis per statement of a loop nest and a formula for the total energy for executing a given loop nest by multiplying these estimates by the number of integer points in the corresponding parametric polyhedra. For computational statements $S_{q}\in C$ as described in Eq. (5), the number of iterations after tiling is obtained from the tiled iteration space.

\operatorname{Vol}({S}_{q})=|\big\{\mathbf{i}\mid\mathbf{i}=\mathbf{j}+P\mathbf{k}\land\mathbf{i}\in\mathcal{I}_{q}\cap\mathcal{I}\big\}|

(12)

For memory-related statements of type $S_{q}^{*}\in M$ as described in Eq. (6), the execution count of the associated memory accesses additionally depends on the dependencies $\mathbf{d}_{j}$ and $\mathbf{d}_{k}$ . Therefore, for $S_{q}\in T$ , the volume is given by

\operatorname{Vol}(S_{q}^{\ast})=|\big\{\mathbf{i}\mid\mathbf{i}=\mathbf{j}+P\mathbf{k}\land\mathbf{j}-\mathbf{d}_{j}\in\mathcal{J}\land\mathbf{i}\in\mathcal{I}_{q}\cap\mathcal{I}\big\}|

(13)

The computation of such volumes corresponds to counting the number of integer points in parametric polyhedra. This can be performed symbolically using Barvinok’s algorithm [2] as implemented in the Integer Set Library (ISL) [17]. The resulting volumes are returned as piecewise quasi-polynomial functions of the loop bounds $N_{i}$ and tile size parameters $p_{i}$ , which can then be inserted into Eq. (11) to determine the total energy consumption $E_{\mathrm{tot}}$ of all statements as executed in a given loop nest.

Example 9

To illustrate the symbolic volume computation, we consider statement $S_{7}^{\ast}$ of the GESUMMV kernel introduced in Example 2. The transformed statement is decomposed into two cases based on the location of the source operand. The iteration spaces of $S_{7}^{\ast 1}$ and $S_{7}^{\ast 2}$ are given as follows with $t_{i}=\lceil N_{i}/p_{i}\rceil$ :

\mathcal{I}_{7}^{\ast 1}=\left\{(j_{0},j_{1},k_{0},k_{1})\in\mathbb{Z}^{4}\;\middle|\;\begin{aligned} &0\leq j_{0}<p_{0}\land 0\leq j_{1}<p_{1}\land\\ &0\leq k_{0}<t_{0}\land 0\leq k_{1}<t_{1}\land\\ &0\leq j_{0}+p_{0}k_{0}<N_{0}\land\\ &0<j_{1}+p_{1}k_{1}<N_{1}\land\\ &0\leq j_{1}-1<p_{1}\end{aligned}\right\}

\mathcal{I}_{7}^{\ast 2}=\left\{(j_{0},j_{1},k_{0},k_{1})\in\mathbb{Z}^{4}\;\middle|\;\begin{aligned} &0\leq j_{0}<p_{0}\land 0\leq j_{1}<p_{1}\land\\ &0\leq k_{0}<t_{0}\land 0\leq k_{1}<t_{1}\land\\ &0\leq j_{0}+p_{0}k_{0}<N_{0}\land\\ &0<j_{1}+p_{1}k_{1}<N_{1}\land\\ &0\leq j_{1}+p_{1}-1<p_{1}\end{aligned}\right\}

Note that the above inequalities do contain non-linear terms like the expression $...\leq j+p\cdot k<...$ (i.e., the product of the parameters $p$ and $k$ ). But for a given (fixed) processor array size, $k$ (the processor element index in a given processor array dimension with a total of $t$ elements in this dimension) is bounded, i.e., $0\leq k<t$ . Therefore, we can unfold the respective inequality constraints practically as follows: $...\leq\{j+p\cdot 0,\quad j+p\cdot 1,\quad\dots,\quad j+p(t-1)\}<...$ .¹¹1Interestingly, the symbolic energy analysis time remains on the order of 1 minute, even for large processor arrays of size $50\times 50=$2500$$ processors. By applying Barvinok’s algorithm [1], the integer-point count for $S_{7}^{\ast 1}$ and $S_{7}^{\ast 2}\in M$ is then obtained for the example of a $2\times 2$ processor array target as shown in Figure 2:

\operatorname{vol}({{S}_{7}^{\ast 1}})=|{\mathcal{I}_{7}^{\ast 1}}|=\begin{cases}4\,p_{0}(p_{1}-1)&\begin{aligned} \text{if }&0<p_{0}\land 2p_{0}<N_{0}\land\\ &p_{1}\geq 2\land 2p_{1}<N_{1}\end{aligned}\\ 2\,N_{0}(p_{1}-1)&\begin{aligned} \text{if }&N_{0}>0\land 2p_{0}\geq N_{0}\land\\ &p_{1}\geq 2\land 2p_{1}<N_{1}\end{aligned}\\ (2N_{1}-4)\,p_{0}&\begin{aligned} \text{if }&0<p_{0}\land 2p_{0}<N_{0}\land\\ &p_{1}\leq N_{1}-2\land\\ &2p_{1}\geq N_{1}\end{aligned}\\ N_{0}(N_{1}-2)&\begin{aligned} \text{if }&N_{0}>0\land 2p_{0}\geq N_{0}\land\\ &p_{1}\leq N_{1}-2\land\\ &2p_{1}\geq N_{1}\end{aligned}\\ 0&\text{otherwise}\end{cases}

The above expression corresponds to the intra-tile case, where the dependence is resolved locally within a tile.

\operatorname{vol}({{S}_{7}^{\ast 2}})=|{\mathcal{I}_{7}^{\ast 2}}|=\begin{cases}2\,p_{0}&\text{if}\ 0<p_{0}<{N_{0}}/{2}\land 0<p_{1}<N_{1}\\ N_{0}&\begin{aligned} &\text{if}\ N_{0}>0\land p_{0}\geq{N_{0}}/{2}\land\\ &0<p_{1}<N_{1}\end{aligned}\\ 0&\text{otherwise.}\end{cases}

This expression corresponds to the inter-tile case, where the dependence crosses tile boundaries. The volume remains fully parametric in the loop bounds $N_{i}$ and tile sizes $p_{i}$ , and thus can be evaluated very simply by just inserting concrete values for the parameters $N_{0},N_{1},p_{0},$ and $p_{1}$ . E.g., for the values exemplified in Figure 2 with a shown iteration space of size $N_{0}\times N_{1}=4\times 5=20$ and tiles of size $2\times 3$ , the resulting volumes are ${\operatorname{Vol}}(\mathcal{S}_{7}^{\ast 1})=12$ and ${\operatorname{Vol}}(\mathcal{S}_{7}^{\ast 2})=4$ , corresponding to 12 intra-tile dependences (exactly matches the number of yellow arrows in $j_{1}$ direction) and 4 inter-tile dependences (exactly matches the number of orange arrows in $k_{1}$ direction).

Using Eq. (10), the energy of $S_{7}^{\ast 1}$ is given by one FD read and one RD write,

E_{7^{\ast 1}}^{M}=E(\mathrm{FD})+E(\mathrm{RD})=0.35+0.12=\qty{0.47}{\pico}.

Similarly, the energy of $S_{7}^{\ast 2}$ is given by one ID read and one RD write,

E_{7^{\ast 2}}^{M}=E(\mathrm{ID})+E(\mathrm{RD})=0.24+0.12=\qty{0.36}{\pico}.

For the concrete configuration, the contribution of these two statements to $E_{\mathrm{tot}}$ in Eq. (11) evaluates to

\operatorname{Vol}(\mathcal{S}_{7}^{\ast 1})E_{7^{\ast 1}}^{M}+\operatorname{Vol}(\mathcal{S}_{7}^{\ast 2})E_{7^{\ast 2}}^{M}=12\cdot 0.47+4\cdot 0.36=\qty{7.08}{\pico}.

This energy corresponds to the total energy contribution of statement $S_{7}^{\ast}$ for the given configuration. The overall kernel energy is obtained by summing the contributions of all statements according to Eq. (11).

V Experimental results

This section evaluates the proposed symbolic energy analysis framework from two complementary perspectives. First, we validate the accuracy and analysis-time efficiency of the symbolic approach by deriving parametric volumes for both computations and memory operations, and comparing these with counts obtained from a cycle-accurate simulation of the execution of a loop nest. Second, we analyze energy and latency scaling with increasing problem sizes and show that the symbolic model provides a scalable basis for energy evaluation, making it particularly suitable to explore application-specific architecture sizing.

V-A Validation of Symbolic Volume Computation Results

We validate the accuracy of the symbolic volume computation results by comparing analytically derived integer-point counts for both data transfer and computation against reference counts obtained from a cycle-accurate simulator. The simulator operates using an XML-based architectural description that captures the entire TCPA architecture. Based on this description, the TCPA compiler [23] automatically tiles the input loop programs, maps the tiles onto the processor array, and schedules their execution. During simulation, all data transfers and computations are tracked, yielding exact reference counts. Each selected PolyBench kernel [10] is then simulated for a specified architectural configuration to obtain ground-truth memory access and computation counts.

Note that our analytical method can analyze the energy of a loop kernel symbolically without requiring concrete loop bounds. The resulting volumes are expressed as closed-form quasi-polynomials. In the following, we evaluated our approach across eight different benchmarks from [10] for varying problem sizes and architectural parameters. The analytically derived access counts and obtained total energy values match the simulation results exactly, confirming the accuracy of the symbolic formulation.

Figure 4 compares the analysis time of simulation-based counting and the proposed symbolic method for the GESUMMV benchmark mapped onto an $8\times 8$ PE array. The simulation-based approach exhibits a rapid increase in analysis time with increasing matrix size, as it explicitly executes all loop iterations, instruction activities, and memory accesses within the cycle-accurate model. As the iteration space grows quadratically with the matrix dimension, the simulation cost scales accordingly.

In contrast, the symbolic approach evaluates one-time calculated closed-form expressions for computation and memory access statements, independent of the number of loop iterations. As a result, the analysis time remains almost constant at less than $0.5\ \unit{s}$ across all evaluated problem sizes. This decoupling of analysis cost from the dynamic execution makes the symbolic method scalable and suitable for exploring large loop bounds that are impractical for simulation-based analysis.

V-B Energy and Latency Scaling Analysis

Since the proposed analysis derives expressions, it enables a fast evaluation of bounds and exploration of architectural configurations. In the following, we study the total energy $E_{\mathrm{tot}}$ and latency $L$ of a given loop nest with increasing loop bounds. Thereby, we provide a fine-grained breakdown of energy contributions across different access locations $L(x)\in\mathrm{T}$ , revealing how memory accesses and computations contribute individually. As will be shown, computation energy remains relatively small compared to data transfer energy across all configurations. The total energy and latency are computed using Eq. (11) and (8), respectively. For this analysis, we consider the GEMM kernel [10], where the iteration space grows cubically with problem size (i.e., $O(N^{3})$ ). As loop bounds increase, both computation and data movement grow rapidly. Figure 5 illustrates how total energy and latency increase with matrix size for GEMM, analytically evaluated for an $8\times 8$ PE grid. As expected, both metrics grow rapidly with increasing loop bounds due to the cubic growth of the iteration space.

The energy breakdown further reveals how different components scale. For smaller problem sizes, DRAM accesses dominate the total energy consumption. However, as the loop bounds increase, the relative contribution of DRAM energy decreases, while the energy associated with on-chip storage and communication—such as FD and RD registers—as well as computation, increases. This shift is primarily due to larger tile sizes, which increase intra-tile data reuse and consequently amplify activity within local storage locations. This growth is not strictly proportional to the loop bounds, as data accessed from different locations scales differently depending on tile sizes and data dependencies. In many existing energy estimation approaches, activity counts are obtained by simulating a fixed workload instance for smaller loop bounds and extrapolating to larger ones, since simulation is not efficient at handling larger bounds. Our proposed symbolic analysis does not suffer from scalability.

Since not only the energy estimates are parametric but also the schedules, performance metrics such as latency, throughput, and thus also energy efficiency can be computed analytically. This paves the way for a rapid comparison of architectural configurations and supports DSE to identify suitable accelerator architectures for a huge number of loop applications.

VI Conclusion

This paper introduces a symbolic methodology for energy analysis of loop nests when mapped and scheduled on parallel processor array accelerator architectures. In contrast to simulation-based approaches, which require an explicit execution of all loop iterations, all memory accesses, and all operations, our method derives closed-form expressions for computations and memory accesses directly from the program representation and mapping. By combining symbolic volume computation of polyhedral spaces with pre-characterized energy margins per access and operation types, the approach enables a symbolic and accurate estimation of energy without requiring time-intensive cycle-accurate simulations for each setting of loop bounds.

The proposed analysis lifts energy evaluation from a simulation-based process to a polyhedral-model-based formulation that is applicable at early design stages. Once the symbolic expressions are derived, the total energy can be evaluated efficiently for different loop bounds and architecture configurations without repeated analysis. This is particularly important for loop-intensive applications, where simulation time increases rapidly with problem size, while the symbolic evaluation remains nearly constant, providing a scalable basis for design space exploration.

Our evaluations have demonstrated that our symbolic and cycle-accurate simulation-based energy analysis approaches match in their results, with the symbolic approach also providing a fine-grained view of the contributions of different types of memory and register accesses. This makes it possible to not only study the total energy, but also the influence of array size, mapping, scheduling, and data movement across different access locations.

Overall, the presented framework provides a practical and efficient approach to energy estimation for processor-array accelerators. By enabling fast symbolic evaluation of energy, latency, and related performance metrics from parametric loop bounds, it supports a fast analysis of architectural configurations and helps to identify suitable accelerator architectures for a given application.

References

[1] A. I. Barvinok (1994) A polynomial time algorithm for counting integral points in polyhedra when the dimension is fixed. Mathematics of Operations Research 19 (4), pp. 769–779. Cited by: §II, Example 9.
[2] A. I. Barvinok (2002) A course in convexity. Graduate studies in mathematics, Vol. 54, American Mathematical Society. External Links: ISBN 978-0-8218-2968-4 Cited by: §IV-C.
[3] Y. Chen, T. Krishna, J. S. Emer, and V. Sze (2017) Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid State Circuits 52 (1), pp. 127–138. External Links: Document Cited by: §II.
[4] F. Hannig, V. Lari, S. Boppu, A. Tanase, and O. Reiche (2014) Invasive tightly-coupled processor arrays: A domain-specific architecture/compiler co-design approach. ACM Trans. Embedded Comput. Syst. 13 (4s), pp. 133:1–133:29. External Links: Link, Document Cited by: §I, §I, §II, §III-A, §III.
[5] V. Kandiah, S. Peverelle, M. Khairy, J. Pan, A. Manjunath, T. G. Rogers, T. M. Aamodt, and N. Hardavellas (2021) AccelWattch: A power modeling framework for modern gpus. In Int. Symposium on Microarchitecture (MICRO), pp. 738–753. External Links: Document Cited by: §II.
[6] D. Kissler, F. Hannig, A. Kupriyanov, and J. Teich (2006) A highly parameterizable parallel processor array architecture. In IEEE Int. Conf. on Field Programmable Technology (FPT), Bangkok, Thailand, pp. 105–112. External Links: Document Cited by: §I, §I, §II, Figure 1, §III-A, §III.
[7] L. Liu, J. Zhu, Z. Li, Y. Lu, Y. Deng, J. Han, S. Yin, and S. Wei (2019) A survey of coarse-grained reconfigurable architecture and design: taxonomy, challenges, and applications. Comput. Surv. 52 (6), pp. 118:1–118:39. External Links: Document Cited by: §II.
[8] A. Parashar, P. Raina, Y. S. Shao, Y. Chen, V. A. Ying, A. Mukkara, R. Venkatesan, B. Khailany, S. W. Keckler, and J. S. Emer (2019) Timeloop: A systematic approach to DNN accelerator evaluation. In IEEE Int. Symposium on Performance Analysis of Systems and Software, ISPASS, Madison, USA, pp. 304–315. External Links: Document Cited by: §II.
[9] A. Pedram, S. Richardson, M. Horowitz, S. Galal, and S. Kvatinsky (2017) Dark memory and accelerator-rich system optimization in the dark silicon era. IEEE Des. Test 34 (2), pp. 39–50. External Links: Document Cited by: §IV-A, TABLE I.
[10] L. Pouchet PolyBench: the polyhedral benchmark suite. Note: Accessed: Oct. 10, 2025 Cited by: §V-A, §V-A, §V-B, Example 1.
[11] R. Seghir, S. Verdoolaege, K. Beyls, and V. Loechner (2004) Analytical computation of ehrhart polynomials and its application in compile-time generated cache hints. In Int. Conf. on Compilers, Architecture and Synthesis for Embedded Systems (CASES), Cited by: §II.
[12] J. Teich, A. Tanase, and F. Hannig (2014) Symbolic mapping of loop programs onto processor arrays. J. Signal Process. Syst. 77 (1-2), pp. 31–59. External Links: Document Cited by: §III-D, §III.
[13] J. Teich, L. Thiele, and L. Z. Zhang (1997) Partitioning processor arrays under resource constraints. J. VLSI Signal Process. 17 (1), pp. 5–20. External Links: Document Cited by: §III-C.
[14] J. Teich and L. Thiele (1993) Partitioning of processor arrays: a piecewise regular approach. Integr. 14 (3), pp. 297–332. External Links: Document Cited by: §III-B, §III-C, §III-C.
[15] J. Teich (1993) A compiler for application specific processor arrays. Ph.D. Thesis, Saarland University, Germany. External Links: Link, ISBN 978-3-86111-701-8 Cited by: §III-B.
[16] S. Verdoolaege, R. Seghir, K. Beyls, V. Loechner, and M. Bruynooghe (2004) Analytical computation of ehrhart polynomials: enabling more compiler analyses and optimizations. In Int. Conf. on Compilers, Architecture, and Synthesis for Embedded Systems (CASES), pp. 248–258. External Links: ISBN 1581138903, Link, Document Cited by: §II.
[17] S. Verdoolaege (2010) isl: an integer set library for the polyhedral model. In Third Int. Congress on Mathematical Software (ICMS), K. Fukuda, J. van der Hoeven, M. Joswig, and N. Takayama (Eds.), Vol. 6327, pp. 299–302. External Links: Document Cited by: §IV-C.
[18] D. Walter, M. Brand, C. Heidorn, M. Witterauf, F. Hannig, and J. Teich (2024) ALPACA: an accelerator chip for nested loop programs. In Int. Symposium on Circuits and Systems (ISCAS), pp. 1–5. External Links: Document Cited by: §I, §I.
[19] D. Walter, M. Halm, D. Seidel, I. Ghosh, C. Heidorn, F. Hannig, and J. Teich (2026) Modeling and Mapping of Regular Nested Loops on Processor Arrays: CGRAs vs. TCPAs. In 29. Workshop zu Methoden und Beschreibungssprachen zur Modellierung und Verifikation von Schaltungen und Systemen, Cited by: §II.
[20] D. Walter, M. Witterauf, and J. Teich (2020) Real-time scheduling of I/O transfers for massively parallel processor arrays. In 18th ACM/IEEE Int. Conf. on Formal Methods and Models for System Design, MEMOCODE 2020, Jaipur, India, December 2-4, 2020, pp. 1–11. External Links: Document Cited by: §III-A.
[21] M. Wijtvliet, H. Corporaal, and A. Kumar (2021) CGRA-EAM - rapid energy and area estimation for coarse-grained reconfigurable architectures. ACM Trans. Reconfigurable Technol. Syst. 14 (4), pp. 19:1–19:28. External Links: Document Cited by: §II.
[22] M. Wijtvliet, L. Waeijen, and H. Corporaal (2016) Coarse grained reconfigurable architectures in the past 25 years: overview and classification. In SAMOS, pp. 235–244. External Links: Document Cited by: §II.
[23] M. Witterauf, D. Walter, F. Hannig, and J. Teich (2021) Symbolic loop compilation for Tightly Coupled Processor Arrays. ACM Trans. Embedded Comput. Syst. (TECS) 20 (5), pp. 1–31. Cited by: §III-A, §III-C, §III-D, §V-A.
[24] M. Witterauf, A. Tanase, F. Hannig, and J. Teich (2016) Modulo scheduling of symbolically tiled loops for Tightly Coupled Processor Arrays. In Int. Conf. on Application-specific Systems, Architectures and Processors (ASAP), pp. 58–66. External Links: Document Cited by: §III-D.
[25] Y. N. Wu, J. S. Emer, and V. Sze (2019) Accelergy: an architecture-level energy estimation methodology for accelerator designs. In Proceedings of the Int. Conf. on Computer-Aided Design, ICCAD 2019, Westminster, CO, USA, November 4-7, 2019, D. Z. Pan (Ed.), pp. 1–8. External Links: Document Cited by: §II.

Symbolic Polyhedral-Based Energy Analysis for Nested Loop Programs