Department of Electrical Engineering and Computer Science
Kassel, Hesse, Germany
11email: {r.nather}@uni-kassel.com
Finding Conservation Laws of Large Dynamical Systems with Tasks and Futures:
A Case Study in Utilizing Dynamic Data Dependencies
Abstract
As parallel workloads grow in complexity, managing fine-grained data dependencies becomes a critical challenge. Futures offer a promising model for handling these dependencies, particularly in irregular algorithms, but they also come with the restriction of value-immutability. This immutability limits the ability to perform in-place memory updates, a necessity for high-performance linear algebra where memory recycling is paramount.
In this paper, we address these limitations by introducing a new construct, await_delete, which extends traditional future semantics to allow safe value reuse once consumers are finished. Building on this extension, we present a novel future-based algorithm for the block-wise inversion of dense, symmetric matrices, motivated by a recent algorithm for finding conservation laws of dynamical systems.
We implement our approach in an extended version of Taskflow and evaluate it through strong-scaling experiments. Our results demonstrate that while futures incur significant overhead on smaller problem sizes, they achieve nearly linear scaling on large matrices. We analyze the amortization threshold and show that futures are a viable high-performance tool for large-scale linear algebra.
1 Introduction
As the demand for computational power continues to grow, parallel programming models must evolve to handle increasingly complex workloads. Among these models, futures, awaitable placeholders for asynchronously computed values, offer a compelling approach to managing data dependencies with fine granularity. By allowing a program to only await the specific parts of a computation that are strictly necessary, futures can effectively avoid induced dependencies that might otherwise stall execution. This capability makes them an attractive candidate for parallelizing algorithms with complicated, irregular dependency graphs; especially those containing dynamic (data) dependencies which can only be observed at runtime.
However, the adoption of futures is not without trade-offs. The runtime bookkeeping required to track asynchronous execution state incurs additional memory allocations and management, which can become a bottleneck. Furthermore, the standard semantics of futures typically require that values remain constant once computed. This immutability simplifies reasoning about concurrency but limits the flexibility of the model, particularly in scenarios where in-place updates or memory recycling are essential for performance.
To understand where futures reside within the vast design space of dependency management and parallelization, and to analyze how well they perform in practice, it is necessary to not just evaluate them on synthetic benchmarks, but to also examine a real-world problem characterized by a heavy computational load and complicated dependencies.
For this purpose, we focus on the parallelization of an algorithm for finding conservation laws of dynamical systems, recently proposed by Mebrartie et al[4]. The most computationally expensive component of this algorithm involves the inversion of large, symmetric matrices.
To mitigate the substantial memory requirements of this approach, the original authors suggested the use of hierarchical matrices (-matrices). -matrices are recursively subdivided block matrices in which some leaf blocks are stored in a compressed form. However, parallel algorithms for -matrix arithmetic typically involve complicated dependency patterns (see [1, 5]). Interestingly, future-based algorithms have already been proposed specifically tailored for the -LU decomposition [5, 6], demonstrating the utility of futures for this exact problem.
But, while -matrices offer significant memory savings, we first need a baseline to determine how well futures perform on dense matrices. Otherwise, isolating the overhead of the future model would be difficult, as the computational gain from -matrix arithmetic would skew the performance measurement. Therefore, to isolate the overheads and benefits of the futures model itself, we first restrict our investigation to the dense matrix case. This provides a controlled baseline that captures the essential dependency structure without the performance gain from -techniques.
To apply futures effectively to this dense matrix inversion, we must address the limitation of value-immutability, specifically the inability to recycle memory buffers once their consumers have finished. In high-performance linear algebra, the ability to update values in place is often crucial. To bridge this gap, we introduce a new construct, await_delete, which allows us to reuse or change the value of a future once no other readers for this value exist. This extension allows us to manage memory more directly, while retaining the dependency-handling benefits of the futures model.
Building on this extension, we present a future-based algorithm for the block-wise inversion of matrices. While we only focus on the fully dense case in this paper, this algorithm serves as the foundation for future work. By introducing case distinctions for the leaf-blocks, it can readily be extended to -matrices. We have implemented this algorithm in an extended version of Taskflow[6, 3] and conducted a series of strong-scaling experiments to determine the threshold at which the overhead associated with futures amortizes relative to the problem size.
We summarize our contributions as follows:
-
•
We extend the standard futures model to support value updates via the await_delete construct and discuss the theoretical implications of this relaxation.
-
•
We propose a novel future-based algorithm for the block-wise inversion of symmetric matrices, designed to maximize parallelism through fine-grained dependency management.
-
•
We provide a implementation and analysis of the proposed algorithm. Our results demonstrate that futures can achieve linear scaling on large problem sizes, but suffer from a significant overhead in smaller problem sizes. We discuss potential causes and outline how to lower the amortization threshold.
The remainder of this paper is structured as follows: Section 2 provides some background about the future model and the algorithm from Mebrartie et al. Section 3 presents the block-wise inversion algorithm, and Section 4 discusses the implementation details and performance results. Section 5 summarizes related work and Section 6 presents our conclusions.
2 Background
This section provides the necessary background for this paper. We begin by defining the specific future model and semantics utilized in our implementation, drawing upon established work in [5, 6]. We then review how these futures are applied to manage the complex, recursive dependencies found in hierarchical matrix (-matrix) arithmetic. Finally, we outline the mathematical formulation for finding conservation laws in dynamical systems, which serves as the primary driver for our case study.
2.1 Futures for dynamic dependencies
We adopt the definition and implementation of futures described in [5] and [6]. Futures act as placeholders for values, featuring distinct writer and reader sides. Specifically, each placeholder is associated with a single promise (used to write the value) and one or more futures (used to read the value). The promise is fulfilled exactly once; upon fulfillment, all associated futures become ready immediately.
Promises and futures are identified by their name and type, and can be stored in standard data structures such as arrays. The element type is unrestricted, enabling advanced constructs such as futures of futures. When passed as parameters, futures are duplicated as needed, while promises remain unique and are always moved (and never copied). Moving a promise transfers the responsibility for filling it to the recipient task. A task is ready to execute only when all futures it depends on have been fulfilled.
Regarding memory management, our implementation utilizes a reference-counting scheme. When all futures referencing a specific placeholder go out of scope, the placeholder and its underlying memory are automatically deallocated, providing a limited form of automatic memory reclamation.
We will use the general notion that spawning a task will return a future for the result of the task. A task can also be spawned into a promise, thereby binding the result of that task to a pre-existing promise. The code in Listing 1 demonstrates how recursive computations and dependencies are expressed using the promise/future model.
-
add:
When used to create a task, takes two integers, computing their sum, and fulfills the output promise p via p.set(i+j).
-
fib:
Computes the -th Fibonacci number and fulfills promise p:
-
–
Base case: For , fulfills p immediately with .
-
–
Recursive case:
-
1.
Spawn fib(n-1), obtaining future f1 for .
-
2.
Create promise p2 for and obtain its future f2.
-
3.
Spawn add with f1, f2, and output promise p (scheduled only after both futures are ready)
-
4.
Compute directly via fib(n-2, p2) (synchronous in this example).
-
1.
-
–
2.2 Futures for recursive block-matrices
The application of futures to -matrix arithmetic, as detailed in [5], leverages the recursive nature of the data structure. Non-elementary tasks recursively spawn other non-elementary tasks as they traverse the matrix hierarchy, eventually reaching the leaf level where elementary tasks are executed. Elementary tasks perform the actual, numerical work; in our implementation, this will be done by an external library. Futures are used to track dependencies throughout this process.
As tasks execute, several pieces of information must be communicated between them to manage dependencies and data flow:
-
(i1)
Has the task been completed? For non-elementary tasks, this means that all child tasks have been spawned, whereas for elementary tasks it indicates that the associated matrix block has been updated. The cases are combined, since a (non-elementary) task generally does not know whether another task is elementary or not.
-
(i2)
Is the update of the associated matrix block complete? (This is only relevant for non-elementary tasks, where (i2) differs from (i1))
-
(i3)
Contents of the (updated) matrix block (This is only relevant for elementary tasks)
-
(i4)
Futures of all child tasks (This is only relevant for non-elementary tasks)
To facilitate this communication, we define the recursive node structure in Listing 2. The rec_matrix_node struct contains a data future for storing the memory address of leaf blocks and a boolean flag leaf to distinguish between internal and leaf nodes. The matrix_t type represents the underlying matrix type used by the external library mentioned above. The array of futures subnodes contains the futures for child tasks in non-leaf nodes. Finally, the done future signals the completion of the operation associated with the node.
It is important to distinguish the roles of these futures: data becoming ready signifies that the memory address for the block is known (essential for non-elementary tasks to schedule children), whereas done becoming ready signifies that the computation on that block (and all subnodes) has finished. Which task is responsible for setting them depends on the algorithm being implemented.
2.3 Finding conservation laws by Kernel Ridge Regression
The primary motivation for our matrix inversion algorithm stems from the problem of identifying conservation laws in dynamical systems.
Consider a dynamical system defined by , where . A conservation law is a scalar function that remains constant along the trajectories of the system, satisfying .
As proposed by Mebrartie et al. [4], this problem can be formulated as a regression task using Kernel Ridge Regression (KRR). Given a set of trajectory data , where is the number of samples per trajectory and is the number of trajectories, the objective is to find a function that minimizes the deviation from constancy along each trajectory.
The conservation law is then expressed as a linear combination of kernel functions , where are the coefficients to be determined.
To find the coefficients , one must solve a quadratic optimization problem. This involves the inversion of the regularized kernel matrix , where is a regularization parameter and the symmetric matrix is computed from the data points .
The authors note that execution time is dominated by the repeated inversion of kernel matrices. Since matrix size depends on the number of training samples (), these matrices can become very large. Furthermore, identifying optimal parameters via cross-validation requires solving this inverse problem multiple times for different matrix configurations.
3 Algorithm
Standard futures enforce value immutability, preventing in-place updates and forcing new memory allocations for every computation step. To address this, we introduce await_delete, a primitive that allows values to be reused or overwritten once it is guaranteed that no other tasks will read them.
3.1 The await_delete-Construct
The await_delete construct takes an existing future as input and returns a new future . The promise associated with is implicit and tied to the lifetime of the original future . Specifically, is fulfilled only when the reference count of drops to zero, indicating that all consumers of the original value have released their handles.
From the perspective of a task waiting on , the behavior is identical to waiting on a standard future. The fulfillment of then signals not just that data is available, but that the exclusive ownership of the underlying memory has been reclaimed. This allows the consumer of to safely treat the memory as mutable and reuse it for subsequent operations without allocating new buffers.
It is important to note that await_delete can only be invoked once per future, as it assumes that the reference count is sufficient to track exclusivity. Provided this invariant holds, scheduling tasks according to their future-based dependencies (as outlined in Subsection 2.1) guarantees deterministic execution order. However, as with standard parallel floating-point arithmetic, bit-wise reproducibility is not guaranteed due to the non-associative nature of floating-point operations.
Furthermore, the await_delete construct composes safely within arbitrary dependency graphs, including nested structures. Whether is the result of a simple computation or a complex chain of nested futures, is fulfilled only when the specific future becomes unreachable. This local property ensures that await_delete can be applied to any node in a graph without introducing race conditions or deadlocks, provided the write-once constraint of the promise is maintained and await_delete is invoked at most once per future.
3.1.1 Application to Leaf Blocks:
In the context of dense matrix arithmetic, elementary tasks operate on individual matrix blocks. Without await_delete, a sequence of operations (e.g., scaling followed by addition) would typically require allocating a temporary buffer for the intermediate result. With await_delete, we can chain these operations on the same memory region.
Consider a block represented by a future . A task reads and produces a new block , represented by future . If a subsequent task requires as input, it typically waits on . However, if is intended to update in-place, we instead pass to . This ensures that can only be executed once is finished and no more consumers for the result of exist, thereby allowing us to reuse the memory where block is stored. Note that this extends to (de)allocation as well: allocating space for the block is fulfilling the corresponding promise, whereas deallocation simply becomes the last task in the sequence of operations.
3.1.2 Application to Recursive Structures:
In recursive data structures like the one shown in Listing 2, operations often involve traversing the tree and modifying nodes. A critical challenge arises when multiple sequential operations must be performed on the same structure (e.g., a forward factorization pass followed by a backward substitution pass). In this case it needs to be ensured that the elementary tasks on the leafs are sequenced correctly.
Consider a recursive node and two sequential operations: and . If these non-elementary tasks are scheduled using standard futures for the child nodes, a race condition may occur. Specifically, the parent task responsible for might finish spawning its children before the parent task for has finished spawning its children. This would allow elementary tasks from to execute before elementary tasks from on the same data, inverting the intended sequential dependencies.
Passing an await_deleted future for a child node solves this issue by enforcing strict sequentialization at the spawner level. When a parent task completes, that is, finishes spawning tasks, it drops its handles to the child futures. By having the next task wait for await_delete(child_future), we ensure that the next operation cannot begin spawning until the previous operation has finished spawning. This effectively “peels back” the recursive layers one by one, ensuring that task spawns cannot overtake each other. Crucially, this does not require waiting for the entire sub-tree of tasks to complete; we only synchronize the boundary between the sequential passes at each layer individually.
3.2 Blockwise Inversion of Symmetric Matrices
We apply the await_delete construct to the problem of inverting a symmetric matrix using a standard block-wise recursive approach. Given a symmetric matrix partitioned into four blocks, the inverse can be computed using the following block decomposition:
where the intermediate steps are defined as:
Note that each operation overwrites the respective block; the notation above distinguishes the intermediate values (, ) to clarify the data flow.
Translating this into the future-model is then straightforward; each overwrite happens on an await_deleted future, where the result is passed to the subsequent operations. Algorithm 1 details this process.
The function Invert takes a recursive node representing a matrix block. If the node is a leaf, we simply invoke an external routine to invert the dense block in-place.
For non-leaf nodes, the algorithm follows the dependencies defined in the block decomposition. We spawn tasks to compute the intermediate components () and tasks for the final inverse blocks () sequentially.
Crucially, each step reuses the memory of a specific input block:
-
•
Line 5: We recursively invert the top-left block . The result is stored directly in the memory previously occupied by .
-
•
Line 6: We compute . The result will overwrite the top-right block .
-
•
Line 7: We compute the Schur complement . This is a fused multiply-add operation that will overwrite the bottom-right block .
-
•
Line 8: We recursively invert the Schur complement to obtain , reusing the same memory slot.
-
•
Line 9: We compute . This will overwrite the bottom-left block .
-
•
Line 10: We compute as the transpose of . This reuses the memory currently holding (which was previously ).
-
•
Line 11: Finally, we compute . This will overwrite the memory holding (which was originally ).
While standard block-wise algorithms overwrite input blocks to save memory, traditional futures would necessitate allocating new memory for every intermediate result, incurring significant overhead. The await_delete construct resolves this by enabling precise memory reuse sequencing. It signals when a block is no longer needed by dependent tasks, allowing its memory to be safely reclaimed for subsequent operations. This creates a pipeline that recycles memory exactly between computational phases, maintaining parallelism while minimizing footprint as described in Subsection 3.1.2.
Note that this algorithm does not make any assumptions about the storage layout of the leaf blocks. Indeed, the dependency management with futures allows the computation to flow downward, ultimately overwriting each (leaf) block without the tasks higher up in the recursion hierarchy needing to know where they are actually stored. By the same token, it also generalizes to both operations on non-square matrices, as well as the inversion of non-symmetric matrices. However, while inverting non-symmetric matrices will require additional working space, utilizing the await_delete construct allows us to bound this overhead. Specifically, we can reuse the memory of the input matrix to store the result (as above), but also of the additional working space, thereby limiting the total auxiliary memory requirement to at most half the size of the matrix.
4 Evaluation
To analyze the performance of our future-based matrix inversion, we conducted a series of strong-scaling experiments. We varied both the total size of the matrix and the granularity of the tasks by adjusting the size of the leaf blocks. Specifically, we tested three matrix sizes: , , and . For each matrix size, we varied the leaf block size such that the ratio between the matrix size and the block size was , , and . It is important to note that the total numerical work for each matrix size remains constant across different block sizes; the only variable is the granularity of the parallel tasks.
4.1 Runtime Analysis
Figure 1 presents the execution time measurements. Each subfigure corresponds to one matrix dimension, with the three curves representing the different block size ratios. The results confirm that parallelism is effectively exploited, as execution time decreases with increasing core count for all configurations. However, the impact of task granularity is significant. In Subfigure 1A (), we observe that finer granularity (smaller block sizes, corresponding to higher ratios) worsens performance. Specifically, the configuration with a ratio of (smallest blocks) exhibits notably higher runtimes compared to the coarser configurations. For the larger matrix sizes ( and ), the performance gap between block sizes and (ratios and ) is less pronounced, though the trend of coarser granularity yielding better performance persists.
4.2 Speedup Analysis
To better quantify the efficiency of the parallelization, Figure 2 depicts the computed speedup curves. Each subfigure corresponds to a ratio between matrix size and block size. For the largest matrix size (), the speedup curves are nearly identical across all block size ratios (Subfigure 2C). This suggests that for sufficiently large problems, the overhead associated with fine-grained task management becomes negligible compared to the computational work. For the intermediate matrix size (), the optimal performance is achieved at a ratio of (Subfigure 2B). Here, the finer granularity of ratio leads to a noticeable degradation in speedup, while the coarser ratio of performs similarly to the optimal case. Interestingly, for the smallest matrix size (), the speedup at ratio converges with the performance of the matrix at the same ratio, indicating that the task granularity is sufficiently coarse to amortize the overhead even for smaller problem sizes.
4.3 Discussion of Overheads
The performance degradation at fine granularities points to a runtime bottleneck rather than an algorithmic flaw. We attribute this to the current future implementation, which uses a locking scheme (atomic compare-and-swap) for reference counting. With many small tasks, congestion arises as numerous threads simultaneously attempt to read or fulfill futures, causing significant synchronization overhead. This effect is most pronounced when the task creation-to-computation time ratio is high. As matrix size increases, the computation-to-communication ratio improves, diminishing the relative cost of lock contention and closing the performance gap for the case.
To validate our hypothesis that lock contention is the primary bottleneck, we profiled the execution to separate the time spent in actual computation from the time spent managing futures. Figure 3 provides a detailed breakdown of the runtime for matrix sizes (Subfigure 3A) and (Subfigure 3B), categorizing the time into three components: compute time (actual numerical computation), future management (every operation that touches the shared state of futures), and RTS overhead.
The data clearly illustrates the trade-off between granularity and overhead. For configurations with fine granularity (high ratio ), the time spent on future management (red) constitutes a significant portion of the total runtime. This confirms that the cost of managing dependencies dominates when tasks are small. Conversely, as the block size increases (lower ratio ), the proportion of time spent on future management drops and the application becomes compute-bound.
Comparing this approach to a traditional threading model like OpenMP would likely show lower overheads for the dependency management. However, such models lack the dynamic flexibility required for our target application, specifically the handling of irregular, runtime-generated dependencies inherent to -matrix arithmetic and the conservation law problem (Subsections 2.2 and 2.3). Thus, the observed overhead represents the cost of this increased expressiveness.
5 Related Work
Taskflow [3] already provides a notion of composability of dependencies through module tasks, but these are static programming constructs designed as a design-time abstraction and lack the flexibility required for the dynamic and recursive task generation inherent in our matrix inversion algorithm.
The data-versioning approach in Superglue [7] is similar to our approach. However, it requires manual management of both data partitioning and versioning, whereas our future-based approach follows the recursive nature of subdivided matrices more closely. Together with our new await_delete construct, this gives an implicit data versioning which avoids the additional development effort.
While Parsec [2] automatically orders tasks to prevent premature data overwrites, its dependency model does not easily accommodate dependencies between non-sibling tasks, which are common in -arithmetic.
6 Conclusion
This work demonstrates that futures offer a compelling solution for managing complex, irregular dependency patterns in parallel algorithms. Our empirical analysis reveals a critical trade-off: while futures incur significant overhead for small problem sizes, they achieve nearly linear scaling on large matrices.
The await_delete construct is essential to this performance profile. By enabling safe memory reuse at the task level, it allows to maintain the memory efficiency of in-place linear algebra while preserving the dependency management benefits of futures. This facilitates seamless integration with external numerical libraries, as memory management is handled entirely within the task model. While await_delete provides a pragmatic solution, we view it as a temporary extension; future work should aim to integrate memory ownership semantics more natively into the future model to eliminate the need for ad-hoc constructs.
Looking ahead, we plan to extend our approach to hierarchical matrices to fully implement the pipeline for finding conservation laws in dynamical systems. Additionally, we aim to optimize the runtime to reduce the overhead of the current implementation and lower the amortization threshold.
References
- [1] Carratalá-Sáez, R., Christophersen, S., Aliaga, J.I., Beltran, V., Börm, S., Quintana-Ortí, E.S.: Exploiting Nested Task-Parallelism in the H-LU Factorization. Journal of Computational Science 33, 20–33 (2019). https://doi.org/10.1016/j.jocs.2019.02.004
- [2] Hoque, R., Herault, T., Bosilca, G., Dongarra, J.: Dynamic task discovery in PaRSEC: a data-flow task-based runtime. In: Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems. ScalA ’17, Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3148226.3148233
- [3] Huang, T.W., Lin, D.L., Lin, C.X., Lin, Y.: Taskflow: A lightweight parallel and heterogeneous task graph computing system. IEEE Transactions on Parallel and Distributed Systems 33(6), 1303–1320 (2021). https://doi.org/10.1109/TPDS.2021.3104255
- [4] Mebratie, M.A., Nather, R., von Rudorff, G.F., Seiler, W.M.: Machine learning conservation laws of dynamical systems. Phys. Rev. E 111, 025305 (Feb 2025). https://doi.org/10.1103/PhysRevE.111.025305
- [5] Nather, R., Fohry, C.: Futures for Dynamic Dependencies–Parallelizing the -LU Factorization. In: Workshop on Asynchronous Many-Task Systems and Applications. pp. 9–21. Springer (2024). https://doi.org/10.1007/978-3-031-61763-8_2
- [6] Nather, R., Fohry, C.: Integration of a Powerful Future Construct into the Taskflow System. SN Computer Science 7, 31 (2025). https://doi.org/10.1007/s42979-025-04577-y
- [7] Tillenius, M.: Superglue: A shared memory framework using data versioning for dependency-aware task-based parallelization. SIAM Journal on Scientific Computing 37(6), C617–C642 (2015). https://doi.org/10.1137/140989716