0% found this document useful (0 votes)
39 views

Deep Reinforcement Learning Enabled Self Configurable Networks On Chip For High Performance and Energy Efficient Computi

This paper proposes a deep reinforcement learning technique to configure voltage and frequency levels of network-on-chip routers and links. The goal is to achieve both high performance and energy efficiency while meeting an overall energy budget constraint. Distributed reinforcement learning agents are used, where each agent configures a router and associated links based on system utilization and application demands. Simulations show the proposed approach improves energy-delay product compared to non-machine learning techniques.

Uploaded by

Fazrul Faiz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Deep Reinforcement Learning Enabled Self Configurable Networks On Chip For High Performance and Energy Efficient Computi

This paper proposes a deep reinforcement learning technique to configure voltage and frequency levels of network-on-chip routers and links. The goal is to achieve both high performance and energy efficiency while meeting an overall energy budget constraint. Distributed reinforcement learning agents are used, where each agent configures a router and associated links based on system utilization and application demands. Simulations show the proposed approach improves energy-delay product compared to non-machine learning techniques.

Uploaded by

Fazrul Faiz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

This article has been accepted for publication in IEEE Access.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3182500

Deep Reinforcement Learning Enabled


Self-Configurable Networks-on-Chip for
High-Performance and Energy-Efficient
Computing Systems
Md Farhadur Reza, (Member, IEEE)
Department of Mathematics and Computer Science, Eastern Illinois University, Charleston, IL 61920 USA (e-mail: [email protected])
Corresponding author: Md Farhadur Reza (e-mail: [email protected]).

ABSTRACT Network-on-Chips (NoC) has been the superior interconnect fabric for multi/many-core on-
chip systems because of its scalability and parallelism. On-chip network resources can be dynamically
configured to improve the energy efficiency and performance of NoC. However, large and complex design
space in heterogeneous NoC architectures becomes difficult to explore within a reasonable time for optimal
trade-offs of energy and performance. Furthermore, reactive resource management is not effective in
preventing problems, such as thermal hotspots, from happening in adaptive systems. Therefore, we propose
machine learning techniques to provide proactive solutions within an instant in NoC-based computing
systems. We present a deep reinforcement learning technique to configure voltage/frequency levels of NoC
routers and links for both high performance and energy efficiency while meeting the global energy budget
constraint. Distributed reinforcement learning agents technique has been proposed, where a reinforcement
learning agent configures a NoC router and associated links intelligently based on system utilization and
application demands. Additionally, neural networks are used to approximate the actions of distributed
reinforcement learning agents. Simulations results for 256-core and 16-core NoC architectures under real
applications and synthetic traffic show that the proposed self-configurable approach improves energy-delay
product (EDP) by 30-40% compared to traditional non-machine-learning based solution.

INDEX TERMS Network-on-Chip (NoC), Multicore Architecture, Mancore Processor, Machine Learning
(ML), Reinforcement Learning (RL), Distributed RL, Deep reinforcement learning (Deep RL), Q-learning,
Neural Networks (NNs), Self-configurable, Energy-Efficiency, High-Performance

I. INTRODUCTION important benefits over traditional bus in terms of scalability,


parallelism, throughput, and power efficiency for on-chip
ULTIPROCESSOR System-on-Chips (MPSoCs) and
M chip multiprocessors (CMP) are moving towards the
integration of hundreds to thousands of cores on a chip.
communication in multi/many-core systems [5], [14], [16].
The number of NoC router and link resources to support
the increased number of cores in on-chip system has also
Manycore on-chip systems have better power efficiency, in-
increased significantly. This results in significant increase in
terconnection, and latency compared to traditional local area
NoC power consumption relative to the total chip power.
network (LAN) based systems [28]. Industry and academia
Several NoC prototypes have shown that on-chip network
researchers are working on manycore single chip solution
consumes 10-40% of the total chip power, including 30% in
that could replace the traditional big data-center solution
the Intel 80-core Terascale chip [24] and 40% in the MIT
and/or that can be used as the base chip for supercomputer
RAW chip [48]. Furthermore, [29] demonstrated that data
with million cores [1], [8], [10]. For example, a single
movement consumes 25% of the total energy. Because of the
chip with 850K independent cores has been developed by
gap between transistor density and transistor power efficiency
Cerebras [1], and each processor chip of Sunway TaihuLight
in process technology (failure of Dennard Scaling [17]),
with 40960 processors contains 256-core [10]. As an on-chip
manycore chip faces power budget problem [47]. High power
can contain hundreds to thousand of core, network-on-chip
consumption also affects the lifespan of systems, due to
(NoC) has been adopted as a standard solution to manage the
increased heat buildup.
complex on-chip communication among cores, where cores
are running the tasks of applications. NoC offers several With the advancement of technology, computation be-

VOLUME XX, XXXX

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3182500

comes more energy-efficient compared to communication. In can be designed to support computing nodes with higher
7nm transistor technology, energy per unit communication communication requirements. Because of the diverse traffic
consumes 6-time more energy than that of computation [9]. patterns of application(s), different routers and links in NoC
The fraction of time spent in communication for an applica- can carry different amounts of data by merging traffic flows
tion increases with the increase in the number of processing from various tasks of the application(s). To cope with the
cores in systems and it (fraction of time spent in communi- heterogeneity of traffic workloads and heterogeneity of com-
cation) can be more than 50% for large-scale systems [6], puting resources, NoC can be configured heterogeneously
which significantly hampers the parallelism performance (as for energy efficiency and high performance. Heterogeneous
computations need to wait for data transferred in communica- NoC configurations include V/F-scaling of routers and links.
tion). The key element to scalable chip performance is the on- Prior research has shown that dynamic voltage and frequency
chip interconnects connecting the cores/memory. Therefore, scaling (DVFS) can reduce dynamic energy consumption of
a challenging research problem is to design energy-efficient NoC routers and/or links at run-time [33], [35]. Many works
and high-performance NoC for multi/many-core comput- have effectively applied voltage/frequency (V/F) scaling on
ing systems. Our objective is to obtain high-performance on-chip systems and networks for energy reduction [4], [11],
[22], [51], [53]. With DVFS, supply voltage is increased
during high NoC traffic to meet the NoC performance re-
quirements while supply voltage needs to be decreased dur-
ing low traffic for energy savings. Because of the large and
Router Core Link
4-core complex NoC design space (V/F-levels, heterogeneity of
Concentated Tile
cores and routers, multicores, topology, task-core mapping,
etc.), reactive solutions using heuristics or linear optimiza-
tion solvers are not effective at run-time systems due to
the high time complexity for exploring many designs and
configuration parameters for optimal/near-optimal solutions.
Large problem instances cannot be computed using linear
programming optimizers (e.g., integer linear programming
solver) for optimal solutions since they fail to compute re-
sults in time for run-time configuration decisions. Heuristics
may not cover all possible cases under heterogeneous tasks
(and traffic) and heterogeneous resources. Therefore, reactive
or ad-hoc resource configuration may not be an effective
FIGURE 1: 64-core architecture connected through 4 × 4 technique in preventing problems, such as creating thermal
concentrated 2D-Mesh NoC. hotspots and exceeding chip power budget, from happening
NoC while achieving energy-efficiency. In this work, besides in adaptive systems. For example, reactive/slow solutions
small-scale 16-core NoC, we consider large-scale 256-core may not increase or decrease V/F-levels properly to meet
architecture that is placed in a 8 × 8 concentrated mesh the demands of the applications while providing both high-
(CMesh) topology. A 64-core CMesh architecture is illus- performance and energy-efficient solution. Machine learning
trated in Figure 1, where cores are placed in a 4 × 4 2D-Mesh (ML) can help by predicting NoC resource requirements
topology. 4-core are concentrated with a single router in a (before it requires at that instant) for different applications
tile, where every core can be heterogeneous with different in advance and configure the NoC accordingly to minimize
computational capacities. Each core has an individual L1 power and thermal variations while avoiding any loss in
cache and each router has an L2 cache shared among the performance.
four cores connected to each router. Routers are connected to
each other via links to form the NoC, where every router can In this paper, we propose the use of deep reinforcement
be heterogeneous with different communication capacities, learning (Deep RL) to dynamically configure NoC resources
including varying bandwidth links and buffer counts. based on precise system utilization and application demands.
Different tasks of an application can have different com- RL agent is used to monitor the states (in terms of features),
putation and communication demands, for example, one state transition, and evaluate the reward of the NoC config-
task can be computation-intensive where another task can uration action. RL agent reinforces the positive or negative
be communication-intensive. Furthermore, multiple appli- rewards of the taken decision, which helps the system to
cations with different demands can run simultaneously on learn and take proper actions and achieves the optimization
a multi/many-core chip. Multi/many-core systems can con- objectives of energy efficiency and high performance. We
tain heterogeneous computing nodes, for example, GPUs, take advantage of neural networks (NNs) to learn the patterns
accelerators, and CPUs. NoC can be designed with hetero- in application demands, and approximate NoC configuration
geneous capacities in different parts of NoC. For example, decision accordingly. The major contributions of this work
routers with more buffers and links with higher bandwidth are outlined below:
VOLUME XX, XXXX

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3182500

• Self-configurable NoC using Reinforcement Learning: V/F-levels for energy savings. Multiple wires on a link (be-
RL agents are trained to automatically configure the tween routers) are activated or deactivated to configure the
NoC resources (routers and links) to maximize NoC per- required link-width based on the dynamic communication
formance while ensuring energy efficiency. V/F-levels requirements between the tasks of an application. [52] pre-
of the NoC routers and links are dynamically configured sented a deep RL approach for efficient NoC arbitration. The
at run-time based on the application demand using RL proposed self-learning decision making mechanism reduces
while global energy budget constraint is used to limit packet latency, which results in improved NoC throughput.
energy consumption of NoC resources. RL agents se- In [25], the authors proposed cooperative multi-agent RL-
lect exploration (random-action) or exploitation (best- based co-optimization techniques to jointly perform different
action) policy with random probability  to get the performance and power optimization involving cache parti-
global optimal solution. tioning and DVFS of NoC, core and cache.
• Distributed Reinforcement Learning Agents and Neural In addition to RL, other ML techniques (e.g., decision
Network Approximator: Distributed RL techniques for tree) have also been proposed for NoC optimization and
NoC configurations are proposed and implemented to configuration. [30] proposed an imitation learning (IL) based
make the proposed ML-enabled technique feasible for methodology for dynamic V/F-island (cluster of nodes/links)
large-scale NoC-based systems with reasonable hard- control in manycore systems. [19] leveraged decision trees
ware overhead (for ML). NNs are used to approximate to predict and mitigate errors before the fault affects NoC
the actions for RL agents based on the learning of the based systems. The proposed decision tree model achieves
state (features), where traditional table-based learning reduction in packet re-transmission and energy savings. [27]
approach is not feasible because of the need for large presented an NN-based intelligent hotspot prediction mech-
state-action mapping tables and high convergence time anism that was used with a congestion-control mechanism
for learning. to handle hotspot formations efficiently. [39] proposed run-
• Evaluation on Real and Synthetic Benchmarks: The time predictive configuration of node voltage-levels and link
proposed approach is evaluated using both large-scale widths of NoC using NNs for energy-savings while address-
(256-core) and small-scale (16-core) architectures on a ing both power and temperature constraints of manycore
real system simulator. Both real benchmarks (with large NoC. [32] proposed NN based predictive routing algorithm
and small applications) and synthetic traffic are used for NoC which uses network state and congestion infor-
for evaluating the proposed approach compared to ex- mation to estimate routing costs and perform low-latency
isting non-machine-learning (non-ML) based solution. routing of traffic. [13] proposed learning-enabled energy-
Simulation results under real (COSMIC and E3S) and aware dynamic V/F scaling for NoC architectures. The pro-
synthetic benchmarks show that the proposed approach, posed work relies on an offline trained regression model and
on average, improves latency by 25% and improves provides a wide variety of V/F pairs. [20] extended this work
energy and throughput by 6% (improves EDP by 30- by adopting RL for selecting DVFS mode directly, which
40%) compared to non-ML based solution and improves removes the need for labelling in linear regression. Some
EDP by 8% compared to an RL-based solution [38]. works focus only on core resources of the multi/many-core
The paper is organized as follows. We discuss the related systems instead of uncore (including NoC) resources. [46]
work on NoC configuration In Section II. Self-configurable proposed RL enabled online power management technique
NoC configuration strategy using RL and NNs approximator that learns the best power management policy that gives
is presented in Section III. Simulation results are presented the minimum power consumption for a given performance
in Section IV. constraint without any prior information of workload. [21]
proposed an RL based I/O management for energy-efficient
II. RELATED WORK communication between manycore processor and memory,
ML techniques, such as RL, regressions, and NNs, have been instead of transmitting data under a fixed large voltage-
proposed for design and optimization challenges, including swing. [12] presented an on-line distributed RL based DVFS
energy and performance, of multi/many-core systems and control algorithm for manycore system under power con-
on-chip networks. [31] proposes an auotomated data-driven straints. Per-core RL method is used to learn the optimal
framework to quickly configure and design manycore sys- control policy of the V/F-levels in the system. At the coarser
tems for a wide-range of application and operating scenarios. grain, an efficient global power budget reallocation algo-
The authors proposed to use ML for both design-time and rithm is used to maximize the overall performance.In [25],
run-time decisions to create fully-adaptive systems. the authors proposed cooperative multi-agent RL-based co-
A few existing works using RL for NoC optimization and optimization techniques to jointly perform different perfor-
configuration are discussed here. [49] used distributed RL for mance and power optimization involving cache partitioning
simultaneously optimizing performance, energy efficiency and DVFS of core and uncore. [15] proposed ML to in-
and reliability of NoC in manycore systems, where each telligently explore the design space of 3D NoC to optimize
router independently takes the decisions. [38] proposed RL the placement of both planar and vertical communication
to configure NoC link-bandwidths dynamically by scaling links for energy efficiency. [26] developed a learning-based
VOLUME XX, XXXX

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3182500

framework using lasso regression to enable fast and accurate observing the costs and then optimize the system. Another
transient thermal prediction in chip multiprocessor. advantage of RL is that it does not need the labelling of the
However, unlike previous works that focus mostly on training data (output label), which is required in supervised
small-scale NoCs, this work (which is the extended version learning (e.g., NNs, regressions). Therefore, RL is adapted
of [37]) focuses to provide solutions for both small-scale and in this work to dynamically configure V/F-levels depending
large-scale NoCs using deep RL techniques. Furthermore, on the application demands to maximize the performance
this work focuses to improve both energy and performance and energy efficiency of NoC. Our proposed approach apply
using online RL technique for predicting V/F-levels of the online learning because it allows the algorithm to learn as
NoC routers and links depending on the computation and data becomes available instead of learning from a static data
communication requirements of various applications running set. Performance and utilization data are collected in each
in multi/many-core systems. Besides performance improve- interval after taking action, and that data is used for training
ment and energy efficiency, the main advantage of RL is RL agent.
its self-adaptive nature to configure NoC at an instant to Q-learning is used as an RL technique for finding optimal
meet the real time requirements of application(s), where policy for selecting NoC V/F-level configuration actions. The
supervised ML technique need complete retraining to adapt core of the Q-learning is a simple value iteration update,
to changes in systems/applications and traditional non-ML using an iterative algorithm, and Q-value in the learning
based algorithms and optimization solvers fail to provide algorithm is calculated by using the weighted average of
solutions within a reasonable time (because of high time the old value and the new information. The Q-value of
complexity). taking action a in state s at current time step t is denoted
Q(st , at ). Q-learning does not require building explicit rep-
III. DYNAMIC NOC CONFIGURATION USING DEEP resentations of the state transitions and the expected reward
REINFORCEMENT LEARNING to estimate Q-value. This helps the system as initially the
In this work, model-free RL, namely Q-learning, is adapted. system is not aware of the probability of state transitions and
Q-learning directly estimates the optimal Q-values of each the reward. Q-learning applies incremental updates with the
action in each state, from which a policy is derived (instead current state, the next state, the action, and the immediate
of learning a model of the environment). When applications reward to approximate new Q-value. That’s why Q-learning
are running in the NoC-based multi/many-core systems, RL is an efficient algorithm for any environment especially large-
is used to configure the NoC V/F-levels dynamically based scale environment. Based on the NoC state st (utilization of
on the communication demands of the tasks to improve NoC NoC resources) at current time step t, there may be several
performance. We have selected four different voltage-levels possible V/F actions to take. An RL agent chooses the action
for NoC configurations: 0.8V, 0.9V, 1.0V, and 1.1V. We lim- a that has the highest (currently estimated) Q-value among all
ited our voltage-levels as too many levels will have high over- possible actions (or chooses a random action with some prob-
head for voltage regulator(s). However, we chose voltage- ability). After taking the action, the agent transitions to a new
levels that meet the energy (power) budget constraint. With state st+1 (new utilization values of NoC resources) while
each voltage (V) level, a frequency (F) is also selected to form in the meantime the NoC environment provides a reward rt
a V/F pair. The following V/F pairs are used in this work: 0.8 (to maximize NoC performance and energy efficiency). Q-
V/1 GHz, 0.9 V/1.5 GHz, 1.0 V/2 GHz, and 1.1V/2.5 GHz. learning algorithm tries to maximize the expected cumulative
RL agents are trained to maximize performance while global reward achievable from a given state-action pair (st , at ),
energy budget constraint is used to limit the V/F-levels of and it can approximate the optimal solution from Bellman
NoC resources. Equation using the following iterative update for Q-value
Q(st , at ) in equation 1:
A. REINFORCEMENT LEARNING AND Q-LEARNING
In RL, training data is given as a feedback to the program’s Q(st , at ) = (1 − α) · Q(st , at )+
(1)
actions in a dynamic environment. Feedback is given in α · (rt + γ · max Q(st+1 , a))
a
forms of rewards and punishments to reinforce the actions,
such as scaling V/F-levels and link bandwidth in NoC. RL where, rt is the reward observed for the current state st , α
is very much suitable for intelligent decisions in autonomic is the learning rate (0 ≤ α ≤ 1) and max Q(st+1 , a) is
and dynamic systems because of the following reasons [45]: the maximum Q value over all possible actions a in next
firstly, an autonomic system learns (using RL) what actions state st+1 . The learning rate α determines the importance
to take to maximize the long-term rewards from a specific of new experience compared to past experience. A factor
state. Secondly, RL properly treats the dynamic phenomena of 0 makes the agent learn nothing from new experience,
as it can take into account delayed consequences (decision while a factor of 1 makes the agent consider only the most
and states) for current action in the environment. Because recent information. Higher learning rate learns more from
of the autonomic property of RL, RL is very effective to new experience and gives less priority to old information,
handle run-time changes (link/router fault, change in appli- where a learning rate of 1 makes the agent consider only the
cation demands, etc.) by interacting with the system and most recent information. The discount factor γ determines
VOLUME XX, XXXX

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3182500

the importance of future rewards. Lower discount rate gives


more importance to current rewards, where a factor of 0 only
considers current rewards. A factor of 0 will make the agent
considers only current rewards, while a factor approaching 1
will make it strive for a long-term high reward.

B. DEEP REINFORCEMENT LEARNING


The major disadvantage of RL is that the system needs to
maintain a mapping table for state, action, and rewards, and
this table grows exponentially with the increase in problem
size in multi/many-core systems. Traditional Q-learning uses
a table, Q-table, to store the Q-value Q(s, a) for each state-
action pair, as shown in Figure 2. As NoC features have
continuous values, state-action space and corresponding Q-
table can be large. For large-scale NoC with many resources
(cores/routers/links), the system needs large number of Q-
table and the overall size and cost of Q-tables will be ex- FIGURE 3: NoC configuration Framework using Deep RL.
tremely large. This state-action mapping table challenge of
RL can be addressed by an approximate function of state, ac-
tion, and rewards. In this work, NNs are used to approximate sion with the maximum Q-value that maximize the latency
Q-function Q(s, a) that estimates the future returns taking and throughput of NoC. With a small probability, an RL
action a from state s. NNs function approximation removes agent also chooses random action (instead of best action) to
the need for large state-action mapping table, which makes avoid the local optimal solution (to achieve global optimal
the proposed work scalable for large-scale systems. Given solution).
a state s, an NN can output a vector of approximated Q- The overall framework for dynamic NoC configuration
values for each possible action a. Then, the action with the using deep RL is shown in Figure 3. In the NoC configuration
highest Q-value is chosen or a random action chosen with a framework, an RL agent is integrated with each router for
small probability. This technique of combining RL and NN is V/F-level configuration decision. Distributed RL agents col-
called deep RL. Deep RL solutions have made many break- lect NoC statistics in a fixed interval and configure the routers
through to create something and/or to solve problems like and links (connected to the router). Though RL agents take
to achieve human-level performance in AlphaGo [43] and the local decisions, their target is to improve the global NoC
Atari [36] games. The significance of deep RL contributions performance by interacting with the NoC environment, as an
motivated us to apply deep RL to solve NoC optimization and RL agent checks the impact of its actions by evaluating NoC
configuration issues in multi/many-core systems. latency and power consumption. RL agents learn the best
actions with time as it trains the NNs. The backpropagation
algorithm [41] is used to train the NNs to learn the parameters
of the NNs and to predict the decision. Gradient descent
approach is used to propagate the prediction errors to learn
the parameters of NNs for improving Q-value prediction
decision. Upon receiving a decision from the RL agent, the
DVFS controller with the help of voltage regulator selects the
appropriate V/F-level. A synchronization is needed between
two routers as V/F-levels may be different for routers.
FIGURE 2: Q-Table: State (s), Action (a), and Q-value
(Q(s,a)). C. COMPONENTS OF THE DEEP RL MODEL FOR NOC
CONFIGURATION
NNs are used to approximate the state-action mapping ta- An RL model contains state, action, environment, and re-
ble in an RL agent, as NNs have shown significant advantages ward components. The deep RL model further contains NN
in many domains such as image processing, speech recog- component. The components of the proposed deep RL model
nition, and machine translation. NNs are used to discover for self-configurable NoC are shown in Figure 4 and are
the patterns in the NoC statistics (collected during current described below.
and past experiences) and predict the NoC V/F configuration
actions with corresponding values (Q-values) for the next 1) Environment (Architecture)
period. With NNs, we can easily extend the number of V/F- The environment of the NoC configuration framework con-
levels by just adding an additional output neuron in the output sists of routers, links, and processing cores. The environment
layer of NNs. An RL agent takes the V/F configuration deci- generates the reward in terms of NoC latency and power con-
VOLUME XX, XXXX

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3182500

the best action to maximize NoC performance. However, in


this work, an RL agent selects exploration or exploitation
policy with random probability  to avoid the local optimal
solution.  parameter controls the amount of exploration vs.
exploitation. In the exploitation phase, with probability 1 − ,
an RL agent selects V/F-level configuration action with the
maximum Q-value from the trained NNs. In the exploration
phase, with probability , an RL agent takes random decision
(instead of best action from trained NNs) to reduce the prob-
ability of local optimal solution. for example, for  value of
0.1, an RL agent takes random decision with 10% probability
or takes best Q-value actions with 90% probability. The value
FIGURE 4: Deep RL model components and flow for Self- of  can be tuned based on the performance of the agent
Configurable NoC. and the demands of the application(s).  value should be set
to higher value (e.g., 0.5 or 50%) initially as an RL agent
needs to explore more to learn the impact of its actions
sumption for the taken NoC V/F-level configuration decision as the knowledge of the RL agent may not be complete. 
by an RL agent (at a router) depending on the current state values can be reduced with time as an RL agent becomes
(which is discussed in the next section) of the router and more skilled over time in taking correct decisions by learning
associated links. actions and corresponding energy and performance impact on
the NoC environment. Because of the exploration capability,
2) State (Agent’s view of the architecture) the proposed RL agent selects optimal action almost all the
A vector of features is defined as a system state. Feature time after a certain number of training steps. In this work, we
selection is very important for an RL model. If the features do stop the training of an agent after the convergence is achieved
not correlate to the action (NoC configuration decision), then as the agent can correctly predicts the correct actions.
an RL algorithm chosen to create the model will not be able
to predict the outcome. Therefore, a great deal of thought are
placed in the features selected for the RL model. The selected
features are the utilization and capacity of the NoC resources:
router and link. Flit count in the buffer of the router and link
is used to calculate router and link utilization. Three features
are used in a state vector, as follows: flit count, % of buffer
utilization, and % of link utilization. V/F-level configuration
decision is taken based on the utilization and capacity of the
link and router resources. The number of features are kept to
a minimum in order to decrease the amount of time needed
to calculate the predicted Q-values at run-time. Also the
features are selected by using the information already present
at each router. We collect performance data to evaluate the
performance of the model and collect energy consumption to
evaluate whether the decision from the RL model meets the
global energy budget constraint.

3) Action (Resource configuration)


The agent takes online decision to configure V/F-levels set- FIGURE 5: NoC features and actions for voltage-level con-
ting for NoC routers and links based on the utilization of figuration using neural networks.
NoC resources. As mentioned before, the following V/F-level
pairs are used in this RL model: 0.8 V/1 GHz, 0.9 V/1.5
GHz, 1.0 V/1.8 GHz, and 1.1 V/2 GHz. However, the V/F- 4) Reward Function
level pairs are configurable based on the availability of V/F- An RL agent gets feedback for its action from the NoC envi-
levels. The agent generates vector of Q-values for each V/F- ronment in terms of positive or negative reward. Depending
level configuration action. Figure 5 shows the Q-values for on the V/F-level configuration action in the current state, the
four different V/F configuration based on the router and link environment generates a reward, based on the optimization
features at a router port. NNs approximate the Q-values for target, and sends it to the agent. The reward determines
different V/F configurations, and RL agent generally selects how good the taken action is, and Q-learning algorithm is
the V/F configuration with the maximum Q-value, which is designed to maximize the long-term reward. In this run-time
VOLUME XX, XXXX

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3182500

configuration context, a reward function is used to maximize shown in equation 1). The backpropagation algorithm works
NoC performance and energy efficiency. The product of by computing the gradient of the loss function, as shown in
latency and power is used in the reward function to minimize equation 3. The loss function is the squared error difference
0
both latency and power consumption of NoC, as shown in between predicted Q-value Q and target Q-value Q .
equation 2. The negative reward is used to reduce latency and 0

power consumption. Lower negative values of latency and lossQ = (r + γ · max Q (st+1 , at+1 ) − Q(st , at ))2 (3)
at+1
power consumption means higher performance and energy
efficient, respectively, NoC. Higher value of latency and/or In each training step, the backpropagation algorithm com-
power make the reward value more negative, and an RL agent putes the gradient with respect to the prediction network’s
tries to achieve lower negative value of reward. θ parameters. The backpropagation algorithm computes gra-
dient one layer at a time and iterates backward from the last
reward = −latency ∗ power (2) layer (based on the feedback from target Q-value) to update
where, the unit of reward is joules, where the units of power and to learn θ parameters of NNs. The goal of learning θ
and latency are watt and (nano) seconds, respectively. parameters is to correctly approximate Q-value. The use of
NNs (instead of state-action mapping table) makes the RL
5) Neural Networks for Function Approximation agent robust to different demands of the applications because
of the generalization property of NNs.
Since state attributes (e.g., buffer and link utilization) are
continuous numbers, number of states can be infinite that
D. STABILITY OF REINFORCEMENT LEARNING AGENT
leads to large state-action mapping tables and high conver-
gence time for Q-learning. NNs are used for action-function RL suffers instability when the Q-value function is approx-
approximation in RL to solve the the problems (large state- imated with a non-linear supervised learning, like logistic
action mapping tables and high convergence time) associated regressions and NNs [36]. The following techniques are
with traditional Q-learning solution. An NN provides a mech- adapted to improve the stability (reduce the fluctuation) of
anism for generalizing training experience across states, so predictions from RL.
that it is no longer necessary to keep the state-action mapping
table for RL agent’s decisions. An NN approximates the 1) Experience Replay for Deep RL Stability
vector of Q-values scores for each possible V/F-level con- Similarity of subsequent training samples can lead an NN
figuration actions, and the RL agent selects the action with generalization into local minima because of the correlation
the maximum score for maximizing the NoC performance. between the current approximate action-value and the target
The proposed deep RL approach uses fully connected NNs action value. Experience replay breaks up the correlation
with three layers: input layer with three neurons (for three in data to stabilize deep RL technique [34]. Instead of
features), hidden layer with 8 neurons and an output layer training RL agent on state-action pairs as they occur during
with 4 neurons. Three layers for NNs and 8 neurons are simulation or actual experience, the system stores the data
chosen for hidden layer empirically as it provides sufficient discovered (state, action, reward, next state) during state
accuracy with lower hardware overhead. Sigmoid function is transitions, in a table with limited entries (to reduce hardware
used at the neurons of the hidden layers, where the activation overhead). The system then randomly selects the experience
value ranges from 0 to 1. ReLU activation function is used data from the table to provide decorrelated data to train the
at the output layer neurons as Q-values can be greater than 1 NNs. Experience replay also allows the model to learn past
(for the corresponding V/F-levels). The layers of the NNs are experience multiple times, as it can randomly select same
computed in sequential order, for example, first layer must data. This learning strategy leads to faster convergence. In
gather features and compute its values before the next layer the experience replay table, the oldest data is overwritten by
can be computed. In NNs, it is possible to parallelize all a new data (as the size of the table is kept limited to reduce
units and operations within each layer, which increases the hardware overhead).
speed of prediction by RL. Because of the parallel nature
of operations, the computation time reduces significantly in 2) Separate Target Network for Deep RL Stability
NNs. The second modification to online deep RL learning is to
NNs are trained to approximate the Q-values for actions use a separate NN for generating the target Q-values in the
based on the current state (features), past learning experience, Q-learning update. This modification makes the algorithm
and target optimization objective (as reflected in reward func- more stable compared to standard online Q-learning, where
tion) for NoC configuration. θ parameters (weights) decides an update that increases current state Q-value also increases
a prediction of NNs using the input parameters from NoC the target Q-value, possibly leading to oscillations of the
statistics. An NN is trained with input state (old state) and policy. Target and predictor NNs are trained at different in-
target Q-value data collected through experience to learn θ tervals. The predictor network parameters are updated in each
parameters. The backpropagation algorithm [41] using gra- training step, while the target network parameters are updated
dient descent method is used to learn the θ parameters. The periodically. For example, we set the training intervals for
target Q-values are calculated using Q-learning equation (as predictor and target NNs to 50 cycles and 100 cycles (in
VOLUME XX, XXXX

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3182500

simulation), respectively. This policy adds a delay between Algorithm 1: Deep RL algorithm for NoC configuration
target and predictor network parameters updates, and this input : NoC Features: Flit count, Buffer Utilization, Link
delay helps to reduce the impact of predictor network on Utilization
target network parameters, which results in the reduction of output : NoC V/F-Levels Configuration
divergence in approximation (using NNs). Generating the Initialize NN predictor and target NN using random weights;
Initialize Experience Replay memory buffer;
targets using an older set of parameters adds a delay between while simulation cycle_count ≤ maximum number of cycles
the time an update to predict Q-value is made (each training or packet_count ≤ maximum number of packets per node
step) and the time the update affects the target Q-value threshold do
(periodically), making divergence or oscillations much more Step 1: Monitor the traffic in router and links of NoC (as
state);
unlikely.
Step 2: With probability , select a random action
configuration or with probability 1 −  predict the best
action configuration for current NoC state using NNs;
E. DEEP RL MODEL TRAINING AND LEARNING Step 3: Configure V/F-levels of NoC router and links
ALGORITHM FOR NOC CONFIGURATION using action in previous step;
Step 4:
The simplified pseudocode of the deep RL algorithm for NoC Observe reward (-latency*power) of current action and
find next state (NoC utilization);
configuration is shown in Algorithm 1. Firstly, the algorithm Store experience (state, action, reward, next state) into the
initializes both NN predictor and target NN using random Experience Replay buffer;
weights so that predictor can take action in its initial periods. Sample the random records from the replay memory to
Based on the current state of a router (from the router statis- form mini-batches;
tics), RL agent takes V/F-level configuration decision for the Step 5:
Calculate the predicted Q-values using predictor
router and associated links for the next period. RL agent uses Q-network for different V/F actions in the current state,
exploration or exploitation phase to take the decision depend- Qcurrent_predictors ;
ing on the random probability, . In the exploration phase, Calculate the target Q-values using target Q-network for
RL agent takes random decision to reduce the probability different V/F actions (a0 ) in the next state, Qnext_state ;
of local optimal solution (to get global optimal solution). Set Q-values targets as
ycurrent_targets = reward + γmaxa0 Qnext_state ;
In the exploitation phase, RL agent uses the trained NNs to Train and update the connection weights of the predictor
select the V/F-level configuration action with the maximum Q-network using the square loss of calculated predicted
Q-value. RL agent considers router and all the associated Q-values (from predictor Q-network) and target Q-values
links of a router at a time for V/F-level configurations. (from target Q-network) using sample data from replay
memory, (ycurrent_targets − Qcurrent_predictors )2 ;
Because of the cost of training in terms of computation Step 6:
and time, we train the predictor NN and target NN in fixed if cycle_count == target_interval then
Update the NN connection weights of the target
intervals instead of per cycle training. As mentioned be-
Q-network as the same as the predictor Q-network
fore, training intervals are different for predictor and tar- for the next period;
get networks to improve the deep RL stability. We set the end
training intervals for predictor and target NNs to 50 cy- cycle_count = cycle_count + predictor_interval;
end
cles and 100 cycles, respectively. Q-values for current state,
Qcurrent_predictors , are predicted by the predictor NN. Q-
values targets, ycurrent_targets , are set by using the predicted RL Agent for Router i RL Agent for Router j
Q-values from target Q-network and the reward from the
environment. Then training for predictor network to update
connection weights is done by using the square loss of the 12 13 14 15
RL Agent
calculated Q-values from predictor and target Q-networks,
Core Core
(ycurrent_targets − Qcurrent_predictors )2 . For training, we
8 9 10 11
sample random records from the experience replay table to
form mini-batches and then use gradient descent approach Core Core
to back propagate this error to the hidden layer to tune their 4 5 6 7
Router
weights. The mini-batch size is set to 50 in this work. Target
Q-network parameters (connection weights of NNs) are set
to same value as predictor Q-network, after a fixed interval 0 1 2 3 Link
(set as target Q-network training interval). The algorithm
runs until the maximum number of simulation cycles reached FIGURE 6: Distributed RL-Agents based Optimization in a
or packet count per node exceeds the maximum number of 4 × 4 CMesh NoC based 64-Core System.
packets per node threshold.

VOLUME XX, XXXX

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3182500

F. DISTRIBUTED REINFORCEMENT LEARNING FOR layer, NNs have a total of 32 multiplies, 28 additions, and
LARGE-SCALE NOC 4 comparisons. This equates to a total of 56 multiplies,
In a NoC based manycore archiecture, it is not feasible to 44 additions, and 12 comparisons to gather the features,
implement centralized RL agent (ML) monitoring and deci- compute state-action values, and to select the voltage-level
sion framework for all the NoC resources due to exponential with the largest action value. In 45nm, the energy cost of a
increase in communication delay with the distance of the single 16-bit floating point add is estimated to be 0.4 pJ and
nodes from the centralized controller. Decision (dynamic the area cost is 1360 um2 [23]. The energy cost of a multiply
configuration) delay increases as every node has to commu- is estimated to be 1.1 pJ and the area cost is 1640 um2 in
nicate with the centralized RL agent for local NoC router 45nm [23]. Therefore, for a single RL agent, the total energy
and links configuration decisions. As configuration decision overhead is 75.6 pJ and the total area overhead is 0.168 mm2 .
delay increases, that decision could be too late for its ef- As we are using separate target network for deep RL stability,
fective application in real-time systems, which could results overall overhead is multiplied by two. So the total energy
in opportunity loss to reduce energy consumption and/or to overhead is 151.2 pJ and the total area overhead is 0.3366
improve performance of NoC. Furthermore, centralized de- mm2 . Additionally, an RL agent needs a buffer to hold the
cision maker can become a communication bottleneck in the historical data for experience replay. Because of the limited
manycore NoC as many communications are going through entries (e.g., 200 entries) in the buffer, the overhead of the
that one controller node. Hotspot can form in the centralized buffer is not significant enough. In the proposed distributed
controller and packets passing through the hotspot area will RL algorithm, per router RL agent decides the V/F action
be delayed as the NoC resources (e.g., buffers and links) are values based on the feature values (state) of the associated
already occupied by the packets in the hotspot region. This router and links of NoC. Though the energy and area cost
contention/delay problem in the centralized controller region of an RL agent is not significant enough, for a very large
can propagate throughout the NoC. Distributed RL technique scale NoC, such as 25 × 25 NoC, the overhead of many RL
is needed in manycore NoC to reduce the above mentioned agents can be high. To address that, we propose CMesh NoC,
problems (high communication delay and hotspots) in cen- where a single RL agent at a router supports multiple cores
tralized RL technique. A distributed per-router RL agent in the NoC. The number of concentrated cores per router can
based framework in a 4 × 4 CMesh NoC is presented in be configured based on NoC size, performance, and allowed
Figure 6. In the CMesh NoC architecture, a router is con- hardware overhead. In this work, four cores are connected
nected with four cores, which results in total 64 cores as to a router. This CMesh NoC approach makes the overhead
shown in Figure 6. In a router, an RL agent can control the (both energy and area) of distributed RL implementation
configuration decisions of the router itself and associated all feasible for large-scale computing systems.
the NoC outgoing links to meet the communication demands
through that node. RL agent computes the best voltage-level IV. SIMULATION AND RESULTS
configuration decisions based on experience from previous Real system experiments are carried out using gem5 [7] and
historical data. Initially, RL agents could provide local op- Garnet [3] platforms to evaluate the NoC performance and
timal solution. With more training, RL agents have better energy efficiency. The following four different V/F-levels
understanding of the impact of their actions in overall NoC, are used for NoC configurations: 0.8 V/1 GHz, 0.9 V/1.5
as reward value is changed based on latency and energy GHz, 1.0 V/2 GHz, 1.1 V/2.5 GHz. The DSENT [44] tool
consumption of the whole NoC. Because all features are local is integrated with gem5 to evaluate the energy/power con-
to the router, if more routers/cores were added to the network, sumption. The gem5 and DSENT configurations are shown
the proposed deep RL enabled model could be easily scaled in Table. 1 and Table. 2, respectively. The proposed deep
to the new architecture with no change in the algorithm. RL algorithm is implemented and integrated in Garnet for
dynamic configuration of NoC. An RL agent module is added
G. HARDWARE OVERHEAD FOR DEEP with each router in the Garnet. The NN function approxima-
REINFORCEMENT LEARNING tor implementation consists of one input layer, one hidden
Each new feature for deep RL increases the arithmetic over- layer, and one output layer. Sigmoid and ReLU activation
head. That’s why the number of features for deep RL are kept functions are used for hidden and output layers, respectively.
to a small number of values to reduce hardware overhead. We use three features at the router for V/F-level configuration
The proposed deep RL uses fully connected NNs with three prediction: flit count, % of buffer utilization, and % of link
layers. As the proposed approach uses three features for utilization. The buffer utilization is calculated as the number
configuration decisions, it needs three (3) neurons in the input of active buffers in the previous interval divided by the total
layer. Similarly, it needs four (4) output neurons in the output number of buffers at a router. Similarly, the link utilization
layer as the proposed approach uses 4 V/F-levels as Q-values. is calculated as the number of active links in the previous
We have empirically selected eight (8) hidden neurons as 8- interval divided by the total number of NoC links. A reward
neuron provides good performance. For input layer to hidden function as the negative product of latency and power is used
layer connections, NNs have a total of 24 multiplies, 16 to evaluate the V/F-level configuration actions to maximize
additions, and 8 comparisons. For hidden layer to output performance and energy efficiency of NoC.  value is set to
VOLUME XX, XXXX

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3182500

0.1, which means an RL agent takes a random action instead TABLE 2: DSENT configurations for 256-core CMesh NoC.
of the best action in 10% cases.  value is not changed in Technology 22nm
this work as it (change in ) did not significantly impact Operating Frequency 1 GHz
our results as the proposed RL agent selects optimal action No. of Virtual Networks(VN) 3
No. of Virtual channels(VC) per VN 4
almost all the time after a certain number of training steps for No. of Buffers per VC 4
 value of 0.1. Learning rate α is set to 0.5. Network Type CMesh
No. of Cores 256
TABLE 1: gem5 configurations for 256-core CMesh NoC. No. of Routers 64
No. of Cores per Router 4
Instruction Set Architecture (ISA) ALPHA
No. of Input Ports 5
CPU Frequency 2GHz
No. of Output Ports 5
Interconnection Networks GARNET
Flit Width 128-bit
No. of Virtual Networks(VN) 3
Buffer Model 128-bit
No. of Virtual channels(VC) per VN 4
Crossbar Model Multiplexer
No. of Buffers per VC 4
Switch Allocator Model MatrixArbiter
Network Size 8 × 8 Mesh
No. of Cores 256
No. of Routers 64
No. of Cores per Router 4 node corresponding to the 1’s complement of its own ad-
Routing XY-Routing dress; (iii) Tornado: every node i sends a packet to node
Flit Width 128-bit
(i + 3)mod8; (iv) Bit Rotation (BitRt): destination is found
Max. Packets per Node 50000
Simulation Cycle 1000 by circular shifting of the bits of the source; (v) Neighbor
(NBor): node sends messages to only its neighbors; (vi)
We evaluate the proposed deep RL based NoC con- Shuffle (Shuf): destination is calculated by using source ad-
figuration model using communication-observant schedula- dress and number of destinations; (vii) Bit Reverse (BitRv):
ble memory-inclusive computation (COSMIC) benchmark destination is found by reversing the bits of the source; (viii)
suite [50] and embedded system synthesis benchmarks suite Transpose (Trans): node (x,y) sends messages only to (y,x).
(E3S) [18]. For large-scale problems, we have simulated 256- We compared our solution with a non-ML based NoC
core connected through 8 × 8 CMesh architecture, where configuration solution in [40]. In this work [40], the system
a router is connected to four computing nodes (cores), and managers using heuristic (non-ML) increase or decrease the
simulated large applications in the COSMIC benchmark. The V/F-levels of the NoC resources depending on the change in
COSMIC benchmark suite is used for large-scale problem application traffic. We have evaluated our NoC configuration
simulation because it (COSMIC) is based on real applications framework using latency, throughput, and energy metrics.
with a large number of tasks per application. The following For the COSMIC and E3S benchmarks, applications are
five applications from the COSMIC benchmark are mapped initially mapped to NoC based system using the (same) load-
in our simulations: face recognition, cifar, ultrasound imag- balanced mapping solution in [40] for both ML and non-
ing, reed-solomon code decoder (RS-Decoder), and reed- ML solutions for fair comparisons. The total (global) energy
solomon code encoder (RS-Encoder). The face recognition budget constraint is configured depending on the demands of
(face), cifar (cfr), ultrasound (ultra), RS-Decoder (RSD), the applications in a benchmark. The global energy budget
and RS-Encoder (RSE) application have 33,33, 526, 527, constraint is set to 1.5 joule (J) for the COSMIC benchmark,
and 141 tasks, respectively. For small-scale problems, we and it (global energy budget constraint) is set to 50 mJ (milli-
have simulated 16-core connected through 4 × 4 2D-Mesh Joule) for both E3S benchmark and synthetic traffic patterns.
NoC, and simulated applications in the E3S benchmark. The Higher energy budget constraint is used for the COSMIC
E3S benchmark comprises applications in consumer (C), benchmark (compared to other benchmarks) because of the
autoindustry (A), networking (N), telecom (T) and office- larger number of tasks and demands of the applications.
automation (O). E3S applications also have several sub-
applications within themselves, for example, telecom appli- A. PERFORMANCE UNDER LARGE-SCALE NOC AND
cation has 8 sub-applications. In overall, C, A, N, T, and O APPLICATION SETTINGS
applications have 12, 24, 13, 28, and 5 tasks, respectively. Figure 7 shows the energy consumption for various applica-
We also simulate complex multi-application domain systems tions in the COSMIC benchmark for all techniques. Simula-
by mixing multiple applications in the COSMIC and E3S tion results for application in the COSMIC benchmark sug-
benchmarks, e.g., FaceCfr for face recognition and cifar gest that latency improvement is significant (throughput also
applications and CN for consumer and networking applica- improves) in the proposed self-configurable NoC solution,
tions. Furthermore, we evaluate the proposed deep RL based while minimizing the energy consumption in the system. The
NoC configuration model using the following eight synthetic proposed approach improves (reduces) energy consumption
traffic patterns: (i) Uniform Random (UniR): destinations by up to 9% (by 6% on average) compared to the non-ML
are randomly selected with a uniform distribution; (ii) Bit- based configuration solution. Energy consumption decreases
complement (BitC): each node sends messages only to the due to lower energy consumption in the NoC because of
VOLUME XX, XXXX

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3182500

500 Non-ML 5000 Non-ML


Deep RL Deep RL
400 4000

Throughput (Gbps)
Energy (mJ)

300 3000

200 2000

100 1000

0 0
ce

ar

tra

E
SE

an

ce

ar

tr a

E
SE

an
Cf

Cf
RS

RS

RS

RS
RS

RS
Cif

Cif
Fa

Fa
Me

Me
rR

rR
Ul

Ul
ce

ce
ce

ce
Cf

Cf
Fa

Fa
Fa

Fa
Benchmarks Benchmarks

FIGURE 7: Energy comparison under COSMIC Benchmarks FIGURE 9: Throughput comparison under COSMIC Bench-
in 256-core NoC. marks in 256-core NoC.

120 Non-ML running multiple applications in manycore architectures.


Deep RL
100
3.0 Non-ML
Deep RL
Latency (ns)

80
2.5
60
2.0
Energy (mJ)

40
1.5
20
1.0
0
ce

ar

tra

E
SE

an
Cf
RS

RS
RS

0.5
Cif
Fa

Me
rR
Ul

ce

ce

Cf
Fa

Fa

Benchmarks
0.0
C
N
A
T
O
CN
CA
CT
NA
AT
NT
TO

A
an
FIGURE 8: Latency comparison under COSMIC Bench-
CN
Me
marks in 256-core NoC. Benchmarks

FIGURE 10: Energy comparison under E3S Benchmarks in


the proactive V/F-levels configuration using RL to maximize 16-core NoC.
reward, where reward is configured to minimize energy (and
latency). RL assigns the required V/F-levels by learning from
the historical data. As RL agents learn the policy to maximize B. PERFORMANCE UNDER SMALL-SCALE NOC AND
the reward (product of latency and power), communication APPLICATION SETTINGS
latency in NoC improves significantly because of the lower E3S applications are simulated under 16-core architecture
queuing latency at NoC routers. The proposed approach im- connected through 2D-Mesh NoC. Like the COSMIC bench-
proves latency by up to 70% (by 25% on average) compared mark results, simulation results for the E3S benchmark sug-
to the non-ML based configuration solution, as shown in gest that latency improvement is significant (throughput also
Figure 8. Because of the reduction of latency, throughput improves) in the proposed self-configurable NoC solution,
in the proposed solution improves by 10% (on average) while minimizing the energy consumption in the system. As
compared to the non-ML solution, as shown in Figure 9. seen in Figure 10, the proposed configurable solution im-
Furthermore, we observe that the proposed approach also proves energy consumption by up to 14% (by 5% on average)
improves energy, latency, and throughput more for all the compared to the non-ML solution. Energy improvement is
multiple application mixes running in the system. This fur- higher because of the dynamic heterogeneous assignment of
ther indicates the effectiveness of the proposed approach for required V/F-levels to NoC routers and links using online
VOLUME XX, XXXX

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3182500

of the proposed approach for manycore systems running mul-


100 Non-ML tiple applications. For two single applications, networking
Deep RL
(N) and office-automation (O), the proposed approach did not
80 significantly improve (degrades in one case) energy, latency,
and throughput mainly due to the presence of lower number
Latency (ns)

of tasks and communication traffic (and so less opportunity


60
of improvement) in those applications.

40
6
Homogeneous (Static)
20 Reconfigurable (Deep RL)
5

0 4

Energy (mJ)
C
N
A
T
O
CN
CA
CT
NA
AT
NT
TO

A
an
CN
Me
Benchmarks 3

FIGURE 11: Latency comparison under E3S Benchmarks in 2


16-core NoC.
1

RL solution (instead of assigning homogeneous maximum 0


voltages). As RL agents learn the policy to minimize la-
i

Rt

or

uf

ns

Rv

an
Un

To

Bit

Sh
NB
Bit

Tra

Me
Bit
tency, the proposed approach improves latency by up to Benchmarks
120% (by 34% on average) compared to the non-ML based
configuration solution, as shown in Figure 11. Because of FIGURE 13: Energy comparison under synthetic traffic in
the reduction in flit transmission delay, throughput in the 16-core NoC
proposed solution improves by 6% on average compared to
the non-ML solution, as shown in Figure 12.

40 Homogeneous (Static)
Non-ML Reconfigurable (Deep RL)
35
1200 Deep RL
30
1000
Latency (ns)
Throughput (Gbps)

25
800 20

600 15
10
400
5
200
0
0
i

Rt

or

uf

ns

Rv

an
Un

To

Bit

Sh
NB
Bit

Tra

Me
Bit

Benchmarks
C
N
A
T
O
CN
CA
CT
NA
AT
NT
TO

A
an
CN
Me

Benchmarks
FIGURE 14: Latency comparison under synthetic traffic in
FIGURE 12: Throughput comparison under E3S Bench- 16-core NoC.
marks in 16-core NoC.

Furthermore, the proposed RL enabled approach performs C. PERFORMANCE UNDER SYNTHETIC TRAFFIC
better for mixture of applications in the E3S benchmark For synthetic traffic patterns, we compare the proposed deep
for all the metrics (energy, latency, and throughput). For RL based run-time heterogeneous NoC configuration with
example, the proposed approach improves latency by 120% static homogeneous NoC configuration solution in 16-core
for combined autoindustry (A) and telecom (T) applications, architecture. We analyze whether RL can improve the en-
while the improvement is 30%, on average, for individual ergy, latency, and throughput even in regular NoC (without
applications running separately. This indicates the feasibility the help of mapping). As seen in Figure 13, the proposed
VOLUME XX, XXXX

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3182500

configurable NoC solution improves energy by up to 17%


Energy Consumption
(by 8% on average) compared to the homogeneous static 0.40
NoC solution. Energy consumption decreases because of
0.35
the proactive resource configuration of the routers and links
using RL based on the past historical traffic requirements 0.30
and system utilization (e.g., link and buffer utilization). As
0.25

Energy (mJ)
RL agents learn the policy to maximize the reward under
constrained energy consumption, the proposed approach, on 0.20
average, improves latency by 25% compared to the non-
ML based configuration solution, as shown in Figure 14. 0.15

Throughput remains almost same or slightly improved (by 0.10


2%) in the configurable deep RL solution, as in Figure 15.
0.05

0.00
700 Homogeneous (Static) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Reconfigurable (Deep RL) Router
600
FIGURE 16: Router-wise energy distribution in 16-core NoC
Throughput (Gbps)

500 for CNA applications under E3S Benchmarks.


400

300 approach. Figure 17 shows that latency decreases with the


increase in NoC resources as packets face less contention
200 (and less congestion) in NoC links and routers. Because of
less congestion, NoC throughput increases with the addition
100
of more NoC resources, as shown in Figure 18. The latency
0 and throughput performance results demonstrated that the
proposed DRL enabled self-configurable NoC approach is
i

Rt

or

uf

ns

Rv

an
Un

To

Bit

Sh
NB
Bit

Tra

Me
Bit

Benchmarks scalable.

FIGURE 15: Throughput comparison under synthetic traffic 100


in 16-core NoC. 16-core
64-core
80 256-core

D. HOTSPOT REDUCTION
Latency (ns)

60
This proposed work uses a load-balanced mapping solution,
which was adopted from [40], to evenly distribute the tasks
among cores in NoC to prevent hotspots. Hotspots increase 40
the probability of several problems, including electromi-
gration, burning chip, and fault. This work only focuses 20
only on communication energy consumption in NoC. The
energy consumption at a router is calculated by summing
up the energy consumption on the router and on the asso- 0
C
N
A
T
O
CN
CA
CT
NA
AT
NT
TO

A
an

ciated communication links (to adjacent routers). Figure 16


CN
Me

shows router-wise energy distribution in 16-core NoC for Benchmarks


a mixture of applications (consumer (C), networking(N),
and auto-industry(A)) under E3S benchmarks. Router-wise FIGURE 17: Latency for different NoC sizes under E3S
energy distribution result shows that energy is almost uni- Benchmarks.
formly distributed among the routers and the links except
the edge/corner routers/links in NoC. The balanced energy
distribution reduces the possibility of hotspots in NoC. F. REWARDS
The reward for an RL agent is calculated by considering both
E. SCALABILITY latency (delay) and energy consumption to maximize both
Latency and throughput results are calculated for E3S bench- performance and energy-efficiency. The proposed deep RL
marks with the number of cores ranging from 64 to 256 (64, approach has distributed RL agents, and the Figure 19 shows
100, 144, 196, and 256) to test the scalability of the proposed the reward values with time (cycles) for a sample distributed
VOLUME XX, XXXX

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3182500

2500 NoC like this proposed work (proposed work also targets
16-core to achieve high performance). We found that the proposed
64-core neural networks enabled deep RL approach performs better
2000 256-core
than the RL approach in [38] for NoC configuration in terms
Throughput (Gbps)

of both energy-efficiency and latency (delay). As shown in


1500 Figure 20, EDP in this proposed work is around 8% lower
compared to that of [38]. And the proposed work performs
better than [38] for all the applications under E3S bench-
1000
marks. The major reason for better EDP in this work is that
the proposed solution focuses to address both performance
500 (latency) and energy efficiency, where [38] focused mainly
on energy-efficiency.
0
C
N
A
T
O
CN
CA
CT
NA
AT
NT
TO

A
an
RL [38]
CN
Me
Deep RL (this work)
Benchmarks 200

FIGURE 18: Throughput for different NoC sizes under E3S


Benchmarks. 150

EDP
RL agent. The reward value increases with time, as shown 100
in Figure 19, which means that the RL agent is learning and
becoming skillful as time goes on. After around 500 cycles,
the reward value saturates and it (reward) does not change, 50
which means that the RL agent becomes expert in making
decisions. Similar patterns in change in reward values with
time have been seen in other distributed RL agents. 0
C N A T O CN CA CT NA AT NT TO CNA Mean
Benchmarks

0
Rewards FIGURE 20: EDP comparison with an RL Approach [38].

−10000
H. NOC AREA OVERHEAD
The area overhead of NoC is calculated using DSENT tool.
−20000 Because of the concentrated Mesh implementation with 4-
Rewards

cores per router, a router has 4 local ports in NoC. For


connection with adjacent routers, a router has additional 2
−30000
ports (corner nodes) or 3 ports (edge nodes) or 4 ports based
on the position of the router in the mesh architecture. We
−40000 calculate the NoC overhead using the number of buffers and
ports at routers, the number of routers, and the bandwidths of
the links. The hardware overhead for RL agents is calculated
−50000 in Sec. III-G. The NoC area overheads (excluding RL agents)
200 400 600 800 1000
Cycles for 16-core, 64-core, and 256-core NoCs are 4.51 mm2 ,
18.02 mm2 , and 72.08 mm2 . The hardware overheads for
FIGURE 19: Rewards for a Distributed RL Agent. RL agents for 16-core, 64-core, and 256-core NoCs are 1.34
mm2 , 5.38 mm2 , and 21.50 mm2 . So, the total NoC area
overheads (including RL agents) for 16-core, 64-core, and
G. COMPARISON WITH AN RL WORK 256-core NoCs are 5.85 mm2 , 23.40 mm2 , and 93.59 mm2 .
The proposed work has been compared with another ML Therefore, the hardware overhead of RL agents is 23% of the
work that configures the link bandwidths dynamically using total NoC area overhead.
RL for energy savings [38]. [38] changes the link width by
activating or deactivating wires on links using distributed The uniqueness of the proposed self-configurable NoC
RL agents, where an agent stores Q-values for state-action solution is that the solution maximizes both performance
mapping in a table, namely Q-table. [38] is selected for and energy efficiency while it configures the NoC instantly
comparison as the goal of [38] is to achieve energy-efficient with the help of RL-based distributed intelligent agents.
VOLUME XX, XXXX

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3182500

The proposed solution addresses the challenges of reactive [9] S. Borkar. Exascale computing - a fact or a fiction? In Keynote at IEEE
and slow solutions from heuristics (as in [40]) and opti- International Symposium on Parallel and Distributed Processing (IPDPS),
May 2013.
mization solvers (e.g., IBM CPLEX [2]). Because of the [10] Y. Cai, C. Yang, W. Ma, and Y. Ao. Extreme-scale realistic stencil
reactive/slow solutions, non-ML based approach [40] can- computations on sunway taihulight with ten million cores. In Proceedings
not increase or decrease V/F-levels efficiently to provide of 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid
Computing (CCGRID), pages 566–571, 2018.
both high-performance and energy-efficient solution. Fur- [11] X. Chen, Z. Xu, H. Kim, P. V. Gratz, J. Hu, M. Kishinevsky, U. Ogras, and
thermore, as shown in the simulation results, the proposed R. Ayoub. Dynamic voltage and frequency scaling for shared resources
solution doesn’t try to maximize the NoC performance by in multicore processor designs. In Proceedings of 50th ACM/EDAC/IEEE
Design Automation Conference (DAC), pages 1–7, 2013.
assigning maximum V/F-levels as high power consumption [12] Z. Chen and D. Marculescu. Distributed reinforcement learning for power
and corresponding high temperature, even in only one region limited many-core system performance optimization. In Proceedings of
of NoC, can create problems, including failure of electrical IEEE Design, Automation and Test in Europe (DATE), pages 1521–1526,
March 2015.
components [42], in systems. The proposed solution im-
[13] M. Clark, A. Kodi, R. Bunescu, and A. Louri. LEAD: Learning-enabled
proves EDP by 27%, by 40%, and by 35% for COSMIC, E3S, energy-aware dynamic voltage/frequency scaling in nocs. In Proceedings
and synthetic traffic benchmarks, respectively, compared to of Design Automation Conference (DAC), pages 1–6, 2018.
corresponding non-ML based solution and improves EDP by [14] W. J. Dally and B. Towles. Route packets, not wires: On-chip inteconnec-
tion networks. In Proceedings of the 38th Design Automation Conference,
8% compared to an RL-based solution [38]. Therefore, the pages 684–689, 2001.
proposed self-configurable NoC solution maximizes perfor- [15] S. Das, J. R. Doppa, D. H. Kim, P. P. Pande, and K. Chakrabarty. Optimiz-
mance with limited power consumption to quickly provide ing 3d noc design for energy efficiency: A machine learning approach. In
Proceedings of IEEE/ACM International Conference on Computer-aided
energy-efficient and high-performance solution. Design (ICCAD), pages 705–712, Nov 2015.
[16] G. De Micheli, C. Seiculescu, S. Murali, L. Benini, F. Angiolini, and
A. Pullini. Networks on chips: From research to products. In Proceedings
V. CONCLUSION of Design Automation Conference (DAC), pages 300–305, June 2010.
We have proposed dynamic configuration of on-chip net- [17] R. H. Dennard et al. Design of ion-implanted mosfet’s with very small
works on multi/many-core computing systems using ma- physical dimensions. IEEE Journal of Solid-State Circuits, 9(5):256–268,
Oct. 1974.
chine learning techniques for energy-efficient and high- [18] R. Dick. Embedded system synthesis benchmarks suites (e3s),
performance NoC. NoC is configured proactively based http://robertdick.org/tools.html.
on the historical data using deep reinforcement learn- [19] D. DiTomaso, T. Boraten, A. Kodi, and A. Louri. Dynamic error mitigation
in nocs using intelligent prediction techniques. In 2016 49th Annual
ing, where distributed reinforcement learning agents take IEEE/ACM International Symposium on Microarchitecture (MICRO),
the voltage/frequency-level configuration actions for NoC pages 1–12, Oct 2016.
routers and links using neural networks function approxima- [20] Q. Fettes, M. Clark, R. Bunescu, A. Karanth, and A. Louri. Dynamic
voltage and frequency scaling in nocs with supervised and reinforcement
tor. The use of neural networks in reinforcement learning learning techniques. IEEE Transactions on Computers, 68(3):375–389,
and distributed per-router reinforcement learning agents in March 2019.
NoC makes the proposed approach feasible for large-scale [21] H. Hantao, P. D. S. Manoj, D. Xu, H. Yu, and Z. Hao. Reinforcement
systems. Simulation results under real and synthetic traffic learning based self-adaptive voltage-swing adjustment of 2.5d i/os for
many-core microprocessor and memory communication. In Proceedings
demonstrate that the proposed machine learning enabled of IEEE/ACM International Conference on Computer-aided Design (IC-
self-configurable NoC solution improves NoC performance CAD), pages 224–229, Nov 2014.
significantly while maximizing energy efficiency. The pro- [22] S. Herbert and D. Marculescu. Analysis of dynamic voltage/frequency
scaling in chip-multiprocessors. In Proceedings of ACM/IEEE Interna-
posed solution incurs low hardware overhead while providing tional Symposium on Low Power Electronics and Design (ISLPED), pages
self-configurable NoC to meet the real-time requirements of 38–43, Aug 2007.
multiple applications. [23] M. Horowitz. 1.1 computing’s energy problem (and what we can do about
it). In Proceedings of International Solid-State Circuits Virtual Conference
(ISSCC), pages 10–14, Feb 2014.
REFERENCES [24] Y. Hoskote, S. Vangal, A. Singh, N. Borkar, and S. Borkar. A 5-ghz mesh
interconnect for a teraflops processor. IEEE Micro, 27(5):51–61, Sept.
[1] Cerebras Wafer Scale Engine, https://www.cerebras.net/product/. 2007.
[2] IBM CPLEX Optimization Studio, http://www- [25] R. Jain, P. R. Panda, and S. Subramoney. Machine learned machines:
01.ibm.com/software/commerce/optimization/cplex-optimizer/. Adaptive co-optimization of caches, cores, and on-chip network. In
[3] N. Agarwal, T. Krishna, L. S. Peh, and N. K. Jha. GARNET: A detailed Proceedings of IEEE Design, Automation and Test in Europe (DATE),
on-chip network model inside a full-system simulator. In Proceedings of pages 253–256, March 2016.
IEEE International Symposium on Performance Analysis of Systems and [26] D. C. Juan, H. Zhou, D. Marculescu, and X. Li. A learning-based autore-
Software (ISPASS), pages 33–42, April 2009. gressive model for fast transient thermal analysis of chip-multiprocessors.
[4] Y. Bai, V. W. Lee, and E. Ipek. Voltage regulator efficiency aware power In Proceedings of ACM/IEEE Asia and South Pacific Design Automation
management. In Proc. ASPLOS, pages 825–838, April 2017. Conference (ASP-DAC), pages 597–602, Jan 2012.
[5] L. Benini and G. D. Micheli. Networks on chip: a new paradigm for [27] E. Kakoulli, V. Soteriou, and T. Theocharides. Intelligent hotspot predic-
systems on chip design. In Proceedings of Design, Automation and Test in tion for network-on-chip-based multicore systems. IEEE Trans. TCAD,
Europe Conference and Exhibition (DATE), pages 418–419, Nov. 2002. 31(3):418–431, March 2012.
[6] K. Bergman et al. Exascale computing study: Technology challenges in [28] M. Kas. Toward on-chip datacenters: A perspective on general trends and
achieving exascale systems, 2008. on-chip particulars. J. Supercomput., 62(1):214–226, Oct. 2012.
[7] N. Binkert et al. The gem5 simulator. SIGARCH Comput. Archit. News, [29] G. Kestor, R. Gioiosa, D. J. Kerbyson, and A. Hoisie. Quantifying the
39(2):1–7, Aug. 2011. energy cost of data movement in scientific applications. In 2013 IEEE
[8] B. Bohnenstiehl et al. Kilocore: A 32-nm 1000-processor computational International Symposium on Workload Characterization (IISWC), pages
array. IEEE Journal of Solid-State Circuits, 52(4):891–902, April 2017. 56–65, 2013.

VOLUME XX, XXXX

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3182500

[30] R. G. Kim, W. Choi, Z. Chen, J. R. Doppa, P. P. Pande, D. Marculescu, and [52] J. Yin, Y. Eckert, S. Che, M. Oskin, and G. H. Loh. Toward more efficient
R. Marculescu. Imitation learning for dynamic vfi control in large-scale noc arbitration: A deep reinforcement learning approach. In Proceed-
manycore systems. IEEE Transactions on Very Large Scale Integration ings of International Workshop on AI-assisted Design for Architecture
(VLSI) Systems, 25(9):2458–2471, Sep. 2017. (AIDArc), held in conjunction with International Symposium on Computer
[31] R. G. Kim, J. R. Doppa, P. P. Pande, D. Marculescu, and R. Marculescu. Architecture (ISCA), June 2018.
Machine learning and manycore systems design: A serendipitous symbio- [53] Y. Zhang, X. Hu, and D. Z. Chen. Task scheduling and voltage selection for
sis. Computer, 51(7):66–77, July 2018. energy minimization. In Proceedings of ACM/IEEE Design Automation
[32] M. A. Kinsy, S. Khadka, and M. Isakov. Prenoc: Neural network based Conference (DAC), pages 183–188, June 2002.
predictive routing for network-on-chip architectures. In Proceedings of
ACM International Conference on Great Lakes Symposium on VLSI
(GLSVLSI), pages 65–70, 2017.
[33] Li Shang, L. Peh, and N. K. Jha. Power-efficient interconnection networks:
Dynamic voltage scaling with links. IEEE Computer Architecture Letters,
1(1):6–6, 2002.
[34] L.-J. Lin. Self-improving reactive agents based on reinforcement learning,
planning and teaching. Machine Learning, 8(3-4):293–321, May 1992.
[35] A. K. Mishra, R. Das, S. Eachempati, R. Iyer, N. Vijaykrishnan, and C. R.
Das. A case for dynamic frequency tuning in on-chip networks. In Pro-
ceedings of IEEE/ACM International Symposium on Microarchitecture
(MICRO), pages 292–303, Dec 2009.
[36] V. Mnih et al. Human-level control through deep reinforcement learning.
Nature, 518(7540):529–533, Feb. 2015.
[37] M. F. Reza. Deep reinforcement learning for self-configurable noc. In Pro-
ceedings of IEEE International International System-on-Chip conference
(SOCC), 2020.
[38] M. F. Reza. Reinforcement learning based dynamic link configuration
for energy-efficient noc. In Proceedings of IEEE International Midwest MD FARHADUR REZA is an Assistant Professor
Symposium on Circuits and Systems (MWSCAS), 2020. in the Department of Mathematics and Computer
[39] M. F. Reza, T. Le, B. Dey, M. Bayoumi, and D. Zhao. Neuro-NoC: Energy Science at Eastern Illinois University (EIU). Prior
optimization in heterogeneous many-core noc using neural networks in to joining EIU in Fall’21, he was an Assistant Pro-
dark silicon era. In Proceedings of IEEE International Symposium on fessor of Computer Science at University of Cen-
Circuits and Systems (ISCAS), 2018.
tral Missouri. He was a Postdoctoral Researcher at
[40] M. F. Reza, D. Zhao, and M. Bayoumi. Power-thermal aware balanced
Virginia Tech and at George Washington Univer-
task-resource co-allocation in heterogeneous many cpu-gpu cores noc in
sity in 2019 and 2018, respectively. He has been
dark silicon era. In Proceedings of 31st IEEE International System-on-
Chip Conference (SOCC), 2018.
serving as a Publication Co-Chair for IEEE SOCC
[41] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Neurocomputing: since 2020. He has been working as a Publicity
Foundations of research. chapter Learning Representations by Back- Co-Chair for IEEE COINS 2022. He has been serving in the Technical
propagating Errors, pages 696–699. MIT Press, 1988. Program Committee of several conferences, including IEEE SOCC 2020-
[42] L. Shang and R. Dick. Thermal crisis: challenges and potential solutions. 2022, IEEE COINS 2020-2022, and IEEE/ACM NoCArc 2021. He orga-
IEEE Potentials, 25(5):31–35, 2006. nized a special session in IEEE AICAS 2022. He is serving as a track chair
[43] D. Silver and Others. Mastering the game of Go with deep neural networks for Internet-of-Things (IoT) track in IEEE COINS and served as a session
and tree search. Nature, 529(7587):484–489, Jan. 2016. chair in IEEE SOCC and IEEE MWSCAS conferences. He serves in the
[44] C. Sun, C.-H. O. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L.-S. Peh, Review Committee members of IEEE ISCAS and IEEE AICAS conferences.
and V. Stojanovic. DSENT - a tool connecting emerging photonics with He published papers in peer-reviewed journals and conferences, including
electronics for opto-electronic networks-on-chip modeling. In Proceedings ACM/IEEE NOCS, ACM GLSVLSI, and Elsevier MICPRO. He received A.
of IEEE/ACM International Symposium on Networks-on-Chip (NOCS), Richard Newton Young fellow award from ACM/IEEE Design Automation
pages 201–210, May 2012. Conference (DAC) in 2014. He earned Ph.D. and M.Sc. degrees in computer
[45] R. S. Sutton and A. G. Barto. Introduction to Reinforcement Learning. science from University of Louisiana at Lafayette in 2017 and 2014, respec-
MIT Press, Cambridge, MA, USA, 1st edition, 1998. tively. He received MBA degree from Institute of Business Administration
[46] Y. Tan, W. Liu, and Q. Qiu. Adaptive power management using reinforce- (IBA) at the University of Dhaka in 2011. He received B.Sc. degree in com-
ment learning. In Proceedings of IEEE/ACM International Conference on puter science and engineering from Bangladesh University of Engineering
Computer-aided Design (ICCAD), pages 461–467, New York, NY, USA, and Technology in 2005. He has 7 years of working experience in telecom
2009. ACM. and software industries. His research interests include resource management,
[47] M. B. Taylor. A landscape of the new dark silicon design regime. IEEE networks-on-chip, multi-core architectures, and machine learning/artificial
Micro, 33(5):8–19, Sept. 2013.
intelligence. He is an IEEE member and an ACM member.
[48] M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald,
H. Hoffman, P. Johnson, J.-W. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski,
N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal.
The raw microprocessor: A computational fabric for software circuits and
general-purpose programs. IEEE Micro, 22(2):25–35, March 2002.
[49] K. Wang, A. Louri, A. Karanth, and R. Bunescu. Intellinoc: A holistic
design framework for energy-efficient and reliable on-chip communication
for manycores. In Proceedings of International Symposium on Computer
Architecture (ISCA), pages 589–600, 2019.
[50] Z. Wang et al. A case study on the communication and computation
behaviors of real applications in noc-based mpsocs. In Proceedings of
IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pages
480–485, July 2014.
[51] Wonyoung Kim, M. S. Gupta, G. Wei, and D. Brooks. System level
analysis of fast, per-core dvfs using on-chip switching regulators. In
Proceedings of IEEE 14th International Symposium on High Performance
Computer Architecture, pages 123–134, 2008.

VOLUME XX, XXXX

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

You might also like