0% found this document useful (0 votes)
95 views15 pages

Kazadi Joel 9213934 DLMAIRIL01

This research essay explores Reinforcement Learning-based Neural Architecture Search (RL-NAS), highlighting its advantages in automating neural network design through Markov Decision Processes. It reviews foundational RL concepts, formulates NAS as an RL problem, and presents case studies demonstrating improved search efficiency and accuracy. The paper also addresses challenges and outlines future research directions for enhancing RL-NAS methodologies.

Uploaded by

Joël Kazadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views15 pages

Kazadi Joel 9213934 DLMAIRIL01

This research essay explores Reinforcement Learning-based Neural Architecture Search (RL-NAS), highlighting its advantages in automating neural network design through Markov Decision Processes. It reviews foundational RL concepts, formulates NAS as an RL problem, and presents case studies demonstrating improved search efficiency and accuracy. The paper also addresses challenges and outlines future research directions for enhancing RL-NAS methodologies.

Uploaded by

Joël Kazadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

MSc.

Data Science (120 ECTS)

Reinforcement Learning (DLMAIRIL01)

WRITTEN ASSIGNMENT

Exploring Optimal Neural Network Architectures


What benefits does Reinforcement Learning offer?

Joël Kazadi

Matriculation Number: 9213934

Tutor: Prof. Dr. Janki Dodiya

Delivery date: May 17, 2025


Abstract

This research essay provides a comprehensive survey of Reinforcement Learning-based


Neural Architecture Search (RL-NAS), elucidating how modeling architecture design as
a Markov Decision Process (MDP) enables automated, sequential decision-making over
complex search spaces. We begin by reviewing the foundational principles of RL, including
MDPs and policy-optimization techniques, before detailing the formulation of NAS as
an RL problem, with specific attention to controller architectures, action representations,
and composite reward functions. Through various case studies, we demonstrate dramatic
reductions in search time alongside consistent gains in accuracy on benchmark tasks. We then
analyze persistent challenges in RL-NAS, including sample inefficiency, search-space design
trade-offs, reproducibility concerns, and transferability limitations. Finally, we outline future
research perspectives in model-based planning, meta-reinforcement learning, hierarchical
search-space regularization, and theoretical frameworks, setting the stage for fully automated,
adaptive, and reproducible neural architecture design.

Keywords: Reinforcement Learning, Neural Architecture Search, Policy Gradient, Sample Efficiency.

Zusammenfassung

(Het verkennen van optimale neurale netwerkarchitecturen: Welke voordelen biedt RL?)
Dit onderzoeksessay biedt een uitgebreid overzicht van op Reinforcement Learning gebaseerde
Neurale Architectuur Zoeken (RL-NAS), waarbij duidelijk wordt hoe het modelleren van ar-
chitectuurontwerp als een Markov BeslissingsProces geautomatiseerde, sequentiële besluitvorm-
ing over complexe zoekruimten mogelijk maakt. We beginnen met een overzicht van de
basisprincipes van RL, inclusief Markov Beslissingsprocessen en technieken voor beleid-
soptimalisatie, voordat we de formulering van NAS als een RL-probleem beschrijven, met
specifieke aandacht voor controllerarchitecturen, actievoorstellingen en samengestelde be-
loningsfuncties. Door middel van verschillende casestudy’s laten we dramatische reducties in
zoektijd zien naast consistente toename in nauwkeurigheid op benchmark taken. Vervolgens
analyseren we hardnekkige uitdagingen in RL-NAS, waaronder inefficiëntie van steekproeven,
trade-offs in het ontwerp van de zoekruimte, problemen met reproduceerbaarheid en beperkin-
gen in overdraagbaarheid. Tot slot schetsen we toekomstige onderzoeksperspectieven op
het gebied van modelgebaseerde planning, meta-versterkingsleren, hiërarchische regular-
isatie van de zoekruimte en theoretische kaders, waarmee we de weg bereiden voor volledig
geautomatiseerd, adaptief en reproduceerbaar neuraal architectuurontwerp.

Schlüsselwörter: Versterkingsleren, Neurale architectuur zoeken, Beleidsgradiënt, Steekproefefficiëntie.

II
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Reinforcement Learning foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Policy optimization methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Neural Architecture Search overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.1 Search Space design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Traditional NAS methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4 RL-based NAS framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.1 NAS as an RL problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2 Controller architecture and Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.3 Reward signal and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.4 Training protocols and Sample efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5 Case Studies and Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.1 RNN controller for CNN design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.2 Extensions and Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
6 Challenges and Research Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

List of Tables

1 Key hyperparameters of the LSTM controller in NASNet-A search . . . . . . . . . . . . . . . . . 7


2 Comparison of NASNet-A with human-designed baselines on CIFAR-10 . . . . . . . . . . . . . . 8
3 Comparison of efficiency-oriented NAS methods on CIFAR-10. . . . . . . . . . . . . . . . . . . 8
4 ImageNet transfer performance of RL-based and manual architectures. . . . . . . . . . . . . . . . 8

III
1 Introduction

Over the past decade, the complexity and depth of neural network architectures have grown at an unprecedented
rate. From AlexNet’s eight layers in 2012 to today’s hundreds-layered residual networks and transformer models
exceeding a billion parameters, the design space has drastically expanded (Alarcon 2020). Manual exploration of
such vast architectural configurations is laborious and susceptible to human bias, leading to sub-optimal solutions.
Neural Architecture Search (NAS) has thus emerged as a critical research frontier, promising to automate the
discovery of high-performance architectures while dramatically reducing human intervention (Mnih et al. 2015).

Among the array of NAS methodologies – ranging from evolutionary algorithms and Bayesian optimization to
gradient-based search – Reinforcement Learning (RL) stands out for its conceptual alignment with sequential
decision making and its capacity to handle large, discrete action spaces. In an RL-based NAS framework, a controller
agent formulates architecture components one decision at a time, receiving a scalar reward only after the network
has been trained and evaluated on a validation set. This paradigm elegantly mirrors the exploration–exploitation
dilemma inherent to architecture design, yet introduces formidable challenges in sample efficiency, reward sparsity,
and computational cost.

This essay investigates the question “What benefits does Reinforcement Learning offer in the search for optimal
neural network architectures?” To address this, we first formalize the NAS problem as a Markov Decision Process
(MDP), identifying states, actions, transitions, and reward structures. We then scrutinize the design and implemen-
tation of controller models, typically recurrent or attention-based networks trained via policy-gradient methods,
highlighting their flexibility in capturing complex architectural dependencies. Subsequently, we examine advanced
techniques that mitigate the steep cost of evaluating thousands of candidate networks, including weight-sharing
supernets, multi-fidelity evaluations, and performance prediction models. Through comparative analysis of seminal
case studies and recent extensions, the discussion elucidates how RL-driven NAS accelerates discovery, fosters
novel architectural motifs, and adapts to diverse objectives such as latency, energy consumption, and model size.

Finally, the essay outlines remaining limitations — namely the high resource demands and the sensitivity of learned
policies to search-space design — and identifies promising research directions. These include the integration of
model-based RL to predict network performance, meta-reinforcement learning for rapid adaptation across tasks,
and hierarchical controllers that decompose architecture search into macro- and micro-architectural decision (Zoph
and Le 2017). By providing a comprehensive, critical perspective on RL-based NAS, this paper aims to clarify the
unique contributions of Reinforcement Learning to automated architecture design and to chart pathways toward
more efficient and generalizable search strategies.

The remainder of the paper is organized as follows. Section 2 reviews the fundamental concepts of Reinforcement
Learning, including MDPs and policy-gradient methods. Section 3 introduces Neural Architecture Search, detailing
search space design, cell-based versus layer-wise parameterizations, and traditional methods. Section 4 formulates
NAS as a RL problem, describing the controller architecture, action and state representations, reward design, and
training protocols. Section 5 presents key case studies and surveys extensions that improve efficiency and scalability.
Section 6 discusses the main challenges facing RL-driven NAS, including computational cost, sample inefficiency,
and search-space bias, and outlines promising future research directions. Section 7 concludes.
1
2 Reinforcement Learning foundations

Reinforcement Learning (RL) provides a formal framework for sequential decision-making under uncertainty,
modeling the interaction between an agent and its environment as a Markov Decision Process (MDP). In this
section, we first introduce the MDP formalism and its core elements, followed by an in-depth examination of
policy-optimization techniques that form the backbone of RL-driven NAS.

2.1 Markov Decision Processes

An RL problem is cast as an MDP, defined by the tuple M = (S, A, P, R, β), where S is the state space, A the action
space, P(s0 | s, a) the transition probability kernel, R(s, a) the immediate reward function, and β ∈ [0, 1] the discount
factor. At each time stage t, the agent observes a state of the environment st , takes an action at based on the policy
function at = π(st ) to transition to the next state of the environment st+1 with probability P(st+1 | st , at ), and receives
reward rt = R(st , at ). The agent’s objective is to learn a policy π that maximizes the expected discounted return


"X #
L = Eπ β rt .
t
(1)
t=0

By construction, an MDP is “memoryless”, i.e. the agent’s behavior only depends on the current state-action pair
(st , at ). Mathematically, let Ht = (s0 , a0 , s1 , a1 , . . . , st−1 , at−1 ) be the process history available a time t that contains
all state-action pairs for all previous time stages. The general form of a policy function will both use the current state
and the process history, i.e. at = π(Ht , st ). To satisfy the Markovian property, we assume that all the information
needed to model the agent’s decision rule is sufficiently contained in the current state st , meaning Ht does not have a
significantly high informational value. Ignoring this history, we obtain the memoryless policy function at = π(st ).
P 
To evaluate the policy π, we define the state-value function V π (s) = Eπ t=0 β rt
∞ t | s0 = s , which estimates the
return from state s under π. By conditioning on the first action and successor state, one obtains the Bellman equation
that captures the recursive decomposition of the expected total reward (Bellman 1957):

V π (s) = R(s, a) + β P(s0 | s, a) × V π (s0 ).


X
(2)
s0 ∈S
P 
Similarly, the action-value function Qπ (s, a) = Eπ t=0 β rt
∞ t | s0 = s, a0 = a quantifies the return of taking action
a in s and thereafter following π. Whereas the state-value function only depends on the state s, the action-value
function depends on both the state s and the action a taken by the agent. The associated Bellman equation is:

Qπ (s, a) = R(s, a) + β P(s0 | s, a) × Qπ (s0 , a0 ).


X
(3)
s0 ∈S

Optimization seeks the optimal value functions V ? (s) = maxπ V π (s) and Q? (s, a) = maxπ Qπ (s, a), which satisfy
the Bellman optimality equations that provide the foundation for dynamic programming (Puterman 1994):
 
V ? (s) = maxa R(s, a) + β s0 P(s0 | s, a) × V ? (s0 )
P
(i) for the state-value function :
P   (4)
(ii) for the action-value function : Q? (s, a) = R(s, a) + β s0 P(s0 | s, a) × maxa0 Q? (s0 , a0 )

2
In small or discrete MDPs, value iteration and policy iteration directly solve these Bellman equations via iterative
updates. However, for high-dimensional or continuous problems – including Neural Architecture Search – explicit
solution is intractable, motivating policy-optimization methods that directly adjust parameterized policies without
solving value functions everywhere (Sutton et al. 1999). We turn next to these techniques in sub-section 2.2.

2.2 Policy optimization methods

Policy optimization methods represent a class of algorithms that directly adjust a parameterized policy to maxi-
mize expected cumulative reward without explicitly computing the complete value function over the state space.
These methods are particularly well-suited for high-dimensional or continuous action domains, where traditional
value-based techniques such as Q-learning become intractable due to the curse of dimensionality. By parameterizing
the decision-making strategy as a differentiable function, typically a neural network, policy optimization enables
gradient-based updates that iteratively improve performance.

A foundational result in this area is the “policy gradient theorem”, which provides an expression for the gradient of
the expected return with respect to policy parameters θ (Sutton et al. 1999). The theorem states that:

∇θ J(θ) = E s∼dπθ ,a∼πθ ∇θ log πθ (a | s) Qπθ (s, a) ,


 
(5)

where dπθ (s) denotes the discounted state visitation distribution under policy πθ , and Qπθ (s, a) is the action-value
function for policy πθ . This result underpins the Monte Carlo policy gradient algorithm, commonly known as
“REINFORCE”, which estimates the gradient via sampled trajectories. The REINFORCE update rule takes the form:

T −1
X
θ ←θ+α ∇θ log πθ (at | st ) Gt , (6)
t=0

where Gt = βk−t rk is the return from time t. While REINFORCE is conceptually straightforward, it suffers
PT −1
k=t
from high variance in its gradient estimates, which can slow convergence (Williams 1992; Peters and Schaal 2005).
To mitigate this, various variance-reduction techniques are employed, such as subtracting a baseline function, often
chosen to be an approximation of the state-value function Vw (s).

Actor–Critic methods extend this idea by learning both a policy (the actor) and a value function (the critic)
simultaneously. The critic evaluates the current policy by estimating the value function Vw (s), which serves as a
baseline to reduce variance in the policy gradient. The temporal-difference (TD) error δt = rt + βVw (st+1 ) − Vw (st )
provides an unbiased estimate of the advantage A(st , at ) = Qπ (st , at ) − Vw (st ), and the actor update becomes:

θ ← θ + α δt ∇θ log πθ (at | st ). (7)

More advanced policy optimization algorithms incorporate mechanisms to enforce stability and improve sample
efficiency. Trust Region Policy Optimization (TRPO) constrains the step size in policy space by optimizing under a
KL-divergence constraint, ensuring that each policy update does not deviate excessively from the previous policy
(Schulman et al. 2015). Proximal Policy Optimization (PPO) simplifies this idea by using a clipped surrogate
objective, which limits the ratio between new and old action probabilities (Schulman et al. 2017). These approaches
have demonstrated significant improvements in both convergence speed and final performance on challenging tasks.
3
3 Neural Architecture Search overview

Neural Architecture Search (NAS) seeks to automate the design of neural networks by framing architecture selection
as an optimization problem over a discrete search space X (Elsken et al. 2019). Formally, the goal is to identify:

x? = arg max F (x), (8)


x∈X

where F (x) quantifies the performance of architecture x under a fixed training regimen. The NAS pipeline is
typically decomposed into three integral components: (i) the search space design, which defines the universe X
of permissible network topologies; (ii) the search strategy, which is the algorithmic mechanism that explores X;
and (iii) the performance estimation, which approximates F (x) efficiently. In this section, we focus on the first two
aspects, providing a comprehensive overview of design philosophies and classical exploration techniques.

3.1 Search Space design

The design of the search space X directly dictates both the expressivity and the computational tractability of NAS. A
well–structured X balances the richness of potential architectures against the combinatorial explosion of choices.
Common parameterizations include:

– Cell-based search: In this hierarchical approach, X is defined in terms of “cells” – small directed acyclic graphs
(DAGs) that are repeated to form the full network. A cell C can be represented by an adjacency matrix A ∈ {0, 1}N×N
and an operation assignment matrix O ∈ ON×N , where N is the number of nodes and O the set of candidate operations.
The full network of depth L then becomes:

network(C, L) = C
| ◦ C {z },
◦ ··· ◦C (9)
L times

and the size of X is |X| = |{A, O}|. As stated by Liu et al. (2025), by factorizing the search into cell structure
discovery and stacking rules, one reduces the effective dimensionality from O( l |O|l ) to O(|O|N(N−1) ).
P

– Layer-wise search: Each layer index ` = 1, 2, . . . , D is associated with a tuple of hyperparameters (t` , k` , w` ),
where t` denotes the layer type (e.g., convolution, pooling, skip), k` the kernel size, and w` the channel width. Thus,

x = [(t1 , k1 , w1 ), (t2 , k2 , w2 ), . . . , (tD , kD , wD )], (10)


QD
and the cardinality of X scales as `=1 |T | × |K| × |W|, often yielding billions of candidates even for moderate D.
Although layer–wise search is highly expressive, it demands efficient search strategies to remain tractable.

– Hierarchical and Constraint-driven spaces: More elaborate definitions of X introduce multi–level motifs,
grouping primitive operations into “micro-architectures” and “macro-architectures”. To ensure that only architectures
satisfying cost(x) under budget B are considered, constraints reflecting hardware budgets (e.g., FLOPs or memory
limits) is incorporated via feasibility masks:

Xhw = {x ∈ X : cost(x) ≤ B}. (11)


4
3.2 Traditional NAS methods

Prior to the rise of learning-based NAS, three algorithmic paradigms dominated the exploration of X: (i) random
search, (ii) evolutionary algorithms, and (iii) bayesian optimization. Each offers distinct trade-offs between simplicity,
sample efficiency, and global search capabilities.

– Random search: The baseline method draws architectures x ∼ U(X) and evaluates F (x), often employing
early stopping heuristics if partial training reveals poor performance. While naively inefficient, random sampling
can achieve competitive results in sufficiently constrained spaces and serves as a crucial sanity check for more
sophisticated approaches.

– Evolutionary algorithms: Evolutionary NAS maintains a population Pg of size M at generation g, where each
individual x ∈ Pg is evaluated to yield fitness F (x). New populations are formed via:

Pg+1 = selecttop k mutate(crossover(Pg )) ,



(12)

with mutation operators altering adjacency or operation assignments and crossover exchanging substructures
between parent architectures. Yuan et al. (2024) argue that the process iterates until convergence or resource
exhaustion, favoring exploration of diverse topologies and exploitation of high-fitness individuals.

– Bayesian optimization: BO models F (x) with a surrogate fˆ(x) – commonly a Gaussian process or tree-structured
Parzen estimator – and selects the next candidate by maximizing an acquisition function α(x), such as expected
improvement (EI):

xnext = arg max α(x; fˆ, D), (13)


x∈X

where D is the set of previously evaluated pairs (x, F (x)). While BO is highly sample efficient in low-dimensional
settings, scaling to the high cardinalities of NAS remains challenging without advanced embedding or surrogate
approximations.

Despite their historical importance, these traditional methods suffer from high computational cost per evaluation
and limited scalability (Sun et al. 2021). In the next section, we examine how a Reinforcement Learning–based
approach overcome these limitations by sharing parameters and exploiting structural priors within X.

4 RL-based NAS framework

RL recasts NAS as a sequential decision-making process, conferring unparalleled flexibility in exploring complex
search spaces and directly optimizing black-box performance metrics through reward-driven feedback (Zoph and
Le 2017; Pham et al. 2018). By treating each partial network as an MDP state and architectural choices as actions,
RL-based NAS inherently balances exploration of novel designs against exploitation of high-reward motifs, yielding
more robust and diverse architectures than traditional heuristics (Cai et al. 2019; Schulman et al. 2017). This
framework accommodates multi-objective rewards and when combined with parameter-sharing supernets and
multi-fidelity evaluations, achieves dramatic gains in sample efficiency (Mnih et al. 2016).
5
4.1 NAS as an RL problem

In the context of RL-NAS, the search space X is redefined as the state-action space of an MDP. Each partial
architecture st corresponds to an MDP state, while each design decision at is characterized as an action that enriches
the network’s description. The overarching goal of the agent is to learn a policy πθ (at |st ) that maximizes the expected
cumulative reward, expressed mathematically as:

T −1 i
hX
L(θ) = Eτ∼πθ rt , (14)
t=0

where rt can encapsulate various performance metrics such as validation accuracy, latency penalties, or other
resource usage metrics (Cai et al. 2019). Each trajectory τ = (s0 , a0 , s1 , a1 , . . . , sT ) delineates a fully specified
architecture, with the final reward rT guiding the parameter updates of the controller through policy-gradient methods,
fostering a continuous learning process.

4.2 Controller architecture and Actions

Typically, the controller is parameterized using either a Long Short-Term Memory (LSTM) or a transformer network.
The hidden state ht of this architecture encodes the sequence of prior design decisions and informs the probability
distribution over subsequent actions via the softmax function:

πθ (at | st ) = softmax(W ht + b). (15)

By modeling long-range dependencies, these sequence models are crucial in ensuring that significant architectural
patterns, such as repeated blocks or skip connections, naturally emerge during the search process (Zoph and Le
2017). Recent advancements have seen the replacement of recurrent neural networks (RNNs) with transformer-based
controllers, which attend to all prior actions in parallel. This shift not only enhances convergence speed but also
allows for scaling to larger search spaces without sacrificing sample efficiency. Such architectures empower the
controller to produce coherent, globally optimized networks rather than merely locally tuned layers.

4.3 Reward signal and Evaluation

A key benefit of the RL paradigm is its capacity to incorporate composite and non-differentiable reward functions.
For example, the reward function can be formulated as:

FLOPs(x)
r(x) = Acc(x) − λ , (16)
109
which effectively penalizes computational costs in tandem with accuracy (Cai et al. 2019). To alleviate the
high costs associated with full training cycles, RL-NAS leverages weight-sharing supernets, wherein all sampled
architectures share parameters within a single over-parameterized network. This approach reduces the evaluation
process to efficient subgraph forward passes, streamlining the search process. Additionally, multi-fidelity schemes
further enhance efficiency by employing low-cost proxy evaluations to filter out suboptimal architectures before
committing to more expensive training cycles.

6
4.4 Training protocols and Sample efficiency

Early iterations of RL-NAS approaches necessitated thousands of GPU-days for effective search operations. How-
ever, innovations in parameter-sharing, exemplified by Pham et al. (2018) with the Efficient Neural Architecture
Search (ENAS), have drastically reduced search costs by three orders of magnitude, enabling the exploration of
architectures for datasets such as CIFAR-10 in under a single day. Advanced RL optimizers, including Proximal
Policy Optimization (PPO) and distributed replay strategies like Ape-X, have been instrumental in stabilizing
controller updates while leveraging off-policy data to accelerate convergence (Mnih et al. 2016). Furthermore, Lee
and Park (2022) show that Meta-RL methodologies train controllers across a distribution of tasks, facilitating rapid
adaptation to novel domains with minimal additional search effort. This confluence of techniques renders RL-based
NAS both automated and adaptive, unlocking the potential for bespoke architectures that were previously deemed
infeasible to discover manually.

5 Case Studies and Empirical Results

To illustrate the role RL play when it comes to automate the search for the optimal network architecture, the
paper reviews two representative case studies: (i) the original RNN-controller approach for CNN design, and (ii)
subsequent extensions that enhance efficiency and scalability. Across these works, the progression from thousands
of GPU-days down to hours of search, combined with continual improvements in validation accuracy, underscores
the transformative power of RL paradigms in automated architecture design.

5.1 RNN controller for CNN design

Zoph and Le (2017) introduced the first RL-NAS framework, employing an LSTM controller to generate convolu-
tional cell descriptions, each of which was then trained to convergence on the CIFAR-10 dataset1 . Their NASNet-A
cell achieved a test error of 3.65%, surpassing the previous state-of-the-art by 0.09% but at the cost of approximately
2,000 GPU-days of search. This pioneering work validated the RL-NAS concept but highlighted the prohibitive
computational expense inherent to fully training sampled architectures.

Tab. 1: Key hyperparameters of the LSTM controller in NASNet-A search

Parameter Value
Controller type 2-layer LSTM
Hidden units per layer 100
Learning rate 0.00035
Entropy regularization 0.0001
Training epochs 20
Source : Designed by the author.

These settings reflect a trade-off between controller expressivity and training stability: a deeper LSTM can capture
complex dependencies but requires more data to train effectively. Entropy regularization encourages exploration of
diverse architectures, crucial in the early stages of search but often reduced over time to refine high-reward designs.
1
CIFAR-10 stands for Canadian Institute for Advanced Research (CIFAR) 10. It is a dataset used for training machine learning models,
consisting of 60,000 color images in 10 different classes, with 6,000 images per class. The classes include airplanes, cars, birds, cats, deer,
dogs, frogs, horses, and trucks.
7
Tab. 2: Comparison of NASNet-A with human-designed baselines on CIFAR-10

Method Test Error GPU-Days Params


NASNet-A (Zoph and Le 2017) 3.65% 2,000 27.6
ResNet-56 (He et al. 2016) 6.97% – 0.9
DenseNet-BC (Huang et al. 2017) 3.46% – 0.8
Source : Designed by the author.

Although NASNet-A outperformed human-designed models, its computational cost was orders of magnitude
higher, underscoring the need for efficiency improvements. The parameter count of NASNet-A also exceeded
that of conventional architectures, reflecting the controller’s tendency to favor larger cells for accuracy rather than
parsimony.

5.2 Extensions and Improvements

Subsequent work focused on reducing search cost while maintaining or improving accuracy. Pham et al. (2018)
introduced ENAS, which utilized weight-sharing supernets to evaluate architectures via subgraph passes, slashing
search time to under 1 GPU-day and achieving a test error of 2.89% on CIFAR-10. Cai et al. (2019) presented
ProxylessNAS, which eliminated proxy datasets by optimizing directly on target tasks with latency-aware regulariza-
tion, reaching 97.92% accuracy (2.08% error) on CIFAR-10 and 75.1% top-1 on ImageNet under 5.1 ms latency in
approximately 200 GPU-hours.

Tab. 3: Comparison of efficiency-oriented NAS methods on CIFAR-10.

Method Error (%) GPU-Days Params Latency (ms)


ENAS (Pham et al. 2018) 2.89 1 day 4.6 –
ProxylessNAS (Cai et al. 2019) 2.08 8.3 days 5.7 5.1
DARTS (Liu et al. 2019) 2.76 0.17 days 3.3 –
Source : Designed by the author.

The ENAS supernet approach demonstrated that parameter sharing can preserve performance while vastly im-
proving sample efficiency, a finding later generalized by differentiable methods like DARTS. ProxylessNAS further
showed that search need not rely on proxy tasks, enabling hardware-aware designs with real-time latency constraints
baked into the reward function.
Tab. 4: ImageNet transfer performance of RL-based and manual architectures.

Method Top-1 (%) Top-5 (%) GPU-Days FLOPs (G)


NASNet-Large (Zoph and Le 2017) 82.7 96.2 2,000 564
MobileNetV2 (Sandler et al. 2018) 72.0 91.0 — 300
ProxylessNAS (Cai et al. 2019) 75.1 92.7 8.3 days 320
Source : Designed by the author.

8
NASNet-Large achieved state-of-the-art ImageNet accuracy but required prohibitive compute, motivating Proxy-
lessNAS’s direct optimization for mobile latency. MobileNetV2 serves as a human-designed baseline, illustrating
the practical gains available through RL-based hardware-aware search.

6 Challenges and Research Perspectives

NAS driven by RL has delivered state-of-the-art models across vision and language tasks, yet it remains impeded by
severe computational and sample-efficiency bottlenecks. Early RL-NAS frameworks often required thousands of
GPU-days to converge, with evaluation of each candidate architecture dominating search costs despite advances in
parameter-sharing and multi-fidelity strategies (Salmani Pour Avval et al. 2025). Even with supernet weight-sharing,
recent studies report that over 80% of total search time is expended on architecture evaluation, underscoring the
persistent sample-efficiency challenge in RL-based NAS (Jaafra et al. 2019).

Beyond cost, the design of the search space X critically shapes the efficiency of RL-NAS. Overly expansive or
unconstrained spaces exacerbate the combinatorial explosion of architectures, leading to unstable training and
sub-optimal convergence behaviors (Baymurzina et al. 2022). Conversely, aggressively restrictive spaces – while
reducing evaluation load – risk eliminating novel, high-performance motifs and embedding human bias into the
search process. Striking the right balance between expressivity and tractability remains an open problem, particularly
as hardware-aware constraints (e.g., latency budgets) are increasingly incorporated into reward signals.

Reproducibility and theoretical understanding of RL-NAS lag behind empirical progress. Many published studies
omit full details of controller initialization, hyper-parameter settings, and evaluation protocols, complicating efforts
to replicate results or perform fair comparisons (Ying et al. 2019). Moreover, the discrete, high-dimensional
optimization landscape induced by RL controllers defies existing convergence guarantees, leaving questions around
policy-gradient stability and sample complexity unanswered. Establishing standardized benchmarks, such as the
NAS-Bench suites, and developing theory to characterize RL-NAS dynamics are critical steps toward robust, trans-
parent progress (Elsken et al. 2019).

Transferability of learned controllers across tasks and domains offers both promise and challenge. Meta-RL
approaches have begun to demonstrate rapid adaptation of NAS agents to new datasets, reducing search time by
leveraging prior knowledge (Li et al. 2023). However, achieving truly domain-agnostic controllers remains elusive;
controllers pre-trained on CIFAR-10 often require extensive fine-tuning to perform competitively on large-scale
tasks like ImageNet. Research into invariant representations and task-conditioning mechanisms may unlock broader
applicability of RL-NAS agents.

Looking ahead, several research directions stand out as especially promising. Model-based RL for NAS, wherein
learned performance predictors guide search without full evaluations, could dramatically slash sample requirements
by planning in latent space (Cassimon et al. 2025). Hierarchical and regularized search spaces that blend human
priors with learnable structures may achieve the expressivity of unconstrained spaces while retaining tractability
(Wang and Zhu 2024). Meta-learning frameworks hold the potential to amortize search costs across tasks, enabling
zero-shot or few-shot NAS on new domains. Finally, developing theoretical foundations will underpin the next
generation of reliable, reproducible RL-NAS methodologies.
9
By addressing these intertwined challenges – efficiency, expressivity, reproducibility, and transferability – and
pursuing these forward-looking perspectives, the field can advance toward fully automated, scalable, and theoretically
grounded neural architecture design.

7 Conclusion

Reinforcement Learning–based Neural Architecture Search (RL-NAS) has revolutionized the way deep networks
are designed, transforming a manual, intuition-driven process into a principled, automated framework. By framing
architecture selection as a sequential decision-making problem, RL-NAS leverages the exploration–exploitation
dilemma to discover novel motifs and connectivity patterns that often surpass human-crafted designs in terms of
accuracy and efficiency. The initial demonstrations, epitomized by the NASNet-A controllers, validated the concept
but also highlighted critical barriers of computational cost and sample inefficiency that spurred a wave of innovations.

Subsequent advances – ENAS’s parameter-sharing supernets, ProxylessNAS’s hardware-aware optimization,


and differentiable approaches like DARTS – collectively reduced search time from thousands of GPU-days to
practical GPU-hours, without compromising performance. These methods introduced multi-fidelity evaluation
and gradient-based relaxations that dramatically lowered the barrier to entry, broadening RL-NAS’s accessibility
to researchers and practitioners alike. Empirical benchmarks across CIFAR-10 and ImageNet demonstrate that
RL-NAS can consistently produce state-of-the-art architectures with far less human intervention and expert tuning.

Despite these successes, several challenges persist. Evaluating large numbers of candidate architectures remains the
dominant cost, and sample inefficiency continues to constrain the speed of convergence. The design of search spaces
involves a delicate balance between expressivity and tractability: spaces that are too large overwhelm RL controllers,
while overly restrictive spaces may omit groundbreaking architectures. Moreover, reproducibility and theoretical
understanding lag behind practice, with many RL-NAS studies lacking detailed reporting of hyper-parameters,
controller initialization, and evaluation protocols.

Looking forward, integrating model-based performance predictors and meta-reinforcement learning promises to
mitigate evaluation costs and enhance transferability across tasks. Hierarchical and regularized search-space designs
can embed human priors in a learnable manner, guiding exploration without undue restriction. Meanwhile, the
development of rigorous theoretical frameworks and standardized, reproducible benchmarks will be essential to
validate progress and ensure robust comparisons across methods. Combining these advances will move RL-NAS
toward an era of fully automated, efficient, and transparent neural architecture design.

To sum up, RL-based NAS has matured from a proof-of-concept into a scalable, high-performance paradigm, yet
the journey is far from over. Continued innovation in sample-efficiency, search-space regularization, and theoretical
analysis will be crucial to unlocking the next wave of breakthroughs. As these research directions converge, we
anticipate a future where bespoke architectures are generated on demand – tailored to specific tasks and hardware
constraints – with minimal human oversight, ushering in a new era of intelligent, data-driven model design.

10
References

Alarcon, N. (2020). OpenAI presents GPT-3, a 175 billion parameters language model (NVIDIA Developer).
https://developer.nvidia.com/blog/openai-presents-gpt-3-a-175-billion-parameters-
language-model/.

Baymurzina, D., Golikov, E., and Burtsev, M. (2022). A review of Neural Architecture Search. Neurocomputing,
474:82–93.

Bellman, R. E. (1957). Dynamic Programming. Princeton University Press, Princeton, NJ.

Cai, H., Zhu, L., and Han, S. (2019). ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware.
In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.
OpenReview.net.

Cassimon, A., Mercelis, S., and Mets, K. (2025). Scalable Reinforcement Learning-based Neural Architecture
Search. Neural Computing and Applications, 37:231–261.

Elsken, T., Metzen, J. H., and Hutter, F. (2019). Neural Architecture Search: A survey. Journal of Machine Learning
Research, 20(1):1997–2017.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep Residual Learning for Image Recognition. In Proceedings of
2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’16, pages 770–778. IEEE.

Huang, G., Liu, Z., van der Maaten, L., and Weinberger, K. Q. (2017). Densely connected Convolutional Networks.
In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July
21-26, 2017, pages 2261–2269. IEEE Computer Society.

Jaafra, Y., Luc Laurent, J., Deruyver, A., and Saber Naceur, M. (2019). Reinforcement Learning for Neural
Architecture Search: A review. Image and Vision Computing, 89:57–66.

Lee, K. and Park, S. (2022). Meta-Reinforcement Learning for Neural Architecture Search. Expert Systems with
Applications, 200:116926.

Li, Y., Wu, J., and Deng, T. (2023). Meta-GNAS: Meta-reinforcement learning for graph neural architecture search.
Engineering Applications of Artificial Intelligence, 123:106300.

Liu, B., Zhao, H., Yuan, T., Zhang, T., and Liu, Z. (2025). Rethinking cell-based neural architecture search: A
theoretical perspective. Neural Networks, page 107557.

Liu, H., Simonyan, K., and Yang, Y. (2019). DARTS: Differentiable Architecture Search. In International Conference
on Learning Representations.

Mnih, V., Badia, A., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. (2016).
Asynchronous Methods for Deep Reinforcement Learning. In Proceedings of the 33rd International Conference
on Machine Learning (ICML), volume 48 of Proceedings of Machine Learning Research, pages 1928–1937.

11
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M.,
Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D.,
Wierstra, D., Legg, S., and Hassabis, D. (2015). Human-level control through Deep Reinforcement Learning.
Nature, 518:529–533.

Peters, J. and Schaal, S. (2005). Natural Actor–Critic. In Proceedings of the 16th European Conference on Machine
Learning, Lecture Notes in Computer Science, pages 280–291. Springer.

Pham, H., Guan, M. Y., Zoph, B., Le, Q. V., and Dean, J. (2018). Efficient Neural Architecture Search via Parameter
Sharing. In Proceedings of the 35th International Conference on Machine Learning (ICML), volume 80 of
Proceedings of Machine Learning Research, pages 4095–4104.

Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley &
Sons Ltd., New York.

Salmani Pour Avval, S., Eskue, N., Groves, R., and Yaghoubi, V. (2025). Systematic review on Neural Architecture
Search. Artificial Intelligence Review, 58:73.

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.-C. (2018). MobileNetV2: Inverted Residuals and
Linear Bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pages 4510–4520.

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust Region Policy Optimization. In Bach,
F. and Blei, D., editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of
Proceedings of Machine Learning Research, pages 1889–1897, Lille, France. PMLR.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal Policy Optimization
Algorithms. https://arxiv.org/abs/1707.06347.

Sun, Q., Zhang, X., and Xue, J. (2021). A Review of Neural Architecture Search. Neurocomputing, 455:174–196.

Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (1999). Policy Gradient Methods for Reinforcement
Learning with Function Approximation. In Solla, S., Leen, T., and Müller, K., editors, Advances in Neural
Information Processing Systems (NeurIPS), volume 12, pages 1057–1063. MIT Press.

Wang, X. and Zhu, W. (2024). Advances in Neural Architecture Search. National Science Review, 11(8).

Williams, R. J. (1992). Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning.
Machine Learning, 8(3):229–256.

Ying, C., Klein, A., Real, E., Christiansen, E., Murphy, K., and Hutter, F. (2019). NAS-Bench-101: Towards
Reproducible Neural Architecture Search. https://arxiv.org/abs/1902.09635.

Yuan, G., Xue, B., and Zhang, M. (2024). An Evolutionary Neural Architecture Search Method based on Performance
Prediction and Weight Inheritance. Information Sciences, 667:120466.

Zoph, B. and Le, Q. V. (2017). Neural Architecture Search with Reinforcement Learning. In International
Conference on Learning Representations (ICLR).
12

You might also like