Autonomous Penetration Testing via RL & LLMs: A
Proposal
Menashe Ashkenazi 3266485321 and Noam Leshem 2130213632
1
Computer Science, Ariel University, Israel, [email protected]
2 Computer Science, Ariel University, Israel, [email protected]
Repository: https://github.com/menashe12346/cybera i.git
1. Introduction
This research proposal presents a novel framework for the development of intelligent agents
in the domain of autonomous offensive cybersecurity. With the growing complexity of
digital infrastructures and the dynamic nature of networked environments, traditional
penetration testing tools—which rely heavily on manual control and static logic—are no
longer sufficient.
Intelligent agents, particularly those driven by reinforcement learning (RL), have shown
promise in adapting to uncertain, adversarial settings. These agents are capable of learn-
ing optimal actions through interaction with their environment, improving over time based
on success and failure.
Simultaneously, recent advancements in large language models (LLMs) have introduced
powerful tools capable of interpreting and reasoning over complex textual outputs, making
them especially valuable in cybersecurity tasks that involve tool output parsing, context
inference, and decision support.
This project aims to integrate these two paradigms: fast, policy-driven exploration via
RL agents (based on deep Q-networks), and semantically-aware advisory reasoning using
fine-tuned LLMs. Together, these components enable a multi-agent architecture that can
autonomously perform reconnaissance and exploitation within dynamic cyber environ-
ments.
The system will be evaluated in realistic penetration testing scenarios, with emphasis
1
on autonomous behavior, learning efficiency, and adaptability in the face of network de-
fenses and heterogeneous topologies.
2. Related Work and Literature Review
Several recent studies have investigated the use of artificial intelligence in the context of
penetration testing, agent-based decision-making, and cyber-defense architectures. How-
ever, few have explored the full integration of hierarchical reasoning
Hu et al. (2020) Hu et al., 2020 introduced a pioneering approach using deep reinforce-
ment learning (DRL) to automate penetration testing in simplified simulated networks.
Their work demonstrated that DRL agents could learn attack sequences through explor-
ation and reward signals. However, their design was limited to monolithic agents, lacked
modularity, and did not incorporate real-world tools or dynamic parsing of output.
Ghanem et al. (2023) Ghanem et al., 2023 extended this idea by proposing a hierarch-
ical reinforcement learning framework specifically tailored to large-scale networks. Their
model decomposes attack paths into subtasks across subnetworks, significantly improv-
ing learning efficiency and scalability. While this hierarchical breakdown aligns with our
agent-based modular design, their system does not include real-time language processing
or blackboard-based memory for multi-agent coordination.
Zhou et al. (2024) Zhou et al., 2024 provide a comprehensive survey of transfer learn-
ing techniques within cybersecurity, identifying key challenges in domain adaptation, data
scarcity, and knowledge reuse. Their work emphasizes the potential of transfer learning
to improve model robustness across varied security tasks, including malware detection,
intrusion analysis, and vulnerability assessment. However, their analysis is primarily fo-
cused on classification-based settings. In contrast, our system leverages transfer learning
in a multi-agent reinforcement learning context, where LLMs serve as semantic decision
advisors within dynamic, evolving attack environments.
3. System Architecture
The proposed system is built around a modular and hierarchical agent-based architecture
designed for autonomous offensive cybersecurity. At its core lies a shared blackboard that
serves as the central memory and coordination hub, allowing multiple agents to operate
in parallel and update shared state representations.
The architecture includes the following key components:
Blackboard API: A centralized, structured memory that stores the evolving sys-
tem state, including target configuration, service fingerprints, CVEs, attack history, and
2
impact assessments. All agents read from and write to this memory.
Reconnaissance Agent: A learning-based agent that performs information gather-
ing actions (e.g., Nmap, Gobuster) and selects optimal scan strategies via Deep Q-Network
(DQN). It receives feedback from the system to improve scanning policies over time.
Vulnerability Agent: A deterministic agent that analyzes scan results and computes
CPEs, matches known CVEs from the NVD dataset, and ranks them by severity and
relevance.
Exploitation Agent: A DQN-based agent that selects a CVE and attempts ex-
ploitation using known Metasploit or ExploitDB modules. If a shell is opened or system
behavior changes, this feedback updates the learning model.
Defense Identification Agent: A diagnostic module that monitors whether exploits
are blocked, captures PCAPs, and checks for known IDS/IPS patterns. This information
helps determine the true impact of the attack.
LLM Modules: There are four separate LLMs, each fine-tuned for a distinct task:
parsing recon output, interpreting exploit results, selecting the most suitable exploit, and
recommending next actions. These are queried only when the DQN is uncertain.
Embedding Cache: A semantic memory structure that encodes states as embed-
dings. When a new state is encountered, it is compared with previously seen embeddings
using cosine similarity. Cached LLM results are reused when similarity is high.
4. LLMs
The system includes several fine-tuned LLM modules, each optimized for a specific pur-
pose. These modules enhance agent capabilities by providing semantic reasoning and
interpretation in cases where reinforcement learning alone is insufficient:
LLM-ReconParser: Translates unstructured outputs of recon tools (e.g., nmap, gobuster)
into structured JSON states.
LLM-ExploitAnalyzer: Analyzes results of exploit attempts, determining if a shell was
opened, the access level, or if the attempt was blocked.
LLM-ExploitSelector: Given a detailed system state, recommends the most likely ex-
ploit path to succeed.
LLM-ReconAdvisor: Suggests which reconnaissance command to run next based on
the current environment.
Each LLM response is stored in a semantic cache. During training or execution, the
current state is converted into an embedding and compared against cached entries using
cosine similarity. If a match is found (above threshold), the cached result is reused,
avoiding unnecessary LLM queries. This integration enables Reducing computational
cost, Improving generalization for novel states, Semantic depth in ambiguous scenarios,
Smooth fallback behavior for low-confidence DQN predictions Together, the DQN and
3
LLMs form a hybrid intelligent agent capable of acting quickly in familiar cases and
reasoning deeply in unfamiliar or complex ones.
5. Learning Loop and Reward Signal
The learning loop begins by initializing the scenario, setting the target IP, and resetting
the agent’s internal state and replay buffer. Each episode simulates a full attack session,
where agents interact with the environment and learn from outcomes.
At each step within an episode:
1. ReconAgent is activated: It queries the target (e.g., using nmap, gobuster) to
collect information.
2. State is encoded: All current observations are flattened and turned into a fixed-
length numerical vector.
3. DQN selects an action: Based on the current encoded state, the policy model
returns Q-values for all actions, and the best one is selected.
4. Action is executed: The selected command is run, and the system captures its
output.
5. LLM is invoked (if needed): If the DQN is uncertain (e.g., low Q-value), a rel-
evant LLM module parses the output or suggests an alternative action. Its response
is stored in the cache.
6. New state is observed: Any change in system behavior (e.g., opened port, dis-
covered CVE, shell access) is encoded again.
7. Reward is calculated: Based on the transition we give reward based on the new
information added to the current state.
8. Experience is stored: The tuple (s, a, r, s′ ) is saved to the prioritized replay buffer.
9. Model is updated: Every few steps, the agent samples mini-batches and improves
the policy via gradient descent.
The process repeats until the episode ends. This cycle enables continuous learning
and adaptation across different target environments.
6. Datasets
The learning-based system relies on structured datasets representing real attack scen-
arios, exploit knowledge, and network reconnaissance. To support both DQN and LLM
components, several datasets were curated and constructed.
4
6.1 CVE and Exploit Datasets
Two main vulnerability datasets are used:
• NVD (CVE) JSON: Contains detailed CVE entries from the National Vulnerabil-
ity Database, including affected CPEs, CVSS scores, and descriptions. This dataset
is used by the VulnAgent to match CPEs to likely vulnerabilities.
• Metasploit and ExploitDB Mappings: A processed dataset that links CVE
identifiers to Metasploit exploit modules and ExploitDB scripts. For each exploit,
the path, payloads, required options, and CVE ID are recorded.
6.2 LLM Training Sets
Each of the four LLM modules is trained on task-specific data:
• LLM-ReconParser: supervised fine-tuning using recon outputs
• LLM-ExploitAnalyzer: logs of attack results and corresponding impact
• LLM-ExploitSelector: state-to-exploit pairs with CVE context
• LLM-ReconAdvisor: sequences of reconnaissance decisions across different server
topologies
For now we are still exploring methods to create those datasets
7. Expected Challenges
Developing an autonomous penetration testing system involves several key challenges:
Dynamic Environments Real-world networks change frequently, making it hard for
agents to generalize from static simulations.
Adversarial Defenses IDS, firewalls, and evasive mechanisms can cause inconsistent
outcomes, breaking assumptions of environment stationarity.
LLM Latency LLMs add semantic depth but are costly and slow. Efficient use and
caching are essential.
8. Conclusion and Future Work
This work presents a novel architecture for intelligent penetration testing that integrates
reinforcement learning and large language models. By combining fast decision-making
with deep contextual understanding, the system balances adaptability and interpretabil-
ity in complex network environments.
5
In the long term, such systems may transform automated red teaming, enabling persist-
ent agents that learn over time, adapt to defenses, and generalize across infrastructure
types. Beyond penetration testing, similar architectures may be applied in cyber threat
emulation, autonomous blue teaming, and real-time network monitoring.
Another promising extension is the development of specialized expert agents, each fo-
cused on a narrow domain such as privilege escalation, service fingerprinting, or post-
exploitation.
Additionally, representing system states as structured graphs instead of flat vectors could
enhance relational reasoning and allow the use of graph neural networks (GNNs) to cap-
ture complex interdependencies between services, ports, and observed behaviors.
This hybrid approach lays the foundation for cyber agents that not only act but un-
derstand — leveraging structured learning and reasoning to navigate evolving security
landscapes.
References
Ghanem, M. C., Chen, T. M., & Nepomuceno, E. G. (2023). Hierarchical reinforcement
learning for efficient and effective automated penetration testing of large networks.
Journal of Intelligent Information Systems. https://doi.org/10.1007/s10844-023-
00755-1
Hu, H., Yang, Y., Qian, C., Yu, T., & Zhang, B.-T. (2020). Automated penetration
testing using deep reinforcement learning. IEEE European Symposium on Secur-
ity and Privacy Workshops (EuroS&PW), 112–121. https : / / doi . org / 10 . 1109 /
EuroSPW51379.2020.00025
Zhou, L., Huang, M., Wang, J., Chen, X., & Su, L. (2024). Transfer learning for security:
Challenges and future directions [Preprint available at https : / / arxiv . org / abs /
2403.00935]. Proceedings of the 2024 ACM Conference on Artificial Intelligence
and Security.