0% found this document useful (0 votes)
84 views52 pages

Generative AI in Cybersecurity: A Comprehensive Review of LLM Applications and Vulnerabilities

This paper reviews the role of Generative AI and Large Language Models (LLMs) in enhancing cybersecurity, covering their applications in areas such as intrusion detection, malware detection, and phishing response. It also addresses vulnerabilities associated with LLMs, including prompt injection and data poisoning, while proposing mitigation strategies. The study evaluates the performance of 42 LLM models and emphasizes the need for robust model deployment to combat evolving cyber threats.

Uploaded by

Jai Shanthi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views52 pages

Generative AI in Cybersecurity: A Comprehensive Review of LLM Applications and Vulnerabilities

This paper reviews the role of Generative AI and Large Language Models (LLMs) in enhancing cybersecurity, covering their applications in areas such as intrusion detection, malware detection, and phishing response. It also addresses vulnerabilities associated with LLMs, including prompt injection and data poisoning, while proposing mitigation strategies. The study evaluates the performance of 42 LLM models and emphasizes the need for robust model deployment to combat evolving cyber threats.

Uploaded by

Jai Shanthi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Generative AI in Cybersecurity: A Comprehensive

Review of LLM Applications and Vulnerabilities


Mohamed Amine Ferrag∗∥ , Fatima Alwahedi† , Ammar Battah† , Bilel Cherif† , Abdechakour Mechri‡ ,
Norbert Tihanyi† , Tamas Bisztray§ , and Merouane Debbah¶
∗ Department of Computer Science, Guelma University, Algeria
† Technology Innovation Institute, UAE
‡ Concordia University, Canada
§ University of Oslo, Norway
¶ Khalifa University of Science and Technology, UAE
∥ Corresponding author: [email protected]
arXiv:2405.12750v2 [cs.CR] 17 Jan 2025

Abstract—This paper provides a comprehensive review of GQA Grouped-Query Attention


the future of cybersecurity through Generative AI and Large HPC High-Performance Computing
Language Models (LLMs). We explore LLM applications across HLS High-Level Synthesis Design Verification
various domains, including hardware design security, intrusion HQQ Half-Quadratic Quantization
detection, software engineering, design verification, cyber threat IDS Intrusion Detection System
intelligence, malware detection, and phishing detection. We
present an overview of LLM evolution and its current state, LLM Large Language Model
focusing on advancements in models such as GPT-4, GPT-3.5, LoRA Low-rank Adapters
Mixtral-8x7B, BERT, Falcon2, and LLaMA. Our analysis extends LSTM Long Short-Term Memory
to LLM vulnerabilities, such as prompt injection, insecure ML Machine Learning
output handling, data poisoning, DDoS attacks, and adversarial MLP Multi-Layer Perceptron
instructions. We delve into mitigation strategies to protect these
models, providing a comprehensive look at potential attack sce- MQA Multi-Query Attention
narios and prevention techniques. Furthermore, we evaluate the NIST National Institute of Standards and Technology
performance of 42 LLM models in cybersecurity knowledge and NLP Natural Language Processing
hardware security, highlighting their strengths and weaknesses. NLU Natural Language Understanding
We thoroughly evaluate cybersecurity datasets for LLM training ORPO Odds Ratio Preference Optimization
and testing, covering the lifecycle from data creation to usage
and identifying gaps for future research. In addition, we review PEFT Parameter Efficient Fine-Tuning
new strategies for leveraging LLMs, including techniques like PLM Pre-trained Language Model
Half-Quadratic Quantization (HQQ), Reinforcement Learning PPO Proximal Policy Optimization
with Human Feedback (RLHF), Direct Preference Optimization RAG Retrieval Augmentation Generation
(DPO), Quantized Low-Rank Adapters (QLoRA), and Retrieval- RLHF Reinforcement Learning from Human Feedback
Augmented Generation (RAG). These insights aim to enhance
real-time cybersecurity defenses and improve the sophistication RNN Recurrent Neural Networks
of LLM applications in threat detection and response. Our paper RTL Register-Transfer Level
provides a foundational understanding and strategic direction SARD Software Assurance Reference Dataset
for integrating LLMs into future cybersecurity frameworks, em- SFT Supervised Fine-Tuning
phasizing innovation and robust model deployment to safeguard SVM Support Vector Machine
against evolving cyber threats.
TRPO Trust Region Policy Optimization
Index Terms—Generative AI, LLM, Transformer, Security,
Cyber Security.
I. I NTRODUCTION
The history of Natural Language Processing (NLP) dates
L IST OF A BBREVIATIONS back to the 1950s when the Turing test was developed. How-
AI Artificial Intelligence ever, NLP has seen significant advancements in recent decades
AIGC Artificial Intelligence Generated Content with the introduction of Recurrent Neural Networks (RNN)
APT Advanced Persistent Threat [1], Long Short-Term Memory (LSTM) [2], Gated Recurrent
CNN Convolutional Neural Network Units (GRU) [3], and Transformer methods [4]. RNN was first
CTG Controllable Text Generation introduced in the 1990s to model data sequences. LSTM, a
CVE Common Vulnerabilities and Exposures variant of RNN, was introduced in 1997, which addressed
CWE Common Weakness Enumeration the vanishing gradient problem and allowed for longer-term
FNN Feed-Forward Neural Network memory in NLP models. GRU, another variant of RNN, was
FRR False Refusal Rate introduced in 2014, which reduced the number of parame-
GPT Generative Pre-trained Transformers ters and improved computational efficiency [5]. The latest
GRU Gated Recurrent Units breakthrough in NLP was the introduction of Transformers

1
in 2017, enabling parallel processing of sequential data and Automation [28].
revolutionizing tasks like machine translation. These methods 7) Penetration Testing: LLMs can help generate scripts
have significantly improved various NLP tasks, including or modify existing ones to automate certain parts of
sentiment analysis, language generation, and translation [4], the penetration testing process. This includes scripts for
[6]–[8]. vulnerability scanning, network mapping, and exploiting
Cybersecurity is an ever-evolving field, with threats becom- known vulnerabilities [29].
ing increasingly sophisticated and complex. As organizations 8) Security Protocols Verification: LLMs can help verify
and individuals rely on digital technologies for communica- the security of protocols such as TLS/SSL, IPSec, . . . etc.
tion, commerce, and critical infrastructure, the need for robust 9) Security Training and Awareness: LLMs can generate
cybersecurity measures has never been greater [9]. The scale training materials tailored to an organization’s needs.
and diversity of cyber threats make it a daunting challenge They can also simulate phishing attacks and other se-
for security professionals to effectively identify, detect, and curity scenarios to train employees to recognize and
defend against them. In this context, Large Language Models respond to security threats [30].
(LLMs) have emerged as a game-changing technology with
the potential to enhance cybersecurity practices significantly
[10]–[14]. These models, powered by advanced NLP and
Large Language Models for Nine Cybersecurity Use Cases
Machine Learning (ML) techniques, offer a new frontier in
the fight against cyber threats [15], [16]. This article explores
the motivations and applications of LLMs in cybersecurity.
Cybersecurity professionals often need to sift through a
vast amount of textual data, including security alerts, incident Threat Detection Phishing Detection Incident Response
and Analysis and Response
reports, threat feeds, and research papers, to stay ahead of
evolving threats. LLMs, like Falcon 180b [17], possess natural
language understanding capabilities that enable them to parse,
summarize, and contextualize this information efficiently [7],
Security Cyber Forensics Chatbots
[18], [19]. They can assist in rapidly identifying relevant Automation
threat intelligence, allowing analysts to make more informed
decisions and prioritize responses [20]. LLMs can excel in
various domains within cybersecurity [21], [22]. Figure 1
highlights the top nine use cases and applications for LLMs
Penetration Testing Security Protocols Security Training
in this field [23], [24]. Verification and Awareness
1) Threat Detection and Analysis: LLMs can analyze
vast network data in real-time to detect anomalies and
Fig. 1: LLM Use Cases And Applications for Cybersecurity.
potential threats. They can recognize patterns indicative
of cyber attacks, such as malware, phishing attempts,
and unusual network traffic [20]. The primary aim of this paper is to provide an in-depth
2) Phishing Detection and Response: LLMs can identify and comprehensive review of the future of cybersecurity using
phishing emails by analyzing the text for malicious Generative AI and LLMs, covering all relevant topics in the
intent and comparing it to known phishing examples. cyber domain. The contributions of this study are summarized
They can also generate alerts and recommend preventive below:
actions [25]. • We review LLMs’ applications for cybersecurity use
3) Incident Response: During a cybersecurity incident, cases, such as hardware design security, intrusion de-
LLMs can assist by providing rapid analysis of the sit- tection, software engineering, design verification, cyber
uation, suggesting mitigation strategies, and automating threat intelligence, malware detection, phishing, and spam
responses where applicable [26]. detection, etc., providing a nuanced understanding of
4) Security Automation: LLMs can facilitate the automa- LLM capabilities across different cybersecurity domains;
tion of routine security tasks such as patch management, • We present a comprehensive overview of LLMs in cy-
vulnerability assessments, and compliance checks. This bersecurity, detailing their evolution and current state,
reduces the workload on cybersecurity teams and allows including advancements in 42 specific models, such as
them to focus on more complex tasks [10]. GPT-4o, GPT-4, BERT, Falcon, and LLaMA models;
5) Cyber Forensics: LLMs can help in forensic analysis • We analyze the vulnerabilities associated with LLMs,
by parsing through logs and data to determine the cause including prompt injection, insecure output handling,
and method of attack, thus aiding in the recovery process training data poisoning, inference data poisoning, DDoS
and future prevention strategies [27]. attacks, and adversarial natural language instructions. We
6) Chatbots: LLMs significantly enhance the capabilities also examine the mitigation strategies to safeguard these
of chatbots in cybersecurity environments by providing models from such vulnerabilities, providing a compre-
User Interaction, Incident Reporting and Handling, Real- hensive look at potential attack scenarios and prevention
time Assistance, Training and Simulations, and FAQ techniques;

2
SURVEY STRUCTURE

IX. LLM Cybersecurity Insights, Challenges and Limitations

A. Challenges and Limitations


B. LLM Cybersecurity Insights
X. Conclusion

I. Introduction VIII. LLM Vulnerabilities and Mitigation

Cybersecurity use cases A. Prompt Injection


Contributions B. Insecure Output Handling
C. Adversarial Natural Language Instructions
D. Automatic adversarial prompt generation
E. Training Data Poisoning
II. Related Reviews F. Inference Data Poisoning
G. Insecure Plugins
A. Applications of LLMs in Hardware Design Security H. Denial of Service (DoS) attack
B. Evaluation of LLMs
C. Evolution and State of LLMs in AI LLM
D. Advancements in PLMs for NLP VII. Cybersecurity datasets for LLMS
E. Instruction Fine-Tuning for LLMs
F. LLMs in Software Engineering A. Cyber Security Dataset Lifecycle
G. Multimodal Algorithms B. Software Cyber Security datasets
H. Alignment Requirements for LLMs
I. Knowledge-Enhanced Pre-trained Language Models
J. Controllable Text Generation in NLG
K. LLM for Cyber Security
L. Our survey compared to related surveys VI. Code specific LLMs

A. Prevalent LLMs
III. Preliminaries of NLP for Cybersecurity B. Datasets Development for Code-centric LLM Models
C. Vulnerabilities Analysis of LLM-Generated Code
A. Recurrent neural network
B. Transformer models

IV. LLMS-based models for Cybersecurity V. General LLMs

A. Recurrent Neural Networks-based models A. Prevalent LLMs


B. Transformer-based models B. LLMs performance in the cybersecurity domain

Fig. 2: Survey Structure (From Section I. to Section X.)

• We evaluated the performance of 42 LLM models in detection and response.


different datasets in the cybersecurity domain.
• We thoroughly evaluate cybersecurity datasets tailored
for LLM training and testing. This includes a lifecycle
analysis from dataset creation to usage, covering various
The rest of this paper is organized as follows. Section
stages such as data cleaning, preprocessing, annotation,
II presents an in-depth analysis of related reviews in the
and labeling. We also compare cybersecurity datasets to
field, charting the evolution and state of LLMs in artificial
identify gaps and opportunities for future research;
intelligence. Section III delves into the preliminaries of NLP
• We provide the challenges and limitations of employing
applications for cybersecurity, covering foundational models
LLMs in cybersecurity settings, such as dealing with
and their advancements. Section IV discusses LLM-based
adversarial attacks and ensuring robustness. We also dis-
solutions specific to cybersecurity. Section V reviews general
cuss the implications of these challenges for future LLM
LLM models. Section VI reviews Code-specific LLMs models.
deployments and the development of secure, optimized
Section VII explores various cybersecurity datasets designed
models;
for LLM training and evaluation, detailing their development
• We discuss novel insights and strategies for leveraging
lifecycle and specific attributes. Section VIII focuses on the
LLMs in cybersecurity, including advanced techniques
vulnerabilities associated with LLMs and the strategies for
such as Half-Quadratic Quantization (HQQ), Reinforce-
their mitigation, introducing a classification of potential threats
ment Learning with Human Feedback (RLHF), Direct
and defense mechanisms. Section IX offers comprehensive in-
Preference Optimization (DPO), Odds Ratio Preference
sights into the challenges and limitations of integrating LLMs
Optimization (ORPO), GPT-Generated Unified Format
into cybersecurity frameworks, including practical considera-
(GGUF), Quantized Low-Rank Adapters (QLoRA), and
tions and theoretical constraints. Finally, Section X concludes
Retrieval-Augmented Generation (RAG). These insights
the paper by summarizing the key findings and proposing
aim to enhance real-time cybersecurity defenses and
directions for future research in LLMs and cybersecurity. A
improve the sophistication of LLM applications in threat
brief overview of the paper’s structure is illustrated in Figure 2.

3
TABLE I: Summary of Related Reviews on Large Language Models
Focused Area of Study Year Authors Key Points Data. Vuln. Comp. Optim. Hardw.
LLMs in Enhancing Hard- 2023 Saha et Discuss applications of LLMs in hardware design security, in- ✘ ✘ ✘ ✘ ✘
ware Design Security al. [31] cluding vulnerability introduction, assessment, verification, and
countermeasure development.
Comprehensive Evaluation 2023 Chang et Provides an analysis of LLM evaluations focusing on criteria, ✘ ✘ ✘ ✘ ✘
Methodologies for LLMs al. [30] context, methodologies, and future challenges.
The Evolutionary Path of 2023 Zhao et Surveys the evolution of LLMs in AI, focusing on pre-training, ✘ ✘ ✘ ✘ ✘
LLMs in AI al. [26] adaptation tuning, utilization, and capacity evaluation.
Recent Advancements in 2023 Min et al. Reviews advancements in PLMs for NLP, covering paradigms ✘ ✘ ✘ ✘ ✘
PLMs for NLP [32] like Pre-train then Fine-tune, Prompt-based Learning, and NLP
as Text Generation.
Exploring Instruction Fine- 2023 Zhang et Explores instruction fine-tuning for LLMs, covering methodolo- ✘ ✘ ✘ ✘ ✘
Tuning in LLMs al. [33] gies, datasets, models, and multi-modality techniques.
Applying LLMs in Soft- 2023 Fan et al. Survey the use of LLMs in Software Engineering, discussing ✘ ✘ ✘ ✘ ✘
ware Engineering [34] applications, challenges, and hybrid approaches.
Understanding Multimodal 2023 Wu et al. Provides an overview of multimodal algorithms, covering defini- ✘ ✘ ✘ ✘ ✘
Algorithms [35] tion, evolution, technical aspects, and challenges.
Defining Alignment Re- 2023 Liu et al. Proposes a taxonomy of alignment requirements for LLMs and ✘ ✘ ✘ ✘ ✘
quirements for LLMs [36] discusses harmful content concepts.
Incorporating External 2023 Hu et al. Reviews KE-PLMs, focusing on incorporating different types of ✘ ✘ ✘ ✘ ✘
Knowledge in PLMs [37] knowledge into PLMs for NLP.
Advances in Controllable 2023 Zhang et Reviews CTG in NLG, focusing on Transformer-based PLMs and ✘ ✘ ✘ ✘ ✘
Text Generation al. [38] challenges in controllability.
LLM for Blockchain Secu- 2024 He et al. Analyze existing research to understand how LLMs can improve ✘ ✘ ✘
rity [39] blockchain systems’ security.
LLM for Critical Infras- 2024 Yigit et Proposing advanced strategies using Generative AI and Large ✘ ✘ ✘ ✘ ✘
tructure Protection al. [40] Language Models to enhance resilience and security.
Software Testing with 2024 Wang et Explore how Large Language Models (LLMs) can enhance ✘ ✘ ✘ ✘ ✘
Large Language Models al. [41] software testing, examining tasks, techniques, and future research
directions.
Malicious Insider Threat 2024 Alzaabi et Recommends advanced ML methods like deep learning and ✘ ✘ ✘
Detection Using Machine al. [27] NLP for better detection and mitigation of insider threats in
Learning Methods cybersecurity, emphasizing the need for integrating time-series
techniques.
Advancements in Large 2024 Raiaan et Reviews the evolution, architectures, applications, societal im- ✘ ✘ ✘ ✘ ✘
Language Models al. [28] pacts, and challenges of LLMs, aiding practitioners, researchers,
and experts in understanding their development and prospects.
Applications of LLMs in 2024 Xu et al. Highlights the diverse applications of LLMs in cybersecurity ✘ ✘ ✘
cybersecurity tasks [42] tasks such as vulnerability detection, malware analysis, and
intrusion and phishing detection.
Retrieval-Augmented Gen- 2024 Zhao et Reviews how RAG has been integrated into various AIGC scenar- ✘ ✘ ✘ ✘ ✘
eration for LLMs al. [28] ios to overcome common challenges such as updating knowledge,
handling long-tail data, mitigating data leakage, and managing
costs associated with training and inference.
Provides an overview of 2024 Han et al. Reviews various PEFT algorithms, their effectiveness, and the ✘ ✘ ✘ ✘ ✘
Parameter Efficient Fine- [43] computational overhead involved.
Tuning (PEFT)
LLM for Cyber Security 2024 Zhang et The paper conducts a systematic literature review of over 180 ✘ ✘ ✘
al. [44] works on applying LLMs in cybersecurity.
LLM with security and pri- 2024 Yao et al. Explores the dual impact of LLMs on security and privacy, ✘ ✘ ✘
vacy issues [10] highlighting their potential to enhance cybersecurity and data
protection while also posing new risks and vulnerabilities.
THIS SURVEY 2024 Ferrag et This paper provides an in-depth review of using Generative AI ✔ ✔ ✔ ✔ ✔
al. and Large Language Models (LLMs) in cybersecurity.
✘ : Not covered; : Partially covered; ✔: Covered; Data.: Datasets used for training and fine-tuning LLMs for security use cases; Vuln.: LLM Vulnerabilities and Mitigation ;
Comp.: Experimental Analysis of LLMs Models’ Performance in Cyber Security Knowledge; Optim.: Optimization Strategies for Large Language Models in Cybersecurity;
Hardw. : Experimental Analysis of LLMs Models’ Performance in Hardware Security.

II. R ELATED REVIEWS tuning for LLMs, and explore their impactful integration into
software engineering. The section also encompasses an in-
This section delves into a curated collection of recent depth look at multimodal algorithms, examines the critical
articles that significantly contribute to the evolving landscape aspect of alignment requirements for LLMs, and discusses
of LLMs and their multifaceted applications. These reviews integrating external knowledge into PLMs to enhance NLP
offer a comprehensive and insightful exploration into various tasks. Lastly, it sheds light on the burgeoning field of Control-
dimensions of LLMs, including their innovative applications lable Text Generation (CTG) in Natural Language Generation
in hardware design security, evaluation methodologies, and (NLG), highlighting the latest trends and challenges in this
evolving role in artificial intelligence. Further, they cover dynamic and rapidly advancing area of research [45]–[47].
cutting-edge advancements in Pre-trained Language Models Table I presents a comprehensive summary of existing reviews
(PLMs) for NLP, delve into the intricacies of instruction fine- on LLMS across various application domains.

4
A. Evaluation of LLMs highlights a range of instruction-fine-tuned models, showcas-
Chang et al. [30] offers a comprehensive analysis of LLM ing their diversity and capabilities. It also examines multi-
evaluations, addressing three key aspects: the criteria for modality techniques and datasets, including those involving
evaluation (what to evaluate), the context (where to evaluate), images, speech, and video, reflecting the broad applicability
and the methodologies (how to evaluate). It thoroughly reviews of instruction tuning. The adaptation of LLMs to different
various tasks across different domains to understand the suc- domains and applications using instruction tuning strategies is
cesses and failures of LLMs, contributing to future research reviewed, demonstrating the versatility of this method. Addi-
directions. The paper also discusses current evaluation metrics, tionally, the survey addresses efforts to enhance the efficiency
datasets, and benchmarks and introduces novel approaches, of instruction fine-tuning, focusing on reducing computational
providing a deep understanding of the current evaluation and time costs. Finally, it evaluates these models, including
landscape. Additionally, it highlights future challenges in LLM performance analysis and critical perspectives, offering a holis-
evaluation and supports the research community by open- tic view of the current state and potential of instruction fine-
sourcing related materials, fostering collaborative advance- tuning in LLMs.
ments in the field.
E. LLMs in Software Engineering
B. Evolution and State of LLMs in AI Fan et al. [34] present a survey on using LLMs in Software
Zhao et al. [26] provides an in-depth survey of LLMs’ Engineering (SE), highlighting their potential applications and
evolution and current state in artificial intelligence. It traces open research challenges. LLMs, known for their emergent
the progression from statistical language models to neural lan- properties, offer novel and creative solutions across various
guage models, specifically focusing on the recent emergence Software Engineering activities, including coding, design,
of pre-trained language models (PLMs) using Transformer requirements analysis, bug fixing, refactoring, performance
models trained on extensive corpora. The paper emphasizes the optimization, documentation, and analytics. Despite these ad-
significant advancements achieved by scaling up these mod- vantages, the paper also acknowledges the significant technical
els, noting that LLMs demonstrate remarkable performance challenges these emergent properties bring, such as the need
improvements beyond a certain threshold and exhibit unique for methods to eliminate incorrect solutions, notably hallu-
capabilities not found in smaller-scale models. The survey cinations. The survey emphasizes the crucial role of hybrid
covers four critical aspects of LLMs: pre-training, adaptation approaches, which combine traditional Software Engineering
tuning, utilization, and capacity evaluation, providing insights techniques with LLMs, in developing and deploying reliable,
into both their technical evolution and the challenges they efficient, and effective LLM-based solutions for Software
pose. Additionally, the paper discusses the resources available Engineering. This approach suggests a promising pathway
for LLM development and explores potential future research for integrating advanced AI models into practical software
directions, underlining the transformative effect of LLMs on development processes.
AI development and application.
F. Multimodal Algorithms
C. Advancements in PLMs for NLP Wu et al. [35] addresses a significant gap in understand-
Min et al. [32] surveys the latest advancements in leveraging ing multimodal algorithms by providing a comprehensive
PLMs for NLP, organizing the approaches into three main overview of their definition, historical development, applica-
paradigms. Firstly, the ”Pre-train then Fine-tune” method tions, and challenges. It begins by defining multimodal models
involves general pre-training on large unlabeled datasets fol- and algorithms, then traces their historical evolution, offering
lowed by specific fine-tuning for targeted NLP tasks. Secondly, insights into their progression and significance. The paper
”Prompt-based Learning” uses tailored prompts to transform serves as a practical guide, covering various technical aspects
NLP tasks into formats akin to a PLM’s pre-training, enhanc- essential to multimodal models, such as knowledge represen-
ing the model’s performance, especially in few-shot learning tation, selection of learning objectives, model construction,
scenarios. Lastly, the ”NLP as Text Generation” paradigm information fusion, and prompts. Additionally, it reviews cur-
reimagines NLP tasks as text generation problems, fully capi- rent algorithms employed in multimodal models and discusses
talizing on the strengths of generative models like GPT-2 and commonly used datasets, thus laying a foundation for future
T5. These paradigms represent the cutting-edge methods in research and evaluation in this field. The paper concludes
utilizing PLMs for various NLP applications. by exploring several applications of multimodal models and
delving into key challenges that have emerged from their
D. Instruction Fine-Tuning for LLMs recent development, shedding light on both the potential and
the limitations of these advanced computational tools.
Zhang et al. [33] delves into the field of instruction fine-
tuning for LLMs, offering a detailed exploration of various
facets of this rapidly advancing area. It begins with an G. Alignment Requirements for LLMs
overview of the general methodologies used in instruction Liu et al. [36] propose a taxonomy of alignment require-
fine-tuning, then discusses the construction of commonly-used, ments for LLMs to aid practitioners in understanding and ef-
representative datasets tailored for this approach. The survey fectively implementing alignment dimensions and inform data

5
collection efforts for developing robust alignment processes. challenging research area. The paper surveys various ap-
The paper dissects the concept of ”harmful” generated content proaches that have emerged in the last 3-4 years, each targeting
into specific categories, such as harm to individuals (like emo- different CTG tasks with varying controlled constraints. It
tional harm, offensiveness, and discrimination), societal harm provides a comprehensive overview of common tasks, main
(including instructions for violent or dangerous behaviors), approaches, and evaluation methods in CTG and discusses the
and harm to stakeholders (such as misinformation impacting current challenges and potential future directions in the field.
business decisions). Citing an imbalance in Anthropic’s align- Claiming to be the first to summarize state-of-the-art CTG
ment data, the paper points out the uneven representation of techniques from the perspective of Transformer-based PLMs,
various harm categories, like the high frequency of ”violence” this paper aims to assist researchers and practitioners in keep-
versus the marginal appearance of ”child abuse” and ”self- ing pace with the academic and technological developments
harm.” This observation supports the argument that alignment in CTG, offering them an insightful landscape of the field and
techniques heavily dependent on data cannot ensure that LLMs a guide for future research.
will uniformly align with human behaviors across all aspects.
The authors’ own measurement studies reveal that aligned
J. LLM for Cyber Security
models do not consistently show improvements across all
harm categories despite the alignment efforts claimed by the Zhang et al. [44] examines the integration of LLMs within
model developers. Consequently, the paper advocates for a cybersecurity. Through an extensive literature review involving
framework that allows a more transparent, multi-objective over 127 publications from leading security and software
evaluation of LLM trustworthiness, emphasizing the need for engineering venues, this paper aims to shed light on LLMs’
a comprehensive and balanced approach to alignment in LLM multifaceted roles in enhancing cybersecurity measures. The
development. survey pinpoints various applications for LLMs in detecting
vulnerabilities, analyzing malware, and managing network
H. Knowledge-Enhanced Pre-trained Language Models intrusions and phishing threats. It highlights the current lim-
itations regarding the datasets used, which often lack size
Hu et al. [37] offers a comprehensive review of Knowledge-
and diversity, thereby underlining the necessity for more
Enhanced Pre-trained Language Models (KE-PLMs), a bur-
robust datasets tailored to these security tasks. The paper
geoning field aiming to address the limitations of standard
also identifies promising methodologies like fine-tuning and
PLMs in NLP. While PLMs trained on vast text corpora
domain-specific pre-training, which could better harness the
demonstrate impressive performance across various NLP tasks,
potential of LLMs in cybersecurity contexts.
they often fall short in areas like reasoning due to the absence
Yao et al. [10] explores the dual role of LLMs in se-
of external knowledge. The paper focuses on how incorpo-
curity and privacy, highlighting their benefits in enhancing
rating different types of knowledge into PLMs can overcome
code security and data confidentiality and detailing potential
these shortcomings. It introduces distinct taxonomies for Nat-
risks and inherent vulnerabilities. The authors categorize the
ural Language Understanding (NLU) and Natural Language
applications and challenges into ”The Good,” ”The Bad,”
Generation (NLG) to distinguish between these two core
and ”The Ugly,” where they discuss LLMs’ positive impacts,
areas of NLP. For NLU, the paper categorizes knowledge
their use in offensive applications, and their susceptibility to
types into linguistic, text, knowledge graph (KG), and rule
specific attacks, respectively. The paper emphasizes the need
knowledge. In the context of NLG, KE-PLMs are classified
for further research on threats like model and parameter extrac-
into KG-based and retrieval-based methods. By outlining these
tion attacks and emerging techniques such as safe instruction
classifications and exploring the current state of KE-PLMs,
tuning, underscoring the complex balance between leveraging
the paper provides not only clear insights into this evolving
LLMs for improved security and mitigating their risks.
domain but also identifies promising future directions for the
Saha et al. [31] discussed several key applications of LLMs
development and application of KE-PLMs, highlighting their
in the context of hardware design security. The paper illus-
potential to enhance the capabilities of PLMs in NLP tasks
trates how LLMs can intentionally introduce vulnerabilities
significantly.
and weaknesses into RTL (Register-Transfer Level) designs.
This process is guided by well-crafted prompts in natural
I. Controllable Text Generation in NLG language, demonstrating the model’s ability to understand and
Zhang [38] provides a critical and systematic review of manipulate complex technical designs. The authors explore
Controllable Text Generation (CTG), a burgeoning field in using LLMs to assess the security of hardware designs. The
NLG that is essential for developing advanced text generation model is employed to identify vulnerabilities, weaknesses,
technologies tailored to specific practical constraints. The and potential threats. It’s also used to pinpoint simple cod-
paper focuses on using large-scale pre-trained language models ing issues that could evolve into significant security bugs,
(PLMs), particularly those based on transformer architecture, highlighting the model’s ability to evaluate technical designs
which have established a new paradigm in NLG due to their critically. In this application, LLMs verify whether a hardware
ability to generate more diverse and fluent text. However, design adheres to specific security rules or policies. The
the limited interpretability of deep neural networks poses paper examines the model’s proficiency in calculating secu-
challenges to the controllability of these methods, making rity metrics, understanding security properties, and generating
transformer-based PLM-driven CTG a rapidly evolving and functional testbenches to detect weaknesses. This part of the

6
study underscores the LLM’s ability to conduct thorough and A. Recurrent neural networks
detailed verification processes. Finally, the paper investigates Recurrent Neural Networks (RNNs) [48] are artificial neural
how effectively LLMs can be used to develop countermea- networks that handle data sequences such as time series or
sures against existing vulnerabilities in a design. This aspect NLP tasks. The RNN model consists of two linked recur-
focuses on the model’s capability to solve problems and create rent neural networks. The first RNN encodes sequences of
solutions to enhance the security of hardware designs. Overall, symbols into a fixed-length vector, while the second decodes
the paper presents an in-depth analysis of how LLMs can be this vector into a new sequence. This architecture aims to
a powerful tool in various stages of hardware design security, maximize the conditional probability of a target sequence from
from vulnerability introduction and assessment to verification a given source sequence. When applied to cybersecurity, this
and countermeasure development. model could be instrumental in threat detection and response
systems by analyzing and predicting network traffic or log data
sequences that indicate malicious activity. Integrating the con-
K. Our survey compared to related surveys ditional probabilities generated by this model could enhance
anomaly detection frameworks, improving the identification
Our paper presents a more specialized and technical explo- of subtle or novel cyber threats. The model’s ability to learn
ration of generative artificial intelligence and large language meaningful representations of data sequences further supports
models in cybersecurity than the previous literature review. its potential to recognize complex patterns and anomalies in
Focusing on a broad array of cybersecurity domains such cybersecurity environments [49], [50].
as hardware design security, intrusion detection systems, and 1) Gated Recurrent Units: GRUs are a recurrent neural
software engineering, it targets a wider professional audience, network architecture designed to handle the vanishing gradient
including engineers, researchers, and industrial practitioners. problem that can occur with standard recurrent networks.
This paper reviews 35 leading models like GPT-4, BERT, Introduced by Cho et al. in 2014 [51], GRUs simplify the
Falcon, and LLaMA, not only highlighting their applications LSTM (Long Short-Term Memory) model while retaining its
but also their developmental trajectories, thereby providing a ability to model long-term dependencies in sequential data.
comprehensive insight into the current capabilities and future GRUs achieve this through two main gates: the update gate,
potentials of these models in cybersecurity. which controls how much a new state overwrites the old
The paper also delves deeply into the vulnerabilities associ- state, and the reset gate, which determines how much past
ated with LLMs, such as prompt injection, adversarial natural information to forget. These gates effectively regulate the
language instructions, and insecure output handling. It presents flow of information, making GRUs adept at tasks like time
sophisticated attack scenarios and robust mitigation strategies, series prediction, speech recognition, and natural language
offering a detailed analysis crucial for understanding and pro- processing. The main steps of GRUs are organized as follows:
tecting against potential threats. Additionally, the lifecycle of • Update Gate: The update gate determines how much
specialized cybersecurity datasets—covering creation, clean- information from the previous hidden state should be
ing, preprocessing, annotation, and labeling—is scrutinized, passed to the new one. The update gate is calculated using
providing essential insights into improving data integrity and the following formula:
utility for training and testing LLMs. This level of detail is
vital for developing robust cybersecurity solutions that can
effectively leverage the power of LLMs. zt = σ(Wz xt + Uz ht−1 ) (1)
Lastly, the paper examines the challenges associated with
deploying LLMs in cybersecurity contexts, emphasizing the where zt is the update gate at time step t, Wz and Uz are
necessity for model robustness and the implications of adver- the weight matrices, xt is the input at time step t, and
sarial attacks. It introduces advanced methodologies such as ht−1 is the previous hidden state. The sigmoid function,
Reinforcement Learning with Human Feedback (RLHF) and represented by σ, squishes the equation’s results between
Retrieval-Augmented Generation (RAG) to enhance real-time 0 and 1. The update gate allows the GRU to decide how
cybersecurity operations. This focus not only delineates the much of the previous hidden state information should be
current state of LLM applications in cybersecurity but also passed on to the new hidden state. If the update gate
sets the direction for future research and practical applications, is close to 1, it means that a lot of the previous hidden
aiming to optimize and secure LLM deployments in an evolv- state information should be passed on, while if the update
ing threat landscape. This makes the paper an indispensable gate is close to 0, it means that very little of the previous
resource for anyone involved in cybersecurity and AI, bridging hidden state information should be passed on. The Update
the gap between academic research and practical applications. Gate formula can be explored in the following three
different parts:
Part 1: Linear combination of inputs:
III. P RELIMINARIES OF NLP FOR C YBER S ECURITY

This section presents the preliminaries of NLP for cyber- zt = Wr xt + Ur ht−1 (2)
security, including recurrent neural networks (LSTMs and
GRUs) and transformer models. Part 2: Application of the sigmoid function:

7
close to 1, the new hidden state is primarily influenced
r̃t = σ(zt ) (3) by the previous hidden state. If the update gate is close
to 0, the candidate’s hidden state primarily influences the
Part 3: Element-wise multiplication of r̃t and previous new hidden state. The element-wise product between the
hidden state: update gate (zt ) and the candidate hidden state (h̃t ) is
used to create the update vector (zt ⊙ h̃t ). The element-
wise product between the complement of the update gate
rt = r̃t ⊙ ht−1 (4) (1 − zt ) and the previous hidden state (ht−1 ) is used to
create the reset vector ((1−zt )⊙ht−1 ). Finally, the update
Where ⊙ represents the Hadamard product, also known and reset vectors are added to calculate the new hidden
as the element-wise multiplication. state (ht ).
• Reset Gate: The reset gate determines how much of the
previous hidden state should be forgotten. The reset gate 2) Long Short-Term Memory: The LSTM [2] was designed
is calculated using the following formula: to overcome the vanishing gradient problem that affects tra-
ditional recurrent neural networks (RNNs) during training,
particularly over long sequences. By integrating memory cells
rt = σ(Wr xt + Ur ht−1 ) (5) that can maintain information over extended periods and gates
that regulate the flow of information into and out of the
where rt is the reset gate at time step t, Wr and Ur are cell, LSTMs provide an effective mechanism for learning
the weight matrices, xt is the input at time step t, and dependencies and retaining information over time. This archi-
ht−1 is the previous hidden state. tecture has proved highly influential, becoming foundational
• Candidate Hidden State: The candidate’s hidden state to numerous applications in machine learning that require
combines the input and the previous hidden state, filtered handling sequential data, such as natural language processing,
through the reset gate. The candidate’s hidden state is speech recognition, and time series analysis. The impact of
calculated using the following formula: this work has been extensive, as it enabled the practical
use and development of deep learning models for complex
sequence modeling tasks. In cybersecurity, LSTMs can be used
h̃t = tanh(W xt + U (rt ⊙ ht−1 )) (6) for anomaly detection, where they analyze network traffic or
system logs to identify unusual patterns that may signify a
where h̃t is the candidate hidden state at time step t, W security breach or malicious activity [52]–[54]. Their ability to
and U are the weight matrices, xt is the input at time learn from long sequences makes them particularly useful for
step t, and rt is the reset gate at time step t. In this detecting sophisticated attacks that evolve, such as advanced
equation, the input at time step t (xt ) is combined with the persistent threats (APTs) and ransomware. The main steps of
previous hidden state (ht−1 ) through the weight matrices LSTM models are organized as follows:
W and U . The reset gate (rt ) is used to control the extent
to which the previous hidden state (ht−1 ) is passed to • Input Gate: The first step in an LSTM-based RNN
the candidate hidden state (h̃t ). The element-wise product involves calculating the input gate. The input gate deter-
between the reset gate (rt ) and the previous hidden state mines the extent of new input to be added to the current
(ht−1 ) is used to create the reset vector (rt ⊙ ht−1 ). The state. The formula for the input gate is:
reset vector is combined with the input (xt ) through the
weight matrix U . Finally, the result is passed through
the hyperbolic tangent function to calculate the candidate it = σ(Wi · [ht−1 , xt ] + bi ) (8)
hidden state (h̃t ).
• New Hidden State: The new hidden state combines the
previous and candidate hidden states, filtered through the where it is the input gate at time step t, Wi is the weight
update gate. The new hidden state is calculated using the matrix for the input gate, ht−1 is the hidden state from
following formula: the previous time step, xt is the input at time step t, and
bi is the bias for the input gate. The function σ(x) is the
sigmoid activation function. This formula calculates the
ht = (1 − zt ) ⊙ ht−1 + zt ⊙ h̃t (7) input gate it by first concatenating the previous hidden
state ht−1 with the current input xt . This combined vector
where ht is the new hidden state at time step t, zt is is multiplied by the weight matrix Wi , and the bias bi is
the update gate at time step t, and h̃t is the candidate added. Finally, the sigmoid activation function is applied
hidden state at time step t. The new hidden state (ht ) to produce it , which ranges from 0 to 1 and represents
is calculated by taking a weighted combination of the how much the current input updates the hidden state.
previous hidden state (ht−1 ) and the candidate hidden • Forget Gate: The second step calculates the forget gate,
state (h̃t ). The weight of the previous hidden state is determining how much the previous state should be
determined by the update gate (zt ) - if the update gate is forgotten. The formula for the forget gate is:

8
ft = σ(Wf · [ht−1 , xt ] + bf ) (9)

where ft is the forget gate at time step t, Wf is the weight


matrix for the forget gate, ht−1 is the hidden state from
the previous time step, xt is the input at time step t, and
bf is the bias for the forget gate. The forget gate ft is
calculated like the input gate. It involves concatenating
the previous hidden state ht−1 with the current input xt ,
multiplying by the weight matrix Wf , and adding the
bias bf . The resulting value is passed through the sigmoid
activation function to determine the forget gate ft , which
ranges from 0 to 1 and represents the degree to which
the previous hidden state is preserved or forgotten in the
current hidden state.
• Candidate Memory Cell: The third step calculates the
candidate memory cell, representing the potential mem-
ory state update. The formula for the candidate memory
cell is:
Fig. 3: How Transformer works for Software Security.

c̃t = tanh(Wc · [ht−1 , xt ] + bc ) (10)


B. Transformer models
The Transformer architecture proposed by Vaswani et al.
where c̃t is the candidate memory cell at time step t, [4] in 2017 is a significant advancement in natural language
Wc is the weight matrix for the candidate memory cell, processing built entirely around attention mechanisms. These
ht−1 is the hidden state from the previous time step, xt mechanisms allow the model to assess the relevance of dif-
is the input at time step t, and bc is the bias for the ferent words in a sentence, independent of their positional
candidate memory cell. The function tanh(x) is the hy- relationships. This foundational technology has enhanced the
perbolic tangent activation function. In this formula, the efficiency of tasks like translation and text summarization
candidate memory cell c̃t is calculated by concatenating and has broad cybersecurity applications. In cybersecurity,
the previous hidden state ht−1 and the input xt , then Transformer models can detect and respond to threats by ana-
multiplying by the weight matrix Wc and adding the bias lyzing source code patterns and network traffic and identifying
bc . The result is passed through the hyperbolic tangent anomalies in system logs, as presented in Figure 3. They can
activation function, which ranges from -1 to 1, to control also be used for the automated generation of security policies
the magnitude of the memory cell update. based on the evolving landscape of threats and for intelligent
• Current Memory Cell: The fourth step calculates the threat hunting, where the system predicts and neutralizes
current memory cell, which is the updated state of the threats before they cause harm. This makes Transformers
memory cell, combining the effects of the forget and input versatile in enhancing security protocols and defending against
gates. The formula for the current memory cell is: cyber attacks [20]. The main steps of Transformer models are
organized as follows:
• Attention Mechanism: The attention mechanism in the
ct = ft · ct−1 + it · c̃t (11) Transformer model computes attention scores between
the input and output representations. These scores are
calculated using the scaled dot-product of the query and
where ct is the current memory cell at time step t, ft is
key representations and then normalized by a softmax
the forget gate at time step t, ct−1 is the memory cell
function. The attention scores are subsequently used to
from the previous time step, it is the input gate at time
compute a weighted sum of the value representations,
step t, and c̃t is the candidate memory cell at time step
forming the output of the attention mechanism.
t. This equation represents the new memory cell state as
The equation defines the attention scores:
a combination of the old state (modulated by the forget
gate) and the potential update (modulated by the input
gate).
QK T
 
• Output Gate: The final step calculates the output gate, Attention(Q, K, V ) = softmax √ V
which determines the amount of information output from dk
(12)
the LSTM cell. The details and formula for the output
gate should follow.
Where:

9
Q, K, and V represent the matrices of queries, keys, and bution of activations in a layer changes during training.
values transformed from the input representations. The The normalization operation is performed by subtracting
dimension of the keys is denoted by dk . The attention the mean of the activations and dividing by the square
mechanism involves computing the dot product of Q and root of the variance. This ensures the activations have
the transpose of K, which is then scaled by the inverse a zero mean and unit variance, leading to more stable
square root of dk to stabilize the gradients. The result training.
is passed through a softmax function to normalize the • Position-wise Feed Forward: The position-wise feed-
scores, ensuring they sum to 1. These scores, represent- forward network transforms the input and output repre-
ing attention weights, compute a weighted sum of the sentations in the Transformer model. The position-wise
values in V , resulting in the final attention output. This feedforward can be calculated as follows:
mechanism allows the model to dynamically focus on
the most relevant parts of the input sequence for making
predictions. F F N (x) = max(0, xW1 + b1 )W2 + b2 (16)
• Multi-Head Attention: In the Transformer model, mul-
tiple attention heads enhance the model’s capability to
Where: x is the input to the feed-forward network. W1 ,
simultaneously focus on different parts of the input se-
b1 , W2 , and b2 are the weight and bias parameters of the
quence. The multi-head attention is calculated as follows:
feed-forward network. max(0, x) is the ReLU activation
function. This equation represents a simple feed-forward
neural network (FFN) operation in deep learning models.
MHead(Q, K, V ) = Concat(hd1 , hd2 , . . . , hdh )W O The FFN operation is a multi-layer perceptron (MLP)
(13) that transforms the input x into a new representation by
passing it through two fully connected (dense) layers. The
Where: hdi represents the output of the i-th at- first layer is followed by a ReLU activation function,
tention head, computed using the attention formula which applies a non-linear activation to the input by
Attention(Qi , Ki , Vi ). Each Qi , Ki , and Vi are different setting all negative values to zero. This activation function
linear projections of the original inputs Q, K, and V . helps the model learn complex non-linear relationships
W O is a linear transformation matrix applied to the between the input and output. The second layer is a linear
concatenated results of all attention heads. The Concat transformation that produces the final output of the FFN.
function concatenates the outputs of each head along a The weight and bias parameters of the two layers, W1 ,
specific dimension. The outputs of individual heads, hdi , b1 , W2 , and b2 , are learned during training and allow the
are each computed using the scaled dot-product attention model to learn different representations of the input data.
mechanism: • Encoder and Decoder Blocks: In the Transformer model,
the encoder and decoder blocks transform the input
sequences into the output sequences. The encoder and
Qi KiT decoder blocks can be calculated as follows:
 
hdi = Attention(Qi , Ki , Vi ) = softmax √ Vi
dk
(14)
Enc(x) = LN(x + MHead(x, x, x)) (17)
This approach enables the Multi-Head Attention mech-
anism to capture various aspects of the input sequence,
simultaneously focusing on different subspace represen-
tations. As a result, it facilitates the model’s capture of Dec(x, y) = LN(x+MHead(x, y, y)+MHead(x, x, x))
more complex relationships and improves performance (18)
across different types of tasks.
• Layer Normalization: In the Transformer model, layer Where: x is the input to the encoder/decoder block. y
normalization ensures the input is within a standard is the output from the previous encoder/decoder block.
range. The layer normalization can be calculated as The Encoder block Enc(x) takes the input x and applies
follows: the Multi-Head Attention mechanism to compute the
attention scores between the input and itself. The result
x − mean(x) is then added to the input and passed through a Layer
LN (x) = p (15) Normalization operation. The output of the encoder block
var(x)
is the new representation of the input after processing
through the Multi-Head Attention and Layer Normaliza-
Where x is the input to the layer normalization. tion operations. The Decoder block Dec(x, y) is similar to
textmean(x) and textvar(x) are the mean and variance the encoder block but also takes the output from the previ-
of x, respectively. Layer Normalization aims to mitigate ous decoder block, y, as input. The Multi-Head Attention
the internal covariate shift, which arises when the distri- mechanism is applied to compute the attention scores

10
between the input and the previous output and between which are used for effective classification using an enhanced
the input and itself. The results are added to the input LSTM framework. The proposed system outperformed other
and passed through a Layer Normalization operation. The methods, such as LPBoost and DNNs, in accuracy, precision,
output of the decoder block is the new representation recall, and error rate. The NSL-KDD dataset was used for
of the input after processing through the Multi-Head validation and testing, and further verification was done on
Attention and Layer Normalization operations. other datasets. While the paper provides a comprehensive
solution, future research could explore the applicability of the
IV. LLM S - BASED MODELS FOR C YBER S ECURITY proposed system to other datasets and real-world scenarios.
Additionally, a more detailed analysis of the computational
This section reviews recent studies employing LLM-
cost of the proposed system compared to other methods could
based models (i.e., Recurrent Neural Networks-based and
be beneficial.
transformer-based models) for threat detection, malware clas-
Zhao et al. [63] presents ERNN, an end-to-end RNN
sification, intrusion detection, and software vulnerability de-
model with a novel gating unit called session gate, designed
tection.Table II presents the RNN-based models for Cyber
to address network-induced phenomena that may result in
Security, while Tables III and IV present the transformer-based
misclassifications in traffic detection systems used in cyber-
models for Cyber Security. Figure 4 presents the LLM-based
security. The gating unit includes four types of actions to
solutions for Cyber Security Use Cases.
simulate network-induced phenomena during model training
and the Mealy machine to adjust the probability distribution
A. Recurrent Neural Networks-based models of network-induced phenomena. The paper demonstrates that
1) Intrusion Detection: Yin et al. [55] propose a deep learn- ERNN outperforms state-of-the-art methods by 4% accuracy
ing approach for intrusion detection using recurrent neural and is scalable in terms of parameter settings and feature
networks (RNN-ID) and study its performance in binary and selection. The paper also uses the Integrated Gradients method
multiclass classification tasks. The results show that the RNN- to interpret the gating mechanism and demonstrates its ability
ID model outperforms traditional machine learning methods to reduce dependencies on local packets. Althubiti et al.
in accuracy. Chawla et al. [60] presented an anomaly-based [57] propose a deep learning-based intrusion detection system
intrusion detection system that leverages recurrent neural net- (IDS) that uses a Long Short-Term Memory (LSTM) RNN
works (RNNs) with gated recurrent units (GRUs) and stacked to classify and predict known and unknown intrusions. The
convolutional neural networks (CNNs) to detect malicious experiments show that the proposed LSTM-based IDS can
cyber attacks. The system establishes a baseline of normal achieve a high accuracy rate of 0.9997. Xu et al. [58] propose
behavior for a given system by analyzing sequences of system a novel IDS that consists of a recurrent neural network with
calls made by processes. It identifies anomalous sequences gated recurrent units (GRU), multilayer perceptron (MLP),
based on a language model trained on normal call sequences and softmax module. The experiments on the KDD 99 and
from the ADFA dataset of system call traces. The authors NSL-KDD data sets show that the system has a high overall
demonstrate that using GRUs instead of LSTMs results in detection rate and a low false positive rate. Ferrag and Lean-
comparable performance with reduced training times and dros [59] propose a novel deep learning and blockchain-based
that combining GRUs with stacked CNNs leads to improved energy framework for smart grids, which uses a blockchain-
anomaly detection. The proposed system shows promising based scheme and a deep learning-based scheme for intrusion
results in detecting anomalous system call sequences in the detection. The deep learning-based scheme employs recurrent
ADFA dataset. However, further research is needed to evaluate neural networks to detect network attacks and fraudulent
its performance in other datasets and real-world scenarios and transactions in the blockchain-based energy network. The
address issues related to adversarial attacks. performance of the proposed IDS is evaluated using three
Ullah et al. [61] introduce the deep learning models to tackle different data sources.
the challenge of managing cybersecurity in the growing realm Polat et al. [65] introduce a method for improving the
of IoT devices and services. The models utilize Recurrent detection of DDoS attacks in SCADA systems that use SDN
Neural Networks, Convolutional Neural Networks, and hybrid technology. The authors propose using a Recurrent Neural Net-
techniques to detect anomalies in IoT networks accurately. work (RNN) classifier model with two parallel deep learning-
The proposed models are validated using various datasets (i.e., based methods: Long Short-Term Memory (LSTM) and Gated
IoT-DS2, MQTTset, IoT-23, and datasets) and achieve high Recurrent Units (GRU). The proposed model is trained and
accuracy, precision, recall, and F1 score. However, the models tested on a dataset from an experimentally created SDN-
need to be tested on more extensive and diverse datasets, based SCADA topology containing DDoS attacks and regular
and further research is necessary to enhance their scalability network traffic data. The results show that the proposed RNN
for practical applications in cybersecurity. Donkol et al. [62] model achieves an accuracy of 97.62% for DDoS attack de-
presents a technique, ELSTM-RNN, for improving security in tection, and transfer learning further improves its performance
intrusion detection systems. Using likely point particle swarm by around 5%.
optimization (LPPSO) and enhanced LSTM classification, the 2) Software Security: Wang et al. [64] propose a deep
proposed system addresses gradient vanishing, generalization, learning-based defense system called PatchRNN to automat-
and overfitting issues. The system uses an enhanced parti- ically detect secret security patches in open-source software
cle swarm optimization technique to select efficient features, (OSS). The system leverages descriptive keywords in the

11
TABLE II: RNN-based models for Cyber Security.
Study Year Type of Model Dataset Used Domain Key Contributions Open Issues
Yin et al. 2017 RNN-ID (Recurrent Benchmark data set Intrusion Detection The proposed model can improve Other machine learning algorithms
[55] Neural Network- the accuracy of intrusion detection and deep learning models, such as
Intrusion Detection) convolutional neural networks and
transformers are not considered in
the comparison
Güera et al. 2018 Temporal-aware Large set of deepfake videos collected Detection of Deep- The proposed method achieves The proposed approach’s effective-
[56] Pipeline (CNN and from multiple video websites fake Videos competitive results in detecting ness might be limited to the specific
RNN) deepfake videos while using a sim- types of deepfakes present in the
ple architecture dataset
Althubiti et 2018 LSTM RNN CSIC 2010 HTTP dataset Web Intrusion De- Proposal of LSTM RNN for web The paper only uses the CSIC 2010
al. [57] tection intrusion detection. High accuracy HTTP dataset, which may not be
rate (0.9997) in binary classifica- representative of all types of web
tion. application attacks
Xu et al. 2018 GRU-MLP-Softmax KDD 99 and NSL-KDD data sets Network Intrusion The system achieves leading perfor- The paper does not provide infor-
[58] (Gated Recurrent Detection mance with overall detection rates mation about the scalability of the
Unit, Multilayer of 99.42% using KDD 99 and proposed model
Perceptron, 99.31% using NSL-KDD, with low
Softmax) false positive rates
Ferrag and 2019 Blockchain and CICID2017 dataset, Power system Energy framework Proposal of DeepCoin framework The paper does not address the po-
Leandros RNN dataset, Bot-IoT dataset for Smart Grids combining blockchain and deep tential scalability issues that may
[59] learning for smart grid security arise as the number of nodes in the
network increases
Chawla et 2019 GRU with CNN ADFA (Australian Defence Force Intrusion Detection Achieved improved performance by The proposed system is vulnerable
al. [60] Academy) dataset combining GRUs and CNNs to adversarial attacks
Ullah et al. 2022 LSTM, BiLSTM, IoT-DS2, MQTTset, IoT-23, and Intrusion Detection Validation of the proposed mod- Further research is necessary to en-
[61] and GRU datasets els using various datasets, achieving hance their scalability for practical
high accuracy, precision, recall, and applications in cybersecurity
F1 score
Donkol et 2023 LSTM CSE-CIC-IDS2018, CICIDS2017, and Intrusion Detection The proposed system outperformed Future research could explore the
al. [62] UNSW-NB15 datasets other methods such as LPBoost and applicability of the proposed system
DNNs in terms of accuracy, preci- to other datasets
sion, recall, and error rate
Zhao et al. 2023 End-to-End Recur- IDS2017 and IDS2018 datasets Intrusion attacks + Address network-induced phe- The proposed system is vulnerable
[63] rent Neural Network and malware nomena that may result in misclas- to adversarial attacks
sifications in traffic detection sys-
tems used in cybersecurity
Wang et al. 2021 RNN A large-scale patch dataset PatchDB Software Security The PatchRNN system can effec- The PatchRNN system can only
[64] tively detect secret security patches support C/C++
with a low false positive rate
Polat et al. 2022 LSTM and GRU SDN-based SCADA system Detection of DDoS The results show that the proposed The paper only focuses on detecting
[65] attacks RNN model achieves an accuracy DDoS attacks and does not address
of 97.62% for DDoS attack detec- other types of cyber threats (e.g., in-
tion sider threats or advanced persistent
threats)

commit message and syntactic and semantic features at the B. Transformer-based models
source-code level. The system’s performance was evaluated
1) Cloud Threat Forensics: Parra et al. [66] proposed
on a large-scale real-world patch dataset and a case study
an interpretable federated transformer log learning model
on NGINX. The results indicate that the PatchRNN system
for threat detection in syslogs. The model is generated by
can effectively detect secret security patches with a low false
training local transformer-based threat detection models at
positive rate.
each client and aggregating the learned parameters to generate
a global federated learning model. The authors demonstrate
the difference between normal and abnormal log time series
3) Detection of Deepfake Videos: Güera et al. [56] propose through the goodness of fit test and provide insights into
a temporal-aware pipeline that automatically detects deepfake the model’s decision-making process through an attention-
videos by using a convolutional neural network (CNN) to based interpretability module. The results from the HDFS and
extract frame-level features and a recurrent neural network CTDD datasets validate the proposed approach’s effectiveness
(RNN) to classify the videos. The results show that the system in achieving threat forensics in real-world operational set-
can achieve competitive results in this task with a simple tings. Evange et al. [75] discuss the importance of actionable
architecture. threat intelligence in defending against increasingly sophis-
ticated cyber threats. Cyber Threat Intelligence is available
Overall, the reviewed studies demonstrate the potential of on various online sources, and Named Entity Recognition
deep learning methods, particularly RNNs, for intrusion detec- (NER) techniques can extract relevant information from these
tion in various domains. The results show that the proposed sources. The paper investigates the use of transformer-based
deep learning-based models outperform traditional machine models in NER and how they can facilitate the extraction
learning methods in accuracy. However, more research is of cybersecurity-related named entities. The DNRTI dataset,
needed to address the limitations and challenges associated which contains over 300 threat intelligence reports, tests the ef-
with these approaches, such as data scalability and inter- fectiveness of transformer-based models compared to previous
pretability. approaches. The experimental results show that transformer-

12
TABLE III: Transformer-based models for Cyber Security (Part I).
Study Year Type of Model Dataset Used Domain Key Contributions Open Issues
Parra et al. 2022 Federated HDFS and CTDD datasets Threat detection and The interpretability module inte- The paper briefly mentions the ap-
[66] Transformer Log forensics grated into the model provides plicability of the proposed approach
Learning Model insightful interpretability of the in edge computing systems but does
model’s decision-making process not discuss the scalability of the
approach to larger systems
Ziems et al. 2021 Transformer Model, Malware family datasets Malware Classifica- Demonstration that transformer- The experiments are conducted
[67] BERT, CANINE, tion based models outperform on preprocessed NIST NVD/SARD
Bagging-based traditional machine and deep databases, which may not reflect
random transformer learning models in classifying real-world conditions
forest (RTF) malware families
Wu et al. 2022 Robust CICID2017 and CIC-DDoS2019 Intrusion Detection The proposed method outperforms There is no discussion in the pa-
[68] Transformer-based datasets classical machine learning algo- per regarding the scalability of the
Intrusion Detection rithms such as support vector ma- proposed method, particularly when
System (RTID) chine (SVM) and deep learning al- dealing with large-scale and real-
gorithms (i.e., RNN, FNN, LSTM) time network traffic
on the two evaluated datasets
Demirkıran 2022 Transformer-based Catak dataset, Oliveira dataset, Malware classifica- The paper demonstrates that The study only focuses on mal-
et al. [69] models VirusShare dataset, and VirusSample tion transformer-based models, ware families that use API call se-
dataset specifically BERT and CANINE, quences, which means that it does
outperform traditional machine and not consider other malware types
deep learning models in classifying that may not use API calls
malware families
Ghourbi et 2022 An optimized ToN-IoT and Edge IIoTset datasets Threat Detection The experimental evaluation of the The paper does not discuss the scal-
al. [70] LightGBM model approach showed remarkable accu- ability of the proposed system for
and a Transformer- racies of 99% large-scale healthcare networks
based model
Thapa et al. 2022 Transformer-based Software vulnerability datasets of Software security The paper highlights the advan- The paper only focuses on detect-
[71] language models C/C++ source codes and vulnerability tages of transformer-based language ing vulnerabilities in C/C++ source
detection in models over contemporary models code and does not explore the use
programming of large transformer-based language
languages, models in detecting vulnerabilities
specifically C/C++ in other programming languages
Ranade et 2021 A transformer-based WebText dataset Fake Cyber Threat The attack is shown to introduce Further research is needed to ex-
al. [72] language model, Intelligence adverse impacts such as returning plore how to prevent or detect data
specifically GPT-2 incorrect reasoning outputs poisoning attacks on cyber-defense
system
Fu et al. [73] 2022 Transformer- Large-scale real-world dataset with Software The proposed system is accurate The model’s performance can be
based line-level more than 188k C/C++ functions vulnerability for predicting vulnerable functions changed when applied to different
vulnerability prediction in safety- affected by the Top-25 most dan- programming languages or software
prediction model critical software gerous CWEs systems
systems
Mamede et 2022 A transformer-based Software Assurance Reference Software security The proposed system can identify The proposed method cannot be ex-
al. [74] deep learning model Dataset (SARD) project, which in the context of up to 21 vulnerability types and tended to other programming lan-
contains vulnerable and non- Java programming achieved an accuracy of 98.9% in guages and integrated into existing
vulnerable Java files language multi-label classification software development processes
Evange et 2021 A transformer-based DNRTI (Dataset for NER in Threat Cybersecurity threat The experimental results demon- Further research is needed to test
al. [75] model Intelligence) intelligence strate that transformer-based tech- the effectiveness of transformer-
niques outperform previous state- based models on larger and more
of-the-art approaches for NER in diverse datasets
threat intelligence
Hashemi et 2023 Transformer models Labeled dataset from vulnerability Vulnerability Infor- The proposed approach outper- The paper does not address the is-
al. [76] (including BERT, databases mation Extraction forms existing rule-based and CRF- sue of bias in the labeled dataset
XLNet, RoBERTa, based models
and DistilBERT)
Liu et al. 2022 Transformer model A commit benchmark dataset that Commit message The experimental results demon- The pre-training dataset used in the
[77] includes over 7.99 million commits generation strate that CommitBART signifi- paper is limited to GitHub commits
across 7 programming languages (generation task) cantly outperforms previous pre-
and security patch trained models for code
identification
(understanding task)
Ahmad et al. 2024 Transformer model Set of 15 hardware security bug Hardware Security Bug repair potential demonstrated The need for designer assistance
[78] benchmark designs from three Bugs by ensemble of LLMs, outperform- in bug identification, handling com-
sources: MITRE website, OpenTitan ing state-of-the-art automated tool plex bugs, limited evaluations due
System-on-Chip (SoC) and the to simulation constraints, and chal-
Hack@DAC 2021 SoC lenges with token limits and repair
generation using LLMs
Wan et al. 2024 Transformer model Chrysalis dataset, comprising over Design Verification Creating the Chrysalis dataset Refining LLM techniques, integrat-
[79] 1,000 function-level HLS designs with for HLS debugging, and enabling ing LLMs into development envi-
injected logical bugs LLM-based bug detection and ronments, and addressing scalabil-
integration into development ity and generalization challenges
environments
Jang et al. 2024 Transformer model Includes 150K online security articles, Threat Detection Pre-trained language model for The paper’s limitations include a
[80] 7.3K security paper abstracts, 3.4K the cybersecurity domain, CyBER- narrow focus on specific non-
Wikipedia articles, and 185K CVE Tuned incorporates non-linguistic linguistic element (NLE) types, ac-
descriptions. elements (NLEs) such as URLs and knowledging the existence of more
hash values commonly found in cy- complex NLE types like code
bersecurity texts. blocks and file paths that require
future exploration

based techniques are more effective than previous methods in extracting cybersecurity-related named entities.

13
TABLE IV: Transformer-based models for Cyber Security (Part II).
Study Year Type of Model Dataset Used Domain Key Contributions Open Issues
Bayer et al. 2024 Transformer model A dataset consisting of 4.3 million Intrusion attacks
Created a high-quality dataset and the model may not be suitable as
[81] entries of Twitter, Blogs, Paper, and and malware a domain-adapted language model a replacement for every type of cy-
CVEs related to the cybersecurity do- for the cybersecurity domain, which bersecurity model. They also state
main improves the internal representation that the hyperparameters may not
space of domain words and per- be generalizable to other language
forms best in cybersecurity scenar- models, especially very large lan-
ios guage models
Shestov et 2024 Transformer model The dataset comprises 22,945 Vulnerability detec- Finetuning the state-of-the-art code The proposed study shows that the
al. [82] function-level source code samples. It tion LLM, WizardCoder, increasing its main bottlenecks of the task that
includes 13,247 samples for training, training speed without performance limit performance lie in the field
5,131 for validation, and 4,567 for harm. of dataset quality and suggests the
testing usage of the project-level context
information
He et al. 2024 Transformer model Used three datasets: one with over blockchain technol- The introduction of a novel model, Include the model’s limitation in
[83] 100,000 entries from Ethereum main- ogy and smart con- BERT-ATT-BiLSTM, for advanced recognizing unseen contract struc-
net contracts, another with 892,913 tracts vulnerability detection in smart tures or novel types of vulnerabil-
addresses labelled across five vulner- contracts, and the evaluation of its ities, and the need to incorporate
ability categories, and a third with performance against other models support for multiple programming
6,498 smart contracts, including 314 languages to enhance universality
associated with Ponzi schemes and robustness
Patsakis et 2024 LLM fine-tuned for Malicious scripts from the Emotet Malware Classifica- Demonstrated 69.56% accuracy in Optimizing LLM fine-tuning for
al. [84] deobfuscation tasks malware campaign tion extracting URLs and 88.78% for improved accuracy and integrating
domains of droppers; explored deobfuscation capabilities into op-
LLM potential in malware deobfus- erational security pipelines
cation and reverse engineering
Guo et al. 2024 Fine-tuned open- Compiled dataset and five benchmark Software Security Demonstrated fine-tuning’s effec- Addressing dataset mislabeling and
[85] source and general- datasets for vulnerability detection tiveness in improving detection ac- improving generalizability of mod-
purpose LLMs for curacy; highlighted limitations of els to unseen code scenarios
binary classification existing benchmark datasets
Jamal et al. 2024 Transformer model Two open-source datasets, 747 spam, Phishing and spam Proposing IPSDM, a fine-tuned ver- Class imbalance, addressed with
[25] 189 phishing, 4825 ham; class imbal- detection sion of DistilBERT and RoBERTA, ADASYN, but potential bias re-
ance addressed with ADASYN outperforming baseline models and mains
the demonstration of the effective-
ness of LLMs in addressing cyber-
security challenges
Lykousas 2024 LLMs for detecting Public code repositories with embed- Authentication and Highlighted differences in password Improving LLM accuracy in detect-
and Patsakis hard-coded creden- ded secrets and passwords Code Security patterns between developers and ing secrets and addressing context-
[86] tials in source code users; evaluated LLMs for detect- sensitive password vulnerabilities
ing hard-coded credentials and dis-
cussed their limitations
Karlsen et 2024 Fine-tuned LLMs Six datasets from web application and Cybersecurity Log Proposed a new pipeline leveraging Scaling models for more diverse
al. [87] for sequence system logs Analysis 60 fine-tuned models for log analy- log formats and optimizing for real-
classification (e.g., sis; DistilRoBERTa achieved an F1- time analysis in dynamic environ-
DistilRoBERTa, score of 0.998, outperforming state- ments
GPT-2, GPT-Neo) of-the-art techniques
Mechri et al. 2025 Decoder-only Python dataset (1.875M function-level Software Security High accuracy in detecting vulnera- Further improvement in identifying
[88] Transformer with code snippets from GitHub, Codepar- bilities across 14 CWEs, F1 scores complex vulnerabilities and han-
64K context length rot, and GPT4-o-generated data) ranging from 84% to 99% dling diverse programming patterns
Ding et al. 2025 LLM-enhanced SolidiFI benchmark dataset Blockchain Security Recall of 95.06% and F1-score of Expanding the framework’s
[89] framework with 94.95% for detecting smart contract applicability to more complex
in-context learning vulnerabilities; self-check architec- blockchain environments and new
and CoT reasoning ture for CoT generation vulnerability types
Arshad et al. 2025 LLM-based Simulation data for vehicular commu- Autonomous Trans- 18% reduction in latency, 12% im- Addressing node selfishness, scal-
[90] decentralized nication scenarios portation Systems provement in throughput, and en- ability in larger networks, and
vehicular network hanced secure V2X communication privacy-preserving real-time data
architecture using blockchain and LLMs exchange
Xiao et al. 2025 LLM with Solidity v0.8 vulnerabilities dataset Blockchain Security Reduced false-positive rates by over Improving recall for newer Solidity
[91] advanced prompting 60%; evaluated latest five LLMs versions and adapting to evolving
techniques and identified root causes for re- library and framework changes
duced recall in Solidity v0.8
Hassanin et 2025 Pre-trained UNSW NB 15 and TON IoT datasets Intrusion Detection Achieves 100% accuracy Exploring scalability for larger and
al. [92] Transformer with on UNSW NB 15 dataset, more diverse datasets; integrating
specialized input significantly outperforming real-time detection capabilities
transformation BiLSTM, GRU, and CNN models
module
Liu et al. 2025 LLM-powered static Real-world firmware datasets Hardware Security Fully automated taint analysis with Exploring adaptability for diverse
[93] binary taint analysis Bugs 37 newly discovered bugs and 10 binary formats and enhancing real-
assigned CVEs; low engineering time analysis capabilities
cost
Gaber et al. 2025 Transformer-based Assembly instructions captured by the Malware Classifica- Introduced a novel AI-based frame- Enhancing scalability for larger
et al. [94] framework for zero- Peekaboo tool tion work leveraging Assembly data datasets and addressing advanced
day ransomware for high-accuracy zero-day ran- evasion techniques in novel ran-
detection somware detection; demonstrated somware samples
the relevance of Transformer mod-
els to ransomware classification by
aligning with Zipf’s law

Karlsen et al. [87] proposed the LLM4Sec framework models, including architectures like BERT, RoBERTa, GPT-2,
that demonstrates the potential of large language models in and GPT-Neo. The study highlights the importance of fine-
cybersecurity log analysis by benchmarking 60 fine-tuned tuning for domain adaptation, with DistilRoBERTa achieving

14
Fig. 4: LLM-based Solutions for Cyber Security Use Cases.

an exceptional F1-score of 0.998 across diverse datasets. This malware deobfuscation, focusing on real-world scripts from
work introduces a novel experimentation pipeline that can the Emotet malware campaign. The evaluation highlights the
serve as a foundation for further advancements in automated potential of LLMs in identifying key indicators of compro-
log analysis. Future research could focus on scaling these mise, achieving 69.56% accuracy for URLs and 88.78% for
models to handle various log formats and optimizing them associated domains. These findings emphasize the importance
for real-time, dynamic cybersecurity environments. of fine-tuning LLMs for specialized cybersecurity tasks, such
2) Malware classification: Ziems et al. [67] explore as reverse engineering and malware analysis. While promising,
transformer-based models for malware classification using API the work identifies areas for improvement, including optimiz-
call sequences as features. The study compares the perfor- ing fine-tuning strategies to enhance accuracy and integrating
mance of the traditional machine and deep learning models these capabilities into threat intelligence frameworks for real-
with transformer-based models. It shows that transformer- world application.
based models outperform traditional models in terms of F1- Gaber et al. et al. [94] proposed a Pulse framework that pio-
score and AUC score. The authors also propose a bagging- neers the use of Transformer models for zero-day ransomware
based random transformer forest (RTF) model that reaches detection by analyzing Assembly instructions captured through
state-of-the-art evaluation scores on three out of four datasets. the Peekaboo dynamic binary instrumentation tool. By lever-
Demirkıran et al. [69] proposes using transformer-based mod- aging Zipf’s law, the study effectively connects linguistic prin-
els for classifying malware families, better suited for capturing ciples with ransomware behavior, making Transformer models
sequence relationships among API calls than traditional ma- ideal for classification tasks. This innovative approach forces
chine and deep learning models. The experiments show that the model to focus on malicious patterns by excluding familiar
the proposed transformer-based models outperform traditional functionality, ensuring robust detection of novel ransomware.
models such as LSTM and pre-trained models such as BERT Future research could expand scalability to accommodate
or CANINE in classifying highly imbalanced malware families larger datasets and address increasingly sophisticated evasion
based on evaluation metrics like F1-score and AUC score. techniques in emerging ransomware threats.
Additionally, the proposed bagging-based random transformer 3) Intrusion Detection: Wu et al. [68] proposed an RTID
forest (RTF) model, an ensemble of BERT or CANINE, that reconstructs feature representations in imbalanced datasets
achieves state-of-the-art performance on three out of four to make a trade-off between dimensionality reduction and
datasets, including a state-of-the-art F1-score of 0.6149 on one feature retention. The proposed method utilizes a stacked
of the commonly used benchmark datasets. encoder-decoder neural network and a self-attention mech-
Patsakis et al. [84] investigates the application of LLMs in anism for network traffic type classification. The results

15
with CICID2017 and CIC-DDoS2019 datasets demonstrate files. The authors report an accuracy of 98.9% for multi-label
the proposed method’s effectiveness in intrusion detection classification and provide a demonstration video, source code,
compared to classical machine learning and deep learning and datasets for the tool.
algorithms. Ghourbi et al. [70] propose an intrusion and Liu et al. [77] introduce CommitBART, a pre-trained Trans-
malware detection system to secure the entire network of the former model specifically designed to understand and generate
healthcare system independently of the installed devices and natural language messages for GitHub commits. The model is
computers. The proposed solution includes two components: trained on a large dataset of over 7.99 million commits, cov-
an intrusion detection system for medical devices installed ering seven different programming languages, using a variety
in the healthcare network and a malware detection system of pre-training objectives, including denoising, cross-modal
for data servers and medical staff computers. The proposed generation, and contrastive learning, across six pre-training
system is based on optimized LightGBM and Transformer- tasks. The authors propose a ”commit intelligence” framework
based models. It is trained with four different datasets to encompassing one understanding task and three generation
ensure a varied knowledge of the different attacks affecting tasks for commits. The experimental results demonstrate that
the healthcare sector. The experimental evaluation of the CommitBART significantly outperforms previous pre-trained
approach showed remarkable accuracies of 99%. PLLM-CS models for code, and the analysis suggests that each pre-
[92] introduces a transformative approach to satellite network training task contributes to the model’s performance.
security, achieving perfect accuracy on a benchmark dataset Ding et al. [95] discuss the effectiveness of code language
and demonstrating superior performance over traditional deep models (code LMs) in detecting vulnerabilities. It identifies
learning models. significant issues in current datasets, such as poor quality,
4) Software Vulnerability Detection: Thapa et al. [71] low accuracy, and high duplication rates, which compromise
explores the use of large transformer-based language models model performance in realistic scenarios. To overcome these
in detecting software vulnerabilities in C/C++ source code, challenges, it introduces the PrimeVul dataset, which uses
leveraging the transferability of knowledge gained from nat- advanced data labeling, de-duplication, and realistic evaluation
ural language processing. The paper presents a systematic metrics to represent real-world conditions accurately. The find-
framework for source code translation, model preparation, ings reveal that current benchmarks, like the BigVul, greatly
and inference. It conducts an empirical analysis of software overestimate code LMs’ capabilities, with much lower per-
vulnerability datasets to demonstrate the good performance of formance observed on PrimeVul. This significant discrepancy
transformer-based language models in vulnerability detection. highlights the need for further innovative research to meet the
The paper also highlights the advantages of transformer- practical demands of deploying code LMs in security-sensitive
based language models over contemporary models, such as environments.
bidirectional long short-term memory and bidirectional gated SecureQwen [88] is a vulnerability detection system de-
recurrent units, in terms of F1-score. However, the paper does signed for Python codebases. It uses a decoder-only trans-
not discuss the limitations or potential drawbacks of using former model with an extended context length of 64K tokens
transformer-based language models for software vulnerability to analyze large-scale datasets. The model identifies vulnera-
detection, and further research is needed in this area. Fu et bilities across 14 types of CWEs with high accuracy, achieving
al. [73] propose an approach called LineVul, which uses a F1 scores ranging from 84% to 99%. By leveraging a dataset
Transformer-based model to predict software vulnerabilities of 1.875 million function-level code snippets from various
at the line level. The approach is evaluated on a large-scale sources, including GitHub and synthetic data, SecureQwen
dataset (i.e., on a large-scale real-world dataset with more demonstrates its capability to detect security issues in both
than 188k C/C++ functions). It achieves higher F1-measure human-written and AI-generated code.
for function-level predictions and higher Top-10 accuracy Guo et al. [85] explores the role of LLMs in detecting
for line-level predictions compared to baseline approaches. vulnerabilities in source code, comparing the performance of
The analysis also shows that LineVul accurately predicts fine-tuned open-source models and general-purpose LLMs.
vulnerable functions affected by the top 25 most dangerous Leveraging a binary classification task and multiple datasets
CWEs. However, the model’s performance can be changed demonstrates the importance of fine-tuning smaller models for
when applied to different programming languages or software specific tasks, sometimes outperforming larger counterparts.
systems. The analysis also exposes critical issues with current bench-
Mamede et al. [74] presented a transformer-based VS Code mark datasets, such as mislabeling, which significantly affects
extension that uses state-of-the-art deep learning techniques model training and evaluation. Future research directions in-
for automatic vulnerability detection in Java code. The authors clude improving dataset quality and developing strategies to
emphasize the importance of early vulnerability detection enhance model generalization for more diverse and complex
within the software development life cycle to promote applica- software vulnerabilities.
tion security. Despite the availability of advanced deep learn- Lykousas and Patsakis [86] examine developer password
ing techniques for vulnerability detection, the authors note that patterns and the role of LLMs in detecting hard-coded creden-
these techniques are not yet widely used in development envi- tials in source code. The study reveals that while developers
ronments. The paper describes the architecture and evaluation tend to select more complex passwords compared to regular
of the VDet tool, which uses the Transformer architecture for users, context often influences weaker patterns. It underscores
multi-label classification of up to 21 vulnerability types in Java the risks posed by public repositories containing secrets and

16
the need for enhanced security practices. Additionally, the often surpassing human detection. Furthermore, they discuss
paper evaluates LLMs for detecting hard-coded credentials the economic impact of AI in lowering the costs of orches-
and identifying their potential and limitations. Future work trating phishing attacks.
should focus on refining LLM capabilities to detect sensitive Chataut et al. [98] focused on the effectiveness of LLMs
information and raising developers’ awareness about secure in detecting phishing emails amidst threat actors’ constant
password management. evolution of phishing strategies. Their study emphasizes the
5) Cyber Threat Intelligence: Ranade et al. [72] presented necessity for continual development and adaptation of detec-
a method for automatically generating fake Cyber Threat In- tion models to keep pace with innovative phishing techniques.
telligence (CTI) using transformers, which can mislead cyber- The role of LLMs in this context highlights their potential to
defense systems. The generated fake CTI is used to perform a significantly enhance email security by improving detection
data poisoning attack on a Cybersecurity Knowledge Graph capabilities.
(CKG) and a cybersecurity corpus. The attack introduces 7) Hardware Security Evaluation: Ahmad et al. [78]
adverse impacts such as returning incorrect reasoning outputs, delves into leveraging LLMs to automatically repair identified
representation poisoning, and corruption of other dependent security-relevant bugs present in hardware designs, explicitly
AI-based cyber defense systems. A human evaluation study focusing on Verilog code. Hardware security bugs pose signif-
was conducted with cybersecurity professionals and threat icant challenges in ensuring the reliability and safety of hard-
hunters, which reveals that professional threat hunters were ware designs. They curated a corpus of hardware security bugs
equally likely to consider the generated fake CTI and authentic through a meticulously designed framework. They explored
CTI as true. the performance of various LLMs, including OpenAI’s Codex
Hashemi et al. [76] propose an alternative approach for and CodeGen, in generating replacement code to fix these
automated vulnerability information extraction using Trans- bugs. The experiments reveal promising results, demonstrating
former models, including BERT, XLNet, RoBERTa, and Dis- that LLMs can effectively repair hardware security bugs, with
tilBERT, to extract security-related words and terms and success rates varying across different bugs and LLM mod-
phrases from descriptions of vulnerabilities. The authors fine- els. By optimizing parameters such as instruction variation,
tune several language representation models similar to BERT temperature, and model selection, they achieved successful
on a labeled dataset from vulnerability databases for Named repairs for a significant portion of the bugs in their dataset. In
Entity Recognition (NER) to extract complex features without addition, the results demonstrate that LLMs, including GPT-
requiring domain-expert knowledge. This approach outper- 4, code-davinci-002, and code-cushman-001, yield successful
forms the CRF-based models and can detect new information repairs for simple security bugs, with GPT-4 achieving a
from vulnerabilities with different description text patterns. success rate of 67% at variation e, temp 0.5. However, LLMs’
The authors conclude that this approach provides a structured performance varies across bugs, showing success rates over
and unambiguous format for disclosing and disseminating vul- 75% with some bugs, while others are more challenging to
nerability information, which is crucial for preventing security repair, with success rates below 10%. The study emphasizes
attacks. the importance of detailed prompt instructions, with variation
6) Phishing and spam detection: Koide et al. introduced d showing the highest success rate among OpenAI LLMs.
[96], a novel system leveraging LLMs to detect phishing Further investigation is needed to evaluate LLMs’ scalability
emails. Despite advances in traditional spam filters, significant and effectiveness for diverse hardware security bug scenarios.
challenges such as oversight and false positives persist. The Their findings underscore the potential of LLMs in automating
system transforms email data into prompts for LLM analy- the bug repair process in hardware designs, marking a crucial
sis, achieving a high accuracy rate (99.70%) and providing step towards developing automated end-to-end bug repair tools
detailed reasoning for its determinations. This helps users for hardware security.
make informed decisions about suspicious emails, potentially Mohamadreza et al. [99] explored the potential of using
enhancing the effectiveness of phishing detection. large language models to enhance the input generation in the
Jamal et al. [25] explored the potential of LLMs to address process of hardware design verification for security-related
the growing sophistication of phishing and spam attacks. Their bugs. Mohamadreza et al. introduced Chatfuzz, a novel ML-
work, IPSDM, is an improved model based on the BERT based hardware fuzzer that leverages LLMs and reinforce-
family, specifically fine-tuned to detect phishing and spam ment learning to generate complex and random machine
emails. Compared to baseline models, IPSDM shows superior code sequences for exploring processor security vulnerabil-
accuracy, precision, recall, and F1-score performance on both ities. Chatfuzz introduces a specialized LLM model into a
balanced and unbalanced datasets while addressing overfitting hardware fuzzing approach to enhance the input generation
concerns. quality, outperforming the existing approaches regarding cov-
Heiding et al. [97] compared automatically generated phish- erage, scalability, and efficiency. Utilizing LLMs to understand
ing emails by GPT-4, manually designed emails using the V- processor language and generate data/control flow entangled
Triad method, and their combination. Their findings suggest machine code sequences, Chatfuzz integrates RL to guide
that emails designed with the V-Triad achieved the highest input generation based on code coverage metrics. Their ex-
click-through rates, indicating the effectiveness of exploiting periment on real-world cores, namely RocketCore and BOOM
cognitive biases. The study also evaluated the capability of cores, showed significantly faster coverage than state-of-the-art
four different LLMs to detect phishing intentions, with results hardware fuzzes. ChatFuzz achieves 75% condition coverage

17
in RocketCore in 52 minutes and 97.02% in BOOM in 49 Design Verification (HLS). The authors created a dataset
minutes, identifying unique mismatches and new bugs and named Chrysalis to solve the problem of the non-existence of
showcasing its effectiveness in hardware security testing. specialized HLS bug detection and evaluation capabilities. The
Weimin et al. [100] introduces LLM4SECHW, a novel Chrysalis dataset comprises over 1000 function-level designs
framework for hardware debugging that utilizes domain- extracted from reputable sources with intentionally injected
specific Large Language Models. The authors addressed the known bugs to evaluate and refine LLM-based HLS bug
limitations of out-of-the-shelf LLMs in the hardware security localization. The set of the introduced bugs was selected based
domain by gathering a dataset of hardware design defects on the most common human coding errors and has been shaped
and remediation steps. The collected dataset has been built to elude most of the existing conventional HLS synthesis
by leveraging open-sourced hardware designs from GitHub; tools detection mechanisms. The paper’s authors suggest that
the data consists of different Hardware Description Language Chrysalis would contribute to the LLM-aided HLS design
modules with their respective commits. By harnessing ver- verification by offering a benchmarking to the existing and
sion control information from open-source hardware projects specialized models. The paper also suggests a prompt engi-
and processing it to create a debugging-oriented dataset, neering approach that would enhance the efficiency of a large
LLM4SECHW fine-tunes hardware domain-specific language language model on the studied task. The proposed prompt
models to locate and rectify bugs autonomously, enhancing structure introduces a separation of concern approach, where
bug localization. LLM4SECHW has been evaluated with two the used prompt deals with each class of bugs separately. The
objectives: bug identification and design patching. The authors prompt starts by explicitly defining the context of the task,
demonstrated that non-fine-tuned LLMs lack hardware domain the functional description, the implementation context, and the
knowledge, which makes them incapable of locating bugs in task objective. The prompt is implemented through three main
the hardware design of a popular security-specialized chip sections: context, requirements, and complementary rules. The
project named OpenTitan. The based models (falcon 7b, llama highlighted works lay a foundation for a methodological,
2, Bard, chatbot, and stableLM) did not efficiently locate the practical approach to benchmarking, evaluating, and deploying
introduced hardware bugs. The three fine-tuned models (falcon LLM tasks for HLS design verification. While the paper does
7b, llama2, stableLM) successfully located the introduced bugs not provide any conclusive results about LLMs’ performance
in the hardware design. in such tasks, the authors believe that such methodology would
Zhang et al. [100] introduces Hardware Phi-1.5B, a large accelerate the adoption of new techniques to integrate LLMs
language model tailored for the hardware domain of the semi- into the design verification flow.
conductor industry, addressing the complexity of hardware- Mingjie et al. [103] evaluated the LLMs’ performance in
specific issues. The research focused on developing datasets solving Verilog related design tasks and generating design
specifically for the hardware domain to enhance the model’s testbenchs by introducing VerilogEval. VerilogEval comprises
performance in comprehending complex terminologies. The different hardware design tasks ranging from module imple-
authors claim to surpass general code language models and mentation of simple combinatorial circuits to complex finite
natural language models like CodeLlama, BERT, and GPT-2 state machines, code debugging, and testbench construction.
in the Hardware understanding tasks. VerilogEval suggests an end-to-end evaluation framework that
Madhav et al. [101] evaluated the security of the HDL fits better in the context of the hardware design verification
code generated by ChatGPT. The authors introduced a similar process benchmarking. The VerilogEval framework validates
taxonomy to the NIST CWE 1 . The authors conducted various the correctness of the prompted tasks by comparing the
experiments to explore the impact of prompt engineering on behavior simulation to an established golden model of the
the security of the generated hardware design. prompted design. The authors used pass@k metric instead
Liu et al. [93] introduces a groundbreaking approach, named of the generic NLP related metrics like the BLEU score
LATTE, to binary program security by utilizing LLMs for probability metric. The study demonstrates that pre-trained
static binary taint analysis. Unlike traditional tools like Emtaint language models’ Verilog code generation capabilities can be
and Karonte, which rely on manually crafted taint propagation improved through supervised fine-tuning. The experimental re-
and vulnerability inspection rules, LATTE is fully automated, sults show that fine-tuning LLMs on the hardware design tasks
reducing dependency on human expertise. Its effectiveness and using the pass@k metric helps assess the performance of
is demonstrated through the discovery of 37 previously un- the resulting models properly. The pass@k metric helps assess
known bugs in real-world firmware, with 10 earning CVE the performance of Large Language Models (LLMs) in Verilog
assignments. Additionally, LATTE offers a scalable and cost- code generation by quantifying the number of successful code
efficient solution, making it highly accessible to researchers completions out of k samples, offering a clear evaluation
and practitioners. This work highlights the potential of LLMs criterion. The used metric shows that a fine-tuned model
to revolutionize binary program analysis, though future re- could have equal or better performance than the state-of-the-
search could focus on enhancing adaptability to diverse binary art OpenAI models (gpt-3 and gpt-4). VerilogEval highlights
formats and integrating real-time capabilities. the growing significance of Large Language Models (LLMs)
8) Hardware design & Verification: Lily et al. [102] in- and their application in various domains, emphasizing their
troduced the application of LLMs into High-Level Synthesis potential in Verilog code generation for hardware design and
verification. The findings underscore the importance of the
1 https://nvd.nist.gov/vuln/categories proposed benchmarking framework in advancing the state of

18
the art in Verilog code generation, highlighting the vast poten- The authors claim that LLMIF achieved a notable increase
tial of LLMs in assisting the hardware design and verification in protocol message coverage and code coverage by 55.2%
process. and 53.9%, respectively, outperforming other Zigbee fuzzers
9) Protocols verification: Ruijie et al. [105] introduced in these aspects.
ChatAFL, an LLM-based protocol fuzzer. ChatAFL introduces LLMIF algorithm successfully uncovered 11 vulnerabilities
an LLM-guided protocol fuzzing to address the challenge of on real-world Zigbee devices, including eight previously un-
finding security flaws in protocol implementations without known vulnerabilities, showcasing its effectiveness in iden-
a machine-readable specification. The study suggests three tifying security flaws in IoT devices. By incorporating the
strategies for integrating an LLM into a mutation-based proto- large language model into IoT fuzzing, LLMIF demonstrated
col fuzzer, focusing on grammar extraction, seed enrichment, enhanced capabilities in protocol message coverage and vul-
and saturation handling to enhance code coverage and state nerability discovery, highlighting its potential for improving
transitions. ChatAFL prototype implementation demonstrates the security testing of IoT devices.
that the LLM-guided stateful fuzzer outperforms state-of-the- 10) Blockchain Security: SmartGuard [89] is a framework
art fuzzers like AFLNET [106] and NSFUZZ [107] in terms that combines large language models with advanced reasoning
of protocol state space coverage and code coverage. techniques to detect vulnerabilities in smart contracts. It uses
The experiments evaluated CHATAFL’s improvement over semantic similarity to retrieve relevant code snippets and
the baselines in terms of transition coverage achieved in employs Chains of Thought (CoT) reasoning for in-context
24 hours, speed-up in achieving the same coverage, and learning. The framework includes a self-check mechanism for
the probability of outperforming the baselines in a random generating reliable reasoning chains from labeled data. Tests
campaign. CHATAFL demonstrated significant efficacy by on the SolidiFI benchmark dataset show exceptional results,
covering 47.60% and 42.69% more state transitions, 29.55% with a recall of 95.06% and an F1-score of 94.95%, outper-
and 25.75% more states, and 5.81% and 6.74% more code forming existing tools in smart contract security. BlockLLM
than AFLNET and NSFUZZ, respectively. [90] introduces a decentralized network architecture for au-
CHATAFL discovered nine unique and previously unknown tonomous vehicles, integrating blockchain with large language
vulnerabilities in widely used and extensively tested proto- models to improve security and communication. It enhances
col implementations on real widely used projects (live555, vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I)
proFTPD, kamailio). The discovered vulnerabilities encom- communication by providing adaptive decision-making and
pass various memory vulnerabilities, including use-after-free, ensuring data integrity. With features like incentive mecha-
buffer overflow, and memory leaks, which have potential se- nisms for node reliability, BlockLLM achieves significant im-
curity implications such as remote code execution or memory provements, including an 18% reduction in latency and a 12%
leakage. The study demonstrated the effectiveness of utilizing increase in throughput, offering a scalable solution for secure
LLMs for guiding protocol fuzzing to enhance state and code vehicular networks. Xiao et al. [91] advances the field of
coverage in protocol implementations. smart contract vulnerability detection by focusing on Solidity
Wang et al. [108] introduced LLMIF and LLM-aided v0.8, the latest version, unlike earlier works based on outdated
fuzzing approach for IoT devices protocols. LLMIF intro- versions. By leveraging advanced prompting techniques with
duces an LLM augmentation-based approach. The developed five cutting-edge LLMs, the study significantly reduces false-
pipeline incorporates an enhanced seed generation strategy positive rates (over 60%), showcasing the potential of refined
by building an augmentation based on domain knowledge. LLM utilization. However, the findings also reveal a significant
The domain knowledge structure is extracted from the various drop in recall for specific vulnerabilities due to challenges
specifications of the under-fuzzing protocol. The flow starts adapting to newly introduced libraries and frameworks. Ad-
by selecting a seed from the extracted augmentation set and dressing these limitations could further enhance the precision
then enriching the extracted seed by exploring the protocol and robustness of LLM-based smart contract analysis.
specification. The enriching process is driven by the various
ranges of input values extracted during the augmentation
phase. Furthermore, LLMIF introduces a coverage approach
by mutating the selected seed through the various enrichment V. G ENERAL LLM S
and mutation operators that have been selected.
The evaluation part of LLMIF mainly aimed to evaluate Tables V, VI, VII compare general transformer-based Large
three axes: code coverage, ablation, and bug identification. Language Models. LLM models are generally trained on a
The authors used an out-of-the-shelf popular SOC (CC2530) diverse and broad range of data to provide a relatively com-
for the evaluation. 11 commercial devices have been selected prehensive understanding. They can handle various language
to conduct the various experiments. While the ablation and tasks like translation, summarization, and question-answering.
bug detection could be easily evaluated, the code coverage is In contrast, code-specific LLMs are specialized models trained
impossible using the custom firmware that ships with the se- primarily on programming languages and related technical
lected devices. The authors used an open-source Zigbee stack literature, which makes their primary role in understanding
to demonstrate the coverage capabilities. The authors claimed and generating programming code well-suited for tasks like
that LLMIF outperforms Z-FUZZER [109], and BOOFUZZ automated code generation, code completion, and bug detec-
[110] in terms of code coverage for the target Zigbee stack. tion.

19
Q1: In cryptography, what is the purpose of a message Total Correct
authentication code (MAC) or digital signature (SIG)? Total Questions
Q2: "What is the role of Uncoordinated Frequency Hopping Accuracy: %
(UFH) in anti-jamming broadcast communication?
...

80
Questions
role_prompt = "You are a security expert
who answers questions."
2k 500
Questions Questions user_prompt = f"Question:
{question}\nOptions: {', '.join([f'{key}) {value}'
for key, value in
answers.items()])}\n\nChoose the correct
answer (A, B, C, or D) only. Always return in
this format: 'ANSWER: X'"
Inference Model

10 k
Questions Gpt-4-turbo, Gpt-3.5-turbo, Mixtral-8x7B-Instruct, LLM prompt
GEMINI-pro (Bard), Falcon-180B-Chat, Flan-T5-XXL, engineering
CyberMetric Dataset Zephyr-7B-beta, Llama 2-70B, Falcon-40B-Instruct,
Flan-T5-Base, ....

Fig. 5: LLMs Performance Steps in the cybersecurity domain using CyberMetric Dataset [104].

A. Prevalent LLMs encoder-decoder-based model that operates within the unified


1) GPT-3: GPT-3 (the third version of the Generative Pre- text-to-text framework. Multiple variants of T5 with different
trained Transformer series by OpenAI) was developed to sizes - ranging between 220M to 11B parameters- were devel-
prove that scaling language models substantially improves oped to broaden the experimental scope and were trained on
their task-agnostic few-shot performance [111]. Based on massive amounts of data from various sources, including C4,
transformer architecture, GPT-3 has eight variants ranging Web Text, and Wikipedia. Building on the foundation of these
between 125M and 175B parameters, all trained for 300B diverse model sizes and rich data sources, multiple approaches
tokens from datasets like Common Crawl, WebText, Books, and different settings for pre-training and fine-tuning were
and Wikipedia. Additionally, the models were trained on V100 examined and discussed, achieving performance that nearly
GPU leveraging techniques like autoregressive training, scaled matched human levels on one of the benchmarks. Considering
cross-entropy loss, and others. GPT-3, especially its most that, the model’s potential in cybersecurity applications is
capable 175B version, has demonstrated strong performance particularly promising. For instance, T5 can be utilized for
on many NLP tasks in different settings (i.e., zero-shot, one- threat intelligence by extracting critical information from vast
shot, and few-shots), suggesting it could significantly improve security documents and then summarizing and organizing that
cybersecurity applications if appropriately fine-tuned. This information.
could translate to more effective Phishing Detection through 4) BERT: Bidirectional Encoder Representations from
precise language analysis, faster Incident Response, and other Transformers, commonly known as BERT, was presented by
critical applications to enhance digital security measures. [114] to enhance fine-tuning-based approaches in NLP. It is
2) GPT-4: In 2023, the GPT-4 transformer-based model available in two versions: BERT-Base, with 110M parameters,
was released by OpenAI as the first large-scale multi- and BERT-Large, with 340M parameters, trained on 126GB of
modal model, exhibiting unprecedented performance in var- data from BooksCorpus and English Wikipedia. During its pre-
ious benchmarks. The model’s capability of processing image training phase, BERT employed two key techniques: Masked
and text inputs has shifted the AI paradigm to a new level, Language Modeling (MLM) and Next Sentence Prediction
expanding beyond traditional NLP. [112] declared that GPT- (NSP). Building on these approaches, fine-tuning, and feature-
4 was trained using a vast corpus of web-based data and based methods have led to competitive performance from
data licensed from third-party sources with autoregressive BERT-Large in particular. Since encoder-only models like
techniques and Reinforcement Learning from Human Feed- BERT are known for their robust contextual understanding,
back (RLHF). However, other specifics, such as the model applying such models to tasks like malware detection and
size, data size, and comprehensive training details, remain software vulnerability can be highly effective in cybersecurity.
undisclosed. Although GPT-4 could potentially be leveraged 5) ALBERT: Aiming to address the limitations related to
by cybercriminals for a wide range of attacks, such as social GPU/TPU memory and training time in Large Language
engineering, if implemented strategically, it can also help Models (LLMs), Google researchers developed A Lite BERT
reduce the likelihood of individuals and organizations falling (ALBERT), a modified version of BERT with significantly
prey to them. fewer parameters [115]. And like other LLMs, ALBERT was
3) T5: Motivated by the trend of applying transfer learning introduced in various sizes, with options ranging from 12M
for NLP, researchers of Google have introduced T5 [113], an to 235M parameters, all trained on data from BooksCorpus

20
TABLE V: Comparison of Large Language Models
Model Architecture Base Model Para- Training Pre-training Corpus Released Applications Use Cases in Training Key Training Quanti- Ref
meters Tokens Volume By Cybersecurity Scheme Techniques zation
GPT-3 Decoder-only NA 175B 300B Books, +570GB Open AI Language Malware Pre-training, Autoregressive NA [111]
Web text, Modeling, Text Detection, In-context training,
Wikipedia, Completion, Threat learning Scaled Cross
Common QA Intelligence, Entropy Loss,
Crawl Social Backpropagation
Engineering and gradient
Detection descent, Mixed
precision training.
GPT-4 Decoder-only NA NA NA Web Data, NA Open AI Language Malware Pre-training, Autoregressive NA [112]
Third-party Modeling, Text Detection, RLHF training
licensed data Completion, Threat
QA Intelligence,
Social
Engineering
Detection
T5 Encoder- NA 11B 1000B C4, Web Text, 750GB Google Language Malware Pre-training, Text-to-text frame- NA [113]
decoder Wikipedia Modeling, Detection, Fine-tuning work, Denotation-
Summa- Threat based pretraining
rization, Intelligence,
Translation Social
Engineering
Detection
BERT Encoder-only NA 340M 250B BooksCorpus, 126GB Google Language Malware Detec- Pre-training Masked NA [114]
English Modeling, tion, Threat In- LM(MLM),
Wikipedia Classification, telligence, Intru- Next-sentence
QA, NER sion Detection, prediction(NSP)
Phishing Detec-
tion
ALBERT Encoder-only BERT 235M +250B BooksCorpus, NA Google Language Malware Detec- Pre-training Factorized NA [115]
(calcu- English Modeling, tion, Threat In- embedding
lated) Wikipedia Classification telligence, Intru- parameterization,
sion Detection, Cross-layer
Phishing Detec- parameter sharing,
tion Inter-sentence
coherence loss,
Sentence order
prediction (SOP)
RoBERTa Encoder-only BERT 355M 2000B BooksCorpus, NA Meta Language Malware Detec- Pre-training Dynamic Masking, NA [116]
English Modeling, tion, Threat In- Full-Sentences
Wikipedia Classification, telligence, Intru- without NSP
QA, NER sion Detection, loss, Large mini-
Phishing Detec- batches, Larger
tion byte-level BPE
XLNet Encoder-only Transformer- 340M +2000B English 158GB CMU, Language Malware Detec- Pre-training Permutation NA [117]
XL (calcu- Wikipedia (calcu- Google Modeling, tion, Threat In- LM(PLM), Two-
lated) lated) Classification, telligence, Intru- stream self-
QA sion Detection, attention, Segment
Phishing Detec- Recurrence and
tion Relative Encoding
ProphetNet Encoder- NA 550M +260B Web Data, 160GB Microsoft Language Cybersecurity Pre-training, Masked Sequence NA [118]
decoder (calcu- Books Research Modeling, Reporting, Fine-tuning generation,
lated) Asia Question Threat Autoregressive
Generation, Intelligence training, Denoising
Summarization Autoencoder
objective, Shared
Parameters
between encoder
and decoder,
Maximum
Likelihood
Estimation (MLE)
Falcon Decoder-only NA 7-180B 5000B Web Data NA TII Language Malware Pre-training Autoregressive NA [119]
Modeling, Text Detection, training,
Completion, Threat FlashAttention,
QA Intelligence, ALiBi Positional
Social encoding
Engineering
Detection
Reformer Encoder- NA Up to +150B Web Data NA Google Language Malware Detec- Pre-training Locality-Sensitive NA [120]
decoder 6B (calcu- Modeling, tion, Threat In- Hashing (LSH)
lated) Classification telligence, Intru- Attention,
sion Detection, Chunked
Phishing Detec- Processing,
tion Shared-QK
Attention Heads,
Reversible layers

21
TABLE VI: Continued
Model Architecture Base Model Para- Training Pre-training Corpus Released Applications Use Cases in Training Key Training Quanti- Ref
meters Tokens Volume By Cybersecurity Scheme Techniques zation
PaLM Decoder-only NA 540B 780B Webpages, 2TB Google Language Threat Pre-training SwiGLU NA [121]
books, Modeling, QA, Intelligence, Activation, Parallel
Wikipedia, Translation Security Layers, Multi-
news articles, Policies Query attention
source code, Generation (MQA), RoPE
social media embeddings,
conversations, Shared Input-
GitHub Output embedding
PaLM2 Decoder-only NA NA NA web NA Google Language Threat Pre-training Compute optimal NA [122]
documents, Modeling, QA, Intelligence, scaling, Canary
books, code, Summarization Security token sequences,
mathematics, Policies Control tokens for
conversational Generation inference
data
LLaMA Decoder-only NA 7-65B 1400B CommonCrawl, 177GB Meta Language Threat Pre-training Pre-normalization, NA [123]
C4, GitHub, Modeling, Text Intelligence, SwiGLU activation
Wikipedia, Completion, Malware function, Rotary
Books, arXiv, QA Detection Embedding, Model
StackExchange and sequence
parallelism
LLaMA2 Decoder-only NA 7-70B 2000B Mix of puli- NA Meta Language Threat Pre-training, Optimized NA [124]
cally available Modeling, Text Intelligence, Fine-tuning, autoregressive
data Completion, Malware RLHF training, Grouped
QA Detection Query Attention
(GQA)
GShard MoE NA 600B 1000B Web Data NA Google Language Threat Pre-training Conditional NA [125]
Modeling Intelligence, Computation,
Intrusion Lightweight
Detection, Annotation APIs,
Malware XLA SPMD
Detection partitioning,
Position-wise MoE
ELECTRA Encoder-only NA 335M +1800B BooksCorpus, 158GB Google Language Threat Pre-training, Replaced token NA [126]
(calcu- English Modeling, Intelligence, Fine-tuning detection,
lated) Wikipedia Classification Intrusion Generator-
Detection, discriminator
Malware framework, Token
Detection, replacement,
Phishing Weight-sharing
Detection
MPT-30B Decoder-only NA 30B 1000B C4, mC4, NA MosaicML Language Threat Pre-training FlashAttention, NA [127]
Common- Modeling, Text Intelligence, ALiBi positional
Crawl, Completion, Malware encoding
Wikipedia, QA Detection,
Books, arXiv Software
Vulnerability
Yi-34B NA NA 34B 3000B Chinese and NA 01.AI Language Threat Pre-training, NA GPTQ, [128]
English dataset Modeling, Intelligence, Fine-tuning AWQ
Question Phishing
Answering Detection,
Vulnerability
Assessment
Phi-3-mini Decoder-only NA 3.8B 3.3T Phi-3 datasets NA Microsoft Language Threat Pre-training, LongRope, Query NA [129]
(Public Modeling, Text Intelligence, Fine-tuning Attention (GQA)
documents, Completion, Intrusion
synthetic, chat QA Detection,
formats) Malware
Detection
Mistral 7B Decoder-only NA 7.24B NA NA NA Mistral Language Threat Pre-training, Sliding Window NA [130]
AI Modeling, Text Intelligence, Fine-tuning Attention, Query
Completion, Intrusion Attention (GQA),
QA Detection, Byte-fallback BPE
Malware tokenizer
Detection
Cerebras- Decoder-only NA 2.7B 371B The Pile 825 GB Cerebras Language Threat Pre-training standard trainable NA [131]
GPT 2.7B Dataset Modeling, Text Intelligence, positional embed-
Completion, Intrusion dings and GPT-2
QA Detection, transformer, GPT-
Malware 2/3 vocabulary and
Detection tokenizer block
ZySec- Decoder-only NA 7.24B NA Trained across NA ZySec AI Language Expert guidance Pre-training NA NA [132]
AI/ZySec 30+ domains in Modeling, Text in cybersecurity
7B cybersecurity Completion, issues
QA
DeciLM Decoder-only NA 7.04 NA NA NA Deci Language Threat Pre-trained Grouped-Query NA [133]
7B Modeling, Text Intelligence, Attention (GQA)
Completion, Intrusion
QA Detection,
Malware
Detection

22
TABLE VII: Continued
Model Architecture Base Model Para- Training Pre-training Corpus Released Applications Use Cases in Training Key Training Quanti- Ref
meters Tokens Volume By Cybersecurity Scheme Techniques zation
Zephyr 7B Decoder-only Mistral 7B 7.24B NA NA NA Hugging- Language Threat Fine-tuning Flash Attention, NA [134]
Beta Face Modeling, Text Intelligence, Direct Preference
Completion, Intrusion Optimization
QA Detection, (DPO)
Malware
Detection
Dolly v2 12B Decoder-only Pythia 12B 12B 3T The Pile 825GiB Databricks Language Threat Fine-tuning NA NA [135]
Dataset Modeling, Text Intelligence,
Completion, Intrusion
QA Detection,
Malware
Detection
Falcon2 11B Decoder-only NA 11.1B 5T RefinedWeb NA TII Language Malware Pre-training ZeRO, high- NA [136]
enhanced Modeling, Text Detection, performance
with curated Completion, Threat Triton kernels,
corpora. QA Intelligence, FlashAttention-2
Social
Engineering
Detection

and English Wikipedia. Various methods and techniques were making it -after appropriate fine-tuning- a capable tool for
deployed during the pre-training stage, including Factorized enhancing various aspects of the cybersecurity field.
Embedding Parameterization, Cross-layer Parameter Sharing, 8) ProphetNet: ProphetNet LLM, proposed by Microsoft,
Inter-sentence Coherence Loss, and Sentence Order Prediction is a sequence-to-sequence pre-trained model that aims to
(SOP). As a result, one of the models (i.e., ALBERT-xxlarge) address the issue of overfitting on strong local correlations by
outperformed BERT-Large despite having fewer parameters. leveraging two novel techniques, namely: future n-gram pre-
Thus, utilizing ALBERT in cybersecurity applications, such diction and n-stream self-attention [118]. Built on an encoder-
as phishing detection and malware classification, could signif- decoder architecture and trained on 16GB base-scale and
icantly contribute to advancing cybersecurity infrastructure. 160GB large-scale datasets sourced from web data and books,
6) RoBERTa: RoBERTa, proposed by Meta, is an opti- ProphetNet, with its 550M parameters, achieved new state-of-
mized replication of BERT that demonstrates how the choice the-art results on multiple benchmarks. The model was also
of hyperparameters can significantly impact the model’s per- fine-tuned for two downstream tasks, Question Generation and
formance [116]. RoBERTa has only one version with 355M Text Summarization, where it achieved the best performance.
parameters but is trained and tested in various data sizes Therefore, utilizing ProphetNet in cybersecurity tasks such as
and training steps. Similar to BERT, the training data was automated security incident summarization could significantly
taken from the Books corpus and English Wikipedia. However, enhance efficiency and decision-making.
the key optimizations in this model were in the training 9) Falcon: Falcon LLM, built on decoder-only architec-
techniques, which included multiple methods such as Dynamic ture, was introduced by the Technology Innovation Institute
Masking, training on Full Sentences without NSP loss, using (TII) as a proof-of-concept that enhancing data quality can
Large Mini-Batches, and employing a Larger Byte-Level BPE. significantly improve the LLM performance even with purely
Consequently, RoBERTa achieved state-of-the-art results in web-sourced data [119]. This insight is increasingly relevant as
some of the benchmarks. With proper fine-tuning, RoBERTa’s scaling in LLMs, which is becoming more prevalent, requires
ability to understand, interpret, and generate human-like text is more data for processing. The model has three versions (i.e.,
leveraged to automate and enhance various tasks in the realm 7B, 40B, 180B) pre-trained on the “RefinedWeb” dataset
of cybersecurity. proposed by TII. RefinedWeb, sourced exclusively from web
7) XLNet: The advances and limitations of Masked Lan- data, was subjected to various filtering and deduplication
guage Modeling (MLM) in bidirectional encoders and Au- techniques to ensure high quality. Autoregressive training,
toregressive Language Modeling have inspired researchers at Flash Attention, and ALiBi Positional encoding were the
CMU and Google AI to develop XLNet [117]. Based on methods used for pre-training. With further fine-tuning, Falcon
the Transformer-XL model, XLNet combines aspects of both can advance cybersecurity, particularly in threat intelligence
approaches, enabling the learning of bidirectional contexts and analysis.
while addressing common MLM issues, such as neglecting 10) Reformer: Striving to address common memory limi-
dependencies between masked positions and the discrepancy tations in LLMs, Google proposed the Reformer, an encoder-
between pretraining and finetuning phases. With 340M pa- decoder memory-efficient LLM [120]. With up to 6B param-
rameters, XLNet was pre-trained using data from English eters, Reformer was pre-trained on web data using techniques
Wikipedia and utilizing techniques like Permutation Language including Locality-Sensitive Hashing (LSH) Attention, Chun-
Modeling (PLM), Two-stream attention, Segment Recurrence, ked Processing, Shared-QK Attention Heads, and Reversible
and Relative Encoding. Due to the careful design of the model layers. These techniques were proven to have a negligible
and strategic pre-training techniques, XLNet has achieved impact on the training process compared to the standard
substantial performance over other popular models like BERT, Transformer, as the Reformer achieved results that matched

23
the full Transformer but with much faster processing and with Chinchilla-70B and PaLM-540B. Given its relatively
better memory efficiency. Subsequently, employing Reformer small size and superior performance, fine-tuning LLaMA on
for tasks like large-scale data analysis could serve the cyberse- cyber threat intelligence tasks could significantly enhance the
curity field by enabling more efficient processing and analysis security of edge devices.
of extensive datasets. 14) LLaMA2: LLaMA2 is an optimized version of LLaMA
11) PaLM: Driven by the advancement in machine learning developed by Meta and a collection of pre-trained and fine-
and natural language processing, Google has developed PaLM tuned LLMs with sizes ranging from 7 to 70B parameters
to examine the impact of scale on few-shot learning [121]. [124]. In the pre-training, a mixture of publicly available data
PaLM, built on decoder-only architecture, was trained with was used for up to 2000B training tokens. Moreover, multiple
540B parameters using Pathways, a new system that utilizes techniques were used in the predecessor LLaMA, such as Pre-
highly efficient training across multiple TPU pods. The model normalization, SwiGLU activation function, and Rotary posi-
was trained on 2TB of data from multiple sources, including tional embeddings. Two additional methods, namely increased
news articles, Wikipedia, source code, etc. SwiGLU Activa- context length and group-query attention (GQA), were also
tion, Parallel Layers, and other techniques were deployed for used. After pre-training, variants of the model (i.e., LLaMA2-
pre-training three different parameter scales, 8B, 62B, and Chat) were optimized for dialog use cases by supervised
540B, to understand the scaling behavior better. An observed fine-tuning and reinforcement learning with human feedback
discontinuous improvement indicated that as LLMs reach a (RLHF). The model evaluation, which focused on helpfulness
certain level of scale, they exhibit new abilities. Furthermore, and safety, showed superiority over the other open-source
these emerging capabilities continue to evolve and become models and competitive performance to some closed-source
apparent even beyond the scales that have been previously models.
explored and documented. Subsequently, PaLM achieved a 15) GShard: GShard LLM was introduced by Google in
breakthrough by outperforming the finetuned state-of-the-art 2020, aiming to address neural network scaling issues related
and average human on some benchmarks, proving that when to computation cost and training efficiency [125]. Based on
scaling is combined with chain-of-thought prompting, basic a Mixture-of-Experts (MoE) transformer with 600B parame-
few-shot evaluation has the potential to equal or surpass the ters, GShard was pre-trained on 1000B tokens of web data.
performance of fine-tuned state-of-the-art models across a Multiple techniques were deployed for the training stage,
broad spectrum of reasoning tasks. With such strong capabili- such as conditional computation, XLA SPMD partitioning,
ties, utilizing PaLM for tasks like generating security policies position-wise MoE, and parallel execution using annotation
and incident response automation can enhance the efficiency APIs. Subsequently, GShard outperformed prior models in
and effectiveness of cybersecurity operations. translation tasks and exhibited a favorable trade-off between
12) PaLM2: PaLM2 is an advanced variant of the PaLM scale and computational cost, resulting in a practical and
model that is more compute-efficient, although it offers better sample-efficient model. These results highlight the importance
multilingual and reasoning capabilities [122]. The key en- of considering training efficiency when scaling LLMs, which
hancements in the model are the improved dataset mixtures, makes it more viable in the real world.
the compute-optimal scaling, and architectural and objective 16) ELECTRA: The extensive computation cost of MLM
improvements. The significant evaluation results of PaLM2 pre-training methods has inspired Google to propose ELEC-
indicate that various approaches could elaborate on the model’s TRA LLM, which is a 335M parameters’ encoder-only trans-
enhancement besides scaling, such as meticulous data selection former model that utilizes a novel pre-training approach called
and efficient architecture/objectives. Moreover, the fact that “replaced token detection” [126]. This technique allows the
PaLM2 outperformed the predecessor PaLM despite its signif- model to learn from the entire sequence rather than just a
icantly smaller size shows that the model quality has a greater small portion of masked tokens. Given that the quality and
influence on the performance than the model size as it could diversity of ELECTRA training data play a pivotal role in
enable more efficient inference, reducing serving costs and its ability to generalize across tasks, the model was trained
potentially allowing for broader applications and accessibility on a vast Books Corpus and English Wikipedia. Pre-training
to more users. techniques were utilized, including replaced token detection,
13) LLaMA: Proposed by Meta, the LLaMA decoder-only generator-discriminator framework, token replacement, and
model is a proof-of-concept that it’s possible to achieve state- weight-sharing. As a result, ELECTRA was able to perform
of-the-art performance by training exclusively on publicly comparably to popular models like RoBERTa and XLNet
available data [123]. LLaMA, with multiple variants ranging when using less than 25% of their compute and outperform
between 7 and 65 billion parameters, was trained on 1400B to- them when using equivalent compute. Deploying such a robust
kens of publicly available datasets, including CommonCrawl, model in the security field after fine-tuning can provide an
C4, arXiv, and others. Interestingly, the techniques used for efficient solution for detecting and mitigating sophisticated
training the model were inspired by multiple popular models cyber threats, thanks to its nuanced understanding of context
like GPT-3 (Pre-normalization), PaLM (SwiGLU activation and language patterns.
function), and GPTNeo (Rotary Embedding). As a result of 17) MPT-30B: MPT-30B LLM is a decoder-only trans-
this incorporation, LLaMA-13B was able to outperform GPT- former introduced by MosaicML after the notable success
3(175B) on most benchmarks despite it being more than ten of MPT-7B [127]. The model has multiple variants, the
times smaller, while LLaMA-65B has shown to be competitive base model and two fine-tuned variants, namely MPT-30B-

24
Instruct and MPT-30B-Chat. Training the model on a variety code samples. The prompt that has been used instructs the
of datasets such as C4, CommonCrawl, and arXiv, among model to check the concerned code for any issue or bug and
others, besides the strategic selection of pre-training meth- respond only with yes or no. The result is presented as the
ods like FlashAttention and ALiBi positional encoding, have ratio of the responses where the model successfully identified
contributed to a robust performance, surpassing even the a buggy code from the total samples used. The hardware CWE
original GPT-3 benchmarks. MPT-30B has also significantly column evaluates the capability of the models to link a buggy
performed in programming tasks, outperforming some open- code to its CWE number. The prompt has been designed to
source models designed specifically for code generation. With ask for a well-defined CWE number on the buggy design. This
these capabilities, deploying MPT-30B in cybersecurity could evaluation process asses the capability of an LLM model in
substantially enhance threat detection and response systems. bug detection and classification into the correct CWE class
Its adeptness at understanding and generating programming number.
languages promises advancements in automated vulnerability The top performers in this evaluation in terms of design bug
assessment and developing sophisticated security protocols. detection are LLama3 and Mixtral. While the LLama3 model
18) Yi-34B: The newly released LLM Yi-34B developed by performs better in the bug detection tasks, it lacks the proper
01.AI is getting attention as one of the best open-source LLMs identification of the CWE issue related to the faulty section.
[128]. Given the recent release of the model, its technical paper Mixtral models show less performance at identifying bugs but
has not yet been published; hence, the available information higher diversity in identifying a bug’s security impact on the
is limited. The model has multiple variants: base and chat overall design implementation. The outcomes of this experi-
models, some quantized. All variants are trained on a dataset ment reveal that some models cannot identify the right issues
containing Chinese and English only, and the chat versions with the source code, which might require further refinement
have gone through supervised fine-tuning, resulting in more of the used prompt and/or fine-tuning the general-purpose
efficient models for downstream tasks. The base model out- models on bug-locating tasks. The results also show that the
performed many open LLMs in certain benchmarks, including model size doesn’t greatly impact the model performance at
renowned ones like LLaMA2-70B and Falcon-180B. Even the locating the bugs nor reasoning about their according impact
quantized versions have demonstrated impressive performance, (CWE class identification). While the samples that have been
paving the way for their deployment in cybersecurity applica- picked do not exceed the context length of the selected models,
tions, such as edge security solutions. the token size of the model itself might reveal a superiority
19) Falcon2-11B: Falcon2-11B LLM [136] built by TII, is for the larger models when dealing with large source codes.
a decoder-only model with 11 billion parameters, trained on However, superior bug identification and reasoning are also
an immense corpus of text data totaling over 5,000 billion required to provide the required performance.
tokens. In terms of performance, Falcon2-11B showcases In conclusion, the highlighted results reveal that the existing
impressive capabilities, supporting 11 languages: English, models might be subject to weaknesses in identifying bugs in
German, Spanish, French, Italian, Portuguese, Polish, Dutch, Hardware designs that might lead to security-related issues.
Romanian, Czech, and Swedish. While it excels in generating The two-step evaluation process gives better visibility in
human-like text, it also carries the biases and stereotypes building more robust dedicated LLMs for Hardware design
prevalent in its training data, a common challenge LLMs security evaluation. Models that properly locate bugs do not
face. To address this, TII recommends fine-tuning the model show similar performance in classifying the bug’s impact on
for specific tasks and implementing guardrails for production the overall design. The outcomes could be evaluated with a
use. In the training process of Falcon2-11B, they utilized larger sample size and a more dedicated study at a large scale
a four-stage strategy with increasing context lengths; in the to get conclusive results.
final stage, they reached 8162 context lengths. This stage
focused on enhancing performance using high-quality data.
C. LLMs performance in the cybersecurity knowledge
Additionally, the training leveraged 1024 A100 40GB GPUs
and a custom distributed training codebase named Gigatron, Table IX compares various 42 LLMs performance in the
which employs a 3D parallelism approach combined with cybersecurity domain using CyberMetric dataset [104] . Figure
ZeRO, high-performance Triton kernels, and FlashAttention-2 5 presents the LLMs performance steps. The models are
for efficient and effective training. evaluated based on their accuracy across four question sets:
80 questions, 500 questions, 2000 questions, and 10,000 ques-
tions. The performance is represented in percentage accuracy,
B. LLMs performance in the hardware cybersecurity offering a comprehensive view of each model’s proficiency in
Table VIII compares 19 publicly available LLMs’ perfor- handling cybersecurity-related queries.
mance in Hardware design-related bug detection and security The top performers in this evaluation are the GPT-4 and
issues identification using samples from various sources. A GPT-4-turbo models by OpenAI. These models demonstrate
portion of the Chrystalis dataset [102] has been used to exceptional performance, with GPT-4 achieving 96.25% ac-
evaluate the performance of the LLM models in bug detection curacy on the 80-question set and maintaining high accuracy
tasks. A set of faults has been injected intentionally into a with 88.89% on the 10,000-question set. GPT-4-turbo closely
functional code and labeled as faulty. The sample size that follows with similar accuracy percentages. Both models are
has been processed comprises 10K of hardware design-related proprietary and developed by OpenAI, indicating a high

25
TABLE VIII: Comparison of 19 LLMs Models’ Performance in Hardware Security Knowledge.
Hardware CWE Number
LLM model Size Design bug detection
1245 1221 1224 1298 1254 1209 1223 1234 1231
Llama 3-7b-instruct 8B 39.556% Yes No No No Yes No No Yes No
Mixtral-8x7B-Instruct 8x7B 16.154% No No No No No No No No No
Dolphin-mistral-7B 7B 16,024% Yes Yes No No No No No No No
Codegemma-9b-instruct 9B 10.746% No No No No No No No No Yes
CodeQwen-7b-instruct 7B 10.269% No No No No No No No No No
Wizard-vicuna-uncensored-7b-instruct 7B 9.374% No No No No No No No No No
Mistral-openorca-7b-instruct 7B 8.241% No No No No No Yes No No No
Wizardlm2-7b-instruct 7B 5.646% No No No No No No No No No
Llama2-uncensored-7b-instruct 7B 2.505% No No No No No No No No No
Falcon-40b-instruct 40B 1.620% No No No No No No No No No
Deepseek-coder-33b-instruct 33B 1.570% No No No No No No No No No
Orca-mini-3b-instruct 3B 1.173% No No Yes No No No No No No
Qwen2-4b-instruct 4B 0.576% No No No No No No No No No
CodeLlama-7b-instruct 7B 0.218% No No No No No No No No No
Phi3-4b-instruct 4B 0.019% No No No No No No No No No
Hardware-Phi 1.5B 0% No No No No No No No No No
Llava-13b-instruct 13B 0% No No No No No No No No No
Gemma-9b-instruct 9B 0% No No No No No No No No No
Starcoder2-15b-instruct 15B 0% No No No No No No No No No
Yes: Detected the CWE sample by MITRE, No: Did not Detect the CWE sample by MITRE. CWE: Common Weakness Enumeration.

TABLE IX: Comparison of 42 LLMs Models’ Performance in Cyber Security Knowledge.


Accuracy
LLM model Company Size License
80 Q 500 Q 2k Q 10k Q
GPT-4o OpenAI N/A Proprietary 96.25% 93.40% 91.25% 88.89%
GPT-4-turbo OpenAI N/A Proprietary 96.25% 93.30% 91.00% 88.50%
Mixtral-8x7B-Instruct Mistral AI 45B Apache 2.0 92.50% 91.80% 91.10% 87.00%
Falcon-180B-Chat TII 180B Apache 2.0 90.00% 87.80% 87.10% 87.00%
GEMINI-pro 1.0 Google 137B Proprietary 90.00% 85.05% 84.00% 87.50%
GPT-3.5-turbo OpenAI 175B Proprietary 90.00% 87.30% 88.10% 80.30%
Yi-1.5-9B-Chat 01-ai 9B Apache 2.0 87.50% 80.80% 77.15% 76.04%
Hermes-2-Pro-Llama-3-8B NousResearch 8B Open 86.25% 80.80% 77.95% 77.33%
Dolphin-2.8-mistral-7b-v02 Cognitive Computations 7B Apache 2.0 83.75% 77.80 % 76.60% 75.01%
Mistral-7B-OpenOrca Open-Orca 7B Apache 2.0 83.75% 80.20% 79.00% 76.71 %
Gemma-1.1-7b-it Google 7B Open 82.50% 75.40% 75.75% 73.32%
Flan-T5-XXL Google 11B Apache 2.0 81.94% 71.10% 69.00% 67.50%
Meta-Llama-3-8B-Instruct Meta 8B Open 81.25 % 76.20% 73.05% 71.25%
Zephyr-7B-beta HuggingFace 7B MIT 80.94% 76.40% 72.50% 65.00%
Yi-1.5-6B-Chat 01-ai 6B Apache 2.0 80.00% 75.80% 75.70% 74.84%
Mistral-7B-Instruct-v0.2 Mistral AI 7B Apache 2.0 78.75% 78.40% 76.40% 74.82%
Llama 2-70B Meta 70B Apache 2.0 75.00% 73.40% 71.60% 66.10%
Qwen1.5-7B Qwen 7B Open 73.75% 60.60% 61.35% 59.79%
Qwen1.5-14B Qwen 14B Open 71.25% 70.00% 72.00% 69.96%
Mistral-7B-Instruct-v0.1 Mistral AI 7B Apache 2.0 70.00% 71.80% 68.25% 67.29%
Llama-3-8B-Instruct-Gradient-1048k Bartowski 8B Open 66.25% 58.00% 56.30% 55.09%
Qwen1.5-MoE-A2.7B Qwen 2.7B Open 62.50% 64.60% 61.65% 60.73%
Phi-2 Microsoft 2.7B MIT 53.75% 48.00% 52.90% 52.13%
Llama3-ChatQA-1.5-8B Nvidia 8B Open 53.75% 52.80% 49.45 % 49.64%
DeciLM-7B Deci 7B Apache 2.0 52.50% 47.20% 50.44% 50.75%
Flan-T5-Base Google 0.25B Apache 2.0 51.25% 50.40% 48.55% 47.09%
Deepseek-moe-16b-chat Deepseek 16B MIT 47.50% 45.80% 49.55% 48.76%
Mistral-7B-v0.1 Mistral AI 7B Apache 2.0 43.75% 39.40% 38.15% 39.28%
Qwen-7B Qwen 7B Open 43.75% 58.00% 55.75% 54.09%
Gemma-7b Google 7B Open 42.50% 37.20% 36.00% 34.28%
Meta-Llama-3-8B Meta 8B Open 38.75% 35.80% 37.00% 36.00%
Genstruct-7B NousResearch 7B Apache 2.0 38.75% 40.60% 37.55% 36.93%
Qwen1.5-4B Qwen 4B Open 36.25% 41.20% 40.50% 40.29%
Llama-2-13b-hf Meta 13B Open 33.75% 37.00% 36.40% 34.49%
Dolly V2 12b BF16 Databricks 12B MIT 33.75% 30.00% 28.75% 27.00%
Deepseek-llm-7b-base DeepSeek 7B MIT 33.75% 25.20% 27.00% 26.48%
Cerebras-GPT-2.7B Cerebras 7B Apache 2.0 25.00% 20.20% 19.75% 19.27%
Gemma-2b Google 2B Open 25.00% 23.20% 18.20% 19.18%
Stablelm-2-1 6b Stability AI 6B Open 16.25% 21.80% 19.55% 20.09%
ZySec-7B ZySec-AI 7B Apache 2.0 12.50% 16.40% 15.55% 14.04%
Phi-3-mini-4k-instruct Microsoft 3.8B MIT 5.00% 5.00% 4.41% 4.80%
Phi-3-mini-128k-instruct Microsoft 3.8B MIT 1.25% 0.20% 0.70% 0.88%

26
optimization level for specialized tasks within a controlled VI. C ODE - SPECIFIC LLM S
environment. Another strong performer is the Mixtral-8x7B-
Instruct by Mistral AI, which boasts accuracy of 92.50% on the The rapid evolution of technology and software develop-
80-question set and 87.00% on the 10,000-question set. This ment has increased the demand for specialized tools that
model is open-source under the Apache 2.0 license, demon- aid in coding, debugging, and enhancing software security
strating the potential of community-driven development in [152], [153]. Recognizing this need, various organizations
achieving high performance. Additionally, GEMINI-pro 1.0 by have developed Code-specific LLMs, each offering unique
Google shows robust performance, achieving 90.00% accuracy features and capabilities. These models leverage advanced
on the 80-question set and 87.50% on the 10,000-question set, machine learning techniques to understand, generate, and
highlighting the capabilities of large-scale corporate research manipulate code, thereby revolutionizing the field of software
and development in LLMs. development [154], [155]. This section delves into several
notable Code-specific LLMs, exploring their architectures,
Mid-tier performers include models like Yi-1.5-9B-Chat by training methods, and potential applications in cybersecurity
01-ai and Hermes-2-Pro-Llama-3-8B by NousResearch. Yi- and beyond [156]–[159]. Table X and Table XI compare Code-
1.5-9B-Chat performs reasonably well with an 87.50% accu- specific Large Language Models.
racy on the 80-question set, tapering to 76.04% on the 10,000-
question set. Under the Apache 2.0 license, this model shows a
balance between open-source collaboration and performance. A. Prevalent LLMs
Hermes-2-Pro-Llama-3-8B achieves 86.25% accuracy on the 1) SantaCoder: As part of the BigCode project, Hug-
80-question set and 77.33% on the 10,000-question set, further gingFace and ServiceNow have proposed SantaCoder LLM
underscoring the effectiveness of collaborative research efforts. [137]. Based on the decoder-only architecture and with a
1.1B parameter, SantaCoder was trained on 268GB of Python,
Lower-tier performers include models like Qwen1.5-7B by Java, and JavaScript subsets of The Stack dataset. Multiple
Qwen. Qwen1.5-7B scores 73.75% on the 80-question set, filtering techniques were used for the training data with-
dropping to 59.79% on the 10,000-question set. As an open out much impact except for one (i.e., filtering files from
model, Qwen1.5-7B indicates the challenges faced by smaller repositories with 5+ GitHub stars), significantly deteriorat-
models in maintaining high accuracy with increasing question ing the performance on text2code benchmarks. Pre-training
set sizes. Falcon-40B-Instruct achieves 67.50% accuracy on methods included Multi-Query-Attention (MQA) and Fill-in-
the 80-question set and 64.50% on the 10,000-question set. the-Middle (FIM). Although these techniques have led to a
Licensed under Apache 2.0, it highlights the competitive slight drop in the model’s performance compared to Multi-
landscape of open-source LLMs. Head-Attention (MHA) and training without FIM, the model
The lowest-tier performers include models such as Phi- could still outperform previous multi-lingual code models
3-mini-128k-instruct by Microsoft and Stablelm-2-1 6b by like CodeGen0Multi-2.7B and InCoder-6.7B despite being
Stability AI. Phi-3-mini-128k-instruct has the lowest perfor- substantially smaller. Such performance can be promising if
mance, with only 1.25% accuracy on the 80-question set and deployed in cybersecurity for tasks like software vulnerability
0.88% on the 10,000-question set. Despite being from a major and secure code generation.
company like Microsoft and licensed under MIT, this model 2) StarCoder: StarCoder is another decoder-only model
underscores the importance of continuous development and developed within the BigCode project [138]. With 15.5B
optimization in LLMs. Stablelm-2-1 6b scores 16.25% on the parameters, StarCoder was pre-trained on 1000B tokens from
80-question set, decreasing to 20.09% on the 10,000-question over 80 different programming languages. The pre-training
set, demonstrating smaller models’ difficulties in scaling up utilized techniques such as FIM and MQA and Learned
effectively. Absolute Positional Embeddings. After pre-training, the base
model was fine-tuned on an additional 35B tokens of Python.
In conclusion, the table reveals that proprietary models Compared to other Code LLMs, StarCoder outperformed all
perform better than open-source models, suggesting that con- fine-tuning models on Python. Moreover, the base model
trolled environments and dedicated resources may significantly outperformed OpenAI code-cushman-001. StarCoder’s excep-
enhance model performance. However, larger models do not tional performance in Python and its broad training in multiple
always guarantee higher performance, as seen with some mid programming languages position it as a highly versatile tool
and lower-tier performers. Additionally, many models show for various coding tasks.
a decline in accuracy as the number of questions increases, 3) StarChat-Alpha: StarChat Alpha is a variant of Star-
highlighting the challenges in maintaining performance con- Coder fine-tuned to act as a helpful coding assistant that ac-
sistency across larger datasets. The analysis indicates that cepts natural language prompting (considering that StarCoder
while top-tier proprietary models lead in performance, there needs specific structured prompting) [139]. With 16B param-
is significant potential within the open-source community eters, the model was fine-tuned on a mixture of oasst1 and
to develop competitive models. Continuous improvements in databricks-dolly-15k datasets. The model has not undergone
model architecture, training data quality, and optimization RLHF or similar methods, which would have helped align it
techniques are crucial for advancing state-of-the-art cyberse- with human preferences. Nevertheless, the comprehensive pre-
curity knowledge within LLMs. training of the base model contributed to the model’s ability to

27
TABLE X: Comparison of Code-specific Large Language Models
Model Architecture Base Model Para- Training Pre-training Corpus Released Applications Use Cases in Training Key Training Quanti- Ref
meters Tokens Volume By Cybersecurity Scheme Techniques zation
SantaCoder Decoder-only NA 1.1B 236B The Stack 268GB Hugging- Code Threat Pre-training Multi Query NA [137]
v1.1 dataset Face, Generation, Intelligence, Attention (MQA),
(Python, Java, Servi- Code Software Fill-in-the-Middle
and JavaScript) ceNow Completion, Vulnerability, (FIM)
Code Analysis, Source Code
QA Generation
StarCoder Decoder-only NA 15.5B PT 80+ +800GB Hugging- Code Threat Pre-training, Fill-in-the-Middle NA [138]
1000B, programming Face, Generation, Intelligence, Fine-tuning (FIM), Multi
FT 35B languages, Servi- Code Software Query Attention
Git commits, ceNow Completion, Vulnerability (MQA), Learned
GitHub issues, Code Analysis, Detection absolute positional
and Jupyter QA embeddings
notebooks
StarChat Al- Decoder-only StarCoder- 16B NA oasst1 and NA Hugging- Code Threat Fine-tuning NA NA [139]
pha base databricks- Face, Generation, Intelligence,
dolly-15k Servi- Code Software
datasets ceNow Completion, Vulnerability
Code Analysis,
QA
CodeGen2 Decoder-only NA 1-16B 400B Stack dataset NA Salesforce Program Threat Pre-training Causal Language NA [140]
(causal LM) v1.1 Synthesis, Intelligence, Modeling, Cross-
Code Software entropy Loss,
Generation Vulnerability File-level Span
Corruption,
Infilling
CodeGen2.5 Decoder-only NA 7B 1400B StarCoderData NA Salesforce Code Threat Pre-training Flash Attention, NA [141]
(causal LM) Generation, Intelligence, Infill Sampling,
Code Software Span Corruption
Completion, Vulnerability
Code Analysis
CodeT5+ Encoder- NA 220M- 51.5B CodeSearchNet NA Salesforce Code Threat Pre-training Span Denoising, NA [142]
decoder 16B dataset, Generation and Intelligence, Contrastive
GitHub code Completion, Software Learning, text-
dataset Math Vulnerability code Matching,
Programming, Causal Language
Text-to-code Modeling (CLM)
Retrieval Tasks
XGen-7B Decoder-only NA 7B 1500B GitHub, NA Salesforce Code Genera- Threat Pre-training, Standard Dense NA [143]
Several public tion, Summa- Intelligence, Fine-tuning Attention, Two-
sources, Apex rization Software stage Training
code data Vulnerability Strategy
(mixture of
natural text
data and code
data)
Replit Code Decoder-only NA 2.7B 525B Stack Dedup NA Replit, Code Threat Pre-training Flash Attention, Matrix [144]
V1 (causal LM) v1.2 dataset Inc. Completion, Intelligence, AliBi Positional Multipli-
(20 different Code Software Embeddings, cation
languages) Generation Vulnerability LionW Optimizer
DeciCoder- Decoder-only NA 1B 446B StarCoderData NA Deci Code Threat Pre-training Fill-in-the-Middle NA [145]
1B (Python, Java, Completion, Intelligence, training (FIM),
and JavaScript) Code Software Grouped Query
Generation, Vulnerability Attention (GQA)
Code Analysis
CodeLLAMA Decoder-only LLaMA2 7-34B 620B Text and code NA Meta Code Threat Pre-training, Causal Infilling, NA [146]
from multiple Completion, Intelligence, Fine-tuning Autoregressive
datasets Code Software Training,
Generation, Vulnerability Repository-level
Code Analysis Reasoning, Long-
context Fine-
tuning
CodeQwen1.5- Decoder-only Qwen1.5 7.25B 3T code-related NA Qwen Code Threat Pre-training Flash Attention, NA [147]
7B data Generation, Intelligence, RoPE, Grouped-
Code Software Query Attention
Completion, Vulnerability, (GQA)
Code Analysis Bug fixes
DeepSeek Decoder-only NA 33.3B 2T Composition NA DeepSeek Code Threat Pre-training, Flash Attention, NA [148]
Coder-33B- of code Generation, Intelligence, Long- RoPE, Grouped-
instruct and natural Code Software context Query Attention
language Completion, Vulnerability pre-training, (GQA)
Code Analysis Instruction
fine-tuning
CodeGemma- Decoder-only Gemma 8.54B 500B Code NA Google Code Threat Pre-training, Fill-in-the-middle NA [149]
7B repositories, completion, Intelligence, Fine-tuning (FIM) tasks,
Mathematics Code Software dependency graph-
datasets, generation, Vulnerability based packing,
Synthetic code Code chat, unit test-based
Instruction lexical packing
following

28
TABLE XI: Continued
Granite 8B Decoder-only NA 8.05B 4.05T Publicly NA IBM Code Threat Pre-trained RoPE embedding, NA [150]
Code Datasets Granite generation, Intelligence, in two Grouped-Query
(GitHub Code Intrusion phases (the Attention (GQA),
Code Clean, explanation, Detection, second Context Length of
Starcoder data) Code fixing, Malware phase for 4096 Tokens
etc. Detection high-quality
data)
DeepSeek-V2 Decoder-only NA 236B 8.1T Composition NA DeepSeek Code Threat Pre-training, Mixture-of- NA [151]
of code Generation, Intelligence, SFT, RL, Experts (MoE),
and natural Code Software Long Multi-head Latent
language Completion, Vulnerability Context Attention (MLA)
Code Analysis Extension

interpret various coding tasks and provide accurate code sug- 7) XGen-7B: Another production of Salesforce AI Re-
gestions. This capability makes it an invaluable programming search is XGen-7B LLM, a decoder-only transformer with
tool, simplifying code development and problem-solving. 7B parameters [143]. The model was developed to address
4) CodeGen-2: Developed by Salesforce AI research, the problem of sequence length constraints in the available
CodeGen2 was proposed as a product of extensive research open-source LLMs as many tasks require inference over an
in the field of LLM aimed at optimizing model architectures input context. XGen-7B, with up to 8K sequence length, was
and learning algorithms to enhance the efficiency and reduce trained on 1500B tokens from a mixture of text and code
the costs associated with LLMs [140]. The final findings were data. Techniques like standard dense attention and a two-stage
examined in multiple variants with parameters ranging from training strategy were utilized for pre-training. Additionally,
1B to 16B, where the 16B model is trained on 400B tokens the model was enhanced with instructional tuning, a technique
from the Stack dataset. Causal language modeling, cross- that refines its responses to align closely with specific user
entropy loss, and other techniques were used for pre-training, instructions. As a result, XGen-7B achieved comparable or
resulting in a robust program synthesis model. CodeGen2’s better results than other 7B state-of-the-art open-source LLMs.
proficiency in program synthesis makes it a valuable asset 8) Replit code v1: Proposed by Replit, Inc., the 2.7B
in cybersecurity applications, such as aiding in vulnerability parameters causal language model Replit-code-v1-3b, with
detection and enhancing code security analysis. Its ability to a focus on code completion, was trained on 525B tokens
understand and generate complex small models can be trained from a subset of the stack Dedup v1.2 dataset [144]. The
for multiple epochs with specific settings, efficient security model underwent advanced pre-training techniques such as
protocols, and automated threat detection systems. Flash Attention for efficient computation, AliBi positional
5) CodeGen-2.5: Another version of the CodeGen family is embeddings for enhanced context interpretation, and the Li-
CodeGen 2.5 [141]. The 7B parameters model was introduced onW optimizer for improved training dynamics. The Replit
to prove that good models don’t necessarily have to be big, code v1 model is also available in two quantization options:
especially with the trend of scaling up LLMs and the data size 8-bit and 4-bit. The Replit-code-v1-3b model’s capabilities
limitations. CodeGen 2.5 was trained on 1400B training to- in understanding and generating code make it particularly
kens from StarCoderData. A strategic selection of pre-training suited for cybersecurity applications, such as automating the
techniques, such as Flash Attention, Infill Sampling, and Span detection of code vulnerabilities and generating secure coding
Corruption, enhanced the model’s performance. Moreover, that patterns. Additionally, its quantized versions can be utilized
led to a good performance that is on par with popular LLMs for edge security.
of larger size. The results indicated that small models can be 9) DeciCoder-1B: DeciCoder-1B is an open-source 1B
trained for multiple epochs with specific settings and achieve parameter decoder-only transformer developed by Deci AI
comparable results to bigger models. with a 2048-context window [145]. Subsets of Python, Java,
6) CodeT5+: CodeT5+ is an encoder-decoder transformer and JavaScript from the StarCoderData dataset were used for
proposed by Salesforce AI Research to address some code training. The model architecture was built using Automated
LLMs limitations [142]. Specifically, those related to the ar- Neural Architecture Construction (AutoNAC) developed by
chitecture being either inflexible or serving as a single system the company, which is a technology designed to automatically
and pre-training task limitations related to a limited set of pre- create and optimize deep learning models, particularly neural
training objectives can result in a substantial degradation in networks, for specific tasks and hardware environments. More-
performance. The proposed model has different variants rang- over, Grouped Query Attention (GQA) and FIM were utilized
ing from 220M to 16B parameters. Trained on 51.5B tokens to pre-train the model. Consequently, the model has shown
from CodeSearchNet and GitHub code datasets using tech- smaller memory usage compared to popular code LLMs like
niques like span denoising, contrastive learning, and others, the StarCoder and outperformed SantaCoder in the languages it
model achieved new state-of-the-art results on various code- was trained on with remarkable inference speed.
related tasks like code generation, code completion, etc. A 10) CodeLLAMA: Based on LLAMA 2, CodeLLAMA was
model with such capabilities can be valuable to cybersecurity introduced by Meta as a decoder-only transformer code LLM
for threat intelligence and software vulnerability. [146]. With variants ranging from 7 to 34B parameters of

29
base, python specialized, and instruction-following models, all V2 excels at live coding tasks and open-ended generation,
trained on text and code from multiple datasets, CodeLLAMA supporting both English and Chinese.
emerges as a comprehensive suite of models, adept at handling
a wide array of programming-related tasks. Causal infilling, B. Datasets Development for Code-centric LLM Models
Long-context fine-tuning, and other techniques were utilized
The development of large-scale datasets has played a crucial
for pre-training and fine-tuning. CodeLLAMA models’ family
role in advancing LLM models, especially those focused on
achieved state-of-the-art performance in multiple benchmarks,
understanding and generating code. Table XII presents the
indicating their potential for transformative applications in
datasets used for pre-training foundation models in Coding.
cybersecurity. Their advanced code analysis and generation
Datasets like CodeSearchNet [160] and The Pile [161] have
capabilities could be crucial in automating threat detection and
been instrumental in bridging the gap between natural lan-
enhancing vulnerability assessments.
guage and code, improving semantic search capabilities, and
11) CodeQwen1.5-7B: CodeQwen1.5-7B-Chat [147] is a
enhancing language model training across diverse domains.
transformer-based decoder-only language model trained on 3
These datasets provide a rich source of real-world code in
trillion tokens of code data. It supports 92 coding languages
multiple programming languages and include expert annota-
and has strong code-generation capabilities. The model can
tions and natural language queries that challenge and push the
understand and generate long contexts of up to 64,000 tokens
boundaries of LLM performance in code-related tasks.
and has shown excellent performance in text-to-SQL and bug-
Over time, the focus has shifted towards increasing the
fixing tasks. It is based on Qwen1.5, which offers eight model
size, diversity, and ethical considerations of the data used in
sizes, including 0.5B, 1.8B, 4B, 7B, 14B, 32B, and 72B dense
training AI models. Introducing datasets such as ROOTS and
models, and an MoE model of 14B with 2.7B activated.
The Stack v2 [164] reflects a growing emphasis on responsible
12) DeepSeek Coder-33B-instruct: Deepseek Coder [148]
LLM development. These newer datasets encompass a broader
is a series of code language models, with each model trained
range of programming languages and coding scenarios, and
from scratch on 2 trillion tokens, 87% of which are code and
they incorporate governance frameworks to ensure the ethical
13% natural language in English and Chinese. The model
use of the data. In addition, these datasets are designed to
comes in various sizes, ranging from 1B to 33B, with the
address the needs of large multilingual language models and
33B model being fine-tuned on 2 billion tokens of instruction
the specific challenges of code generation and comprehension,
data. It achieves state-of-the-art performance among open-
demonstrating the evolving landscape of LLM research driven
source code models on multiple programming languages and
by enhanced dataset quality and scope.
benchmarks.
13) CodeGemma-7B: CodeGemma [149] is a collection of
lightweight open code models built on top of Gemma. It is a C. Vulnerabilities Analysis of LLM-Generated Code
text-to-text and text-to-code decoder-only model with 7 billion The evolution of LLMs in software development has
parameters, specializing in code completion and generation brought significant advancements and new security challenges
tasks. It can answer questions about code fragments, gener- [173]. Table XIII presents a comparative analysis of vulnera-
ate code from natural language, or discuss programming or bilities in LLM-generated code.
technical problems. CodeGemma was trained on 500 billion Schuster et al. [165] demonstrate how LLMs employed
tokens of primarily English language data from publicly avail- in code autocompletion are susceptible to poisoning attacks,
able code repositories, open-source mathematics datasets and which can manipulate the model’s output to suggest insecure
synthetically generated code. code. This vulnerability is intensified by the ability to target
14) Granite 8B Code: IBM released a family of Granite specific developers or repositories, making the attacks more
code models [150], including the Granite-8B-Code-Base, to effective and difficult to detect. Despite defenses against such
make coding more accessible and efficient for developers. attacks, their effectiveness remains limited, raising concerns
Granite-8B-Code-Base is a decoder-only code model designed over the secure deployment of these technologies [165].
for code generation, explanation, and fixing. It is trained in Recent studies, such as those by Asare et al. [166] and
two phases: first on 4 trillion tokens from 116 programming Sandoval et al. [167], provide an empirical and comparative
languages, then on 500 billion tokens from a carefully de- analysis of the security aspects of code generated by LLMs
signed mixture of high-quality code and natural language data. like GitHub’s Copilot and OpenAI Codex. Asare et al. [166]
This two-phase training strategy ensures the model can reason find that while Copilot occasionally replicates vulnerabilities
and follow instructions while understanding programming known from human-written code, it does not consistently do
languages and syntax. so across different vulnerabilities. In contrast, Sandoval et
15) DeepSeek-V2: DeepSeek-V2 [151] is a mixture-of- al. [167] report a minimal increase in security risks when
experts (MoE) language model with 236 billion parameters, developers use LLMs in coding, indicating that LLMs do not
of which 21 billion are activated for each token. It is a sig- necessarily degrade the security of the code more than human
nificant upgrade from the previous DeepSeek model, offering developers would.
stronger performance while reducing training costs by 42.5%. Moreover, Perry et al. [168] reveal a concerning trend where
The model was pre-trained on a vast and diverse corpus of users interacting with AI code assistants tend to write less
8.1 trillion tokens, followed by supervised fine-tuning and secure code but believe otherwise. Their findings underscore
reinforcement learning to maximise its capabilities. DeepSeek- the need for heightened awareness and better design of user

30
TABLE XII: Datasets Used for Pre-training Foundation Models in Coding
Dataset Title Year Purpose Content Significance
Contains about 6 million
Advances the semantic code
”CodeSearchNet Challenge: functions from six languages
CodeSearchNet Focuses on bridging natural search field with a challenge
Evaluating the State of 2019 and 2 million automatically
[160] language and code. including 99 queries and 4k
Semantic Code Search” generated query-like
expert annotations.
annotations.
Improves model
”The Pile: An 800GB Dataset Comprises 22 high-quality, di-
Designed to train large-scale generalization capabilities;
The Pile [161] of Diverse Text for Language 2020 verse text subsets totaling 825
language models. evaluates with GPT-2 and
Modeling” GiB.
GPT-3.
Facilitates model training in Consists of 115M code files
2 Aids in diverse language and
CodeParrot CodeParrot Dataset 2022 code understanding and gen- from GitHub in 32 program-
format model training.
eration. ming languages, totaling 1TB.
Demonstrates improved per-
The Stack ”The Stack: 3 TB of permis- Aimed at fostering research Features 3.1 TB of code in 30 formance on text2code bench-
2022
[162] sively licensed source code” on AI for code. programming languages. marks; introduces data gover-
nance.
”The BigScience ROOTS Spans 59 languages and fo- Advances large-scale
Supports ethical, multilingual
ROOTS [163] Corpus: A 1.6TB Composite 2023 cuses on diverse, inclusive language model research
model research.
Multilingual Dataset” data. with an ethical approach.
Built from sources including
Shows improvements in code
The Stack v2 ”StarCoder 2 and The Stack Enhances foundation models 619 programming languages,
2024 LLM benchmarks; ensures
[164] v2: The Next Generation” for code. significantly larger than its
transparency in training data.
predecessor.

TABLE XIII: Comparative Analysis of Vulnerabilities in LLM-Generated Code


Reference Year Primary Focus Methodology Key Findings
Schuster et al. [165] 2021 Poisoning in code autocom- Experimental poisoning attacks on autocom- Demonstrated effective targeted and untar-
pletion pleters geted poisoning; current defenses are largely
ineffective.
Asare et al. [166] 2023 Security analysis of GitHub’s Empirical analysis comparing human and Copilot does not consistently replicate human
Copilot Copilot-generated code vulnerabilities vulnerabilities, showing variable performance
across different types.
Sandoval et al. 2023 Security implications of LLM User study with AI-assisted coding tasks Minimal increase in security risks from LLM
[167] code assistants in C program- assistance compared to control.
ming
Perry et al. [168] 2023 Impact of AI code assistants Large-scale user study on security task per- Participants using AI wrote less secure code
on security formance but were overconfident in its security.
Hamer et al. [169] 2024 Security vulnerabilities in Empirical analysis of code snippets for secu- LLM-generated code had fewer vulnerabili-
LLM vs. StackOverflow code rity vulnerabilities ties than StackOverflow, highlighting differ-
ences in security risks.
Cotroneo et al. 2024 Security assessment tool for Development and validation of DeVAIC tool DeVAIC effectively identifies vulnerabilities
[170] AI-generated code in Python code, outperforming other tools.
Tóth et al. [171] 2024 Evaluating security of LLM- Hybrid evaluation using static and dynamic Significant vulnerabilities found in AI-
generated PHP web code analysis generated PHP code, emphasizing the need
for thorough testing.
Tihanyi et al. [172] 2024 Security of LLM-generated C Dataset creation and analysis using formal Over 63% of generated C programs were
code from neutral prompts verification found vulnerable, with minor variations be-
tween different LLMs.

interfaces to foster critical engagement with the code sugges- accurate, and effective for training or evaluating the models.
tions provided by LLMs [168]. In a similar vein, Hamer et Figure 6 presents the cyber security dataset lifecycle for LLM
al. [169] emphasize the educational gap among developers development.
regarding the security implications of using code snippets from 1) Define Objectives: Defining the objectives for a cyberse-
AI like ChatGPT or traditional sources like StackOverflow, curity dataset for LLMs is crucial as it dictates its construction
highlighting that both sources can propagate insecure code. and application. For training purposes, the dataset should cover
Lastly, novel tools like DeVAIC introduced by Cotroneo various cybersecurity topics and incorporate various data types
et al. [170] and comprehensive vulnerability evaluations in like text, code, and logs, aiming to develop a robust and
LLM-generated web application code by Tóth et al. [171] and versatile LLM capable of understanding diverse threats (e.g.,
Tihanyi et al. [172] illustrate ongoing efforts to understand Edge-IIoT dataset [174] for Network Security and FormAI
better and mitigate the risks associated with AI-generated dataset [172], [175] for Software Security). For evaluation,
code. DeVAIC, for instance, offers a promising approach to the focus narrows to specific areas, such as benchmarking the
detecting vulnerabilities in incomplete Python code snippets, LLMs’ knowledge in cybersecurity (e.g., CyberMetric [104]).
potentially enhancing the security assessment capabilities for
AI-generated code. 2) Scope and Content Gathering: For the scope and content
gathering stage of building a cybersecurity dataset aimed at
training and fine-tuning LLMs, selecting a broad range of
VII. C YBER S ECURITY DATASETS FOR LLM S
topics is essential to ensure comprehensive coverage. Key
A. Cyber Security Dataset Lifecycle areas include network security, malware analysis, software
Creating a cybersecurity dataset for use with LLMs in- security, cryptographic protocols, cloud security, and incident
volves several steps that ensure the dataset is comprehensive, response. The data should be sourced from diverse and reliable

31
origins, such as public and private databases such as Common on C/C++ program code out of many other programming
Weakness Enumeration (CWE) and Common Vulnerabilities languages. The dataset contains the source files, with each test
and Exposures (CVE) [176], [177]. case containing bad functions and good functions that patch
3) Data Cleaning and Preprocessing: This process involves the vulnerable “bad” code. Test cases are labeled with CWEs
filtering out irrelevant content to maintain a focus on cy- to indicate the type of vulnerability exposed in the program.
bersecurity and standardizing formats across the dataset. For The dataset contains keywords to indicate precisely where vul-
example, processing the Starcoder 2 dataset [164] involves nerable and non-vulnerable functions exist. Thus, the dataset
several meticulous steps to refine GitHub issues collected needs careful sanitization /obfuscation. While the dataset has
from GHArchive. Initially, auto-generated texts from email many vulnerability types and gives concrete examples, they are
replies and brief messages under 200 characters are removed, still programs purposefully built to demonstrate vulnerabilities
along with truncating longer comments to maintain a max- rather than naturally occurring ones.
imum of 100 lines while preserving the last 20 lines. This 2) Draper dataset: Researchers in work [179] leveraged
step alone reduced the dataset volume by 17%. The dataset a new dataset for vulnerability detection using deep rep-
then undergoes further cleaning to remove comments by bots resentation. A custom lexer was used to create a generic
identified through specific keywords in usernames, eliminating representation to capture the essential tokens while minimizing
an additional 3% of the issues. A notable focus is placed on the the token count. It was curated using open-source C/C++ code
interaction quality within the issues; conversations with two or from SATE IV, Github, and Debian and labeled using three
more users are prioritized, and those with extensive text under static analyzers. The dataset is substantial, but the vulnerability
a single user are preserved if they stay under 7,000 characters. percentage is low, standing at roughly 6.8%. The dataset is
Issues dominated by a single user with more than ten events multi-labelled, where more than one CWE can exist in a
are excluded, recognizing them as potentially low-quality or code sample. The dataset focuses on four main CWEs or
bot-driven, resulting in a 38% reduction of the remaining categories, while the rest of the vulnerabilities are grouped
dataset. For privacy, usernames are anonymized by replac- into one class. The researchers mapped the static analyzer
ing them with a sequential participant counter, maintaining findings to CWEs and binary labels. Furthermore, since the
confidentiality while preserving the integrity of conversational researchers did the mapping, warnings, and functions flagged
dynamics. by static analyzers that would not typically be exploited were
4) Annotation and Labeling: A sophisticated hybrid ap- not flagged as vulnerable. In addition, a strict deduplication
proach can be adopted to ensure precision and scalability in process was used to refine the dataset. The authors utilize this
the annotation and labeling stage of developing a cybersecurity dataset to train their model after lexing the source code to
dataset for LLMs. Cybersecurity experts manually annotate the reduce the code representation and use a limited vocabulary
dataset, meticulously labeling complex attributes such as threat size. Due to lexing the source code, the approach reduces the
type, guaranteeing high accuracy. Concurrently, automated needed vocabulary size compared to the original size required.
tools like static analyzers (e.g., Clang for C/C++ and Bandit However, the vulnerable portion is minimal compared to
for Python), formal verification methods (e.g., ESBMC), and the dataset. Moreover, the labeling considers four categories,
dynamic tools are employed to handle the large volume of which are limited compared to other datasets.
data efficiently. These tools initially tag the data, which human 3) Reveal dataset: Reveal [180] was proposed to provide
experts carefully review and correct [178]. an accurate dataset that reflects the real world, which is
why it is also reflected in the imbalance of the samples.
Their work finds that performance drops by more than 50%
B. Software Cyber Security datasets in real-world prediction, highlighting the need for a dataset
In software cyber security, datasets play a crucial role subjected to a realistic setting. The authors focus on two
in understanding, detecting, and mitigating vulnerabilities in open-source projects, Linux Debian and Chromium, as they
software systems. This sub-section explores several significant are popular, well-maintained, showcase important domains,
software cybersecurity datasets, each offering unique insights and have publicly available vulnerability reports. The data
and methodologies for vulnerability analysis in cybersecurity. is not used as text but as Code Property Graphs (CPG),
From the extensive BigVul dataset, which links vulnerabilities which are then converted to graph embeddings for training
in the CVE database to specific code commits, to the inno- a Gated Graph Neural Network (GGNN). The authors use
vative FormAI dataset, leveraging AI-generated C programs an approach inspired by Zhou et. al [181] to identify the
and advanced verification methods for precise vulnerability security vulnerabilities in the project, and they remedy the
classification, each dataset contributes uniquely to the field. class imbalance due to the majority of non-vulnerable code
These datasets range from manually labeled by security ex- through SMOTE. Such data was collected from Bugzilla and
perts to those generated using state-of-the-art automated tools, Debian security tracker. Looking at the vulnerable portion,
providing diverse resources for researchers and practitioners. it constitutes 9.16% out of the 18,169 programs. While the
Table XIV provides an overview of software vulnerability dataset attempts to depict a realistic dataset, relying on two
datasets that can be used for fine-tuning LLMs for software sole projects might limit how well a model trained on the
security. dataset would perform in a real-world prediction case.
1) Sate IV - Juliet dataset: Nist has developed the SARD 4) Devign dataset: Researchers of Devign [177] required
Juliet dataset to assess the capabilities of static analysis tools an accurate dataset to be used in several graph forms, which

32
Fig. 6: Cyber Security Dataset Lifecycle for LLM development.

they believe is better in reflecting the structural and logical teristics that can be useful for thoroughly analyzing vulner-
aspects of source code. The proposed approach contains a abilities and the history of change. Moreover, the diversity
graph embedding layer that uses Abstract Syntax Tree (AST), of the projects and the vulnerability types expose the models
Control Flow Graph (CFG), Data Flow Graph (DFG), and being trained on it to several patterns. However, the dataset
Natural Code Sequence (NCS) to generate a joint graph rep- only contains 11,823 vulnerable functions as opposed to the
resentation. The rationale behind a joint representation is the 253,096 non-vulnerable functions. While it may depict real
ability of certain graphs to portray different vulnerability types projects, the data is imbalanced, and more vulnerable functions
not uncovered by others. Motivated to propose a more accurate are needed to train large models.
dataset instead of those generated using static analyzers, the 7) D2A dataset: A Dataset proposed by IBM [176] is
researchers invested in a security team to manually label curated using differential analysis to label issues reported by
the samples. The data is collected from large open-source static analysis tools. Bug-fixing commit pairs are extracted
projects: Linux, Wireshark, QEMU, and FFmpeg. The dataset from open-source projects with a static analyzer running on
is manually labeled over two rounds, with 600 hours put into them. If issues were detected in the “before” version and
labeling it. While a significant advantage of the dataset is that disappeared in the “after” version, then it is assumed to be
it is manually labeled and verified, the dataset is only binary a bug. Compared to other datasets, the bug trace is included
labeled. Also, it is worth noting that only 2 out of the four in the dataset to determine the type and exact location of the
datasets are available. bug. Open-source projects such as FFmpeg, OpenSSL, httpd,
5) VUDENC: The VUDENC dataset [182] is comprised libtiff, libav and NGINX constitute the curated dataset. This
of 25,040 vulnerability-fixing commits from 14,686 different dataset also has a limited number of vulnerable samples, and a
GitHub repositories. The commits were filtered only to include manual validation experiment shows that the results are better
those that changed the code in a limited number of places, than those of regular differential analysis. However, it is still
ensuring that the changed code was related to the commit mes- not at the desired accuracy, with manual validation showing
sage. The dataset covers seven common vulnerability types: an accuracy of 53%. The paper’s authors applied the dataset
SQL injection, cross-site scripting (XSS), command injection, to build a classifier to identify false alarms in static analyzers
cross-site request forgery (XSRF), remote code execution, path to reduce the false positive rate.
disclosure, and open redirect. This Python-specific dataset 8) CVEfixes: CVEfixes [184] is a dataset built using the
focuses on improving software systems’ security by identi- method proposed by the authors to curate vulnerability datasets
fying potentially vulnerable code. Each vulnerability type has based on CVEs. The automated tool was used to release
a dedicated dataset, with the number of repositories ranging CVEfixes, a dataset covering CVEs up to 9 June 2021. The
from 39 to 336 and the number of changed files ranging from dataset is organized in a relational database, which can be
80 to 650. The total lines of code across all vulnerability types used to extract data with the desired information. It contains
exceed 200,000, demonstrating the comprehensive nature of the code changes in several levels, namely on the repository,
the dataset. commit, file, and method levels. The dataset contains 5495
6) BigVul dataset: BigVul [183] is a C/C++ vulnerability vulnerability fixing commits with 5365 CVE records, covering
dataset curated from the CVE database and its relevant open- 1754 open-source projects. The mining tool is shared, and the
source projects. 3,754 code vulnerabilities were collected from most recent CVE records can be mined.
348 open-source projects spanning 91 vulnerability types. The 9) CrossVul: The CrossVul dataset [185] encompasses a
dataset links CVEs in the CVE database with code commits diverse range of programming languages, exceeding 40 in
and project bug reports. Furthermore, the dataset contains 21 total, and comprises vulnerable source files. The dataset was
features to show changes and where the vulnerability lies. curated by extracting data from GitHub projects referenced
Compared to other datasets, BigVul provides many charac- by the National Vulnerability Database (NVD), specifically

33
focusing on files modified through git-diff. Files preceding to produce diverse and realistic code samples. A standout
the commit are tagged as vulnerable, while those following feature of the FormAI dataset is its meticulous vulnerability
the commit are designated as non-vulnerable. Organized by classification. Each program is thoroughly analyzed for vulner-
Common Weakness Enumerations (CWEs)/Common Vulnera- abilities, with the type of vulnerability, the specific line num-
bilities and Exposures (CVEs), as well as language types, the ber, and the name of the vulnerable function clearly labeled.
dataset offers a comprehensive classification of vulnerabilities. This precise labeling is achieved using the Efficient SMT-
It encompasses 1675 GitHub projects, spanning 5877 commits based Bounded Model Checker (ESBMC), an advanced formal
and 27,476 files, with an equal distribution of 13,738 files verification method. ESBMC employs techniques like model
marked as vulnerable and non-vulnerable, respectively. A checking, abstract interpretation, constraint programming, and
supplementary dataset containing the commit messages for satisfiability modulo theories to rigorously assess safety and
each sample is provided. security properties in the programs. This approach ensures that
10) SySeVR dataset: SySeVR framework was proposed vulnerabilities are definitively detected, providing a formal
in [186], which builds on the previous work in VulDeeP- model or counterexample for each finding and effectively
ecker [187]. While VulDeePecker only considers library/ API eliminating false positives.
function calls, SySeVR covers a variety of vulnerabilities. 13) Chrysalis-HLS: Chrysalis-HLS [79] dataset, a helpful
Furthermore, SySeVR utilized a unique approach using the resource for improving Large Language Models’ performance
notions of syntax-based Vulnerability Candidates(SyVCs) and in hardware and software design. This comprehensive dataset
Semantics-based Vulnerability Candidates (SeVCs) to rep- targets functional verification and code debugging in High-
resent programs as vectors that accommodate syntax and Level Synthesis (HLS). It offers a realistic evaluation envi-
semantic information. Their approach results show a reduced ronment with over 1,000 function-level designs and up to 45
false-negative rate. The dataset is collected from the National injected bug combinations. Named ”Chrysalis” to symbolize
Vulnerability Database (NVD) and Software Assurance Refer- code transformation, it includes diverse HLS applications with
ence Dataset (SARD). The NVD dataset contains 19 popular various error types. Created with GPT-4 and curated prompts,
C/C++ open source products and the SARD data comprises Chrysalis-HLS is a valuable resource for advancing LLM
126 vulnerability types. There are 1,591 programs from open- capabilities in HLS verification and debugging, enhancing
source projects, of which 874 are vulnerable. As for SARD, hardware engineering.
there are 14,000 programs, with 13,906 being vulnerable.
While this dataset uses the existing datasets published by 14) ReFormAI: The ReFormAI dataset [189] is a large-
NIST, the datasets would need further processing in most scale dataset of 60,000 independent SystemVerilog designs
cases. For example, many vulnerable SARD programs contain with varied complexity levels, targeting different Common
the vulnerable snippet and its patch. Not separating them into Weakness Enumerations (CWEs). The dataset was generated
different samples might yield unwanted results depending on by four different LLMs and features a unique set of designs
the application. for each of the 10 CWEs evaluated. The designs were labeled
11) DiverseVul dataset: DiverseVul [188] is proposed as based on the vulnerabilities identified by formal verification
a new vulnerable source code dataset that covers 295 than with unbounded proof. The LLMs evaluated include GPT-3.5-
the previous datasets combined. Furthermore, the dataset is Turbo, Perplexity AI, Text-Davinci-003, and LLaMA. The re-
60% bigger than previous open-source C/C++ datasets. The sults indicate that at least 60% of the samples from the 60,000
data is collected by crawling security issue websites and ex- SystemVerilog designs are vulnerable to CWEs, highlighting
tracting the commits. The security-based commits are labeled the need for caution when using LLM-generated code in real-
vulnerable before and non-vulnerable for the version after world projects.
the commit. DiverseVul covers over 797 projects and 7,514 15) PrimeVul: PrimeVul [95] dataset is a benchmark
commits with more than 130 CWEs. The MD5 hashes are dataset based on existing open-source datasets. Mainly taking
used to de-duplicate functions, yielding 18,495 vulnerable and into consideration BigVul [183], CrossVul [185], CVEfixes
330,492 non-vulnerable functions. The authors conduct several [184] and DiverseVul [188]. The proposed pipeline consists of
experiments to validate the dataset, combining their dataset merging, de-duplication, and labeling through 1) PRIMEVUL-
with previous datasets and showing insights and possibilities of ONEFUNC and 2) PRIMEVUL-NVDCHECK. ONEFUNC
their use. The paper shows that using their dataset along with selects only single functions that are associated with security
the previous datasets yields the best result in their experiments, commits. NVDCHECK is the compartment where a commit
as opposed to using a single dataset. is linked to its CVE and checked for in the NVD database.
12) FormAI dataset: The FormAI dataset [175] represents The function is labeled vulnerable if the description precisely
a significant advancement in cybersecurity and LLM, featuring mentions the function. The other case is the description
an extensive collection of 112,000 AI-generated, independent, containing the file name and the function being the single
and compilable C programs. This dataset is unique because function changed by a security commit. After such a process,
it utilizes dynamic zero-shot prompting techniques to create the yielded dataset consists of 7k vulnerable functions and
various programs, ranging from complex tasks like network 228,800 benign functions. The dataset spans 755 projects and
management and encryption to simpler ones like string manip- contains 6,827 commits. Their work also assesses the label
ulation. These programs were generated using GPT-3.5-turbo, quality of their dataset and other related datasets, showing
demonstrating the ability of Large Language Models (LLMs) low label errors in PrimeVul.

34
16) X1: X1 [82] dataset is constructed from several open- occur with explicit user approval. Therefore, the influence
source vulnerability datasets: CVEFixes, a Manually-Curated of untrusted or unfamiliar content on user prompts should
Dataset, and VCMatch. The dataset contains standalone func- be minimized to prevent indirect manipulations. Establishing
tions labeled as either vulnerable or non-vulnerable. The label- clear trust boundaries within the system is also crucial. These
ing process involves extracting functions from vulnerability- boundaries maintain user control and prevent unauthorized
fixing commits, assuming pre-change versions are vulnerable actions, safeguarding the system from external manipulations
and post-change versions are non-vulnerable. A modified [196].
dataset (X1) is created to address potential false positives, 3) Potential Attack Scenarios: The scenarios for prompt
containing only functions that were the sole change in a injection attacks are diverse and concerning. One scenario
commit. The final dataset consists of X1 without P3, which involves adversarial prompt injections on websites, leading
has 1334 samples, and X1 with P3, which has 22945 samples. to unauthorized actions by the LLM. Another potential threat
X1 without P3 is balanced, with a 1:1 ratio of positive to is hidden prompt injections in documents like resumes, de-
negative classes, while X1 with P3 is imbalanced, reflecting signed to manipulate the LLM’s output [197]. Furthermore,
the real-world distribution of vulnerable functions with a 1:34 there’s the risk of direct user control over the LLM through
ratio. The dataset size is relatively small, which may limit its prompt injections, where malicious users craft inputs to gain
representativeness of the real vulnerability distribution. undue influence over the model’s responses. By understanding
these risks and implementing robust prevention strategies,
VIII. LLM VULNERABILITIES AND M ITIGATION developers and users of LLMs can protect against potential
This section reviews the OWASP Top 10 for LLM Appli- exploitations [198].
cations project [190], a comprehensive initiative designed to
increase awareness about LLM security vulnerabilities. This B. Insecure Output Handling
project targets a wide audience, including developers, design- This issue arises when an application or plugin blindly
ers, architects, managers, and organizations that deploy and trusts LLM outputs, funneling them into client-side or backend
manage LLMs. Its core deliverable lists the top 10 most critical operations. Such oversight can lead to critical security risks
security vulnerabilities commonly found in LLM applications. like Cross-Site Scripting (XSS), Cross-Site Request Forgery
In addition, we include other LLM vulnerabilities not included (CSRF), Server-Side Request Forgery (SSRF), privilege esca-
in the OWASP project, as presented in Table XV. Figure 7 lation, and remote code execution.
presents LLM vulnerabilities included in the OWASP project. 1) Nature of Insecure Output Handling Vulnerabilities:
The core of the problem lies in the unverified acceptance
A. Prompt Injection of LLM outputs. For example, if LLM-generated content,
such as JavaScript or Markdown, is directly processed by a
Integrating LLMs into various digital platforms has brought
browser or a backend function, it can lead to XSS or remote
to light the critical issue of prompt injection [191]. This
code execution. This highlights the danger of assuming LLM
cybersecurity concern involves crafting inputs that manipulate
outputs are safe by default, emphasizing the need for thorough
LLMs, potentially leading to unauthorized system exploitation
validation and sanitization.
or sensitive information disclosure. As LLMs become more
2) Prevention Strategies: Preventing these vulnerabilities
prevalent, understanding and countering prompt injection at-
involves two key strategies. Firstly, implementing stringent
tacks is paramount for safeguarding the integrity and security
validation for LLM outputs before interacting with backend
of these systems [192].
functions can help identify and neutralize potential threats.
1) Nature of Prompt Injection Attacks: Prompt injection
Secondly, encoding LLM outputs before they reach the end
attacks in LLMs can manifest in various forms. One common
user can prevent misinterpretation of the code, thereby reduc-
method involves manipulating the model to retrieve private
ing the risk of malicious executions.
information. Attackers may craft inputs that subtly direct the
3) Potential Attack Scenarios: The scenarios for exploita-
LLM to divulge confidential data. Another technique involves
tion are varied. They range from an application inadvertently
embedding hidden prompts in web pages, which can solicit
allowing LLM-generated responses to manipulate internal
sensitive information from unsuspecting users [193]. In addi-
functions, leading to unauthorized actions, to an LLM-powered
tion, attackers might embed specific prompts in documents,
tool capturing and transmitting sensitive data to malicious
such as resumes, to alter the LLM’s output for deceptive pur-
entities. Other risks include allowing users to generate unvetted
poses. Finally, the risk of web plugins being exploited through
SQL queries through an LLM, which could result in data
rogue website instructions leads to unauthorized actions by the
breaches and the potential for LLMs to create and execute
LLM [194].
harmful XSS payloads.
2) Mitigation Strategies: To combat these threats, several
mitigation strategies can be employed. First, operational re-
strictions are vital; limiting the LLM’s capabilities to es- C. Adversarial Natural Language Instructions
sential functions significantly reduces the risk of malicious Wu et al. [199] proposed presented DeceptPrompt, high-
exploitation. Requiring user consent for sensitive operations lighting a critical vulnerability in Code LLMs: their sus-
is another critical measure [195]. This approach ensures that ceptibility to adversarial natural language instructions. These
high-risk activities or operations involving sensitive data only instructions are designed to appear benign while leading

35
TABLE XIV: Overview of Software Vulnerability Datasets that can be used for fine-tuning LLMs for software security.
Dataset Year Lang Source Multi- Type Samples Labelling Method Classification Challenges/Limitations
class Method
Sate IV - 2012 C, SARD Yes Synthetic Approx 60k Testcases are vulner- CWE Designed to be vulnerable,
Juliet C++ (C/C++) & able by design, with might not accurately depict
& 29k (Java) test corresponding patch real-world projects.
Java cases
Draper 2018 C Open-source Yes Real Total: 1.27M Static analyzers CWE Small percentage of vulnera-
[179] V: 82K NV: ble samples. Limited to four
1.19M categories.
Reveal 2018 C/C++ Open-source No Real Total: 18k V: Vulnerability-fixing Binary classes Imbalance in sample distribu-
[180] 1.6k NV: 16k commits identified by tion and only binary labeled.
security terms Limited to two projects.
Devign 2019 C Open-source No Real Total: 26K V: Binary Manual label- Binary classes Binary labeled. Partial dataset
[177] 12K NV: 14K ing release.
VUDENC 2019 Python Open-source Yes Real 1,009 commits Vulnerability-fixing Vulnerability Relatively small Dataset, No
[182] from 812 commits from GitHub type guarantee that the commits
repositories repositories fixed vulnerabilities.
BigVul 2020 C/C++ Open-source Yes Real Total: 264k Vulnerability-fixing CVE/CWE Significant class imbalance.
[183] V: 11.8k NV: commits from CVE Lack of CWEs/categories for
253k database all samples.
D2A [176] 2021 C/C++ Open-source Yes Real Total: 1.3M Vunerability-fixing Categories Small percentage of vulnera-
V: 18.6k NV: commits with static based on static ble samples. Manual valida-
1.27M analyzers analyzer tion shows low accuracy.
CVEfixes 2021 27 Open-source Yes Real 5,495 Vulnerability-fixing CVE/CWE Labelling accuracy needs en-
[184] lan- commits, commits from CVE hancement and dataset size in-
guages 50k methods database creased (only limited to CVE
records).
CrossVul 2021 40+ Open-source Yes Real 5,877 Vulnerability-fixing CVE/CWE Labelling accuracy needs en-
[185] lan- commits, commits from CVE hancement and dataset size in-
guages 27k (13,738 database creased. Takes the whole file
V/NV) files without pinpointing functions.
(only limited to CVE records).
SySeVR 2022 C/C++ SARD/NVD Yes Semi- Total: 15.6k Extracted from exist- CVE/CWE Limited subset of
[186] Synthetic V: 14.8k NV: ing databases NVD SARD/NVD. SARD is
811 and SARD synthetic, while NVD is
limited in the number of
labeled vulnerabilities.
DiverseVul 2023 C/C++ Open-source Yes Real Total: 349K Vulnerability-fixing CWE Labelling accuracy needs en-
[188] V: 18.9k NV: commits from hancement and dataset size in-
330K security trackers creased (specifically vulnera-
ble functions).
FormAI 2023 C AI-generated Yes Artificial Total: 112k V: Formal verification Custom Bounded formal verification
[175] 57k NV: 55K Bounded Model categories does not cover all types of
checker vulnerabilities and depth.
Chrysalis- 2024 C++ Open-source Yes Synthetic Over 1,000 Predefined errors Bug Type Addressing scalability and
HLS [79] function-level generalization challenges
HLS designs
FormAI v2 2024 C AI-generated Yes Artificial Total: 265k V: Formal verification Custom Bounded formal verification
[172] 168k NV: 23k Bounded Model categories does not cover all vulnerabil-
checker ities and depth.
ReFormAI 2024 System AI-generated Yes Artificial Total: 60k V: Formal verification CWE Formal verification with an
[189] Ver- 60k NV: 0 Bounded Model unbounded proof.
ilog checker
PrimeVul 2024 C/C++ Open-source Yes Real Total: 236k V: Single function se- CWE Limited vulnerable samples
[95] 7k NV: 229k lection and extraction due filtering existing samples
from NVD and specific function selec-
tion.
X1 [82] 2024 Java Open-source Yes Real Total: 22.9k Analyzing Binary classes Imbalanced, small, and may
V: 0.6k NV: vulnerability-fixing not represent the true vulner-
22.3k commits ability distribution.
V : Vulnerable , NV: Non Vulnerable

Code LLMs to produce functionally accurate code containing erated code. Enriching the training of LLMs with adversarial
hidden vulnerabilities. The DeceptPrompt algorithm utilizes a examples produced by DeceptPrompt is recommended to boost
sophisticated evolution-based methodology with a fine-grained their defense against security threats. Furthermore, continuous
loss design, crafting deceptive instructions that maintain the updates and security patches, informed by the latest cybersecu-
appearance of normal language inputs while introducing se- rity research, are crucial for maintaining the LLMs’ defenses
curity flaws into the generated code. This vulnerability is against new adversarial techniques. Addressing these chal-
exacerbated by the challenges in preserving the code’s func- lenges involves preserving the code’s functionality, targeting
tionality, targeting specific vulnerabilities, and maintaining the specific vulnerabilities, and maintaining the semantics of the
semantics of the natural language prompts [200]. natural language prompts used in the generation process.
1) Prevention Strategies: The study suggests a set of pre- 2) Potential Attack Scenarios: The authors highlight vari-
vention strategies to counter these threats. This involves in- ous potential attack scenarios that could exploit the vulnera-
tegrating advanced code validation mechanisms within LLMs bilities exposed by DeceptPrompt. These scenarios include at-
to identify and mitigate potential vulnerabilities in the gen- tackers using crafted natural language prompts to induce Code

36
Fig. 7: LLM vulnerabilities included in the OWASP project.

LLMs into generating code with vulnerabilities, leading to data • Advanced Alignment Algorithms: Develop new align-
breaches, unauthorized access, or system compromises. The ment strategies that are more resistant to adversarial
effectiveness of DeceptPrompt in real-world settings under- manipulations, ensuring that maliciously crafted suffixes
scores the urgency for robust security measures in Code LLMs, cannot easily override guardrails.
given their increasing use in critical systems and infrastructure. • Real-Time Monitoring: Implement systems capable of
The challenges in preserving the code’s functionality, targeting detecting suspicious prompt patterns or gradients in real
specific vulnerabilities, and maintaining the semantics of the time, enabling swift neutralization of emerging attacks.
natural language prompts add complexity to these potential • Ongoing Model Retraining: Continuously update mod-
attack scenarios, amplifying the need for enhanced security els with fresh adversarial examples to bolster resilience
protocols in Code LLMs. against newly discovered attack vectors.
• Adaptive Response Mechanisms: Design LLMs and sup-
D. Automatic Adversarial Prompt Generation porting infrastructure to adapt to changing tactics, reduc-
1) Nature of the Attack: Zou et al. [201] propose a method ing the window of opportunity for adversaries.
for automatically generating adversarial prompts in aligned 3) Potential Attack Scenarios: Automated adversarial
language models. Specifically, they craft a targeted suffix prompt generation can pose significant risks in various con-
that, when appended to diverse LLM queries, maximizes the texts:
likelihood of producing objectionable or undesirable content. • Widespread Content Manipulation: Attackers may propa-
Unlike earlier approaches, this method employs automated gate malicious suffixes across large-scale user interactions
techniques such as greedy and gradient-based search to cir- (such as web forums or social media), causing misaligned
cumvent existing alignment measures systematically. By ex- outputs on a broad scale.
ploiting gaps in the alignment framework, attackers can bypass • Targeted Model Evasion: In specialized applications like
safeguards to prevent harmful or prohibited outputs. content filtering or customer support bots, adversaries
2) Prevention Strategies: The findings from Zou et al. [201] might exploit gradient-based techniques to bypass spe-
highlight the need for comprehensive defenses against auto- cific policy checks repeatedly.
mated adversarial prompt generation. Potential countermea- • Dynamic Model Attacks: As new alignment protocols are
sures include: introduced, attackers can use automated search methods

37
TABLE XV: Overview of LLM vulnerabilities and Mitigation
Vulnerabilities Nature of the Vulnerability Examples Mitigation Strategies Potential Attack Scenarios
Prompt Injection Manipulation of LLMs through
crafted inputs leading to unautho- • Hidden prompts in web • Operational restrictions • Adversarial injections on web-
rized exploitation or sensitive in- pages • User consent for sensitive op- sites
formation disclosure. • Deceptive documents erations • Hidden prompts in documents
• Rogue web plugin in- • Trust boundaries establishment • Direct user control through
structions crafted inputs

Insecure Output Blind trust in LLM outputs lead


Handling to security risks like XSS, CSRF, • Direct processing • Validation of LLM outputs • LLM responses manipulating
SSRF, etc. of LLM-generated • Encoding outputs before reach- internal functions
JavaScript or Markdown ing end-users • Generating unvetted SQL
queries
• Creating harmful XSS pay-
loads

Inference Data Poi- Stealthy activation of malicious


soning responses under specific opera- • Conditions based on • Monitoring and anomaly de- • Manipulated responses under
tional conditions such as token- token-output limits in tection systems specifically de- token limitations leading to
limited output. user settings signed for conditional outputs misinformation
• Stealthily altered outputs • Regular audits of outputs under • Triggered malicious behavior
when cost-saving modes various token limitations in cost-sensitive environments
are enabled

Adversarial Code LLMs produce functionally


Natural Language accurate code with hidden vulner- • DeceptPrompt algorithm • Advanced code validation • Crafted prompts leading to
Instructions abilities due to adversarial instruc- creating deceptive • Training LLMs with adversar- code with vulnerabilities
tions. instructions ial examples • Unauthorized access or system
• Continuous updates and secu- compromises
rity patches

Automatic Automated methods to generate


Adversarial Prompt prompts that bypass LLM align- • Crafting specific suffixes • Developing advanced align- • Bypassing alignment measures
Generation ment measures. for objectionable content ment algorithms leading to the generation of ob-
generation • Real-time monitoring jectionable content
• Training models with new ad-
versarial examples

Training Data Poi- Manipulation of training data to


soning skew LLM learning, introducing • Injecting biased or harm- • Verifying data sources • Misleading outputs spreading
biases or vulnerabilities. ful data into training sets • Employing dedicated models biased opinions
• Sandboxing, input filters • Injection of false data into
• Monitoring for poisoning signs training

Insecure Plugins Vulnerabilities in plugin design


and interaction with external sys- • Inadequate input valida- • Rigorous input validation • Exploiting input handling vul-
tems or data sources. tion • Adherence to least privilege nerabilities
• Overprivileged access • Secure API practices • Overprivileged plugins for
• Insecure API interactions • Regular security audits privilege escalation
• SQL injections

Denial of Service Attempts to make a system inac-


(DoS) Attack cessible by overwhelming it with • Volume-based attacks • Rate limiting • Overloading servers
traffic or triggering crashes. • Protocol attacks • Robust infrastructure • Disrupting communication be-
• Application layer attacks • Continuous monitoring and tween users and services
rapid response • Straining system resources

to uncover fresh vulnerabilities, fueling an ongoing arms significantly compromising the model’s security, effectiveness,
race between defenders and adversaries. and ethical behavior. Examples include intentionally including
targeted, inaccurate documents, training models using unver-
E. Training Data Poisoning ified data, or allowing unrestricted dataset access, leading to
loss of control. Such actions can detrimentally affect model
Training Data Poisoning in LLMs represents a critical
performance, erode user trust, and harm brand reputation
security and ethical issue, where malicious actors manipulate
[203].
the training dataset to skew the model’s learning process. This
manipulation can range from introducing biased or incorrect 2) Prevention Strategies: To combat training data poison-
data to embedding hidden, harmful instructions, compromising ing, several prevention strategies are essential. Firstly, verify-
model integrity and reliability. The impact is profound, as ing the supply chain of training data and the legitimacy of data
poisoned LLMs may produce biased, offensive, or inaccurate sources is crucial. This step ensures the integrity and quality
outputs, raising significant challenges in detection due to the of the data used for training models. Employing dedicated
vast and complex nature of training datasets [202]. models for specific use cases can help isolate and protect
1) Nature of Training Data Poisoning: Training data poi- different applications from a compromised data source [201].
soning in LLMs occurs when an attacker deliberately ma- Another effective strategy is implementing sandboxing and
nipulates the training data or fine-tuning processes. This input filters and ensuring adversarial robustness. In addition,
manipulation introduces vulnerabilities, backdoors, or biases, regularly monitoring for signs of poisoning attacks through

38
loss measurement and model analysis is vital in identifying this vector, demonstrating high success rates in controlled
and mitigating such threats. experiments, highlighting the need for heightened security
The prevention of training data poisoning in LLMs can measures in environments where LLMs are deployed.
be significantly bolstered by incorporating advanced strategies
before and after the training phase. The pre-training defense
G. Insecure Plugins
is a dataset-level strategy that filters suspicious samples from
the training data. This method assumes that text and image 1) Nature of Insecure Plugins: The nature of insecure
pairs (i.e., Multimodal data) in a dataset should be relevant plugins in LLMs revolves around several key vulnerabilities
to each other. The post-training defense is another crucial that stem from how these plugins are designed, implemented,
strategy, which involves ”sterilizing” a poisoned model by and interact with external systems or data sources. These
further fine-tuning it on clean data, thus maintaining its utility. vulnerabilities can compromise the security, reliability, and
This is conducted by fine-tuning the poisoned models on a integrity of both the LLM and the systems it interacts with.
clean dataset (e.g., the VG dataset in the study) with a specific The primary issues associated with insecure plugins in LLMs
learning rate [202]. include inadequate input validation, overprivileged access,
3) Potential Attack Scenarios: Several potential attack sce- insecure API interactions, SQL injection, and database vul-
narios arise from training data poisoning. These include the nerabilities.
generation of misleading LLM outputs that could spread 2) Prevention Strategies: To counter the Insecure Plugins,
biased opinions or even incite hate crimes. Malicious users a multi-faceted approach to security is essential. Implementing
might inject false data into training, intentionally skewing rigorous input validation, including type-checking, sanitiza-
the model’s outputs [204]. Adversaries could also manipulate tion, and parameterization, is crucial, especially in data query
a model’s training data to compromise its integrity. Such construction. Adhering to the principle of least privilege is key
scenarios highlight the need for stringent security measures in plugin design; each plugin should only access necessary
in the training and maintaining LLMs, as the implications of resources and functionalities. Ensuring secure API practices
compromised models extend beyond technical performance to and avoiding direct URL construction from user inputs is vital.
societal impacts and ethical considerations. Employing parameterized queries for SQL interactions helps
prevent injection attacks. In addition, regular security audits
and vulnerability assessments are necessary to identify and
F. Inference Data Poisoning address potential weaknesses proactively.
1) Nature of Inference Data Poisoning: Inference data 3) Potential Attack Scenarios: Various attack scenarios
poisoning targets LLMs during their operational phase, unlike emerge from Insecure Plugins. For instance, an attacker could
training-time attacks that tamper with a model’s training exploit input handling vulnerabilities to extract sensitive data
dataset. This attack subtly alters the input data to trigger or gain unauthorized system access. Overprivileged plugins
specific, often malicious behaviors in a model without any could be used for privilege escalation, allowing attackers to
modifications to the model itself. The approach detailed by perform restricted actions. Manipulation of API calls can lead
He et al. [205] utilizes a novel method where the poison is to redirection to malicious sites, opening doors to further
activated not by obvious, fixed triggers but by conditions re- system exploits. SQL injection through plugin queries can
lated to output token limitations. Such conditions are generally compromise database integrity and confidentiality, leading to
overlooked as they are a part of normal user interactions aimed significant data breaches.
at managing computational costs, thus enhancing the stealth
and undetectability of attacks.
H. Denial of Service (DoS) attack
2) Prevention Strategies: Preventing inference data poison-
ing requires a multi-faceted approach. Firstly, robust anomaly 1) Nature of DoS attack: A Denial of Service (DoS) attack
detection systems can be implemented to scrutinize input is a malicious attempt to disrupt the normal functioning of a
patterns and detect deviations from typical user queries. Reg- targeted system, making it inaccessible to its intended users.
ular audits of model responses under various conditions can The attack typically involves overwhelming the target with a
also help identify any inconsistencies that suggest poisoning. flood of internet traffic. This could be achieved through various
Implementing stricter input handling controls and limiting means, such as sending more requests than the system can
the impact of token limitation settings could also reduce handle or sending information that triggers a crash. In the
vulnerabilities. context of services like LLM, a DoS attack could bombard the
3) Potential Attack Scenarios: The potential scenarios for service with a high volume of complex queries, significantly
inference data poisoning are varied and context-dependent. slowing down the system or causing it to fail [206].
For example, in a cost-sensitive environment where users 2) Potential Attack Scenarios: The DoS attacks against
frequently limit token outputs to manage expenses, an attacker LLM can be divided into three categories: volume-based
could leverage this setting to trigger harmful responses from attacks, protocol attacks, and application layer attacks.
the model. Such scenarios could include delivering incorrect or • Volume-based Attacks: This is the most straightforward
biased information, manipulating sentiment in text generation, kind of DoS attack, where the attacker attempts to sat-
or generating content that could lead to reputational damage or urate the bandwidth of the targeted system. For LLM,
legal issues. The BrieFool framework [205] effectively exploits this could involve sending many requests simultaneously,

39
more than what the servers are equipped to handle, LLMs are expected to assist in managing this data deluge
leading to service disruption [207]. efficiently. However, ensuring these models can process vast
• Protocol Attacks: These attacks exploit weaknesses in the amounts of data accurately and identify threats amidst this
network protocol stack layers to render the target inacces- complexity is daunting, necessitating high levels of efficiency
sible. They could involve, for instance, manipulating the and accuracy in the LLMs. The corporation faced a situation
communication process between the user and the LLM where the LLM failed to recognize a sophisticated cyber-
service in a way that disrupts or halts service [208]. attack hidden within the massive influx of data. This oversight
• Application Layer Attacks: These are more sophisticated occurred because the model hadn’t been trained with the latest
and involve sending requests that appear to be legitimate attack patterns, highlighting a gap in its learning. The incident
but are designed to exhaust application resources. For underscored the need for LLMs to process data efficiently and
LLM, this could involve complex queries requiring ex- maintain high accuracy and adaptability in threat detection.
tensive processing power or memory, thereby straining 3) Training Data Availability and Quality: A critical chal-
the system [209]. lenge for AI-based cyber defense is the lack of high-quality,
3) Prevention Strategies: To combat DoS attacks in LLM accessible training data, as organizations generally hesitate
services, the following prevention strategies can be applied: to share sensitive information. The effectiveness of LLMs in
• Rate Limiting: Implementing a rate-limiting strategy is cybersecurity depends heavily on the quality and availability of
crucial. This involves limiting the number of requests a training data. Overcoming this data gap remains a significant
user can make within a given timeframe, which helps hurdle, whether through synthetic data generation or other
prevent an overload of the system. means.
• Robust Infrastructure: A robust and scalable server in- 4) Developing and Training Custom Models for Unique
frastructure can help absorb the influx of traffic during Cybersecurity Domains: Certain specialized areas in cyber-
an attack. This could involve using load balancers, re- security require custom models due to their unique vocab-
dundant systems, and cloud-based services that can scale ularies or data structures, which standard LLMs might not
dynamically in response to increased traffic. address adequately. Unique Vocabularies and Data Structures:
• Monitoring and Rapid Response: Continuous traffic mon- Cybersecurity domains, such as network security, malware
itoring can help quickly identify unusual patterns in- analysis, and threat intelligence, have their terminologies,
dicative of a DoS attack. Once detected, rapid response data formats, and communication protocols. Standard LLMs,
measures, such as traffic filtering or rerouting, can be typically trained on general datasets, might not be familiar with
employed to mitigate the attack. these specialized terms and structures, leading to ineffective
or inaccurate threat detection and response. Customizing and
training these models to handle specific cybersecurity scenar-
IX. LLM C YBERSECURITY I NSIGHTS , C HALLENGES AND
ios is complex and demands substantial resources, presenting
L IMITATIONS
a significant challenge in the field.
A. Challenges and Limitations 5) Real-Time Information Provision by Security Copilots:
1) Adapting to Sophisticated Phishing Techniques: The Security copilots powered by LLMs need to provide accurate,
increasing sophistication of phishing attacks, especially those up-to-date information in real-time to be effective in the
enhanced by AI, presents a major challenge for LLMs in dynamic threat landscape of cybersecurity. Ensuring the rele-
cybersecurity. These models need to evolve to identify and vance and accuracy of information provided by these models
counteract these threats effectively continuously. The chal- in real-time is challenging but essential for effective responses
lenge lies in the need for regular updates and training to keep to cybersecurity threats.
pace with the advanced tactics of attackers, which demands
substantial resources and expertise. For example, a large com-
pany implemented an LLM-based security system to detect B. LLM Cybersecurity Insights
phishing emails. Initially, the system was highly effective, Table XVI presents various facets of LLM integration into
identifying and blocking 95% of phishing attempts. However, cybersecurity, providing insights into architectural nuances,
attackers quickly adapted, using AI to generate more con- dataset creation, pre-training, fine-tuning methodologies, eval-
vincing phishing emails that mimicked the company’s official uation metrics, advanced techniques, deployment strategies,
communication style and included personalized information security measures, and optimization approaches.
about the customers. The company’s LLM struggled to keep 1) LLM architecture: A cyber security scientist venturing
up with these advanced tactics. Phishing emails have become into utilizing LLMs must understand the architecture’s nuances
so sophisticated that they can bypass traditional detection (presented in Section III) to tailor these tools for security ap-
methods, significantly increasing the number of successful plications effectively. Understanding the architecture of LLMs,
attacks. Hence, evolving and adapting LLMs in cybersecurity including their ability to process and generate language-based
to combat AI-enhanced phishing threats is an open challenge. data, is crucial for detecting phishing attempts, deciphering
2) Managing Data Overload in Enterprise Applications: malicious scripts, or identifying unusual patterns in network
With the proliferation of enterprise applications, IT teams are traffic that may indicate a breach. Knowledge of how these
overwhelmed by the sheer volume of data they need to manage models tokenize input data, their attention mechanisms to
and secure, often without corresponding increases in staffing. weigh information, and their output generation techniques

40
TABLE XVI: LLM Cybersecurity Insights.
Aspect Details Tools/Methods Applications
Architecture Focus on model components such as Threat Detection and Analysis, Security Au-
tokenization, attention mechanisms, and • Paper: Attention Is All You Need tomation, Cyber Forensics, Penetration Test-
output generation. ing, Security Training and Awareness, and
Chatbots.
Cyber Security Creation of prompt-response pairs that Building datasets that mirror real-world
Dataset simulate cyber threats using synthetic • OpenAI API for synthetic data threats for training and refining LLMs.
data. • Evol-Instruct for data refinement
• Regex filtering for uniqueness

Pre-training Models Training on large-scale datasets com- Preparing LLMs to understand and predict
prising billions of tokens, filtered and • Megatron-LM for handling large cybersecurity-specific content accurately.
aligned with cybersecurity lexicon. datasets
• gpt-neox for sequential data handling
• Distributed training tools

Supervised Fine- Incorporating specialized cybersecurity Enhancing LLMs to address unique cyberse-
Tuning datasets into pre-trained models for tai- • LoRA for parameter-efficient adjust- curity threats and scenarios specifically.
lored applications. ments
• QLoRA for quantization and efficient
memory management

Cyber Security Setting up specialized frameworks and Evaluating how well LLMs detect, under-
Evaluation datasets to test LLMs against potential • Bespoke cybersecurity benchmarks stand, and respond to cyber threats.
cyber threats. • Authoritative datasets for threat detec-
tion

Advanced LLM Implementing techniques like RAG and Improving response relevance and accuracy
Techniques RLHF to augment LLMs with real-time • RAG for context retrieval from in cybersecurity applications.
data and expert-aligned feedback. databases
• RLHF with specialized preference
datasets and reward models

LLM Deployments Adopting deployment strategies that Deploying LLMs in various environments
range from local installations to large- • Platforms like Gradio and Streamlit to ensure accessibility and responsiveness
scale server setups. for prototyping across devices.
• Cloud services for robust deployment
• Edge deployment strategies for
resource-limited environments

Securing LLMs Addressing vulnerabilities unique to Preventing and mitigating security threats to
LLMs such as prompt hacking and train- • Security measures like prompt injec- maintain data integrity and model reliability
ing data leakage. tion prevention in LLMs.
• Red teaming
• Continuous monitoring systems

Optimizing LLMs Implementing strategies to reduce mem- Enabling efficient LLM operation on various
ory and computational requirements • Model quantization hardware, making them scalable and practical
while maintaining output quality. • Use of bfloat16 data formats for diverse applications.
• Optimization of attention mechanisms

provide the foundational skills necessary to tweak models for a pre-defined vocabulary to ensure relevance and accuracy.
optimized threat detection and response [210]. Techniques such as causal language modeling, distinct from
2) Building Cyber Security dataset: Building a robust cy- masked language modeling, are employed, where the loss
bersecurity dataset using LLMs involves generating and refin- functions and training methodologies, such as those used in
ing intricate prompt-response pairs to mirror real-world cyber Megatron-LM [213] or gpt-neox [214], are optimized for han-
threats. Employing synthetic data generation via the OpenAI dling sequential data predictively. Understanding the scaling
API allows for diverse cybersecurity scenarios, while advanced laws is crucial, as these laws help predict how increases in
tools like Evol-Instruct [211] enhance dataset quality by model size, dataset breadth, and computational power can
adding complexity and removing outdated threats. Techniques proportionally enhance model performance [215]. While in-
such as regex filtering and removing near-duplicates ensure depth knowledge of High-Performance Computing (HPC) isn’t
the data’s uniqueness and relevance. In addition, familiarizing necessary for using pre-trained models, it becomes essential
with various prompt templates like Alpaca [212] is essential when building a large-scale language model for cyber security
for structuring this data effectively, ensuring that the LLM from scratch, requiring an understanding of hardware capabil-
can be finely tuned to respond to the nuanced landscape of ities and managing distributed workloads effectively.
cybersecurity challenges efficiently. Most pre-training LLM models are trained using smdis-
3) Pre-training models: Pre-training a model for cyberse- tributed libraries, proposed by AWS SageMaker, which of-
curity tasks involves a complex and resource-intensive pro- fer robust solutions for distributed training machine learning
cess to prepare a language model to understand and predict models, enhancing efficiency on large-scale deployments. The
cybersecurity-specific content. This requires a massive dataset smdistributed.dataparallel library supports data parallelism,
comprising billions or trillions of tokens, which undergo optimizing GPU usage by partitioning the training data across
rigorous processes like filtering, tokenizing, and aligning with multiple GPUs, thus speeding up the learning process and

41
Fig. 8: Parameter Efficient Fine-Tuning (PEFT) provides an efficient approach by minimizing the number of parameters needed
for fine-tuning and reducing memory consumption comparable to that of traditional fine-tuning.

minimizing communication overhead. On the other hand, time threat analysis. This strategic fine-tuning enhances model
smdistributed.modelparallel is tailored for model parallelism, specificity and significantly boosts their utility in practical
allowing large models to be split across multiple GPUs when cybersecurity applications.
a single model cannot fit into the memory of one GPU. These 5) Cyber Security Evaluation: To evaluate the code gen-
tools seamlessly integrate with frameworks like TensorFlow, eration models, Hugging Face uses the following 7 code
PyTorch, and MXNet, simplifying the implementation of com- generation Python tasks: DS-1000, MBPP, MBPP+, APPS, In-
plex distributed training tasks. structHumanEval, HumanEval+, and HumanEval [227]–[233].
4) Supervised Fine-Tuning: Supervised fine-tuning (SFT) In cybersecurity, evaluating large language models demands
of pre-trained Large Language Models for cybersecurity ap- a specialized framework considering such applications’ unique
plications enables these models to move beyond basic next- security and accuracy needs. When setting up evaluation met-
token prediction tasks, transforming them into specialized rics for cybersecurity-focused LLMs, test cases should closely
tools tailored to specific cybersecurity needs. This fine-tuning mimic potential security scenarios to assess how well the
process allows for incorporating proprietary or novel datasets model detects, understands, and responds to cyber threats. This
that have not been previously exposed to models like Fal- involves configuring the LLM with tailored inputs, expected
con 180b, providing a significant edge in addressing unique outputs, and security-specific contextual data [30]. For in-
security challenges. Figure 8 outlines a comprehensive three- stance, IBM’s D2A dataset [176] and Microsoft’s dataset [234]
step process for training a large language model specialized aids in evaluating AI models’ capability to identify software
in cybersecurity, beginning with unsupervised pre-training on vulnerabilities using specific metrics such as accuracy.
a vast corpus of cybersecurity-related texts, including diverse Table XVII compares benchmarks for evaluating LLMs
data such as malware, network security, and dark web content. in cybersecurity knowledge. CyberMetric [104] is a bench-
Following this, the model undergoes traditional fine-tuning mark dataset designed explicitly for evaluating large language
using a smaller, targeted dataset to refine its capabilities for models in knowledge cybersecurity. It consists of 10,000
specific cybersecurity tasks. However, the Parameter-Efficient questions derived from various authoritative sources within
Fine-Tuning (PEFT) [216] involves freezing the original model the field. The dataset is used to measure the knowledge of
weights and fine-tuning a small set of new parameters, enhanc- LLMs across a spectrum of cybersecurity topics, facilitating
ing the model’s adaptability and efficiency while minimizing direct comparisons between human expertise and machine
the risk of overfitting, thus preparing the LLM to tackle capabilities. This unique dataset aids in understanding the
advanced cybersecurity challenges efficiently. strengths and limitations of LLMs in cybersecurity, providing
Techniques such as LoRA (Low-rank Adapters) [217] offer a foundation for further development and specialized training
a parameter-efficient approach by adjusting only a subset of the of these models in this critical area.
model’s parameters, thus optimizing computational resources Similar to the CyberMetric benchmark, Meta proposed the
while maintaining performance. More advanced methods like CyberSecEval 2 benchmark [223] to quantify security risks
QLoRA [218] enhance this by quantizing the model’s weights associated with LLMs such as GPT-4 and Meta Llama 3 70B-
and managing memory more efficiently, making executing Instruct. They highlight new testing areas, notably prompt
these operations even on limited platforms like Google Colab injection and code interpreter abuse, revealing that mitigating
with only one GPU A100. In addition, tools like Axolotl and attack risks in LLMs remains challenging, with significant
DeepSpeed [219], [220] facilitate the deployment of these fine- rates of successful prompt injections. The study also explores
tuned models across various hardware setups, ensuring that the safety-utility tradeoff, proposing the False Refusal Rate
the enhanced models can be scaled efficiently for real-world (FRR) to measure how conditioning LLMs to avoid unsafe
cybersecurity tasks, ranging from intrusion detection to real- prompts might reduce their overall utility by also rejecting

42
TABLE XVII: Comparison of Benchmarks for Evaluating LLMs in Cybersecurity Knowledge
Benchmark Source Year Description Key Features and Metrics
CyberSecEval 1 Meta 2023 A benchmark tests LLMs across two critical security do- It measures the frequency and
[221] mains—generating insecure code and compliance with re- conditions LLMs propose insecure
quests to assist in cyberattacks. code solutions under.
SecQA Liu et al. 2023 A dataset of multiple-choice questions designed to evaluate Evaluates understanding and appli-
[222] the performance of LLMs in computer security. Features two cation of security principles.
versions of varying complexity and tests LLMs in both 0-shot
and 5-shot learning settings.
CyberMetric Tihanyi et 2024 A dataset designed for evaluating LLMs in cybersecurity Direct comparison between human
al. [104] knowledge, consisting of 10,000 questions from various au- expertise and LLMs.
thoritative sources. Used to measure the spectrum of cyber-
security topics covered by LLMs.
CyberSecEval 2 Meta 2024 Focuses on quantifying security risks associated with LLMs, Testing areas: prompt injection,
[223] such as prompt injection and code interpreter abuse. It high- code interpreter abuse; Metric:
lights challenges in mitigating attack risks and introduces the FRR.
False Refusal Rate (FRR) metric.
WMDP-Cyber Li et al. 2024 Consists of 3,668 multiple-choice questions designed to mea- Covers biosecurity, cybersecurity,
[224] sure LLMs’ knowledge in biosecurity, cybersecurity, and and chemical security.
chemical security. Excludes sensitive and export-controlled
information.
LLM4Vuln Sun et al. 2024 A unified evaluation framework for assessing the vulnerability Focuses on vulnerability reasoning
[225] reasoning capabilities of LLMs, using 75 verified high-risk in LLMs.
smart contract vulnerabilities in 4,950 scenarios across three
LLMs.
CyberBench Liu et al. 2024 A domain-specific, multi-task benchmark for assessing LLM Includes diverse tasks such as vul-
[226] performance in cybersecurity tasks. nerability detection, threat analysis,
and incident response.

benign requests. Additionally, the research assesses LLMs’ vulnerabilities were tested in 4,950 scenarios across three
capabilities in automating core cybersecurity tasks, suggesting LLMs: GPT-4, Mixtral, and Code Llama.
that models with coding abilities perform better in exploit 6) Advanced LLM techniques (RAG and RLHF): Advanced
generation tasks. The benchmark code is open source to techniques like Retrieval Augmentation Generation (RAG)
facilitate further research 3 . can significantly enhance Language Model performance by
Liu et al. introduced SecQA [222], a novel dataset designed enabling the model to access external databases for additional
to evaluate the performance of LLMs in computer security. context and information, making it highly effective in special-
The dataset, generated by GPT-4, consists of multiple-choice ized fields such as cybersecurity. In cybersecurity applications,
questions to assess LLMs’ understanding and application of RAG can dynamically retrieve up-to-date information from
security principles. SecQA features two versions with varying well-known databases such as CVE (Common Vulnerabilities
complexity to challenge LLMs across different difficulty lev- and Exposures), CWE (Common Weakness Enumeration), and
els. The authors comprehensively evaluated prominent LLMs, the NIST (National Institute of Standards and Technology)
including GPT-3.5-Turbo, GPT-4, Llama-2, Vicuna, Mistral, database [243]. This capability allows the model to offer
and Zephyr, in both 0-shot and 5-shot learning settings. current and specific advice regarding vulnerabilities, threat
The findings from the SecQA v1 and v2 datasets reveal intelligence, and compliance standards. Integrating real-time
diverse capabilities and limitations of these models in han- data from these authoritative sources into the response gener-
dling security-related content. Li et al. [224] introduced the ation process allows RAG to empower Language Models to
Weapons of Mass Destruction Proxy (WMDP) benchmark. deliver precise and contextually relevant cybersecurity insights
This publicly available dataset consists of 3,668 multiple- without extensive retraining, thus enhancing decision-making
choice questions designed to measure LLMs’ knowledge in in critical security operations [244].
biosecurity, cybersecurity, and chemical security, ensuring the Reinforcement Learning from Human Feedback (RLHF) is
exclusion of sensitive and export-controlled information. Sun an advanced method to enhance LLMs tailored for cybersecu-
et al. [225] introduced LLM4Vuln, a unified evaluation frame- rity applications, focusing on aligning the model’s responses
work designed to precisely assess the vulnerability reasoning with expert expectations in the security domain. This involves
capabilities of LLMs independent of their other functions such utilizing specialized preference datasets, which contain re-
as information seeking, knowledge adoption, and structured sponses ranked by cybersecurity professionals, presenting a
output. This framework aims to determine how enhancing more challenging production process than typical instructional
these separate capabilities could boost LLMs’ effectiveness in datasets. Techniques like Proximal Policy Optimization (PPO)
identifying vulnerabilities. To test the efficacy of LLM4Vuln, leverage a reward model to evaluate how well text outputs
controlled experiments were conducted with 75 verified high- align with security expert rankings, refining the model’s
risk smart contract vulnerabilities sourced from Code4rena training through adjustments based on KL divergence [240].
audits conducted between August and November 2023. These Direct Preference Optimization (DPO) further optimizes this
by framing it as a classification challenge, using a stable
3 https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks
reference model that avoids the complexities of training reward

43
TABLE XVIII: Optimization Strategies for Large Language Models in Cybersecurity
Optimization Description Key Benefits Cybersecurity Use Case Scenarios
Strategy
Advanced Attention Implements techniques like Flash Attention [235] Speeds up processing saves com- Efficient processing of long log files and network
Mechanisms to optimize self-attention layers, reducing compu- pute resources. traffic data for anomaly detection.
tation times, particularly effective for long input
sequences.
Bitsnbytes Introduces k-bit quantization (notably 8-bit) us- Halves memory usage without Efficient real-time malware analysis and intrusion
ing block-wise methods to maintain performance loss in performance. detection on edge devices.
while halving memory usage.
GPTQ [236] A novel quantization method for GPT models Compresses model size, mini- Deploying large-scale threat prediction models on
that reduces bit width to 3 or 4 bits per weight, mizes accuracy loss. consumer-grade hardware.
enabling the operation of large models on single
GPUs with minimal accuracy loss.
GGUF Quantization Optimized for quick model loading and saving, Enhances efficiency of model de- Rapid deployment of updated models to respond
making LLM inference more efficient. Supported ployment. to emerging threats and vulnerabilities.
by Hugging Face Hub.
QLoRA [218] Enables training using memory-saving techniques Preserves performance with re- Training complex cybersecurity models on sys-
with a small set of trainable low-rank adaptation duced memory. tems with limited memory resources.
weights.
Lower-precision Uses formats like bfloat16 instead of float32 for Reduces computational Enhancing the speed and efficiency of continuous
data Formats training and inference to optimize resource usage overhead. cybersecurity monitoring systems.
without compromising performance accuracy.
FSDP-QLoRA Combines Fully Sharded Data Parallelism (FSDP) Scales up model training across Enabling the collaborative training of security
with 4-bit quantization and LoRA to shard model multiple GPUs. models across different organizations without re-
parameters, optimizer states, and gradients across quiring top-tier hardware.
GPUs.
Half-Quadratic A model quantization technique that enables the Works efficiently with HQQ can be employed in cybersecurity to pro-
Quantization quantization of large models rapidly and accu- CUDA/Triton kernels and tect models by reducing the precision of model
(HQQ) [237] rately without the need for calibration data. aims for seamless integration weights, making it harder for attackers to reverse
with torch.compile. engineer or tamper with the models..
Multi-token Predic- A new training approach for large language mod- Models trained with 4-token pre- Multi-token prediction can enhance the modeling
tion [238] els where models predict multiple future tokens dictions can achieve up to 3x of sophisticated cyber attack patterns.
simultaneously rather than the next token only. faster inference speeds, even with
large batch sizes.
Trust Region An advanced policy gradient method in reinforce- TRPO enhances training stability In environments with dynamic and evolving
Policy Optimization ment learning that addresses the inefficiencies of by using trust regions to prevent threats, TRPO can help maintain a stable and
(TRPO) [239] standard policy gradient methods. overly large updates that could effective response mechanism, adjusting policies
destabilize the policy. incrementally to handle new types of malware.
Proximal Policy A reinforcement learning technique designed to Prevents ”falling off the cliff” By limiting the extent of policy updates, PPO
Optimization (PPO) improve training stability by cautiously updating scenarios where a policy update helps maintain a steady adaptation to evolving cy-
[240] policies. is too large could irreversibly bersecurity threats, reducing the risk of overfitting
damage the policy’s effective- to specific attack patterns.
ness.
Direct Preference A fine-tuning methodology for foundation mod- Requires significantly less data Reduces the computational and data demands of
Optimization els optimize policies directly using a Kull- and compute resources than pre- continuously training cybersecurity models, allow-
(DPO) [241] back–Leibler divergence-constrained framework, vious methods like PPO. ing for more scalable solutions.
removing the need for a separate reward model.
Odds Ratio Pref- An algorithm designed for supervised fine-tuning Eliminates the need for an Enables dynamic adaptation of security models
erence Optimization (SFT) of language models that optimizes prefer- additional preference alignment to new and evolving cyber threats by optimizing
(ORPO) [242] ence alignment without the need for a separate phase, simplifying the fine- preference alignment efficiently.
reference model. tuning process.

models and requires minimal hyperparameter adjustments invaluable tools for tracking vulnerabilities, managing risks,
[245]. These methods are crucial for reducing biases, fine- and adhering to industry standards in the cybersecurity domain
tuning threat detection accuracy, and enhancing the overall [247].
effectiveness of cybersecurity-focused LLMs.
7) LLM deployments: Deploying LLMs offers a range of
In practical cybersecurity applications, the integration of approaches tailored to the scale and specific needs of different
RAG can be facilitated by orchestrators like LangChain, applications. At one end of the spectrum, local deployment of-
LlamaIndex, and FastRAG, which connect Language Models fers enhanced privacy and control, utilizing platforms like LM
to relevant tools, databases, and other resources. These orches- Studio and Ollama to power apps directly on users’ machines,
trators ensure efficient information flow, enabling Language thus capitalizing on the open-source nature of some LLMs.
Models to seamlessly access and incorporate essential cyberse- For more dynamic or temporary setups, frameworks such as
curity information [246]. Advanced techniques such as multi- Gradio and Streamlit allow developers to prototype and share
query retrievers and HyDE are used to optimize the retrieval of demos quickly, with hosting options like Hugging Face Spaces
relevant cybersecurity documents and adapt user queries into providing an accessible path to broader distribution. On the
more effective forms for document retrieval. Furthermore, in- industrial scale, deploying LLMs can require robust server
corporating a memory system that recalls previous interactions setups, utilizing cloud services or on-premises infrastructure
allows these models to provide consistent and context-aware that might leverage specialized frameworks for peak perfor-
responses over time. This amalgamation of advanced retrieval mance and efficiency. Meanwhile, edge deployment strategies
mechanisms and memory enhancement through RAG signif- bring LLM capabilities to devices with limited resources,
icantly boosts the efficacy of Language Models in handling using advanced, lightweight frameworks to integrate smart
complex and evolving cybersecurity challenges, making them capabilities directly into mobile and web platforms, ensuring

44
responsiveness and accessibility across a broad spectrum of ethically sourced and handled, maintaining transparency about
user environments [248], [249]. data sources and model training methodologies. This struc-
Currently, LLMs can be deployed on Phones. Microsoft tured approach to security and governance ensures that LLMs
[129] propose phi-3-mini. This highly efficient 3.8 billion are used responsibly and remain secure from conventional
parameter language model delivers robust performance on par cyber threats and those unique to their operational nature.
with much larger models such as Mixtral 8x7B and GPT- 9) Optimizing LLMs: Optimizing LLMs for production
3.5, achieving impressive scores like 69% on the MMLU encompasses several crucial techniques to enhance speed,
and 8.38 on MT-bench. Remarkably, the phi-3-mini’s compact reduce memory requirements, and maintain output quality.
size allows for deployment on mobile devices, expanding One pivotal strategy is model quantization, which significantly
its accessibility and utility. This performance breakthrough reduces the precision of model weights—often to 4-bit or 8-
is primarily attributed to an innovative approach in training bit—thereby decreasing the GPU memory requirements. Table
data selection—a significantly enhanced version of the dataset XVIII presents the optimization strategies for LLMs that can
used for phi-2, which integrates heavily filtered web data and be adopted for Cybersecurity use cases. For instance, quantiz-
synthetic data tailored for relevance and diversity. It has been ing a model to 4-bit can bring down the VRAM requirement
further aligned to ensure the model’s practicality in real-world from 32 GB to just over 9 GB, allowing these models to
applications for enhanced robustness, safety, and optimization run efficiently on consumer-level hardware like the RTX
for chat formats. In addition, the research extends into larger 3090 GPU. Therefore, advanced attention mechanisms such
models, phi-3-small and phi-3-medium, which are trained on as Flash Attention reduce computation times by optimizing
4.8 trillion tokens with 7 billion and 14 billion parameters, self-attention layers, which are integral to transformers [235].
respectively. These models retain the foundational strengths This optimization is especially beneficial for handling long
of phi-3-mini and exhibit superior performance, scoring up to input sequences, where traditional self-attention mechanisms
78% on MMLU and 8.9 on MT-bench, illustrating significant could become prohibitively expensive regarding memory and
enhancements in language understanding capabilities with processing power [251], [252].
scaling. In addition, AirLLM 4 enhances memory management
The quantization methods include Bitsnbytes, 4-bit GPTQ,
for inference, enabling large language models, such as those
2-bit GPTQ, and GGUF quantization. Bitsnbytes introduces a
with 70 billion parameters (e.g., Llama3 70B), to operate on a
k-bit quantization approach that significantly reduces mem-
single 4GB GPU card. This can be achieved without requiring
ory consumption while maintaining performance [236]. It
quantization, distillation, pruning, or any other form of model
employs an 8-bit optimization using block-wise quantization
compression that could diminish performance.
to achieve 32-bit performance at a lower memory cost and
8) Securing LLMs: Securing LLMs is essential due to their
uses LLM.Int() for 8-bit quantization during inference, halving
inherent susceptibility to traditional software vulnerabilities
the required memory without performance loss. Furthermore,
and unique risks stemming from their design and operational
QLoRA [218], or 4-bit quantization, enables the training
methods. Specifically, LLMs are prone to prompt hacking,
of LLMs using memory-saving techniques that include a
where techniques such as prompt injection can be used to
small set of trainable low-rank adaptation weights, allowing
manipulate model responses, prompt leaking that risks expo-
for performance preservation. In parallel, GPTQ is a novel
sure of training data, and jailbreaking intended to circumvent
quantization method for GPT models, facilitating the reduction
built-in safety mechanisms. These specific threats necessitate
of bit width to 3 or 4 bits per weight, enabling the operation
implementing comprehensive security measures that directly
of models as large as 175 billion parameters on a single GPU
address the unique challenges LLMs pose. Additionally, in-
with minimal accuracy loss. This method provides substantial
serting backdoors during training, either by poisoning the data
compression and speed advantages, making high-performance
or embedding secret triggers, can significantly alter a model’s
LLMs more accessible and cost-effective. Additionally, the
behavior during inference, posing severe risks to data integrity
GGUF format, supported by Hugging Face Hub and optimized
and model reliability.
for quick model loading and saving, enhances the efficiency
As discussed in Section VIII, to mitigate these threats
of LLM inference.
effectively, organizations must adopt rigorous defensive strate-
gies as recommended by the OWASP LLM security checklist Another effective optimization is incorporating lower-
5
. This includes testing LLM applications against known precision data formats such as bfloat16 for training and
vulnerabilities using methods like red teaming and specific inference. This approach aligns with the training precision
tools such as garak [250] to identify and address security and avoids the computational overhead associated with float32
flaws. In addition, deploying continuous monitoring systems precision, optimizing resource usage without compromising
like langfuse 6 in production environments helps detect and performance accuracy. The potential VRAM requirements for
rectify anomalous behaviors or potential breaches in real- different models using bfloat16 are substantial. For example,
time. The OWASP checklist also emphasizes the importance GPT-3 might require up to 350 GB. In comparison, smaller
of governance frameworks that ensure data used in training is models like Llama-2-70b and Falcon-40b require 140 GB and
80 GB, respectively, illustrating the scale of resources needed
4 https://pypi.org/project/airllm/ even with efficient data formats 7 .
5 https://owasp.org/www-project-top-10-for-large-language-model-
applications/
6 https://github.com/langfuse/langfuse 7 https://huggingface.co/docs/transformers/main/en/llm tutorial optimization

45
Recently, FSDP-QLoRA 8 , a new technique combining data capabilities to develop more robust and sophisticated defenses
parallelism, 4-bit quantization, and LoRA, was introduced by against evolving cyber threats. The strategic direction outlined
Answer.AI in collaboration with bitsandbytes. Utilizing Fully in this paper aims to guide future research and deployment,
Sharded Data Parallelism (FSDP) to shard model parameters, emphasizing the importance of innovation and resilience in
optimizer states, and gradients across GPUs, this approach safeguarding digital infrastructures.
enables the training of LLMs up to 70 billion parameters
on dual 24GB GPU systems. FSDP-QLoRA represents a
significant step forward in making the training of large-scale R EFERENCES
LLMs. [1] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, “How to construct
Collectively, these techniques make it feasible to deploy deep recurrent neural networks,” arXiv preprint arXiv:1312.6026, 2013.
powerful LLMs on a wider range of hardware and enhance [2] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
computation, vol. 9, no. 8, pp. 1735–1780, 1997.
their scalability and practicality in diverse applications, ensur- [3] R. Dey and F. M. Salem, “Gate-variants of gated recurrent unit (gru)
ing they can deliver high performance even under hardware neural networks,” in 2017 IEEE 60th international midwest symposium
constraints. on circuits and systems (MWSCAS). IEEE, 2017, pp. 1597–1600.
[4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
X. C ONCLUSION Advances in neural information processing systems, vol. 30, 2017.
[5] B. Yan, K. Li, M. Xu, Y. Dong, Y. Zhang, Z. Ren, and X. Cheng, “On
In this paper, we presented a comprehensive and in-depth protecting the data privacy of large language models (llms): A survey,”
review of the future of cybersecurity through the lens of arXiv preprint arXiv:2403.05156, 2024.
Generative AI and Large Language Models (LLMs). Our [6] D. Myers, R. Mohawesh, V. I. Chellaboina, A. L. Sathvik, P. Venkatesh,
Y.-H. Ho, H. Henshaw, M. Alhawawreh, D. Berdik, and Y. Jararweh,
exploration covered a wide range of LLM applications in “Foundation and large language models: fundamentals, challenges,
cybersecurity, including hardware design security, intrusion opportunities, and social impacts,” Cluster Computing, vol. 27, no. 1,
detection, software engineering, design verification, cyber pp. 1–26, 2024.
threat intelligence, malware detection, and phishing and spam [7] S. Tonmoy, S. Zaman, V. Jain, A. Rani, V. Rawte, A. Chadha, and
A. Das, “A comprehensive survey of hallucination mitigation tech-
detection, illustrating the broad potential of LLMs across niques in large language models,” arXiv preprint arXiv:2401.01313,
various domains. 2024.
We provided a detailed examination of the evolution and [8] M. A. Ferrag, M. Debbah, and M. Al-Hawawreh, “Generative ai for
cyber threat-hunting in 6g-enabled iot networks,” in 2023 IEEE/ACM
current state of LLMs, highlighting advancements in 35 23rd International Symposium on Cluster, Cloud and Internet Comput-
specific models, such as GPT-4, GPT-3.5, BERT, Falcon, ing Workshops (CCGridW). IEEE, 2023, pp. 16–25.
and LLaMA. Our analysis included an in-depth look at the [9] I. H. Sarker, H. Janicke, M. A. Ferrag, and A. Abuadbba, “Multi-
aspect rule-based ai: Methods, taxonomy, challenges and directions
vulnerabilities associated with LLMs, such as prompt injec- toward automation, intelligence and transparent cybersecurity modeling
tion, insecure output handling, training and inference data for critical infrastructures,” Internet of Things, p. 101110, 2024.
poisoning, DDoS attacks, and adversarial natural language [10] Y. Yao, J. Duan, K. Xu, Y. Cai, Z. Sun, and Y. Zhang, “A survey on
large language model (llm) security and privacy: The good, the bad,
instructions. We discussed mitigation strategies to protect these and the ugly,” High-Confidence Computing, p. 100211, 2024.
models, offering a thorough understanding of potential attack [11] Y. Yan, Y. Zhang, and K. Huang, “Depending on yourself when
scenarios and prevention techniques. you should: Mentoring llm with rl agents to become the master in
cybersecurity games,” arXiv preprint arXiv:2403.17674, 2024.
Our evaluation of 40 LLM models in terms of cybersecurity [12] M. Sladić, V. Valeros, C. Catania, and S. Garcia, “Llm in the shell:
knowledge and hardware security demonstrated their vary- Generative honeypots,” arXiv preprint arXiv:2309.00155, 2023.
ing strengths and weaknesses. We also conducted a detailed [13] W. Tann, Y. Liu, J. H. Sim, C. M. Seah, and E.-C. Chang, “Using
large language models for cybersecurity capture-the-flag challenges and
assessment of cybersecurity datasets used for LLM training certification questions,” arXiv preprint arXiv:2308.10443, 2023.
and testing, from data creation to usage, identifying gaps and [14] O. G. Lira, A. Marroquin, and M. A. To, “Harnessing the advanced
opportunities for future research. capabilities of llm for adaptive intrusion detection systems,” in In-
We addressed the challenges and limitations of employing ternational Conference on Advanced Information Networking and
Applications. Springer, 2024, pp. 453–464.
LLMs in cybersecurity settings, including the difficulty of [15] C. Ebert and M. Beck, “Artificial intelligence for cybersecurity,” IEEE
defending against adversarial attacks and ensuring model ro- Software, vol. 40, no. 6, pp. 27–34, 2023.
bustness. Additionally, we explored advanced techniques like [16] J. Wang, Y. Huang, C. Chen, Z. Liu, S. Wang, and Q. Wang, “Software
testing with large language models: Survey, landscape, and vision,”
Half-Quadratic Quantization (HQQ), Reinforcement Learning IEEE Transactions on Software Engineering, 2024.
with Human Feedback (RLHF), Direct Preference Optimiza- [17] E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojo-
tion (DPO), Odds Ratio Preference Optimization (ORPO), caru, M. Debbah, É. Goffinet, D. Hesslow, J. Launay, Q. Malartic
et al., “The falcon series of open language models,” arXiv preprint
GPT-Generated Unified Format (GGUF), Quantized Low- arXiv:2311.16867, 2023.
Rank Adapters (QLoRA), and Retrieval-Augmented Genera- [18] H. Zhou, C. Hu, Y. Yuan, Y. Cui, Y. Jin, C. Chen, H. Wu, D. Yuan,
tion (RAG) to enhance real-time cybersecurity defenses and L. Jiang, D. Wu, X. Liu, C. Zhang, X. Wang, and J. Liu, “Large
language model (llm) for telecommunications: A comprehensive survey
improve the sophistication of LLM applications in threat on principles, key techniques, and opportunities,” 2024.
detection and response. [19] H. Lai and M. Nissim, “A survey on automatic generation of figurative
Our findings underscore the significant potential of LLMs language: From rule-based systems to large language models,” ACM
Computing Surveys, 2024.
in transforming cybersecurity practices. By integrating LLMs
[20] M. A. Ferrag, M. Ndhlovu, N. Tihanyi, L. C. Cordeiro, M. Deb-
into future cybersecurity frameworks, we can leverage their bah, T. Lestable, and N. S. Thandi, “Revolutionizing cyber threat
detection with large language models: A privacy-preserving bert-based
8 https://huggingface.co/docs/bitsandbytes/main/en/fsdp qlora lightweight model for iot/iiot devices,” IEEE Access, 2024.

46
[21] N. Tihanyi, T. Bisztray, R. A. Dubniczky, R. Toth, B. Borsos, B. Cherif, [43] Z. Han, C. Gao, J. Liu, S. Q. Zhang et al., “Parameter-efficient fine-
M. A. Ferrag, L. Muzsai, R. Jain, R. Marinelli et al., “Dynamic tuning for large models: A comprehensive survey,” arXiv preprint
intelligence assessment: Benchmarking llms on the road to agi with arXiv:2403.14608, 2024.
a focus on model confidence,” arXiv preprint arXiv:2410.15490, 2024. [44] J. Zhang, H. Bu, H. Wen, Y. Chen, L. Li, and H. Zhu, “When llms
[22] N. Tihanyi, M. A. Ferrag, R. Jain, T. Bisztray, and M. Debbah, meet cybersecurity: A systematic literature review,” arXiv preprint
“Cybermetric: A benchmark dataset based on retrieval-augmented gen- arXiv:2405.03644, 2024.
eration for evaluating llms in cybersecurity knowledge,” in 2024 IEEE [45] C. Cui, Y. Ma, X. Cao, W. Ye, Y. Zhou, K. Liang, J. Chen, J. Lu,
International Conference on Cyber Security and Resilience (CSR). Z. Yang, K.-D. Liao et al., “A survey on multimodal large language
IEEE, 2024, pp. 296–302. models for autonomous driving,” in Proceedings of the IEEE/CVF
[23] Z. Liu, “A review of advancements and applications of pre-trained lan- Winter Conference on Applications of Computer Vision, 2024, pp. 958–
guage models in cybersecurity,” in 2024 12th International Symposium 979.
on Digital Forensics and Security (ISDFS), 2024, pp. 1–10. [46] G. Bai, Z. Chai, C. Ling, S. Wang, J. Lu, N. Zhang, T. Shi,
[24] O. Friha, M. A. Ferrag, B. Kantarci, B. Cakmak, A. Ozgun, and Z. Yu, M. Zhu, Y. Zhang et al., “Beyond efficiency: A systematic
N. Ghoualmi-Zine, “Llm-based edge intelligence: A comprehensive survey of resource-efficient large language models,” arXiv preprint
survey on architectures, applications, security and trustworthiness,” arXiv:2401.00625, 2024.
IEEE Open Journal of the Communications Society, 2024. [47] S. Tian, Q. Jin, L. Yeganova, P.-T. Lai, Q. Zhu, X. Chen, Y. Yang,
[25] S. Jamal, H. Wimmer, and I. H. Sarker, “An improved transformer- Q. Chen, W. Kim, D. C. Comeau et al., “Opportunities and challenges
based model for detecting phishing, spam and ham emails: A large for chatgpt and large language models in biomedicine and health,”
language model approach,” Security and Privacy, p. e402, 2024. Briefings in Bioinformatics, vol. 25, no. 1, p. bbad493, 2024.
[Online]. Available: https://doi.org/10.1002/spy2.402 [48] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning repre-
[26] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, sentations by back-propagating errors,” nature, vol. 323, no. 6088, pp.
B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language 533–536, 1986.
models,” arXiv preprint arXiv:2303.18223, 2023. [49] S. M. Kasongo, “A deep learning technique for intrusion detection
[27] F. R. Alzaabi and A. Mehmood, “A review of recent advances, system using a recurrent neural networks based framework,” Computer
challenges, and opportunities in malicious insider threat detection using Communications, vol. 199, pp. 113–125, 2023.
machine learning methods,” IEEE Access, vol. 12, pp. 30 907–30 927, [50] S. M. Sohi, J.-P. Seifert, and F. Ganji, “Rnnids: Enhancing network
2024. intrusion detection systems through deep learning,” Computers &
[28] M. A. K. Raiaan, M. S. H. Mukta, K. Fatema, N. M. Fahad, S. Sakib, Security, vol. 102, p. 102151, 2021.
M. M. J. Mim, J. Ahmad, M. E. Ali, and S. Azam, “A review on large [51] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares,
language models: Architectures, applications, taxonomies, open issues H. Schwenk, and Y. Bengio, “Learning phrase representations using
and challenges,” IEEE Access, vol. 12, pp. 26 839–26 874, 2024. rnn encoder-decoder for statistical machine translation,” arXiv preprint
[29] R. Fang, R. Bindu, A. Gupta, and D. Kang, “Llm agents arXiv:1406.1078, 2014.
can autonomously exploit one-day vulnerabilities,” arXiv preprint [52] H. Sedjelmaci, F. Guenab, S.-M. Senouci, H. Moustafa, J. Liu, and
arXiv:2404.08144, 2024. S. Han, “Cyber security based on artificial intelligence for cyber-
[30] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, physical systems,” IEEE Network, vol. 34, no. 3, pp. 6–7, 2020.
C. Wang, Y. Wang et al., “A survey on evaluation of large language [53] P. Dixit and S. Silakari, “Deep learning algorithms for cybersecurity
models,” ACM Transactions on Intelligent Systems and Technology, applications: A technological and status review,” Computer Science
2023. Review, vol. 39, p. 100317, 2021.
[54] S. Gaba, I. Budhiraja, V. Kumar, S. Martha, J. Khurmi, A. Singh,
[31] D. Saha, S. Tarek, K. Yahyaei, S. K. Saha, J. Zhou, M. Tehranipoor,
K. K. Singh, S. Askar, and M. Abouhawwash, “A systematic analysis
and F. Farahmandi, “Llm for soc security: A paradigm shift,” arXiv
of enhancing cyber security using deep learning for cyber physical
preprint arXiv:2310.06046, 2023.
systems,” IEEE Access, 2024.
[32] B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz,
[55] C. Yin, Y. Zhu, J. Fei, and X. He, “A deep learning approach for
E. Agirre, I. Heintz, and D. Roth, “Recent advances in natural language
intrusion detection using recurrent neural networks,” Ieee Access,
processing via large pre-trained language models: A survey,” ACM
vol. 5, pp. 21 954–21 961, 2017.
Computing Surveys, vol. 56, no. 2, pp. 1–40, 2023.
[56] D. Güera and E. J. Delp, “Deepfake video detection using recurrent
[33] S. Zhang, L. Dong, X. Li, S. Zhang, X. Sun, S. Wang, J. Li, R. Hu, neural networks,” in 2018 15th IEEE international conference on
T. Zhang, F. Wu et al., “Instruction tuning for large language models: advanced video and signal based surveillance (AVSS). IEEE, 2018,
A survey,” arXiv preprint arXiv:2308.10792, 2023. pp. 1–6.
[34] A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, [57] S. Althubiti, W. Nick, J. Mason, X. Yuan, and A. Esterline, “Applying
and J. M. Zhang, “Large language models for software engineering: long short-term memory recurrent neural network for intrusion detec-
Survey and open problems,” arXiv preprint arXiv:2310.03533, 2023. tion,” in SoutheastCon 2018. IEEE, 2018, pp. 1–5.
[35] J. Wu, W. Gan, Z. Chen, S. Wan, and P. S. Yu, “Multimodal large [58] C. Xu, J. Shen, X. Du, and F. Zhang, “An intrusion detection system
language models: A survey,” arXiv preprint arXiv:2311.13165, 2023. using a deep neural network with gated recurrent units,” IEEE Access,
[36] Y. Liu, Y. Yao, J.-F. Ton, X. Zhang, R. G. H. Cheng, Y. Klochkov, vol. 6, pp. 48 697–48 707, 2018.
M. F. Taufiq, and H. Li, “Trustworthy llms: a survey and guideline [59] M. A. Ferrag and L. Maglaras, “Deepcoin: A novel deep learning and
for evaluating large language models’ alignment,” arXiv preprint blockchain-based energy exchange framework for smart grids,” IEEE
arXiv:2308.05374, 2023. Transactions on Engineering Management, vol. 67, no. 4, pp. 1285–
[37] L. Hu, Z. Liu, Z. Zhao, L. Hou, L. Nie, and J. Li, “A survey of 1297, 2019.
knowledge enhanced pre-trained language models,” IEEE Transactions [60] A. Chawla, B. Lee, S. Fallon, and P. Jacob, “Host based intrusion
on Knowledge and Data Engineering, 2023. detection system with combined cnn/rnn model,” in ECML PKDD 2018
[38] H. Zhang, H. Song, S. Li, M. Zhou, and D. Song, “A survey of con- Workshops: Nemesis 2018, UrbReas 2018, SoGood 2018, IWAISe 2018,
trollable text generation using transformer-based pre-trained language and Green Data Mining 2018, Dublin, Ireland, September 10-14, 2018,
models,” ACM Computing Surveys, vol. 56, no. 3, pp. 1–37, 2023. Proceedings 18. Springer, 2019, pp. 149–158.
[39] Z. He, Z. Li, and S. Yang, “Large language models for blockchain secu- [61] I. Ullah and Q. H. Mahmoud, “Design and development of rnn anomaly
rity: A systematic literature review,” arXiv preprint arXiv:2403.14280, detection model for iot networks,” IEEE Access, vol. 10, pp. 62 722–
2024. 62 750, 2022.
[40] Y. Yigit, M. A. Ferrag, I. H. Sarker, L. A. Maglaras, C. Chrysoulas, [62] A. A. E.-B. Donkol, A. G. Hafez, A. I. Hussein, and M. M. Mabrook,
N. Moradpoor, and H. Janicke, “Critical infrastructure protec- “Optimization of intrusion detection using likely point pso and en-
tion: Generative ai, challenges, and opportunities,” arXiv preprint hanced lstm-rnn hybrid technique in communication networks,” IEEE
arXiv:2405.04874, 2024. Access, vol. 11, pp. 9469–9482, 2023.
[41] J. Wang, Y. Huang, C. Chen, Z. Liu, S. Wang, and Q. Wang, “Software [63] Z. Zhao, Z. Li, J. Jiang, F. Yu, F. Zhang, C. Xu, X. Zhao, R. Zhang,
testing with large language models: Survey, landscape, and vision,” and S. Guo, “Ernn: Error-resilient rnn for encrypted traffic detection
IEEE Transactions on Software Engineering, pp. 1–27, 2024. towards network-induced phenomena,” IEEE Transactions on Depend-
[42] H. Xu, S. Wang, N. Li, Y. Zhao, K. Chen, K. Wang, Y. Liu, T. Yu, able and Secure Computing, 2023.
and H. Wang, “Large language models for cyber security: A systematic [64] X. Wang, S. Wang, P. Feng, K. Sun, S. Jajodia, S. Benchaaboun, and
literature review,” arXiv preprint arXiv:2405.04760, 2024. F. Geck, “Patchrnn: A deep learning-based system for security patch

47
identification,” in MILCOM 2021-2021 IEEE Military Communications [86] N. Lykousas and C. Patsakis, “Decoding developer password patterns:
Conference (MILCOM). IEEE, 2021, pp. 595–600. A comparative analysis of password extraction and selection practices,”
[65] H. Polat, M. Türkoğlu, O. Polat, and A. Şengür, “A novel approach for Computers & Security, vol. 145, p. 103974, 2024.
accurate detection of the ddos attacks in sdn-based scada systems based [87] E. Karlsen, X. Luo, N. Zincir-Heywood, and M. Heywood, “Bench-
on deep recurrent neural networks,” Expert Systems with Applications, marking large language models for log analysis, security, and interpre-
vol. 197, p. 116748, 2022. tation,” Journal of Network and Systems Management, vol. 32, no. 3,
[66] G. Parra, L. Selvera, J. Khoury, H. Irizarry, E. Bou-Harb, and P. Rad, p. 59, 2024.
“Interpretable federated transformer log learning for cloud threat foren- [88] A. Mechri, M. A. Ferrag, and M. Debbah, “Secureqwen: Leveraging
sics,” in Proceedings of the Network and Distributed Systems Security llms for vulnerability detection in python codebases,” Computers &
(NDSS) Symposium, 2022. Security, vol. 148, p. 104151, 2025.
[67] N. Ziems and S. Wu, “Security vulnerability detection using deep [89] H. Ding, Y. Liu, X. Piao, H. Song, and Z. Ji, “Smartguard: An llm-
learning natural language processing,” in IEEE INFOCOM 2021-IEEE enhanced framework for smart contract vulnerability detection,” Expert
Conference on Computer Communications Workshops (INFOCOM Systems with Applications, p. 126479, 2025.
WKSHPS). IEEE, 2021, pp. 1–6. [90] U. Arshad and Z. Halim, “Blockllm: A futuristic llm-based decen-
[68] Z. Wu, H. Zhang, P. Wang, and Z. Sun, “Rtids: A robust transformer- tralized vehicular network architecture for secure communications,”
based approach for intrusion detection system,” IEEE Access, vol. 10, Computers and Electrical Engineering, vol. 123, p. 110027, 2025.
pp. 64 375–64 387, 2022. [91] Z. Xiao, Q. Wang, H. Pearce, and S. Chen, “Logic meets
[69] F. Demirkıran, A. Çayır, U. Ünal, and H. Dağ, “An ensemble of magic: Llms cracking smart contract vulnerabilities,” arXiv preprint
pre-trained transformer models for imbalanced multiclass malware arXiv:2501.07058, 2025.
classification,” Computers & Security, vol. 121, p. 102846, 2022. [92] M. Hassanin, M. Keshk, S. Salim, M. Alsubaie, and D. Sharma, “Pllm-
[70] A. Ghourabi, “A security model based on lightgbm and transformer to cs: Pre-trained large language model (llm) for cyber threat detection in
protect healthcare systems from cyberattacks,” IEEE Access, vol. 10, satellite networks,” Ad Hoc Networks, vol. 166, p. 103645, 2025.
pp. 48 890–48 903, 2022. [93] P. Liu, C. Sun, Y. Zheng, X. Feng, C. Qin, Y. Wang, Z. Xu, Z. Li,
[71] C. Thapa, S. I. Jang, M. E. Ahmed, S. Camtepe, J. Pieprzyk, and P. Di, Y. Jiang et al., “Llm-powered static binary taint analysis,” ACM
S. Nepal, “Transformer-based language models for software vulnera- Transactions on Software Engineering and Methodology, 2025.
bility detection,” in Proceedings of the 38th Annual Computer Security [94] M. Gaber, M. Ahmed, and H. Janicke, “Zero day ransomware detection
Applications Conference, 2022, pp. 481–496. with pulse: Function classification with transformer models and assem-
[72] P. Ranade, A. Piplai, S. Mittal, A. Joshi, and T. Finin, “Generating bly language,” Computers & Security, vol. 148, p. 104167, 2025.
fake cyber threat intelligence using transformer-based models,” in 2021
[95] Y. Ding, Y. Fu, O. Ibrahim, C. Sitawarin, X. Chen, B. Alomair,
International Joint Conference on Neural Networks (IJCNN). IEEE,
D. Wagner, B. Ray, and Y. Chen, “Vulnerability detection with code
2021, pp. 1–9.
language models: How far are we?” arXiv preprint arXiv:2403.18624,
[73] M. Fu and C. Tantithamthavorn, “Linevul: a transformer-based line- 2024.
level vulnerability prediction,” in Proceedings of the 19th International
[96] T. Koide, N. Fukushi, H. Nakano, and D. Chiba, “Chatspamdetector:
Conference on Mining Software Repositories, 2022, pp. 608–620.
Leveraging large language models for effective phishing email detec-
[74] C. Mamede, E. Pinconschi, and R. Abreu, “A transformer-based ide
tion,” arXiv preprint arXiv:2402.18093, 2024.
plugin for vulnerability detection,” in 37th IEEE/ACM International
Conference on Automated Software Engineering, 2022, pp. 1–4. [97] F. Heiding, B. Schneier, A. Vishwanath, J. Bernstein, and P. S. Park,
“Devising and detecting phishing emails using large language models,”
[75] P. Evangelatos, C. Iliou, T. Mavropoulos, K. Apostolou, T. Tsikrika,
IEEE Access, 2024.
S. Vrochidis, and I. Kompatsiaris, “Named entity recognition in cyber
threat intelligence using transformer-based models,” in 2021 IEEE [98] R. Chataut, P. K. Gyawali, and Y. Usman, “Can ai keep you safe? a
International Conference on Cyber Security and Resilience (CSR). study of large language models for phishing detection,” in 2024 IEEE
IEEE, 2021, pp. 348–353. 14th Annual Computing and Communication Workshop and Conference
[76] F. Hashemi Chaleshtori and I. Ray, “Automation of vulnerability (CCWC). IEEE, 2024, pp. 0548–0554.
information extraction using transformer-based language models,” in [99] M. Rostami, M. Chilese, S. Zeitouni, R. Kande, J. Rajendran, and A.-R.
Computer Security. ESORICS 2022 International Workshops. Springer, Sadeghi, “Beyond random inputs: A novel ml-based hardware fuzzing,”
2023, pp. 645–665. 2024.
[77] S. Liu, Y. Li, and Y. Liu, “Commitbart: A large pre-trained model for [100] Z. Zhang, G. Chadwick, H. McNally, Y. Zhao, and R. Mullins,
github commits,” arXiv preprint arXiv:2208.08100, 2022. “Llm4dv: Using large language models for hardware test stimuli
[78] B. Ahmad, S. Thakur, B. Tan, R. Karri, and H. Pearce, “On hardware generation,” 2023.
security bug code fixes by prompting large language models,” IEEE [101] M. Nair, R. Sadhukhan, and D. Mukhopadhyay, “Generating
Transactions on Information Forensics and Security, pp. 1–1, 2024. secure hardware using chatgpt resistant to cwes,” Cryptology
[79] L. J. Wan, Y. Huang, Y. Li, H. Ye, J. Wang, X. Zhang, and D. Chen, ePrint Archive, Paper 2023/212, 2023, https://eprint.iacr.org/2023/212.
“Invited paper: Software/hardware co-design for llm and its application [Online]. Available: https://eprint.iacr.org/2023/212
for design verification,” in 2024 29th Asia and South Pacific Design [102] L. J. Wan, Y. Huang, Y. Li, H. Ye, J. Wang, X. Zhang,
Automation Conference (ASP-DAC), 2024, pp. 435–441. and D. Chen, “Software/hardware co-design for llm and its
[80] E. Jang, J. Cui, D. Yim, Y. Jin, J.-W. Chung, S. Shin, and Y. Lee, application for design verification,” in Proceedings of the 29th Asia
“Ignore me but don’t replace me: Utilizing non-linguistic elements and South Pacific Design Automation Conference, ser. ASPDAC
for pretraining on the cybersecurity domain,” arXiv preprint, 2024, to ’24. IEEE Press, 2024, p. 435–441. [Online]. Available: https:
appear in NAACL Findings 2024. //doi.org/10.1109/ASP-DAC58780.2024.10473893
[81] M. Bayer, P. Kuehn, R. Shanehsaz, and C. Reuter, “Cysecbert: A [103] M. Liu, N. Pinckney, B. Khailany, and H. Ren, “Verilogeval: Evaluating
domain-adapted language model for the cybersecurity domain,” ACM large language models for verilog code generation,” 2023.
Transactions on Privacy and Security, vol. 27, no. 2, pp. 1–20, 2024. [104] N. Tihanyi, M. A. Ferrag, R. Jain, and M. Debbah, “Cybermetric: A
[82] A. Shestov, R. Levichev, R. Mussabayev, E. Maslov, A. Cheshkov, and benchmark dataset for evaluating large language models knowledge in
P. Zadorozhny, “Finetuning large language models for vulnerability cybersecurity,” arXiv preprint arXiv:2402.07688, 2024.
detection,” arXiv preprint, 2024, version 4. [105] R. Meng, M. Mirchev, M. Böhme, and A. Roychoudhury, “Large
[83] F. He, F. Li, and P. Liang, “Enhancing smart contract security: Leverag- language model guided protocol fuzzing,” in Proceedings of the 31st
ing pre-trained language models for advanced vulnerability detection,” Annual Network and Distributed System Security Symposium (NDSS),
IET Blockchain, 2024, first published: 29 March 2024. 2024.
[84] C. Patsakis, F. Casino, and N. Lykousas, “Assessing llms in malicious [106] V.-T. Pham, M. Böhme, and A. Roychoudhury, “Aflnet: A greybox
code deobfuscation of real-world malware campaigns,” Expert Systems fuzzer for network protocols,” in 2020 IEEE 13th International Con-
with Applications, vol. 256, p. 124912, 2024. [Online]. Available: ference on Software Testing, Validation and Verification (ICST), 2020,
https://www.sciencedirect.com/science/article/pii/S0957417424017792 pp. 460–465.
[85] Y. Guo, C. Patsakis, Q. Hu, Q. Tang, and F. Casino, “Outside the [107] S. Qin, F. Hu, Z. Ma, B. Zhao, T. Yin, and C. Zhang, “Nsfuzz:
comfort zone: Analysing llm capabilities in software vulnerability Towards efficient and state-aware network service fuzzing,” ACM
detection,” in European symposium on research in computer security. Trans. Softw. Eng. Methodol., vol. 32, no. 6, sep 2023. [Online].
Springer, 2024, pp. 271–289. Available: https://doi.org/10.1145/3580598

48
[108] J. Wang, L. Yu, and X. Luo, “Llmif: Augmented large language model [131] N. Dey, G. Gosal, Z. C. Chen, H. Khachane, W. Marshall, R. Pathria,
for fuzzing iot devices,” in 2024 IEEE Symposium on Security and M. Tom, and J. Hestness, “Cerebras-gpt: Open compute-optimal
Privacy (SP). IEEE Computer Society, 2024, pp. 196–196. language models trained on the cerebras wafer-scale cluster,” arXiv
[109] M. Ren, X. Ren, H. Feng, J. Ming, and Y. Lei, “Z-fuzzer: device- preprint arXiv:2304.03208, 2023, submitted on 6 Apr 2023. [Online].
agnostic fuzzing of zigbee protocol implementation,” in Proceedings Available: https://doi.org/10.48550/arXiv.2304.03208
of the 14th ACM Conference on Security and Privacy in Wireless and [132] ZySec-AI, “Zysec-ai: Project zysec,” Webpage, accessed: 2024-05-01.
Mobile Networks, ser. WiSec ’21. New York, NY, USA: Association [Online]. Available: https://github.com/ZySec-AI/project-zysec
for Computing Machinery, 2021, p. 347–358. [Online]. Available: [133] DeciAI Research Team, “Decilm-7b,” 2023. [Online]. Available:
https://doi.org/10.1145/3448300.3468296 https://huggingface.co/Deci/DeciLM-7B
[110] J. Pereyda, “Boofuzz: Network protocol fuzzing for humans,” https: [134] L. Tunstall, E. Beeching, N. Lambert, N. Rajani, K. Rasul, Y. Belkada,
//boofuzz.readthedocs.io/en/stable, 2020. S. Huang, L. von Werra, C. Fourrier, N. Habib, N. Sarrazin, O. San-
[111] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, seviero, A. M. Rush, and T. Wolf, “Zephyr: Direct distillation of lm
A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language mod- alignment,” 2023.
els are few-shot learners,” Advances in neural information processing [135] M. Conover, M. Hayes, A. Mathur, J. Xie, J. Wan, S. Shah,
systems, vol. 33, pp. 1877–1901, 2020. A. Ghodsi, P. Wendell, M. Zaharia, and R. Xin. (2023) Free
[112] OpenAI, “Gpt-4 technical report,” 2023. dolly: Introducing the world’s first truly open instruction-tuned
[113] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, llm. [Online]. Available: https://www.databricks.com/blog/2023/04/12/
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learn- dolly-first-open-commercially-viable-instruction-tuned-llm
ing with a unified text-to-text transformer,” The Journal of Machine [136] TIIUAE, “Falcon-11b,” https://huggingface.co/tiiuae/falcon-11B, 2024,
Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020. accessed: 2024-05-01.
[114] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training [137] L. B. Allal, R. Li, D. Kocetkov, C. Mou, C. Akiki, C. M. Ferrandis,
of deep bidirectional transformers for language understanding,” arXiv N. Muennighoff, M. Mishra, A. Gu, M. Dey et al., “Santacoder: don’t
preprint arXiv:1810.04805, 2018. reach for the stars!” arXiv preprint arXiv:2301.03988, 2023.
[115] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, [138] R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou,
“Albert: A lite bert for self-supervised learning of language represen- M. Marone, C. Akiki, J. Li, J. Chim et al., “Starcoder: may the source
tations,” arXiv preprint arXiv:1909.11942, 2019. be with you!” arXiv preprint arXiv:2305.06161, 2023.
[116] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, [139] Hugging Face & ServiceNow, “Huggingfaceh4/starchat-alpha,” https://
L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert huggingface.co/HuggingFaceH4/starchat-alpha, 2023, accessed: 2023-
pretraining approach,” arXiv preprint arXiv:1907.11692, 2019. 12-10.
[117] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and
[140] E. Nijkamp, H. Hayashi, C. Xiong, S. Savarese, and Y. Zhou,
Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language
“Codegen2: Lessons for training llms on programming and natural
understanding,” Advances in neural information processing systems,
languages,” arXiv preprint arXiv:2305.02309, 2023.
vol. 32, 2019.
[141] Salesforce AI Research, “Codegen2.5: Small, but mighty,”
[118] W. Qi, Y. Yan, Y. Gong, D. Liu, N. Duan, J. Chen, R. Zhang,
2023, accessed: 2023-12-10. [Online]. Available: https://blog.
and M. Zhou, “Prophetnet: Predicting future n-gram for sequence-to-
salesforceairesearch.com/codegen25/
sequence pre-training,” arXiv preprint arXiv:2001.04063, 2020.
[119] G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, A. Cappelli, [142] Y. Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, and S. C. Hoi,
H. Alobeidli, B. Pannier, E. Almazrouei, and J. Launay, “The refined- “Codet5+: Open code large language models for code understanding
web dataset for falcon llm: outperforming curated corpora with web and generation,” arXiv preprint arXiv:2305.07922, 2023.
data, and web data only,” arXiv preprint arXiv:2306.01116, 2023. [143] E. Nijkamp, T. Xie, H. Hayashi, B. Pang, C. Xia, C. Xing, J. Vig,
[120] N. Kitaev, Ł. Kaiser, and A. Levskaya, “Reformer: The efficient S. Yavuz, P. Laban, B. Krause et al., “Xgen-7b technical report,” arXiv
transformer,” arXiv preprint arXiv:2001.04451, 2020. preprint arXiv:2309.03450, 2023.
[121] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, [144] Replit, Inc., “replit-code-v1-3b,” 2023, accessed: 2023-12-10. [Online].
P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling Available: https://huggingface.co/replit/replit-code-v1-3b
language modeling with pathways,” Journal of Machine Learning [145] Deci AI, “Introducing decicoder: The new gold standard
Research, vol. 24, no. 240, pp. 1–113, 2023. in efficient and accurate code generation,” August 2023,
[122] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, accessed: 2023-12-10. [Online]. Available: https://deci.ai/blog/
S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al., “Palm 2 technical decicoder-efficient-and-accurate-code-generation-llm/
report,” arXiv preprint arXiv:2305.10403, 2023. [146] B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi,
[123] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models
T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: for code,” arXiv preprint arXiv:2308.12950, 2023.
Open and efficient foundation language models,” arXiv preprint [147] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge,
arXiv:2302.13971, 2023. Y. Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu,
[124] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren, C. Tan, S. Tan, J. Tu,
N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang, H. Yang,
2: Open foundation and fine-tuned chat models,” arXiv preprint J. Yang, S. Yang, Y. Yao, B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang,
arXiv:2307.09288, 2023. Y. Zhang, Z. Zhang, C. Zhou, J. Zhou, X. Zhou, and T. Zhu, “Qwen
[125] D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, technical report,” arXiv preprint, Tech. Rep., 2023, 59 pages, 5 figures.
N. Shazeer, and Z. Chen, “Gshard: Scaling giant models with [148] D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen,
conditional computation and automatic sharding,” arXiv preprint X. Bi, Y. Wu, Y. Li, F. Luo, Y. Xiong, and W. Liang, “Deepseek-
arXiv:2006.16668, 2020. coder: When the large language model meets programming – the rise
[126] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Electra: Pre- of code intelligence,” arXiv preprint, 2024, submitted on 25 Jan 2024,
training text encoders as discriminators rather than generators,” arXiv Last revised 26 Jan 2024.
preprint arXiv:2003.10555, 2020. [149] C. Team, A. J. Hartman, A. Hu, C. A. Choquette-Choo, H. Zhao,
[127] The MosaicML NLP Team, “Mpt-30b: Raising the bar for open- J. Fine, J. Hui, J. Shen, J. Kelley, J. Howland, K. Bansal, L. Vilnis,
source foundation models,” June 2023, accessed: 2023-12-10. [Online]. M. Wirth, N. Nguyen, P. Michel, P. Choy, P. Joshi, R. Kumar,
Available: https://www.mosaicml.com/blog/mpt-30b S. Hashmi, S. Agrawal, S. Zuo, T. Warkentin, and Z. Gong,
[128] 01.AI, “Yi-34b,” https://huggingface.co/01-ai/Yi-34B, 2023, accessed: “Codegemma: Open code models based on gemma,” 2024. [Online].
2023-12-10. Available: https://goo.gle/codegemma
[129] M. A. et al., “Phi-3 technical report: A highly capable language model [150] M. Mishra, M. Stallone, G. Zhang, Y. Shen, A. Prasad, A. Meza Soria,
locally on your phone,” arXiv preprint arXiv:2404.14219, 2024. M. Merler, P. Selvam, S. Surendran, S. Singh, M. Sethi, X.-H. Dang,
[130] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. P. Li, K.-L. Wu, S. Zawad, A. Coleman, M. White, M. Lewis,
Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, R. Pavuluri, Y. Koyfman, B. Lublinsky, M. de Bayser, I. Abdelaziz,
L. Saulnier, L. Renard Lavaud, M.-A. Lachaux, P. Stock, T. Le Scao, K. Basu, M. Agarwal, Y. Zhou, C. Johnson, A. Goyal, H. Patel,
T. Lavril, T. Wang, T. Lacroix, and W. El Sayed, “Mistral 7b,” arXiv Y. Shah, P. Zerfos, H. Ludwig, A. Munawar, M. Crouse, P. Kapanipathi,
preprint arXiv:2310.06825, 2023, submitted on 10 Oct 2023. [Online]. S. Salaria, B. Calio, S. Wen, S. Seelam, B. Belgodere, C. Fonseca,
Available: https://doi.org/10.48550/arXiv.2310.06825 A. Singhee, N. Desai, D. D. Cox, R. Puri, and R. Panda, “Granite code

49
models: A family of open foundation models for code intelligence,” [173] N. S. Harzevili, A. B. Belle, J. Wang, S. Wang, Z. Ming, N. Nagappan
arXiv preprint arXiv:2405.04324, May 2024. et al., “A survey on automated software vulnerability detection using
[151] DeepSeek-AI, “Deepseek-v2: A strong, economical, and efficient machine learning and deep learning,” arXiv preprint arXiv:2306.11673,
mixture-of-experts language model,” arXiv preprint arXiv:2405.04434, 2023.
May 2024, submitted on 7 May 2024 (v1), last revised 8 May 2024 [174] M. A. Ferrag, O. Friha, D. Hamouda, L. Maglaras, and H. Janicke,
(this version, v2). [Online]. Available: https://arxiv.org/abs/2405.04434 “Edge-iiotset: A new comprehensive realistic cyber security dataset of
[152] P. Haller, J. Golde, and A. Akbik, “Pecc: Problem extraction and coding iot and iiot applications for centralized and federated learning,” IEEE
challenges,” arXiv preprint arXiv:2404.18766, 2024. Access, vol. 10, pp. 40 281–40 306, 2022.
[153] A. Z. Yang, Y. Takashima, B. Paulsen, J. Dodds, and D. Kroening, [175] N. Tihanyi, T. Bisztray, R. Jain, M. A. Ferrag, L. C. Cordeiro, and
“Vert: Verified equivalent rust transpilation with few-shot learning,” V. Mavroeidis, “The formai dataset: Generative ai in software security
arXiv preprint arXiv:2404.18852, 2024. through the lens of formal verification,” in Proceedings of the 19th
[154] D. Nichols, P. Polasam, H. Menon, A. Marathe, T. Gamblin, and International Conference on Predictive Models and Data Analytics in
A. Bhatele, “Performance-aligned llms for generating fast code,” arXiv Software Engineering, 2023, pp. 33–43.
preprint arXiv:2404.18864, 2024. [176] Y. Zheng, S. Pujar, B. Lewis, L. Buratti, E. Epstein, B. Yang, J. Laredo,
[155] Z. Ma, A. R. Chen, D. J. Kim, T.-H. Chen, and S. Wang, “Llmparser: A. Morari, and Z. Su, “D2a: A dataset built for ai-based vulnerability
An exploratory study on using large language models for log parsing,” detection methods using differential analysis,” in 2021 IEEE/ACM
in Proceedings of the IEEE/ACM 46th International Conference on 43rd International Conference on Software Engineering: Software
Software Engineering, 2024, pp. 1–13. Engineering in Practice (ICSE-SEIP), 2021, pp. 111–120.
[156] T. H. Le, M. A. Babar, and T. H. Thai, “Software vulnerability [177] Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu, “Devign: Effective Vulner-
prediction in low-resource languages: An empirical study of codebert ability Identification by Learning Comprehensive Program Semantics
and chatgpt,” arXiv preprint arXiv:2404.17110, 2024. via Graph Neural Networks,” arXiv e-prints, p. arXiv:1909.03496, Sep.
[157] B. Guan, Y. Wan, Z. Bi, Z. Wang, H. Zhang, Y. Sui, P. Zhou, and 2019.
L. Sun, “Codeip: A grammar-guided multi-bit watermark for large [178] H. Hanif, M. H. N. M. Nasir, M. F. Ab Razak, A. Firdaus, and N. B.
language models of code,” arXiv preprint arXiv:2404.15639, 2024. Anuar, “The rise of software vulnerability: Taxonomy of software
[158] X.-C. Wen, X. Wang, Y. Chen, R. Hu, D. Lo, and C. Gao, “Vuleval: To- vulnerabilities detection and machine learning approaches,” Journal of
wards repository-level evaluation of software vulnerability detection,” Network and Computer Applications, vol. 179, p. 103009, 2021.
arXiv preprint arXiv:2404.15596, 2024. [179] R. Russell, L. Kim, L. Hamilton, T. Lazovich, J. Harer, O. Ozdemir,
[159] Z. Zhang, C. Chen, B. Liu, C. Liao, Z. Gong, H. Yu, J. Li, and R. Wang, P. Ellingwood, and M. McConley, “Automated vulnerability detection
“Unifying the perspectives of nlp and software engineering: A survey in source code using deep representation learning,” in 2018 17th
on language models for code,” arXiv preprint arXiv:2311.07989, 2023. IEEE International Conference on Machine Learning and Applications
[160] H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt, (ICMLA), 2018, pp. 757–762.
“Codesearchnet challenge: Evaluating the state of semantic code [180] ——, “Automated vulnerability detection in source code using deep
search,” arXiv preprint arXiv:1909.09436, 2019. representation learning,” in 2018 17th IEEE International Conference
[161] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, on Machine Learning and Applications (ICMLA), 2018, pp. 757–762.
J. Phang, H. He, A. Thite, N. Nabeshima et al., “The pile: An [181] Y. Zhou and A. Sharma, “Automated identification of security issues
800gb dataset of diverse text for language modeling,” arXiv preprint from commit messages and bug reports,” in Proceedings of the 2017
arXiv:2101.00027, 2020. 11th joint meeting on foundations of software engineering, 2017, pp.
[162] D. Kocetkov, R. Li, L. B. Allal, J. Li, C. Mou, C. M. Ferrandis, 914–919.
Y. Jernite, M. Mitchell, S. Hughes, T. Wolf et al., “The stack: 3 tb of [182] L. Wartschinski, Y. Noller, T. Vogel, T. Kehrer, and L. Grunske,
permissively licensed source code,” arXiv preprint arXiv:2211.15533, “Vudenc: Vulnerability detection with deep learning on a natural
2022. codebase for python,” Information and Software Technology, vol.
[163] H. Laurençon, L. Saulnier, T. Wang, C. Akiki, A. Villanova del Moral, 144, p. 106809, 2022, arXiv preprint arXiv:2201.08441. [Online].
T. Le Scao, L. Von Werra, C. Mou, E. González Ponferrada, H. Nguyen Available: https://doi.org/10.48550/arXiv.2201.08441
et al., “The bigscience roots corpus: A 1.6 tb composite multilingual [183] J. Fan, Y. Li, S. Wang, and T. N. Nguyen, “A c/c++ code
dataset,” Advances in Neural Information Processing Systems, vol. 35, vulnerability dataset with code changes and cve summaries,” in
pp. 31 809–31 826, 2022. Proceedings of the 17th International Conference on Mining Software
[164] A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, N. Tazi, Repositories, ser. MSR ’20. New York, NY, USA: Association
A. Tang, D. Pykhtar, J. Liu, Y. Wei et al., “Starcoder 2 and the stack for Computing Machinery, 2020, p. 508–512. [Online]. Available:
v2: The next generation,” arXiv preprint arXiv:2402.19173, 2024. https://doi.org/10.1145/3379597.3387501
[165] R. Schuster, C. Song, E. Tromer, and V. Shmatikov, “You autocomplete [184] G. Bhandari, A. Naseer, and L. Moonen, “Cvefixes: automated collec-
me: Poisoning vulnerabilities in neural code completion,” in 30th tion of vulnerabilities and their fixes from open-source software,” in
USENIX Security Symposium (USENIX Security 21), 2021, pp. 1559– Proceedings of the 17th International Conference on Predictive Models
1575. and Data Analytics in Software Engineering, 2021, pp. 30–39.
[166] O. Asare, M. Nagappan, and N. Asokan, “Is github’s copilot as bad [185] G. Nikitopoulos, K. Dritsa, P. Louridas, and D. Mitropoulos, “Crossvul:
as humans at introducing vulnerabilities in code?” Empirical Software a cross-language vulnerability dataset with commit data,” in Proceed-
Engineering, vol. 28, no. 6, p. 129, 2023. ings of the 29th ACM Joint Meeting on European Software Engineering
[167] G. Sandoval, H. Pearce, T. Nys, R. Karri, S. Garg, and B. Dolan-Gavitt, Conference and Symposium on the Foundations of Software Engineer-
“Lost at c: A user study on the security implications of large language ing, 2021, pp. 1565–1569.
model code assistants,” in 32nd USENIX Security Symposium (USENIX [186] Z. Li, D. Zou, S. Xu, H. Jin, Y. Zhu, and Z. Chen, “Sysevr: A
Security 23), 2023, pp. 2205–2222. framework for using deep learning to detect software vulnerabilities,”
[168] N. Perry, M. Srivastava, D. Kumar, and D. Boneh, “Do users write IEEE Transactions on Dependable and Secure Computing, vol. 19,
more insecure code with ai assistants?” in Proceedings of the 2023 no. 4, pp. 2244–2258, 2022.
ACM SIGSAC Conference on Computer and Communications Security, [187] Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng, and Y. Zhong,
2023, pp. 2785–2799. “Vuldeepecker: A deep learning-based system for vulnerability detec-
[169] S. Hamer, M. d’Amorim, and L. Williams, “Just another copy and tion,” arXiv preprint arXiv:1801.01681, 2018.
paste? comparing the security vulnerabilities of chatgpt generated code [188] Y. Chen, Z. Ding, L. Alowain, X. Chen, and D. Wagner, “DiverseVul:
and stackoverflow answers,” arXiv preprint arXiv:2403.15600, 2024. A New Vulnerable Source Code Dataset for Deep Learning Based
[170] D. Cotroneo, R. De Luca, and P. Liguori, “Devaic: A tool for security Vulnerability Detection,” arXiv e-prints, p. arXiv:2304.00409, Apr.
assessment of ai-generated code,” arXiv preprint arXiv:2404.07548, 2023.
2024. [189] D. N. Gadde, A. Kumar, T. Nalapat, E. Rezunov, and F. Cappellini,
[171] R. Tóth, T. Bisztray, and L. Erdodi, “Llms in web-development: Evalu- “All artificial, less intelligence: Genai through the lens of formal
ating llm-generated php code unveiling vulnerabilities and limitations,” verification,” Infineon Technologies, 2024.
arXiv preprint arXiv:2404.14459, 2024. [190] OWASP Foundation, “Owasp top 10 for large
[172] N. Tihanyi, T. Bisztray, M. A. Ferrag, R. Jain, and L. C. Cordeiro, “Do language model applications,” https://owasp.org/
neutral prompts produce insecure code? formai-v2 dataset: Labelling www-project-top-10-for-large-language-model-applications/, 2023,
vulnerabilities in code generated by large language models,” 2024. accessed: 2023-12-26.

50
[191] F. Perez and I. Ribeiro, “Ignore previous prompt: Attack techniques [214] A. Andonian, Q. Anthony, S. Biderman, S. Black, P. Gali, L. Gao,
for language models,” arXiv preprint arXiv:2211.09527, 2022. E. Hallahan, J. Levy-Kramer, C. Leahy, L. Nestler, K. Parker,
[192] K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Pieler, J. Phang, S. Purohit, H. Schoelkopf, D. Stander, T. Songz,
M. Fritz, “More than you’ve asked for: A comprehensive analysis of C. Tigges, B. Thérien, P. Wang, and S. Weinbach, “GPT-NeoX:
novel prompt injection threats to application-integrated large language Large Scale Autoregressive Language Modeling in PyTorch,” 9 2023.
models,” arXiv e-prints, pp. arXiv–2302, 2023. [Online]. Available: https://www.github.com/eleutherai/gpt-neox
[193] J. Yan, V. Yadav, S. Li, L. Chen, Z. Tang, H. Wang, V. Srinivasan, [215] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess,
X. Ren, and H. Jin, “Virtual prompt injection for instruction-tuned R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws
large language models,” arXiv preprint arXiv:2307.16888, 2023. for neural language models,” arXiv preprint arXiv:2001.08361, 2020.
[194] R. Pedro, D. Castro, P. Carreira, and N. Santos, “From prompt [216] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe,
injections to sql injection attacks: How protected is your llm-integrated A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer
web application?” arXiv preprint arXiv:2308.01990, 2023. learning for nlp,” in International conference on machine learning.
[195] S. Abdelnabi, K. Greshake, S. Mishra, C. Endres, T. Holz, and M. Fritz, PMLR, 2019, pp. 2790–2799.
“Not what you’ve signed up for: Compromising real-world llm- [217] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang,
integrated applications with indirect prompt injection,” in Proceedings and W. Chen, “Lora: Low-rank adaptation of large language models,”
of the 16th ACM Workshop on Artificial Intelligence and Security, 2023, arXiv preprint arXiv:2106.09685, 2021.
pp. 79–90. [218] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora:
[196] Y. Liu, G. Deng, Y. Li, K. Wang, T. Zhang, Y. Liu, H. Wang, Efficient finetuning of quantized llms,” Advances in Neural Information
Y. Zheng, and Y. Liu, “Prompt injection attack against llm-integrated Processing Systems, vol. 36, 2024.
applications,” arXiv preprint arXiv:2306.05499, 2023. [219] X. Wu, H. Xia, S. Youn, Z. Zheng, S. Chen, A. Bakhtiari, M. Wyatt,
[197] J. Yan, V. Yadav, S. Li, L. Chen, Z. Tang, H. Wang, V. Srinivasan, R. Y. Aminabadi, Y. He, O. Ruwase, L. Song et al., “Zeroquant(4+2):
X. Ren, and H. Jin, “Backdooring instruction-tuned large language Redefining llms quantization with a new fp6-centric strategy for diverse
models with virtual prompt injection,” in NeurIPS 2023 Workshop on generative tasks,” arXiv preprint arXiv:2312.08583, 2023.
Backdoors in Deep Learning-The Good, the Bad, and the Ugly, 2023. [220] H. Xia, Z. Zheng, X. Wu, S. Chen, Z. Yao, S. Youn, A. Bakhtiari,
[198] D. Glukhov, I. Shumailov, Y. Gal, N. Papernot, and V. Papyan, “Llm M. Wyatt, D. Zhuang, Z. Zhou et al., “Fp6-llm: Efficiently serving large
censorship: A machine learning challenge or a computer security language models through fp6-centric algorithm-system co-design,”
problem?” arXiv preprint arXiv:2307.10719, 2023. arXiv preprint arXiv:2401.14112, 2024.
[199] F. Wu, X. Liu, and C. Xiao, “Deceptprompt: Exploiting llm-driven [221] M. Bhatt, S. Chennabasappa, C. Nikolaidis, S. Wan, I. Evtimov,
code generation via adversarial natural language instructions,” arXiv D. Gabi, D. Song, F. Ahmad, C. Aschermann, L. Fontana et al., “Purple
preprint arXiv:2312.04730, 2023. llama cyberseceval: A secure coding benchmark for language models,”
[200] B. D. Son, N. T. Hoa, T. Van Chien, W. Khalid, M. A. Ferrag, W. Choi, arXiv preprint arXiv:2312.04724, 2023.
and M. Debbah, “Adversarial attacks and defenses in 6g network- [222] Z. Liu, “Secqa: A concise question-answering dataset for evalu-
assisted iot systems,” IEEE Internet of Things Journal, 2024. ating large language models in computer security,” arXiv preprint
[201] A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson, “Universal and arXiv:2312.15838, 2023.
transferable adversarial attacks on aligned language models,” arXiv [223] M. Bhatt, S. Chennabasappa, Y. Li, C. Nikolaidis, D. Song, S. Wan,
preprint arXiv:2307.15043, 2023. F. Ahmad, C. Aschermann, Y. Chen, D. Kapil, D. Molnar, S. Whitman,
[202] Z. Yang, X. He, Z. Li, M. Backes, M. Humbert, P. Berrang, and and J. Saxe, “Cyberseceval 2: A wide-ranging cybersecurity evaluation
Y. Zhang, “Data poisoning attacks against multimodal encoders,” in suite for large language models,” 2024.
International Conference on Machine Learning. PMLR, 2023, pp. [224] N. Li, A. Pan, A. Gopal, S. Yue, D. Berrios, A. Gatti, J. D. Li, A.-
39 299–39 313. K. Dombrowski, S. Goel, L. Phan et al., “The wmdp benchmark:
[203] A. E. Cinà, K. Grosse, A. Demontis, S. Vascon, W. Zellinger, B. A. Measuring and reducing malicious use with unlearning,” arXiv preprint
Moser, A. Oprea, B. Biggio, M. Pelillo, and F. Roli, “Wild patterns arXiv:2403.03218, 2024.
reloaded: A survey of machine learning security against training data [225] Y. Sun, D. Wu, Y. Xue, H. Liu, W. Ma, L. Zhang, M. Shi, and
poisoning,” ACM Computing Surveys, vol. 55, no. 13s, pp. 1–39, 2023. Y. Liu, “Llm4vuln: A unified evaluation framework for decoupling and
[204] P. Gupta, K. Yadav, B. B. Gupta, M. Alazab, and T. R. Gadekallu, “A enhancing llms’ vulnerability reasoning,” 2024.
novel data poisoning attack in federated learning based on inverted loss [226] Z. Liu, J. Shi, and J. F. Buford, “Cyberbench: A multi-task benchmark
function,” Computers & Security, vol. 130, p. 103270, 2023. for evaluating large language models in cybersecurity.” [Online].
[205] J. He, W. Jiang, G. Hou, W. Fan, R. Zhang, and H. Li, “Talk too much: Available: http://aics.site/AICS2024/AICS CyberBench.pdf
Poisoning large language models under token limit,” arXiv preprint [227] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick,
arXiv:2404.14795, 2024. and O. Tafjord, “Think you have solved question answering? try arc,
[206] A. B. de Neira, B. Kantarci, and M. Nogueira, “Distributed denial of the ai2 reasoning challenge,” arXiv preprint arXiv:1803.05457, 2018.
service attack prediction: Challenges, open issues and opportunities,” [228] Y. Lai, C. Li, Y. Wang, T. Zhang, R. Zhong, L. Zettlemoyer, W.-
Computer Networks, vol. 222, 2023. t. Yih, D. Fried, S. Wang, and T. Yu, “Ds-1000: A natural and
[207] N. Hoque, D. K. Bhattacharyya, and J. K. Kalita, “Botnet in ddos reliable benchmark for data science code generation,” in International
attacks: Trends and challenges,” IEEE Communications Surveys and Conference on Machine Learning. PMLR, 2023, pp. 18 319–18 345.
Tutorials, vol. 17, 2015. [229] J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by
[208] O. Osanaiye, K. K. R. Choo, and M. Dlodlo, “Distributed denial of chatgpt really correct? rigorous evaluation of large language models for
service (ddos) resilience in cloud: Review and conceptual cloud ddos code generation,” Advances in Neural Information Processing Systems,
mitigation framework,” 2016. vol. 36, 2024.
[209] Q. Yan, F. R. Yu, Q. Gong, and J. Li, “Software-defined networking [230] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi, “Hel-
(sdn) and distributed denial of service (ddos) attacks in cloud com- laswag: Can a machine really finish your sentence?” arXiv preprint
puting environments: A survey, some research issues, and challenges,” arXiv:1905.07830, 2019.
IEEE Communications Surveys and Tutorials, vol. 18, 2016. [231] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and
[210] H. Zhao, H. Chen, F. Yang, N. Liu, H. Deng, H. Cai, S. Wang, D. Yin, J. Steinhardt, “Measuring massive multitask language understanding,”
and M. Du, “Explainability for large language models: A survey,” ACM arXiv preprint arXiv:2009.03300, 2020.
Transactions on Intelligent Systems and Technology, vol. 15, no. 2, pp. [232] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser,
1–38, 2024. M. Plappert, J. Tworek, J. Hilton, R. Nakano et al., “Training verifiers
[211] C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, and to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021.
D. Jiang, “Wizardlm: Empowering large language models to follow [233] D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang,
complex instructions,” arXiv preprint arXiv:2304.12244, 2023. D. Song, and J. Steinhardt, “Measuring mathematical problem solving
[212] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, with the math dataset,” arXiv preprint arXiv:2103.03874, 2021.
and T. Hashimoto, “Alpaca: a strong, replicable instruction-following [234] S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. B.
model; 2023,” URL https://crfm. stanford. edu/2023/03/13/alpaca. html. Clement, D. Drain, D. Jiang, D. Tang, G. Li, L. Zhou, L. Shou, L. Zhou,
[213] V. A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Tufano, M. Gong, M. Zhou, N. Duan, N. Sundaresan, S. K. Deng,
M. Shoeybi, and B. Catanzaro, “Reducing activation recomputation S. Fu, and S. Liu, “Codexglue: A machine learning benchmark dataset
in large transformer models,” Proceedings of Machine Learning and for code understanding and generation,” CoRR, vol. abs/2102.04664,
Systems, vol. 5, 2023. 2021.

51
[235] T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention: Fast
and memory-efficient exact attention with io-awareness,” Advances in
Neural Information Processing Systems, vol. 35, pp. 16 344–16 359,
2022.
[236] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accu-
rate post-training quantization for generative pre-trained transformers,”
arXiv preprint arXiv:2210.17323, 2022.
[237] H. Badri and A. Shaji, “Half-quadratic quantization of large
machine learning models,” November 2023. [Online]. Available:
https://mobiusml.github.io/hqq blog/
[238] F. Gloeckle, B. Youbi Idrissi, B. Rozière, D. Lopez-Paz, and G. Syn-
naeve, “Better & Faster Large Language Models via Multi-token
Prediction,” arXiv e-prints, p. arXiv:2404.19737, Apr. 2024.
[239] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust
region policy optimization,” in Proceedings of the 32nd International
Conference on Machine Learning, ser. Proceedings of Machine
Learning Research, F. Bach and D. Blei, Eds., vol. 37. Lille,
France: PMLR, 07–09 Jul 2015, pp. 1889–1897. [Online]. Available:
https://proceedings.mlr.press/v37/schulman15.html
[240] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
“Proximal policy optimization algorithms,” 2017.
[241] R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning,
S. Ermon, and C. Finn, “Direct preference optimization: Your
language model is secretly a reward model,” in Advances in
Neural Information Processing Systems, A. Oh, T. Naumann,
A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds.,
vol. 36. Curran Associates, Inc., 2023, pp. 53 728–53 741.
[Online]. Available: https://proceedings.neurips.cc/paper files/paper/
2023/file/a85b405ed65c6477a4fe8302b5e06ce7-Paper-Conference.pdf
[242] J. Hong, N. Lee, and J. Thorne, “Orpo: Monolithic preference opti-
mization without reference model,” 2024.
[243] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal,
H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela,
“Retrieval-augmented generation for knowledge-intensive nlp tasks,”
in Advances in Neural Information Processing Systems, H. Larochelle,
M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran
Associates, Inc., 2020, pp. 9459–9474.
[244] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun,
M. Wang, and H. Wang, “Retrieval-augmented generation for large
language models: A survey,” 2024.
[245] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and
C. Finn, “Direct preference optimization: Your language model is
secretly a reward model,” 2023.
[246] P. Zhao, H. Zhang, Q. Yu, Z. Wang, Y. Geng, F. Fu, L. Yang, W. Zhang,
J. Jiang, and B. Cui, “Retrieval-augmented generation for ai-generated
content: A survey,” 2024.
[247] Y. Huang and J. Huang, “A survey on retrieval-augmented text gener-
ation for large language models,” 2024.
[248] M. team, “MLC-LLM,” 2023. [Online]. Available: https://github.com/
mlc-ai/mlc-llm
[249] ——, “MNN-LLM,” 2023. [Online]. Available: https://github.com/
wangzhaode/mnn-llm/
[250] L. Derczynski, E. Galinkin, and S. Majumdar, “garak: A Framework
for Large Language Model Red Teaming,” https://garak.ai, 2024.
[251] N. Shazeer, “Fast transformer decoding: One write-head is all you
need,” 2019.
[252] J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebrón, and
S. Sanghai, “Gqa: Training generalized multi-query transformer models
from multi-head checkpoints,” 2023.

52

You might also like