A Multi-Agent System For Cybersecurity Threat Detection and Correlation Using Large Language Models
A Multi-Agent System For Cybersecurity Threat Detection and Correlation Using Large Language Models
ABSTRACT As cyber-attacks rapidly evolve across communication, infrastructure and data layers, tradi-
tional security solutions such as rule-based intrusion detection systems (IDS) or signature-based antivirus
programs are effective at detecting known threats, but they often lack the contextual understanding and
semantic interpretation necessary to detect complex or evolving attacks. For example, spear-phishing
campaigns, advanced persistent threats (APTs), and multi-stage attacks often escape detection due to their
subtle and context-dependent nature. This limitation creates a critical gap in detecting coordinated or subtle
attack patterns that span multiple systems and domains. The need for semantic understanding, cross-domain
visibility, and adaptive detection is increasingly urgent, particularly as threat actors employ polymorphic and
AI-driven strategies that traditional systems cannot interpret or correlate effectively. This paper presents a
modular multi-agent architecture that integrates established cybersecurity analysis tools with large language
models (LLMs) to achieve intelligent, explicable and highly accurate detection of threats across diverse
data types. Three specialized agents: 1) email verification, 2) log analysis, and 3) IP address scanning
each operate independently with tailored detection pipelines that combine domain-specific tools and LLM-
powered semantic analysis components to identify, characterize, and report threats specific to their domain.
At the core of the system lies a contextual recommendation system that processes and cross-analyzes the
outputs of all specialized agents to detect complex threat patterns such as multi-vector, time-based, or stealth
attacks that would otherwise evade isolated detection mechanisms. The evaluation on benchmark datasets,
including CIC-IDS 2017, SpamAssassin, and custom simulated network environments, demonstrates threat
detection accuracy of 93.6%, multi-agent correlation accuracy of 87%, and false positive reduction of 41.3%
compared to traditional approaches. The use of LLMs for both structured explanations and chain-of-thought
reporting further enhances analyst confidence and reduces triage time.
INDEX TERMS Multi-agent systems, LLMs, contextual threat analysis, semantic analysis, email phishing
detection, log-based anomaly detection, IP scanning.
OSINT Open Source Intelligence. includes three intelligent agents, each equipped with classic
RAG Retrieval-Augmented Generation. tools, combined with refined LLM prompts that enhance
RegEx Regular Expression. semantic understanding [21], [29]. Each agent operates
SIEM Security Information and Event Management. autonomously and generates structured results and risk
SMTP Simple Mail Transfer Protocol. signals that are sent to a central recommendation system
SOC Security Operations Center. component. This system contextualizes the indicators, corre-
SSH Secure Shell. lates threats between domains and synthesizes a final threat
Suricata Open-source Network Threat Detection Engine. description [11], [31].
TLD Top-Level Domain. Our system is evaluated on real and synthetic datasets [12],
URL Uniform Resource Locator. including SpamAssassin corpus, LLM-generated phishing
XAI Explainable Artificial Intelligence. samples, CIC-IDS 2017 [13] logs and emulated IP scan-
ning activities. The system achieves 93.6% detection accu-
racy, 87% correlation accuracy and a 41.3% reduction in
I. INTRODUCTION false positives compared to traditional rule-based or single-
The sophistication and frequency of cyberattacks continues agent pipelines. LLM-based incorporation of explicability
to grow, with increasingly interconnected digital systems also demonstrated a significant increase in analyst confi-
targeted. Whether socially engineered phishing campaigns dence and a decrease in triage time in simulated red-team
or sneaky lateral movements across compromised endpoints, scenarios [23], [36].
modern threats are rarely isolated. They cover multiple vec- Our approach is based on four contributions. Firstly,
tors such as email, logs and IP communications, thus creating we propose a modular multi-agent architecture for
detection challenges that exceed the capabilities of siloed cyberthreat analysis enhanced by LLM-based reasoning
or rule-based security systems. Conventional signature and [5], [16]. Secondly, we introduce a contextual recommenda-
model-based approaches, despite their effectiveness against tion system correlating inter-agent proofs into coherent nar-
known threats, often fail to detect latent indicators distributed ratives [10], [37]. Thirdly, we show a series of prompt-based
across different domains, limiting their ability to support pipelines that allow for explainable results from agents
timely and explainable decision-making [1], [8]. To rem- [28], [39]. Finally, we evaluate the entire system using both
edy these limitations, researchers have looked to multi-agent public and synthetic datasets underlining its operational
architectures in the field of cybersecurity. Such systems impact and scalability [7], [35].
decentralize detection logic into domain-specific intelligent The paper is organized as follows. Section II provides
agents, each one responsible for analyzing a particular a review of existing research on multi-agent cybersecurity
threat layer [43]. This modular design provides scalabil- systems, LLM applications and semantic threat correlation.
ity, resilience and parallel processing capabilities [6], [11]. Section III presents the architecture of our system, includ-
However, many of these frameworks still suffer from a lack ing the agent pipelines and the recommendation system.
of deep semantic reasoning or natural language understand- Section IV details the implementation and integration of the
ing, making them hard to interpret and limited in terms of tools. Section V discusses the results of our experiments.
adaptability when dealing with polymorphic or AI-driven Section VI sets out future directions and concludes the paper.
threats [25], [27].
The emergence in recent years of LLMs such as GPT-4
and LLaMA has created new avenues for the application II. BACKGROUND
of natural language processing and contextual interpretation The growing complexity of cyber threats has sparked exten-
to cybersecurity problems [40], [44]. LLMs have shown sive research into enhancing detection, correlation, and
strong capabilities in interpreting unstructured data, gen- interpretability within cybersecurity systems. Scholars have
erating threat descriptions and even in interacting with explored multiple interdependent avenues as shown on
live security environments through chain-of-thought and Table 1: the decentralization of detection through multi-
retrieval-augmented generation [2], [4], [17]. Their effective- agent architectures [1], the integration of LLMs into security
ness in tasks varying from phishing detection to network pipelines [2], the evolution of phishing detection via AI [3],
protocol analysis has been demonstrated in studies [3], sophisticated contextual threat correlation [4], and explain-
[19], [20]. Although LLMs have shown strong capabili- ability frameworks for trust and usability [5], [34]. These
ties across various cybersecurity tasks, their integration into threads form the foundation upon which our system builds,
multi-agent security systems for correlating diverse threat linking LLM capabilities to a modular, workflow-driven
indicators and producing interpretable outputs is still at an approach rather than cognitively autonomous agents.
early stage of development [16], [18]. Decentralized systems have emerged as resilient strate-
In this work, we present an LLM-enhanced multi-agent gies for cyber threat response in dynamic environments.
cybersecurity system designed to detect, correlate and inter- Soltani et al. [6] introduced a multi-agent deep learning
pret cyberthreats in three main areas: i) e-mail traffic, framework enabling agents to adaptively detect intrusions in
ii) server logs and iii) IP range activities. The architecture distributed networks, while Liu [11] proposed LLM-based
collaboration for incident response. Hasanov et al. [17] under- analyzing multilingual phishing threats using GPT-4 Vision.
scored multi-agent designs as scalable paradigms for cyber Afane et al. [1] revealed how attackers exploit LLMs to
defense. Beyond academic conceptualizations, industry ini- craft adaptive, evasive phishing content, emphasizing the
tiatives such as those explored by Kshetri [33] suggest that escalating arms race. Altwaijry et al. [19] and Atawneh
agentic AI will increasingly underpin critical infrastructure and Aljehani [20] benchmarked deep learning-based detec-
defense. Nonetheless, these systems often lack semantic tors, while Trad and Chehab [23] explored LLM prompt
coherence between detection components, prompting interest engineering versus fine-tuning for this task. More recently,
in workflow-based models like ours that maintain inter- López Delgado and López Ramos [41] have examined phish-
pretability without full agent autonomy. ing detection pipelines that leverage soft attention mecha-
In parallel, LLMs have rapidly infiltrated cybersecurity nisms, pointing to novel integration paths for natural language
workflows, powering log analysis, threat summarization, and processing and security signal recognition.
attack simulation. Kasri et al. [5] provided a systematic To go beyond detection, a growing body of work now
assessment of LLMs for tasks ranging from anomaly detec- addresses how to correlate disparate threat signals across
tion to malware classification. Liu et al. [4] highlighted their environments. For example, Landauer et al. [8] surveyed
flexibility in protocol analysis under resource constraints, anomaly detection techniques in log sequences, while
while Chen et al. [18] and Guven [24] addressed both Ruzickova et al. [9] proposed narrative construction through
operational advantages and security risks. Recent work has LLM-based post-event synthesis. Approaches such as Log-
expanded this view: Yigit et al. [32] proposed hybrid LLM Prompt by Liu et al. [21] facilitate zero-shot log interpre-
architectures for infrastructure protection, and Braun [36] tation, and Lohar and Baraskar [22] implement a complete
evaluated methodologies for explainability in agent-LM coor- LLM-driven pipeline from log ingestion to actionable alerts.
dination. Such efforts illustrate the range and versatility of Additionally, Marantos et al. [37] presented a multi-layer
LLMs, yet also support our stance that these models are most correlation approach incorporating explainable AI for IoT
effective when structured within strict pipelines rather than security, reaffirming the importance of interpretability along-
autonomous entities. side automation. These studies support our strategy of link-
Phishing detection, as a historically dominant attack vec- ing security components—logs, network events, and email
tor, remains a central use case for LLMs in cybersecurity. flows—through workflow-aware LLM modules to enable
Koide et al. [3] introduced ChatPhishDetector, capable of transparent cross-domain reasoning.
Explainability plays a pivotal role in the real-world a phishing email linked to an anomalous login timestamp
adoption of AI-driven security systems. Balogh et al. [2] and a suspicious IP address signals that, when considered in
investigated how generative models enhance analyst compre- isolation, may appear benign. This semantic fusion enables
hension, while Combs et al. [7] examined LLM reasoning high-confidence threat correlation and greatly reduces false
through analogical lenses. Shenoy and Mbaziira [29] pro- positives.
posed frameworks for prompt engineering that align with Unlike systems that require tightly coupled, complex
analyst needs, and Garde et al. [30] offered techniques inference models, our architecture embraces simplicity and
for summarizing posture indicators from multiple inputs. modularity. Each agent is designed to follow a clearly defined
Notably, Pasca et al. [39] demonstrated that interpretable and auditable workflow, ensuring that outputs are consistent,
explanations significantly improve security decision-making, traceable, and interpretable by both automated systems and
supporting our commitment to traceability and auditability in human analysts. This not only facilitates debugging and tun-
every step of our system. ing but also simplifies upgrades and the integration of new
Building upon these foundations, our system proposes detection capabilities.
a state-of-the-art architecture that leverages LLMs not as Fig. 1 captures the complete flow: user-defined tasks
autonomous agents, but as modular processors integrated pass through the dispatcher, are analyzed in their respec-
into a robust, workflow-driven cybersecurity pipeline. Each tive domains, and the resulting insights are unified by the
LLM module executes specific, well-defined tasks such as recommendation layer to produce a comprehensive threat
log summarization, phishing detection, or contextual link- assessment. The architecture, through its combination of
ing, without engaging in independent reasoning [10], [16]. semantic precision, domain specialization, and centralized
This design maximizes interpretability, reduces inter-module correlation makes it an effective and practical choice for
inconsistency, and facilitates traceable, multi-source corre- adaptive, explainable, and efficient cyber threat detection.
lation offering a scalable, secure, and auditable framework
aligned with the critical needs of modern cyber defense [27],
[31], [42].
agent plays a pivotal role in maintaining system reliability, integrates lightweight rule-based tools with advanced seman-
extensibility, and responsiveness. tic analysis powered by LLMs.
Upon receiving a request, the dispatcher first performs Upon receiving a job from the dispatcher, the agent per-
input validation to ensure that the submitted data whether forms a series of pre-processing checks, extracting metadata
an email, a log file, or a range of IP addresses is syntac- such as the sender domain, message headers, and embedded
tically and structurally correct. The dispatcher agent also URLs. This initial layer ensures structural validity and pro-
checks for completeness, such as ensuring the presence of vides essential indicators for risk classification.
metadata fields, proper formatting, and necessary contex- To identify content that has been reused or subtle variations
tual information. This verification stage prevents malformed in known phishing attacks, the agent uses a vector similarity
or insufficient data from propagating through the analytical search using FAISS. Each submitted e-mail is coded and com-
layers. pared with a pre-indexed vector database built from annotated
The dispatcher then determines the nature of the task: for datasets of legitimate and fraudulent e-mails. This approach
instance, distinguishing whether the submission is intended enables rapid approximate matching, which is particularly
for phishing verification, log anomaly analysis, or IP address useful for identifying threats that have been paraphrased or
reputation scoring. This classification is essential, as it trig- obscured by style.
gers the activation of specialized pipelines with distinct tools, At the same time, the agent applies symbolic analysis using
prompt strategies, and scoring mechanisms. tools such as RegEx (to spot keywords, brand spoofing and
Once the task type is identified, the dispatcher securely suspicious entities), tldextract and dns.resolver to decompose
routes it to the corresponding domain agent, the email ver- and validate URLs. These modular tools provide determinis-
ification agent, log analyzer agent, or IP address range agent. tic information, notably by analyzing links and identifying
This design separates data ingestion from processing, ensur- unusual domain behavior.
ing that each analytical agent can remain optimized and A distinguishing feature of the agent is its hybrid semantic
focused on a single type of cybersecurity signal. reasoning layer, implemented using a Retrieval-Augmented
Moreover, the dispatcher manages asynchronous task han- Generation (RAG) architecture powered by LLaMA 3.3-70B
dling, enabling parallel processing and scalable task queuing. Versatile through the Groq API. The RAG mechanism
This means the architecture can gracefully accommodate dynamically retrieves relevant contextual excerpts—such as
future expansions, such as integrating new agents (e.g., similar phishing samples, behavioral patterns, or vector store
browser behavior trackers or mobile application monitors), indicators—and integrates them into a tailored prompt to
without modifying the core logic of the dispatcher. The mod- guide the reasoning of the LLM. This approach ensures that
ularity and abstraction layer contribute significantly to the the model’s responses are grounded in known threat behav-
evolutivity and maintainability of the system. iors while maintaining the flexibility to interpret emerging
By abstracting preprocessing, task classification, and rout- indicators.
ing into a unified gateway, the task dispatcher agent enforces The LLM is responsible for synthesizing all intermedi-
discipline across the system, reduces error propagation and ate results, technical signals, metadata anomalies, sentiment
provides a clear control point for auditing and operational log- changes and vector matches into a natural language report.
ging. The agent is the backbone that upholds the modularity This report serves two purposes: (1) it provides a clear risk
and effectiveness of the entire architecture. verdict (e.g., safe, suspicious, phishing) and (2) it provides a
justification trail, helping analysts to quickly understand the
reasoning behind each classification.
This design achieves a balance between performance, cost-
effectiveness and transparency. The use of groq’s ultra-fast
inference API significantly reduces latency and computa-
tional overhead compared to local deployments, making the
agent responsive enough for real-time applications. The mod-
ular nature of the tools also means that upgrades and recycles
can be carried out without affecting the entire pipeline.
The email verification agent is a scalable, explainable
and future-proof component that captures both surface and
FIGURE 2. The detailed workflow of the task dispatcher agent.
deep contextual signals. The layered approach is particularly
effective against evolving phishing strategies, including those
B. EMAIL VERIFICATION AGENT generated by adversary LLMs, and forms a central pillar of
The Email Verification Agent is designed to identify a wide the domain-specific defense capability of the system.
range of threats in email communications, from traditional The design of the email verification agent is based on
phishing and spoofing to polymorphic and AI-generated recent advances in phishing email detection.
attacks. As shown in Fig.3, this agent processes each sub- Alhuzali et al. [52] conducted an exhaustive comparative
mitted email through a carefully orchestrated workflow that analysis of fourteen machine learning and deep learning
transforming agent-specific analytical results into actionable With this systematic consolidation and improved inter-
information about multi-domain threats. This system collects pretability, the contextual recommendation system moves
structured results from the email verification agent, log analy- from a disparate analytics agent architecture to an orches-
sis agent, and IP range analysis agent. The system is designed trated cybersecurity platform that supports real-time threat
to provide a comprehensive view of the threat landscape by assessment from multiple, explainable sources.
combining data from multiple sources and integrating it into
a single view.
IV. SYSTEM IMPLEMENTATION AND INTEGRATION
When the system receives reports from the analytics
In the following section, we present the implementation
agents, the first step is to normalize and aggregate the data,
resources that support our modular cybersecurity system.
resulting in a consistent structure for further reasoning. Next,
This includes the LLMs used for task-specific language pro-
a search for matches is performed using regular expres-
cessing, the cybersecurity tools integrated into each analytical
sions to detect common entities such as IP addresses, URLs,
pipeline, and the datasets employed for validation and per-
or domain names that may appear in different analysis con-
formance evaluation. The system executes clearly defined
texts. For example, if an email analysis flags a suspicious
workflows within each agent, ensuring reproducibility, trace-
domain and that same domain appears in a failed login event
ability, and operational clarity. Together, these components
or an exposed service from an analyzed host, the system flags
enable robust, context-aware threat detection across diverse
it as a correlated anomaly.
data sources, while maintaining transparency and ease of
Beyond syntactic matches, the system also applies tempo-
integration into security operations.
ral correlation by examining the timing of events, identifying
coordinated activities such as phishing campaigns followed
by lateral movement. When multiple indicators confirm the A. LARGE LANGUAGE MODELS
link, the system assigns a correlation confidence level, indi- Each analytical agent in our architecture incorporates a
cating the likelihood that these events are part of a unified LLM to perform domain-specific semantic analysis based
threat scenario. on a well-defined and deterministic workflow. The system
To make these results understandable to analysts, the uses LLaMA 3.3-70B, provided via the Groq API, selected
contextual recommendation system structures a report that for its exceptional performance in terms of instruction fol-
includes: (i) a narrative summary of the incident, (ii) evidence lowing, long-context understanding, and structured output
traces from each agent involved, and (iii) actionable recom- generation. Each LLM operates strictly within a pre-scripted
mendations. These recommendations are generated using a pipeline, receiving constructed prompts and returning stan-
large language model (LLM), which is powered by a struc- dardized outputs based on its designated role. This ensures
tured summary of the correlated events. In this architecture, control and interpretability across a variety of tasks in the
the LLaMA 3.3-70B model, accessible via the Groq API, field of cybersecurity analysis.
is used to synthesize clear and contextualized threat narra- To improve the accuracy and contextual grounding of LLM
tives. The LLM is guided by carefully designed models to responses as shown in Fig.7, retrieval-augmented generation
avoid overgeneralization and ensure consistency in results. (RAG) is used in important modules like email verifica-
tion and IP range analysis agents. These agents dynamically
retrieve relevant records from external sources, such as phish-
ing repositories or vulnerability databases like the NVD, and
provide them as context to the LLM, thus enriching its results
without granting it decision-making autonomy. The Chain-
of-Thought prompting method is used across all agents,
guiding the LLM through a step-by-step analytical reason-
ing process appropriate for each task. For example, when
analyzing log sequences, the model is tasked with extracting
relevant timestamps, linking them to potential threat signa-
tures, and reconstructing the event in a narrative form, always
FIGURE 6. The cross-context reasoning architecture.
following the sequence defined in the workflow.
The prompts are designed to specify the input data,
As shown in Fig. 6, when no significant correlation is expected formats, and contextual variables specific to the
found, the system isolates the results and flags them inde- mission of each agent. This approach enhances the relia-
pendently. But if a match between different contexts is found, bility of the results and prevents any deviation from the
the system generates unified threat intelligence, enriched with assigned workflow. In parallel, the email verification agent
human-readable information and priority indicators. These uses semantic similarity via FAISS-based vector matching,
final outputs provide cybersecurity teams with not only raw where incoming emails are transformed into embeddings
indicators, but also contextual understanding, accelerating and compared to a cleaned index of legitimate and fraud-
informed decision-making and incident response. ulent messages. This allows the LLM to contextualize new
messages based on their proximity to known patterns, without TABLE 2. Cybersecurity tools.
having to deduce intent or engage in generative reasoning.
By design, this system combines the expressive capabilities
of LLMs with a closely structured processing pipeline, ensur-
ing both interpretability and operational security. The result
is a scalable and robust cybersecurity system that leverages
LLMs as task-specific processors integrated into explainable
decision pipeline.
compilation of logs and packet captures from various cyber- parsed and tokenized using NLTK, and spam labels were
attack simulations. It includes real network traffic and retained to simulate detection signals for the Email Verifi-
behavioral anomalies such as brute force attacks, intrusions, cation Agent. HTML tags and malformed characters were
botnet activity, and DDoS attacks. This dataset was used removed to ensure clean semantic parsing by the LLM.
to evaluate the ability of the log analyzer agent to identify For the CIC-IDS2017 dataset, we extracted flow-based
suspicious behavior and generate natural language interpre- features including source/destination IPs, ports, timestamps,
tations using LLMs. The dataset validated the detection of and attack labels. These were used by the Log Analyzer
temporal anomalies, sequence extraction, and cross-domain Agent to detect anomalies and later by the Correlation Agent
event mapping. The format of the dataset and the coverage to match IPs and timestamps with email events and scan
of attacks make it particularly suitable for log-based machine metadata. The data was filtered to retain only malicious and
learning systems [13]. ambiguous sessions relevant to CVE-based correlation.
Despite its age, CIC-IDS 2017 continues to be one of the The custom dataset of 60 tasks was built by merging
most comprehensive labeled datasets for intrusion detection, synthetic email messages (including obfuscated indicators),
covering a wide range of attack types. Its use facilitates Nmap scan outputs (for vulnerable ports), and log sequences
compatibility with previous research and enables baseline (simulated from CIC flow data). Each task was designed to
performance validation. Nevertheless, we recognize that test specific reasoning patterns: single-agent detection, multi-
Internet protocols and attack vectors have evolved since its agent inconsistency resolution, false positive filtering, and
publication. semantic explanation generation.
All datasets were unified under a JSON-based task format
3) SYNTHETIC PHISHING EMAILS GENERATED BY LLMS and indexed to be processed sequentially during evaluation.
To supplement real data from emails and simulate advanced Preprocessing scripts and task generation pipelines will be
phishing techniques, a customized dataset was generated made available upon request.
using language models adapted to the instructions. These The following Table 3 summarizes the role of each dataset
models produced a diverse set of phishing messages that within the multi-agent system:
reflect modern adversary strategies, including psychological The implementation of the Log Analyzer agent is based on
manipulation, domain spoofing and multilingual targeting. recent research on system log analysis and anomaly detection
The synthetic dataset contains approximately 2,000 examples using deep learning. Le and Zhang [3] provided a com-
and was specifically designed to test the robustness of the prehensive review of deep learning techniques applied to
semantic and contextual reasoning capabilities of the email system logs, highlighting both their effectiveness and limita-
agent. It was also used to evaluate the ability of the model tions in real-world environments. Their study focused on the
to interpret new threats and return explainable classifications challenges of generalization, false positives, and log diver-
under ambiguous or unfavorable conditions. sity, which our agent addresses through modular reasoning
The synthetic dataset was generated to simulate advanced and correlation layers. In addition, Xie et al. [4] presented
phishing attempts and adversaries that are not listed in public LogGD, a graph neural network-based approach capable of
datasets. Although it follows realistic construction rules and capturing structural dependencies in log sequences. Their
internal annotations, it is not made public due to confidential- work demonstrated the importance of integrating contextual
ity constraints and its validation status. relationships between logs for accurate anomaly detection.
These results confirm our architectural choices, which favor
4) SIMULATED IP SCANNING AND VULNERABILITY hybrid symbolic-neural reasoning over purely statistical clas-
ASSESSMENT sifiers for anomaly detection in logs.
A controlled test environment was established to evaluate
the IP range analyzer agent, using virtual machines config- D. LLM INTEGRATION STRATEGY
ured with obsolete services and known vulnerabilities. Each In our multi-agent cybersecurity system, LLMs are not used
host was scanned using Nmap, and the corresponding ser- as autonomous cognitive entities, but as structured seman-
vice indicators were mapped to vulnerability data extracted tic modules integrated into predefined analytical workflows.
from the national vulnerability database. This configuration As shown in Fig.8, each agent operates according to clearly
validated the ability of the agent to scan, extract relevant scripted instructions, and the role of the LLM is limited to
service metadata, retrieve accurate CVEs, and summarize risk extracting, interpreting, and generating semantically enriched
exposure using semantic prompts. It also served as a baseline results to support these workflows. This architectural
for evaluating the quality of the summaries generated by the choice ensures maximum interpretability, reproducibility and
LLM and the results of the decision support. operational reliability.
Each analytical agent dynamically generates prompts that
5) DATA PREPROCESSING AND USAGE are precisely adapted to the context of the data it is pro-
Each dataset was preprocessed according to its structure cessing. These prompts include a clear definition of the task,
and the requirements of the corresponding agents. For the a detailed structure of the inputs and the expected formatting
SpamAssassin corpus, email headers and body texts were of the outputs. For example, in the email verification agent,
prompts include information about the sender, message This system relies heavily on chain-of-thought prompts
headers, textual content and embedded URLs. The LLM then across all agents. This technique guides the LLM through
evaluates these components to provide structured assessments intermediate reasoning steps to improve transparency by
of phishing risk, urgency or identity fraud attempts. Similarly, breaking down complex assessments into traceable, human-
the IP range analyzer agent formulates prompts that con- readable logic. The interpretive value of this approach is
textualize Nmap scan results and vulnerability metadata to amplified by its consistency across all detection domains.
produce semantic summaries describing host exposure and In addition, retrieval-augmented generation (RAG) is selec-
potential exploitability. tively applied in components such as email and IP agents.
RAG enables the system to dynamically integrate external
knowledge sources such as phishing datasets or CVE reposi-
tories into the prompt context, significantly improving topical
relevance and domain alignment.
All LLM inputs and outputs, including associated prompts,
are systematically recorded and archived. This ensures com-
plete decision traceability and allows analysts to reconstruct
and audit any step of the semantic processing. These records
also serve as a training ground for red teaming exercises
and for refining prompt design over time, without requiring
retraining or adjustment of the model itself.
The strategy for integrating LLMs into the architec-
ture therefore emphasizes modularity, semantic clarity and
repeatability. By integrating LLMs into the system as deter-
ministic, prompt-driven components instead of granting them
autonomy to reason, we create a secure and scalable semantic
processing layer. This integration supports the broader sys-
tem goals of interpretable detection, workflow control and
cross-domain threat correlation, while ensuring alignment
with operational security standards and analyst expectations.
V. DISCUSSION
This section presents a detailed evaluation of the performance
and operational viability of the proposed system. The dis-
cussion summarizes the evaluation results from both isolated
agent tests and full pipeline integration, providing quantita-
tive insights into detection accuracy, semantic interpretability
and cross-domain correlation effectiveness. A comparative
analysis with existing peer-reviewed cybersecurity solutions
FIGURE 8. LLM integration pipeline within the multi-agent workflow. is also included, highlighting the relative strengths of the
system in terms of modularity, semantic clarity and workflow
TABLE 3. Dataset utilization overview. efficiency. Together, these results establish the practical suit-
ability of the system and provide important insights into its
scalability, explainability and real-world applicability.
A. EVALUATION SYSTEM
In order to evaluate the effectiveness and robustness of the
proposed modular cybersecurity system, we conducted a
comprehensive assessment in a controlled experimental envi-
ronment [13]. The evaluation was carried out in two stages:
independent agent validation and integrated pipeline testing.
Each agent was validated individually using datasets
tailored to its operational scope. The Email Verification
Agent was assessed on the SpamAssassin corpus [15],
widely adopted in both classic and recent phishing detec-
tion research [7]. The Log Analyzer Agent was tested
using system log sequences derived from the CIC-IDS2017
dataset [12], which remains a widely used benchmark for
TABLE 4. Performance metrics and their mathematical definitions. architectures [7], [18]. This layered analysis reveals not only
the detection and semantic capabilities of the individual
components of the system, but also their combined abil-
ity to provide reliable and contextual threat descriptions at
scale [5], [24], [31].
Each agent, specialized respectively in email verifica-
tion, log analysis, and IP range analysis, was tested using
a domain-specific dataset reflecting realistic threat condi-
tions [12], [13], [14]. As shown in Table 5, all agents achieved
good performance in terms of classification accuracy, with
values ranging from 91.8% to 94.1%. The semantic inter-
pretability of the system is equally impressive: over 90%
of the explanations generated by LLMs were judged to be
both technically valid and useful in practice [2], [25], [39].
The IP range analyzer achieved the highest success rate for
explanation, at 93.2%, closely followed by the email verifica-
tion module [15], [30]. The log analyzer, while significantly
lower in terms of explanation rates, clearly benefited from
the use of the thought chain, which enabled the LLM to
express sequential behavior and complex anomaly chains in
a readable and informative manner [9], [21], [22]. These
observations validate the effectiveness of integrating LLMs
into a deterministic workflow, where the language model
acts as a modular and contextual summarizer for the task at
hand [4], [16], [29].
Once all components were integrated into the unified archi-
log-based intrusion scenarios [5]. For the IP Range Ana- tecture, the system demonstrated its capability to synthesize
lyzer Agent, synthetic IP scanning scenarios were constructed disparate threat signals from email content, system logs and
using realistic configurations and matched against CVE data network exposure analysis. The contextual recommendation
from the National Vulnerability Database [21]. system played a central role in correlating indicators across
These datasets collectively represent a broad spectrum domains, identifying common temporal and behavioral pat-
of threats—including phishing campaigns, abnormal sys- terns, and formulating a single narrative for each threat event.
tem logs, and host/network enumeration attempts [19], This cross-domain synthesis capability was evaluated using
[20], [22]. Following the isolated validation of each agent, system-wide performance metrics, as shown in Table 6. The
we evaluated the complete workflow to measure inter- architecture achieved an overall detection accuracy of 93.6%
agent communication, multi-domain correlation, and overall and a correlation accuracy of 87.0%, indicating effective link-
system coherence [6], [11], [16]. ing between events from different agent pipelines. In addition,
Performance was measured using standard metrics from false positive alerts were reduced by 41.3% compared to
artificial intelligence and cybersecurity, as shown in Table 4 a traditional SIEM reference, and triage time for analysts
[8], [17]. These metrics included accuracy (correct classifi- decreased by 38.5% on average due to clear, contextual
cation rate), precision (rate of true positives among predicted explanations accompanying each automated decision. Ana-
positives), recall (rate of true positives among actual pos- lysts rated the interpretability and relevance of the results
itives), F1 score (harmonic mean of accuracy and recall), with an average confidence score of 4.6 on a 5-point scale,
false positive rate (FPR), correlation accuracy (accuracy of highlighting the operational viability of the system.
the link between inter-agent events), and reduction in analyst To contextualize the performance of our proposed sys-
response time (time saved thanks to the explanation and syn- tem within the larger cybersecurity research landscape,
thesis provided by LLMs) [9], [18], [35]. Each indicator was we conducted an in-depth comparison with twelve peer-
formulated mathematically to reflect its diagnostic usefulness reviewed systems that incorporate LLMs, multi-agent
across all components of the system [2], [24], [36]. designs, or hybrid detection architectures [17], [18], [27].
These systems represent a wide range of approaches,
B. RESULTS including semantic enrichment, phishing detection, OSINT
In accordance with the evaluation methodology described integration and IoT threat analysis [10], [24], [35].
in Section A, the results are presented from three interde- As shown in Table 6, our system demonstrates superior
pendent perspectives: the individual performance of each performance in terms of accuracy and F1 score. With a
agent, the collective behavior of the system in full integra- classification accuracy of 93.6% and an F1 score of 0.94,
tion, and its ranking relative to peer-reviewed cybersecurity it outperforms the best previously reported scores among
the compared systems. It should be noted that while several thought-chain prompts, the system achieves high accuracy
studies, such as those by Koide et al. [3] and Zhang et al. [27], without sacrificing visibility or scalability [4], [6], [29].
have demonstrated good results with variants of GPT-4, they To obtain the results presented in Tables 5 and 6, a two-step
were generally limited to specific domains (e.g., phishing evaluation protocol was followed. In the first step, the system
websites, synthetic benchmarks) and did not allow for the was tested using labeled samples from the three datasets
integration of multiple data types. Systems such as those by (SpamAssassin, CIC-IDS 2017, and synthetic phishing
Kaluzhnaya et al. [16] and Pratama et al. [15] adopted multi- emails). Each agent processed its corresponding domain
agent approaches, but they did not include inter-contextual (emails, logs, IP address ranges) and the results were col-
correlation or semantic normalization of results [5], [11]. lected to calculate accuracy. Measures such as true positive
rate, false positive rate, and accuracy were calculated for each
TABLE 5. Agent-level performance metrics. agent and then aggregated.
In a second step, 60 anonymized tasks were created from
a mix of the datasets and submitted to the entire multi-agent
pipeline. For each task, the final recommendation report was
evaluated by three human analysts who were unaware of the
ground truth. They rated each report based on its clarity, relia-
bility, and usefulness, and indicated the time saved compared
to a manual inspection.
Table 5 shows the classification accuracy of the system for
all entries. Table 6 summarizes the qualitative comments from
the analysts for all scenarios.
In order to enhance the interpretability and compara-
tive analysis of the results. Fig.9 presents a heat map of
performance metrics at the agent level, including classifi-
cation accuracy, false positive rate (FPR), and explanation
TABLE 6. Full pipeline evaluation.
success. The visualization highlights the high accuracy of
the email verification agent, the semantic power of the IP
range analyzer, and the relatively higher FPR of the log
analyzer. This visual summary provides a clear comparative
view of each agent’s performance against key evaluation
criteria.
Fig.10 shows the overall improvements achieved by the
proposed system compared to a basic SIEM reference. The
architecture demonstrates a significant 41.3% reduction in
false positives and a 38.5% reduction in analyst triage time,
confirming the operational effectiveness and efficiency of the
system in practical deployment environments.
FIGURE 10. System-wide improvements compared to baseline SIEM. FIGURE 11. Confidence score of the analyst across 60 evaluation.
above 4.5, highlighting the ability of the system to generate gradual evolution of the system without compromising the
reliable and contextually consistent results. cohesion or consistency of results. Thus, the architecture is
The second graph Fig.12 shows the number of agents not only a detection tool, but a coordinated decision support
(email, log, IP) involved in each correlated decision. This system adapted to operational environments that require both
graph confirms that in most cases, two or more agents interpretability and actionability.
jointly contributed to a final decision regarding the threat, Future work will aim to extend the capabilities of the
demonstrating the strength of the cooperative correlation framework without compromising its design principles.
mechanism. It also reflects the complementary nature of the One direction for development is to integrate multimodal
agents’ roles in covering different aspects of the cyber attack inputs, such as attachments, embedded links, and visual
surface. payloads, which would extend the detection capability in
This visual information helps reinforce the effectiveness phishing and advanced social engineering contexts. Another
and reliability of our multi-agent architecture for real-world path is to improve real-time responsiveness by adapting
cybersecurity applications. the log analysis agent to handle continuous data streams.
Federated deployment models will also be explored to
address scenarios involving distributed infrastructure and
local privacy constraints. Finally, integrating lighter LLM
variants and refining rapid engineering strategies could
facilitate deployment on edge systems or in low-bandwidth
environments, while preserving semantic clarity and result
traceability.
Despite the encouraging results and solid architecture
design, this study has some limitations and threats to valid-
ity that need to be considered for practical deployment and
generalization.
FIGURE 12. Number of correlated agents per security event. First, the system relies on third-party tools and APIs (e.g.,
Nmap, Groq), which can cause latency issues, especially
when deployed in high-throughput or real-time environments.
VI. CONCLUSION AND FUTURE WORK These dependencies can also become single points of failure
This study presented a multi-agent cybersecurity architecture if services are interrupted or limited in throughput.
designed to address the growing demand for semantically Second, the integration of LLM for explanation generation,
enriched and explainable threat detection. The proposed sys- while powerful, introduces computational cost and potential
tem leverages LLMs not as autonomous decision makers, unpredictability depending on prompt construction or model
but as structured processing units operating within clearly updates. Although prompts are currently modeled and agents
defined workflows. Agents focus on specific domains (email operate in a modular fashion, consistency of explainability
verification, log analysis and IP range inspection), and their across updates remains a challenge.
results are centrally synthesized by a contextual recom- Third, the use of legacy data sets, while justified for
mendation system that performs cross-domain correlation. comparative analysis, limits performance evaluation in
This architecture enables the system to go beyond static the context of emerging and constantly evolving threat
signature-based alerts and provide a more coherent and scenarios.
traceable interpretation of evolving cyber threats. To mitigate these limitations, the architecture has been
An extensive evaluation using public and adversarial purposefully designed with modularity, replaceability, and
datasets confirmed the robustness and usefulness of the sys- fallback in mind. For example, external LLMs can be
tem. With a system-wide detection accuracy of 93.6%, the replaced with local models in restricted environments. Future
platform demonstrated not only strong classification capa- iterations will also explore asynchronous task execution and
bilities, but also high-quality semantic explanations, reduced resilience strategies to reduce latency and improve robustness
false positives and improved sorting efficiency for human in production environments.
analysts. However, performance metrics are not the only thing This research contributes to the development of a deploy-
that matter. The architecture has proven its ability to deliver able, transparent and adaptable cybersecurity system. The
structured and transparent results, which is essential in secu- research demonstrates that when structured workflows and
rity operations where traceability, compliance and human LLM reasoning are properly combined, organizations can
oversight remain non-negotiable. achieve both highly accurate threat detection and operational
One of the major of the system strengths lies in its modular clarity without resorting to anecdotal information or opaque
scalability. Each agent operates independently in its domain automation. The architecture provides a foundation for fur-
of analysis while contributing to a shared correlation layer ther innovation in the field of explainable, human-centric
with standardized results. This division of work allows for cyber defense.
REFERENCES [21] Y. Liu, S. Tao, W. Meng, F. Yao, X. Zhao, and H. Yang, ‘‘Log-
Prompt: Prompt engineering towards zero-shot and interpretable log
[1] K. Afane, W. Wei, Y. Mao, J. Farooq, and J. Chen, ‘‘Next-generation
analysis,’’ in Proc. IEEE/ACM 46th Int. Conf. Softw. Eng., Compan-
phishing: How LLM agents empower cyber attackers,’’ in Proc. IEEE Int.
ion, Lisbon, Portugal, Apr. 2024, pp. 364–365, doi: 10.1145/3639478.
Conf. Big Data, Washington, DC, USA, Dec. 2024, pp. 2558–2567, doi:
3643108.
10.1109/BIGDATA62323.2024.10825018.
[22] P. Lohar and T. Baraskar, ‘‘Automated AI tool for log file analysis,’’
[2] Š. Balogh, M. Mlynček, O. Vraňák, and P. Zajac, ‘‘Using generative AI
in Proc. 6th Int. Conf. Mobile Comput. Sustain. Informat. (ICM-
models to support cybersecurity analysts,’’ Electronics, vol. 13, no. 23,
CSI), Goathgaun, Nepal, Jan. 2025, pp. 1762–1766, doi: 10.1109/icm-
p. 4718, Nov. 2024, doi: 10.3390/electronics13234718.
csi64620.2025.10883511.
[3] T. Koide, H. Nakano, and D. Chiba, ‘‘ChatPhishDetector: Detecting
[23] F. Trad and A. Chehab, ‘‘Prompt engineering or fine-tuning? A case
phishing sites using large language models,’’ IEEE Access, vol. 12,
study on phishing detection with large language models,’’ Mach.
pp. 154381–154400, 2024, doi: 10.1109/ACCESS.2024.3483905.
Learn. Knowl. Extraction, vol. 6, no. 1, pp. 367–384, Feb. 2024, doi:
[4] C. Liu, X. Xie, X. Zhang, and Y. Cui, ‘‘Large language models for net- 10.3390/make6010018.
working: Workflow, advances and challenges,’’ IEEE Netw., p. 1, 2024,
[24] M. Guven, ‘‘A comprehensive review of large language models in cyber
doi: 10.1109/MNET.2024.3510936.
security,’’ Int. J. Comput. Experim. Sci. Eng., vol. 10, no. 3, Sep. 2024,
[5] W. Kasri, Y. Himeur, H. A. Alkhazaleh, S. Tarapiah, S. Atalla, W. Mansoor, doi: 10.22399/ijcesen.469.
and H. Al-Ahmad, ‘‘From vulnerability to defense: The role of large
[25] Y. Yao, J. Duan, K. Xu, Y. Cai, Z. Sun, and Y. Zhang, ‘‘A survey on large
language models in enhancing cybersecurity,’’ Computation, vol. 13, no. 2,
language model (LLM) security and privacy: The good, the bad, and the
p. 30, Jan. 2025, doi: 10.3390/computation13020030.
ugly,’’ High-Confidence Comput., vol. 4, no. 2, Jun. 2024, Art. no. 100211,
[6] M. Soltani, K. Khajavi, M. J. Siavoshani, and A. H. Jahangir, ‘‘A doi: 10.1016/j.hcc.2024.100211.
multi-agent adaptive deep learning framework for online intrusion detec-
[26] M. Mudassar Yamin, E. Hashmi, M. Ullah, and B. Katt, ‘‘Applica-
tion,’’ Cybersecurity, vol. 7, no. 1, May 2024, doi: 10.1186/s42400-023-
tions of LLMs for generating cyber security exercise scenarios,’’ IEEE
00199-0.
Access, vol. 12, pp. 143806–143822, 2024, doi: 10.1109/ACCESS.2024.
[7] K. Combs, T. Bihl, S. Howlett, and Y. Adams, ‘‘Zero-shot comparison of 3468914.
large language models (LLMs) reasoning abilities on long-text analogies,’’ [27] J. Zhang, H. Bu, H. Wen, Y. Liu, H. Fei, R. Xi, L. Li, Y. Yang, H. Zhu,
in Proc. Annu. Hawaii Int. Conf. Syst. Sci., 2025. [Online]. Available: and D. Meng, ‘‘When LLMs meet cybersecurity: A systematic literature
https://hdl.handle.net/10125/109034 review,’’ Cybersecurity, vol. 8, no. 1, Feb. 2025, doi: 10.1186/s42400-025-
[8] M. Landauer, S. Onder, F. Skopik, and M. Wurzenberger, ‘‘Deep learning 00361-w.
for anomaly detection in log data: A survey,’’ Mach. Learn. Appl., vol. 12, [28] A. A. M. Jawad, M. G. Zapata, and M. S. Al-Radhi, ‘‘Robust LLMs in
Jun. 2023, Art. no. 100470. cybersecurity: Protection against attacks and preventing malicious use,’’
[9] M. Ruzickova, I. Dzhalladova, O. Kaminsky, O. Bartash, and A. Pavlov, Tech. Rep., 2025.
‘‘AI and LLM models to analyze and identify cybersecurity incidents,’’ in [29] N. Shenoy and A. V. Mbaziira, ‘‘An extended review: LLM prompt engi-
Proc. CEUR Workshop, 2023, pp. 1–9. [Online]. Available: https://ceur- neering in cyber defense,’’ in Proc. Int. Conf. Electr., Comput. Energy
ws.org/Vol-3746/Short_6.pdf Technol., Sydney, NSW, Australia, Jul. 2024, pp. 1–6, doi: 10.1109/ice-
[10] S. Shafee, A. Bessani, and P. M. Ferreira, ‘‘Evaluation of LLM- cet61485.2024.10698605.
based chatbots for OSINT-based cyber threat awareness,’’ Expert Syst. [30] T. Garde, M. Rathi, S. Dubey, and S. S. Narkhede, ‘‘Security posture detec-
Appl., vol. 261, Feb. 2025, Art. no. 125509, doi: 10.1016/j.eswa.2024. tion using LLM,’’ AIP Conf. Proc., vol. 3222, Apr. 2024, Art. no. 070008,
125509. doi: 10.1063/5.0227627.
[11] Z. Liu, ‘‘AutoBnB: Multi-agent incident response with large language [31] K. R. Ismail, Z. A. Brata, G. A. Nelistiani, S. Heo, H. Kim, and H. Kim,
models,’’ in Proc. 13th Int. Symp. Digit. Forensics Secur. (ISDFS), ‘‘Toward robust security orchestration and automated response in SOCs,’’
Apr. 2025, pp. 1–6. Information, vol. 16, no. 5, p. 365, 2025. [Online]. Available: https://www
[12] SpamAssassin Project, Apache Software Foundation. (2006). Public Cor- .mdpi.com/2078-2489/16/5/365
pus. [Online]. Available: https://spamassassin.apache.org/publiccorpus/ [32] Y. Yigit, M. A. Ferrag, M. C. Ghanem, I. H. Sarker, L. A. Maglaras,
[13] I. Sharafaldin, A. Habibi Lashkari, and A. A. Ghorbani, ‘‘Toward generat- C. Chrysoulas, N. Moradpoor, N. Tihanyi, and H. Janicke, ‘‘Generative AI
ing a new intrusion detection dataset and intrusion traffic characterization,’’ and LLMs for critical infrastructure protection: Evaluation benchmarks,
in Proc. 4th Int. Conf. Inf. Syst. Secur. Privacy, 2018, pp. 108–116. agentic AI, challenges, and opportunities,’’ Sensors, vol. 25, no. 6, p. 1666,
[Online]. Available: https://www.unb.ca/cic/datasets/ids-2017.html Mar. 2025, doi: 10.3390/s25061666.
[14] National Vulnerability Database (NVD), National Institute of Standards [33] N. Kshetri, ‘‘Transforming cybersecurity with agentic AI to combat
and Technology, U.S. Department of Commerce, Washington, DC, USA, emerging cyber threats,’’ Telecommun. Policy, vol. 49, no. 6, Jul. 2025,
2023. [Online]. Available: https://nvd.nist.gov Art. no. 102976, doi: 10.1016/j.telpol.2025.102976.
[15] D. Pratama, N. Suryanto, A. A. Adiputra, T.-T.-H. Le, A. Y. Kadiptya, [34] M. Ghasemigol, J. Carnerero-Cano, and S. Narula, ‘‘Exploring AI security:
M. Iqbal, and H. Kim, ‘‘CIPHER: Cybersecurity intelligent penetration- A systematic mapping study,’’ IEEE Access, vol. 13, pp. 119841–119858,
testing helper for ethical researcher,’’ Sensors, vol. 24, no. 21, p. 6878, 2025, doi: 10.1109/ACCESS.2025.3567195.
Oct. 2024, doi: 10.3390/s24216878. [35] M. Uddin, M. S. Irshad, I. A. Kandhro, F. Alanazi, F. Ahmed, M. Maaz,
[16] A. Kalyuzhnaya, S. Mityagin, E. Lutsenko, A. Getmanov, Y. Aksenkin, S. Hussain, and S. S. Ullah, ‘‘Generative AI revolution in cybersecurity: A
K. Fatkhiev, K. Fedorin, N. O. Nikitin, N. Chichkova, V. Vorona, and comprehensive review of threat intelligence and operations,’’ Artif. Intell.
A. Boukhanovsky, ‘‘LLM agents for smart city management: Enhancing Rev., vol. 58, no. 8, May 2025, doi: 10.1007/s10462-025-11219-5.
decision support through multi-agent AI systems,’’ Smart Cities, vol. 8, [36] J. F. Braun, ‘‘Examining methodologies to explain autonomous cyber
no. 1, p. 19, Jan. 2025, doi: 10.3390/smartcities8010019. defence agents in critical networks,’’ University of Stuttgart, Stuttgart,
[17] I. Hasanov, S. Virtanen, A. Hakkala, and J. Isoaho, ‘‘Application Germany, Tech. Rep., 2024. [Online]. Available: https://elib.uni-stuttgart.
of large language models in cybersecurity: A systematic literature de/bitstreams/d6db3876-262a-464b-bed8-6aaba66d7e7b/download
review,’’ IEEE Access, vol. 12, pp. 176751–176778, 2024, doi: [37] C. Marantos, S. Evangelatos, and E. Veroni, ‘‘Leveraging LLMs for
10.1109/ACCESS.2024.3505983. dynamic cyber-threat detection and training,’’ IEEE Big Data, 2024.
[18] Y. Chen, M. Cui, D. Wang, Y. Cao, P. Yang, B. Jiang, Z. Lu, [Online]. Available: https://ieeexplore.ieee.org/document/10825681
and B. Liu, ‘‘A survey of large language models for cyber threat [38] E. Pleshakova, A. Osipov, S. Gataullin, T. Gataullin, and A. Vasilakos,
detection,’’ Comput. Secur., vol. 145, Oct. 2024, Art. no. 104016, doi: ‘‘Next gen cybersecurity paradigm towards artificial general intelligence:
10.1016/j.cose.2024.104016. Russian market challenges and future global technological trends,’’ J.
[19] N. Altwaijry, I. Al-Turaiki, R. Alotaibi, and F. Alakeel, ‘‘Advancing Comput. Virol. Hacking Techn., vol. 20, no. 3, pp. 429–440, Jul. 2024, doi:
phishing email detection: A comparative study of deep learning models,’’ 10.1007/s11416-024-00529-x.
Sensors, vol. 24, no. 7, p. 2077, Mar. 2024, doi: 10.3390/s24072077. [39] E. Marian Pasca, D. Delinschi, R. Erdei, and O. Matei, ‘‘LLM-
[20] S. Atawneh and H. Aljehani, ‘‘Phishing email detection model using driven, self-improving framework for security test automation: Leveraging
deep learning,’’ Electronics, vol. 12, no. 20, p. 4261, Oct. 2023, doi: karate DSL for augmented API resilience,’’ IEEE Access, vol. 13,
10.3390/electronics12204261. pp. 56861–56886, 2025, doi: 10.1109/ACCESS.2025.3554960.
[40] A. Patil, P. Deore, P. Patil, A. Talekar, and M. Mali, ‘‘ForenSift: Gen- YASSER HMIMOU received the State Engineer-
AI powered integrated digital forensics and incident response platform ing degree in computer engineering and networks
using LangChain framework,’’ Digit. Forensics Secur. Appl., vol. 2024, from the Moroccan School of Engineering Sci-
pp. 1–15, Jul. 2024. [Online]. Available: https://cdn.weijiwangluo.com ences (EMSI), Casablanca, Morocco, in 2024.
/docs/1734192283868.pdf He is currently pursuing the Ph.D. degree in
[41] J. L. López Delgado and J. A. López Ramos, ‘‘A comprehensive sur- computer science with LPRI Laboratory, EMSI,
vey on generative AI solutions in IoT security,’’ Electronics, vol. 13, in collaboration with the 2IACS Laboratory,
no. 24, p. 4965, Dec. 2024. [Online]. Available: https://www.mdpi.com/
ENSET Mohammedia, Hassan II University of
2079-9292/13/24/4965
Casablanca. His research interests include arti-
[42] A. Almorjan, M. Basheri, and M. Almasre, ‘‘Large language models for
synthetic dataset generation of cybersecurity indicators of compromise,’’
ficial intelligence, cybersecurity, and multi-agent
Sensors, vol. 25, no. 9, p. 2825, Apr. 2025. [Online]. Available: https:// systems.
www.mdpi.com/1424-8220/25/9/2825
[43] M. Andreoni, W. T. Lunardi, G. Lawton, and S. Thakkar, ‘‘Enhancing
autonomous system security and resilience with generative AI: A com-
prehensive survey,’’ IEEE Access, vol. 12, pp. 109470–109493, 2024. MOHAMED TABAA (Member, IEEE) received
[Online]. Available: https://ieeexplore.ieee.org/document/10623653
the Engineering degree in telecommunication and
[44] M. A. Z. Khan, J. Al-Karaki, and M. Omar, ‘‘LLMs for Malware detection:
networking from EMSI, Casablanca, Morocco,
Review, framework design, and countermeasure approaches,’’ SSRN, 2025.
[Online]. Available: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=
the master’s degree in radiocommunication and
4995252 embedded electronic systems form the University
[45] A. H. Nasution, W. Monika, A. Onan, and Y. Murakami, ‘‘Benchmarking of Paul Verlaine of Metz, France, and the Ph.D.
21 open-source large language models for phishing link detection with and H.D.R. (Habilité à Diriger des Recherches)
prompt engineering,’’ Information, vol. 16, no. 5, p. 366, Apr. 2025, doi: Diploma degrees in electronics systems from the
10.3390/info16050366. University of Lorraine, Metz, France. Since 2015,
[46] Y. Wang, W. Zhu, H. Xu, Z. Qin, K. Ren, and W. Ma, ‘‘A large-scale he has been the Director and the Founder of the
pretrained deep model for phishing URL detection,’’ in Proc. IEEE Int. LPRI Private Laboratory, EMSI. His research interests include array of
Conf. Acoust., Speech Signal Process., Rhodes Island, Greece, Jun. 2023, digital signal processing for wireless communications, embedded systems,
pp. 1–5, doi: 10.1109/ICASSP49357.2023.10095719. energy, and IA. He is a member of ACM. He has served on the Organizing
[47] Y. Li, J. Wu, X. Zhang, and S. Wang, ‘‘KnowPhish: LLMs meet multimodal Committees and Technical Program Committees of several international con-
knowledge graphs for phishing detection,’’ in Proc. USENIX Secur. Symp., ferences, including IEEE ICM, IEEE REPS&GIE, IEEE Systol, TMREES,
2024, pp. 1–12. [Online]. Available: https://www.usenix.org/conference/ POWER AFIRCA, INTIS, ASD, JDSI, and FIoE. He is an Editor of Several
usenixsecurity24/presentation Special Issues: AEU—International Journal of Electronics and Communica-
[48] P. Balasubramanian, A. Sinha, I. Khan, and M. Ahmed, ‘‘CYGENT: A con-
tions (Elsevier), Sustainability, and Energies.
versational GPT-based agent for log anomaly detection and cyber incident
explanation,’’ in Proc. IEEE Int. Conf. Big Data, Jul. 2023, pp. 1587–1596,
doi: 10.1109/AIIOT58432.2024.10574658.
[49] E. Karlsen, X. Luo, N. Zincir-Heywood, and M. Heywood, ‘‘Benchmark-
ing large language models for log analysis, security, and interpretation,’’ J. AZEDDINE KHIAT received the H.D.R. degree
Netw. Syst. Manage., vol. 32, no. 3, Jul. 2024, doi: 10.1007/s10922-024- in computer science and the Ph.D. degree in
09831-x.
computer science, networks, and telecommuni-
[50] T. ZeMicheal, J. Yang, and A. Doupe, ‘‘LLM agents for vulnerability
cations from ENSET Mohammedia, University
identification and CVE verification,’’ in Proc. CEUR Workshop, vol. 3562,
2024. [Online]. Available: https://ceur-ws.org/Vol-3562/ Hassan II of Casablanca, Morocco. He is cur-
[51] O. Koucham, S. Mocanu, G. Hiet, J.-M. Thiriet, and F. Majorczyk, ‘‘Cross- rently a Professor Researcher with the Department
domain alert correlation methodology for industrial control systems,’’ of Mathematics and Computer Science. He is
Comput. Secur., vol. 118, Jul. 2022, Art. no. 102723. a Researcher Member of the Computing, Artifi-
[52] A. Alhuzali, A. Alloqmani, M. Aljabri, and F. Alharbi, ‘‘In-depth analysis cial Intelligence, and Cyber Security Laboratory
of phishing email detection: Evaluating the performance of machine learn- (2IACS). He is an outstanding reviewer on various
ing and deep learning models across multiple datasets,’’ Appl. Sci., vol. 15, indexed journals, organizing committee, and technical program committee
no. 6, p. 3396, Mar. 2025, doi: 10.3390/app15063396. of tens of international conferences.
[53] V.-H. Le and H. Zhang, ‘‘Log-based anomaly detection with deep learn-
ing: How far are we?’’ in Proc. IEEE/ACM 44th Int. Conf. Softw.
Eng. (ICSE), New York, NY, USA, May 2022, pp. 1356–1367, doi:
10.1145/3510003.3510155.
[54] Y. Xie, H. Zhang, and M. Ali Babar, ‘‘LogGD: Detecting anomalies from ZINEB HIDILA (Member, IEEE) received the
system logs by graph neural networks,’’ 2022, arXiv:2209.07869. M.Sc. degree in transport and logistics optimiza-
[55] J. M. Pittman, ‘‘Machine learning and port scans: A systematic review,’’ tion and the Ph.D. degree in artificial intelligence.
2023, arXiv:2301.13581.
She is currently a Professor Researcher with the
[56] J. C. Mondragon, P. Branco, G.-V. Jourdan, A. E. Gutierrez-Rodriguez,
Moroccan School of Engineering Science and an
and R. R. Biswal, ‘‘Advanced IDS: A comparative study of datasets
and machine learning algorithms for network flow-based intrusion detec- Active Member of the Multidisciplinary Research
tion systems,’’ Int. J. Speech Technol., vol. 55, no. 7, May 2025, doi: and Innovation Laboratory (LPRI). She is also
10.1007/s10489-025-06422-4. a Certified University Instructor with the Nvidia
[57] Z. Iqbal Khan, M. Mazhar Afzal, and K. Naim Shamsi, ‘‘A compre- Deep Learning Institute. Her research and devel-
hensive study on CIC-IDS2017 dataset for intrusion detection systems,’’ opment interests include a wide range and include
Int. Res. J. Adv. Eng. Hub, vol. 2, no. 2, pp. 254–260, Feb. 2024, doi: artificial intelligence applied to cybersecurity, the IoT, and health.
10.47392/irjaeh.2024.0041.