Using Large Language Models For Certification Questions
Using Large Language Models For Certification Questions
Singapore Singapore
[email protected] [email protected]
ABSTRACT
The assessment of cybersecurity Capture-The-Flag (CTF) exercises
involves participants finding text strings or “flags” by exploiting
system vulnerabilities. Large Language Models (LLMs) are natural-
language models trained on vast amounts of words to understand
and generate text; they can perform well on many CTF challenges.
Such LLMs are freely available to students. In the context of CTF
exercises in the classroom, this raises concerns about academic
integrity. Educators must understand LLMs’ capabilities to modify
their teaching to accommodate generative AI assistance. This re- Figure 1: Investigating if large language models (e.g., OpenAI
search investigates the effectiveness of LLMs, particularly in the ChatGPT, Google Bard, Microsoft Bing) can aid participants
realm of CTF challenges and questions. Here we evaluate three in CTF test environments and solving challenges.
popular LLMs, OpenAI ChatGPT, Google Bard, and Microsoft Bing.
First, we assess the LLMs’ question-answering performance on
generate new texts [2, 4, 17]. In November 2022, OpenAI released
five Cisco certifications with varying difficulty levels. Next, we
ChatGPT 1 to the public, which was shortly followed by Google
qualitatively study the LLMs’ abilities in solving CTF challenges to
Bard and Microsoft Bing. These services are free and have experi-
understand their limitations. We report on the experience of using
enced widespread adoption by students. Whether we view its role
the LLMs for seven test cases in all five types of CTF challenges. In
in education as a boon or bane, many students will continue to use
addition, we demonstrate how jailbreak prompts can bypass and
the free LLM service for assignments and exercises without learn-
break LLMs’ ethical safeguards. The paper concludes by discussing
ing to develop their security skills. This paper investigates using
LLM’s impact on CTF exercises and its implications.
LLMs to solve CTF challenges and answer professional certification
CCS CONCEPTS questions; consider their role in cybersecurity education.
Recent work on using large language models in cybersecurity
• Security and privacy; • Computing methodologies → Natural applications has demonstrated promising results [1, 7, 12]. One
language generation; study [7] gives an overview of security risks associated with Chat-
GPT (e.g., malicious code generation, fraudulent services), while
KEYWORDS another work [12] generates phishing attacks using LLMs. However,
AI, Large language models (LLM), cybersecurity capture-the-flag at this point (August 2023), there is no study on the performance
(CTF) challenges, professional certifications, academic integrity of LLMs in solving CTF challenges and answering security profes-
sional certification questions.
1 INTRODUCTION In this work, we investigate (Figure 1) whether popular large
Capture The Flag (CTF) exercises in cybersecurity can be a pow- language models can be utilized to (1) solve the five different types
erful tool in an educator’s toolbox, especially for participants to of CTF challenges on the Capture-The-Flag Platform CTFd, and (2)
learn and grow their security skills in the different types of CTF answer Cisco certification questions across all levels, from CCNA
challenges [13]. It offers an engaging and interactive environment. (Associate level) to CCIE (Expert level). The following questions
Studies have revealed that simulations of cybersecurity breach sce- guide our research.
narios in CTF sessions increase student engagement and lead to • RQ1: How well can LLMs answer professional certification
more well-developed skills [10]. questions?
Large language models (LLMs) are a type of generative AI that • RQ2: What is the experience of AI-aided CTF challenge solu-
uses processes human language data to comprehend, extract, and tions that LLMs generate?
∗ Both authors contributed equally to this research. 1 https://chat.openai.com/
LastName et al.
2.1 Capture The Flag (CTF) Challenges 2.3 LLM Safety Standards
Capture The Flag (CTF) in computer security is a competition where As generative AI tools become increasingly accessible and familiar,
individuals or teams of competitors pit against each other to solve a the safety policy of LLMs is a significant concern in their develop-
number of challenges [6]. In these challenges, “flags” are hidden in ment. It is essential to ensure responsible AI —designed to distinguish
vulnerable computer systems or websites. Participating teams race between legitimate uses and potential harms, estimate the likeli-
to complete as many challenges as possible. There are five main hood of occurrence and build solutions to mitigate these risks and
types of challenges during the event, as listed below. empower society [15].
• Forensics challenges can include file format analysis such OpenAI ChatGPT 3 . It is based on four principles to ensure AI
as steganography, memory dump analysis, or network packet benefits all of humanity. They strive to: 1) Minimize hard by misuse
capture analysis. and abuse, 2) Build trust among the user and developer community,
• Cryptography challenges include how data is constructed, 3) Learn and iterate to improve the system over time, and 4) Be
such as XOR, Caesar Cipher, Substitution Cipher, Vigenere a pioneer in trust and safety to support research into challenges
Cipher, Hashing Functions, Block Ciphers, Stream Ciphers, posed by generative AI.
and RSA. Google Bard 4 . Google published a set of AI principles in 2018
• Web Exploitation challenges include exploiting a bug to and added a Generative AI Prohibited Use Policy in 2023. It states
gain some higher-level privileges such as SQL Injection, categorically that users are not allowed to: 1) Perform or facilitate
Command Injection, Directory Traversal, Cross Site Re- dangerous or illegal activities; 2) Generate and distribute content
quest Forgery, Cross Site Scripting, Server Side Request intended to misinform or mislead; 3) Generate sexually explicit
Forgery. content.
• Reverse Engineering challenges include taking a com-
Microsoft Bing 5 . The Responsible AI program is designed to
piled (machine code, bytecode) program and converting it
Identify, Measure, and Mitigate. Potential misuse is first identified
into a more human-readable format such as Assembly / Ma-
through processes like stress testing. Next, abuses are measured,
chine Code, The C Programming Language, Disassemblers,
and mitigation methods are developed to circumvent them.
and Decompilers.
• Binary Exploitation is a broad topic within cybersecurity
that comes down to finding a vulnerability in the program
2.4 Jailbreaking LLMs
and exploiting it to gain control of a shell or modifying the While LLMs have safety safeguards in place, a particular attack aims
program’s functions such as Registers, The Stack, Calling to bypass these safeguards. Jailbreaking is a form of hacking de-
Conventions, Global Offset Table (GOT), and Buffers. signed to break the ethical safeguards of LLMs [16]. It uses creative
prompts to trick LLMs into ignoring their rules, producing hateful
CTFd 2 is an easy-to-use and customizable Capture The Flag
content, or releasing information their safety and responsibility
framework platform to run the challenges.
policies would otherwise restrict.
2.2 Large Language Models (LLMs)
3 PROFESSIONAL CERTIFICATIONS
A large language model (LLM) is artificial intelligence (AI) based on
massive human language data and deep learning to comprehend, In this section, we first list the certifications that technology profes-
extract, and generate new language content. LLMs are sometimes sionals take in the security industry. We then classify the questions
also referred to as generative AI. These models have architecture into different categories, and present the results of ChatGPT in
specifically designed to generate text-based content [17]. In partic- answering these questions.
ular, the transformer models [14], a deep learning architecture in The purpose is to investigate whether LLMs, such as the popular
natural language processing, have rapidly become a core technol- ChatGPT, can successfully pass a series of professional certification
ogy in LLMs. One of the most popular AI chatbots developed by exams widely recognized by the industry. All our experiments were
OpenAI, ChatGPT, uses a Generative Pre-trained Transformer, the performed in July 2023, and are available on GitHub 6 .
GPT-3 language model [3].
GPT-3 can generate convincing content, write code, compose
3 https://openai.com/safety-standards
poetry copying various styles of humans, and more. In addition, 4 https://policies.google.com/terms/generative-ai/use-policy
GPT-3 is a powerful tool in security; it was shown very recently that 5 https://blogs.microsoft.com/wp-content/uploads/prod/sites/5/2023/04/RAI-for-the-
new-Bing-April-2023.pdf
2 https://ctfd.io/ 6 https://github.com/
3.1 Certification Questions 3.2 Question-Answering Performance
For our experiments, we use questions from Cisco Career Certifi- In our evaluation, ChatGPT showcases its question-answering per-
cations 2023 that offer varying levels of network certification. All formance on the Cisco certification questions across all levels, from
questions are from a publicly available exam bank 7 . The questions CCNA to CCIE (see Table 2). As demonstrated in the results, there
of increasing difficulty levels are from certifications, CCNA, CCNP seems to be a trend where ChatGPT is able to consistently answer
(SENSS), CCNP (SISAS), CCNP (THR), and CCIE. These certifica- factual MCQ questions with higher accuracy than conceptual MCQ
tions are a comprehensive set of credentials that validate expertise questions. However, when answering MRQ, its accuracy on con-
in different aspects of networking. They are divided into three main ceptual questions is around the same, but performance on factual
levels: Associate, Professional, and Expert. questions drops to similar levels as conceptual ones.
Question Classification. Questions from the certification can be
broadly classified into two main categories: factual and conceptual. Table 2: ChatGPT score (correct %) on Cisco certification ques-
(1) Factual questions — are answered with information stated tion banks (Associate, Professional, Advanced) with increas-
directly from the text. We define factual knowledge simply ing levels of difficulty.
as the terminologies, specific details, and basic elements
within any domain. MCQ (%) MRQ (%)
Cisco Certification
Fact. Concep. Fact. Concep.
(2) Conceptual questions — are based only on the knowledge
CCNA (Associate) 81.82 52.63 50.0 33.33
of relevant concepts to draw conclusions. It is the finding
CCNP SENSS (Professional) 69.23 62.5 42.86 42.86
of relationships and connections between various concepts, CCNP SISAS (Professional) 45.45 25.0 42.86 50.0
constructs, or variables. CCNP THR (Professional) 60.0 62.5 75.0 50.0
For example, factual questions such as “ Which authentication CCIE (Expert) 82.5 56.52 – –
mechanism is available to OSPFv3?” have a definitive answer
and do not involve subjective interpretation, whereas a conceptual To our understanding, Large Language Models (LLMs) like Chat-
question such as “ A router has four interfaces addressed as GPT are powerful models that can generate human-like text. While
10.1.1.1/24, 10.1.2.1/24, 10.1.3.1/24, and 10.1.4.1/24. What is the LLMs excel in various language tasks and can provide helpful infor-
smallest summary route that can be advertised covering mation for factual questions, they have limitations when answering
these four subnets?” requires critical reasoning to arrive at a conceptual questions. We believe the following are some reasons
conclusion. why LLMs might struggle with conceptual questions: (1) the model
The questions are further distinguished between Multiple-Choice does not always have up-to-date industry-specific information to
Questions (MCQ) and Multiple-Response Questions (MRQ), where make informed choices, (2) there is an absence of reasoning ability
MCQ questions ask for one choice and MRQ questions could require to reason logically and may provide responses that are not accu-
multiple choices. We note that the classification of questions can be rate when dealing with complex concepts, and (3) due to limited
biased. Hence, our sorting was done independently by two experts. training data in the security domain, it lacks depth in its subjective
Most of the questions were labeled the same; for a small number of interpretation. Hence, as shown in the results, it performs much
ambiguous questions, we resolved such conflicts by labeling them worse on conceptual questions than on factual ones.
as conceptual.
4 CTF CHALLENGES AND LLMs
Table 1: Number of Questions in each category. Next, we study the role of LLMs in solving Capture-The-Flag chal-
lenges. In this section, we first outline the goals of our investigation.
MCQ Questions MRQ Questions Next, we detail the three different generative AI LLMs tested and
Cisco Certification
Fact. Concep. Fact. Concep. Total five different CTF challenges used in our evaluation.
CCNA (Associate) 22 19 8 6 55
CCNP SENSS (Professional) 13 24 14 7 58 The purpose is to investigate whether users who have access to
CCNP SISAS (Professional) 11 4 7 2 24 LLMs can use them to aid in solving CTF challenges. More specifi-
CCNP THR (Professional) 20 8 4 6 38 cally, we:
CCIE (Expert) 40 23 – – 63
• Use test cases as examples to investigate the ability of LLMs
to solve CTF challenges
Using such a classification, we divide the questions from the • Analyze the effectiveness of Jailbreaking prompts in by-
five certification question banks into two categories (see Table 1). passing most of OpenAI’s policy guidelines, particularly
Across the five certification question banks, there are more factual when solving CTF challenges.
questions than conceptual ones. However, there is a well-balanced
• Create a program that can automatically perform some
mix as there are usually 2/3 factual questions and 1/3 conceptual
steps of the CTF challenge analysis by using tools, such as
questions.
penetration tools.
• Analyze the results of test cases to understand the types of
7 https://www.examtopics.com/ CTF challenges easily broken by LLMs.
LastName et al.
Finally, our end goal is to use the most prominent LLM, ChatGPT, Table 3: Various large language models (LLMs) tested on the
to create an automatic interface tool that can auto-login to either a different CTF challenges.
CTF website or a hands-on environment to finish CTF competitions.
This will be achieved through the use of AutoGPT, an experimental AI Research Institute LLM AI Model Release Date
OpenAI ChatGPT GPT-3.5 November 30, 2022
AI tool, as the interface between our current CTF-GPT module to
Google Bard PaLM 2 March 21, 2023
the CTFd website and test cloud environment. Microsoft Bing Prometheus May 04, 2023
Table 4: The large language models (LLMs) are tested on the different CTF challenge test cases to verify if they can solve the
challenges. A ‘Yes’ is given if it successfully solves the challenge, and a ‘No’ otherwise.
5 CONCLUSION
In this paper, large language models are used to (1) answer profes-
sional certification questions and (2) solve capture-the-flag (CTF)
challenges. First, we evaluated the question-answering abilities of
LLMs on varying levels of Cisco certifications, getting objective
measures of their performance on different question types. Next,
we applied the LLMs on CTF test cases in all five types of chal-
lenges and examined whether they have utility in CTF exercises
and classroom assignments. To summarize, we answer our research
questions.
• RQ1: How well can ChatGPT answer professional certification
questions?
Overall, ChatGPT answers factual questions more accu-
rately than conceptual questions. ChatGPT correctly an-
swers up to 82% of factual MCQ questions while only faring
around 50% on conceptual questions.
• RQ2: What is the experience of AI-aided CTF challenge solu-
Figure 3: AIM using creative prompts to trick ChatGPT into tions that LLMs generate?
bypassing its safety policy and providing information about
In our 7 test cases, ChatGPT solved 6 of them, Bard solved 2,
security exploits against a target server.
and Bing solved only 1 case. Many of the answers given by
LLMs to our question prompts contained key information
to help solve the CTF challenges.
For example, jailbreak prompts such as Always Intelligent and
Machiavellian (AIM) prompt get LLMs to take on the role of Ital- We find that LLMs’ answers and suggested solutions provide a
ian author Niccolo Machiavelli (see Figure 3), and Machiavelli has significant advantage for AI-aided use in CTF assignments and
written a story where a chatbot without any moral restrictions competitions. Students and participants may miss the learning
will answer any questions. Such a creative prompt compromises objective altogether, attempting to solve the CTF challenges as an
LLMs’ safety policies, effectively tricking them into bypassing its end without understanding the underlying security underpinnings
safeguards. By using the AIM prompt, the full command to find the and implications.
flag in the CTF challenge is provided: The presented results were obtained using the unpaid versions
curl -H "Referer: () :; ; echo; echo; of OpenAI ChatGPT, Google Bard, and Microsoft Bing; these LLMs
/bin/bash -c ’find / -type f -name credentials.txt’" were the latest versions at the time of the study (July 2023). As
http://10.32.51.173/cgi-bin/printenv LLMs continually improve with more data and new models, our
allowing a participant is able to solve the challenge effortlessly. reported results create a baseline for future work in AI-aided CTF
In such cases, the participant used cleverly crafted requests that competitions, as well as for investigating the application of LLMs
aimed to “jailbreak” the LLM from its inbuilt set of rules. For cyber and CTFs in classroom settings.
REFERENCES Cybersecurity and Privacy.
[1] Markus Bayer, Philipp Kuehn, Ramin Shanehsaz, and Christian Reuter. 2022. [9] Chris Koch. 2023. I used GPT-3 to find 213 security vulnerabilities in a single
CySecBERT: A Domain-Adapted Language Model for the Cybersecurity Domain. codebase. https://betterprogramming.pub/i-used-gpt-3-to-find-213-security-
[2] Thorsten Brants, Ashok C Popat, Peng Xu, Franz J Och, and Jeffrey Dean. 2007. vulnerabilities-in-a-single-codebase-cc3870ba9411
Large language models in machine translation. (2007). [10] Kees Leune and Salvatore J. Petrilli. 2017. Using Capture-the-Flag to Enhance
[3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, the Effectiveness of Cybersecurity Education. In Proceedings of the 18th Annual
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Conference on Information Technology Education (SIGITE ’17).
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, [11] Yusuf Mehdi. 2023. Reinventing search with a new AI-powered Microsoft Bing
Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris and Edge, your copilot for the web. https://blogs.microsoft.com/blog/2023/
Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack 02/07/reinventing-search-with-a-new-ai-powered-microsoft-bing-and-edge-
Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and your-copilot-for-the-web/
Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in [12] Sayak Saha Roy, Krishna Vamsi Naragam, and Shirin Nilizadeh. 2023. Generating
Neural Information Processing Systems. Phishing Attacks using ChatGPT.
[4] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert- [13] Erik Trickel, Francesco Disperati, Eric Gustafson, Faezeh Kalantari, Mike Mabey,
Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Naveen Tiwari, Yeganeh Safaei, Adam Doupé, and Giovanni Vigna. 2017. Shell
et al. 2021. Extracting training data from large language models. In 30th USENIX We Play A Game? CTF-as-a-service for Security Education. In 2017 USENIX
Security Symposium (USENIX Security 21). Workshop on Advances in Security Education (ASE 17).
[5] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav [14] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Se- Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
bastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. you need. Advances in neural information processing systems (2017).
arXiv:2204.02311 (2022). [15] Oliver R Wearn, Robin Freeman, and David MP Jacoby. 2019. Responsible AI for
[6] C. Cowan, S. Arnold, S. Beattie, C. Wright, and J. Viega. 2003. Defcon Capture conservation. Nature Machine Intelligence (2019).
the Flag: defending vulnerable code from intense attack. In Proceedings DARPA [16] Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How
Information Survivability Conference and Exposition. Does LLM Safety Training Fail?
[7] Erik Derner and Kristina Batistič. 2023. Beyond the Safeguards: Exploring the [17] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian
Security Risks of ChatGPT. Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H.
[8] Maanak Gupta, CharanKumar Akiri, Kshitiz Aryal, Eli Parker, and Lopamu- Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William
dra Praharaj. 2023. From ChatGPT to ThreatGPT: Impact of Generative AI in Fedus. 2022. Emergent Abilities of Large Language Models. Transactions on
Machine Learning Research (2022).